[Paper Review๐Ÿ“ƒ] Frame Attention Networks for facial expression Recognition in videos

Frame Attention Networks for facial expression Recognition in videos

- FAN -

Paper๐Ÿ˜™

์ด ๋…ผ๋ฌธ์€ ๋น„๋””์˜ค(์˜์ƒ)์„ ํ”„๋ ˆ์ž„์ฒ˜๋ฆฌํ•˜์—ฌ ์–ผ๊ตดํ‘œ์ •์„ ์ธ์‹ํ•˜๋Š”๋ฐ์˜ ํ•œ๊ณ„๋ฅผ ์ง€์ ํ•˜๊ณ  relation-attention ์ด๋ผ๋Š” ๊ฐœ๋…์„ ์ถ”๊ฐ€ํ•œ CNN ๋ชจ๋ธ์„ ์ œ์•ˆํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.


๋จผ์ €, Frame์„ ์ฒ˜๋ฆฌํ•˜์—ฌ FER(Facial Express Recognition)์„ ํ•˜๋Š”๋ฐ self-attention์ด๋ผ๋Š” ๊ฐœ๋…์„ ๋„์ž…ํ•œ ๋ฐฐ๊ฒฝ์„ ์•Œ์•„๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

์ผ๋‹จ, ์˜์ƒ์„ ์ฒ˜๋ฆฌํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” ํ•œ ๋น„๋””์˜ค ํŒŒ์ผ์„ frame(์ด๋ฏธ์ง€ ํŒŒ์ผ)์œผ๋กœ ๋ฐ”๊พธ์–ด์„œ ์ฒ˜๋ฆฌ๋ฅผ ํ•ด์•ผํ•˜๋Š”๋ฐ, ๊ฐ๊ฐ์˜ ํ”„๋ ˆ์ž„๋“ค์— ๋Œ€ํ•œ ์–ผ๊ตด์˜ ํŠน์ง•์„ ์ฐพ๋Š” ๊ฒƒ์€ ๋งค์šฐ ์ค‘์š”ํ•œ ์ผ์ž…๋‹ˆ๋‹ค.

๋”ฐ๋ผ์„œ ์ด ๋…ผ๋ฌธ์—์„œ๋Š” ํ”„๋ ˆ์ž„์›Œํฌ์˜ ์ข…๋‹จ ํ•™์Šต์— ์žˆ์–ด์„œ *์ฐจ๋ณ„์ ์ธ ๋ถ€๋ถ„์„ ์ž๋™์œผ๋กœ ํ•˜์ด๋ผ์ดํŠธ ํ•  ์ˆ˜ ์žˆ๋Š” ๋ฐฉ๋ฒ•์„ ์ œ์•ˆํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

๊ทธ๋ ‡๋‹ค๋ฉด ์—ฌ๊ธฐ์„œ ์ฐจ๋ณ„์ ์ธ ๋ถ€๋ถ„์ด๋ž€, ๊ฐ ํ”„๋ ˆ์ž„๋“ค์ด ๊ฐ€์ง€๋Š” ์ค‘์š”๋„๋ฅผ ๋งํ•˜๊ณ  ์žˆ๋Š”๋ฐ์š”, ์ž์„ธํ•œ ๊ฒƒ์€ ์ฒœ์ฒœํžˆ ์„ค๋ช…ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค.

์ผ๋‹จ video-based FER์— ์žˆ์–ด์„œ ํŠน์ง•์„ ์ถ”์ถœํ•˜๋Š” ๋ฐฉ๋ฒ•์—๋Š” 3๊ฐ€์ง€๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค.

1) static-based feature extraction (ํŠน์ง•์  ๊ธฐ๋ฐ˜ ์ถ”์ถœ)

ex) LBP, Gabor filters

2) spatial-temporal methods (์‹œ๊ณ„์—ด, ์‹œ๊ณต๊ฐ„์„ ์ธ์‹ํ•˜๋Š” ๋ฐฉ๋ฒ•)

ex) LSTM, C3D

3) Geometry based methods (์–ผ๊ตด์˜ key point๋ฅผ ์ถ”์ถœํ•˜์—ฌ ์‚ฌ์šฉํ•˜๋Š” ๋ฐฉ๋ฒ•)

์œ„์˜ 3๊ฐ€์ง€ ๋ฐฉ๋ฒ•์—์„œ 1) ๋ฒˆ์˜ ๋ฐฉ๋ฒ•์ด EmotiW ๋ผ๋Š” challenge์—์„œ ๊ฐ€์žฅ ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์ด๋Š” ๋ฐฉ์‹์ด๋ผ ์ด ๋ฐฉ์‹์„ ์‚ฌ์šฉํ•œ๋‹ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค.

๊ทธ๋Ÿฐ๋ฐ ๋ฌธ์ œ๋Š” ์—ฌ๊ธฐ์„œ 1) ๋ฒˆ์˜ ๋ฐฉ์‹์„ ์‚ฌ์šฉํ•˜๋ ค๋ฉด Frame aggregation ์ด๋ผ๋Š” ๊ฒƒ์„ ํ•ด์•ผํ•ฉ๋‹ˆ๋‹ค. ์ด ๋…ผ๋ฌธ ์ด์ „์˜ ๋…ผ๋ฌธ์„ ์‚ดํŽด๋ณด๋ฉด์€ ๊ณ ์ •๋œ ๊ธธ์ด์˜ video representation ๊ฐ’์€ n๊ฐœ์˜ ํด๋ž˜์Šค์˜ ํ™•๋ฅ ๋ถ„ํฌ ๋ฒกํ„ฐ๋กœ ํ˜•์„ฑํ•˜๋Š”๋ฐ ์ด๊ฒƒ์„ ํ”„๋ ˆ์ž„๋“ค์˜ ํ‰๊ท  ํ˜น์€ ํ™•์žฅ์˜ ๋ฐฉ์‹์œผ๋กœ ๋ฌถ์–ด์„œ ์ฒ˜๋ฆฌํ•œ๋‹ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค.

์•„๋ž˜ ์ˆ˜์‹์„ ๋ณด๋ฉด

$r=\sum_k a_kf_k$ ์—์„œ, $a_k$๋Š” linear weight, $f_k$๋Š” feature representation์—์„œ ์ถ”์ถœํ•œ feature extraction(ํŠน์ง•์ ) k๋Š” ๋น„๋””์˜ค์˜ k ๋ฒˆ์งธ ํ”„๋ ˆ์ž„์ด๋ฉฐ, $r$์€ representation ๊ฐ’์„ ๋œปํ•ฉ๋‹ˆ๋‹ค.

๊ฐ ํ”„๋ ˆ์ž„์— ๋Œ€ํ•ด์„œ linear weight, ์—ฌ๊ธฐ์„œ๋Š” ํ•˜์ดํผ ํŒŒ๋ผ๋ฏธํ„ฐ ์ธ ๊ฒƒ ๊ฐ™์Šต๋‹ˆ๋‹ค. ๋…ผ๋ฌธ์—์„œ ์ด weight ๊ฐ’์— ๋”ฐ๋ผ ์„ฑ๋Šฅ์˜ ์ฐจ์ด๊ฐ€ ์žˆ๋‹ค๊ณ  ๋งํ•œ ๊ฒƒ์„ ๋ฏธ๋ฃจ์–ด ๋ณด์•„ ์ž„์˜๋กœ ์ฃผ์–ด์ง€๋Š” ๋žœ๋ค๊ฐ’์ธ ๊ฒƒ ๊ฐ™์Šต๋‹ˆ๋‹ค.

์ด๋Ÿฐ์‹์œผ๋กœ ๊ฐ ํ”„๋ ˆ์ž„์— ๋Œ€ํ•ด์„œ ๊ฐ€์ค‘์น˜๋ฅผ ๊ณฑํ•œ ๊ฒƒ๋“ค์˜ ํ•ฉ์œผ๋กœ ํ•˜๋‚˜์˜ representation ๊ฐ’์œผ๋กœ ๋ณธ๋‹ค๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

ํ•˜์ง€๋งŒ ์•„๋ž˜ ์‚ฌ์ง„์„ ๋ณด๋ฉด, ํ•œ ๋น„๋””์˜ค์— ๋Œ€ํ•ด frame์œผ๋กœ ์ชผ๊ฐœ์ง€๋ฉด์„œ ๊ฐ ํ”„๋ ˆ์ž„์ด ํ™•์‹คํ•œ ํ‘œ์ •์„ ๋ณด์ด๋Š” ๊ฒƒ๋„ ์žˆ์ง€๋งŒ ์• ๋งคํ•˜๊ฒŒ ์ƒ๊ฐ๋  ์ˆ˜ ์žˆ๋Š” ํ”„๋ ˆ์ž„๋„ ์žˆ๋Š”๋ฐ, ์ด๊ฒƒ์— ๋Œ€ํ•œ ๊ณ ๋ ค ์—†์ด (๊ฐ ํ”„๋ ˆ์ž„์˜ ํ‘œ์ •์˜ ํ™•์‹คํ•œ ์ •๋„์™€๋Š” ์ƒ๊ด€์—†์ด) ๊ฐ€์ค‘์น˜๋ฅผ ๋žœ๋คํ•˜๊ฒŒ ์ฃผ์–ด๋ฒ„๋ฆฌ๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค๋ฉด happy ์™€๋Š” ์‚ฌ๋ญ‡ ๋‹ฌ๋ผ๋ณด์ด๋Š” ๋งˆ์ง€๋ง‰ ํ”„๋ ˆ์ž„์— ๋“ค์–ด๊ฐ„ ๋žœ๋ค ๊ฐ€์ค‘์น˜($a_k$)๊ฐ€ ์ปค๋ฒ„๋ฆฐ๋‹ค๋ฉด, ํ•™์Šตํ•  ๋•Œ ์• ๋งคํ•œ ๋ฐ์ดํ„ฐ์˜ ๋น„์ค‘์ด ๋” ์ปค์ง€๋Š” ๋ถˆ์ƒ์‚ฌ๊ฐ€ ์ผ์–ด๋‚˜๊ฒ ์ฃ !๐Ÿ˜ต๐Ÿ˜ต

image

๋”ฐ๋ผ์„œ, ์ด ๋…ผ๋ฌธ์—์„œ๋Š” ์œ„์™€ ๊ฐ™์€ ๋ฐฉ์‹์œผ๋กœ๋Š” ๊ฐ frame ๋“ค์— ๋Œ€ํ•œ ์ค‘์š”๋„๋ฅผ ๋ฌด์‹œํ•˜๋Š” ๊ฒฝํ–ฅ์ด ์žˆ๋‹ค๋ฉฐ ํ•œ๊ณ„๋ฅผ ์ง€์ ํ•˜์˜€์Šต๋‹ˆ๋‹ค.

๊ทธ๋ ‡๊ฒŒ ํ•ด์„œ ๊ฐ ํ”„๋ ˆ์ž„์— ๋Œ€ํ•ด ์ค‘์š”ํ•œ ์ •๋„๋ฅผ ํŒ๋ณ„ํ•ด ์ค‘์š”๋„์˜ ๊ฐ€์ค‘์น˜๋ฅผ ์ฃผ์ž! ํ•˜๊ณ  ๋‚˜์˜จ๊ฒƒ์ด FAN ์ž…๋‹ˆ๋‹ค.

Network Architecture

image

๋„คํŠธ์›Œํฌ ๊ตฌ์กฐ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๋‘ ๊ฐœ์˜ ๋ชจ๋“ˆ๋กœ ์ด๋ฃจ์–ด์ ธ ์žˆ์Šต๋‹ˆ๋‹ค.

1) Feature embedding module

์ž…๋ ฅ๊ฐ’์œผ๋กœ ๋“ค์–ด์˜จ ๋น„๋””์˜ค ํ”„๋ ˆ์ž„๋“ค์„ CNN ์ปจ๋ณผ๋ฃจ์…˜์„ ํ†ตํ•ด ๊ฐ ํŠน์ง•์  ๋ฒกํ„ฐ๋ฅผ ์ถ”์ถœํ•ฉ๋‹ˆ๋‹ค.

2) Frame attention module

์ถ”์ถœ๋œ feature vector (ํŠน์ง•์  ๋ฒกํ„ฐ)๋“ค์„ ์—ฐ์‚ฐํ•˜์—ฌ attention weight๋ฅผ ๊ณ„์‚ฐํ•˜์—ฌ ํ”„๋ ˆ์ž„์— ๋Œ€ํ•œ ํ™•๋ฅ ๊ฐ’์„ ์–ป์–ด classification์„ ์ˆ˜ํ–‰ํ•œ๋‹ค.

attention module์„ ์ž์„ธํžˆ ์‚ดํŽด๋ณด๋ฉด,

image

๋จผ์ € ๋…ธ๋ž€์ƒ‰์œผ๋กœ ํ•˜์ด๋ผ์ดํŠธ๋œ ๋ถ€๋ถ„์—์„œ self-attention weight ์™€ global representation ๊ฐ’์„ ๊ตฌํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

self-attention wieght๋Š” attention weight๋ฅผ ๊ตฌํ•˜๋Š” ๋ฐฉ์‹๊ณผ ๋™์ผํ•˜๊ฒŒ ์ด๋ฃจ์–ด์ง€๋Š”๋ฐ์š”, attention weight๋Š” input์˜ hidden state ๋ฒกํ„ฐ์™€ ouput ์—์„œ ๋‚˜์˜ฌ ๊ฒƒ์ด๋ผ ์˜ˆ์ƒ๋˜๋Š” ํ•œ ์‹œ์ ์˜ state ์˜ ๋ฒกํ„ฐ ๊ฐ’์„ ๊ฐ๊ฐ dot product (๋‚ด์ ) ์‹œ์ผœ์„œ output์œผ๋กœ ๋‚˜์˜ค๊ฒŒ ๋  ๊ฐ’์— ๋Œ€ํ•ด ๋ชจ๋“  input๊ณผ์˜ ์œ ์‚ฌ๋„๋ฅผ ๊ตฌํ•˜๋Š” ๋ฐฉ์‹์ž…๋‹ˆ๋‹ค.

์ž์—ฐ์–ด ์ฒ˜๋ฆฌ์™€ ๊ฐ™์€ Sequence-to-sequence ์˜ attention score๋ฅผ ์–ป๋Š” ๊ตฌ์กฐ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

image

์ด๊ฒƒ๊ณผ ์œ ์‚ฌํ•˜๊ฒŒ FAN์—์„œ๋„ feature vector ๊ฐ’์„ input์˜ hidden state, FC ๋ ˆ์ด์–ด์˜ ํŒŒ๋ผ๋ฏธํ„ฐ ๊ฐ’์„ ์ถœ๋ ฅ์œผ๋กœ ๋‚˜์˜ฌ ๊ฒƒ์ด๋ผ ์˜ˆ์ƒ๋˜๋Š” ๊ฐ’์œผ๋กœ ๋‘” ๊ฒƒ ๊ฐ™์Šต๋‹ˆ๋‹ค.

์—ฌ๊ธฐ์„œ $q^0$ ๋ฅผ FC ๋ ˆ์ด์–ด์˜ ํŒŒ๋ผ๋ฏธํ„ฐ ๊ฐ’์ด๋ผ๊ณ  ํ–ˆ๋Š”๋ฐ, ์‹ค์ œ๋กœ ๊ณต์‹ github ์— ๊ฐ€์„œ ์ฝ”๋“œ๋ฅผ ๋œฏ์–ด๋ณธ ๊ฒฐ๊ณผ FC ๋ ˆ์ด์–ด ์ž์ฒด๋ฅผ dot product ํ•˜๋Š” ๊ฒƒ ๊ฐ™์•˜์Šต๋‹ˆ๋‹ค. Attention machanism ์„ ์‚ฌ์šฉํ•˜๋Š” ๋‹ค๋ฅธ FER ๋…ผ๋ฌธ์—์„œ๋„ ํ™•์ธ ๊ฒฐ๊ณผ FC ๋ ˆ์ด์–ด ์ž์ฒด๋ฅผ ๋‚ด์ ํ•œ๋‹ค๊ณ  ์„ค๋ช…ํ–ˆ์œผ๋ฏ€๋กœ, ์ด ๋…ผ๋ฌธ์—์„œ๋„ FC ๋ ˆ์ด์–ด ์ž์ฒด๋ฅผ ๋‚ด์ ํ•˜๋Š” ๊ฒƒ์ด๋ผ๊ณ  ์ƒ๊ฐํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค.

(์ƒ๊ฐํ•ด๋ณด๋‹ˆ.. ๋‹ค์Œ์— ๋‚˜์˜ค๊ฒŒ ๋  ๊ฒƒ์ด๋ผ๊ณ  ์˜ˆ์ƒ๋˜๋Š” ๋ชจ๋“  ๊ฐ’๋“ค๊ณผ์˜ ์œ ์‚ฌ๋„๋ฅผ ์ง„ํ–‰ํ•ด์„œ ์–ด๋–ค๊ฒƒ์ด ๋†’์€ ๊ฐ’์œผ๋กœ ๋‚˜์˜ฌ์ง€๋ฅผ ๊ณ„์‚ฐํ•˜๋Š” ๊ฒƒ์ด๋ฏ€๋กœ FC๋ ˆ์ด์–ด ์ž์ฒด๋กœ ์—ฐ์‚ฐํ•˜๋Š” ๊ฒƒ์ด ๋งž๋Š”๊ฒƒ ๊ฐ™์Šต๋‹ˆ๋‹ค.)

๊ทธ ๋‹ค์Œ์€ attention module์—์„œ ๋นจ๊ฐ„์ƒ‰ ๋ถ€๋ถ„์— ๋Œ€ํ•œ ์„ค๋ช…์ž…๋‹ˆ๋‹ค. ์ด ๋ถ€๋ถ„์€ relation attention weight๋ฅผ ๊ตฌํ•˜๋Š”๋ฐ์š”, relation attention ์ด๋ผ๋Š” ๊ฐœ๋…์€ global feature์™€ local feature ๋‘˜ ๋ชจ๋‘๋ฅผ ๊ฐ–๊ณ  ํ•™์Šตํ•˜๋Š” ๊ฒƒ์ด ๋” ์ข‹์€ ์„ฑ๋Šฅ์„ ๋‚ผ ๊ฒƒ์ด๋ผ๋Š” ๊ฐ€์ • ํ•˜์— ๋งŒ๋“ค์–ด์ง„ ๊ฐœ๋…์ž…๋‹ˆ๋‹ค.

๊ทธ๋ž˜์„œ ๋ณด๋ฉด, ์ผ๋‹จ ์ฒซ๋ฒˆ์งธ ๋‹จ์—์„œ ๋‚˜์˜จ feature ๊ฐ’๊ณผ self-attention weight๋ฅผ ํ†ตํ•ด ๋‚˜์˜จ global representation ๊ฐ’ ($f\prime_v$) ๋ฅผ concat ์‹œ์ผœ์„œ ์ด๋ฅผ ๋‹ค์‹œ FC๋ ˆ์ด์–ด์™€ ๋‚ด์ ์‹œ์ผœ ์‹œ๊ทธ๋ชจ์ด๋“œ ํ•จ์ˆ˜๋ฅผ ์ทจํ•ด weight ๊ฐ’์„ ๊ตฌํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

๋”ฐ๋ผ์„œ ๊ตฌํ•œ self-attention weight ($\alpha_i$)์™€ global representation ($\beta_i$) ๋ฅผ ๊ฐ€์ง€๊ณ  ์ตœ์ข… representation ๊ฐ’์ธ $f_v$๋ฅผ ๊ตฌํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

image

์ œ๊ฐ€ ํ•ด์„ํ•œ relation attention์€ ํ•œ ์ฐจ๋ก€๋กœ ๊ณ„์‚ฐ๋œ attention๊ฐ’(์œ ์‚ฌ๋„ ๊ฐ€์ค‘์น˜์™€ ๊ณ„์‚ฐ๋œ ํ”ผ์ณ๊ฐ’)๋“ค์„ ๋˜ ๋‹ค์‹œ ํ•œ์ฐจ๋ก€ ๋” attention ๊ณ„์‚ฐ์„ ํ•จ์œผ๋กœ ์จ feature์— ํ•œ์ฐจ๋ก€ ๊ณ„์‚ฐํ•˜์—ฌ ๋” ์ค‘์š”ํ•œ ๊ฐ’์— ๊ฐ’์„ ๋” ์ฃผ๊ณ , ์ค‘์š”ํ•˜์ง€๋งŒ ๋œ ์ค‘์š”ํ•œ ๊ฒƒ์—๋Š” ๊ฐ€์ค‘์น˜๋ฅผ ๋œ ๋ถ€์—ฌํ•˜๋Š” ๊ฒƒ ๊ฐ™์Šต๋‹ˆ๋‹ค. (์ค‘์š”ํ•œ ํ”ผ์ณ๋ฅผ ๋” ๊ฐ•์กฐํ•˜๋Š” ๋Š๋‚Œ..? ์ •ํ™•ํ•œ ํ•ด์„์„ ํ•˜์‹œ๋Š” ๋ถ„ ์—ฐ๋ฝ์ฃผ์„ธ์š”..!!)


Compare relation attention vs self attention

image

self attention ๊ณผ relation attention ์„ ๋น„๊ตํ•˜๋ฉด ๋‘˜ ๋ชจ๋‘ ๊ฐ ์‹œํ€€์Šค ํ”„๋ ˆ์ž„๋“ค ์ค‘ ํ™•์‹คํ•œ ํ‘œ์ •์„ ๋ณด์ด๋Š” ๊ฒƒ์— ๋” ๋งŽ์€ ๊ฐ€์ค‘์น˜๋ฅผ ์ฃผ๋Š” ๊ฒƒ์„ ๋ณด์ด๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

๊ฑฐ๊ธฐ์— relation attention ์ด ์ข€ ๋” ์ข‹์€ ๊ฐ€์ค‘์น˜๋ฅผ ์ฃผ๋Š” ๊ฒƒ์„ ํ™•์ธํ–ˆ์Šต๋‹ˆ๋‹ค.


Experiment Result

- CK+

๋จผ์ €, CK+ ๋ฐ์ดํ„ฐ์…‹์—์„œ์˜ ์„ฑ๋Šฅ์„ ์‚ดํŽด๋ณด๋ฉด.. baseline ๋ชจ๋ธ๋ณด๋‹ค attention ์„ ์ทจํ•œ ๋ชจ๋ธ์ด ๋” ์ข‹์€ ์„ฑ๋Šฅ์„, relation attention ์„ ์‚ฌ์šฉํ•œ ๋ชจ๋ธ์ด ์กฐ๊ธˆ ๋” ์ข‹์€ ์„ฑ๋Šฅ์„ ๋‚ด๋Š” ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์„œ CK+๋กœ ๋‚˜์˜จ ์„ฑ๋Šฅ์€ SOTA ์„ฑ๋Šฅ์„ ๋ณด์ธ๋‹ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค!

image

- AFEW 8.0

์˜ํ™” ๋ฐ์ดํ„ฐ๋กœ ์ด๋ฃจ์–ด์ง„ AFEW 8.0 ๋ฐ์ดํ„ฐ์…‹์—์„œ์˜ ์„ฑ๋Šฅ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค. SOTA ์„ฑ๋Šฅ๊นŒ์ง€๋Š” ์•„๋‹ˆ์ง€๋งŒ ๊ทธ๋ž˜๋„ relation attention์„ ์‚ฌ์šฉํ•  ๋•Œ์˜ ์„ฑ๋Šฅ์ด ๋” ์ข‹๊ฒŒ ๋‚˜์˜จ๋‹ค๋Š” ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

image

ํƒœ๊ทธ: , ,

์นดํ…Œ๊ณ ๋ฆฌ:

์—…๋ฐ์ดํŠธ:

๋Œ“๊ธ€๋‚จ๊ธฐ๊ธฐ