[Paper Review๐Ÿ“ƒ] Distract Your Attention: Multi-head Cross Attention Network for Facial Expression Recognition

Noisy Student Training using Body Language Dataset Improves Facial Expression Recognition

-DAN-

paper๐Ÿ˜™

์ด๋ฒˆ์—”, Papers with code ๊ธฐ์ค€, AffectNet Data๋กœ ํ˜„์žฌ 2์œ„๋ฅผ ์ฐจ์ง€ํ•˜๊ณ  ์žˆ๋Š” ๋…ผ๋ฌธ์„ ๋ถ„์„ํ•ด๋ณด๊ณ ์ž ํ•œ๋‹ค.

FER์— meta learning ๊ธฐ๋ฒ•์„ ์ ์šฉํ•˜๊ณ ์ž ์—ฌ๋Ÿฌ meta learning ๊ธฐ๋ฒ•์„ ๊ณต๋ถ€ํ•˜๋˜ ์ค‘, ๊ธฐ์กด ๋ฐฉ์‹๋“ค์€ ์™„์ „ ๋‹ค๋ฅด๊ฒŒ ์ƒ๊ธด ์ด๋ฏธ์ง€์—์„œ ๊ฐ ํด๋ž˜์Šค ๋‹น ์ ์€ ๋ฐ์ดํ„ฐ๋งŒ์„ ๊ฐ€์ง€๊ณ  ์—ฐ์‚ฐํ•˜๋Š” ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๋ฐฉ์‹์„ ์ œ์‹œํ•œ๋‹ค.

ํ•˜์ง€๋งŒ ์–ผ๊ตด ๋ฐ์ดํ„ฐ์˜ ๊ฒฝ์šฐ, ์–ผ๊ตด์ด๋ผ๋Š” ๋ฒ”์ฃผ๋Š” ๊ฐ™๊ณ , ๊ทธ ์•ˆ์—์„œ ๋ฏธ์„ธํ•œ ํ‘œ์ •์ฐจ์ด๋งŒ ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ๊ทธ ์ฐจ์ด์— ์ดˆ์ ์„ ๋งž์ถ”๋ฉด์„œ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์–ด๋–ป๊ฒŒ ์ ‘๊ทผํ•ด์•ผํ•  ์ง€์— ๋Œ€ํ•ด ๊ณ ๋ฏผํ•˜๊ณ  ์žˆ๋‹ค.

๊ทธ๋ž˜์„œ ์•ฝ๊ฐ„ ๊ฑฐ๋ฆฌ ๊ธฐ๋ฐ˜์„ ์ด์šฉํ•˜๋ฉด์„œ ์–ผ๊ตด ๋ถ€๋ถ„, ์ค‘์š”ํ•œ ํ‘œ์ •์ด ๋ณ€ํ™”ํ•˜๋Š” ๊ทธ ๋ถ€๋ถ„์— ์ดˆ์ ์„ ๋งž์ถ˜ ๋…ผ๋ฌธ์„ ์ฐพ์•„์„œ ๊ณต๋ถ€ํ•ด ๋ณด์•˜๋‹ค.

๋…ผ๋ฌธ์—์„œ ์ œ์‹œํ•˜๊ณ  ์žˆ๋Š” ์ „์ฒด ๋ชจ๋ธ ๊ตฌ์กฐ๋Š” ์•„๋ž˜์™€ ๊ฐ™๋‹ค.

image

์—ฌ๋Š ๋ชจ๋ธ๋“ค๊ณผ ๋น„์Šทํ•˜๊ฒŒ base model (์—ฌ๊ธฐ์„œ๋Š” resnet ์‚ฌ์šฉ) ์œผ๋กœ feature์„ ์ถ”์ถœํ•˜์—ฌ ๊ทธ feature๋กœ attention์„ ๊ฑฐ์ณ์„œ ์ฒ˜๋ฆฌํ•˜๋Š” ๋ฐฉ์‹์ด๋‹ค.

ํฌ๊ฒŒ FCN, MAN, AFN ์„ธ ๋ชจ๋“ˆ๋กœ ๊ตฌ์„ฑ๋˜์–ด์žˆ๋‹ค.


- FCN(Feature Clustering Network)

image


Affinity Loss

์ˆ˜์‹์€ ์•„๋ž˜์™€ ๊ฐ™์œผ๋ฉฐ, $M = Y$, ์—ฌ๊ธฐ์„œ Y๋Š” ํด๋ž˜์Šค(7๊ฐœ), ์ด๋ฉฐ feature๊ฐ’ $x^{`}$ ์— ๊ฐ class์— ํ•ด๋‹นํ•˜๋Š” random์œผ๋กœ ์ •ํ•ด์ง„ center point์˜ ์ฐจ์ด๋กœ loss๋ฅผ ๊ตฌํ•˜๊ณ  ์žˆ๋‹ค.

image

์•„๋ž˜๋Š” Affinity Loss ๋ฅผ ์‚ฌ์šฉํ•จ์œผ๋กœ์จ inter class ๊ฐ„์˜ ๊ฑฐ๋ฆฌ๋ฅผ ์ตœ๋Œ€ํ™” ์‹œํ‚ค๋ฉด์„œ intra class์˜ ๊ฑฐ๋ฆฌ๋Š” ์ตœ์†Œํ™” ์‹œํ‚ค๋Š” ์‹์œผ๋กœ ๋ฐ์ดํ„ฐ ๋ถ„ํฌ๊ฐ€ ์ •๋ฆฌ๋˜๋Š” ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค.

image


- MAN (Multi-head cross Attention Network)

Attention ๋ชจ๋“ˆ์€ ์ด 2๊ฐ€์ง€๋ฅผ ์‚ฌ์šฉํ•˜๊ณ  ์žˆ์œผ๋ฉฐ Spatial, Channel attention์„ ์‚ฌ์šฉํ•˜๊ณ  ์žˆ๋‹ค. ๋‘ attention์€ ์ด์ „ ๋…ผ๋ฌธ์—์„œ๋„ ๋‹ค๋ค˜๋Š”๋ฐ ์ด๋ฆ„๋งŒ ๊ฐ™๊ณ  ๊ทธ ๋‚ด์šฉ์€ ๋‹ค๋ฅธ ๊ฒƒ ๊ฐ™๋‹ค.

SA์™€ CA ๋ฅผ ์‚ฌ์šฉํ•œ ๋…ผ๋ฌธ์€ [๋…ผ๋ฌธ์ •๋ฆฌ๐Ÿ“ƒ] Noisy Student Training using Body Language Dataset Improves Facial Expression Recognition ์—ฌ๊ธฐ๋ฅผ ์ฐธ๊ณ ํ•˜๋ฉด ๋  ๊ฒƒ ๊ฐ™๋‹ค.

image

image

์ด๋ฒˆ attention ์—์„œ๋Š” local feature์„ ๋” ์ž˜ ๋ฝ‘๊ธฐ ์œ„ํ•ด์„œ 3x3, 1x3, 3x1 ์ปจ๋ณผ๋ฃจ์…˜์„ ๊ฐ๊ฐ ๊ฑฐ์ณ์„œ summation ํ•˜๋Š” ๋ฐฉ์‹์œผ๋กœ ์‚ฌ์šฉํ•˜๊ณ  ์žˆ๋‹ค. ๋‹น์—ฐํ•œ ๋ง์ด๊ฒ ์ง€๋งŒ ์—ฌ๋Ÿฌ ์‚ฌ์ด์ฆˆ์˜ kernel๋กœ convolution์„ ํ•˜๊ณ  ๋”ํ•˜๋ฉด ํŠน์ง• feature๋“ค์ด ๋”ํ•ด์ ธ์„œ ์ค‘์š” ๋ถ€๋ถ„์˜ ๊ฐ’์ด ๋” ์ปค์ง€๊ธฐ ๋•Œ๋ฌธ์— ์ง€์—ญ์  ํŠน์ง•์„ ์ž˜ ํฌ์ฐฉํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋œ๋‹ค.

์ด ํ›„, spatial attention์˜ output ๊ฐ’์€ Channel Attention์˜ input ๊ฐ’์œผ๋กœ ๋“ค์–ด๊ฐ€์„œ ๋‘๋ฒˆ์˜ linear๋ฅผ ๊ฑฐ์น˜๊ฒŒ ๋œ๋‹ค.


- AFN (Attention Fusion Network)

image

MAN ๋ชจ๋“ˆ์„ ๊ฑฐ์ณ ๋‚˜์˜จ ํ”ผ์ณ๋“ค์„ ์‹œ๊ทธ๋ชจ์ด๋“œํ•จ์ˆ˜๋ฅผ ์ตœํ•œ ํ›„ ๊ฐ๊ฐ summation์„ ํ•ด์ฃผ์–ด linear -> batch normalization์„ ๊ฑฐ์ณ์„œ ์ตœ์ข… feature๋ฅผ ๋ฝ‘๊ฒŒ ๋œ๋‹ค.

Partition Loss

AFN ์—์„œ ์‚ฌ์šฉํ•˜๋Š” loss๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์€๋ฐ, ์œ„์—์„œ attention ์ด k ๋ฒˆ ๋Œ์•„๊ฐ€๋ฏ€๋กœ ๊ทธ๊ฒƒ ๋งŒํผ ํ‘œ์ค€ํŽธ์ฐจ๋กœ ๋‚˜๋ˆ„์–ด์„œ log softmax๋กœ ๊ณ„์‚ฐํ•˜๊ณ  ์žˆ๋‹ค.

image

๊ฒฐ๊ตญ ์ตœ์ข… loss ๋Š” 3๊ฐ€์ง€ loss๋ฅผ(affinity + partition + ce) ๋ชจ๋‘ ํ•ฉํ•˜์—ฌ ์—…๋ฐ์ดํŠธ ํ•˜๋Š”๋ฐ, ์•ž์— ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ๊ณฑํ•ด์„œ ๊ฐ€์ค‘์น˜๋ฅผ ์ฃผ์–ด ๊ณ„์‚ฐํ•ด์ค€๋‹ค (๋…ผ๋ฌธ์—์„œ๋Š” 1.0 ์œผ๋กœ ์„ค์ •ํ•ด์คŒ)

image


Performance

image

์•„๋ž˜ ๊ทธ๋ž˜ํ”„๋Š” attention์„ ๋ช‡๋ฒˆ ๋Œ๋ ธ์„ ๋•Œ ์„ฑ๋Šฅ์ด ์ข‹์€์ง€๋ฅผ ํŒ๋‹จํ•˜๊ธฐ ์œ„ํ•ด ๊ทธ๋ฆฐ ํ‘œ์ด๋‹ค. 4๋ฒˆ์„ ๋ฐ˜๋ณตํ• ๋•Œ์˜ ์„ฑ๋Šฅ์ด ๊ฐ€์žฅ ์ข‹์•˜๊ณ , ์˜ค๋ฅธ์ชฝ ๊ทธ๋ฆผ์—์„œ๋Š” ๊ฐ๊ฐ attention์ด ๋Œ์•„๊ฐ€๋ฉด์„œ feature์„ visualizationํ•œ ๊ฒฐ๊ณผ์ด๋‹ค. ๋ˆˆ๊ณผ ์ž…์ฃผ๋ณ€์— ์ค‘์š”๋„๊ฐ€ ๋ถ„ํฌ๋œ ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค.

image

4๋ฒˆ ๋ฐ˜๋ณตํ•˜๋Š”๊ฒŒ ๊ฐ€์žฅ ์ข‹์€ ์„ฑ๋Šฅ์ด๋ผ๋Š”๋ฐ ์–ด์ฐŒ๋ณด๋ฉด ๋‹น์—ฐํ•œ ๊ฒฐ๊ณผ์ธ ๊ฒƒ ๊ฐ™๋‹ค ์–‘์ชฝ ๋ˆˆ๊ณผ ์–‘์ชฝ ์ž…๊ฐ€๊ฐ€ ํ‘œ์ • ๋ณ€ํ™”์— ์žˆ์–ด์„œ ๊ฐ€์žฅ ์ค‘์š”ํ•œ ๋ถ€๋ถ„์ด์ž ํด๋ž˜์Šค๋“ค์„ ๊ตฌ๋ถ„ํ•  ์ˆ˜ ์žˆ๋Š” ์ง€ํ‘œ์ด๊ธฐ ๋•Œ๋ฌธ์— ๊ฒฐ๊ณผ๊ฐ€ ๊ทธ๋ ‡๊ฒŒ ๋‚˜์˜จ ๊ฒƒ ๊ฐ™๋‹ค.

๋งŒ์•ฝ 3๋ฒˆ ๋ฐ˜๋ณตํ–ˆ๋‹ค๋ฉด ์ž…์ˆ ์ด๋‚˜ ๋ˆˆ ํ•œ์ชฝ์— ๋Œ€ํ•œ ์ง€ํ‘œ๊ฐ€ ๋ˆ„๋ฝ๋˜์–ด์žˆ์„ ํ…๋ฐ, ํ‘œ์ •์˜ ๋Œ€์นญ์ด ์ค‘์š”ํ•˜๋ฏ€๋กœ(์˜ˆ๋ฅผ ๋“ค์–ด, ์˜๋ฏธ์‹ฌ์žฅํ•œ ํ‘œ์ •์„ ์ง€์„ ๋•Œ ํ•œ์ชฝ ๋ˆˆ๋งŒ ์ปค์ง€๋Š” ๊ฒฝ์šฐ -_^, ๐Ÿง ์–‘ ๋ˆˆ ์ค‘์—์„œ ํ•œ ์ชฝ ๋ˆˆ์— ๋Œ€ํ•ด์„œ๋งŒ ๋ถ„๋ฅ˜ํ•œ๋‹ค๋ฉด ๊ฒฐ๊ณผ๋Š” ๋‹ค๋ฅด๊ฒŒ ๋‚˜์˜ค๊ฒ ์ฃ ?) ์ •ํ™•๋„์˜ ์ฐจ์ด๊ฐ€ ํ™•์‹คํžˆ ๋‚˜๋Š” ๊ฒƒ ๊ฐ™์Šต๋‹ˆ๋‹ค.

์ด์ œ ์Šฌ์Šฌ ์ฃผ์ œ๋ฅผ ๊ตฌ์ฒดํ™” ํ•˜๊ณ  ์‹คํ—˜์„ ๊น”์ง๊น”์ง~ ํ•ด๋ณผ ์‹œ๊ธฐ๊ฐ€ ์™”๋‹ค.. ์ผ๋‹จ AFEW ๋ฐ์ดํ„ฐ ์…‹์„ ์ž˜ ๋ถ„์„ํ•ด๋ณด๋ฉด์„œ Meta learning ๊ณต๋ถ€ํ•ด์•ผ๊ฒ ๋‹ค. ํ™”์ดํŒ…!!๐Ÿ˜๐Ÿ˜

ํƒœ๊ทธ: , ,

์นดํ…Œ๊ณ ๋ฆฌ:

์—…๋ฐ์ดํŠธ:

๋Œ“๊ธ€๋‚จ๊ธฐ๊ธฐ