[Paper Review๐Ÿ“ƒ] Noisy Student Training using Body Language Dataset Improves Facial Expression Recognition

Noisy Student Training using Body Language Dataset Improves Facial Expression Recognition

-Noisy Student FER-

Paper๐Ÿ˜™

์ด๋ฒˆ์—”, Papers with code ๊ธฐ์ค€, AFEW ๋ฐ์ดํ„ฐ๋กœ SOTA์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ•œ FER ๋…ผ๋ฌธ์„ ๋ฆฌ๋ทฐํ•˜๋ ค๊ณ  ํ•œ๋‹ค.

์ฐธ ๋งŽ์ด๋„ ์‚ฌ์šฉํ•˜์˜€๋‹ค. 3๊ฐ€์ง€ attention ๊ธฐ๋ฒ•, ์–ผ๊ตด ์ด๋ฏธ์ง€๋ฅผ 3๊ฐ€์ง€๋กœ ๋‚˜๋ˆ„์–ด์„œ ๋ถ„์„, noisy student์œผ๋กœ ํ•™์Šต, extra dataset ์‚ฌ์šฉ.. ์ด๋ ‡๊ฒŒ ๋‹ค ์‚ฌ์šฉํ•ด๋„ audio๋ฅผ ์‚ฌ์šฉํ•œ multi-modal model ๋ณด๋‹ค๋Š” ์•„๋ž˜์— ๋žญํฌ๋˜์–ด์žˆ๋‹ค.

**์•„๋ž˜ ์ด๋ฏธ์ง€๋Š” AFEW ๋ฐ์ดํ„ฐ์…‹ ๊ธฐ์ค€์œผ๋กœ ๋ชจ๋ธ ์ˆœ์œ„์ด๋‹ค. 6์œ„๋ฅผ ์ฐจ์ง€ํ•˜๋Š” ์ค‘

ํ™”๋ฉด ์บก์ฒ˜ 2022-01-05 234023

Introduction

  • Propose an efficient model addresses the challenges posed by videos in the wild while tackling the issue of labelled data inadequacy

  • Previous video-based emotion recognition used visual cues but a fusion of 5 different architectures with more than 300 million parameters. However, this model proposed method uses a single model with approximately 25million parameters and comparable performance

  • Use SOTA pre-trained deep learning model (Enlighten-GAN) for preprocessing (because, previous methods tend to amplify noise, tone distortion, and other artefacts)

  • Use three-level attention mechanism (spatial-attention block, channel-attention block, frame-attention block)

์š”์•ฝํ•˜์ž๋ฉด, ์‹ฑ๊ธ€๋ชจ๋ธ์„ ์‚ฌ์šฉํ•˜๋ฉด์„œ, ์ด๋ฏธ์ง€ ์ „์ฒ˜๋ฆฌ์— GAN, Backbone์—์„œ๋Š” 3๋ฒˆ์˜ attention, unlabelled dataset(extra data)์„ ์‚ฌ์šฉํ•˜์—ฌ SOTA ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ•˜์˜€๋‹ค.


Pre-processing

์ด๋ฏธ์ง€ ์ „์ฒ˜๋ฆฌ ๊ณผ์ •์„ ๋„์‹ํ™” ํ•ด๋ณด์•˜๋‹ค.

ํ™”๋ฉด ์บก์ฒ˜ 2022-01-06 202517

๋ชจ๋“  FER์ด ๊ทธ๋Ÿฌํ•˜๋“ฏ, ๋น„๋””์˜ค ๋ฐ์ดํ„ฐ๋ฅผ ํ”„๋ ˆ์ž„์ฒ˜๋ฆฌํ•˜๊ณ , ๊ฐ ํ”„๋ ˆ์ž„์—์„œ MTCNN์„ ์‚ฌ์šฉํ•ด ์–ผ๊ตด์„ ์ฐพ๊ณ  CROPํ•œ ํ›„ ๊ฐ๋„๋ฅผ ๋งž์ถฐ์ฃผ๋Š” ์ž‘์—…์„ ํ•œ๋‹ค. (MTCNN ๋…ผ๋ฌธ์€ ์ฝ๋Š”์ค‘์ด๋‹ค. ์ถ”ํ›„ ํฌ์ŠคํŒ…ํ•˜๊ฒ ๋‹ค..!)

์—ฌ๊ธฐ์„œ ์ถ”๊ฐ€๋˜๋Š”๊ฒŒ Enlighten-GAN์ธ๋ฐ, ์œ„์˜ ๊ทธ๋ฆผ์—์„œ๋„ ๋ณผ ์ˆ˜ ์žˆ๋“ฏ์ด ์–ด๋‘์šด ํ™”๋ฉด์—์„œ ์–ผ๊ตด์˜ ํŠน์ง•์ ์„ ์ฐพ๊ธฐ ํž˜๋“ค๊ธฐ ๋•Œ๋ฌธ์— GAN์„ ์‚ฌ์šฉํ•˜์—ฌ์„œ ์ด๋ฏธ์ง€๋ฅผ ๋ฐ๊ฒŒ ํ•˜๋Š” ์ฒ˜๋ฆฌ๋ฅผ ํ•ด์ฃผ์—ˆ๋‹ค.

๊ธฐ์กด์—๋Š” ์ „์ฒ˜๋ฆฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜(gamma correction, difference of Gaussians, histogram equalization ๋“ฑ)์„ ์‚ฌ์šฉํ–ˆ์—ˆ๋Š”๋ฐ, ๋…ธ์ด์ฆˆ๋ฅผ ์ƒ์„ธํ™”์‹œํ‚ค๊ณ  ํ†ค์ด๋‚˜ ๋‹ค๋ฅธ ์ธ๊ณต๋ฌผ๋“ค์„ ์™œ๊ณก์‹œํ‚ค๋Š” ๋ฌธ์ œ๊ฐ€ ์žˆ์–ด GAN์„ ์‚ฌ์šฉํ–ˆ๋‹ค๊ณ  ํ•œ๋‹ค.

์ด๋ ‡๊ฒŒ ๋ฐ๊ธฐ ์ฒ˜๋ฆฌํ•ด์ค€ ์–ผ๊ตด ์ด๋ฏธ์ง€๋ฅผ ๋‹ค์‹œ MTCNN์„ ํ†ตํ•ด ์–‘์ชฝ ๋ˆˆ๊ณผ ์–‘์ชฝ ์ž…์ˆ ์„ ๋žœ๋“œ๋งˆํฌ๋ฅผ ์ฐพ์•„์„œ ๋ˆˆ์—์„œ half lower crop!, ์ž…์—์„œ half upper ๊นŒ์ง€ crop! ํ•ด์ค€ ํ›„ ๋‹ค์‹œ 224x224๋กœ resizeํ•ด์ค€๋‹ค.


Backbone Network with Spatial-Attention

๋‹ค์Œ์€ ResNet18 backbone ๋„คํŠธ์›Œํฌ ๊ตฌ์กฐ์ด๋‹ค. ๋…ผ๋ฌธ์—์„œ ์ œ์‹œํ•˜๋Š” ์ด๋ฏธ์ง€๊ฐ€ ํ—ท๊ฐˆ๋ ค์„œ ๋‹ค์‹œ ์ˆ˜์ •ํ•ด์„œ ์ •๋ฆฌํ•ด๋ณด์•˜๋‹ค.

image

input์œผ๋กœ๋Š” 224x224x9 ๋กœ ์ด๋ฏธ์ง€ ์ „์ฒ˜๋ฆฌ์—์„œ ์–ป์€ face, eyes, mouth ์„ธ ์žฅ์˜ ์ด๋ฏธ์ง€๋ฅผ ์ž…๋ ฅํ•œ๋‹ค. ์ด ๊ตฌ์กฐ๋Š” group-convolution์„ ์‚ฌ์šฉํ•˜์—ฌ ๋…๋ฆฝ์ ์ธ ์—ฐ์‚ฐ์„ ์ˆ˜ํ–‰ํ•œ๋‹ค๊ณ  ํ•œ๋‹ค. (์ด๋ฏธ์ง€๋กœ๋Š” 3๊ฐ€์ง€ ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ ์ฒ˜๋Ÿผ ๋ณด์ด์ง€๋งŒ ํ•˜๋‚˜์˜ ๋ชจ๋ธ๋กœ ๋…๋ฆฝ์ ์ธ ์—ฐ์‚ฐ์„ ์ˆ˜ํ–‰ํ•˜๋Š” ๊ฒƒ..!) ๋ณด๋ฉด, ์—ฐ๋‘์ƒ‰ ํ…Œ๋‘๋ฆฌ์˜ BOX๊ฐ€ Residual block์ด๊ณ , ๊ฐ Residual block์—์„œ SA(Spatial Attention) ๋ฅผ ์ˆ˜ํ–‰ํ•˜์—ฌ 1์ฐจ์›์˜ feature๋ฅผ ๋ฝ‘์•„์„œ 4๊ฐœ์˜ block์—์„œ ๋ฝ‘์€ feature๋“ค์„ concat ์‹œ์ผœ์„œ 960๊ฐœ์˜ feature vector์„ ๋ฝ‘์•„๋‚ธ๋‹ค. ์ด๋ ‡๊ฒŒ ๋ฝ‘์€ feature๋Š” ์ด๋ฏธ์ง€์—์„œ ์–ด๋Š ๋ถ€๋ถ„์ด ์ค‘์š”ํ•œ์ง€์— ๋Œ€ํ•œ ์ •๋ณด๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ๋‹ค.

- Spatial Attention ์—ฐ์‚ฐ

image

$W_sl$๊ณผ $W_s2$๋Š” ๊ฐ๊ฐ ๊ฐ€์ค‘์น˜ ํ–‰๋ ฌ๊ณผ ๋ฒกํ„ฐ์ด๋‹ค. $L$์€ 2D-tensor๋กœ ์ฐจ์›์„ ๋ณ€๊ฒฝํ•ด์ค€ channel์ด๋ผ๊ณ  ๋ณด๋ฉด ๋œ๋‹ค. ์ž์„ธํ•œ๊ฑด ๋” ๊ณต๋ถ€ํ•ด์•ผ๊ฒ ์ง€๋งŒ, ์ˆ˜์‹์„ ๋ณด๋ฉด attention ๊ตฌํ•˜๋Š” ์ˆ˜์‹๊ณผ ๋™์ผํ•œ๋ฐ ์ข€ ๋‹ค๋ฅธ ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค. attention์€ ๋ณธ๋ž˜, fc ๋ ˆ์ด์–ด์—์„œ ์—ฐ์‚ฐ์„ ํ•ด์ฃผ์—ˆ๋˜ ๋ฐ˜๋ฉด, ์—ฌ๊ธฐ์„œ๋Š” ํ–‰๋ ฌ๊ณฑ์œผ๋กœ ์—ฐ์‚ฐ์„ ํ•œ๋‹ค. ๋ˆ„๊ฐ€ ์ž์„ธํžˆ ์•„์‹œ๋Š”๋ถ„ ์žˆ์œผ๋ฉด ์—ฐ๋ฝ์ข€.. ์ฃผ์…”์š”


channel Attention

image

๊ฐ face, eyes, mouth์—์„œ ๋ฝ‘์€ 960feature ๋กœ attention์—ฐ์‚ฐ์„ ํ†ตํ•ด ํ‰๊ท ์„ ๋‚ธ ํ•˜๋‚˜์˜ 960feature์„ ๋ฝ‘๊ฒŒ ๋œ๋‹ค. ์ด๋ฅผ ๋‹ค์‹œ 512 feature๋กœ ์ค„์ด๋ฉด ์ด feature๊ฐ€ ํ•œ ํ”„๋ ˆ์ž„์˜ feature์ด ๋œ๋‹ค

์ด๋ ‡๊ฒŒ ๊ฐ ํ”„๋ ˆ์ž„์— ๋Œ€ํ•œ ํ‰๊ท  feature๋“ค์„ ๊ณ„์‚ฐํ•ด ๊ตฌํ•ด๋‚ด์–ด 512 feature๋ฅผ ๋งŒ๋“ค๊ฒŒ ๋œ๋Š” ๊ฒƒ์ด๋‹ค.


Frame Attention

image

๊ฐ ํ”„๋ ˆ์ž„์— ๋Œ€ํ•œ 512 feature ์— ๋Œ€ํ•ด์„œ ๋˜ ๋‹ค์‹œ attention! ๊ทธ๋ฆฌ๊ณ  ์ตœ์ข…์ ์œผ๋กœ 7๊ฐœ์˜ label์— ๋Œ€ํ•ด classification ํ•ด์ค๋‹ˆ๋‹ค.

Noisy student training

image

AFEW ๋ฐ์ดํ„ฐ๋ฅผ ํ™•์ธํ•ด๋ณด๋‹ˆ ํ™•์‹คํžˆ ์–‘์ด ์ ์—ˆ๋‹ค. ์ด๊ฑธ๋กœ๋งŒ ํ•™์Šตํ•˜๋ฉด ์„ฑ๋Šฅ์ด ์ž˜ ์•ˆ๋‚˜์˜ค๊ธด ํ• ๊ฒƒ ๊ฐ™๋‹ค ใ…Žใ…Ž ๊ทธ๋ž˜์„œ ์ด ๋…ผ๋ฌธ์—์„œ ์ œ์‹œํ•œ๊ฒŒ, Unlabelled๋œ ๋ฐ์ดํ„ฐ๋ฅผ ๊ฐ€์ ธ์™€์„œ pseudo label์„ ํ•œ ํ›„ ํ•™์Šต์‹œํ‚จ ๋ชจ๋ธ๋กœ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ(noisy student ๊ฐœ๋…) ์ธ๋ฐ, ์—ฌ๊ธฐ์„œ๋Š” unlabel ๋œ ๋ฐ์ดํ„ฐ๋ฅผ BoLD(BodyLanguage Dataset)์„ ์‚ฌ์šฉํ•˜์˜€๋‹ค.

์ˆœ์„œ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

  1. AFEW8.0์œผ๋กœ ํ•™์Šตํ•œ๋‹ค.
  2. ํ•™์Šตํ•œ ๊ฒƒ ์ค‘ ๊ฐ€์žฅ best model๋กœ BoLD ๋ฐ์ดํ„ฐ์— pseudo label์„ ์ง„ํ–‰ํ•œ๋‹ค.
  3. AFEW8.0์— ๋”ํ•˜์—ฌ label์ด ์ƒ์„ฑ๋œ BoLD ๋ฐ์ดํ„ฐ๋ฅผ ํ•ฉํ•œ ๋ฐ์ดํ„ฐ(์ตœ์ข…๋ฐ์ดํ„ฐ์…‹)๋กœ ํ•™์Šต์„ ์‹œํ‚ค๋Š”๋ฐ ์—ฌ๊ธฐ์— + noise๋ฅผ ์ถ”๊ฐ€ํ•ด์ค€๋‹ค
  4. ์ตœ์ข… ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ 3๋ฒˆ 4๋ฒˆ ๋ฐ˜๋ณตํ•œ๋‹ค.

(์‚ฌ์šฉํ•œ noise์—๋Š” dropout(0.5), contrast, brightness, translation, sharpness, flips ๋“ฑ์„ ๋žœ๋ค์œผ๋กœ ์‚ฌ์šฉํ•จ)

๐Ÿ˜† => Noisy Student ์ฐธ๊ณ 


result

image

๊ฐ face, mouth, eyes ์— ๋Œ€ํ•ด์„œ afew8.0์œผ๋กœ ์„ฑ๋Šฅ์ธก์ •ํ•œ ๊ฒฐ๊ณผ์ธ๋ฐ์š”, ๊ฐ์ •ํ‘œํ˜„์— ์žˆ์–ด์„œ ๋ˆˆ์ด ๋งŽ์ด ์“ฐ์ด๋Š” sad๋Š” ์—ญ์‹œ eyes ์—์„œ ์„ฑ๋Šฅ์ด ์ข‹๊ณ , happy์™€ angry ๊ฐ™์ด ์ž…์˜ ํ‘œํ˜„์ด ์ค‘์š”ํ•œ ๊ฐ์ •์€ mouth์—์„œ ์„ฑ๋Šฅ์ด ์ข‹๊ฒŒ ๋‚˜์˜จ ๊ฒƒ์„ ํ™•์ธํ•˜์˜€์Šต๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ best ๋Š” ์ด ์„ธ๊ฐ€์ง€๋ฅผ ๋ชจ๋‘ ํ•ฉํ•ด์„œ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ! ์ด๋ผ๊ณ  ์ฃผ์žฅํ•˜๋„ค์š”


๋‹ค์Œ์€ iteration์„ ๋ฐ˜๋ณตํ•˜๋ฉด์„œ ์„ฑ๋Šฅ์ด ํ–ฅ์ƒ๋จ + ๊ท ํ˜•๋งž์ถ˜ unlabelled ๋ฐ์ดํ„ฐ์˜ ์ค‘์š”์„ฑ์„ ๋ณด์—ฌ์ฃผ๋Š” ์ •๋„๋กœ ํ•ด์„ํ•˜๋ฉด ๋  ๊ฒƒ ๊ฐ™์Œ

image


CK+๋ฐ์ดํ„ฐ์—์„  99.69%, AFEW ์—์„œ๋Š” 55.17%์˜ ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ•˜์˜€๋‹ค๋Š” ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

image

component importance๋ฅผ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. ์—ญ์‹œ ์ด๊ฒƒ์ €๊ฒƒ ๋‹ค ๋ถ™์ด๊ณ  training๋„ ๋งŽ์ด ๋ฐ˜๋ณตํ•œ๊ฒŒ ์„ฑ๋Šฅ์ด ์ข‹๊ฒŒ ๋‚˜์˜ค๋„ค์š”.

image

๊ทธ๋Ÿผ์—๋„ AFEW8.0 ๋ฐ์ดํ„ฐ ๊ธฐ์ค€์œผ๋กœ 6๋“ฑ์ด๋ผ๋Š”๊ฒŒ.. ์ด๋ ‡๊ฒŒ ๋‹ค ๋ถ™์—ฌ๋„ ๊ฒฐ๊ตญ multi-modal model์„ ์ด๊ธธ ์ˆ˜ ์—†๋‹ค๋‹ˆโ€ฆ ๊ณต๋ถ€ํ•˜๋ฉด์„œ ๋งŽ์€ ์ƒ๊ฐ์„ ํ•˜๊ฒŒ ํ•˜๋Š” ๋…ผ๋ฌธ์ด์—ˆ์Šต๋‹ˆ๋‹ค. ์ตœ์‹  ๊ธฐ๋ฒ•๋“ค(GAN๊ณผ attention 3๊ฐ€์ง€)์„ ๋ชจ๋‘ ๋‹ค ๊ฐ–๋‹ค ์“ฐ๊ณ  ์‹ฌ์ง€์–ด extra dataset๊นŒ์ง€ ์‚ฌ์šฉํ•˜์˜€๋Š”๋ฐ AUDIO๋ฅผ ๊ฐ™์ด ์‚ฌ์šฉํ•œ model๊ณผ 10% ์”ฉ์ด๋‚˜ ์ฐจ์ด๊ฐ€ ๋‚œ๋‹ค. visual ๋งŒ์œผ๋กœ๋Š” ํ•œ๊ณ„๊ฐ€ ์žˆ๋Š” ๊ฒƒ์ธ์ง€ ์•„๋‹ˆ๋ฉด ์ƒˆ๋กœ์šด ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์ œ์‹œํ•˜๊ณ  ๋ฐฉํ–ฅ์„ ์ „ํ™˜ํ•˜๋Š” ๊ฒƒ์„ ๊ณ ๋ฏผํ•ด๋ด์•ผ๊ฒ ๋‹ค..!

์˜ค๋Š˜๋„ ๋ฌด์‚ฌํžˆ ์„ธ๋ฏธ๋‚˜ ์™„๋ฃŒ!๐Ÿ˜ฝ

ํƒœ๊ทธ: , ,

์นดํ…Œ๊ณ ๋ฆฌ:

์—…๋ฐ์ดํŠธ:

๋Œ“๊ธ€๋‚จ๊ธฐ๊ธฐ