[Paper Review๐Ÿ“ƒ] Facial expression and attributes recognition based on multi-task learning of lightweight neural networks

Facial expression and attributes recognition based on multi-task learning of lightweight neural networks

- Multi-task EfficientNet-B2 -

Paper๐Ÿ˜™

์ด ๋…ผ๋ฌธ์€ ๋น„๋””์˜ค(์˜์ƒ)์„ ํ”„๋ ˆ์ž„์„ ์—ฌ๋Ÿฌ๊ฐœ ๋ฐ›์•„ ๊ฐ ํ”„๋ ˆ์ž„๋งˆ๋‹ค์˜ mean, std, min, max ๊ฐ’์„ ๊ณ„์‚ฐํ•˜์—ฌ ์–ผ๊ตดํ‘œ์ •์„ ์ธ์‹ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์ œ์‹œํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

ํŠน์ง•

  • ์–ผ๊ตด ์†์„ฑ(๋‚˜์ด, ์„ฑ๋ณ„, ๊ตญ์ )์˜ ๋ถ„๋ฅ˜์™€ ์–ผ๊ตด ์ธ์ง€๋ฅผ ํ•™์Šตํ•˜๊ธฐ ์œ„ํ•ด์„œ ๋งˆ์ง„์ด ์—†๋Š” crop ๋œ ์–ผ๊ตด์ด๋ฏธ์ง€๋ฅผ input๊ฐ’์œผ๋กœ ํ•™์Šต

  • ๋น„๊ต์  ๋น ๋ฅด๊ณ  ๊ฐ€๋ฒผ์šด MobileNet, EfficientNet, RexNet architecture ๋ฅผ baseline network๋กœ ์‚ฌ์šฉํ•œ๋‹ค.

  • ์–ผ๊ตด ํ‘œ์ •(๊ฐ์ •)์„ ์˜ˆ์ธกํ•˜๋Š”๋ฐ ์žˆ์–ด์„œ ๋„คํŠธ์›Œํฌ๋ฅผ fine-tuning ํ•˜๋Š” ๊ฒƒ์˜ ์ค‘์š”์„ฑ์„ ๊ฐ•์กฐํ•˜๊ณ  ์žˆ์Œ

์ด ๋…ผ๋ฌธ์—์„œ๋Š” Multi-task networks ๋ผ๋Š” ๊ฒƒ์„ ์ œ์‹œํ•˜๋Š”๋ฐ ๊ตฌ์กฐ๋ฅผ ๋ณด๋ฉด ์•„๋ž˜์™€ ๊ฐ™์€ ๊ตฌ์กฐ๋กœ ์ด๋ฃจ์–ด์ ธ ์žˆ๋‹ค.

image

ํ•˜๋‚˜์”ฉ ์‚ดํŽด๋ณด์ž๋ฉด, ์œ„์—์„œ ์–ธ๊ธ‰ํ•œ MobileNet, EfficientNet, RexNet์„ facial recognition CNN(์–ผ๊ตด์ธ์‹ ๋ชจ๋ธ)์˜ backbone network๋กœ ์‚ฌ์šฉ์„ ํ•˜๊ณ ์žˆ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ์ด CNN์„ ๊ฑฐ์ณ์„œ ๊ฐ๊ฐ ์„ฑ๋ณ„, ๋‚˜์ด, ๊ตญ์ , ๊ฐ์ • ์ด 4๊ฐ€์ง€์— ๋Œ€ํ•ด ์˜ˆ์ธก์„ ์ง„ํ–‰ํ•˜๊ณ  ์žˆ๋Š” ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค.

๋จผ์ €, backbone architecture(Face recognition CNN) ์—์„œ ์ˆ˜ํ–‰ํ•˜๊ธฐ ์œ„ํ•ด ๋“ค์–ด์˜ค๋Š” input image ๋Š” ๋“ค์–ด์˜จ ํ”„๋ ˆ์ž„์—์„œ ์–ผ๊ตด์„ ์ฐพ์•„ ์–ผ๊ตด๋ถ€๋ถ„๋งŒ์„ ์ž˜๋ผ์„œ 244x244 ํฌ๊ธฐ์— ๋งž๊ฒŒ ์ด๋ฏธ์ง€๋ฅผ ๋Š˜๋ฆฌ๋Š” ์ž‘์—…์„ ํ•œ๋‹ค.(๋งˆ์ง„์ด ์—†๊ฒŒ๋” crop ํ•จ) ์ด๋•Œ MTCNN face detection์„ ์‚ฌ์šฉํ•˜์—ฌ ์–ผ๊ตด์„ ์ฐพ๋„๋ก ํ•œ๋‹ค. MTCNN์€ ๋Š๋ฆฌ๋‹ค๋Š” ํ‰๊ฐ€๊ฐ€ ์žˆ์–ด์„œ ์‹ค์‹œ๊ฐ„์œผ๋กœ ๋ณผ๋•Œ facebox ๋ฅผ ์‚ฌ์šฉํ•ด๋ณผ๊นŒ ์ƒ๊ฐ์ค‘์ด๋‹ค.

image

SOTA ์„ฑ๋Šฅ์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

image

emotion 7๊ฐœ์— ๋Œ€ํ•ด์„œ AFFECT ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ 66.34% ๊นŒ์ง€ ์„ฑ๋Šฅ์„ ๋Œ์–ด์˜ฌ๋ ธ๋‹ค.

์—ฌ๊ธฐ์„œ ๋˜ ์งš๊ณ  ๋„˜์–ด๊ฐˆ ๋ถ€๋ถ„์€ ๋‚˜์ด๋ฅผ ๋งž์ถ”๋Š” layer ๊ฐ€ ์„ฑ๋ณ„๊ณผ ๋ฏผ์กฑ์˜ layer ๋ณด๋‹ค ๋” ๊นŠ๋‹ค๋Š” ๊ฒƒ์ด๋‹ค. ๊ทธ ์ด์œ ๋Š” ์„ฑ๋ณ„๊ณผ ๋ฏผ์กฑ๊ฐ™์€ ๊ฒฝ์šฐ ๊ทธ ์นดํ…Œ๊ณ ๋ฆฌ๊ฐ€ ๋ช‡ ๊ฐœ ์—†์ง€๋งŒ, ๋‚˜์ด์˜ ๊ฒฝ์šฐ 1์‚ด๋ถ€ํ„ฐ N ์‚ด๊นŒ์ง€ ๋งค์šฐ ๋ฒ”์œ„๊ฐ€ ์ž์ž˜ํ•˜๊ณ  ๋งŽ๊ธฐ ๋•Œ๋ฌธ์— ๋ ˆ์ด์–ด๋ฅผ ๋” ์ถ”๊ฐ€ํ•ด์ฃผ์—ˆ๋‹ค๊ณ  ํ•œ๋‹ค.

๋˜, ๊ฐ์ •์ธ์‹์„ ์œ„ํ•ด์„œ๋Š” ์•ž์—์„œ๋„ ์–ธ๊ธ‰ํ–ˆ๋“ฏ์ด finetuned layer์„ ์‚ฌ์šฉํ•ด ํ•™์Šตํ•˜๋Š” ๊ฒƒ์ด ์„ฑ๋Šฅ์ด ๋” ์ข‹๊ธฐ ๋•Œ๋ฌธ์— ์ด ๋ถ€๋ถ„์„ ํ•œ๋ฒˆ ๊ฑฐ์น˜๊ณ  ์˜ˆ์ธก์„ ์ง„ํ–‰ํ•˜๊ฒŒ ๋œ๋‹ค.

image

์œ„์˜ ๊ทธ๋ฆผ์˜ ์ดˆ๋ก์ƒ‰ ๋ฐ•์Šค๋ถ€๋ถ„์—์„œ frozen weight ๋ผ๋Š” ๊ฒƒ์„ ํ•œ๋‹ค. weight๋ฅผ frozen ํ•œ๋‹ค๋Š” ๊ฒƒ์€ ๊ธฐ์กด ๋„คํŠธ์›Œํฌ ๊ตฌ์กฐ์™€ pretrained ๋œ weight๋ฅผ ๊ฐ€์ ธ์™€์„œ ํ•™์Šต์„ ์‹œํ‚ฌ ๋•Œ, ๋‚ด๊ฐ€ ๋งŒ๋“œ๋Š” ์ƒˆ๋กœ์šด weight ๋กœ ๊ฐ’์„ update ์‹œํ‚ค๊ธฐ ๋ณด๋‹ค๋Š” ๊ธฐ์กด์— ์žˆ๋Š” weight๋ฅผ ๊ทธ๋Œ€๋กœ ํ•™์Šต ์‹œํ‚ค๋Š” ๊ฒƒ์ด ํ•™์Šตํ•  ๋•Œ ์„ฑ๋Šฅ์ด ๋” ์ข‹๊ธฐ ๋•Œ๋ฌธ์— ์‚ฌ์šฉํ•˜๋Š” ๊ธฐ๋ฒ•์ด๋ผ๊ณ  ํ•œ๋‹ค.

์ผ๋‹จ fronzen ์— ๋Œ€ํ•ด์„œ๋Š” ์ข€๋” ๊ณต๋ถ€ํ•ด์•ผ ๊ฒ ์ง€๋งŒ. ๋‚ด ์ƒ๊ฐ์—๋„, ์ด๋ฏธ SOTA ์„ฑ๋Šฅ์„ ์ด๋ค„๋‚ธ ๋„คํŠธ์›Œํฌ ๊ตฌ์กฐ์ด๊ณ , ์ด ๊ตฌ์กฐ๋กœ ํ•™์Šต๋œ weight ๊ฐ’๋“ค์€ ๊ทธ ๊ตฌ์กฐ์— ์ ํ•ฉํ•œ ๊ฐ’์„ ๊ฐ€์ง€๊ณ  ์žˆ์„ ๊ฒƒ์ด๋ผ๊ณ  ์ƒ๊ฐํ•œ๋‹ค. ๊ทธ๋ ‡๋‹ค๋ฉด ์ด ๊ฐ’์„ ๊ทธ๋Œ€๋กœ ์“ฐ๋Š”๊ฒŒ ์•„๋ฌด๋ž˜๋„ ์ข‹๊ฒ ์ฃ ?!

๋”ฐ๋ผ์„œ ์ด ๋„คํŠธ์›Œํฌ์—์„œ๋Š” face recogntion CNN ๋ถ€๋ถ„์„ update ํ•˜๋Š” ๋™์•ˆ ๋‚˜๋จธ์ง€ ๊ตฌ์กฐ๋Š” frozen ์‹œํ‚ค๊ณ  ๋ช‡๋ฒˆ์ •๋„ ํ•™์Šตํ•œ ์ดํ›„์— ํ•จ๊ป˜ ํ•™์Šต์‹œ์ผœ weight๋ฅผ update ํ•œ๋‹ค. frozen ํ•ด์„œ update ํ•˜๋Š” ํšŸ์ˆ˜๋Š” ๊ฐ๊ฐ ์„ฑ๋ณ„, ๋‚˜์ด, ๋ฏผ์กฑ, ๊ฐ์ •์˜ ๋ ˆ์ด์–ด์— ๋”ฐ๋ผ ๋‹ค๋ฅด๋‹ˆ ๋…ผ๋ฌธ ์ฐธ๊ณ !


Emotion Recognition process

๊ฐ์ •์„ ์ธ์‹ํ•˜๋Š” ํ”„๋กœ์„ธ์Šค๋ฅผ ๊ฐ„๋‹จํ•˜๊ฒŒ ๊ทธ๋ฆผ์œผ๋กœ ์ •๋ฆฌํ•ด ๋ณด์•˜๋‹ค. ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๋ชจ๋“  sequence ํ”„๋ ˆ์ž„ (์‹œํ€€์Šค ๋ฐ์ดํ„ฐ๊ฐ€ ์•„๋‹ˆ๋ผ๋ฉด ๊ทธ๋ƒฅ ํ”„๋ ˆ์ž„์—์„œ๋„ ์ง„ํ–‰) ์— ์žˆ๋Š” ์–ผ๊ตด์„ ์ฐพ์•„ 244x244 ๋กœ ๋งˆ์ง„ ์—†์ด crop์„ ํ•œ๋‹ค. ๊ทธ๋ ‡๊ฒŒ ์ฐพ์•„๋‚ธ ์–ผ๊ตด์— ์ด๋ฏธ์ง€์— ๋Œ€ํ•ด์„œ feature์„ ์ถ”์ถœํ•ด feature๊ฐ’์— ๋Œ€ํ•ด mean, std, min, max ๊ฐ’์„ ๊ณ„์‚ฐํ•ด ๊ฐ ๊ณ„์‚ฐ์„ concatenate ์‹œ์ผœ์ค€๋‹ค.

image


์ฝ”๋“œ๋กœ ๋ณด๋ฉด ์ดํ•ด๊ฐ€ ๋น ๋ฅด๋‹ค ์•„๋ž˜ ์ฝ”๋“œ ์ฐธ๊ณ !

image

์˜ˆ์œ ์•„์ด์œ  ์–ผ๊ตด๋กœ ์‹œํ—˜ํ•ด ๋ณด์•˜๋‹ค.

image

์ผ๋‹จ ์–ผ๊ตด์„ ์ž˜ ์žก๊ณ , ๊ธฐ์šธ์–ด์ง€๋Š” ์–ผ๊ตด์— ๋Œ€ํ•ด alignment๋ฅผ ์ž˜ ํ•˜๊ณ  ์žˆ๋‹ค. ์˜ค๋ฅธ์ชฝ ์˜์ƒ์˜ bounding box๋ฅผ ๋ณด๋ฉด ์„ธ๋กœ๊ฐ€ ๋” ๊ธด ์ง์‚ฌ๊ฐํ˜• ํ˜•ํƒœ๋กœ ์žกํ˜”๊ณ , ์–ผ๊ตด๋„ ๊ธฐ์šธ์–ด์ ธ์žˆ์ง€๋งŒ, ์™ผ์ชฝ์˜ crop๋œ ์ด๋ฏธ์ง€๋ฅผ ๋ณด๋ฉด ์–ผ๊ตด์˜ ๊ธฐ์šธ๊ธฐ๋„ ์ˆ˜์ง์œผ๋กœ ์ž˜ ์ •๋ ฌ๋˜์–ด์žˆ๊ณ  224 ์‚ฌ์ด์ฆˆ์˜ ์ด๋ฏธ์ง€๋กœ ์ฑ„์›Œ์ ธ์„œ ๋“ค์–ด๊ฐ„ ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค.

๋ฐ”๋กœ ์ด ์ด๋ฏธ์ง€๊ฐ€ CNN์˜ input ๊ฐ’์œผ๋กœ ๋“ค์–ด๊ฐ€ feature๋ฅผ ๋ฝ‘๊ฒŒ ๋˜๋Š” ๊ฒƒ์ด๋‹ค.

console ์ฐฝ์„ ๋ณด๋ฉด [0.00959151 0.3924815 0.6546932 โ€ฆ ์ด๋Ÿฐ ์ˆซ์ž๋ฅผ ๋ณผ ์ˆ˜ ์žˆ๋Š”๋ฐ, ์ด๊ฒŒ ํ•œ ํ”„๋ ˆ์ž„์˜ ํ•œ ์–ผ๊ตด ๋‹น feature์˜ mean, std, min, max ๊ฐ’์ด ๋‚˜์—ด๋˜์–ด์žˆ๋Š” ๊ฒƒ์ด๋‹ค.


Experiments

AffectNet ๋ฐ์ดํ„ฐ์…‹์— ๋Œ€ํ•œ ์–ผ๊ตด ๊ฐ์ •์ธ์‹ ์„ฑ๋Šฅ! ๋ฌด๋ ค 62.42% ๋กœ SOTA ๋‹ฌ์„ฑ!

image

ํŠนํžˆ ์ด ๋ชจ๋ธ์€ ๋ชจ๋ฐ”์ผ์—์„œ ๋Œ์•„๊ฐˆ ์ˆ˜ ์žˆ๋„๋ก ํ•˜๋Š” ๊ฒƒ์ด ํ•ต์‹ฌ์ด๊ธฐ ๋•Œ๋ฌธ์— (๊ทธ๋ž˜์„œ backbone architecture๋„ ๊ฐ€๋ฒผ์šด๊ฑฐ ์”€) CPU ์—์„œ๋„ ์ž˜๋Œ์•„๊ฐ€๋Š”์ง€, ์–ผ๋งˆ๋‚˜ ๋น ๋ฅด๊ฒŒ ๋Œ์•„๊ฐ€๋Š”์ง€๋ฅผ ์ฒดํฌํ•˜์˜€๋‹ค. MobileNet-v1 ์ด ๋น ๋ฅด๋‹ค. ์ฝ”๋“œ๋ฅผ ๋ณด๋‹ˆ๊นŒ ์ด๋ฏธ์ง€ ํ•œ์žฅ ๋„ฃ์„๋•Œ์˜ ์ฒ˜๋ฆฌ์ธ๋ฐ.. ๋‚˜๋Š” ๋น„๋””์˜ค๋กœ ์‹ค์‹œ๊ฐ„ ์ฒ˜๋ฆฌ๊ฐ€ ์ค‘์š”ํ•œ๋ฐ.. AFEW๋กœ ํ•™์Šต์‹œ์ผฐ๋‹ค๊ธธ๋ž˜ ๋น„๋””์˜ค๋กœ demo ๊ฐ€๋Šฅํ•œ๊ฐ€ ํ•˜๊ณ  ์„ค๋ ˆ์—ˆ์ง€๋งŒ.. ํ•œ์žฅ์˜ ์ด๋ฏธ์ง€์ฒ˜๋ฆฌ demo ๋ฐ–์— ์—†์—ˆ๋‹ค. ๊ทธ๋ž˜์„œ ๋น„๋””์˜ค์— ์ฒ˜๋ฆฌ๋  ์ˆ˜ ์žˆ๋„๋ก ์ฝ”๋“œ๋ฅผ ์ˆ˜์ •ํ•˜์—ฌ ์‚ฌ์šฉํ•˜๋Š” ์ค‘!

๋‚ด ์—ฐ๊ตฌ์ฃผ์ œ๊ฐ€ FER ์ด ํ•ต์‹ฌ์ด๊ณ  ๋ถ€๊ฐ€์ ์œผ๋กœ GAN ๋ชจ๋ธ์„ ์“ฐ๋Š” ๊ฒƒ์ด๋ผ ์ด์ œ FER ๋…ผ๋ฌธ๋„ ์ž์ฃผ ์ฝ๊ณ  ํฌ์ŠคํŒ…ํ•  ์˜ˆ์ •์ด๋‹ค..!

์–ด์ฐŒ์ €์ฐŒ ์„์‚ฌ์ƒํ™œ ์ž˜ ๋ฒ„ํ‹ฐ์ž!

ํƒœ๊ทธ: , ,

์นดํ…Œ๊ณ ๋ฆฌ:

์—…๋ฐ์ดํŠธ:

๋Œ“๊ธ€๋‚จ๊ธฐ๊ธฐ