[Paper Review๐Ÿ“ƒ] Exploring Emotion Features and Fusion Strategies for Audio-Video Emotion Recognition

Exploring Emotion Features and Fusion Strategies for Audio-Video Emotion Recognition

- Audio-Video Emotion Recognition -

Paper๐Ÿ˜™ 2021๋…„ 12์›” 14์ผ ๊ธฐ์ค€, AFEW ๋ฐ์ดํ„ฐ๋กœ SOTA ์„ฑ๋Šฅ ๋‹ฌ์„ฑ

์ด ๋…ผ๋ฌธ์€ audio ์™€ video ๋‘ ๊ฐ€์ง€ modal ์„ fusion(ํ˜ผํ•ฉ) ํ•˜์—ฌ SOTA ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ•˜์˜€๋‹ค.

์ด ๋…ผ๋ฌธ์˜ ๋ฉ”์ธ CONTRIBUTIONS ๋ฅผ ์‚ดํŽด๋ณด๋ฉด,

1) faceCNN์˜ ์ „์ดํ•™์Šต์— ์•Œ๋งž์€ ๋ฐ์ดํ„ฐ์…‹๊ณผ ์•Œ๋งž์€ ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ด ์ค‘์š”ํ•˜๋‹ค (๋ชจ๋“  FER ๋…ผ๋ฌธ์—์„œ๋„ ๋งํ•˜์ง€๋งŒ,, ์–ผ๊ตด ์ธ์‹ ๋ชจ๋ธ์„ ์ž˜ pre-train ์‹œํ‚ค๋Š” ๊ฒƒ์ด ์„ฑ๋Šฅ์˜ฌ๋ฆฌ๋Š” ๊ฒƒ์˜ ํ•ต์‹ฌ์ด๋ผ๊ณ  ๊ฐ•์กฐํ•˜๋„ค์š”)

2) visual feature๊ณผ audio feature์˜ ํ˜ผํ•ฉ ๊ธฐ๋ฒ•์„ ์œ„ํ•ด ์„ธ๊ฐ€์ง€ attention ๋งค์ปค๋‹ˆ์ฆ˜์„ ๊ณ ์•ˆํ•˜์˜€๋‹ค.

3) cross-modal feature ํ˜ผํ•ฉ์„ ์œ„ํ•ด factorized bilinear Pooling(FBP)๊ธฐ๋ฒ•์„ ์ ์šฉํ•˜์˜€๋‹ค. (cross-modal ์ด๋ผ๋Š” ๊ฒƒ์€ audio modal ๊ณผ video modal ์„ ํ˜ผํ•ฉํ•˜๋Š” ๊ฒƒ์„ ๋œปํ•œ๋‹ค๊ณ  ์ƒ๊ฐํ•˜๋ฉด ๋  ๊ฒƒ ๊ฐ™๋‹ค.)


Pipeline of audio-video emotion recognition

๋…ผ๋ฌธ์—์„œ ์ œ์‹œํ•˜๋Š” ๊ตฌ์กฐ๋Š” ์•„๋ž˜์™€ ๊ฐ™์Šต๋‹ˆ๋‹ค.

image

์‚ดํŽด๋ณด๋ฉด, ๊ฐ audio ์™€ video ์ด๋ฏธ์ง€์— ๋Œ€ํ•ด ์ „์ฒ˜๋ฆฌ๋ฅผ ์ˆ˜ํ–‰ํ•œ ํ›„ ๊ฐ๊ฐ์˜ CNN ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•ด feature์„ ๋ฝ‘์€ ๋‹ค์Œ์— attention ๋งค์ปค๋‹ˆ์ฆ˜์„ ์‚ฌ์šฉํ•ด ๋‚˜์˜จ ๊ฐ’์„ fusion ์„ ํ•˜์—ฌ ์ตœ์ข… feature ์„ ๋ฝ‘๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

๊ฐ ํ”„๋กœ์„ธ์Šค ๋ณ„๋กœ ์ž์„ธํžˆ ์‚ดํŽด๋ด…์‹œ๋‹ค..!

Preprocessing

์˜ค๋””์˜ค์™€ ์ด๋ฏธ์ง€ ์ „์ฒ˜๋ฆฌ์— ์žˆ์–ด์„œ๋Š” ๊ฐ„๋žตํ•˜๊ฒŒ ๋…ผ๋ฌธ ๋‚ด์šฉ์„ ๊ทธ๋Œ€๋กœ ์“ฐ๊ฒ ์Šต๋‹ˆ๋‹ค.

  • AUDIO

Speech spectrogram -> use hamming window with 40msec window size, 10msec shift

Log mel-spectrogram -> calculate its deltas and delta-deltas

  • VIDEO

use dlib toolbox for face detection and alignment

Extend face bbox with ratio of 30%

Crop face and scale to 224 x 224

(ํ•œ ํ”„๋ž˜์ž„ ๋‚ด์—์„œ ์–ผ๊ตด์ด ๊ฒ€์ถœ๋˜์ง€ ์•Š๋Š”๋‹ค๋ฉด ๋ชจ๋“  ํ”„๋ ˆ์ž„๋“ค์„ ์‚ฌ์šฉํ•˜์ง€ ์•Š๋Š”๋‹ค.)


Feature Extraction

แ„†แ…ฎแ„Œแ…ฆ

์ „์ฒด ๊ตฌ์กฐ์—์„œ ๋ดค๋“ฏ์ด, audio ์™€ visual ์ด๋ฏธ์ง€์— ๋Œ€ํ•ด ๊ฐ๊ธฐ ๋‹ค๋ฅธ CNN ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•˜์—ฌ feature ๋ฅผ ๋ฝ‘๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ๋จผ์ €, audio ์˜ ๊ฒฝ์šฐ AlexNet์˜ ๋งˆ์ง€๋ง‰ pooling layer์—์„œ feature ๊ฐ’์„ ๋ฝ‘์•„์„œ HxWxC ์˜ ํ”ผ์ณ ๋งต ํ˜•ํƒœ๋กœ ์ถ”์ถœ์ด ๋ฉ๋‹ˆ๋‹ค.

ํ•œํŽธ, visual์˜ ๊ฒฝ์šฐ, ๋งˆ์ง€๋ง‰ FC๋ ˆ์ด์–ด์—์„œ ๋‚˜์˜จ ๊ฐ’์„ ํ”ผ์ณ๋กœ ๋ฝ‘๊ธฐ ๋•Œ๋ฌธ์— n๊ฐœ์˜ ๋ฒกํ„ฐ ํ˜•ํƒœ๋กœ ๊ฐ’์ด ๋‚˜์˜ค๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. (๋…ผ๋ฌธ์—์„œ๋Š” VGGFace, ResNet18, IR50 ์„ธ ๊ฐ€์ง€ ๋ชจ๋ธ์— ๋Œ€ํ•ด์„œ ์‹คํ—˜์„ ํ•˜์˜€๊ณ  ๊ทธ์ค‘ IR50์ด ๊ฐ€์žฅ ์ข‹์€ ์„ฑ๋Šฅ์„ ๋‚ด์–ด์„œ ์ตœ์ข… ๊ตฌ์กฐ ์ด๋ฏธ์ง€์—๋Š” IR๋กœ ์ž‘์„ฑํ•œ ๊ฒƒ ๊ฐ™์Šต๋‹ˆ๋‹ค.)

Three Attention machanism

์ด์ œ ํ”„๋กœ์„ธ์Šค์˜ feature ์„ ํ˜ผํ•ฉํ•˜๋Š” ๋ชจ๋“ˆ์—์„œ intra-modal ๋ถ€๋ถ„์„ ์‚ดํŽด๋ด…์‹œ๋‹ค. ์—ฌ๊ธฐ๋„ audio ์™€ video ๋Š” ๊ฐ๊ฐ ๊ณ„์‚ฐํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ์•„์ง๊นŒ์ง€ ๋‘˜์„ ํ•ฉ์น˜์ง€๋Š” ์•Š๋„ค์š”.

attention์˜ ์ˆ˜์‹, ์™œ ์ €๋ ‡๊ฒŒ ๊ฐ€์ค‘์น˜์™€ ๋‹ค์Œ ์— ๋‚˜์˜ฌ๊ฒƒ์ด๋ผ ์˜ˆ์ƒ๋˜๋Š” FC ๋ ˆ์ด์–ด(์•„๋ž˜ ์ˆ˜์‹์—์„œ๋Š” $W$)๋ฅผ ๋‚ด์ ํ•˜์—ฌ ์œ ์‚ฌ๋„๋ฅผ ๊ตฌํ•˜๋Š” ์ง€๋Š” FAN ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ์—์„œ๋„ ๋ฆฌ๋ทฐํ–ˆ์œผ๋ฏ€๋กœ ์ฐธ๊ณ ํ•˜๊ธฐ!

image

self-attention๊ณผ relation-attention ์™ธ์— transformer-attention ์ด๋ผ๋Š” ๊ฐœ๋…์ด ์ถ”๊ฐ€๋˜์—ˆ๋‹ค! ์ƒˆ๋กœ์šด ๊ฐœ๋… ์•Œ์•„๊ฐ€๋Š”๊ฑฐ ๋„ˆ๋ฌด ์žฌ๋ฐŒ๋„ค์š”๐Ÿ˜†๐Ÿ˜†

trasnformer-attention์€ ๊ธฐ์กด attention๊ณผ ๋งˆ์ฐฌ๊ฐ€์ง€์˜ ๊ฐœ๋…์„ ๊ฐ–๊ณ ๊ฐ€๋Š”๋ฐ, feature์˜ ์ฐจ์›์„ ์ค„์ด๊ธฐ ์œ„ํ•ด ๋‹ค์Œ๊ณผ ๊ฐ™์€ linear function ๊ณผ์ •์„ ๊ฑฐ์ณ์„œ ์ฐจ์›์„ ๊ฐ์†Œ์‹œํ‚จ ํ›„, ์œ ์‚ฌ๋„๋ฅผ ๊ตฌํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

$W_{m \times d}$๋ฅผ ํ”ผ์ณ๊ฐ’๊ณผ ๊ณฑํ•ด์ฃผ์–ด ์ฐจ์›์„ ์ค„์ธ ๊ฐ’์„ $f^{`}_i$ ๋กœ ์„ค์ •ํ•˜์—ฌ ๊ทธ ๊ฐ’์œผ๋กœ ๋‚ด์ ์‹œํ‚จ ๊ฐ’์„ ์ œ๊ณฑ์œผ๋กœ ์ทจํ•˜์—ฌ ๊ฐ€์ค‘์น˜๋กœ ์‚ฌ์šฉํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.


FBP(Factorized Bilinear Pooling) module

image

์œ„์˜ ๊ฐœ๋…์„ ์ดํ•ดํ•˜๊ธฐ ์ „, Bilinear Pooling์ด ๋ฌด์—‡์ธ์ง€๋ฅผ ๋จผ์ € ์‚ดํŽด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

Bilinear pooling ์€ speech ์™€ visual๊ณผ ๊ฐ™์ด ์„œ๋กœ ๋‹ค๋ฅธ modal ์„ (์ฐจ์›์ด ๋‹ค๋ฅธ ๋ฒกํ„ฐ ๊ณ„์‚ฐ) ๋ชจ๋“  feature๋“ค์ด ์ƒํ˜ธ์ž‘์šฉํ•˜์—ฌ ๊ณ„์‚ฐํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•˜๋Š” ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค.

๋‹ค์Œ๊ณผ ๊ฐ™์ด audio feature vector $a$ ๋Š” m ์ฐจ์›์— ์กด์žฌํ•˜๊ณ , video feature vector ์€ $n$ ์ฐจ์›์— ์กด์žฌํ•ฉ๋‹ˆ๋‹ค. ๋‘ ๋‹ค๋ฅธ ์ฐจ์›์—์„œ ๊ณ„์‚ฐ์„ ํ•˜๋ ค๋ฉด ์–ด๋–ป๊ฒŒ ํ•ด์•ผํ• ๊นŒ์š”? ์—ฌ๊ธฐ์„œ๋Š” projection matrix๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

image

์•„๋ž˜ ์ˆ˜์‹๊ณผ ๊ฐ™์ด a์™€ v ๋ฅผ ๋‚ด์ ํ•˜๊ธฐ ์œ„ํ•ด ๊ฐ€์šด๋ฐ mxn ์ฐจ์›์— ์กด์žฌํ•˜๋Š” projection matrix ์ธ $W$ ๋ฅผ ์ถ”๊ฐ€ํ•˜์—ฌ์„œ ๊ณ„์‚ฐ์„ ํ•ด์ฃผ๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

image

์ด๋ ‡๊ฒŒ ๋‘ ๋ฒกํ„ฐ์˜ ๋ชจ๋“  ์š”์†Œ์— ๋Œ€ํ•ด์„œ ์ƒํ˜ธ์ž‘์šฉํ•˜๋ฉฐ ๊ณ„์‚ฐํ•  ์ˆ˜ ์žˆ์–ด์„œ ๋ชจ๋“  element๊ฐ„์˜ ์ƒํ˜ธ ๊ด€๊ณ„์„ฑ์„ ํŒŒ์•…ํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ์žฅ์ ์ด ์žˆ์Šต๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ, ๊ทธ๋ฆผ์—์„œ๋„ ๋ณด์ด๋Š” z ์˜ ๋ฉด์ ! ๋งค์šฐ ํฐ ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ์ฃ ! ์—ฐ์‚ฐ์˜ ๋น„์šฉ์ด ๋งค์šฐ ํฌ๊ณ  overfitting์„ ์ดˆ๋ž˜ํ•  ์œ„ํ—˜์ด ์žˆ์Šต๋‹ˆ๋‹ค.

๋”ฐ๋ผ์„œ ์ด ๋ฌธ์ œ์ ์„ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ์ œ์‹œ๋œ ๋ฐฉ๋ฒ• ์ค‘ ํ•˜๋‚˜๊ฐ€ FBP ์ž…๋‹ˆ๋‹ค.

์•„๋ž˜๋Š” ๋…ผ๋ฌธ์˜ ์ˆ˜์‹์„ ์ •๋ฆฌํ•ด ๋ณธ๊ฒƒ์ธ๋ฐ์š”.. ์ด ๋…ผ๋ฌธ์— ์žˆ์–ด์„œ FBP ์˜ ๋‚ด์šฉ์€ ๊ฐ„๋žตํ•˜๊ฒŒ ๋‚˜์™€์„œ ๋‹ค์Œ ๋…ผ๋ฌธ์„ ์ฐธ๊ณ ํ•˜์—ฌ ์ž‘์„ฑํ•˜์˜€์Šต๋‹ˆ๋‹ค.

image

์œ„์˜ ์ˆ˜์‹์€ ๋ณต์žกํ•ด๋ณด์ด์ง€๋งŒ ๋‹จ์ˆœํžˆ ์ƒ๊ฐํ•ด๋ณด๋ฉด, Bilinear pooling์˜ ๋‹จ์  ๋•Œ๋ฌธ์—, matrix factorization tricks ์„ ์‚ฌ์šฉํ•ด the projection matrix $W_i$ ๋ฅผ ์œ„์™€ ๊ฐ™์ด ๋‘๊ฐœ์˜ low- rank์˜ ํ–‰๋ ฌ๋กœ ๋ฐ”๊พธ์–ด ๊ณ„์‚ฐํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

Exploration of Emotion Features

Table 1 ์€ VGGFace, ResNet18, IR50 ์„ธ CNN ๋ชจ๋ธ๊ณผ FER+, RAF-DB, AffectNet ๋ฐ์ดํ„ฐ๋“ค์— ๋Œ€ํ•ด ๊ฐ๊ฐ ํ•™์Šตํ•ด ์‹คํ—˜ํ•œ ๊ฒฐ๊ณผ, IR50์„ AffectNet ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ ์ „์ดํ•™์Šตํ•œ ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์ด ๊ฐ€์žฅ ์ข‹๋‹ค๋Š” ๊ฒƒ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

image

Table 2 ๋Š” Audio์™€ Visual์— ๋Œ€ํ•ด์„œ ๊ฐ self, relation, transformer attention์„ ์กฐํ•ฉํ•ด ์‚ฌ์šฉํ•œ ๊ฒฐ๊ณผ๋“ค ์ค‘, ๊ฐ€์žฅ ๋†’์€ ์„ฑ๋Šฅ์„ ๋ณด์ด๋Š” ๊ฒƒ์€ ๋‘ modal์— ์žˆ์–ด์„œ transformer ์„ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ž„์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

image

์ด๋ฒˆ์ฃผ ์„ธ๋ฏธ๋‚˜๋„ ๋ฌด์‚ฌํžˆ ์™„๋ฃŒ..! ๐Ÿ˜ฝ

ํƒœ๊ทธ: , ,

์นดํ…Œ๊ณ ๋ฆฌ:

์—…๋ฐ์ดํŠธ:

๋Œ“๊ธ€๋‚จ๊ธฐ๊ธฐ