[Paper Review๐Ÿ“ƒ] Convolutional relation network for facial expression recognition in the wild with few-shot learning

์˜ค๋Š˜ ๋ฆฌ๋ทฐํ•  ๋…ผ๋ฌธ์€ โžก๏ธ Zhu, Qing, et al. โ€œConvolutional relation network for facial expression recognition in the wild with few-shot learning.โ€ย Expert Systems with Applicationsย 189 (2022): 116046.

FER์˜ ์„ฑ๋Šฅ์„ ๋Œ์–ด์˜ฌ๋ฆด ์ˆ˜ ์žˆ๋Š” ์ƒˆ๋กœ์šด metric ๋ฐฉ์‹์„ ์—ฐ๊ตฌํ•˜์—ฌ few-shot learning์˜ ๋‹ค๋ฅธ method๋“ค๊ณผ ๋น„๊ตํ•œ ๋…ผ๋ฌธ์ด๋‹ค. FER๋ถ„์•ผ์— ๋งž์ถคํ˜• metric์„ ์„ ๋ณด์˜€๋‹ค๋Š” ๊ฒƒ์—์„œ ์˜์˜๊ฐ€ ์žˆ๋Š”๋ฐ ๋‚ด์šฉ์€ ์กฐ๊ธˆ ๋ถ€์‹คํ•ด์„œ ์•„์‰ฌ์› ๋˜ ๋…ผ๋ฌธ์ด๋‹ค.

ํ•˜์ง€๋งŒ FER ๋ถ„์•ผ๋ฅผ ๋‹ค๋ฅธ ํ•™์Šต๋ฐฉ๋ฒ•์œผ๋กœ ์ ์šฉ์‹œ์ผœ ์—ฐ๊ตฌํ•˜๊ณ  ์‹ถ์€ ๋‚˜์—๊ฒŒ ์žˆ์–ด ๋งค์šฐ ๋‹จ๋น„๊ฐ™์€ ๊ทธ๋Ÿฐ ๋…ผ๋ฌธ์ด๋ž„๊นŒ ใ…Žใ…Ž

Framework overview

๋จผ์ €, ์ด ๋…ผ๋ฌธ์—์„œ ์ฃผ์žฅํ•˜๋Š” ๋ชจ๋ธ์˜ ์•„ํ‚คํ…์ณ ์ „๋ฐ˜์„ ์‚ดํŽด๋ณด์ž.

image

few-shot learning๊ณผ ๊ฐ™์ด support set๊ณผ query set์ด input์œผ๋กœ ๋“ค์–ด๊ฐ€๊ณ  1) feature embedding ์„ ๊ฑฐ์ณ ๋‚˜์˜จ feature๋ฅผ ํ•œ๋ฉด 2) depth attention pooling์„ ๊ฑฐ์นœ ํ›„ support feature์™€ query feature์˜ concate์„ ์‹œํ–‰ํ•˜์—ฌ, 3) ๋งˆ์ง€๋ง‰ convolution layers๋“ค์„ ๊ฑฐ์ณ concatenation๋œ feature ๊ฐ’๊ณผ 2)๋ฅผ ๊ฑฐ์น˜์ง€ ์•Š๊ณ  ๋‚˜์˜จ feature ๊ฐ’์„ ๊ณฑํ•˜์—ฌ ์‚ฌ์šฉํ•œ๋‹ค.

์ด ๋‹จ๊ณ„๋“ค์ด ์˜๋ฏธํ•˜๋Š” ๋ฐ”๊ฐ€ ๋ฌด์—‡์ธ์ง€ ํ•˜๋‚˜์”ฉ ์•Œ์•„๋ณด์ž!

Stage1 : Feature Embedding

image

์ฒซ๋ฒˆ์งธ๋กœ๋Š” input์œผ๋กœ ๋“ค์–ด์˜ค๋Š” support set ์ด๋ฏธ์ง€๋“ค๊ณผ query set ์ด๋ฏธ์ง€๋“ค์— ๋Œ€ํ•˜์—ฌ ๊ฐ๊ฐ feature๋ฅผ ๋ฝ‘๋Š” ๊ฒƒ์ด๋‹ค. ์ด ๋ชจ๋ธ์ž์ฒด๋Š” relation network๋กœ ๊ตฌ์„ฑ๋˜์–ด์žˆ์œผ๋ฉฐ, ์ฒซ๋ฒˆ์งธ ๋‹จ๊ณ„์— ํ•ด๋‹นํ•˜๋Š” layer๋Š” 4๊ฐœ์˜ convolution block์œผ๋กœ, ์•„๋ž˜์™€ ๊ฐ™์ด ๊ตฌ์„ฑ๋˜์–ด์žˆ๋‹ค.

  • Relation network convolution

    Each convolution block has

    3x3 convolution of 64 filters

    Batch normalization

    Relu activation function layer

    2x2 max pooling

feature embedding ๋ถ€๋ถ„์˜ ์‹์„ $f_\theta$ ๋ผ๊ณ  ํ–ˆ์„ ๋•Œ,

  • feature map of support set : $f_\theta(S^{(i)})$
  • feature map of query set : $f_\theta(Q^{(j)})$

Stage2 : Salient Discriminative Feature Learning

image

1) Depth Average Pooling (DAP)

image

์ด ๋ถ€๋ถ„์— ๋Œ€ํ•ด์„œ ํ•ด์„์ด ์ข€ ์–ด๋ ค์› ๋Š”๋ฐ, depth average pooling ํ•˜๋‹ˆ๊นŒ ๋‹น์—ฐํžˆ channel attention์ด๊ฒ ๊ฑฐ๋‹ˆ~ ์ƒ๊ฐ์„ ํ–ˆ๋Š”๋ฐ, ์ง„์ž‘์— channel attention์ด์—ˆ๋‹ค๋ฉด Global Average Pooling์ด๋ผ๋Š” ๊ฐœ๋…์ด ์žˆ๋Š”๋ฐ, ๊ทธ ๋‹จ์–ด๋ฅผ ์ผ๊ฒ ์ง€! ๋ผ๋Š” ์ƒ๊ฐ์ด ๋“ค์–ด ๋‹ค์‹œ ๋…ผ๋ฌธ์„ ๊ผผ๊ผผํ•˜๊ฒŒ ์ฝ์—ˆ๋‹ค. ์ฝ์–ด๋ณด๋‹ˆ, channel attention์€ ์•„๋‹ˆ๊ณ , ๋ ˆ์ด๋ธ”์ด ๊ฐ™์€ support set ์ด๋ฏธ์ง€๋“ค์ด ์—ฌ๋Ÿฌ๊ฐœ๊ฐ€ ๋“ค์–ด์˜ค๊ฒŒ ๋˜๋Š”๋ฐ, ๊ฐ™์€ ๋ ˆ์ด๋ธ”์— ํ•ด๋‹นํ•˜๋Š” ์ด๋ฏธ์ง€๋“ค์— ๋Œ€ํ•ด์„œ DAP๋ฅผ ํ•ด์ฃผ๋Š” ๊ฒƒ์ด์—ˆ๋‹ค! ๊ทธ๋ž˜์„œ ํ•œ ๋ ˆ์ด๋ธ”์— ๋Œ€ํ•œ ์—ฌ๋Ÿฌ support set ์ด๋ฏธ์ง€๋“ค์€ ํ•˜๋‚˜์˜ feature map์œผ๋กœ pooling์ด ๋œ๋‹ค. ๊ตฌ์ฒด์ ์œผ๋กœ ์จ์žˆ์ง„ ์•Š์ง€๋งŒ, ๋…ผ๋ฌธ์—์„œ ์ œ์‹œํ•˜๋Š” ์ˆ˜์‹๊ณผ, ๊ธ€์— ๊ทผ๊ฑฐํ•˜๋ฉด ์ด๋ ‡๊ฒŒ ๋˜๋Š” ๊ฒƒ์ด ๋งž์„ ๊ฒƒ์ด๋‹ค. ๋˜ํ•œ, support set ๋“ค ๊ฐ„์˜ ๊ด€๊ณ„๋ฅผ average pooling ๋ฐฉ์‹์œผ๋กœ ๊ณ„์‚ฐํ•œ๋‹ค๋Š” ์ ์—์„œ ๊ดœ์ฐฎ์€ ๋ฐฉ๋ฒ•์ธ ๊ฒƒ ๊ฐ™๋‹ค๊ณ  ์ƒ๊ฐํ•œ๋‹ค.

DAP ๋Š” support set ์ด๋ฏธ์ง€๋“ค์—๋งŒ ํ•ด์ฃผ๋Š”๋ฐ ๊ทธ ์ด์œ ๋Š”, ์ด ํ’€๋ง์„ ๊ฑฐ์น˜๋ฉด ๊ฐ ์ด๋ฏธ์ง€๋“ค์ด ๊ฐ€์ง„ "commonality"๋ฅผ ๋ฝ‘๊ณ , ์œ ์‚ฌํ•˜์ง€ ์•Š์€ ๋‹ค๋ฅธ ๋ถ€๋ถ„์— ๋Œ€ํ•ด์„œ๋Š” ์ •๋ณด๋ฅผ ์—†์•จ ์ˆ˜ ์žˆ์—ˆ๋‹ค๊ณ  ํ•œ๋‹ค.

2) JS Divergence

์ด๊ฑด GAN ๋…ผ๋ฌธ๋ฆฌ๋ทฐํ•˜๊ณ  ๊ฐœ๋… ์ •๋ฆฌ๋ฅผ ํ•˜๋ฉด์„œ ๋‹ค๋ค˜๋˜ ๊ฐœ๋…์ด๋ผ โžก๏ธ [GAN study] KL-divergence & JS-divergence & Maximum Likelihood Estimation์™€ ๊ฐœ๋…์ •๋ฆฌ ํ˜น์€ [๋…ผ๋ฌธ์ •๋ฆฌ๐Ÿ“ƒ] Generative Adversarial Nets ์ด ํฌ์ŠคํŒ…์„ ์ฐธ๊ณ ํ•˜๋ฉด ์ดํ•ด๊ฐ€ ๋น ๋ฅผ ๊ฒƒ ๊ฐ™๋‹ค.

$D_{JS}^{i,j} (P(f^a_\theta(S^{(j)})),P(f^a_\theta(S^{(j)}))) $

์ด metric์˜ ์žฅ์ ์€..

  • JS Divergence ๋กœ loss๋ฅผ ๊ณ„์‚ฐํ•˜์—ฌ ๊ฐ์ • ์ด๋ฏธ์ง€๋ฅผ ๊ตฌ๋ถ„ํ•˜๋Š” ๋Šฅ๋ ฅ์„ ์ฆ๋Œ€์‹œํ‚ฌ ์ˆ˜ ์žˆ๋‹ค.
  • ๋‹ค๋ฅธ ํด๋ž˜์Šค๋ผ๋ฆฌ ๋ฉ€๋ฆฌ ๋–จ์–ด์งˆ ์ˆ˜ ์žˆ๋„๋ก penalize ํ•  ์ˆ˜ ์žˆ๋‹ค.

DAP์˜ ์‹

  • Feature map of support setโ€™s DAP : $f^a_\theta(S^{(j)})$

๊ฒฐ๊ตญ stage2์—์„œ ์‚ฌ์šฉํ•˜๋Š” Loss function์˜ ์ตœ์ข…์‹์€!

$L_{dist}^k = 1 - $ $1\over{N^2}$ $\sum^N_{i=1} \sum^N_{j=1} [y_k^{i,j} - D_{JS}^{i,j} (P(f^a_\theta(S^{(j)})),P(f^a_\theta(S^{(j)})))]^2$

Stage3 : Emotion Similarity Learning

image

์ด ๋‹จ๊ณ„์—์„œ๋Š”, ๋ฐ”๋กœ ์ „ ๋‹จ๊ณ„์—์„œ ๊ตฌํ•œ DAP ๋ฅผ ๊ฑฐ์นœ support set feature ๊ณผ, query set feature ์„ concatenation ์„ ์‹œ์ผœ relation network์˜ ๋‚˜๋จธ์ง€ 4๊ฐœ์˜ layer์„ ๊ฑฐ์น˜๊ฒŒ ํ•œ๋‹ค.

๋˜ํ•œ, dap ๋ฅผ ๊ฑฐ์น˜์ง€ ์•Š์€ support set๊ณผ query set์˜ similarity๋ฅผ ๊ตฌํ•˜๊ธฐ ์œ„ํ•ด, concate ํ•œ feature๊ฐ€ ๋‚˜์˜จ feature์™€ ๊ทธ๋ƒฅ 8๊ฐœ์˜ layer์„ ๊ฑฐ์ณ๋‚˜์˜จ feature์˜ ๊ณฑ์œผ๋กœ loss๋ฅผ ๊ณ„์‚ฐํ•ด์ค€๋‹ค. ๋”ฐ๋ผ์„œ, ์ด๋ฒˆ ๋‹จ๊ณ„์—์„œ ๊ณ„์‚ฐํ•˜๋Š” loss function์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

$L_r^{(k)} =$ $1\over{N^2}$ $\sum^N_{i=1} \sum^N_{j=1} [y_k^{(i,j)} - r^{i,j} (g_๐œ‘ [C(f^a_\theta(S^{(j)}),f^a_\theta(S^{(j)})])]^2$

์ตœ์ข… CRN Loss

$L_{CRN} =$ $1\over{K}$ $\sum^K_{k=1} (L_r^k + \lambda L_{dist}^k )$

Experiment details

์ด ๋…ผ๋ฌธ์—์„œ๋Š” ๊ฐ ๋ฐ์ดํ„ฐ๋งˆ๋‹ค emotion label ์— ํ•ด๋‹นํ•˜๋Š” ์ด๋ฏธ์ง€์˜ ๊ฐฏ์ˆ˜๊ฐ€ ๋งค์šฐ ์ƒ์˜ํ•˜๋‹ค๋ฉด์„œ imbalance ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด, ๋น„์Šทํ•˜๊ฒŒ ๊ฐ€์ง€๊ณ  ์žˆ๋Š” ์ˆ˜์˜ label์„ train์œผ๋กœ, ์ ์€ ์ด๋ฏธ์ง€์ˆ˜๋ฅผ ๊ฐ–๊ณ  ์žˆ๋Š” ๋ ˆ์ด๋ธ”์„ test๋กœ ํ•˜์—ฌ์„œ ์‹คํ—˜์„ ์ง„ํ–‰ํ–ˆ๋‹ค๊ณ  ํ•œ๋‹ค.

image

ํ•˜์ง€๋งŒ train์— ์ด๋ฏธ์ง€๋ฅผ ํ‘œ์— ๋‚˜์˜จ๊ฒƒ์„ ๋‹ค ์ผ๋Š”์ง€ ์•„๋‹Œ์ง€๋Š” ์ž์„ธํžˆ ์„œ์ˆ ํ•˜์ง€ ์•Š์•„ ์ •ํ™•ํ•œ ์ •๋ณด๋Š” ๋ชจ๋ฅธ๋‹ค. ๋˜ํ•œ, n-shot k-way ์— ๋Œ€ํ•˜์—ฌ n๊ณผ k ์— ๋Œ€ํ•œ ์ •๋ณด๋„ ์„œ์ˆ ๋˜์–ด์žˆ์ง€ ์•Š์Œ.. ํ 

Experiment Results

  • RAF-DB

image

  • FER2013

image

  • SFEW

image

generated feature maps with different emotion categories with/without the JS

image

Model Ablation

image

๋Œ“๊ธ€๋‚จ๊ธฐ๊ธฐ