[Paper Review๐Ÿ“ƒ] When Facial Expression Recognition Meets Few-Shot Learning: A Joint and Alternate Learning Framework

When Facial Expression Recognition Meets Few-Shot Learning: A Joint and Alternate Learning Framework

-EGS-Net-

Paper๐Ÿ˜™

This paper proposes Emotion guided Similarity Network (EGS-Net), consisting emotion branch and a similarity branch, based on a two-stage learning framework. In the first stage, the similarity branch is jointly trained with the emotion branch in a multi-task fashion. With the regularization of the emotion branch, we prevent the similarity branch from overfitting to sampled base classes that are highly overlapped across different episodes. In the second stage, the emotion branch and the similarity branch play a two-student game to alternately learn from each other, thereby further improving the inference ability of the similarity branch on unseen compound expressions.

Before discuss the whole architecture, letโ€™s flick through what is few-shot learning๐Ÿ˜€

์ „์ฒด์  ๊ตฌ์กฐ๋ฅผ ์‚ดํŽด๋ณด๊ธฐ ์ „ ํ“จ์ƒท๋Ÿฌ๋‹(few-shot) ์ด ๋ฌด์—‡์ธ์ง€ ์ž ๊น ํ›‘์–ด๋ด…์‹œ๋‹ค!

for more details about few-shot learning go to this post! -> what is few-shot learning?

ํ“จ์ƒท์— ๋Œ€ํ•œ ์ž์„ธํ•œ ๋‚ด์šฉ์€ ์œ„์˜ ๋งํฌ ํฌ์ŠคํŠธ๋กœ gogo~


Few shot learning

This concept is from a thought โ€œHumans learn new concepts with very little supervision!โ€. Previously deep learning tasks have been trained with large-scale datasets like ImageNet and mscoco dataset. Some authors insisted that they are not enough, and they even added JFT-300M, unlabelled large dataset. It was an inevitable problem since long time ago. However, few-shot concept is one of the solutions for large-scale dataset problem.

In few-shot learning, model performs well with even small scale dataset. Below picture is a example of how it works.

In few shot, they use N classes, not all the classes and use K samples for each class and this is support set. Letโ€™s say we will train 3 classes with 2 samples each. and it is called "3 way 2 shot learning"

If you want to learn 2 classes with 1 sample each, it is called โ€œ2 way 1 shot learningโ€

And batch set(or query set) is working as validation set, the modelโ€™s purpose is to predict query setโ€™s sample correctly.

So, below image shows each train tasks in 3 way 2 shot learning. And for test task, same process as train task but! use unseen classes with samples.

image [์ด๋ฏธ์ง€์ถœ์ฒ˜]https://www.borealisai.com/en/blog/tutorial-2-few-shot-learning-and-meta-learning-i/


To recapitulate briefly, the main points are..

  • Outperforms even with small scale dataset (ex) mini-ImageNet:100 class, 600images each)
  • Train with few images, not all
  • Possible in inferencing unseen datasets.

Now letโ€™s move on to EGS-Net!

image


1st stage : Joint Learning

image


Emotion branch

In emotion branch, global representative features are extracted. And this feature used as regularizer for similarity branch.

image


Similarity branch

This branch is the one using few-shot learning concept. each features from support set and query set are calculated with metric-based computation.(cosine similarity)

image


2nd stage : Alternative Learning

imageimage


whole process

image


setting details

image

Domain shift

THIS IS ADDITIONAL RESEARCH TO EXPLAIN EXPERIMENT ON THIS PAPER.

that current few-shot learning algorithms are fragile to address a large domain-shift. You can compare three tables below.

image

As you see, the scenario with a large domain shift mini-ImageNet โ†’ CK+ and mini-ImageNet โ†’ RAF seems NOT GOOD. However, the scenario with a narrow domain shift: RAF basic โ†’ CK+ and the best performing algorithm reached 84.90% ยฑ 0.53% accuracy, when only learning from five samples.

In fact, due to the limited number of base classes in our FER task, the performance of existing FSL methods drops substantially.

=> to alleviate this problem, paper proposed a novel EGS-Net with joint learning + alternate learning


Experiments

image

์ €์ž๋Š” CFEE์™€ EmotionNet ๋‘๊ฐœ์˜ compound emotion์ด ํฌํ•จ๋œ ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ ์‹คํ—˜์„ ์ง„ํ–‰ํ•˜์˜€๋‹ค. training ์—๋Š” ๊ธฐ๋ณธ ๊ฐ์ •(6~8๊ฐœ)๋งŒ ๊ฐ€์ง€๊ณ  ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ๋””ํ…Œ์ผํ•œ ์„ค์ •์„ ๋ณด๊ธฐ ์œ„ํ•ด์„œ ๊ฐ ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ์— _B(basic emotion), _C(compound emotion) ์œผ๋กœ ๋‚˜๋ˆ„์–ด์„œ (CFEE_B CFEE_C, EmotioNet_B, EmotioNet_C) ์‹คํ—˜ํ•˜์˜€๋‹ค.

๊ทธ๋ฆฌ๊ณ  ๊ฐ E_b ๋Š” emotion branch, S_b ๋Š” similarity branch๋ฅผ ๋œปํ•˜๋ฉฐ (single)์€ RAF-DB ๋งŒ ์‚ฌ์šฉํ•˜์˜€์„๋•Œ, (multiple)๋Š” ๋ฐ์ดํ„ฐ๋ฅผ ๋ชจ๋‘ ๋ณตํ•ฉ์ ์œผ๋กœ ํ•™์Šต์‹œ์ผฐ์„ ๋•Œ๋ฅผ ์˜๋ฏธํ•œ๋‹ค.

ํ™•์‹คํžˆ 1shot ๋ณด๋‹ค๋Š” 5shot ์ผ๋•Œ ์„ฑ๋Šฅ์ด ์ข‹์œผ๋ฉฐ, compound emotion ๋ฐ์ดํ„ฐ๋ณด๋‹ค basic emotion ๋ฐ์ดํ„ฐ์˜ ์„ฑ๋Šฅ์ด ๋” ์ข‹๋‹ค. ์ด๋Š” domain shift ๊ฐ€ ์ž‘๊ธฐ ๋•Œ๋ฌธ์ด๋ผ๊ณ  ๋ณผ ์ˆ˜ ์žˆ๋‹ค. ๋˜ํ•œ, single์ด๋ฏธ์ง€๋กœ ํ•™์Šต์‹œํ‚ฌ๋•Œ๋Š” emotion branch ์™€ similarity branch์˜ ์„ฑ๋Šฅ ์ฐจ์ด๊ฐ€ ์ ์ง€๋งŒ multiple ์ด๋ฏธ์ง€์ผ ๋•Œ, similarity branch์˜ ์„ฑ๋Šฅ์ด ๋” ์ข‹์•„์ง€๋Š” ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค.(inference ability to unseen data) ์ด๊ฒƒ์€ few-shot์„ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์˜ ํƒ€๋‹น์„ฑ์„ ์ž…์ฆ์‹œ์ผœ์ค€๋‹ค๊ณ  ํŒ๋‹จ๋œ๋‹ค.


references

[1] Matching Networks for One Shot Learning

[2] Revisiting few-shot learning for facial expression recognition

๋Œ“๊ธ€๋‚จ๊ธฐ๊ธฐ