[Paper Review๐] When Facial Expression Recognition Meets Few-Shot Learning: A Joint and Alternate Learning Framework
When Facial Expression Recognition Meets Few-Shot Learning: A Joint and Alternate Learning Framework
-EGS-Net-
Paper๐
This paper proposes Emotion guided Similarity Network (EGS-Net), consisting emotion branch and a similarity branch, based on a two-stage learning framework.
In the first stage, the similarity branch is jointly trained with the emotion branch in a multi-task fashion. With the regularization of the emotion branch, we prevent the similarity branch from overfitting to sampled base classes that are highly overlapped across different episodes. In the second stage, the emotion branch and the similarity branch play a two-student game to alternately learn from each other, thereby further improving the inference ability of the similarity branch on unseen compound expressions.
Before discuss the whole architecture, letโs flick through what is few-shot learning๐
์ ์ฒด์ ๊ตฌ์กฐ๋ฅผ ์ดํด๋ณด๊ธฐ ์ ํจ์ท๋ฌ๋(few-shot) ์ด ๋ฌด์์ธ์ง ์ ๊น ํ์ด๋ด ์๋ค!
for more details about few-shot learning go to this post! -> what is few-shot learning?
ํจ์ท์ ๋ํ ์์ธํ ๋ด์ฉ์ ์์ ๋งํฌ ํฌ์คํธ๋ก gogo~
Few shot learning
This concept is from a thought โHumans learn new concepts with very little supervision!โ. Previously deep learning tasks have been trained with large-scale datasets like ImageNet and mscoco dataset. Some authors insisted that they are not enough, and they even added JFT-300M, unlabelled large dataset. It was an inevitable problem since long time ago. However, few-shot concept is one of the solutions for large-scale dataset problem.
In few-shot learning, model performs well with even small scale dataset. Below picture is a example of how it works.
In few shot, they use N classes, not all the classes and use K samples for each class and this is support set.
Letโs say we will train 3 classes with 2 samples each. and it is called "3 way 2 shot learning"
If you want to learn 2 classes with 1 sample each, it is called โ2 way 1 shot learningโ
And batch set(or query set) is working as validation set, the modelโs purpose is to predict query setโs sample correctly.
So, below image shows each train tasks in 3 way 2 shot learning. And for test task, same process as train task but! use unseen classes
with samples.
[์ด๋ฏธ์ง์ถ์ฒ]https://www.borealisai.com/en/blog/tutorial-2-few-shot-learning-and-meta-learning-i/
To recapitulate briefly, the main points are..
- Outperforms even with small scale dataset (ex) mini-ImageNet:100 class, 600images each)
- Train with few images, not all
- Possible in inferencing unseen datasets.
Now letโs move on to EGS-Net!
1st stage : Joint Learning
Emotion branch
In emotion branch, global representative features are extracted. And this feature used as regularizer for similarity branch.
Similarity branch
This branch is the one using few-shot learning concept. each features from support set and query set are calculated with metric-based computation.(cosine similarity)
2nd stage : Alternative Learning
whole process
setting details
Domain shift
THIS IS ADDITIONAL RESEARCH TO EXPLAIN EXPERIMENT ON THIS PAPER.
that current few-shot learning algorithms are fragile to address a large domain-shift. You can compare three tables below.
As you see, the scenario with a large domain shift mini-ImageNet โ CK+ and mini-ImageNet โ RAF seems NOT GOOD. However, the scenario with a narrow domain shift: RAF basic โ CK+ and the best performing algorithm reached 84.90% ยฑ 0.53% accuracy, when only learning from five samples.
In fact, due to the limited number of base classes in our FER task, the performance of existing FSL methods drops substantially.
=> to alleviate this problem, paper proposed a novel EGS-Net with joint learning + alternate learning
Experiments
์ ์๋ CFEE์ EmotionNet ๋๊ฐ์ compound emotion์ด ํฌํจ๋ ๋ฐ์ดํฐ์ ์ผ๋ก ์คํ์ ์งํํ์๋ค. training ์๋ ๊ธฐ๋ณธ ๊ฐ์ (6~8๊ฐ)๋ง ๊ฐ์ง๊ณ ์๊ธฐ ๋๋ฌธ์ ๋ํ ์ผํ ์ค์ ์ ๋ณด๊ธฐ ์ํด์ ๊ฐ ํ ์คํธ ๋ฐ์ดํฐ์ _B(basic emotion), _C(compound emotion) ์ผ๋ก ๋๋์ด์ (CFEE_B CFEE_C, EmotioNet_B, EmotioNet_C) ์คํํ์๋ค.
๊ทธ๋ฆฌ๊ณ ๊ฐ E_b ๋ emotion branch, S_b ๋ similarity branch๋ฅผ ๋ปํ๋ฉฐ (single)์ RAF-DB ๋ง ์ฌ์ฉํ์์๋, (multiple)๋ ๋ฐ์ดํฐ๋ฅผ ๋ชจ๋ ๋ณตํฉ์ ์ผ๋ก ํ์ต์์ผฐ์ ๋๋ฅผ ์๋ฏธํ๋ค.
ํ์คํ 1shot ๋ณด๋ค๋ 5shot ์ผ๋ ์ฑ๋ฅ์ด ์ข์ผ๋ฉฐ, compound emotion ๋ฐ์ดํฐ๋ณด๋ค basic emotion ๋ฐ์ดํฐ์ ์ฑ๋ฅ์ด ๋ ์ข๋ค. ์ด๋ domain shift ๊ฐ ์๊ธฐ ๋๋ฌธ
์ด๋ผ๊ณ ๋ณผ ์ ์๋ค.
๋ํ, single์ด๋ฏธ์ง๋ก ํ์ต์ํฌ๋๋ emotion branch ์ similarity branch์ ์ฑ๋ฅ ์ฐจ์ด๊ฐ ์ ์ง๋ง multiple ์ด๋ฏธ์ง์ผ ๋, similarity branch์ ์ฑ๋ฅ์ด ๋ ์ข์์ง๋ ๊ฒ์ ๋ณผ ์ ์๋ค.(inference ability to unseen data) ์ด๊ฒ์ few-shot์ ์ฌ์ฉํ๋ ๊ฒ์ ํ๋น์ฑ์ ์
์ฆ์์ผ์ค๋ค๊ณ ํ๋จ๋๋ค.
references
[1] Matching Networks for One Shot Learning
[2] Revisiting few-shot learning for facial expression recognition
๋๊ธ๋จ๊ธฐ๊ธฐ