[Paper Review📃] When Facial Expression Recognition Meets Few-Shot Learning: A Joint and Alternate Learning Framework

When Facial Expression Recognition Meets Few-Shot Learning: A Joint and Alternate Learning Framework

-EGS-Net-

This paper proposes Emotion guided Similarity Network (EGS-Net), consisting emotion branch and a similarity branch, based on a two-stage learning framework. In the first stage, the similarity branch is jointly trained with the emotion branch in a multi-task fashion. With the regularization of the emotion branch, we prevent the similarity branch from overfitting to sampled base classes that are highly overlapped across different episodes. In the second stage, the emotion branch and the similarity branch play a two-student game to alternately learn from each other, thereby further improving the inference ability of the similarity branch on unseen compound expressions.

Before discuss the whole architecture, let’s flick through what is few-shot learning😀

전체적 구조를 살펴보기 전 퓨샷러닝(few-shot) 이 무엇인지 잠깐 훑어봅시다!

for more details about few-shot learning go to this post! -> what is few-shot learning?

퓨샷에 대한 자세한 내용은 위의 링크 포스트로 gogo~

Few shot learning

This concept is from a thought “Humans learn new concepts with very little supervision!”. Previously deep learning tasks have been trained with large-scale datasets like ImageNet and mscoco dataset. Some authors insisted that they are not enough, and they even added JFT-300M, unlabelled large dataset. It was an inevitable problem since long time ago. However, few-shot concept is one of the solutions for large-scale dataset problem.

In few-shot learning, model performs well with even small scale dataset. Below picture is a example of how it works.

In few shot, they use N classes, not all the classes and use K samples for each class and this is support set. Let’s say we will train 3 classes with 2 samples each. and it is called "3 way 2 shot learning"

If you want to learn 2 classes with 1 sample each, it is called “2 way 1 shot learning”

And batch set(or query set) is working as validation set, the model’s purpose is to predict query set’s sample correctly.

So, below image shows each train tasks in 3 way 2 shot learning. And for test task, same process as train task but! use unseen classes with samples.

[이미지출처]https://www.borealisai.com/en/blog/tutorial-2-few-shot-learning-and-meta-learning-i/

To recapitulate briefly, the main points are..

Outperforms even with small scale dataset (ex) mini-ImageNet:100 class, 600images each)
Train with few images, not all
Possible in inferencing unseen datasets.

Now let’s move on to EGS-Net!

1st stage : Joint Learning

Emotion branch

In emotion branch, global representative features are extracted. And this feature used as regularizer for similarity branch.

Similarity branch

This branch is the one using few-shot learning concept. each features from support set and query set are calculated with metric-based computation.(cosine similarity)

2nd stage : Alternative Learning

whole process

setting details

Domain shift

THIS IS ADDITIONAL RESEARCH TO EXPLAIN EXPERIMENT ON THIS PAPER.

that current few-shot learning algorithms are fragile to address a large domain-shift. You can compare three tables below.

As you see, the scenario with a large domain shift mini-ImageNet → CK+ and mini-ImageNet → RAF seems NOT GOOD. However, the scenario with a narrow domain shift: RAF basic → CK+ and the best performing algorithm reached 84.90% ± 0.53% accuracy, when only learning from five samples.

In fact, due to the limited number of base classes in our FER task, the performance of existing FSL methods drops substantially.

=> to alleviate this problem, paper proposed a novel EGS-Net with joint learning + alternate learning

Experiments

저자는 CFEE와 EmotionNet 두개의 compound emotion이 포함된 데이터셋으로 실험을 진행하였다. training 에는 기본 감정(6~8개)만 가지고 있기 때문에 디테일한 설정을 보기 위해서 각 테스트 데이터에 _B(basic emotion), _C(compound emotion) 으로 나누어서 (CFEE_B CFEE_C, EmotioNet_B, EmotioNet_C) 실험하였다.

그리고 각 E_b 는 emotion branch, S_b 는 similarity branch를 뜻하며 (single)은 RAF-DB 만 사용하였을때, (multiple)는 데이터를 모두 복합적으로 학습시켰을 때를 의미한다.

확실히 1shot 보다는 5shot 일때 성능이 좋으며, compound emotion 데이터보다 basic emotion 데이터의 성능이 더 좋다. 이는 domain shift 가 작기 때문이라고 볼 수 있다. 또한, single이미지로 학습시킬때는 emotion branch 와 similarity branch의 성능 차이가 적지만 multiple 이미지일 때, similarity branch의 성능이 더 좋아지는 것을 볼 수 있다.(inference ability to unseen data) 이것은 few-shot을 사용하는 것의 타당성을 입증시켜준다고 판단된다.

references

[1] Matching Networks for One Shot Learning

[2] Revisiting few-shot learning for facial expression recognition

Twitter Facebook LinkedIn

zzennin