[논문 리뷰] Pseudo label

논문 리뷰/semi-supervised learning

[논문 리뷰] Pseudo label

curious_cat 2023. 3. 12. 00:07

728x90

개요:

제목: Pseudo-label: the simple and efficient semi-supervised learning method for deep neural networks

2013년도에 나온 논문이라 조금 오래되었지만 semi-supervised learning 분야에서 개념적으로 중요하다. semi-supervised learning은 dataset에 일부만 label이 있을 때 label이 없는 데이터도 잘 활용하는 방법이다 (일부 데이터에 대해서는 supervised learning이지만 label이 없는 데이터에 대해서는 unsupervised learning이라서 semi-supervised learning). 이 논문에서는 classification 문제를 다루고, label이 없는 데이터에 대해서 pseudo-label을 만들어서 ground truth label처럼 사용하는 방식을 제안한다. 여기서 pseudo-label은 모델을 학습시키는 도중 unlabeled 데이터에 대해서 모델로 예측했을 때 가장 높은 확률을 갖는 class를 말한다 (이 class가 unlabeled data의 pseudo-label).

논문에서 pseudo-label 학습 방식과 entropy regularization과도 연관을 짓는다: pseudo-label을 학습하는 것은 unlabeled 데이터의 conditional entropy를 낮게 하는 것과 관련이 있다. 그래서 각 class가 갖는 확률 분포 사이의 overlap이 작아지게 된다. 이런 데이터 분포를 low-density separation이라고 한다.

방법

Nerual network

ReLU activation을 갖는 MLP
마지막 feature에 sigmoid를 사용해서 확률 예측 (softmax x)

Loss

\( \sum_{i}^C L(y_i,f_i(x))\): 여기서 C는 label의 수, y는 label (i번째 class면 i번째 component만 1이고 나머지는 0), f는 sigmoid를 통과한 MLP의 output
\( L(y_i,f_i) = -y_i \log f_i - (1-y_i) \log (1-f_i)\)는 cross entropy loss

Pre-training & fine tuning 방식으로 학습

denoising auto-encoder 방식으로 pre-training을 한다. 이 논문의 핵심은 아니기 때문에 구체적인 내용은 스킵
(fine tuning) 학습할 때 dropout 사용
learning rate scheduler: exponential decay
stochastic gradient descent + momentum

Pseudo-Label

아이디어는 간단하다: 현제 모델이 가장 높은 확률로 예측하는 class를 label로 사용:

우선 pre-training을 하고, labeled data & unlabeled data를 동시에 학습시킨다
unlabeled data는 pseudo-label을 사용; weight update할 때 매번 새로 계산한다 (한 번에 다 계산해 두고 사용하는 것이 아니라) n은 labeled data의 mini-batch 수, n'는 unlabeled data의 mini-batch 수

y,f는 각각 labeled data의 label & MLP의 output
y',f'는 각각 unlabeled data의 pseudo-label & MLP의 output
\( \alpha (t)\)는 label & pseudo-label의 balance를 조절해 주는 coefficient
우선 labeled data에 대해서 학습하고, \( \alpha (t)\)를 특정 값까지 증가시켜주는 방식으로 학습한다:

왜 통하는가?

Low density separation between classes

Cluster assumption에 의하면 class 사이를 가르는 decision boundary는 data가 별로 없는 (low-density) 영역에 있어야 일반화가 잘 된다. Entropy regularization이 이런 역할을 한다

Entropy regularization

conditional entropy를 최소화해서 low density separation을 추구함
\( H(y|x') = -\frac{1}{n'} \sum_{m=1}^{n'} \sum_{i}^C P(y_i^m=1 | x'^m ) \log P (y_i^m=1|x'^m)\)
- n': number of unlabeled data
- C: number of classes
- \( y_i^m\): unknown label of mth unlabeled sample
- \( x'^m\): mth unlabeled sample

첫 번째 항은 maximum a posteriori probability를 구하는 loss function
두 번째 항은 위에서 설명한 entropy regularization
\( \lambda\): 두 항의 상대적 중요도를 결정하는 coefficient
labeled data를 잘 분류하면서 unlabeled data에 대해서 low-density decision boundary를 요구하는 loss로 해석

Pseudo label 방법 해석

pseudo-label을 사용해서 unlabeled data가 K개의 class 중 한 개의 class에 해당되도록 학습하는 것은 (식 15의 두 번째 항) entropy regularization으로 해석 가능 (식 18의 두 번째 항)
식 18에서 첫번째 항은 식 15에서 사용한 cross entropy loss과 대응
밑에 그림은 MNIST dataset에 대해서 학습했을 때 나오는 feature (sigmoid 전)을 t-SNE를 사용해서 2차원에 embedding 한 결과이다 (train data에 대해서 loss는 0; test data의 분포를 그렸다). 보다시피 pseudo label이 있으면 class 사이에 data density가 낮아지는 것을 볼 수 있다.

728x90

'논문 리뷰 > semi-supervised learning' 카테고리의 다른 글

[논문 리뷰] Unsupervised Data Augmentation for Consistency Training (0)	2023.06.03
[논문 리뷰] ReMixMatch (0)	2023.04.06
[논문 리뷰] Virtual Adversarial Training (0)	2023.04.02
[논문 리뷰] MixMatch (0)	2023.03.22
[논문 리뷰] Mean teachers are better role models (0)	2023.03.13

현재글[논문 리뷰] Pseudo label

좋아하는 것에 대해서 이것저것 올리는 장소.

250x250

핸드드립, 간단, 리트코드, 맛집, python, Semi-Supervised Learning, object detection, 생두, self-supervised learning, 커피, pytorch, 홈 로스팅, 알고리즘, 논문 리뷰, 정리, 카페, 강릉, 고수, 로스팅, 맥주,

Today :
Yesterday :

일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

curiosity killed the cat