[논문 리뷰] HiFace: High-Fidelity 3D Face Reconstruction byLearning Static and Dynamic Details

논문 리뷰/face

[논문 리뷰] HiFace: High-Fidelity 3D Face Reconstruction byLearning Static and Dynamic Details

curious_cat 2023. 8. 1. 17:36

728x90

개요

마이크로소프트에서 synthetic face dataset을 만든 이후 점점 face reconstruction 기술력이 좋아지고 있습니다. dense landmark 기반 face reconstruction 논문에서 보인 성능도 대단했는데, 이제 얼굴의 디테일(주름)도 reconstruction이 잘 되는 것 같네요. 정말 realistic한 synthetic dataset을 마이크로소프트에서 보유하고 있기 때문에 기존에 사용하지 않았던 학습 방식들을 도입할 수 있지만, 데이터셋이 공개되어있지는 않다 보니 저같이 데이터에 access 못하는 입장에서는 임팩트가 약간 한계가 있네요. 저자들이 공유한 결과를 보면 성능이 좋은게 정말 체감되고, REALY benchmark에서도 SOTA를 찍었습니다.

이번 논문의 핵심 결과를 보여주는 그림입니다.
Reconstruction 관련 결과: 첫번째 줄에 있는 사진들에서 coarse reconstruction을 한 것이 두 번째 줄에 있는 사진들입니다. 여기에 detail reconstruction까지 하면 세 번째 줄에 있는 사진들을 얻을 수 있습니다.
Animation 관련 결과: 4,5,6번째 줄에 있는 사진들은 가장 밑에 있는 사진에 reconstruction한 결과를 입히면 어떻게 되는지 보여줍니다. 4번째 줄에 있는 사진은 static detail (주름)을 입혀준 결과이고, 5번째 줄에 있는 사진은 dynamic detail(표정 관련) 입혀준 결과이고, 6번째 줄에 있는 사진은 static detail과 dynamic detail을 둘 다 입혀준 결과입니다.
coarse reconstruction을 하기 위해서 3DMM을 사용하고, detail reconstruction을 하기 위해서 스캔 데이터로부터 detail에 대한 basis 얻어서 (3DMM과 비슷하게 PCA를 통해서 만든다)

방법

Notation

Parameter 리스트 (밑에서 notation 기억 안날 때 참고)

\( \alpha \in \mathbb{R}^{300}\): albedo coefficient
\( \beta \in \mathbb{R}^{256}\): 3DMM identity (shape) coefficient
\( \xi \in \mathbb{R}^{233}\): 3DMM expression coefficient
\( p \in \mathbb{R}^{3\times4 + 3}\): 3DMM pose coefficient (head, neck, two eyes 의 각각 3차원 rotation 값) & root joint 3차원 위치.
\( \gamma \in \mathbb{R}^{9}\): spherical harmonics coefficient (조명 관련)
\( \varphi \in \mathbb{R}^{300}\): (detail) displacement map 관련 static coefficient
\( \phi = (\phi_{com}, \phi_{str}) \in (\mathbb{R}^{26},\mathbb{R}^{26} )\): 각각 compressed, stretched (detail) displacement map 관련 dyamic coefficient

ResNet을 통해서 이 파라미터들을 regression한다 (예외적으로 dynamic coefficient은 static coefficient와 expression coefficient로 구하기 때문에 regression되는 목록에서 제외)

Parameter 설명

3DMM 관련

3DMM은 기본적으로 사람 얼굴을 스캔하고 vertex 위치들을 PCA를 해서 만든다. 이 논문에서는 7,667개의 vertex와 14,832개의 triangle로 이뤄진 mesh로 얼굴을 표현한다. 평균 얼굴은 \( \bar{S} \)로 표기한다.
얼굴 모양 관련 PCA component들은 \( B_{id}\)로 표기
얼굴 표정 관련 PCA component들은 \( B_{exp}\)로 표기 (참고로 뒤에서 설명할 pose도 고려해야 정확한 얼굴 모양이 나온다)
특정 사람의 무표정 (neutral expression) 얼굴 모양은 다음과 같이 \( B_{id}\)를 활용해서 얻는다:
\[ S_{neu} = \bar{S} + \beta B_{id}\]
특정 사람이 표정을 지으면 \( B_{exp}\)도 다음과 같이 활용해서 얼굴 모양을 얻는다
\[ S_{neu} = \bar{S} + \beta B_{id} + \xi B_{exp}\]

pose, camera 모델 관련

위에서 구한 shape, expression을 입힌 shape S를 기반으로 pose를 linear blend skinning을통해서 입혀서 coarse한 최종 얼굴 모양 \( S_p\)를 구한다:
\[ S_p = LBS(S,p,J; W)\]
여기서 LBS는 linear blend skinning, p는 pose 값들, J는 4개의 joint location 위치, W는 (blend)weights (pose가 vertex 위치에 에 얼마나 영향을 주는지 결정하는 값)이다.
참고로 joint 위치는 \( \beta\) 에 대한 linear function으로 계산한다 (자세한 내용은 Fake it till you make it 논문 참고)

Illumination 관련

Albedo map은 512 x 512 x 3으로 나타내는 얼굴의 고유 색상이라고 생각하면 된다. 실제 사진에서 보이는 색상은 illumination도 고려해줘야 한다. 특정 사람의 albedo map도 3DMM과 비슷하게 다양한 얼굴의 색상을 PCA 해서 평균 색상\( \bar{A}\)에 principal component \( B_{alb}\) 더하는 방식으로 얻는다:
\[ A = \bar{A} + \alpha B_{alb}\]
실제 보이는 texture map T는 albedo map에 illumination까지 고려해서 얻을 수 있다. Illumination은 각 albedo map 위치에 대응되는 얼굴 mesh 위치에 수직 벡터 (normal vector) N 의 방향향을 고려해줘야 하는데, 보통 spherical harmonics로 모델링을 한다. spherical harmonics는 radial 방향과 anglular 방향으로 decomposition이 되는 함수를 모델링할 때 자주 사용하는데, spherical harmonics를 사용하겠다는 뜻은 light source가 멀리 있다고 가정하겠다는 뜻이다. 이러한 모델을 초창기에 사용한 논문에서는 "distant" light source라고 표현한다.
\[ T = A \odot \sum_{k=1}^{9} \gamma_k \Psi_k(N)\]
이 식에서 T는 texture map, A는 albedo map, \( \gamma\)는 illumination coefficient, \( \Psi: \mathbb{R}^3 \rightarrow \mathbb{R}\)는 spherical harmonics다.
\( \odot\)는 Kronecker product

Detail 관련

Detail map은 albedo map처럼 UV map으로 만든다. 각 detail map에 대응되는coarse shape S 의 위치에 수직 방향으로 modulation을 주는 방식으로 detail을 살린다.
이 논문에서 특징적인 점은 detail을 static, dynamic으로 쪼갠다는 것이다:
\[D = D_{sta} + D_{dyn}\]
Static detail은 무표정일 때 있는 detail (주름)을 뜻하고, dynamic detail은 얼굴 표정을 지었을 때 생기는 detail (주름)을 뜻한다.

Static detail 관련

static detail은 무표정 얼굴의 displacement map을 PCA해서 얻는다. 특정 사람의 static detail은 다음과 같이 모델링한다:
\[D_{sta} = \bar{D}_{sta} + \varphi B_{sta}\]
여기서 \( \bar{D}_{sta} \)는 평균 displacement map이고, \( B_{sta}\)는 PCA component들이다. \( \varphi\)는 static coefficient.

Dynamic detail 관련

Dynamic detail같은 static detail와 다르게 dynamic coeficient에 대해서 linear하게 모델링이 잘 안 된다.
이것을 해결하기 위해서 본 논문에서는 dynamic detail을 compressed, stretched displacement map로 나눠서 사용한다. 각각은 dynamic coefficient에 대해서 linear한 모델이지만, 섞을 때 nonlinear하게 섞어준다.
compressed expression의 예시로 인상을 쓰는 표정이 있는데, 눈썹 사이 compression이 일어나서 주름이 생긴다. 반면 stretched expression의 예시로 눈썹 사이가 완전히 이완된 표정을 들 수 있다.
compressed / stretched displacement map (각각 \( B_{com}, B_{str}\)) 도 static displacement map처럼 PCA로 구한다
Dynamic coefficient \( \phi\)는 static coeficient \( \varphi\)와 expression coefficient \( \xi\)로부터 구한다. 밑에 있는 식에서 \( \Phi \)는 MLP이고, \( \tilde{\xi}\)는 expression coefficient에 MLP를 통과시켜서 얻는다. \( \mu\) 는 mean, \( \sigma\)는 standard deviation이다. \( \tilde{\xi} \)를 AdaIN (adaptive instance normalization) 을 통해서 주입시켜 줬다고 생각하면 된다.

이렇게 얻은 dynamic coefficient를 기반으로 detail map을 linear하게 만들 수 있다:
\[ D_{com} = \bar{D}_{com} + \phi_{com} B_{com}\]
\[ D_{str} = \bar{D}_{str} + \phi_{str} B_{str}\]
이렇게 얻은 detail map을 주입할 때 stretched 되었는지, 또는 compressed 되었는지 tension 값을 계산해서 결정한다. 각 vertex에 tension은 다음과 같이 계산할 수 있다. 어떤 표정을 지었을 때 vertex \( v_i \in S \)에 연결된 edge 들을 \( \{ e_1,...,e_K\}\)라고 하자. 같은 사람이지만 무표정인 얼굴에서 (\( S_{neu}\)) 대응되는 vertex의 edge 들을 \( \{ e_1', ..., e_K' \}\)라고 하자. Tension은 무표정 대비 표정을 지었을 때 vertex 에 연결된 edge가 평균적으로 짧은지 긴지를 통해서 계산한다 (평균적으로 짧으면 tension > 0 = compressed, 길면 tension < 0 = stretched)

이렇게 얻은 tension 값을 UV map \( M_{uv}\)으로 바꾼 후, 양수인 곳에는 compressed detail을 사용하고 음수인 곳에서는 stretched detail을 사용한다 (그래서 dynamic coefficient에 대해서 nonlinear하다):
\[ D_{dyn} = M^+_{uv} \odot D_{com} + M^-_{uv} \odot D_{str}\]

Loss function

static/dynamic detail 관련 loss, coarse shape 관련 loss, self-supervised loss , KD loss (나이 관련 loss), regularization을 사용 한다

detail loss

\( M_{detail}\)은 얼굴 부분만 segmentation해주는 UV map에서 mask
\( D_{sta/com/str}\): 모델 output 기반으로 만든 detail map (static, compressed, stretched)
\( \hat{D}_{sta/com/str}\): GT detail map (synthetic data는 GT가 있는데, real world data에 대해서는 어떻게 되는지 설명이 없다. 아마도 loss를 real world data에 대해서 생략할 듯)\

shape loss

\( \mathcal{L}_{sh} = \mathcal{L}_{ver} + \mathcal{L}_{kl}\)
\( \mathcal{L}_{ver} = || M_{ver} \odot (S-\hat{S}) ||\)
- \( M_{ver} \)는 coarse shape의 안면 부분에 속하는 vertex만 사용하도록 강제하는 mask (vertex에 masking)
- \( \hat{S}\): predicted coarse shape, S: GT coarse shape
\( \mathcal{L}_{kl} = \rho(\beta) (\log \rho(\beta) - \log \rho (\hat{\beta}))\)
- shape parameter의 distribution를 맞춰주는 loss
- \( \rho\): softmax
- \( \hat{\beta}\): predicted shape, \( \beta\): GT shape
shape loss도 real world 이미지에 대해서는 아마도 사용 안 할 듯

self-supervised loss

real world 이미지에도 구현이 가능한 loss (이전 loss들은 synthetic data에만 사용 가능)
\( \mathcal{L}_{self} = \mathcal{L}_{pho} + \lambda_{id}\mathcal{L}_{id} + \lambda_{lmk}\mathcal{L}_{lmk}\)
photometric loss는 differentiable renderer을 통해서 만든 얼굴 이미지 \(\hat{I}^r\) 과 (얼굴 shape, albedo, illumination을 고려해서 rendering 한다) 사진 \(I\) 에 있는 얼굴 이미지 사이 L2 loss로 구현한다. \( M_I\)는 안면에만 pixel 사이 loss를 주기 위한 마스크다.

perceptual loss (id로 표기)는 face recognition network \( \Gamma\) 의 feature 사이 cosine similarity loss로 구현된다

landmark loss는 GT landmark \( \mu \) 와 predicted landmark \( \hat{\mu} \) 사이 L2 loss를 사용한다. 이때 uncertainty \( \sigma\)로 weight를 추가적으로 준다 (uncertainty가 높으면 loss를 작게 준다). 여기서 GT landmark 위치와 uncertainty는 이 논문 시리즈 저자들의 예전 논문 모델을 통해서 얻는다.

KD loss (age loss)

static detail에 age 정보가 담겨있을 것이라는 직관에 기반한 loss이다.
우선 기존에 학습된 age classification network \( \Gamma_{age}\)를 통해 age 에 대한 확률 분포를 얻는다
static detail \( \phi \)를 MLP에 통과시켜서 확률 분포 \( \hat{p}_{age}\)를 얻는다
이 두 분포 사이에 loss를 다음과 같이 준다 (KL divergence)

Regularization

\( \alpha, \beta, \xi, \varphi, \phi\)에 L2 regularization을 해준다

요약

Implementation detail

데이터: 위에서 언급했듯이 synthetic data와 real data를 같이 사용한다.
학습 파라미터: 캠쳐 이미지 참고

728x90

'논문 리뷰 > face' 카테고리의 다른 글

[논문 리뷰] STAR Loss: 부정확한 facial landmark 데이터의 한계점에 대한 방안 (0)	2024.02.03
[논문 리뷰] Few-shot Geometry-Aware Keypoint Localization (0)	2023.08.10
[논문 리뷰] Facial Retargeting with Automatic Range of Motion Alignment (0)	2023.06.12
[논문 리뷰] FaceScape (0)	2023.05.17
[논문 리뷰] FLAME (0)	2023.04.30

현재글[논문 리뷰] HiFace: High-Fidelity 3D Face Reconstruction byLearning Static and Dynamic Details

좋아하는 것에 대해서 이것저것 올리는 장소.

250x250

간단, 강릉, self-supervised learning, 리트코드, 고수, 카페, 맛집, 맥주, python, 논문 리뷰, Semi-Supervised Learning, object detection, 알고리즘, 정리, 핸드드립, pytorch, 홈 로스팅, 로스팅, 생두, 커피,

Today :
Yesterday :

curiosity killed the cat

[논문 리뷰] HiFace: High-Fidelity 3D Face Reconstruction byLearning Static and Dynamic Details

개요