[논문] DreamWaltz: Make a Scene with Complex 3D Animatable Avatars

DreamWaltz: Make a Scene with Complex 3D Animatable Avatars

We present DreamWaltz, a novel framework for generating and animating complex 3D avatars given text guidance and parametric human body prior. While recent methods have shown encouraging results for text-to-3D generation of common objects, creating high-qua

idea-research.github.io

DreamWaltz: Make a Scene with Complex 3D Animatable Avatars

arxiv.org

Abstract

DreamWaltz

: A novel framework for generating and animating complex 3D avatars given text guidance and parametric human body prior

고퀄리티의 움직이는 3D avatar를 생성하는 것은 어려운 작업이기 때문에 DreamWaltz에서는 이를 개선하고자 3D-consistent occlusion-aware Score Distillation Sampling [SDS]으로 canonical pose로 implicit neural representation [NeRF]를 optimize하는 방식을 제안했다.

이 방식은 3D-aware skeleton conditioning을 통해 view-aligned supervision을 제공해서 artifact와 multiple face와 같은 문제 없이 복잡한 avatar를 생성할 수 있게 해준다고 한다.

Animation을 하기 위해서 DreamWaltz는 다양한 포즈로 conditioning된 diffusion model의 image prior들로 부터 animatable 3D avatar representation을 학습하여 retraining 없이도 임의의 포즈를 생성할 수 있다고 한다.

결과적으로 DreamWaltz는 animation을 위한 복잡한 shape, appearance, pose를 받아서 animatable 3D avatar를 생성하는 효과적이고 강건한 모델이라고 할 수 있다.

Introduction

복잡한 3D model들을 생성하는 기존의 모델들의 경우 수작업에 의존하는 경우가 많았기 때문에 cost가 높고 시간도 오래걸렸다.

본 논문에서는 다음 고려사항들을 만족할 해결책들을 제안하는데 초점을 맞췄다고 한다.

Easily controllable over avatar properties through textual descriptions
Capable of producing high-quality and diverse 3D avatars with complex shapes and appearances
The generated avatars should be ready for animation and scene composition with diverse interactions

Deep Learning method
[+] Monocular image나 video로 부터 3D human model을 생성할 수 있게 됨
[-]
Image, video에서의 강력한 visual prior에 너무 의존할 뿐만 아니라 human body geometry에도 의존해서 복잡하거나 imaginative한 shape, appearance를 생성하기엔 부적절하다.
2D Generative Model + 3D modeling
[+] 3D digitization을 더욱 accessible하게 해주고 DL에서의 과도한 dataset 의존성을 줄여줌
[-]
1) Avatar는 보통 appearance에서 복잡한 detail을 요구함 [loose cloth, hair style]
2) Avatar는 관절 구조를 갖고 있기 때문에 coordinated & constrained 방식으로 다양한 포즈를 취할 수 있음
3) Avatar는 shape와 texture detail 등을 바꿀 수 있기 때문에 animation이 어려워짐

We present DreamWaltz, a framework for generating high-quality 3D digital avatars from text prompts utilizing human body prior of shapes and poses, ready for animation and composition with diverse avatar-avatar, avatar-object and avatar-scene interactions

DreamWaltz는 NeRF를 3D avatar representation을 위해 사용하고, pre-trained text-and-skeleton-conditional diffusion model을 shape와 appearance supervision, SMPL을 3D-aware posed-skeleton을 뽑아내는데 사용을 했다고 한다.

Our method enables high-quality avatar generation with 3D-consistent SDS, which resolves the view disparity between the diffusion model’s supervision and NeRF’s rendering

Contribution

We propose a novel text-to-avatar generation framework named DreamWaltz, which is capable of creating animatable 3D avatars with complex shapes and appearances.
For avatar creation, we propose a SMPL-guided 3D-consistent Score Distillation Sampling strategy with occlusion culling, enabling the generation of high-quality avatars, e.g., avoiding the Janus [multi-face] problem and limb ambiguity.
We propose to learn an animatable NeRF representation from diffusion model and human pose prior, which enables the animation of complex avatars. Once trained, we can animate the created avatar with any pose sequence without retraining.
Experiments show that DreamWaltz is effective in creating high-quality and animatable avatars, ready for scene composition with diverse interactions across avatars and objects.

Method

Preliminary

Text-to-3D generation

SMPL

: 3D parametric human body model with a vertex-based linear deformation model, which decomposes body deformation into identity-related and pose-related shape deformation

N=6,890개의 vertex와 K=24개의 keypoint를 갖고 있다. SMPL의 효율적이고 표현력 높은 h uman motion representation 능력 덕분에 human과 관련된 여러 분야에서 사용이 되고 있다.

DreamWaltz: A Text-to-Avatar Generation Framework

Creating a Canonical Avatar

DreamWaltz는 3D avatar representation을 위해 trainable NeRF를 사용한다. NeRF는 SMPL prior를 2가지 방법으로 leverage한다고 한다.

Initializing NeRF
Extracting 3D-aware and occlusion-aware posed-skeleton to condition ControlNet for 3D-consistent Score Distillation Sampling

SMPL-guided initialization

NeRF가 training/inference가 오래걸린다는 것은 유명한 사실이다. 그래서 본 논문에서는 NeRF의 optimizing 과정의 속도를 빠르게 하고 diffusion model으로 부터 informative supervision을 가져와서 합리적인 initial input을 주기 위해 SMPL mesh based NeRF를 pre-training 시켰다고 한다.

SMPL은 self-occlusion을 피하거나 특정 pose를 취하고 있는 avatar를 생성하기 위해 canonical pose에 존재하고 있을 수도 있다고 한다. 구체적으로 임의의 sampling viewpoint가 주어진 SMPL moedl의 silhouette를 image를 rendering하고 이 이미지와 NeRF로 rendering한 이미지 사이의 MSE loss를 최소화 한다. 주목할 점은 NeRF는 Stable Diffusion의 latent space에서 image를 rendering하기 때문에 VAE기반 encoder를 통해 silhouette image를 latent space로 보내서 loss를 계산해야 한다.

결과적으로, SMPL-guided NeRF initialization이 geometry와 avatar 생성의 수렴 속도를 빠르게 해주었다고 한다.

3D-consistent Score Distillation Sampling

Vanilla SDS는 “front view of..” 같은 view-dependent prompt augmentation을 사용해서 3D view-consistent supervision을 제공한다.

하지만 이러한 prompting 전략은 정확한 view consistency를 보장하지 못하고 diffusion model의 supervision image의 viewpoint와 NeRF의 rendering image의 viewpoint 간의 차이를 미해결 상태로 내버려두게 된다.

이러한 inconsistency는 3D generation에서 blurry나 Janus [Multi-face] 같은 문제를 야기한다.

이를 해결하기 위해 controllable image generation 분야에서 영감을 받아서 추가적인 3D-aware conditioning image를 사용해서 3D-consistent NeRF optimization을 위한 SDS를 향상시키는 것에 사용했다고 한다.

일반적으로는 skeleton을 사용을 하는데, 그 이유는 skeleton이 최소한의 structure 구조를 나타내고 있기 때문에 복잡한 avatar 생성에 도움이 될 수 있다고 한다. 3D-consistent supervision을 얻기 위해서는 conditioning image의 viewpoint가 NeRF의 rendering viewpoint와 sync가 맞아야 한다. 그래서 이를 위해 conditioning image 생성에 SMPL model을 사용하게 되었다고 한다.

Occlusion culling

3D-aware conditional image의 제안은 SDS optimization process에서 3D consistency를 높여주었다. 하지만, 이 방식의 효과성은 diffusion model이 conditional image를 어떻게 해석하느냐에 따라 달려있다.

위 그림을 보면, back-view skeleton map을 ControlNet의 conditional image로 제공을 하고 text-to-image generation을 수행했다고 한다. 하지만, 결과물을 보면 앞면의 얼굴이 여전히 생성된 이미지에 나타나는 것을 볼 수 있다. 이러한 결점은 multiple face나 불분명한 facial feature 등의 문제를 야기할 수 있다.

따라서 occlusion culling algorithm을 제안해서 facial keypoint들이 주어진 viewpoint에서 보이는지 판별하고 보이지 않는다고 판단되면 skeleton map에서 지워버리는 작업을 진행했다고 한다. 이때 body keypoint들은 변하지 않는데 그 이유가 이것들은 SMPL mesh에 있고 새로운 prior를 주지 않는 이상 가려져 있는지 판단하기가 힘들기 때문이라고 한다.

여기서 ControlNet에 대해 잠깐 설명을 하자면, Text-to-Image Diffusion Model에 conditional image로 prior를 주는 모델이다.

Adding Conditional Control to Text-to-Image Diffusion Models

We present ControlNet, a neural network architecture to add spatial conditioning controls to large, pretrained text-to-image diffusion models. ControlNet locks the production-ready large diffusion models, and reuses their deep and robust encoding layers pr

arxiv.org

OpenPose
: Skeleton과 prompt를 줘서 입력된 자세를 기반으로 이미지를 생성

Canny
: 이미지의 윤곽선[Canny]를 추가적인 condition으로 받아서 이를 기반으로 이미지 생성

Depth map
: Subject의 depth를 기반으로 이미지를 생성

MLSD
: 입력 이미지의 직선 윤곽만 따와서 condition으로 부여

Reference-only
: 어떠한 condition 없이 입력 이미지와 prompt만을 참조해서 원하는 이미지 생성

Segmentation
: 입력 이미지의 segmentation mask를 이용해서 prompt에 맞는 이미지를 segmentation mask 위에 생성

Scribble
: 낙서 형태로 윤곽선을 그려준 뒤 이미지를 생성 ⇒ 더욱 창의적이고 높은 자유도를 가짐

이렇듯 ControlNet은 사용자로 부터 받은 prompt와 다른 유형의 condition을 받아서 원하는 이미지를 생성하는 모델이라는 것을 알 수 있다.

Learning an Animatable Avatar

위 그림을 보면, training process에서 VPoser로 부터 임의로 viable한 pose를 취하고 있는 SMPL model들을 선택해서 ControlNet에 condition으로 주고, generalizable density weighting function을 학습해서 vertex-based pose transformation을 refine해서 복잡한 avatar 생성을 가능하게 했다고 한다.

SMPL-guided avatar articulation

SMPL은 observation space에서 canonical space로의 vertex transformation을 정의한다.

이 논문에서는 SMPL-guided transformation을 통해 NeRF-represented avatar articulation을 달성하고자 했다고 한다. 더 구체적으로 설명하면 다음과 같다.

하지만, non-skin-tight complex avatar에 대해 p가 mesh vertex로 부터 멀어져 잘못된 좌표 변환이 발생해서 관절이 더 생긴다던지 artifact가 생기는 문제가 생길 수 있다.

이러한 문제를 피하기 위해 추가로 density weighting mechanism을 제안했다고 한다.

Density weighting network

위에서 언급했듯이 density weighting network를 제안해서 p에서 잘못 변환된 color contribution을 억제하고 효과적으로 artifact를 줄였다고 한다.

이를 수행하기 위해 generalizable density weighting network MLP_DWN을 학습시켰다고 한다. 더 구체적으로 설명하면 다음과 같다.

Sampled human pose prior

임의의 motion sequence에 대해 생성한 avatar가 animate하게 하기 위해서 density weighting network가 임의의 자세에 대해서도 일반화될 수 있어야 한다고 확신했다고 한다.

이를 위해 human pose의 latent representation을 학습하는 VAE인 VPoser를 human pose prior로 사용했다고 한다. Training 동안 임의로 SMPL pose parameter를 VPoser로 부터 sampling 해서 상응하는 posed mesh를 생성했다고 한다. 이 mesh를 다음 목적을 위해 사용했다고 한다.

3D-consistent SDS를 위한 conditioning image로 줄 skeleton map 추출
Animatable avatar representation을 학습하기 위한 mesh guidance

이 전략이 avatar articulation learning을 SDS supervision과 align 할 수 있게 해주었고, 네트워크가 다양한 pose로 부터 일반화 가능한 density weighting function을 학습할 수 있다는 것을 보장해주었다고 한다. 또한 결과물의 quality도 향상되었다고 한다.

Making a Scene with Animatable 3D Avatars

DreamWaltz의 장점은 retraining 필요 없이 animation을 생성할 수 있다는 점이다. 하지만, 새로 구성된 scene이 unruliness, artifact, interpenetration 같은 현상이 보일 수 있는데, naive composite rendering이 서로 다른 구성요소들 간의 interaction을 고려하지 않기 때문이라고 한다.

Experiment

Implementation details

DreamWaltz is implemented in PyTorch and can be trained and evaluated on a single NVIDIA 3090 GPU. For the canonical avatar creation stage, we train the avatar representation for 30,000 iterations, which takes about an hour. For the animatable avatar learning stage, the avatar representation and the introduced density weighting network are further trained for 50,000 iterations. Inference takes less than 3 seconds per rendering frame. Note that the two stages can be combined for joint training, but we decouple them for training efficiency.

Dataset

3DPW와 AIST++로 부터의 SMPL-format motion sequence와 in-the-wild video를 사용했다고 한다.

Evaluation of Canonical Avatars

High-quality avatar generation

Comparison with SOTA methods

User Studies

Evaluation of Animatable Avatars

Ablation Studies

Effectiveness of occlusion culling

Effectiveness of animation learning

Effects of Joint Optimization for Scene Composition

Further Analysis and Application

Shape control via SMPL parameter $\beta$

Creative and Diverse Avatar Generation

More results

Conclusion

Our method learns an animatable NeRF representation that could retarget the generated avatar to any pose without retraining, enabling realistic animation with arbitrary motion sequence

'Paper Review > 3D Human Reconstruction' 카테고리의 다른 글

[논문] HumanSplat: Generalizable Single-Image Human 3DGS with Structure Priors (1)	2025.02.24
[논문] JIFF: Jointly-aligned Implicit Face Function for High Quality Single View Clothed Human Reconstruction (0)	2025.02.20
[논문] DoubleField: Bridging the Neural Surface and Radiance Fields for High-fidelity Human Reconstruction and Rendering (0)	2025.01.13
[논문] SIFU: Side-view Conditioned Implicit Function for Real-world Usable Clothed Human Reconstruction (1)	2024.10.21
[논문] PIFuHD: Multi-Level Pixel-Aligned Implicit Function for High-Resolution 3D Human Digitization (2)	2024.09.27

Abstract

Introduction

Contribution

Method

Preliminary

Text-to-3D generation

SMPL

DreamWaltz: A Text-to-Avatar Generation Framework

Creating a Canonical Avatar

SMPL-guided initialization

3D-consistent Score Distillation Sampling

Occlusion culling

Learning an Animatable Avatar

SMPL-guided avatar articulation

Density weighting network

Sampled human pose prior

Making a Scene with Animatable 3D Avatars

Experiment

Implementation details

Dataset

Evaluation of Canonical Avatars

High-quality avatar generation

Comparison with SOTA methods

User Studies

Evaluation of Animatable Avatars

Ablation Studies

Effectiveness of occlusion culling

Effectiveness of animation learning

Effects of Joint Optimization for Scene Composition

Further Analysis and Application

Shape control via SMPL parameter $\beta$

Creative and Diverse Avatar Generation

More results

Conclusion

'Paper Review > 3D Human Reconstruction' 카테고리의 다른 글

티스토리툴바