Diffusers
yanboding commited on
Commit
c67a966
Β·
verified Β·
1 Parent(s): da2ec72

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +123 -3
README.md CHANGED
@@ -1,3 +1,123 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ ---
4
+ <meta name="google-site-verification" content="-XQC-POJtlDPD3i2KSOxbFkSBde_Uq9obAIh_4mxTkM" />
5
+
6
+ <div align="center">
7
+
8
+ <h2><a href="https://arxiv.org/abs/2408.10605">MTVCrafter: 4D Motion Tokenization for Open-World Human Image Animation</a></h2>
9
+
10
+ > Official project page of **MTVCrafter**, a novel framework for general and high-quality human image animation using raw 3D motion sequences.
11
+
12
+ <!--
13
+ [Yanbo Ding](https://github.com/DINGYANB),
14
+ [Shaobin Zhuang](https://scholar.google.com/citations?user=PGaDirMAAAAJ&hl=zh-CN&oi=ao),
15
+ [Kunchang Li](https://scholar.google.com/citations?user=D4tLSbsAAAAJ),
16
+ [Zhengrong Yue](https://arxiv.org/search/?searchtype=author&query=Zhengrong%20Yue),
17
+ [Yu Qiao](https://scholar.google.com/citations?user=gFtI-8QAAAAJ&hl),
18
+ [Yali Wang†](https://scholar.google.com/citations?user=hD948dkAAAAJ)
19
+ -->
20
+
21
+ [![arXiv](https://img.shields.io/badge/arXiv-2408.10605-b31b1b.svg)](https://www.arxiv.org/abs/2505.10238)
22
+ [![GitHub](https://img.shields.io/badge/GitHub-MTVCrafter-blue?logo=github)](https://github.com/DINGYANB/MTVCrafter)
23
+ [![Hugging Face Space](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Models-yellow)](https://huggingface.co/yanboding/)
24
+ [![Project Page](https://img.shields.io/badge/🌐%20Page-GitHub.io-brightgreen)](https://dingyanb.github.io/MTVCtafter/)
25
+
26
+ </div>
27
+
28
+
29
+ ## πŸ” Abstract
30
+
31
+ Human image animation has attracted increasing attention and developed rapidly due to its broad applications in digital humans. However, existing methods rely on 2D-rendered pose images for motion guidance, which limits generalization and discards essential 3D information.
32
+ To tackle these problems, we propose **MTVCrafter (Motion Tokenization Video Crafter)**, the first framework that directly models raw 3D motion sequences for open-world human image animation beyond intermediate 2D representations.
33
+
34
+ - We introduce **4DMoT (4D motion tokenizer)** to encode raw motion data into discrete motion tokens, preserving 4D compact yet expressive spatio-temporal information.
35
+ - Then, we propose **MV-DiT (Motion-aware Video DiT)**, which integrates a motion attention module and 4D positional encodings to effectively modulate vision tokens with motion tokens.
36
+ - The overall pipeline facilitates high-quality human video generation guided by 4D motion tokens.
37
+
38
+ MTVCrafter achieves **state-of-the-art results with an FID-VID of 6.98**, outperforming the second-best by approximately **65%**. It generalizes well to diverse characters (single/multiple, full/half-body) across various styles.
39
+
40
+ ## 🎯 Motivation
41
+
42
+ ![Motivation](./static/images/Motivation.png)
43
+
44
+ Our motivation is that directly tokenizing 4D motion captures more faithful and expressive information than traditional 2D-rendered pose images derived from the driven video.
45
+
46
+ ## πŸ’‘ Method
47
+
48
+ ![Method](./static/images/4DMoT.png)
49
+
50
+ *(1) 4DMoT*:
51
+ Our 4D motion tokenizer consists of an encoder-decoder framework to learn spatio-temporal latent representations of SMPL motion sequences,
52
+ and a vector quantizer to learn discrete tokens in a unified space.
53
+ All operations are performed in 2D space along frame and joint axes.
54
+
55
+ ![Method](./static/images/MV-DiT.png)
56
+
57
+ *(2) MV-DiT*:
58
+ Based on video DiT architecture,
59
+ we design a 4D motion attention module to combine motion tokens with vision tokens.
60
+ Since the tokenization and flattening disrupted positional information,
61
+ we introduce 4D RoPE to recover the spatio-temporal relationships.
62
+ To further improve the quality of generation and generalization,
63
+ we use learnable unconditional tokens for motion classifier-free guidance.
64
+
65
+ ---
66
+
67
+ ## πŸ› οΈ Installation
68
+
69
+ We recommend using a clean Python environment (Python 3.10+).
70
+
71
+ ```bash
72
+ clone this repository && cd MTVCrafter
73
+
74
+ # Create virtual environment
75
+ conda create -n mtvcrafter python=3.11
76
+ conda activate mtvcrafter
77
+
78
+ # Install dependencies
79
+ pip install -r requirements.txt
80
+ ```
81
+
82
+ ## πŸš€ Usage
83
+
84
+ To animate a human image with a given 3D motion sequence,
85
+ you first need to obtain the SMPL motion sequnces from the driven video:
86
+
87
+ ```bash
88
+ python process_nlf.py "your_video_directory"
89
+ ```
90
+
91
+ Then, you can use the following command to animate the image guided by 4D motion tokens:
92
+
93
+ ```bash
94
+ python infer.py --ref_image_path "ref_images/hunam.png" --motion_data_path "data/sample_data.pkl" --output_path "inference_output"
95
+ ```
96
+
97
+ - `--ref_image_path`: Path to the image of reference character.
98
+ - `--motion_data_path`: Path to the motion sequence (.pkl format).
99
+ - `--output_path`: Where to save the generated animation results.
100
+
101
+ For our 4DMoT, you can run the following command to train the model on your dataset:
102
+
103
+ ```bash
104
+ accelerate launch train_vqvae.py
105
+ ```
106
+
107
+ ## πŸ“„ Citation
108
+
109
+ If you find our work useful, please consider citing:
110
+
111
+ ```bibtex
112
+ @article{ding2025mtvcrafter,
113
+ title={MTVCrafter: 4D Motion Tokenization for Open-World Human Image Animation},
114
+ author={Ding, Yanbo},
115
+ journal={arXiv preprint arXiv:2505.10238},
116
+ year={2025}
117
+ }
118
+ ```
119
+
120
+ ## πŸ“¬ Contact
121
+
122
+ For questions or collaboration, feel free to reach out via GitHub Issues
123
+ or email me at πŸ“§ [email protected].