--- license: apache-2.0 ---

MTVCrafter: 4D Motion Tokenization for Open-World Human Image Animation

> Official project page of **MTVCrafter**, a novel framework for general and high-quality human image animation using raw 3D motion sequences. πŸ”— [Project Page](https://dingyanb.github.io/MTVCtafter/) | πŸ“„ [ArXiv](https://arxiv.org/abs/2505.10238) | πŸ’» [Code](https://github.com/DINGYANB/MTVCrafter) | πŸ€— [Hugging Face Model](https://huggingface.co/yanboding/MTVCrafter)
## πŸ” Abstract Human image animation has attracted increasing attention and developed rapidly due to its broad applications in digital humans. However, existing methods rely on 2D-rendered pose images for motion guidance, which limits generalization and discards essential 3D information. To tackle these problems, we propose **MTVCrafter (Motion Tokenization Video Crafter)**, the first framework that directly models raw 3D motion sequences for open-world human image animation beyond intermediate 2D representations. - We introduce **4DMoT (4D motion tokenizer)** to encode raw motion data into discrete motion tokens, preserving 4D compact yet expressive spatio-temporal information. - Then, we propose **MV-DiT (Motion-aware Video DiT)**, which integrates a motion attention module and 4D positional encodings to effectively modulate vision tokens with motion tokens. - The overall pipeline facilitates high-quality human video generation guided by 4D motion tokens. MTVCrafter achieves **state-of-the-art results with an FID-VID of 6.98**, outperforming the second-best by approximately **65%**. It generalizes well to diverse characters (single/multiple, full/half-body) across various styles. ## 🎯 Motivation ![Motivation](./static/images/Motivation.png) Our motivation is that directly tokenizing 4D motion captures more faithful and expressive information than traditional 2D-rendered pose images derived from the driven video. ## πŸ’‘ Method ![Method](./static/images/4DMoT.png) *(1) 4DMoT*: Our 4D motion tokenizer consists of an encoder-decoder framework to learn spatio-temporal latent representations of SMPL motion sequences, and a vector quantizer to learn discrete tokens in a unified space. All operations are performed in 2D space along frame and joint axes. ![Method](./static/images/MV-DiT.png) *(2) MV-DiT*: Based on video DiT architecture, we design a 4D motion attention module to combine motion tokens with vision tokens. Since the tokenization and flattening disrupted positional information, we introduce 4D RoPE to recover the spatio-temporal relationships. To further improve the quality of generation and generalization, we use learnable unconditional tokens for motion classifier-free guidance. --- ## πŸ› οΈ Installation We recommend using a clean Python environment (Python 3.10+). ```bash clone this repository && cd MTVCrafter-main # Create virtual environment conda create -n mtvcrafter python=3.11 conda activate mtvcrafter # Install dependencies pip install -r requirements.txt ``` ## πŸš€ Usage To animate a human image with a given 3D motion sequence, you first need to obtain the SMPL motion sequnces from the driven video: ```bash python process_nlf.py "your_video_directory" ``` Then, you can use the following command to animate the image guided by 4D motion tokens: ```bash python infer.py --ref_image_path "ref_images/hunam.png" --motion_data_path "data/sample_data.pkl" --output_path "inference_output" ``` - `--ref_image_path`: Path to the image of reference character. - `--motion_data_path`: Path to the motion sequence (.pkl format). - `--output_path`: Where to save the generated animation results. For our 4DMoT, you can run the following command to train the model on your dataset: ```bash accelerate launch train_vqvae.py ``` ## πŸ“„ Citation If you find our work useful, please consider citing: ```bibtex @misc{ding2025mtvcrafter4dmotiontokenization, title={MTVCrafter: 4D Motion Tokenization for Open-World Human Image Animation}, author={Yanbo Ding and Xirui Hu and Zhizhi Guo and Yali Wang}, year={2025}, eprint={2505.10238}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2505.10238}, } ``` ## πŸ“¬ Contact For questions or collaboration, feel free to reach out via GitHub Issues or email me at πŸ“§ yb.ding@siat.ac.cn.