--- license: apache-2.0 ---

EchoMimicV3: 1.3B Parameters are All You Need for Unified Multi-Modal and Multi-Task Human Animation

Rang Meng1Yan WangWeipeng WuRuobing ZhengYuming Li2Chenguang Ma2
Terminal Technology Department, Alipay, Ant Group.

1Core Contributor  2Corresponding Authors

## 🚀 EchoMimic Series * EchoMimicV3: 1.3B Parameters are All You Need for Unified Multi-Modal and Multi-Task Human Animation. [GitHub](https://github.com/antgroup/echomimic_v3) * EchoMimicV2: Towards Striking, Simplified, and Semi-Body Human Animation. [GitHub](https://github.com/antgroup/echomimic_v2) * EchoMimicV1: Lifelike Audio-Driven Portrait Animations through Editable Landmark Conditioning. [GitHub](https://github.com/antgroup/echomimic) ## 📣 Updates * [2025.07.08] 🔥 Our [paper](https://arxiv.org/abs/2507.03905) is in public on arxiv. ## 🌅 Gallery

For more demo videos, please refer to the project page. ## Quick Start ### Environment Setup - Tested System Environment: Centos 7.2/Ubuntu 22.04, Cuda >= 12.1 - Tested GPUs: A100(80G) / RTX4090D (24G) / V100(16G) - Tested Python Version: 3.10 / 3.11 ### 🛠️Installation #### 1. Create a conda environment and install pytorch, xformers ``` conda create -n echomimic_v3 python=3.10 conda activate echomimic_v3 ``` #### 2. Other dependencies ``` pip install -r requirements.txt ``` ### 🧱Model Preparation | Models | Download Link | Notes | | --------------|-------------------------------------------------------------------------------|-------------------------------| | Wan2.1-Fun-1.3B-InP | 🤗 [Huggingface](https://huggingface.co/spaces/alibaba-pai/Wan2.1-Fun-1.3B-InP) | Base model | wav2vec2-base | 🤗 [Huggingface](https://huggingface.co/facebook/wav2vec2-base-960h) | Audio encoder | EchoMimicV3 | 🤗 [Huggingface](https://huggingface.co/BadToBest/EchoMimicV3) | Our weights -- The **weights** is organized as follows. ``` ./models/ ├── Wan2.1-Fun-1.3B-InP ├── wav2vec2-base-960h └── transformer └── diffusion_pytorch_model.safetensors ### 🔑 Quick Inference ``` python infer.py ``` > Tips > - Audio CFG: Audio CFG works optimally between 2~3. Increase the audio CFG value for better lip synchronization, while decreasing the audio CFG value can improve the visual quality. > - Text CFG: Text CFG works optimally between 4~6. Increase the text CFG value for better prompt following, while decreasing the text CFG value can improve the visual quality. > - TeaCache: The optimal range for `--teacache_thresh` is between 0~0.1. > - Sampling steps: 5 steps for talking head, 15~25 steps for talking body. > - ​Long video generation: If you want to generate a video longer than 138 frames, you can use Long Video CFG. ## 📝 TODO List | Status | Milestone | |:--------:|:-------------------------------------------------------------------------| | 2025.08.08 | The inference code of EchoMimicV3 meet everyone on GitHub | | 🚀 | Preview version Pretrained models trained on English and Chinese on HuggingFace | | 🚀 | Preview version Pretrained models trained on English and Chinese on ModelScope | | 🚀 | 720P Pretrained models trained on English and Chinese on HuggingFace | | 🚀 | 720P Pretrained models trained on English and Chinese on ModelScope | | 🚀 | The training code of EchoMimicV3 meet everyone on GitHub | ## 📒 Citation If you find our work useful for your research, please consider citing the paper : ``` @misc{meng2025echomimicv3, title={EchoMimicV3: 1.3B Parameters are All You Need for Unified Multi-Modal and Multi-Task Human Animation}, author={Rang Meng, Yan Wang, Weipeng Wu, Ruobing Zheng, Yuming Li, Chenguang Ma}, year={2025}, eprint={2507.03905}, archivePrefix={arXiv} } ``` ## 🌟 Star History [![Star History Chart](https://api.star-history.com/svg?repos=antgroup/echomimic_v3&type=Date)](https://star-history.com/#antgroup/echomimic_v3&Date)