--- license: apache-2.0 pipeline_tag: image-to-video library_name: phantom ---
Lijie Liu*, Tianxiang Ma*, Bingchuan Li*, Zhuowei Chen*, Jiawei Liu, Gen Li, Siyu Zhou, Qian He, Xinglong Wu
> * Equal contribution, † Project lead
>Intelligent Creation Team, ByteDance
## 🔥 Latest News! * May 27, 2025: 🎉 We have released the Phantom-Wan-14B model, a more powerful Subject-to-Video generation model. * Apr 23, 2025: 😊 Thanks to [ComfyUI-WanVideoWrapper](https://github.com/kijai/ComfyUI-WanVideoWrapper/tree/dev) for adapting ComfyUI to Phantom-Wan-1.3B. Everyone is welcome to use it! * Apr 21, 2025: 👋 Phantom-Wan is coming! We adapted the Phantom framework into the [Wan2.1](https://github.com/Wan-Video/Wan2.1) video generation model. The inference codes and checkpoint have been released. * Apr 10, 2025: We have updated the [full version](https://arxiv.org/pdf/2502.11079v2) of the Phantom paper, which now includes more detailed descriptions of the model architecture and dataset pipeline. * Feb 16, 2025: We proposed a novel subject-consistent video generation model, **Phantom**, and have released the [report](https://arxiv.org/pdf/2502.11079v1) publicly. For more video demos, please visit the [project page](https://phantom-video.github.io/Phantom/). ## 📑 Todo List - [x] Inference codes and Checkpoint of Phantom-Wan-1.3B - [x] Checkpoint of Phantom-Wan-14B - [ ] Checkpoint of Phantom-Wan-14B Pro - [ ] Open source Phantom-Data - [ ] Training codes of Phantom-Wan ## 📖 Overview Phantom is a unified video generation framework for single and multi-subject references, built on existing text-to-video and image-to-video architectures. It achieves cross-modal alignment using text-image-video triplet data by redesigning the joint text-image injection model. Additionally, it emphasizes subject consistency in human generation while enhancing ID-preserving video generation. ## ⚡️ Quickstart ### Installation Clone the repo: ```sh git clone https://github.com/Phantom-video/Phantom.git cd Phantom ``` Install dependencies: ```sh # Ensure torch >= 2.4.0 pip install -r requirements.txt ``` ### Model Download | Models | Download Link | Notes | |--------------|---------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------| | Phantom-Wan-1.3B | 🤗 [Huggingface](https://huggingface.co/bytedance-research/Phantom/blob/main/Phantom-Wan-1.3B.pth) | Supports both 480P and 720P | Phantom-Wan-14B | 🤗 [Huggingface](https://huggingface.co/bytedance-research/Phantom/tree/main) | Supports both 480P and 720P First you need to download the 1.3B original model of Wan2.1, since our Phantom-Wan model relies on the Wan2.1 VAE and Text Encoder model. Download Wan2.1-1.3B using huggingface-cli: ``` sh pip install "huggingface_hub[cli]" huggingface-cli download Wan-AI/Wan2.1-T2V-1.3B --local-dir ./Wan2.1-T2V-1.3B ``` Then download the Phantom-Wan-1.3B and Phantom-Wan-14B model: ``` sh huggingface-cli download bytedance-research/Phantom --local-dir ./Phantom-Wan-Models ``` Alternatively, you can manually download the required models and place them in the `Phantom-Wan-Models` folder. ### Run Subject-to-Video Generation #### Phantom-Wan-1.3B - Single-GPU inference ``` sh python generate.py --task s2v-1.3B --size 832*480 --ckpt_dir ./Wan2.1-T2V-1.3B --phantom_ckpt ./Phantom-Wan-Models/Phantom-Wan-1.3B.pth --ref_image "examples/ref1.png,examples/ref2.png" --prompt "暖阳漫过草地,扎着双马尾、头戴绿色蝴蝶结、身穿浅绿色连衣裙的小女孩蹲在盛开的雏菊旁。她身旁一只棕白相间的狗狗吐着舌头,毛茸茸尾巴欢快摇晃。小女孩笑着举起黄红配色、带有蓝色按钮的玩具相机,将和狗狗的欢乐瞬间定格。" --base_seed 42 ``` - Multi-GPU inference using FSDP + xDiT USP ``` sh pip install "xfuser>=0.4.1" torchrun --nproc_per_node=8 generate.py --task s2v-1.3B --size 832*480 --ckpt_dir ./Wan2.1-T2V-1.3B --phantom_ckpt ./Phantom-Wan-Models/Phantom-Wan-1.3B.pth --ref_image "examples/ref3.png,examples/ref4.png" --dit_fsdp --t5_fsdp --ulysses_size 4 --ring_size 2 --prompt "夕阳下,一位有着小麦色肌肤、留着乌黑长发的女人穿上有着大朵立体花朵装饰、肩袖处带有飘逸纱带的红色纱裙,漫步在金色的海滩上,海风轻拂她的长发,画面唯美动人。" --base_seed 42 ``` > 💡Note: > * Changing `--ref_image` can achieve single reference Subject-to-Video generation or multi-reference Subject-to-Video generation. The number of reference images should be within 4. > * To achieve the best generation results, we recommend that you describe the visual content of the reference image as accurately as possible when writing `--prompt`. For example, "examples/ref1.png" can be described as "a toy camera in yellow and red with blue buttons". > * When the generated video is unsatisfactory, the most straightforward solution is to try changing the `--base_seed` and modifying the description in the `--prompt`. For inferencing examples, please refer to "infer.sh". You will get the following generated results:
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |