Upload folder using huggingface_hub
Browse files- .gitattributes +2 -2
- README (1).md +58 -0
- README.md +54 -36
- assets/icon0.png +3 -0
- assets/methods_overview.gif +3 -0
.gitattributes
CHANGED
@@ -34,5 +34,5 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
|
34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
36 |
assets/grid.gif filter=lfs diff=lfs merge=lfs -text
|
37 |
-
assets/
|
38 |
-
assets/
|
|
|
34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
36 |
assets/grid.gif filter=lfs diff=lfs merge=lfs -text
|
37 |
+
assets/icon0.png filter=lfs diff=lfs merge=lfs -text
|
38 |
+
assets/methods_overview.gif filter=lfs diff=lfs merge=lfs -text
|
README (1).md
ADDED
@@ -0,0 +1,58 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
|
2 |
+
# Pusa VidGen
|
3 |
+
|
4 |
+
[Codes](https://github.com/Yaofang-Liu/Pusa-VidGen) | [Paper (Coming Soon)](https://huggingface.co/RaphaelLiu/Pusa-V0.5)
|
5 |
+
|
6 |
+
## Overview
|
7 |
+
|
8 |
+
Pusa is built on a novel video diffusion paradigm of model that frame-level noise control, different from the conventional video diffusion models. We originally introduced that in the [FVDM](https://arxiv.org/abs/2410.03160) paper. With this paradigm, Pusa smoothly supports many video generation tasks (e.g., Text/Image/Video-to-Video) while maintaining high-fidelity motion and strong prompt adherence because of our slight modification to the base model. This Pusa-V0.5 is an early preview version based on [Mochi1-Preview](https://huggingface.co/genmo/mochi-1-preview). We are open-sourcing this to invite community colloboration to improve the method and extend its capabilities.
|
9 |
+
|
10 |
+
✨ **Key Features**
|
11 |
+
- **Multi-task support**: Text-to-Video, Image-to-Video, Interpolation, Transition, Loop, Long Video, and more
|
12 |
+
- **Shockingly-efficient**: Trained with only 0.1k H800 GPU hours, cost $0.1k, with 16 H800 GPUs, batchsize 32, 500 training iterations. It can be more efficient if train the model on a single node with more parallelism techniques. Welcome to colloboration :-)
|
13 |
+
- **Full Open-Source**: Code, architecture, and training details included
|
14 |
+
|
15 |
+
🔍 **Unique Architecture**
|
16 |
+
- A novel diffusion model supporting frame-level noise with vectorized timesteps originally introduced in the [FVDM paper](https://arxiv.org/abs/2410.03160) for flexibility and scalability.
|
17 |
+
- Our modification to the base model does not influence its original Text-to-Video generation ability if no fintuing.
|
18 |
+
- The method can be similarlly applied to mainstream video diffusion models like Hunyuan Video, wan2.1, etc. Welcome to colloboration again :-)
|
19 |
+
|
20 |
+
## Download Weights
|
21 |
+
|
22 |
+
You can use the Hugging Face CLI to download the model:
|
23 |
+
```
|
24 |
+
pip install huggingface_hub
|
25 |
+
huggingface-cli download RaphaelLiu/Pusa-V0.5 --local-dir <path_to_downloaded_directory>
|
26 |
+
|
27 |
+
```
|
28 |
+
Or, directly download the weights from [Hugging Face](https://huggingface.co/RaphaelLiu/Pusa-V0.5) to a folder on your computer.
|
29 |
+
|
30 |
+
|
31 |
+
## Limitations
|
32 |
+
Pusa has a few known limitations. The base model Mochi generates videos at low resolution (480p). We expect to get better results when use our proposed method to more powerful models like Wan2.1. We also welcom collobartion from the community to improve the model and extend its capabilities.
|
33 |
+
|
34 |
+
## Related Work
|
35 |
+
- [mochi](https://huggingface.co/genmo/mochi-1-preview) is our base model, a top-tier open-source video generation models in Artifical Analysis Leaderboard for video generation.
|
36 |
+
- [FVDM](https://arxiv.org/abs/2410.03160) introduces frame-level noise control with the vectorized timestep approach that inspired Pusa.
|
37 |
+
|
38 |
+
## BibTeX
|
39 |
+
```
|
40 |
+
@misc{Liu2025pusa,
|
41 |
+
title={Pusa: A Next-Level All-in-One Video Diffusion Model},
|
42 |
+
author={Yaofang Liu and Rui Liu},
|
43 |
+
year={2025},
|
44 |
+
publisher = {GitHub},
|
45 |
+
journal = {GitHub repository},
|
46 |
+
howpublished={\url{https://github.com/Yaofang-Liu/Pusa-VidGen}}
|
47 |
+
}
|
48 |
+
```
|
49 |
+
|
50 |
+
```
|
51 |
+
@article{liu2024redefining,
|
52 |
+
title={Redefining Temporal Modeling in Video Diffusion: The Vectorized Timestep Approach},
|
53 |
+
author={Liu, Yaofang and Ren, Yumeng and Cun, Xiaodong and Artola, Aitor and Liu, Yang and Zeng, Tieyong and Chan, Raymond H and Morel, Jean-michel},
|
54 |
+
journal={arXiv preprint arXiv:2410.03160},
|
55 |
+
year={2024}
|
56 |
+
}
|
57 |
+
```
|
58 |
+
|
README.md
CHANGED
@@ -1,62 +1,80 @@
|
|
1 |
-
|
2 |
# Pusa VidGen
|
3 |
|
4 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
5 |
|
6 |
## Overview
|
7 |
|
8 |
-
Pusa
|
9 |
|
10 |
-
✨
|
11 |
-
- **Multi-task support**: Text-to-Video, Image-to-Video, Interpolation, Transition, Loop, Long Video, and more
|
12 |
-
- **Cost-efficient**: Trained with just 100 H100 GPU hours
|
13 |
-
- **Full Open-Source**: Code, architecture, and training details included
|
14 |
|
15 |
-
|
16 |
-
-
|
|
|
|
|
|
|
|
|
|
|
|
|
17 |
|
18 |
-
|
|
|
|
|
|
|
|
|
|
|
19 |
|
20 |
-
|
|
|
|
|
|
|
21 |
|
22 |
-
|
23 |
-
git clone https://github.com/Yaofang-Liu/Pusa-VidGen
|
24 |
-
cd models
|
25 |
-
pip install uv
|
26 |
-
uv venv .venv
|
27 |
-
source .venv/bin/activate
|
28 |
-
uv pip install setuptools
|
29 |
-
uv pip install -e . --no-build-isolation
|
30 |
-
```
|
31 |
|
|
|
32 |
|
33 |
-
|
34 |
-
```
|
35 |
-
uv pip install -e .[flash] --no-build-isolation
|
36 |
-
```
|
37 |
|
38 |
-
|
39 |
|
40 |
-
##
|
41 |
|
42 |
-
|
43 |
-
|
|
|
|
|
44 |
pip install huggingface_hub
|
45 |
huggingface-cli download RaphaelLiu/Pusa-V0.5 --local-dir <path_to_downloaded_directory>
|
46 |
-
|
47 |
```
|
48 |
-
Or, directly download the weights from [Hugging Face](https://huggingface.co/RaphaelLiu/Pusa-V0.5) to a folder on your computer.
|
49 |
|
|
|
50 |
|
51 |
## Limitations
|
52 |
-
|
|
|
|
|
|
|
|
|
53 |
|
54 |
## Related Work
|
55 |
-
- [mochi](https://huggingface.co/genmo/mochi-1-preview) is our base model, top 3 open-source video generation models in Artifical Analysis Leaderboard for video generation.
|
56 |
-
- [FVDM](https://arxiv.org/abs/2410.03160) introduces the vectorized timestep approach that inspired Pusa's frame-level noise control.
|
57 |
|
58 |
-
|
59 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
60 |
@misc{Liu2025pusa,
|
61 |
title={Pusa: A Next-Level All-in-One Video Diffusion Model},
|
62 |
author={Yaofang Liu and Rui Liu},
|
@@ -67,7 +85,7 @@ Pusa has a few known limitations. The base model Mochi generates videos at 480p.
|
|
67 |
}
|
68 |
```
|
69 |
|
70 |
-
```
|
71 |
@article{liu2024redefining,
|
72 |
title={Redefining Temporal Modeling in Video Diffusion: The Vectorized Timestep Approach},
|
73 |
author={Liu, Yaofang and Ren, Yumeng and Cun, Xiaodong and Artola, Aitor and Liu, Yang and Zeng, Tieyong and Chan, Raymond H and Morel, Jean-michel},
|
|
|
|
|
1 |
# Pusa VidGen
|
2 |
|
3 |
+
<div align="center">
|
4 |
+
|
5 |
+
[](https://github.com/Yaofang-Liu/Pusa-VidGen)
|
6 |
+
[](https://huggingface.co/RaphaelLiu/Pusa-V0.5)
|
7 |
+
|
8 |
+
</div>
|
9 |
+
|
10 |
+
<p align="center">
|
11 |
+
<img src="./Pusa-V0.5/assets/methods_overview.gif" width="80%">
|
12 |
+
</p>
|
13 |
|
14 |
## Overview
|
15 |
|
16 |
+
Pusa introduces a paradigm shift in video diffusion modeling through frame-level noise control, departing from conventional approaches. This innovation was first presented in our [FVDM](https://arxiv.org/abs/2410.03160) paper. Leveraging this architecture, Pusa seamlessly supports diverse video generation tasks (Text-to-Video, Image-to-Video, etc.) while maintaining exceptional motion fidelity and prompt adherence through our refined base model adaptations. Pusa-V0.5 represents an early preview based on [Mochi1-Preview](https://huggingface.co/genmo/mochi-1-preview). We are open-sourcing this work to foster community collaboration, enhance methodologies, and expand capabilities.
|
17 |
|
18 |
+
## ✨ Key Features
|
|
|
|
|
|
|
19 |
|
20 |
+
- **Comprehensive Multi-task Support**:
|
21 |
+
- Text-to-Video generation
|
22 |
+
- Image-to-Video transformation
|
23 |
+
- Frame interpolation
|
24 |
+
- Video transitions
|
25 |
+
- Seamless looping
|
26 |
+
- Extended video generation
|
27 |
+
- And more...
|
28 |
|
29 |
+
- **Unprecedented Efficiency**:
|
30 |
+
- Trained with only 0.1k H800 GPU hours
|
31 |
+
- Total training cost: $0.1k
|
32 |
+
- Hardware: 16 H800 GPUs
|
33 |
+
- Configuration: Batch size 32, 500 training iterations
|
34 |
+
- *Note: Efficiency can be further improved with single-node training and advanced parallelism techniques. Collaborations welcome!*
|
35 |
|
36 |
+
- **Complete Open-Source Release**:
|
37 |
+
- Full codebase
|
38 |
+
- Detailed architecture specifications
|
39 |
+
- Comprehensive training methodology
|
40 |
|
41 |
+
## 🔍 Unique Architecture
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
42 |
|
43 |
+
- **Novel Diffusion Paradigm**: Implements frame-level noise control with vectorized timesteps, originally introduced in the [FVDM paper](https://arxiv.org/abs/2410.03160), enabling unprecedented flexibility and scalability.
|
44 |
|
45 |
+
- **Non-destructive Modification**: Our adaptations to the base model preserve its original Text-to-Video generation capabilities without requiring fine-tuning.
|
|
|
|
|
|
|
46 |
|
47 |
+
- **Universal Applicability**: The methodology can be readily applied to leading video diffusion models including Hunyuan Video, Wan2.1, and others. *Collaborations enthusiastically welcomed!*
|
48 |
|
49 |
+
## Installation and Usage
|
50 |
|
51 |
+
### Download Weights
|
52 |
+
|
53 |
+
**Option 1**: Use the Hugging Face CLI:
|
54 |
+
```bash
|
55 |
pip install huggingface_hub
|
56 |
huggingface-cli download RaphaelLiu/Pusa-V0.5 --local-dir <path_to_downloaded_directory>
|
|
|
57 |
```
|
|
|
58 |
|
59 |
+
**Option 2**: Download directly from [Hugging Face](https://huggingface.co/RaphaelLiu/Pusa-V0.5) to your local machine.
|
60 |
|
61 |
## Limitations
|
62 |
+
|
63 |
+
Pusa currently has several known limitations:
|
64 |
+
- The base Mochi model generates videos at relatively low resolution (480p)
|
65 |
+
- We anticipate significant quality improvements when applying our methodology to more advanced models like Wan2.1
|
66 |
+
- We welcome community contributions to enhance model performance and extend its capabilities
|
67 |
|
68 |
## Related Work
|
|
|
|
|
69 |
|
70 |
+
- [Mochi](https://huggingface.co/genmo/mochi-1-preview): Our foundation model, recognized as a leading open-source video generation system on the Artificial Analysis Leaderboard.
|
71 |
+
- [FVDM](https://arxiv.org/abs/2410.03160): Introduces the groundbreaking frame-level noise control with vectorized timestep approach that inspired Pusa.
|
72 |
+
|
73 |
+
## Citation
|
74 |
+
|
75 |
+
If you find our work useful in your research, please consider citing:
|
76 |
+
|
77 |
+
```bibtex
|
78 |
@misc{Liu2025pusa,
|
79 |
title={Pusa: A Next-Level All-in-One Video Diffusion Model},
|
80 |
author={Yaofang Liu and Rui Liu},
|
|
|
85 |
}
|
86 |
```
|
87 |
|
88 |
+
```bibtex
|
89 |
@article{liu2024redefining,
|
90 |
title={Redefining Temporal Modeling in Video Diffusion: The Vectorized Timestep Approach},
|
91 |
author={Liu, Yaofang and Ren, Yumeng and Cun, Xiaodong and Artola, Aitor and Liu, Yang and Zeng, Tieyong and Chan, Raymond H and Morel, Jean-michel},
|
assets/icon0.png
ADDED
![]() |
Git LFS Details
|
assets/methods_overview.gif
ADDED
![]() |
Git LFS Details
|