RaphaelLiu commited on
Commit
8bd363d
·
verified ·
1 Parent(s): 759dfe0

Upload folder using huggingface_hub

Browse files
Files changed (5) hide show
  1. .gitattributes +2 -2
  2. README (1).md +58 -0
  3. README.md +54 -36
  4. assets/icon0.png +3 -0
  5. assets/methods_overview.gif +3 -0
.gitattributes CHANGED
@@ -34,5 +34,5 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
  assets/grid.gif filter=lfs diff=lfs merge=lfs -text
37
- assets/grid.mp4 filter=lfs diff=lfs merge=lfs -text
38
- assets/mochi-factory.webp filter=lfs diff=lfs merge=lfs -text
 
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
  assets/grid.gif filter=lfs diff=lfs merge=lfs -text
37
+ assets/icon0.png filter=lfs diff=lfs merge=lfs -text
38
+ assets/methods_overview.gif filter=lfs diff=lfs merge=lfs -text
README (1).md ADDED
@@ -0,0 +1,58 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ # Pusa VidGen
3
+
4
+ [Codes](https://github.com/Yaofang-Liu/Pusa-VidGen) | [Paper (Coming Soon)](https://huggingface.co/RaphaelLiu/Pusa-V0.5)
5
+
6
+ ## Overview
7
+
8
+ Pusa is built on a novel video diffusion paradigm of model that frame-level noise control, different from the conventional video diffusion models. We originally introduced that in the [FVDM](https://arxiv.org/abs/2410.03160) paper. With this paradigm, Pusa smoothly supports many video generation tasks (e.g., Text/Image/Video-to-Video) while maintaining high-fidelity motion and strong prompt adherence because of our slight modification to the base model. This Pusa-V0.5 is an early preview version based on [Mochi1-Preview](https://huggingface.co/genmo/mochi-1-preview). We are open-sourcing this to invite community colloboration to improve the method and extend its capabilities.
9
+
10
+ ✨ **Key Features**
11
+ - **Multi-task support**: Text-to-Video, Image-to-Video, Interpolation, Transition, Loop, Long Video, and more
12
+ - **Shockingly-efficient**: Trained with only 0.1k H800 GPU hours, cost $0.1k, with 16 H800 GPUs, batchsize 32, 500 training iterations. It can be more efficient if train the model on a single node with more parallelism techniques. Welcome to colloboration :-)
13
+ - **Full Open-Source**: Code, architecture, and training details included
14
+
15
+ 🔍 **Unique Architecture**
16
+ - A novel diffusion model supporting frame-level noise with vectorized timesteps originally introduced in the [FVDM paper](https://arxiv.org/abs/2410.03160) for flexibility and scalability.
17
+ - Our modification to the base model does not influence its original Text-to-Video generation ability if no fintuing.
18
+ - The method can be similarlly applied to mainstream video diffusion models like Hunyuan Video, wan2.1, etc. Welcome to colloboration again :-)
19
+
20
+ ## Download Weights
21
+
22
+ You can use the Hugging Face CLI to download the model:
23
+ ```
24
+ pip install huggingface_hub
25
+ huggingface-cli download RaphaelLiu/Pusa-V0.5 --local-dir <path_to_downloaded_directory>
26
+
27
+ ```
28
+ Or, directly download the weights from [Hugging Face](https://huggingface.co/RaphaelLiu/Pusa-V0.5) to a folder on your computer.
29
+
30
+
31
+ ## Limitations
32
+ Pusa has a few known limitations. The base model Mochi generates videos at low resolution (480p). We expect to get better results when use our proposed method to more powerful models like Wan2.1. We also welcom collobartion from the community to improve the model and extend its capabilities.
33
+
34
+ ## Related Work
35
+ - [mochi](https://huggingface.co/genmo/mochi-1-preview) is our base model, a top-tier open-source video generation models in Artifical Analysis Leaderboard for video generation.
36
+ - [FVDM](https://arxiv.org/abs/2410.03160) introduces frame-level noise control with the vectorized timestep approach that inspired Pusa.
37
+
38
+ ## BibTeX
39
+ ```
40
+ @misc{Liu2025pusa,
41
+ title={Pusa: A Next-Level All-in-One Video Diffusion Model},
42
+ author={Yaofang Liu and Rui Liu},
43
+ year={2025},
44
+ publisher = {GitHub},
45
+ journal = {GitHub repository},
46
+ howpublished={\url{https://github.com/Yaofang-Liu/Pusa-VidGen}}
47
+ }
48
+ ```
49
+
50
+ ```
51
+ @article{liu2024redefining,
52
+ title={Redefining Temporal Modeling in Video Diffusion: The Vectorized Timestep Approach},
53
+ author={Liu, Yaofang and Ren, Yumeng and Cun, Xiaodong and Artola, Aitor and Liu, Yang and Zeng, Tieyong and Chan, Raymond H and Morel, Jean-michel},
54
+ journal={arXiv preprint arXiv:2410.03160},
55
+ year={2024}
56
+ }
57
+ ```
58
+
README.md CHANGED
@@ -1,62 +1,80 @@
1
-
2
  # Pusa VidGen
3
 
4
- [Codes](https://github.com/Yaofang-Liu/Pusa-VidGen) | [Hugging Face](https://huggingface.co/RaphaelLiu/Pusa-V0.5)
 
 
 
 
 
 
 
 
 
5
 
6
  ## Overview
7
 
8
- Pusa is an advanced open-source video generation model that builds upon Mochi 1 with significant enhancements. It supports multiple video generation tasks while maintaining high-fidelity motion and strong prompt adherence. The model is released under a permissive Apache 2.0 license.
9
 
10
- **Key Features**
11
- - **Multi-task support**: Text-to-Video, Image-to-Video, Interpolation, Transition, Loop, Long Video, and more
12
- - **Cost-efficient**: Trained with just 100 H100 GPU hours
13
- - **Full Open-Source**: Code, architecture, and training details included
14
 
15
- 🔍 **Unique Architecture**
16
- - A novel diffusion model supporting frame-level noise with vectorized timesteps originally introduced in the [FVDM paper](https://arxiv.org/abs/2410.03160) for flexibility and scalability
 
 
 
 
 
 
17
 
18
- ## Installation
 
 
 
 
 
19
 
20
- Install using [uv](https://github.com/astral-sh/uv):
 
 
 
21
 
22
- ```bash
23
- git clone https://github.com/Yaofang-Liu/Pusa-VidGen
24
- cd models
25
- pip install uv
26
- uv venv .venv
27
- source .venv/bin/activate
28
- uv pip install setuptools
29
- uv pip install -e . --no-build-isolation
30
- ```
31
 
 
32
 
33
- If you want to install flash attention, you can use:
34
- ```
35
- uv pip install -e .[flash] --no-build-isolation
36
- ```
37
 
38
- You will also need to install [FFMPEG](https://www.ffmpeg.org/) to turn your outputs into videos.
39
 
40
- ## Download Weights
41
 
42
- You can use the Hugging Face CLI to download the model:
43
- ```
 
 
44
  pip install huggingface_hub
45
  huggingface-cli download RaphaelLiu/Pusa-V0.5 --local-dir <path_to_downloaded_directory>
46
-
47
  ```
48
- Or, directly download the weights from [Hugging Face](https://huggingface.co/RaphaelLiu/Pusa-V0.5) to a folder on your computer.
49
 
 
50
 
51
  ## Limitations
52
- Pusa has a few known limitations. The base model Mochi generates videos at 480p. We expect to get better results when use our proposed method to more powerful models like Wan2.1. We also welcom collobartion from the community to improve the model and extend its capabilities.
 
 
 
 
53
 
54
  ## Related Work
55
- - [mochi](https://huggingface.co/genmo/mochi-1-preview) is our base model, top 3 open-source video generation models in Artifical Analysis Leaderboard for video generation.
56
- - [FVDM](https://arxiv.org/abs/2410.03160) introduces the vectorized timestep approach that inspired Pusa's frame-level noise control.
57
 
58
- ## BibTeX
59
- ```
 
 
 
 
 
 
60
  @misc{Liu2025pusa,
61
  title={Pusa: A Next-Level All-in-One Video Diffusion Model},
62
  author={Yaofang Liu and Rui Liu},
@@ -67,7 +85,7 @@ Pusa has a few known limitations. The base model Mochi generates videos at 480p.
67
  }
68
  ```
69
 
70
- ```
71
  @article{liu2024redefining,
72
  title={Redefining Temporal Modeling in Video Diffusion: The Vectorized Timestep Approach},
73
  author={Liu, Yaofang and Ren, Yumeng and Cun, Xiaodong and Artola, Aitor and Liu, Yang and Zeng, Tieyong and Chan, Raymond H and Morel, Jean-michel},
 
 
1
  # Pusa VidGen
2
 
3
+ <div align="center">
4
+
5
+ [![GitHub](https://img.shields.io/badge/GitHub-Code-blue?logo=github)](https://github.com/Yaofang-Liu/Pusa-VidGen)
6
+ [![Paper](https://img.shields.io/badge/Paper-Coming%20Soon-red)](https://huggingface.co/RaphaelLiu/Pusa-V0.5)
7
+
8
+ </div>
9
+
10
+ <p align="center">
11
+ <img src="./Pusa-V0.5/assets/methods_overview.gif" width="80%">
12
+ </p>
13
 
14
  ## Overview
15
 
16
+ Pusa introduces a paradigm shift in video diffusion modeling through frame-level noise control, departing from conventional approaches. This innovation was first presented in our [FVDM](https://arxiv.org/abs/2410.03160) paper. Leveraging this architecture, Pusa seamlessly supports diverse video generation tasks (Text-to-Video, Image-to-Video, etc.) while maintaining exceptional motion fidelity and prompt adherence through our refined base model adaptations. Pusa-V0.5 represents an early preview based on [Mochi1-Preview](https://huggingface.co/genmo/mochi-1-preview). We are open-sourcing this work to foster community collaboration, enhance methodologies, and expand capabilities.
17
 
18
+ ## ✨ Key Features
 
 
 
19
 
20
+ - **Comprehensive Multi-task Support**:
21
+ - Text-to-Video generation
22
+ - Image-to-Video transformation
23
+ - Frame interpolation
24
+ - Video transitions
25
+ - Seamless looping
26
+ - Extended video generation
27
+ - And more...
28
 
29
+ - **Unprecedented Efficiency**:
30
+ - Trained with only 0.1k H800 GPU hours
31
+ - Total training cost: $0.1k
32
+ - Hardware: 16 H800 GPUs
33
+ - Configuration: Batch size 32, 500 training iterations
34
+ - *Note: Efficiency can be further improved with single-node training and advanced parallelism techniques. Collaborations welcome!*
35
 
36
+ - **Complete Open-Source Release**:
37
+ - Full codebase
38
+ - Detailed architecture specifications
39
+ - Comprehensive training methodology
40
 
41
+ ## 🔍 Unique Architecture
 
 
 
 
 
 
 
 
42
 
43
+ - **Novel Diffusion Paradigm**: Implements frame-level noise control with vectorized timesteps, originally introduced in the [FVDM paper](https://arxiv.org/abs/2410.03160), enabling unprecedented flexibility and scalability.
44
 
45
+ - **Non-destructive Modification**: Our adaptations to the base model preserve its original Text-to-Video generation capabilities without requiring fine-tuning.
 
 
 
46
 
47
+ - **Universal Applicability**: The methodology can be readily applied to leading video diffusion models including Hunyuan Video, Wan2.1, and others. *Collaborations enthusiastically welcomed!*
48
 
49
+ ## Installation and Usage
50
 
51
+ ### Download Weights
52
+
53
+ **Option 1**: Use the Hugging Face CLI:
54
+ ```bash
55
  pip install huggingface_hub
56
  huggingface-cli download RaphaelLiu/Pusa-V0.5 --local-dir <path_to_downloaded_directory>
 
57
  ```
 
58
 
59
+ **Option 2**: Download directly from [Hugging Face](https://huggingface.co/RaphaelLiu/Pusa-V0.5) to your local machine.
60
 
61
  ## Limitations
62
+
63
+ Pusa currently has several known limitations:
64
+ - The base Mochi model generates videos at relatively low resolution (480p)
65
+ - We anticipate significant quality improvements when applying our methodology to more advanced models like Wan2.1
66
+ - We welcome community contributions to enhance model performance and extend its capabilities
67
 
68
  ## Related Work
 
 
69
 
70
+ - [Mochi](https://huggingface.co/genmo/mochi-1-preview): Our foundation model, recognized as a leading open-source video generation system on the Artificial Analysis Leaderboard.
71
+ - [FVDM](https://arxiv.org/abs/2410.03160): Introduces the groundbreaking frame-level noise control with vectorized timestep approach that inspired Pusa.
72
+
73
+ ## Citation
74
+
75
+ If you find our work useful in your research, please consider citing:
76
+
77
+ ```bibtex
78
  @misc{Liu2025pusa,
79
  title={Pusa: A Next-Level All-in-One Video Diffusion Model},
80
  author={Yaofang Liu and Rui Liu},
 
85
  }
86
  ```
87
 
88
+ ```bibtex
89
  @article{liu2024redefining,
90
  title={Redefining Temporal Modeling in Video Diffusion: The Vectorized Timestep Approach},
91
  author={Liu, Yaofang and Ren, Yumeng and Cun, Xiaodong and Artola, Aitor and Liu, Yang and Zeng, Tieyong and Chan, Raymond H and Morel, Jean-michel},
assets/icon0.png ADDED

Git LFS Details

  • SHA256: 3c4b25a6bb220be9fa35bd3e5dceec4cdfd7c05624d7578e1560cde16c641100
  • Pointer size: 131 Bytes
  • Size of remote file: 285 kB
assets/methods_overview.gif ADDED

Git LFS Details

  • SHA256: 3aff2d83b2c30b006ae85e221698adf70e143c605bf74cb037c2d0f4b4db66da
  • Pointer size: 131 Bytes
  • Size of remote file: 228 kB