ghunkins commited on
Commit
fe112cf
·
verified ·
1 Parent(s): 63df4e5

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +253 -139
README.md CHANGED
@@ -1,198 +1,312 @@
1
  ---
 
 
 
 
 
2
  library_name: diffusers
 
 
 
3
  ---
 
4
 
5
- # Model Card for Model ID
 
 
6
 
7
- <!-- Provide a quick summary of what the model is/does. -->
8
 
 
 
 
9
 
 
10
 
11
- ## Model Details
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
12
 
13
- ### Model Description
 
 
 
 
 
14
 
15
- <!-- Provide a longer summary of what this model is. -->
 
 
 
 
 
 
 
 
 
 
 
 
 
16
 
17
- This is the model card of a 🧨 diffusers model that has been pushed on the Hub. This model card has been automatically generated.
 
 
 
 
18
 
19
- - **Developed by:** [More Information Needed]
20
- - **Funded by [optional]:** [More Information Needed]
21
- - **Shared by [optional]:** [More Information Needed]
22
- - **Model type:** [More Information Needed]
23
- - **Language(s) (NLP):** [More Information Needed]
24
- - **License:** [More Information Needed]
25
- - **Finetuned from model [optional]:** [More Information Needed]
26
 
27
- ### Model Sources [optional]
 
 
 
28
 
29
- <!-- Provide the basic links for the model. -->
 
 
 
30
 
31
- - **Repository:** [More Information Needed]
32
- - **Paper [optional]:** [More Information Needed]
33
- - **Demo [optional]:** [More Information Needed]
34
 
35
- ## Uses
 
 
 
36
 
37
- <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
 
38
 
39
- ### Direct Use
 
 
40
 
41
- <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
42
 
43
- [More Information Needed]
44
 
45
- ### Downstream Use [optional]
46
 
47
- <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
 
 
48
 
49
- [More Information Needed]
50
 
51
- ### Out-of-Scope Use
52
 
53
- <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
54
 
55
- [More Information Needed]
56
 
57
- ## Bias, Risks, and Limitations
 
 
58
 
59
- <!-- This section is meant to convey both technical and sociotechnical limitations. -->
 
 
 
 
60
 
61
- [More Information Needed]
62
 
63
- ### Recommendations
64
 
65
- <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
66
 
67
- Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
68
 
69
- ## How to Get Started with the Model
 
70
 
71
- Use the code below to get started with the model.
72
 
73
- [More Information Needed]
 
 
74
 
75
- ## Training Details
76
 
77
- ### Training Data
78
 
79
- <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
80
 
81
- [More Information Needed]
 
 
82
 
83
- ### Training Procedure
84
 
85
- <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
 
 
 
86
 
87
- #### Preprocessing [optional]
88
 
89
- [More Information Needed]
90
 
 
91
 
92
- #### Training Hyperparameters
93
 
94
- - **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
95
 
96
- #### Speeds, Sizes, Times [optional]
97
 
98
- <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
 
99
 
100
- [More Information Needed]
101
 
102
- ## Evaluation
103
 
104
- <!-- This section describes the evaluation protocols and provides the results. -->
 
105
 
106
- ### Testing Data, Factors & Metrics
 
 
 
 
 
 
 
107
 
108
- #### Testing Data
 
109
 
110
- <!-- This should link to a Dataset Card if possible. -->
111
 
112
- [More Information Needed]
113
 
114
- #### Factors
115
 
116
- <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
117
 
118
- [More Information Needed]
119
 
120
- #### Metrics
121
-
122
- <!-- These are the evaluation metrics being used, ideally with a description of why. -->
123
-
124
- [More Information Needed]
125
-
126
- ### Results
127
-
128
- [More Information Needed]
129
-
130
- #### Summary
131
-
132
-
133
-
134
- ## Model Examination [optional]
135
-
136
- <!-- Relevant interpretability work for the model goes here -->
137
-
138
- [More Information Needed]
139
-
140
- ## Environmental Impact
141
-
142
- <!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
143
-
144
- Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
145
-
146
- - **Hardware Type:** [More Information Needed]
147
- - **Hours used:** [More Information Needed]
148
- - **Cloud Provider:** [More Information Needed]
149
- - **Compute Region:** [More Information Needed]
150
- - **Carbon Emitted:** [More Information Needed]
151
-
152
- ## Technical Specifications [optional]
153
-
154
- ### Model Architecture and Objective
155
-
156
- [More Information Needed]
157
-
158
- ### Compute Infrastructure
159
-
160
- [More Information Needed]
161
-
162
- #### Hardware
163
-
164
- [More Information Needed]
165
-
166
- #### Software
167
-
168
- [More Information Needed]
169
-
170
- ## Citation [optional]
171
-
172
- <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
173
-
174
- **BibTeX:**
175
-
176
- [More Information Needed]
177
-
178
- **APA:**
179
-
180
- [More Information Needed]
181
-
182
- ## Glossary [optional]
183
-
184
- <!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
185
-
186
- [More Information Needed]
187
-
188
- ## More Information [optional]
189
-
190
- [More Information Needed]
191
-
192
- ## Model Card Authors [optional]
193
-
194
- [More Information Needed]
195
-
196
- ## Model Card Contact
197
-
198
- [More Information Needed]
 
1
  ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ - zh
6
+ pipeline_tag: image-to-video
7
  library_name: diffusers
8
+ tags:
9
+ - video
10
+ - video-generation
11
  ---
12
+ # Wan2.1 + Lightx2v
13
 
14
+ <p align="center">
15
+ 💜 <a href=""><b>Wan</b></a> &nbsp&nbsp | &nbsp&nbsp 🖥️ <a href="https://github.com/Wan-Video/Wan2.1">GitHub</a> &nbsp&nbsp | &nbsp&nbsp🤗 <a href="https://huggingface.co/Wan-AI/">Hugging Face</a>&nbsp&nbsp | &nbsp&nbsp🤖 <a href="https://modelscope.cn/organization/Wan-AI">ModelScope</a>&nbsp&nbsp | &nbsp&nbsp 📑 <a href="">Paper (Coming soon)</a> &nbsp&nbsp | &nbsp&nbsp 📑 <a href="https://wanxai.com">Blog</a> &nbsp&nbsp | &nbsp&nbsp💬 <a href="https://gw.alicdn.com/imgextra/i2/O1CN01tqjWFi1ByuyehkTSB_!!6000000000015-0-tps-611-1279.jpg">WeChat Group</a>&nbsp&nbsp | &nbsp&nbsp 📖 <a href="https://discord.gg/p5XbdQV7">Discord</a>&nbsp&nbsp
16
+ <br>
17
 
18
+ <hr>
19
 
20
+ <p align="center">
21
+ 🔗 <a href="https://huggingface.co/lightx2v/Wan2.1-I2V-14B-480P-StepDistill-CfgDistill-Lightx2v"><b>Lightx2v</b></a> — Distilled & optimized Wan2.1 for fast, high-quality 480P image-to-video generation
22
+ </p>
23
 
24
+ -----
25
 
26
+ [**Wan: Open and Advanced Large-Scale Video Generative Models**]() <be>
27
+
28
+ In this repository, we present **Wan2.1**, a comprehensive and open suite of video foundation models that pushes the boundaries of video generation. **Wan2.1** offers these key features:
29
+ - 👍 **SOTA Performance**: **Wan2.1** consistently outperforms existing open-source models and state-of-the-art commercial solutions across multiple benchmarks.
30
+ - 👍 **Supports Consumer-grade GPUs**: The T2V-1.3B model requires only 8.19 GB VRAM, making it compatible with almost all consumer-grade GPUs. It can generate a 5-second 480P video on an RTX 4090 in about 4 minutes (without optimization techniques like quantization). Its performance is even comparable to some closed-source models.
31
+ - 👍 **Multiple Tasks**: **Wan2.1** excels in Text-to-Video, Image-to-Video, Video Editing, Text-to-Image, and Video-to-Audio, advancing the field of video generation.
32
+ - 👍 **Visual Text Generation**: **Wan2.1** is the first video model capable of generating both Chinese and English text, featuring robust text generation that enhances its practical applications.
33
+ - 👍 **Powerful Video VAE**: **Wan-VAE** delivers exceptional efficiency and performance, encoding and decoding 1080P videos of any length while preserving temporal information, making it an ideal foundation for video and image generation.
34
+
35
+
36
+ This repo contains our I2V-14B model, which is capable of generating 480P videos, offering advantages in terms of fast generation and excellent quality.
37
+
38
+
39
+ ## Video Demos
40
+
41
+ <div align="center">
42
+ <video width="80%" controls>
43
+ <source src="https://cloud.video.taobao.com/vod/Jth64Y7wNoPcJki_Bo1ZJTDBvNjsgjlVKsNs05Fqfps.mp4" type="video/mp4">
44
+ Your browser does not support the video tag.
45
+ </video>
46
+ </div>
47
+
48
+ ## 🔥 Latest News!!
49
+
50
+ * Feb 25, 2025: 👋 We've released the inference code and weights of Wan2.1.
51
+
52
+
53
+ ## 📑 Todo List
54
+ - Wan2.1 Text-to-Video
55
+ - [x] Multi-GPU Inference code of the 14B and 1.3B models
56
+ - [x] Checkpoints of the 14B and 1.3B models
57
+ - [x] Gradio demo
58
+ - [x] Diffusers integration
59
+ - [ ] ComfyUI integration
60
+ - Wan2.1 Image-to-Video
61
+ - [x] Multi-GPU Inference code of the 14B model
62
+ - [x] Checkpoints of the 14B model
63
+ - [x] Gradio demo
64
+ - [x] Diffusers integration
65
+ - [ ] ComfyUI integration
66
+
67
+
68
+ ## Quickstart
69
+
70
+ #### Installation
71
+ Clone the repo:
72
+ ```
73
+ git clone https://github.com/Wan-Video/Wan2.1.git
74
+ cd Wan2.1
75
+ ```
76
+
77
+ Install dependencies:
78
+ ```
79
+ # Ensure torch >= 2.4.0
80
+ pip install -r requirements.txt
81
+ ```
82
+
83
+
84
+ #### Model Download
85
+
86
+ | Models | Download Link | Notes |
87
+ | --------------|-------------------------------------------------------------------------------|-------------------------------|
88
+ | T2V-14B | 🤗 [Huggingface](https://huggingface.co/Wan-AI/Wan2.1-T2V-14B) 🤖 [ModelScope](https://www.modelscope.cn/models/Wan-AI/Wan2.1-T2V-14B) | Supports both 480P and 720P
89
+ | I2V-14B-720P | 🤗 [Huggingface](https://huggingface.co/Wan-AI/Wan2.1-I2V-14B-720P) 🤖 [ModelScope](https://www.modelscope.cn/models/Wan-AI/Wan2.1-I2V-14B-720P) | Supports 720P
90
+ | I2V-14B-480P | 🤗 [Huggingface](https://huggingface.co/Wan-AI/Wan2.1-I2V-14B-480P) 🤖 [ModelScope](https://www.modelscope.cn/models/Wan-AI/Wan2.1-I2V-14B-480P) | Supports 480P
91
+ | T2V-1.3B | 🤗 [Huggingface](https://huggingface.co/Wan-AI/Wan2.1-T2V-1.3B) 🤖 [ModelScope](https://www.modelscope.cn/models/Wan-AI/Wan2.1-T2V-1.3B) | Supports 480P
92
+
93
+ > 💡Note: The 1.3B model is capable of generating videos at 720P resolution. However, due to limited training at this resolution, the results are generally less stable compared to 480P. For optimal performance, we recommend using 480P resolution.
94
+
95
+
96
+ Download models using 🤗 huggingface-cli:
97
+ ```
98
+ pip install "huggingface_hub[cli]"
99
+ huggingface-cli download Wan-AI/Wan2.1-I2V-14B-480P-Diffusers --local-dir ./Wan2.1-I2V-14B-480P-Diffusers
100
+ ```
101
+
102
+ Download models using 🤖 modelscope-cli:
103
+ ```
104
+ pip install modelscope
105
+ modelscope download Wan-AI/Wan2.1-I2V-14B-480P-Diffusers --local_dir ./Wan2.1-I2V-14B-480P-Diffusers
106
+ ```
107
+
108
+ #### Run Image-to-Video Generation
109
+
110
+ Similar to Text-to-Video, Image-to-Video is also divided into processes with and without the prompt extension step. The specific parameters and their corresponding settings are as follows:
111
+ <table>
112
+ <thead>
113
+ <tr>
114
+ <th rowspan="2">Task</th>
115
+ <th colspan="2">Resolution</th>
116
+ <th rowspan="2">Model</th>
117
+ </tr>
118
+ <tr>
119
+ <th>480P</th>
120
+ <th>720P</th>
121
+ </tr>
122
+ </thead>
123
+ <tbody>
124
+ <tr>
125
+ <td>i2v-14B</td>
126
+ <td style="color: green;">❌</td>
127
+ <td style="color: green;">✔️</td>
128
+ <td>Wan2.1-I2V-14B-720P</td>
129
+ </tr>
130
+ <tr>
131
+ <td>i2v-14B</td>
132
+ <td style="color: green;">✔️</td>
133
+ <td style="color: red;">❌</td>
134
+ <td>Wan2.1-T2V-14B-480P</td>
135
+ </tr>
136
+ </tbody>
137
+ </table>
138
+
139
+
140
+ ##### (1) Without Prompt Extention
141
+
142
+ - Single-GPU inference
143
+ ```
144
+ python generate.py --task i2v-14B --size 832*480 --ckpt_dir ./Wan2.1-I2V-14B-480P --image examples/i2v_input.JPG --prompt "Summer beach vacation style, a white cat wearing sunglasses sits on a surfboard. The fluffy-furred feline gazes directly at the camera with a relaxed expression. Blurred beach scenery forms the background featuring crystal-clear waters, distant green hills, and a blue sky dotted with white clouds. The cat assumes a naturally relaxed posture, as if savoring the sea breeze and warm sunlight. A close-up shot highlights the feline's intricate details and the refreshing atmosphere of the seaside."
145
+ ```
146
+
147
+ > 💡For the Image-to-Video task, the `size` parameter represents the area of the generated video, with the aspect ratio following that of the original input image.
148
+
149
+ - Multi-GPU inference using FSDP + xDiT USP
150
+
151
+ ```
152
+ pip install "xfuser>=0.4.1"
153
+ torchrun --nproc_per_node=8 generate.py --task i2v-14B --size 832*480 --ckpt_dir ./Wan2.1-I2V-14B-480P --image examples/i2v_input.JPG --dit_fsdp --t5_fsdp --ulysses_size 8 --prompt "Summer beach vacation style, a white cat wearing sunglasses sits on a surfboard. The fluffy-furred feline gazes directly at the camera with a relaxed expression. Blurred beach scenery forms the background featuring crystal-clear waters, distant green hills, and a blue sky dotted with white clouds. The cat assumes a naturally relaxed posture, as if savoring the sea breeze and warm sunlight. A close-up shot highlights the feline's intricate details and the refreshing atmosphere of the seaside."
154
+ ```
155
+
156
+ Wan can also be run directly using 🤗 Diffusers!
157
+
158
+ ```python
159
+ import torch
160
+ import numpy as np
161
+ from diffusers import AutoencoderKLWan, WanImageToVideoPipeline
162
+ from diffusers.utils import export_to_video, load_image
163
+ from transformers import CLIPVisionModel
164
 
165
+ # Available models: Wan-AI/Wan2.1-I2V-14B-480P-Diffusers, Wan-AI/Wan2.1-I2V-14B-720P-Diffusers
166
+ model_id = "Wan-AI/Wan2.1-I2V-14B-480P-Diffusers"
167
+ image_encoder = CLIPVisionModel.from_pretrained(model_id, subfolder="image_encoder", torch_dtype=torch.float32)
168
+ vae = AutoencoderKLWan.from_pretrained(model_id, subfolder="vae", torch_dtype=torch.float32)
169
+ pipe = WanImageToVideoPipeline.from_pretrained(model_id, vae=vae, image_encoder=image_encoder, torch_dtype=torch.bfloat16)
170
+ pipe.to("cuda")
171
 
172
+ image = load_image(
173
+ "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/astronaut.jpg"
174
+ )
175
+ max_area = 480 * 832
176
+ aspect_ratio = image.height / image.width
177
+ mod_value = pipe.vae_scale_factor_spatial * pipe.transformer.config.patch_size[1]
178
+ height = round(np.sqrt(max_area * aspect_ratio)) // mod_value * mod_value
179
+ width = round(np.sqrt(max_area / aspect_ratio)) // mod_value * mod_value
180
+ image = image.resize((width, height))
181
+ prompt = (
182
+ "An astronaut hatching from an egg, on the surface of the moon, the darkness and depth of space realised in "
183
+ "the background. High quality, ultrarealistic detail and breath-taking movie-like camera shot."
184
+ )
185
+ negative_prompt = "Bright tones, overexposed, static, blurred details, subtitles, style, works, paintings, images, static, overall gray, worst quality, low quality, JPEG compression residue, ugly, incomplete, extra fingers, poorly drawn hands, poorly drawn faces, deformed, disfigured, misshapen limbs, fused fingers, still picture, messy background, three legs, many people in the background, walking backwards"
186
 
187
+ output = pipe(
188
+ image=image, prompt=prompt, negative_prompt=negative_prompt, height=height, width=width, num_frames=81, guidance_scale=5.0
189
+ ).frames[0]
190
+ export_to_video(output, "output.mp4", fps=16)
191
+ ```
192
 
193
+ ##### (2) Using Prompt Extention
 
 
 
 
 
 
194
 
195
+ Run with local prompt extention using `Qwen/Qwen2.5-VL-7B-Instruct`:
196
+ ```
197
+ python generate.py --task i2v-14B --size 832*480 --ckpt_dir ./Wan2.1-I2V-14B-480P --image examples/i2v_input.JPG --use_prompt_extend --prompt_extend_model Qwen/Qwen2.5-VL-7B-Instruct --prompt "Summer beach vacation style, a white cat wearing sunglasses sits on a surfboard. The fluffy-furred feline gazes directly at the camera with a relaxed expression. Blurred beach scenery forms the background featuring crystal-clear waters, distant green hills, and a blue sky dotted with white clouds. The cat assumes a naturally relaxed posture, as if savoring the sea breeze and warm sunlight. A close-up shot highlights the feline's intricate details and the refreshing atmosphere of the seaside."
198
+ ```
199
 
200
+ Run with remote prompt extention using `dashscope`:
201
+ ```
202
+ DASH_API_KEY=your_key python generate.py --task i2v-14B --size 832*480 --ckpt_dir ./Wan2.1-I2V-14B-480P --image examples/i2v_input.JPG --use_prompt_extend --prompt_extend_method 'dashscope' --prompt "Summer beach vacation style, a white cat wearing sunglasses sits on a surfboard. The fluffy-furred feline gazes directly at the camera with a relaxed expression. Blurred beach scenery forms the background featuring crystal-clear waters, distant green hills, and a blue sky dotted with white clouds. The cat assumes a naturally relaxed posture, as if savoring the sea breeze and warm sunlight. A close-up shot highlights the feline's intricate details and the refreshing atmosphere of the seaside."
203
+ ```
204
 
205
+ ##### (3) Runing local gradio
 
 
206
 
207
+ ```
208
+ cd gradio
209
+ # if one only uses 480P model in gradio
210
+ DASH_API_KEY=your_key python i2v_14B_singleGPU.py --prompt_extend_method 'dashscope' --ckpt_dir_480p ./Wan2.1-I2V-14B-480P
211
 
212
+ # if one only uses 720P model in gradio
213
+ DASH_API_KEY=your_key python i2v_14B_singleGPU.py --prompt_extend_method 'dashscope' --ckpt_dir_720p ./Wan2.1-I2V-14B-720P
214
 
215
+ # if one uses both 480P and 720P models in gradio
216
+ DASH_API_KEY=your_key python i2v_14B_singleGPU.py --prompt_extend_method 'dashscope' --ckpt_dir_480p ./Wan2.1-I2V-14B-480P --ckpt_dir_720p ./Wan2.1-I2V-14B-720P
217
+ ```
218
 
 
219
 
220
+ ## Manual Evaluation
221
 
222
+ We conducted extensive manual evaluations to evaluate the performance of the Image-to-Video model, and the results are presented in the table below. The results clearly indicate that **Wan2.1** outperforms both closed-source and open-source models.
223
 
224
+ <div align="center">
225
+ <img src="assets/i2v_res.png" alt="" style="width: 80%;" />
226
+ </div>
227
 
 
228
 
229
+ ## Computational Efficiency on Different GPUs
230
 
231
+ We test the computational efficiency of different **Wan2.1** models on different GPUs in the following table. The results are presented in the format: **Total time (s) / peak GPU memory (GB)**.
232
 
 
233
 
234
+ <div align="center">
235
+ <img src="assets/comp_effic.png" alt="" style="width: 80%;" />
236
+ </div>
237
 
238
+ > The parameter settings for the tests presented in this table are as follows:
239
+ > (1) For the 1.3B model on 8 GPUs, set `--ring_size 8` and `--ulysses_size 1`;
240
+ > (2) For the 14B model on 1 GPU, use `--offload_model True`;
241
+ > (3) For the 1.3B model on a single 4090 GPU, set `--offload_model True --t5_cpu`;
242
+ > (4) For all testings, no prompt extension was applied, meaning `--use_prompt_extend` was not enabled.
243
 
244
+ -------
245
 
246
+ ## Introduction of Wan2.1
247
 
248
+ **Wan2.1** is designed on the mainstream diffusion transformer paradigm, achieving significant advancements in generative capabilities through a series of innovations. These include our novel spatio-temporal variational autoencoder (VAE), scalable training strategies, large-scale data construction, and automated evaluation metrics. Collectively, these contributions enhance the model’s performance and versatility.
249
 
 
250
 
251
+ ##### (1) 3D Variational Autoencoders
252
+ We propose a novel 3D causal VAE architecture, termed **Wan-VAE** specifically designed for video generation. By combining multiple strategies, we improve spatio-temporal compression, reduce memory usage, and ensure temporal causality. **Wan-VAE** demonstrates significant advantages in performance efficiency compared to other open-source VAEs. Furthermore, our **Wan-VAE** can encode and decode unlimited-length 1080P videos without losing historical temporal information, making it particularly well-suited for video generation tasks.
253
 
 
254
 
255
+ <div align="center">
256
+ <img src="assets/video_vae_res.jpg" alt="" style="width: 80%;" />
257
+ </div>
258
 
 
259
 
260
+ ##### (2) Video Diffusion DiT
261
 
262
+ **Wan2.1** is designed using the Flow Matching framework within the paradigm of mainstream Diffusion Transformers. Our model's architecture uses the T5 Encoder to encode multilingual text input, with cross-attention in each transformer block embedding the text into the model structure. Additionally, we employ an MLP with a Linear layer and a SiLU layer to process the input time embeddings and predict six modulation parameters individually. This MLP is shared across all transformer blocks, with each block learning a distinct set of biases. Our experimental findings reveal a significant performance improvement with this approach at the same parameter scale.
263
 
264
+ <div align="center">
265
+ <img src="assets/video_dit_arch.jpg" alt="" style="width: 80%;" />
266
+ </div>
267
 
 
268
 
269
+ | Model | Dimension | Input Dimension | Output Dimension | Feedforward Dimension | Frequency Dimension | Number of Heads | Number of Layers |
270
+ |--------|-----------|-----------------|------------------|-----------------------|---------------------|-----------------|------------------|
271
+ | 1.3B | 1536 | 16 | 16 | 8960 | 256 | 12 | 30 |
272
+ | 14B | 5120 | 16 | 16 | 13824 | 256 | 40 | 40 |
273
 
 
274
 
 
275
 
276
+ ##### Data
277
 
278
+ We curated and deduplicated a candidate dataset comprising a vast amount of image and video data. During the data curation process, we designed a four-step data cleaning process, focusing on fundamental dimensions, visual quality and motion quality. Through the robust data processing pipeline, we can easily obtain high-quality, diverse, and large-scale training sets of images and videos.
279
 
280
+ ![figure1](assets/data_for_diff_stage.jpg "figure1")
281
 
 
282
 
283
+ ##### Comparisons to SOTA
284
+ We compared **Wan2.1** with leading open-source and closed-source models to evaluate the performace. Using our carefully designed set of 1,035 internal prompts, we tested across 14 major dimensions and 26 sub-dimensions. We then compute the total score by performing a weighted calculation on the scores of each dimension, utilizing weights derived from human preferences in the matching process. The detailed results are shown in the table below. These results demonstrate our model's superior performance compared to both open-source and closed-source models.
285
 
286
+ ![figure1](assets/vben_vs_sota.png "figure1")
287
 
 
288
 
289
+ ## Citation
290
+ If you find our work helpful, please cite us.
291
 
292
+ ```
293
+ @article{wan2.1,
294
+ title = {Wan: Open and Advanced Large-Scale Video Generative Models},
295
+ author = {Wan Team},
296
+ journal = {},
297
+ year = {2025}
298
+ }
299
+ ```
300
 
301
+ ## License Agreement
302
+ The models in this repository are licensed under the Apache 2.0 License. We claim no rights over the your generate contents, granting you the freedom to use them while ensuring that your usage complies with the provisions of this license. You are fully accountable for your use of the models, which must not involve sharing any content that violates applicable laws, causes harm to individuals or groups, disseminates personal information intended for harm, spreads misinformation, or targets vulnerable populations. For a complete list of restrictions and details regarding your rights, please refer to the full text of the [license](LICENSE.txt).
303
 
 
304
 
305
+ ## Acknowledgements
306
 
307
+ We would like to thank the contributors to the [SD3](https://huggingface.co/stabilityai/stable-diffusion-3-medium), [Qwen](https://huggingface.co/Qwen), [umt5-xxl](https://huggingface.co/google/umt5-xxl), [diffusers](https://github.com/huggingface/diffusers) and [HuggingFace](https://huggingface.co) repositories, for their open research.
308
 
 
309
 
 
310
 
311
+ ## Contact Us
312
+ If you would like to leave a message to our research or product teams, feel free to join our [Discord](https://discord.gg/p5XbdQV7) or [WeChat groups](https://gw.alicdn.com/imgextra/i2/O1CN01tqjWFi1ByuyehkTSB_!!6000000000015-0-tps-611-1279.jpg)!