Image-to-Video

Update model card: Add license and pipeline_tag, improve paper links

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +70 -64
README.md CHANGED
@@ -1,3 +1,12 @@
 
 
 
 
 
 
 
 
 
1
  # MUG-V 10B: High-efficiency Training Pipeline for Large Video Generation Models
2
 
3
  <div align="center">
@@ -31,7 +40,7 @@ To our knowledge, this is the first publicly available large-scale video-generat
31
 
32
  ## πŸ”₯ Latest News
33
 
34
- * Oct. 21, 2025: πŸ‘‹ We are excited to announce the release of the **MUG-V 10B** [technical report](#). We welcome feedback and discussions.
35
  * Oct. 21, 2025: πŸ‘‹ We've released Megatron-LM–based [training framework](https://github.com/Shopee-MUG/MUG-V-Megatron-LM-Training) addressing the key challenges of training billion-parameter video generators.
36
  * Oct. 21, 2025: πŸ‘‹ We've released **MUG-V video enhancement** [inference code](https://github.com/Shopee-MUG/MUG-V/tree/main/mug_enhancer) and [weights](https://huggingface.co/MUG-V/MUG-V-inference) (based on WAN-2.1 1.3B).
37
  * Oct. 21, 2025: πŸ‘‹ We've released **MUG-V 10B** ([e-commerce edition](https://github.com/Shopee-MUG/MUG-V)) inference code and weights.
@@ -233,64 +242,64 @@ MUGDiT adopts the latent diffusion transformer paradigm with rectified flow matc
233
 
234
  #### Core Components
235
 
236
- 1. **VideoVAE**: 8Γ—8Γ—8 spatiotemporal compression
237
- - Encoder: 3D convolutions + temporal attention
238
- - Decoder: 3D transposed convolutions + temporal upsampling
239
- - KL regularization for stable latent space
240
-
241
- 2. **3D Patch Embedding**: Converts video latents to tokens
242
- - Patch size: 2Γ—2Γ—2 (non-overlapping)
243
- - Final compression: ~2048Γ— vs. pixel space
244
-
245
- 3. **Position Encoding**: 3D Rotary Position Embeddings (RoPE)
246
- - Extends 2D RoPE to handle temporal dimension
247
- - Frequency-based encoding for spatiotemporal modeling
248
-
249
- 4. **Conditioning Modules**:
250
- - **Caption Embedder**: Projects text embeddings (4096-dim) for cross-attention
251
- - **Timestep Embedder**: Embeds diffusion timestep via sinusoidal encoding
252
- - **Size Embedder**: Handles variable resolution inputs
253
-
254
- 5. **MUGDiT Transformer Block**:
255
-
256
- ```mermaid
257
- graph LR
258
- A[Input] --> B[AdaLN]
259
- B --> C[Self-Attn<br/>QK-Norm]
260
- C --> D[Gate]
261
- D --> E1[+]
262
- A --> E1
263
-
264
- E1 --> F[LayerNorm]
265
- F --> G[Cross-Attn<br/>QK-Norm]
266
- G --> E2[+]
267
- E1 --> E2
268
-
269
- E2 --> I[AdaLN]
270
- I --> J[MLP]
271
- J --> K[Gate]
272
- K --> E3[+]
273
- E2 --> E3
274
-
275
- E3 --> L[Output]
276
-
277
- M[Timestep<br/>Size Info] -.-> B
278
- M -.-> I
279
-
280
- N[Text] -.-> G
281
-
282
- style C fill:#e3f2fd,stroke:#2196f3,stroke-width:2px
283
- style G fill:#f3e5f5,stroke:#9c27b0,stroke-width:2px
284
- style J fill:#fff3e0,stroke:#ff9800,stroke-width:2px
285
- style E1 fill:#c8e6c9,stroke:#4caf50,stroke-width:2px
286
- style E2 fill:#c8e6c9,stroke:#4caf50,stroke-width:2px
287
- style E3 fill:#c8e6c9,stroke:#4caf50,stroke-width:2px
288
- ```
289
-
290
- 6. **Rectified Flow Scheduler**:
291
- - More stable training than DDPM
292
- - Logit-normal timestep sampling
293
- - Linear interpolation between noise and data
294
 
295
  ## Citation
296
  If you find our work helpful, please cite us.
@@ -299,7 +308,7 @@ If you find our work helpful, please cite us.
299
  @article{mug-v2025,
300
  title={MUG-V 10B: High-efficiency Training Pipeline for Large Video Generation Models},
301
  author={Yongshun Zhang and Zhongyi Fan and Yonghang Zhang and Zhangzikang Li and Weifeng Chen and Zhongwei Feng and Chaoyue Wang and Peng Hou and Anxiang Zeng},
302
- journal = {arXiv preprint},
303
  year={2025}
304
  }
305
  ```
@@ -313,7 +322,4 @@ This project is licensed under the Apache License 2.0 - see the [LICENSE](https:
313
 
314
  ## Acknowledgements
315
 
316
- We would like to thank the contributors to the [Open-Sora](https://github.com/hpcaitech/Open-Sora), [DeepFloyd/t5-v1_1-xxl](https://huggingface.co/DeepFloyd/t5-v1_1-xxl), [Wan-Video](https://github.com/Wan-Video), [Qwen](https://huggingface.co/Qwen), [HuggingFace](https://huggingface.co), [Megatron-LM](https://github.com/NVIDIA/Megatron-LM), [TransformerEngine](https://github.com/NVIDIA/TransformerEngine), [DiffSynth](https://github.com/modelscope/DiffSynth-Studio), [diffusers](https://github.com/huggingface/diffusers), [PixArt](https://github.com/PixArt-alpha/PixArt-alpha), etc. repositories, for their open research.
317
-
318
-
319
-
 
1
+ ---
2
+ license: apache-2.0
3
+ pipeline_tag: image-to-video
4
+ ---
5
+
6
+ ---
7
+ license: apache-2.0
8
+ pipeline_tag: image-to-video
9
+ ---
10
  # MUG-V 10B: High-efficiency Training Pipeline for Large Video Generation Models
11
 
12
  <div align="center">
 
40
 
41
  ## πŸ”₯ Latest News
42
 
43
+ * Oct. 21, 2025: πŸ‘‹ We are excited to announce the release of the **MUG-V 10B** [technical report](https://arxiv.org/abs/2510.17519). We welcome feedback and discussions.
44
  * Oct. 21, 2025: πŸ‘‹ We've released Megatron-LM–based [training framework](https://github.com/Shopee-MUG/MUG-V-Megatron-LM-Training) addressing the key challenges of training billion-parameter video generators.
45
  * Oct. 21, 2025: πŸ‘‹ We've released **MUG-V video enhancement** [inference code](https://github.com/Shopee-MUG/MUG-V/tree/main/mug_enhancer) and [weights](https://huggingface.co/MUG-V/MUG-V-inference) (based on WAN-2.1 1.3B).
46
  * Oct. 21, 2025: πŸ‘‹ We've released **MUG-V 10B** ([e-commerce edition](https://github.com/Shopee-MUG/MUG-V)) inference code and weights.
 
242
 
243
  #### Core Components
244
 
245
+ 1. **VideoVAE**: 8Γ—8Γ—8 spatiotemporal compression
246
+ - Encoder: 3D convolutions + temporal attention
247
+ - Decoder: 3D transposed convolutions + temporal upsampling
248
+ - KL regularization for stable latent space
249
+
250
+ 2. **3D Patch Embedding**: Converts video latents to tokens
251
+ - Patch size: 2Γ—2Γ—2 (non-overlapping)
252
+ - Final compression: ~2048Γ— vs. pixel space
253
+
254
+ 3. **Position Encoding**: 3D Rotary Position Embeddings (RoPE)
255
+ - Extends 2D RoPE to handle temporal dimension
256
+ - Frequency-based encoding for spatiotemporal modeling
257
+
258
+ 4. **Conditioning Modules**:
259
+ - **Caption Embedder**: Projects text embeddings (4096-dim) for cross-attention
260
+ - **Timestep Embedder**: Embeds diffusion timestep via sinusoidal encoding
261
+ - **Size Embedder**: Handles variable resolution inputs
262
+
263
+ 5. **MUGDiT Transformer Block**:
264
+
265
+ ```mermaid
266
+ graph LR
267
+ A[Input] --> B[AdaLN]
268
+ B --> C[Self-Attn<br/>QK-Norm]
269
+ C --> D[Gate]
270
+ D --> E1[+]
271
+ A --> E1
272
+
273
+ E1 --> F[LayerNorm]
274
+ F --> G[Cross-Attn<br/>QK-Norm]
275
+ G --> E2[+]
276
+ E1 --> E2
277
+
278
+ E2 --> I[AdaLN]
279
+ I --> J[MLP]
280
+ J --> K[Gate]
281
+ K --> E3[+]
282
+ E2 --> E3
283
+
284
+ E3 --> L[Output]
285
+
286
+ M[Timestep<br/>Size Info] -.-> B
287
+ M -.-> I
288
+
289
+ N[Text] -.-> G
290
+
291
+ style C fill:#e3f2fd,stroke:#2196f3,stroke-width:2px
292
+ style G fill:#f3e5f5,stroke:#9c27b0,stroke-width:2px
293
+ style J fill:#fff3e0,stroke:#ff9800,stroke-width:2px
294
+ style E1 fill:#c8e6c9,stroke:#4caf50,stroke-width:2px
295
+ style E2 fill:#c8e6c9,stroke:#4caf50,stroke-width:2px
296
+ style E3 fill:#c8e6c9,stroke:#4caf50,stroke-width:2px
297
+ ```
298
+
299
+ 6. **Rectified Flow Scheduler**:
300
+ - More stable training than DDPM
301
+ - Logit-normal timestep sampling
302
+ - Linear interpolation between noise and data
303
 
304
  ## Citation
305
  If you find our work helpful, please cite us.
 
308
  @article{mug-v2025,
309
  title={MUG-V 10B: High-efficiency Training Pipeline for Large Video Generation Models},
310
  author={Yongshun Zhang and Zhongyi Fan and Yonghang Zhang and Zhangzikang Li and Weifeng Chen and Zhongwei Feng and Chaoyue Wang and Peng Hou and Anxiang Zeng},
311
+ journal = {arXiv preprint arXiv:2510.17519},
312
  year={2025}
313
  }
314
  ```
 
322
 
323
  ## Acknowledgements
324
 
325
+ We would like to thank the contributors to the [Open-Sora](https://github.com/hpcaitech/Open-Sora), [DeepFloyd/t5-v1_1-xxl](https://huggingface.co/DeepFloyd/t5-v1_1-xxl), [Wan-Video](https://github.com/Wan-Video), [Qwen](https://huggingface.co/Qwen), [HuggingFace](https://huggingface.co), [Megatron-LM](https://github.com/NVIDIA/Megatron-LM), [TransformerEngine](https://github.com/NVIDIA/TransformerEngine), [DiffSynth](https://github.com/modelscope/DiffSynth-Studio), [diffusers](https://github.com/huggingface/diffusers), [PixArt](https://github.com/PixArt-alpha/PixArt-alpha), etc. repositories, for their open research.