MUG-V
/

MUG-V-inference

Image-to-Video

Model card Files Files and versions

xet

Community

Update model card: Add license and pipeline_tag, improve paper links

by nielsr HF Staff - opened Oct 21

base: refs/heads/main

←

from: refs/pr/1

Discussion Files changed

+70

-64

Files changed (1) hide show

README.md +70 -64

README.md CHANGED Viewed

@@ -1,3 +1,12 @@
 # MUG-V 10B: High-efficiency Training Pipeline for Large Video Generation Models
 <div align="center">
@@ -31,7 +40,7 @@ To our knowledge, this is the first publicly available large-scale video-generat
 ## 🔥 Latest News
-* Oct. 21, 2025: 👋 We are excited to announce the release of the **MUG-V 10B** [technical report](#). We welcome feedback and discussions.
 * Oct. 21, 2025: 👋 We've released Megatron-LM–based [training framework](https://github.com/Shopee-MUG/MUG-V-Megatron-LM-Training) addressing the key challenges of training billion-parameter video generators.
 * Oct. 21, 2025: 👋 We've released **MUG-V video enhancement** [inference code](https://github.com/Shopee-MUG/MUG-V/tree/main/mug_enhancer) and [weights](https://huggingface.co/MUG-V/MUG-V-inference) (based on WAN-2.1 1.3B).
 * Oct. 21, 2025: 👋 We've released **MUG-V 10B** ([e-commerce edition](https://github.com/Shopee-MUG/MUG-V)) inference code and weights.
@@ -233,64 +242,64 @@ MUGDiT adopts the latent diffusion transformer paradigm with rectified flow matc
 #### Core Components
-1. **VideoVAE**: 8×8×8 spatiotemporal compression
-   - Encoder: 3D convolutions + temporal attention
-   - Decoder: 3D transposed convolutions + temporal upsampling
-   - KL regularization for stable latent space
-2. **3D Patch Embedding**: Converts video latents to tokens
-   - Patch size: 2×2×2 (non-overlapping)
-   - Final compression: ~2048× vs. pixel space
-3. **Position Encoding**: 3D Rotary Position Embeddings (RoPE)
-   - Extends 2D RoPE to handle temporal dimension
-   - Frequency-based encoding for spatiotemporal modeling
-4. **Conditioning Modules**:
-   - **Caption Embedder**: Projects text embeddings (4096-dim) for cross-attention
-   - **Timestep Embedder**: Embeds diffusion timestep via sinusoidal encoding
-   - **Size Embedder**: Handles variable resolution inputs
-5. **MUGDiT Transformer Block**:
-   ```mermaid
-   graph LR
-       A[Input] --> B[AdaLN]
-       B --> C[Self-Attn<br/>QK-Norm]
-       C --> D[Gate]
-       D --> E1[+]
-       A --> E1
-       E1 --> F[LayerNorm]
-       F --> G[Cross-Attn<br/>QK-Norm]
-       G --> E2[+]
-       E1 --> E2
-       E2 --> I[AdaLN]
-       I --> J[MLP]
-       J --> K[Gate]
-       K --> E3[+]
-       E2 --> E3
-       E3 --> L[Output]
-       M[Timestep<br/>Size Info] -.-> B
-       M -.-> I
-       N[Text] -.-> G
-       style C fill:#e3f2fd,stroke:#2196f3,stroke-width:2px
-       style G fill:#f3e5f5,stroke:#9c27b0,stroke-width:2px
-       style J fill:#fff3e0,stroke:#ff9800,stroke-width:2px
-       style E1 fill:#c8e6c9,stroke:#4caf50,stroke-width:2px
-       style E2 fill:#c8e6c9,stroke:#4caf50,stroke-width:2px
-       style E3 fill:#c8e6c9,stroke:#4caf50,stroke-width:2px
-   ```
-6. **Rectified Flow Scheduler**:
-   - More stable training than DDPM
-   - Logit-normal timestep sampling
-   - Linear interpolation between noise and data
 ## Citation
 If you find our work helpful, please cite us.
@@ -299,7 +308,7 @@ If you find our work helpful, please cite us.
 @article{mug-v2025,
       title={MUG-V 10B: High-efficiency Training Pipeline for Large Video Generation Models},
       author={Yongshun Zhang and Zhongyi Fan and Yonghang Zhang and Zhangzikang Li and Weifeng Chen and Zhongwei Feng and Chaoyue Wang and Peng Hou and Anxiang Zeng},
-      journal = {arXiv preprint},
       year={2025}
 }
 ```
@@ -313,7 +322,4 @@ This project is licensed under the Apache License 2.0 - see the [LICENSE](https:
 ## Acknowledgements
-We would like to thank the contributors to the [Open-Sora](https://github.com/hpcaitech/Open-Sora), [DeepFloyd/t5-v1_1-xxl](https://huggingface.co/DeepFloyd/t5-v1_1-xxl), [Wan-Video](https://github.com/Wan-Video), [Qwen](https://huggingface.co/Qwen), [HuggingFace](https://huggingface.co), [Megatron-LM](https://github.com/NVIDIA/Megatron-LM), [TransformerEngine](https://github.com/NVIDIA/TransformerEngine), [DiffSynth](https://github.com/modelscope/DiffSynth-Studio), [diffusers](https://github.com/huggingface/diffusers), [PixArt](https://github.com/PixArt-alpha/PixArt-alpha), etc. repositories, for their open research.

+---
+license: apache-2.0
+pipeline_tag: image-to-video
+---
+---
+license: apache-2.0
+pipeline_tag: image-to-video
+---
 # MUG-V 10B: High-efficiency Training Pipeline for Large Video Generation Models
 <div align="center">
 ## 🔥 Latest News
+* Oct. 21, 2025: 👋 We are excited to announce the release of the **MUG-V 10B** [technical report](https://arxiv.org/abs/2510.17519). We welcome feedback and discussions.
 * Oct. 21, 2025: 👋 We've released Megatron-LM–based [training framework](https://github.com/Shopee-MUG/MUG-V-Megatron-LM-Training) addressing the key challenges of training billion-parameter video generators.
 * Oct. 21, 2025: 👋 We've released **MUG-V video enhancement** [inference code](https://github.com/Shopee-MUG/MUG-V/tree/main/mug_enhancer) and [weights](https://huggingface.co/MUG-V/MUG-V-inference) (based on WAN-2.1 1.3B).
 * Oct. 21, 2025: 👋 We've released **MUG-V 10B** ([e-commerce edition](https://github.com/Shopee-MUG/MUG-V)) inference code and weights.
 #### Core Components
+1.  **VideoVAE**: 8×8×8 spatiotemporal compression
+    -   Encoder: 3D convolutions + temporal attention
+    -   Decoder: 3D transposed convolutions + temporal upsampling
+    -   KL regularization for stable latent space
+2.  **3D Patch Embedding**: Converts video latents to tokens
+    -   Patch size: 2×2×2 (non-overlapping)
+    -   Final compression: ~2048× vs. pixel space
+3.  **Position Encoding**: 3D Rotary Position Embeddings (RoPE)
+    -   Extends 2D RoPE to handle temporal dimension
+    -   Frequency-based encoding for spatiotemporal modeling
+4.  **Conditioning Modules**:
+    -   **Caption Embedder**: Projects text embeddings (4096-dim) for cross-attention
+    -   **Timestep Embedder**: Embeds diffusion timestep via sinusoidal encoding
+    -   **Size Embedder**: Handles variable resolution inputs
+5.  **MUGDiT Transformer Block**:
+    ```mermaid
+    graph LR
+        A[Input] --> B[AdaLN]
+        B --> C[Self-Attn<br/>QK-Norm]
+        C --> D[Gate]
+        D --> E1[+]
+        A --> E1
+        E1 --> F[LayerNorm]
+        F --> G[Cross-Attn<br/>QK-Norm]
+        G --> E2[+]
+        E1 --> E2
+        E2 --> I[AdaLN]
+        I --> J[MLP]
+        J --> K[Gate]
+        K --> E3[+]
+        E2 --> E3
+        E3 --> L[Output]
+        M[Timestep<br/>Size Info] -.-> B
+        M -.-> I
+        N[Text] -.-> G
+        style C fill:#e3f2fd,stroke:#2196f3,stroke-width:2px
+        style G fill:#f3e5f5,stroke:#9c27b0,stroke-width:2px
+        style J fill:#fff3e0,stroke:#ff9800,stroke-width:2px
+        style E1 fill:#c8e6c9,stroke:#4caf50,stroke-width:2px
+        style E2 fill:#c8e6c9,stroke:#4caf50,stroke-width:2px
+        style E3 fill:#c8e6c9,stroke:#4caf50,stroke-width:2px
+    ```
+6.  **Rectified Flow Scheduler**:
+    -   More stable training than DDPM
+    -   Logit-normal timestep sampling
+    -   Linear interpolation between noise and data
 ## Citation
 If you find our work helpful, please cite us.
 @article{mug-v2025,
       title={MUG-V 10B: High-efficiency Training Pipeline for Large Video Generation Models},
       author={Yongshun Zhang and Zhongyi Fan and Yonghang Zhang and Zhangzikang Li and Weifeng Chen and Zhongwei Feng and Chaoyue Wang and Peng Hou and Anxiang Zeng},
+      journal = {arXiv preprint arXiv:2510.17519},
       year={2025}
 }
 ```
 ## Acknowledgements
+We would like to thank the contributors to the [Open-Sora](https://github.com/hpcaitech/Open-Sora), [DeepFloyd/t5-v1_1-xxl](https://huggingface.co/DeepFloyd/t5-v1_1-xxl), [Wan-Video](https://github.com/Wan-Video), [Qwen](https://huggingface.co/Qwen), [HuggingFace](https://huggingface.co), [Megatron-LM](https://github.com/NVIDIA/Megatron-LM), [TransformerEngine](https://github.com/NVIDIA/TransformerEngine), [DiffSynth](https://github.com/modelscope/DiffSynth-Studio), [diffusers](https://github.com/huggingface/diffusers), [PixArt](https://github.com/PixArt-alpha/PixArt-alpha), etc. repositories, for their open research.