Update README.md
Browse files
README.md
CHANGED
@@ -1,7 +1,7 @@
|
|
1 |
---
|
2 |
language:
|
3 |
- en
|
4 |
-
license:
|
5 |
pipeline_tag: text-to-video
|
6 |
tags:
|
7 |
- video-generation
|
@@ -124,4 +124,169 @@ CogVideoX is an open-source video generation model similar to [QingYing](https:/
|
|
124 |
</tr>
|
125 |
</table>
|
126 |
|
127 |
-
**
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
language:
|
3 |
- en
|
4 |
+
license: other
|
5 |
pipeline_tag: text-to-video
|
6 |
tags:
|
7 |
- video-generation
|
|
|
124 |
</tr>
|
125 |
</table>
|
126 |
|
127 |
+
**Data Explanation**
|
128 |
+
|
129 |
+
+ Testing with the `diffusers` library enabled all optimizations included in the library. This scheme has not been
|
130 |
+
tested on non-NVIDIA A100/H100 devices. It should generally work with all NVIDIA Ampere architecture or higher
|
131 |
+
devices. Disabling optimizations can triple VRAM usage but increase speed by 3-4 times. You can selectively disable
|
132 |
+
certain optimizations, including:
|
133 |
+
|
134 |
+
```
|
135 |
+
pipe.enable_sequential_cpu_offload()
|
136 |
+
pipe.vae.enable_slicing()
|
137 |
+
pipe.vae.enable_tiling()
|
138 |
+
```
|
139 |
+
|
140 |
+
+ In multi-GPU inference, `enable_sequential_cpu_offload()` optimization needs to be disabled.
|
141 |
+
+ Using an INT8 model reduces inference speed, meeting the requirements of lower VRAM GPUs while retaining minimal video
|
142 |
+
quality degradation, at the cost of significant speed reduction.
|
143 |
+
+ [PytorchAO](https://github.com/pytorch/ao) and [Optimum-quanto](https://github.com/huggingface/optimum-quanto/) can be
|
144 |
+
used to quantize the text encoder, Transformer, and VAE modules, reducing CogVideoX’s memory requirements, making it
|
145 |
+
feasible to run the model on smaller VRAM GPUs. TorchAO quantization is fully compatible with `torch.compile`,
|
146 |
+
significantly improving inference speed. `FP8` precision is required for NVIDIA H100 and above, which requires source
|
147 |
+
installation of `torch`, `torchao`, `diffusers`, and `accelerate`. Using `CUDA 12.4` is recommended.
|
148 |
+
+ Inference speed testing also used the above VRAM optimizations, and without optimizations, speed increases by about
|
149 |
+
10%. Only `diffusers` versions of models support quantization.
|
150 |
+
+ Models support English input only; other languages should be translated into English during prompt crafting with a
|
151 |
+
larger model.
|
152 |
+
|
153 |
+
**Note**
|
154 |
+
|
155 |
+
+ Use [SAT](https://github.com/THUDM/SwissArmyTransformer) for inference and fine-tuning SAT version models. Check our
|
156 |
+
GitHub for more details.
|
157 |
+
|
158 |
+
## Getting Started Quickly 🤗
|
159 |
+
|
160 |
+
This model supports deployment using the Hugging Face diffusers library. You can follow the steps below to get started.
|
161 |
+
|
162 |
+
**We recommend that you visit our [GitHub](https://github.com/THUDM/CogVideo) to check out prompt optimization and
|
163 |
+
conversion to get a better experience.**
|
164 |
+
|
165 |
+
1. Install the required dependencies
|
166 |
+
|
167 |
+
```shell
|
168 |
+
# diffusers (from source)
|
169 |
+
# transformers>=4.46.2
|
170 |
+
# accelerate>=1.1.1
|
171 |
+
# imageio-ffmpeg>=0.5.1
|
172 |
+
pip install git+https://github.com/huggingface/diffusers
|
173 |
+
pip install --upgrade transformers accelerate diffusers imageio-ffmpeg
|
174 |
+
```
|
175 |
+
|
176 |
+
2. Run the code
|
177 |
+
|
178 |
+
```python
|
179 |
+
import torch
|
180 |
+
from diffusers import CogVideoXPipeline
|
181 |
+
from diffusers.utils import export_to_video
|
182 |
+
|
183 |
+
prompt = "A panda, dressed in a small, red jacket and a tiny hat, sits on a wooden stool in a serene bamboo forest. The panda's fluffy paws strum a miniature acoustic guitar, producing soft, melodic tunes. Nearby, a few other pandas gather, watching curiously and some clapping in rhythm. Sunlight filters through the tall bamboo, casting a gentle glow on the scene. The panda's face is expressive, showing concentration and joy as it plays. The background includes a small, flowing stream and vibrant green foliage, enhancing the peaceful and magical atmosphere of this unique musical performance."
|
184 |
+
|
185 |
+
pipe = CogVideoXPipeline.from_pretrained(
|
186 |
+
"THUDM/CogVideoX1.5-5B",
|
187 |
+
torch_dtype=torch.bfloat16
|
188 |
+
)
|
189 |
+
|
190 |
+
pipe.enable_sequential_cpu_offload()
|
191 |
+
pipe.vae.enable_tiling()
|
192 |
+
pipe.vae.enable_slicing()
|
193 |
+
|
194 |
+
video = pipe(
|
195 |
+
prompt=prompt,
|
196 |
+
num_videos_per_prompt=1,
|
197 |
+
num_inference_steps=50,
|
198 |
+
num_frames=81,
|
199 |
+
guidance_scale=6,
|
200 |
+
generator=torch.Generator(device="cuda").manual_seed(42),
|
201 |
+
).frames[0]
|
202 |
+
|
203 |
+
export_to_video(video, "output.mp4", fps=8)
|
204 |
+
```
|
205 |
+
|
206 |
+
## Quantized Inference
|
207 |
+
|
208 |
+
[PytorchAO](https://github.com/pytorch/ao) and [Optimum-quanto](https://github.com/huggingface/optimum-quanto/) can be
|
209 |
+
used to quantize the text encoder, transformer, and VAE modules to reduce CogVideoX's memory requirements. This allows
|
210 |
+
the model to run on free T4 Colab or GPUs with lower VRAM! Also, note that TorchAO quantization is fully compatible
|
211 |
+
with `torch.compile`, which can significantly accelerate inference.
|
212 |
+
|
213 |
+
```python
|
214 |
+
# To get started, PytorchAO needs to be installed from the GitHub source and PyTorch Nightly.
|
215 |
+
# Source and nightly installation is only required until the next release.
|
216 |
+
|
217 |
+
import torch
|
218 |
+
from diffusers import AutoencoderKLCogVideoX, CogVideoXTransformer3DModel, CogVideoXImageToVideoPipeline
|
219 |
+
from diffusers.utils import export_to_video
|
220 |
+
from transformers import T5EncoderModel
|
221 |
+
from torchao.quantization import quantize_, int8_weight_only
|
222 |
+
|
223 |
+
quantization = int8_weight_only
|
224 |
+
|
225 |
+
text_encoder = T5EncoderModel.from_pretrained("THUDM/CogVideoX1.5-5B", subfolder="text_encoder",
|
226 |
+
torch_dtype=torch.bfloat16)
|
227 |
+
quantize_(text_encoder, quantization())
|
228 |
+
|
229 |
+
transformer = CogVideoXTransformer3DModel.from_pretrained("THUDM/CogVideoX1.5-5B", subfolder="transformer",
|
230 |
+
torch_dtype=torch.bfloat16)
|
231 |
+
quantize_(transformer, quantization())
|
232 |
+
|
233 |
+
vae = AutoencoderKLCogVideoX.from_pretrained("THUDM/CogVideoX1.5-5B", subfolder="vae", torch_dtype=torch.bfloat16)
|
234 |
+
quantize_(vae, quantization())
|
235 |
+
|
236 |
+
# Create pipeline and run inference
|
237 |
+
pipe = CogVideoXImageToVideoPipeline.from_pretrained(
|
238 |
+
"THUDM/CogVideoX1.5-5B",
|
239 |
+
text_encoder=text_encoder,
|
240 |
+
transformer=transformer,
|
241 |
+
vae=vae,
|
242 |
+
torch_dtype=torch.bfloat16,
|
243 |
+
)
|
244 |
+
|
245 |
+
pipe.enable_model_cpu_offload()
|
246 |
+
pipe.vae.enable_tiling()
|
247 |
+
pipe.vae.enable_slicing()
|
248 |
+
|
249 |
+
prompt = "A little girl is riding a bicycle at high speed. Focused, detailed, realistic."
|
250 |
+
video = pipe(
|
251 |
+
prompt=prompt,
|
252 |
+
num_videos_per_prompt=1,
|
253 |
+
num_inference_steps=50,
|
254 |
+
num_frames=81,
|
255 |
+
guidance_scale=6,
|
256 |
+
generator=torch.Generator(device="cuda").manual_seed(42),
|
257 |
+
).frames[0]
|
258 |
+
|
259 |
+
export_to_video(video, "output.mp4", fps=8)
|
260 |
+
```
|
261 |
+
|
262 |
+
Additionally, these models can be serialized and stored using PytorchAO in quantized data types to save disk space. You
|
263 |
+
can find examples and benchmarks at the following links:
|
264 |
+
|
265 |
+
- [torchao](https://gist.github.com/a-r-r-o-w/4d9732d17412888c885480c6521a9897)
|
266 |
+
- [quanto](https://gist.github.com/a-r-r-o-w/31be62828b00a9292821b85c1017effa)
|
267 |
+
|
268 |
+
## Further Exploration
|
269 |
+
|
270 |
+
Feel free to enter our [GitHub](https://github.com/THUDM/CogVideo), where you'll find:
|
271 |
+
|
272 |
+
1. More detailed technical explanations and code.
|
273 |
+
2. Optimized prompt examples and conversions.
|
274 |
+
3. Detailed code for model inference and fine-tuning.
|
275 |
+
4. Project update logs and more interactive opportunities.
|
276 |
+
5. CogVideoX toolchain to help you better use the model.
|
277 |
+
6. INT8 model inference code.
|
278 |
+
|
279 |
+
## Model License
|
280 |
+
|
281 |
+
This model is released under the [CogVideoX LICENSE](LICENSE).
|
282 |
+
|
283 |
+
## Citation
|
284 |
+
|
285 |
+
```
|
286 |
+
@article{yang2024cogvideox,
|
287 |
+
title={CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer},
|
288 |
+
author={Yang, Zhuoyi and Teng, Jiayan and Zheng, Wendi and Ding, Ming and Huang, Shiyu and Xu, Jiazheng and Yang, Yuanming and Hong, Wenyi and Zhang, Xiaohan and Feng, Guanyu and others},
|
289 |
+
journal={arXiv preprint arXiv:2408.06072},
|
290 |
+
year={2024}
|
291 |
+
}
|
292 |
+
```
|