THUDM-Space commited on
Commit
fdc5267
·
verified ·
1 Parent(s): 1df5201

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +167 -2
README.md CHANGED
@@ -1,7 +1,7 @@
1
  ---
2
  language:
3
  - en
4
- license: apache-2.0
5
  pipeline_tag: text-to-video
6
  tags:
7
  - video-generation
@@ -124,4 +124,169 @@ CogVideoX is an open-source video generation model similar to [QingYing](https:/
124
  </tr>
125
  </table>
126
 
127
- **(rest of the content remains the same as the original)**
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  language:
3
  - en
4
+ license: other
5
  pipeline_tag: text-to-video
6
  tags:
7
  - video-generation
 
124
  </tr>
125
  </table>
126
 
127
+ **Data Explanation**
128
+
129
+ + Testing with the `diffusers` library enabled all optimizations included in the library. This scheme has not been
130
+ tested on non-NVIDIA A100/H100 devices. It should generally work with all NVIDIA Ampere architecture or higher
131
+ devices. Disabling optimizations can triple VRAM usage but increase speed by 3-4 times. You can selectively disable
132
+ certain optimizations, including:
133
+
134
+ ```
135
+ pipe.enable_sequential_cpu_offload()
136
+ pipe.vae.enable_slicing()
137
+ pipe.vae.enable_tiling()
138
+ ```
139
+
140
+ + In multi-GPU inference, `enable_sequential_cpu_offload()` optimization needs to be disabled.
141
+ + Using an INT8 model reduces inference speed, meeting the requirements of lower VRAM GPUs while retaining minimal video
142
+ quality degradation, at the cost of significant speed reduction.
143
+ + [PytorchAO](https://github.com/pytorch/ao) and [Optimum-quanto](https://github.com/huggingface/optimum-quanto/) can be
144
+ used to quantize the text encoder, Transformer, and VAE modules, reducing CogVideoX’s memory requirements, making it
145
+ feasible to run the model on smaller VRAM GPUs. TorchAO quantization is fully compatible with `torch.compile`,
146
+ significantly improving inference speed. `FP8` precision is required for NVIDIA H100 and above, which requires source
147
+ installation of `torch`, `torchao`, `diffusers`, and `accelerate`. Using `CUDA 12.4` is recommended.
148
+ + Inference speed testing also used the above VRAM optimizations, and without optimizations, speed increases by about
149
+ 10%. Only `diffusers` versions of models support quantization.
150
+ + Models support English input only; other languages should be translated into English during prompt crafting with a
151
+ larger model.
152
+
153
+ **Note**
154
+
155
+ + Use [SAT](https://github.com/THUDM/SwissArmyTransformer) for inference and fine-tuning SAT version models. Check our
156
+ GitHub for more details.
157
+
158
+ ## Getting Started Quickly 🤗
159
+
160
+ This model supports deployment using the Hugging Face diffusers library. You can follow the steps below to get started.
161
+
162
+ **We recommend that you visit our [GitHub](https://github.com/THUDM/CogVideo) to check out prompt optimization and
163
+ conversion to get a better experience.**
164
+
165
+ 1. Install the required dependencies
166
+
167
+ ```shell
168
+ # diffusers (from source)
169
+ # transformers>=4.46.2
170
+ # accelerate>=1.1.1
171
+ # imageio-ffmpeg>=0.5.1
172
+ pip install git+https://github.com/huggingface/diffusers
173
+ pip install --upgrade transformers accelerate diffusers imageio-ffmpeg
174
+ ```
175
+
176
+ 2. Run the code
177
+
178
+ ```python
179
+ import torch
180
+ from diffusers import CogVideoXPipeline
181
+ from diffusers.utils import export_to_video
182
+
183
+ prompt = "A panda, dressed in a small, red jacket and a tiny hat, sits on a wooden stool in a serene bamboo forest. The panda's fluffy paws strum a miniature acoustic guitar, producing soft, melodic tunes. Nearby, a few other pandas gather, watching curiously and some clapping in rhythm. Sunlight filters through the tall bamboo, casting a gentle glow on the scene. The panda's face is expressive, showing concentration and joy as it plays. The background includes a small, flowing stream and vibrant green foliage, enhancing the peaceful and magical atmosphere of this unique musical performance."
184
+
185
+ pipe = CogVideoXPipeline.from_pretrained(
186
+ "THUDM/CogVideoX1.5-5B",
187
+ torch_dtype=torch.bfloat16
188
+ )
189
+
190
+ pipe.enable_sequential_cpu_offload()
191
+ pipe.vae.enable_tiling()
192
+ pipe.vae.enable_slicing()
193
+
194
+ video = pipe(
195
+ prompt=prompt,
196
+ num_videos_per_prompt=1,
197
+ num_inference_steps=50,
198
+ num_frames=81,
199
+ guidance_scale=6,
200
+ generator=torch.Generator(device="cuda").manual_seed(42),
201
+ ).frames[0]
202
+
203
+ export_to_video(video, "output.mp4", fps=8)
204
+ ```
205
+
206
+ ## Quantized Inference
207
+
208
+ [PytorchAO](https://github.com/pytorch/ao) and [Optimum-quanto](https://github.com/huggingface/optimum-quanto/) can be
209
+ used to quantize the text encoder, transformer, and VAE modules to reduce CogVideoX's memory requirements. This allows
210
+ the model to run on free T4 Colab or GPUs with lower VRAM! Also, note that TorchAO quantization is fully compatible
211
+ with `torch.compile`, which can significantly accelerate inference.
212
+
213
+ ```python
214
+ # To get started, PytorchAO needs to be installed from the GitHub source and PyTorch Nightly.
215
+ # Source and nightly installation is only required until the next release.
216
+
217
+ import torch
218
+ from diffusers import AutoencoderKLCogVideoX, CogVideoXTransformer3DModel, CogVideoXImageToVideoPipeline
219
+ from diffusers.utils import export_to_video
220
+ from transformers import T5EncoderModel
221
+ from torchao.quantization import quantize_, int8_weight_only
222
+
223
+ quantization = int8_weight_only
224
+
225
+ text_encoder = T5EncoderModel.from_pretrained("THUDM/CogVideoX1.5-5B", subfolder="text_encoder",
226
+ torch_dtype=torch.bfloat16)
227
+ quantize_(text_encoder, quantization())
228
+
229
+ transformer = CogVideoXTransformer3DModel.from_pretrained("THUDM/CogVideoX1.5-5B", subfolder="transformer",
230
+ torch_dtype=torch.bfloat16)
231
+ quantize_(transformer, quantization())
232
+
233
+ vae = AutoencoderKLCogVideoX.from_pretrained("THUDM/CogVideoX1.5-5B", subfolder="vae", torch_dtype=torch.bfloat16)
234
+ quantize_(vae, quantization())
235
+
236
+ # Create pipeline and run inference
237
+ pipe = CogVideoXImageToVideoPipeline.from_pretrained(
238
+ "THUDM/CogVideoX1.5-5B",
239
+ text_encoder=text_encoder,
240
+ transformer=transformer,
241
+ vae=vae,
242
+ torch_dtype=torch.bfloat16,
243
+ )
244
+
245
+ pipe.enable_model_cpu_offload()
246
+ pipe.vae.enable_tiling()
247
+ pipe.vae.enable_slicing()
248
+
249
+ prompt = "A little girl is riding a bicycle at high speed. Focused, detailed, realistic."
250
+ video = pipe(
251
+ prompt=prompt,
252
+ num_videos_per_prompt=1,
253
+ num_inference_steps=50,
254
+ num_frames=81,
255
+ guidance_scale=6,
256
+ generator=torch.Generator(device="cuda").manual_seed(42),
257
+ ).frames[0]
258
+
259
+ export_to_video(video, "output.mp4", fps=8)
260
+ ```
261
+
262
+ Additionally, these models can be serialized and stored using PytorchAO in quantized data types to save disk space. You
263
+ can find examples and benchmarks at the following links:
264
+
265
+ - [torchao](https://gist.github.com/a-r-r-o-w/4d9732d17412888c885480c6521a9897)
266
+ - [quanto](https://gist.github.com/a-r-r-o-w/31be62828b00a9292821b85c1017effa)
267
+
268
+ ## Further Exploration
269
+
270
+ Feel free to enter our [GitHub](https://github.com/THUDM/CogVideo), where you'll find:
271
+
272
+ 1. More detailed technical explanations and code.
273
+ 2. Optimized prompt examples and conversions.
274
+ 3. Detailed code for model inference and fine-tuning.
275
+ 4. Project update logs and more interactive opportunities.
276
+ 5. CogVideoX toolchain to help you better use the model.
277
+ 6. INT8 model inference code.
278
+
279
+ ## Model License
280
+
281
+ This model is released under the [CogVideoX LICENSE](LICENSE).
282
+
283
+ ## Citation
284
+
285
+ ```
286
+ @article{yang2024cogvideox,
287
+ title={CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer},
288
+ author={Yang, Zhuoyi and Teng, Jiayan and Zheng, Wendi and Ding, Ming and Huang, Shiyu and Xu, Jiazheng and Yang, Yuanming and Hong, Wenyi and Zhang, Xiaohan and Feng, Guanyu and others},
289
+ journal={arXiv preprint arXiv:2408.06072},
290
+ year={2024}
291
+ }
292
+ ```