metadata
language:
- en
license: apache-2.0
pipeline_tag: text-to-video
tags:
- video-generation
- thudm
- image-to-video
inference: false
library_name: diffusers
CogVideoX1.5-5B
π δΈζι θ―» | π€ Huggingface Space | π Github | π arxiv
π Visit QingYing and API Platform to experience larger-scale commercial video generation models.
Model Introduction
CogVideoX is an open-source video generation model similar to QingYing. The table below displays the list of video generation models we currently offer, along with their foundational information.
Model Name | CogVideoX1.5-5B (Latest) | CogVideoX1.5-5B-I2V (Latest) | CogVideoX-2B | CogVideoX-5B | CogVideoX-5B-I2V |
---|---|---|---|---|---|
Release Date | November 8, 2024 | November 8, 2024 | August 6, 2024 | August 27, 2024 | September 19, 2024 |
Video Resolution | 1360 * 768 | Min(W, H) = 768 768 β€ Max(W, H) β€ 1360 Max(W, H) % 16 = 0 |
720 * 480 | ||
Number of Frames | Should be 16N + 1 where N <= 10 (default 81) | Should be 8N + 1 where N <= 6 (default 49) | |||
Inference Precision | BF16 (Recommended), FP16, FP32, FP8*, INT8, Not supported: INT4 | FP16*(Recommended), BF16, FP32, FP8*, INT8, Not supported: INT4 | BF16 (Recommended), FP16, FP32, FP8*, INT8, Not supported: INT4 | ||
Single GPU Memory Usage |
SAT BF16: 76GB diffusers BF16: from 10GB* diffusers INT8(torchao): from 7GB* |
SAT FP16: 18GB diffusers FP16: 4GB minimum* diffusers INT8 (torchao): 3.6GB minimum* |
SAT BF16: 26GB diffusers BF16 : 5GB minimum* diffusers INT8 (torchao): 4.4GB minimum* |
||
Multi-GPU Memory Usage | BF16: 24GB* using diffusers |
FP16: 10GB* using diffusers |
BF16: 15GB* using diffusers |
||
Inference Speed (Step = 50, FP/BF16) |
Single A100: ~1000 seconds (5-second video) Single H100: ~550 seconds (5-second video) |
Single A100: ~90 seconds Single H100: ~45 seconds |
Single A100: ~180 seconds Single H100: ~90 seconds |
||
Prompt Language | English* | ||||
Prompt Token Limit | 224 Tokens | 226 Tokens | |||
Video Length | 5 seconds or 10 seconds | 6 seconds | |||
Frame Rate | 16 frames / second | 8 frames / second | |||
Position Encoding | 3d_rope_pos_embed | 3d_sincos_pos_embed | 3d_rope_pos_embed | 3d_rope_pos_embed + learnable_pos_embed | |
Download Link (Diffusers) | π€ HuggingFace π€ ModelScope π£ WiseModel |
π€ HuggingFace π€ ModelScope π£ WiseModel |
π€ HuggingFace π€ ModelScope π£ WiseModel |
π€ HuggingFace π€ ModelScope π£ WiseModel |
π€ HuggingFace π€ ModelScope π£ WiseModel |
Download Link (SAT) | π€ HuggingFace π€ ModelScope π£ WiseModel |
SAT |
(rest of the content remains the same as the original)