|
--- |
|
language: |
|
- en |
|
license: apache-2.0 |
|
pipeline_tag: text-to-video |
|
tags: |
|
- video-generation |
|
- thudm |
|
- image-to-video |
|
inference: false |
|
library_name: diffusers |
|
--- |
|
|
|
# CogVideoX1.5-5B |
|
|
|
<p style="text-align: center;"> |
|
<div align="center"> |
|
<img src=https://github.com/THUDM/CogVideo/raw/main/resources/logo.svg width="50%"/> |
|
</div> |
|
<p align="center"> |
|
<a href="https://huggingface.co/THUDM/CogVideoX1.5-5B/blob/main/README_zh.md">π δΈζι
θ―»</a> | |
|
<a href="https://huggingface.co/spaces/THUDM/CogVideoX-5B-Space">π€ Huggingface Space</a> | |
|
<a href="https://github.com/THUDM/CogVideo">π Github </a> | |
|
<a href="https://arxiv.org/pdf/2408.06072">π arxiv </a> |
|
</p> |
|
<p align="center"> |
|
π Visit <a href="https://chatglm.cn/video?lang=en?fr=osm_cogvideo">QingYing</a> and <a href="https://open.bigmodel.cn/?utm_campaign=open&_channel_track_key=OWTVNma9">API Platform</a> to experience larger-scale commercial video generation models. |
|
</p> |
|
|
|
## Model Introduction |
|
|
|
CogVideoX is an open-source video generation model similar to [QingYing](https://chatglm.cn/video?lang=en?fr=osm_cogvideo). The table below displays the list of video generation models we currently offer, along with their foundational information. |
|
|
|
<table style="border-collapse: collapse; width: 100%;"> |
|
<tr> |
|
<th style="text-align: center;">Model Name</th> |
|
<th style="text-align: center;">CogVideoX1.5-5B (Latest)</th> |
|
<th style="text-align: center;">CogVideoX1.5-5B-I2V (Latest)</th> |
|
<th style="text-align: center;">CogVideoX-2B</th> |
|
<th style="text-align: center;">CogVideoX-5B</th> |
|
<th style="text-align: center;">CogVideoX-5B-I2V</th> |
|
</tr> |
|
<tr> |
|
<td style="text-align: center;">Release Date</td> |
|
<th style="text-align: center;">November 8, 2024</th> |
|
<th style="text-align: center;">November 8, 2024</th> |
|
<th style="text-align: center;">August 6, 2024</th> |
|
<th style="text-align: center;">August 27, 2024</th> |
|
<th style="text-align: center;">September 19, 2024</th> |
|
</tr> |
|
<tr> |
|
<td style="text-align: center;">Video Resolution</td> |
|
<td colspan="1" style="text-align: center;">1360 * 768</td> |
|
<td colspan="1" style="text-align: center;"> Min(W, H) = 768 <br> 768 β€ Max(W, H) β€ 1360 <br> Max(W, H) % 16 = 0 </td> |
|
<td colspan="3" style="text-align: center;">720 * 480</td> |
|
</tr> |
|
<tr> |
|
<td style="text-align: center;">Number of Frames</td> |
|
<td colspan="2" style="text-align: center;">Should be <b>16N + 1</b> where N <= 10 (default 81)</td> |
|
<td colspan="3" style="text-align: center;">Should be <b>8N + 1</b> where N <= 6 (default 49)</td> |
|
</tr> |
|
<tr> |
|
<td style="text-align: center;">Inference Precision</td> |
|
<td colspan="2" style="text-align: center;"><b>BF16 (Recommended)</b>, FP16, FP32, FP8*, INT8, Not supported: INT4</td> |
|
<td style="text-align: center;"><b>FP16*(Recommended)</b>, BF16, FP32, FP8*, INT8, Not supported: INT4</td> |
|
<td colspan="2" style="text-align: center;"><b>BF16 (Recommended)</b>, FP16, FP32, FP8*, INT8, Not supported: INT4</td> |
|
</tr> |
|
<tr> |
|
<td style="text-align: center;">Single GPU Memory Usage<br></td> |
|
<td colspan="2" style="text-align: center;"><a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a> BF16: 76GB <br><b>diffusers BF16: from 10GB*</b><br><b>diffusers INT8(torchao): from 7GB*</b></td> |
|
<td style="text-align: center;"><a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a> FP16: 18GB <br><b>diffusers FP16: 4GB minimum* </b><br><b>diffusers INT8 (torchao): 3.6GB minimum*</b></td> |
|
<td colspan="2" style="text-align: center;"><a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a> BF16: 26GB <br><b>diffusers BF16 : 5GB minimum* </b><br><b>diffusers INT8 (torchao): 4.4GB minimum* </b></td> |
|
</tr> |
|
<tr> |
|
<td style="text-align: center;">Multi-GPU Memory Usage</td> |
|
<td colspan="2" style="text-align: center;"><b>BF16: 24GB* using diffusers</b><br></td> |
|
<td style="text-align: center;"><b>FP16: 10GB* using diffusers</b><br></td> |
|
<td colspan="2" style="text-align: center;"><b>BF16: 15GB* using diffusers</b><br></td> |
|
</tr> |
|
<tr> |
|
<td style="text-align: center;">Inference Speed<br>(Step = 50, FP/BF16)</td> |
|
<td colspan="2" style="text-align: center;">Single A100: ~1000 seconds (5-second video)<br>Single H100: ~550 seconds (5-second video)</td> |
|
<td style="text-align: center;">Single A100: ~90 seconds<br>Single H100: ~45 seconds</td> |
|
<td colspan="2" style="text-align: center;">Single A100: ~180 seconds<br>Single H100: ~90 seconds</td> |
|
</tr> |
|
<tr> |
|
<td style="text-align: center;">Prompt Language</td> |
|
<td colspan="5" style="text-align: center;">English*</td> |
|
</tr> |
|
<tr> |
|
<td style="text-align: center;">Prompt Token Limit</td> |
|
<td colspan="2" style="text-align: center;">224 Tokens</td> |
|
<td colspan="3" style="text-align: center;">226 Tokens</td> |
|
</tr> |
|
<tr> |
|
<td style="text-align: center;">Video Length</td> |
|
<td colspan="2" style="text-align: center;">5 seconds or 10 seconds</td> |
|
<td colspan="3" style="text-align: center;">6 seconds</td> |
|
</tr> |
|
<tr> |
|
<td style="text-align: center;">Frame Rate</td> |
|
<td colspan="2" style="text-align: center;">16 frames / second </td> |
|
<td colspan="3" style="text-align: center;">8 frames / second </td> |
|
</tr> |
|
<tr> |
|
<td style="text-align: center;">Position Encoding</td> |
|
<td colspan="2" style="text-align: center;">3d_rope_pos_embed</td> |
|
<td style="text-align: center;">3d_sincos_pos_embed</td> |
|
<td style="text-align: center;">3d_rope_pos_embed</td> |
|
<td style="text-align: center;">3d_rope_pos_embed + learnable_pos_embed</td> |
|
</tr> |
|
<tr> |
|
<td style="text-align: center;">Download Link (Diffusers)</td> |
|
<td style="text-align: center;"><a href="https://huggingface.co/THUDM/CogVideoX1.5-5B">π€ HuggingFace</a><br><a href="https://modelscope.cn/models/ZhipuAI/CogVideoX1.5-5B">π€ ModelScope</a><br><a href="https://wisemodel.cn/models/ZhipuAI/CogVideoX1.5-5B">π£ WiseModel</a></td> |
|
<td style="text-align: center;"><a href="https://huggingface.co/THUDM/CogVideoX1.5-5B-I2V">π€ HuggingFace</a><br><a href="https://modelscope.cn/models/ZhipuAI/CogVideoX1.5-5B-I2V">π€ ModelScope</a><br><a href="https://wisemodel.cn/models/ZhipuAI/CogVideoX1.5-5B-I2V">π£ WiseModel</a></td> |
|
<td style="text-align: center;"><a href="https://huggingface.co/THUDM/CogVideoX-2b">π€ HuggingFace</a><br><a href="https://modelscope.cn/models/ZhipuAI/CogVideoX-2b">π€ ModelScope</a><br><a href="https://wisemodel.cn/models/ZhipuAI/CogVideoX-2b">π£ WiseModel</a></td> |
|
<td style="text-align: center;"><a href="https://huggingface.co/THUDM/CogVideoX-5b">π€ HuggingFace</a><br><a href="https://modelscope.cn/models/ZhipuAI/CogVideoX-5b">π€ ModelScope</a><br><a href="https://wisemodel.cn/models/ZhipuAI/CogVideoX-5b">π£ WiseModel</a></td> |
|
<td style="text-align: center;"><a href="https://huggingface.co/THUDM/CogVideoX-5b-I2V">π€ HuggingFace</a><br><a href="https://modelscope.cn/models/ZhipuAI/CogVideoX-5b-I2V">π€ ModelScope</a><br><a href="https://wisemodel.cn/models/ZhipuAI/CogVideoX-5b-I2V">π£ WiseModel</a></td> |
|
</tr> |
|
<tr> |
|
<td style="text-align: center;">Download Link (SAT)</td> |
|
<td colspan="2" style="text-align: center;"><a href="https://huggingface.co/THUDM/CogVideoX1.5-5b-SAT">π€ HuggingFace</a><br><a href="https://modelscope.cn/models/ZhipuAI/CogVideoX1.5-5b-SAT">π€ ModelScope</a><br><a href="https://wisemodel.cn/models/ZhipuAI/CogVideoX1.5-5b-SAT">π£ WiseModel</a></td> |
|
<td colspan="3" style="text-align: center;"><a href="./sat/README_zh.md">SAT</a></td> |
|
</tr> |
|
</table> |
|
|
|
**(rest of the content remains the same as the original)** |