🦜VideoChat-Flash-Qwen2_5-7B-1M_res224⚡

[📰 Blog] [📂 GitHub] [📜 Tech Report] [🗨️ Chat Demo]

VideoChat-Flash-Qwen2_5-7B_InternVideo2-1B is constructed upon UMT-L (300M) and Qwen2.5-7B-1M, employing only 16 tokens per frame. By leveraging Yarn to extend the context window to 1M (Qwen2.5-7B-1M's native context window is 128k), our model supports input sequences of up to approximately 50,000 frames.

Note: Due to a predominantly English training corpus, the model only exhibits basic Chinese comprehension, to ensure optimal performance, using English for interaction is recommended.

📈 Performance

Model MVBench LongVideoBench VideoMME(w/o sub) Max input frames
VideoChat-Flash-Qwen2_5-2B@448 70.0 58.3 57.0 10000
VideoChat-Flash-Qwen2-7B@224 73.2 64.2 64.0 10000
VideoChat-Flash-Qwen2_5-7B-1M@224 73.4 66.5 63.5 50000
VideoChat-Flash-Qwen2_5-7B_InternVideo2-1B@224 74.3 64.5 65.1 10000
VideoChat-Flash-Qwen2-7B@448 74.0 64.7 65.3 10000

🚀 How to use the model

First, you need to install flash attention2 and some other modules. We provide a simple installation example below:

pip install transformers==4.40.1
pip install av
pip install imageio
pip install decord
pip install opencv-python
pip install flash-attn --no-build-isolation

Then you could use our model:

from transformers import AutoModel, AutoTokenizer

# model setting
model_path = 'OpenGVLab/VideoChat-Flash-Qwen2_5-7B-1M_res224'

tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model = AutoModel.from_pretrained(model_path, trust_remote_code=True).half().cuda()
image_processor = model.get_vision_tower().image_processor

mm_llm_compress = False # use the global compress or not
if mm_llm_compress:
    model.config.mm_llm_compress = True
    model.config.llm_compress_type = "uniform0_attention"
    model.config.llm_compress_layer_list = [4, 18]
    model.config.llm_image_token_ratio_list = [1, 0.75, 0.25]
else:
    model.config.mm_llm_compress = False

# evaluation setting
max_num_frames = 512
generation_config = dict(
    do_sample=False,
    temperature=0.0,
    max_new_tokens=1024,
    top_p=0.1,
    num_beams=1
)

video_path = "your_video.mp4"

# single-turn conversation
question1 = "Describe this video in detail."
output1, chat_history = model.chat(video_path=video_path, tokenizer=tokenizer, user_prompt=question1, return_history=True, max_num_frames=max_num_frames, generation_config=generation_config)

print(output1)

# multi-turn conversation
question2 = "How many people appear in the video?"
output2, chat_history = model.chat(video_path=video_path, tokenizer=tokenizer, user_prompt=question2, chat_history=chat_history, return_history=True, max_num_frames=max_num_frames, generation_config=generation_config)

print(output2)

✏️ Citation


@article{li2024videochatflash,
  title={VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling},
  author={Li, Xinhao and Wang, Yi and Yu, Jiashuo and Zeng, Xiangyu and Zhu, Yuhan and Huang, Haian and Gao, Jianfei and Li, Kunchang and He, Yinan and Wang, Chenting and others},
  journal={arXiv preprint arXiv:2501.00574},
  year={2024}
}
Downloads last month
9
Safetensors
Model size
7.92B params
Tensor type
BF16
·
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The HF Inference API does not support model that require custom code execution.

Evaluation results