--- language: - en library_name: transformers license: apache-2.0 metrics: - accuracy tags: - multimodal pipeline_tag: video-text-to-text base_model: Qwen/Qwen2.5-VL-7B-Instruct --- # 💡 VideoChat-R1-thinking_7B [\[📂 GitHub\]](https://github.com/OpenGVLab/VideoChat-R1) [\[📜 Tech Report\]](https://arxiv.org/pdf/2504.06958) ## 🚀 How to use the model We provide a simple installation example below: ``` pip install transformers pip install qwen_vl_utils ``` Then you could use our model: ```python from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor from qwen_vl_utils import process_vision_info model_path = "OpenGVLab/VideoChat-R1-thinking_7B" # default: Load the model on the available device(s) model = Qwen2_5_VLForConditionalGeneration.from_pretrained( model_path, torch_dtype="auto", device_map="auto", attn_implementation="flash_attention_2" ) # default processer processor = AutoProcessor.from_pretrained(model_path) video_path = "your_video.mp4" question = "Where is the final cup containing the object?" messages = [ { "role": "user", "content": [ { "type": "video", "video": video_path, "max_pixels": 360 * 420, "fps": 1.0, }, {"type": "text", "text": f"""{question} Output your thought process within the tags, including analysis with either specific timestamps (xx.xx) or time ranges (xx.xx to xx.xx) in tags. Then, provide your final answer within the tags. """}, ], } ] #In Qwen 2.5 VL, frame rate information is also input into the model to align with absolute time. # Preparation for inference text = processor.apply_chat_template( messages, tokenize=False, add_generation_prompt=True ) image_inputs, video_inputs, video_kwargs = process_vision_info(messages, return_video_kwargs=True) inputs = processor( text=[text], images=image_inputs, videos=video_inputs, padding=True, return_tensors="pt", **video_kwargs, ) inputs = inputs.to("cuda") # Inference generated_ids = model.generate(**inputs, max_new_tokens=512) generated_ids_trimmed = [ out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids) ] output_text = processor.batch_decode( generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False ) print(output_text) ``` ## ✏️ Citation ```bibtex @article{li2025videochatr1, title={VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning}, author={Li, Xinhao and Yan, Ziang and Meng, Desen and Dong, Lu and Zeng, Xiangyu and He, Yinan and Wang, Yali and Qiao, Yu and Wang, Yi and Wang, Limin}, journal={arXiv preprint arXiv:2504.06958}, year={2025} } ```