chancharikm commited on
Commit
a4f712a
·
verified ·
1 Parent(s): ef11538

Added README.md

Browse files
Files changed (1) hide show
  1. README.md +180 -0
README.md ADDED
@@ -0,0 +1,180 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ base_model: Qwen/Qwen2.5-VL-72B-Instruct
3
+ library_name: transformers
4
+ license: other
5
+ tags:
6
+ - llama-factory
7
+ - full
8
+ - generated_from_trainer
9
+ pipeline_tag: video-text-to-text
10
+ model-index:
11
+ - name: bal_imb_cap_full_lr2e-4_epoch10.0_freezevisTrue_fps8
12
+ results: []
13
+ ---
14
+
15
+ <!-- This model card has been generated automatically according to the information the Trainer had access to. You
16
+ should probably proofread and complete it, then remove this comment. -->
17
+
18
+
19
+ ## Model description
20
+
21
+
22
+ This model is a fine-tuned version of [Qwen/Qwen2.5-VL-72B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-72B-Instruct) on the current most, high-quality camera motion dataset that is publically available. This preview model is the current SOTA for classifying camera motion or being used for video-text retrieval with camera motion captions using [VQAScore](https://arxiv.org/pdf/2404.01291). Find more information about our work on our Github page for [CameraBench](https://github.com/sy77777en/CameraBench). *More updates to the benchmark and models will come in the future. Stay tuned!*
23
+ ## Intended uses & limitations
24
+
25
+ The usage is identical to a [Qwen2.5-VL](https://github.com/QwenLM/Qwen2.5-VL) model. Our model is primarily useful for camera motion classification in videos as well as video-text retrieval (current SOTA in both tasks).
26
+
27
+ **A quick demo is shown below:**
28
+ <details>
29
+ <summary>Generative Scoring (for classification and retrieval):</summary>
30
+
31
+ ```python
32
+ # Import necessary libraries
33
+ from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
34
+ from qwen_vl_utils import process_vision_info
35
+ import torch
36
+
37
+ # Load the model
38
+ model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
39
+ "chancharikm/qwen2.5-vl-72B-cam-motion-preview", torch_dtype="auto", device_map="auto"
40
+ )
41
+ processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-72B-Instruct")
42
+
43
+ # Prepare input data
44
+ video_path = "file:///path/to/video1.mp4"
45
+ text_description = "the camera tilting upward"
46
+ question = f"Does this video show \"{text_description}\"?"
47
+
48
+ # Format the input for the model
49
+ messages = [
50
+ {
51
+ "role": "user",
52
+ "content": [
53
+ {
54
+ "type": "video",
55
+ "video": video_path,
56
+ "fps": 8.0, # Recommended FPS for optimal inference
57
+ },
58
+ {"type": "text", "text": question},
59
+ ],
60
+ }
61
+ ]
62
+
63
+ text = processor.apply_chat_template(
64
+ messages, tokenize=False, add_generation_prompt=True
65
+ )
66
+ image_inputs, video_inputs, video_kwargs = process_vision_info(messages, return_video_kwargs=True)
67
+ inputs = processor(
68
+ text=[text],
69
+ images=image_inputs,
70
+ videos=video_inputs,
71
+ padding=True,
72
+ return_tensors="pt",
73
+ **video_kwargs
74
+ )
75
+ inputs = inputs.to("cuda")
76
+
77
+ # Generate with score output
78
+ with torch.inference_mode():
79
+ outputs = model.generate(
80
+ **inputs,
81
+ max_new_tokens=1,
82
+ do_sample=False, # Use greedy decoding to get reliable logprobs
83
+ output_scores=True,
84
+ return_dict_in_generate=True
85
+ )
86
+
87
+ # Calculate probability of "Yes" response
88
+ scores = outputs.scores[0]
89
+ probs = torch.nn.functional.softmax(scores, dim=-1)
90
+ yes_token_id = processor.tokenizer.encode("Yes")[0]
91
+ score = probs[0, yes_token_id].item()
92
+
93
+ print(f"Video: {video_path}")
94
+ print(f"Description: '{text_description}'")
95
+ print(f"Score: {score:.4f}")
96
+ ```
97
+ </details>
98
+
99
+ <details>
100
+ <summary>Natural Language Generation</summary>
101
+
102
+ ```python
103
+ # The model is trained on 8.0 FPS which we recommend for optimal inference
104
+
105
+ from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
106
+ from qwen_vl_utils import process_vision_info
107
+
108
+ # default: Load the model on the available device(s)
109
+ model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
110
+ "chancharikm/qwen2.5-vl-72B-cam-motion-preview", torch_dtype="auto", device_map="auto"
111
+ )
112
+
113
+ # We recommend enabling flash_attention_2 for better acceleration and memory saving, especially in multi-image and video scenarios.
114
+ # model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
115
+ # "chancharikm/qwen2.5-vl-72B-cam-motion-preview",
116
+ # torch_dtype=torch.bfloat16,
117
+ # attn_implementation="flash_attention_2",
118
+ # device_map="auto",
119
+ # )
120
+
121
+ # default processor
122
+ processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-72B-Instruct")
123
+
124
+ messages = [
125
+ {
126
+ "role": "user",
127
+ "content": [
128
+ {
129
+ "type": "video",
130
+ "video": "file:///path/to/video1.mp4",
131
+ "fps": 8.0,
132
+ },
133
+ {"type": "text", "text": "Describe the camera motion in this video."},
134
+ ],
135
+ }
136
+ ]
137
+
138
+ text = processor.apply_chat_template(
139
+ messages, tokenize=False, add_generation_prompt=True
140
+ )
141
+ image_inputs, video_inputs, video_kwargs = process_vision_info(messages, return_video_kwargs=True)
142
+ inputs = processor(
143
+ text=[text],
144
+ images=image_inputs,
145
+ videos=video_inputs,
146
+ fps=fps,
147
+ padding=True,
148
+ return_tensors="pt",
149
+ **video_kwargs,
150
+ )
151
+ inputs = inputs.to("cuda")
152
+
153
+ # Inference
154
+ generated_ids = model.generate(**inputs, max_new_tokens=128)
155
+ generated_ids_trimmed = [
156
+ out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
157
+ ]
158
+ output_text = processor.batch_decode(
159
+ generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
160
+ )
161
+ print(output_text)
162
+ ```
163
+ </details>
164
+
165
+
166
+ ## Training and evaluation data
167
+
168
+ Training and evaluation data can be found in our [repo](https://github.com/sy77777en/CameraBench).
169
+
170
+ ## ✏️ Citation
171
+
172
+ If you find this repository useful for your research, please use the following.
173
+ ```
174
+ @article{lin2025camerabench,
175
+ title={Towards Understanding Camera Motions in Any Video},
176
+ author={Lin, Zhiqiu and Cen, Siyuan and Jiang, Daniel and Karhade, Jay and Wang, Hewei and Mitra, Chancharik and Ling, Tiffany and Huang, Yuhan and Liu, Sifan and Chen, Mingyu and Zawar, Rushikesh and Bai, Xue and Du, Yilun and Gan, Chuang and Ramanan, Deva},
177
+ journal={arXiv preprint arXiv:2504.15376},
178
+ year={2025},
179
+ }
180
+ ```