Files changed (1) hide show
  1. README.md +207 -0
README.md ADDED
@@ -0,0 +1,207 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-nc-sa-4.0
3
+ datasets:
4
+ - lmms-lab/LLaVA-Video-178K
5
+ language:
6
+ - en
7
+ metrics:
8
+ - accuracy
9
+ base_model:
10
+ - lmms-lab/LLaVA-Video-7B-Qwen2
11
+ pipeline_tag: video-text-to-text
12
+ library_name: transformers
13
+ tags:
14
+ - Action
15
+ - Video
16
+ - MQA
17
+ - multimodal
18
+ model-index:
19
+ - name: LLaVAction-7B
20
+ results:
21
+ - task:
22
+ type: multimodal
23
+ dataset:
24
+ name: EgoSchema
25
+ type: egoschema
26
+ metrics:
27
+ - type: accuracy
28
+ value: 59.0
29
+ name: accuracy
30
+ verified: true
31
+ - task:
32
+ type: multimodal
33
+ dataset:
34
+ name: MVBench
35
+ type: mvbench
36
+ metrics:
37
+ - type: accuracy
38
+ value: 61.1
39
+ name: accuracy
40
+ verified: true
41
+ - task:
42
+ type: multimodal
43
+ dataset:
44
+ name: NextQA
45
+ type: nextqa
46
+ metrics:
47
+ - type: accuracy
48
+ value: 82.8
49
+ name: accuracy
50
+ verified: true
51
+ - task:
52
+ type: multimodal
53
+ dataset:
54
+ name: PercepTest
55
+ type: percepTest
56
+ metrics:
57
+ - type: accuracy
58
+ value: 70.2
59
+ name: accuracy
60
+ verified: true
61
+ - task:
62
+ type: multimodal
63
+ dataset:
64
+ name: LongVideoBench
65
+ type: longvideobench
66
+ metrics:
67
+ - type: accuracy
68
+ value: 58.6
69
+ name: accuracy
70
+ verified: true
71
+ - task:
72
+ type: multimodal
73
+ dataset:
74
+ name: VideoMME
75
+ type: videomme
76
+ metrics:
77
+ - type: accuracy
78
+ value: 63.9
79
+ name: accuracy
80
+ verified: true
81
+ - task:
82
+ type: multimodal
83
+ dataset:
84
+ name: VideoMME (w-subs)
85
+ type: videomme
86
+ metrics:
87
+ - type: accuracy
88
+ value: 71.4
89
+ name: accuracy
90
+ verified: true
91
+ ---
92
+
93
+ # LLaVAction-7B
94
+
95
+ ## Model Summary
96
+ The LLaVAction models are 7B parameter models trained on LLaVA-Video-178K and EPIC-KITCHENS-100-MQA, based on Qwen2 language model with a context window of 32K tokens.
97
+
98
+ This model supports at most 64 frames.
99
+
100
+ - **Project Page**: [https://mmathislab.github.io/llavaction/](https://mmathislab.github.io/llavaction/)
101
+ - **Paper**: For more details, please check our [paper](https://arxiv.org/abs/tbd)
102
+ - **Repository**: [https://github.com/AdaptiveMotorControlLab/LLaVAction](https://github.com/AdaptiveMotorControlLab/LLaVAction)
103
+ - **Point of Contact**: [Mackenzie Mathis](https://people.epfl.ch/mackenzie.mathis)
104
+ - **Languages**: English
105
+ -
106
+ ## Use
107
+
108
+ ### Intended use
109
+ The model was trained on EPIC-KITCHENS-100-MQA and LLaVA-Video-178K (link). It has improved capability on understanding human egocentric actions from videos.
110
+
111
+
112
+ **Feel free to share your generations in the Community tab!**
113
+
114
+
115
+ ### Generation
116
+ We provide the simple generation process for using our model. For more details, you could refer to Github.
117
+
118
+ ```python
119
+ !pip install llavaction
120
+ from llavaction.model.builder import load_pretrained_model
121
+ from llavaction.mm_utils import get_model_name_from_path, process_images, tokenizer_image_token
122
+ from llavaction.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN, IGNORE_INDEX
123
+ from llavaction.conversation import conv_templates, SeparatorStyle
124
+ from PIL import Image
125
+ import requests
126
+ import copy
127
+ import torch
128
+ import sys
129
+ import warnings
130
+ from decord import VideoReader, cpu
131
+ import numpy as np
132
+ warnings.filterwarnings("ignore")
133
+ def load_video(video_path, max_frames_num,fps=1,force_sample=False):
134
+ if max_frames_num == 0:
135
+ return np.zeros((1, 336, 336, 3))
136
+ vr = VideoReader(video_path, ctx=cpu(0),num_threads=1)
137
+ total_frame_num = len(vr)
138
+ video_time = total_frame_num / vr.get_avg_fps()
139
+ fps = round(vr.get_avg_fps()/fps)
140
+ frame_idx = [i for i in range(0, len(vr), fps)]
141
+ if len(frame_idx) > max_frames_num or force_sample:
142
+ sample_fps = max_frames_num
143
+ uniform_sampled_frames = np.linspace(0, total_frame_num - 1, sample_fps, dtype=int)
144
+ frame_idx = uniform_sampled_frames.tolist()
145
+ frame_time = [i/vr.get_avg_fps() for i in frame_idx]
146
+ spare_frames = vr.get_batch(frame_idx).asnumpy()
147
+ # import pdb;pdb.set_trace()
148
+ return spare_frames,frame_time,video_time
149
+ pretrained = "MLAdaptiveIntelligence/LLaVAction-7B"
150
+ model_name = "llava_qwen"
151
+ device = "cuda"
152
+ device_map = "auto"
153
+ tokenizer, model, image_processor, max_length = load_pretrained_model(pretrained, None, model_name, torch_dtype="bfloat16", device_map=device_map) # Add any other thing you want to pass in llava_model_args
154
+ model.eval()
155
+ video_path = "XXXX"
156
+ max_frames_num = 64
157
+ video,frame_time,video_time = load_video(video_path, max_frames_num, 1, force_sample=True)
158
+ video = image_processor.preprocess(video, return_tensors="pt")["pixel_values"].cuda().half()
159
+ video = [video]
160
+ conv_template = "qwen_1_5" # Make sure you use correct chat template for different models
161
+ time_instruciton = f"The video lasts for {video_time:.2f} seconds, and {len(video[0])} frames are uniformly sampled from it. "
162
+ perspective_prompt = "You are seeing this video from egocentric view and you are the person. Your hands are sometimes interacting with objects. What action are you doing?"
163
+ task_prompt = "Describe in details what you see from the video frames."
164
+ question = DEFAULT_IMAGE_TOKEN + f"\n{time_instruction}\n{perspective_prompt} {task_prompt}"
165
+ conv = copy.deepcopy(conv_templates[conv_template])
166
+ conv.append_message(conv.roles[0], question)
167
+ conv.append_message(conv.roles[1], None)
168
+ prompt_question = conv.get_prompt()
169
+ input_ids = tokenizer_image_token(prompt_question, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).to(device)
170
+ cont = model.generate(
171
+ input_ids,
172
+ images=video,
173
+ modalities= ["video"],
174
+ do_sample=False,
175
+ temperature=0,
176
+ max_new_tokens=4096,
177
+ )
178
+ text_outputs = tokenizer.batch_decode(cont, skip_special_tokens=True)[0].strip()
179
+ print(text_outputs)
180
+ ```
181
+
182
+
183
+ ## Training
184
+
185
+
186
+ ### Model
187
+ - **Architecture**: SO400M + Qwen2
188
+ - **Initialized Model**: lmms-lab/LLaVA-Video-7B-Qwen2
189
+ - **Data**: A mixture of LLaVA-178K and EPIC-KITCHENS-100-MQA, 2 epochs, full model
190
+ - **Precision**: bfloat16
191
+
192
+
193
+ ### Hardware & Software
194
+ GPUs: 32 * Nvidia GH-200 (for whole model series training)
195
+ Orchestration: HuggingFace Trainer
196
+ Neural networks: PyTorch
197
+
198
+ ## Citation
199
+
200
+ ```bibtex
201
+ @article{YeQi2025llavaction,
202
+ title={LLaVAction: evaluating and training multi-modal large language models for action recognition},
203
+ author={Ye, Shaokai and Qi, Haozhe and Mathis, Alexander and Mathis, Mackenzie W.},
204
+ journal={arXiv preprint},
205
+ year={2025}
206
+ }
207
+ ```