Files changed (1) hide show
  1. README.md +146 -134
README.md CHANGED
@@ -1,135 +1,147 @@
1
- ---
2
- library_name: transformers
3
- tags:
4
- - multi-modal
5
- - large-language-model
6
- - video-language-model
7
- license: apache-2.0
8
- datasets:
9
- - lmms-lab/LLaVA-OneVision-Data
10
- - allenai/pixmo-docs
11
- - HuggingFaceM4/Docmatix
12
- - lmms-lab/LLaVA-Video-178K
13
- - ShareGPT4Video/ShareGPT4Video
14
- language:
15
- - en
16
- metrics:
17
- - accuracy
18
- pipeline_tag: visual-question-answering
19
- base_model:
20
- - Qwen/Qwen2.5-7B-Instruct
21
- ---
22
-
23
-
24
- <p align="center">
25
- <img src="https://cdn-uploads.huggingface.co/production/uploads/626938b16f8f86ad21deb989/tt5KYnAUmQlHtfB1-Zisl.png" width="150" style="margin-bottom: 0.2;"/>
26
- <p>
27
-
28
-
29
- <h3 align="center"><a href="https://arxiv.org/abs/2501.13106">VideoLLaMA 3: Frontier Multimodal Foundation Models for Video Understanding</a></h3>
30
-
31
-
32
-
33
- <h5 align="center"> If you like our project, please give us a star ⭐ on <a href="https://github.com/DAMO-NLP-SG/VideoLLaMA3">Github</a> for the latest update. </h5>
34
-
35
-
36
- ## 📰 News
37
- <!-- * **[2024.01.23]** 👋👋 Update technical report. If you have works closely related to VideoLLaMA3 but not mentioned in the paper, feel free to let us know. -->
38
- * **[2024.01.24]** 🔥🔥 Online Demo is available: [VideoLLaMA3-Image-7B](https://huggingface.co/spaces/lixin4ever/VideoLLaMA3-Image), [VideoLLaMA3-7B](https://huggingface.co/spaces/lixin4ever/VideoLLaMA3).
39
- * **[2024.01.22]** Release models and inference code of VideoLLaMA 3.
40
-
41
- ## 🌟 Introduction
42
- VideoLLaMA 3 represents a state-of-the-art series of multimodal foundation models designed to excel in both image and video understanding tasks. Leveraging advanced architectures, VideoLLaMA 3 demonstrates exceptional capabilities in processing and interpreting visual content across various contexts. These models are specifically designed to address complex multimodal challenges, such as integrating textual and visual information, extracting insights from sequential video data, and performing high-level reasoning over both dynamic and static visual scenes.
43
-
44
-
45
-
46
-
47
-
48
- ## 🌎 Model Zoo
49
- | Model | Base Model | HF Link |
50
- | -------------------- | ------------ | ------------------------------------------------------------ |
51
- | VideoLLaMA3-7B | Qwen2.5-7B | [DAMO-NLP-SG/VideoLLaMA3-7B](https://huggingface.co/DAMO-NLP-SG/VideoLLaMA3-7B) |
52
- | VideoLLaMA3-2B | Qwen2.5-1.5B | [DAMO-NLP-SG/VideoLLaMA3-2B](https://huggingface.co/DAMO-NLP-SG/VideoLLaMA3-2B) |
53
- | VideoLLaMA3-7B-Image (**This Checkpoint**) | Qwen2.5-7B | [DAMO-NLP-SG/VideoLLaMA3-7B-Image](https://huggingface.co/DAMO-NLP-SG/VideoLLaMA3-7B-Image) |
54
- | VideoLLaMA3-2B-Image | Qwen2.5-1.5B | [DAMO-NLP-SG/VideoLLaMA3-2B-Image](https://huggingface.co/DAMO-NLP-SG/VideoLLaMA3-2B-Image) |
55
-
56
- We also upload the tuned vision encoder of VideoLLaMA3-7B for wider application:
57
-
58
- | Model | Base Model | HF Link |
59
- | ----------------------------- | ------------------------- | ------------------------------------------------------------ |
60
- | VideoLLaMA3-7B Vision Encoder | siglip-so400m-patch14-384 | [DAMO-NLP-SG/VL3-SigLIP-NaViT](https://huggingface.co/DAMO-NLP-SG/VL3-SigLIP-NaViT) |
61
-
62
-
63
-
64
- ## 🚀 Main Results
65
-
66
-
67
- <img width="500" alt="image" src="https://cdn-uploads.huggingface.co/production/uploads/626938b16f8f86ad21deb989/ArHgZAmidn8Qlz8BwOdJI.png">
68
-
69
- * \* denotes the reproduced results.
70
-
71
- ## 🤖 Quick Start
72
- ```python
73
- import torch
74
- from transformers import AutoModelForCausalLM, AutoProcessor, AutoModel, AutoImageProcessor
75
-
76
- model_name = "DAMO-NLP-SG/VideoLLaMA3-7B-Image"
77
-
78
- model = AutoModelForCausalLM.from_pretrained(
79
- model_name,
80
- trust_remote_code=True,
81
- device_map="auto",
82
- torch_dtype=torch.bfloat16,
83
- attn_implementation="flash_attention_2",
84
- )
85
- processor = AutoProcessor.from_pretrained(model_name, trust_remote_code=True)
86
-
87
- # Image conversation
88
- conversation = [
89
- {
90
- "role": "user",
91
- "content": [
92
- {"type": "image", "image": {"image_path": "https://github.com/DAMO-NLP-SG/VideoLLaMA3/blob/main/assets/sora.png?raw=true"}},
93
- {"type": "text", "text": "What is the woman wearing?"},
94
- ]
95
- }
96
- ]
97
-
98
- inputs = processor(conversation=conversation, return_tensors="pt")
99
- inputs = {k: v.cuda() if isinstance(v, torch.Tensor) else v for k, v in inputs.items()}
100
- if "pixel_values" in inputs:
101
- inputs["pixel_values"] = inputs["pixel_values"].to(torch.bfloat16)
102
- output_ids = model.generate(**inputs, max_new_tokens=128)
103
- response = processor.batch_decode(output_ids, skip_special_tokens=True)[0].strip()
104
- print(response)
105
- ```
106
-
107
-
108
- ## Citation
109
-
110
- If you find VideoLLaMA useful for your research and applications, please cite using this BibTeX:
111
- ```bibtex
112
- @article{damonlpsg2025videollama3,
113
- title={VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding},
114
- author={Boqiang Zhang, Kehan Li, Zesen Cheng, Zhiqiang Hu, Yuqian Yuan, Guanzheng Chen, Sicong Leng, Yuming Jiang, Hang Zhang, Xin Li, Peng Jin, Wenqi Zhang, Fan Wang, Lidong Bing, Deli Zhao},
115
- journal={arXiv preprint arXiv:2501.13106},
116
- year={2025},
117
- url = {https://arxiv.org/abs/2501.13106}
118
- }
119
-
120
- @article{damonlpsg2024videollama2,
121
- title={VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs},
122
- author={Cheng, Zesen and Leng, Sicong and Zhang, Hang and Xin, Yifei and Li, Xin and Chen, Guanzheng and Zhu, Yongxin and Zhang, Wenqi and Luo, Ziyang and Zhao, Deli and Bing, Lidong},
123
- journal={arXiv preprint arXiv:2406.07476},
124
- year={2024},
125
- url = {https://arxiv.org/abs/2406.07476}
126
- }
127
-
128
- @article{damonlpsg2023videollama,
129
- title = {Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding},
130
- author = {Zhang, Hang and Li, Xin and Bing, Lidong},
131
- journal = {arXiv preprint arXiv:2306.02858},
132
- year = {2023},
133
- url = {https://arxiv.org/abs/2306.02858}
134
- }
 
 
 
 
 
 
 
 
 
 
 
 
135
  ```
 
1
+ ---
2
+ library_name: transformers
3
+ tags:
4
+ - multi-modal
5
+ - large-language-model
6
+ - video-language-model
7
+ license: apache-2.0
8
+ datasets:
9
+ - lmms-lab/LLaVA-OneVision-Data
10
+ - allenai/pixmo-docs
11
+ - HuggingFaceM4/Docmatix
12
+ - lmms-lab/LLaVA-Video-178K
13
+ - ShareGPT4Video/ShareGPT4Video
14
+ language:
15
+ - zho
16
+ - eng
17
+ - fra
18
+ - spa
19
+ - por
20
+ - deu
21
+ - ita
22
+ - rus
23
+ - jpn
24
+ - kor
25
+ - vie
26
+ - tha
27
+ - ara
28
+ metrics:
29
+ - accuracy
30
+ pipeline_tag: visual-question-answering
31
+ base_model:
32
+ - Qwen/Qwen2.5-7B-Instruct
33
+ ---
34
+
35
+
36
+ <p align="center">
37
+ <img src="https://cdn-uploads.huggingface.co/production/uploads/626938b16f8f86ad21deb989/tt5KYnAUmQlHtfB1-Zisl.png" width="150" style="margin-bottom: 0.2;"/>
38
+ <p>
39
+
40
+
41
+ <h3 align="center"><a href="https://arxiv.org/abs/2501.13106">VideoLLaMA 3: Frontier Multimodal Foundation Models for Video Understanding</a></h3>
42
+
43
+
44
+
45
+ <h5 align="center"> If you like our project, please give us a star ⭐ on <a href="https://github.com/DAMO-NLP-SG/VideoLLaMA3">Github</a> for the latest update. </h5>
46
+
47
+
48
+ ## 📰 News
49
+ <!-- * **[2024.01.23]** 👋👋 Update technical report. If you have works closely related to VideoLLaMA3 but not mentioned in the paper, feel free to let us know. -->
50
+ * **[2024.01.24]** 🔥🔥 Online Demo is available: [VideoLLaMA3-Image-7B](https://huggingface.co/spaces/lixin4ever/VideoLLaMA3-Image), [VideoLLaMA3-7B](https://huggingface.co/spaces/lixin4ever/VideoLLaMA3).
51
+ * **[2024.01.22]** Release models and inference code of VideoLLaMA 3.
52
+
53
+ ## 🌟 Introduction
54
+ VideoLLaMA 3 represents a state-of-the-art series of multimodal foundation models designed to excel in both image and video understanding tasks. Leveraging advanced architectures, VideoLLaMA 3 demonstrates exceptional capabilities in processing and interpreting visual content across various contexts. These models are specifically designed to address complex multimodal challenges, such as integrating textual and visual information, extracting insights from sequential video data, and performing high-level reasoning over both dynamic and static visual scenes.
55
+
56
+
57
+
58
+
59
+
60
+ ## 🌎 Model Zoo
61
+ | Model | Base Model | HF Link |
62
+ | -------------------- | ------------ | ------------------------------------------------------------ |
63
+ | VideoLLaMA3-7B | Qwen2.5-7B | [DAMO-NLP-SG/VideoLLaMA3-7B](https://huggingface.co/DAMO-NLP-SG/VideoLLaMA3-7B) |
64
+ | VideoLLaMA3-2B | Qwen2.5-1.5B | [DAMO-NLP-SG/VideoLLaMA3-2B](https://huggingface.co/DAMO-NLP-SG/VideoLLaMA3-2B) |
65
+ | VideoLLaMA3-7B-Image (**This Checkpoint**) | Qwen2.5-7B | [DAMO-NLP-SG/VideoLLaMA3-7B-Image](https://huggingface.co/DAMO-NLP-SG/VideoLLaMA3-7B-Image) |
66
+ | VideoLLaMA3-2B-Image | Qwen2.5-1.5B | [DAMO-NLP-SG/VideoLLaMA3-2B-Image](https://huggingface.co/DAMO-NLP-SG/VideoLLaMA3-2B-Image) |
67
+
68
+ We also upload the tuned vision encoder of VideoLLaMA3-7B for wider application:
69
+
70
+ | Model | Base Model | HF Link |
71
+ | ----------------------------- | ------------------------- | ------------------------------------------------------------ |
72
+ | VideoLLaMA3-7B Vision Encoder | siglip-so400m-patch14-384 | [DAMO-NLP-SG/VL3-SigLIP-NaViT](https://huggingface.co/DAMO-NLP-SG/VL3-SigLIP-NaViT) |
73
+
74
+
75
+
76
+ ## 🚀 Main Results
77
+
78
+
79
+ <img width="500" alt="image" src="https://cdn-uploads.huggingface.co/production/uploads/626938b16f8f86ad21deb989/ArHgZAmidn8Qlz8BwOdJI.png">
80
+
81
+ * \* denotes the reproduced results.
82
+
83
+ ## 🤖 Quick Start
84
+ ```python
85
+ import torch
86
+ from transformers import AutoModelForCausalLM, AutoProcessor, AutoModel, AutoImageProcessor
87
+
88
+ model_name = "DAMO-NLP-SG/VideoLLaMA3-7B-Image"
89
+
90
+ model = AutoModelForCausalLM.from_pretrained(
91
+ model_name,
92
+ trust_remote_code=True,
93
+ device_map="auto",
94
+ torch_dtype=torch.bfloat16,
95
+ attn_implementation="flash_attention_2",
96
+ )
97
+ processor = AutoProcessor.from_pretrained(model_name, trust_remote_code=True)
98
+
99
+ # Image conversation
100
+ conversation = [
101
+ {
102
+ "role": "user",
103
+ "content": [
104
+ {"type": "image", "image": {"image_path": "https://github.com/DAMO-NLP-SG/VideoLLaMA3/blob/main/assets/sora.png?raw=true"}},
105
+ {"type": "text", "text": "What is the woman wearing?"},
106
+ ]
107
+ }
108
+ ]
109
+
110
+ inputs = processor(conversation=conversation, return_tensors="pt")
111
+ inputs = {k: v.cuda() if isinstance(v, torch.Tensor) else v for k, v in inputs.items()}
112
+ if "pixel_values" in inputs:
113
+ inputs["pixel_values"] = inputs["pixel_values"].to(torch.bfloat16)
114
+ output_ids = model.generate(**inputs, max_new_tokens=128)
115
+ response = processor.batch_decode(output_ids, skip_special_tokens=True)[0].strip()
116
+ print(response)
117
+ ```
118
+
119
+
120
+ ## Citation
121
+
122
+ If you find VideoLLaMA useful for your research and applications, please cite using this BibTeX:
123
+ ```bibtex
124
+ @article{damonlpsg2025videollama3,
125
+ title={VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding},
126
+ author={Boqiang Zhang, Kehan Li, Zesen Cheng, Zhiqiang Hu, Yuqian Yuan, Guanzheng Chen, Sicong Leng, Yuming Jiang, Hang Zhang, Xin Li, Peng Jin, Wenqi Zhang, Fan Wang, Lidong Bing, Deli Zhao},
127
+ journal={arXiv preprint arXiv:2501.13106},
128
+ year={2025},
129
+ url = {https://arxiv.org/abs/2501.13106}
130
+ }
131
+
132
+ @article{damonlpsg2024videollama2,
133
+ title={VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs},
134
+ author={Cheng, Zesen and Leng, Sicong and Zhang, Hang and Xin, Yifei and Li, Xin and Chen, Guanzheng and Zhu, Yongxin and Zhang, Wenqi and Luo, Ziyang and Zhao, Deli and Bing, Lidong},
135
+ journal={arXiv preprint arXiv:2406.07476},
136
+ year={2024},
137
+ url = {https://arxiv.org/abs/2406.07476}
138
+ }
139
+
140
+ @article{damonlpsg2023videollama,
141
+ title = {Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding},
142
+ author = {Zhang, Hang and Li, Xin and Bing, Lidong},
143
+ journal = {arXiv preprint arXiv:2306.02858},
144
+ year = {2023},
145
+ url = {https://arxiv.org/abs/2306.02858}
146
+ }
147
  ```