qingy2024 commited on
Commit
2239ab7
·
verified ·
1 Parent(s): 7bf17e6

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +54 -0
README.md ADDED
@@ -0,0 +1,54 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ extra_gated_prompt: "You agree to not use the model to conduct experiments that cause harm to human subjects."
4
+ extra_gated_fields:
5
+ Name: text
6
+ Company/Organization: text
7
+ Country: text
8
+ E-Mail: text
9
+ ---
10
+
11
+ # Model Card for InternVideo2 (Vision-Only)
12
+
13
+ This model card describes the **vision encoder** component extracted from the InternVideo2 foundation model series.
14
+
15
+ ## Model Details
16
+
17
+ This checkpoint contains only the vision backbone parameters, suitable for video or image feature extraction tasks. It was obtained by filtering a multimodal InternVideo2 checkpoint (e.g., S2_6B).
18
+
19
+ ### Model Sources
20
+
21
+ - **Original Project Repository:** [InternVideo2](https://github.com/OpenGVLab/InternVideo/tree/main/InternVideo2)
22
+ - **Original Paper:** [2403.15377](https://arxiv.org/abs/2403.15377)
23
+ - **Original Point of Contact:** mailto:[InternVideo Group]([email protected])
24
+
25
+ ### Uploader
26
+
27
+ - **This specific vision-only checkpoint uploaded by:** [qingy2024](https://huggingface.co/qingy2024)
28
+
29
+ ## How to Use
30
+
31
+ This file (`InternVideo2_S2_6B_vision.pt`) is a standard PyTorch state dictionary containing only the vision encoder weights. It can be loaded into a compatible vision model architecture using `model.load_state_dict()`.
32
+
33
+ ```python
34
+ import torch
35
+
36
+ vision_state_dict = torch.load("InternVideo2_S2_6B_vision.pt", map_location='cpu') # or 'cuda'
37
+ ```
38
+
39
+ ## Limitations
40
+
41
+ This model contains only the vision encoder. It **does not** include the text or audio encoders and cannot perform tasks requiring multimodal inputs unless combined with separate models for those modalities.
42
+
43
+ ## Citation
44
+
45
+ If you use this vision encoder, please cite the original InternVideo2 paper:
46
+
47
+ ```bibtex
48
+ @article{wang2024internvideo2,
49
+ title={InternVideo2: Scaling Video Foundation Models for Multimodal Video Understanding},
50
+ author={Wang, Yi and Li, Kunchang and Li, Xinhao and Yu, Jiashuo and He, Yinan and Chen, Guo and Pei, Baoqi and Zheng, Rongkun and Xu, Jilan and Wang, Zun and others},
51
+ journal={arXiv preprint arXiv:2403.15377},
52
+ year={2024}
53
+ }
54
+ ```