File size: 2,132 Bytes
2239ab7
 
69bc11f
 
 
2239ab7
 
 
 
 
69bc11f
 
 
2239ab7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
---
license: apache-2.0
extra_gated_prompt: >-
  You agree to not use the model to conduct experiments that cause harm to human
  subjects.
extra_gated_fields:
  Name: text
  Company/Organization: text
  Country: text
  E-Mail: text
datasets:
- OpenGVLab/InternVid
pipeline_tag: image-feature-extraction
---

# Model Card for InternVideo2 (Vision-Only)

This model card describes the **vision encoder** component extracted from the InternVideo2 foundation model series.

## Model Details

This checkpoint contains only the vision backbone parameters, suitable for video or image feature extraction tasks. It was obtained by filtering a multimodal InternVideo2 checkpoint (e.g., S2_6B).

### Model Sources

- **Original Project Repository:** [InternVideo2](https://github.com/OpenGVLab/InternVideo/tree/main/InternVideo2)
- **Original Paper:** [2403.15377](https://arxiv.org/abs/2403.15377)
- **Original Point of Contact:** mailto:[InternVideo Group]([email protected])

### Uploader

- **This specific vision-only checkpoint uploaded by:** [qingy2024](https://huggingface.co/qingy2024)

## How to Use

This file (`InternVideo2_S2_6B_vision.pt`) is a standard PyTorch state dictionary containing only the vision encoder weights. It can be loaded into a compatible vision model architecture using `model.load_state_dict()`.

```python
import torch

vision_state_dict = torch.load("InternVideo2_S2_6B_vision.pt", map_location='cpu') # or 'cuda'
```

## Limitations

This model contains only the vision encoder. It **does not** include the text or audio encoders and cannot perform tasks requiring multimodal inputs unless combined with separate models for those modalities.

## Citation

If you use this vision encoder, please cite the original InternVideo2 paper:

```bibtex
@article{wang2024internvideo2,
  title={InternVideo2: Scaling Video Foundation Models for Multimodal Video Understanding},
  author={Wang, Yi and Li, Kunchang and Li, Xinhao and Yu, Jiashuo and He, Yinan and Chen, Guo and Pei, Baoqi and Zheng, Rongkun and Xu, Jilan and Wang, Zun and others},
  journal={arXiv preprint arXiv:2403.15377},
  year={2024}
}
```