File size: 2,132 Bytes
2239ab7 69bc11f 2239ab7 69bc11f 2239ab7 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 |
---
license: apache-2.0
extra_gated_prompt: >-
You agree to not use the model to conduct experiments that cause harm to human
subjects.
extra_gated_fields:
Name: text
Company/Organization: text
Country: text
E-Mail: text
datasets:
- OpenGVLab/InternVid
pipeline_tag: image-feature-extraction
---
# Model Card for InternVideo2 (Vision-Only)
This model card describes the **vision encoder** component extracted from the InternVideo2 foundation model series.
## Model Details
This checkpoint contains only the vision backbone parameters, suitable for video or image feature extraction tasks. It was obtained by filtering a multimodal InternVideo2 checkpoint (e.g., S2_6B).
### Model Sources
- **Original Project Repository:** [InternVideo2](https://github.com/OpenGVLab/InternVideo/tree/main/InternVideo2)
- **Original Paper:** [2403.15377](https://arxiv.org/abs/2403.15377)
- **Original Point of Contact:** mailto:[InternVideo Group]([email protected])
### Uploader
- **This specific vision-only checkpoint uploaded by:** [qingy2024](https://huggingface.co/qingy2024)
## How to Use
This file (`InternVideo2_S2_6B_vision.pt`) is a standard PyTorch state dictionary containing only the vision encoder weights. It can be loaded into a compatible vision model architecture using `model.load_state_dict()`.
```python
import torch
vision_state_dict = torch.load("InternVideo2_S2_6B_vision.pt", map_location='cpu') # or 'cuda'
```
## Limitations
This model contains only the vision encoder. It **does not** include the text or audio encoders and cannot perform tasks requiring multimodal inputs unless combined with separate models for those modalities.
## Citation
If you use this vision encoder, please cite the original InternVideo2 paper:
```bibtex
@article{wang2024internvideo2,
title={InternVideo2: Scaling Video Foundation Models for Multimodal Video Understanding},
author={Wang, Yi and Li, Kunchang and Li, Xinhao and Yu, Jiashuo and He, Yinan and Chen, Guo and Pei, Baoqi and Zheng, Rongkun and Xu, Jilan and Wang, Zun and others},
journal={arXiv preprint arXiv:2403.15377},
year={2024}
}
``` |