|
--- |
|
license: apache-2.0 |
|
extra_gated_prompt: >- |
|
You agree to not use the model to conduct experiments that cause harm to human |
|
subjects. |
|
extra_gated_fields: |
|
Name: text |
|
Company/Organization: text |
|
Country: text |
|
E-Mail: text |
|
datasets: |
|
- OpenGVLab/InternVid |
|
pipeline_tag: image-feature-extraction |
|
--- |
|
|
|
# Model Card for InternVideo2 (Vision-Only) |
|
|
|
This model card describes the **vision encoder** component extracted from the InternVideo2 foundation model series. |
|
|
|
## Model Details |
|
|
|
This checkpoint contains only the vision backbone parameters, suitable for video or image feature extraction tasks. It was obtained by filtering a multimodal InternVideo2 checkpoint (e.g., S2_6B). |
|
|
|
### Model Sources |
|
|
|
- **Original Project Repository:** [InternVideo2](https://github.com/OpenGVLab/InternVideo/tree/main/InternVideo2) |
|
- **Original Paper:** [2403.15377](https://arxiv.org/abs/2403.15377) |
|
- **Original Point of Contact:** mailto:[InternVideo Group]([email protected]) |
|
|
|
### Uploader |
|
|
|
- **This specific vision-only checkpoint uploaded by:** [qingy2024](https://huggingface.co/qingy2024) |
|
|
|
## How to Use |
|
|
|
This file (`InternVideo2_S2_6B_vision.pt`) is a standard PyTorch state dictionary containing only the vision encoder weights. It can be loaded into a compatible vision model architecture using `model.load_state_dict()`. |
|
|
|
```python |
|
import torch |
|
|
|
vision_state_dict = torch.load("InternVideo2_S2_6B_vision.pt", map_location='cpu') # or 'cuda' |
|
``` |
|
|
|
## Limitations |
|
|
|
This model contains only the vision encoder. It **does not** include the text or audio encoders and cannot perform tasks requiring multimodal inputs unless combined with separate models for those modalities. |
|
|
|
## Citation |
|
|
|
If you use this vision encoder, please cite the original InternVideo2 paper: |
|
|
|
```bibtex |
|
@article{wang2024internvideo2, |
|
title={InternVideo2: Scaling Video Foundation Models for Multimodal Video Understanding}, |
|
author={Wang, Yi and Li, Kunchang and Li, Xinhao and Yu, Jiashuo and He, Yinan and Chen, Guo and Pei, Baoqi and Zheng, Rongkun and Xu, Jilan and Wang, Zun and others}, |
|
journal={arXiv preprint arXiv:2403.15377}, |
|
year={2024} |
|
} |
|
``` |