qingy2024
/

InternVideo2_S2_6B_Vision

Image Feature Extraction

Model card Files Files and versions

qingy2024 commited on May 31

Commit

2239ab7

·

verified ·

1 Parent(s): 7bf17e6

Create README.md

Files changed (1) hide show

README.md +54 -0

README.md ADDED Viewed

	@@ -0,0 +1,54 @@

+---
+license: apache-2.0
+extra_gated_prompt: "You agree to not use the model to conduct experiments that cause harm to human subjects."
+extra_gated_fields:
+  Name: text
+  Company/Organization: text
+  Country: text
+  E-Mail: text
+---
+# Model Card for InternVideo2 (Vision-Only)
+This model card describes the **vision encoder** component extracted from the InternVideo2 foundation model series.
+## Model Details
+This checkpoint contains only the vision backbone parameters, suitable for video or image feature extraction tasks. It was obtained by filtering a multimodal InternVideo2 checkpoint (e.g., S2_6B).
+### Model Sources
+- **Original Project Repository:** [InternVideo2](https://github.com/OpenGVLab/InternVideo/tree/main/InternVideo2)
+- **Original Paper:** [2403.15377](https://arxiv.org/abs/2403.15377)
+- **Original Point of Contact:** mailto:[InternVideo Group]([email protected])
+### Uploader
+- **This specific vision-only checkpoint uploaded by:** [qingy2024](https://huggingface.co/qingy2024)
+## How to Use
+This file (`InternVideo2_S2_6B_vision.pt`) is a standard PyTorch state dictionary containing only the vision encoder weights. It can be loaded into a compatible vision model architecture using `model.load_state_dict()`.
+```python
+import torch
+vision_state_dict = torch.load("InternVideo2_S2_6B_vision.pt", map_location='cpu') # or 'cuda'
+```
+## Limitations
+This model contains only the vision encoder. It **does not** include the text or audio encoders and cannot perform tasks requiring multimodal inputs unless combined with separate models for those modalities.
+## Citation
+If you use this vision encoder, please cite the original InternVideo2 paper:
+```bibtex
+@article{wang2024internvideo2,
+  title={InternVideo2: Scaling Video Foundation Models for Multimodal Video Understanding},
+  author={Wang, Yi and Li, Kunchang and Li, Xinhao and Yu, Jiashuo and He, Yinan and Chen, Guo and Pei, Baoqi and Zheng, Rongkun and Xu, Jilan and Wang, Zun and others},
+  journal={arXiv preprint arXiv:2403.15377},
+  year={2024}
+}
+```