qingy2024
/

InternVideo2_S2_6B_Vision

Image Feature Extraction

Model card Files Files and versions

InternVideo2_S2_6B_Vision / README.md

qingy2024's picture

Update README.md

69bc11f verified 4 months ago

|

history blame contribute delete

2.13 kB

	---
	license: apache-2.0
	extra_gated_prompt: >-
	You agree to not use the model to conduct experiments that cause harm to human
	subjects.
	extra_gated_fields:
	Name: text
	Company/Organization: text
	Country: text
	E-Mail: text
	datasets:
	- OpenGVLab/InternVid
	pipeline_tag: image-feature-extraction
	---

	# Model Card for InternVideo2 (Vision-Only)

	This model card describes the vision encoder component extracted from the InternVideo2 foundation model series.

	## Model Details

	This checkpoint contains only the vision backbone parameters, suitable for video or image feature extraction tasks. It was obtained by filtering a multimodal InternVideo2 checkpoint (e.g., S2_6B).

	### Model Sources

	- Original Project Repository: [InternVideo2](https://github.com/OpenGVLab/InternVideo/tree/main/InternVideo2)
	- Original Paper: [2403.15377](https://arxiv.org/abs/2403.15377)
	- Original Point of Contact: mailto:[InternVideo Group]([email protected])

	### Uploader

	- This specific vision-only checkpoint uploaded by: [qingy2024](https://huggingface.co/qingy2024)

	## How to Use

	This file (`InternVideo2_S2_6B_vision.pt`) is a standard PyTorch state dictionary containing only the vision encoder weights. It can be loaded into a compatible vision model architecture using `model.load_state_dict()`.

	```python
	import torch

	vision_state_dict = torch.load("InternVideo2_S2_6B_vision.pt", map_location='cpu') # or 'cuda'
	```

	## Limitations

	This model contains only the vision encoder. It does not include the text or audio encoders and cannot perform tasks requiring multimodal inputs unless combined with separate models for those modalities.

	## Citation

	If you use this vision encoder, please cite the original InternVideo2 paper:

	```bibtex
	@article{wang2024internvideo2,
	title={InternVideo2: Scaling Video Foundation Models for Multimodal Video Understanding},
	author={Wang, Yi and Li, Kunchang and Li, Xinhao and Yu, Jiashuo and He, Yinan and Chen, Guo and Pei, Baoqi and Zheng, Rongkun and Xu, Jilan and Wang, Zun and others},
	journal={arXiv preprint arXiv:2403.15377},
	year={2024}
	}
	```