nielsr HF Staff commited on
Commit
e0e642f
·
verified ·
1 Parent(s): 067788b

Correct pipeline tag, add link to paper

Browse files

Correcting the pipeline tag and linking it to the paper, ensuring people can find your model at https://huggingface.co/models?pipeline_tag=video-text-to-text.

Files changed (1) hide show
  1. README.md +8 -11
README.md CHANGED
@@ -1,6 +1,6 @@
1
  ---
2
- library_name: transformers
3
- license: apache-2.0
4
  datasets:
5
  - HuggingFaceM4/the_cauldron
6
  - HuggingFaceM4/Docmatix
@@ -14,17 +14,19 @@ datasets:
14
  - TIGER-Lab/VISTA-400K
15
  - Enxin/MovieChat-1K_train
16
  - ShareGPT4Video/ShareGPT4Video
17
- pipeline_tag: image-text-to-text
18
  language:
19
  - en
20
- base_model:
21
- - HuggingFaceTB/SmolVLM-256M-Instruct
 
22
  ---
23
 
24
  <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/SmolVLM2_banner.png" width="800" height="auto" alt="Image description">
25
 
26
  # SmolVLM2-256M-Video
27
 
 
 
28
  SmolVLM2-256M-Video is a lightweight multimodal model designed to analyze video content. The model processes videos, images, and text inputs to generate text outputs - whether answering questions about media files, comparing visual content, or transcribing text from images. Despite its compact size, requiring only 1.38GB of GPU RAM for video inference. This efficiency makes it particularly well-suited for on-device applications that require specific domain fine-tuning and computational resources may be limited.
29
  ## Model Summary
30
 
@@ -207,12 +209,7 @@ You can cite us in the following way:
207
  ## Training Data
208
  SmolVLM2 used 3.3M samples for training originally from ten different datasets: [LlaVa Onevision](https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data), [M4-Instruct](https://huggingface.co/datasets/lmms-lab/M4-Instruct-Data), [Mammoth](https://huggingface.co/datasets/MAmmoTH-VL/MAmmoTH-VL-Instruct-12M), [LlaVa Video 178K](https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K), [FineVideo](https://huggingface.co/datasets/HuggingFaceFV/finevideo), [VideoStar](https://huggingface.co/datasets/orrzohar/Video-STaR), [VRipt](https://huggingface.co/datasets/Mutonix/Vript), [Vista-400K](https://huggingface.co/datasets/TIGER-Lab/VISTA-400K), [MovieChat](https://huggingface.co/datasets/Enxin/MovieChat-1K_train) and [ShareGPT4Video](https://huggingface.co/datasets/ShareGPT4Video/ShareGPT4Video).
209
  In the following plots we give a general overview of the samples across modalities and the source of those samples.
210
- <!--
211
- <center><img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/smolvlm2_data_split.png" width="auto" height="auto" alt="Image description">
212
- </center>
213
 
214
- ### Details
215
- <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/smolvlm2_datadetails.png" width="auto" height="auto" alt="Image description"> -->
216
 
217
  ## Data Split per modality
218
 
@@ -266,4 +263,4 @@ In the following plots we give a general overview of the samples across modaliti
266
  | video-star/starb | 2.2% |
267
  | vista-400k/combined | 2.2% |
268
  | vript/long | 1.0% |
269
- | ShareGPT4Video/all | 0.8% |
 
1
  ---
2
+ base_model:
3
+ - HuggingFaceTB/SmolVLM-256M-Instruct
4
  datasets:
5
  - HuggingFaceM4/the_cauldron
6
  - HuggingFaceM4/Docmatix
 
14
  - TIGER-Lab/VISTA-400K
15
  - Enxin/MovieChat-1K_train
16
  - ShareGPT4Video/ShareGPT4Video
 
17
  language:
18
  - en
19
+ library_name: transformers
20
+ license: apache-2.0
21
+ pipeline_tag: video-text-to-text
22
  ---
23
 
24
  <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/SmolVLM2_banner.png" width="800" height="auto" alt="Image description">
25
 
26
  # SmolVLM2-256M-Video
27
 
28
+ This repository contains the model as presented in [SmolVLM: Redefining small and efficient multimodal models](https://huggingface.co/papers/2504.05299).
29
+
30
  SmolVLM2-256M-Video is a lightweight multimodal model designed to analyze video content. The model processes videos, images, and text inputs to generate text outputs - whether answering questions about media files, comparing visual content, or transcribing text from images. Despite its compact size, requiring only 1.38GB of GPU RAM for video inference. This efficiency makes it particularly well-suited for on-device applications that require specific domain fine-tuning and computational resources may be limited.
31
  ## Model Summary
32
 
 
209
  ## Training Data
210
  SmolVLM2 used 3.3M samples for training originally from ten different datasets: [LlaVa Onevision](https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data), [M4-Instruct](https://huggingface.co/datasets/lmms-lab/M4-Instruct-Data), [Mammoth](https://huggingface.co/datasets/MAmmoTH-VL/MAmmoTH-VL-Instruct-12M), [LlaVa Video 178K](https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K), [FineVideo](https://huggingface.co/datasets/HuggingFaceFV/finevideo), [VideoStar](https://huggingface.co/datasets/orrzohar/Video-STaR), [VRipt](https://huggingface.co/datasets/Mutonix/Vript), [Vista-400K](https://huggingface.co/datasets/TIGER-Lab/VISTA-400K), [MovieChat](https://huggingface.co/datasets/Enxin/MovieChat-1K_train) and [ShareGPT4Video](https://huggingface.co/datasets/ShareGPT4Video/ShareGPT4Video).
211
  In the following plots we give a general overview of the samples across modalities and the source of those samples.
 
 
 
212
 
 
 
213
 
214
  ## Data Split per modality
215
 
 
263
  | video-star/starb | 2.2% |
264
  | vista-400k/combined | 2.2% |
265
  | vript/long | 1.0% |
266
+ | ShareGPT4Video/all | 0.8% |