TinyLLaVA-Video

arXivGithub

Here, we introduce TinyLLaVA-Video-Phi2-Naive-16-512. For LLM and vision tower, we choose Phi-2 and siglip-so400m-patch14-384, respectively. The model adopts the Naive Video-Level Resampler, samples 16 frames from each video, and represents the video sequence using 512 tokens.

Result

Model (HF Path) #Frame/Query Video-MME MVBench LongVideoBench MLVU
Zhang199/TinyLLaVA-Video-Qwen2.5-3B-Group-1fps-512 1fps/512 47.7 47.0 42.0 52.6
Zhang199/TinyLLaVA-Video-Qwen2.5-3B-Group-16-512 16/512 47.0 45.5 42.4 52.5
Zhang199/TinyLLaVA-Video-Qwen2.5-3B-Naive-16-512 16/512 44.7 42.5 37.6 48.1
Zhang199/TinyLLaVA-Video-Phi2-Naive-16-512 16/512 42.7 42.0 42.2 46.5
Downloads last month
7
Safetensors
Model size
3.39B params
Tensor type
FP16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including Zhang199/TinyLLaVA-Video-Phi2-Naive-16-512