Zhang199/TinyLLaVA-Video-Phi2-Naive-16-512

TinyLLaVA-Video

Here, we introduce TinyLLaVA-Video-Phi2-Naive-16-512. For LLM and vision tower, we choose Phi-2 and siglip-so400m-patch14-384, respectively. The model adopts the Naive Video-Level Resampler, samples 16 frames from each video, and represents the video sequence using 512 tokens.

Result

Model (HF Path)	#Frame/Query	Video-MME	MVBench	LongVideoBench	MLVU
Zhang199/TinyLLaVA-Video-Qwen2.5-3B-Group-1fps-512	1fps/512	47.7	47.0	42.0	52.6
Zhang199/TinyLLaVA-Video-Qwen2.5-3B-Group-16-512	16/512	47.0	45.5	42.4	52.5
Zhang199/TinyLLaVA-Video-Qwen2.5-3B-Naive-16-512	16/512	44.7	42.5	37.6	48.1
Zhang199/TinyLLaVA-Video-Phi2-Naive-16-512	16/512	42.7	42.0	42.2	46.5

Zhang199
/

TinyLLaVA-Video-Phi2-Naive-16-512

Result

Collection including Zhang199/TinyLLaVA-Video-Phi2-Naive-16-512

TinyLLaVA-Video