mfarre
/

Video-LLaVA-7B-hf-CinePile

Video-Text-to-Text

Model card Files Files and versions Community

mfarre HF staff commited on Jul 26

Commit

0ca7fb8

•

1 Parent(s): 8eedcd2

Update README.md

Files changed (1) hide show

README.md +2 -0

README.md CHANGED Viewed

@@ -7,6 +7,8 @@ license: apache-2.0
 ---
 # Model Card for Video-LLaVA - CinePile fine tune
 Video multimodal research often emphasizes activity recognition and object-centered tasks, such as determining "what is the person wearing a red hat doing?" However, this focus overlooks areas like theme exploration, narrative and plot analysis, and character and relationship dynamics.
 CinePile addresses these areas in their benchmark and they find that Large Language Models significantly lag behind human performance in these tasks. Additionally, there is a notable disparity in performance between open and closed models.

 ---
 # Model Card for Video-LLaVA - CinePile fine tune
+![CinePile Benchmark - Open vs Closed models](https://huggingface.co/mfarre/Video-LLaVA-7B-hf-CinePile/resolve/main/benchmark.png)
 Video multimodal research often emphasizes activity recognition and object-centered tasks, such as determining "what is the person wearing a red hat doing?" However, this focus overlooks areas like theme exploration, narrative and plot analysis, and character and relationship dynamics.
 CinePile addresses these areas in their benchmark and they find that Large Language Models significantly lag behind human performance in these tasks. Additionally, there is a notable disparity in performance between open and closed models.