xjtupanda commited on
Commit
a372578
·
verified ·
1 Parent(s): 7a76818

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +0 -9
README.md CHANGED
@@ -26,15 +26,6 @@ This model is a part of the project [Sparrow](https://github.com/VITA-MLLM/Sparr
26
  [MiniCPM-Llama3-V-2_5](https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5).
27
 
28
 
29
- ## Paper
30
-
31
- **Title:** Sparrow: Data-Efficient Video-LLM with Text-to-Image Augmentation
32
-
33
- **Abstract:**
34
-
35
- Recent years have witnessed the success of Multimodal Large Language Models (MLLMs) in the vision understanding domain. The success of these models can largely be attributed to the dominant scaling law, which states that larger parameter sizes and data volumes contribute to better performance. Notably, data scaling has mainly been powered by automatic data pipelines, which center around the self-instruction of LLMs. The paradigm has been taken for granted for quite some time, but the study of the effectiveness of scaling with these data has been neglected for a long time. In this context, this work revisits scaling with synthetic data and focuses on developing video-LLMs from a data-centric perspective. Our main study approach is fine-tuning pre-trained image-LLMs with video data and investigating learning efficiency through data scaling. Results from our preliminary experiments reveal a low learning efficiency phenomenon when simply scaling up video data samples, which, through our probing, can be ascribed to a lack of instruction diversity. Aiming at this issue, we propose a data augmentation method called Sparrow, which synthesizes video-like samples from pure text instruction data. Mixing these synthetic samples with the video data enables a more efficient training scheme. Through comprehensive experiments, we demonstrate that our proposed method achieves performance comparable to or even superior to baselines trained with many more samples. Meanwhile, we find that incorporating these synthetic samples can boost the performance of long video understanding without training with long video data.
36
-
37
-
38
  ## License
39
 
40
  #### Model License
 
26
  [MiniCPM-Llama3-V-2_5](https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5).
27
 
28
 
 
 
 
 
 
 
 
 
 
29
  ## License
30
 
31
  #### Model License