YuanLiuuuuuu commited on
Commit
263b30c
1 Parent(s): e28b279

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +93 -0
README.md ADDED
@@ -0,0 +1,93 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ## POINTS-Yi-1.5-9B-Chat
2
+
3
+ ### Introduction
4
+
5
+ We are excited to announce the first version of POINTS, which integrates recent advancement in vision-language model and new techniques proposed by researchers from WeChat AI.
6
+
7
+ ### What's new in POINTS?
8
+
9
+ **Key Innovations**
10
+
11
+ 1. **Strong Baseline**: We integrate the most recent advancement in vision-language model, i.e., CapFusion, Dual Vision Encoder, and
12
+ Dynamic High Resolution, into POINTS.
13
+
14
+ 2. **Pre-training Dataset Filtering**: We propose to filter the pre-training dataset using perplexity as a metric. Utilizing this filtering strategy, we can significantly reduce the size of the pre-training dataset and improve the performance of the model.
15
+
16
+ 3. **Model Soup**: We propose to apply model soup to models, fine-tuned with different visual instruction tuning datasets, which can further significantly improve the performance of the model.
17
+
18
+ <p align="center">
19
+ <img src="https://github.com/user-attachments/assets/6af35008-f501-400a-a870-b66a9bf2baab" width="60%"/>
20
+ <p>
21
+
22
+
23
+ ### How to use POINTS?
24
+
25
+ ```python
26
+ from transformers import AutoModelForCausalLM, AutoTokenizer
27
+ from transformers import CLIPImageProcessor
28
+ from PIL import Image
29
+ import torch
30
+
31
+
32
+ image_path = '/path/to/local/image.jpg'
33
+ prompt = 'please describe the image in detail'
34
+ pil_image = Image.open(image_path)
35
+ model_path = ''
36
+ tokenizer = AutoTokenizer.from_pretrained(model_path)
37
+ model = AutoModelForCausalLM.from_pretrained(
38
+ model_path, trust_remote_code=True, device_map='cuda').to(torch.bfloat16)
39
+ image_processor = CLIPImageProcessor.from_pretrained(model_path)
40
+ generation_config = {
41
+ 'max_new_tokens': 1024,
42
+ 'temperature': 0.0,
43
+ 'top_p': 0.0,
44
+ 'num_beams': 1,
45
+ }
46
+ res = model.chat(
47
+ pil_image,
48
+ prompt,
49
+ tokenizer,
50
+ image_processor,
51
+ True,
52
+ generation_config
53
+ )
54
+ print(res)
55
+ ```
56
+
57
+ ### Evaluation
58
+
59
+ | Benchmark | InternVL2-8B | LLaVA-OneVision | POINTS |
60
+ | :-------: | :----------: | :-------------: | :----: |
61
+ | MMBench-dev-en | - | 80.8 | 83.2 |
62
+ | MathVista | 58.3 | 62.3 | 60.7 |
63
+ | HallucinationBench | 45.0 | 31.6 | 48.0 |
64
+ | OCRBench | 79.4 | 62.2 | 70.6 |
65
+ | AI2D | 83.6 | 82.4 | 78.5 |
66
+ | MMVet | 54.3 | 51.9 | 50.0 |
67
+ | MMStar | 61.5 | 61.9 | 56.4 |
68
+ | MMMU | 51.2 | 47.9 | 46.9 |
69
+ | ScienceQA | 97.1 | 95.4 | 92.9 |
70
+ | MME | 2215.1 | 1993.6 | 2017.8 |
71
+ | RealWorldQA | 64.2 | 69.9 | 65.9 |
72
+ | LLaVA-Wild | 73.3 | 81.0 | 69.3 |
73
+
74
+
75
+ ### Citation
76
+
77
+ If you find our work helpful, feel free to cite us:
78
+
79
+ ```
80
+ @article{liu2024points,
81
+ title={POINTS: Improving Your Vision-language Model with Affordable Strategies},
82
+ author={Liu, Yuan and Zhao, Zhongyin and Zhuang, Ziyuan and Tian, Le and Zhou, Xiao and Zhou, Jie},
83
+ journal={arXiv preprint arXiv:2409.04828},
84
+ year={2024}
85
+ }
86
+
87
+ @article{liu2024rethinking,
88
+ title={Rethinking Overlooked Aspects in Vision-Language Models},
89
+ author={Liu, Yuan and Tian, Le and Zhou, Xiao and Zhou, Jie},
90
+ journal={arXiv preprint arXiv:2405.11850},
91
+ year={2024}
92
+ }
93
+ ```