nielsr HF staff commited on
Commit
57f7933
·
verified ·
1 Parent(s): 5193ee6

Add image-text-to-text pipeline tag, transformers library, and link to paper and project page

Browse files

This PR adds the `image-text-to-text` pipeline tag and the `transformers` library name to the model card metadata to improve discoverability and clarity. It also adds the project page.

Files changed (1) hide show
  1. README.md +114 -4
README.md CHANGED
@@ -1,9 +1,119 @@
1
  ---
2
- license: apache-2.0
 
3
  language:
4
  - en
 
5
  metrics:
6
  - accuracy
7
- base_model:
8
- - Qwen/Qwen2-VL-7B-Instruct
9
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ base_model:
3
+ - Qwen/Qwen2-VL-7B-Instruct
4
  language:
5
  - en
6
+ license: apache-2.0
7
  metrics:
8
  - accuracy
9
+ pipeline_tag: image-text-to-text
10
+ library_name: transformers
11
+ ---
12
+
13
+ # DeepPerception: Advancing R1-like Cognitive Visual Perception in MLLMs for Knowledge-Intensive Visual Grounding
14
+
15
+ This is the official repository of **DeepPerception**, an MLLM enhanced with cognitive visual perception capabilities.
16
+
17
+ [Project Page](https://deepperception-kvg.github.io/)
18
+
19
+ [Paper](https://arxiv.org/abs/2503.12797)
20
+
21
+ ## Overview
22
+
23
+ <p align="center">
24
+ <img src="figs/header.png" width="100%"></a><br>
25
+ Figure 1: (a) <strong>DeepPerception</strong> employs knowledge-driven reasoning to derive answers, while the baseline model directly outputs predictions without cognitive processing. (b) <strong>DeepPerception</strong> demonstrates superior cognitive visual perception capabilities that cannot be elicited in the foundation model through simplistic zero-shot CoT prompting.
26
+ </p>
27
+
28
+ #### Abstract
29
+
30
+ Human experts excel at fine-grained visual discrimination by leveraging domain knowledge to refine perceptual features, a capability that remains underdeveloped in current Multimodal Large Language Models (MLLMs). Despite possessing vast expert-level knowledge, MLLMs struggle to integrate reasoning into visual perception, often generating direct responses without deeper analysis.
31
+
32
+ To bridge this gap, we introduce knowledge-intensive visual grounding (KVG), a novel visual grounding task that requires both finegrained perception and domain-specific knowledge integration. To address the challenges of KVG, we propose **DeepPerception**, an MLLM enhanced with cognitive visual perception capabilities. Our approach consists of (1) an automated data synthesis pipeline that generates high-quality, knowledge-aligned training samples, and (2) a two-stage training framework combining supervised fine-tuning for cognitive reasoning scaffolding and reinforcement learning to optimize perceptioncognition synergy. To benchmark performance, we introduce KVG-Bench, a comprehensive dataset spanning 10 domains with 1.3K manually curated test cases.
33
+
34
+ Experimental results demonstrate that DeepPerception significantly outperforms direct fine-tuning, achieving +8.08% accuracy improvements on KVG-Bench and exhibiting +4.60% superior cross-domain generalization over baseline approaches. Our findings highlight the importance of integrating cognitive processes into MLLMs for human-like visual perception and open new directions for multimodal reasoning research.
35
+
36
+ #### Key Contributions
37
+
38
+ - We introduce the task of **Knowledge-intensive Visual Grounding (KVG)** to explore the concept of cognitive visual perception for MLLMs, aiming to integrate their inherent knowledge and reasoning capabilities into visual perception.
39
+ - We propose **[DeepPerception](https://huggingface.co/MaxyLee/DeepPerception)**, an MLLM with enhanced cognitive visual perception capabilities. To achieve this, we develop an automated dataset creation pipeline and a two-stage framework integrating supervised cognitive capability enhancement with perception-oriented reinforcement learning.
40
+ - We introduce **[KVG-Bench](https://huggingface.co/datasets/MaxyLee/KVG-Bench)**, a manually curated benchmark for the KVG task involving diverse knowledge domains and entities. Experiments on KVG-Bench and other fine-grained visual recognition tasks demonstrate DeepPerception's exceptional cognitive visual perception capabilities and superior cross-domain generalization performance.
41
+
42
+ ## Get Started
43
+
44
+ ### Contents:
45
+
46
+ - [Environment](#environment)
47
+ - [Data Preparation](#data-preparation)
48
+ - [Checkpoints](#checkpoints)
49
+ - [Evaluation](#evaluation)
50
+ - [Training](#training)
51
+
52
+ ### Environment
53
+
54
+ 1. Clone this repository and navigate to DeepPerception folder
55
+ ```bash
56
+ git clone https://github.com/MaxyLee/DeepPerception.git
57
+ cd DeepPerception
58
+ ```
59
+ 2. Install Packages
60
+ For evaluation:
61
+ ```bash
62
+ conda env create -n deepperception python=3.9
63
+ conda activate deepperception
64
+
65
+ pip install -r requirements.txt
66
+ ```
67
+
68
+ ### Data Preparation
69
+
70
+ | Dataset | Links |
71
+ |--------- |---------------------------------------|
72
+ | KVG-Bench | [`🤗HuggingFace`](https://huggingface.co/datasets/MaxyLee/KVG-Bench) |
73
+ | KVG Training | [`🤗HuggingFace`](https://huggingface.co/datasets/MaxyLee/KVG) |
74
+ ---
75
+
76
+ ### Checkpoints
77
+
78
+ | Model | Links |
79
+ |--------- |---------------------------------------|
80
+ | DeepPerception | [`🤗HuggingFace`](https://huggingface.co/MaxyLee/DeepPerception) |
81
+ | DeepPerception-FGVR | [`🤗HuggingFace`](https://huggingface.co/MaxyLee/DeepPerception-FGVR) |
82
+ ---
83
+
84
+ ### Evaluation
85
+
86
+ ```bash
87
+ # Evaluate on KVG-Bench
88
+ bash eval.sh [CUDA_IDS] [KVG_BENCH_PATH] [CKPT_PATH]
89
+ ```
90
+ Notice: Please modify the script if you want to evaluate on Qwen2-VL.
91
+
92
+ ### Training
93
+
94
+ TODO
95
+
96
+ ## Citation
97
+
98
+ If you find DeepPerception useful for your research or applications, please cite using this BibTeX:
99
+
100
+ ```bibtex
101
+ @misc{ma2025deepperception,
102
+ title={DeepPerception: Advancing R1-like Cognitive Visual Perception in MLLMs for Knowledge-Intensive Visual Grounding},
103
+ author={Xinyu Ma and Ziyang Ding and Zhicong Luo and Chi Chen and Zonghao Guo and Derek F. Wong and Xiaoyi Feng and Maosong Sun},
104
+ year={2025},
105
+ url={https://arxiv.org/abs/2503.12797},
106
+ }
107
+ ```
108
+
109
+ ## Acknowledgement
110
+
111
+ - [Qwen2-VL](https://github.com/QwenLM/Qwen2.5-VL)
112
+ - [vLLM](https://github.com/vllm-project/vllm)
113
+ - [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory)
114
+ - [R1-V](https://github.com/Deep-Agent/R1-V)
115
+
116
+ ## License
117
+
118
+ [![Code License](https://img.shields.io/badge/Code%20License-MIT-Green.svg)](https://github.com/twbs/bootstrap/blob/main/LICENSE)
119
+ [![Data License](https://img.shields.io/badge/Code%20License-Apache_2.0-Green.svg)](https://github.com/tatsu-lab/stanford_alpaca/blob/main/LICENSE)