Update README.md
Browse files
README.md
CHANGED
@@ -27,27 +27,47 @@ pipeline_tag: feature-extraction
|
|
27 |
</a>
|
28 |
</div>
|
29 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
30 |
**VisRAG** is a novel vision-language model (VLM)-based RAG pipeline. In this pipeline, instead of first parsing the document to obtain text, the document is directly embedded using a VLM as an image and then retrieved to enhance the generation of a VLM.Compared to traditional text-based RAG, **VisRAG** maximizes the retention and utilization of the data information in the original documents, eliminating the information loss introduced during the parsing process.
|
31 |
<p align="center"><img width=800 src="https://github.com/openbmb/VisRAG/blob/master/assets/main_figure.png?raw=true"/></p>
|
32 |
|
33 |
-
|
34 |
|
35 |
-
|
|
|
|
|
|
|
|
|
|
|
36 |
**VisRAG-Ret** is a document embedding model built on [MiniCPM-V 2.0](https://huggingface.co/openbmb/MiniCPM-V-2), a vision-language model that integrates [SigLIP](https://huggingface.co/google/siglip-so400m-patch14-384) as the vision encoder and [MiniCPM-2B](https://huggingface.co/openbmb/MiniCPM-2B-sft-bf16) as the language model.
|
37 |
|
38 |
-
|
39 |
In the paper, We use MiniCPM-V 2.0, MiniCPM-V 2.6 and GPT-4o as the generators. Actually you can use any VLMs you like!
|
40 |
|
41 |
-
|
42 |
|
43 |
-
|
44 |
Our training dataset of 362,110 Query-Document (Q-D) Pairs for **VisRAG-Ret** is comprised of train sets of openly available academic datasets (34%) and a synthetic dataset made up of pages from web-crawled PDF documents and augmented with VLM-generated (GPT-4o) pseudo-queries (66%). It can be found in the `VisRAG` Collection on Hugging Face, which is referenced at the beginning of this page.
|
45 |
|
46 |
|
47 |
-
|
48 |
The generation part does not use any fine-tuning; we directly use off-the-shelf LLMs/VLMs for generation.
|
49 |
|
50 |
-
|
51 |
```
|
52 |
torch==2.1.2
|
53 |
torchvision==0.16.2
|
@@ -57,9 +77,9 @@ decord==0.6.0
|
|
57 |
Pillow==10.1.0
|
58 |
```
|
59 |
|
60 |
-
|
61 |
|
62 |
-
|
63 |
```python
|
64 |
from transformers import AutoModel, AutoTokenizer
|
65 |
import torch
|
@@ -125,13 +145,13 @@ scores = (embeddings_query @ embeddings_doc.T)
|
|
125 |
print(scores.tolist())
|
126 |
```
|
127 |
|
128 |
-
|
129 |
|
130 |
* The code in this repo is released under the [Apache-2.0](https://github.com/OpenBMB/MiniCPM/blob/main/LICENSE) License.
|
131 |
* The usage of **VisRAG-Ret** model weights must strictly follow [MiniCPM Model License.md](https://github.com/OpenBMB/MiniCPM/blob/main/MiniCPM%20Model%20License.md).
|
132 |
* The models and weights of **VisRAG-Ret** are completely free for academic research. After filling out a ["questionnaire"](https://modelbest.feishu.cn/share/base/form/shrcnpV5ZT9EJ6xYjh3Kx0J6v8g) for registration, **VisRAG-Ret** weights are also available for free commercial use.
|
133 |
|
134 |
-
|
135 |
|
136 |
```
|
137 |
@misc{yu2024visragvisionbasedretrievalaugmentedgeneration,
|
@@ -145,7 +165,7 @@ print(scores.tolist())
|
|
145 |
}
|
146 |
```
|
147 |
|
148 |
-
|
149 |
|
150 |
- Shi Yu: [email protected]
|
151 |
- Chaoyue Tang: [email protected]
|
|
|
27 |
</a>
|
28 |
</div>
|
29 |
|
30 |
+
<p align="center">β’
|
31 |
+
<a href="#-introduction"> π Introduction </a> β’
|
32 |
+
<a href="#-news">π News</a> β’
|
33 |
+
<a href="#-visrag-pipeline">β¨ VisRAG Pipeline</a> β’
|
34 |
+
<a href="-training">β‘οΈ Training</a>
|
35 |
+
</p>
|
36 |
+
<p align="center">β’
|
37 |
+
<a href="#-requirements">π¦ Requirements</a> β’
|
38 |
+
<a href="#-usage">π§ Usage</a> β’
|
39 |
+
<a href="#-license">π Lisense</a>β’
|
40 |
+
<a href="-citation">π Citation</a>
|
41 |
+
<a href="-contact">π§ Contact</a>
|
42 |
+
</p>
|
43 |
+
|
44 |
+
# π Introduction
|
45 |
**VisRAG** is a novel vision-language model (VLM)-based RAG pipeline. In this pipeline, instead of first parsing the document to obtain text, the document is directly embedded using a VLM as an image and then retrieved to enhance the generation of a VLM.Compared to traditional text-based RAG, **VisRAG** maximizes the retention and utilization of the data information in the original documents, eliminating the information loss introduced during the parsing process.
|
46 |
<p align="center"><img width=800 src="https://github.com/openbmb/VisRAG/blob/master/assets/main_figure.png?raw=true"/></p>
|
47 |
|
48 |
+
# π News
|
49 |
|
50 |
+
* 20241015: Released our train data and test data on Hugging Face which can be found in the [VisRAG](https://huggingface.co/collections/openbmb/visrag-6717bbfb471bb018a49f1c69) Collection on Hugging Face. It is referenced at the beginning of this page.
|
51 |
+
* 20241014: Released our [Paper](https://arxiv.org/abs/2410.10594) on arXiv. Released our [Model](https://huggingface.co/openbmb/VisRAG-Ret) on Hugging Face.
|
52 |
+
|
53 |
+
# β¨ VisRAG Pipeline
|
54 |
+
|
55 |
+
## VisRAG-Ret
|
56 |
**VisRAG-Ret** is a document embedding model built on [MiniCPM-V 2.0](https://huggingface.co/openbmb/MiniCPM-V-2), a vision-language model that integrates [SigLIP](https://huggingface.co/google/siglip-so400m-patch14-384) as the vision encoder and [MiniCPM-2B](https://huggingface.co/openbmb/MiniCPM-2B-sft-bf16) as the language model.
|
57 |
|
58 |
+
## VisRAG-Gen
|
59 |
In the paper, We use MiniCPM-V 2.0, MiniCPM-V 2.6 and GPT-4o as the generators. Actually you can use any VLMs you like!
|
60 |
|
61 |
+
# β‘οΈ Training
|
62 |
|
63 |
+
## VisRAG-Ret
|
64 |
Our training dataset of 362,110 Query-Document (Q-D) Pairs for **VisRAG-Ret** is comprised of train sets of openly available academic datasets (34%) and a synthetic dataset made up of pages from web-crawled PDF documents and augmented with VLM-generated (GPT-4o) pseudo-queries (66%). It can be found in the `VisRAG` Collection on Hugging Face, which is referenced at the beginning of this page.
|
65 |
|
66 |
|
67 |
+
## VisRAG-Gen
|
68 |
The generation part does not use any fine-tuning; we directly use off-the-shelf LLMs/VLMs for generation.
|
69 |
|
70 |
+
# π¦ Requirements
|
71 |
```
|
72 |
torch==2.1.2
|
73 |
torchvision==0.16.2
|
|
|
77 |
Pillow==10.1.0
|
78 |
```
|
79 |
|
80 |
+
# π§ Usage
|
81 |
|
82 |
+
## VisRAG-Ret
|
83 |
```python
|
84 |
from transformers import AutoModel, AutoTokenizer
|
85 |
import torch
|
|
|
145 |
print(scores.tolist())
|
146 |
```
|
147 |
|
148 |
+
# π License
|
149 |
|
150 |
* The code in this repo is released under the [Apache-2.0](https://github.com/OpenBMB/MiniCPM/blob/main/LICENSE) License.
|
151 |
* The usage of **VisRAG-Ret** model weights must strictly follow [MiniCPM Model License.md](https://github.com/OpenBMB/MiniCPM/blob/main/MiniCPM%20Model%20License.md).
|
152 |
* The models and weights of **VisRAG-Ret** are completely free for academic research. After filling out a ["questionnaire"](https://modelbest.feishu.cn/share/base/form/shrcnpV5ZT9EJ6xYjh3Kx0J6v8g) for registration, **VisRAG-Ret** weights are also available for free commercial use.
|
153 |
|
154 |
+
# π Citation
|
155 |
|
156 |
```
|
157 |
@misc{yu2024visragvisionbasedretrievalaugmentedgeneration,
|
|
|
165 |
}
|
166 |
```
|
167 |
|
168 |
+
# π§ Contact
|
169 |
|
170 |
- Shi Yu: [email protected]
|
171 |
- Chaoyue Tang: [email protected]
|