openbmb
/

VisRAG-Ret

@@ -27,27 +27,47 @@ pipeline_tag: feature-extraction
   </a>
 </div>
 **VisRAG** is a novel vision-language model (VLM)-based RAG pipeline. In this pipeline, instead of first parsing the document to obtain text, the document is directly embedded using a VLM as an image and then retrieved to enhance the generation of a VLM.Compared to traditional text-based RAG, **VisRAG** maximizes the retention and utilization of the data information in the original documents, eliminating the information loss introduced during the parsing process.
 <p align="center"><img width=800 src="https://github.com/openbmb/VisRAG/blob/master/assets/main_figure.png?raw=true"/></p>
-## VisRAG Pipeline
-### VisRAG-Ret
 **VisRAG-Ret** is a document embedding model built on [MiniCPM-V 2.0](https://huggingface.co/openbmb/MiniCPM-V-2), a vision-language model that integrates [SigLIP](https://huggingface.co/google/siglip-so400m-patch14-384) as the vision encoder and [MiniCPM-2B](https://huggingface.co/openbmb/MiniCPM-2B-sft-bf16) as the language model.
-### VisRAG-Gen
 In the paper, We use MiniCPM-V 2.0, MiniCPM-V 2.6 and GPT-4o as the generators. Actually you can use any VLMs you like!
-## Training
-### VisRAG-Ret
 Our training dataset of 362,110 Query-Document (Q-D) Pairs for **VisRAG-Ret** is comprised of train sets of openly available academic datasets (34%) and a synthetic dataset made up of pages from web-crawled PDF documents and augmented with VLM-generated (GPT-4o) pseudo-queries (66%). It can be found in the `VisRAG` Collection on Hugging Face, which is referenced at the beginning of this page.
-### VisRAG-Gen
 The generation part does not use any fine-tuning; we directly use off-the-shelf LLMs/VLMs for generation.
-## Requirements
 ```
 torch==2.1.2
 torchvision==0.16.2
@@ -57,9 +77,9 @@ decord==0.6.0
 Pillow==10.1.0
 ```
-## Usage
-### VisRAG-Ret
 ```python
 from transformers import AutoModel, AutoTokenizer
 import torch
@@ -125,13 +145,13 @@ scores = (embeddings_query @ embeddings_doc.T)
 print(scores.tolist())
 ```
-## License
 * The code in this repo is released under the [Apache-2.0](https://github.com/OpenBMB/MiniCPM/blob/main/LICENSE) License.
 * The usage of **VisRAG-Ret** model weights must strictly follow [MiniCPM Model License.md](https://github.com/OpenBMB/MiniCPM/blob/main/MiniCPM%20Model%20License.md).
 * The models and weights of **VisRAG-Ret** are completely free for academic research. After filling out a ["questionnaire"](https://modelbest.feishu.cn/share/base/form/shrcnpV5ZT9EJ6xYjh3Kx0J6v8g) for registration, **VisRAG-Ret** weights are also available for free commercial use.
-## Citation
 ```
 @misc{yu2024visragvisionbasedretrievalaugmentedgeneration,
@@ -145,7 +165,7 @@ print(scores.tolist())
 }
 ```
-## Contact
 - Shi Yu: [email protected]
 - Chaoyue Tang: [email protected]

   </a>
 </div>
+<p align="center">•
+ <a href="#-introduction"> 📖 Introduction </a> •
+ <a href="#-news">🎉 News</a> •
+ <a href="#-visrag-pipeline">✨ VisRAG Pipeline</a> •
+ <a href="-training">⚡️ Training</a>
+</p>
+<p align="center">•
+ <a href="#-requirements">📦 Requirements</a> •
+ <a href="#-usage">🔧 Usage</a> •
+ <a href="#-license">📄 Lisense</a>•
+ <a href="-citation">📑 Citation</a>
+ <a href="-contact">📧 Contact</a>
+</p>
+# 📖 Introduction
 **VisRAG** is a novel vision-language model (VLM)-based RAG pipeline. In this pipeline, instead of first parsing the document to obtain text, the document is directly embedded using a VLM as an image and then retrieved to enhance the generation of a VLM.Compared to traditional text-based RAG, **VisRAG** maximizes the retention and utilization of the data information in the original documents, eliminating the information loss introduced during the parsing process.
 <p align="center"><img width=800 src="https://github.com/openbmb/VisRAG/blob/master/assets/main_figure.png?raw=true"/></p>
+# 🎉 News
+* 20241015: Released our train data and test data on Hugging Face which can be found in the [VisRAG](https://huggingface.co/collections/openbmb/visrag-6717bbfb471bb018a49f1c69) Collection on Hugging Face. It is referenced at the beginning of this page.
+* 20241014: Released our [Paper](https://arxiv.org/abs/2410.10594) on arXiv. Released our [Model](https://huggingface.co/openbmb/VisRAG-Ret) on Hugging Face.
+# ✨ VisRAG Pipeline
+## VisRAG-Ret
 **VisRAG-Ret** is a document embedding model built on [MiniCPM-V 2.0](https://huggingface.co/openbmb/MiniCPM-V-2), a vision-language model that integrates [SigLIP](https://huggingface.co/google/siglip-so400m-patch14-384) as the vision encoder and [MiniCPM-2B](https://huggingface.co/openbmb/MiniCPM-2B-sft-bf16) as the language model.
+## VisRAG-Gen
 In the paper, We use MiniCPM-V 2.0, MiniCPM-V 2.6 and GPT-4o as the generators. Actually you can use any VLMs you like!
+# ⚡️ Training
+## VisRAG-Ret
 Our training dataset of 362,110 Query-Document (Q-D) Pairs for **VisRAG-Ret** is comprised of train sets of openly available academic datasets (34%) and a synthetic dataset made up of pages from web-crawled PDF documents and augmented with VLM-generated (GPT-4o) pseudo-queries (66%). It can be found in the `VisRAG` Collection on Hugging Face, which is referenced at the beginning of this page.
+## VisRAG-Gen
 The generation part does not use any fine-tuning; we directly use off-the-shelf LLMs/VLMs for generation.
+# 📦 Requirements
 ```
 torch==2.1.2
 torchvision==0.16.2
 Pillow==10.1.0
 ```
+# 🔧 Usage
+## VisRAG-Ret
 ```python
 from transformers import AutoModel, AutoTokenizer
 import torch
 print(scores.tolist())
 ```
+# 📄 License
 * The code in this repo is released under the [Apache-2.0](https://github.com/OpenBMB/MiniCPM/blob/main/LICENSE) License.
 * The usage of **VisRAG-Ret** model weights must strictly follow [MiniCPM Model License.md](https://github.com/OpenBMB/MiniCPM/blob/main/MiniCPM%20Model%20License.md).
 * The models and weights of **VisRAG-Ret** are completely free for academic research. After filling out a ["questionnaire"](https://modelbest.feishu.cn/share/base/form/shrcnpV5ZT9EJ6xYjh3Kx0J6v8g) for registration, **VisRAG-Ret** weights are also available for free commercial use.
+# 📑 Citation
 ```
 @misc{yu2024visragvisionbasedretrievalaugmentedgeneration,
 }
 ```
+# 📧 Contact
 - Shi Yu: [email protected]
 - Chaoyue Tang: [email protected]