Visual Document Retrieval
Transformers
Safetensors
ColPali
English
colqwen2
pretraining
File size: 5,008 Bytes
5133f48
 
680424e
 
 
 
 
 
 
 
 
 
5133f48
 
0c6d075
 
 
680424e
 
c9ebfba
680424e
5133f48
680424e
5133f48
680424e
 
 
5133f48
680424e
5133f48
680424e
5133f48
680424e
5133f48
85df28e
5133f48
680424e
5133f48
680424e
 
 
 
5133f48
680424e
5133f48
680424e
 
 
5133f48
680424e
 
5133f48
 
680424e
5133f48
680424e
 
 
 
 
 
5133f48
680424e
5133f48
680424e
 
 
 
 
 
 
 
 
5133f48
680424e
 
 
5133f48
680424e
 
 
 
5133f48
680424e
 
5133f48
680424e
5133f48
680424e
5133f48
680424e
 
5133f48
680424e
5133f48
ef4fa8d
5133f48
680424e
5133f48
680424e
 
 
5133f48
680424e
5133f48
680424e
5133f48
680424e
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
---
library_name: transformers
tags:
- colpali
license: apache-2.0
datasets:
- vidore/colpali_train_set
language:
- en
base_model:
- vidore/colqwen2-base
pipeline_tag: visual-document-retrieval
---

> [!WARNING]
> EXPERIMENTAL: Wait for https://github.com/huggingface/transformers/pull/35778 to be merged before using!

> [!IMPORTANT]
> This version of ColQwen2 should be loaded with the `transformers 🤗` release, not with `colpali-engine`.
> It was converted using the `convert_colqwen2_weights_to_hf.py` script
> from the [`vidore/colqwen2-v1.0-merged`](https://huggingface.co/vidore/colqwen2-v1.0-merged) checkpoint.

# ColQwen2: Visual Retriever based on Qwen2-VL-2B-Instruct with ColBERT strategy

ColQwen2 is a model based on a novel model architecture and training strategy based on Vision Language Models (VLMs) to efficiently index documents from their visual features.
It is a [Qwen2-VL-2B](https://huggingface.co/Qwen/Qwen2-VL-2B-Instruct) extension that generates [ColBERT](https://arxiv.org/abs/2004.12832)- style multi-vector representations of text and images. 
It was introduced in the paper [ColPali: Efficient Document Retrieval with Vision Language Models](https://arxiv.org/abs/2407.01449) and first released in [this repository](https://github.com/ManuelFay/colpali)

<p align="center"><img width=800 src="https://github.com/illuin-tech/colpali/blob/main/assets/colpali_architecture.webp?raw=true"/></p>

The HuggingFace `transformers` 🤗 implementation was contributed by Tony Wu ([@tonywu71](https://huggingface.co/tonywu71)) and Yoni Gozlan ([@yonigozlan](https://huggingface.co/yonigozlan)).

## Model Description

Read the `transformers` 🤗 model card: https://huggingface.co/docs/transformers/en/model_doc/colqwen2.

## Model Training

### Dataset
Our training dataset of 127,460 query-page pairs is comprised of train sets of openly available academic datasets (63%) and a synthetic dataset made up of pages from web-crawled PDF documents and augmented with VLM-generated (Claude-3 Sonnet) pseudo-questions (37%). 
Our training set is fully English by design, enabling us to study zero-shot generalization to non-English languages. We explicitly verify no multi-page PDF document is used both [*ViDoRe*](https://huggingface.co/collections/vidore/vidore-benchmark-667173f98e70a1c0fa4db00d) and in the train set to prevent evaluation contamination. 
A validation set is created with 2% of the samples to tune hyperparameters.

## Usage

```python
import torch
from PIL import Image

from transformers import ColQwen2ForRetrieval, ColQwen2Processor
from transformers.utils.import_utils import is_flash_attn_2_available


model_name = "vidore/colqwen2-v1.0-hf"

model = ColQwen2ForRetrieval.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="cuda:0",  # or "mps" if on Apple Silicon
    attn_implementation="flash_attention_2" if is_flash_attn_2_available() else None,
).eval()

processor = ColQwen2Processor.from_pretrained(model_name)

# Your inputs (replace dummy images with screenshots of your documents)
images = [
    Image.new("RGB", (128, 128), color="white"),
    Image.new("RGB", (64, 32), color="black"),
]
queries = [
    "What is the organizational structure for our R&D department?",
    "Can you provide a breakdown of last year’s financial performance?",
]

# Process the inputs
batch_images = processor(images=images).to(model.device)
batch_queries = processor(text=queries).to(model.device)

# Forward pass
with torch.no_grad():
    image_embeddings = model(**batch_images).embeddings
    query_embeddings = model(**batch_queries).embeddings

# Score the queries against the images
scores = processor.score_retrieval(query_embeddings, image_embeddings)

```

## Limitations

 - **Focus**: The model primarily focuses on PDF-type documents and high-ressources languages, potentially limiting its generalization to other document types or less represented languages.
 - **Support**: The model relies on multi-vector retreiving derived from the ColBERT late interaction mechanism, which may require engineering efforts to adapt to widely used vector retrieval frameworks that lack native multi-vector support.

## License

ColQwen2's vision language backbone model (Qwen2-VL) is under `apache-2.0` license. ColQwen2 inherits from this `apache-2.0` license.

## Contact

- Manuel Faysse: [email protected]
- Hugues Sibille: [email protected]
- Tony Wu: [email protected]

## Citation

If you use any datasets or models from this organization in your research, please cite the original dataset as follows:

```bibtex
@misc{faysse2024colpaliefficientdocumentretrieval,
  title={ColPali: Efficient Document Retrieval with Vision Language Models}, 
  author={Manuel Faysse and Hugues Sibille and Tony Wu and Bilel Omrani and Gautier Viaud and Céline Hudelot and Pierre Colombo},
  year={2024},
  eprint={2407.01449},
  archivePrefix={arXiv},
  primaryClass={cs.IR},
  url={https://arxiv.org/abs/2407.01449}, 
}
```