tattrongvu commited on
Commit
44b84c9
·
verified ·
1 Parent(s): eb99e04

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +125 -0
README.md ADDED
@@ -0,0 +1,125 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ datasets:
4
+ - openbmb/VisRAG-Ret-Train-Synthetic-data
5
+ - openbmb/VisRAG-Ret-Train-In-domain-data
6
+ - tsystems/vqa_de_en_batch1
7
+ - vidore/colpali_train_set
8
+ - llamaindex/vdr-multilingual-train
9
+ - Metric-AI/tabfquad_train_set
10
+ language:
11
+ - en
12
+ - fr
13
+ - es
14
+ - it
15
+ - de
16
+ base_model:
17
+ - Qwen/Qwen2.5-VL-3B-Instruct
18
+ tags:
19
+ - multimodal_embedding
20
+ - multilingual_embedding
21
+ - Text-to-Visual Document (T→VD) retrieval
22
+ library_name: peft
23
+ pipeline_tag: visual-document-retrieval
24
+ ---
25
+ # ColQwen2.5-3b-multilingual-v1.0: Multilingual Visual Retriever based on Qwen2.5-VL-3B-Instruct with ColBERT strategy
26
+
27
+ ### This is the base version trained on 8xH100 80GB with per_device_batch_size=128 for 8 epoch.
28
+
29
+ ColQwen is a model based on a novel model architecture and training strategy based on Vision Language Models (VLMs) to efficiently index documents from their visual features.
30
+ It is a [Qwen2.5-VL-3B](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct) extension that generates [ColBERT](https://arxiv.org/abs/2004.12832)- style multi-vector representations of text and images.
31
+ It was introduced in the paper [ColPali: Efficient Document Retrieval with Vision Language Models](https://arxiv.org/abs/2407.01449) and first released in [this repository](https://github.com/ManuelFay/colpali)
32
+
33
+ <p align="center"><img width=800 src="https://github.com/illuin-tech/colpali/blob/main/assets/colpali_architecture.webp?raw=true"/></p>
34
+
35
+ ## Version specificity
36
+ This model takes dynamic image resolutions in input and does not resize them, changing their aspect ratio as in ColPali.
37
+ Maximal resolution is set so that 768 image patches are created at most. Experiments show clear improvements with larger amounts of image patches, at the cost of memory requirements.
38
+
39
+ This version is trained with `colpali-engine==0.3.9`.
40
+
41
+ ## Data
42
+ - **German & English**: Taken from the `tsystems/vqa_de_en_batch1` dataset.
43
+ - **Multilingual dataset**: Taken from `llamaindex/vdr-multilingual-train`.
44
+ - **Synthetic data**: Taken from `openbmb/VisRAG-Ret-Train-Synthetic-data` dataset.
45
+ - **In-domain VQA dataset**: Taken from `openbmb/VisRAG-Ret-Train-In-domain-data` dataset.
46
+ - **Colpali dataset**: Taken from `vidore/colpali_train_set`.
47
+
48
+ ## Model Training
49
+
50
+ ### Parameters
51
+ We train models use low-rank adapters ([LoRA](https://arxiv.org/abs/2106.09685))
52
+ with `alpha=128` and `r=128` on the transformer layers from the language model,
53
+ as well as the final randomly initialized projection layer, and use a `paged_adamw_8bit` optimizer.
54
+ We train on an 8xH100 GPU setup with distributed data parallelism (via accelerate), a learning rate of 2e-4 with linear decay with 1% warmup steps, batch size per device is 128 in `bfloat16` format
55
+
56
+ ## Installation
57
+ ```bash
58
+ pip install git+https://github.com/illuin-tech/colpali
59
+ pip install transformers==4.49.0
60
+ pip install flash-attn --no-build-isolation
61
+ ```
62
+ ## Usage
63
+
64
+ ```python
65
+ import torch
66
+ from PIL import Image
67
+
68
+ from colpali_engine.models import ColQwen2_5, ColQwen2_5_Processor
69
+
70
+ model = ColQwen2_5.from_pretrained(
71
+ "tsystems/colqwen2.5-3b-multilingual-v1.0",
72
+ torch_dtype=torch.bfloat16,
73
+ device_map="cuda:0", # or "mps" if on Apple Silicon
74
+ ).eval()
75
+ processor = ColQwen2_5_Processor.from_pretrained("tsystems/colqwen2.5-3b-multilingual-v1.0")
76
+
77
+ # Your inputs
78
+ images = [
79
+ Image.new("RGB", (32, 32), color="white"),
80
+ Image.new("RGB", (16, 16), color="black"),
81
+ ]
82
+ queries = [
83
+ "Is attention really all you need?",
84
+ "What is the amount of bananas farmed in Salvador?",
85
+ ]
86
+
87
+ # Process the inputs
88
+ batch_images = processor.process_images(images).to(model.device)
89
+ batch_queries = processor.process_queries(queries).to(model.device)
90
+
91
+ # Forward pass
92
+ with torch.no_grad():
93
+ image_embeddings = model(**batch_images)
94
+ query_embeddings = model(**batch_queries)
95
+
96
+ scores = processor.score_multi_vector(query_embeddings, image_embeddings)
97
+ ```
98
+
99
+
100
+ ## Limitations
101
+
102
+ - **Focus**: The model primarily focuses on PDF-type documents and high-ressources languages, potentially limiting its generalization to other document types or less represented languages.
103
+ - **Support**: The model relies on multi-vector retreiving derived from the ColBERT late interaction mechanism, which may require engineering efforts to adapt to widely used vector retrieval frameworks that lack native multi-vector support.
104
+
105
+ ## License
106
+
107
+ ColQwen2.5's vision language backbone model (Qwen2.5-VL) is under `apache2.0` license. The adapters attached to the model are under MIT license.
108
+
109
+
110
+ ## Citation
111
+
112
+ If you use this models from this organization in your research, please cite the original paper as follows:
113
+
114
+ ```bibtex
115
+ @misc{faysse2024colpaliefficientdocumentretrieval,
116
+ title={ColPali: Efficient Document Retrieval with Vision Language Models},
117
+ author={Manuel Faysse and Hugues Sibille and Tony Wu and Bilel Omrani and Gautier Viaud and Céline Hudelot and Pierre Colombo},
118
+ year={2024},
119
+ eprint={2407.01449},
120
+ archivePrefix={arXiv},
121
+ primaryClass={cs.IR},
122
+ url={https://arxiv.org/abs/2407.01449},
123
+ }
124
+ ```
125
+ - **Developed by:** [T-Systems International](https://www.t-systems.com/de/en)