prithivida commited on
Commit
a1f2530
·
verified ·
1 Parent(s): 3c2770c

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +866 -3
README.md CHANGED
@@ -1,3 +1,866 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ tags:
6
+ - ColBERT
7
+ - passage-retrieval
8
+ - knowledge-distillation
9
+ pretty_name: Independent Implementation of ColBERTv2.0+ Models - modern_colbert_base_en_v1.
10
+ new_version: prithivida/modern_colbert_base_en_v1
11
+ ---
12
+
13
+
14
+
15
+ <center>
16
+ <img src="./dost_logo.png" alt="DonkeyStereotype" width="350px">
17
+ <p> Trained by <a href="https://donkeystereotype.com"/>Donkey Stereotype</p>
18
+ </center>
19
+
20
+ <br><br>
21
+
22
+
23
+ # Independent Implementation of ColBERTv2.0+ Models
24
+
25
+
26
+ > <div style="background-color: #dbeafe; padding: 15px; border-radius: 8px; border-left: 4px solid #1e40af;">
27
+ > <strong style="color: #1d4ed8;">Background:</strong>
28
+ > <span style="color: #374151;">As part of this project, we will be releasing a set of models across weight classes: 1.) Models that worked well, 2.) Experimental models, including failed attempts. This work stands on the shoulders of all previous robust research on ColBERT and variants.</span>
29
+ > </div>
30
+ >
31
+ > <div style="background-color: #dbeafe; padding: 15px; border-radius: 8px; margin-top: 10px; border-left: 4px solid #2563eb;">
32
+ > <strong style="color: #1d4ed8;">What this independent implementation entail?</strong>
33
+ > <ul style="color: #374151; margin: 10px 0;">
34
+ > <li>This is a humble effort to <span style="color: #dc2626; font-weight: 600;"> independently implement Lighton AI's GTE-ModernColBERT </span>.</li>
35
+ > <li> <span style="color: #dc2626; font-weight: 600;"> Without using existing ColBERT libraries </span> (or codebase)like PyLate or Stanford's recipe.</li>
36
+ > <li> <span style="color: #dc2626; font-weight: 600;"> Without any funding, grand GPU budgets, </span> or formal research background.</li>
37
+ > </ul>
38
+ > </div>
39
+
40
+
41
+ As of this writing (2nd July 2025)
42
+
43
+ 1. <a href="https://huggingface.co/lightonai/GTE-ModernColBERT-v1"> LightOn AI's is the best ColBERT </a> in the world and can be considered SOTA. <br/>
44
+ 2. **Today we are humbled and thrilled to announce prithivida/modern_colbert_base_en_v1 is the 2nd best ColBERT in the world.**. Borrowing Antoine's words - <br/>
45
+ > This is the 2nd model to outperform ColBERT-small on BEIR While it is also bigger, it is still a very lightweight model and benefits from the efficiency of ModernBERT!"
46
+
47
+
48
+ <br/>
49
+
50
+ # Comparison with Top ColBERTv2.0+ Models
51
+
52
+ | Dataset / Model | GTE-ModernColBERT<br/>(Lighton AI) | modern_colbert_base_en_v1<br/>(Ours) | ColBERT-small<br/>(Answer AI, reproduced by Lighton) | ColBERT-small<br/>(Answer AI, reported) |
53
+ |:-----------------|:-----------------:|:-----------------:|:------------------------:|:------------------------:|
54
+ | **Outfit type** | AI Lab with PhDs | Indie Researcher, <br/> No PhD, No GPUs :-) | AI Lab with PhDs | AI Lab with PhDs |
55
+ | **BEIR Average** | **54.89** (🥇) | **54.51 (🥈)** | 53.35 | 53.79 |
56
+ | **FiQA2018** | **48.51** | 43.96 | 41.01 | 41.15 |
57
+ | **NFCorpus** | **37.93** | 37.23 | 36.86 | 37.3 |
58
+ | **TREC-COVID** | 83.59 | 83.4 | 83.14 | **84.59** |
59
+ | **Touche2020** | **31.23** | 29.32 | 24.95 | 25.69 |
60
+ | **ArguAna** | 48.51 | **52.05** | 46.76 | 50.09 |
61
+ | **QuoraRetrieval** | 86.61 | 87.54 | **87.89** | 87.72 |
62
+ | **SCIDOCS** | 19.06 | **19.42** | 18.72 | 18.42 |
63
+ | **SciFact** | 76.34 | **76.44** | 74.02 | 74.77 |
64
+ | **NQ** | **61.8** | 61.68 | 59.42 | 59.1 |
65
+ | **ClimateFEVER** | 30.62 | 28.29 | 32.83 | **33.07** |
66
+ | **HotpotQA** | **77.32** | 76.667 | 76.88 | 76.11 |
67
+ | **DBPedia** | **48.03** | 46.31 | 46.36 | 45.58 |
68
+ | **CQADupstack** | 41 | **42.2** | 39.36 | 38.75 |
69
+ | **FEVER** | 87.44 | 88.106 | 88.66 | **90.96** |
70
+ | **MSMARCO** | **45.32** | 44.993 | 43.44 | 43.5 |
71
+
72
+
73
+
74
+ Turns out a very modest GPU budget and a humble background is enough to independently implement the ColBERT's that are in circulation today.
75
+ *Detailed scores will be added soon.*
76
+
77
+ <br/>
78
+
79
+ # Comparison of with legacy ColBERT models
80
+
81
+ Both GTE-ModernColBERT and ColBERT-small model cards have this comparison against older Colbert models. please refer to them.
82
+
83
+ -----
84
+
85
+
86
+ # Running inference:
87
+
88
+ There are really strong storage and retrieval abstractions: VectorDBs like Qdrant, Weaviate or Vespa that support multi-vectors and strong Colbert training libraries like PyLate, So we feel it is best to work the authors and integrate.
89
+ For now we offer only code to load the model, run inference and do some light weight in-memory ranking (no heavy lifting like storing and retrieving using FAISS indexes).
90
+
91
+
92
+ <details>
93
+ <summary><b>Click here for inference code using Transformers</b></summary>
94
+
95
+ > [!TIP]
96
+ > Copy paste the next snippet before running the below snippet.
97
+
98
+ ```python
99
+ model_path = "prithivida/modern_colbert_base_en_v1"
100
+
101
+ try:
102
+
103
+ colbert = ColBERT.load_for_inference(model_path, max_query_len=32, max_doc_len=300)
104
+
105
+ # Test data
106
+ queries = [
107
+ "How does deep learning work?",
108
+ "What is machine learning?",
109
+ "What are neural networks?"
110
+ ]
111
+
112
+ documents = [
113
+ "Machine learning is the idea of approximating a real world phenomenon using data, the approximation can be mathmetical or otherwise.",
114
+ "Deep learning uses neural networks with multiple layers to process data.",
115
+ "Neural networks are computing systems inspired by biological neural networks.",
116
+ "Artificial intelligence encompasses machine learning and deep learning.",
117
+ "Here is how you train dogs",
118
+ ]
119
+
120
+ # Test single query ranking
121
+ print("\n=== Single Query Ranking ===")
122
+ query = "How does deep learning work?"
123
+ results = colbert.rank_documents(query, documents, top_k=3)
124
+
125
+ print(f"Query: {query}")
126
+ for i, (doc_idx, score, doc_text) in enumerate(results):
127
+ print(f" {i+1}. Score: {score:.4f} | Doc: {doc_text}")
128
+
129
+
130
+ except Exception as e:
131
+ print(f"Error during testing: {e}")
132
+
133
+ ```
134
+
135
+
136
+ ```python
137
+ import torch
138
+ from torch import nn
139
+ from transformers import PreTrainedModel, AutoConfig, AutoModel, AutoTokenizer
140
+ from transformers.modeling_outputs import BaseModelOutput
141
+ from tqdm import tqdm
142
+ from typing import List, Tuple, Union, Optional
143
+ import string
144
+ import os
145
+
146
+
147
+ class TaggingHead(nn.Module):
148
+ def __init__(self, input_size, num_labels):
149
+ super().__init__()
150
+ self.classifier = nn.Linear(input_size, num_labels, bias=False)
151
+ nn.init.xavier_uniform_(self.classifier.weight)
152
+
153
+ def forward(self, x):
154
+ return self.classifier(x)
155
+
156
+
157
+ class ColBERT(PreTrainedModel):
158
+ config_class = AutoConfig
159
+ base_model_prefix = "backbone"
160
+
161
+ def __init__(self, config):
162
+ super().__init__(config)
163
+ self.backbone = AutoModel.from_config(config)
164
+ hidden_dim = config.hidden_size
165
+ self.heads = nn.ModuleDict({
166
+ "col_pooling": TaggingHead(hidden_dim, num_labels=128)
167
+ })
168
+
169
+ # Inference settings (will be set when loading for inference)
170
+ self.tokenizer = None
171
+ self.max_query_len = 256
172
+ self.max_doc_len = 300
173
+ self.Q_PID = None
174
+ self.D_PID = None
175
+
176
+ def _init_weights(self, module):
177
+ if isinstance(module, (nn.Linear, nn.Embedding)):
178
+ module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
179
+ if isinstance(module, nn.Linear) and module.bias is not None:
180
+ module.bias.data.zero_()
181
+
182
+ def forward(self, input_ids, attention_mask=None, position_ids=None, return_dict=False, **kwargs):
183
+ kwargs.pop("token_type_ids", None)
184
+
185
+ outputs = self.backbone(
186
+ input_ids=input_ids,
187
+ attention_mask=attention_mask,
188
+ position_ids=position_ids,
189
+ return_dict=True,
190
+ **kwargs
191
+ )
192
+
193
+ reps = outputs.last_hidden_state
194
+ reps = torch.nn.functional.normalize(reps, p=2, dim=2)
195
+ reps *= attention_mask[:, :, None].float()
196
+ logits = self.heads["col_pooling"](reps)
197
+
198
+ if return_dict:
199
+ return BaseModelOutput(last_hidden_state=logits)
200
+ return logits
201
+
202
+ @classmethod
203
+ def load_for_inference(cls, model_name_or_path: str, max_query_len: int = 256,
204
+ max_doc_len: int = 300, device: str = None):
205
+ """
206
+ Load ColBERT model with tokenizer for inference
207
+
208
+ Args:
209
+ model_name_or_path: HuggingFace model path or local directory
210
+ max_query_len: Maximum query length
211
+ max_doc_len: Maximum document length
212
+ device: Device to run inference on (auto-detect if None)
213
+ """
214
+ device = device or ("cuda" if torch.cuda.is_available() else "cpu")
215
+
216
+ try:
217
+ # Load model and tokenizer
218
+ if os.path.exists(model_name_or_path):
219
+ print(f"Loading model from local directory: {model_name_or_path}")
220
+ config = AutoConfig.from_pretrained(model_name_or_path)
221
+ model = cls.from_pretrained(model_name_or_path, config=config)
222
+ tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)
223
+ else:
224
+ print(f"Downloading model from HuggingFace Hub: {model_name_or_path}")
225
+ config = AutoConfig.from_pretrained(model_name_or_path)
226
+ model = cls.from_pretrained(model_name_or_path, config=config)
227
+ tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)
228
+
229
+ # Setup inference configuration
230
+ model.tokenizer = tokenizer
231
+ model.max_query_len = max_query_len
232
+ model.max_doc_len = max_doc_len
233
+ model.Q_PID = tokenizer.convert_tokens_to_ids("[unused0]")
234
+ model.D_PID = tokenizer.convert_tokens_to_ids("[unused1]")
235
+ # Setup post-tokenization punctuation masking
236
+ model.skip_ids = {tokenizer.encode(c, add_special_tokens=False)[0]
237
+ for c in string.punctuation}
238
+
239
+ model.to(device)
240
+ model.eval()
241
+
242
+ print(f"ColBERT model loaded on {device}")
243
+ print(f"Query max length: {max_query_len}, Document max length: {max_doc_len}")
244
+
245
+ return model
246
+
247
+ except Exception as e:
248
+ print(f"Error loading model: {e}")
249
+ raise
250
+
251
+ def _encode_batch(self, ids: torch.Tensor, mask: torch.Tensor, to_cpu: bool = False):
252
+ """Internal encoding function"""
253
+ if self.tokenizer is None:
254
+ raise RuntimeError("Model not loaded for inference. Use ColBERT.load_for_inference()")
255
+
256
+ ids, mask = ids.to(self.device), mask.to(self.device)
257
+ pos = torch.arange(ids.size(1), device=self.device).unsqueeze(0).expand_as(ids)
258
+
259
+ with torch.no_grad():
260
+ rep = self(input_ids=ids, attention_mask=mask, position_ids=pos)
261
+
262
+ return rep.cpu() if to_cpu else rep
263
+
264
+ def encode_queries(self, queries: List[str], batch_size: Optional[int] = None, to_cpu: bool = False):
265
+ """
266
+ Encode queries for ColBERT retrieval
267
+
268
+ Args:
269
+ queries: List of query strings
270
+ batch_size: Batch size for processing (None for single batch)
271
+ to_cpu: Whether to move results to CPU
272
+
273
+ Returns:
274
+ Query representations tensor
275
+ """
276
+ if self.tokenizer is None:
277
+ raise RuntimeError("Model not loaded for inference. Use ColBERT.load_for_inference()")
278
+
279
+ print(f"Encoding {len(queries)} queries...")
280
+
281
+ # Tokenize with query prefix
282
+ enc = self.tokenizer(queries, add_special_tokens=True, truncation=False)
283
+ id_lists = [[self.Q_PID] + ids for ids in enc["input_ids"]]
284
+
285
+ # Apply dynamic augmentation with length cap
286
+ cap = self.max_query_len or (self.tokenizer.model_max_length - 1)
287
+ id_lists = [_dynamic_augment(ids, self.tokenizer.mask_token_id, cap) for ids in id_lists]
288
+
289
+ # Pad sequences
290
+ padded = self.tokenizer.pad({"input_ids": id_lists}, padding=True, return_tensors="pt")
291
+ ids, mask = padded["input_ids"], padded["attention_mask"]
292
+
293
+ # Process in batches if specified
294
+ if batch_size:
295
+ reps = []
296
+ for i, a in tqdm(_split_into_batches(ids, mask, batch_size), desc="Encoding query batches"):
297
+ reps.append(self._encode_batch(i, a, to_cpu))
298
+ return torch.cat(reps)
299
+
300
+ return self._encode_batch(ids, mask, to_cpu)
301
+
302
+ def encode_documents(self, documents: List[str], batch_size: Optional[int] = None,
303
+ keep_dims: bool = True, to_cpu: bool = False):
304
+ """
305
+ Encode documents for ColBERT retrieval with post-tokenization punctuation masking
306
+
307
+ Args:
308
+ documents: List of document strings
309
+ batch_size: Batch size for processing (None for single batch)
310
+ keep_dims: Whether to keep tensor dimensions (True) or return list of variable-length tensors
311
+ to_cpu: Whether to move results to CPU
312
+
313
+ Returns:
314
+ Document representations tensor or list
315
+ """
316
+ if self.tokenizer is None:
317
+ raise RuntimeError("Model not loaded for inference. Use ColBERT.load_for_inference()")
318
+
319
+ print(f"Encoding {len(documents)} documents...")
320
+
321
+ # Tokenize documents WITHOUT removing punctuation (post-tokenization masking)
322
+ enc = self.tokenizer(documents, add_special_tokens=True,
323
+ truncation=True, max_length=self.max_doc_len - 1)
324
+ id_lists = [[self.D_PID] + ids for ids in enc["input_ids"]]
325
+
326
+ # Pad sequences
327
+ padded = self.tokenizer.pad({"input_ids": id_lists}, padding=True, return_tensors="pt")
328
+ ids, mask = padded["input_ids"], padded["attention_mask"]
329
+
330
+ # Apply post-tokenization punctuation masking
331
+ mask[torch.isin(ids, torch.tensor(list(self.skip_ids), device=ids.device))] = 0
332
+
333
+ # Process in batches if specified
334
+ if batch_size:
335
+ ids_s, mask_s, rev = _sort_by_length(ids, mask, batch_size)
336
+ reps = []
337
+
338
+ for i, a in tqdm(_split_into_batches(ids_s, mask_s, batch_size), desc="Encoding document batches"):
339
+ rep = self._encode_batch(i, a, to_cpu)
340
+ if not keep_dims:
341
+ # Convert to list of variable-length tensors
342
+ m = a.cpu().bool() if to_cpu else a.bool()
343
+ rep = [r[m[idx]] for idx, r in enumerate(rep)]
344
+ reps.append(rep)
345
+
346
+ if keep_dims:
347
+ return _stack_3D_tensors(reps)[rev]
348
+ else:
349
+ # Flatten and reorder
350
+ flat = [d for g in reps for d in g]
351
+ return [flat[i] for i in rev.tolist()]
352
+
353
+ # Single batch processing
354
+ rep = self._encode_batch(ids, mask, to_cpu)
355
+ if not keep_dims:
356
+ m = mask.cpu().bool() if to_cpu else mask.bool()
357
+ rep = [r[m[idx]] for idx, r in enumerate(rep)]
358
+
359
+ return rep
360
+
361
+ @staticmethod
362
+ def compute_similarity(q_reps: torch.Tensor, p_reps: torch.Tensor):
363
+ """
364
+ Compute ColBERT-style max similarity between queries and passages
365
+
366
+ Args:
367
+ q_reps: Query representations [num_queries, max_q_len, dim]
368
+ p_reps: Passage representations [num_passages, max_p_len, dim]
369
+
370
+ Returns:
371
+ Similarity scores [num_queries, num_passages]
372
+ """
373
+ token_scores = torch.einsum("qin,pjn->qipj", q_reps, p_reps)
374
+ scores, _ = token_scores.max(-1)
375
+ scores = scores.sum(1)
376
+ return scores
377
+
378
+ def search(self, queries: List[str], documents: List[str],
379
+ batch_size: Optional[int] = None, return_scores: bool = True):
380
+ """
381
+ End-to-end search: encode queries and documents, compute similarities
382
+
383
+ Args:
384
+ queries: List of query strings
385
+ documents: List of document strings
386
+ batch_size: Batch size for encoding
387
+ return_scores: Whether to return similarity scores
388
+
389
+ Returns:
390
+ If return_scores=True: (scores, query_reps, doc_reps)
391
+ If return_scores=False: (query_reps, doc_reps)
392
+ """
393
+ # Encode queries and documents
394
+ q_reps = self.encode_queries(queries, batch_size=batch_size, to_cpu=True)
395
+ p_reps = self.encode_documents(documents, batch_size=batch_size, to_cpu=True)
396
+
397
+ if return_scores:
398
+ # Compute similarities
399
+ print("Computing similarities...")
400
+ scores = self.compute_similarity(q_reps, p_reps)
401
+ return scores, q_reps, p_reps
402
+
403
+ return q_reps, p_reps
404
+
405
+ def rank_documents(self, query: str, documents: List[str], top_k: int = 10):
406
+ """
407
+ Rank documents for a single query
408
+
409
+ Args:
410
+ query: Query string
411
+ documents: List of document strings
412
+ top_k: Number of top results to return
413
+
414
+ Returns:
415
+ List of (document_index, score, document_text) tuples
416
+ """
417
+ scores, _, _ = self.search([query], documents, return_scores=True)
418
+ scores = scores.squeeze(0) # Remove query dimension
419
+
420
+ # Get top-k results
421
+ top_indices = torch.topk(scores, min(top_k, len(documents))).indices
422
+
423
+ results = []
424
+ for idx in top_indices:
425
+ results.append((idx.item(), scores[idx].item(), documents[idx.item()]))
426
+
427
+ return results
428
+
429
+
430
+
431
+ # ---------------------------------------------------------------------------
432
+ # Helper Functions
433
+ # ---------------------------------------------------------------------------
434
+
435
+ def _split_into_batches(ids: torch.Tensor, mask: torch.Tensor, bsize: int):
436
+ return [(ids[i:i + bsize], mask[i:i + bsize])
437
+ for i in range(0, ids.size(0), bsize)]
438
+
439
+ def _sort_by_length(ids: torch.Tensor, mask: torch.Tensor, bsize: int):
440
+ if ids.size(0) <= bsize:
441
+ return ids, mask, torch.arange(ids.size(0))
442
+
443
+ lengths = mask.sum(-1)
444
+ order = lengths.sort().indices
445
+ reverse = order.sort().indices
446
+ return ids[order], mask[order], reverse
447
+
448
+ def _dynamic_augment(ids: List[int], mask_id: int, max_cap: int = None) -> List[int]:
449
+ if max_cap is not None and len(ids) > max_cap:
450
+ return ids[:max_cap]
451
+
452
+ q_len = len(ids)
453
+ target = max(32, ((q_len + 31) // 32) * 32)
454
+ if target - q_len < 8:
455
+ target = q_len + 8
456
+ if max_cap is not None:
457
+ target = min(target, max_cap)
458
+ return ids + [mask_id] * (target - q_len)
459
+
460
+ def _stack_3D_tensors(groups):
461
+ bsize = sum(x.size(0) for x in groups)
462
+ maxlen = max(x.size(1) for x in groups)
463
+ hdim = groups[0].size(2)
464
+ out = torch.zeros(bsize, maxlen, hdim, device=groups[0].device, dtype=groups[0].dtype)
465
+ ptr = 0
466
+ for g in groups:
467
+ out[ptr:ptr + g.size(0), :g.size(1)] = g
468
+ ptr += g.size(0)
469
+ return out
470
+
471
+ ```
472
+ </details>
473
+
474
+
475
+ <details>
476
+ <summary><b>Click here for inference code using ONNX</b></summary>
477
+
478
+ > [!TIP]
479
+ > Copy paste the next snippet before running the below snippet.
480
+
481
+
482
+ ```python
483
+ model_path = "prithivida/modern_colbert_base_en_v1"
484
+ onnx_model_path = "prithivida/modern_colbert_base_en_v1/onnx/model.onnx"
485
+
486
+ # Load ONNX model for inference using the standalone tokenizer path
487
+ onnx_colbert = ONNXColBERT(onnx_model_path, model_path , max_query_len=32, max_doc_len=300) # Pass model_path as tokenizer_path
488
+
489
+ # Test inference
490
+ queries = [
491
+ "How does deep learning work?",
492
+ "What is machine learning?",
493
+ "What are neural networks?"
494
+ ]
495
+
496
+ documents = [
497
+ "Machine learning is the idea of approximating a real world phenomenon using data, the approximation can be mathmetical or otherwise.",
498
+ "Deep learning uses neural networks with multiple layers to process data.",
499
+ "Neural networks are computing systems inspired by biological neural networks.",
500
+ "Artificial intelligence encompasses machine learning and deep learning.",
501
+ "Here is how you train dogs",
502
+ ]
503
+
504
+ # Test single query ranking
505
+ print("\n=== ONNX Standalone Single Query Ranking ===")
506
+ query = "How does deep learning work?"
507
+ results = onnx_colbert.rank_documents(query, documents, top_k=3)
508
+
509
+ print(f"Query: {query}")
510
+ for i, (doc_idx, score, doc_text) in enumerate(results):
511
+ print(f" {i+1}. Score: {score:.4f} | Doc: {doc_text}")
512
+
513
+ ```
514
+
515
+
516
+ ```python
517
+
518
+ import numpy as np
519
+ import onnxruntime as ort
520
+ from tokenizers import AddedToken, Tokenizer
521
+ import json
522
+ import string
523
+ from pathlib import Path
524
+ from typing import List, Optional, Tuple, Union
525
+ from tqdm import tqdm
526
+
527
+
528
+ # ---------------------------------------------------------------------------
529
+ # ONNX ColBERT Class
530
+ # ---------------------------------------------------------------------------
531
+
532
+ class ONNXColBERT:
533
+ def __init__(self, onnx_model_path: str, tokenizer_path: str,
534
+ max_query_len: int = 256, max_doc_len: int = 300,
535
+ providers: Optional[List[str]] = None):
536
+ """
537
+ ONNX ColBERT - identical to PyTorch ColBERT.load_for_inference()
538
+
539
+ Args:
540
+ onnx_model_path: Path to the ONNX model file
541
+ tokenizer_path: Path to the tokenizer directory
542
+ max_query_len: Maximum query length
543
+ max_doc_len: Maximum document length
544
+ providers: ONNX Runtime providers
545
+ """
546
+ # Load standalone tokenizer
547
+ self.model_dir = Path(tokenizer_path)
548
+ self.tokenizer = self._get_tokenizer(max_length=512)
549
+ self.max_query_len = max_query_len
550
+ self.max_doc_len = max_doc_len
551
+
552
+ # Setup inference configuration
553
+ self.Q_PID = self.tokenizer.token_to_id("[unused0]")
554
+ self.D_PID = self.tokenizer.token_to_id("[unused1]")
555
+ self.mask_token_id = self.tokenizer.token_to_id("[MASK]")
556
+
557
+ if None in [self.Q_PID, self.D_PID, self.mask_token_id]:
558
+ raise ValueError("Could not find required special tokens in tokenizer")
559
+
560
+ # Setup post-tokenization punctuation masking
561
+ self.skip_ids = set()
562
+ for c in string.punctuation:
563
+ encoded = self.tokenizer.encode(c, add_special_tokens=False)
564
+ if len(encoded.ids) > 0:
565
+ self.skip_ids.add(encoded.ids[0])
566
+
567
+ print(f"Identified {len(self.skip_ids)} punctuation token IDs to skip")
568
+
569
+ # Initialize ONNX Runtime session
570
+ if providers is None:
571
+ providers = ['CUDAExecutionProvider', 'CPUExecutionProvider']
572
+
573
+ self.session = ort.InferenceSession(onnx_model_path, providers=providers)
574
+ print(f"✅ ONNX ColBERT loaded with providers: {self.session.get_providers()}")
575
+ print(f"Query max length: {max_query_len}, Document max length: {max_doc_len}")
576
+
577
+ def _get_tokenizer(self, max_length: int = 512) -> Tokenizer:
578
+ """Initialize tokenizer"""
579
+ with open(str(self.model_dir / "config.json")) as config_file:
580
+ config = json.load(config_file)
581
+ with open(str(self.model_dir / "tokenizer_config.json")) as tokenizer_config_file:
582
+ tokenizer_config = json.load(tokenizer_config_file)
583
+ with open(str(self.model_dir / "special_tokens_map.json")) as tokens_map_file:
584
+ tokens_map = json.load(tokens_map_file)
585
+
586
+ tokenizer = Tokenizer.from_file(str(self.model_dir / "tokenizer.json"))
587
+ tokenizer.enable_truncation(max_length=min(tokenizer_config["model_max_length"], max_length))
588
+ tokenizer.enable_padding(pad_id=config["pad_token_id"], pad_token=tokenizer_config["pad_token"])
589
+
590
+ for token in tokens_map.values():
591
+ if isinstance(token, str):
592
+ tokenizer.add_special_tokens([token])
593
+ elif isinstance(token, dict):
594
+ tokenizer.add_special_tokens([AddedToken(**token)])
595
+
596
+ return tokenizer
597
+
598
+ def _encode_batch(self, ids: np.ndarray, mask: np.ndarray, to_cpu: bool = False) -> np.ndarray:
599
+ """Internal encoding function"""
600
+ # Create position IDs
601
+ pos = np.arange(ids.shape[1])[None, :].repeat(ids.shape[0], axis=0)
602
+
603
+ # ONNX inference
604
+ inputs = {
605
+ "input_ids": ids.astype(np.int64),
606
+ "attention_mask": mask.astype(np.int64),
607
+ "position_ids": pos.astype(np.int64)
608
+ }
609
+
610
+ outputs = self.session.run(["last_hidden_state"], inputs)
611
+ return outputs[0]
612
+
613
+ def encode_queries(self, queries: List[str], batch_size: Optional[int] = None,
614
+ to_cpu: bool = False) -> np.ndarray:
615
+ """Encode queries - IDENTICAL to PyTorch ColBERT.encode_queries()"""
616
+ print(f"Encoding {len(queries)} queries...")
617
+
618
+ # Tokenize with query prefix
619
+ encoded_queries = self.tokenizer.encode_batch(queries, add_special_tokens=True)
620
+ id_lists = [[self.Q_PID] + encoded.ids for encoded in encoded_queries]
621
+
622
+ # Apply dynamic augmentation with length cap
623
+ cap = self.max_query_len or 511
624
+ id_lists = [_dynamic_augment(ids, self.mask_token_id, cap) for ids in id_lists]
625
+
626
+ # Manual padding
627
+ max_len = max(len(ids) for ids in id_lists)
628
+ batch_size_actual = len(id_lists)
629
+
630
+ ids = np.zeros((batch_size_actual, max_len), dtype=np.int64)
631
+ mask = np.zeros((batch_size_actual, max_len), dtype=np.int64)
632
+
633
+ for i, id_list in enumerate(id_lists):
634
+ ids[i, :len(id_list)] = id_list
635
+ mask[i, :len(id_list)] = 1
636
+
637
+ # Process in batches if specified
638
+ if batch_size:
639
+ reps = []
640
+ for i, a in tqdm(_split_into_batches(ids, mask, batch_size), desc="Encoding query batches"):
641
+ reps.append(self._encode_batch(i, a, to_cpu))
642
+ return np.concatenate(reps, axis=0)
643
+
644
+ return self._encode_batch(ids, mask, to_cpu)
645
+
646
+ def encode_documents(self, documents: List[str], batch_size: Optional[int] = None,
647
+ keep_dims: bool = True, to_cpu: bool = False) -> Union[np.ndarray, List[np.ndarray]]:
648
+ """Encode documents - IDENTICAL to PyTorch ColBERT.encode_documents()"""
649
+ print(f"Encoding {len(documents)} documents...")
650
+
651
+ # Encode documents individually to preserve natural lengths
652
+ encoded_docs = []
653
+ for doc in documents:
654
+ encoded = self.tokenizer.encode(doc, add_special_tokens=True)
655
+ encoded_docs.append(encoded)
656
+
657
+ id_lists = []
658
+ for encoded in encoded_docs:
659
+ ids = encoded.ids
660
+ # Truncate to max_doc_len - 1
661
+ if len(ids) > self.max_doc_len - 1:
662
+ ids = ids[:self.max_doc_len - 1]
663
+ # Add D_PID prefix
664
+ ids = [self.D_PID] + ids
665
+ id_lists.append(ids)
666
+
667
+ # Manual padding
668
+ max_len = max(len(ids) for ids in id_lists)
669
+ batch_size_actual = len(id_lists)
670
+
671
+ ids = np.zeros((batch_size_actual, max_len), dtype=np.int64)
672
+ mask = np.zeros((batch_size_actual, max_len), dtype=np.int64)
673
+
674
+ for i, id_list in enumerate(id_lists):
675
+ ids[i, :len(id_list)] = id_list
676
+ mask[i, :len(id_list)] = 1
677
+
678
+ # Apply post-tokenization punctuation masking
679
+ for skip_id in self.skip_ids:
680
+ mask[ids == skip_id] = 0
681
+
682
+ # Process in batches if specified
683
+ if batch_size:
684
+ ids_s, mask_s, rev = _sort_by_length(ids, mask, batch_size)
685
+ reps = []
686
+
687
+ for i, a in tqdm(_split_into_batches(ids_s, mask_s, batch_size), desc="Encoding document batches"):
688
+ rep = self._encode_batch(i, a, to_cpu)
689
+ if not keep_dims:
690
+ m = a.astype(bool)
691
+ rep = [r[m[idx]] for idx, r in enumerate(rep)]
692
+ reps.append(rep)
693
+
694
+ if keep_dims:
695
+ return _stack_3D_arrays(reps)[rev]
696
+ else:
697
+ flat = [d for g in reps for d in g]
698
+ return [flat[i] for i in rev.tolist()]
699
+
700
+ # Single batch processing
701
+ rep = self._encode_batch(ids, mask, to_cpu)
702
+ if not keep_dims:
703
+ m = mask.astype(bool)
704
+ rep = [r[m[idx]] for idx, r in enumerate(rep)]
705
+
706
+ return rep
707
+
708
+ @staticmethod
709
+ def compute_similarity(q_reps: np.ndarray, p_reps: np.ndarray) -> np.ndarray:
710
+ """Compute ColBERT similarity - IDENTICAL to PyTorch version"""
711
+ # Identical to PyTorch: torch.einsum("qin,pjn->qipj", q_reps, p_reps)
712
+ token_scores = np.einsum("qin,pjn->qipj", q_reps, p_reps)
713
+
714
+ # Identical to PyTorch: scores, _ = token_scores.max(-1)
715
+ scores = np.max(token_scores, axis=-1)
716
+
717
+ # Identical to PyTorch: scores = scores.sum(1)
718
+ scores = np.sum(scores, axis=1)
719
+
720
+ return scores
721
+
722
+ def search(self, queries: List[str], documents: List[str],
723
+ batch_size: Optional[int] = None, return_scores: bool = True):
724
+ """End-to-end search - IDENTICAL to PyTorch ColBERT.search()"""
725
+ # Encode queries and documents
726
+ q_reps = self.encode_queries(queries, batch_size=batch_size, to_cpu=True)
727
+ p_reps = self.encode_documents(documents, batch_size=batch_size, to_cpu=True)
728
+
729
+ if return_scores:
730
+ # Compute similarities
731
+ print("Computing similarities...")
732
+ scores = self.compute_similarity(q_reps, p_reps)
733
+ return scores, q_reps, p_reps
734
+
735
+ return q_reps, p_reps
736
+
737
+ def rank_documents(self, query: str, documents: List[str], top_k: int = 10) -> List[Tuple]:
738
+ """Rank documents - IDENTICAL to PyTorch ColBERT.rank_documents()"""
739
+ scores, _, _ = self.search([query], documents, return_scores=True)
740
+ scores = scores.squeeze(0)
741
+
742
+ # Get top-k results
743
+ top_indices = np.argsort(scores)[::-1][:min(top_k, len(documents))]
744
+
745
+ results = []
746
+ for idx in top_indices:
747
+ results.append((int(idx), float(scores[idx]), documents[idx]))
748
+
749
+ return results
750
+
751
+
752
+
753
+ # ---------------------------------------------------------------------------
754
+ # Helper Functions (NumPy versions)
755
+ # ---------------------------------------------------------------------------
756
+
757
+ def _split_into_batches(ids: np.ndarray, mask: np.ndarray, bsize: int):
758
+ return [(ids[i:i + bsize], mask[i:i + bsize])
759
+ for i in range(0, ids.shape[0], bsize)]
760
+
761
+ def _sort_by_length(ids: np.ndarray, mask: np.ndarray, bsize: int):
762
+ if ids.shape[0] <= bsize:
763
+ return ids, mask, np.arange(ids.shape[0])
764
+
765
+ lengths = mask.sum(-1)
766
+ order = np.argsort(lengths)
767
+ reverse = np.argsort(order)
768
+ return ids[order], mask[order], reverse
769
+
770
+ def _dynamic_augment(ids: List[int], mask_id: int, max_cap: int = None) -> List[int]:
771
+ if max_cap is not None and len(ids) > max_cap:
772
+ return ids[:max_cap]
773
+
774
+ q_len = len(ids)
775
+ target = max(32, ((q_len + 31) // 32) * 32)
776
+ if target - q_len < 8:
777
+ target = q_len + 8
778
+ if max_cap is not None:
779
+ target = min(target, max_cap)
780
+ return ids + [mask_id] * (target - q_len)
781
+
782
+ def _stack_3D_arrays(groups):
783
+ bsize = sum(x.shape[0] for x in groups)
784
+ maxlen = max(x.shape[1] for x in groups)
785
+ hdim = groups[0].shape[2]
786
+ out = np.zeros((bsize, maxlen, hdim), dtype=groups[0].dtype)
787
+ ptr = 0
788
+ for g in groups:
789
+ out[ptr:ptr + g.shape[0], :g.shape[1]] = g
790
+ ptr += g.shape[0]
791
+ return out
792
+
793
+
794
+
795
+ ```
796
+
797
+ </details>
798
+
799
+
800
+ <br/>
801
+
802
+ _____
803
+
804
+
805
+ # Notes on reproducing
806
+
807
+ We welcome anyone to reproduce our results. Here are some tips and observations:
808
+
809
+ - Please pay attention the query length. We tried our best to look at what the original ColBERTv2.0 used, what LightOn AI used and also spoke to Nils Reimers on taking liberty in the choice of query lengths.
810
+ - Note on query length from ColBERTv2.0 paper:
811
+ > Unless otherwise stated, we keep the default query maximum sequence length for ColBERTv2 and RocketQAv2, which is 32 tokens. For the ArguAna test in BEIR, as the queries are themselves long documents, we set the maximum query length used by ColBERTv2 and RocketQAv2 to 300. For Climate-FEVER, as the queries are relatively long sentence claims, we set the maximum query length used by ColBERTv2 to 64.
812
+ - Query lengths used by LightOn AI PyLate: (Assuming the OSS code is what they used)
813
+ ```python
814
+ query_len = {
815
+ "quora": 32,
816
+ "climate-fever": 64,
817
+ "nq": 32,
818
+ "msmarco": 32,
819
+ "hotpotqa": 32,
820
+ "nfcorpus": 32,
821
+ "scifact": 48,
822
+ "trec-covid": 48,
823
+ "fiqa": 32,
824
+ "arguana": 64,
825
+ "scidocs": 48,
826
+ "dbpedia-entity": 32,
827
+ "webis-touche2020": 32,
828
+ "fever": 32,
829
+ "cqadupstack/android": 32,
830
+ "cqadupstack/english": 32,
831
+ "cqadupstack/gaming": 32,
832
+ "cqadupstack/gis": 32,
833
+ "cqadupstack/mathematica": 32,
834
+ "cqadupstack/physics": 32,
835
+ "cqadupstack/programmers": 32,
836
+ "cqadupstack/stats": 32,
837
+ "cqadupstack/tex": 32,
838
+ "cqadupstack/unix": 32,
839
+ "cqadupstack/webmasters": 32,
840
+ "cqadupstack/wordpress": 32,
841
+ }
842
+ ```
843
+ - This is what OG Nils had to say when I asked about why query has so much liberty:
844
+ > Comparison is always hard...I think query length doesn't skew too much. Retrieval compute scales linear with the number of query tokens. So if people are comfortable to compare models with largely different parameters, comparing different query token lengths would be fine as well
845
+ - Nota bene: There *may be* minor differences in the numbers when reproducing, for instance BGE-M3 reports a nDCG@10 of 59.3 for MIRACL hindi and we Observed only 58.9. But not massive differences like in the case of reported and reproduced Colbert-small in some datasets.
846
+
847
+ Here are our numbers for the full hindi run on BGE-M3
848
+
849
+ ```python
850
+ {'NDCG@1': 0.49714, 'NDCG@3': 0.5115, 'NDCG@5': 0.53908, 'NDCG@10': 0.58936, 'NDCG@100': 0.6457, 'NDCG@1000': 0.65336}
851
+ {'MAP@1': 0.28845, 'MAP@3': 0.42424, 'MAP@5': 0.46455, 'MAP@10': 0.49955, 'MAP@100': 0.51886, 'MAP@1000': 0.51933}
852
+ {'Recall@10': 0.73032, 'Recall@50': 0.8987, 'Recall@100': 0.93974, 'Recall@200': 0.95763, 'Recall@500': 0.97813, 'Recall@1000': 0.9902}
853
+ {'P@1': 0.49714, 'P@3': 0.33048, 'P@5': 0.24629, 'P@10': 0.15543, 'P@100': 0.0202, 'P@1000': 0.00212}
854
+ {'MRR@10': 0.60893, 'MRR@100': 0.615, 'MRR@1000': 0.6151}
855
+ ```
856
+
857
+ - We made sure all quirks and known BEIR ColBERT issues are taken care off:
858
+ - [Arguana and Quora (?) self match issues](https://github.com/beir-cellar/beir/issues/67)
859
+ - TBA
860
+
861
+ # Acknowledgements
862
+
863
+ - Thanks to Nils Reimers for the tips and inputs.
864
+ - Thanks to Nandan Thakur for answering questions.
865
+ - Thanks Antoine Chaffin and LightOn team for PyLate.
866
+ - We thank Prithivi Da for his generous funding for this work :-)