prithivida
/

modern_colbert_base_en_v1

+---
+license: apache-2.0
+language:
+- en
+tags:
+- ColBERT
+- passage-retrieval
+- knowledge-distillation
+pretty_name:  Independent Implementation of ColBERTv2.0+ Models - modern_colbert_base_en_v1.
+new_version: prithivida/modern_colbert_base_en_v1
+---
+<center>
+<img src="./dost_logo.png" alt="DonkeyStereotype" width="350px">
+  <p> Trained by <a href="https://donkeystereotype.com"/>Donkey Stereotype</p>
+</center>
+<br><br>
+# Independent Implementation of ColBERTv2.0+ Models
+> <div style="background-color: #dbeafe; padding: 15px; border-radius: 8px; border-left: 4px solid #1e40af;">
+> <strong style="color: #1d4ed8;">Background:</strong>
+> <span style="color: #374151;">As part of this project, we will be releasing a set of models across weight classes: 1.) Models that worked well, 2.) Experimental models, including failed attempts. This work stands on the shoulders of all previous robust research on ColBERT and variants.</span>
+> </div>
+>
+> <div style="background-color: #dbeafe; padding: 15px; border-radius: 8px; margin-top: 10px; border-left: 4px solid #2563eb;">
+> <strong style="color: #1d4ed8;">What this independent implementation entail?</strong>
+> <ul style="color: #374151; margin: 10px 0;">
+> <li>This is a humble effort to <span style="color: #dc2626; font-weight: 600;"> independently implement Lighton AI's GTE-ModernColBERT </span>.</li>
+> <li> <span style="color: #dc2626; font-weight: 600;"> Without using existing ColBERT libraries  </span>  (or codebase)like PyLate or Stanford's recipe.</li>
+> <li> <span style="color: #dc2626; font-weight: 600;"> Without any funding, grand GPU budgets, </span>  or formal research background.</li>
+> </ul>
+> </div>
+As of this writing (2nd July 2025)
+1. <a href="https://huggingface.co/lightonai/GTE-ModernColBERT-v1"> LightOn AI's is the best ColBERT </a> in the world and can be considered SOTA. <br/>
+2. **Today we are humbled and thrilled to announce prithivida/modern_colbert_base_en_v1 is the 2nd best ColBERT in the world.**. Borrowing Antoine's words - <br/>
+   > This is the 2nd model to outperform ColBERT-small on BEIR While it is also bigger, it is still a very lightweight model and benefits from the efficiency of ModernBERT!"
+<br/>
+# Comparison with Top ColBERTv2.0+ Models
+| Dataset / Model | GTE-ModernColBERT<br/>(Lighton AI) | modern_colbert_base_en_v1<br/>(Ours) | ColBERT-small<br/>(Answer AI, reproduced by Lighton) | ColBERT-small<br/>(Answer AI, reported) |
+|:-----------------|:-----------------:|:-----------------:|:------------------------:|:------------------------:|
+| **Outfit type**     | AI Lab with PhDs   | Indie Researcher, <br/> No PhD, No GPUs :-)      | AI Lab with PhDs                      | AI Lab with PhDs                      |
+| **BEIR Average**     | **54.89** (🥇)   | **54.51 (🥈)**       | 53.35                    | 53.79                    |
+| **FiQA2018**    | **48.51**         | 43.96             | 41.01                    | 41.15                    |
+| **NFCorpus**    | **37.93**         | 37.23             | 36.86                    | 37.3                     |
+| **TREC-COVID**  | 83.59             | 83.4             | 83.14                    | **84.59**                |
+| **Touche2020**  | **31.23**         | 29.32             | 24.95                    | 25.69                    |
+| **ArguAna**     | 48.51             | **52.05**         | 46.76                    | 50.09                    |
+| **QuoraRetrieval** | 86.61          | 87.54             | **87.89**                | 87.72                    |
+| **SCIDOCS**     | 19.06             | **19.42**         | 18.72                    | 18.42                    |
+| **SciFact**     | 76.34             | **76.44**             | 74.02                    | 74.77                    |
+| **NQ**          | **61.8**          | 61.68            | 59.42                    | 59.1                     |
+| **ClimateFEVER** | 30.62            | 28.29             | 32.83                    | **33.07**                |
+| **HotpotQA**    | **77.32**         | 76.667             | 76.88                    | 76.11                    |
+| **DBPedia**     | **48.03**         | 46.31             | 46.36                    | 45.58                    |
+| **CQADupstack** | 41                | **42.2**         | 39.36                    | 38.75                    |
+| **FEVER**       | 87.44             | 88.106             | 88.66                    | **90.96**                |
+| **MSMARCO**     | **45.32**         | 44.993             | 43.44                    | 43.5                     |
+Turns out a very modest GPU budget and a humble background is enough to independently implement the ColBERT's that are in circulation today.
+*Detailed scores will be added soon.*
+<br/>
+# Comparison of with legacy ColBERT models
+Both GTE-ModernColBERT and ColBERT-small model cards have this comparison against older Colbert models. please refer to them.
+-----
+# Running inference:
+There are really strong storage and retrieval abstractions: VectorDBs like Qdrant, Weaviate or Vespa that support multi-vectors and strong Colbert training libraries like PyLate, So we feel it is best to work the authors and integrate.
+For now we offer only code to load the model, run inference and do some light weight in-memory ranking (no heavy lifting like storing and retrieving using FAISS indexes).
+<details>
+<summary><b>Click here for inference code using Transformers</b></summary>
+> [!TIP]
+> Copy paste the next snippet before running the below snippet.
+```python
+model_path = "prithivida/modern_colbert_base_en_v1"
+try:
+    colbert = ColBERT.load_for_inference(model_path, max_query_len=32, max_doc_len=300)
+    # Test data
+    queries = [
+        "How does deep learning work?",
+        "What is machine learning?",
+        "What are neural networks?"
+    ]
+    documents = [
+        "Machine learning is the idea of approximating a real world phenomenon using data, the approximation can be mathmetical or otherwise.",
+        "Deep learning uses neural networks with multiple layers to process data.",
+        "Neural networks are computing systems inspired by biological neural networks.",
+        "Artificial intelligence encompasses machine learning and deep learning.",
+        "Here is how you train dogs",
+    ]
+    # Test single query ranking
+    print("\n=== Single Query Ranking ===")
+    query = "How does deep learning work?"
+    results = colbert.rank_documents(query, documents, top_k=3)
+    print(f"Query: {query}")
+    for i, (doc_idx, score, doc_text) in enumerate(results):
+        print(f"  {i+1}. Score: {score:.4f} | Doc: {doc_text}")
+except Exception as e:
+    print(f"Error during testing: {e}")
+```
+```python
+import torch
+from torch import nn
+from transformers import PreTrainedModel, AutoConfig, AutoModel, AutoTokenizer
+from transformers.modeling_outputs import BaseModelOutput
+from tqdm import tqdm
+from typing import List, Tuple, Union, Optional
+import string
+import os
+class TaggingHead(nn.Module):
+    def __init__(self, input_size, num_labels):
+        super().__init__()
+        self.classifier = nn.Linear(input_size, num_labels, bias=False)
+        nn.init.xavier_uniform_(self.classifier.weight)
+    def forward(self, x):
+        return self.classifier(x)
+class ColBERT(PreTrainedModel):
+    config_class = AutoConfig
+    base_model_prefix = "backbone"
+    def __init__(self, config):
+        super().__init__(config)
+        self.backbone = AutoModel.from_config(config)
+        hidden_dim = config.hidden_size
+        self.heads = nn.ModuleDict({
+            "col_pooling": TaggingHead(hidden_dim, num_labels=128)
+        })
+        # Inference settings (will be set when loading for inference)
+        self.tokenizer = None
+        self.max_query_len = 256
+        self.max_doc_len = 300
+        self.Q_PID = None
+        self.D_PID = None
+    def _init_weights(self, module):
+        if isinstance(module, (nn.Linear, nn.Embedding)):
+            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
+        if isinstance(module, nn.Linear) and module.bias is not None:
+            module.bias.data.zero_()
+    def forward(self, input_ids, attention_mask=None, position_ids=None, return_dict=False, **kwargs):
+        kwargs.pop("token_type_ids", None)
+        outputs = self.backbone(
+            input_ids=input_ids,
+            attention_mask=attention_mask,
+            position_ids=position_ids,
+            return_dict=True,
+            **kwargs
+        )
+        reps = outputs.last_hidden_state
+        reps = torch.nn.functional.normalize(reps, p=2, dim=2)
+        reps *= attention_mask[:, :, None].float()
+        logits = self.heads["col_pooling"](reps)
+        if return_dict:
+            return BaseModelOutput(last_hidden_state=logits)
+        return logits
+    @classmethod
+    def load_for_inference(cls, model_name_or_path: str, max_query_len: int = 256,
+                          max_doc_len: int = 300, device: str = None):
+        """
+        Load ColBERT model with tokenizer for inference
+        Args:
+            model_name_or_path: HuggingFace model path or local directory
+            max_query_len: Maximum query length
+            max_doc_len: Maximum document length
+            device: Device to run inference on (auto-detect if None)
+        """
+        device = device or ("cuda" if torch.cuda.is_available() else "cpu")
+        try:
+            # Load model and tokenizer
+            if os.path.exists(model_name_or_path):
+                print(f"Loading model from local directory: {model_name_or_path}")
+                config = AutoConfig.from_pretrained(model_name_or_path)
+                model = cls.from_pretrained(model_name_or_path, config=config)
+                tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)
+            else:
+                print(f"Downloading model from HuggingFace Hub: {model_name_or_path}")
+                config = AutoConfig.from_pretrained(model_name_or_path)
+                model = cls.from_pretrained(model_name_or_path, config=config)
+                tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)
+            # Setup inference configuration
+            model.tokenizer = tokenizer
+            model.max_query_len = max_query_len
+            model.max_doc_len = max_doc_len
+            model.Q_PID = tokenizer.convert_tokens_to_ids("[unused0]")
+            model.D_PID = tokenizer.convert_tokens_to_ids("[unused1]")
+            # Setup post-tokenization punctuation masking
+            model.skip_ids = {tokenizer.encode(c, add_special_tokens=False)[0]
+                             for c in string.punctuation}
+            model.to(device)
+            model.eval()
+            print(f"ColBERT model loaded on {device}")
+            print(f"Query max length: {max_query_len}, Document max length: {max_doc_len}")
+            return model
+        except Exception as e:
+            print(f"Error loading model: {e}")
+            raise
+    def _encode_batch(self, ids: torch.Tensor, mask: torch.Tensor, to_cpu: bool = False):
+        """Internal encoding function"""
+        if self.tokenizer is None:
+            raise RuntimeError("Model not loaded for inference. Use ColBERT.load_for_inference()")
+        ids, mask = ids.to(self.device), mask.to(self.device)
+        pos = torch.arange(ids.size(1), device=self.device).unsqueeze(0).expand_as(ids)
+        with torch.no_grad():
+            rep = self(input_ids=ids, attention_mask=mask, position_ids=pos)
+        return rep.cpu() if to_cpu else rep
+    def encode_queries(self, queries: List[str], batch_size: Optional[int] = None, to_cpu: bool = False):
+        """
+        Encode queries for ColBERT retrieval
+        Args:
+            queries: List of query strings
+            batch_size: Batch size for processing (None for single batch)
+            to_cpu: Whether to move results to CPU
+        Returns:
+            Query representations tensor
+        """
+        if self.tokenizer is None:
+            raise RuntimeError("Model not loaded for inference. Use ColBERT.load_for_inference()")
+        print(f"Encoding {len(queries)} queries...")
+        # Tokenize with query prefix
+        enc = self.tokenizer(queries, add_special_tokens=True, truncation=False)
+        id_lists = [[self.Q_PID] + ids for ids in enc["input_ids"]]
+        # Apply dynamic augmentation with length cap
+        cap = self.max_query_len or (self.tokenizer.model_max_length - 1)
+        id_lists = [_dynamic_augment(ids, self.tokenizer.mask_token_id, cap) for ids in id_lists]
+        # Pad sequences
+        padded = self.tokenizer.pad({"input_ids": id_lists}, padding=True, return_tensors="pt")
+        ids, mask = padded["input_ids"], padded["attention_mask"]
+        # Process in batches if specified
+        if batch_size:
+            reps = []
+            for i, a in tqdm(_split_into_batches(ids, mask, batch_size), desc="Encoding query batches"):
+                reps.append(self._encode_batch(i, a, to_cpu))
+            return torch.cat(reps)
+        return self._encode_batch(ids, mask, to_cpu)
+    def encode_documents(self, documents: List[str], batch_size: Optional[int] = None,
+                        keep_dims: bool = True, to_cpu: bool = False):
+        """
+        Encode documents for ColBERT retrieval with post-tokenization punctuation masking
+        Args:
+            documents: List of document strings
+            batch_size: Batch size for processing (None for single batch)
+            keep_dims: Whether to keep tensor dimensions (True) or return list of variable-length tensors
+            to_cpu: Whether to move results to CPU
+        Returns:
+            Document representations tensor or list
+        """
+        if self.tokenizer is None:
+            raise RuntimeError("Model not loaded for inference. Use ColBERT.load_for_inference()")
+        print(f"Encoding {len(documents)} documents...")
+        # Tokenize documents WITHOUT removing punctuation (post-tokenization masking)
+        enc = self.tokenizer(documents, add_special_tokens=True,
+                           truncation=True, max_length=self.max_doc_len - 1)
+        id_lists = [[self.D_PID] + ids for ids in enc["input_ids"]]
+        # Pad sequences
+        padded = self.tokenizer.pad({"input_ids": id_lists}, padding=True, return_tensors="pt")
+        ids, mask = padded["input_ids"], padded["attention_mask"]
+        # Apply post-tokenization punctuation masking
+        mask[torch.isin(ids, torch.tensor(list(self.skip_ids), device=ids.device))] = 0
+        # Process in batches if specified
+        if batch_size:
+            ids_s, mask_s, rev = _sort_by_length(ids, mask, batch_size)
+            reps = []
+            for i, a in tqdm(_split_into_batches(ids_s, mask_s, batch_size), desc="Encoding document batches"):
+                rep = self._encode_batch(i, a, to_cpu)
+                if not keep_dims:
+                    # Convert to list of variable-length tensors
+                    m = a.cpu().bool() if to_cpu else a.bool()
+                    rep = [r[m[idx]] for idx, r in enumerate(rep)]
+                reps.append(rep)
+            if keep_dims:
+                return _stack_3D_tensors(reps)[rev]
+            else:
+                # Flatten and reorder
+                flat = [d for g in reps for d in g]
+                return [flat[i] for i in rev.tolist()]
+        # Single batch processing
+        rep = self._encode_batch(ids, mask, to_cpu)
+        if not keep_dims:
+            m = mask.cpu().bool() if to_cpu else mask.bool()
+            rep = [r[m[idx]] for idx, r in enumerate(rep)]
+        return rep
+    @staticmethod
+    def compute_similarity(q_reps: torch.Tensor, p_reps: torch.Tensor):
+        """
+        Compute ColBERT-style max similarity between queries and passages
+        Args:
+            q_reps: Query representations [num_queries, max_q_len, dim]
+            p_reps: Passage representations [num_passages, max_p_len, dim]
+        Returns:
+            Similarity scores [num_queries, num_passages]
+        """
+        token_scores = torch.einsum("qin,pjn->qipj", q_reps, p_reps)
+        scores, _ = token_scores.max(-1)
+        scores = scores.sum(1)
+        return scores
+    def search(self, queries: List[str], documents: List[str],
+               batch_size: Optional[int] = None, return_scores: bool = True):
+        """
+        End-to-end search: encode queries and documents, compute similarities
+        Args:
+            queries: List of query strings
+            documents: List of document strings
+            batch_size: Batch size for encoding
+            return_scores: Whether to return similarity scores
+        Returns:
+            If return_scores=True: (scores, query_reps, doc_reps)
+            If return_scores=False: (query_reps, doc_reps)
+        """
+        # Encode queries and documents
+        q_reps = self.encode_queries(queries, batch_size=batch_size, to_cpu=True)
+        p_reps = self.encode_documents(documents, batch_size=batch_size, to_cpu=True)
+        if return_scores:
+            # Compute similarities
+            print("Computing similarities...")
+            scores = self.compute_similarity(q_reps, p_reps)
+            return scores, q_reps, p_reps
+        return q_reps, p_reps
+    def rank_documents(self, query: str, documents: List[str], top_k: int = 10):
+        """
+        Rank documents for a single query
+        Args:
+            query: Query string
+            documents: List of document strings
+            top_k: Number of top results to return
+        Returns:
+            List of (document_index, score, document_text) tuples
+        """
+        scores, _, _ = self.search([query], documents, return_scores=True)
+        scores = scores.squeeze(0)  # Remove query dimension
+        # Get top-k results
+        top_indices = torch.topk(scores, min(top_k, len(documents))).indices
+        results = []
+        for idx in top_indices:
+            results.append((idx.item(), scores[idx].item(), documents[idx.item()]))
+        return results
+# ---------------------------------------------------------------------------
+# Helper Functions
+# ---------------------------------------------------------------------------
+def _split_into_batches(ids: torch.Tensor, mask: torch.Tensor, bsize: int):
+    return [(ids[i:i + bsize], mask[i:i + bsize])
+            for i in range(0, ids.size(0), bsize)]
+def _sort_by_length(ids: torch.Tensor, mask: torch.Tensor, bsize: int):
+    if ids.size(0) <= bsize:
+        return ids, mask, torch.arange(ids.size(0))
+    lengths = mask.sum(-1)
+    order = lengths.sort().indices
+    reverse = order.sort().indices
+    return ids[order], mask[order], reverse
+def _dynamic_augment(ids: List[int], mask_id: int, max_cap: int = None) -> List[int]:
+    if max_cap is not None and len(ids) > max_cap:
+        return ids[:max_cap]
+    q_len = len(ids)
+    target = max(32, ((q_len + 31) // 32) * 32)
+    if target - q_len < 8:
+        target = q_len + 8
+    if max_cap is not None:
+        target = min(target, max_cap)
+    return ids + [mask_id] * (target - q_len)
+def _stack_3D_tensors(groups):
+    bsize = sum(x.size(0) for x in groups)
+    maxlen = max(x.size(1) for x in groups)
+    hdim = groups[0].size(2)
+    out = torch.zeros(bsize, maxlen, hdim, device=groups[0].device, dtype=groups[0].dtype)
+    ptr = 0
+    for g in groups:
+        out[ptr:ptr + g.size(0), :g.size(1)] = g
+        ptr += g.size(0)
+    return out
+```
+</details>
+<details>
+<summary><b>Click here for inference code using ONNX</b></summary>
+> [!TIP]
+> Copy paste the next snippet before running the below snippet.
+```python
+model_path = "prithivida/modern_colbert_base_en_v1"
+onnx_model_path = "prithivida/modern_colbert_base_en_v1/onnx/model.onnx"
+# Load ONNX model for inference using the standalone tokenizer path
+onnx_colbert = ONNXColBERT(onnx_model_path, model_path , max_query_len=32, max_doc_len=300) # Pass model_path as tokenizer_path
+# Test inference
+queries = [
+        "How does deep learning work?",
+        "What is machine learning?",
+        "What are neural networks?"
+    ]
+documents = [
+    "Machine learning is the idea of approximating a real world phenomenon using data, the approximation can be mathmetical or otherwise.",
+    "Deep learning uses neural networks with multiple layers to process data.",
+    "Neural networks are computing systems inspired by biological neural networks.",
+    "Artificial intelligence encompasses machine learning and deep learning.",
+    "Here is how you train dogs",
+]
+# Test single query ranking
+print("\n=== ONNX Standalone Single Query Ranking ===")
+query = "How does deep learning work?"
+results = onnx_colbert.rank_documents(query, documents, top_k=3)
+print(f"Query: {query}")
+for i, (doc_idx, score, doc_text) in enumerate(results):
+    print(f"  {i+1}. Score: {score:.4f} | Doc: {doc_text}")
+```
+```python
+import numpy as np
+import onnxruntime as ort
+from tokenizers import AddedToken, Tokenizer
+import json
+import string
+from pathlib import Path
+from typing import List, Optional, Tuple, Union
+from tqdm import tqdm
+# ---------------------------------------------------------------------------
+# ONNX ColBERT Class
+# ---------------------------------------------------------------------------
+class ONNXColBERT:
+    def __init__(self, onnx_model_path: str, tokenizer_path: str,
+                 max_query_len: int = 256, max_doc_len: int = 300,
+                 providers: Optional[List[str]] = None):
+        """
+        ONNX ColBERT - identical to PyTorch ColBERT.load_for_inference()
+        Args:
+            onnx_model_path: Path to the ONNX model file
+            tokenizer_path: Path to the tokenizer directory
+            max_query_len: Maximum query length
+            max_doc_len: Maximum document length
+            providers: ONNX Runtime providers
+        """
+        # Load standalone tokenizer
+        self.model_dir = Path(tokenizer_path)
+        self.tokenizer = self._get_tokenizer(max_length=512)
+        self.max_query_len = max_query_len
+        self.max_doc_len = max_doc_len
+        # Setup inference configuration
+        self.Q_PID = self.tokenizer.token_to_id("[unused0]")
+        self.D_PID = self.tokenizer.token_to_id("[unused1]")
+        self.mask_token_id = self.tokenizer.token_to_id("[MASK]")
+        if None in [self.Q_PID, self.D_PID, self.mask_token_id]:
+            raise ValueError("Could not find required special tokens in tokenizer")
+        # Setup post-tokenization punctuation masking
+        self.skip_ids = set()
+        for c in string.punctuation:
+            encoded = self.tokenizer.encode(c, add_special_tokens=False)
+            if len(encoded.ids) > 0:
+                self.skip_ids.add(encoded.ids[0])
+        print(f"Identified {len(self.skip_ids)} punctuation token IDs to skip")
+        # Initialize ONNX Runtime session
+        if providers is None:
+            providers = ['CUDAExecutionProvider', 'CPUExecutionProvider']
+        self.session = ort.InferenceSession(onnx_model_path, providers=providers)
+        print(f"✅ ONNX ColBERT loaded with providers: {self.session.get_providers()}")
+        print(f"Query max length: {max_query_len}, Document max length: {max_doc_len}")
+    def _get_tokenizer(self, max_length: int = 512) -> Tokenizer:
+        """Initialize tokenizer"""
+        with open(str(self.model_dir / "config.json")) as config_file:
+            config = json.load(config_file)
+        with open(str(self.model_dir / "tokenizer_config.json")) as tokenizer_config_file:
+            tokenizer_config = json.load(tokenizer_config_file)
+        with open(str(self.model_dir / "special_tokens_map.json")) as tokens_map_file:
+            tokens_map = json.load(tokens_map_file)
+        tokenizer = Tokenizer.from_file(str(self.model_dir / "tokenizer.json"))
+        tokenizer.enable_truncation(max_length=min(tokenizer_config["model_max_length"], max_length))
+        tokenizer.enable_padding(pad_id=config["pad_token_id"], pad_token=tokenizer_config["pad_token"])
+        for token in tokens_map.values():
+            if isinstance(token, str):
+                tokenizer.add_special_tokens([token])
+            elif isinstance(token, dict):
+                tokenizer.add_special_tokens([AddedToken(**token)])
+        return tokenizer
+    def _encode_batch(self, ids: np.ndarray, mask: np.ndarray, to_cpu: bool = False) -> np.ndarray:
+        """Internal encoding function"""
+        # Create position IDs
+        pos = np.arange(ids.shape[1])[None, :].repeat(ids.shape[0], axis=0)
+        # ONNX inference
+        inputs = {
+            "input_ids": ids.astype(np.int64),
+            "attention_mask": mask.astype(np.int64),
+            "position_ids": pos.astype(np.int64)
+        }
+        outputs = self.session.run(["last_hidden_state"], inputs)
+        return outputs[0]
+    def encode_queries(self, queries: List[str], batch_size: Optional[int] = None,
+                      to_cpu: bool = False) -> np.ndarray:
+        """Encode queries - IDENTICAL to PyTorch ColBERT.encode_queries()"""
+        print(f"Encoding {len(queries)} queries...")
+        # Tokenize with query prefix
+        encoded_queries = self.tokenizer.encode_batch(queries, add_special_tokens=True)
+        id_lists = [[self.Q_PID] + encoded.ids for encoded in encoded_queries]
+        # Apply dynamic augmentation with length cap
+        cap = self.max_query_len or 511
+        id_lists = [_dynamic_augment(ids, self.mask_token_id, cap) for ids in id_lists]
+        # Manual padding
+        max_len = max(len(ids) for ids in id_lists)
+        batch_size_actual = len(id_lists)
+        ids = np.zeros((batch_size_actual, max_len), dtype=np.int64)
+        mask = np.zeros((batch_size_actual, max_len), dtype=np.int64)
+        for i, id_list in enumerate(id_lists):
+            ids[i, :len(id_list)] = id_list
+            mask[i, :len(id_list)] = 1
+        # Process in batches if specified
+        if batch_size:
+            reps = []
+            for i, a in tqdm(_split_into_batches(ids, mask, batch_size), desc="Encoding query batches"):
+                reps.append(self._encode_batch(i, a, to_cpu))
+            return np.concatenate(reps, axis=0)
+        return self._encode_batch(ids, mask, to_cpu)
+    def encode_documents(self, documents: List[str], batch_size: Optional[int] = None,
+                        keep_dims: bool = True, to_cpu: bool = False) -> Union[np.ndarray, List[np.ndarray]]:
+        """Encode documents - IDENTICAL to PyTorch ColBERT.encode_documents()"""
+        print(f"Encoding {len(documents)} documents...")
+        # Encode documents individually to preserve natural lengths
+        encoded_docs = []
+        for doc in documents:
+            encoded = self.tokenizer.encode(doc, add_special_tokens=True)
+            encoded_docs.append(encoded)
+        id_lists = []
+        for encoded in encoded_docs:
+            ids = encoded.ids
+            # Truncate to max_doc_len - 1
+            if len(ids) > self.max_doc_len - 1:
+                ids = ids[:self.max_doc_len - 1]
+            # Add D_PID prefix
+            ids = [self.D_PID] + ids
+            id_lists.append(ids)
+        # Manual padding
+        max_len = max(len(ids) for ids in id_lists)
+        batch_size_actual = len(id_lists)
+        ids = np.zeros((batch_size_actual, max_len), dtype=np.int64)
+        mask = np.zeros((batch_size_actual, max_len), dtype=np.int64)
+        for i, id_list in enumerate(id_lists):
+            ids[i, :len(id_list)] = id_list
+            mask[i, :len(id_list)] = 1
+        # Apply post-tokenization punctuation masking
+        for skip_id in self.skip_ids:
+            mask[ids == skip_id] = 0
+        # Process in batches if specified
+        if batch_size:
+            ids_s, mask_s, rev = _sort_by_length(ids, mask, batch_size)
+            reps = []
+            for i, a in tqdm(_split_into_batches(ids_s, mask_s, batch_size), desc="Encoding document batches"):
+                rep = self._encode_batch(i, a, to_cpu)
+                if not keep_dims:
+                    m = a.astype(bool)
+                    rep = [r[m[idx]] for idx, r in enumerate(rep)]
+                reps.append(rep)
+            if keep_dims:
+                return _stack_3D_arrays(reps)[rev]
+            else:
+                flat = [d for g in reps for d in g]
+                return [flat[i] for i in rev.tolist()]
+        # Single batch processing
+        rep = self._encode_batch(ids, mask, to_cpu)
+        if not keep_dims:
+            m = mask.astype(bool)
+            rep = [r[m[idx]] for idx, r in enumerate(rep)]
+        return rep
+    @staticmethod
+    def compute_similarity(q_reps: np.ndarray, p_reps: np.ndarray) -> np.ndarray:
+        """Compute ColBERT similarity - IDENTICAL to PyTorch version"""
+        # Identical to PyTorch: torch.einsum("qin,pjn->qipj", q_reps, p_reps)
+        token_scores = np.einsum("qin,pjn->qipj", q_reps, p_reps)
+        # Identical to PyTorch: scores, _ = token_scores.max(-1)
+        scores = np.max(token_scores, axis=-1)
+        # Identical to PyTorch: scores = scores.sum(1)
+        scores = np.sum(scores, axis=1)
+        return scores
+    def search(self, queries: List[str], documents: List[str],
+               batch_size: Optional[int] = None, return_scores: bool = True):
+        """End-to-end search - IDENTICAL to PyTorch ColBERT.search()"""
+        # Encode queries and documents
+        q_reps = self.encode_queries(queries, batch_size=batch_size, to_cpu=True)
+        p_reps = self.encode_documents(documents, batch_size=batch_size, to_cpu=True)
+        if return_scores:
+            # Compute similarities
+            print("Computing similarities...")
+            scores = self.compute_similarity(q_reps, p_reps)
+            return scores, q_reps, p_reps
+        return q_reps, p_reps
+    def rank_documents(self, query: str, documents: List[str], top_k: int = 10) -> List[Tuple]:
+        """Rank documents - IDENTICAL to PyTorch ColBERT.rank_documents()"""
+        scores, _, _ = self.search([query], documents, return_scores=True)
+        scores = scores.squeeze(0)
+        # Get top-k results
+        top_indices = np.argsort(scores)[::-1][:min(top_k, len(documents))]
+        results = []
+        for idx in top_indices:
+            results.append((int(idx), float(scores[idx]), documents[idx]))
+        return results
+# ---------------------------------------------------------------------------
+# Helper Functions (NumPy versions)
+# ---------------------------------------------------------------------------
+def _split_into_batches(ids: np.ndarray, mask: np.ndarray, bsize: int):
+    return [(ids[i:i + bsize], mask[i:i + bsize])
+            for i in range(0, ids.shape[0], bsize)]
+def _sort_by_length(ids: np.ndarray, mask: np.ndarray, bsize: int):
+    if ids.shape[0] <= bsize:
+        return ids, mask, np.arange(ids.shape[0])
+    lengths = mask.sum(-1)
+    order = np.argsort(lengths)
+    reverse = np.argsort(order)
+    return ids[order], mask[order], reverse
+def _dynamic_augment(ids: List[int], mask_id: int, max_cap: int = None) -> List[int]:
+    if max_cap is not None and len(ids) > max_cap:
+        return ids[:max_cap]
+    q_len = len(ids)
+    target = max(32, ((q_len + 31) // 32) * 32)
+    if target - q_len < 8:
+        target = q_len + 8
+    if max_cap is not None:
+        target = min(target, max_cap)
+    return ids + [mask_id] * (target - q_len)
+def _stack_3D_arrays(groups):
+    bsize = sum(x.shape[0] for x in groups)
+    maxlen = max(x.shape[1] for x in groups)
+    hdim = groups[0].shape[2]
+    out = np.zeros((bsize, maxlen, hdim), dtype=groups[0].dtype)
+    ptr = 0
+    for g in groups:
+        out[ptr:ptr + g.shape[0], :g.shape[1]] = g
+        ptr += g.shape[0]
+    return out
+```
+</details>
+<br/>
+_____
+# Notes on reproducing
+We welcome anyone to reproduce our results. Here are some tips and observations:
+- Please pay attention the query length. We tried our best to look at what the original ColBERTv2.0 used, what LightOn AI used and also spoke to Nils Reimers on taking liberty in the choice of query lengths.
+- Note on query length from ColBERTv2.0 paper:
+> Unless otherwise stated, we keep the default query maximum sequence length for ColBERTv2 and RocketQAv2, which is 32 tokens. For the ArguAna test in BEIR, as the queries are themselves long documents, we set the maximum query length used by ColBERTv2 and RocketQAv2 to 300. For Climate-FEVER, as the queries are relatively long sentence claims, we set the maximum query length used by ColBERTv2 to 64.
+- Query lengths used by LightOn AI PyLate: (Assuming the OSS code is what they used)
+  ```python
+   query_len = {
+        "quora": 32,
+        "climate-fever": 64,
+        "nq": 32,
+        "msmarco": 32,
+        "hotpotqa": 32,
+        "nfcorpus": 32,
+        "scifact": 48,
+        "trec-covid": 48,
+        "fiqa": 32,
+        "arguana": 64,
+        "scidocs": 48,
+        "dbpedia-entity": 32,
+        "webis-touche2020": 32,
+        "fever": 32,
+        "cqadupstack/android": 32,
+        "cqadupstack/english": 32,
+        "cqadupstack/gaming": 32,
+        "cqadupstack/gis": 32,
+        "cqadupstack/mathematica": 32,
+        "cqadupstack/physics": 32,
+        "cqadupstack/programmers": 32,
+        "cqadupstack/stats": 32,
+        "cqadupstack/tex": 32,
+        "cqadupstack/unix": 32,
+        "cqadupstack/webmasters": 32,
+        "cqadupstack/wordpress": 32,
+    }
+  ```
+- This is what OG Nils had to say when I asked about why query has so much liberty:
+> Comparison is always hard...I think query length doesn't skew too much. Retrieval compute scales linear with the number of query tokens. So if people are comfortable to compare models with largely different parameters, comparing different query token lengths would be fine as well
+- Nota bene: There *may be* minor differences in the numbers when reproducing, for instance BGE-M3 reports a nDCG@10 of 59.3 for MIRACL hindi and we Observed only 58.9. But not massive differences like in the case of reported and reproduced Colbert-small in some datasets.
+Here are our numbers for the full hindi run on BGE-M3
+```python
+{'NDCG@1': 0.49714, 'NDCG@3': 0.5115, 'NDCG@5': 0.53908, 'NDCG@10': 0.58936, 'NDCG@100': 0.6457, 'NDCG@1000': 0.65336}
+{'MAP@1': 0.28845, 'MAP@3': 0.42424, 'MAP@5': 0.46455, 'MAP@10': 0.49955, 'MAP@100': 0.51886, 'MAP@1000': 0.51933}
+{'Recall@10': 0.73032, 'Recall@50': 0.8987, 'Recall@100': 0.93974, 'Recall@200': 0.95763, 'Recall@500': 0.97813, 'Recall@1000': 0.9902}
+{'P@1': 0.49714, 'P@3': 0.33048, 'P@5': 0.24629, 'P@10': 0.15543, 'P@100': 0.0202, 'P@1000': 0.00212}
+{'MRR@10': 0.60893, 'MRR@100': 0.615, 'MRR@1000': 0.6151}
+```
+- We made sure all quirks and known BEIR ColBERT issues are taken care off:
+  - [Arguana and Quora (?) self match issues](https://github.com/beir-cellar/beir/issues/67)
+  - TBA
+# Acknowledgements
+- Thanks to Nils Reimers for the tips and inputs.
+- Thanks to Nandan Thakur for answering questions.
+- Thanks Antoine Chaffin and LightOn team for PyLate.
+- We thank Prithivi Da for his generous funding for this work :-)