license: cc-by-nc-4.0
language:
- multilingual
- af
- am
- ar
- as
- az
- be
- bg
- bn
- br
- bs
- ca
- cs
- cy
- da
- de
- el
- en
- eo
- es
- et
- eu
- fa
- fi
- fr
- fy
- ga
- gd
- gl
- gu
- ha
- he
- hi
- hr
- hu
- hy
- id
- is
- it
- ja
- jv
- ka
- kk
- km
- kn
- ko
- ku
- ky
- la
- lo
- lt
- lv
- mg
- mk
- ml
- mn
- mr
- ms
- my
- ne
- nl
- 'no'
- om
- or
- pa
- pl
- ps
- pt
- ro
- ru
- sa
- sd
- si
- sk
- sl
- so
- sq
- sr
- su
- sv
- sw
- ta
- te
- th
- tl
- tr
- ug
- uk
- ur
- uz
- vi
- xh
- yi
- zh
inference: false
tags:
- ColBERT
- passage-retrieval
Trained by Jina AI.
JinaColBERT V2: your multilingual late interaction retriever!
JinaColBERT V2 (jina-colbert-v2) is a new model based on the JinaColBERT V1 that expands on the capabilities and performance of the jina-colbert-v1-en model. Like the previous release, it has Jina AI’s 8192 token input context and the improved efficiency, performance, and explainability of token-level embeddings and late interaction.
This new release adds new functionality and performance improvements:
- Multilingual support for dozens of languages, with strong performance on major global languages.
- Matryoshka embeddings, which allow users to trade between efficiency and precision flexibly.
- Superior retrieval performance when compared to the English-only
jina-colbert-v1-en.
JinaColBERT V2 offers three different versions for different embeddings dimensions:
jinaai/jina-colbert-v2: 128 dimension embeddings
jinaai/jina-colbert-v2-96: 96 dimension embeddings
jinaai/jina-colbert-v2-64: 64 dimension embeddings
Usage
Installation
jina-colbert-v2 is trained with flash attention and therefore requires einops and flash_attn to be installed.
To use the model, you could either use the Standford ColBERT library or use the pylate/ragatouille package that we provide.
pip install -U einops flash_attn
pip install -U ragatouille # or
pip install -U colbert-ai # or
pip install -U pylate
PyLate
# Please refer to Pylate: https://github.com/lightonai/pylate for detailed usage
from pylate import indexes, models, retrieve
model = models.ColBERT(
model_name_or_path="jinaai/jina-colbert-v2",
query_prefix="[QueryMarker]",
document_prefix="[DocumentMarker]",
attend_to_expansion_tokens=True,
trust_remote_code=True,
)
RAGatouille
from ragatouille import RAGPretrainedModel
RAG = RAGPretrainedModel.from_pretrained("jinaai/jina-colbert-v2")
docs = [
"ColBERT is a novel ranking model that adapts deep LMs for efficient retrieval.",
"Jina-ColBERT is a ColBERT-style model but based on JinaBERT so it can support both 8k context length, fast and accurate retrieval.",
]
RAG.index(docs, index_name="demo")
query = "What does ColBERT do?"
results = RAG.search(query)
Stanford ColBERT
from colbert.infra import ColBERTConfig
from colbert.modeling.checkpoint import Checkpoint
ckpt = Checkpoint("jinaai/jina-colbert-v2", colbert_config=ColBERTConfig())
docs = [
"ColBERT is a novel ranking model that adapts deep LMs for efficient retrieval.",
"Jina-ColBERT is a ColBERT-style model but based on JinaBERT so it can support both 8k context length, fast and accurate retrieval.",
]
query_vectors = ckpt.queryFromText(docs, bsize=2)
Evaluation Results
Retrieval Benchmarks
BEIR
| NDCG@10 | jina-colbert-v2 | jina-colbert-v1 | ColBERTv2.0 | BM25 |
|---|---|---|---|---|
| avg | 0.531 | 0.502 | 0.496 | 0.440 |
| nfcorpus | 0.346 | 0.338 | 0.337 | 0.325 |
| fiqa | 0.408 | 0.368 | 0.354 | 0.236 |
| trec-covid | 0.834 | 0.750 | 0.726 | 0.656 |
| arguana | 0.366 | 0.494 | 0.465 | 0.315 |
| quora | 0.887 | 0.823 | 0.855 | 0.789 |
| scidocs | 0.186 | 0.169 | 0.154 | 0.158 |
| scifact | 0.678 | 0.701 | 0.689 | 0.665 |
| webis-touche | 0.274 | 0.270 | 0.260 | 0.367 |
| dbpedia-entity | 0.471 | 0.413 | 0.452 | 0.313 |
| fever | 0.805 | 0.795 | 0.785 | 0.753 |
| climate-fever | 0.239 | 0.196 | 0.176 | 0.213 |
| hotpotqa | 0.766 | 0.656 | 0.675 | 0.603 |
| nq | 0.640 | 0.549 | 0.524 | 0.329 |
MS MARCO Passage Retrieval
| MRR@10 | jina-colbert-v2 | jina-colbert-v1 | ColBERTv2.0 | BM25 |
|---|---|---|---|---|
| MSMARCO | 0.396 | 0.390 | 0.397 | 0.187 |
Multilingual Benchmarks
MIRACLE
| NDCG@10 | jina-colbert-v2 | mDPR (zero shot) |
|---|---|---|
| avg | 0.627 | 0.427 |
| ar | 0.753 | 0.499 |
| bn | 0.750 | 0.443 |
| de | 0.504 | 0.490 |
| es | 0.538 | 0.478 |
| en | 0.570 | 0.394 |
| fa | 0.563 | 0.480 |
| fi | 0.740 | 0.472 |
| fr | 0.541 | 0.435 |
| hi | 0.600 | 0.383 |
| id | 0.547 | 0.272 |
| ja | 0.632 | 0.439 |
| ko | 0.671 | 0.419 |
| ru | 0.643 | 0.407 |
| sw | 0.499 | 0.299 |
| te | 0.742 | 0.356 |
| th | 0.772 | 0.358 |
| yo | 0.623 | 0.396 |
| zh | 0.523 | 0.512 |
mMARCO
| MRR@10 | jina-colbert-v2 | BM-25 | ColBERT-XM |
|---|---|---|---|
| avg | 0.313 | 0.141 | 0.254 |
| ar | 0.272 | 0.111 | 0.195 |
| de | 0.331 | 0.136 | 0.270 |
| nl | 0.330 | 0.140 | 0.275 |
| es | 0.341 | 0.158 | 0.285 |
| fr | 0.335 | 0.155 | 0.269 |
| hi | 0.309 | 0.134 | 0.238 |
| id | 0.319 | 0.149 | 0.263 |
| it | 0.337 | 0.153 | 0.265 |
| ja | 0.276 | 0.141 | 0.241 |
| pt | 0.337 | 0.152 | 0.276 |
| ru | 0.298 | 0.124 | 0.251 |
| vi | 0.287 | 0.136 | 0.226 |
| zh | 0.302 | 0.116 | 0.246 |
Matryoshka Representation Benchmarks
BEIR
| NDCG@10 | dim=128 | dim=96 | dim=64 |
|---|---|---|---|
| avg | 0.599 | 0.591 | 0.589 |
| nfcorpus | 0.346 | 0.340 | 0.347 |
| fiqa | 0.408 | 0.404 | 0.404 |
| trec-covid | 0.834 | 0.808 | 0.805 |
| hotpotqa | 0.766 | 0.764 | 0.756 |
| nq | 0.640 | 0.640 | 0.635 |
MSMARCO
| MRR@10 | dim=128 | dim=96 | dim=64 |
|---|---|---|---|
| msmarco | 0.396 | 0.391 | 0.388 |
Other Models
Additionally, we provide the following embedding models, you can also use them for retrieval.
jina-embeddings-v2-base-en: 137 million parameters.jina-embeddings-v2-base-zh: 161 million parameters Chinese-English bilingual model.jina-embeddings-v2-base-de: 161 million parameters German-English bilingual model.jina-embeddings-v2-base-es: 161 million parameters Spanish-English bilingual model.jina-reranker-v2: multilingual reranker model.jina-clip-v1: English multimodal (text-image) embedding model.
Contact
Join our Discord community and chat with other community members about ideas.
@misc{jha2024jinacolbertv2generalpurposemultilinguallate,
title={Jina-ColBERT-v2: A General-Purpose Multilingual Late Interaction Retriever},
author={Rohan Jha and Bo Wang and Michael Günther and Saba Sturua and Mohammad Kalim Akram and Han Xiao},
year={2024},
eprint={2408.16672},
archivePrefix={arXiv},
primaryClass={cs.IR},
url={https://arxiv.org/abs/2408.16672},
}