update README
Browse files
README.md
CHANGED
|
@@ -1,148 +1,276 @@
|
|
| 1 |
-
---
|
| 2 |
-
|
| 3 |
-
|
| 4 |
-
-
|
| 5 |
-
|
| 6 |
-
-
|
| 7 |
-
|
| 8 |
-
|
| 9 |
-
|
| 10 |
-
|
| 11 |
-
|
| 12 |
-
|
| 13 |
-
|
| 14 |
-
|
| 15 |
-
|
| 16 |
-
|
| 17 |
-
|
| 18 |
-
|
| 19 |
-
-
|
| 20 |
-
|
| 21 |
-
|
| 22 |
-
|
| 23 |
-
- **
|
| 24 |
-
|
| 25 |
-
|
| 26 |
-
|
| 27 |
-
|
| 28 |
-
|
| 29 |
-
|
| 30 |
-
|
| 31 |
-
|
| 32 |
-
|
| 33 |
-
|
| 34 |
-
|
| 35 |
-
|
| 36 |
-
|
| 37 |
-
|
| 38 |
-
|
| 39 |
-
|
| 40 |
-
|
| 41 |
-
|
| 42 |
-
|
| 43 |
-
|
| 44 |
-
|
| 45 |
-
|
| 46 |
-
|
| 47 |
-
|
| 48 |
-
|
| 49 |
-
|
| 50 |
-
|
| 51 |
-
|
| 52 |
-
|
| 53 |
-
|
| 54 |
-
|
| 55 |
-
|
| 56 |
-
|
| 57 |
-
|
| 58 |
-
|
| 59 |
-
|
| 60 |
-
|
| 61 |
-
|
| 62 |
-
|
| 63 |
-
|
| 64 |
-
|
| 65 |
-
|
| 66 |
-
|
| 67 |
-
|
| 68 |
-
|
| 69 |
-
|
| 70 |
-
|
| 71 |
-
|
| 72 |
-
|
| 73 |
-
|
| 74 |
-
|
| 75 |
-
|
| 76 |
-
|
| 77 |
-
|
| 78 |
-
|
| 79 |
-
|
| 80 |
-
|
| 81 |
-
|
| 82 |
-
|
| 83 |
-
|
| 84 |
-
|
| 85 |
-
|
| 86 |
-
|
| 87 |
-
|
| 88 |
-
|
| 89 |
-
|
| 90 |
-
|
| 91 |
-
|
| 92 |
-
|
| 93 |
-
|
| 94 |
-
|
| 95 |
-
|
| 96 |
-
|
| 97 |
-
|
| 98 |
-
|
| 99 |
-
|
| 100 |
-
|
| 101 |
-
|
| 102 |
-
|
| 103 |
-
|
| 104 |
-
|
| 105 |
-
|
| 106 |
-
|
| 107 |
-
|
| 108 |
-
|
| 109 |
-
|
| 110 |
-
|
| 111 |
-
|
| 112 |
-
|
| 113 |
-
|
| 114 |
-
|
| 115 |
-
|
| 116 |
-
|
| 117 |
-
|
| 118 |
-
|
| 119 |
-
|
| 120 |
-
|
| 121 |
-
|
| 122 |
-
|
| 123 |
-
|
| 124 |
-
|
| 125 |
-
|
| 126 |
-
|
| 127 |
-
|
| 128 |
-
|
| 129 |
-
|
| 130 |
-
|
| 131 |
-
|
| 132 |
-
|
| 133 |
-
|
| 134 |
-
|
| 135 |
-
|
| 136 |
-
|
| 137 |
-
|
| 138 |
-
|
| 139 |
-
|
| 140 |
-
|
| 141 |
-
|
| 142 |
-
|
| 143 |
-
|
| 144 |
-
|
| 145 |
-
|
| 146 |
-
|
| 147 |
-
|
| 148 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
base_model:
|
| 4 |
+
- microsoft/MiniLM-L6-v2
|
| 5 |
+
tags:
|
| 6 |
+
- transformers
|
| 7 |
+
- sentence-transformers
|
| 8 |
+
- sentence-similarity
|
| 9 |
+
- feature-extraction
|
| 10 |
+
- text-embeddings-inference
|
| 11 |
+
- information-retrieval
|
| 12 |
+
- knowledge-distillation
|
| 13 |
+
language:
|
| 14 |
+
- en
|
| 15 |
+
---
|
| 16 |
+
<div style="display: flex; justify-content: center;">
|
| 17 |
+
<div style="display: flex; align-items: center; gap: 10px;">
|
| 18 |
+
<img src="logo.png" alt="MongoDB Logo" style="height: 36px; width: auto;">
|
| 19 |
+
<span style="font-size: 32px; font-weight: bold">MongoDB/mdbr-leaf-ir</span>
|
| 20 |
+
</div>
|
| 21 |
+
</div>
|
| 22 |
+
|
| 23 |
+
**mdbr-leaf-ir** is a compact high-performance text embedding model specifically designed for **information retrieval (IR)** tasks.
|
| 24 |
+
|
| 25 |
+
Enabling even greater efficiency, `mdbr-leaf-ir` supports [flexible asymmetric architectures](#asymmetric-retrieval-setup) and is robust to [vector quantization](#vector-quantization) and [MRL truncation](#mrl).
|
| 26 |
+
|
| 27 |
+
If you are looking to perform other tasks such as classification, clustering, semantic sentence similarity, summarization, please check out our [`mdb-leaf-mt`](https://huggingface.co/MongoDB/mdb-leaf-mt) model.
|
| 28 |
+
|
| 29 |
+
**Note**: this model has been developed by MongoDB Research and is not part of MongoDB's commercial offerings.
|
| 30 |
+
|
| 31 |
+
## Technical Report
|
| 32 |
+
|
| 33 |
+
A technical report detailing our proposed `LEAF` training procedure is [available here (TBD)](http://FILL_HERE_ARXIV_LINK).
|
| 34 |
+
|
| 35 |
+
## Highlights
|
| 36 |
+
|
| 37 |
+
* **State-of-the-Art Performance**: `mdbr-leaf-ir` achieves new state-of-the-art results for compact embedding models, ranking <span style="color:red">#TBD</span> on the public BEIR benchmark leaderboard for models <30M parameters with an average nDCG@10 score of <span style="color:red">[TBD HERE]</span>.
|
| 38 |
+
* **Flexible Architecture Support**: `mdbr-leaf-ir` supports asymmetric retrieval architectures enabling even greater retrieval results. [See below](#asymmetric-retrieval-setup) for more information.
|
| 39 |
+
* **MRL and quantization support**: embedding vectors generated by `mdbr-leaf-ir` compress well when truncated (MRL) and/or are stored using more efficient types like `int8` and `binary`. [See below](#mrl) for more information.
|
| 40 |
+
|
| 41 |
+
|
| 42 |
+
## Performance
|
| 43 |
+
|
| 44 |
+
### Benchmark Results
|
| 45 |
+
|
| 46 |
+
* Values are nDCG@10
|
| 47 |
+
* Scores exclude CQADupstack and MSMARCO; full BEIR results are available on the [public leaderboard](https://huggingface.co/spaces/mteb/leaderboard).
|
| 48 |
+
* Scores in bold highlight when our model outperforms comparisons in either standard or asymmetric mode; we also highlight cases when comparisons outperform our model in standard mode. Blue are scores when asymmetric outperforms standard.
|
| 49 |
+
* `BM25` scores are obtained with `(k₁=0.9, b=0.4)`.
|
| 50 |
+
|
| 51 |
+
| Model | Size | arg. | fiqa | nfc | scid. | scif. | quora | covid | nq | fever | c-fever | dbp. | hotpot | avg. |
|
| 52 |
+
|-------|------|------|------|-----|-------|-------|--------|-------|----|----- |---------|------|--------|------|
|
| 53 |
+
| **`mdbr-leaf-ir` (asym.)** | 23M | **<span style="color:blue">58.5</span>** | **<span style="color:blue">42.1</span>** | **36.1** | <span style="color:blue">20.4</span> | **69.9** | <span style="color:blue">86.2</span> | **<span style="color:blue">83.7</span>** | **<span style="color:blue">61.4</span>** | **<span style="color:blue">86.4</span>** | **<span style="color:blue">37.4</span>** | **<span style="color:blue">44.8</span>** | **<span style="color:blue">69.0</span>** | **<span style="color:blue">58.0</span>** |
|
| 54 |
+
| **`mdbr-leaf-ir`** | 23M | **56.7** | **38.1** | **36.2** | 19.5 | **70.0** | 71.0 | **83.0** | **58.2** | **85.4** | **32.4** | 43.7 | 68.2 | **55.2** |
|
| 55 |
+
| **Comparisons** | | | | | | | | | | | | | | |
|
| 56 |
+
| `snowflake-arctic-embed-xs` | 23M | 52.1 | 34.5 | 30.9 | 18.4 | 64.5 | 86.6 | 79.4 | 54.8 | 83.4 | 29.9 | 40.2 | 65.3 | 53.3 |
|
| 57 |
+
| `MiniLM-L6-v2` | 23M | 50.2 | 36.9 | 31.6 | **21.6** | 64.5 | **87.6** | 47.2 | 43.9 | 51.9 | 20.3 | 32.3 | 46.5 | 44.5 |
|
| 58 |
+
| `BM25` | -- | 40.8 | 23.8 | 31.8 | 15.0 | 67.6 | 78.7 | 58.9 | 30.5 | 63.8 | 16.2 | 31.9 | 62.9 | 43.5 |
|
| 59 |
+
| `SPLADE v2` | 110M | 47.9 | 33.6 | 33.4 | 15.8 | 69.3 | 83.8 | 71.0 | 52.1 | 78.6 | 23.5 | 43.5 | **68.4** | 51.7 |
|
| 60 |
+
| `ColBERT v2` | 110M | 46.3 | 35.6 | 33.8 | 15.4 | 69.3 | 85.2 | 73.8 | 56.2 | 78.5 | 17.6 | **44.6** | 66.7 | 51.9 |
|
| 61 |
+
|
| 62 |
+
## Quickstart
|
| 63 |
+
|
| 64 |
+
### Sentence Transformers
|
| 65 |
+
|
| 66 |
+
```python
|
| 67 |
+
from sentence_transformers import SentenceTransformer
|
| 68 |
+
|
| 69 |
+
# Load the model
|
| 70 |
+
model = SentenceTransformer("MongoDB/mdbr-leaf-ir")
|
| 71 |
+
|
| 72 |
+
# Example queries and documents
|
| 73 |
+
queries = [
|
| 74 |
+
"What is machine learning?",
|
| 75 |
+
"How does neural network training work?"
|
| 76 |
+
]
|
| 77 |
+
|
| 78 |
+
documents = [
|
| 79 |
+
"Machine learning is a subset of artificial intelligence that focuses on algorithms that can learn from data.",
|
| 80 |
+
"Neural networks are trained through backpropagation, adjusting weights to minimize prediction errors."
|
| 81 |
+
]
|
| 82 |
+
|
| 83 |
+
# Encode queries and documents
|
| 84 |
+
query_embeddings = model.encode(queries, prompt_name="query")
|
| 85 |
+
document_embeddings = model.encode(documents)
|
| 86 |
+
|
| 87 |
+
# Compute similarity scores
|
| 88 |
+
scores = model.similarity(query_embeddings, document_embeddings)
|
| 89 |
+
|
| 90 |
+
# Print results
|
| 91 |
+
for i, query in enumerate(queries):
|
| 92 |
+
print(f"Query: {query}")
|
| 93 |
+
for j, doc in enumerate(documents):
|
| 94 |
+
print(f" Similarity: {scores[i, j]:.4f} | Document {j}: {doc[:80]}...")
|
| 95 |
+
|
| 96 |
+
# Query: What is machine learning?
|
| 97 |
+
# Similarity: 0.6908 | Document 0: Machine learning is a subset of ...
|
| 98 |
+
# Similarity: 0.4598 | Document 1: Neural networks are trained ...
|
| 99 |
+
#
|
| 100 |
+
# Query: How does neural network training work?
|
| 101 |
+
# Similarity: 0.4432 | Document 0: Machine learning is a subset of ...
|
| 102 |
+
# Similarity: 0.5794 | Document 1: Neural networks are trained ...
|
| 103 |
+
```
|
| 104 |
+
|
| 105 |
+
### Transformers Usage
|
| 106 |
+
|
| 107 |
+
<span style="color:red">CHECK THAT safe_open WORKS WITH URLS; link to code in repo</span>
|
| 108 |
+
|
| 109 |
+
<!-- ```python
|
| 110 |
+
from safetensors import safe_open
|
| 111 |
+
from transformers import AutoModel, AutoTokenizer
|
| 112 |
+
|
| 113 |
+
# Load the model
|
| 114 |
+
tokenizer = AutoTokenizer.from_pretrained(MODEL)
|
| 115 |
+
model = AutoModel.from_pretrained(MODEL)
|
| 116 |
+
|
| 117 |
+
tensors = {}
|
| 118 |
+
with safe_open(MODEL + "/2_Dense/model.safetensors", framework="pt") as f:
|
| 119 |
+
for k in f.keys():
|
| 120 |
+
tensors[k] = f.get_tensor(k)
|
| 121 |
+
|
| 122 |
+
W_out = torch.nn.Linear(in_features=384, out_features=768, bias=True)
|
| 123 |
+
W_out.load_state_dict({
|
| 124 |
+
"weight": tensors["linear.weight"],
|
| 125 |
+
"bias": tensors["linear.bias"]
|
| 126 |
+
})
|
| 127 |
+
|
| 128 |
+
_ = model.eval()
|
| 129 |
+
_ = W_out.eval()
|
| 130 |
+
|
| 131 |
+
# Example queries and documents
|
| 132 |
+
queries = [
|
| 133 |
+
"What is machine learning?",
|
| 134 |
+
"How does neural network training work?"
|
| 135 |
+
]
|
| 136 |
+
|
| 137 |
+
documents = [
|
| 138 |
+
"Machine learning is a subset of artificial intelligence that focuses on algorithms that can learn from data.",
|
| 139 |
+
"Neural networks are trained through backpropagation, adjusting weights to minimize prediction errors."
|
| 140 |
+
]
|
| 141 |
+
|
| 142 |
+
# Tokenize
|
| 143 |
+
QUERY_PREFIX = 'Represent this sentence for searching relevant passages: '
|
| 144 |
+
queries_with_prefix = [QUERY_PREFIX + query for query in queries]
|
| 145 |
+
|
| 146 |
+
query_tokens = tokenizer(queries_with_prefix, padding=True, truncation=True, return_tensors='pt', max_length=512)
|
| 147 |
+
document_tokens = tokenizer(documents, padding=True, truncation=True, return_tensors='pt', max_length=512)
|
| 148 |
+
|
| 149 |
+
# Perform Inference
|
| 150 |
+
with torch.inference_mode():
|
| 151 |
+
y_queries = model(**query_tokens).last_hidden_state
|
| 152 |
+
y_docs = model(**document_tokens).last_hidden_state
|
| 153 |
+
|
| 154 |
+
# perform pooling
|
| 155 |
+
y_queries = y_queries * query_tokens.attention_mask.unsqueeze(-1)
|
| 156 |
+
y_queries_pooled = y_queries.sum(dim=1) / query_tokens.attention_mask.sum(dim=1, keepdim=True)
|
| 157 |
+
|
| 158 |
+
y_docs = y_docs * document_tokens.attention_mask.unsqueeze(-1)
|
| 159 |
+
y_docs_pooled = y_docs.sum(dim=1) / document_tokens.attention_mask.sum(dim=1, keepdim=True)
|
| 160 |
+
|
| 161 |
+
# map to desired output dimension
|
| 162 |
+
y_queries_out = W_out(y_queries_pooled)
|
| 163 |
+
y_docs_out = W_out(y_docs_pooled)
|
| 164 |
+
|
| 165 |
+
# normalize and return
|
| 166 |
+
query_embeddings = F.normalize(y_queries_out, dim=-1)
|
| 167 |
+
document_embeddings = F.normalize(y_docs_out, dim=-1)
|
| 168 |
+
|
| 169 |
+
similarities = query_embeddings @ document_embeddings.T
|
| 170 |
+
print(f"Similarities:\n{similarities}")
|
| 171 |
+
# Similarities:
|
| 172 |
+
# tensor([[0.6908, 0.4598],
|
| 173 |
+
# [0.4432, 0.5794]])
|
| 174 |
+
``` -->
|
| 175 |
+
|
| 176 |
+
### Asymmetric Retrieval Setup
|
| 177 |
+
|
| 178 |
+
`mdbr-leaf-ir` is *aligned* to [`snowflake-arctic-embed-m-v1.5`](https://huggingface.co/Snowflake/snowflake-arctic-embed-m-v1.5), the model it has been distilled from, making the asymmetric system below possible:
|
| 179 |
+
|
| 180 |
+
```python
|
| 181 |
+
# Use a larger model for document encoding (one-time, at index time)
|
| 182 |
+
doc_model = SentenceTransformer("Snowflake/snowflake-arctic-embed-m-v1.5")
|
| 183 |
+
document_embeddings = doc_model.encode(documents)
|
| 184 |
+
|
| 185 |
+
# Use mdbr-leaf-ir for query encoding (real-time, low latency)
|
| 186 |
+
query_model = SentenceTransformer("MongoDB/mdbr-leaf-ir")
|
| 187 |
+
query_embeddings = query_model.encode(queries, prompt_name="query")
|
| 188 |
+
|
| 189 |
+
# Compute similarities
|
| 190 |
+
scores = query_model.similarity(query_embeddings, document_embeddings)
|
| 191 |
+
```
|
| 192 |
+
Retrieval results from asymmetric mode are usually superior to the [standard mode above](#sentence-transformers).
|
| 193 |
+
|
| 194 |
+
### MRL
|
| 195 |
+
|
| 196 |
+
Embeddings have been trained via [MRL](https://arxiv.org/abs/2205.13147) and can be truncated for more efficient storage:
|
| 197 |
+
```python
|
| 198 |
+
from torch.nn import functional as F
|
| 199 |
+
|
| 200 |
+
query_embeds = model.encode(queries, prompt_name="query", convert_to_tensor=True)
|
| 201 |
+
doc_embeds = model.encode(documents, convert_to_tensor=True)
|
| 202 |
+
|
| 203 |
+
# Truncate and normalize according to MRL
|
| 204 |
+
query_embeds = F.normalize(query_embeds[:, :256], dim=-1)
|
| 205 |
+
doc_embeds = F.normalize(doc_embeds[:, :256], dim=-1)
|
| 206 |
+
|
| 207 |
+
similarities = model.similarity(query_embeds, doc_embeds)
|
| 208 |
+
|
| 209 |
+
print('After MRL:')
|
| 210 |
+
print(f"* Embeddings dimension: {query_embeds.shape[1]}")
|
| 211 |
+
print(f"* Similarities:\n\t{similarities}")
|
| 212 |
+
|
| 213 |
+
# After MRL:
|
| 214 |
+
# * Embeddings dimension: 256
|
| 215 |
+
# * Similarities:
|
| 216 |
+
# tensor([[0.7202, 0.5006],
|
| 217 |
+
# [0.4744, 0.6083]])
|
| 218 |
+
```
|
| 219 |
+
|
| 220 |
+
### Vector Quantization
|
| 221 |
+
Vector quantization, for example to `int8` or `binary`, can be performed as follows:
|
| 222 |
+
|
| 223 |
+
**Note**: For vector quantization to types other than binary, we suggest performing a calibration to determine the optimal ranges, [see here](https://sbert.net/examples/sentence_transformer/applications/embedding-quantization/README.html#scalar-int8-quantization).
|
| 224 |
+
Good initial values, according to the [teacher model's documentation](https://huggingface.co/Snowflake/snowflake-arctic-embed-m-v1.5#compressing-to-128-bytes), are:
|
| 225 |
+
* `int8`: -0.3 and +0.3
|
| 226 |
+
* `int4`: -0.18 and +0.18
|
| 227 |
+
```python
|
| 228 |
+
from sentence_transformers.quantization import quantize_embeddings
|
| 229 |
+
import torch
|
| 230 |
+
|
| 231 |
+
query_embeds = model.encode(queries, prompt_name="query")
|
| 232 |
+
doc_embeds = model.encode(documents)
|
| 233 |
+
|
| 234 |
+
# Quantize embeddings to int8 using -0.3 and +0.3 as calibration ranges
|
| 235 |
+
ranges = torch.tensor([[-0.3], [+0.3]]).expand(2, query_embeds.shape[1]).cpu().numpy()
|
| 236 |
+
query_embeds = quantize_embeddings(query_embeds, "int8", ranges=ranges)
|
| 237 |
+
doc_embeds = quantize_embeddings(doc_embeds, "int8", ranges=ranges)
|
| 238 |
+
|
| 239 |
+
# Calculate similarities; cast to int64 to avoid under/overflow
|
| 240 |
+
similarities = query_embeds.astype(int) @ doc_embeds.astype(int).T
|
| 241 |
+
|
| 242 |
+
print('After quantization:')
|
| 243 |
+
print(f"* Embeddings type: {query_embeds.dtype}")
|
| 244 |
+
print(f"* Similarities:\n{similarities}")
|
| 245 |
+
|
| 246 |
+
# After quantization:
|
| 247 |
+
# * Embeddings type: int8
|
| 248 |
+
# * Similarities:
|
| 249 |
+
# [[119073 78877]
|
| 250 |
+
# [ 76174 99127]]
|
| 251 |
+
```
|
| 252 |
+
|
| 253 |
+
|
| 254 |
+
## Citation
|
| 255 |
+
|
| 256 |
+
If you use this model in your work, please cite:
|
| 257 |
+
|
| 258 |
+
```bibtex
|
| 259 |
+
@article{mdb_leaf,
|
| 260 |
+
title = {LEAF: Lightweight Embedding Alignment Knowledge Distillation Framework},
|
| 261 |
+
author = {Robin Vujanic and Thomas Rueckstiess},
|
| 262 |
+
year = {2025}
|
| 263 |
+
eprint = {TBD},
|
| 264 |
+
archiveprefix = {arXiv},
|
| 265 |
+
primaryclass = {FILL HERE},
|
| 266 |
+
url = {FILL HERE}
|
| 267 |
+
}
|
| 268 |
+
```
|
| 269 |
+
|
| 270 |
+
## License
|
| 271 |
+
|
| 272 |
+
This model is released under Apache 2.0 <span style="color:red">(TBD)</span> License.
|
| 273 |
+
|
| 274 |
+
## Contact
|
| 275 |
+
|
| 276 |
+
For questions or issues, please open an issue or pull request. You can also contact the MongoDB ML research team at [email protected].
|
logo.png
ADDED
|