update README
Browse files
README.md
CHANGED
@@ -1,148 +1,276 @@
|
|
1 |
-
---
|
2 |
-
|
3 |
-
|
4 |
-
-
|
5 |
-
|
6 |
-
-
|
7 |
-
|
8 |
-
|
9 |
-
|
10 |
-
|
11 |
-
|
12 |
-
|
13 |
-
|
14 |
-
|
15 |
-
|
16 |
-
|
17 |
-
|
18 |
-
|
19 |
-
-
|
20 |
-
|
21 |
-
|
22 |
-
|
23 |
-
- **
|
24 |
-
|
25 |
-
|
26 |
-
|
27 |
-
|
28 |
-
|
29 |
-
|
30 |
-
|
31 |
-
|
32 |
-
|
33 |
-
|
34 |
-
|
35 |
-
|
36 |
-
|
37 |
-
|
38 |
-
|
39 |
-
|
40 |
-
|
41 |
-
|
42 |
-
|
43 |
-
|
44 |
-
|
45 |
-
|
46 |
-
|
47 |
-
|
48 |
-
|
49 |
-
|
50 |
-
|
51 |
-
|
52 |
-
|
53 |
-
|
54 |
-
|
55 |
-
|
56 |
-
|
57 |
-
|
58 |
-
|
59 |
-
|
60 |
-
|
61 |
-
|
62 |
-
|
63 |
-
|
64 |
-
|
65 |
-
|
66 |
-
|
67 |
-
|
68 |
-
|
69 |
-
|
70 |
-
|
71 |
-
|
72 |
-
|
73 |
-
|
74 |
-
|
75 |
-
|
76 |
-
|
77 |
-
|
78 |
-
|
79 |
-
|
80 |
-
|
81 |
-
|
82 |
-
|
83 |
-
|
84 |
-
|
85 |
-
|
86 |
-
|
87 |
-
|
88 |
-
|
89 |
-
|
90 |
-
|
91 |
-
|
92 |
-
|
93 |
-
|
94 |
-
|
95 |
-
|
96 |
-
|
97 |
-
|
98 |
-
|
99 |
-
|
100 |
-
|
101 |
-
|
102 |
-
|
103 |
-
|
104 |
-
|
105 |
-
|
106 |
-
|
107 |
-
|
108 |
-
|
109 |
-
|
110 |
-
|
111 |
-
|
112 |
-
|
113 |
-
|
114 |
-
|
115 |
-
|
116 |
-
|
117 |
-
|
118 |
-
|
119 |
-
|
120 |
-
|
121 |
-
|
122 |
-
|
123 |
-
|
124 |
-
|
125 |
-
|
126 |
-
|
127 |
-
|
128 |
-
|
129 |
-
|
130 |
-
|
131 |
-
|
132 |
-
|
133 |
-
|
134 |
-
|
135 |
-
|
136 |
-
|
137 |
-
|
138 |
-
|
139 |
-
|
140 |
-
|
141 |
-
|
142 |
-
|
143 |
-
|
144 |
-
|
145 |
-
|
146 |
-
|
147 |
-
|
148 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: apache-2.0
|
3 |
+
base_model:
|
4 |
+
- microsoft/MiniLM-L6-v2
|
5 |
+
tags:
|
6 |
+
- transformers
|
7 |
+
- sentence-transformers
|
8 |
+
- sentence-similarity
|
9 |
+
- feature-extraction
|
10 |
+
- text-embeddings-inference
|
11 |
+
- information-retrieval
|
12 |
+
- knowledge-distillation
|
13 |
+
language:
|
14 |
+
- en
|
15 |
+
---
|
16 |
+
<div style="display: flex; justify-content: center;">
|
17 |
+
<div style="display: flex; align-items: center; gap: 10px;">
|
18 |
+
<img src="logo.png" alt="MongoDB Logo" style="height: 36px; width: auto;">
|
19 |
+
<span style="font-size: 32px; font-weight: bold">MongoDB/mdbr-leaf-ir</span>
|
20 |
+
</div>
|
21 |
+
</div>
|
22 |
+
|
23 |
+
**mdbr-leaf-ir** is a compact high-performance text embedding model specifically designed for **information retrieval (IR)** tasks.
|
24 |
+
|
25 |
+
Enabling even greater efficiency, `mdbr-leaf-ir` supports [flexible asymmetric architectures](#asymmetric-retrieval-setup) and is robust to [vector quantization](#vector-quantization) and [MRL truncation](#mrl).
|
26 |
+
|
27 |
+
If you are looking to perform other tasks such as classification, clustering, semantic sentence similarity, summarization, please check out our [`mdb-leaf-mt`](https://huggingface.co/MongoDB/mdb-leaf-mt) model.
|
28 |
+
|
29 |
+
**Note**: this model has been developed by MongoDB Research and is not part of MongoDB's commercial offerings.
|
30 |
+
|
31 |
+
## Technical Report
|
32 |
+
|
33 |
+
A technical report detailing our proposed `LEAF` training procedure is [available here (TBD)](http://FILL_HERE_ARXIV_LINK).
|
34 |
+
|
35 |
+
## Highlights
|
36 |
+
|
37 |
+
* **State-of-the-Art Performance**: `mdbr-leaf-ir` achieves new state-of-the-art results for compact embedding models, ranking <span style="color:red">#TBD</span> on the public BEIR benchmark leaderboard for models <30M parameters with an average nDCG@10 score of <span style="color:red">[TBD HERE]</span>.
|
38 |
+
* **Flexible Architecture Support**: `mdbr-leaf-ir` supports asymmetric retrieval architectures enabling even greater retrieval results. [See below](#asymmetric-retrieval-setup) for more information.
|
39 |
+
* **MRL and quantization support**: embedding vectors generated by `mdbr-leaf-ir` compress well when truncated (MRL) and/or are stored using more efficient types like `int8` and `binary`. [See below](#mrl) for more information.
|
40 |
+
|
41 |
+
|
42 |
+
## Performance
|
43 |
+
|
44 |
+
### Benchmark Results
|
45 |
+
|
46 |
+
* Values are nDCG@10
|
47 |
+
* Scores exclude CQADupstack and MSMARCO; full BEIR results are available on the [public leaderboard](https://huggingface.co/spaces/mteb/leaderboard).
|
48 |
+
* Scores in bold highlight when our model outperforms comparisons in either standard or asymmetric mode; we also highlight cases when comparisons outperform our model in standard mode. Blue are scores when asymmetric outperforms standard.
|
49 |
+
* `BM25` scores are obtained with `(k₁=0.9, b=0.4)`.
|
50 |
+
|
51 |
+
| Model | Size | arg. | fiqa | nfc | scid. | scif. | quora | covid | nq | fever | c-fever | dbp. | hotpot | avg. |
|
52 |
+
|-------|------|------|------|-----|-------|-------|--------|-------|----|----- |---------|------|--------|------|
|
53 |
+
| **`mdbr-leaf-ir` (asym.)** | 23M | **<span style="color:blue">58.5</span>** | **<span style="color:blue">42.1</span>** | **36.1** | <span style="color:blue">20.4</span> | **69.9** | <span style="color:blue">86.2</span> | **<span style="color:blue">83.7</span>** | **<span style="color:blue">61.4</span>** | **<span style="color:blue">86.4</span>** | **<span style="color:blue">37.4</span>** | **<span style="color:blue">44.8</span>** | **<span style="color:blue">69.0</span>** | **<span style="color:blue">58.0</span>** |
|
54 |
+
| **`mdbr-leaf-ir`** | 23M | **56.7** | **38.1** | **36.2** | 19.5 | **70.0** | 71.0 | **83.0** | **58.2** | **85.4** | **32.4** | 43.7 | 68.2 | **55.2** |
|
55 |
+
| **Comparisons** | | | | | | | | | | | | | | |
|
56 |
+
| `snowflake-arctic-embed-xs` | 23M | 52.1 | 34.5 | 30.9 | 18.4 | 64.5 | 86.6 | 79.4 | 54.8 | 83.4 | 29.9 | 40.2 | 65.3 | 53.3 |
|
57 |
+
| `MiniLM-L6-v2` | 23M | 50.2 | 36.9 | 31.6 | **21.6** | 64.5 | **87.6** | 47.2 | 43.9 | 51.9 | 20.3 | 32.3 | 46.5 | 44.5 |
|
58 |
+
| `BM25` | -- | 40.8 | 23.8 | 31.8 | 15.0 | 67.6 | 78.7 | 58.9 | 30.5 | 63.8 | 16.2 | 31.9 | 62.9 | 43.5 |
|
59 |
+
| `SPLADE v2` | 110M | 47.9 | 33.6 | 33.4 | 15.8 | 69.3 | 83.8 | 71.0 | 52.1 | 78.6 | 23.5 | 43.5 | **68.4** | 51.7 |
|
60 |
+
| `ColBERT v2` | 110M | 46.3 | 35.6 | 33.8 | 15.4 | 69.3 | 85.2 | 73.8 | 56.2 | 78.5 | 17.6 | **44.6** | 66.7 | 51.9 |
|
61 |
+
|
62 |
+
## Quickstart
|
63 |
+
|
64 |
+
### Sentence Transformers
|
65 |
+
|
66 |
+
```python
|
67 |
+
from sentence_transformers import SentenceTransformer
|
68 |
+
|
69 |
+
# Load the model
|
70 |
+
model = SentenceTransformer("MongoDB/mdbr-leaf-ir")
|
71 |
+
|
72 |
+
# Example queries and documents
|
73 |
+
queries = [
|
74 |
+
"What is machine learning?",
|
75 |
+
"How does neural network training work?"
|
76 |
+
]
|
77 |
+
|
78 |
+
documents = [
|
79 |
+
"Machine learning is a subset of artificial intelligence that focuses on algorithms that can learn from data.",
|
80 |
+
"Neural networks are trained through backpropagation, adjusting weights to minimize prediction errors."
|
81 |
+
]
|
82 |
+
|
83 |
+
# Encode queries and documents
|
84 |
+
query_embeddings = model.encode(queries, prompt_name="query")
|
85 |
+
document_embeddings = model.encode(documents)
|
86 |
+
|
87 |
+
# Compute similarity scores
|
88 |
+
scores = model.similarity(query_embeddings, document_embeddings)
|
89 |
+
|
90 |
+
# Print results
|
91 |
+
for i, query in enumerate(queries):
|
92 |
+
print(f"Query: {query}")
|
93 |
+
for j, doc in enumerate(documents):
|
94 |
+
print(f" Similarity: {scores[i, j]:.4f} | Document {j}: {doc[:80]}...")
|
95 |
+
|
96 |
+
# Query: What is machine learning?
|
97 |
+
# Similarity: 0.6908 | Document 0: Machine learning is a subset of ...
|
98 |
+
# Similarity: 0.4598 | Document 1: Neural networks are trained ...
|
99 |
+
#
|
100 |
+
# Query: How does neural network training work?
|
101 |
+
# Similarity: 0.4432 | Document 0: Machine learning is a subset of ...
|
102 |
+
# Similarity: 0.5794 | Document 1: Neural networks are trained ...
|
103 |
+
```
|
104 |
+
|
105 |
+
### Transformers Usage
|
106 |
+
|
107 |
+
<span style="color:red">CHECK THAT safe_open WORKS WITH URLS; link to code in repo</span>
|
108 |
+
|
109 |
+
<!-- ```python
|
110 |
+
from safetensors import safe_open
|
111 |
+
from transformers import AutoModel, AutoTokenizer
|
112 |
+
|
113 |
+
# Load the model
|
114 |
+
tokenizer = AutoTokenizer.from_pretrained(MODEL)
|
115 |
+
model = AutoModel.from_pretrained(MODEL)
|
116 |
+
|
117 |
+
tensors = {}
|
118 |
+
with safe_open(MODEL + "/2_Dense/model.safetensors", framework="pt") as f:
|
119 |
+
for k in f.keys():
|
120 |
+
tensors[k] = f.get_tensor(k)
|
121 |
+
|
122 |
+
W_out = torch.nn.Linear(in_features=384, out_features=768, bias=True)
|
123 |
+
W_out.load_state_dict({
|
124 |
+
"weight": tensors["linear.weight"],
|
125 |
+
"bias": tensors["linear.bias"]
|
126 |
+
})
|
127 |
+
|
128 |
+
_ = model.eval()
|
129 |
+
_ = W_out.eval()
|
130 |
+
|
131 |
+
# Example queries and documents
|
132 |
+
queries = [
|
133 |
+
"What is machine learning?",
|
134 |
+
"How does neural network training work?"
|
135 |
+
]
|
136 |
+
|
137 |
+
documents = [
|
138 |
+
"Machine learning is a subset of artificial intelligence that focuses on algorithms that can learn from data.",
|
139 |
+
"Neural networks are trained through backpropagation, adjusting weights to minimize prediction errors."
|
140 |
+
]
|
141 |
+
|
142 |
+
# Tokenize
|
143 |
+
QUERY_PREFIX = 'Represent this sentence for searching relevant passages: '
|
144 |
+
queries_with_prefix = [QUERY_PREFIX + query for query in queries]
|
145 |
+
|
146 |
+
query_tokens = tokenizer(queries_with_prefix, padding=True, truncation=True, return_tensors='pt', max_length=512)
|
147 |
+
document_tokens = tokenizer(documents, padding=True, truncation=True, return_tensors='pt', max_length=512)
|
148 |
+
|
149 |
+
# Perform Inference
|
150 |
+
with torch.inference_mode():
|
151 |
+
y_queries = model(**query_tokens).last_hidden_state
|
152 |
+
y_docs = model(**document_tokens).last_hidden_state
|
153 |
+
|
154 |
+
# perform pooling
|
155 |
+
y_queries = y_queries * query_tokens.attention_mask.unsqueeze(-1)
|
156 |
+
y_queries_pooled = y_queries.sum(dim=1) / query_tokens.attention_mask.sum(dim=1, keepdim=True)
|
157 |
+
|
158 |
+
y_docs = y_docs * document_tokens.attention_mask.unsqueeze(-1)
|
159 |
+
y_docs_pooled = y_docs.sum(dim=1) / document_tokens.attention_mask.sum(dim=1, keepdim=True)
|
160 |
+
|
161 |
+
# map to desired output dimension
|
162 |
+
y_queries_out = W_out(y_queries_pooled)
|
163 |
+
y_docs_out = W_out(y_docs_pooled)
|
164 |
+
|
165 |
+
# normalize and return
|
166 |
+
query_embeddings = F.normalize(y_queries_out, dim=-1)
|
167 |
+
document_embeddings = F.normalize(y_docs_out, dim=-1)
|
168 |
+
|
169 |
+
similarities = query_embeddings @ document_embeddings.T
|
170 |
+
print(f"Similarities:\n{similarities}")
|
171 |
+
# Similarities:
|
172 |
+
# tensor([[0.6908, 0.4598],
|
173 |
+
# [0.4432, 0.5794]])
|
174 |
+
``` -->
|
175 |
+
|
176 |
+
### Asymmetric Retrieval Setup
|
177 |
+
|
178 |
+
`mdbr-leaf-ir` is *aligned* to [`snowflake-arctic-embed-m-v1.5`](https://huggingface.co/Snowflake/snowflake-arctic-embed-m-v1.5), the model it has been distilled from, making the asymmetric system below possible:
|
179 |
+
|
180 |
+
```python
|
181 |
+
# Use a larger model for document encoding (one-time, at index time)
|
182 |
+
doc_model = SentenceTransformer("Snowflake/snowflake-arctic-embed-m-v1.5")
|
183 |
+
document_embeddings = doc_model.encode(documents)
|
184 |
+
|
185 |
+
# Use mdbr-leaf-ir for query encoding (real-time, low latency)
|
186 |
+
query_model = SentenceTransformer("MongoDB/mdbr-leaf-ir")
|
187 |
+
query_embeddings = query_model.encode(queries, prompt_name="query")
|
188 |
+
|
189 |
+
# Compute similarities
|
190 |
+
scores = query_model.similarity(query_embeddings, document_embeddings)
|
191 |
+
```
|
192 |
+
Retrieval results from asymmetric mode are usually superior to the [standard mode above](#sentence-transformers).
|
193 |
+
|
194 |
+
### MRL
|
195 |
+
|
196 |
+
Embeddings have been trained via [MRL](https://arxiv.org/abs/2205.13147) and can be truncated for more efficient storage:
|
197 |
+
```python
|
198 |
+
from torch.nn import functional as F
|
199 |
+
|
200 |
+
query_embeds = model.encode(queries, prompt_name="query", convert_to_tensor=True)
|
201 |
+
doc_embeds = model.encode(documents, convert_to_tensor=True)
|
202 |
+
|
203 |
+
# Truncate and normalize according to MRL
|
204 |
+
query_embeds = F.normalize(query_embeds[:, :256], dim=-1)
|
205 |
+
doc_embeds = F.normalize(doc_embeds[:, :256], dim=-1)
|
206 |
+
|
207 |
+
similarities = model.similarity(query_embeds, doc_embeds)
|
208 |
+
|
209 |
+
print('After MRL:')
|
210 |
+
print(f"* Embeddings dimension: {query_embeds.shape[1]}")
|
211 |
+
print(f"* Similarities:\n\t{similarities}")
|
212 |
+
|
213 |
+
# After MRL:
|
214 |
+
# * Embeddings dimension: 256
|
215 |
+
# * Similarities:
|
216 |
+
# tensor([[0.7202, 0.5006],
|
217 |
+
# [0.4744, 0.6083]])
|
218 |
+
```
|
219 |
+
|
220 |
+
### Vector Quantization
|
221 |
+
Vector quantization, for example to `int8` or `binary`, can be performed as follows:
|
222 |
+
|
223 |
+
**Note**: For vector quantization to types other than binary, we suggest performing a calibration to determine the optimal ranges, [see here](https://sbert.net/examples/sentence_transformer/applications/embedding-quantization/README.html#scalar-int8-quantization).
|
224 |
+
Good initial values, according to the [teacher model's documentation](https://huggingface.co/Snowflake/snowflake-arctic-embed-m-v1.5#compressing-to-128-bytes), are:
|
225 |
+
* `int8`: -0.3 and +0.3
|
226 |
+
* `int4`: -0.18 and +0.18
|
227 |
+
```python
|
228 |
+
from sentence_transformers.quantization import quantize_embeddings
|
229 |
+
import torch
|
230 |
+
|
231 |
+
query_embeds = model.encode(queries, prompt_name="query")
|
232 |
+
doc_embeds = model.encode(documents)
|
233 |
+
|
234 |
+
# Quantize embeddings to int8 using -0.3 and +0.3 as calibration ranges
|
235 |
+
ranges = torch.tensor([[-0.3], [+0.3]]).expand(2, query_embeds.shape[1]).cpu().numpy()
|
236 |
+
query_embeds = quantize_embeddings(query_embeds, "int8", ranges=ranges)
|
237 |
+
doc_embeds = quantize_embeddings(doc_embeds, "int8", ranges=ranges)
|
238 |
+
|
239 |
+
# Calculate similarities; cast to int64 to avoid under/overflow
|
240 |
+
similarities = query_embeds.astype(int) @ doc_embeds.astype(int).T
|
241 |
+
|
242 |
+
print('After quantization:')
|
243 |
+
print(f"* Embeddings type: {query_embeds.dtype}")
|
244 |
+
print(f"* Similarities:\n{similarities}")
|
245 |
+
|
246 |
+
# After quantization:
|
247 |
+
# * Embeddings type: int8
|
248 |
+
# * Similarities:
|
249 |
+
# [[119073 78877]
|
250 |
+
# [ 76174 99127]]
|
251 |
+
```
|
252 |
+
|
253 |
+
|
254 |
+
## Citation
|
255 |
+
|
256 |
+
If you use this model in your work, please cite:
|
257 |
+
|
258 |
+
```bibtex
|
259 |
+
@article{mdb_leaf,
|
260 |
+
title = {LEAF: Lightweight Embedding Alignment Knowledge Distillation Framework},
|
261 |
+
author = {Robin Vujanic and Thomas Rueckstiess},
|
262 |
+
year = {2025}
|
263 |
+
eprint = {TBD},
|
264 |
+
archiveprefix = {arXiv},
|
265 |
+
primaryclass = {FILL HERE},
|
266 |
+
url = {FILL HERE}
|
267 |
+
}
|
268 |
+
```
|
269 |
+
|
270 |
+
## License
|
271 |
+
|
272 |
+
This model is released under Apache 2.0 <span style="color:red">(TBD)</span> License.
|
273 |
+
|
274 |
+
## Contact
|
275 |
+
|
276 |
+
For questions or issues, please open an issue or pull request. You can also contact the MongoDB ML research team at [email protected].
|
logo.png
ADDED
![]() |