rvo commited on
Commit
cdf0a7b
·
verified ·
1 Parent(s): ea98995

update README

Browse files
Files changed (2) hide show
  1. README.md +276 -148
  2. logo.png +0 -0
README.md CHANGED
@@ -1,148 +1,276 @@
1
- ---
2
- tags:
3
- - sentence-transformers
4
- - sentence-similarity
5
- - feature-extraction
6
- - dense
7
- base_model: sentence-transformers/all-MiniLM-L6-v2
8
- pipeline_tag: sentence-similarity
9
- library_name: sentence-transformers
10
- ---
11
-
12
- # SentenceTransformer based on sentence-transformers/all-MiniLM-L6-v2
13
-
14
- This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [sentence-transformers/all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2). It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
15
-
16
- ## Model Details
17
-
18
- ### Model Description
19
- - **Model Type:** Sentence Transformer
20
- - **Base model:** [sentence-transformers/all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) <!-- at revision c9745ed1d9f207416be6d2e6f8de32d1f16199bf -->
21
- - **Maximum Sequence Length:** 512 tokens
22
- - **Output Dimensionality:** 768 dimensions
23
- - **Similarity Function:** Cosine Similarity
24
- <!-- - **Training Dataset:** Unknown -->
25
- <!-- - **Language:** Unknown -->
26
- <!-- - **License:** Unknown -->
27
-
28
- ### Model Sources
29
-
30
- - **Documentation:** [Sentence Transformers Documentation](https://sbert.net)
31
- - **Repository:** [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence-transformers)
32
- - **Hugging Face:** [Sentence Transformers on Hugging Face](https://huggingface.co/models?library=sentence-transformers)
33
-
34
- ### Full Model Architecture
35
-
36
- ```
37
- SentenceTransformer(
38
- (0): Transformer({'max_seq_length': 512, 'do_lower_case': False, 'architecture': 'BertModel'})
39
- (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
40
- (2): Dense({'in_features': 384, 'out_features': 768, 'bias': True, 'activation_function': 'torch.nn.modules.linear.Identity'})
41
- (3): Normalize()
42
- )
43
- ```
44
-
45
- ## Usage
46
-
47
- ### Direct Usage (Sentence Transformers)
48
-
49
- First install the Sentence Transformers library:
50
-
51
- ```bash
52
- pip install -U sentence-transformers
53
- ```
54
-
55
- Then you can load this model and run inference.
56
- ```python
57
- from sentence_transformers import SentenceTransformer
58
-
59
- # Download from the 🤗 Hub
60
- model = SentenceTransformer("sentence_transformers_model_id")
61
- # Run inference
62
- queries = [
63
- "Which planet is known as the Red Planet?",
64
- ]
65
- documents = [
66
- "Venus is often called Earth's twin because of its similar size and proximity.",
67
- 'Mars, known for its reddish appearance, is often referred to as the Red Planet.',
68
- 'Saturn, famous for its rings, is sometimes mistaken for the Red Planet.',
69
- ]
70
- query_embeddings = model.encode_query(queries)
71
- document_embeddings = model.encode_document(documents)
72
- print(query_embeddings.shape, document_embeddings.shape)
73
- # [1, 768] [3, 768]
74
-
75
- # Get the similarity scores for the embeddings
76
- similarities = model.similarity(query_embeddings, document_embeddings)
77
- print(similarities)
78
- # tensor([[0.4166, 0.6312, 0.5094]])
79
- ```
80
-
81
- <!--
82
- ### Direct Usage (Transformers)
83
-
84
- <details><summary>Click to see the direct usage in Transformers</summary>
85
-
86
- </details>
87
- -->
88
-
89
- <!--
90
- ### Downstream Usage (Sentence Transformers)
91
-
92
- You can finetune this model on your own dataset.
93
-
94
- <details><summary>Click to expand</summary>
95
-
96
- </details>
97
- -->
98
-
99
- <!--
100
- ### Out-of-Scope Use
101
-
102
- *List how the model may foreseeably be misused and address what users ought not to do with the model.*
103
- -->
104
-
105
- <!--
106
- ## Bias, Risks and Limitations
107
-
108
- *What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
109
- -->
110
-
111
- <!--
112
- ### Recommendations
113
-
114
- *What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
115
- -->
116
-
117
- ## Training Details
118
-
119
- ### Framework Versions
120
- - Python: 3.12.7
121
- - Sentence Transformers: 5.1.0
122
- - Transformers: 4.52.4
123
- - PyTorch: 2.6.0+cu126
124
- - Accelerate: 1.2.1
125
- - Datasets: 3.1.0
126
- - Tokenizers: 0.21.0
127
-
128
- ## Citation
129
-
130
- ### BibTeX
131
-
132
- <!--
133
- ## Glossary
134
-
135
- *Clearly define terms in order to be accessible across audiences.*
136
- -->
137
-
138
- <!--
139
- ## Model Card Authors
140
-
141
- *Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
142
- -->
143
-
144
- <!--
145
- ## Model Card Contact
146
-
147
- *Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
148
- -->
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ base_model:
4
+ - microsoft/MiniLM-L6-v2
5
+ tags:
6
+ - transformers
7
+ - sentence-transformers
8
+ - sentence-similarity
9
+ - feature-extraction
10
+ - text-embeddings-inference
11
+ - information-retrieval
12
+ - knowledge-distillation
13
+ language:
14
+ - en
15
+ ---
16
+ <div style="display: flex; justify-content: center;">
17
+ <div style="display: flex; align-items: center; gap: 10px;">
18
+ <img src="logo.png" alt="MongoDB Logo" style="height: 36px; width: auto;">
19
+ <span style="font-size: 32px; font-weight: bold">MongoDB/mdbr-leaf-ir</span>
20
+ </div>
21
+ </div>
22
+
23
+ **mdbr-leaf-ir** is a compact high-performance text embedding model specifically designed for **information retrieval (IR)** tasks.
24
+
25
+ Enabling even greater efficiency, `mdbr-leaf-ir` supports [flexible asymmetric architectures](#asymmetric-retrieval-setup) and is robust to [vector quantization](#vector-quantization) and [MRL truncation](#mrl).
26
+
27
+ If you are looking to perform other tasks such as classification, clustering, semantic sentence similarity, summarization, please check out our [`mdb-leaf-mt`](https://huggingface.co/MongoDB/mdb-leaf-mt) model.
28
+
29
+ **Note**: this model has been developed by MongoDB Research and is not part of MongoDB's commercial offerings.
30
+
31
+ ## Technical Report
32
+
33
+ A technical report detailing our proposed `LEAF` training procedure is [available here (TBD)](http://FILL_HERE_ARXIV_LINK).
34
+
35
+ ## Highlights
36
+
37
+ * **State-of-the-Art Performance**: `mdbr-leaf-ir` achieves new state-of-the-art results for compact embedding models, ranking <span style="color:red">#TBD</span> on the public BEIR benchmark leaderboard for models <30M parameters with an average nDCG@10 score of <span style="color:red">[TBD HERE]</span>.
38
+ * **Flexible Architecture Support**: `mdbr-leaf-ir` supports asymmetric retrieval architectures enabling even greater retrieval results. [See below](#asymmetric-retrieval-setup) for more information.
39
+ * **MRL and quantization support**: embedding vectors generated by `mdbr-leaf-ir` compress well when truncated (MRL) and/or are stored using more efficient types like `int8` and `binary`. [See below](#mrl) for more information.
40
+
41
+
42
+ ## Performance
43
+
44
+ ### Benchmark Results
45
+
46
+ * Values are nDCG@10
47
+ * Scores exclude CQADupstack and MSMARCO; full BEIR results are available on the [public leaderboard](https://huggingface.co/spaces/mteb/leaderboard).
48
+ * Scores in bold highlight when our model outperforms comparisons in either standard or asymmetric mode; we also highlight cases when comparisons outperform our model in standard mode. Blue are scores when asymmetric outperforms standard.
49
+ * `BM25` scores are obtained with `(k₁=0.9, b=0.4)`.
50
+
51
+ | Model | Size | arg. | fiqa | nfc | scid. | scif. | quora | covid | nq | fever | c-fever | dbp. | hotpot | avg. |
52
+ |-------|------|------|------|-----|-------|-------|--------|-------|----|----- |---------|------|--------|------|
53
+ | **`mdbr-leaf-ir` (asym.)** | 23M | **<span style="color:blue">58.5</span>** | **<span style="color:blue">42.1</span>** | **36.1** | <span style="color:blue">20.4</span> | **69.9** | <span style="color:blue">86.2</span> | **<span style="color:blue">83.7</span>** | **<span style="color:blue">61.4</span>** | **<span style="color:blue">86.4</span>** | **<span style="color:blue">37.4</span>** | **<span style="color:blue">44.8</span>** | **<span style="color:blue">69.0</span>** | **<span style="color:blue">58.0</span>** |
54
+ | **`mdbr-leaf-ir`** | 23M | **56.7** | **38.1** | **36.2** | 19.5 | **70.0** | 71.0 | **83.0** | **58.2** | **85.4** | **32.4** | 43.7 | 68.2 | **55.2** |
55
+ | **Comparisons** | | | | | | | | | | | | | | |
56
+ | `snowflake-arctic-embed-xs` | 23M | 52.1 | 34.5 | 30.9 | 18.4 | 64.5 | 86.6 | 79.4 | 54.8 | 83.4 | 29.9 | 40.2 | 65.3 | 53.3 |
57
+ | `MiniLM-L6-v2` | 23M | 50.2 | 36.9 | 31.6 | **21.6** | 64.5 | **87.6** | 47.2 | 43.9 | 51.9 | 20.3 | 32.3 | 46.5 | 44.5 |
58
+ | `BM25` | -- | 40.8 | 23.8 | 31.8 | 15.0 | 67.6 | 78.7 | 58.9 | 30.5 | 63.8 | 16.2 | 31.9 | 62.9 | 43.5 |
59
+ | `SPLADE v2` | 110M | 47.9 | 33.6 | 33.4 | 15.8 | 69.3 | 83.8 | 71.0 | 52.1 | 78.6 | 23.5 | 43.5 | **68.4** | 51.7 |
60
+ | `ColBERT v2` | 110M | 46.3 | 35.6 | 33.8 | 15.4 | 69.3 | 85.2 | 73.8 | 56.2 | 78.5 | 17.6 | **44.6** | 66.7 | 51.9 |
61
+
62
+ ## Quickstart
63
+
64
+ ### Sentence Transformers
65
+
66
+ ```python
67
+ from sentence_transformers import SentenceTransformer
68
+
69
+ # Load the model
70
+ model = SentenceTransformer("MongoDB/mdbr-leaf-ir")
71
+
72
+ # Example queries and documents
73
+ queries = [
74
+ "What is machine learning?",
75
+ "How does neural network training work?"
76
+ ]
77
+
78
+ documents = [
79
+ "Machine learning is a subset of artificial intelligence that focuses on algorithms that can learn from data.",
80
+ "Neural networks are trained through backpropagation, adjusting weights to minimize prediction errors."
81
+ ]
82
+
83
+ # Encode queries and documents
84
+ query_embeddings = model.encode(queries, prompt_name="query")
85
+ document_embeddings = model.encode(documents)
86
+
87
+ # Compute similarity scores
88
+ scores = model.similarity(query_embeddings, document_embeddings)
89
+
90
+ # Print results
91
+ for i, query in enumerate(queries):
92
+ print(f"Query: {query}")
93
+ for j, doc in enumerate(documents):
94
+ print(f" Similarity: {scores[i, j]:.4f} | Document {j}: {doc[:80]}...")
95
+
96
+ # Query: What is machine learning?
97
+ # Similarity: 0.6908 | Document 0: Machine learning is a subset of ...
98
+ # Similarity: 0.4598 | Document 1: Neural networks are trained ...
99
+ #
100
+ # Query: How does neural network training work?
101
+ # Similarity: 0.4432 | Document 0: Machine learning is a subset of ...
102
+ # Similarity: 0.5794 | Document 1: Neural networks are trained ...
103
+ ```
104
+
105
+ ### Transformers Usage
106
+
107
+ <span style="color:red">CHECK THAT safe_open WORKS WITH URLS; link to code in repo</span>
108
+
109
+ <!-- ```python
110
+ from safetensors import safe_open
111
+ from transformers import AutoModel, AutoTokenizer
112
+
113
+ # Load the model
114
+ tokenizer = AutoTokenizer.from_pretrained(MODEL)
115
+ model = AutoModel.from_pretrained(MODEL)
116
+
117
+ tensors = {}
118
+ with safe_open(MODEL + "/2_Dense/model.safetensors", framework="pt") as f:
119
+ for k in f.keys():
120
+ tensors[k] = f.get_tensor(k)
121
+
122
+ W_out = torch.nn.Linear(in_features=384, out_features=768, bias=True)
123
+ W_out.load_state_dict({
124
+ "weight": tensors["linear.weight"],
125
+ "bias": tensors["linear.bias"]
126
+ })
127
+
128
+ _ = model.eval()
129
+ _ = W_out.eval()
130
+
131
+ # Example queries and documents
132
+ queries = [
133
+ "What is machine learning?",
134
+ "How does neural network training work?"
135
+ ]
136
+
137
+ documents = [
138
+ "Machine learning is a subset of artificial intelligence that focuses on algorithms that can learn from data.",
139
+ "Neural networks are trained through backpropagation, adjusting weights to minimize prediction errors."
140
+ ]
141
+
142
+ # Tokenize
143
+ QUERY_PREFIX = 'Represent this sentence for searching relevant passages: '
144
+ queries_with_prefix = [QUERY_PREFIX + query for query in queries]
145
+
146
+ query_tokens = tokenizer(queries_with_prefix, padding=True, truncation=True, return_tensors='pt', max_length=512)
147
+ document_tokens = tokenizer(documents, padding=True, truncation=True, return_tensors='pt', max_length=512)
148
+
149
+ # Perform Inference
150
+ with torch.inference_mode():
151
+ y_queries = model(**query_tokens).last_hidden_state
152
+ y_docs = model(**document_tokens).last_hidden_state
153
+
154
+ # perform pooling
155
+ y_queries = y_queries * query_tokens.attention_mask.unsqueeze(-1)
156
+ y_queries_pooled = y_queries.sum(dim=1) / query_tokens.attention_mask.sum(dim=1, keepdim=True)
157
+
158
+ y_docs = y_docs * document_tokens.attention_mask.unsqueeze(-1)
159
+ y_docs_pooled = y_docs.sum(dim=1) / document_tokens.attention_mask.sum(dim=1, keepdim=True)
160
+
161
+ # map to desired output dimension
162
+ y_queries_out = W_out(y_queries_pooled)
163
+ y_docs_out = W_out(y_docs_pooled)
164
+
165
+ # normalize and return
166
+ query_embeddings = F.normalize(y_queries_out, dim=-1)
167
+ document_embeddings = F.normalize(y_docs_out, dim=-1)
168
+
169
+ similarities = query_embeddings @ document_embeddings.T
170
+ print(f"Similarities:\n{similarities}")
171
+ # Similarities:
172
+ # tensor([[0.6908, 0.4598],
173
+ # [0.4432, 0.5794]])
174
+ ``` -->
175
+
176
+ ### Asymmetric Retrieval Setup
177
+
178
+ `mdbr-leaf-ir` is *aligned* to [`snowflake-arctic-embed-m-v1.5`](https://huggingface.co/Snowflake/snowflake-arctic-embed-m-v1.5), the model it has been distilled from, making the asymmetric system below possible:
179
+
180
+ ```python
181
+ # Use a larger model for document encoding (one-time, at index time)
182
+ doc_model = SentenceTransformer("Snowflake/snowflake-arctic-embed-m-v1.5")
183
+ document_embeddings = doc_model.encode(documents)
184
+
185
+ # Use mdbr-leaf-ir for query encoding (real-time, low latency)
186
+ query_model = SentenceTransformer("MongoDB/mdbr-leaf-ir")
187
+ query_embeddings = query_model.encode(queries, prompt_name="query")
188
+
189
+ # Compute similarities
190
+ scores = query_model.similarity(query_embeddings, document_embeddings)
191
+ ```
192
+ Retrieval results from asymmetric mode are usually superior to the [standard mode above](#sentence-transformers).
193
+
194
+ ### MRL
195
+
196
+ Embeddings have been trained via [MRL](https://arxiv.org/abs/2205.13147) and can be truncated for more efficient storage:
197
+ ```python
198
+ from torch.nn import functional as F
199
+
200
+ query_embeds = model.encode(queries, prompt_name="query", convert_to_tensor=True)
201
+ doc_embeds = model.encode(documents, convert_to_tensor=True)
202
+
203
+ # Truncate and normalize according to MRL
204
+ query_embeds = F.normalize(query_embeds[:, :256], dim=-1)
205
+ doc_embeds = F.normalize(doc_embeds[:, :256], dim=-1)
206
+
207
+ similarities = model.similarity(query_embeds, doc_embeds)
208
+
209
+ print('After MRL:')
210
+ print(f"* Embeddings dimension: {query_embeds.shape[1]}")
211
+ print(f"* Similarities:\n\t{similarities}")
212
+
213
+ # After MRL:
214
+ # * Embeddings dimension: 256
215
+ # * Similarities:
216
+ # tensor([[0.7202, 0.5006],
217
+ # [0.4744, 0.6083]])
218
+ ```
219
+
220
+ ### Vector Quantization
221
+ Vector quantization, for example to `int8` or `binary`, can be performed as follows:
222
+
223
+ **Note**: For vector quantization to types other than binary, we suggest performing a calibration to determine the optimal ranges, [see here](https://sbert.net/examples/sentence_transformer/applications/embedding-quantization/README.html#scalar-int8-quantization).
224
+ Good initial values, according to the [teacher model's documentation](https://huggingface.co/Snowflake/snowflake-arctic-embed-m-v1.5#compressing-to-128-bytes), are:
225
+ * `int8`: -0.3 and +0.3
226
+ * `int4`: -0.18 and +0.18
227
+ ```python
228
+ from sentence_transformers.quantization import quantize_embeddings
229
+ import torch
230
+
231
+ query_embeds = model.encode(queries, prompt_name="query")
232
+ doc_embeds = model.encode(documents)
233
+
234
+ # Quantize embeddings to int8 using -0.3 and +0.3 as calibration ranges
235
+ ranges = torch.tensor([[-0.3], [+0.3]]).expand(2, query_embeds.shape[1]).cpu().numpy()
236
+ query_embeds = quantize_embeddings(query_embeds, "int8", ranges=ranges)
237
+ doc_embeds = quantize_embeddings(doc_embeds, "int8", ranges=ranges)
238
+
239
+ # Calculate similarities; cast to int64 to avoid under/overflow
240
+ similarities = query_embeds.astype(int) @ doc_embeds.astype(int).T
241
+
242
+ print('After quantization:')
243
+ print(f"* Embeddings type: {query_embeds.dtype}")
244
+ print(f"* Similarities:\n{similarities}")
245
+
246
+ # After quantization:
247
+ # * Embeddings type: int8
248
+ # * Similarities:
249
+ # [[119073 78877]
250
+ # [ 76174 99127]]
251
+ ```
252
+
253
+
254
+ ## Citation
255
+
256
+ If you use this model in your work, please cite:
257
+
258
+ ```bibtex
259
+ @article{mdb_leaf,
260
+ title = {LEAF: Lightweight Embedding Alignment Knowledge Distillation Framework},
261
+ author = {Robin Vujanic and Thomas Rueckstiess},
262
+ year = {2025}
263
+ eprint = {TBD},
264
+ archiveprefix = {arXiv},
265
+ primaryclass = {FILL HERE},
266
+ url = {FILL HERE}
267
+ }
268
+ ```
269
+
270
+ ## License
271
+
272
+ This model is released under Apache 2.0 <span style="color:red">(TBD)</span> License.
273
+
274
+ ## Contact
275
+
276
+ For questions or issues, please open an issue or pull request. You can also contact the MongoDB ML research team at [email protected].
logo.png ADDED