Update README.md
Browse files
README.md
CHANGED
@@ -1,399 +1,409 @@
|
|
1 |
-
|
2 |
-
|
3 |
-
|
4 |
-
|
5 |
-
|
6 |
-
|
7 |
-
|
8 |
-
|
9 |
-
|
10 |
-
|
11 |
-
|
12 |
-
|
13 |
-
|
14 |
-
|
15 |
-
|
16 |
-
|
17 |
-
|
18 |
-
|
19 |
-
|
20 |
-
|
21 |
-
|
22 |
-
|
23 |
-
|
24 |
-
|
25 |
-
|
26 |
-
|
27 |
-
|
28 |
-
|
29 |
-
|
30 |
-
|
31 |
-
|
32 |
-
|
33 |
-
|
34 |
-
|
35 |
-
|
36 |
-
|
37 |
-
|
38 |
-
|
39 |
-
|
40 |
-
|
41 |
-
|
42 |
-
|
43 |
-
|
44 |
-
|
45 |
-
|
46 |
-
|
47 |
-
|
48 |
-
|
49 |
-
|
50 |
-
|
51 |
-
|
52 |
-
|
53 |
-
|
54 |
-
|
55 |
-
|
56 |
-
|
57 |
-
`<|START_INSTRUCTION|>
|
58 |
-
|
59 |
-
|
60 |
-
|
61 |
-
|
62 |
-
|
63 |
-
|
64 |
-
|
65 |
-
|
66 |
-
|
67 |
-
|
68 |
-
|
69 |
-
|
70 |
-
|
71 |
-
|
72 |
-
|
73 |
-
|
74 |
-
|
75 |
-
|
76 |
-
|
77 |
-
|
78 |
-
|
79 |
-
|
80 |
-
|
81 |
-
|
82 |
-
|
83 |
-
|
84 |
-
|
85 |
-
|
86 |
-
|
87 |
-
|
88 |
-
|
89 |
-
|
90 |
-
)
|
91 |
-
|
92 |
-
|
93 |
-
|
94 |
-
|
95 |
-
|
96 |
-
|
97 |
-
|
98 |
-
|
99 |
-
|
100 |
-
|
101 |
-
|
102 |
-
|
103 |
-
|
104 |
-
|
105 |
-
|
106 |
-
|
107 |
-
|
108 |
-
|
109 |
-
|
110 |
-
|
111 |
-
|
112 |
-
|
113 |
-
|
114 |
-
|
115 |
-
|
116 |
-
|
117 |
-
|
118 |
-
|
119 |
-
|
120 |
-
|
121 |
-
|
122 |
-
|
123 |
-
|
124 |
-
|
125 |
-
|
126 |
-
|
127 |
-
|
128 |
-
|
129 |
-
|
130 |
-
|
131 |
-
|
132 |
-
|
133 |
-
|
134 |
-
|
135 |
-
|
136 |
-
|
137 |
-
|
138 |
-
|
139 |
-
|
140 |
-
|
141 |
-
|
142 |
-
|
143 |
-
|
144 |
-
|
145 |
-
|
146 |
-
|
147 |
-
|
148 |
-
|
149 |
-
|
150 |
-
|
151 |
-
|
152 |
-
|
153 |
-
|
154 |
-
|
155 |
-
|
156 |
-
|
157 |
-
|
158 |
-
|
159 |
-
|
160 |
-
|
161 |
-
|
162 |
-
|
163 |
-
|
164 |
-
|
165 |
-
|
166 |
-
|
167 |
-
|
168 |
-
|
169 |
-
|
170 |
-
|
171 |
-
|
172 |
-
|
173 |
-
|
174 |
-
|
175 |
-
|
176 |
-
|
177 |
-
|
178 |
-
|
179 |
-
|
180 |
-
|
181 |
-
|
182 |
-
|
183 |
-
|
184 |
-
|
185 |
-
|
186 |
-
|
187 |
-
|
188 |
-
|
189 |
-
|
190 |
-
|
191 |
-
|
192 |
-
|
193 |
-
|
194 |
-
|
195 |
-
|
196 |
-
|
197 |
-
|
198 |
-
|
199 |
-
|
200 |
-
|
201 |
-
|
202 |
-
|
203 |
-
|
204 |
-
|
205 |
-
|
206 |
-
|
207 |
-
|
208 |
-
|
209 |
-
|
210 |
-
|
211 |
-
|
212 |
-
|
213 |
-
|
214 |
-
|
215 |
-
|
216 |
-
|
217 |
-
|
218 |
-
|
219 |
-
|
220 |
-
|
221 |
-
|
222 |
-
|
223 |
-
|
224 |
-
|
225 |
-
|
226 |
-
|
227 |
-
|
228 |
-
|
229 |
-
|
230 |
-
|
231 |
-
|
232 |
-
|
233 |
-
|
234 |
-
|
235 |
-
|
236 |
-
|
237 |
-
|
238 |
-
|
239 |
-
|
240 |
-
|
241 |
-
|
242 |
-
|
243 |
-
|
244 |
-
|
245 |
-
|
246 |
-
|
247 |
-
|
248 |
-
|
249 |
-
|
250 |
-
|
251 |
-
|
252 |
-
|
253 |
-
|
254 |
-
|
255 |
-
|
256 |
-
|
257 |
-
|
258 |
-
|
259 |
-
|
260 |
-
|
261 |
-
|
262 |
-
|
263 |
-
|
264 |
-
|
265 |
-
|
266 |
-
|
267 |
-
|
268 |
-
|
269 |
-
|
270 |
-
|
271 |
-
|
272 |
-
|
273 |
-
|
274 |
-
|
275 |
-
|
276 |
-
|
277 |
-
|
278 |
-
|
279 |
-
|
280 |
-
|
281 |
-
|
282 |
-
|
283 |
-
|
284 |
-
|
285 |
-
|
286 |
-
|
287 |
-
|
288 |
-
|
289 |
-
|
290 |
-
|
291 |
-
|
292 |
-
|
293 |
-
|
294 |
-
|
295 |
-
|
296 |
-
|
297 |
-
|
298 |
-
|
299 |
-
|
300 |
-
|
301 |
-
|
302 |
-
|
303 |
-
|
304 |
-
|
305 |
-
|
306 |
-
|
307 |
-
|
308 |
-
|
309 |
-
|
310 |
-
|
311 |
-
|
312 |
-
|
313 |
-
|
314 |
-
|
315 |
-
|
316 |
-
|
317 |
-
|
318 |
-
|
319 |
-
|
320 |
-
|
321 |
-
|
322 |
-
|
323 |
-
|
324 |
-
|
325 |
-
|
326 |
-
|
327 |
-
|
328 |
-
|
329 |
-
|
330 |
-
|
331 |
-
|
332 |
-
|
333 |
-
|
334 |
-
|
335 |
-
|
336 |
-
|
337 |
-
|
|
338 |
-
|
339 |
-
|
|
340 |
-
|
|
341 |
-
|
|
342 |
-
|
|
343 |
-
|
|
344 |
-
|
|
345 |
-
|
|
346 |
-
|
|
347 |
-
|
|
348 |
-
|
|
349 |
-
|
|
350 |
-
|
|
351 |
-
|
|
352 |
-
|
353 |
-
|
354 |
-
|
355 |
-
|
356 |
-
|
357 |
-
|
358 |
-
|
359 |
-
|
|
360 |
-
|
361 |
-
|
|
362 |
-
|
363 |
-
|
364 |
-
|
365 |
-
|
366 |
-
|
367 |
-
|
368 |
-
|
369 |
-
|
370 |
-
|
371 |
-
https://huggingface.co/
|
372 |
-
|
373 |
-
|
374 |
-
|
375 |
-
|
376 |
-
|
377 |
-
|
378 |
-
|
379 |
-
|
380 |
-
|
381 |
-
|
382 |
-
|
383 |
-
|
384 |
-
|
385 |
-
|
386 |
-
|
387 |
-
|
388 |
-
|
389 |
-
| **
|
390 |
-
|
391 |
-
| **
|
392 |
-
| **
|
393 |
-
| **
|
394 |
-
|
395 |
-
|
396 |
-
|
397 |
-
|
398 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
399 |
- As said before, this model is still in alpha or beta stage, the model may have some unexpected behaviour.
|
|
|
1 |
+
---
|
2 |
+
license: mit
|
3 |
+
datasets:
|
4 |
+
- BAAI/Infinity-Instruct
|
5 |
+
- HuggingFaceFW/fineweb-edu
|
6 |
+
language:
|
7 |
+
- en
|
8 |
+
base_model:
|
9 |
+
- answerdotai/ModernBERT-large
|
10 |
+
---
|
11 |
+
## 1 Introduction
|
12 |
+
|
13 |
+
Cooperating with [Richinfo](https://www.richinfo.cn/index.html), this released model was trained using a novel approach,
|
14 |
+
and while we haven't fully understood
|
15 |
+
the underlying principles
|
16 |
+
yet, we have achieved promising results. Therefore, we have decided to open-source the model and hope that
|
17 |
+
**someone will test the model and provide us with feedback!**
|
18 |
+
|
19 |
+
**The technical report will be completed this week.**
|
20 |
+
|
21 |
+
The core training method of this model will be implemented in
|
22 |
+
the [RAG-Retrieval repository](https://github.com/NovaSearch-Team/RAG-Retrieval) open sourced by the NovaSearch Team,
|
23 |
+
welcome to star!
|
24 |
+
|
25 |
+
This model is based on [answerdotai/ModernBERT-large](https://huggingface.co/answerdotai/ModernBERT-base).
|
26 |
+
A perfect model, thanks for their sharing!
|
27 |
+
|
28 |
+
The embedding model has the following features:
|
29 |
+
|
30 |
+
1. Max length is 128k, parameter size is 395M, and support only for English.
|
31 |
+
2. Supports both single-vector and multi-vector (similar to Colbert, but with fewer vectors, only 0.5% of the number of
|
32 |
+
tokens).
|
33 |
+
3. Achieved quite impressive results on the short text evaluation (MTEB-eng-v2), without using the MTEB training set,
|
34 |
+
even surpassing several 7B-sized models.
|
35 |
+
4. On the long text evaluation LongEmbed, the single-vector surpasses many large and commercial models. If multi-vector
|
36 |
+
is used, the average score becomes the first place. Currently, our score is 0.86, while the current first place score
|
37 |
+
is 0.79.
|
38 |
+
5. Ultra-fast encoding speed, benefiting from the architectural advantages of ModernBert, the encoding speed for long
|
39 |
+
texts is still very fast.
|
40 |
+
6. Super flexible multi-vector combination method, where the multi-vector can be understood as span or chunk level, not
|
41 |
+
token level, so how to specify the chunk can be completely customized according to your own scenario, very flexible.
|
42 |
+
|
43 |
+
## 2 Usage
|
44 |
+
|
45 |
+
We suggest you read the following contents with the model architecture diagram.
|
46 |
+
|
47 |
+

|
48 |
+
|
49 |
+
We do hope you read the `modeling_dewey_v1.py` and `custom_st.py` carefully, these codes is easy to read and
|
50 |
+
will help you a lot!
|
51 |
+
|
52 |
+
### 2.1 Prompts
|
53 |
+
|
54 |
+
Our model is a kind of instruct-embedding-model, when using our model, you should add prompt before the text.
|
55 |
+
|
56 |
+
For **Retrieval task**, you **MUST** use our provided prompt:\
|
57 |
+
query: `<|START_INSTRUCTION|>Answer the question<|END_INSTRUCTION|>`\
|
58 |
+
passage: `<|START_INSTRUCTION|>Candidate document<|END_INSTRUCTION|>`
|
59 |
+
|
60 |
+
For **STS task**, you **MUST** use our provided prompt:\
|
61 |
+
`<|START_INSTRUCTION|>Generate semantically similar text<|END_INSTRUCTION|>`
|
62 |
+
|
63 |
+
For **Classification and Clustering task**, you should design your own prompt, below are some examples:\
|
64 |
+
`<|START_INSTRUCTION|>Classify text into intents<|END_INSTRUCTION|>`\
|
65 |
+
`<|START_INSTRUCTION|>Classify text into toxic or not toxic<|END_INSTRUCTION|>`\
|
66 |
+
`<|START_INSTRUCTION|>Output main category of Medrxiv papers based on the titles<|END_INSTRUCTION|>`\
|
67 |
+
`<|START_INSTRUCTION|>Output topic or theme of news articles<|END_INSTRUCTION|>`
|
68 |
+
|
69 |
+
### 2.2 Single Vector
|
70 |
+
|
71 |
+
For using single vector, our model is compatible with the `SentenceTransformer`.
|
72 |
+
|
73 |
+
```python
|
74 |
+
import os
|
75 |
+
|
76 |
+
# os.environ["HF_ENDPOINT"] = "https://hf-mirror.com"
|
77 |
+
import torch
|
78 |
+
from sentence_transformers import SentenceTransformer
|
79 |
+
|
80 |
+
RETRIEVE_Q_PROMPT = "<|START_INSTRUCTION|>Answer the question<|END_INSTRUCTION|>"
|
81 |
+
RETRIEVE_P_PROMPT = "<|START_INSTRUCTION|>Candidate document<|END_INSTRUCTION|>"
|
82 |
+
model = SentenceTransformer(
|
83 |
+
"infgrad/dewey_en_beta",
|
84 |
+
trust_remote_code=True,
|
85 |
+
model_kwargs={
|
86 |
+
"torch_dtype": torch.bfloat16,
|
87 |
+
"attn_implementation": "flash_attention_2"
|
88 |
+
},
|
89 |
+
config_kwargs={"single_vector_type": "mean"}
|
90 |
+
).cuda().bfloat16().eval()
|
91 |
+
# the choice of single_vector_type:
|
92 |
+
## for short text (<1k): cls_add_mean
|
93 |
+
## for long text (>1k): mean
|
94 |
+
|
95 |
+
# the max length of model is 128*1024
|
96 |
+
model.max_seq_length = 32 * 1024
|
97 |
+
|
98 |
+
query_vectors = model.encode(
|
99 |
+
sentences=[f"{RETRIEVE_Q_PROMPT}What is a computer composed of?", f"{RETRIEVE_Q_PROMPT}why the sky is blue"]
|
100 |
+
)
|
101 |
+
passage_vectors = model.encode(
|
102 |
+
sentences=[
|
103 |
+
f"{RETRIEVE_P_PROMPT}Central processing unit (CPU), memory (RAM), storage (hard drive or SSD), input/output devices (keyboard, mouse, monitor), and a motherboard",
|
104 |
+
f"{RETRIEVE_P_PROMPT}Shorter wavelengths of light, such as blue and violet, are scattered more by gases and particles in Earth's atmosphere.",
|
105 |
+
]
|
106 |
+
)
|
107 |
+
|
108 |
+
print(query_vectors @ passage_vectors.T)
|
109 |
+
# the output is:
|
110 |
+
# [[0.52512825 0.19771025]
|
111 |
+
# [0.17617573 0.5918883 ]]
|
112 |
+
```
|
113 |
+
|
114 |
+
### 2.3 Multi Vectors
|
115 |
+
|
116 |
+
Our multi vectors are bsed on text span(i.e. chunk), so each vector can be considered as a contextual chunk vector.
|
117 |
+
**In order to get multi vectors of a document, you should get chunks and their spans first.**
|
118 |
+
|
119 |
+
Below are detailed steps to get multi vectors:
|
120 |
+
|
121 |
+
**Step1:** Chunk the document to get chunks and spans. This can be done by using our `encode` function, or you can also
|
122 |
+
chunk documents by yourself according to your scenario.\
|
123 |
+
**Note that, if you decide to chunk by yourself, your chunk and span should not contain prompt!!!**\
|
124 |
+
**Step2:** encode text to get token embeddings\
|
125 |
+
**Step3:** according to span (i.e. start_position and end_position) to get chunk vector,
|
126 |
+
we use mean of span token embeddings as chunk vector (i.e. normalize(token_embed[start_position:end_position].mean(
|
127 |
+
axis=0)))\
|
128 |
+
**Step4:** For each span, do Step3, until get all chunk vectors, you can also add span(0,1) and span(1+prompt_len,
|
129 |
+
text_len-1) to get global vector
|
130 |
+
|
131 |
+
For retrieval tasks, query vector should be **single vector**, so the final score between query and document is the max
|
132 |
+
score of query with every document vector.
|
133 |
+
This is compatible with FAISS, MILVUS and so on. Just enlarge the top-k and do de-duplicate on searched documents.
|
134 |
+
|
135 |
+
Below are detailed code examples.
|
136 |
+
|
137 |
+
#### 2.3.1 Chunk text in the `encode` function
|
138 |
+
|
139 |
+
You can directly use `encode` method in our model to get multi vectors.
|
140 |
+
This method will chunk text automatically.
|
141 |
+
You can choose the chunk strategy by setting `fast_chunk` parameter, if `fast_chunk` is true, directly chunk on input
|
142 |
+
ids, else using RecursiveCharacterTextSplitter.
|
143 |
+
|
144 |
+
```python
|
145 |
+
import os
|
146 |
+
import numpy as np
|
147 |
+
|
148 |
+
# os.environ["HF_ENDPOINT"] = "https://hf-mirror.com"
|
149 |
+
from pydantic import BaseModel
|
150 |
+
from typing import Optional, List
|
151 |
+
from transformers import AutoTokenizer, AutoModel
|
152 |
+
|
153 |
+
|
154 |
+
class TextSpan(BaseModel):
|
155 |
+
s: int
|
156 |
+
e: int
|
157 |
+
text: Optional[str] = None
|
158 |
+
module_name: str
|
159 |
+
|
160 |
+
|
161 |
+
RETRIEVE_Q_PROMPT = "<|START_INSTRUCTION|>Answer the question<|END_INSTRUCTION|>"
|
162 |
+
RETRIEVE_P_PROMPT = "<|START_INSTRUCTION|>Candidate document<|END_INSTRUCTION|>"
|
163 |
+
model = AutoModel.from_pretrained(
|
164 |
+
"infgrad/dewey_en_beta",
|
165 |
+
trust_remote_code=True,
|
166 |
+
attn_implementation="flash_attention_2"
|
167 |
+
).cuda().bfloat16()
|
168 |
+
model.tokenizer = AutoTokenizer.from_pretrained("infgrad/dewey_en_beta")
|
169 |
+
max_seq_length = 32 * 1024
|
170 |
+
|
171 |
+
q_list = ["why the sky is blue"]
|
172 |
+
p_list = [
|
173 |
+
"""
|
174 |
+
I’ve been trying to understand why the sky changes colors, and I think I understand most of it, but something in the online explanations doesn’t make it clear for me:
|
175 |
+
|
176 |
+
I’ve read:
|
177 |
+
|
178 |
+
sky is blue because blue light gets scattered the most during the day.
|
179 |
+
|
180 |
+
in the evening it turns red because now even more of the blue light gets scattered
|
181 |
+
|
182 |
+
So a few questions:
|
183 |
+
|
184 |
+
The scattering of light during the day: does it mean that blue light gets reflected off air particles and reaches our eyes, while the rest of the frequencies pass through and reach the ground?
|
185 |
+
|
186 |
+
Surely some of the other frequencies also get scattered during the day, just in much smaller amounts?
|
187 |
+
|
188 |
+
So during the evening blue light gets scattered even more, to the point where even less of it reaches the eyes?
|
189 |
+
|
190 |
+
And so it gets red because now we can see the lower frequencies being scattered without blue overshadowing them?
|
191 |
+
|
192 |
+
Trying to word it myself: during the day only the highest frequencies get filtered, but during the evening also lower frequencies get filtered, because now the “light strainer” (air) is just catching more of it?
|
193 |
+
|
194 |
+
It gets darker in the evening without a good ability to see colors because there’s is no blue and so on light to reflect off of objects?
|
195 |
+
|
196 |
+
Is it ok to speak about light as a frequency? Or it’s only correct to say “wave length”?
|
197 |
+
|
198 |
+
Blue light is scattered in all directions by the tiny molecules of air in Earth's atmosphere. Blue is scattered more than other colors because it travels as shorter, smaller waves.
|
199 |
+
This is why we see a blue sky most of the time. Closer to the horizon, the sky fades to a lighter blue or white.
|
200 |
+
"""
|
201 |
+
]
|
202 |
+
|
203 |
+
# query should be a single vector, so we set chunk_size as -1 to avoid chunk.
|
204 |
+
# If chunk size is -1, the model will return an array with shape of (2,2048) consisting of cls_vector and mean_vector(mean of all token embeddings).
|
205 |
+
query_vectors = model.encode(
|
206 |
+
sentences=q_list,
|
207 |
+
use_cuda=True,
|
208 |
+
show_progress_bar=True,
|
209 |
+
chunk_size=-1,
|
210 |
+
chunk_overlap=32,
|
211 |
+
convert_to_tensor=False,
|
212 |
+
max_seq_length=max_seq_length,
|
213 |
+
batch_size=8,
|
214 |
+
normalize_embeddings=True,
|
215 |
+
prompt=RETRIEVE_Q_PROMPT,
|
216 |
+
fast_chunk=False
|
217 |
+
|
218 |
+
)[0]
|
219 |
+
# query vector do not need multi vector, we only use mean as final single vector
|
220 |
+
pred = [vecs[1:2, :] for vecs in query_vectors]
|
221 |
+
|
222 |
+
# spans_list contail each chunk's span, you can use span to get text
|
223 |
+
spans_list: List[List[TextSpan]]
|
224 |
+
passage_vectors_list: List[np.ndarray]
|
225 |
+
passage_vectors_list, spans_list = model.encode(
|
226 |
+
sentences=p_list,
|
227 |
+
use_cuda=True,
|
228 |
+
show_progress_bar=True,
|
229 |
+
chunk_size=64,
|
230 |
+
chunk_overlap=8,
|
231 |
+
convert_to_tensor=False,
|
232 |
+
max_seq_length=max_seq_length,
|
233 |
+
batch_size=8,
|
234 |
+
normalize_embeddings=True,
|
235 |
+
prompt=RETRIEVE_P_PROMPT,
|
236 |
+
fast_chunk=True, # if fast_chunk is true, directly chunk on input ids, else using RecursiveCharacterTextSplitter
|
237 |
+
)
|
238 |
+
# spans_list stores each passage's spans, passage_vectors_list stores each passage's vectors so len(spans_list) == len(p_list) and len(spans_list) == len(passage_vectors_list)
|
239 |
+
# for a passage's spans and vectors, each span corresponds to a vector (1*2048). So, len(spans_list[idx]) == len(passage_vectors_list[idx])
|
240 |
+
print((query_vectors[0] @ passage_vectors_list[0].T).max())
|
241 |
+
# output 0.7331543
|
242 |
+
# get each chunk's content
|
243 |
+
for spans, passage in zip(spans_list, p_list):
|
244 |
+
text_ids = model.tokenizer.encode(RETRIEVE_P_PROMPT + passage)
|
245 |
+
for span in spans:
|
246 |
+
s, e = span.s, span.e
|
247 |
+
chunk_text = model.tokenizer.decode(
|
248 |
+
text_ids[s:e],
|
249 |
+
skip_special_tokens=True,
|
250 |
+
clean_up_tokenization_spaces=True
|
251 |
+
).strip()
|
252 |
+
```
|
253 |
+
|
254 |
+
Please read annotation of this `encode` to get more information.
|
255 |
+
|
256 |
+
#### 2.3.2 Chunk text by yourself
|
257 |
+
|
258 |
+
If you want to chunk text by yourself, you should just set the `batch_text_spans` parameter in the `encode` function.
|
259 |
+
|
260 |
+
```python
|
261 |
+
import os
|
262 |
+
import numpy as np
|
263 |
+
|
264 |
+
# os.environ["HF_ENDPOINT"] = "https://hf-mirror.com"
|
265 |
+
from pydantic import BaseModel
|
266 |
+
from typing import Optional, List
|
267 |
+
from transformers import AutoTokenizer, AutoModel
|
268 |
+
|
269 |
+
|
270 |
+
class TextSpan(BaseModel):
|
271 |
+
s: int
|
272 |
+
e: int
|
273 |
+
text: Optional[str] = None
|
274 |
+
module_name: str
|
275 |
+
|
276 |
+
|
277 |
+
prompt = "<|START_INSTRUCTION|>Candidate document<|END_INSTRUCTION|>"
|
278 |
+
|
279 |
+
# load model
|
280 |
+
model = AutoModel.from_pretrained(
|
281 |
+
"infgrad/dewey_en_beta",
|
282 |
+
trust_remote_code=True,
|
283 |
+
attn_implementation="flash_attention_2"
|
284 |
+
)
|
285 |
+
model.tokenizer = AutoTokenizer.from_pretrained("infgrad/dewey_en_beta")
|
286 |
+
max_seq_length = 32 * 1024
|
287 |
+
|
288 |
+
# chunk text
|
289 |
+
passage = "this sentence 1. this sentence 2. this sentence 3"
|
290 |
+
chunks = ["this sentence 1. this sentence 2.", "this sentence 2. this sentence 3"]
|
291 |
+
prompt_length = len(model.tokenizer.tokenize(prompt))
|
292 |
+
text_spans = [
|
293 |
+
# s=0, e=1 means that this vector is cls vector, so the module_name is cls_linear, otherwise the module_name is chunk_linear
|
294 |
+
TextSpan(s=0, e=1, module_name="cls_linear")
|
295 |
+
]
|
296 |
+
for chunk in chunks:
|
297 |
+
s = passage.find(chunk)
|
298 |
+
e = s + len(chunk)
|
299 |
+
text_spans.append(
|
300 |
+
TextSpan(
|
301 |
+
# add 1, as there is a [CLS] token at the beginning of text.
|
302 |
+
s=1 + prompt_length + len(model.tokenizer.tokenize(passage[:s])),
|
303 |
+
e=1 + prompt_length + len(model.tokenizer.tokenize(passage[:e])),
|
304 |
+
module_name="chunk_linear"
|
305 |
+
)
|
306 |
+
)
|
307 |
+
|
308 |
+
spans_list: List[List[TextSpan]]
|
309 |
+
passage_vectors_list: List[np.ndarray]
|
310 |
+
passage_vectors_list, _ = model.encode(
|
311 |
+
sentences=[passage],
|
312 |
+
use_cuda=False,
|
313 |
+
show_progress_bar=True,
|
314 |
+
chunk_size=64,
|
315 |
+
chunk_overlap=12,
|
316 |
+
convert_to_tensor=False,
|
317 |
+
max_seq_length=max_seq_length,
|
318 |
+
batch_size=8,
|
319 |
+
normalize_embeddings=True,
|
320 |
+
prompt=prompt,
|
321 |
+
fast_chunk=True,
|
322 |
+
batch_text_spans=[text_spans]
|
323 |
+
)
|
324 |
+
print(passage_vectors_list[0].shape, passage_vectors_list[0][:, 2])
|
325 |
+
# the output is (3, 2048) [0.01461297 0.02085092 0.0022509 ]
|
326 |
+
```
|
327 |
+
|
328 |
+
## 3 Evaluation
|
329 |
+
|
330 |
+
### 3.1 MTEB(eng, v2)
|
331 |
+
|
332 |
+
URL: http://mteb-leaderboard.hf.space/?benchmark_name=MTEB%28eng%2C+v2%29
|
333 |
+
|
334 |
+
Reproduction
|
335 |
+
script: https://huggingface.co/infgrad/dewey_en_beta/blob/main/scripts/evaluate/run_evaluate_mteb_dewey_en_beta.py
|
336 |
+
|
337 |
+
| **Model** | **Zero-shot** | **Parameters** | **Dimensions** | **Max Tokens** | **Mean (Task)** | **Mean (TaskType)** | **Classification** | **Clustering** | **Pair Classification** | **Reranking** | **Retrieval** | **STS** | **Summarization** |
|
338 |
+
|:------------------------------------------------------------------------------------------------------------------------:|:-------------:|:--------------:|:--------------:|:--------------:|:---------------:|:-------------------:|:------------------:|:--------------:|:-----------------------:|:-------------:|:-------------:|:-------:|:-----------------:|
|
339 |
+
| [gemini-embedding-exp-03-07](https://developers.googleblog.com/en/gemini-embedding-text-model-now-available-gemini-api/) | 95% | Unknown | 3072 | 8192 | 73.3 | 67.67 | 90.05 | 59.39 | 87.7 | 48.59 | 64.35 | 85.29 | 38.28 |
|
340 |
+
| [jasper_en_vision_language_v1](https://huggingface.co/NovaSearch/jasper_en_vision_language_v1) | 56% | 1B | 8960 | 131072 | 71.41 | 66.65 | 90.27 | 60.52 | 88.14 | 50 | 56.05 | 84.37 | 37.19 |
|
341 |
+
| [gte-Qwen2-7B-instruct](https://huggingface.co/Alibaba-NLP/gte-Qwen2-7B-instruct) | NA | 7B | 3584 | 32768 | 70.72 | 65.77 | 88.52 | 58.97 | 85.9 | 50.47 | 58.09 | 82.69 | 35.74 |
|
342 |
+
| [stella_en_1.5B_v5](https://huggingface.co/NovaSearch/stella_en_1.5B_v5) | 56% | 1B | 8960 | 131072 | 69.43 | 65.32 | 89.38 | 57.06 | 88.02 | 50.19 | 52.42 | 83.27 | 36.91 |
|
343 |
+
| [SFR-Embedding-2_R](https://huggingface.co/Salesforce/SFR-Embedding-2_R) | 85% | 7B | 4096 | 32768 | 69.82 | 65.31 | 90.54 | 59.39 | 88.09 | 48.99 | 53.75 | 80.86 | 35.54 |
|
344 |
+
| [Linq-Embed-Mistral](https://huggingface.co/Linq-AI-Research/Linq-Embed-Mistral) | 95% | 7B | 4096 | 32768 | 69.8 | 65.29 | 83 | 54.07 | 88.44 | 49.44 | 60.14 | 84.69 | 37.26 |
|
345 |
+
| [NV-Embed-v2](https://huggingface.co/nvidia/NV-Embed-v2) | 56% | 7B | 4096 | 32768 | 69.81 | 65 | 87.19 | 47.66 | 88.69 | 49.61 | 62.84 | 83.82 | 35.21 |
|
346 |
+
| [SFR-Embedding-Mistral](https://huggingface.co/Salesforce/SFR-Embedding-Mistral) | 85% | 7B | 4096 | 32768 | 69.31 | 64.94 | 80.47 | 54.93 | 88.59 | 50.15 | 59.33 | 84.77 | 36.32 |
|
347 |
+
| [stella_en_400M_v5](https://huggingface.co/NovaSearch/stella_en_400M_v5) | 56% | 435M | 4096 | 8192 | 69.39 | 64.84 | 88.25 | 57.65 | 87.17 | 49.6 | 52.73 | 83.93 | 34.53 |
|
348 |
+
| [text-embedding-004](https://cloud.google.com/vertex-ai/generative-ai/docs/embeddings/get-text-embeddings) | 95% | Unknown | 768 | 2048 | 69.53 | 64.82 | 86.03 | 51.52 | 87.65 | 48.48 | 59.06 | 84.84 | 36.12 |
|
349 |
+
| [text-embedding-005](https://cloud.google.com/vertex-ai/generative-ai/docs/embeddings/get-text-embeddings) | 95% | Unknown | 768 | 2048 | 69.6 | 64.77 | 86.03 | 51.91 | 87.62 | 48.84 | 58.77 | 85.18 | 35.05 |
|
350 |
+
| [e5-mistral-7b-instruct](https://huggingface.co/intfloat/e5-mistral-7b-instruct) | 95% | 7B | 4096 | 32768 | 67.97 | 64 | 79.85 | 51.44 | 88.42 | 49.78 | 57.62 | 84.32 | 36.57 |
|
351 |
+
| [text-multilingual-embedding-002](https://cloud.google.com/vertex-ai/generative-ai/docs/embeddings/get-text-embeddings) | 95% | Unknown | 768 | 2048 | 67.67 | 63.52 | 84.65 | 50.41 | 86.6 | 47.48 | 54.7 | 83.94 | 36.84 |
|
352 |
+
| [NV-Embed-v1](https://huggingface.co/nvidia/NV-Embed-v1) | 56% | 7B | 4096 | 32768 | 68.32 | 63.37 | 84.11 | 49.5 | 87.05 | 49.16 | 60.13 | 82.2 | 31.4 |
|
353 |
+
| **[infgrad/dewey_en_beta](https://huggingface.co/infgrad/dewey_en_beta)** | 95% | 395M | 2048 | 131072 | 0.68 | 63.30 | 81.83 | 51.75 | 86.82 | 46.35 | 56.32 | 84.21 | 35.79 |
|
354 |
+
| [gte-Qwen2-1.5B-instruct](https://huggingface.co/Alibaba-NLP/gte-Qwen2-1.5B-instruct) | NA | 1B | 8960 | 32768 | 67.2 | 63.26 | 85.84 | 53.54 | 87.52 | 49.25 | 50.25 | 82.51 | 33.94 |
|
355 |
+
| [GritLM-7B](https://huggingface.co/GritLM/GritLM-7B) | 95% | 7B | 4096 | 4096 | 67.07 | 63.22 | 81.25 | 50.82 | 87.29 | 49.59 | 54.95 | 83.03 | 35.65 |
|
356 |
+
| [GritLM-8x7B](https://huggingface.co/GritLM/GritLM-8x7B) | 95% | 57B | 4096 | 4096 | 66.16 | 62.42 | 79.98 | 51.48 | 85.23 | 49.22 | 52.46 | 82.93 | 35.65 |
|
357 |
+
| [text-embedding-3-large](https://openai.com/index/new-embedding-models-and-api-updates/) | NA | Unknown | 3072 | 8191 | 66.43 | 62.15 | 79.15 | 48.9 | 85.81 | 47.45 | 57.98 | 81.44 | 34.31 |
|
358 |
+
| [mxbai-embed-large-v1](https://huggingface.co/mixedbread-ai/mxbai-embed-large-v1) | 100% | 335M | 1024 | 512 | 66.26 | 62.04 | 79.1 | 47.48 | 87.2 | 48.05 | 55.4 | 84.42 | 32.63 |
|
359 |
+
| [GIST-large-Embedding-v0](https://huggingface.co/avsolatorio/GIST-large-Embedding-v0) | 80% | 335M | 1024 | 512 | 66.25 | 61.96 | 78.91 | 48.84 | 86.7 | 48.76 | 54.52 | 84.44 | 31.52 |
|
360 |
+
| [bge-large-en-v1.5](https://huggingface.co/BAAI/bge-large-en-v1.5) | 100% | 335M | 1024 | 512 | 65.89 | 61.87 | 78.34 | 48.01 | 87.13 | 48.26 | 55.44 | 82.79 | 33.13 |
|
361 |
+
| [UAE-Large-V1](https://huggingface.co/WhereIsAI/UAE-Large-V1) | 100% | 335M | 1024 | 512 | 66.4 | 61.85 | 79.08 | 47.86 | 87.25 | 48.35 | 55.91 | 84.37 | 30.13 |
|
362 |
+
|
363 |
+
### 3.2 LongEmbed
|
364 |
+
|
365 |
+
URL: http://mteb-leaderboard.hf.space/?benchmark_name=LongEmbed
|
366 |
+
|
367 |
+
Reproduction script: https://huggingface.co/infgrad/dewey_en_beta/blob/main/scripts/evaluate/run_evaluate_long_embed.py
|
368 |
+
|
369 |
+
| **Model** | **Zero-shot** | **Number of Parameters** | **Embedding Dimensions** | **Max Tokens** | **Mean (Task)** | **Mean (TaskType)** | **Retrieval** |
|
370 |
+
|:-------------------------------------------------------------------------------------------------------------------------:|:-------------:|:------------------------:|:------------------------:|:--------------:|:---------------:|:-------------------:|:-------------:|
|
371 |
+
| **[infgrad/dewey_en_beta-MultiVectors](https://huggingface.co/infgrad/dewey_en_beta)** | 100% | 395M | 2048 | 131072 | 86.59 | 86.59 | 86.59 |
|
372 |
+
| [voyage-multilingual-2](https://blog.voyageai.com/2024/06/10/voyage-multilingual-2-multilingual-embedding-model/) | 100% | Unknown | 1024 | 32000 | 79.17 | 79.17 | 79.17 |
|
373 |
+
| [voyage-law-2](https://blog.voyageai.com/2024/04/15/domain-specific-embeddings-and-retrieval-legal-edition-voyage-law-2/) | 100% | Unknown | 1024 | 16000 | 78.85 | 78.85 | 78.85 |
|
374 |
+
| **[infgrad/dewey_en_beta-SingleVector](https://huggingface.co/infgrad/dewey_en_beta)** | 100% | 395M | 2048 | 131072 | 77.98 | 77.98 | 77.98 |
|
375 |
+
| [voyage-3](https://blog.voyageai.com/2024/09/18/voyage-3/) | 100% | Unknown | 1024 | 32000 | 74.06 | 74.06 | 74.06 |
|
376 |
+
| [inf-retriever-v1](https://huggingface.co/infly/inf-retriever-v1) | 100% | 7B | 3584 | 32768 | 73.19 | 73.19 | 73.19 |
|
377 |
+
|
378 |
+
### 3.3 LoCoV1
|
379 |
+
|
380 |
+
URL: https://huggingface.co/datasets/hazyresearch/LoCoV1-Queries\
|
381 |
+
https://huggingface.co/datasets/hazyresearch/LoCoV1-Documents
|
382 |
+
|
383 |
+
Reproduction script: https://huggingface.co/infgrad/dewey_en_beta/blob/main/scripts/evaluate/run_evaluate_loco.py
|
384 |
+
|
385 |
+
Metric: NDCG@10
|
386 |
+
|
387 |
+
Result:
|
388 |
+
|
389 |
+
| **dataset-name** | **bge-m3-8k** | **gte-modernbert-base-8k** | **Linq-Embed-Mistral-4k** | **Linq-Embed-Mistral-8k** | **SFR-Embedding-Mistral-8k** | **e5-mistral-7b-instruct-8k** | **dewey_en_beta-8k** | **dewey_en_beta_64k** | **dewey_en_beta_64k-multi-vectors** |
|
390 |
+
|:---------------------------------:|:-------------:|:--------------------------:|:-------------------------:|:-------------------------:|:----------------------------:|:-----------------------------:|:--------------------:|:------------------------:|:--------------------------------------:|
|
391 |
+
| **2wikimqa_test** | 0.9271 | 0.8658 | 0.8884 | 0.9067 | 0.8965 | 0.8901 | 0.8953 | 0.9051 | 0.9775 |
|
392 |
+
| **courtlistener_HTML_test** | 0.1933 | 0.2349 | 0.3551 | 0.3670 | 0.3647 | 0.3543 | 0.3415 | 0.3616 | 0.4775 |
|
393 |
+
| **courtlistener_Plain_Text_test** | 0.1888 | 0.2478 | 0.3675 | 0.3761 | 0.3679 | 0.3579 | 0.3377 | 0.3485 | 0.4426 |
|
394 |
+
| **gov_report_test** | 0.9869 | 0.9750 | 0.9832 | 0.9837 | 0.9816 | 0.9823 | 0.9855 | 0.9883 | 0.9853 |
|
395 |
+
| **legal_case_reports_test** | 0.3702 | 0.4476 | 0.5398 | 0.5432 | 0.5319 | 0.4850 | 0.5474 | 0.5875 | 0.6534 |
|
396 |
+
| **multifieldqa_test** | 0.9373 | 0.9341 | 0.9345 | 0.9327 | 0.9450 | 0.9321 | 0.9687 | 0.9564 | 0.9754 |
|
397 |
+
| **passage_retrieval_test** | 0.4493 | 0.5271 | 0.3470 | 0.3407 | 0.2902 | 0.3248 | 0.7562 | 0.7389 | 0.8550 |
|
398 |
+
| **qasper_abstract_test** | 1.0000 | 0.9806 | 0.9982 | 0.9982 | 0.9973 | 0.9965 | 0.9973 | 0.9982 | 0.9982 |
|
399 |
+
| **qasper_title_test** | 0.9860 | 0.8892 | 0.9838 | 0.9833 | 0.9861 | 0.9812 | 0.9742 | 0.9742 | 0.9840 |
|
400 |
+
| **qmsum_test** | 0.6668 | 0.6307 | 0.6816 | 0.7237 | 0.7169 | 0.7148 | 0.7438 | 0.7613 | 0.8154 |
|
401 |
+
| **stackoverflow_test** | 0.9634 | 0.9087 | 0.9760 | 0.9760 | 0.9766 | 0.9690 | 0.9362 | 0.9369 | 0.9443 |
|
402 |
+
| **summ_screen_fd_test** | 0.9320 | 0.9379 | 0.9747 | 0.9635 | 0.9656 | 0.9580 | 0.9796 | 0.9821 | 0.9788 |
|
403 |
+
| **Average** | 0.7168 | 0.7150 | 0.7525 | 0.7579 | 0.7517 | 0.7455 | 0.7886 |**0.7949** |**0.8406** |
|
404 |
+
|
405 |
+
## 4 Limitations
|
406 |
+
|
407 |
+
- Only English text.
|
408 |
+
- On short text tasks, the performance might not be as good as that of conventional short text embedding models.
|
409 |
- As said before, this model is still in alpha or beta stage, the model may have some unexpected behaviour.
|