infgrad commited on
Commit
035fc10
·
verified ·
1 Parent(s): c5d9fc7

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +408 -398
README.md CHANGED
@@ -1,399 +1,409 @@
1
- ## 1 Introduction
2
-
3
- Cooperating with [Richinfo](https://www.richinfo.cn/index.html), this released model was trained using a novel approach,
4
- and while we haven't fully understood
5
- the underlying principles
6
- yet, we have achieved promising results. Therefore, we have decided to open-source the model and hope that
7
- **someone will test the model and provide us with feedback!**
8
-
9
- **The technical report will be completed this week.**
10
-
11
- The core training method of this model will be implemented in
12
- the [RAG-Retrieval repository](https://github.com/NovaSearch-Team/RAG-Retrieval) open sourced by the NovaSearch Team,
13
- welcome to star!
14
-
15
- This model is based on [answerdotai/ModernBERT-large](https://huggingface.co/answerdotai/ModernBERT-base).
16
- A perfect model, thanks for their sharing!
17
-
18
- The embedding model has the following features:
19
-
20
- 1. Max length is 128k, parameter size is 395M, and support only for English.
21
- 2. Supports both single-vector and multi-vector (similar to Colbert, but with fewer vectors, only 0.5% of the number of
22
- tokens).
23
- 3. Achieved quite impressive results on the short text evaluation (MTEB-eng-v2), without using the MTEB training set,
24
- even surpassing several 7B-sized models.
25
- 4. On the long text evaluation LongEmbed, the single-vector surpasses many large and commercial models. If multi-vector
26
- is used, the average score becomes the first place. Currently, our score is 0.86, while the current first place score
27
- is 0.79.
28
- 5. Ultra-fast encoding speed, benefiting from the architectural advantages of ModernBert, the encoding speed for long
29
- texts is still very fast.
30
- 6. Super flexible multi-vector combination method, where the multi-vector can be understood as span or chunk level, not
31
- token level, so how to specify the chunk can be completely customized according to your own scenario, very flexible.
32
-
33
- ## 2 Usage
34
-
35
- We suggest you read the following contents with the model architecture diagram.
36
-
37
- ![avatar](./imgs/inference_architecture.png)
38
-
39
- We do hope you read the `modeling_dewey_v1.py` and `custom_st.py` carefully, these codes is easy to read and
40
- will help you a lot!
41
-
42
- ### 2.1 Prompts
43
-
44
- Our model is a kind of instruct-embedding-model, when using our model, you should add prompt before the text.
45
-
46
- For **Retrieval task**, you **MUST** use our provided prompt:\
47
- query: `<|START_INSTRUCTION|>Answer the question<|END_INSTRUCTION|>`\
48
- passage: `<|START_INSTRUCTION|>Candidate document<|END_INSTRUCTION|>`
49
-
50
- For **STS task**, you **MUST** use our provided prompt:\
51
- `<|START_INSTRUCTION|>Generate semantically similar text<|END_INSTRUCTION|>`
52
-
53
- For **Classification and Clustering task**, you should design your own prompt, below are some examples:\
54
- `<|START_INSTRUCTION|>Classify text into intents<|END_INSTRUCTION|>`\
55
- `<|START_INSTRUCTION|>Classify text into toxic or not toxic<|END_INSTRUCTION|>`\
56
- `<|START_INSTRUCTION|>Output main category of Medrxiv papers based on the titles<|END_INSTRUCTION|>`\
57
- `<|START_INSTRUCTION|>Output topic or theme of news articles<|END_INSTRUCTION|>`
58
-
59
- ### 2.2 Single Vector
60
-
61
- For using single vector, our model is compatible with the `SentenceTransformer`.
62
-
63
- ```python
64
- import os
65
-
66
- # os.environ["HF_ENDPOINT"] = "https://hf-mirror.com"
67
- import torch
68
- from sentence_transformers import SentenceTransformer
69
-
70
- RETRIEVE_Q_PROMPT = "<|START_INSTRUCTION|>Answer the question<|END_INSTRUCTION|>"
71
- RETRIEVE_P_PROMPT = "<|START_INSTRUCTION|>Candidate document<|END_INSTRUCTION|>"
72
- model = SentenceTransformer(
73
- "infgrad/dewey_en_beta",
74
- trust_remote_code=True,
75
- model_kwargs={
76
- "torch_dtype": torch.bfloat16,
77
- "attn_implementation": "flash_attention_2"
78
- },
79
- config_kwargs={"single_vector_type": "mean"}
80
- ).cuda().bfloat16().eval()
81
- # the choice of single_vector_type:
82
- ## for short text (<1k): cls_add_mean
83
- ## for long text (>1k): mean
84
-
85
- # the max length of model is 128*1024
86
- model.max_seq_length = 32 * 1024
87
-
88
- query_vectors = model.encode(
89
- sentences=[f"{RETRIEVE_Q_PROMPT}What is a computer composed of?", f"{RETRIEVE_Q_PROMPT}why the sky is blue"]
90
- )
91
- passage_vectors = model.encode(
92
- sentences=[
93
- f"{RETRIEVE_P_PROMPT}Central processing unit (CPU), memory (RAM), storage (hard drive or SSD), input/output devices (keyboard, mouse, monitor), and a motherboard",
94
- f"{RETRIEVE_P_PROMPT}Shorter wavelengths of light, such as blue and violet, are scattered more by gases and particles in Earth's atmosphere.",
95
- ]
96
- )
97
-
98
- print(query_vectors @ passage_vectors.T)
99
- # the output is:
100
- # [[0.52512825 0.19771025]
101
- # [0.17617573 0.5918883 ]]
102
- ```
103
-
104
- ### 2.3 Multi Vectors
105
-
106
- Our multi vectors are bsed on text span(i.e. chunk), so each vector can be considered as a contextual chunk vector.
107
- **In order to get multi vectors of a document, you should get chunks and their spans first.**
108
-
109
- Below are detailed steps to get multi vectors:
110
-
111
- **Step1:** Chunk the document to get chunks and spans. This can be done by using our `encode` function, or you can also
112
- chunk documents by yourself according to your scenario.\
113
- **Note that, if you decide to chunk by yourself, your chunk and span should not contain prompt!!!**\
114
- **Step2:** encode text to get token embeddings\
115
- **Step3:** according to span (i.e. start_position and end_position) to get chunk vector,
116
- we use mean of span token embeddings as chunk vector (i.e. normalize(token_embed[start_position:end_position].mean(
117
- axis=0)))\
118
- **Step4:** For each span, do Step3, until get all chunk vectors, you can also add span(0,1) and span(1+prompt_len,
119
- text_len-1) to get global vector
120
-
121
- For retrieval tasks, query vector should be **single vector**, so the final score between query and document is the max
122
- score of query with every document vector.
123
- This is compatible with FAISS, MILVUS and so on. Just enlarge the top-k and do de-duplicate on searched documents.
124
-
125
- Below are detailed code examples.
126
-
127
- #### 2.3.1 Chunk text in the `encode` function
128
-
129
- You can directly use `encode` method in our model to get multi vectors.
130
- This method will chunk text automatically.
131
- You can choose the chunk strategy by setting `fast_chunk` parameter, if `fast_chunk` is true, directly chunk on input
132
- ids, else using RecursiveCharacterTextSplitter.
133
-
134
- ```python
135
- import os
136
- import numpy as np
137
-
138
- # os.environ["HF_ENDPOINT"] = "https://hf-mirror.com"
139
- from pydantic import BaseModel
140
- from typing import Optional, List
141
- from transformers import AutoTokenizer, AutoModel
142
-
143
-
144
- class TextSpan(BaseModel):
145
- s: int
146
- e: int
147
- text: Optional[str] = None
148
- module_name: str
149
-
150
-
151
- RETRIEVE_Q_PROMPT = "<|START_INSTRUCTION|>Answer the question<|END_INSTRUCTION|>"
152
- RETRIEVE_P_PROMPT = "<|START_INSTRUCTION|>Candidate document<|END_INSTRUCTION|>"
153
- model = AutoModel.from_pretrained(
154
- "infgrad/dewey_en_beta",
155
- trust_remote_code=True,
156
- attn_implementation="flash_attention_2"
157
- ).cuda().bfloat16()
158
- model.tokenizer = AutoTokenizer.from_pretrained("infgrad/dewey_en_beta")
159
- max_seq_length = 32 * 1024
160
-
161
- q_list = ["why the sky is blue"]
162
- p_list = [
163
- """
164
- I’ve been trying to understand why the sky changes colors, and I think I understand most of it, but something in the online explanations doesn’t make it clear for me:
165
-
166
- I’ve read:
167
-
168
- sky is blue because blue light gets scattered the most during the day.
169
-
170
- in the evening it turns red because now even more of the blue light gets scattered
171
-
172
- So a few questions:
173
-
174
- The scattering of light during the day: does it mean that blue light gets reflected off air particles and reaches our eyes, while the rest of the frequencies pass through and reach the ground?
175
-
176
- Surely some of the other frequencies also get scattered during the day, just in much smaller amounts?
177
-
178
- So during the evening blue light gets scattered even more, to the point where even less of it reaches the eyes?
179
-
180
- And so it gets red because now we can see the lower frequencies being scattered without blue overshadowing them?
181
-
182
- Trying to word it myself: during the day only the highest frequencies get filtered, but during the evening also lower frequencies get filtered, because now the “light strainer” (air) is just catching more of it?
183
-
184
- It gets darker in the evening without a good ability to see colors because there’s is no blue and so on light to reflect off of objects?
185
-
186
- Is it ok to speak about light as a frequency? Or it’s only correct to say “wave length”?
187
-
188
- Blue light is scattered in all directions by the tiny molecules of air in Earth's atmosphere. Blue is scattered more than other colors because it travels as shorter, smaller waves.
189
- This is why we see a blue sky most of the time. Closer to the horizon, the sky fades to a lighter blue or white.
190
- """
191
- ]
192
-
193
- # query should be a single vector, so we set chunk_size as -1 to avoid chunk.
194
- # If chunk size is -1, the model will return an array with shape of (2,2048) consisting of cls_vector and mean_vector(mean of all token embeddings).
195
- query_vectors = model.encode(
196
- sentences=q_list,
197
- use_cuda=True,
198
- show_progress_bar=True,
199
- chunk_size=-1,
200
- chunk_overlap=32,
201
- convert_to_tensor=False,
202
- max_seq_length=max_seq_length,
203
- batch_size=8,
204
- normalize_embeddings=True,
205
- prompt=RETRIEVE_Q_PROMPT,
206
- fast_chunk=False
207
-
208
- )[0]
209
- # query vector do not need multi vector, we only use mean as final single vector
210
- pred = [vecs[1:2, :] for vecs in query_vectors]
211
-
212
- # spans_list contail each chunk's span, you can use span to get text
213
- spans_list: List[List[TextSpan]]
214
- passage_vectors_list: List[np.ndarray]
215
- passage_vectors_list, spans_list = model.encode(
216
- sentences=p_list,
217
- use_cuda=True,
218
- show_progress_bar=True,
219
- chunk_size=64,
220
- chunk_overlap=8,
221
- convert_to_tensor=False,
222
- max_seq_length=max_seq_length,
223
- batch_size=8,
224
- normalize_embeddings=True,
225
- prompt=RETRIEVE_P_PROMPT,
226
- fast_chunk=True, # if fast_chunk is true, directly chunk on input ids, else using RecursiveCharacterTextSplitter
227
- )
228
- # spans_list stores each passage's spans, passage_vectors_list stores each passage's vectors so len(spans_list) == len(p_list) and len(spans_list) == len(passage_vectors_list)
229
- # for a passage's spans and vectors, each span corresponds to a vector (1*2048). So, len(spans_list[idx]) == len(passage_vectors_list[idx])
230
- print((query_vectors[0] @ passage_vectors_list[0].T).max())
231
- # output 0.7331543
232
- # get each chunk's content
233
- for spans, passage in zip(spans_list, p_list):
234
- text_ids = model.tokenizer.encode(RETRIEVE_P_PROMPT + passage)
235
- for span in spans:
236
- s, e = span.s, span.e
237
- chunk_text = model.tokenizer.decode(
238
- text_ids[s:e],
239
- skip_special_tokens=True,
240
- clean_up_tokenization_spaces=True
241
- ).strip()
242
- ```
243
-
244
- Please read annotation of this `encode` to get more information.
245
-
246
- #### 2.3.2 Chunk text by yourself
247
-
248
- If you want to chunk text by yourself, you should just set the `batch_text_spans` parameter in the `encode` function.
249
-
250
- ```python
251
- import os
252
- import numpy as np
253
-
254
- # os.environ["HF_ENDPOINT"] = "https://hf-mirror.com"
255
- from pydantic import BaseModel
256
- from typing import Optional, List
257
- from transformers import AutoTokenizer, AutoModel
258
-
259
-
260
- class TextSpan(BaseModel):
261
- s: int
262
- e: int
263
- text: Optional[str] = None
264
- module_name: str
265
-
266
-
267
- prompt = "<|START_INSTRUCTION|>Candidate document<|END_INSTRUCTION|>"
268
-
269
- # load model
270
- model = AutoModel.from_pretrained(
271
- "infgrad/dewey_en_beta",
272
- trust_remote_code=True,
273
- attn_implementation="flash_attention_2"
274
- )
275
- model.tokenizer = AutoTokenizer.from_pretrained("infgrad/dewey_en_beta")
276
- max_seq_length = 32 * 1024
277
-
278
- # chunk text
279
- passage = "this sentence 1. this sentence 2. this sentence 3"
280
- chunks = ["this sentence 1. this sentence 2.", "this sentence 2. this sentence 3"]
281
- prompt_length = len(model.tokenizer.tokenize(prompt))
282
- text_spans = [
283
- # s=0, e=1 means that this vector is cls vector, so the module_name is cls_linear, otherwise the module_name is chunk_linear
284
- TextSpan(s=0, e=1, module_name="cls_linear")
285
- ]
286
- for chunk in chunks:
287
- s = passage.find(chunk)
288
- e = s + len(chunk)
289
- text_spans.append(
290
- TextSpan(
291
- # add 1, as there is a [CLS] token at the beginning of text.
292
- s=1 + prompt_length + len(model.tokenizer.tokenize(passage[:s])),
293
- e=1 + prompt_length + len(model.tokenizer.tokenize(passage[:e])),
294
- module_name="chunk_linear"
295
- )
296
- )
297
-
298
- spans_list: List[List[TextSpan]]
299
- passage_vectors_list: List[np.ndarray]
300
- passage_vectors_list, _ = model.encode(
301
- sentences=[passage],
302
- use_cuda=False,
303
- show_progress_bar=True,
304
- chunk_size=64,
305
- chunk_overlap=12,
306
- convert_to_tensor=False,
307
- max_seq_length=max_seq_length,
308
- batch_size=8,
309
- normalize_embeddings=True,
310
- prompt=prompt,
311
- fast_chunk=True,
312
- batch_text_spans=[text_spans]
313
- )
314
- print(passage_vectors_list[0].shape, passage_vectors_list[0][:, 2])
315
- # the output is (3, 2048) [0.01461297 0.02085092 0.0022509 ]
316
- ```
317
-
318
- ## 3 Evaluation
319
-
320
- ### 3.1 MTEB(eng, v2)
321
-
322
- URL: http://mteb-leaderboard.hf.space/?benchmark_name=MTEB%28eng%2C+v2%29
323
-
324
- Reproduction
325
- script: https://huggingface.co/infgrad/dewey_en_beta/blob/main/scripts/evaluate/run_evaluate_mteb_dewey_en_beta.py
326
-
327
- | **Model** | **Zero-shot** | **Parameters** | **Dimensions** | **Max Tokens** | **Mean (Task)** | **Mean (TaskType)** | **Classification** | **Clustering** | **Pair Classification** | **Reranking** | **Retrieval** | **STS** | **Summarization** |
328
- |:------------------------------------------------------------------------------------------------------------------------:|:-------------:|:--------------:|:--------------:|:--------------:|:---------------:|:-------------------:|:------------------:|:--------------:|:-----------------------:|:-------------:|:-------------:|:-------:|:-----------------:|
329
- | [gemini-embedding-exp-03-07](https://developers.googleblog.com/en/gemini-embedding-text-model-now-available-gemini-api/) | 95% | Unknown | 3072 | 8192 | 73.3 | 67.67 | 90.05 | 59.39 | 87.7 | 48.59 | 64.35 | 85.29 | 38.28 |
330
- | [jasper_en_vision_language_v1](https://huggingface.co/NovaSearch/jasper_en_vision_language_v1) | 56% | 1B | 8960 | 131072 | 71.41 | 66.65 | 90.27 | 60.52 | 88.14 | 50 | 56.05 | 84.37 | 37.19 |
331
- | [gte-Qwen2-7B-instruct](https://huggingface.co/Alibaba-NLP/gte-Qwen2-7B-instruct) | NA | 7B | 3584 | 32768 | 70.72 | 65.77 | 88.52 | 58.97 | 85.9 | 50.47 | 58.09 | 82.69 | 35.74 |
332
- | [stella_en_1.5B_v5](https://huggingface.co/NovaSearch/stella_en_1.5B_v5) | 56% | 1B | 8960 | 131072 | 69.43 | 65.32 | 89.38 | 57.06 | 88.02 | 50.19 | 52.42 | 83.27 | 36.91 |
333
- | [SFR-Embedding-2_R](https://huggingface.co/Salesforce/SFR-Embedding-2_R) | 85% | 7B | 4096 | 32768 | 69.82 | 65.31 | 90.54 | 59.39 | 88.09 | 48.99 | 53.75 | 80.86 | 35.54 |
334
- | [Linq-Embed-Mistral](https://huggingface.co/Linq-AI-Research/Linq-Embed-Mistral) | 95% | 7B | 4096 | 32768 | 69.8 | 65.29 | 83 | 54.07 | 88.44 | 49.44 | 60.14 | 84.69 | 37.26 |
335
- | [NV-Embed-v2](https://huggingface.co/nvidia/NV-Embed-v2) | 56% | 7B | 4096 | 32768 | 69.81 | 65 | 87.19 | 47.66 | 88.69 | 49.61 | 62.84 | 83.82 | 35.21 |
336
- | [SFR-Embedding-Mistral](https://huggingface.co/Salesforce/SFR-Embedding-Mistral) | 85% | 7B | 4096 | 32768 | 69.31 | 64.94 | 80.47 | 54.93 | 88.59 | 50.15 | 59.33 | 84.77 | 36.32 |
337
- | [stella_en_400M_v5](https://huggingface.co/NovaSearch/stella_en_400M_v5) | 56% | 435M | 4096 | 8192 | 69.39 | 64.84 | 88.25 | 57.65 | 87.17 | 49.6 | 52.73 | 83.93 | 34.53 |
338
- | [text-embedding-004](https://cloud.google.com/vertex-ai/generative-ai/docs/embeddings/get-text-embeddings) | 95% | Unknown | 768 | 2048 | 69.53 | 64.82 | 86.03 | 51.52 | 87.65 | 48.48 | 59.06 | 84.84 | 36.12 |
339
- | [text-embedding-005](https://cloud.google.com/vertex-ai/generative-ai/docs/embeddings/get-text-embeddings) | 95% | Unknown | 768 | 2048 | 69.6 | 64.77 | 86.03 | 51.91 | 87.62 | 48.84 | 58.77 | 85.18 | 35.05 |
340
- | [e5-mistral-7b-instruct](https://huggingface.co/intfloat/e5-mistral-7b-instruct) | 95% | 7B | 4096 | 32768 | 67.97 | 64 | 79.85 | 51.44 | 88.42 | 49.78 | 57.62 | 84.32 | 36.57 |
341
- | [text-multilingual-embedding-002](https://cloud.google.com/vertex-ai/generative-ai/docs/embeddings/get-text-embeddings) | 95% | Unknown | 768 | 2048 | 67.67 | 63.52 | 84.65 | 50.41 | 86.6 | 47.48 | 54.7 | 83.94 | 36.84 |
342
- | [NV-Embed-v1](https://huggingface.co/nvidia/NV-Embed-v1) | 56% | 7B | 4096 | 32768 | 68.32 | 63.37 | 84.11 | 49.5 | 87.05 | 49.16 | 60.13 | 82.2 | 31.4 |
343
- | **[infgrad/dewey_en_beta](https://huggingface.co/infgrad/dewey_en_beta)** | 95% | 395M | 2048 | 131072 | 0.68 | 63.30 | 81.83 | 51.75 | 86.82 | 46.35 | 56.32 | 84.21 | 35.79 |
344
- | [gte-Qwen2-1.5B-instruct](https://huggingface.co/Alibaba-NLP/gte-Qwen2-1.5B-instruct) | NA | 1B | 8960 | 32768 | 67.2 | 63.26 | 85.84 | 53.54 | 87.52 | 49.25 | 50.25 | 82.51 | 33.94 |
345
- | [GritLM-7B](https://huggingface.co/GritLM/GritLM-7B) | 95% | 7B | 4096 | 4096 | 67.07 | 63.22 | 81.25 | 50.82 | 87.29 | 49.59 | 54.95 | 83.03 | 35.65 |
346
- | [GritLM-8x7B](https://huggingface.co/GritLM/GritLM-8x7B) | 95% | 57B | 4096 | 4096 | 66.16 | 62.42 | 79.98 | 51.48 | 85.23 | 49.22 | 52.46 | 82.93 | 35.65 |
347
- | [text-embedding-3-large](https://openai.com/index/new-embedding-models-and-api-updates/) | NA | Unknown | 3072 | 8191 | 66.43 | 62.15 | 79.15 | 48.9 | 85.81 | 47.45 | 57.98 | 81.44 | 34.31 |
348
- | [mxbai-embed-large-v1](https://huggingface.co/mixedbread-ai/mxbai-embed-large-v1) | 100% | 335M | 1024 | 512 | 66.26 | 62.04 | 79.1 | 47.48 | 87.2 | 48.05 | 55.4 | 84.42 | 32.63 |
349
- | [GIST-large-Embedding-v0](https://huggingface.co/avsolatorio/GIST-large-Embedding-v0) | 80% | 335M | 1024 | 512 | 66.25 | 61.96 | 78.91 | 48.84 | 86.7 | 48.76 | 54.52 | 84.44 | 31.52 |
350
- | [bge-large-en-v1.5](https://huggingface.co/BAAI/bge-large-en-v1.5) | 100% | 335M | 1024 | 512 | 65.89 | 61.87 | 78.34 | 48.01 | 87.13 | 48.26 | 55.44 | 82.79 | 33.13 |
351
- | [UAE-Large-V1](https://huggingface.co/WhereIsAI/UAE-Large-V1) | 100% | 335M | 1024 | 512 | 66.4 | 61.85 | 79.08 | 47.86 | 87.25 | 48.35 | 55.91 | 84.37 | 30.13 |
352
-
353
- ### 3.2 LongEmbed
354
-
355
- URL: http://mteb-leaderboard.hf.space/?benchmark_name=LongEmbed
356
-
357
- Reproduction script: https://huggingface.co/infgrad/dewey_en_beta/blob/main/scripts/evaluate/run_evaluate_long_embed.py
358
-
359
- | **Model** | **Zero-shot** | **Number of Parameters** | **Embedding Dimensions** | **Max Tokens** | **Mean (Task)** | **Mean (TaskType)** | **Retrieval** |
360
- |:-------------------------------------------------------------------------------------------------------------------------:|:-------------:|:------------------------:|:------------------------:|:--------------:|:---------------:|:-------------------:|:-------------:|
361
- | **[infgrad/dewey_en_beta-MultiVectors](https://huggingface.co/infgrad/dewey_en_beta)** | 100% | 395M | 2048 | 131072 | 86.59 | 86.59 | 86.59 |
362
- | [voyage-multilingual-2](https://blog.voyageai.com/2024/06/10/voyage-multilingual-2-multilingual-embedding-model/) | 100% | Unknown | 1024 | 32000 | 79.17 | 79.17 | 79.17 |
363
- | [voyage-law-2](https://blog.voyageai.com/2024/04/15/domain-specific-embeddings-and-retrieval-legal-edition-voyage-law-2/) | 100% | Unknown | 1024 | 16000 | 78.85 | 78.85 | 78.85 |
364
- | **[infgrad/dewey_en_beta-SingleVector](https://huggingface.co/infgrad/dewey_en_beta)** | 100% | 395M | 2048 | 131072 | 77.98 | 77.98 | 77.98 |
365
- | [voyage-3](https://blog.voyageai.com/2024/09/18/voyage-3/) | 100% | Unknown | 1024 | 32000 | 74.06 | 74.06 | 74.06 |
366
- | [inf-retriever-v1](https://huggingface.co/infly/inf-retriever-v1) | 100% | 7B | 3584 | 32768 | 73.19 | 73.19 | 73.19 |
367
-
368
- ### 3.3 LoCoV1
369
-
370
- URL: https://huggingface.co/datasets/hazyresearch/LoCoV1-Queries\
371
- https://huggingface.co/datasets/hazyresearch/LoCoV1-Documents
372
-
373
- Reproduction script: https://huggingface.co/infgrad/dewey_en_beta/blob/main/scripts/evaluate/run_evaluate_loco.py
374
-
375
- Metric: NDCG@10
376
-
377
- Result:
378
-
379
- | **dataset-name** | **bge-m3-8k** | **gte-modernbert-base-8k** | **Linq-Embed-Mistral-4k** | **Linq-Embed-Mistral-8k** | **SFR-Embedding-Mistral-8k** | **e5-mistral-7b-instruct-8k** | **dewey_en_beta-8k** | **dewey_en_beta_64k** | **dewey_en_beta_64k-multi-vectors** |
380
- |:---------------------------------:|:-------------:|:--------------------------:|:-------------------------:|:-------------------------:|:----------------------------:|:-----------------------------:|:--------------------:|:------------------------:|:--------------------------------------:|
381
- | **2wikimqa_test** | 0.9271 | 0.8658 | 0.8884 | 0.9067 | 0.8965 | 0.8901 | 0.8953 | 0.9051 | 0.9775 |
382
- | **courtlistener_HTML_test** | 0.1933 | 0.2349 | 0.3551 | 0.3670 | 0.3647 | 0.3543 | 0.3415 | 0.3616 | 0.4775 |
383
- | **courtlistener_Plain_Text_test** | 0.1888 | 0.2478 | 0.3675 | 0.3761 | 0.3679 | 0.3579 | 0.3377 | 0.3485 | 0.4426 |
384
- | **gov_report_test** | 0.9869 | 0.9750 | 0.9832 | 0.9837 | 0.9816 | 0.9823 | 0.9855 | 0.9883 | 0.9853 |
385
- | **legal_case_reports_test** | 0.3702 | 0.4476 | 0.5398 | 0.5432 | 0.5319 | 0.4850 | 0.5474 | 0.5875 | 0.6534 |
386
- | **multifieldqa_test** | 0.9373 | 0.9341 | 0.9345 | 0.9327 | 0.9450 | 0.9321 | 0.9687 | 0.9564 | 0.9754 |
387
- | **passage_retrieval_test** | 0.4493 | 0.5271 | 0.3470 | 0.3407 | 0.2902 | 0.3248 | 0.7562 | 0.7389 | 0.8550 |
388
- | **qasper_abstract_test** | 1.0000 | 0.9806 | 0.9982 | 0.9982 | 0.9973 | 0.9965 | 0.9973 | 0.9982 | 0.9982 |
389
- | **qasper_title_test** | 0.9860 | 0.8892 | 0.9838 | 0.9833 | 0.9861 | 0.9812 | 0.9742 | 0.9742 | 0.9840 |
390
- | **qmsum_test** | 0.6668 | 0.6307 | 0.6816 | 0.7237 | 0.7169 | 0.7148 | 0.7438 | 0.7613 | 0.8154 |
391
- | **stackoverflow_test** | 0.9634 | 0.9087 | 0.9760 | 0.9760 | 0.9766 | 0.9690 | 0.9362 | 0.9369 | 0.9443 |
392
- | **summ_screen_fd_test** | 0.9320 | 0.9379 | 0.9747 | 0.9635 | 0.9656 | 0.9580 | 0.9796 | 0.9821 | 0.9788 |
393
- | **Average** | 0.7168 | 0.7150 | 0.7525 | 0.7579 | 0.7517 | 0.7455 | 0.7886 |**0.7949** |**0.8406** |
394
-
395
- ## 4 Limitations
396
-
397
- - Only English text.
398
- - On short text tasks, the performance might not be as good as that of conventional short text embedding models.
 
 
 
 
 
 
 
 
 
 
399
  - As said before, this model is still in alpha or beta stage, the model may have some unexpected behaviour.
 
1
+ ---
2
+ license: mit
3
+ datasets:
4
+ - BAAI/Infinity-Instruct
5
+ - HuggingFaceFW/fineweb-edu
6
+ language:
7
+ - en
8
+ base_model:
9
+ - answerdotai/ModernBERT-large
10
+ ---
11
+ ## 1 Introduction
12
+
13
+ Cooperating with [Richinfo](https://www.richinfo.cn/index.html), this released model was trained using a novel approach,
14
+ and while we haven't fully understood
15
+ the underlying principles
16
+ yet, we have achieved promising results. Therefore, we have decided to open-source the model and hope that
17
+ **someone will test the model and provide us with feedback!**
18
+
19
+ **The technical report will be completed this week.**
20
+
21
+ The core training method of this model will be implemented in
22
+ the [RAG-Retrieval repository](https://github.com/NovaSearch-Team/RAG-Retrieval) open sourced by the NovaSearch Team,
23
+ welcome to star!
24
+
25
+ This model is based on [answerdotai/ModernBERT-large](https://huggingface.co/answerdotai/ModernBERT-base).
26
+ A perfect model, thanks for their sharing!
27
+
28
+ The embedding model has the following features:
29
+
30
+ 1. Max length is 128k, parameter size is 395M, and support only for English.
31
+ 2. Supports both single-vector and multi-vector (similar to Colbert, but with fewer vectors, only 0.5% of the number of
32
+ tokens).
33
+ 3. Achieved quite impressive results on the short text evaluation (MTEB-eng-v2), without using the MTEB training set,
34
+ even surpassing several 7B-sized models.
35
+ 4. On the long text evaluation LongEmbed, the single-vector surpasses many large and commercial models. If multi-vector
36
+ is used, the average score becomes the first place. Currently, our score is 0.86, while the current first place score
37
+ is 0.79.
38
+ 5. Ultra-fast encoding speed, benefiting from the architectural advantages of ModernBert, the encoding speed for long
39
+ texts is still very fast.
40
+ 6. Super flexible multi-vector combination method, where the multi-vector can be understood as span or chunk level, not
41
+ token level, so how to specify the chunk can be completely customized according to your own scenario, very flexible.
42
+
43
+ ## 2 Usage
44
+
45
+ We suggest you read the following contents with the model architecture diagram.
46
+
47
+ ![avatar](./imgs/inference_architecture.png)
48
+
49
+ We do hope you read the `modeling_dewey_v1.py` and `custom_st.py` carefully, these codes is easy to read and
50
+ will help you a lot!
51
+
52
+ ### 2.1 Prompts
53
+
54
+ Our model is a kind of instruct-embedding-model, when using our model, you should add prompt before the text.
55
+
56
+ For **Retrieval task**, you **MUST** use our provided prompt:\
57
+ query: `<|START_INSTRUCTION|>Answer the question<|END_INSTRUCTION|>`\
58
+ passage: `<|START_INSTRUCTION|>Candidate document<|END_INSTRUCTION|>`
59
+
60
+ For **STS task**, you **MUST** use our provided prompt:\
61
+ `<|START_INSTRUCTION|>Generate semantically similar text<|END_INSTRUCTION|>`
62
+
63
+ For **Classification and Clustering task**, you should design your own prompt, below are some examples:\
64
+ `<|START_INSTRUCTION|>Classify text into intents<|END_INSTRUCTION|>`\
65
+ `<|START_INSTRUCTION|>Classify text into toxic or not toxic<|END_INSTRUCTION|>`\
66
+ `<|START_INSTRUCTION|>Output main category of Medrxiv papers based on the titles<|END_INSTRUCTION|>`\
67
+ `<|START_INSTRUCTION|>Output topic or theme of news articles<|END_INSTRUCTION|>`
68
+
69
+ ### 2.2 Single Vector
70
+
71
+ For using single vector, our model is compatible with the `SentenceTransformer`.
72
+
73
+ ```python
74
+ import os
75
+
76
+ # os.environ["HF_ENDPOINT"] = "https://hf-mirror.com"
77
+ import torch
78
+ from sentence_transformers import SentenceTransformer
79
+
80
+ RETRIEVE_Q_PROMPT = "<|START_INSTRUCTION|>Answer the question<|END_INSTRUCTION|>"
81
+ RETRIEVE_P_PROMPT = "<|START_INSTRUCTION|>Candidate document<|END_INSTRUCTION|>"
82
+ model = SentenceTransformer(
83
+ "infgrad/dewey_en_beta",
84
+ trust_remote_code=True,
85
+ model_kwargs={
86
+ "torch_dtype": torch.bfloat16,
87
+ "attn_implementation": "flash_attention_2"
88
+ },
89
+ config_kwargs={"single_vector_type": "mean"}
90
+ ).cuda().bfloat16().eval()
91
+ # the choice of single_vector_type:
92
+ ## for short text (<1k): cls_add_mean
93
+ ## for long text (>1k): mean
94
+
95
+ # the max length of model is 128*1024
96
+ model.max_seq_length = 32 * 1024
97
+
98
+ query_vectors = model.encode(
99
+ sentences=[f"{RETRIEVE_Q_PROMPT}What is a computer composed of?", f"{RETRIEVE_Q_PROMPT}why the sky is blue"]
100
+ )
101
+ passage_vectors = model.encode(
102
+ sentences=[
103
+ f"{RETRIEVE_P_PROMPT}Central processing unit (CPU), memory (RAM), storage (hard drive or SSD), input/output devices (keyboard, mouse, monitor), and a motherboard",
104
+ f"{RETRIEVE_P_PROMPT}Shorter wavelengths of light, such as blue and violet, are scattered more by gases and particles in Earth's atmosphere.",
105
+ ]
106
+ )
107
+
108
+ print(query_vectors @ passage_vectors.T)
109
+ # the output is:
110
+ # [[0.52512825 0.19771025]
111
+ # [0.17617573 0.5918883 ]]
112
+ ```
113
+
114
+ ### 2.3 Multi Vectors
115
+
116
+ Our multi vectors are bsed on text span(i.e. chunk), so each vector can be considered as a contextual chunk vector.
117
+ **In order to get multi vectors of a document, you should get chunks and their spans first.**
118
+
119
+ Below are detailed steps to get multi vectors:
120
+
121
+ **Step1:** Chunk the document to get chunks and spans. This can be done by using our `encode` function, or you can also
122
+ chunk documents by yourself according to your scenario.\
123
+ **Note that, if you decide to chunk by yourself, your chunk and span should not contain prompt!!!**\
124
+ **Step2:** encode text to get token embeddings\
125
+ **Step3:** according to span (i.e. start_position and end_position) to get chunk vector,
126
+ we use mean of span token embeddings as chunk vector (i.e. normalize(token_embed[start_position:end_position].mean(
127
+ axis=0)))\
128
+ **Step4:** For each span, do Step3, until get all chunk vectors, you can also add span(0,1) and span(1+prompt_len,
129
+ text_len-1) to get global vector
130
+
131
+ For retrieval tasks, query vector should be **single vector**, so the final score between query and document is the max
132
+ score of query with every document vector.
133
+ This is compatible with FAISS, MILVUS and so on. Just enlarge the top-k and do de-duplicate on searched documents.
134
+
135
+ Below are detailed code examples.
136
+
137
+ #### 2.3.1 Chunk text in the `encode` function
138
+
139
+ You can directly use `encode` method in our model to get multi vectors.
140
+ This method will chunk text automatically.
141
+ You can choose the chunk strategy by setting `fast_chunk` parameter, if `fast_chunk` is true, directly chunk on input
142
+ ids, else using RecursiveCharacterTextSplitter.
143
+
144
+ ```python
145
+ import os
146
+ import numpy as np
147
+
148
+ # os.environ["HF_ENDPOINT"] = "https://hf-mirror.com"
149
+ from pydantic import BaseModel
150
+ from typing import Optional, List
151
+ from transformers import AutoTokenizer, AutoModel
152
+
153
+
154
+ class TextSpan(BaseModel):
155
+ s: int
156
+ e: int
157
+ text: Optional[str] = None
158
+ module_name: str
159
+
160
+
161
+ RETRIEVE_Q_PROMPT = "<|START_INSTRUCTION|>Answer the question<|END_INSTRUCTION|>"
162
+ RETRIEVE_P_PROMPT = "<|START_INSTRUCTION|>Candidate document<|END_INSTRUCTION|>"
163
+ model = AutoModel.from_pretrained(
164
+ "infgrad/dewey_en_beta",
165
+ trust_remote_code=True,
166
+ attn_implementation="flash_attention_2"
167
+ ).cuda().bfloat16()
168
+ model.tokenizer = AutoTokenizer.from_pretrained("infgrad/dewey_en_beta")
169
+ max_seq_length = 32 * 1024
170
+
171
+ q_list = ["why the sky is blue"]
172
+ p_list = [
173
+ """
174
+ I’ve been trying to understand why the sky changes colors, and I think I understand most of it, but something in the online explanations doesn’t make it clear for me:
175
+
176
+ I’ve read:
177
+
178
+ sky is blue because blue light gets scattered the most during the day.
179
+
180
+ in the evening it turns red because now even more of the blue light gets scattered
181
+
182
+ So a few questions:
183
+
184
+ The scattering of light during the day: does it mean that blue light gets reflected off air particles and reaches our eyes, while the rest of the frequencies pass through and reach the ground?
185
+
186
+ Surely some of the other frequencies also get scattered during the day, just in much smaller amounts?
187
+
188
+ So during the evening blue light gets scattered even more, to the point where even less of it reaches the eyes?
189
+
190
+ And so it gets red because now we can see the lower frequencies being scattered without blue overshadowing them?
191
+
192
+ Trying to word it myself: during the day only the highest frequencies get filtered, but during the evening also lower frequencies get filtered, because now the “light strainer” (air) is just catching more of it?
193
+
194
+ It gets darker in the evening without a good ability to see colors because there’s is no blue and so on light to reflect off of objects?
195
+
196
+ Is it ok to speak about light as a frequency? Or it’s only correct to say “wave length”?
197
+
198
+ Blue light is scattered in all directions by the tiny molecules of air in Earth's atmosphere. Blue is scattered more than other colors because it travels as shorter, smaller waves.
199
+ This is why we see a blue sky most of the time. Closer to the horizon, the sky fades to a lighter blue or white.
200
+ """
201
+ ]
202
+
203
+ # query should be a single vector, so we set chunk_size as -1 to avoid chunk.
204
+ # If chunk size is -1, the model will return an array with shape of (2,2048) consisting of cls_vector and mean_vector(mean of all token embeddings).
205
+ query_vectors = model.encode(
206
+ sentences=q_list,
207
+ use_cuda=True,
208
+ show_progress_bar=True,
209
+ chunk_size=-1,
210
+ chunk_overlap=32,
211
+ convert_to_tensor=False,
212
+ max_seq_length=max_seq_length,
213
+ batch_size=8,
214
+ normalize_embeddings=True,
215
+ prompt=RETRIEVE_Q_PROMPT,
216
+ fast_chunk=False
217
+
218
+ )[0]
219
+ # query vector do not need multi vector, we only use mean as final single vector
220
+ pred = [vecs[1:2, :] for vecs in query_vectors]
221
+
222
+ # spans_list contail each chunk's span, you can use span to get text
223
+ spans_list: List[List[TextSpan]]
224
+ passage_vectors_list: List[np.ndarray]
225
+ passage_vectors_list, spans_list = model.encode(
226
+ sentences=p_list,
227
+ use_cuda=True,
228
+ show_progress_bar=True,
229
+ chunk_size=64,
230
+ chunk_overlap=8,
231
+ convert_to_tensor=False,
232
+ max_seq_length=max_seq_length,
233
+ batch_size=8,
234
+ normalize_embeddings=True,
235
+ prompt=RETRIEVE_P_PROMPT,
236
+ fast_chunk=True, # if fast_chunk is true, directly chunk on input ids, else using RecursiveCharacterTextSplitter
237
+ )
238
+ # spans_list stores each passage's spans, passage_vectors_list stores each passage's vectors so len(spans_list) == len(p_list) and len(spans_list) == len(passage_vectors_list)
239
+ # for a passage's spans and vectors, each span corresponds to a vector (1*2048). So, len(spans_list[idx]) == len(passage_vectors_list[idx])
240
+ print((query_vectors[0] @ passage_vectors_list[0].T).max())
241
+ # output 0.7331543
242
+ # get each chunk's content
243
+ for spans, passage in zip(spans_list, p_list):
244
+ text_ids = model.tokenizer.encode(RETRIEVE_P_PROMPT + passage)
245
+ for span in spans:
246
+ s, e = span.s, span.e
247
+ chunk_text = model.tokenizer.decode(
248
+ text_ids[s:e],
249
+ skip_special_tokens=True,
250
+ clean_up_tokenization_spaces=True
251
+ ).strip()
252
+ ```
253
+
254
+ Please read annotation of this `encode` to get more information.
255
+
256
+ #### 2.3.2 Chunk text by yourself
257
+
258
+ If you want to chunk text by yourself, you should just set the `batch_text_spans` parameter in the `encode` function.
259
+
260
+ ```python
261
+ import os
262
+ import numpy as np
263
+
264
+ # os.environ["HF_ENDPOINT"] = "https://hf-mirror.com"
265
+ from pydantic import BaseModel
266
+ from typing import Optional, List
267
+ from transformers import AutoTokenizer, AutoModel
268
+
269
+
270
+ class TextSpan(BaseModel):
271
+ s: int
272
+ e: int
273
+ text: Optional[str] = None
274
+ module_name: str
275
+
276
+
277
+ prompt = "<|START_INSTRUCTION|>Candidate document<|END_INSTRUCTION|>"
278
+
279
+ # load model
280
+ model = AutoModel.from_pretrained(
281
+ "infgrad/dewey_en_beta",
282
+ trust_remote_code=True,
283
+ attn_implementation="flash_attention_2"
284
+ )
285
+ model.tokenizer = AutoTokenizer.from_pretrained("infgrad/dewey_en_beta")
286
+ max_seq_length = 32 * 1024
287
+
288
+ # chunk text
289
+ passage = "this sentence 1. this sentence 2. this sentence 3"
290
+ chunks = ["this sentence 1. this sentence 2.", "this sentence 2. this sentence 3"]
291
+ prompt_length = len(model.tokenizer.tokenize(prompt))
292
+ text_spans = [
293
+ # s=0, e=1 means that this vector is cls vector, so the module_name is cls_linear, otherwise the module_name is chunk_linear
294
+ TextSpan(s=0, e=1, module_name="cls_linear")
295
+ ]
296
+ for chunk in chunks:
297
+ s = passage.find(chunk)
298
+ e = s + len(chunk)
299
+ text_spans.append(
300
+ TextSpan(
301
+ # add 1, as there is a [CLS] token at the beginning of text.
302
+ s=1 + prompt_length + len(model.tokenizer.tokenize(passage[:s])),
303
+ e=1 + prompt_length + len(model.tokenizer.tokenize(passage[:e])),
304
+ module_name="chunk_linear"
305
+ )
306
+ )
307
+
308
+ spans_list: List[List[TextSpan]]
309
+ passage_vectors_list: List[np.ndarray]
310
+ passage_vectors_list, _ = model.encode(
311
+ sentences=[passage],
312
+ use_cuda=False,
313
+ show_progress_bar=True,
314
+ chunk_size=64,
315
+ chunk_overlap=12,
316
+ convert_to_tensor=False,
317
+ max_seq_length=max_seq_length,
318
+ batch_size=8,
319
+ normalize_embeddings=True,
320
+ prompt=prompt,
321
+ fast_chunk=True,
322
+ batch_text_spans=[text_spans]
323
+ )
324
+ print(passage_vectors_list[0].shape, passage_vectors_list[0][:, 2])
325
+ # the output is (3, 2048) [0.01461297 0.02085092 0.0022509 ]
326
+ ```
327
+
328
+ ## 3 Evaluation
329
+
330
+ ### 3.1 MTEB(eng, v2)
331
+
332
+ URL: http://mteb-leaderboard.hf.space/?benchmark_name=MTEB%28eng%2C+v2%29
333
+
334
+ Reproduction
335
+ script: https://huggingface.co/infgrad/dewey_en_beta/blob/main/scripts/evaluate/run_evaluate_mteb_dewey_en_beta.py
336
+
337
+ | **Model** | **Zero-shot** | **Parameters** | **Dimensions** | **Max Tokens** | **Mean (Task)** | **Mean (TaskType)** | **Classification** | **Clustering** | **Pair Classification** | **Reranking** | **Retrieval** | **STS** | **Summarization** |
338
+ |:------------------------------------------------------------------------------------------------------------------------:|:-------------:|:--------------:|:--------------:|:--------------:|:---------------:|:-------------------:|:------------------:|:--------------:|:-----------------------:|:-------------:|:-------------:|:-------:|:-----------------:|
339
+ | [gemini-embedding-exp-03-07](https://developers.googleblog.com/en/gemini-embedding-text-model-now-available-gemini-api/) | 95% | Unknown | 3072 | 8192 | 73.3 | 67.67 | 90.05 | 59.39 | 87.7 | 48.59 | 64.35 | 85.29 | 38.28 |
340
+ | [jasper_en_vision_language_v1](https://huggingface.co/NovaSearch/jasper_en_vision_language_v1) | 56% | 1B | 8960 | 131072 | 71.41 | 66.65 | 90.27 | 60.52 | 88.14 | 50 | 56.05 | 84.37 | 37.19 |
341
+ | [gte-Qwen2-7B-instruct](https://huggingface.co/Alibaba-NLP/gte-Qwen2-7B-instruct) | NA | 7B | 3584 | 32768 | 70.72 | 65.77 | 88.52 | 58.97 | 85.9 | 50.47 | 58.09 | 82.69 | 35.74 |
342
+ | [stella_en_1.5B_v5](https://huggingface.co/NovaSearch/stella_en_1.5B_v5) | 56% | 1B | 8960 | 131072 | 69.43 | 65.32 | 89.38 | 57.06 | 88.02 | 50.19 | 52.42 | 83.27 | 36.91 |
343
+ | [SFR-Embedding-2_R](https://huggingface.co/Salesforce/SFR-Embedding-2_R) | 85% | 7B | 4096 | 32768 | 69.82 | 65.31 | 90.54 | 59.39 | 88.09 | 48.99 | 53.75 | 80.86 | 35.54 |
344
+ | [Linq-Embed-Mistral](https://huggingface.co/Linq-AI-Research/Linq-Embed-Mistral) | 95% | 7B | 4096 | 32768 | 69.8 | 65.29 | 83 | 54.07 | 88.44 | 49.44 | 60.14 | 84.69 | 37.26 |
345
+ | [NV-Embed-v2](https://huggingface.co/nvidia/NV-Embed-v2) | 56% | 7B | 4096 | 32768 | 69.81 | 65 | 87.19 | 47.66 | 88.69 | 49.61 | 62.84 | 83.82 | 35.21 |
346
+ | [SFR-Embedding-Mistral](https://huggingface.co/Salesforce/SFR-Embedding-Mistral) | 85% | 7B | 4096 | 32768 | 69.31 | 64.94 | 80.47 | 54.93 | 88.59 | 50.15 | 59.33 | 84.77 | 36.32 |
347
+ | [stella_en_400M_v5](https://huggingface.co/NovaSearch/stella_en_400M_v5) | 56% | 435M | 4096 | 8192 | 69.39 | 64.84 | 88.25 | 57.65 | 87.17 | 49.6 | 52.73 | 83.93 | 34.53 |
348
+ | [text-embedding-004](https://cloud.google.com/vertex-ai/generative-ai/docs/embeddings/get-text-embeddings) | 95% | Unknown | 768 | 2048 | 69.53 | 64.82 | 86.03 | 51.52 | 87.65 | 48.48 | 59.06 | 84.84 | 36.12 |
349
+ | [text-embedding-005](https://cloud.google.com/vertex-ai/generative-ai/docs/embeddings/get-text-embeddings) | 95% | Unknown | 768 | 2048 | 69.6 | 64.77 | 86.03 | 51.91 | 87.62 | 48.84 | 58.77 | 85.18 | 35.05 |
350
+ | [e5-mistral-7b-instruct](https://huggingface.co/intfloat/e5-mistral-7b-instruct) | 95% | 7B | 4096 | 32768 | 67.97 | 64 | 79.85 | 51.44 | 88.42 | 49.78 | 57.62 | 84.32 | 36.57 |
351
+ | [text-multilingual-embedding-002](https://cloud.google.com/vertex-ai/generative-ai/docs/embeddings/get-text-embeddings) | 95% | Unknown | 768 | 2048 | 67.67 | 63.52 | 84.65 | 50.41 | 86.6 | 47.48 | 54.7 | 83.94 | 36.84 |
352
+ | [NV-Embed-v1](https://huggingface.co/nvidia/NV-Embed-v1) | 56% | 7B | 4096 | 32768 | 68.32 | 63.37 | 84.11 | 49.5 | 87.05 | 49.16 | 60.13 | 82.2 | 31.4 |
353
+ | **[infgrad/dewey_en_beta](https://huggingface.co/infgrad/dewey_en_beta)** | 95% | 395M | 2048 | 131072 | 0.68 | 63.30 | 81.83 | 51.75 | 86.82 | 46.35 | 56.32 | 84.21 | 35.79 |
354
+ | [gte-Qwen2-1.5B-instruct](https://huggingface.co/Alibaba-NLP/gte-Qwen2-1.5B-instruct) | NA | 1B | 8960 | 32768 | 67.2 | 63.26 | 85.84 | 53.54 | 87.52 | 49.25 | 50.25 | 82.51 | 33.94 |
355
+ | [GritLM-7B](https://huggingface.co/GritLM/GritLM-7B) | 95% | 7B | 4096 | 4096 | 67.07 | 63.22 | 81.25 | 50.82 | 87.29 | 49.59 | 54.95 | 83.03 | 35.65 |
356
+ | [GritLM-8x7B](https://huggingface.co/GritLM/GritLM-8x7B) | 95% | 57B | 4096 | 4096 | 66.16 | 62.42 | 79.98 | 51.48 | 85.23 | 49.22 | 52.46 | 82.93 | 35.65 |
357
+ | [text-embedding-3-large](https://openai.com/index/new-embedding-models-and-api-updates/) | NA | Unknown | 3072 | 8191 | 66.43 | 62.15 | 79.15 | 48.9 | 85.81 | 47.45 | 57.98 | 81.44 | 34.31 |
358
+ | [mxbai-embed-large-v1](https://huggingface.co/mixedbread-ai/mxbai-embed-large-v1) | 100% | 335M | 1024 | 512 | 66.26 | 62.04 | 79.1 | 47.48 | 87.2 | 48.05 | 55.4 | 84.42 | 32.63 |
359
+ | [GIST-large-Embedding-v0](https://huggingface.co/avsolatorio/GIST-large-Embedding-v0) | 80% | 335M | 1024 | 512 | 66.25 | 61.96 | 78.91 | 48.84 | 86.7 | 48.76 | 54.52 | 84.44 | 31.52 |
360
+ | [bge-large-en-v1.5](https://huggingface.co/BAAI/bge-large-en-v1.5) | 100% | 335M | 1024 | 512 | 65.89 | 61.87 | 78.34 | 48.01 | 87.13 | 48.26 | 55.44 | 82.79 | 33.13 |
361
+ | [UAE-Large-V1](https://huggingface.co/WhereIsAI/UAE-Large-V1) | 100% | 335M | 1024 | 512 | 66.4 | 61.85 | 79.08 | 47.86 | 87.25 | 48.35 | 55.91 | 84.37 | 30.13 |
362
+
363
+ ### 3.2 LongEmbed
364
+
365
+ URL: http://mteb-leaderboard.hf.space/?benchmark_name=LongEmbed
366
+
367
+ Reproduction script: https://huggingface.co/infgrad/dewey_en_beta/blob/main/scripts/evaluate/run_evaluate_long_embed.py
368
+
369
+ | **Model** | **Zero-shot** | **Number of Parameters** | **Embedding Dimensions** | **Max Tokens** | **Mean (Task)** | **Mean (TaskType)** | **Retrieval** |
370
+ |:-------------------------------------------------------------------------------------------------------------------------:|:-------------:|:------------------------:|:------------------------:|:--------------:|:---------------:|:-------------------:|:-------------:|
371
+ | **[infgrad/dewey_en_beta-MultiVectors](https://huggingface.co/infgrad/dewey_en_beta)** | 100% | 395M | 2048 | 131072 | 86.59 | 86.59 | 86.59 |
372
+ | [voyage-multilingual-2](https://blog.voyageai.com/2024/06/10/voyage-multilingual-2-multilingual-embedding-model/) | 100% | Unknown | 1024 | 32000 | 79.17 | 79.17 | 79.17 |
373
+ | [voyage-law-2](https://blog.voyageai.com/2024/04/15/domain-specific-embeddings-and-retrieval-legal-edition-voyage-law-2/) | 100% | Unknown | 1024 | 16000 | 78.85 | 78.85 | 78.85 |
374
+ | **[infgrad/dewey_en_beta-SingleVector](https://huggingface.co/infgrad/dewey_en_beta)** | 100% | 395M | 2048 | 131072 | 77.98 | 77.98 | 77.98 |
375
+ | [voyage-3](https://blog.voyageai.com/2024/09/18/voyage-3/) | 100% | Unknown | 1024 | 32000 | 74.06 | 74.06 | 74.06 |
376
+ | [inf-retriever-v1](https://huggingface.co/infly/inf-retriever-v1) | 100% | 7B | 3584 | 32768 | 73.19 | 73.19 | 73.19 |
377
+
378
+ ### 3.3 LoCoV1
379
+
380
+ URL: https://huggingface.co/datasets/hazyresearch/LoCoV1-Queries\
381
+ https://huggingface.co/datasets/hazyresearch/LoCoV1-Documents
382
+
383
+ Reproduction script: https://huggingface.co/infgrad/dewey_en_beta/blob/main/scripts/evaluate/run_evaluate_loco.py
384
+
385
+ Metric: NDCG@10
386
+
387
+ Result:
388
+
389
+ | **dataset-name** | **bge-m3-8k** | **gte-modernbert-base-8k** | **Linq-Embed-Mistral-4k** | **Linq-Embed-Mistral-8k** | **SFR-Embedding-Mistral-8k** | **e5-mistral-7b-instruct-8k** | **dewey_en_beta-8k** | **dewey_en_beta_64k** | **dewey_en_beta_64k-multi-vectors** |
390
+ |:---------------------------------:|:-------------:|:--------------------------:|:-------------------------:|:-------------------------:|:----------------------------:|:-----------------------------:|:--------------------:|:------------------------:|:--------------------------------------:|
391
+ | **2wikimqa_test** | 0.9271 | 0.8658 | 0.8884 | 0.9067 | 0.8965 | 0.8901 | 0.8953 | 0.9051 | 0.9775 |
392
+ | **courtlistener_HTML_test** | 0.1933 | 0.2349 | 0.3551 | 0.3670 | 0.3647 | 0.3543 | 0.3415 | 0.3616 | 0.4775 |
393
+ | **courtlistener_Plain_Text_test** | 0.1888 | 0.2478 | 0.3675 | 0.3761 | 0.3679 | 0.3579 | 0.3377 | 0.3485 | 0.4426 |
394
+ | **gov_report_test** | 0.9869 | 0.9750 | 0.9832 | 0.9837 | 0.9816 | 0.9823 | 0.9855 | 0.9883 | 0.9853 |
395
+ | **legal_case_reports_test** | 0.3702 | 0.4476 | 0.5398 | 0.5432 | 0.5319 | 0.4850 | 0.5474 | 0.5875 | 0.6534 |
396
+ | **multifieldqa_test** | 0.9373 | 0.9341 | 0.9345 | 0.9327 | 0.9450 | 0.9321 | 0.9687 | 0.9564 | 0.9754 |
397
+ | **passage_retrieval_test** | 0.4493 | 0.5271 | 0.3470 | 0.3407 | 0.2902 | 0.3248 | 0.7562 | 0.7389 | 0.8550 |
398
+ | **qasper_abstract_test** | 1.0000 | 0.9806 | 0.9982 | 0.9982 | 0.9973 | 0.9965 | 0.9973 | 0.9982 | 0.9982 |
399
+ | **qasper_title_test** | 0.9860 | 0.8892 | 0.9838 | 0.9833 | 0.9861 | 0.9812 | 0.9742 | 0.9742 | 0.9840 |
400
+ | **qmsum_test** | 0.6668 | 0.6307 | 0.6816 | 0.7237 | 0.7169 | 0.7148 | 0.7438 | 0.7613 | 0.8154 |
401
+ | **stackoverflow_test** | 0.9634 | 0.9087 | 0.9760 | 0.9760 | 0.9766 | 0.9690 | 0.9362 | 0.9369 | 0.9443 |
402
+ | **summ_screen_fd_test** | 0.9320 | 0.9379 | 0.9747 | 0.9635 | 0.9656 | 0.9580 | 0.9796 | 0.9821 | 0.9788 |
403
+ | **Average** | 0.7168 | 0.7150 | 0.7525 | 0.7579 | 0.7517 | 0.7455 | 0.7886 |**0.7949** |**0.8406** |
404
+
405
+ ## 4 Limitations
406
+
407
+ - Only English text.
408
+ - On short text tasks, the performance might not be as good as that of conventional short text embedding models.
409
  - As said before, this model is still in alpha or beta stage, the model may have some unexpected behaviour.