File size: 30,929 Bytes
035fc10
fb5a755
 
035fc10
 
 
 
 
fb5a755
017fae7
 
 
 
fb5a755
035fc10
fb5a755
 
 
 
 
 
 
 
 
 
 
 
 
035fc10
 
 
 
 
 
 
 
0322301
035fc10
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
fb5a755
035fc10
 
 
 
 
 
fb5a755
 
035fc10
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
fb5a755
035fc10
fb5a755
035fc10
fb5a755
035fc10
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b815718
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
---
base_model:
- answerdotai/ModernBERT-large
datasets:
- BAAI/Infinity-Instruct
- HuggingFaceFW/fineweb-edu
language:
- en
license: mit
pipeline_tag: feature-extraction
tags:
- sentence-transformers
- transformers
library_name: sentence-transformers
---

# Dewey Long Context Embedding Model: A Technical Report

The model was presented in the paper [](https://huggingface.co/papers/2503.20376).

# Paper abstract

The abstract of the paper is the following:

```
In this technical report, we introduce Dewey, a novel long context embedding model designed to enhance retrieval performance in long document scenarios. Dewey builds upon the ModernBERT architecture, known for its efficient handling of extended sequences, and incorporates an instruction-based training approach to align embeddings with specific task requirements. Key features of Dewey include its 128k context window, multi-vector representation for improved granularity, and a flexible chunking mechanism that allows customizable vector combinations. We evaluate Dewey on the LongEmbed benchmark, where it achieves state-of-the-art results, surpassing several larger models. Additionally, we present comprehensive usage examples and implementation details to facilitate the adoption and adaptation of Dewey for various applications.
```

## 1 Introduction

Cooperating with [Richinfo](https://www.richinfo.cn/index.html), this released model was trained using a novel approach,
and while we haven't fully understood
the underlying principles
yet, we have achieved promising results. Therefore, we have decided to open-source the model and hope that
**someone will test the model and provide us with feedback!**

The technical report: https://arxiv.org/abs/2503.20376

The core training method of this model will be implemented in
the [RAG-Retrieval repository](https://github.com/NovaSearch-Team/RAG-Retrieval) open sourced by the NovaSearch Team,
welcome to star!

This model is based on [answerdotai/ModernBERT-large](https://huggingface.co/answerdotai/ModernBERT-base).
A perfect model, thanks for their sharing!

The embedding model has the following features:

1. Max length is 128k, parameter size is 395M, and support only for English.
2. Supports both single-vector and multi-vector (similar to Colbert, but with fewer vectors, only 0.5% of the number of
   tokens).
3. Achieved quite impressive results on the short text evaluation (MTEB-eng-v2), without using the MTEB training set,
   even surpassing several 7B-sized models.
4. On the long text evaluation LongEmbed, the single-vector surpasses many large and commercial models. If multi-vector
   is used, the average score becomes the first place. Currently, our score is 0.86, while the current first place score
   is 0.79.
5. Ultra-fast encoding speed, benefiting from the architectural advantages of ModernBert, the encoding speed for long
   texts is still very fast.
6. Super flexible multi-vector combination method, where the multi-vector can be understood as span or chunk level, not
   token level, so how to specify the chunk can be completely customized according to your own scenario, very flexible.

## 2 Usage

We suggest you read the following contents with the model architecture diagram.

![avatar](./imgs/inference_architecture.png)

We do hope you read the `modeling_dewey_v1.py` and `custom_st.py` carefully, these codes is easy to read and
will help you a lot!

### 2.1 Prompts

Our model is a kind of instruct-embedding-model, when using our model, you should add prompt before the text.

For **Retrieval task**, you **MUST** use our provided prompt:\
query: `<|START_INSTRUCTION|>Answer the question<|END_INSTRUCTION|>`\
passage: `<|START_INSTRUCTION|>Candidate document<|END_INSTRUCTION|>`

For **STS task**, you **MUST** use our provided prompt:\
`<|START_INSTRUCTION|>Generate semantically similar text<|END_INSTRUCTION|>`

For **Classification and Clustering task**, you should design your own prompt, below are some examples:\
`<|START_INSTRUCTION|>Classify text into intents<|END_INSTRUCTION|>`\
`<|START_INSTRUCTION|>Classify text into toxic or not toxic<|END_INSTRUCTION|>`\
`<|START_INSTRUCTION|>Output main category of Medrxiv papers based on the titles<|END_INSTRUCTION|>`\
`<|START_INSTRUCTION|>Output topic or theme of news articles<|END_INSTRUCTION|>`

### 2.2 Single Vector

For using single vector, our model is compatible with the `SentenceTransformer`.

```python
import os

# os.environ["HF_ENDPOINT"] = "https://hf-mirror.com"
import torch
from sentence_transformers import SentenceTransformer

RETRIEVE_Q_PROMPT = "<|START_INSTRUCTION|>Answer the question<|END_INSTRUCTION|>"
RETRIEVE_P_PROMPT = "<|START_INSTRUCTION|>Candidate document<|END_INSTRUCTION|>"
model = SentenceTransformer(
    "infgrad/dewey_en_beta",
    trust_remote_code=True,
    model_kwargs={
        "torch_dtype": torch.bfloat16,
        "attn_implementation": "flash_attention_2"
    },
    config_kwargs={"single_vector_type": "mean"}
).cuda().bfloat16().eval()
# the choice of single_vector_type:
## for short text (<1k): cls_add_mean
## for long text (>1k): mean

# the max length of model is 128*1024
model.max_seq_length = 32 * 1024

query_vectors = model.encode(
    sentences=[f"{RETRIEVE_Q_PROMPT}What is a computer composed of?", f"{RETRIEVE_Q_PROMPT}why the sky is blue"]
)
passage_vectors = model.encode(
    sentences=[
        f"{RETRIEVE_P_PROMPT}Central processing unit (CPU), memory (RAM), storage (hard drive or SSD), input/output devices (keyboard, mouse, monitor), and a motherboard",
        f"{RETRIEVE_P_PROMPT}Shorter wavelengths of light, such as blue and violet, are scattered more by gases and particles in Earth's atmosphere.",
    ]
)

print(query_vectors @ passage_vectors.T)
# the output is:
# [[0.52512825 0.19771025]
#  [0.17617573 0.5918883 ]]
```

### 2.3 Multi Vectors

Our multi vectors are bsed on text span(i.e. chunk), so each vector can be considered as a contextual chunk vector.
**In order to get multi vectors of a document, you should get chunks and their spans first.**

Below are detailed steps to get multi vectors:

**Step1:** Chunk the document to get chunks and spans. This can be done by using our `encode` function, or you can also
chunk documents by yourself according to your scenario.\
**Note that, if you decide to chunk by yourself, your chunk and span should not contain prompt!!!**\
**Step2:** encode text to get token embeddings\
**Step3:** according to span (i.e. start_position and end_position) to get chunk vector,
we use mean of span token embeddings as chunk vector (i.e. normalize(token_embed[start_position:end_position].mean(
axis=0)))\
**Step4:** For each span, do Step3, until get all chunk vectors, you can also add span(0,1) and span(1+prompt_len,
text_len-1) to get global vector

For retrieval tasks, query vector should be **single vector**, so the final score between query and document is the max
score of query with every document vector.\
This is compatible with FAISS, MILVUS and so on. Just enlarge the top-k and do de-duplicate on searched documents.

Below are detailed code examples.

#### 2.3.1 Chunk text in the `encode` function

You can directly use `encode` method in our model to get multi vectors.\
This method will chunk text automatically.\
You can choose the chunk strategy by setting `fast_chunk` parameter, if `fast_chunk` is true, directly chunk on input
ids, else using RecursiveCharacterTextSplitter.

```python
import os
import numpy as np

# os.environ["HF_ENDPOINT"] = "https://hf-mirror.com"
from pydantic import BaseModel
from typing import Optional, List
from transformers import AutoTokenizer, AutoModel


class TextSpan(BaseModel):
    s: int
    e: int
    text: Optional[str] = None
    module_name: str


RETRIEVE_Q_PROMPT = "<|START_INSTRUCTION|>Answer the question<|END_INSTRUCTION|>"
RETRIEVE_P_PROMPT = "<|START_INSTRUCTION|>Candidate document<|END_INSTRUCTION|>"
model = AutoModel.from_pretrained(
    "infgrad/dewey_en_beta",
    trust_remote_code=True,
    attn_implementation="flash_attention_2"
).cuda().bfloat16()
model.tokenizer = AutoTokenizer.from_pretrained("infgrad/dewey_en_beta")
max_seq_length = 32 * 1024

q_list = ["why the sky is blue"]
p_list = [
    """
    I’ve been trying to understand why the sky changes colors, and I think I understand most of it, but something in the online explanations doesn’t make it clear for me:

I’ve read:

sky is blue because blue light gets scattered the most during the day.

in the evening it turns red because now even more of the blue light gets scattered

So a few questions:

The scattering of light during the day: does it mean that blue light gets reflected off air particles and reaches our eyes, while the rest of the frequencies pass through and reach the ground?

Surely some of the other frequencies also get scattered during the day, just in much smaller amounts?

So during the evening blue light gets scattered even more, to the point where even less of it reaches the eyes?

And so it gets red because now we can see the lower frequencies being scattered without blue overshadowing them?\

Trying to word it myself: during the day only the highest frequencies get filtered, but during the evening also lower frequencies get filtered, because now the “light strainer” (air) is just catching more of it?\

It gets darker in the evening without a good ability to see colors because there’s is no blue and so on light to reflect off of objects?\

Is it ok to speak about light as a frequency? Or it’s only correct to say “wave length”?

Blue light is scattered in all directions by the tiny molecules of air in Earth's atmosphere. Blue is scattered more than other colors because it travels as shorter, smaller waves. 
This is why we see a blue sky most of the time. Closer to the horizon, the sky fades to a lighter blue or white.
    """
]

# query should be a single vector, so we set chunk_size as -1 to avoid chunk.
# If chunk size is -1, the model will return an array with shape of (2,2048) consisting of cls_vector and mean_vector(mean of all token embeddings).
query_vectors = model.encode(
    sentences=q_list,
    use_cuda=True,
    show_progress_bar=True,
    chunk_size=-1,
    chunk_overlap=32,
    convert_to_tensor=False,
    max_seq_length=max_seq_length,
    batch_size=8,
    normalize_embeddings=True,
    prompt=RETRIEVE_Q_PROMPT,
    fast_chunk=False

)[0]
# query vector do not need multi vector, we only use mean as final single vector
pred = [vecs[1:2, :] for vecs in query_vectors]

# spans_list contail each chunk's span, you can use span to get text
spans_list: List[List[TextSpan]]
passage_vectors_list: List[np.ndarray]
passage_vectors_list, spans_list = model.encode(
    sentences=p_list,
    use_cuda=True,
    show_progress_bar=True,
    chunk_size=64,
    chunk_overlap=8,
    convert_to_tensor=False,
    max_seq_length=max_seq_length,
    batch_size=8,
    normalize_embeddings=True,
    prompt=RETRIEVE_P_PROMPT,
    fast_chunk=True,  # if fast_chunk is true, directly chunk on input ids, else using RecursiveCharacterTextSplitter
)
# spans_list stores each passage's spans, passage_vectors_list stores each passage's vectors so len(spans_list) == len(p_list) and len(spans_list) == len(passage_vectors_list)
# for a passage's spans and vectors, each span corresponds to a vector (1*2048). So, len(spans_list[idx]) ==  len(passage_vectors_list[idx])
print((query_vectors[0] @ passage_vectors_list[0].T).max())
# output 0.7331543
# get each chunk's content
for spans, passage in zip(spans_list, p_list):
    text_ids = model.tokenizer.encode(RETRIEVE_P_PROMPT + passage)
    for span in spans:
        s, e = span.s, span.e
        chunk_text = model.tokenizer.decode(
            text_ids[s:e],
            skip_special_tokens=True,
            clean_up_tokenization_spaces=True
        ).strip()
```

Please read annotation of this `encode` to get more information.

#### 2.3.2 Chunk text by yourself

If you want to chunk text by yourself, you should just set the `batch_text_spans` parameter in the `encode` function.

```python
import os
import numpy as np

# os.environ["HF_ENDPOINT"] = "https://hf-mirror.com"
from pydantic import BaseModel
from typing import Optional, List
from transformers import AutoTokenizer, AutoModel


class TextSpan(BaseModel):
    s: int
    e: int
    text: Optional[str] = None
    module_name: str


prompt = "<|START_INSTRUCTION|>Candidate document<|END_INSTRUCTION|>"

# load model
model = AutoModel.from_pretrained(
    "infgrad/dewey_en_beta",
    trust_remote_code=True,
    attn_implementation="flash_attention_2"
)
model.tokenizer = AutoTokenizer.from_pretrained("infgrad/dewey_en_beta")
max_seq_length = 32 * 1024

# chunk text
passage = "this sentence 1. this sentence 2. this sentence 3"
chunks = ["this sentence 1. this sentence 2.", "this sentence 2. this sentence 3"]
prompt_length = len(model.tokenizer.tokenize(prompt))
text_spans = [
    # s=0, e=1 means that this vector is cls vector, so the module_name is cls_linear, otherwise the module_name is chunk_linear
    TextSpan(s=0, e=1, module_name="cls_linear")
]
for chunk in chunks:
    s = passage.find(chunk)
    e = s + len(chunk)
    text_spans.append(
        TextSpan(
            # add 1, as there is a [CLS] token at the beginning of text.
            s=1 + prompt_length + len(model.tokenizer.tokenize(passage[:s])),
            e=1 + prompt_length + len(model.tokenizer.tokenize(passage[:e])),
            module_name="chunk_linear"
        )
    )

spans_list: List[List[TextSpan]]
passage_vectors_list: List[np.ndarray]
passage_vectors_list, _ = model.encode(
    sentences=[passage],
    use_cuda=False,
    show_progress_bar=True,
    chunk_size=64,
    chunk_overlap=12,
    convert_to_tensor=False,
    max_seq_length=max_seq_length,
    batch_size=8,
    normalize_embeddings=True,
    prompt=prompt,
    fast_chunk=True,
    batch_text_spans=[text_spans]
)
print(passage_vectors_list[0].shape, passage_vectors_list[0][:, 2])
# the output is (3, 2048) [0.01461297 0.02085092 0.0022509 ]
```

## 3 Evaluation

### 3.1 MTEB(eng, v2)

URL: http://mteb-leaderboard.hf.space/?benchmark_name=MTEB%28eng%2C+v2%29

Reproduction
script: https://huggingface.co/infgrad/dewey_en_beta/blob/main/scripts/evaluate/run_evaluate_mteb_dewey_en_beta.py

|                                                        **Model**                                                         | **Zero-shot** | **Parameters** | **Dimensions** | **Max Tokens** | **Mean (Task)** | **Mean (TaskType)** | **Classification** | **Clustering** | **Pair Classification** | **Reranking** | **Retrieval** | **STS** | **Summarization** |
|:------------------------------------------------------------------------------------------------------------------------:|:-------------:|:--------------:|:--------------:|:--------------:|:---------------:|:-------------------:|:------------------:|:--------------:|:-----------------------:|:-------------:|:-------------:|:-------:|:-----------------:|
| [gemini-embedding-exp-03-07](https://developers.googleblog.com/en/gemini-embedding-text-model-now-available-gemini-api/) | 95%           | Unknown        | 3072           | 8192           | 73.3            | 67.67               | 90.05              | 59.39          | 87.7                    | 48.59         | 64.35         | 85.29   | 38.28             |
|              [jasper_en_vision_language_v1](https://huggingface.co/NovaSearch/jasper_en_vision_language_v1)              | 56%           | 1B             | 8960           | 131072         | 71.41           | 66.65               | 90.27              | 60.52          | 88.14                   | 50            | 56.05         | 84.37   | 37.19             |
|                    [gte-Qwen2-7B-instruct](https://huggingface.co/Alibaba-NLP/gte-Qwen2-7B-instruct)                     | NA            | 7B             | 3584           | 32768          | 70.72           | 65.77               | 88.52              | 58.97          | 85.9                    | 50.47         | 58.09         | 82.69   | 35.74             |
|                         [stella_en_1.5B_v5](https://huggingface.co/NovaSearch/stella_en_1.5B_v5)                         | 56%           | 1B             | 8960           | 131072         | 69.43           | 65.32               | 89.38              | 57.06          | 88.02                   | 50.19         | 52.42         | 83.27   | 36.91             |
|                         [SFR-Embedding-2_R](https://huggingface.co/Salesforce/SFR-Embedding-2_R)                         | 85%           | 7B             | 4096           | 32768          | 69.82           | 65.31               | 90.54              | 59.39          | 88.09                   | 48.99         | 53.75         | 80.86   | 35.54             |
|                     [Linq-Embed-Mistral](https://huggingface.co/Linq-AI-Research/Linq-Embed-Mistral)                     | 95%           | 7B             | 4096           | 32768          | 69.8            | 65.29               | 83                 | 54.07          | 88.44                   | 49.44         | 60.14         | 84.69   | 37.26             |
|                                 [NV-Embed-v2](https://huggingface.co/nvidia/NV-Embed-v2)                                 | 56%           | 7B             | 4096           | 32768          | 69.81           | 65                  | 87.19              | 47.66          | 88.69                   | 49.61         | 62.84         | 83.82   | 35.21             |
|                     [SFR-Embedding-Mistral](https://huggingface.co/Salesforce/SFR-Embedding-Mistral)                     | 85%           | 7B             | 4096           | 32768          | 69.31           | 64.94               | 80.47              | 54.93          | 88.59                   | 50.15         | 59.33         | 84.77   | 36.32             |
|                         [stella_en_400M_v5](https://huggingface.co/NovaSearch/stella_en_400M_v5)                         | 56%           | 435M           | 4096           | 8192           | 69.39           | 64.84               | 88.25              | 57.65          | 87.17                   | 49.6          | 52.73         | 83.93   | 34.53             |
|        [text-embedding-004](https://cloud.google.com/vertex-ai/generative-ai/docs/embeddings/get-text-embeddings)        | 95%           | Unknown        | 768            | 2048           | 69.53           | 64.82               | 86.03              | 51.52          | 87.65                   | 48.48         | 59.06         | 84.84   | 36.12             |
|        [text-embedding-005](https://cloud.google.com/vertex-ai/generative-ai/docs/embeddings/get-text-embeddings)        | 95%           | Unknown        | 768            | 2048           | 69.6            | 64.77               | 86.03              | 51.91          | 87.62                   | 48.84         | 58.77         | 85.18   | 35.05             |
|                     [e5-mistral-7b-instruct](https://huggingface.co/intfloat/e5-mistral-7b-instruct)                     | 95%           | 7B             | 4096           | 32768          | 67.97           | 64                  | 79.85              | 51.44          | 88.42                   | 49.78         | 57.62         | 84.32   | 36.57             |
| [text-multilingual-embedding-002](https://cloud.google.com/vertex-ai/generative-ai/docs/embeddings/get-text-embeddings)  | 95%           | Unknown        | 768            | 2048           | 67.67           | 63.52               | 84.65              | 50.41          | 86.6                    | 47.48         | 54.7          | 83.94   | 36.84             |
|                                 [NV-Embed-v1](https://huggingface.co/nvidia/NV-Embed-v1)                                 | 56%           | 7B             | 4096           | 32768          | 68.32           | 63.37               | 84.11              | 49.5           | 87.05                   | 49.16         | 60.13         | 82.2    | 31.4              |
|                      **[infgrad/dewey_en_beta](https://huggingface.co/infgrad/dewey_en_beta)**                      | 95%           | 395M           | 2048           | 131072         | 0.68            | 63.30               | 81.83              | 51.75          | 86.82                   | 46.35         | 56.32         | 84.21   | 35.79             |
|                  [gte-Qwen2-1.5B-instruct](https://huggingface.co/Alibaba-NLP/gte-Qwen2-1.5B-instruct)                   | NA            | 1B             | 8960           | 32768          | 67.2            | 63.26               | 85.84              | 53.54          | 87.52                   | 49.25         | 50.25         | 82.51   | 33.94             |
|                                   [GritLM-7B](https://huggingface.co/GritLM/GritLM-7B)                                   | 95%           | 7B             | 4096           | 4096           | 67.07           | 63.22               | 81.25              | 50.82          | 87.29                   | 49.59         | 54.95         | 83.03   | 35.65             |
|                                 [GritLM-8x7B](https://huggingface.co/GritLM/GritLM-8x7B)                                 | 95%           | 57B            | 4096           | 4096           | 66.16           | 62.42               | 79.98              | 51.48          | 85.23                   | 49.22         | 52.46         | 82.93   | 35.65             |
|                 [text-embedding-3-large](https://openai.com/index/new-embedding-models-and-api-updates/)                 | NA            | Unknown        | 3072           | 8191           | 66.43           | 62.15               | 79.15              | 48.9           | 85.81                   | 47.45         | 57.98         | 81.44   | 34.31             |
|                    [mxbai-embed-large-v1](https://huggingface.co/mixedbread-ai/mxbai-embed-large-v1)                     | 100%          | 335M           | 1024           | 512            | 66.26           | 62.04               | 79.1               | 47.48          | 87.2                    | 48.05         | 55.4          | 84.42   | 32.63             |
|                  [GIST-large-Embedding-v0](https://huggingface.co/avsolatorio/GIST-large-Embedding-v0)                   | 80%           | 335M           | 1024           | 512            | 66.25           | 61.96               | 78.91              | 48.84          | 86.7                    | 48.76         | 54.52         | 84.44   | 31.52             |
|                            [bge-large-en-v1.5](https://huggingface.co/BAAI/bge-large-en-v1.5)                            | 100%          | 335M           | 1024           | 512            | 65.89           | 61.87               | 78.34              | 48.01          | 87.13                   | 48.26         | 55.44         | 82.79   | 33.13             |
|                              [UAE-Large-V1](https://huggingface.co/WhereIsAI/UAE-Large-V1)                               | 100%          | 335M           | 1024           | 512            | 66.4            | 61.85               | 79.08              | 47.86          | 87.25                   | 48.35         | 55.91         | 84.37   | 30.13             |

### 3.2 LongEmbed

URL: http://mteb-leaderboard.hf.space/?benchmark_name=LongEmbed

Reproduction script: https://huggingface.co/infgrad/dewey_en_beta/blob/main/scripts/evaluate/run_evaluate_long_embed.py

|                                                         **Model**                                                         | **Zero-shot** | **Number of Parameters** | **Embedding Dimensions** | **Max Tokens** | **Mean (Task)** | **Mean (TaskType)** | **Retrieval** |
|:-------------------------------------------------------------------------------------------------------------------------:|:-------------:|:------------------------:|:------------------------:|:--------------:|:---------------:|:-------------------:|:-------------:|
|                  **[infgrad/dewey_en_beta-MultiVectors](https://huggingface.co/infgrad/dewey_en_beta)**                   | 100%          | 395M                     | 2048                     | 131072         | 86.59           | 86.59               | 86.59         |
|     [voyage-multilingual-2](https://blog.voyageai.com/2024/06/10/voyage-multilingual-2-multilingual-embedding-model/)     | 100%          | Unknown                  | 1024                     | 32000          | 79.17           | 79.17               | 79.17         |
| [voyage-law-2](https://blog.voyageai.com/2024/04/15/domain-specific-embeddings-and-retrieval-legal-edition-voyage-law-2/) | 100%          | Unknown                  | 1024                     | 16000          | 78.85           | 78.85               | 78.85         |
|                  **[infgrad/dewey_en_beta-SingleVector](https://huggingface.co/infgrad/dewey_en_beta)**                   | 100%          | 395M                     | 2048                     | 131072         | 77.98           | 77.98               | 77.98         |
|                                [voyage-3](https://blog.voyageai.com/2024/09/18/voyage-3/)                                 | 100%          | Unknown                  | 1024                     | 32000          | 74.06           | 74.06               | 74.06         |
|                             [inf-retriever-v1](https://huggingface.co/infly/inf-retriever-v1)                             | 100%          | 7B                       | 3584                     | 32768          | 73.19           | 73.19               | 73.19         |

### 3.3 LoCoV1

URL: https://huggingface.co/datasets/hazyresearch/LoCoV1-Queries\
https://huggingface.co/datasets/hazyresearch/LoCoV1-Documents

Reproduction script: https://huggingface.co/infgrad/dewey_en_beta/blob/main/scripts/evaluate/run_evaluate_loco.py

Metric: NDCG@10

Result:

| **dataset-name**                  | **bge-m3-8k** | **gte-modernbert-base-8k** | **Linq-Embed-Mistral-4k** | **Linq-Embed-Mistral-8k** | **SFR-Embedding-Mistral-8k** | **e5-mistral-7b-instruct-8k** | **dewey_en_beta-8k** |  **dewey_en_beta_64k**   |  **dewey_en_beta_64k-multi-vectors**   |
|:---------------------------------:|:-------------:|:--------------------------:|:-------------------------:|:-------------------------:|:----------------------------:|:-----------------------------:|:--------------------:|:------------------------:|:--------------------------------------:|
| **2wikimqa_test**                 | 0.9271        | 0.8658                     | 0.8884                    | 0.9067                    | 0.8965                       | 0.8901                        | 0.8953               |          0.9051          |                 0.9775                 |
| **courtlistener_HTML_test**       | 0.1933        | 0.2349                     | 0.3551                    | 0.3670                    | 0.3647                       | 0.3543                        | 0.3415               |          0.3616          |                 0.4775                 |
| **courtlistener_Plain_Text_test** | 0.1888        | 0.2478                     | 0.3675                    | 0.3761                    | 0.3679                       | 0.3579                        | 0.3377               |          0.3485          |                 0.4426                 |
| **gov_report_test**               | 0.9869        | 0.9750                     | 0.9832                    | 0.9837                    | 0.9816                       | 0.9823                        | 0.9855               |          0.9883          |                 0.9853                 |
| **legal_case_reports_test**       | 0.3702        | 0.4476                     | 0.5398                    | 0.5432                    | 0.5319                       | 0.4850                        | 0.5474               |          0.5875          |                 0.6534                 |
| **multifieldqa_test**             | 0.9373        | 0.9341                     | 0.9345                    | 0.9327                    | 0.9450                       | 0.9321                        | 0.9687               |          0.9564          |                 0.9754                 |
| **passage_retrieval_test**        | 0.4493        | 0.5271                     | 0.3470                    | 0.3407                    | 0.2902                       | 0.3248                        | 0.7562               |          0.7389          |                 0.8550                 |
| **qasper_abstract_test**          | 1.0000        | 0.9806                     | 0.9982                    | 0.9982                    | 0.9973                       | 0.9965                        | 0.9973               |          0.9982          |                 0.9982                 |
| **qasper_title_test**             | 0.9860        | 0.8892                     | 0.9838                    | 0.9833                    | 0.9861                       | 0.9812                        | 0.9742               |          0.9742          |                 0.9840                 |
| **qmsum_test**                    | 0.6668        | 0.6307                     | 0.6816                    | 0.7237                    | 0.7169                       | 0.7148                        | 0.7438               |          0.7613          |                 0.8154                 |
| **stackoverflow_test**            | 0.9634        | 0.9087                     | 0.9760                    | 0.9760                    | 0.9766                       | 0.9690                        | 0.9362               |          0.9369          |                 0.9443                 |
| **summ_screen_fd_test**           | 0.9320        | 0.9379                     | 0.9747                    | 0.9635                    | 0.9656                       | 0.9580                        | 0.9796               |          0.9821          |                 0.9788                 |
| **Average**                       | 0.7168        | 0.7150                     | 0.7525                    | 0.7579                    | 0.7517                       | 0.7455                        | 0.7886               |**0.7949**                |**0.8406**                              |

## 4 Limitations

- Only English text.
- On short text tasks, the performance might not be as good as that of conventional short text embedding models.
- As said before, this model is still in alpha or beta stage, the model may have some unexpected behaviour.

## 5 Cite

```
@misc{zhang2025deweylongcontextembedding,
      title={Dewey Long Context Embedding Model: A Technical Report}, 
      author={Dun Zhang and Panxiang Zou and Yudong Zhou},
      year={2025},
      eprint={2503.20376},
      archivePrefix={arXiv},
      primaryClass={cs.IR},
      url={https://arxiv.org/abs/2503.20376}, 
}