infgrad
/

dewey_en_beta

@@ -1,399 +1,409 @@
-## 1 Introduction
-Cooperating with [Richinfo](https://www.richinfo.cn/index.html), this released model was trained using a novel approach,
-and while we haven't fully understood
-the underlying principles
-yet, we have achieved promising results. Therefore, we have decided to open-source the model and hope that
-**someone will test the model and provide us with feedback!**
-**The technical report will be completed this week.**
-The core training method of this model will be implemented in
-the [RAG-Retrieval repository](https://github.com/NovaSearch-Team/RAG-Retrieval) open sourced by the NovaSearch Team,
-welcome to star!
-This model is based on [answerdotai/ModernBERT-large](https://huggingface.co/answerdotai/ModernBERT-base).
-A perfect model, thanks for their sharing!
-The embedding model has the following features:
-1. Max length is 128k, parameter size is 395M, and support only for English.
-2. Supports both single-vector and multi-vector (similar to Colbert, but with fewer vectors, only 0.5% of the number of
-   tokens).
-3. Achieved quite impressive results on the short text evaluation (MTEB-eng-v2), without using the MTEB training set,
-   even surpassing several 7B-sized models.
-4. On the long text evaluation LongEmbed, the single-vector surpasses many large and commercial models. If multi-vector
-   is used, the average score becomes the first place. Currently, our score is 0.86, while the current first place score
-   is 0.79.
-5. Ultra-fast encoding speed, benefiting from the architectural advantages of ModernBert, the encoding speed for long
-   texts is still very fast.
-6. Super flexible multi-vector combination method, where the multi-vector can be understood as span or chunk level, not
-   token level, so how to specify the chunk can be completely customized according to your own scenario, very flexible.
-## 2 Usage
-We suggest you read the following contents with the model architecture diagram.
-![avatar](./imgs/inference_architecture.png)
-We do hope you read the `modeling_dewey_v1.py` and `custom_st.py` carefully, these codes is easy to read and
-will help you a lot!
-### 2.1 Prompts
-Our model is a kind of instruct-embedding-model, when using our model, you should add prompt before the text.
-For **Retrieval task**, you **MUST** use our provided prompt:\
-query: `<|START_INSTRUCTION|>Answer the question<|END_INSTRUCTION|>`\
-passage: `<|START_INSTRUCTION|>Candidate document<|END_INSTRUCTION|>`
-For **STS task**, you **MUST** use our provided prompt:\
-`<|START_INSTRUCTION|>Generate semantically similar text<|END_INSTRUCTION|>`
-For **Classification and Clustering task**, you should design your own prompt, below are some examples:\
-`<|START_INSTRUCTION|>Classify text into intents<|END_INSTRUCTION|>`\
-`<|START_INSTRUCTION|>Classify text into toxic or not toxic<|END_INSTRUCTION|>`\
-`<|START_INSTRUCTION|>Output main category of Medrxiv papers based on the titles<|END_INSTRUCTION|>`\
-`<|START_INSTRUCTION|>Output topic or theme of news articles<|END_INSTRUCTION|>`
-### 2.2 Single Vector
-For using single vector, our model is compatible with the `SentenceTransformer`.
-```python
-import os
-# os.environ["HF_ENDPOINT"] = "https://hf-mirror.com"
-import torch
-from sentence_transformers import SentenceTransformer
-RETRIEVE_Q_PROMPT = "<|START_INSTRUCTION|>Answer the question<|END_INSTRUCTION|>"
-RETRIEVE_P_PROMPT = "<|START_INSTRUCTION|>Candidate document<|END_INSTRUCTION|>"
-model = SentenceTransformer(
-    "infgrad/dewey_en_beta",
-    trust_remote_code=True,
-    model_kwargs={
-        "torch_dtype": torch.bfloat16,
-        "attn_implementation": "flash_attention_2"
-    },
-    config_kwargs={"single_vector_type": "mean"}
-).cuda().bfloat16().eval()
-# the choice of single_vector_type:
-## for short text (<1k): cls_add_mean
-## for long text (>1k): mean
-# the max length of model is 128*1024
-model.max_seq_length = 32 * 1024
-query_vectors = model.encode(
-    sentences=[f"{RETRIEVE_Q_PROMPT}What is a computer composed of?", f"{RETRIEVE_Q_PROMPT}why the sky is blue"]
-)
-passage_vectors = model.encode(
-    sentences=[
-        f"{RETRIEVE_P_PROMPT}Central processing unit (CPU), memory (RAM), storage (hard drive or SSD), input/output devices (keyboard, mouse, monitor), and a motherboard",
-        f"{RETRIEVE_P_PROMPT}Shorter wavelengths of light, such as blue and violet, are scattered more by gases and particles in Earth's atmosphere.",
-    ]
-)
-print(query_vectors @ passage_vectors.T)
-# the output is:
-# [[0.52512825 0.19771025]
-#  [0.17617573 0.5918883 ]]
-```
-### 2.3 Multi Vectors
-Our multi vectors are bsed on text span(i.e. chunk), so each vector can be considered as a contextual chunk vector.
-**In order to get multi vectors of a document, you should get chunks and their spans first.**
-Below are detailed steps to get multi vectors:
-**Step1:** Chunk the document to get chunks and spans. This can be done by using our `encode` function, or you can also
-chunk documents by yourself according to your scenario.\
-**Note that, if you decide to chunk by yourself, your chunk and span should not contain prompt!!!**\
-**Step2:** encode text to get token embeddings\
-**Step3:** according to span (i.e. start_position and end_position) to get chunk vector,
-we use mean of span token embeddings as chunk vector (i.e. normalize(token_embed[start_position:end_position].mean(
-axis=0)))\
-**Step4:** For each span, do Step3, until get all chunk vectors, you can also add span(0,1) and span(1+prompt_len,
-text_len-1) to get global vector
-For retrieval tasks, query vector should be **single vector**, so the final score between query and document is the max
-score of query with every document vector.
-This is compatible with FAISS, MILVUS and so on. Just enlarge the top-k and do de-duplicate on searched documents.
-Below are detailed code examples.
-#### 2.3.1 Chunk text in the `encode` function
-You can directly use `encode` method in our model to get multi vectors.
-This method will chunk text automatically.
-You can choose the chunk strategy by setting `fast_chunk` parameter, if `fast_chunk` is true, directly chunk on input
-ids, else using RecursiveCharacterTextSplitter.
-```python
-import os
-import numpy as np
-# os.environ["HF_ENDPOINT"] = "https://hf-mirror.com"
-from pydantic import BaseModel
-from typing import Optional, List
-from transformers import AutoTokenizer, AutoModel
-class TextSpan(BaseModel):
-    s: int
-    e: int
-    text: Optional[str] = None
-    module_name: str
-RETRIEVE_Q_PROMPT = "<|START_INSTRUCTION|>Answer the question<|END_INSTRUCTION|>"
-RETRIEVE_P_PROMPT = "<|START_INSTRUCTION|>Candidate document<|END_INSTRUCTION|>"
-model = AutoModel.from_pretrained(
-    "infgrad/dewey_en_beta",
-    trust_remote_code=True,
-    attn_implementation="flash_attention_2"
-).cuda().bfloat16()
-model.tokenizer = AutoTokenizer.from_pretrained("infgrad/dewey_en_beta")
-max_seq_length = 32 * 1024
-q_list = ["why the sky is blue"]
-p_list = [
-    """
-    I’ve been trying to understand why the sky changes colors, and I think I understand most of it, but something in the online explanations doesn’t make it clear for me:
-I’ve read:
-sky is blue because blue light gets scattered the most during the day.
-in the evening it turns red because now even more of the blue light gets scattered
-So a few questions:
-The scattering of light during the day: does it mean that blue light gets reflected off air particles and reaches our eyes, while the rest of the frequencies pass through and reach the ground?
-Surely some of the other frequencies also get scattered during the day, just in much smaller amounts?
-So during the evening blue light gets scattered even more, to the point where even less of it reaches the eyes?
-And so it gets red because now we can see the lower frequencies being scattered without blue overshadowing them?
-Trying to word it myself: during the day only the highest frequencies get filtered, but during the evening also lower frequencies get filtered, because now the “light strainer” (air) is just catching more of it?
-It gets darker in the evening without a good ability to see colors because there’s is no blue and so on light to reflect off of objects?
-Is it ok to speak about light as a frequency? Or it’s only correct to say “wave length”?
-Blue light is scattered in all directions by the tiny molecules of air in Earth's atmosphere. Blue is scattered more than other colors because it travels as shorter, smaller waves.
-This is why we see a blue sky most of the time. Closer to the horizon, the sky fades to a lighter blue or white.
-    """
-]
-# query should be a single vector, so we set chunk_size as -1 to avoid chunk.
-# If chunk size is -1, the model will return an array with shape of (2,2048) consisting of cls_vector and mean_vector(mean of all token embeddings).
-query_vectors = model.encode(
-    sentences=q_list,
-    use_cuda=True,
-    show_progress_bar=True,
-    chunk_size=-1,
-    chunk_overlap=32,
-    convert_to_tensor=False,
-    max_seq_length=max_seq_length,
-    batch_size=8,
-    normalize_embeddings=True,
-    prompt=RETRIEVE_Q_PROMPT,
-    fast_chunk=False
-)[0]
-# query vector do not need multi vector, we only use mean as final single vector
-pred = [vecs[1:2, :] for vecs in query_vectors]
-# spans_list contail each chunk's span, you can use span to get text
-spans_list: List[List[TextSpan]]
-passage_vectors_list: List[np.ndarray]
-passage_vectors_list, spans_list = model.encode(
-    sentences=p_list,
-    use_cuda=True,
-    show_progress_bar=True,
-    chunk_size=64,
-    chunk_overlap=8,
-    convert_to_tensor=False,
-    max_seq_length=max_seq_length,
-    batch_size=8,
-    normalize_embeddings=True,
-    prompt=RETRIEVE_P_PROMPT,
-    fast_chunk=True,  # if fast_chunk is true, directly chunk on input ids, else using RecursiveCharacterTextSplitter
-)
-# spans_list stores each passage's spans, passage_vectors_list stores each passage's vectors so len(spans_list) == len(p_list) and len(spans_list) == len(passage_vectors_list)
-# for a passage's spans and vectors, each span corresponds to a vector (1*2048). So, len(spans_list[idx]) ==  len(passage_vectors_list[idx])
-print((query_vectors[0] @ passage_vectors_list[0].T).max())
-# output 0.7331543
-# get each chunk's content
-for spans, passage in zip(spans_list, p_list):
-    text_ids = model.tokenizer.encode(RETRIEVE_P_PROMPT + passage)
-    for span in spans:
-        s, e = span.s, span.e
-        chunk_text = model.tokenizer.decode(
-            text_ids[s:e],
-            skip_special_tokens=True,
-            clean_up_tokenization_spaces=True
-        ).strip()
-```
-Please read annotation of this `encode` to get more information.
-#### 2.3.2 Chunk text by yourself
-If you want to chunk text by yourself, you should just set the `batch_text_spans` parameter in the `encode` function.
-```python
-import os
-import numpy as np
-# os.environ["HF_ENDPOINT"] = "https://hf-mirror.com"
-from pydantic import BaseModel
-from typing import Optional, List
-from transformers import AutoTokenizer, AutoModel
-class TextSpan(BaseModel):
-    s: int
-    e: int
-    text: Optional[str] = None
-    module_name: str
-prompt = "<|START_INSTRUCTION|>Candidate document<|END_INSTRUCTION|>"
-# load model
-model = AutoModel.from_pretrained(
-    "infgrad/dewey_en_beta",
-    trust_remote_code=True,
-    attn_implementation="flash_attention_2"
-)
-model.tokenizer = AutoTokenizer.from_pretrained("infgrad/dewey_en_beta")
-max_seq_length = 32 * 1024
-# chunk text
-passage = "this sentence 1. this sentence 2. this sentence 3"
-chunks = ["this sentence 1. this sentence 2.", "this sentence 2. this sentence 3"]
-prompt_length = len(model.tokenizer.tokenize(prompt))
-text_spans = [
-    # s=0, e=1 means that this vector is cls vector, so the module_name is cls_linear, otherwise the module_name is chunk_linear
-    TextSpan(s=0, e=1, module_name="cls_linear")
-]
-for chunk in chunks:
-    s = passage.find(chunk)
-    e = s + len(chunk)
-    text_spans.append(
-        TextSpan(
-            # add 1, as there is a [CLS] token at the beginning of text.
-            s=1 + prompt_length + len(model.tokenizer.tokenize(passage[:s])),
-            e=1 + prompt_length + len(model.tokenizer.tokenize(passage[:e])),
-            module_name="chunk_linear"
-        )
-    )
-spans_list: List[List[TextSpan]]
-passage_vectors_list: List[np.ndarray]
-passage_vectors_list, _ = model.encode(
-    sentences=[passage],
-    use_cuda=False,
-    show_progress_bar=True,
-    chunk_size=64,
-    chunk_overlap=12,
-    convert_to_tensor=False,
-    max_seq_length=max_seq_length,
-    batch_size=8,
-    normalize_embeddings=True,
-    prompt=prompt,
-    fast_chunk=True,
-    batch_text_spans=[text_spans]
-)
-print(passage_vectors_list[0].shape, passage_vectors_list[0][:, 2])
-# the output is (3, 2048) [0.01461297 0.02085092 0.0022509 ]
-```
-## 3 Evaluation
-### 3.1 MTEB(eng, v2)
-URL: http://mteb-leaderboard.hf.space/?benchmark_name=MTEB%28eng%2C+v2%29
-Reproduction
-script: https://huggingface.co/infgrad/dewey_en_beta/blob/main/scripts/evaluate/run_evaluate_mteb_dewey_en_beta.py
-|                                                        **Model**                                                         | **Zero-shot** | **Parameters** | **Dimensions** | **Max Tokens** | **Mean (Task)** | **Mean (TaskType)** | **Classification** | **Clustering** | **Pair Classification** | **Reranking** | **Retrieval** | **STS** | **Summarization** |
-|:------------------------------------------------------------------------------------------------------------------------:|:-------------:|:--------------:|:--------------:|:--------------:|:---------------:|:-------------------:|:------------------:|:--------------:|:-----------------------:|:-------------:|:-------------:|:-------:|:-----------------:|
-| [gemini-embedding-exp-03-07](https://developers.googleblog.com/en/gemini-embedding-text-model-now-available-gemini-api/) | 95%           | Unknown        | 3072           | 8192           | 73.3            | 67.67               | 90.05              | 59.39          | 87.7                    | 48.59         | 64.35         | 85.29   | 38.28             |
-|              [jasper_en_vision_language_v1](https://huggingface.co/NovaSearch/jasper_en_vision_language_v1)              | 56%           | 1B             | 8960           | 131072         | 71.41           | 66.65               | 90.27              | 60.52          | 88.14                   | 50            | 56.05         | 84.37   | 37.19             |
-|                    [gte-Qwen2-7B-instruct](https://huggingface.co/Alibaba-NLP/gte-Qwen2-7B-instruct)                     | NA            | 7B             | 3584           | 32768          | 70.72           | 65.77               | 88.52              | 58.97          | 85.9                    | 50.47         | 58.09         | 82.69   | 35.74             |
-|                         [stella_en_1.5B_v5](https://huggingface.co/NovaSearch/stella_en_1.5B_v5)                         | 56%           | 1B             | 8960           | 131072         | 69.43           | 65.32               | 89.38              | 57.06          | 88.02                   | 50.19         | 52.42         | 83.27   | 36.91             |
-|                         [SFR-Embedding-2_R](https://huggingface.co/Salesforce/SFR-Embedding-2_R)                         | 85%           | 7B             | 4096           | 32768          | 69.82           | 65.31               | 90.54              | 59.39          | 88.09                   | 48.99         | 53.75         | 80.86   | 35.54             |
-|                     [Linq-Embed-Mistral](https://huggingface.co/Linq-AI-Research/Linq-Embed-Mistral)                     | 95%           | 7B             | 4096           | 32768          | 69.8            | 65.29               | 83                 | 54.07          | 88.44                   | 49.44         | 60.14         | 84.69   | 37.26             |
-|                                 [NV-Embed-v2](https://huggingface.co/nvidia/NV-Embed-v2)                                 | 56%           | 7B             | 4096           | 32768          | 69.81           | 65                  | 87.19              | 47.66          | 88.69                   | 49.61         | 62.84         | 83.82   | 35.21             |
-|                     [SFR-Embedding-Mistral](https://huggingface.co/Salesforce/SFR-Embedding-Mistral)                     | 85%           | 7B             | 4096           | 32768          | 69.31           | 64.94               | 80.47              | 54.93          | 88.59                   | 50.15         | 59.33         | 84.77   | 36.32             |
-|                         [stella_en_400M_v5](https://huggingface.co/NovaSearch/stella_en_400M_v5)                         | 56%           | 435M           | 4096           | 8192           | 69.39           | 64.84               | 88.25              | 57.65          | 87.17                   | 49.6          | 52.73         | 83.93   | 34.53             |
-|        [text-embedding-004](https://cloud.google.com/vertex-ai/generative-ai/docs/embeddings/get-text-embeddings)        | 95%           | Unknown        | 768            | 2048           | 69.53           | 64.82               | 86.03              | 51.52          | 87.65                   | 48.48         | 59.06         | 84.84   | 36.12             |
-|        [text-embedding-005](https://cloud.google.com/vertex-ai/generative-ai/docs/embeddings/get-text-embeddings)        | 95%           | Unknown        | 768            | 2048           | 69.6            | 64.77               | 86.03              | 51.91          | 87.62                   | 48.84         | 58.77         | 85.18   | 35.05             |
-|                     [e5-mistral-7b-instruct](https://huggingface.co/intfloat/e5-mistral-7b-instruct)                     | 95%           | 7B             | 4096           | 32768          | 67.97           | 64                  | 79.85              | 51.44          | 88.42                   | 49.78         | 57.62         | 84.32   | 36.57             |
-| [text-multilingual-embedding-002](https://cloud.google.com/vertex-ai/generative-ai/docs/embeddings/get-text-embeddings)  | 95%           | Unknown        | 768            | 2048           | 67.67           | 63.52               | 84.65              | 50.41          | 86.6                    | 47.48         | 54.7          | 83.94   | 36.84             |
-|                                 [NV-Embed-v1](https://huggingface.co/nvidia/NV-Embed-v1)                                 | 56%           | 7B             | 4096           | 32768          | 68.32           | 63.37               | 84.11              | 49.5           | 87.05                   | 49.16         | 60.13         | 82.2    | 31.4              |
-|                      **[infgrad/dewey_en_beta](https://huggingface.co/infgrad/dewey_en_beta)**                      | 95%           | 395M           | 2048           | 131072         | 0.68            | 63.30               | 81.83              | 51.75          | 86.82                   | 46.35         | 56.32         | 84.21   | 35.79             |
-|                  [gte-Qwen2-1.5B-instruct](https://huggingface.co/Alibaba-NLP/gte-Qwen2-1.5B-instruct)                   | NA            | 1B             | 8960           | 32768          | 67.2            | 63.26               | 85.84              | 53.54          | 87.52                   | 49.25         | 50.25         | 82.51   | 33.94             |
-|                                   [GritLM-7B](https://huggingface.co/GritLM/GritLM-7B)                                   | 95%           | 7B             | 4096           | 4096           | 67.07           | 63.22               | 81.25              | 50.82          | 87.29                   | 49.59         | 54.95         | 83.03   | 35.65             |
-|                                 [GritLM-8x7B](https://huggingface.co/GritLM/GritLM-8x7B)                                 | 95%           | 57B            | 4096           | 4096           | 66.16           | 62.42               | 79.98              | 51.48          | 85.23                   | 49.22         | 52.46         | 82.93   | 35.65             |
-|                 [text-embedding-3-large](https://openai.com/index/new-embedding-models-and-api-updates/)                 | NA            | Unknown        | 3072           | 8191           | 66.43           | 62.15               | 79.15              | 48.9           | 85.81                   | 47.45         | 57.98         | 81.44   | 34.31             |
-|                    [mxbai-embed-large-v1](https://huggingface.co/mixedbread-ai/mxbai-embed-large-v1)                     | 100%          | 335M           | 1024           | 512            | 66.26           | 62.04               | 79.1               | 47.48          | 87.2                    | 48.05         | 55.4          | 84.42   | 32.63             |
-|                  [GIST-large-Embedding-v0](https://huggingface.co/avsolatorio/GIST-large-Embedding-v0)                   | 80%           | 335M           | 1024           | 512            | 66.25           | 61.96               | 78.91              | 48.84          | 86.7                    | 48.76         | 54.52         | 84.44   | 31.52             |
-|                            [bge-large-en-v1.5](https://huggingface.co/BAAI/bge-large-en-v1.5)                            | 100%          | 335M           | 1024           | 512            | 65.89           | 61.87               | 78.34              | 48.01          | 87.13                   | 48.26         | 55.44         | 82.79   | 33.13             |
-|                              [UAE-Large-V1](https://huggingface.co/WhereIsAI/UAE-Large-V1)                               | 100%          | 335M           | 1024           | 512            | 66.4            | 61.85               | 79.08              | 47.86          | 87.25                   | 48.35         | 55.91         | 84.37   | 30.13             |
-### 3.2 LongEmbed
-URL: http://mteb-leaderboard.hf.space/?benchmark_name=LongEmbed
-Reproduction script: https://huggingface.co/infgrad/dewey_en_beta/blob/main/scripts/evaluate/run_evaluate_long_embed.py
-|                                                         **Model**                                                         | **Zero-shot** | **Number of Parameters** | **Embedding Dimensions** | **Max Tokens** | **Mean (Task)** | **Mean (TaskType)** | **Retrieval** |
-|:-------------------------------------------------------------------------------------------------------------------------:|:-------------:|:------------------------:|:------------------------:|:--------------:|:---------------:|:-------------------:|:-------------:|
-|                  **[infgrad/dewey_en_beta-MultiVectors](https://huggingface.co/infgrad/dewey_en_beta)**                   | 100%          | 395M                     | 2048                     | 131072         | 86.59           | 86.59               | 86.59         |
-|     [voyage-multilingual-2](https://blog.voyageai.com/2024/06/10/voyage-multilingual-2-multilingual-embedding-model/)     | 100%          | Unknown                  | 1024                     | 32000          | 79.17           | 79.17               | 79.17         |
-| [voyage-law-2](https://blog.voyageai.com/2024/04/15/domain-specific-embeddings-and-retrieval-legal-edition-voyage-law-2/) | 100%          | Unknown                  | 1024                     | 16000          | 78.85           | 78.85               | 78.85         |
-|                  **[infgrad/dewey_en_beta-SingleVector](https://huggingface.co/infgrad/dewey_en_beta)**                   | 100%          | 395M                     | 2048                     | 131072         | 77.98           | 77.98               | 77.98         |
-|                                [voyage-3](https://blog.voyageai.com/2024/09/18/voyage-3/)                                 | 100%          | Unknown                  | 1024                     | 32000          | 74.06           | 74.06               | 74.06         |
-|                             [inf-retriever-v1](https://huggingface.co/infly/inf-retriever-v1)                             | 100%          | 7B                       | 3584                     | 32768          | 73.19           | 73.19               | 73.19         |
-### 3.3 LoCoV1
-URL: https://huggingface.co/datasets/hazyresearch/LoCoV1-Queries\
-https://huggingface.co/datasets/hazyresearch/LoCoV1-Documents
-Reproduction script: https://huggingface.co/infgrad/dewey_en_beta/blob/main/scripts/evaluate/run_evaluate_loco.py
-Metric: NDCG@10
-Result:
-| **dataset-name**                  | **bge-m3-8k** | **gte-modernbert-base-8k** | **Linq-Embed-Mistral-4k** | **Linq-Embed-Mistral-8k** | **SFR-Embedding-Mistral-8k** | **e5-mistral-7b-instruct-8k** | **dewey_en_beta-8k** |  **dewey_en_beta_64k**   |  **dewey_en_beta_64k-multi-vectors**   |
-|:---------------------------------:|:-------------:|:--------------------------:|:-------------------------:|:-------------------------:|:----------------------------:|:-----------------------------:|:--------------------:|:------------------------:|:--------------------------------------:|
-| **2wikimqa_test**                 | 0.9271        | 0.8658                     | 0.8884                    | 0.9067                    | 0.8965                       | 0.8901                        | 0.8953               |          0.9051          |                 0.9775                 |
-| **courtlistener_HTML_test**       | 0.1933        | 0.2349                     | 0.3551                    | 0.3670                    | 0.3647                       | 0.3543                        | 0.3415               |          0.3616          |                 0.4775                 |
-| **courtlistener_Plain_Text_test** | 0.1888        | 0.2478                     | 0.3675                    | 0.3761                    | 0.3679                       | 0.3579                        | 0.3377               |          0.3485          |                 0.4426                 |
-| **gov_report_test**               | 0.9869        | 0.9750                     | 0.9832                    | 0.9837                    | 0.9816                       | 0.9823                        | 0.9855               |          0.9883          |                 0.9853                 |
-| **legal_case_reports_test**       | 0.3702        | 0.4476                     | 0.5398                    | 0.5432                    | 0.5319                       | 0.4850                        | 0.5474               |          0.5875          |                 0.6534                 |
-| **multifieldqa_test**             | 0.9373        | 0.9341                     | 0.9345                    | 0.9327                    | 0.9450                       | 0.9321                        | 0.9687               |          0.9564          |                 0.9754                 |
-| **passage_retrieval_test**        | 0.4493        | 0.5271                     | 0.3470                    | 0.3407                    | 0.2902                       | 0.3248                        | 0.7562               |          0.7389          |                 0.8550                 |
-| **qasper_abstract_test**          | 1.0000        | 0.9806                     | 0.9982                    | 0.9982                    | 0.9973                       | 0.9965                        | 0.9973               |          0.9982          |                 0.9982                 |
-| **qasper_title_test**             | 0.9860        | 0.8892                     | 0.9838                    | 0.9833                    | 0.9861                       | 0.9812                        | 0.9742               |          0.9742          |                 0.9840                 |
-| **qmsum_test**                    | 0.6668        | 0.6307                     | 0.6816                    | 0.7237                    | 0.7169                       | 0.7148                        | 0.7438               |          0.7613          |                 0.8154                 |
-| **stackoverflow_test**            | 0.9634        | 0.9087                     | 0.9760                    | 0.9760                    | 0.9766                       | 0.9690                        | 0.9362               |          0.9369          |                 0.9443                 |
-| **summ_screen_fd_test**           | 0.9320        | 0.9379                     | 0.9747                    | 0.9635                    | 0.9656                       | 0.9580                        | 0.9796               |          0.9821          |                 0.9788                 |
-| **Average**                       | 0.7168        | 0.7150                     | 0.7525                    | 0.7579                    | 0.7517                       | 0.7455                        | 0.7886               |**0.7949**                |**0.8406**                              |
-## 4 Limitations
-- Only English text.
-- On short text tasks, the performance might not be as good as that of conventional short text embedding models.
 - As said before, this model is still in alpha or beta stage, the model may have some unexpected behaviour.

+---
+license: mit
+datasets:
+- BAAI/Infinity-Instruct
+- HuggingFaceFW/fineweb-edu
+language:
+- en
+base_model:
+- answerdotai/ModernBERT-large
+---
+## 1 Introduction
+Cooperating with [Richinfo](https://www.richinfo.cn/index.html), this released model was trained using a novel approach,
+and while we haven't fully understood
+the underlying principles
+yet, we have achieved promising results. Therefore, we have decided to open-source the model and hope that
+**someone will test the model and provide us with feedback!**
+**The technical report will be completed this week.**
+The core training method of this model will be implemented in
+the [RAG-Retrieval repository](https://github.com/NovaSearch-Team/RAG-Retrieval) open sourced by the NovaSearch Team,
+welcome to star!
+This model is based on [answerdotai/ModernBERT-large](https://huggingface.co/answerdotai/ModernBERT-base).
+A perfect model, thanks for their sharing!
+The embedding model has the following features:
+1. Max length is 128k, parameter size is 395M, and support only for English.
+2. Supports both single-vector and multi-vector (similar to Colbert, but with fewer vectors, only 0.5% of the number of
+   tokens).
+3. Achieved quite impressive results on the short text evaluation (MTEB-eng-v2), without using the MTEB training set,
+   even surpassing several 7B-sized models.
+4. On the long text evaluation LongEmbed, the single-vector surpasses many large and commercial models. If multi-vector
+   is used, the average score becomes the first place. Currently, our score is 0.86, while the current first place score
+   is 0.79.
+5. Ultra-fast encoding speed, benefiting from the architectural advantages of ModernBert, the encoding speed for long
+   texts is still very fast.
+6. Super flexible multi-vector combination method, where the multi-vector can be understood as span or chunk level, not
+   token level, so how to specify the chunk can be completely customized according to your own scenario, very flexible.
+## 2 Usage
+We suggest you read the following contents with the model architecture diagram.
+![avatar](./imgs/inference_architecture.png)
+We do hope you read the `modeling_dewey_v1.py` and `custom_st.py` carefully, these codes is easy to read and
+will help you a lot!
+### 2.1 Prompts
+Our model is a kind of instruct-embedding-model, when using our model, you should add prompt before the text.
+For **Retrieval task**, you **MUST** use our provided prompt:\
+query: `<|START_INSTRUCTION|>Answer the question<|END_INSTRUCTION|>`\
+passage: `<|START_INSTRUCTION|>Candidate document<|END_INSTRUCTION|>`
+For **STS task**, you **MUST** use our provided prompt:\
+`<|START_INSTRUCTION|>Generate semantically similar text<|END_INSTRUCTION|>`
+For **Classification and Clustering task**, you should design your own prompt, below are some examples:\
+`<|START_INSTRUCTION|>Classify text into intents<|END_INSTRUCTION|>`\
+`<|START_INSTRUCTION|>Classify text into toxic or not toxic<|END_INSTRUCTION|>`\
+`<|START_INSTRUCTION|>Output main category of Medrxiv papers based on the titles<|END_INSTRUCTION|>`\
+`<|START_INSTRUCTION|>Output topic or theme of news articles<|END_INSTRUCTION|>`
+### 2.2 Single Vector
+For using single vector, our model is compatible with the `SentenceTransformer`.
+```python
+import os
+# os.environ["HF_ENDPOINT"] = "https://hf-mirror.com"
+import torch
+from sentence_transformers import SentenceTransformer
+RETRIEVE_Q_PROMPT = "<|START_INSTRUCTION|>Answer the question<|END_INSTRUCTION|>"
+RETRIEVE_P_PROMPT = "<|START_INSTRUCTION|>Candidate document<|END_INSTRUCTION|>"
+model = SentenceTransformer(
+    "infgrad/dewey_en_beta",
+    trust_remote_code=True,
+    model_kwargs={
+        "torch_dtype": torch.bfloat16,
+        "attn_implementation": "flash_attention_2"
+    },
+    config_kwargs={"single_vector_type": "mean"}
+).cuda().bfloat16().eval()
+# the choice of single_vector_type:
+## for short text (<1k): cls_add_mean
+## for long text (>1k): mean
+# the max length of model is 128*1024
+model.max_seq_length = 32 * 1024
+query_vectors = model.encode(
+    sentences=[f"{RETRIEVE_Q_PROMPT}What is a computer composed of?", f"{RETRIEVE_Q_PROMPT}why the sky is blue"]
+)
+passage_vectors = model.encode(
+    sentences=[
+        f"{RETRIEVE_P_PROMPT}Central processing unit (CPU), memory (RAM), storage (hard drive or SSD), input/output devices (keyboard, mouse, monitor), and a motherboard",
+        f"{RETRIEVE_P_PROMPT}Shorter wavelengths of light, such as blue and violet, are scattered more by gases and particles in Earth's atmosphere.",
+    ]
+)
+print(query_vectors @ passage_vectors.T)
+# the output is:
+# [[0.52512825 0.19771025]
+#  [0.17617573 0.5918883 ]]
+```
+### 2.3 Multi Vectors
+Our multi vectors are bsed on text span(i.e. chunk), so each vector can be considered as a contextual chunk vector.
+**In order to get multi vectors of a document, you should get chunks and their spans first.**
+Below are detailed steps to get multi vectors:
+**Step1:** Chunk the document to get chunks and spans. This can be done by using our `encode` function, or you can also
+chunk documents by yourself according to your scenario.\
+**Note that, if you decide to chunk by yourself, your chunk and span should not contain prompt!!!**\
+**Step2:** encode text to get token embeddings\
+**Step3:** according to span (i.e. start_position and end_position) to get chunk vector,
+we use mean of span token embeddings as chunk vector (i.e. normalize(token_embed[start_position:end_position].mean(
+axis=0)))\
+**Step4:** For each span, do Step3, until get all chunk vectors, you can also add span(0,1) and span(1+prompt_len,
+text_len-1) to get global vector
+For retrieval tasks, query vector should be **single vector**, so the final score between query and document is the max
+score of query with every document vector.
+This is compatible with FAISS, MILVUS and so on. Just enlarge the top-k and do de-duplicate on searched documents.
+Below are detailed code examples.
+#### 2.3.1 Chunk text in the `encode` function
+You can directly use `encode` method in our model to get multi vectors.
+This method will chunk text automatically.
+You can choose the chunk strategy by setting `fast_chunk` parameter, if `fast_chunk` is true, directly chunk on input
+ids, else using RecursiveCharacterTextSplitter.
+```python
+import os
+import numpy as np
+# os.environ["HF_ENDPOINT"] = "https://hf-mirror.com"
+from pydantic import BaseModel
+from typing import Optional, List
+from transformers import AutoTokenizer, AutoModel
+class TextSpan(BaseModel):
+    s: int
+    e: int
+    text: Optional[str] = None
+    module_name: str
+RETRIEVE_Q_PROMPT = "<|START_INSTRUCTION|>Answer the question<|END_INSTRUCTION|>"
+RETRIEVE_P_PROMPT = "<|START_INSTRUCTION|>Candidate document<|END_INSTRUCTION|>"
+model = AutoModel.from_pretrained(
+    "infgrad/dewey_en_beta",
+    trust_remote_code=True,
+    attn_implementation="flash_attention_2"
+).cuda().bfloat16()
+model.tokenizer = AutoTokenizer.from_pretrained("infgrad/dewey_en_beta")
+max_seq_length = 32 * 1024
+q_list = ["why the sky is blue"]
+p_list = [
+    """
+    I’ve been trying to understand why the sky changes colors, and I think I understand most of it, but something in the online explanations doesn’t make it clear for me:
+I’ve read:
+sky is blue because blue light gets scattered the most during the day.
+in the evening it turns red because now even more of the blue light gets scattered
+So a few questions:
+The scattering of light during the day: does it mean that blue light gets reflected off air particles and reaches our eyes, while the rest of the frequencies pass through and reach the ground?
+Surely some of the other frequencies also get scattered during the day, just in much smaller amounts?
+So during the evening blue light gets scattered even more, to the point where even less of it reaches the eyes?
+And so it gets red because now we can see the lower frequencies being scattered without blue overshadowing them?
+Trying to word it myself: during the day only the highest frequencies get filtered, but during the evening also lower frequencies get filtered, because now the “light strainer” (air) is just catching more of it?
+It gets darker in the evening without a good ability to see colors because there’s is no blue and so on light to reflect off of objects?
+Is it ok to speak about light as a frequency? Or it’s only correct to say “wave length”?
+Blue light is scattered in all directions by the tiny molecules of air in Earth's atmosphere. Blue is scattered more than other colors because it travels as shorter, smaller waves.
+This is why we see a blue sky most of the time. Closer to the horizon, the sky fades to a lighter blue or white.
+    """
+]
+# query should be a single vector, so we set chunk_size as -1 to avoid chunk.
+# If chunk size is -1, the model will return an array with shape of (2,2048) consisting of cls_vector and mean_vector(mean of all token embeddings).
+query_vectors = model.encode(
+    sentences=q_list,
+    use_cuda=True,
+    show_progress_bar=True,
+    chunk_size=-1,
+    chunk_overlap=32,
+    convert_to_tensor=False,
+    max_seq_length=max_seq_length,
+    batch_size=8,
+    normalize_embeddings=True,
+    prompt=RETRIEVE_Q_PROMPT,
+    fast_chunk=False
+)[0]
+# query vector do not need multi vector, we only use mean as final single vector
+pred = [vecs[1:2, :] for vecs in query_vectors]
+# spans_list contail each chunk's span, you can use span to get text
+spans_list: List[List[TextSpan]]
+passage_vectors_list: List[np.ndarray]
+passage_vectors_list, spans_list = model.encode(
+    sentences=p_list,
+    use_cuda=True,
+    show_progress_bar=True,
+    chunk_size=64,
+    chunk_overlap=8,
+    convert_to_tensor=False,
+    max_seq_length=max_seq_length,
+    batch_size=8,
+    normalize_embeddings=True,
+    prompt=RETRIEVE_P_PROMPT,
+    fast_chunk=True,  # if fast_chunk is true, directly chunk on input ids, else using RecursiveCharacterTextSplitter
+)
+# spans_list stores each passage's spans, passage_vectors_list stores each passage's vectors so len(spans_list) == len(p_list) and len(spans_list) == len(passage_vectors_list)
+# for a passage's spans and vectors, each span corresponds to a vector (1*2048). So, len(spans_list[idx]) ==  len(passage_vectors_list[idx])
+print((query_vectors[0] @ passage_vectors_list[0].T).max())
+# output 0.7331543
+# get each chunk's content
+for spans, passage in zip(spans_list, p_list):
+    text_ids = model.tokenizer.encode(RETRIEVE_P_PROMPT + passage)
+    for span in spans:
+        s, e = span.s, span.e
+        chunk_text = model.tokenizer.decode(
+            text_ids[s:e],
+            skip_special_tokens=True,
+            clean_up_tokenization_spaces=True
+        ).strip()
+```
+Please read annotation of this `encode` to get more information.
+#### 2.3.2 Chunk text by yourself
+If you want to chunk text by yourself, you should just set the `batch_text_spans` parameter in the `encode` function.
+```python
+import os
+import numpy as np
+# os.environ["HF_ENDPOINT"] = "https://hf-mirror.com"
+from pydantic import BaseModel
+from typing import Optional, List
+from transformers import AutoTokenizer, AutoModel
+class TextSpan(BaseModel):
+    s: int
+    e: int
+    text: Optional[str] = None
+    module_name: str
+prompt = "<|START_INSTRUCTION|>Candidate document<|END_INSTRUCTION|>"
+# load model
+model = AutoModel.from_pretrained(
+    "infgrad/dewey_en_beta",
+    trust_remote_code=True,
+    attn_implementation="flash_attention_2"
+)
+model.tokenizer = AutoTokenizer.from_pretrained("infgrad/dewey_en_beta")
+max_seq_length = 32 * 1024
+# chunk text
+passage = "this sentence 1. this sentence 2. this sentence 3"
+chunks = ["this sentence 1. this sentence 2.", "this sentence 2. this sentence 3"]
+prompt_length = len(model.tokenizer.tokenize(prompt))
+text_spans = [
+    # s=0, e=1 means that this vector is cls vector, so the module_name is cls_linear, otherwise the module_name is chunk_linear
+    TextSpan(s=0, e=1, module_name="cls_linear")
+]
+for chunk in chunks:
+    s = passage.find(chunk)
+    e = s + len(chunk)
+    text_spans.append(
+        TextSpan(
+            # add 1, as there is a [CLS] token at the beginning of text.
+            s=1 + prompt_length + len(model.tokenizer.tokenize(passage[:s])),
+            e=1 + prompt_length + len(model.tokenizer.tokenize(passage[:e])),
+            module_name="chunk_linear"
+        )
+    )
+spans_list: List[List[TextSpan]]
+passage_vectors_list: List[np.ndarray]
+passage_vectors_list, _ = model.encode(
+    sentences=[passage],
+    use_cuda=False,
+    show_progress_bar=True,
+    chunk_size=64,
+    chunk_overlap=12,
+    convert_to_tensor=False,
+    max_seq_length=max_seq_length,
+    batch_size=8,
+    normalize_embeddings=True,
+    prompt=prompt,
+    fast_chunk=True,
+    batch_text_spans=[text_spans]
+)
+print(passage_vectors_list[0].shape, passage_vectors_list[0][:, 2])
+# the output is (3, 2048) [0.01461297 0.02085092 0.0022509 ]
+```
+## 3 Evaluation
+### 3.1 MTEB(eng, v2)
+URL: http://mteb-leaderboard.hf.space/?benchmark_name=MTEB%28eng%2C+v2%29
+Reproduction
+script: https://huggingface.co/infgrad/dewey_en_beta/blob/main/scripts/evaluate/run_evaluate_mteb_dewey_en_beta.py
+|                                                        **Model**                                                         | **Zero-shot** | **Parameters** | **Dimensions** | **Max Tokens** | **Mean (Task)** | **Mean (TaskType)** | **Classification** | **Clustering** | **Pair Classification** | **Reranking** | **Retrieval** | **STS** | **Summarization** |
+|:------------------------------------------------------------------------------------------------------------------------:|:-------------:|:--------------:|:--------------:|:--------------:|:---------------:|:-------------------:|:------------------:|:--------------:|:-----------------------:|:-------------:|:-------------:|:-------:|:-----------------:|
+| [gemini-embedding-exp-03-07](https://developers.googleblog.com/en/gemini-embedding-text-model-now-available-gemini-api/) | 95%           | Unknown        | 3072           | 8192           | 73.3            | 67.67               | 90.05              | 59.39          | 87.7                    | 48.59         | 64.35         | 85.29   | 38.28             |
+|              [jasper_en_vision_language_v1](https://huggingface.co/NovaSearch/jasper_en_vision_language_v1)              | 56%           | 1B             | 8960           | 131072         | 71.41           | 66.65               | 90.27              | 60.52          | 88.14                   | 50            | 56.05         | 84.37   | 37.19             |
+|                    [gte-Qwen2-7B-instruct](https://huggingface.co/Alibaba-NLP/gte-Qwen2-7B-instruct)                     | NA            | 7B             | 3584           | 32768          | 70.72           | 65.77               | 88.52              | 58.97          | 85.9                    | 50.47         | 58.09         | 82.69   | 35.74             |
+|                         [stella_en_1.5B_v5](https://huggingface.co/NovaSearch/stella_en_1.5B_v5)                         | 56%           | 1B             | 8960           | 131072         | 69.43           | 65.32               | 89.38              | 57.06          | 88.02                   | 50.19         | 52.42         | 83.27   | 36.91             |
+|                         [SFR-Embedding-2_R](https://huggingface.co/Salesforce/SFR-Embedding-2_R)                         | 85%           | 7B             | 4096           | 32768          | 69.82           | 65.31               | 90.54              | 59.39          | 88.09                   | 48.99         | 53.75         | 80.86   | 35.54             |
+|                     [Linq-Embed-Mistral](https://huggingface.co/Linq-AI-Research/Linq-Embed-Mistral)                     | 95%           | 7B             | 4096           | 32768          | 69.8            | 65.29               | 83                 | 54.07          | 88.44                   | 49.44         | 60.14         | 84.69   | 37.26             |
+|                                 [NV-Embed-v2](https://huggingface.co/nvidia/NV-Embed-v2)                                 | 56%           | 7B             | 4096           | 32768          | 69.81           | 65                  | 87.19              | 47.66          | 88.69                   | 49.61         | 62.84         | 83.82   | 35.21             |
+|                     [SFR-Embedding-Mistral](https://huggingface.co/Salesforce/SFR-Embedding-Mistral)                     | 85%           | 7B             | 4096           | 32768          | 69.31           | 64.94               | 80.47              | 54.93          | 88.59                   | 50.15         | 59.33         | 84.77   | 36.32             |
+|                         [stella_en_400M_v5](https://huggingface.co/NovaSearch/stella_en_400M_v5)                         | 56%           | 435M           | 4096           | 8192           | 69.39           | 64.84               | 88.25              | 57.65          | 87.17                   | 49.6          | 52.73         | 83.93   | 34.53             |
+|        [text-embedding-004](https://cloud.google.com/vertex-ai/generative-ai/docs/embeddings/get-text-embeddings)        | 95%           | Unknown        | 768            | 2048           | 69.53           | 64.82               | 86.03              | 51.52          | 87.65                   | 48.48         | 59.06         | 84.84   | 36.12             |
+|        [text-embedding-005](https://cloud.google.com/vertex-ai/generative-ai/docs/embeddings/get-text-embeddings)        | 95%           | Unknown        | 768            | 2048           | 69.6            | 64.77               | 86.03              | 51.91          | 87.62                   | 48.84         | 58.77         | 85.18   | 35.05             |
+|                     [e5-mistral-7b-instruct](https://huggingface.co/intfloat/e5-mistral-7b-instruct)                     | 95%           | 7B             | 4096           | 32768          | 67.97           | 64                  | 79.85              | 51.44          | 88.42                   | 49.78         | 57.62         | 84.32   | 36.57             |
+| [text-multilingual-embedding-002](https://cloud.google.com/vertex-ai/generative-ai/docs/embeddings/get-text-embeddings)  | 95%           | Unknown        | 768            | 2048           | 67.67           | 63.52               | 84.65              | 50.41          | 86.6                    | 47.48         | 54.7          | 83.94   | 36.84             |
+|                                 [NV-Embed-v1](https://huggingface.co/nvidia/NV-Embed-v1)                                 | 56%           | 7B             | 4096           | 32768          | 68.32           | 63.37               | 84.11              | 49.5           | 87.05                   | 49.16         | 60.13         | 82.2    | 31.4              |
+|                      **[infgrad/dewey_en_beta](https://huggingface.co/infgrad/dewey_en_beta)**                      | 95%           | 395M           | 2048           | 131072         | 0.68            | 63.30               | 81.83              | 51.75          | 86.82                   | 46.35         | 56.32         | 84.21   | 35.79             |
+|                  [gte-Qwen2-1.5B-instruct](https://huggingface.co/Alibaba-NLP/gte-Qwen2-1.5B-instruct)                   | NA            | 1B             | 8960           | 32768          | 67.2            | 63.26               | 85.84              | 53.54          | 87.52                   | 49.25         | 50.25         | 82.51   | 33.94             |
+|                                   [GritLM-7B](https://huggingface.co/GritLM/GritLM-7B)                                   | 95%           | 7B             | 4096           | 4096           | 67.07           | 63.22               | 81.25              | 50.82          | 87.29                   | 49.59         | 54.95         | 83.03   | 35.65             |
+|                                 [GritLM-8x7B](https://huggingface.co/GritLM/GritLM-8x7B)                                 | 95%           | 57B            | 4096           | 4096           | 66.16           | 62.42               | 79.98              | 51.48          | 85.23                   | 49.22         | 52.46         | 82.93   | 35.65             |
+|                 [text-embedding-3-large](https://openai.com/index/new-embedding-models-and-api-updates/)                 | NA            | Unknown        | 3072           | 8191           | 66.43           | 62.15               | 79.15              | 48.9           | 85.81                   | 47.45         | 57.98         | 81.44   | 34.31             |
+|                    [mxbai-embed-large-v1](https://huggingface.co/mixedbread-ai/mxbai-embed-large-v1)                     | 100%          | 335M           | 1024           | 512            | 66.26           | 62.04               | 79.1               | 47.48          | 87.2                    | 48.05         | 55.4          | 84.42   | 32.63             |
+|                  [GIST-large-Embedding-v0](https://huggingface.co/avsolatorio/GIST-large-Embedding-v0)                   | 80%           | 335M           | 1024           | 512            | 66.25           | 61.96               | 78.91              | 48.84          | 86.7                    | 48.76         | 54.52         | 84.44   | 31.52             |
+|                            [bge-large-en-v1.5](https://huggingface.co/BAAI/bge-large-en-v1.5)                            | 100%          | 335M           | 1024           | 512            | 65.89           | 61.87               | 78.34              | 48.01          | 87.13                   | 48.26         | 55.44         | 82.79   | 33.13             |
+|                              [UAE-Large-V1](https://huggingface.co/WhereIsAI/UAE-Large-V1)                               | 100%          | 335M           | 1024           | 512            | 66.4            | 61.85               | 79.08              | 47.86          | 87.25                   | 48.35         | 55.91         | 84.37   | 30.13             |
+### 3.2 LongEmbed
+URL: http://mteb-leaderboard.hf.space/?benchmark_name=LongEmbed
+Reproduction script: https://huggingface.co/infgrad/dewey_en_beta/blob/main/scripts/evaluate/run_evaluate_long_embed.py
+|                                                         **Model**                                                         | **Zero-shot** | **Number of Parameters** | **Embedding Dimensions** | **Max Tokens** | **Mean (Task)** | **Mean (TaskType)** | **Retrieval** |
+|:-------------------------------------------------------------------------------------------------------------------------:|:-------------:|:------------------------:|:------------------------:|:--------------:|:---------------:|:-------------------:|:-------------:|
+|                  **[infgrad/dewey_en_beta-MultiVectors](https://huggingface.co/infgrad/dewey_en_beta)**                   | 100%          | 395M                     | 2048                     | 131072         | 86.59           | 86.59               | 86.59         |
+|     [voyage-multilingual-2](https://blog.voyageai.com/2024/06/10/voyage-multilingual-2-multilingual-embedding-model/)     | 100%          | Unknown                  | 1024                     | 32000          | 79.17           | 79.17               | 79.17         |
+| [voyage-law-2](https://blog.voyageai.com/2024/04/15/domain-specific-embeddings-and-retrieval-legal-edition-voyage-law-2/) | 100%          | Unknown                  | 1024                     | 16000          | 78.85           | 78.85               | 78.85         |
+|                  **[infgrad/dewey_en_beta-SingleVector](https://huggingface.co/infgrad/dewey_en_beta)**                   | 100%          | 395M                     | 2048                     | 131072         | 77.98           | 77.98               | 77.98         |
+|                                [voyage-3](https://blog.voyageai.com/2024/09/18/voyage-3/)                                 | 100%          | Unknown                  | 1024                     | 32000          | 74.06           | 74.06               | 74.06         |
+|                             [inf-retriever-v1](https://huggingface.co/infly/inf-retriever-v1)                             | 100%          | 7B                       | 3584                     | 32768          | 73.19           | 73.19               | 73.19         |
+### 3.3 LoCoV1
+URL: https://huggingface.co/datasets/hazyresearch/LoCoV1-Queries\
+https://huggingface.co/datasets/hazyresearch/LoCoV1-Documents
+Reproduction script: https://huggingface.co/infgrad/dewey_en_beta/blob/main/scripts/evaluate/run_evaluate_loco.py
+Metric: NDCG@10
+Result:
+| **dataset-name**                  | **bge-m3-8k** | **gte-modernbert-base-8k** | **Linq-Embed-Mistral-4k** | **Linq-Embed-Mistral-8k** | **SFR-Embedding-Mistral-8k** | **e5-mistral-7b-instruct-8k** | **dewey_en_beta-8k** |  **dewey_en_beta_64k**   |  **dewey_en_beta_64k-multi-vectors**   |
+|:---------------------------------:|:-------------:|:--------------------------:|:-------------------------:|:-------------------------:|:----------------------------:|:-----------------------------:|:--------------------:|:------------------------:|:--------------------------------------:|
+| **2wikimqa_test**                 | 0.9271        | 0.8658                     | 0.8884                    | 0.9067                    | 0.8965                       | 0.8901                        | 0.8953               |          0.9051          |                 0.9775                 |
+| **courtlistener_HTML_test**       | 0.1933        | 0.2349                     | 0.3551                    | 0.3670                    | 0.3647                       | 0.3543                        | 0.3415               |          0.3616          |                 0.4775                 |
+| **courtlistener_Plain_Text_test** | 0.1888        | 0.2478                     | 0.3675                    | 0.3761                    | 0.3679                       | 0.3579                        | 0.3377               |          0.3485          |                 0.4426                 |
+| **gov_report_test**               | 0.9869        | 0.9750                     | 0.9832                    | 0.9837                    | 0.9816                       | 0.9823                        | 0.9855               |          0.9883          |                 0.9853                 |
+| **legal_case_reports_test**       | 0.3702        | 0.4476                     | 0.5398                    | 0.5432                    | 0.5319                       | 0.4850                        | 0.5474               |          0.5875          |                 0.6534                 |
+| **multifieldqa_test**             | 0.9373        | 0.9341                     | 0.9345                    | 0.9327                    | 0.9450                       | 0.9321                        | 0.9687               |          0.9564          |                 0.9754                 |
+| **passage_retrieval_test**        | 0.4493        | 0.5271                     | 0.3470                    | 0.3407                    | 0.2902                       | 0.3248                        | 0.7562               |          0.7389          |                 0.8550                 |
+| **qasper_abstract_test**          | 1.0000        | 0.9806                     | 0.9982                    | 0.9982                    | 0.9973                       | 0.9965                        | 0.9973               |          0.9982          |                 0.9982                 |
+| **qasper_title_test**             | 0.9860        | 0.8892                     | 0.9838                    | 0.9833                    | 0.9861                       | 0.9812                        | 0.9742               |          0.9742          |                 0.9840                 |
+| **qmsum_test**                    | 0.6668        | 0.6307                     | 0.6816                    | 0.7237                    | 0.7169                       | 0.7148                        | 0.7438               |          0.7613          |                 0.8154                 |
+| **stackoverflow_test**            | 0.9634        | 0.9087                     | 0.9760                    | 0.9760                    | 0.9766                       | 0.9690                        | 0.9362               |          0.9369          |                 0.9443                 |
+| **summ_screen_fd_test**           | 0.9320        | 0.9379                     | 0.9747                    | 0.9635                    | 0.9656                       | 0.9580                        | 0.9796               |          0.9821          |                 0.9788                 |
+| **Average**                       | 0.7168        | 0.7150                     | 0.7525                    | 0.7579                    | 0.7517                       | 0.7455                        | 0.7886               |**0.7949**                |**0.8406**                              |
+## 4 Limitations
+- Only English text.
+- On short text tasks, the performance might not be as good as that of conventional short text embedding models.
 - As said before, this model is still in alpha or beta stage, the model may have some unexpected behaviour.