E5 Base, Arctic Edition

This model is the result of the Arctic Embed walkthrough example for training embedding models using the open-source Arctic Embed codebase. In the walkthrough, we fine-tune the e5-base-unsupervised using an improved dataset that leverages modern hard-negative mining practices and includes three more high-quality retrieval datasets than the original E5 finetuning pipeline.

Model	BEIR Score (nDCG@10)	CLEF English (nDCG@10)
e5-base-v2	50.19	45.38
arctic-e5-base	54.70	52.77
gte-base-en-v1.5	54.02	47.91
arctic-embed-m-v1.0	54.89	47.62
arctic-embed-m-v2.0	55.38	54.06

NOTE: This model was trained as an example and heavily leverages in-domain datasets from the data sources used by the BEIR benchmark. Though it performs well on the CLEF English dataset, it may be substantially overfit to the domains of the BEIR benchmark and may not generalize well to certain applications.

Usage

Using Sentence Transformers

You can use the sentence-transformers package to use an snowflake-arctic-embed model, as shown below.

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("Snowflake/snowflake-arctic-e5-base")

queries = ['what is snowflake?', 'Where can I get the best tacos?']
documents = ['The Data Cloud!', 'Mexico City of Course!']

query_embeddings = model.encode(queries, prompt_name="query")
document_embeddings = model.encode(documents)

scores = query_embeddings @ document_embeddings.T
for query, query_scores in zip(queries, scores):
    doc_score_pairs = list(zip(documents, query_scores))
    doc_score_pairs = sorted(doc_score_pairs, key=lambda x: x[1], reverse=True)
    # Output passages & scores
    print("Query:", query)
    for document, score in doc_score_pairs:
        print(score, document)

Produces:

Query: what is snowflake?
0.2747492 The Data Cloud!
0.19998045 Mexico City of Course!
Query: Where can I get the best tacos?
0.29974818 Mexico City of Course!
0.2344071 The Data Cloud!

Using Huggingface transformers

You can use the transformers package to use the model, as shown below. For optimal retrieval quality, use the CLS token to embed each text portion (not mean pooling) and use the standard E5 query and document prefixes below.

import torch
from transformers import AutoModel, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('Snowflake/snowflake-arctic-e5-base')
model = AutoModel.from_pretrained('Snowflake/snowflake-arctic-e5-base')
model.eval()

query_prefix = 'query: '
queries  = ['what is snowflake?', 'Where can I get the best tacos?']
queries_with_prefix = ["{}{}".format(query_prefix, q) for q in queries]
query_tokens = tokenizer(queries_with_prefix, padding=True, truncation=True, return_tensors='pt', max_length=512)

document_prefix = 'passage: '
documents = ['The Data Cloud!', 'Mexico City of Course!']
documents_with_prefix = ["{}{}".format(document_prefix, d) for d in documents]
document_tokens =  tokenizer(documents_with_prefix, padding=True, truncation=True, return_tensors='pt', max_length=512)

# Compute token embeddings
with torch.inference_mode():
    query_embeddings = model(**query_tokens)[0][:, 0]
    document_embeddings = model(**document_tokens)[0][:, 0]


# normalize embeddings
query_embeddings = torch.nn.functional.normalize(query_embeddings, p=2, dim=1)
document_embeddings = torch.nn.functional.normalize(document_embeddings, p=2, dim=1)

scores = torch.mm(query_embeddings, document_embeddings.transpose(0, 1))
for query, query_scores in zip(queries, scores):
    doc_score_pairs = list(zip(documents, query_scores))
    doc_score_pairs = sorted(doc_score_pairs, key=lambda x: x[1], reverse=True)
    #Output passages & scores
    print("Query:", query)
    for document, score in doc_score_pairs:
        print(score, document)

License

Arctic is licensed under the Apache-2. The released models can be used for commercial purposes free of charge.