What similarity function is best for arctic-embed-v2.0 models?

#6
by Androssi - opened

Hey there, thanks for open-sourcing arctic-embed-m-v2.0 and arctic-embed-l-v2.0.
I am trying them out in some simple RAG tasks and they are really impressive!

I was wondering if there is any significant difference switching the similarity function from cosine similarity to dot product.
I see that in the model cards:

  • the Sentence Transformers example uses cosine similarity
  • the Huggingface Transformers examples (both Python and JS) seem to use dot product.

So I guess the models can work reasonably well with both.
But I am curious if you have are any insights on which approach I should use.
E.g., what similarity did you use when evaluating on MTEB/MIRACL/CLEF?

Snowflake org

If you look at the Huggingface Transformers example code closely, you'll see that the query and document vectors are normalized to have length of 1.0. The dot product of normalized vectors is actually mathematically equivalent to the cosine similarity, so there isn't a difference you need to worry about here. In practice, normalizing ahead of time reduces the amount of computation required to compute similarity during retrieval, so this is the approach we recommend for most practitioners. We do not provide any evals or suggestions for dotproduct similarity on un-normalized vectors.

Equation

When vectors a and b are unit length, then we have equivalence between cosine similarity and dot-product:
a=b=1||a|| = ||b|| = 1
implies
cossim(a,b)=abab=ab11=ab=dotproduct(a,b) cossim(a, b) = \frac{a^\intercal b}{||a|| \cdot ||b||} = \frac{a^\intercal b}{1 \cdot 1} = a^\intercal b = dotproduct(a, b)

See also: https://en.wikipedia.org/wiki/Cosine_similarity

Aaaaaaah thank you, I missed the fact that the vectors were normalized beforehand in that case! My bad!!
Thank you for your super quick and thorough reply, I really appreciate it!

Androssi changed discussion status to closed

Sign up or log in to comment