Seems to prefer shorter content? Model config issue?
I've just been trying to debug why this model doesn't seem to perform as well as others (eg, bge-base, also 768 dimensions) for retrieval on Cloudflare's public documentation.
The model supports 2048 tokens, but I'm only chunking to 512 or less. I'm still finding that it seems to be penalizing longer chunks and matching shorter ones that are similar length to the query, even though they're far less relevant.
Here's an example notebook:
The query is "How do I get started with Queues?" and the most relevant snippet in our corpus is in there (titled "Queues · Getting started") – but it appears way down the list in terms of cosine similarity, with a bunch of much shorter, but far less relevant snippets (not even related to Queues at all) – of which I've included a few.
It seems to me that it's penalizing the most relevant document for being longer and starting to include more information (even though this is highly relevant to getting started with queues).
One thing I'm wondering is whether the padding tokens are driving the similarity higher than the extra content tokens are? ie, it's preferring shorter snippets because they'd be padded, just as the short query would be.
Is this a known issue that even ~500 tokens are too many for it to be accurate? Or is there some adjustment that could be made to the padding or pooling that might help here?
From the notebook:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("google/embeddinggemma-300m")
# The first Queues Getting Started document is clearly the most relevant – but it's longer and has actual content about how to get started
# But the other three documents all score higher, even though they've got nothing to do with Queues
# This happens with other documents and queries too
docs = model.encode([
"title: Queues · Getting started | text: Cloudflare Queues is a flexible messaging queue that allows you to queue messages for asynchronous processing. By following this guide, you will create your first queue, a Worker to publish messages to that queue, and a consumer Worker to consume messages from that queue.\n\n## Prerequisites\n\nTo use Queues, you will need:\n\n1. Sign up for a [Cloudflare account](https://dash.cloudflare.com/sign-up/workers-and-pages).\n2. Install [`Node.js`](https://docs.npmjs.com/downloading-and-installing-node-js-and-npm).\n\n### Node.js version manager\n\nUse a Node version manager like [Volta](https://volta.sh/) or [nvm](https://github.com/nvm-sh/nvm) to avoid permission issues and change Node.js versions. [Wrangler](/workers/wrangler/install-and-update/), discussed later in this guide, requires a Node version of `16.17.0` or later.\n\n## 1. Create a Worker project\n\nYou will access your queue from a Worker, the producer Worker. You must create at least one producer Worker to publish messages onto your queue. If you are using [R2 Bucket Event Notifications](/r2/buckets/event-notifications/), then you do not need a producer Worker.\n\nTo create a producer Worker, run:\n\n```sh\nnpm create cloudflare@latest -- \"producer-worker\n```\n\nThis will create a new directory, which will include both a `src/index.ts` Worker script, and a [`wrangler.jsonc`](/workers/wrangler/configuration/) configuration file. After you create your Worker, you will create a Queue to access.\n\nMove into the newly created directory:\n\n```sh\ncd producer-worker\n```\n\n## 2. Create a queue\n\nTo use queues, you need to create at least one queue to publish messages to and consume messages from.\n\nTo create a queue, run:\n\n```sh\nnpx wrangler queues create \u003CMY-QUEUE-NAME\u003E\n```\n",
"title: Web3 · Ethereum Gateway · Concepts | text: As you get started with Cloudflare's Ethereum Gateway, you may want to read through the following concepts.\n\n:::note\n\nFor help with additional concepts, refer to the [Ethereum documentation](https://ethereum.org/).\n:::",
"title: Load Balancing · Get started | text: Get started with load balancing in one of two ways:\n\n* [Quickstart](/load-balancing/get-started/quickstart/): Get up and running quickly with Load Balancing.\n* [Learning path](/learning-paths/load-balancing/concepts/): Check an in-depth walkthrough for how to plan and set up a load balancer.",
"title: Workers · Tutorials · Build a Slackbot | text: If you want to get started building your own projects, review the existing list of [Quickstart templates](/workers/get-started/quickstarts/).",
])
query = "How do I get started with Queues?"
print(model.similarity(docs, model.encode([f"task: search result | query: {query}"])))
print(model.similarity(docs, model.encode([f"task: question answering | query: {query}"])))
# tensor([[0.4166],
# [0.4299],
# [0.4668],
# [0.4799]])
# tensor([[0.3731],
# [0.4365],
# [0.4623],
# [0.4645]])
weird. it's not padding that's doing it tho. i'm testing the model without sentence transformers so that i can tokenize without padding, and the results are still the same.
this model seems broken as-is. here's MTEB NanoMSMARCORetrieval scores of this model vs Snowflake/snowflake-arctic-embed-m-v2.0:
Oh awesome, thanks for running that. Seems something's not quite right.
Others are seeing issues too:
https://www.reddit.com/r/LocalLLaMA/comments/1ncfk97/googleembeddinggemma300m_is_broken/
https://www.reddit.com/r/LocalLLaMA/comments/1n8egxb/comment/nceexuz/
https://www.reddit.com/r/LocalLLaMA/comments/1n8egxb/comment/ncf8k9i/
I think something's up with the model weights or config.
If I use a different model source:
!pip install sentence-transformers[onnx]
model = SentenceTransformer("onnx-community/embeddinggemma-300m-ONNX", backend="onnx",
model_kwargs={
"provider": "CPUExecutionProvider",
"file_name": "onnx/model.onnx",
})
Then I get much better accuracy:
# The first result is a much better match than the others now
tensor([[0.7392],
[0.4998],
[0.6326],
[0.5803]])
tensor([[0.7187],
[0.5220],
[0.6596],
[0.5953]])
You can try that model out in your browser here too (the demo uses dot product instead of cosine, but same results):
https://huggingface.co/spaces/webml-community/semantic-galaxy
If I print(model)
for the ONNX model, I get:
SentenceTransformer(
(0): Transformer({'max_seq_length': 2048, 'do_lower_case': False, 'architecture': 'ORTModelForFeatureExtraction'})
(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)
For the official google/embeddinggemma-300m
model I get:
SentenceTransformer(
(0): Transformer({'max_seq_length': 2048, 'do_lower_case': False, 'architecture': 'Gemma3TextModel'})
(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
(2): Dense({'in_features': 768, 'out_features': 3072, 'bias': False, 'activation_function': 'torch.nn.modules.linear.Identity'})
(3): Dense({'in_features': 3072, 'out_features': 768, 'bias': False, 'activation_function': 'torch.nn.modules.linear.Identity'})
(4): Normalize()
)
I think I might have encountered a similar problem as well. When I applied this model to my own retrieval task, the model did not work as well as expected.
Also, I noticed that when using the example code from the model card, the output is not the same as the comments in the example. I wonder if this is normal?
- example's output:
tensor([[0.3011, 0.6359, 0.4930, 0.4889]])
- my output:
tensor([[0.4989, 0.7087, 0.5910, 0.5932]])
hey, good catch! here are results for "onnx-community/embeddinggemma-300m-ONNX" on MTEB "AppsRetrieval" and "NanoMSMARCORetrieval" : https://pastebin.com/77qwWth4
AppsRetrieval is in the ballbark of what i'd expect for not using the query format, NanoMSMARCORetrieval is improved but still seems low? i probably need to make a custom model wrapper to benchmark it a bit better, but things are looking up!
interesting comparison between the model configs too