Seems to prefer shorter content? Model config issue?
I've just been trying to debug why this model doesn't seem to perform as well as others (eg, bge-base, also 768 dimensions) for retrieval on Cloudflare's public documentation.
The model supports 2048 tokens, but I'm only chunking to 512 or less. I'm still finding that it seems to be penalizing longer chunks and matching shorter ones that are similar length to the query, even though they're far less relevant.
Here's an example notebook:
The query is "How do I get started with Queues?" and the most relevant snippet in our corpus is in there (titled "Queues · Getting started") – but it appears way down the list in terms of cosine similarity, with a bunch of much shorter, but far less relevant snippets (not even related to Queues at all) – of which I've included a few.
It seems to me that it's penalizing the most relevant document for being longer and starting to include more information (even though this is highly relevant to getting started with queues).
One thing I'm wondering is whether the padding tokens are driving the similarity higher than the extra content tokens are? ie, it's preferring shorter snippets because they'd be padded, just as the short query would be.
Is this a known issue that even ~500 tokens are too many for it to be accurate? Or is there some adjustment that could be made to the padding or pooling that might help here?
From the notebook:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("google/embeddinggemma-300m")
# The first Queues Getting Started document is clearly the most relevant – but it's longer and has actual content about how to get started
# But the other three documents all score higher, even though they've got nothing to do with Queues
# This happens with other documents and queries too
docs = model.encode([
"title: Queues · Getting started | text: Cloudflare Queues is a flexible messaging queue that allows you to queue messages for asynchronous processing. By following this guide, you will create your first queue, a Worker to publish messages to that queue, and a consumer Worker to consume messages from that queue.\n\n## Prerequisites\n\nTo use Queues, you will need:\n\n1. Sign up for a [Cloudflare account](https://dash.cloudflare.com/sign-up/workers-and-pages).\n2. Install [`Node.js`](https://docs.npmjs.com/downloading-and-installing-node-js-and-npm).\n\n### Node.js version manager\n\nUse a Node version manager like [Volta](https://volta.sh/) or [nvm](https://github.com/nvm-sh/nvm) to avoid permission issues and change Node.js versions. [Wrangler](/workers/wrangler/install-and-update/), discussed later in this guide, requires a Node version of `16.17.0` or later.\n\n## 1. Create a Worker project\n\nYou will access your queue from a Worker, the producer Worker. You must create at least one producer Worker to publish messages onto your queue. If you are using [R2 Bucket Event Notifications](/r2/buckets/event-notifications/), then you do not need a producer Worker.\n\nTo create a producer Worker, run:\n\n```sh\nnpm create cloudflare@latest -- \"producer-worker\n```\n\nThis will create a new directory, which will include both a `src/index.ts` Worker script, and a [`wrangler.jsonc`](/workers/wrangler/configuration/) configuration file. After you create your Worker, you will create a Queue to access.\n\nMove into the newly created directory:\n\n```sh\ncd producer-worker\n```\n\n## 2. Create a queue\n\nTo use queues, you need to create at least one queue to publish messages to and consume messages from.\n\nTo create a queue, run:\n\n```sh\nnpx wrangler queues create \u003CMY-QUEUE-NAME\u003E\n```\n",
"title: Web3 · Ethereum Gateway · Concepts | text: As you get started with Cloudflare's Ethereum Gateway, you may want to read through the following concepts.\n\n:::note\n\nFor help with additional concepts, refer to the [Ethereum documentation](https://ethereum.org/).\n:::",
"title: Load Balancing · Get started | text: Get started with load balancing in one of two ways:\n\n* [Quickstart](/load-balancing/get-started/quickstart/): Get up and running quickly with Load Balancing.\n* [Learning path](/learning-paths/load-balancing/concepts/): Check an in-depth walkthrough for how to plan and set up a load balancer.",
"title: Workers · Tutorials · Build a Slackbot | text: If you want to get started building your own projects, review the existing list of [Quickstart templates](/workers/get-started/quickstarts/).",
])
query = "How do I get started with Queues?"
print(model.similarity(docs, model.encode([f"task: search result | query: {query}"])))
print(model.similarity(docs, model.encode([f"task: question answering | query: {query}"])))
# tensor([[0.4166],
# [0.4299],
# [0.4668],
# [0.4799]])
# tensor([[0.3731],
# [0.4365],
# [0.4623],
# [0.4645]])
weird. it's not padding that's doing it tho. i'm testing the model without sentence transformers so that i can tokenize without padding, and the results are still the same.
this model seems broken as-is. here's MTEB NanoMSMARCORetrieval scores of this model vs Snowflake/snowflake-arctic-embed-m-v2.0:
Oh awesome, thanks for running that. Seems something's not quite right.
Others are seeing issues too:
https://www.reddit.com/r/LocalLLaMA/comments/1ncfk97/googleembeddinggemma300m_is_broken/
https://www.reddit.com/r/LocalLLaMA/comments/1n8egxb/comment/nceexuz/
https://www.reddit.com/r/LocalLLaMA/comments/1n8egxb/comment/ncf8k9i/
I think something's up with the model weights or config.
If I use a different model source:
!pip install sentence-transformers[onnx]
model = SentenceTransformer("onnx-community/embeddinggemma-300m-ONNX", backend="onnx",
model_kwargs={
"provider": "CPUExecutionProvider",
"file_name": "onnx/model.onnx",
})
Then I get much better accuracy:
# The first result is a much better match than the others now
tensor([[0.7392],
[0.4998],
[0.6326],
[0.5803]])
tensor([[0.7187],
[0.5220],
[0.6596],
[0.5953]])
You can try that model out in your browser here too (the demo uses dot product instead of cosine, but same results):
https://huggingface.co/spaces/webml-community/semantic-galaxy
If I print(model)
for the ONNX model, I get:
SentenceTransformer(
(0): Transformer({'max_seq_length': 2048, 'do_lower_case': False, 'architecture': 'ORTModelForFeatureExtraction'})
(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)
For the official google/embeddinggemma-300m
model I get:
SentenceTransformer(
(0): Transformer({'max_seq_length': 2048, 'do_lower_case': False, 'architecture': 'Gemma3TextModel'})
(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
(2): Dense({'in_features': 768, 'out_features': 3072, 'bias': False, 'activation_function': 'torch.nn.modules.linear.Identity'})
(3): Dense({'in_features': 3072, 'out_features': 768, 'bias': False, 'activation_function': 'torch.nn.modules.linear.Identity'})
(4): Normalize()
)
I think I might have encountered a similar problem as well. When I applied this model to my own retrieval task, the model did not work as well as expected.
Also, I noticed that when using the example code from the model card, the output is not the same as the comments in the example. I wonder if this is normal?
- example's output:
tensor([[0.3011, 0.6359, 0.4930, 0.4889]])
- my output:
tensor([[0.4989, 0.7087, 0.5910, 0.5932]])
hey, good catch! here are results for "onnx-community/embeddinggemma-300m-ONNX" on MTEB "AppsRetrieval" and "NanoMSMARCORetrieval" : https://pastebin.com/77qwWth4
AppsRetrieval is in the ballbark of what i'd expect for not using the query format, NanoMSMARCORetrieval is improved but still seems low? i probably need to make a custom model wrapper to benchmark it a bit better, but things are looking up!
interesting comparison between the model configs too
I think I might have encountered a similar problem as well. When I applied this model to my own retrieval task, the model did not work as well as expected.
Also, I noticed that when using the example code from the model card, the output is not the same as the comments in the example. I wonder if this is normal?
- example's output:
tensor([[0.3011, 0.6359, 0.4930, 0.4889]])
- my output:
tensor([[0.4989, 0.7087, 0.5910, 0.5932]])
Same here, the results when using SentenceTransformer implementation (no quant) is pretty bad.
I've not been able to double-check this with confidence, but are you all on the required transformers
version? Without it, the model will use causal attention instead of bidirectional attention:
pip install git+https://github.com/huggingface/[email protected]
pip install sentence-transformers>=5.0.0
Source: https://huggingface.co/blog/embeddinggemma#sentence-transformers
With this correct version, I get:
tensor([[0.3008, 0.6361, 0.4927, 0.4889]])
- Tom Aarsen
I've not been able to double-check this with confidence, but are you all on the required
transformers
version? Without it, the model will use causal attention instead of bidirectional attention:pip install git+https://github.com/huggingface/[email protected] pip install sentence-transformers>=5.0.0
Source: https://huggingface.co/blog/embeddinggemma#sentence-transformers
With this correct version, I get:
tensor([[0.3008, 0.6361, 0.4927, 0.4889]])
- Tom Aarsen
oh goodness! It seems the problem is hidden in the specific version of transformers. After switching the version of transformers from the latest version to v4.56.0-Embedding-Gemma-preview, everything got back to normal.
Thank you very much!
I think I might have encountered a similar problem as well. When I applied this model to my own retrieval task, the model did not work as well as expected.
Also, I noticed that when using the example code from the model card, the output is not the same as the comments in the example. I wonder if this is normal?
- example's output:
tensor([[0.3011, 0.6359, 0.4930, 0.4889]])
- my output:
tensor([[0.4989, 0.7087, 0.5910, 0.5932]])
Same here, the results when using SentenceTransformer implementation (no quant) is pretty bad.
Use transformers with version v4.56.0-Embedding-Gemma-preview and everything will go well 😊
lol, i thought > 4.56 had it in there, but it's not coming until 4.57
Hi @hichaelmart ,
Could you please try to run with latest transformers
and sentence_transformers
versions. I have tried with latest versions of these libraries and can able to get the good results.
The working versions of transformers
and sentence_transformers
:
Please find the attached gist file for your reference.
Thanks.
I was following the instructions from here: https://huggingface.co/google/embeddinggemma-300m#usage
That just says to use pip install -U sentence-transformers
The versions you're using seem to be very specific versions that aren't mentioned in the model page (or official documentation?) Are you sure they're the official versions of these libraries? Just seems strange it's called @v4.56.0-Embedding-Gemma-preview
In any case, glad this is solved – but seems this should be mentioned in more places officially.
Hi @hichaelmart ,
Yes, this is the specific dev version of sentence-transformers
, I have already forwarded the same concern to respective team to look into. Thank you so much for your patience and understanding.
Thanks.
I also originally forked a notebook called "embeddinggemma-300m.ipynb" from here: https://huggingface.co/google/embeddinggemma-300m.ipynb
That didn't used to have these specific version instructions. It seems that's now been updated with these instructions (and a lot more).
Glad to see it!
I've not been able to double-check this with confidence, but are you all on the required
transformers
version? Without it, the model will use causal attention instead of bidirectional attention:pip install git+https://github.com/huggingface/[email protected] pip install sentence-transformers>=5.0.0
Source: https://huggingface.co/blog/embeddinggemma#sentence-transformers
With this correct version, I get:
tensor([[0.3008, 0.6361, 0.4927, 0.4889]])
- Tom Aarsen
oh goodness! It seems the problem is hidden in the specific version of transformers. After switching the version of transformers from the latest version to v4.56.0-Embedding-Gemma-preview, everything got back to normal.
Thank you very much!
Thank you very much!
On MPS device, I get
tensor([[0.3008, 0.6361, 0.4927, 0.4889]])
Hi @hichaelmart ,
Yes, this is the specific dev version of
sentence-transformers
, I have already forwarded the same concern to respective team to look into. Thank you so much for your patience and understanding.Thanks.
Hey @BalakrishnaCh – can you ping them again? The official instructions still haven't been updated:
https://huggingface.co/google/embeddinggemma-300m#usage
Hi @BalakrishnaCh ,
I have forwarded your concern to the team, they are currently looking into it. Your patience is really appreciated in this matter.
Thanks.
@BalakrishnaCh any update here?
I created a pull request in case there's any confusion about what needs to be changed in the README: https://huggingface.co/google/embeddinggemma-300m/discussions/24
Thanks for your contribution by raising the PR for the changes. The team will review the changes and will merge once it's reviewed.