kenoc/mxbai-abat-matryoshka · Suggestion: Different base model

Hello @kenoc !

The https://huggingface.co/mixedbread-ai/mxbai-embed-large-v1 model is very strong, but it's only trained on English texts. Notably, it also uses an English tokenizer. This means that it's not well suited for other languages, as it isn't really familiar with how those words.

Luckily, they also made a German-English model! https://huggingface.co/mixedbread-ai/deepset-mxbai-embed-de-large-v1
This one uses a multilingual tokenizer, and I think it should work very well when finetuned (or even before).

One more comment: in your training samples I'm noticing that positive is your first column and anchor your second. In Sentence Transformers, the order of columns is most important. So, right now you're training with:

Given the answer, which of these questions is the corresponding one?
instead of
Given the question, which of these answers is the corresponding one?

I would recommend changing the order with:

train_dataset = train_dataset.select_columns(("anchor", "positive"))

Tom Aarsen