๐๏ธ Today I'm introducing a method to train static embedding models that run 100x to 400x faster on CPU than common embedding models, while retaining 85%+ of the quality! Including 2 fully open models: training scripts, datasets, metrics.
We apply our recipe to train 2 Static Embedding models that we release today! We release: 2๏ธโฃ an English Retrieval model and a general-purpose Multilingual similarity model (e.g. classification, clustering, etc.), both Apache 2.0 ๐ง my modern training strategy: ideation -> dataset choice -> implementation -> evaluation ๐ my training scripts, using the Sentence Transformers library ๐ my Weights & Biases reports with losses & metrics ๐ my list of 30 training and 13 evaluation datasets
The 2 Static Embedding models have the following properties: ๐๏ธ Extremely fast, e.g. 107500 sentences per second on a consumer CPU, compared to 270 for 'all-mpnet-base-v2' and 56 for 'gte-large-en-v1.5' 0๏ธโฃ Zero active parameters: No Transformer blocks, no attention, not even a matrix multiplication. Super speed! ๐ No maximum sequence length! Embed texts at any length (note: longer texts may embed worse) ๐ Linear instead of exponential complexity: 2x longer text takes 2x longer, instead of 2.5x or more. ๐ช Matryoshka support: allow you to truncate embeddings with minimal performance loss (e.g. 4x smaller with a 0.56% perf. decrease for English Similarity tasks)
The blogpost contains a lengthy list of possible advancements; I'm very confident that our 2 models are only the tip of the iceberg, and we may be able to get even better performance.
Hey, it has been a while... I was busy participating in ๐ ๐๐๐ฆ๐ฆ๐ ๐๐จ๐ฆ๐ฉ๐๐ญ๐ข๐ญ๐ข๐จ๐ง!
Here's the idea: Gemma open models have a large vocabulary size (256K), so improving them for a specific language or cultural context should be pretty affordable - no need for continued pre-training.
In this notebook, I show how I improve the performance of Gemma 2 2B on Italian via Post-Training. I believe this method is adaptable to other languages and model sizes.
๐๐ฆ๐บ ๐๐ต๐ฆ๐ฑ๐ด ๐ Choose reference metrics ๐งโ๐ฌ Data curation for Instruction Fine Tuning: identify existing datasets + generate synthetic data ๐๏ธโโ๏ธ Efficient Instruction Fine Tuning with Spectrum ๐งโ๐ฌ Data curation for Preference Tuning: identify existing datasets + generate synthetic data ๐๐ Efficient Direct Preference Optimization with Spectrum ๐ Evaluation
I'm also planning a ๐ Gemma Giveaway (on LinkedIn - https://www.linkedin.com/in/stefano-fiorucci) in the next few days - sharing techniques, datasets, and models I used for my project... so stay tuned! ๐ป