🏎️ Today I'm introducing a method to train static embedding models that run 100x to 400x faster on CPU than common embedding models, while retaining 85%+ of the quality! Including 2 fully open models: training scripts, datasets, metrics.
We apply our recipe to train 2 Static Embedding models that we release today! We release: 2️⃣ an English Retrieval model and a general-purpose Multilingual similarity model (e.g. classification, clustering, etc.), both Apache 2.0 🧠 my modern training strategy: ideation -> dataset choice -> implementation -> evaluation 📜 my training scripts, using the Sentence Transformers library 📊 my Weights & Biases reports with losses & metrics 📕 my list of 30 training and 13 evaluation datasets
The 2 Static Embedding models have the following properties: 🏎️ Extremely fast, e.g. 107500 sentences per second on a consumer CPU, compared to 270 for 'all-mpnet-base-v2' and 56 for 'gte-large-en-v1.5' 0️⃣ Zero active parameters: No Transformer blocks, no attention, not even a matrix multiplication. Super speed! 📏 No maximum sequence length! Embed texts at any length (note: longer texts may embed worse) 📐 Linear instead of exponential complexity: 2x longer text takes 2x longer, instead of 2.5x or more. 🪆 Matryoshka support: allow you to truncate embeddings with minimal performance loss (e.g. 4x smaller with a 0.56% perf. decrease for English Similarity tasks)
The blogpost contains a lengthy list of possible advancements; I'm very confident that our 2 models are only the tip of the iceberg, and we may be able to get even better performance.
That didn't take long! Nomic AI has finetuned the new ModernBERT-base encoder model into a strong embedding model for search, classification, clustering and more!
Details: 🤖 Based on ModernBERT-base with 149M parameters. 📊 Outperforms both nomic-embed-text-v1 and nomic-embed-text-v1.5 on MTEB! 🏎️ Immediate FA2 and unpacking support for super efficient inference. 🪆 Trained with Matryoshka support, i.e. 2 valid output dimensionalities: 768 and 256. ➡️ Maximum sequence length of 8192 tokens! 2️⃣ Trained in 2 stages: unsupervised contrastive data -> high quality labeled datasets. ➕ Integrated in Sentence Transformers, Transformers, LangChain, LlamaIndex, Haystack, etc. 🏛️ Apache 2.0 licensed: fully commercially permissible