‼️Sentence Transformers v5.0 is out! The biggest update yet introduces Sparse Embedding models, encode methods improvements, Router module for asymmetric models & much more. Sparse + Dense = 🔥 hybrid search performance! Details:
1️⃣ Sparse Encoder Models Brand new support for sparse embedding models that generate high-dimensional embeddings (30,000+ dims) where <1% are non-zero:
- Full SPLADE, Inference-free SPLADE, and CSR architecture support - 4 new modules, 12 new losses, 9 new evaluators - Integration with @elastic-co, @opensearch-project, @NAVER LABS Europe, @qdrant, @IBM, etc. - Decode interpretable embeddings to understand token importance - Hybrid search integration to get the best of both worlds
2️⃣ Enhanced Encode Methods & Multi-Processing - Introduce encode_query & encode_document automatically use predefined prompts - No more manual pool management - just pass device list directly to encode() - Much cleaner and easier to use than the old multi-process approach
3️⃣ Router Module & Advanced Training - Router module with different processing paths for queries vs documents - Custom learning rates for different parameter groups - Composite loss logging - see individual loss components - Perfect for two-tower architectures
4️⃣ Comprehensive Documentation & Training - New Training Overview, Loss Overview, API Reference docs - 6 new training example documentation pages - Full integration examples with major search engines - Extensive blogpost on training sparse models
What's next? We would love to hear from the community! What sparse encoder models would you like to see? And what new capabilities should Sentence Transformers handle - multimodal embeddings, late interaction models, or something else? Your feedback shapes our roadmap!
🤗 Just published: "Consent by Design" - exploring how we're building better consent mechanisms across the HF ecosystem!
Our research shows open AI development enables: - Community-driven ethical standards - Transparent accountability - Context-specific implementations - Privacy as core infrastructure
Check out our Space Privacy Analyzer tool that automatically generates privacy summaries of applications!
Effective consent isn't about perfect policies; it's about architectures that empower users while enabling innovation. 🚀
I just released Sentence Transformers v4.1; featuring ONNX and OpenVINO backends for rerankers offering 2-3x speedups and improved hard negatives mining which helps prepare stronger training datasets. Details:
🏎️ ONNX, OpenVINO, Optimization, Quantization - I've added ONNX and OpenVINO support with just one extra argument: "backend" when loading the CrossEncoder reranker, e.g.: CrossEncoder("cross-encoder/ms-marco-MiniLM-L6-v2", backend="onnx") - The export_optimized_onnx_model, export_dynamic_quantized_onnx_model, and export_static_quantized_openvino_model functions now work with CrossEncoder rerankers, allowing you to optimize (e.g. fusions, gelu approximations, etc.) or quantize (int8 weights) rerankers. - I've uploaded ~340 ONNX & OpenVINO models for all existing models under the cross-encoder Hugging Face organization. You can use these without having to export when loading.
⛏ Improved Hard Negatives Mining - Added 'absolute_margin' and 'relative_margin' arguments to mine_hard_negatives. - absolute_margin ensures that sim(query, negative) < sim(query, positive) - absolute_margin, i.e. an absolute margin between the negative & positive similarities. - relative_margin ensures that sim(query, negative) < sim(query, positive) * (1 - relative_margin), i.e. a relative margin between the negative & positive similarities. - Inspired by the excellent NV-Retriever paper from NVIDIA.
With this release, I introduce near-feature parity between the SentenceTransformer embedding & CrossEncoder reranker models, which I've wanted to do for quite some time! With rerankers very strongly supported now, it's time to look forward to other useful architectures!
✨MLLM > R1 Omni by Alibaba Tongyi - 0.5B > Qwen2.5 Omni by Alibaba Qwen - 7B with apache2.0
🖼️Video > CogView-4 by ZhipuAI - Apacha2.0 > HunyuanVideo-I2V by TencentHunyuan > Open Sora2.0 - 11B with Apache2.0 > Stepvideo TI2V by StepFun AI - 30B with MIT license
⚡️Image/3D > Hunyuan3D 2mv/2mini (0.6B) by @TencentHunyuan > FlexWorld by ByteDance - MIT license > Qwen2.5-VL-32B-Instruct by Alibaba Qwen - Apache2.0 > Tripo SG (1.5B)/SF by VastAIResearch - MIT license > InfiniteYou by ByteDance
> LHM by Alibaba AIGC team - Apache2.0 > Spatial LM by ManyCore
🧠Reasoning > QwQ-32B by Alibaba Qwen - Apache2.0 > Skywork R1V - 38B with MIT license > RWKV G1 by RWKV AI - 0.1B pure RNN reasoning model with Apache2.0 > Fin R1 by SUFE AIFLM Lab - financial reasoning
🔠LLM > DeepSeek v3 0324 by DeepSeek -MIT license > Babel by Alibaba DAMO - 9B/83B/25 languages
‼️Sentence Transformers v4.0 is out! You can now train and finetune reranker models with multi-GPU training, bf16 support, loss logging, callbacks & much more. I also prove that finetuning on your domain helps much more than you might think.
1️⃣ Reranker Training Refactor Reranker models can now be trained using an extensive trainer with a lot of powerful features: - MultiGPU Training (Data Parallelism (DP) and Distributed Data Parallelism (DDP)) - bf16 training support; loss logging - Evaluation datasets + evaluation loss - Improved callback support + an excellent Weights & Biases integration - Gradient checkpointing, gradient accumulation - Model card generation - Resuming from a training checkpoint without performance loss - Hyperparameter Optimization and much more!
Read my detailed blogpost to learn about the components that make up this new training approach: https://huggingface.co/blog/train-reranker Notably, the release is fully backwards compatible: all deprecations are soft, meaning that they still work but emit a warning informing you how to upgrade.
2️⃣ New Reranker Losses - 11 new losses: - 2 traditional losses: BinaryCrossEntropy and CrossEntropy - 2 distillation losses: MSE and MarginMSE - 2 in-batch negatives losses: MNRL (a.k.a. InfoNCE) and CMNRL - 5 learning to rank losses: Lambda, p-ListMLE, ListNet, RankNet, ListMLE
3️⃣ New Reranker Documentation - New Training Overview, Loss Overview, API Reference docs - 5 new, 1 refactored training examples docs pages - 13 new, 6 refactored training scripts - Migration guides (2.x -> 3.x, 3.x -> 4.x)
4️⃣ Blogpost Alongside the release, I've written a blogpost where I finetune ModernBERT on a generic question-answer dataset. My finetunes easily outperform all general-purpose reranker models, even models 4x as big. Finetuning on your domain is definitely worth it: https://huggingface.co/blog/train-reranker
An assembly of 18 European companies, labs, and universities have banded together to launch 🇪🇺 EuroBERT! It's a state-of-the-art multilingual encoder for 15 European languages, designed to be finetuned for retrieval, classification, etc.
🇪🇺 15 Languages: English, French, German, Spanish, Chinese, Italian, Russian, Polish, Portuguese, Japanese, Vietnamese, Dutch, Arabic, Turkish, Hindi 3️⃣ 3 model sizes: 210M, 610M, and 2.1B parameters - very very useful sizes in my opinion ➡️ Sequence length of 8192 tokens! Nice to see these higher sequence lengths for encoders becoming more common. ⚙️ Architecture based on Llama, but with bi-directional (non-causal) attention to turn it into an encoder. Flash Attention 2 is supported. 🔥 A new Pareto frontier (stronger *and* smaller) for multilingual encoder models 📊 Evaluated against mDeBERTa, mGTE, XLM-RoBERTa for Retrieval, Classification, and Regression (after finetuning for each task separately): EuroBERT punches way above its weight. 📝 Detailed paper with all details, incl. data: FineWeb for English and CulturaX for multilingual data, The Stack v2 and Proof-Pile-2 for code.
The next step is for researchers to build upon the 3 EuroBERT base models and publish strong retrieval, zero-shot classification, etc. models for all to use. I'm very much looking forward to it!
I just released Sentence Transformers v3.4.0, featuring a memory leak fix, compatibility between the powerful Cached... losses and the Matryoshka loss modifier, and a bunch of fixes & small features.
🪆 Matryoshka & Cached loss compatibility It is now possible to combine the powerful Cached... losses (which use in-batch negatives & a caching mechanism to allow for endless batch size & negatives) with the Matryoshka loss modifier which modifies a base loss such that it is trained not only on the maximum dimensionality (e.g. 1024 dimensions), but also on many lower dimensions (e.g. 768, 512, 256, 128, 64, 32). After training, these models' embeddings can be truncated for faster retrieval, etc.
🎞️ Resolve memory leak when Model and Trainer are reinitialized Due to a circular dependency between Trainer -> Model -> ModelCardData -> Trainer, deleting both the trainer & model still didn't free up the memory. This led to a memory leak in scripts where you repeatedly do so.
➕ New Features Many new small features, e.g. multi-GPU support for 'mine_hard_negatives', a 'margin' parameter to TripletEvaluator, and Matthews Correlation Coefficient in the BinaryClassificationEvaluator.
🐛 Bug Fixes Also a bunch of fixes, for example that subsequent batches were not sorted when using the "no_duplicates" batch sampler. See the release notes for more details.
🏎️ Today I'm introducing a method to train static embedding models that run 100x to 400x faster on CPU than common embedding models, while retaining 85%+ of the quality! Including 2 fully open models: training scripts, datasets, metrics.
We apply our recipe to train 2 Static Embedding models that we release today! We release: 2️⃣ an English Retrieval model and a general-purpose Multilingual similarity model (e.g. classification, clustering, etc.), both Apache 2.0 🧠 my modern training strategy: ideation -> dataset choice -> implementation -> evaluation 📜 my training scripts, using the Sentence Transformers library 📊 my Weights & Biases reports with losses & metrics 📕 my list of 30 training and 13 evaluation datasets
The 2 Static Embedding models have the following properties: 🏎️ Extremely fast, e.g. 107500 sentences per second on a consumer CPU, compared to 270 for 'all-mpnet-base-v2' and 56 for 'gte-large-en-v1.5' 0️⃣ Zero active parameters: No Transformer blocks, no attention, not even a matrix multiplication. Super speed! 📏 No maximum sequence length! Embed texts at any length (note: longer texts may embed worse) 📐 Linear instead of exponential complexity: 2x longer text takes 2x longer, instead of 2.5x or more. 🪆 Matryoshka support: allow you to truncate embeddings with minimal performance loss (e.g. 4x smaller with a 0.56% perf. decrease for English Similarity tasks)
The blogpost contains a lengthy list of possible advancements; I'm very confident that our 2 models are only the tip of the iceberg, and we may be able to get even better performance.
That didn't take long! Nomic AI has finetuned the new ModernBERT-base encoder model into a strong embedding model for search, classification, clustering and more!
Details: 🤖 Based on ModernBERT-base with 149M parameters. 📊 Outperforms both nomic-embed-text-v1 and nomic-embed-text-v1.5 on MTEB! 🏎️ Immediate FA2 and unpacking support for super efficient inference. 🪆 Trained with Matryoshka support, i.e. 2 valid output dimensionalities: 768 and 256. ➡️ Maximum sequence length of 8192 tokens! 2️⃣ Trained in 2 stages: unsupervised contrastive data -> high quality labeled datasets. ➕ Integrated in Sentence Transformers, Transformers, LangChain, LlamaIndex, Haystack, etc. 🏛️ Apache 2.0 licensed: fully commercially permissible
After some heated discussion 🔥, we clarify our intent re. storage limits on the Hub
TL;DR: - public storage is free, and (unless blatant abuse) unlimited. We do ask that you consider upgrading to PRO and/or Enterprise Hub if possible - private storage is paid above a significant free tier (1TB if you have a paid account, 100GB otherwise)
I just released Sentence Transformers v3.3.0 & it's huge! 4.5x speedup for CPU with OpenVINO int8 static quantization, training with prompts for a free perf. boost, PEFT integration, evaluation on NanoBEIR, and more! Details:
1. We integrate Post-Training Static Quantization using OpenVINO, a very efficient solution for CPUs that processes 4.78x as many texts per second on average, while only hurting performance by 0.36% on average. There's a new export_static_quantized_openvino_model method to quantize a model.
2. We add the option to train with prompts, e.g. strings like "query: ", "search_document: " or "Represent this sentence for searching relevant passages: ". It's as simple as using the prompts argument in SentenceTransformerTrainingArguments. Our experiments show that you can easily reach 0.66% to 0.90% relative performance improvement on NDCG@10 at no extra cost by adding "query: " before each training query and "document: " before each training answer.
3. Sentence Transformers now supports training PEFT adapters via 7 new methods for adding new adapters or loading pre-trained ones. You can also directly load a trained adapter with SentenceTransformer as if it's a normal model. Very useful for e.g. 1) training multiple adapters on 1 base model, 2) training bigger models than otherwise possible, or 3) cheaply hosting multiple models by switching multiple adapters on 1 base model.
4. We added easy evaluation on NanoBEIR, a subset of BEIR a.k.a. the MTEB Retrieval benchmark. It contains 13 datasets with 50 queries and up to 10k documents each. Evaluation is fast, and can easily be done during training to track your model's performance on general-purpose information retrieval tasks.
📣 Sentence Transformers v3.2.0 is out, marking the biggest release for inference in 2 years! 2 new backends for embedding models: ONNX (+ optimization & quantization) and OpenVINO, allowing for speedups up to 2x-3x AND Static Embeddings for 500x speedups at 10-20% accuracy cost.
1️⃣ ONNX Backend: This backend uses the ONNX Runtime to accelerate model inference on both CPU and GPU, reaching up to 1.4x-3x speedup depending on the precision. We also introduce 2 helper methods for optimizing and quantizing models for (much) faster inference. 2️⃣ OpenVINO Backend: This backend uses Intel their OpenVINO instead, outperforming ONNX in some situations on CPU.
Usage is as simple as SentenceTransformer("all-MiniLM-L6-v2", backend="onnx"). Does your model not have an ONNX or OpenVINO file yet? No worries - it'll be autoexported for you. Thank me later 😉
🔒 Another major new feature is Static Embeddings: think word embeddings like GLoVe and word2vec, but modernized. Static Embeddings are bags of token embeddings that are summed together to create text embeddings, allowing for lightning-fast embeddings that don't require any neural networks. They're initialized in one of 2 ways:
1️⃣ via Model2Vec, a new technique for distilling any Sentence Transformer models into static embeddings. Either via a pre-distilled model with from_model2vec or with from_distillation where you do the distillation yourself. It'll only take 5 seconds on GPU & 2 minutes on CPU, no dataset needed. 2️⃣ Random initialization. This requires finetuning, but finetuning is extremely quick (e.g. I trained with 3 million pairs in 7 minutes). My final model was 6.6% worse than bge-base-en-v1.5, but 500x faster on CPU.
My biggest release of the year: a series of 7 specialized embedding models for information retrieval within tax documents, is now available for free on Hugging Face 🤗
These new models aim to offer an open source alternative for in-domain semantic search from large text corpora and will improve RAG systems and context addition for large language models.
Trained on more than 43 million tax tokens derived from semi-synthetic and raw-synthetic data, enriched by various methods (in particular MSFT's evol-instruct by @intfloat), and corrected by humans, this project is the fruit of hundreds of hours of work and is the culmination of a global effort to open up legal technologies that has only just begun.
A big thank you to Microsoft for Startups for giving me access to state-of-the-art infrastructure to train these models, and to @julien-c, @clem 🤗, @thomwolf and the whole HF team for the inference endpoint API and the generous provision of Meta LLama-3.1-70B. Special thanks also to @tomaarsen for his invaluable advice on training embedding models and Loss functions ❤️
#phdone - I defended my PhD yesterday! A key lesson: it is amazing how open science and open source can empower beginners with limited resources:
I first learned about instruction-based classifiers like BERT-NLI 3-4 years ago, through the @HuggingFace ZeroShotClassificationPipeline. Digging deeper into this, it was surprisingly easy to find new datasets, newer base models, and reusable fine-tuning scripts on the HF Hub to create my own zeroshot models - although I didn't know much about fine-tuning at the time.
Thanks to the community effect of the Hub, my models were downloaded hundreds of thousands of times after a few months. Seeing my research being useful for people motivated me to improve and upload newer models. Leaving my contact details in the model cards led to academic cooperation and consulting contracts (and eventually my job at HF).
That's the power of open science & open source: learning, sharing, improving, collaborating.
I mean every word in my thesis acknowledgments (screenshot). I'm very grateful to my supervisors @vanatteveldt@CasAndreu@KasperWelbers for their guidance; to @profAndreaRenda and @CEPS_thinktank for enabling me to work part-time during the first year; to @huggingface for creating awesome tools and an awesome platform; and to many others who are not active on social media.
Links to the full thesis and the collection of my most recent models are below.
PS: If someone happens to speak Latin, let me know if my diploma contains some hidden Illuminati code or something :D
I've just shipped the Sentence Transformers v3.1.1 patch release, fixing the hard negatives mining utility for some models. This utility is extremely useful to get more performance out of your embedding training data.
🔓 Beyond that, this release removes the numpy<2 restriction from v3.1.0. This was previously required for Windows as not all third-party libraries were updated to support numpy v2. With Sentence Transformers, you can now choose v1 or v2 of numpy.
🎉SetFit v1.1.0 is out! Training efficient classifiers on CPU or GPU now uses the Sentence Transformers Trainer, and we resolved a lot of issues caused by updates of third-party libraries (like Transformers). Details:
Training a SetFit classifier model consists of 2 phases: 1. Finetuning a Sentence Transformer embedding model 2. Training a Classifier to map embeddings -> classes
🔌The first phase now uses the SentenceTransformerTrainer that was introduced in the Sentence Transformers v3 update. This brings some immediate upsides like MultiGPU support, without any (intended) breaking changes.
➡️ Beyond that, we softly deprecated the "evaluation_strategy" argument in favor of "eval_strategy" (following a Transformers deprecation), and deprecated Python 3.7. In return, we add official support for Python 3.11 and 3.12.
✨ There's some more minor changes too, like max_steps and eval_max_steps now being a hard limit instead of an approximate one, training/validation losses now logging nicely in Notebooks, and the "device" parameter no longer being ignored in some situations.