F2LLM Technical Report: Matching SOTA Embedding Performance with 6 Million Open-Source Data
Abstract
F2LLM, a suite of large language models, achieves high embedding performance with efficient fine-tuning from foundation models using open-source datasets.
We introduce F2LLM - Foundation to Feature Large Language Models, a suite of state-of-the-art embedding models in three sizes: 0.6B, 1.7B, and 4B. Unlike previous top-ranking embedding models that require massive contrastive pretraining, sophisticated training pipelines, and costly synthetic training data, F2LLM is directly finetuned from foundation models on 6 million query-document-negative tuples curated from open-source, non-synthetic datasets, striking a strong balance between training cost, model size, and embedding performance. On the MTEB English leaderboard, F2LLM-4B ranks 2nd among models with approximately 4B parameters and 7th overall, while F2LLM-1.7B ranks 1st among models in the 1B-2B size range. To facilitate future research in the field, we release the models, training dataset, and code, positioning F2LLM as a strong, reproducible, and budget-friendly baseline for future works.
Community
We present F2LLM, a family of fully open embedding models that strike a strong balance between training cost, model size, and embedding performance, serving as a strong, reproducible, and budget-friendly baseline for developing embedding models in the future.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Granite Embedding R2 Models (2025)
- EmbeddingGemma: Powerful and Lightweight Text Representations (2025)
- Training LLMs to be Better Text Embedders through Bidirectional Reconstruction (2025)
- MTEB-NL and E5-NL: Embedding Benchmark and Models for Dutch (2025)
- Conan-Embedding-v2: Training an LLM from Scratch for Text Embeddings (2025)
- BiXSE: Improving Dense Retrieval via Probabilistic Graded Relevance Distillation (2025)
- QZhou-Embedding Technical Report (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 3
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper