Papers
arxiv:2510.02294

F2LLM Technical Report: Matching SOTA Embedding Performance with 6 Million Open-Source Data

Published on Oct 2
· Submitted by Ziyin Zhang on Oct 3
Authors:
,
,
,

Abstract

F2LLM, a suite of large language models, achieves high embedding performance with efficient fine-tuning from foundation models using open-source datasets.

AI-generated summary

We introduce F2LLM - Foundation to Feature Large Language Models, a suite of state-of-the-art embedding models in three sizes: 0.6B, 1.7B, and 4B. Unlike previous top-ranking embedding models that require massive contrastive pretraining, sophisticated training pipelines, and costly synthetic training data, F2LLM is directly finetuned from foundation models on 6 million query-document-negative tuples curated from open-source, non-synthetic datasets, striking a strong balance between training cost, model size, and embedding performance. On the MTEB English leaderboard, F2LLM-4B ranks 2nd among models with approximately 4B parameters and 7th overall, while F2LLM-1.7B ranks 1st among models in the 1B-2B size range. To facilitate future research in the field, we release the models, training dataset, and code, positioning F2LLM as a strong, reproducible, and budget-friendly baseline for future works.

Community

Paper author Paper submitter

We present F2LLM, a family of fully open embedding models that strike a strong balance between training cost, model size, and embedding performance, serving as a strong, reproducible, and budget-friendly baseline for developing embedding models in the future.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 3

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2510.02294 in a Space README.md to link it from this page.

Collections including this paper 2