ReasonIR: Training Retrievers for Reasoning Tasks
Abstract
We present ReasonIR-8B, the first retriever specifically trained for general reasoning tasks. Existing retrievers have shown limited gains on reasoning tasks, in part because existing training datasets focus on short factual queries tied to documents that straightforwardly answer them. We develop a synthetic data generation pipeline that, for each document, our pipeline creates a challenging and relevant query, along with a plausibly related but ultimately unhelpful hard negative. By training on a mixture of our synthetic data and existing public data, ReasonIR-8B achieves a new state-of-the-art of 29.9 nDCG@10 without reranker and 36.9 nDCG@10 with reranker on BRIGHT, a widely-used reasoning-intensive information retrieval (IR) benchmark. When applied to RAG tasks, ReasonIR-8B improves MMLU and GPQA performance by 6.4% and 22.6% respectively, relative to the closed-book baseline, outperforming other retrievers and search engines. In addition, ReasonIR-8B uses test-time compute more effectively: on BRIGHT, its performance consistently increases with longer and more information-rich rewritten queries; it continues to outperform other retrievers when combined with an LLM reranker. Our training recipe is general and can be easily extended to future LLMs; to this end, we open-source our code, data, and model.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Distillation and Refinement of Reasoning in Small Language Models for Document Re-ranking (2025)
- OpenRAG: Optimizing RAG End-to-End via In-Context Retrieval Learning (2025)
- Leveraging LLMs for Utility-Focused Annotation: Reducing Manual Effort for Retrieval and RAG (2025)
- SUNAR: Semantic Uncertainty based Neighborhood Aware Retrieval for Complex QA (2025)
- FinDER: Financial Dataset for Question Answering and Evaluating Retrieval-Augmented Generation (2025)
- Rank-R1: Enhancing Reasoning in LLM-based Document Rerankers via Reinforcement Learning (2025)
- Imagine All The Relevance: Scenario-Profiled Indexing with Knowledge Expansion for Dense Retrieval (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 1
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper