SearchInstruct: Enhancing Domain Adaptation via Retrieval-Based Instruction Dataset Creation
Abstract
SearchInstruct enhances supervised fine-tuning datasets for large language models by expanding domain-specific questions and retrieving accurate answers, improving model performance and enabling efficient model editing.
Supervised Fine-Tuning (SFT) is essential for training large language models (LLMs), significantly enhancing critical capabilities such as instruction following and in-context learning. Nevertheless, creating suitable training datasets tailored for specific domains remains challenging due to unique domain constraints and data scarcity. In this paper, we propose SearchInstruct, an innovative method explicitly designed to construct high quality instruction datasets for SFT. Our approach begins with a limited set of domain specific, human generated questions, which are systematically expanded using a large language model. Subsequently, domain relevant resources are dynamically retrieved to generate accurate and contextually appropriate answers for each augmented question. Experimental evaluation demonstrates that SearchInstruct enhances both the diversity and quality of SFT datasets, leading to measurable improvements in LLM performance within specialized domains. Additionally, we show that beyond dataset generation, the proposed method can also effectively facilitate tasks such as model editing, enabling efficient updates to existing models. To facilitate reproducibility and community adoption, we provide full implementation details, the complete set of generated instruction response pairs, and the source code in a publicly accessible Git repository: [https://github.com/mostafaamiri/SearchInstruct](https://github.com/mostafaamiri/SearchInstruct)
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- TCIA: A Task-Centric Instruction Augmentation Method for Instruction Finetuning (2025)
- MUST-RAG: MUSical Text Question Answering with Retrieval Augmented Generation (2025)
- Transforming Questions and Documents for Semantically Aligned Retrieval-Augmented Generation (2025)
- LMAR: Language Model Augmented Retriever for Domain-specific Knowledge Indexing (2025)
- Domain-Aware RAG: MoL-Enhanced RL for Efficient Training and Scalable Retrieval (2025)
- Key-Augmented Neural Triggers for Knowledge Sharing (2025)
- From Ranking to Selection: A Simple but Efficient Dynamic Passage Selector for Retrieval Augmented Generation (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper