Textbooks Are All You Need II: phi-1.5 technical report Paper • 2309.05463 • Published Sep 11, 2023 • 87
TinyStories: How Small Can Language Models Be and Still Speak Coherent English? Paper • 2305.07759 • Published May 12, 2023 • 36
Scaling Synthetic Data Creation with 1,000,000,000 Personas Paper • 2406.20094 • Published Jun 28, 2024 • 102
Instruction Pre-Training: Language Models are Supervised Multitask Learners Paper • 2406.14491 • Published Jun 20, 2024 • 94
Improving Text Embeddings with Large Language Models Paper • 2401.00368 • Published Dec 31, 2023 • 81
Enhancing Chat Language Models by Scaling High-quality Instructional Conversations Paper • 2305.14233 • Published May 23, 2023 • 6
Adapting Large Language Models via Reading Comprehension Paper • 2309.09530 • Published Sep 18, 2023 • 79
Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models Paper • 2401.01335 • Published Jan 2, 2024 • 68
Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing Paper • 2406.08464 • Published Jun 12, 2024 • 70
WaveCoder: Widespread And Versatile Enhanced Instruction Tuning with Refined Data Generation Paper • 2312.14187 • Published Dec 20, 2023 • 52
Rephrasing the Web: A Recipe for Compute and Data-Efficient Language Modeling Paper • 2401.16380 • Published Jan 29, 2024 • 51
Synthetic Data (Almost) from Scratch: Generalized Instruction Tuning for Language Models Paper • 2402.13064 • Published Feb 20, 2024 • 49
AgentInstruct: Toward Generative Teaching with Agentic Flows Paper • 2407.03502 • Published Jul 3, 2024 • 51
Toward General Instruction-Following Alignment for Retrieval-Augmented Generation Paper • 2410.09584 • Published Oct 12, 2024 • 49
OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning Dataset Paper • 2402.10176 • Published Feb 15, 2024 • 38
DataDreamer: A Tool for Synthetic Data Generation and Reproducible LLM Workflows Paper • 2402.10379 • Published Feb 16, 2024 • 32
Best Practices and Lessons Learned on Synthetic Data for Language Models Paper • 2404.07503 • Published Apr 11, 2024 • 32
Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models Paper • 2312.06585 • Published Dec 11, 2023 • 29
Becoming self-instruct: introducing early stopping criteria for minimal instruct tuning Paper • 2307.03692 • Published Jul 5, 2023 • 26
Simple synthetic data reduces sycophancy in large language models Paper • 2308.03958 • Published Aug 7, 2023 • 22
CodecLM: Aligning Language Models with Tailored Synthetic Data Paper • 2404.05875 • Published Apr 8, 2024 • 18
Source2Synth: Synthetic Data Generation and Curation Grounded in Real Data Sources Paper • 2409.08239 • Published Sep 12, 2024 • 21
WizardLM: Empowering Large Language Models to Follow Complex Instructions Paper • 2304.12244 • Published Apr 24, 2023 • 14
Learning to Generate Instruction Tuning Datasets for Zero-Shot Task Adaptation Paper • 2402.18334 • Published Feb 28, 2024 • 12
Synthesizing Text-to-SQL Data from Weak and Strong LLMs Paper • 2408.03256 • Published Aug 6, 2024 • 10
Self-Instruct: Aligning Language Model with Self Generated Instructions Paper • 2212.10560 • Published Dec 20, 2022 • 9
Ensemble-Instruct: Generating Instruction-Tuning Data with a Heterogeneous Mixture of LMs Paper • 2310.13961 • Published Oct 21, 2023 • 5
M2Lingual: Enhancing Multilingual, Multi-Turn Instruction Alignment in Large Language Models Paper • 2406.16783 • Published Jun 24, 2024 • 4
Synthetic Data Generation with Large Language Models for Text Classification: Potential and Limitations Paper • 2310.07849 • Published Oct 11, 2023 • 2
Explore-Instruct: Enhancing Domain-Specific Instruction Coverage through Active Exploration Paper • 2310.09168 • Published Oct 13, 2023 • 2
Increasing Diversity While Maintaining Accuracy: Text Data Generation with Large Language Models and Human Interventions Paper • 2306.04140 • Published Jun 7, 2023 • 2
SALMON: Self-Alignment with Principle-Following Reward Models Paper • 2310.05910 • Published Oct 9, 2023 • 2
Better Synthetic Data by Retrieving and Transforming Existing Datasets Paper • 2404.14361 • Published Apr 22, 2024 • 2
Impossible Distillation: from Low-Quality Model to High-Quality Dataset & Model for Summarization and Paraphrasing Paper • 2305.16635 • Published May 26, 2023 • 1
Arena Learning: Build Data Flywheel for LLMs Post-training via Simulated Chatbot Arena Paper • 2407.10627 • Published Jul 15, 2024 • 1
ZeroGen: Efficient Zero-shot Learning via Dataset Generation Paper • 2202.07922 • Published Feb 16, 2022 • 1
West-of-N: Synthetic Preference Generation for Improved Reward Modeling Paper • 2401.12086 • Published Jan 22, 2024 • 1
Automatic Instruction Evolving for Large Language Models Paper • 2406.00770 • Published Jun 2, 2024 • 2
Generative AI for Synthetic Data Generation: Methods, Challenges and the Future Paper • 2403.04190 • Published Mar 7, 2024 • 1
On LLMs-Driven Synthetic Data Generation, Curation, and Evaluation: A Survey Paper • 2406.15126 • Published Jun 14, 2024
Large Language Model as Attributed Training Data Generator: A Tale of Diversity and Bias Paper • 2306.15895 • Published Jun 28, 2023
A Multi-Faceted Evaluation Framework for Assessing Synthetic Data Generated by Large Language Models Paper • 2404.14445 • Published Apr 20, 2024
TarGEN: Targeted Data Generation with Large Language Models Paper • 2310.17876 • Published Oct 27, 2023
#InsTag: Instruction Tagging for Analyzing Supervised Fine-tuning of Large Language Models Paper • 2308.07074 • Published Aug 14, 2023
Orca 2: Teaching Small Language Models How to Reason Paper • 2311.11045 • Published Nov 18, 2023 • 76
Orca: Progressive Learning from Complex Explanation Traces of GPT-4 Paper • 2306.02707 • Published Jun 5, 2023 • 46
WizardCoder: Empowering Code Large Language Models with Evol-Instruct Paper • 2306.08568 • Published Jun 14, 2023 • 29
LMSYS-Chat-1M: A Large-Scale Real-World LLM Conversation Dataset Paper • 2309.11998 • Published Sep 21, 2023 • 25
Let's Synthesize Step by Step: Iterative Dataset Synthesis with Large Language Models by Extrapolating Errors from Small Models Paper • 2310.13671 • Published Oct 20, 2023 • 19
Self-play with Execution Feedback: Improving Instruction-following Capabilities of Large Language Models Paper • 2406.13542 • Published Jun 19, 2024 • 17
Auto-Instruct: Automatic Instruction Generation and Ranking for Black-Box Language Models Paper • 2310.13127 • Published Oct 19, 2023 • 12
WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct Paper • 2308.09583 • Published Aug 18, 2023 • 7
GenQA: Generating Millions of Instructions from a Handful of Prompts Paper • 2406.10323 • Published Jun 14, 2024 • 5
UltraFeedback: Boosting Language Models with High-quality Feedback Paper • 2310.01377 • Published Oct 2, 2023 • 5
Unnatural Instructions: Tuning Language Models with (Almost) No Human Labor Paper • 2212.09689 • Published Dec 19, 2022 • 1
Aligning Large Language Models through Synthetic Feedback Paper • 2305.13735 • Published May 23, 2023 • 1
Principle-Driven Self-Alignment of Language Models from Scratch with Minimal Human Supervision Paper • 2305.03047 • Published May 4, 2023 • 1
Mixture of Soft Prompts for Controllable Data Generation Paper • 2303.01580 • Published Mar 2, 2023 • 1
Refined Direct Preference Optimization with Synthetic Data for Behavioral Alignment of LLMs Paper • 2402.08005 • Published Feb 12, 2024 • 1
Harnessing the Power of David against Goliath: Exploring Instruction Data Generation without Using Closed-Source Models Paper • 2308.12711 • Published Aug 24, 2023 • 1
Generating Training Data with Language Models: Towards Zero-Shot Language Understanding Paper • 2202.04538 • Published Feb 9, 2022
Knowledge-Infused Prompting: Assessing and Advancing Clinical Text Data Generation with Large Language Models Paper • 2311.00287 • Published Nov 1, 2023
GPT3Mix: Leveraging Large-scale Language Models for Text Augmentation Paper • 2104.08826 • Published Apr 18, 2021
Synthetic Prompting: Generating Chain-of-Thought Demonstrations for Large Language Models Paper • 2302.00618 • Published Feb 1, 2023
MIND: Math Informed syNthetic Dialogues for Pretraining LLMs Paper • 2410.12881 • Published Oct 15, 2024 • 1
Dynosaur: A Dynamic Growth Paradigm for Instruction-Tuning Data Curation Paper • 2305.14327 • Published May 23, 2023
Automatically Generating Numerous Context-Driven SFT Data for LLMs across Diverse Granularity Paper • 2405.16579 • Published May 26, 2024
Unsupervised Neural Machine Translation with Generative Language Models Only Paper • 2110.05448 • Published Oct 11, 2021
Content preserving text generation with attribute controls Paper • 1811.01135 • Published Nov 3, 2018
Large Language Models Are Human-Level Prompt Engineers Paper • 2211.01910 • Published Nov 3, 2022 • 1
PersonaMath: Enhancing Math Reasoning through Persona-Driven Data Augmentation Paper • 2410.01504 • Published Oct 2, 2024