view article Article Can we create pedagogically valuable multi-turn synthetic datasets from Cosmopedia? By davanstrien • May 7, 2024 • 8
view article Article Train 400x faster Static Embedding Models with Sentence Transformers 15 days ago • 128
Tulu 3 Datasets Collection All datasets released with Tulu 3 -- state of the art open post-training recipes. • 33 items • Updated 1 day ago • 64
PixMo Collection A set of vision-language datasets built by Ai2 and used to train the Molmo family of models. Read more at https://molmo.allenai.org/blog • 9 items • Updated 24 days ago • 55
Gemma 2: Improving Open Language Models at a Practical Size Paper • 2408.00118 • Published Jul 31, 2024 • 76
view article Article PyTorchModelHubMixin: Bridging the Gap for Custom AI Models on Hugging Face By not-lain • Nov 11, 2024 • 16
Qwen2.5-Coder Collection Code-specific model series based on Qwen2.5 • 40 items • Updated Nov 28, 2024 • 268
view article Article Recipe: Preparing Multilingual Speech Datasets for TTS Training By PHBJT • Nov 4, 2024 • 16
MobileLLM Collection Optimizing Sub-billion Parameter Language Models for On-Device Use Cases (ICML 2024) https://arxiv.org/abs/2402.14905 • 9 items • Updated Nov 27, 2024 • 104
🌌 Synthetic textbooks Collection Synthetically generated textbooks • 5 items • Updated Jun 2, 2024 • 2
view article Article How to optimize your data labelling project with custom interfaces By burtenshaw • Oct 16, 2024 • 18
view article Article Synthetic dataset generation techniques: Self-Instruct By davanstrien • May 15, 2024 • 14