view article Article Cosmopedia: how to create large-scale synthetic data for pre-training Large Language Models By loubnabnl and 2 others • Mar 20, 2024 • 94
view article Article Atlaset Dataset for Moroccan Darija: From Data Collection, Analysis, to Model Trainings By atlasia and 1 other • Mar 6 • 24
view article Article MiniMax-01 is Now Open-Source: Scaling Lightning Attention for the AI Agent Era By MiniMax-AI • Jan 15 • 46
view article Article Arabic RAG Leaderboard: A Comprehensive Framework for Evaluating Arabic Language Retrieval Systems By Navid-AI and 1 other • Feb 9 • 12
view article Article Darija Chatbot Arena: Making LLMs Compete in the Moroccan Dialect By atlasia and 2 others • Feb 10 • 14
SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model Paper • 2502.02737 • Published Feb 4 • 232
view article Article Open-R1: a fully open reproduction of DeepSeek-R1 By eliebak and 2 others • Jan 28 • 862
Atlas-Chat: Adapting Large Language Models for Low-Resource Moroccan Arabic Dialect Paper • 2409.17912 • Published Sep 26, 2024 • 29
view article Article Yay! Organizations can now publish blog Articles By huggingface and 3 others • Jan 20 • 45
view article Article TerjamaBench: A Cultural Benchmark for English-Darija Machine Translation By imomayiz and 4 others • Jan 10 • 33
SmolLM2 Collection State-of-the-art compact LLMs for on-device applications: 1.7B, 360M, 135M • 16 items • Updated 30 days ago • 265
📀 Dataset comparison models Collection 1.8B models trained on 350BT to compare different pretraining datasets • 8 items • Updated Jun 12, 2024 • 38
view article Article Finding Moroccan Arabic (Darija) in Fineweb 2 By omarkamali and 3 others • Dec 8, 2024 • 23
Falcon Mamba: The First Competitive Attention-free 7B Language Model Paper • 2410.05355 • Published Oct 7, 2024 • 36