FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language Paper ⢠2506.20920 ⢠Published 27 days ago ⢠61
SmolVLM: Redefining small and efficient multimodal models Paper ⢠2504.05299 ⢠Published Apr 7 ⢠193
SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model Paper ⢠2502.02737 ⢠Published Feb 4 ⢠236
Towards Best Practices for Open Datasets for LLM Training Paper ⢠2501.08365 ⢠Published Jan 14 ⢠64
SelfCodeAlign: Self-Alignment for Code Generation Paper ⢠2410.24198 ⢠Published Oct 31, 2024 ⢠25