FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language Paper ⢠2506.20920 ⢠Published 2 days ago ⢠23
How Programming Concepts and Neurons Are Shared in Code Language Models Paper ⢠2506.01074 ⢠Published 26 days ago ⢠3
Tracing Multilingual Factual Knowledge Acquisition in Pretraining Paper ⢠2505.14824 ⢠Published May 20 ⢠4
On Relation-Specific Neurons in Large Language Models Paper ⢠2502.17355 ⢠Published Feb 24 ⢠9
How Transliterations Improve Crosslingual Alignment Paper ⢠2409.17326 ⢠Published Sep 25, 2024 ⢠1
GlotCC: An Open Broad-Coverage CommonCrawl Corpus and Pipeline for Minority Languages Paper ⢠2410.23825 ⢠Published Oct 31, 2024 ⢠4
MEXA: Multilingual Evaluation of English-Centric LLMs via Cross-Lingual Alignment Paper ⢠2410.05873 ⢠Published Oct 8, 2024 ⢠3