FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language Paper ⢠2506.20920 ⢠Published 2 days ago ⢠23
view article Article Transformers backend integration in SGLang By marcsun13 and 4 others ⢠5 days ago ⢠35
view article Article Tiny Agents: a MCP-powered agent in 50 lines of code By julien-c ⢠Apr 25 ⢠283
How Programming Concepts and Neurons Are Shared in Code Language Models Paper ⢠2506.01074 ⢠Published 26 days ago ⢠3
Tracing Multilingual Factual Knowledge Acquisition in Pretraining Paper ⢠2505.14824 ⢠Published May 20 ⢠4
Multilingual k-Nearest-Neighbor Machine Translation Paper ⢠2310.14644 ⢠Published Oct 23, 2023 ⢠2
Qwen2.5 Collection Qwen2.5 language models, including pretrained and instruction-tuned models of 7 sizes, including 0.5B, 1.5B, 3B, 7B, 14B, 32B, and 72B. ⢠46 items ⢠Updated Apr 28 ⢠624
SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model Paper ⢠2502.02737 ⢠Published Feb 4 ⢠235
Advances and Challenges in Foundation Agents: From Brain-Inspired Intelligence to Evolutionary, Collaborative, and Safe Systems Paper ⢠2504.01990 ⢠Published Mar 31 ⢠292
ā UI is a good thing š ā Collection cool spaces with a cool UI, what could be better? ⢠5 items ⢠Updated May 5 ⢠20
On Relation-Specific Neurons in Large Language Models Paper ⢠2502.17355 ⢠Published Feb 24 ⢠9
MMTEB Collection Our contribution to the Massive Multilingual Text Embedding Benchmark (MMTEB). Retrieval and reranking benchmarks in 16 languages. ⢠4 items ⢠Updated Jun 6, 2024 ⢠3
MMTEB: Massive Multilingual Text Embedding Benchmark Paper ⢠2502.13595 ⢠Published Feb 19 ⢠36
CommonCrawl Collection Large web-mined general corpus based on CommonCrawl. ⢠8 items ⢠Updated Apr 13 ⢠3
NoLiMa: Long-Context Evaluation Beyond Literal Matching Paper ⢠2502.05167 ⢠Published Feb 7 ⢠15
view article Article Finding Moroccan Arabic (Darija) in Fineweb 2 By omarkamali and 3 others ⢠Dec 8, 2024 ⢠23