A collection of datasets for LLM pretraining

Hugging Face Smol Models Research
Enterprise
community
AI & ML interests
Exploring smol models (for text, vision and video) and high quality web and synthetic datasets
Recent Activity
View all activity
Organization Card
Hugging Face Smol Models Research
This is the home for smol models (SmolLM & SmolVLM) and high quality pre-training datasets. We released:
- FineWeb-Edu: a filtered version of FineWeb dataset for educational content, paper available here.
- Cosmopedia: the largest open synthetic dataset, with 25B tokens and 30M samples. It contains synthetic textbooks, blog posts, and stories, posts generated by Mixtral. Blog post available here.
- Smollm-Corpus: the pre-training corpus of SmolLM: Cosmopedia v0.2, FineWeb-Edu dedup and Python-Edu. Blog post available here.
- FineMath: the best public math pretraining dataset with 50B tokens of mathematical and problem solving data.
- Stack-Edu: the best open code pretraining dataset with educational code in 15 programming languages.
- SmolLM2 models: a series of strong small models in three sizes: 135M, 360M and 1.7B
- SmolVLM2: a family of small Video and Vision models in three sizes: 2.2B, 500M and 256M. Blog post available here.
News ποΈ
- HuggingSnap: turn your iPhone into a visual assistant usig SmolVLM2. App Store - Source code
- Stack-Edu: 125B tokens of educational code in 15 programming languages. Dataset

Collections
13
-
70
SmolVLM
πGenerate text by analyzing images and videos
-
53
SmolVLM2 HighlightGenerator
π¨Generate video highlights from uploaded video
-
17
SmolVLM2 IPhone Waitlist
β°sign in to receive news on the iPhone app
-
24
SmolVLM2 XSPFGenerator (VLC prototype)
πGenerate video highlights and playlist
spaces
13
Running
28
SmolLM2 1.7B Instruct WebGPU
π
A blazingly fast & powerful AI chatbot that runs in-browser!
Running
47
SmolVLM 256M Instruct WebGPU
π¨
Generate descriptions for images using WebGPU technology
Running
4
Smolvlm Web Benchmarking
π
Running
17
SmolVLM2 IPhone Waitlist
β°
sign in to receive news on the iPhone app
Sleeping
24
SmolVLM2 XSPFGenerator (VLC prototype)
π
Generate video highlights and playlist
Runtime error
53
SmolVLM2 HighlightGenerator
π¨
Generate video highlights from uploaded video
models
75

HuggingFaceTB/simplewiki-pruned-text-350k
Updated

HuggingFaceTB/SmolLM2-360M-Instruct
Text Generation
β’
Updated
β’
861k
β’
115

HuggingFaceTB/SmolLM2-135M-Instruct
Text Generation
β’
Updated
β’
386k
β’
186

HuggingFaceTB/SmolLM2-1.7B-Instruct
Text Generation
β’
Updated
β’
84k
β’
609

HuggingFaceTB/SmolVLM2-2.2B-Base
Image-Text-to-Text
β’
Updated
β’
176
β’
3

HuggingFaceTB/SmolVLM-256M-Instruct
Image-Text-to-Text
β’
Updated
β’
412k
β’
220

HuggingFaceTB/SmolVLM-Instruct
Image-Text-to-Text
β’
Updated
β’
77.1k
β’
434

HuggingFaceTB/SmolVLM-500M-Instruct
Image-Text-to-Text
β’
Updated
β’
31.6k
β’
119

HuggingFaceTB/SmolVLM2-256M-Video-Instruct
Image-Text-to-Text
β’
Updated
β’
23.2k
β’
55

HuggingFaceTB/SmolVLM2-500M-Video-Instruct
Image-Text-to-Text
β’
Updated
β’
18.1k
β’
57
datasets
40
HuggingFaceTB/wikispeedia-traces
Viewer
β’
Updated
β’
420
β’
14
HuggingFaceTB/stack-edu
Viewer
β’
Updated
β’
167M
β’
2.19k
β’
32
HuggingFaceTB/issues-kaggle-notebooks
Viewer
β’
Updated
β’
16.1M
β’
1.14k
β’
8
HuggingFaceTB/dclm-edu
Viewer
β’
Updated
β’
1B
β’
17.9k
β’
25
HuggingFaceTB/SmolLM2-intermediate-evals
Viewer
β’
Updated
β’
582
β’
58
HuggingFaceTB/smoltalk
Viewer
β’
Updated
β’
2.2M
β’
7.26k
β’
330
HuggingFaceTB/smol-smoltalk
Viewer
β’
Updated
β’
485k
β’
1.41k
β’
38
HuggingFaceTB/finemath
Viewer
β’
Updated
β’
48.3M
β’
16.2k
β’
307
HuggingFaceTB/everyday-conversations-llama3.1-2k
Viewer
β’
Updated
β’
2.38k
β’
623
β’
98
HuggingFaceTB/MagPie-Pro-300k-MT
Viewer
β’
Updated
β’
300k
β’
76
β’
2