FineData

community

AI & ML interests

We release large pre-training datasets to accelerate open LLM development. Part of the Hugging Face Science team (hf.co/science)

Recent Activity

joelniklaus updated a Space about 1 month ago

HuggingFaceFW/finephrase

craffel authored a paper about 1 month ago

Risk Under Pressure: Compute-Aware Evaluation of Adversarial Robustness in Language Models

cfahlgren1 submitted a paper about 1 month ago

From AGI to ASI

View all activity

Papers

FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language

The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale

View all Papers

Organization Card

Community About org cards

🍷 FineData

This is the home of the 🍷 FineData team, a branch of the 🤗 Hugging Face Science Team releasing large scale pre-training datasets to accelerate open LLM development.

🍷 FineWeb: A 15T tokens English dataset for LLM pre-training. See the blogpost and paper.
📚 FineWeb-Edu: a filtered subset of the most educational content from FineWeb.
🥂 FineWeb2: an extension of FineWeb to over 1000 languages. See the paper.
📄 FinePDFs: 3T tokens of text data extracted from PDFs sourced from the Web. See the blogpost
🌐 FineWiki: an updated, better extracted version of Wikipedia in 300+ languages.
📄 FinePDFs-Edu: 350B+ highly educational tokens filtered from 📄 FinePDFs
💬 FineTranslations: 1+1T tokens of parallel text translated from 500+ 🥂 FineWeb2 languages

buckets 2

HuggingFaceFW/finephrase-checkpoints

HuggingFaceFW/finephrase-rephrased

Collections 8

View 8 collections

spaces 8

The Synthetic Data Playbook: Generating Trillions of the Finest Tokens

Visualize synthetic‑data experiments as an interactive bookshelf

FinePDFs: Liberating 3T of the finest tokens from PDFs

FineWiki Viewer

Viewer to explore the finewiki dataset

FineWeb: decanting the web for the finest text data at scale

Explore and download the FineWeb web‑scale text dataset

Scaling FineWeb to 1000+ languages: Step 1: finding signal in 100s of evaluation tasks

Evaluate multilingual models using FineTasks

models 105

HuggingFaceFW/finepdfs_edu_classifier_eng_Latn

0.4B • Updated Nov 11, 2025 • 34 • 2

HuggingFaceFW/finepdfs_dclm_classifier_eng_Latn

0.4B • Updated Oct 6, 2025 • 46

HuggingFaceFW/finepdfs_edu_classifier_v2_eng_Latn

0.4B • Updated Oct 6, 2025 • 14 • 1

HuggingFaceFW/finepdfs_ocr_quality_classifier_eng_Latn

0.4B • Updated Oct 6, 2025 • 17

HuggingFaceFW/finepdfs_edu_classifier_guj_Gujr

0.3B • Updated Oct 6, 2025 • 4

HuggingFaceFW/finepdfs_edu_classifier_nno_Latn

0.3B • Updated Oct 6, 2025 • 24

HuggingFaceFW/finepdfs_edu_classifier_kaz_Cyrl

0.3B • Updated Oct 6, 2025 • 3

HuggingFaceFW/finepdfs_edu_classifier_tam_Taml

0.3B • Updated Oct 6, 2025 • 8

HuggingFaceFW/finepdfs_edu_classifier_azj_Latn

0.3B • Updated Oct 6, 2025 • 7

HuggingFaceFW/finepdfs_edu_classifier_afr_Latn

0.3B • Updated Oct 6, 2025 • 4

View 105 models

datasets 35

HuggingFaceFW/finepdfs

Viewer • Updated Apr 3 • 476M • 27.1k • 898

HuggingFaceFW/finephrase

Viewer • Updated Mar 31 • 1.02B • 192k • 136

HuggingFaceFW/finepdfs_edu_50BT-dclm_30BT-fineweb_edu_20BT-shuffled

Viewer • Updated Mar 2 • 56.1M • 1.93k • 1

HuggingFaceFW/finepdfs_edu_50BT-dclm_30BT-fineweb_edu_20BT

Viewer • Updated Mar 2 • 56.1M • 14k

HuggingFaceFW/finepdfs_50BT-dclm_30BT-fineweb_edu_20BT-shuffled

Viewer • Updated Mar 2 • 62.1M • 754 • 4

HuggingFaceFW/finepdfs_50BT-dclm_30BT-fineweb_edu_20BT

Viewer • Updated Mar 2 • 62.1M • 18.3k • 2

HuggingFaceFW/finepdfs_edu_100BT-shuffled

Viewer • Updated Mar 2 • 17.8M • 452

HuggingFaceFW/finepdfs_edu_100BT

Viewer • Updated Mar 2 • 17.8M • 2.42k

HuggingFaceFW/finepdfs_100BT-shuffled

Viewer • Updated Mar 2 • 14.6M • 589

HuggingFaceFW/finepdfs_100BT

Viewer • Updated Mar 2 • 29.9M • 2.36k

View 35 datasets