Papers
arxiv:2506.14111

Essential-Web v1.0: 24T tokens of organized web data

Published on Jun 17
· Submitted by Research-EAI on Jun 17
Authors:
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,

Abstract

A large, 24-trillion-token Essential-Web v1.0 dataset annotated with a multi-category taxonomy outperforms or is competitive with existing datasets in various domains using simple filtering techniques.

AI-generated summary

Data plays the most prominent role in how language models acquire skills and knowledge. The lack of massive, well-organized pre-training datasets results in costly and inaccessible data pipelines. We present Essential-Web v1.0, a 24-trillion-token dataset in which every document is annotated with a twelve-category taxonomy covering topic, format, content complexity, and quality. Taxonomy labels are produced by EAI-Distill-0.5b, a fine-tuned 0.5b-parameter model that achieves an annotator agreement within 3% of Qwen2.5-32B-Instruct. With nothing more than SQL-style filters, we obtain competitive web-curated datasets in math (-8.0% relative to SOTA), web code (+14.3%), STEM (+24.5%) and medical (+8.6%). Essential-Web v1.0 is available on HuggingFace: https://huggingface.co/datasets/EssentialAI/essential-web-v1.0

Community

Paper submitter

ESSENTIAL-WEB V1.0: 24T tokens of organized web data

Amazing work!! 🔥

Great work!

Fantastic!

Impressive!!!!!

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 1

Datasets citing this paper 8

Browse 8 datasets citing this paper

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2506.14111 in a Space README.md to link it from this page.

Collections including this paper 4