HuggingFaceFW
Enterprise
community
AI & ML interests
None defined yet.
Organization Card
🤗 HuggingFace 🍷 FineWeb datasets
Read our technical report!
This organization hosts the 🍷 FineWeb datasets, a collection of text datasets sourced from the web (CommonCrawl), released under a permissive license (ODC-By).
The creation of 🍷 FineWeb involved careful processing and filtering of large amounts of web data with the aim of lowering the barriers to entry to anyone intending to pretrain high-performance large language models.
All code and artefacts needed for reproduction are public and built on top of open source libraries, such as the 🤗 libraries datatrove
, nanotron
or lighteval
.
Version 1 of the 🍷 FineWeb dataset is available here. Our ablation models can be found here.
Collections
4
models
30
HuggingFaceFW/Datasets-Metrics-Viewer-Data
Updated
HuggingFaceFW/ablation-model-fineweb-edu
Text Generation
•
Updated
•
691
•
11
HuggingFaceFW/fineweb-edu-classifier
Text Classification
•
Updated
•
260k
•
124
HuggingFaceFW/ablation-exp-filter-custom-all_filters-28BT
Text Generation
•
Updated
•
13
•
1
HuggingFaceFW/ablation-exp-filter-custom-line_char_duplicated_0.01-28BT
Text Generation
•
Updated
•
15
•
2
HuggingFaceFW/ablation-exp-filter-custom-line_ratio_0.67-28BT
Text Generation
•
Updated
•
17
HuggingFaceFW/ablation-exp-filter-custom-lines_punct_0.12-28BT
Text Generation
•
Updated
•
15
•
3
HuggingFaceFW/ablation-exp-filter-baseline_c4-28BT
Text Generation
•
Updated
•
17
•
2
HuggingFaceFW/ablation-exp-filter-baseline_cc-28BT
Text Generation
•
Updated
•
15
•
4
HuggingFaceFW/ablation-exp-filter-c4-word_lengths-28BT
Text Generation
•
Updated
•
13
•
2
datasets
5
HuggingFaceFW/fineweb-edu
Viewer
•
Updated
•
3B
•
592k
•
532
HuggingFaceFW/fineweb
Viewer
•
Updated
•
46B
•
383k
•
1.74k
HuggingFaceFW/fineweb-edu-llama3-annotations
Viewer
•
Updated
•
467k
•
241
•
34
HuggingFaceFW/fineweb-edu-score-2
Viewer
•
Updated
•
11.8B
•
32.3k
•
58
HuggingFaceFW/admin
Viewer
•
Updated
•
2
•
7.71k
•
3