FineData

Team
community
Activity Feed

AI & ML interests

We release large pre-training datasets to accelerate open LLM development. Part of the Hugging Face Science team (hf.co/science)

Recent Activity

joelniklausΒ  published a Space about 9 hours ago
HuggingFaceFW/finephrase
joelniklausΒ  updated a dataset about 9 hours ago
HuggingFaceFW/finephrase
joelniklausΒ  updated a Space about 20 hours ago
HuggingFaceFW/finephrase
View all activity

HuggingFaceFW 's collections 8

🀏 Smol-Data
Tried and tested mixes for strong pretraining. Inspired by https://huggingface.co/blog/codelion/optimal-dataset-mixing
πŸ“€ Dataset comparison models
1.8B models trained on 350BT to compare different pretraining datasets
πŸ“š FineWeb-Edu
FineWeb-Edu datasets, classifier and ablation model
πŸ§ͺ FineWeb v1 data experiments
Ablation models trained for our data experiments.
🀏 Smol-Data
Tried and tested mixes for strong pretraining. Inspired by https://huggingface.co/blog/codelion/optimal-dataset-mixing
πŸ“š FineWeb-Edu
FineWeb-Edu datasets, classifier and ablation model
πŸ“€ Dataset comparison models
1.8B models trained on 350BT to compare different pretraining datasets
πŸ§ͺ FineWeb v1 data experiments
Ablation models trained for our data experiments.