Running 958 958 FineWeb: decanting the web for the finest text data at scale π· Generate high-quality web text data for LLM training
The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale Paper β’ 2406.17557 β’ Published Jun 25, 2024 β’ 98
π Dataset comparison models Collection 1.8B models trained on 350BT to compare different pretraining datasets β’ 8 items β’ Updated Jun 12, 2024 β’ 38
π§ͺ FineWeb v1 data experiments Collection Ablation models trained for our data experiments. β’ 22 items β’ Updated Jun 12, 2024 β’ 5