- Pre-training code with nanotron - Evaluation suite with lighteval - Synthetic data generation using distilabel (powers our new SFT dataset HuggingFaceTB/smoltalk) - Post-training scripts with TRL & the alignment handbook - On-device tools with llama.cpp for summarization, rewriting & agents
Apache 2.0 licensed. V2 pre-training data mix coming soon!
Wow, impressive 340B model by nvidia with a nice permissive license! π The technical report is full of insights and seems to use a different learning rate schedule than cosine, probably a variant of WSD. Hope to get more info on that! π
π· FineWeb technical report is out and so is π FineWeb-Edu, a 1.3 trillion tokens dataset that outperforms all other open web datasets, with remarkable improvements on educational benchmarksΒ such as MMLU, ARC, and OpenBookQA.
We used Llama 3 generations to train an educational quality classifier, filtering the 15 trillion tokens of FineWeb to select only those with high educational value (an approach also used in Llama 3 and Phi-3 training datasets). We're releasing both FineWeb-Edu and the classifier, along with a larger, less heavily filtered version containing 5.4 trillion tokens.
You can find more details about the dataset and the experiments we ran in the FineWeb technical report, It's a 45-minute read but it contains all the secret sauce for building high quality web datasets.
We've just published a detailed blog post on the creation of Cosmopedia dataset. We hope this will provide insights about generating synthetic data at scale for pre-training. https://huggingface.co/blog/cosmopedia
Here are some key takeaways: π― Prompt curation is crucial: we want to cover many topics with few duplicates. π You can leverage various resources for diversity: using different seed data, generation formats, and target audiences. βοΈ The importance of a good technical stack: for scalable generations with tools like llm-swarm and fast model training and evaluation.
β Today weβre releasing The Stack v2 & StarCoder2: a series of 3B, 7B & 15B code generation models trained on 3.3 to 4.5 trillion tokens of code:
- StarCoder2-15B matches or outperforms CodeLlama 34B, and approaches DeepSeek-33B on multiple benchmarks. - StarCoder2-3B outperforms StarCoderBase-15B and similar sized models. - The Stack v2 a 4x larger dataset than the Stack v1, resulting in 900B unique code tokens π As always, we released everything from models and datasets to curation code. Enjoy!