| library_name: transformers | |
| datasets: | |
| - HuggingFaceTB/cosmo2_training_data_subset_1M | |
| # cosmo2-tokenizer | |
| Tokenizer for the training of cosmo2. This tokenizer was trained on 1M samples from: | |
| - FineWeb-Edu 70% | |
| - Cosmopedia v2 15% | |
| - StarCoderData 8% | |
| - OpenWebMath 5% | |
| - StackOverFlow 2% |