BUT-FIT
/

csmpt7b

@@ -5,6 +5,9 @@ datasets:
 - BUT-FIT/adult_content_classifier_dataset
 language:
 - cs
 ---
 # Introduction
 CSMPT7b is a large Czech language model continously pretrained on 272b training tokens from English [MPT7b](https://huggingface.co/mosaicml/mpt-7b) model. Model was pretrained on ~67b token [Large Czech Collection](https://huggingface.co/datasets/BUT-FIT/BUT-LCC) using Czech tokenizer, obtained using our vocabulary swap method (see below).
@@ -20,7 +23,6 @@ Training was done on [Karolina](https://www.it4i.cz/en) cluster.
 - 01/10/2024 We released [BenCzechMark](https://huggingface.co/spaces/CZLC/BenCzechMark), the first Czech evaluation suite for fair open-weights model comparison.
 - 18/04/2024 We released all our training checkpoints (in MosaicML format & packed using ZPAQ) at [czechllm.fit.vutbr.cz/csmpt7b/checkpoints/](https://czechllm.fit.vutbr.cz/csmpt7b/checkpoints/)
 - 06/05/2024 We released small manually annotated [dataset of adult content](https://huggingface.co/datasets/BUT-FIT/adult_content_classifier_dataset). We used classifier trained on this dataset for filtering our corpus.
--
 # Evaluation
 Dev eval at CS-HellaSwag  (automatically translated HellaSwag benchmark).
 | Model | CS-HellaSwag Accuracy |

 - BUT-FIT/adult_content_classifier_dataset
 language:
 - cs
+tags:
+- llama-cpp
+- gguf-my-repo
 ---
 # Introduction
 CSMPT7b is a large Czech language model continously pretrained on 272b training tokens from English [MPT7b](https://huggingface.co/mosaicml/mpt-7b) model. Model was pretrained on ~67b token [Large Czech Collection](https://huggingface.co/datasets/BUT-FIT/BUT-LCC) using Czech tokenizer, obtained using our vocabulary swap method (see below).
 - 01/10/2024 We released [BenCzechMark](https://huggingface.co/spaces/CZLC/BenCzechMark), the first Czech evaluation suite for fair open-weights model comparison.
 - 18/04/2024 We released all our training checkpoints (in MosaicML format & packed using ZPAQ) at [czechllm.fit.vutbr.cz/csmpt7b/checkpoints/](https://czechllm.fit.vutbr.cz/csmpt7b/checkpoints/)
 - 06/05/2024 We released small manually annotated [dataset of adult content](https://huggingface.co/datasets/BUT-FIT/adult_content_classifier_dataset). We used classifier trained on this dataset for filtering our corpus.
 # Evaluation
 Dev eval at CS-HellaSwag  (automatically translated HellaSwag benchmark).
 | Model | CS-HellaSwag Accuracy |