Text Generation
Transformers
Safetensors
GGUF
Czech
mpt
llama-cpp
gguf-my-repo
custom_code
text-generation-inference
mfajcik commited on
Commit
dea96b2
·
verified ·
1 Parent(s): 401fc64

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +3 -1
README.md CHANGED
@@ -5,6 +5,9 @@ datasets:
5
  - BUT-FIT/adult_content_classifier_dataset
6
  language:
7
  - cs
 
 
 
8
  ---
9
  # Introduction
10
  CSMPT7b is a large Czech language model continously pretrained on 272b training tokens from English [MPT7b](https://huggingface.co/mosaicml/mpt-7b) model. Model was pretrained on ~67b token [Large Czech Collection](https://huggingface.co/datasets/BUT-FIT/BUT-LCC) using Czech tokenizer, obtained using our vocabulary swap method (see below).
@@ -20,7 +23,6 @@ Training was done on [Karolina](https://www.it4i.cz/en) cluster.
20
  - 01/10/2024 We released [BenCzechMark](https://huggingface.co/spaces/CZLC/BenCzechMark), the first Czech evaluation suite for fair open-weights model comparison.
21
  - 18/04/2024 We released all our training checkpoints (in MosaicML format & packed using ZPAQ) at [czechllm.fit.vutbr.cz/csmpt7b/checkpoints/](https://czechllm.fit.vutbr.cz/csmpt7b/checkpoints/)
22
  - 06/05/2024 We released small manually annotated [dataset of adult content](https://huggingface.co/datasets/BUT-FIT/adult_content_classifier_dataset). We used classifier trained on this dataset for filtering our corpus.
23
- -
24
  # Evaluation
25
  Dev eval at CS-HellaSwag (automatically translated HellaSwag benchmark).
26
  | Model | CS-HellaSwag Accuracy |
 
5
  - BUT-FIT/adult_content_classifier_dataset
6
  language:
7
  - cs
8
+ tags:
9
+ - llama-cpp
10
+ - gguf-my-repo
11
  ---
12
  # Introduction
13
  CSMPT7b is a large Czech language model continously pretrained on 272b training tokens from English [MPT7b](https://huggingface.co/mosaicml/mpt-7b) model. Model was pretrained on ~67b token [Large Czech Collection](https://huggingface.co/datasets/BUT-FIT/BUT-LCC) using Czech tokenizer, obtained using our vocabulary swap method (see below).
 
23
  - 01/10/2024 We released [BenCzechMark](https://huggingface.co/spaces/CZLC/BenCzechMark), the first Czech evaluation suite for fair open-weights model comparison.
24
  - 18/04/2024 We released all our training checkpoints (in MosaicML format & packed using ZPAQ) at [czechllm.fit.vutbr.cz/csmpt7b/checkpoints/](https://czechllm.fit.vutbr.cz/csmpt7b/checkpoints/)
25
  - 06/05/2024 We released small manually annotated [dataset of adult content](https://huggingface.co/datasets/BUT-FIT/adult_content_classifier_dataset). We used classifier trained on this dataset for filtering our corpus.
 
26
  # Evaluation
27
  Dev eval at CS-HellaSwag (automatically translated HellaSwag benchmark).
28
  | Model | CS-HellaSwag Accuracy |