Update README.md
Browse files
README.md
CHANGED
@@ -5,6 +5,9 @@ datasets:
|
|
5 |
- BUT-FIT/adult_content_classifier_dataset
|
6 |
language:
|
7 |
- cs
|
|
|
|
|
|
|
8 |
---
|
9 |
# Introduction
|
10 |
CSMPT7b is a large Czech language model continously pretrained on 272b training tokens from English [MPT7b](https://huggingface.co/mosaicml/mpt-7b) model. Model was pretrained on ~67b token [Large Czech Collection](https://huggingface.co/datasets/BUT-FIT/BUT-LCC) using Czech tokenizer, obtained using our vocabulary swap method (see below).
|
@@ -20,7 +23,6 @@ Training was done on [Karolina](https://www.it4i.cz/en) cluster.
|
|
20 |
- 01/10/2024 We released [BenCzechMark](https://huggingface.co/spaces/CZLC/BenCzechMark), the first Czech evaluation suite for fair open-weights model comparison.
|
21 |
- 18/04/2024 We released all our training checkpoints (in MosaicML format & packed using ZPAQ) at [czechllm.fit.vutbr.cz/csmpt7b/checkpoints/](https://czechllm.fit.vutbr.cz/csmpt7b/checkpoints/)
|
22 |
- 06/05/2024 We released small manually annotated [dataset of adult content](https://huggingface.co/datasets/BUT-FIT/adult_content_classifier_dataset). We used classifier trained on this dataset for filtering our corpus.
|
23 |
-
-
|
24 |
# Evaluation
|
25 |
Dev eval at CS-HellaSwag (automatically translated HellaSwag benchmark).
|
26 |
| Model | CS-HellaSwag Accuracy |
|
|
|
5 |
- BUT-FIT/adult_content_classifier_dataset
|
6 |
language:
|
7 |
- cs
|
8 |
+
tags:
|
9 |
+
- llama-cpp
|
10 |
+
- gguf-my-repo
|
11 |
---
|
12 |
# Introduction
|
13 |
CSMPT7b is a large Czech language model continously pretrained on 272b training tokens from English [MPT7b](https://huggingface.co/mosaicml/mpt-7b) model. Model was pretrained on ~67b token [Large Czech Collection](https://huggingface.co/datasets/BUT-FIT/BUT-LCC) using Czech tokenizer, obtained using our vocabulary swap method (see below).
|
|
|
23 |
- 01/10/2024 We released [BenCzechMark](https://huggingface.co/spaces/CZLC/BenCzechMark), the first Czech evaluation suite for fair open-weights model comparison.
|
24 |
- 18/04/2024 We released all our training checkpoints (in MosaicML format & packed using ZPAQ) at [czechllm.fit.vutbr.cz/csmpt7b/checkpoints/](https://czechllm.fit.vutbr.cz/csmpt7b/checkpoints/)
|
25 |
- 06/05/2024 We released small manually annotated [dataset of adult content](https://huggingface.co/datasets/BUT-FIT/adult_content_classifier_dataset). We used classifier trained on this dataset for filtering our corpus.
|
|
|
26 |
# Evaluation
|
27 |
Dev eval at CS-HellaSwag (automatically translated HellaSwag benchmark).
|
28 |
| Model | CS-HellaSwag Accuracy |
|