--- language: cs license: cc-by-nc-sa-4.0 tags: - Czech - GEC - GECCC dataset pipeline_tag: text-generation library_name: transformers base_model: google/byt5-small --- # Model Card for byt5-small-geccc-mate The `byt5-small-geccc-mate` model is a sequence-to-sequence model performing grammar error correction in Czech described in the paper [Refining Czech GEC: Insights from a Multi-Experiment Approach](https://arxiv.org/abs/2506.22402). It is a finetuned version of [byt5-small](https://huggingface.co/google/byt5-small) using the MATE method and the [GECCC dataset](https://hdl.handle.net/11234/1-4861). ## Model Description - **Developed by:** [Seznam.cz](https://seznam.cz) and [Charles University, MFF, ÚFAL](https://ufal.mff.cuni.cz/) - **Language(s) (NLP):** Czech - **Model type:** character-based encoder-decoder Transformer model - **Finetuned from model:** `google/byt5-small` - **Finetuned on:** - first synthetic errors generated by the MATE method (see [the paper](https://arxiv.org/abs/2506.22402)) - then the [GECCC dataset](https://hdl.handle.net/11234/1-4861) - **License:** CC BY-NC-SA 4.0 ## Model Sources - **Repository:** https://github.com/ufal/tsd2025-gec - **Paper:** [Refining Czech GEC: Insights from a Multi-Experiment Approach](https://arxiv.org/abs/2506.22402) - **Dataset:** [GECCC dataset](https://hdl.handle.net/11234/1-4861) ## Evaluation
Performance bubblechart
| Model | Parameters | GECCC F-0.5 score | AKCES F-0.5 score | |:------|-----------:|:-----------------:|:-----------------:| | [**byt5-small-geccc-mate**](https://hf.co/ufal/byt5-small-geccc-mate) | **300M** | **72.56** | | [byt5-base-geccc-mate](https://hf.co/ufal/byt5-base-geccc-mate) | 582M | 75.15 | | [byt5-large-geccc-mate](https://hf.co/ufal/byt5-large-geccc-mate) | 1275M | 77.01 | | [byt5-large-akces-mate](https://hf.co/ufal/byt5-large-akces-mate) | 1275M | | 84.40 | | [transformer-base-geccc-mate](https://hf.co/ufal/transformer-base-geccc-mate) | 65M | 73.73 | ## Uses The model can be directly used to process space-tokenized input Czech text and produce grammar-corrected Czech text. ## How to Get Started with the Model Use the code below to get started with the model. Note that the input must be **space-tokenized**, i.e., every token (using the [UDPipe 1](https://ufal.mff.cuni.cz/udpipe/1) tokenizer [czech-pdt-ud-2.5-191206.udpipe](https://hdl.handle.net/11234/1-3131)) must be space-separated. ```python tokenizer = transformers.AutoTokenizer.from_pretrained("ufal/byt5-small-geccc-mate") model = transformers.AutoModelForSeq2SeqLM.from_pretrained("ufal/byt5-small-geccc-mate") batch = tokenizer(["Sveřepý šakali zavile vyly na býlí mesýc ."], return_tensors="pt") outputs = model.generate(batch.input_ids, max_length=256, num_beams=4) print(tokenizer.batch_decode(outputs, skip_special_tokens=True)) ``` ## BibTeX Citation ``` @InProceedings{10.1007/978-3-032-02551-7_7, author="Pechman, Petr and Straka, Milan and Strakov{\'a}, Jana and N{\'a}plava, Jakub", editor="Ek{\v{s}}tein, Kamil and Konop{\'i}k, Miloslav and Pra{\v{z}}{\'a}k, Ond{\v{r}}ej and P{\'a}rtl, Franti{\v{s}}ek", title="Refining Czech GEC: Insights from a Multi-experiment Approach", booktitle="Text, Speech, and Dialogue", year="2026", publisher="Springer Nature Switzerland", address="Cham", pages="64--76", isbn="978-3-032-02551-7", doi="10.1007/978-3-032-02551-7_7" } ```