|
--- |
|
language: |
|
- ca |
|
license: apache-2.0 |
|
tags: |
|
- catalan |
|
- masked-lm |
|
- distilroberta |
|
widget: |
|
- text: El Català és una llengua molt <mask>. |
|
- text: Salvador Dalí va viure a <mask>. |
|
- text: La Costa Brava té les millors <mask> d'Espanya. |
|
- text: El cacaolat és un batut de <mask>. |
|
- text: <mask> és la capital de la Garrotxa. |
|
- text: Vaig al <mask> a buscar bolets. |
|
- text: Antoni Gaudí vas ser un <mask> molt important per la ciutat. |
|
- text: Catalunya és una referència en <mask> a nivell europeu. |
|
--- |
|
|
|
# DistilRoBERTa-base-ca |
|
|
|
## Model description |
|
|
|
This model is a distilled version of [projecte-aina/roberta-base-ca-v2](https://huggingface.co/projecte-aina/roberta-base-ca-v2). |
|
|
|
It follows the same training procedure as [DistilBERT](https://arxiv.org/abs/1910.01108), using the implementation of Knowledge Distillation |
|
from the paper's [official repository](https://github.com/huggingface/transformers/tree/main/examples/research_projects/distillation). |
|
|
|
The resulting architecture consists of 6 layers, 768 dimensional embeddings and 12 attention heads. |
|
This adds up to a total of 82M parameters, which is considerably less than the 125M of standard RoBERTa-base models. |
|
This makes the model lighter and faster than the original, at the cost of a slightly lower performance. |
|
|
|
## Training |
|
|
|
### Training procedure |
|
|
|
This model has been trained using a technique known as Knowledge Distillation, |
|
which is used to shrink networks to a reasonable size while minimizing the loss in performance. |
|
|
|
It basically consists in distilling a large language model (the teacher) into a more |
|
lightweight, energy-efficient, and production-friendly model (the student). |
|
|
|
So, in a “teacher-student learning” setup, a relatively small student model is trained to mimic the behavior of a larger teacher model. |
|
As a result, the student has lower inference time and the ability to run in commodity hardware. |
|
|
|
### Training data |
|
|
|
The training corpus consists of several corpora gathered from web crawling and public corpora, as shown in the table below: |
|
|
|
| Corpus | Size (GB) | |
|
|--------------------------|-----------:| |
|
| Catalan Crawling | 13.00 | |
|
| RacoCatalá | 8.10 | |
|
| Catalan Oscar | 4.00 | |
|
| CaWaC | 3.60 | |
|
| Cat. General Crawling | 2.50 | |
|
| Wikipedia | 1.10 | |
|
| DOGC | 0.78 | |
|
| Padicat | 0.63 | |
|
| ACN | 0.42 | |
|
| Nació Digital | 0.42 | |
|
| Cat. Government Crawling | 0.24 | |
|
| Vilaweb | 0.06 | |
|
| Catalan Open Subtitles | 0.02 | |
|
| Tweets | 0.02 | |
|
|
|
## Evaluation |
|
|
|
### Evaluation benchmark |
|
|
|
This model has been fine-tuned on the downstream tasks of the [Catalan Language Understanding Evaluation benchmark (CLUB)](https://club.aina.bsc.es/), which includes the following datasets: |
|
|
|
| Dataset | Task| Total | Train | Dev | Test | |
|
|:----------|:----|:--------|:-------|:------|:------| |
|
| AnCora | NER | 13,581 | 10,628 | 1,427 | 1,526 | |
|
| AnCora | POS | 16,678 | 13,123 | 1,709 | 1,846 | |
|
| STS-ca | STS | 3,073 | 2,073 | 500 | 500 | |
|
| TeCla | TC | 137,775 | 110,203| 13,786| 13,786| |
|
| TE-ca | RTE | 21,163 | 16,930 | 2,116 | 2,117 | |
|
| CatalanQA | QA | 21,427 | 17,135 | 2,157 | 2,135 | |
|
| XQuAD-ca | QA | - | - | - | 1,189 | |
|
|
|
### Evaluation results |
|
|
|
This is how it compares to its teacher when fine-tuned on the aforementioned downstream tasks: |
|
|
|
| Model \ Task |NER (F1)|POS (F1)|STS-ca (Comb.)|TeCla (Acc.)|TEca (Acc.)|CatalanQA (F1/EM)| XQuAD-ca <sup>1</sup> (F1/EM) | |
|
| ------------------------|:-------|:-------|:-------------|:-----------|:----------|:----------------|:------------------------------| |
|
| RoBERTa-base-ca-v2 | 89.29 | 98.96 | 79.07 | 74.26 | 83.14 | 89.50/76.63 | 73.64/55.42 | |
|
| DistilRoBERTa-base-ca | 87.88 | 98.83 | 77.26 | 73.20 | 76.00 | 84.07/70.77 | 62.93/45.08 | |
|
|
|
<sup>1</sup> : Trained on CatalanQA, tested on XQuAD-ca (no train set). |