DistilBERTa / README.md
mapama247's picture
Update README.md
04bdd72 verified
---
language:
- ca
license: apache-2.0
tags:
- catalan
- masked-lm
- distilroberta
widget:
- text: El Català és una llengua molt <mask>.
- text: Salvador Dalí va viure a <mask>.
- text: La Costa Brava les millors <mask> d'Espanya.
- text: El cacaolat és un batut de <mask>.
- text: <mask> és la capital de la Garrotxa.
- text: Vaig al <mask> a buscar bolets.
- text: Antoni Gaudí vas ser un <mask> molt important per la ciutat.
- text: Catalunya és una referència en <mask> a nivell europeu.
---
# DistilRoBERTa-base-ca
## Model description
This model is a distilled version of [projecte-aina/roberta-base-ca-v2](https://huggingface.co/projecte-aina/roberta-base-ca-v2).
It follows the same training procedure as [DistilBERT](https://arxiv.org/abs/1910.01108), using the implementation of Knowledge Distillation
from the paper's [official repository](https://github.com/huggingface/transformers/tree/main/examples/research_projects/distillation).
The resulting architecture consists of 6 layers, 768 dimensional embeddings and 12 attention heads.
This adds up to a total of 82M parameters, which is considerably less than the 125M of standard RoBERTa-base models.
This makes the model lighter and faster than the original, at the cost of a slightly lower performance.
## Training
### Training procedure
This model has been trained using a technique known as Knowledge Distillation,
which is used to shrink networks to a reasonable size while minimizing the loss in performance.
It basically consists in distilling a large language model (the teacher) into a more
lightweight, energy-efficient, and production-friendly model (the student).
So, in a “teacher-student learning” setup, a relatively small student model is trained to mimic the behavior of a larger teacher model.
As a result, the student has lower inference time and the ability to run in commodity hardware.
### Training data
The training corpus consists of several corpora gathered from web crawling and public corpora, as shown in the table below:
| Corpus | Size (GB) |
|--------------------------|-----------:|
| Catalan Crawling | 13.00 |
| RacoCatalá | 8.10 |
| Catalan Oscar | 4.00 |
| CaWaC | 3.60 |
| Cat. General Crawling | 2.50 |
| Wikipedia | 1.10 |
| DOGC | 0.78 |
| Padicat | 0.63 |
| ACN | 0.42 |
| Nació Digital | 0.42 |
| Cat. Government Crawling | 0.24 |
| Vilaweb | 0.06 |
| Catalan Open Subtitles | 0.02 |
| Tweets | 0.02 |
## Evaluation
### Evaluation benchmark
This model has been fine-tuned on the downstream tasks of the [Catalan Language Understanding Evaluation benchmark (CLUB)](https://club.aina.bsc.es/), which includes the following datasets:
| Dataset | Task| Total | Train | Dev | Test |
|:----------|:----|:--------|:-------|:------|:------|
| AnCora | NER | 13,581 | 10,628 | 1,427 | 1,526 |
| AnCora | POS | 16,678 | 13,123 | 1,709 | 1,846 |
| STS-ca | STS | 3,073 | 2,073 | 500 | 500 |
| TeCla | TC | 137,775 | 110,203| 13,786| 13,786|
| TE-ca | RTE | 21,163 | 16,930 | 2,116 | 2,117 |
| CatalanQA | QA | 21,427 | 17,135 | 2,157 | 2,135 |
| XQuAD-ca | QA | - | - | - | 1,189 |
### Evaluation results
This is how it compares to its teacher when fine-tuned on the aforementioned downstream tasks:
| Model \ Task |NER (F1)|POS (F1)|STS-ca (Comb.)|TeCla (Acc.)|TEca (Acc.)|CatalanQA (F1/EM)| XQuAD-ca <sup>1</sup> (F1/EM) |
| ------------------------|:-------|:-------|:-------------|:-----------|:----------|:----------------|:------------------------------|
| RoBERTa-base-ca-v2 | 89.29 | 98.96 | 79.07 | 74.26 | 83.14 | 89.50/76.63 | 73.64/55.42 |
| DistilRoBERTa-base-ca | 87.88 | 98.83 | 77.26 | 73.20 | 76.00 | 84.07/70.77 | 62.93/45.08 |
<sup>1</sup> : Trained on CatalanQA, tested on XQuAD-ca (no train set).