--- language: - ca tags: - roberta - fill-mask - catalan license: apache-2.0 library_name: transformers --- # RoBERTa-ca Model Card RoBERTa-ca is a new foundational Catalan language model built on the [RoBERTa](https://huggingface.co/FacebookAI/roberta-base) architecture. It uses vocabulary adaptation from [mRoBERTa](https://huggingface.co/BSC-LT/mRoBERTa), a method that initializes all weights from mRoBERTa while applying a specialized treatment to the embedding matrix. This treatment carefully handles the differences between the two tokenizers. The model is then continually pretrained using a Catalan-only corpus, consisting of 95GB of high-quality Catalan data. ## Technical Description Technical details of the RoBERTa-ca model. | Description | Value | |-------------------------|:--------------| | Model Parameters | 125M | | Tokenizer Type | SPM | | Vocabulary size | 50,304 | | Precision | bfloat16 | | Context length | 512 | Training Hyperparemeters | Hyperparameter | Value | |------------------------- |:-------------- | | Pretraining Objective | Masked Language Modeling | | Learning Rate | 3E-05 | | Learning Rate Scheduler | Cosine | | Warmup | 2425 | | Optimizer | AdamW | | Optimizer Hyperparameters | AdamW (β1=0.9,β2=0.98,ε =1e-06 ) | | Optimizer Decay | 1E-02 | | Global Batch Size | 1024 | | Dropout | 1E-01 | | Attention Dropout | 1E-01 | | Activation Function | GeLU | ### EVALUATION: CLUB Benchmark Model performance in Catalan Language is assessed using the Catalan benchmark CLUB. CLUB [(Catalan Language Understanding Benchmark)](https://club.aina.bsc.es/datasets.html) consists of 6 tasks: Named Entity Recognition (NER), Part-of-Speech Tagging (POS), Semantic Textual Similarity (STS), Text Classification (TC), Textual Entailment (TE), and Question Answering (QA). This benchmark evaluates the model's capabilities in the Catalan language. The following base foundational models have been considered for the comparison: | Multilingual Foundational Model | Number of Parameters | Vocab Size | Description | |---------------------------------|----------------------|------------|-------------| | [BERTa](https://huggingface.co/PlanTL-GOB-ES/roberta-base-ca) | 126M | 52K | BERTa is a Catalan-specific language model pretrained with Catalan-only data. | | [BERTinho](https://huggingface.co/dvilares/bertinho-gl-base-cased) | 109M | 30K | BERTinho is monolingual BERT model for Galician language. | | [mBERT](https://huggingface.co/google-bert/bert-base-multilingual-cased) | 178M | 120K | Multilingual BERT model pretrained on the top 104 languages with the largest Wikipedia. | | [mRoBERTa](https://huggingface.co/BSC-LT/mRoBERTa) | 283M | 256K | RoBERTa base model pretrained with 35 European languages and a larger vocabulary size. | | [roberta-base-bne](https://huggingface.co/PlanTL-GOB-ES/roberta-base-bne) | 125M | 50K | RoBERTa base model pretrained with 570GB of data from web crawlings performed by the National Library of Spain from 2009 to 2019. | | [RoBERTa-ca](https://huggingface.co/BSC-LT/RoBERTa-ca) | 125M | 50K | RoBERTa-ca is a Catalan-specific language model obtained by using vocabulary adaptation from mRoBERTa. | | [xlm-roberta-base](https://huggingface.co/FacebookAI/xlm-roberta-base) | 279M | 250K | Foundational RoBERTa model pretrained with CommonCrawl data containing 100 languages. | | [xlm-roberta-large](https://huggingface.co/FacebookAI/xlm-roberta-large) | 561M | 250K | Foundational RoBERTa model pretrained with CommonCrawl data containing 100 languages. |
tasks | roberta-base-bne (125M) | berta (126M) | mBERT (178M) | xlm-roberta-base (279M) | xlm-roberta-large (561M) | roberta-ca (125M) | mRoBERTa (283M) |
---|---|---|---|---|---|---|---|
ner (F1) | 87.59 | 89.47 | 85.89 | 87.50 | 89.47 | 89.70 | 88.33 |
pos (F1) | 98.64 | 98.89 | 98.78 | 98.91 | 99.03 | 99.00 | 98.98 |
sts (Person) | 74.27 | 81.39 | 77.05 | 75.11 | 83.49 | 82.99 | 79.52 |
tc (Acc.) | 73.86 | 73.16 | 72.00 | 73.05 | 74.10 | 72.81 | 72.41 |
te (Acc.) | 72.27 | 80.11 | 75.86 | 78.27 | 86.63 | 82.14 | 82.38 |
viquiquad (F1) | 82.56 | 86.74 | 87.42 | 86.81 | 90.35 | 87.31 | 87.86 |
xquad (F1) | 60.56 | 67.38 | 67.72 | 68.56 | 76.08 | 70.53 | 69.40 |