PortBERT: Navigating the Depths of Portuguese Language Models

PortBERT is a family of RoBERTa-based language models pre-trained from scratch on the Portuguese portion of OSCAR23 and MC4 (deduplicated variants of CulturaX). The models are designed to offer strong downstream performance in Portuguese NLP tasks, while providing insights into the cost-performance tradeoffs of training across hardware backends.

We release two variants:

  • PortBERT-base: 126M parameters, trained on 8Γ— A40 GPUs (fp32)
  • PortBERT-large: 357M parameters, trained on TPUv4-128 pod (fp32)

Model Details

Detail PortBERT-base PortBERT-large
Architecture RoBERTa-base RoBERTa-large
Parameters ~126M ~357M
Tokenizer GPT-2 style (52k vocab) Same
Pretraining corpus deduplicated mC4 and OSCAR 23 from CulturaX Same
Objective Masked Language Modeling Same
Training time ~27 days on 8Γ— A40 ~6.2 days on TPUv4-128 pod
Precision fp32 fp32
Framework fairseq fairseq

Downstream Evaluation (ExtraGLUE)

We evaluate PortBERT on ExtraGLUE, a Portuguese adaptation of the GLUE benchmark. Fine-tuning was conducted using HuggingFace Transformers, with NNI-based grid search over batch size and learning rate (28 configurations per task). Each task was fine-tuned for up to 10 epochs. Metrics were computed on validation sets due to the lack of held-out test sets.

AVG score averages the following metrics:

  • STSB Spearman
  • STSB Pearson
  • RTE Accuracy
  • WNLI Accuracy
  • MRPC Accuracy
  • MRPC F1

πŸ§ͺ Evaluation Results

Legend: Bold = best, italic = second-best per model size.

Model STSB_Sp STSB_Pe STSB_Mean RTE_Acc WNLI_Acc MRPC_Acc MRPC_F1 AVG
Large models
XLM-RoBERTa_large 90.00 90.27 90.14 82.31 57.75 90.44 93.31 84.01
EuroBERT-610m 88.46 88.59 88.52 78.34 59.15 91.91 94.20 83.44
PortBERT_large 88.53 88.68 88.60 72.56 61.97 89.46 92.39 82.26
BERTimbau_large 89.40 89.61 89.50 75.45 59.15 88.24 91.55 82.23
Base models
RoBERTaLexPT_base 86.68 86.86 86.77 69.31 59.15 89.46 92.34 80.63
PortBERT_base 87.39 87.65 87.52 68.95 60.56 87.75 91.13 80.57
RoBERTaCrawlPT_base 87.34 87.45 87.39 72.56 56.34 87.99 91.20 80.48
BERTimbau_base 88.39 88.60 88.50 70.40 56.34 87.25 90.97 80.32
XLM-RoBERTa_base 85.75 86.09 85.92 68.23 60.56 87.75 91.32 79.95
EuroBERT-210m 86.54 86.62 86.58 65.70 57.75 87.25 91.00 79.14
AlBERTina 100M PTPT 86.52 86.51 86.52 70.04 56.34 85.05 89.57 79.01
AlBERTina 100M PTBR 85.97 85.99 85.98 68.59 56.34 85.78 89.82 78.75
AiBERTa 83.56 83.73 83.65 64.98 56.34 82.11 86.99 76.29
roBERTa PT 48.06 48.51 48.29 56.68 59.15 72.06 81.79 61.04

Fairseq Checkpoint

Get the fairseq checkpoint here.

πŸ“œ License

MIT License

Downloads last month
16
Safetensors
Model size
126M params
Tensor type
I64
Β·
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Dataset used to train PortBERT/PortBERT_base