Model Card for Model ID

Model Description

RNA-BERTa is a lightweight BERT model trained following the RoBERTa approach.
It features a context window of 512 tokens and an embedding dimension of 512 (compared to the standard 768), resulting in approximately 55.56 million parameters. This design aligns with the 1:20 parameter-to-token compute-optimal ratio suggested by Hoffmann et al.. The model was pretrained on a masked language modeling (MLM) task using 9,757,119 RNA sequences sourced from RNACentral and NCBI, totaling 1.07 billion training tokens.

RNA-BERTa can be fine-tuned or utilized to generate embeddings from RNA sequences for various downstream applications and other related tasks.

Developed by: Pasquale Lobascio
Shared by: IlPakoZ
Model type: RoBERTa-based Transformer with MLM head
License: Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0)

Direct Use

RNA-BERTa can be used to generate embeddings from RNA sequences, which can be applied to various downstream biological sequence analysis tasks.

Downstream Use

The model can be fine-tuned for a wide range of RNA-related tasks, such as classification, motif detection, or other predictive modeling involving RNA sequences.

Out-of-Scope Use

This model is not intended for RNA sequences longer than 512 tokens, as the context length is limited. It may not perform well on tasks unrelated to RNA sequence embeddings or Masked Language Modeling.

Bias, Risks, and Limitations

The model’s context length is limited to 512 tokens, which restricts its use on longer RNA sequences.
As the model was pretrained on specific datasets, it may have biases related to sequence representation or coverage.
Potential biases or limitations inherent in the training data (RNACentral and NCBI) may affect downstream tasks.
This is a domain-specific model; it may not generalize outside RNA sequence analysis.

How to Get Started with the Model

from transformers import RobertaForMaskedLM, RobertaTokenizerFast, RobertaModel

# Load with MLM head
model = RobertaForMaskedLM.from_pretrained("IlPakoZ/RNA-BERTa9700")
tokenizer = RobertaTokenizerFast.from_pretrained("IlPakoZ/RNA-BERTa9700")

# Alternatively, load only the encoder for downstream tasks
encoder = RobertaModel.from_pretrained("IlPakoZ/RNA-BERTa9700")
tokenizer = RobertaTokenizerFast.from_pretrained("IlPakoZ/RNA-BERTa9700")

Training Data

The pretraining dataset includes a total of 10,841,246 RNA sequences. The data was divided into training set (9,757,119 sequences) and validation set (1,084,127 sequences), collected from RNACentral and NCBI, for a total of ~1.22 B tokens (of which ~1.07 B training tokens).
The data includes the following RNA types:

7,979,027 ribosomal RNA sequences, collected from SILVA database;
492,955 (pre and mature) miRNA sequences, collected from RNACentral databases;
3,137 repeats sequences, collected from RNACentral databases;
90,467 riboswitches sequences, collected from RNACentral databases;
29,581 ribozymes sequences, collected from RNACentral databases;
2,246,079 virus sequences, collected from NCBI virus database.

Only sequences up to 2,000 nucleotides long were selected.

Training Procedure

The pretraining of RNA-BERTa consisted of three main steps:

Hyperparameter Optimization (HO):
We performed 32 trials of HO using Optuna with a Tree-structured Parzen Estimator (TPE) sampler. Due to the high cost of tuning large transformer models, we leveraged μParametrization (μP) Yang et al., 2022 to optimize learning rate and warm-up steps on a smaller version of RNA-BERTa. This allowed us to generalize the hyperparameters to the full model efficiently. Because μP only transfers non-regularization parameters, weight decay was fixed at 0.01.
To avoid instability reported with 16-bit floating point values during HO with μP Blake et al., 2024, we used full precision. Training during HO was limited to 5,000 steps per model (instead of the full 38,000 steps), which saved considerable compute while maintaining precise hyperparameter transfer for this post-layer normalization transformer architecture.
To cover slight variations in the optimal learning rate, three final models were trained using learning rates scaled by factors of 1, 0.875, and 0.75 from the HO optimum. Overall, this approach improved efficiency by approximately 4.5× compared to full-scale HO. HO was conducted on 4 NVIDIA V100 GPUs with 150 GB RAM and completed in about two days.
Pretraining Schedule:
Throughout pretraining, we used cosine learning rate scheduling with warm-up and an approximate 10× decay over one full training epoch, following recommendations from Rae et al. and the Chinchilla scaling laws Hoffmann et al., 2022.
Implementation Details:
The μP implementations of AdamW and RoBERTa from the original μP authors were employed. An attention multiplier of √32 was used to enable smooth parameter transfer during subsequent fine-tuning.

The overall pretraining workflow is illustrated in Figure \ref{fig:ho}.

Citation

BibTeX:

@article{10.48550/arXiv.2203.15556,
  title={Training compute-optimal large language models},
  author={Hoffmann, Jordan and Borgeaud, Sebastian and Mensch, Arthur and Buchatskaya, Elena and Cai, Trevor and Rutherford, Eliza and Casas, Diego de Las and Hendricks, Lisa Anne and Welbl, Johannes and Clark, Aidan and others},
  journal={arXiv preprint arXiv:2203.15556},
  year={2022}
},
@article{10.1093/nar/gkaa921,
  title={RNAcentral 2021: secondary structure integration, improved sequence search and new member databases},
  journal={Nucleic acids research},
  volume={49},
  number={D1},
  pages={D212--D220},
  year={2021},
  publisher={Oxford University Press}
},
@article{10.1093/nar/gkae979,
  title={Database resources of the National Center for Biotechnology Information in 2025},
  author={Sayers, Eric W and Beck, Jeffrey and Bolton, Evan E and Brister, J Rodney and Chan, Jessica and Connor, Ryan and Feldgarden, Michael and Fine, Anna M and Funk, Kathryn and Hoffman, Jinna and others},
  journal={Nucleic acids research},
  volume={53},
  number={D1},
  pages={D20--D29},
  year={2025},
  publisher={Oxford University Press}
},
@article{10.48550/arXiv.2203.03466,
  title={Tensor programs v: Tuning large neural networks via zero-shot hyperparameter transfer},
  author={Yang, Greg and Hu, Edward J and Babuschkin, Igor and Sidor, Szymon and Liu, Xiaodong and Farhi, David and Ryder, Nick and Pachocki, Jakub and Chen, Weizhu and Gao, Jianfeng},
  journal={arXiv preprint arXiv:2203.03466},
  year={2022}
},

@article{10.48550/arXiv.2112.11446,
  title={Scaling language models: Methods, analysis \& insights from training gopher},
  author={Rae, Jack W and Borgeaud, Sebastian and Cai, Trevor and Millican, Katie and Hoffmann, Jordan and Song, Francis and Aslanides, John and Henderson, Sarah and Ring, Roman and Young, Susannah and others},
  journal={arXiv preprint arXiv:2112.11446},
  year={2021}
},

@article{10.48550/arXiv.2407.17465,
  title={u-$\mu$P: The Unit-Scaled Maximal Update Parametrization},
  author={Blake, Charlie and Eichenberg, Constantin and Dean, Josef and Balles, Lukas and Prince, Luke Y and Deiseroth, Bj{\"o}rn and Cruz-Salinas, Andres Felipe and Luschi, Carlo and Weinbach, Samuel and Orr, Douglas},
  journal={arXiv preprint arXiv:2407.17465},
  year={2024}
}

IlPakoZ
/

RNA-BERTa9700