mRNAbert
mRNAbert is a transformer-based RNA language model trained on millions of transcriptomic sequences from the human genome. It is used as the foundation model for downstream fine-tuning tasks in the G4mer project, including rG4 structure prediction and variant effect analysis.
Model Details
- Architecture: BERT-base
- Tokenization: Overlapping 6-mers
- Pretraining data: Human transcriptome (GENCODE v40, hg38)
- Task: Masked language modeling (MLM)
- Input: RNA sequences (ACGT)
- Max length: 512nt
Disclaimer
This is the official implementation of the G4mer model as described in the manuscript:
Zhuang, Farica, et al. G4mer: an RNA language model for transcriptome-wide identification of G-quadruplexes and disease variants from population-scale genetic data. bioRxiv (2024).
See our Bitbucket repo for code, data, and tutorials.
Model Details
G4mer transformer-based model trained on transcriptome-wide RNA sequences to predict:
- Binary classification: Whether a 70-nt seqeunce region forms an rG4 structure
All models use overlapping 6-mer tokenization and are trained from scratch on the human transcriptome.
Variants
Model | Task | Size |
---|---|---|
Biociphers/g4mer |
rG4 binary class | ~46M |
Biociphers/g4mer-subtype |
rG4 subtype class | ~46M |
Biociphers/g4mer-regression |
rG4 strength | ~46M |
Usage
Fine-tune
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
from torch.utils.data import DataLoader, Dataset
from torch.optim import AdamW
import torch.nn.functional as F
# Example dataset
sequences = ["GGGAGGGCGCGTGTGGTGAGAGGAGGGAGGGAAGGAAGGCGGAGGAAGGA", # rG4
"TCTGGGAAAAGCTACTGTAAGTAGGAGCAGATTCTGGGTTTAATCGGAGG"] # non-rG4
labels = [1, 0]
# Tokenization with 6-mers
def to_kmers(seq, k=6):
return ' '.join([seq[i:i+k] for i in range(len(seq)-k+1)])
tokenizer = AutoTokenizer.from_pretrained("Biociphers/mRNAbert")
tokenized = [tokenizer(to_kmers(seq), return_tensors='pt', padding='max_length', truncation=True, max_length=512) for seq in sequences]
# Dataset class
class rG4Dataset(Dataset):
def __init__(self, tokenized_inputs, labels):
self.inputs = tokenized_inputs
self.labels = labels
def __len__(self):
return len(self.labels)
def __getitem__(self, idx):
item = {key: val.squeeze(0) for key, val in self.inputs[idx].items()}
item["labels"] = torch.tensor(self.labels[idx], dtype=torch.long)
return item
dataset = rG4Dataset(tokenized, labels)
loader = DataLoader(dataset, batch_size=2, shuffle=True)
# Load base model for classification
model = AutoModelForSequenceClassification.from_pretrained("Biociphers/mRNAbert", num_labels=2)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
# Optimizer
optimizer = AdamW(model.parameters(), lr=2e-5)
# Training loop (1 epoch for demo)
model.train()
for batch in loader:
batch = {k: v.to(device) for k, v in batch.items()}
outputs = model(**batch)
loss = outputs.loss
loss.backward()
optimizer.step()
optimizer.zero_grad()
print("Loss:", loss.item())
Web Tool
The mRNAbert
model was fine-tuned to create G4mer, a state-of-the-art model for predicting RNA G-quadruplexes and their subtypes.
You can explore G4mer predictions interactively through our web tool:
Features include:
- RNA sequence prediction (binary rG4-forming vs. non-forming)
- Transcriptome-wide prediction of rG4s and subtypes
- Variant effect annotation using gnomAD SNVs
- Search and filter by gene, transcript, region (5′UTR, CDS, 3′UTR), and sequence context
No installation needed — just visit and start exploring.
Citation - MLA
Zhuang, Farica, et al. "G4mer: an RNA language model for transcriptome-wide identification of G-quadruplexes and disease variants from population-scale genetic data." bioRxiv (2024): 2024-10.
Contact
For questions, feedback, or discussions about G4mer, please post on the Biociphers Google Group.
- Downloads last month
- 10