You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

mRNAbert

mRNAbert is a transformer-based RNA language model trained on millions of transcriptomic sequences from the human genome. It is used as the foundation model for downstream fine-tuning tasks in the G4mer project, including rG4 structure prediction and variant effect analysis.

Model Details

  • Architecture: BERT-base
  • Tokenization: Overlapping 6-mers
  • Pretraining data: Human transcriptome (GENCODE v40, hg38)
  • Task: Masked language modeling (MLM)
  • Input: RNA sequences (ACGT)
  • Max length: 512nt

Disclaimer

This is the official implementation of the G4mer model as described in the manuscript:

Zhuang, Farica, et al. G4mer: an RNA language model for transcriptome-wide identification of G-quadruplexes and disease variants from population-scale genetic data. bioRxiv (2024).

See our Bitbucket repo for code, data, and tutorials.

Model Details

G4mer transformer-based model trained on transcriptome-wide RNA sequences to predict:

  • Binary classification: Whether a 70-nt seqeunce region forms an rG4 structure

All models use overlapping 6-mer tokenization and are trained from scratch on the human transcriptome.

Variants

Model Task Size
Biociphers/g4mer rG4 binary class ~46M
Biociphers/g4mer-subtype rG4 subtype class ~46M
Biociphers/g4mer-regression rG4 strength ~46M

Usage

Fine-tune

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
from torch.utils.data import DataLoader, Dataset
from torch.optim import AdamW
import torch.nn.functional as F

# Example dataset
sequences = ["GGGAGGGCGCGTGTGGTGAGAGGAGGGAGGGAAGGAAGGCGGAGGAAGGA",  # rG4
             "TCTGGGAAAAGCTACTGTAAGTAGGAGCAGATTCTGGGTTTAATCGGAGG"]  # non-rG4
labels = [1, 0]

# Tokenization with 6-mers
def to_kmers(seq, k=6):
    return ' '.join([seq[i:i+k] for i in range(len(seq)-k+1)])

tokenizer = AutoTokenizer.from_pretrained("Biociphers/mRNAbert")
tokenized = [tokenizer(to_kmers(seq), return_tensors='pt', padding='max_length', truncation=True, max_length=512) for seq in sequences]

# Dataset class
class rG4Dataset(Dataset):
    def __init__(self, tokenized_inputs, labels):
        self.inputs = tokenized_inputs
        self.labels = labels

    def __len__(self):
        return len(self.labels)

    def __getitem__(self, idx):
        item = {key: val.squeeze(0) for key, val in self.inputs[idx].items()}
        item["labels"] = torch.tensor(self.labels[idx], dtype=torch.long)
        return item

dataset = rG4Dataset(tokenized, labels)
loader = DataLoader(dataset, batch_size=2, shuffle=True)

# Load base model for classification
model = AutoModelForSequenceClassification.from_pretrained("Biociphers/mRNAbert", num_labels=2)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

# Optimizer
optimizer = AdamW(model.parameters(), lr=2e-5)

# Training loop (1 epoch for demo)
model.train()
for batch in loader:
    batch = {k: v.to(device) for k, v in batch.items()}
    outputs = model(**batch)
    loss = outputs.loss
    loss.backward()
    optimizer.step()
    optimizer.zero_grad()
    print("Loss:", loss.item())

Web Tool

The mRNAbert model was fine-tuned to create G4mer, a state-of-the-art model for predicting RNA G-quadruplexes and their subtypes.

You can explore G4mer predictions interactively through our web tool:

G4mer Web Tool

Features include:

  • RNA sequence prediction (binary rG4-forming vs. non-forming)
  • Transcriptome-wide prediction of rG4s and subtypes
  • Variant effect annotation using gnomAD SNVs
  • Search and filter by gene, transcript, region (5′UTR, CDS, 3′UTR), and sequence context

No installation needed — just visit and start exploring.

Citation - MLA

Zhuang, Farica, et al. "G4mer: an RNA language model for transcriptome-wide identification of G-quadruplexes and disease variants from population-scale genetic data." bioRxiv (2024): 2024-10.

Contact

For questions, feedback, or discussions about G4mer, please post on the Biociphers Google Group.

Downloads last month
10
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support