G4mer
G4mer is a transformer-based RNA foundation model trained to identify RNA G-quadruplexes (rG4s) from sequence input, fine-tuned with mRNAbert (Biociphers/mRNAbert).
Disclaimer
This is the official implementation of the G4mer model as described in the manuscript:
Zhuang, Farica, et al. G4mer: an RNA language model for transcriptome-wide identification of G-quadruplexes and disease variants from population-scale genetic data. bioRxiv (2024).
See our Bitbucket repo for code, data, and tutorials.
Model Details
G4mer transformer-based model trained on transcriptome-wide RNA sequences to predict:
- Binary classification: Whether a 70-nt seqeunce region forms an rG4 structure
All models use overlapping 6-mer tokenization and are trained from scratch on the human transcriptome.
Variants
Model | Task | Size |
---|---|---|
Biociphers/g4mer |
rG4 binary class | ~46M |
Biociphers/g4mer-subtype |
rG4 subtype class | ~46M |
Biociphers/g4mer-regression |
rG4 strength | ~46M |
Usage
Binary rG4 Prediction
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
tokenizer = AutoTokenizer.from_pretrained("biociphers/g4mer")
model = AutoModelForSequenceClassification.from_pretrained("biociphers/g4mer")
sequence = "GGGAGGGCGCGTGTGGTGAGAGGAGGGAGGGAAGGAAGGCGGAGGAAGGA" # max length: 70nt window
def to_kmers(seq, k=6):
return ' '.join([seq[i:i+k] for i in range(len(seq) - k + 1)])
sequence = to_kmers(sequence, k=6) # Convert to 6-mers
inputs = tokenizer(sequence, return_tensors="pt")
outputs = model(**inputs)
logits = outputs.logits
rG4_probability = torch.softmax(logits, dim=1)[:, 1].item()
print(rG4_probability)
G4mer was trained on a maximum of 70nt per sequence. For sequences longer than 70nt, we recommend scanning the input sequence with a sliding window of 70nt and taking the maximum rG4 score across all windows.
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("Biociphers/g4mer")
model = AutoModelForSequenceClassification.from_pretrained("Biociphers/g4mer")
model.eval()
# Define k-mer function
def to_kmers(seq, k=6):
return ' '.join([seq[i:i+k] for i in range(len(seq) - k + 1)])
# Define a long sequence (must contain only A/C/G/T)
sequence = "GGGAGGGCGCGTGTGGTGAGAGGAGGGAGGGAAGGAAGGCGGAGGAAGGA" * 2 # ~100nt
# Slide 70nt window with stride 1
window_size = 70
stride = 1
windows = [sequence[i:i+window_size] for i in range(0, len(sequence) - window_size + 1, stride)]
# Score each window using G4mer
scores = []
for w in windows:
kmer_seq = to_kmers(w, k=6)
tokens = tokenizer(kmer_seq, return_tensors="pt")
with torch.no_grad():
output = model(**tokens)
prob = torch.nn.functional.softmax(output.logits, dim=-1)
scores.append(prob[0][1].item()) # class 1 = rG4-forming
# Final rG4 score for the long sequence
max_score = max(scores)
print(f"Max rG4 score across windows: {max_score:.3f}")
Web Tool
You can explore G4mer predictions interactively through our web tool:
Features include:
- RNA sequence prediction runs
G4mer
on GPU to compute probability of rG4-forming - Transcriptome-wide prediction of rG4s and subtypes
- Variant effect annotation using gnomAD SNVs
- Search and filter by gene, transcript, region (5′UTR, CDS, 3′UTR), and sequence context
No installation needed — just visit and start exploring.
Citation - MLA
Zhuang, Farica, et al. "G4mer: an RNA language model for transcriptome-wide identification of G-quadruplexes and disease variants from population-scale genetic data." bioRxiv (2024): 2024-10.
Contact
For questions, feedback, or discussions about G4mer, please post on the Biociphers Google Group.
- Downloads last month
- 19