G4mer Subtype
G4mer-Subtype is a transformer-based RNA language model that predicts RNA G-quadruplex (rG4) subtypes from sequence input. It is fine-tuned from Biociphers/mRNAbert
and trained on 70-nt sequences labeled with experimentally derived rG4 subtype categories.
Disclaimer
This is the official subtype classification model from the G4mer framework as described in the manuscript:
Zhuang, Farica, et al. G4mer: an RNA language model for transcriptome-wide identification of G-quadruplexes and disease variants from population-scale genetic data. bioRxiv (2024).
See our Bitbucket repo for code, data, and tutorials.
Model Details
G4mer-Subtype is trained to classify each 70-nt RNA sequence into one of eight rG4 subtypes, each representing a distinct sequence/structure motif observed in experimental rG4 data.
Subtype Mapping
Class Index | Subtype Description |
---|---|
0 | G≥40% |
1 | Unknown |
2 | Bulges |
3 | Canonical |
4 | Long loop |
5 | Potential G-quadruplex & G≥40% |
6 | Potential G-triplex & G≥40% |
7 | Two-quartet |
All models use overlapping 6-mer tokenization and were fine-tuned on human transcriptome-derived sequences with subtype labels.
Variants
Model | Task | Size |
---|---|---|
Biociphers/g4mer |
rG4 binary class | ~46M |
Biociphers/g4mer-subtype |
rG4 subtype class | ~46M |
Biociphers/g4mer-regression |
rG4 strength (score) | ~46M |
Usage
Predict rG4 Subtypes
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
# Load binary rG4 model and tokenizer
binary_tokenizer = AutoTokenizer.from_pretrained("biociphers/g4mer")
binary_model = AutoModelForSequenceClassification.from_pretrained("biociphers/g4mer")
binary_model.eval()
# Load subtype model and tokenizer
subtype_tokenizer = AutoTokenizer.from_pretrained("biociphers/g4mer-subtype")
subtype_model = AutoModelForSequenceClassification.from_pretrained("biociphers/g4mer-subtype")
subtype_model.eval()
# Input sequence (max 70 nt)
sequence = "GGGAGGGCGCGTGTGGTGAGAGGAGGGAGGGAAGGAAGGCGGAGGAAGGA"
# Convert to space-separated 6-mers
def to_kmers(seq, k=6):
return ' '.join([seq[i:i+k] for i in range(len(seq) - k + 1)])
kmer_sequence = to_kmers(sequence)
# Predict rG4 binary score
binary_inputs = binary_tokenizer(kmer_sequence, return_tensors="pt")
with torch.no_grad():
binary_output = binary_model(**binary_inputs)
rG4_prob = torch.nn.functional.softmax(binary_output.logits, dim=-1)[0][1].item()
# If confidently predicted to be rG4. Here, we set rG4 threshold to moderately confident with 0.7.
if rG4_prob > 0.7:
# Only classify subtype if confident rG4
subtype_inputs = subtype_tokenizer(kmer_sequence, return_tensors="pt")
with torch.no_grad():
subtype_output = subtype_model(**subtype_inputs)
subtype_probs = torch.nn.functional.softmax(subtype_output.logits, dim=-1)
predicted_class = torch.argmax(subtype_probs, dim=-1).item()
subtype_mapping = {
0: "G≥40%",
1: "Unknown",
2: "Bulges",
3: "Canonical",
4: "Long loop",
5: "Potential G-quadruplex & G≥40%",
6: "Potential G-triplex & G≥40%",
7: "Two-quartet"
}
print(f"Predicted subtype: {subtype_mapping[predicted_class]}")
else:
print(f"Not a confident rG4 (score = {rG4_prob:.2f}); skipping subtype classification.")
Training data
The model was trained on experimentally validated rG4 regions annotated with subtype labels based on loop lengths, bulges, guanine content, and overall folding potential. Each 70-nt training window was associated with one of the eight subtype labels shown above.
Intended use
G4mer-Subtype is intended for researchers studying:
- RNA G-quadruplex structural diversity
- Subtype-specific regulatory roles in the transcriptome
- Effects of sequence variation on rG4 formation patterns
Web Tool
You can explore G4mer predictions interactively through our web tool:
Features include:
- RNA sequence prediction runs
G4mer
on GPU to compute probability of rG4-forming - Transcriptome-wide prediction of rG4s and subtypes
- Variant effect annotation using gnomAD SNVs
- Search and filter by gene, transcript, region (5′UTR, CDS, 3′UTR), and sequence context
No installation needed — just visit and start exploring.
Citation - MLA
Zhuang, Farica, et al. "G4mer: an RNA language model for transcriptome-wide identification of G-quadruplexes and disease variants from population-scale genetic data." bioRxiv (2024): 2024-10.
Contact
For questions, feedback, or discussions about G4mer, please post on the Biociphers Google Group.
- Downloads last month
- 5