G4mer Subtype

G4mer-Subtype is a transformer-based RNA language model that predicts RNA G-quadruplex (rG4) subtypes from sequence input. It is fine-tuned from Biociphers/mRNAbert and trained on 70-nt sequences labeled with experimentally derived rG4 subtype categories.

Disclaimer

This is the official subtype classification model from the G4mer framework as described in the manuscript:

Zhuang, Farica, et al. G4mer: an RNA language model for transcriptome-wide identification of G-quadruplexes and disease variants from population-scale genetic data. bioRxiv (2024).

See our Bitbucket repo for code, data, and tutorials.

Model Details

G4mer-Subtype is trained to classify each 70-nt RNA sequence into one of eight rG4 subtypes, each representing a distinct sequence/structure motif observed in experimental rG4 data.

Subtype Mapping

Class Index	Subtype Description
0	G≥40%
1	Unknown
2	Bulges
3	Canonical
4	Long loop
5	Potential G-quadruplex & G≥40%
6	Potential G-triplex & G≥40%
7	Two-quartet

All models use overlapping 6-mer tokenization and were fine-tuned on human transcriptome-derived sequences with subtype labels.

Variants

Model	Task	Size
`Biociphers/g4mer`	rG4 binary class	~46M
`Biociphers/g4mer-subtype`	rG4 subtype class	~46M
`Biociphers/g4mer-regression`	rG4 strength (score)	~46M

Usage

Predict rG4 Subtypes

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# Load binary rG4 model and tokenizer
binary_tokenizer = AutoTokenizer.from_pretrained("biociphers/g4mer")
binary_model = AutoModelForSequenceClassification.from_pretrained("biociphers/g4mer")
binary_model.eval()

# Load subtype model and tokenizer
subtype_tokenizer = AutoTokenizer.from_pretrained("biociphers/g4mer-subtype")
subtype_model = AutoModelForSequenceClassification.from_pretrained("biociphers/g4mer-subtype")
subtype_model.eval()

# Input sequence (max 70 nt)
sequence = "GGGAGGGCGCGTGTGGTGAGAGGAGGGAGGGAAGGAAGGCGGAGGAAGGA"

# Convert to space-separated 6-mers
def to_kmers(seq, k=6):
    return ' '.join([seq[i:i+k] for i in range(len(seq) - k + 1)])

kmer_sequence = to_kmers(sequence)

# Predict rG4 binary score
binary_inputs = binary_tokenizer(kmer_sequence, return_tensors="pt")
with torch.no_grad():
    binary_output = binary_model(**binary_inputs)
    rG4_prob = torch.nn.functional.softmax(binary_output.logits, dim=-1)[0][1].item()

# If confidently predicted to be rG4. Here, we set rG4 threshold to moderately confident with 0.7.
if rG4_prob > 0.7:
    # Only classify subtype if confident rG4
    subtype_inputs = subtype_tokenizer(kmer_sequence, return_tensors="pt")
    with torch.no_grad():
        subtype_output = subtype_model(**subtype_inputs)
        subtype_probs = torch.nn.functional.softmax(subtype_output.logits, dim=-1)
        predicted_class = torch.argmax(subtype_probs, dim=-1).item()

    subtype_mapping = {
        0: "G≥40%",
        1: "Unknown",
        2: "Bulges",
        3: "Canonical",
        4: "Long loop",
        5: "Potential G-quadruplex & G≥40%",
        6: "Potential G-triplex & G≥40%",
        7: "Two-quartet"
    }
    print(f"Predicted subtype: {subtype_mapping[predicted_class]}")
else:
    print(f"Not a confident rG4 (score = {rG4_prob:.2f}); skipping subtype classification.")

Training data

The model was trained on experimentally validated rG4 regions annotated with subtype labels based on loop lengths, bulges, guanine content, and overall folding potential. Each 70-nt training window was associated with one of the eight subtype labels shown above.

Intended use

G4mer-Subtype is intended for researchers studying:

RNA G-quadruplex structural diversity
Subtype-specific regulatory roles in the transcriptome
Effects of sequence variation on rG4 formation patterns

Web Tool

You can explore G4mer predictions interactively through our web tool:

G4mer Web Tool

Features include:

RNA sequence prediction runs G4mer on GPU to compute probability of rG4-forming
Transcriptome-wide prediction of rG4s and subtypes
Variant effect annotation using gnomAD SNVs
Search and filter by gene, transcript, region (5′UTR, CDS, 3′UTR), and sequence context

No installation needed — just visit and start exploring.

Citation - MLA

Zhuang, Farica, et al. "G4mer: an RNA language model for transcriptome-wide identification of G-quadruplexes and disease variants from population-scale genetic data." bioRxiv (2024): 2024-10.

Contact

For questions, feedback, or discussions about G4mer, please post on the Biociphers Google Group.

Biociphers
/

g4mer-subtype

You need to agree to share your contact information to access this model