You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

G4mer

G4mer is a transformer-based RNA foundation model trained to identify RNA G-quadruplexes (rG4s) from sequence input, fine-tuned with mRNAbert (Biociphers/mRNAbert).

Disclaimer

This is the official implementation of the G4mer model as described in the manuscript:

Zhuang, Farica, et al. G4mer: an RNA language model for transcriptome-wide identification of G-quadruplexes and disease variants from population-scale genetic data. bioRxiv (2024).

See our Bitbucket repo for code, data, and tutorials.

Model Details

G4mer transformer-based model trained on transcriptome-wide RNA sequences to predict:

  • Binary classification: Whether a 70-nt seqeunce region forms an rG4 structure

All models use overlapping 6-mer tokenization and are trained from scratch on the human transcriptome.

Variants

Model Task Size
Biociphers/g4mer rG4 binary class ~46M
Biociphers/g4mer-subtype rG4 subtype class ~46M
Biociphers/g4mer-regression rG4 strength ~46M

Usage

Binary rG4 Prediction

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

tokenizer = AutoTokenizer.from_pretrained("biociphers/g4mer")
model = AutoModelForSequenceClassification.from_pretrained("biociphers/g4mer")

sequence = "GGGAGGGCGCGTGTGGTGAGAGGAGGGAGGGAAGGAAGGCGGAGGAAGGA"  # max length: 70nt window

def to_kmers(seq, k=6):
    return ' '.join([seq[i:i+k] for i in range(len(seq) - k + 1)])

sequence = to_kmers(sequence, k=6)  # Convert to 6-mers
inputs = tokenizer(sequence, return_tensors="pt")
outputs = model(**inputs)
logits = outputs.logits

rG4_probability = torch.softmax(logits, dim=1)[:, 1].item()
print(rG4_probability)

G4mer was trained on a maximum of 70nt per sequence. For sequences longer than 70nt, we recommend scanning the input sequence with a sliding window of 70nt and taking the maximum rG4 score across all windows.

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("Biociphers/g4mer")
model = AutoModelForSequenceClassification.from_pretrained("Biociphers/g4mer")
model.eval()

# Define k-mer function
def to_kmers(seq, k=6):
    return ' '.join([seq[i:i+k] for i in range(len(seq) - k + 1)])

# Define a long sequence (must contain only A/C/G/T)
sequence = "GGGAGGGCGCGTGTGGTGAGAGGAGGGAGGGAAGGAAGGCGGAGGAAGGA" * 2  # ~100nt

# Slide 70nt window with stride 1
window_size = 70
stride = 1
windows = [sequence[i:i+window_size] for i in range(0, len(sequence) - window_size + 1, stride)]

# Score each window using G4mer
scores = []
for w in windows:
    kmer_seq = to_kmers(w, k=6)
    tokens = tokenizer(kmer_seq, return_tensors="pt")
    with torch.no_grad():
        output = model(**tokens)
        prob = torch.nn.functional.softmax(output.logits, dim=-1)
        scores.append(prob[0][1].item())  # class 1 = rG4-forming

# Final rG4 score for the long sequence
max_score = max(scores)
print(f"Max rG4 score across windows: {max_score:.3f}")

Web Tool

You can explore G4mer predictions interactively through our web tool:

G4mer Web Tool

Features include:

  • RNA sequence prediction runs G4mer on GPU to compute probability of rG4-forming
  • Transcriptome-wide prediction of rG4s and subtypes
  • Variant effect annotation using gnomAD SNVs
  • Search and filter by gene, transcript, region (5′UTR, CDS, 3′UTR), and sequence context

No installation needed — just visit and start exploring.

Citation - MLA

Zhuang, Farica, et al. "G4mer: an RNA language model for transcriptome-wide identification of G-quadruplexes and disease variants from population-scale genetic data." bioRxiv (2024): 2024-10.

Contact

For questions, feedback, or discussions about G4mer, please post on the Biociphers Google Group.

Downloads last month
19
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support