RNAGenesis: A Generalist Foundation Model for Functional RNA Therapeutics
Model Description
RNAGenesis is a generalist RNA foundation model that integrates sequence representation, structural prediction, and de novo functional design within a single generative framework. Trained on diverse clustered non-coding RNAs, RNAGenesis leverages a BERT-style encoder, query-based latent compression, and a diffusion-guided decoder enhanced by inference-time alignment with gradient guidance and beam search strategies.
This model achieves state-of-the-art performance on:
- 11 of 13 tasks in the BEACON benchmark
- Inverse folding and 3D structure prediction
- De novo structure design
- RNA therapeutics prediction (ASOs, siRNAs, shRNAs, circRNAs, UTR variants)
- Functional RNA design including aptamers and CRISPR sgRNA scaffolds
Model Details
- Model Type: Generalist RNA Foundation Model
- Architecture: BERT-style encoder with query-based latent compression and diffusion-guided decoder
- Input: RNA sequences (AUGC notation)
- Output: Sequence embeddings, structure predictions, functional designs
- Training Data: Diverse clustered non-coding RNAs
- Key Features:
- Sequence representation learning
- Structural prediction capabilities
- De novo functional design
- Inference-time alignment with gradient guidance
- Beam search optimization strategies
Usage
Installation
pip install transformers torch
Basic Usage
from transformers import AutoModel, AutoTokenizer
import torch
# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("your-username/RNAGenesis", trust_remote_code=True)
model = AutoModel.from_pretrained("your-username/RNAGenesis", trust_remote_code=True, torch_dtype=torch.bfloat16)
# Prepare your RNA sequence
rna_sequence = "GCCGGGCAUGGUGGCGCAUGCCUGUAGUCCCAGCUACCCGGGGAGGCUGAGGCAGAAGGAUCACUCGAGCCCAGGAGUUUGAGGUUGCUGUGAGCUAGGCUGACGCCACGGCACUCAGUCUAGCCUGGGCAACAAAGCGAGACUCUGUCUCCA"
# Tokenize and get embeddings
input_ids = torch.tensor(tokenizer.convert_tokens_to_ids(rna_sequence)).unsqueeze(0)
with torch.no_grad():
outputs = model(input_ids)
embeddings = outputs.last_hidden_state.mean(dim=1) # Average pooling
print(f"Embedding shape: {embeddings.shape}")
Advanced Usage - Batch Processing
sequences = [
"AUGCGAUCGAUCGAUCG",
"GCGCGCAUAUAUAUAUA",
"UUUUAAAACCCCGGGGA"
]
# Process multiple sequences
embeddings = []
for seq in sequences:
input_ids = torch.tensor(tokenizer.convert_tokens_to_ids(seq)).unsqueeze(0)
with torch.no_grad():
outputs = model(input_ids)
seq_embedding = outputs.last_hidden_state.mean(dim=1)
embeddings.append(seq_embedding)
# Stack embeddings
all_embeddings = torch.cat(embeddings, dim=0)
Performance Highlights
BEACON Benchmark
- State-of-the-art performance on 11 of 13 tasks
- Superior performance in structure-aware modeling tasks
RNATx-Bench (RNA Therapeutics Benchmark)
- Evaluated on >100,000 experimentally validated sequences
- Strong predictive performance across:
- Antisense oligonucleotides (ASOs)
- Small interfering RNAs (siRNAs)
- Short hairpin RNAs (shRNAs)
- Circular RNAs (circRNAs)
- Untranslated region (UTR) variants
Experimental Validation
- Aptamer Design: IGFBP3-targeting aptamers with KD values as low as 4.02 nM
- CRISPR Enhancement: Up to 2.5-fold improvement in editing efficiency across:
- CRISPR-Cas9 systems
- Base editing systems
- Prime editing systems
Limitations
- Maximum sequence length: Depends on model configuration
- Input must be valid RNA sequences using standard AUGC notation
- Model performance may vary on sequences significantly different from training data
- This is a preprint model - results have not been peer-reviewed
Citation
If you use this model in your research, please cite:
@article{zhang2024rnagenesis,
title={RNAGenesis: A Generalist Foundation Model for Functional RNA Therapeutics},
author={Zhang, Zaixi and Jin, Ruofan and Chao, Linlin and Xu, Guangxue and Zhang, Yikun and Zhou, Guowei and Yin, Di and Guo, Yingqing and Fu, Yaqi and Yang, Yukang and Huang, Kaixuan and Wang, Xiaotong and Zhang, Junze and Yang, Yujie and Yang, Qirong and Xu, Ziyao and Weinan, E and Zhou, Ruhong and Zhang, Xiaoming and Wang, Mengdi and Cong, Le},
journal={bioRxiv},
year={2024},
doi={10.1101/2024.12.30.630826},
note={Preprint}
}
Paper: https://doi.org/10.1101/2024.12.30.630826
License
This model is released under the Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0).
Access
This model requires approval for access. Please fill out the access request form with:
- Your intended use case
- Your affiliation
- Whether the use is for commercial or research purposes
Authors
Zaixi Zhang, Ruofan Jin, Linlin Chao, Guangxue Xu, Yikun Zhang, Guowei Zhou, Di Yin, Yingqing Guo, Yaqi Fu, Yukang Yang, Kaixuan Huang, Xiaotong Wang, Junze Zhang, Yujie Yang, Qirong Yang, Ziyao Xu, E Weinan, Ruhong Zhou, Xiaoming Zhang, Mengdi Wang, Le Cong
Contact
For questions or issues, please open an issue on the model repository.
- Downloads last month
- -