RNAGenesis: A Generalist Foundation Model for Functional RNA Therapeutics

Model Description

RNAGenesis is a generalist RNA foundation model that integrates sequence representation, structural prediction, and de novo functional design within a single generative framework. Trained on diverse clustered non-coding RNAs, RNAGenesis leverages a BERT-style encoder, query-based latent compression, and a diffusion-guided decoder enhanced by inference-time alignment with gradient guidance and beam search strategies.

This model achieves state-of-the-art performance on:

11 of 13 tasks in the BEACON benchmark
Inverse folding and 3D structure prediction
De novo structure design
RNA therapeutics prediction (ASOs, siRNAs, shRNAs, circRNAs, UTR variants)
Functional RNA design including aptamers and CRISPR sgRNA scaffolds

Model Details

Model Type: Generalist RNA Foundation Model
Architecture: BERT-style encoder with query-based latent compression and diffusion-guided decoder
Input: RNA sequences (AUGC notation)
Output: Sequence embeddings, structure predictions, functional designs
Training Data: Diverse clustered non-coding RNAs
Key Features:
- Sequence representation learning
- Structural prediction capabilities
- De novo functional design
- Inference-time alignment with gradient guidance
- Beam search optimization strategies

Usage

Installation

pip install transformers torch

Basic Usage

from transformers import AutoModel, AutoTokenizer
import torch

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("your-username/RNAGenesis", trust_remote_code=True)
model = AutoModel.from_pretrained("your-username/RNAGenesis", trust_remote_code=True, torch_dtype=torch.bfloat16)

# Prepare your RNA sequence
rna_sequence = "GCCGGGCAUGGUGGCGCAUGCCUGUAGUCCCAGCUACCCGGGGAGGCUGAGGCAGAAGGAUCACUCGAGCCCAGGAGUUUGAGGUUGCUGUGAGCUAGGCUGACGCCACGGCACUCAGUCUAGCCUGGGCAACAAAGCGAGACUCUGUCUCCA"

# Tokenize and get embeddings
input_ids = torch.tensor(tokenizer.convert_tokens_to_ids(rna_sequence)).unsqueeze(0)
with torch.no_grad():
    outputs = model(input_ids)
    embeddings = outputs.last_hidden_state.mean(dim=1)  # Average pooling

print(f"Embedding shape: {embeddings.shape}")

Advanced Usage - Batch Processing

sequences = [
    "AUGCGAUCGAUCGAUCG",
    "GCGCGCAUAUAUAUAUA",
    "UUUUAAAACCCCGGGGA"
]

# Process multiple sequences
embeddings = []
for seq in sequences:
    input_ids = torch.tensor(tokenizer.convert_tokens_to_ids(seq)).unsqueeze(0)
    with torch.no_grad():
        outputs = model(input_ids)
        seq_embedding = outputs.last_hidden_state.mean(dim=1)
        embeddings.append(seq_embedding)

# Stack embeddings
all_embeddings = torch.cat(embeddings, dim=0)

Performance Highlights

BEACON Benchmark

State-of-the-art performance on 11 of 13 tasks
Superior performance in structure-aware modeling tasks

RNATx-Bench (RNA Therapeutics Benchmark)

Evaluated on >100,000 experimentally validated sequences
Strong predictive performance across:
- Antisense oligonucleotides (ASOs)
- Small interfering RNAs (siRNAs)
- Short hairpin RNAs (shRNAs)
- Circular RNAs (circRNAs)
- Untranslated region (UTR) variants

Experimental Validation

Aptamer Design: IGFBP3-targeting aptamers with KD values as low as 4.02 nM
CRISPR Enhancement: Up to 2.5-fold improvement in editing efficiency across:
- CRISPR-Cas9 systems
- Base editing systems
- Prime editing systems

Limitations

Maximum sequence length: Depends on model configuration
Input must be valid RNA sequences using standard AUGC notation
Model performance may vary on sequences significantly different from training data
This is a preprint model - results have not been peer-reviewed

Citation

If you use this model in your research, please cite:

@article{zhang2024rnagenesis,
  title={RNAGenesis: A Generalist Foundation Model for Functional RNA Therapeutics},
  author={Zhang, Zaixi and Jin, Ruofan and Chao, Linlin and Xu, Guangxue and Zhang, Yikun and Zhou, Guowei and Yin, Di and Guo, Yingqing and Fu, Yaqi and Yang, Yukang and Huang, Kaixuan and Wang, Xiaotong and Zhang, Junze and Yang, Yujie and Yang, Qirong and Xu, Ziyao and Weinan, E and Zhou, Ruhong and Zhang, Xiaoming and Wang, Mengdi and Cong, Le},
  journal={bioRxiv},
  year={2024},
  doi={10.1101/2024.12.30.630826},
  note={Preprint}
}

Paper: https://doi.org/10.1101/2024.12.30.630826

License

This model is released under the Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0).

Access

This model requires approval for access. Please fill out the access request form with:

Your intended use case
Your affiliation
Whether the use is for commercial or research purposes

Authors

Zaixi Zhang, Ruofan Jin, Linlin Chao, Guangxue Xu, Yikun Zhang, Guowei Zhou, Di Yin, Yingqing Guo, Yaqi Fu, Yukang Yang, Kaixuan Huang, Xiaotong Wang, Junze Zhang, Yujie Yang, Qirong Yang, Ziyao Xu, E Weinan, Ruhong Zhou, Xiaoming Zhang, Mengdi Wang, Le Cong

Contact

For questions or issues, please open an issue on the model repository.

Zaixi
/

RNAGenesis

You need to agree to share your contact information to access this model