Protein to RNA CDS Sequence Generation Model
This model is a custom PyTorch model designed to generate RNA CDS sequences from protein sequences. It utilizes a custom transformer-based architecture incorporating an ESM-2 encoder and a Mixture-of-Experts (MoE) layer.
Model Architecture
The model ActorModel_encoder_esm2
is defined in utils.py
.
The key parameters used for instantiation are:
d_model
: Dimension of the model's internal representation (768).nhead
: Number of attention heads (8).num_encoder_layers
: Number of transformer encoder layers (8).dim_feedforward
: Dimension of the feedforward network (d_model * 2
).esm2_dim
: Dimension of the ESM-2 embeddings (1280 for esm2_t33_650M_UR50D).dropout
: Dropout rate (0.3).num_experts
: Number of experts in the MoE layer (6).top_k_experts
: Number of top experts to use (2).device
: The device to run the model on.
Files in this Repository
homo_mrna.pt
: The PyTorch state_dict of the trained model for Homo sapiens mRNA.homo_circ.pt
: The PyTorch state_dict of the trained model for Homo sapiens circlar RNA.Arabidopsis.pt
: The PyTorch state_dict of the trained model for Arabidopsis thaliana mRNA.CR.pt
: The PyTorch state_dict of the trained model for Chlamydomonas reinhardtii mRNA.EscherichiaColi.pt
: The PyTorch state_dict of the trained model for Escherichia coli mRNA.PC.pt
: The PyTorch state_dict of the trained model for Penicillium chrysogenum mRNA.TK.pt
: The PyTorch state_dict of the trained model for Thermococcus kodakarensis KOD1 mRNA.utils.py
: Contains the definition of theActorModel_encoder_esm2
class and theTokenizer
class.transformer_encoder_MoE.py
: Contains the definition of theEncoder
classREADME.md
: This file.
How to Load the Model
Since this is a custom model, you need to download the utils.py
,transformer_encoder_MoE.py
, and the .pt
file and then instantiate the model class and load the state dictionary.
Download Files: You can download the files using the
huggingface_hub
library:from huggingface_hub import hf_hub_download import os repo_id = "sglin/RNARL" local_dir = "./my_RNARL" # Download model weights and utils.py hf_hub_download(repo_id=repo_id, filename="homo_mrna.pt", local_dir=local_dir) hf_hub_download(repo_id=repo_id, filename="homo_circ.pt", local_dir=local_dir) hf_hub_download(repo_id=repo_id, filename="Arabidopsis.pt", local_dir=local_dir) hf_hub_download(repo_id=repo_id, filename="CR.pt", local_dir=local_dir) hf_hub_download(repo_id=repo_id, filename="EscherichiaColi.pt", local_dir=local_dir) hf_hub_download(repo_id=repo_id, filename="PC.pt", local_dir=local_dir) hf_hub_download(repo_id=repo_id, filename="TK.pt", local_dir=local_dir) hf_hub_download(repo_id=repo_id, filename="utils.py", local_dir=local_dir) hf_hub_download(repo_id=repo_id, filename="transformer_encoder_MoE.py", local_dir=local_dir) # Now utils.py,transformer_encoder_MoE.py and model weights are in ./my_RNARL
Import Model Class:
# Assuming you are in or have added ./my_RNARL to your path # Example: If in local_dir # import sys # sys.path.append("./my_RNARL") # from utils import Tokenizer, ActorModel_encoder_esm2 # Or if you copied utils.py to your current working directory: from utils import Tokenizer, ActorModel_encoder_esm2
Load ESM-2 (Dependency): The model requires the ESM-2 encoder. You'll need to load it separately, typically from Hugging Face Hub.
from transformers import AutoTokenizer, EsmModel import torch device = torch.device("cuda" if torch.cuda.is_available() else "cpu") esm2_tokenizer = AutoTokenizer.from_pretrained("esm2_t33_650M_UR50D") esm2_model = EsmModel.from_pretrained("esm2_t33_650M_UR50D").to(device) esm2_model.eval() esm2_dim = esm2_model.config.hidden_size # Get the actual dimension
Note: Your original script used a local path (
./esm2_model_t33_650M_UR50D
). Users loading from the Hub will likely prefer loading directly from the official Hugging Face repo unless you explicitly provide the ESM-2 files in your repo (which is usually not necessary as they are already on the Hub).Instantiate Custom Model and Load Weights: Instantiate your
ActorModel_encoder_esm2
using the parameters from your training script and load the state dictionary.# Define the parameters used during training d_model = 768 nhead = 8 num_encoder_layers = 8 dim_feedforward = d_model * 2 # or the exact value you used dropout = 0.3 num_experts = 6 top_k_experts = 2 # vocab_size needs to match your Tokenizer tokenizer = Tokenizer() # Instantiate your custom tokenizer vocab_size = len(tokenizer.tokens) # Get vocab size from your tokenizer # Instantiate the model model = ActorModel_encoder_esm2( vocab_size=vocab_size, d_model=d_model, nhead=nhead, num_encoder_layers=num_encoder_layers, dim_feedforward=dim_feedforward, esm2_dim=esm2_dim, # Use the esm2_model's dimension dropout=dropout, num_experts=num_experts, top_k_experts=top_k_experts, device=device ) # Load the state dictionary model_weights_path = os.path.join(local_dir, "homo_mrna.pt") model.load_state_dict(torch.load(model_weights_path, map_location=device)) model.to(device) model.eval() print("Model loaded successfully!") # Now you can use the 'model' object for inference # Remember you also need your Tokenizer and the ESM-2 tokenizer/model
Dependencies
torch
transformers
huggingface_hub
pandas
numpy
- The specific ESM-2 model used (
esm2_t33_650M_UR50D
or the one you used).
License
[ MIT, Apache 2.0]