Protein to RNA CDS Sequence Generation Model

This model is a custom PyTorch model designed to generate RNA CDS sequences from protein sequences. It utilizes a custom transformer-based architecture incorporating an ESM-2 encoder and a Mixture-of-Experts (MoE) layer.

Model Architecture

The model ActorModel_encoder_esm2 is defined in utils.py.

The key parameters used for instantiation are:

d_model: Dimension of the model's internal representation (768).
nhead: Number of attention heads (8).
num_encoder_layers: Number of transformer encoder layers (8).
dim_feedforward: Dimension of the feedforward network (d_model * 2).
esm2_dim: Dimension of the ESM-2 embeddings (1280 for esm2_t33_650M_UR50D).
dropout: Dropout rate (0.3).
num_experts: Number of experts in the MoE layer (6).
top_k_experts: Number of top experts to use (2).
device: The device to run the model on.

Files in this Repository

homo_mrna.pt: The PyTorch state_dict of the trained model for Homo sapiens mRNA.
homo_circ.pt: The PyTorch state_dict of the trained model for Homo sapiens circlar RNA.
Arabidopsis.pt: The PyTorch state_dict of the trained model for Arabidopsis thaliana mRNA.
CR.pt: The PyTorch state_dict of the trained model for Chlamydomonas reinhardtii mRNA.
EscherichiaColi.pt: The PyTorch state_dict of the trained model for Escherichia coli mRNA.
PC.pt: The PyTorch state_dict of the trained model for Penicillium chrysogenum mRNA.
TK.pt: The PyTorch state_dict of the trained model for Thermococcus kodakarensis KOD1 mRNA.
utils.py: Contains the definition of the ActorModel_encoder_esm2 class and the Tokenizer class.
transformer_encoder_MoE.py: Contains the definition of the Encoder class
README.md: This file.

How to Load the Model

Since this is a custom model, you need to download the utils.py,transformer_encoder_MoE.py, and the .pt file and then instantiate the model class and load the state dictionary.

Download Files: You can download the files using the huggingface_hub library:

from huggingface_hub import hf_hub_download
import os

repo_id = "sglin/RNARL"
local_dir = "./my_RNARL"

# Download model weights and utils.py
hf_hub_download(repo_id=repo_id, filename="homo_mrna.pt", local_dir=local_dir)
hf_hub_download(repo_id=repo_id, filename="homo_circ.pt", local_dir=local_dir)
hf_hub_download(repo_id=repo_id, filename="Arabidopsis.pt", local_dir=local_dir)
hf_hub_download(repo_id=repo_id, filename="CR.pt", local_dir=local_dir)
hf_hub_download(repo_id=repo_id, filename="EscherichiaColi.pt", local_dir=local_dir)
hf_hub_download(repo_id=repo_id, filename="PC.pt", local_dir=local_dir)
hf_hub_download(repo_id=repo_id, filename="TK.pt", local_dir=local_dir)
hf_hub_download(repo_id=repo_id, filename="utils.py", local_dir=local_dir)
hf_hub_download(repo_id=repo_id, filename="transformer_encoder_MoE.py", local_dir=local_dir)

# Now utils.py,transformer_encoder_MoE.py and model weights are in ./my_RNARL

Import Model Class:

# Assuming you are in or have added ./my_RNARL to your path
# Example: If in local_dir
# import sys
# sys.path.append("./my_RNARL")
# from utils import Tokenizer, ActorModel_encoder_esm2

# Or if you copied utils.py to your current working directory:
from utils import Tokenizer, ActorModel_encoder_esm2

Load ESM-2 (Dependency): The model requires the ESM-2 encoder. You'll need to load it separately, typically from Hugging Face Hub.
```
from transformers import AutoTokenizer, EsmModel
import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

esm2_tokenizer = AutoTokenizer.from_pretrained("esm2_t33_650M_UR50D")
esm2_model = EsmModel.from_pretrained("esm2_t33_650M_UR50D").to(device)
esm2_model.eval()
esm2_dim = esm2_model.config.hidden_size # Get the actual dimension
```
Note: Your original script used a local path (./esm2_model_t33_650M_UR50D). Users loading from the Hub will likely prefer loading directly from the official Hugging Face repo unless you explicitly provide the ESM-2 files in your repo (which is usually not necessary as they are already on the Hub).

Instantiate Custom Model and Load Weights: Instantiate your ActorModel_encoder_esm2 using the parameters from your training script and load the state dictionary.

# Define the parameters used during training
d_model = 768
nhead = 8
num_encoder_layers = 8
dim_feedforward = d_model * 2 # or the exact value you used
dropout = 0.3
num_experts = 6
top_k_experts = 2
# vocab_size needs to match your Tokenizer
tokenizer = Tokenizer() # Instantiate your custom tokenizer
vocab_size = len(tokenizer.tokens) # Get vocab size from your tokenizer

# Instantiate the model
model = ActorModel_encoder_esm2(
    vocab_size=vocab_size,
    d_model=d_model,
    nhead=nhead,
    num_encoder_layers=num_encoder_layers,
    dim_feedforward=dim_feedforward,
    esm2_dim=esm2_dim, # Use the esm2_model's dimension
    dropout=dropout,
    num_experts=num_experts,
    top_k_experts=top_k_experts,
    device=device
)

# Load the state dictionary
model_weights_path = os.path.join(local_dir, "homo_mrna.pt")
model.load_state_dict(torch.load(model_weights_path, map_location=device))
model.to(device)
model.eval()

print("Model loaded successfully!")

# Now you can use the 'model' object for inference
# Remember you also need your Tokenizer and the ESM-2 tokenizer/model

Dependencies

torch
transformers
huggingface_hub
pandas
numpy
The specific ESM-2 model used (esm2_t33_650M_UR50D or the one you used).

License

[ MIT, Apache 2.0]

Contact

[[email protected]]