--- tags: - generation - protein-sequence - rna-sequence - pytorch --- # Protein to RNA CDS Sequence Generation Model This model is a custom PyTorch model designed to generate RNA CDS sequences from protein sequences. It utilizes a custom transformer-based architecture incorporating an ESM-2 encoder and a Mixture-of-Experts (MoE) layer. ## Model Architecture The model `ActorModel_encoder_esm2` is defined in `utils.py`. The key parameters used for instantiation are: - `d_model`: Dimension of the model's internal representation (768). - `nhead`: Number of attention heads (8). - `num_encoder_layers`: Number of transformer encoder layers (8). - `dim_feedforward`: Dimension of the feedforward network (`d_model * 2`). - `esm2_dim`: Dimension of the ESM-2 embeddings (1280 for esm2_t33_650M_UR50D). - `dropout`: Dropout rate (0.3). - `num_experts`: Number of experts in the MoE layer (6). - `top_k_experts`: Number of top experts to use (2). - `device`: The device to run the model on. ## Files in this Repository - `homo_mrna.pt`: The PyTorch state_dict of the trained model for Homo sapiens mRNA. - `homo_circ.pt`: The PyTorch state_dict of the trained model for Homo sapiens circlar RNA. - `Arabidopsis.pt`: The PyTorch state_dict of the trained model for Arabidopsis thaliana mRNA. - `CR.pt`: The PyTorch state_dict of the trained model for Chlamydomonas reinhardtii mRNA. - `EscherichiaColi.pt`: The PyTorch state_dict of the trained model for Escherichia coli mRNA. - `PC.pt`: The PyTorch state_dict of the trained model for Penicillium chrysogenum mRNA. - `TK.pt`: The PyTorch state_dict of the trained model for Thermococcus kodakarensis KOD1 mRNA. - `utils.py`: Contains the definition of the `ActorModel_encoder_esm2` class and the `Tokenizer` class. - `transformer_encoder_MoE.py`: Contains the definition of the `Encoder` class - `README.md`: This file. ## How to Load the Model Since this is a custom model, you need to download the `utils.py`,`transformer_encoder_MoE.py`, and the `.pt` file and then instantiate the model class and load the state dictionary. 1. **Download Files:** You can download the files using the `huggingface_hub` library: ```python from huggingface_hub import hf_hub_download import os repo_id = "sglin/RNARL" local_dir = "./my_RNARL" # Download model weights and utils.py hf_hub_download(repo_id=repo_id, filename="homo_mrna.pt", local_dir=local_dir) hf_hub_download(repo_id=repo_id, filename="homo_circ.pt", local_dir=local_dir) hf_hub_download(repo_id=repo_id, filename="Arabidopsis.pt", local_dir=local_dir) hf_hub_download(repo_id=repo_id, filename="CR.pt", local_dir=local_dir) hf_hub_download(repo_id=repo_id, filename="EscherichiaColi.pt", local_dir=local_dir) hf_hub_download(repo_id=repo_id, filename="PC.pt", local_dir=local_dir) hf_hub_download(repo_id=repo_id, filename="TK.pt", local_dir=local_dir) hf_hub_download(repo_id=repo_id, filename="utils.py", local_dir=local_dir) hf_hub_download(repo_id=repo_id, filename="transformer_encoder_MoE.py", local_dir=local_dir) # Now utils.py,transformer_encoder_MoE.py and model weights are in ./my_RNARL ``` 2. **Import Model Class:** ```python # Assuming you are in or have added ./my_RNARL to your path # Example: If in local_dir # import sys # sys.path.append("./my_RNARL") # from utils import Tokenizer, ActorModel_encoder_esm2 # Or if you copied utils.py to your current working directory: from utils import Tokenizer, ActorModel_encoder_esm2 ``` 3. **Load ESM-2 (Dependency):** The model requires the ESM-2 encoder. You'll need to load it separately, typically from Hugging Face Hub. ```python from transformers import AutoTokenizer, EsmModel import torch device = torch.device("cuda" if torch.cuda.is_available() else "cpu") esm2_tokenizer = AutoTokenizer.from_pretrained("esm2_t33_650M_UR50D") esm2_model = EsmModel.from_pretrained("esm2_t33_650M_UR50D").to(device) esm2_model.eval() esm2_dim = esm2_model.config.hidden_size # Get the actual dimension ``` *Note:* Your original script used a local path (`./esm2_model_t33_650M_UR50D`). Users loading from the Hub will likely prefer loading directly from the official Hugging Face repo unless you explicitly provide the ESM-2 files in your repo (which is usually not necessary as they are already on the Hub). 4. **Instantiate Custom Model and Load Weights:** Instantiate your `ActorModel_encoder_esm2` using the parameters from your training script and load the state dictionary. ```python # Define the parameters used during training d_model = 768 nhead = 8 num_encoder_layers = 8 dim_feedforward = d_model * 2 # or the exact value you used dropout = 0.3 num_experts = 6 top_k_experts = 2 # vocab_size needs to match your Tokenizer tokenizer = Tokenizer() # Instantiate your custom tokenizer vocab_size = len(tokenizer.tokens) # Get vocab size from your tokenizer # Instantiate the model model = ActorModel_encoder_esm2( vocab_size=vocab_size, d_model=d_model, nhead=nhead, num_encoder_layers=num_encoder_layers, dim_feedforward=dim_feedforward, esm2_dim=esm2_dim, # Use the esm2_model's dimension dropout=dropout, num_experts=num_experts, top_k_experts=top_k_experts, device=device ) # Load the state dictionary model_weights_path = os.path.join(local_dir, "homo_mrna.pt") model.load_state_dict(torch.load(model_weights_path, map_location=device)) model.to(device) model.eval() print("Model loaded successfully!") # Now you can use the 'model' object for inference # Remember you also need your Tokenizer and the ESM-2 tokenizer/model ``` ## Dependencies - `torch` - `transformers` - `huggingface_hub` - `pandas` - `numpy` - The specific ESM-2 model used (`esm2_t33_650M_UR50D` or the one you used). ## License [ MIT, Apache 2.0] ## Contact [linsg4521@sjtu.edu.cn]