Add initial model weights, utils.py, transformer_encoder_MoE.py, and README

Files changed (10) hide show

Arabidopsis.pt +3 -0
CR.pt +3 -0
EscherichiaColi.pt +3 -0
PC.pt +3 -0
README.md +153 -0
TK.pt +3 -0
homo_circ.pt +3 -0
homo_mrna.pt +3 -0
transformer_encoder_MoE.py +555 -0
utils.py +844 -0

Arabidopsis.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:725088277142fa26aa1781806b107155363a903d0eba542bdcf312f5dbde48c8
+size 534071434

CR.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:b69bb49ff4a336cc7a4f47ffb438dba298d67b237f678403a43bee511c8c1928
+size 534069804

EscherichiaColi.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:6833a55fc82e708e8e5307dd63be0983504d768d32a1b1ec7a83b616bdf49671
+size 534072130

PC.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:fe6d83b0824ced0f00c43189a58a9eaecd4acf02226dffb057b27166f08d8f1a
+size 534069804

README.md ADDED Viewed

	@@ -0,0 +1,153 @@

+---
+tags:
+- generation
+- protein-sequence
+- rna-sequence
+- pytorch
+---
+# Protein to RNA CDS Sequence Generation Model
+This model is a custom PyTorch model designed to generate RNA CDS sequences from protein sequences. It utilizes a custom transformer-based architecture incorporating an ESM-2 encoder and a Mixture-of-Experts (MoE) layer.
+## Model Architecture
+The model `ActorModel_encoder_esm2` is defined in `utils.py`.
+The key parameters used for instantiation are:
+- `d_model`: Dimension of the model's internal representation (768).
+- `nhead`: Number of attention heads (8).
+- `num_encoder_layers`: Number of transformer encoder layers (8).
+- `dim_feedforward`: Dimension of the feedforward network (`d_model * 2`).
+- `esm2_dim`: Dimension of the ESM-2 embeddings (1280 for esm2_t33_650M_UR50D).
+- `dropout`: Dropout rate (0.3).
+- `num_experts`: Number of experts in the MoE layer (6).
+- `top_k_experts`: Number of top experts to use (2).
+- `device`: The device to run the model on.
+## Files in this Repository
+- `homo_mrna.pt`: The PyTorch state_dict of the trained model for Homo sapiens mRNA.
+- `homo_circ.pt`: The PyTorch state_dict of the trained model for Homo sapiens circlar RNA.
+- `Arabidopsis.pt`: The PyTorch state_dict of the trained model for Arabidopsis thaliana mRNA.
+- `CR.pt`: The PyTorch state_dict of the trained model for Chlamydomonas reinhardtii mRNA.
+- `EscherichiaColi.pt`: The PyTorch state_dict of the trained model for Escherichia coli mRNA.
+- `PC.pt`: The PyTorch state_dict of the trained model for Penicillium chrysogenum mRNA.
+- `TK.pt`: The PyTorch state_dict of the trained model for Thermococcus kodakarensis KOD1 mRNA.
+- `utils.py`: Contains the definition of the `ActorModel_encoder_esm2` class and the `Tokenizer` class.
+- `transformer_encoder_MoE.py`: Contains the definition of the `Encoder` class
+- `README.md`: This file.
+## How to Load the Model
+Since this is a custom model, you need to download the `utils.py`,`transformer_encoder_MoE.py`, and the `.pt` file and then instantiate the model class and load the state dictionary.
+1.  **Download Files:**
+    You can download the files using the `huggingface_hub` library:
+    ```python
+    from huggingface_hub import hf_hub_download
+    import os
+    repo_id = "sglin/RNARL"
+    local_dir = "./my_RNARL"
+    # Download model weights and utils.py
+    hf_hub_download(repo_id=repo_id, filename="homo_mrna.pt", local_dir=local_dir)
+    hf_hub_download(repo_id=repo_id, filename="homo_circ.pt", local_dir=local_dir)
+    hf_hub_download(repo_id=repo_id, filename="Arabidopsis.pt", local_dir=local_dir)
+    hf_hub_download(repo_id=repo_id, filename="CR.pt", local_dir=local_dir)
+    hf_hub_download(repo_id=repo_id, filename="EscherichiaColi.pt", local_dir=local_dir)
+    hf_hub_download(repo_id=repo_id, filename="PC.pt", local_dir=local_dir)
+    hf_hub_download(repo_id=repo_id, filename="TK.pt", local_dir=local_dir)
+    hf_hub_download(repo_id=repo_id, filename="utils.py", local_dir=local_dir)
+    hf_hub_download(repo_id=repo_id, filename="transformer_encoder_MoE.py", local_dir=local_dir)
+    # Now utils.py,transformer_encoder_MoE.py and model weights are in ./my_RNARL
+    ```
+2.  **Import Model Class:**
+    ```python
+    # Assuming you are in or have added ./my_RNARL to your path
+    # Example: If in local_dir
+    # import sys
+    # sys.path.append("./my_RNARL")
+    # from utils import Tokenizer, ActorModel_encoder_esm2
+    # Or if you copied utils.py to your current working directory:
+    from utils import Tokenizer, ActorModel_encoder_esm2
+    ```
+3.  **Load ESM-2 (Dependency):**
+    The model requires the ESM-2 encoder. You'll need to load it separately, typically from Hugging Face Hub.
+    ```python
+    from transformers import AutoTokenizer, EsmModel
+    import torch
+    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+    esm2_tokenizer = AutoTokenizer.from_pretrained("esm2_t33_650M_UR50D")
+    esm2_model = EsmModel.from_pretrained("esm2_t33_650M_UR50D").to(device)
+    esm2_model.eval()
+    esm2_dim = esm2_model.config.hidden_size # Get the actual dimension
+    ```
+    *Note:* Your original script used a local path (`./esm2_model_t33_650M_UR50D`). Users loading from the Hub will likely prefer loading directly from the official Hugging Face repo unless you explicitly provide the ESM-2 files in your repo (which is usually not necessary as they are already on the Hub).
+4.  **Instantiate Custom Model and Load Weights:**
+    Instantiate your `ActorModel_encoder_esm2` using the parameters from your training script and load the state dictionary.
+    ```python
+    # Define the parameters used during training
+    d_model = 768
+    nhead = 8
+    num_encoder_layers = 8
+    dim_feedforward = d_model * 2 # or the exact value you used
+    dropout = 0.3
+    num_experts = 6
+    top_k_experts = 2
+    # vocab_size needs to match your Tokenizer
+    tokenizer = Tokenizer() # Instantiate your custom tokenizer
+    vocab_size = len(tokenizer.tokens) # Get vocab size from your tokenizer
+    # Instantiate the model
+    model = ActorModel_encoder_esm2(
+        vocab_size=vocab_size,
+        d_model=d_model,
+        nhead=nhead,
+        num_encoder_layers=num_encoder_layers,
+        dim_feedforward=dim_feedforward,
+        esm2_dim=esm2_dim, # Use the esm2_model's dimension
+        dropout=dropout,
+        num_experts=num_experts,
+        top_k_experts=top_k_experts,
+        device=device
+    )
+    # Load the state dictionary
+    model_weights_path = os.path.join(local_dir, "homo_mrna.pt")
+    model.load_state_dict(torch.load(model_weights_path, map_location=device))
+    model.to(device)
+    model.eval()
+    print("Model loaded successfully!")
+    # Now you can use the 'model' object for inference
+    # Remember you also need your Tokenizer and the ESM-2 tokenizer/model
+    ```
+## Dependencies
+- `torch`
+- `transformers`
+- `huggingface_hub`
+- `pandas`
+- `numpy`
+- The specific ESM-2 model used (`esm2_t33_650M_UR50D` or the one you used).
+## License
+[Specify your license here, e.g., MIT, Apache 2.0]
+## Contact
+[Optional: Your email or other contact info]

TK.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:3737e4d94a4e266e0c89105a98ddcf8a6e7f444af5ef88810db669cd4ecf084d
+size 534069804

homo_circ.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:5249399ad29c6939703e79ef65532f23867ba5ce0ae6b3e4c2e4e6075ae18466
+size 534071260

homo_mrna.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:c3c825975f8deade11bf02d4783d0703d276f1b5ed7b70a406b01e8843255d66
+size 534069218

transformer_encoder_MoE.py ADDED Viewed

	@@ -0,0 +1,555 @@

+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+import math
+from torch.nn.parallel import parallel_apply
+from typing import Tuple, List, Optional, Union
+import torch.utils.checkpoint as checkpoint
+class MultiHeadAttention(nn.Module):
+    """高效实现的多头注意力机制"""
+    def __init__(self, model_dim: int, n_heads: int):
+        super().__init__()
+        assert model_dim % n_heads == 0, "model_dim must be divisible by n_heads"
+        self.model_dim = model_dim
+        self.d_k = model_dim // n_heads
+        self.n_heads = n_heads
+        # 使用单个线性层同时计算Q, K, V投影，减少计算开销
+        self.qkv_linear = nn.Linear(model_dim, 3 * model_dim, bias=False)
+        self.out_linear = nn.Linear(model_dim, model_dim, bias=False)
+        # 初始化参数，提高训练稳定性
+        nn.init.xavier_uniform_(self.qkv_linear.weight)
+        nn.init.xavier_uniform_(self.out_linear.weight)
+        self.scale = 1.0 / math.sqrt(self.d_k)
+    def forward(self,
+                q: torch.Tensor,
+                k: torch.Tensor,
+                v: torch.Tensor,
+                mask: Optional[torch.Tensor] = None,
+                key_padding_mask: Optional[torch.Tensor] = None) -> torch.Tensor:
+        batch_size = q.size(0)
+        # 如果输入相同，使用更高效的自注意力计算
+        is_self_attention = q.data_ptr() == k.data_ptr() == v.data_ptr()
+        if is_self_attention:
+            # [batch, seq, 3*dim] -> 3 x [batch, seq, dim]
+            qkv = self.qkv_linear(q).chunk(3, dim=-1)
+            q, k, v = map(
+                lambda x: x.view(batch_size, -1, self.n_heads, self.d_k).transpose(1, 2),
+                qkv
+            )
+        else:
+            # 使用单独的线性变换进行异源注意力计算
+            q = self.qkv_linear(q)[:, :, :self.model_dim]
+            k = self.qkv_linear(k)[:, :, self.model_dim:2*self.model_dim]
+            v = self.qkv_linear(v)[:, :, 2*self.model_dim:]
+            q = q.view(batch_size, -1, self.n_heads, self.d_k).transpose(1, 2)
+            k = k.view(batch_size, -1, self.n_heads, self.d_k).transpose(1, 2)
+            v = v.view(batch_size, -1, self.n_heads, self.d_k).transpose(1, 2)
+        # 缩放点积注意力计算
+        scores = torch.matmul(q, k.transpose(-2, -1)) * self.scale
+        # 掩码处理 (提高数值稳定性)
+        if mask is not None:
+            scores = scores.masked_fill(mask == 0, -6.0e4)
+        if key_padding_mask is not None:
+            scores = scores.masked_fill(key_padding_mask.unsqueeze(1).unsqueeze(2), -6.0e4)
+        attn_weights = F.softmax(scores, dim=-1)
+        # 使用注意力权重聚合值
+        context = torch.matmul(attn_weights, v)
+        # 重新组织维度并线性投影输出
+        context = context.transpose(1, 2).contiguous().view(batch_size, -1, self.model_dim)
+        output = self.out_linear(context)
+        return output
+class MoE(nn.Module):
+    """优化的混合专家模块，支持并行计算和更高效的专家选择
+    Args:
+        d_model (int): 模型隐藏层维度
+        num_experts (int): 专家数量
+        d_ff (int): 前馈层维度
+        dropout (float): Dropout概率
+        top_k (int): 每个token选择的专家数量
+    """
+    def __init__(self, d_model: int, num_experts: int, d_ff: int, dropout: float, top_k: int):
+        super().__init__()
+        # 参数初始化
+        self.num_experts = num_experts
+        self.top_k = min(top_k, num_experts)  # 确保top_k不超过专家数量，形状无变化
+        self.d_model = d_model
+        # 门控网络：将输入映射到专家分数 [d_model -> num_experts]
+        self.gate = nn.Linear(d_model, num_experts, bias=False)
+        # 专家网络：并行专家模块列表
+        self.experts = nn.ModuleList([
+            nn.Sequential(  # 每个专家结构：
+                nn.Linear(d_model, d_ff, bias=False),   # [d_model -> d_ff]
+                nn.GELU(),                              # 激活函数无形状变化
+                nn.Dropout(dropout),                    # 无形状变化
+                nn.Linear(d_ff, d_model, bias=False)   # [d_ff -> d_model]
+            ) for _ in range(num_experts)
+        ])
+        # 参数初始化
+        for expert in self.experts:
+            nn.init.kaiming_uniform_(expert[0].weight, a=math.sqrt(5))  # 第一层线性权重初始化
+            nn.init.zeros_(expert[3].weight)  # 输出层零初始化，形状保持 [d_ff, d_model]
+        nn.init.zeros_(self.gate.weight)  # 门控网络零初始化，形状 [d_model, num_experts]
+    def orthogonal_loss(self) -> torch.Tensor:
+        """计算专家网络之间的正交损失，提高专家多样性
+        Returns:
+            torch.Tensor: ���交损失标量值
+        """
+        total_loss = 0.0
+        num_pairs = 0
+        # 获取所有专家的第一层和最后一层权重
+        # expert_weights_1形状: [num_experts, d_ff, d_model]
+        expert_weights_1 = torch.stack([expert[0].weight for expert in self.experts])
+        # expert_weights_2形状: [num_experts, d_model, d_ff]
+        expert_weights_2 = torch.stack([expert[3].weight for expert in self.experts])
+        # 计算所有专家对之间的正交损失
+        for i in range(self.num_experts):
+            w1_i = expert_weights_1[i]  # [d_ff, d_model]
+            w2_i = expert_weights_2[i]  # [d_model, d_ff]
+            for j in range(i+1, self.num_experts):
+                w1_j = expert_weights_1[j]  # [d_ff, d_model]
+                w2_j = expert_weights_2[j]  # [d_model, d_ff]
+                # 计算第一层权重的相似度
+                w1_sim = torch.sum((w1_i @ w1_j.T)**2) / (w1_i.size(0) * w1_j.size(0))  # 标量
+                # 计算第二层权重的相似度
+                w2_sim = torch.sum((w2_i.T @ w2_j)**2) / (w2_i.size(1) * w2_j.size(1))  # 标量
+                total_loss += (w1_sim + w2_sim) / 2  # 平均相似度
+                num_pairs += 1
+        return total_loss / max(num_pairs, 1)  # 平均正交损失
+    def entropy_regularization_loss(self, routing_probs: torch.Tensor) -> torch.Tensor:
+        """计算熵正则化损失，鼓励更均匀的路由分布
+        Args:
+            routing_probs (torch.Tensor): 路由概率分布，形状 [batch*seq_len, num_experts]
+        Returns:
+            torch.Tensor: 熵损失标量值
+        """
+        # 使用数值稳定的log计算
+        log_probs = torch.log(torch.clamp(routing_probs, min=1e-6))  # 保持形状 [batch*seq, num_experts]
+        # 逐元素计算熵，保持维度
+        entropy = -torch.sum(routing_probs * log_probs, dim=-1)  # 形状 [batch*seq]
+        return entropy.mean()  # 标量
+    def forward(self, hidden_states: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
+        """MoE前向传播，高效实现专家选择和组合
+        Args:
+            hidden_states (torch.Tensor): 输入张量，形状 [batch_size, seq_len, d_model]
+        Returns:
+            Tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
+                - 输出张量 [batch_size, seq_len, d_model]
+                - 路由逻辑分数 [batch_size*seq_len, num_experts]
+                - 熵正则化损失标量值
+        """
+        batch_size, seq_len, d_model = hidden_states.shape
+        combined_batch_size = batch_size * seq_len
+        # 展平输入用于并行处理
+        flat_hidden = hidden_states.reshape(combined_batch_size, d_model)  # [batch*seq, d_model]
+        # 路由计算
+        router_logits = self.gate(flat_hidden)  # [batch*seq, num_experts]
+        routing_probs = F.softmax(router_logits, dim=-1)  # [batch*seq, num_experts]
+        # 选择top-k专家
+        routing_weights, selected_experts = torch.topk(routing_probs, self.top_k, dim=-1)  # 均为 [batch*seq, top_k]
+        # 归一化权重
+        routing_weights = routing_weights / routing_weights.sum(dim=-1, keepdim=True)  # [batch*seq, top_k]
+        # 并行计算所有专家输出
+        flat_expert_inputs = [flat_hidden] * self.num_experts  # 列表包含num_experts个[batch*seq, d_model]
+        expert_outputs = parallel_apply(self.experts, flat_expert_inputs)  # 列表包含num_experts个[batch*seq, d_model]
+        expert_outputs = torch.stack(expert_outputs, dim=1)  # [batch*seq, num_experts, d_model]
+        # 构建专家权重矩阵
+        expert_weights_matrix = torch.zeros(
+            combined_batch_size, self.num_experts, device=hidden_states.device
+        )  # [batch*seq, num_experts]
+        # 使用scatter_add高效聚合权重
+        for k in range(self.top_k):
+            k_indices = selected_experts[:, k]  # [batch*seq]
+            k_weights = routing_weights[:, k].unsqueeze(1)  # [batch*seq, 1]
+            # 将权重累加到对应位置
+            expert_weights_matrix.scatter_add_(
+                1,
+                k_indices.unsqueeze(1),  # [batch*seq, 1]
+                k_weights  # [batch*seq, 1]
+            )  # 更新expert_weights_matrix
+        # 矩阵乘法组合专家输出
+        combined_output = torch.bmm(
+            expert_weights_matrix.unsqueeze(1),   # [batch*seq, 1, num_experts]
+            expert_outputs                         # [batch*seq, num_experts, d_model]
+        ).squeeze(1)  # [batch*seq, d_model]
+        # 恢复原始形状
+        output = combined_output.reshape(batch_size, seq_len, d_model)  # [batch_size, seq_len, d_model]
+        # 计算熵正则化损失
+        entropy_loss = self.entropy_regularization_loss(routing_probs)
+        return output, router_logits, entropy_loss
+class EncoderLayer(nn.Module):
+    """优化的编码器层，支持梯度检查点和残差连接预归一化"""
+    def __init__(self, model_dim: int, n_heads: int, ff_hidden_dim: int,
+                 dropout: float, num_experts: int, top_k: int):
+        super().__init__()
+        self.model_dim = model_dim
+        # 使用预归一化（Pre-LN）结构，提高训练稳定性
+        self.norm1 = nn.LayerNorm(model_dim)
+        self.norm2 = nn.LayerNorm(model_dim)
+        self.self_attn = MultiHeadAttention(model_dim, n_heads)
+        self.moe = MoE(model_dim, num_experts, ff_hidden_dim, dropout, top_k)
+        self.dropout = nn.Dropout(dropout)
+        self.dropout1 = nn.Dropout(dropout)
+        self.dropout2 = nn.Dropout(dropout)
+        # 可选的投影层，处理残差连接尺寸不匹配的情况
+        self.use_projection = False
+        if not self.use_projection:
+            self.residual_scale = nn.Parameter(torch.ones(1))
+    def _sa_block(self, x: torch.Tensor,
+                 mask: Optional[torch.Tensor] = None,
+                 key_padding_mask: Optional[torch.Tensor] = None) -> torch.Tensor:
+        """封装自注意力计算，便于梯度检查点使用"""
+        x = self.self_attn(x, x, x, mask=mask, key_padding_mask=key_padding_mask)
+        return self.dropout1(x)
+    def _moe_block(self, x: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
+        """封装MoE计算，便于梯度检查点使用"""
+        return self.moe(x)
+    def forward(self,
+                x: torch.Tensor,
+                src_mask: Optional[torch.Tensor] = None,
+                src_key_padding_mask: Optional[torch.Tensor] = None,
+                use_checkpoint: bool = False) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
+        """
+        编码器层前向传播
+        Args:
+            x: 输入张量 [batch_size, seq_len, model_dim]
+            src_mask: 源序列掩码
+            src_key_padding_mask: 填充掩码
+            use_checkpoint: 是否使用梯度检查点以节省内存
+        """
+        # 预归一化结构 (Pre-LN)
+        normalized_x = self.norm1(x)
+        # 自注意力块 (可选梯度检查点)
+        if use_checkpoint and self.training:
+            attn_output = checkpoint.checkpoint(
+                self._sa_block, normalized_x, src_mask, src_key_padding_mask
+            )
+        else:
+            attn_output = self._sa_block(normalized_x, src_mask, src_key_padding_mask)
+        # 第一个残差连接
+        x = x + attn_output * self.residual_scale
+        # 预归一化
+        normalized_x = self.norm2(x)
+        # MoE块 (可选梯度检查点)
+        if use_checkpoint and self.training:
+            moe_output, router_logits, entropy_loss = checkpoint.checkpoint(
+                self._moe_block, normalized_x
+            )
+        else:
+            moe_output, router_logits, entropy_loss = self._moe_block(normalized_x)
+        # 第二个残差连接
+        x = x + self.dropout2(moe_output) * self.residual_scale
+        return x, router_logits, entropy_loss
+class PositionwiseFeedForward(nn.Module):
+    def __init__(self, d_model, d_ff, dropout=0.1):
+        super(PositionwiseFeedForward, self).__init__()
+        self.linear1 = nn.Linear(d_model, d_ff)
+        self.linear2 = nn.Linear(d_ff, d_model)
+        self.dropout = nn.Dropout(dropout)
+        self.relu = nn.ReLU()
+    def forward(self, x):
+        return self.linear2(self.dropout(self.relu(self.linear1(x))))
+class EncoderLayer_nomoe(nn.Module):
+    def __init__(self, model_dim: int, n_heads: int, ff_hidden_dim: int,
+                 dropout: float):
+        super().__init__()
+        # 使用预归一化（Pre-LN）结构，提高训练稳定性
+        self.norm1 = nn.LayerNorm(model_dim)
+        self.norm2 = nn.LayerNorm(model_dim)
+        self.self_attn = MultiHeadAttention(model_dim, n_heads)
+        self.feed_forward = PositionwiseFeedForward(model_dim, ff_hidden_dim, dropout)
+        self.dropout1 = nn.Dropout(dropout)
+        self.dropout2 = nn.Dropout(dropout)
+    def forward(self,
+                x: torch.Tensor,
+                src_mask: Optional[torch.Tensor] = None,
+                src_key_padding_mask: Optional[torch.Tensor] = None) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
+        # 预归一化结构 (Pre-LN)
+        normalized_x = self.norm1(x)
+        attn_output = self.self_attn(normalized_x, normalized_x, normalized_x, src_mask,src_key_padding_mask)
+        # 第一个残差连接
+        x = x + self.dropout1(attn_output)
+        # 预归一化
+        normalized_x = self.norm2(x)
+        ff_output = self.feed_forward(normalized_x)
+        # 第二个残差连接
+        x = x + self.dropout2(ff_output)
+        return x
+class PositionalEncoding(nn.Module):
+    """高效实现的位置编码"""
+    def __init__(self, d_model: int, dropout: float = 0.1, max_len: int = 5000):
+        super().__init__()
+        self.dropout = nn.Dropout(p=dropout)
+        # 一次性计算并缓存位置编码
+        pe = torch.zeros(1, max_len, d_model)
+        position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
+        div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))
+        # 更高效的位置编码计算
+        pe[0, :, 0::2] = torch.sin(position * div_term)
+        pe[0, :, 1::2] = torch.cos(position * div_term)
+        # 注册缓冲区而不是参数，节省内存
+        self.register_buffer('pe', pe)
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        """
+        添加位置编码到输入
+        Args:
+            x: 输入张量 [batch_size, seq_len, model_dim]
+        """
+        pos_encoding = self.pe[:, :x.size(1)]
+        x = x + pos_encoding
+        return self.dropout(x)
+class Encoder(nn.Module):
+    """优化的Encoder架构"""
+    def __init__(self,
+                 input_dim: int,
+                 model_dim: int,
+                 n_heads: int,
+                 num_layers: int,
+                 ff_hidden_dim: int,
+                 dropout: float,
+                 num_experts: int,
+                 top_k: int,
+                 if_embedding: bool = True,
+                 if_pos_encoding: bool = True,
+                 use_checkpointing: bool = False):
+        super().__init__()
+        self.model_dim = model_dim
+        self.num_layers = num_layers
+        self.if_embedding = if_embedding
+        self.if_pos_encoding = if_pos_encoding
+        self.use_checkpointing = use_checkpointing
+        # 嵌入层
+        if if_embedding:
+            self.embedding = nn.Embedding(input_dim, model_dim)
+            # 改善嵌入初始化
+            nn.init.normal_(self.embedding.weight, mean=0, std=model_dim**-0.5)
+        # 位置编码
+        if if_pos_encoding:
+            self.pos_encoding = PositionalEncoding(model_dim, dropout)
+        # 编码器层
+        self.layers = nn.ModuleList([
+            EncoderLayer(
+                model_dim, n_heads, ff_hidden_dim, dropout, num_experts, top_k
+            ) for _ in range(num_layers)
+        ])
+        # 输出归一化
+        self.final_norm = nn.LayerNorm(model_dim)
+    def forward(self,
+                src: torch.Tensor,
+                src_mask: Optional[torch.Tensor] = None,
+                src_key_padding_mask: Optional[torch.Tensor] = None) -> Tuple[torch.Tensor, List, float]:
+        """
+        编码器前向传播
+        Args:
+            src: 输入序列 [batch_size, seq_len] 或 [batch_size, seq_len, model_dim]
+            src_mask: 源序列掩码
+            src_key_padding_mask: 填充掩码
+        Returns:
+            tuple: (输出张量, 路由逻辑列表, 熵损失)
+        """
+        # 嵌入处理
+        if self.if_embedding:
+            x = self.embedding(src) * math.sqrt(self.model_dim)
+        else:
+            x = src
+        # 位置编码
+        if self.if_pos_encoding:
+            x = self.pos_encoding(x)
+        # 跟踪熵损失和路由逻辑
+        total_entropy_loss = 0.0
+        router_logits_list = []
+        # 通过编码器层
+        for layer in self.layers:
+            x, router_logits, entropy_loss = layer(
+                x,
+                src_mask=src_mask,
+                src_key_padding_mask=src_key_padding_mask,
+                use_checkpoint=self.use_checkpointing
+            )
+            total_entropy_loss += entropy_loss
+            # 只保存CPU版本的路由逻辑，降低内存使用
+            if not self.training:  # 仅在推理时保存路由逻辑
+                router_logits_list.append(router_logits.detach().cpu().tolist())
+        # 应用最终层归一化
+        x = self.final_norm(x)
+        # 计算平均熵损失
+        avg_entropy_loss = total_entropy_loss / self.num_layers
+        return x, router_logits_list, avg_entropy_loss
+class Encoder_nomoe(nn.Module):
+    """优化的Encoder架构"""
+    def __init__(self,
+                 input_dim: int,
+                 model_dim: int,
+                 n_heads: int,
+                 num_layers: int,
+                 ff_hidden_dim: int,
+                 dropout: float,
+                 if_embedding: bool = True,
+                 if_pos_encoding: bool = True):
+        super().__init__()
+        self.model_dim = model_dim
+        self.num_layers = num_layers
+        self.if_embedding = if_embedding
+        self.if_pos_encoding = if_pos_encoding
+        # 嵌入层
+        if if_embedding:
+            self.embedding = nn.Embedding(input_dim, model_dim)
+            # 改善嵌入初始化
+            nn.init.normal_(self.embedding.weight, mean=0, std=model_dim**-0.5)
+        # 位置编码
+        if if_pos_encoding:
+            self.pos_encoding = PositionalEncoding(model_dim, dropout)
+        # 编码器层
+        self.layers = nn.ModuleList([
+            EncoderLayer_nomoe(
+                model_dim, n_heads, ff_hidden_dim, dropout
+            ) for _ in range(num_layers)
+        ])
+        # 输出归一化
+        self.final_norm = nn.LayerNorm(model_dim)
+    def forward(self,
+                src: torch.Tensor,
+                src_mask: Optional[torch.Tensor] = None,
+                src_key_padding_mask: Optional[torch.Tensor] = None) -> Tuple[torch.Tensor, List, float]:
+        # 嵌入处理
+        if self.if_embedding:
+            x = self.embedding(src) * math.sqrt(self.model_dim)
+        else:
+            x = src
+        # 位置编码
+        if self.if_pos_encoding:
+            x = self.pos_encoding(x)
+        # 通过编码器层
+        for layer in self.layers:
+            x = layer(
+                x,
+                src_mask=src_mask,
+                src_key_padding_mask=src_key_padding_mask
+            )
+        # 应用最终层归一化
+        x = self.final_norm(x)
+        return x

utils.py ADDED Viewed

	@@ -0,0 +1,844 @@

+import numpy as np
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+import math
+from torch.utils.data.distributed import DistributedSampler
+import torch.optim.lr_scheduler as lr_scheduler
+from transformer_encoder_MoE import Encoder,Encoder_nomoe
+from itertools import chain
+from torch.nn.parallel import parallel_apply
+from typing import List, Dict, Tuple, Optional, Union
+from torchcrf import CRF
+class Tokenizer:
+    """处理序列编码和解码的分词器，支持蛋白质序列和mRNA序列。"""
+    def __init__(self):
+        # 定义特殊标记和生物序列标记
+        self.special_tokens = ['[START]', '[END]', '[PAD]', '[UNK]', '[SEG]']
+        self.amino_acids = ['A', 'R', 'S', 'I', 'L', 'G', 'V', 'T', 'P', 'N',
+                           'D', 'C', 'Q', 'E', 'H', 'K', 'F', 'Y', 'M', 'W', '*']
+        self.protein_alphabet = ['A', 'R', 'N', 'D', 'C', 'Q', 'E', 'G', 'H', 'I',
+                                 'L', 'K', 'M', 'F', 'P', 'S', 'T', 'W', 'Y', 'V']
+        # 生成所有可能的密码子组合
+        self.codons = [''.join([n1, n2, n3]) for n1 in 'UCAG' for n2 in 'UCAG' for n3 in 'UCAG']
+        # 合并所有标记并创建映射
+        self.tokens = self.special_tokens + self.amino_acids + self.codons
+        self.token_to_id = {token: idx for idx, token in enumerate(self.tokens)}
+        self.id_to_token = {idx: token for token, idx in self.token_to_id.items()}
+        # 缓存常用的特殊标记索引以提高性能
+        self.padding_idx = self.token_to_id['[PAD]']
+        self.start_idx = self.token_to_id['[START]']
+        self.end_idx = self.token_to_id['[END]']
+        self.unk_idx = self.token_to_id['[UNK]']
+        self.seg_idx = self.token_to_id['[SEG]']
+    def encode_pro(self, sequence: str, max_length: int) -> List[int]:
+        """编码蛋白质序列。
+        Args:
+            sequence: 输入的蛋白质序列
+            max_length: 编码后序列的最大长度
+        Returns:
+            编码后的ID列表
+        """
+        # 添加开始标记，并为每个字符获取ID
+        ids = [self.start_idx] + [self.token_to_id.get(token, self.unk_idx) for token in sequence]
+        # 处理序列长度并添加结束标记
+        if len(ids) < max_length - 1:
+            ids.append(self.end_idx)
+        else:
+            ids = ids[:max_length-1] + [self.end_idx]
+        return ids
+    def encode_mrna(self, sequence: str, max_length: int) -> List[int]:
+        """编码mRNA序列，每三个核苷酸作为一个密码子。
+        Args:
+            sequence: 输入的mRNA序列
+            max_length: 编码后序列的最大长度
+        Returns:
+            编码后的ID列表
+        """
+        ids = [self.start_idx]
+        # 每三个字符（一个密码子）作为一个单位处理
+        for i in range(0, len(sequence), 3):
+            codon = sequence[i:i+3]
+            if len(codon) == 3 and codon in self.token_to_id:
+                ids.append(self.token_to_id[codon])
+            else:
+                ids.append(self.unk_idx)
+        # 处理序列长度并添加结束标记
+        if len(ids) < max_length - 1:
+            ids.append(self.end_idx)
+        else:
+            ids = ids[:max_length-1] + [self.end_idx]
+        return ids
+    def decode(self, ids: List[int]) -> str:
+        """将ID序列解码为文本。
+        Args:
+            ids: 编码后的ID列表
+        Returns:
+            解码后的文本
+        """
+        return ''.join([self.id_to_token.get(id, '[UNK]') for id in ids])
+    def pad(self, ids: List[int], max_length: int) -> List[int]:
+        """对序列进行填充至指定长度。
+        Args:
+            ids: 编码后的ID列表
+            max_length: 目标长度
+        Returns:
+            填充后的ID列表
+        """
+        padding_length = max_length - len(ids)
+        if padding_length > 0:
+            return ids + [self.padding_idx] * padding_length
+        return ids
+# 生成密码子表和相关映射
+class BiologicalMappings:
+    """生物序列编码的映射工具类。"""
+    @staticmethod
+    def get_codon_table() -> Dict[str, str]:
+        """返回密码子到氨基酸的映射表。"""
+        return {
+    'GCU':'A', 'GCC':'A', 'GCA':'A', 'GCG':'A', 'CGU':'R', 'CGC':'R',
+    'CGA':'R', 'CGG':'R', 'AGA':'R', 'AGG':'R', 'UCU':'S', 'UCC':'S',
+    'UCA':'S', 'UCG':'S', 'AGU':'S', 'AGC':'S', 'AUU':'I', 'AUC':'I',
+    'AUA':'I', 'UUA':'L', 'UUG':'L', 'CUU':'L', 'CUC':'L', 'CUA':'L',
+    'CUG':'L', 'GGU':'G', 'GGC':'G', 'GGA':'G', 'GGG':'G', 'GUU':'V',
+    'GUC':'V', 'GUA':'V', 'GUG':'V', 'ACU':'T', 'ACC':'T', 'ACA':'T',
+    'ACG':'T', 'CCU':'P', 'CCC':'P', 'CCA':'P', 'CCG':'P', 'AAU':'N',
+    'AAC':'N', 'GAU':'D', 'GAC':'D', 'UGU':'C', 'UGC':'C', 'CAA':'Q',
+    'CAG':'Q', 'GAA':'E', 'GAG':'E', 'CAU':'H', 'CAC':'H', 'AAA':'K',
+    'AAG':'K', 'UUU':'F', 'UUC':'F', 'UAU':'Y', 'UAC':'Y', 'AUG':'M',
+    'UGG':'W','UAG':'*', 'UGA':'*', 'UAA':'*'}
+    @staticmethod
+    def get_amino_acid_to_codon() -> Dict[str, List[str]]:
+        """返回氨基酸到密码子的映射表。"""
+        return {
+    'A':['GCU','GCC','GCA','GCG'], 'R':['CGU','CGC','CGA','CGG','AGA','AGG'],
+    'S':['UCU','UCC','UCA','UCG','AGU','AGC'],'I':['AUU','AUC','AUA'],
+    'L':['UUA','UUG','CUU','CUC','CUA','CUG'],'G':['GGU','GGC','GGA','GGG'],
+    'V':['GUU','GUC','GUA','GUG'],'T':['ACU','ACC','ACA','ACG'],
+    'P':['CCU','CCC','CCA','CCG'],'N':['AAU','AAC'],'D':['GAU','GAC'],
+    'C':['UGU','UGC'],'Q':['CAA','CAG'],'E':['GAA','GAG'],'H':['CAU','CAC'],
+    'K':['AAA','AAG'],'F':['UUU','UUC'],'Y':['UAU','UAC'],'M':['AUG'],'W':['UGG'],
+    '*':['UAG','UGA','UAA']
+}
+    @staticmethod
+    def create_token_mapping(tokenizer: Tokenizer) -> torch.Tensor:
+        """创建从密码子令牌到氨基酸令牌的映射张量。
+        Args:
+            tokenizer: 用于获取令牌到ID映射的分词器
+        Returns:
+            映射张量，索引为密码子ID，值为对应的氨基酸ID
+        """
+        codon_table = BiologicalMappings.get_codon_table()
+        token_codon_to_amino_acid = torch.full((len(tokenizer.tokens),),
+                                              tokenizer.unk_idx,
+                                              dtype=torch.long)
+        for codon, amino_acid in codon_table.items():
+            codon_id = tokenizer.token_to_id.get(codon, tokenizer.unk_idx)
+            amino_acid_id = tokenizer.token_to_id.get(amino_acid, tokenizer.unk_idx)
+            token_codon_to_amino_acid[codon_id] = amino_acid_id
+        return token_codon_to_amino_acid
+class ActorModel_encoder_noesm2(nn.Module):
+    """基于编码器的Actor模型，用于序列生成任务。"""
+    def __init__(self, vocab_size: int, d_model: int, nhead: int,
+                 num_encoder_layers: int, dim_feedforward: int, dropout: float,
+                 num_experts: int, top_k_experts: int, device: torch.device):
+        """初始化模型。
+        Args:
+            vocab_size: 词汇表大小
+            d_model: 模型维度
+            nhead: 注意力头数
+            num_encoder_layers: 编码器层数
+            dim_feedforward: 前馈网络维度
+            dropout: Dropout率
+            num_experts: 专家数量
+            top_k_experts: 使用的顶部专家数量
+            device: 计算设备
+        """
+        super(ActorModel_encoder_noesm2, self).__init__()
+        self.device = device
+        # 获取生物映射并预计算掩码
+        self.amino_acid_to_codon = BiologicalMappings.get_amino_acid_to_codon()
+        self.precomputed_masks = self._precompute_masks()
+        # 创建编码器和输出层
+        self.encoder = Encoder(vocab_size, d_model, nhead, num_encoder_layers,
+                              dim_feedforward, dropout, num_experts, top_k_experts)
+        # 使用序列化的输出层以提高性能
+        self.mrna_output_layer = nn.Sequential(
+            nn.Linear(d_model, d_model//2),
+            nn.LayerNorm(d_model//2),
+            nn.ReLU(),
+            nn.Dropout(dropout),
+            nn.Linear(d_model//2, vocab_size)
+        )
+    def _precompute_masks(self) -> Dict[int, torch.Tensor]:
+        """预计算每个氨基酸对应的密码子掩码，以提高性能。"""
+        tokenizer = Tokenizer()  # 创建分词器实例
+        masks = {}
+        for amino_acid, codons in self.amino_acid_to_codon.items():
+            amino_acid_id = tokenizer.token_to_id.get(amino_acid, tokenizer.unk_idx)
+            mask = torch.zeros(len(tokenizer.tokens), dtype=torch.bool, device=self.device)
+            for codon in codons:
+                codon_id = tokenizer.token_to_id.get(codon, tokenizer.unk_idx)
+                if codon_id != tokenizer.unk_idx:
+                    mask[codon_id] = True
+            masks[amino_acid_id] = mask
+        return masks
+    def forward(self, tokenizer_encoded_proteins: torch.Tensor) -> Tuple[torch.Tensor, list, torch.Tensor]:
+        """模型前向传播。
+        Args:
+            tokenizer_encoded_proteins: 编码后的蛋白质序列，形状为(batch_size, seq_len)
+        Returns:
+            logits: 输出逻辑值，表示模型预测
+            router_logits_list: 路由器逻辑值列表
+            entropy_loss: 熵损失
+        """
+        # 创建源序列的填充掩码
+        tokenizer = Tokenizer()  # 创建分词器实例
+        src_padding_mask = (tokenizer_encoded_proteins == tokenizer.padding_idx)
+        # 通过编码器处理
+        x, router_logits_list, entropy_loss = self.encoder(
+            tokenizer_encoded_proteins,
+            src_key_padding_mask=src_padding_mask
+        )
+        # 为批次中的每个项目和序列位置生成掩码
+        batch_size, seq_len = tokenizer_encoded_proteins.shape
+        # 使用索引查询预计算的掩码，通过广播优化性能
+        amino_acid_to_codon_mask = torch.stack([
+            self.precomputed_masks.get(
+                tok.item(),
+                torch.zeros(len(tokenizer.tokens), dtype=torch.bool, device=self.device)
+            )
+            for tok in tokenizer_encoded_proteins.reshape(-1)
+        ]).view(batch_size, seq_len, -1)
+        # 计算输出逻辑值并应用掩码
+        mrna_logits = self.mrna_output_layer(x)
+        # 使用masking而不是scatter来提高性能
+        mrna_logits = mrna_logits.masked_fill(~amino_acid_to_codon_mask, -6.0e4)
+        return mrna_logits, router_logits_list, entropy_loss
+class ActorModel_encoder_esm2(nn.Module):
+    """基于编码器的Actor模型，用于序列生成任务。"""
+    def __init__(self, vocab_size: int, d_model: int, nhead: int,
+                 num_encoder_layers: int, dim_feedforward: int, esm2_dim: int,dropout: float,
+                 num_experts: int, top_k_experts: int, device: torch.device):
+        super(ActorModel_encoder_esm2, self).__init__()
+        self.device = device
+        # 获取生物映射并预计算掩码
+        self.amino_acid_to_codon = BiologicalMappings.get_amino_acid_to_codon()
+        self.precomputed_masks = self._precompute_masks()
+        self.dim_trans=nn.Linear(esm2_dim, d_model)
+        # 创建编码器和输出层
+        self.encoder = Encoder(vocab_size, d_model, nhead, num_encoder_layers,
+                              dim_feedforward, dropout, num_experts, top_k_experts,if_embedding=False,if_pos_encoding=False)
+        # 使用序列化的输出层以提高性能
+        self.mrna_output_layer = nn.Sequential(
+            nn.Linear(d_model, d_model//2),
+            nn.LayerNorm(d_model//2),
+            nn.ReLU(),
+            nn.Dropout(dropout),
+            nn.Linear(d_model//2, vocab_size)
+        )
+    def _precompute_masks(self) -> Dict[int, torch.Tensor]:
+        """预计算每个氨基酸对应的密码子掩码，以提高性能。"""
+        tokenizer = Tokenizer()  # 创建分词器实例
+        masks = {}
+        for amino_acid, codons in self.amino_acid_to_codon.items():
+            amino_acid_id = tokenizer.token_to_id.get(amino_acid, tokenizer.unk_idx)
+            mask = torch.zeros(len(tokenizer.tokens), dtype=torch.bool, device=self.device)
+            for codon in codons:
+                codon_id = tokenizer.token_to_id.get(codon, tokenizer.unk_idx)
+                if codon_id != tokenizer.unk_idx:
+                    mask[codon_id] = True
+            masks[amino_acid_id] = mask
+        return masks
+    def forward(self, tokenizer_encoded_proteins,esm2_encoded_proteins) -> Tuple[torch.Tensor, list, torch.Tensor]:
+        # 创建源序列的填充掩码
+        tokenizer = Tokenizer()  # 创建分词器实例
+        src_padding_mask = (tokenizer_encoded_proteins == tokenizer.padding_idx)
+        # 通过编码器处理
+        x=self.dim_trans(esm2_encoded_proteins)
+        x, router_logits_list, entropy_loss = self.encoder(
+            x,
+            src_key_padding_mask=src_padding_mask
+        )
+        # 为批次中的每个项目和序列位置生成掩码
+        batch_size, seq_len = tokenizer_encoded_proteins.shape
+        # 使用索引查询预计算的掩码，通过广播优化性能
+        amino_acid_to_codon_mask = torch.stack([
+            self.precomputed_masks.get(
+                tok.item(),
+                torch.zeros(len(tokenizer.tokens), dtype=torch.bool, device=self.device)
+            )
+            for tok in tokenizer_encoded_proteins.reshape(-1)
+        ]).view(batch_size, seq_len, -1)
+        # 计算输出逻辑值并应用掩码
+        mrna_logits = self.mrna_output_layer(x)
+        # 使用masking而不是scatter来提高性能
+        mrna_logits = mrna_logits.masked_fill(~amino_acid_to_codon_mask, -6.0e4)
+        return mrna_logits, router_logits_list, entropy_loss
+    def get_embedding(self, tokenizer_encoded_proteins,esm2_encoded_proteins):
+            # 创建源序列的填充掩码
+            tokenizer = Tokenizer()  # 创建分词器实例
+            src_padding_mask = (tokenizer_encoded_proteins == tokenizer.padding_idx)
+            # 通过编码器处理
+            x=self.dim_trans(esm2_encoded_proteins)
+            x, router_logits_list, entropy_loss = self.encoder(
+                x,
+                src_key_padding_mask=src_padding_mask
+            )
+            return x
+    def get_router_logits(self, tokenizer_encoded_proteins,esm2_encoded_proteins):
+            # 创建源序列的填充掩码
+            tokenizer = Tokenizer()  # 创建分词器实例
+            src_padding_mask = (tokenizer_encoded_proteins == tokenizer.padding_idx)
+            # 通过编码器处理
+            x=self.dim_trans(esm2_encoded_proteins)
+            x, router_logits_list, entropy_loss = self.encoder(
+                x,
+                src_key_padding_mask=src_padding_mask
+            )
+            return router_logits_list
+class ActorModel_encoder_nomoe(nn.Module):
+    """基于编码器的Actor模型，用于序列生成任务。"""
+    def __init__(self, vocab_size: int, d_model: int, nhead: int,
+                 num_encoder_layers: int, dim_feedforward: int,  esm2_dim: int,dropout: float, device: torch.device):
+        super(ActorModel_encoder_nomoe, self).__init__()
+        self.device = device
+        # 获取生物映射并预计算掩码
+        self.amino_acid_to_codon = BiologicalMappings.get_amino_acid_to_codon()
+        self.precomputed_masks = self._precompute_masks()
+        self.dim_trans=nn.Linear(esm2_dim, d_model)
+        # 创建编码器和输出层
+        self.encoder = Encoder_nomoe(vocab_size, d_model, nhead, num_encoder_layers,
+                              dim_feedforward, dropout,if_embedding=False,if_pos_encoding=False)
+        # 使用序列化的输出层以提高性能
+        self.output_layer = nn.Sequential(
+            nn.Linear(d_model, d_model//2),
+            nn.LayerNorm(d_model//2),
+            nn.ReLU(),
+            nn.Dropout(dropout),
+            nn.Linear(d_model//2, vocab_size)
+        )
+    def _precompute_masks(self) -> Dict[int, torch.Tensor]:
+        """预计算每个氨基酸对应的密码子掩码，以提高性能。"""
+        tokenizer = Tokenizer()  # 创建分词器实例
+        masks = {}
+        for amino_acid, codons in self.amino_acid_to_codon.items():
+            amino_acid_id = tokenizer.token_to_id.get(amino_acid, tokenizer.unk_idx)
+            mask = torch.zeros(len(tokenizer.tokens), dtype=torch.bool, device=self.device)
+            for codon in codons:
+                codon_id = tokenizer.token_to_id.get(codon, tokenizer.unk_idx)
+                if codon_id != tokenizer.unk_idx:
+                    mask[codon_id] = True
+            masks[amino_acid_id] = mask
+        return masks
+    def forward(self, tokenizer_encoded_proteins,esm2_encoded_proteins):
+        """模型前向传播。
+        Args:
+            tokenizer_encoded_proteins: 编码后的蛋白质序列，形状为(batch_size, seq_len)
+        Returns:
+            logits: 输出逻辑值，表示模型预测
+            router_logits_list: 路由器逻辑值列表
+            entropy_loss: 熵损失
+        """
+        # 创建源序列的填充掩码
+        tokenizer = Tokenizer()  # 创建分词器实例
+        src_padding_mask = (tokenizer_encoded_proteins == tokenizer.padding_idx)
+        x=self.dim_trans(esm2_encoded_proteins)
+        # 通过编码器处理
+        x= self.encoder(
+            x,
+            src_key_padding_mask=src_padding_mask
+        )
+        # 为批次中的每个项目和序列位置生成掩码
+        batch_size, seq_len = tokenizer_encoded_proteins.shape
+        # 使用索引查询预计算的掩码，通过广播优化性能
+        amino_acid_to_codon_mask = torch.stack([
+            self.precomputed_masks.get(
+                tok.item(),
+                torch.zeros(len(tokenizer.tokens), dtype=torch.bool, device=self.device)
+            )
+            for tok in tokenizer_encoded_proteins.reshape(-1)
+        ]).view(batch_size, seq_len, -1)
+        # 计算输出逻辑值并应用掩码
+        logits = self.output_layer(x)
+        # 使用masking而不是scatter来提高性能
+        logits = logits.masked_fill(~amino_acid_to_codon_mask, -6.0e4)
+        return logits
+class RewardModel_encoder(nn.Module):
+    def __init__(self, vocab_size,  d_model, nhead, num_encoder_layers,  dim_feedforward,dropout,num_experts,top_k_experts,device):
+        super(RewardModel_encoder, self).__init__()
+        self.tokenizer=Tokenizer()
+        self.device=device
+        self.encoder = Encoder(vocab_size, d_model, nhead, num_encoder_layers,
+                              dim_feedforward, dropout, num_experts, top_k_experts)
+        self.reward_output_layer = nn.Sequential(
+                                            nn.Linear(d_model, d_model//2),
+                                            nn.LayerNorm(d_model//2),  # 对线性层的输出进行归一化
+                                            nn.ReLU(),
+                                            nn.Dropout(dropout),
+                                            nn.Linear(d_model//2, 1)
+                                        )
+    def forward(self, tokenizer_encoded_mrnas):
+        src_padding_mask = (tokenizer_encoded_mrnas==self.tokenizer.padding_idx)
+        x,router_logits_list,entropy_loss = self.encoder(tokenizer_encoded_mrnas, src_key_padding_mask=src_padding_mask)
+        reward=self.reward_output_layer(x)
+        reward=reward[:,0,:].squeeze()
+        return reward,router_logits_list,entropy_loss
+class LengthAwareDistributedSampler_human(DistributedSampler):
+    def __init__(self, dataset, lengths, data_num_rat=None,num_replicas=None, rank=None, shuffle=True):
+        super().__init__(dataset, num_replicas=num_replicas, rank=rank, shuffle=shuffle)
+        self.lengths = lengths  # 每个样本的长度列表
+        self.weights = self.calculate_weights()  # 根据长度初始化权重
+        self.data_num_rat=data_num_rat
+        self.total_size = int(len(dataset) * data_num_rat)
+    def calculate_weights(self):
+        # 分段式加权策略
+        weights = np.ones(len(self.lengths))
+        weights[np.array(self.lengths) >= 1300] = 85.64*200
+        weights[(np.array(self.lengths) >= 1200) & (np.array(self.lengths) < 1300)] = 5.02*200
+        weights[(np.array(self.lengths) >= 1100) & (np.array(self.lengths) < 1200)] = 4.36*100
+        weights[(np.array(self.lengths) >= 1000) & (np.array(self.lengths) < 1100)] = 3.63*100
+        weights[(np.array(self.lengths) >= 900) & (np.array(self.lengths) < 1000)] = 3.15
+        weights[(np.array(self.lengths) >= 800) & (np.array(self.lengths) < 900)] = 2.20
+        weights[(np.array(self.lengths) >= 700) & (np.array(self.lengths) < 800)] = 1.64
+        weights[(np.array(self.lengths) >= 600) & (np.array(self.lengths) < 700)] = 1.36
+        weights[(np.array(self.lengths) >= 500) & (np.array(self.lengths) < 600)] = 1.0
+        weights[(np.array(self.lengths) >= 400) & (np.array(self.lengths) < 500)] = 0.75
+        weights[(np.array(self.lengths) >= 300) & (np.array(self.lengths) < 400)] = 0.63
+        weights[(np.array(self.lengths) >= 200) & (np.array(self.lengths) < 300)] = 0.60
+        weights[(np.array(self.lengths) >= 100) & (np.array(self.lengths) < 200)] = 0.71
+        weights[np.array(self.lengths) < 100] = 3.68*100
+        return weights / np.sum(weights)  # 将权重归一化
+    def __iter__(self):
+        # 根据加权采样进行索引选择
+        indices = np.random.choice(len(self.dataset), self.total_size,  replace=True, p=self.weights)
+        # 边界处理：截断到可以整除 num_replicas 的长度
+        total_size_local = (len(indices) // self.num_replicas) * self.num_replicas
+        indices = indices[:total_size_local]  # 截断多余的样本
+        # 将样本分配给不同进程
+        indices = indices[self.rank:total_size_local:self.num_replicas]
+        if self.shuffle:
+            np.random.shuffle(indices)
+        return iter(indices.tolist())
+    def set_epoch(self, epoch):
+        super().set_epoch(epoch)
+class LengthAwareDistributedSampler_Arabidopsis(DistributedSampler):
+    def __init__(self, dataset, lengths, data_num_rat=None,num_replicas=None, rank=None, shuffle=True):
+        super().__init__(dataset, num_replicas=num_replicas, rank=rank, shuffle=shuffle)
+        self.lengths = lengths  # 每个样本的长度列表
+        self.weights = self.calculate_weights()  # 根据长度初始化权重
+        self.data_num_rat=data_num_rat
+        self.total_size = int(len(dataset) * data_num_rat)
+    def calculate_weights(self):
+        # 分段式加权策略
+        weights = np.ones(len(self.lengths))
+        weights[np.array(self.lengths) >= 1300] = 630.75*20
+        weights[(np.array(self.lengths) >= 1200) & (np.array(self.lengths) < 1300)] = 17.05*20
+        weights[(np.array(self.lengths) >= 1100) & (np.array(self.lengths) < 1200)] = 11.52*20
+        weights[(np.array(self.lengths) >= 1000) & (np.array(self.lengths) < 1100)] = 7.17*10
+        weights[(np.array(self.lengths) >= 900) & (np.array(self.lengths) < 1000)] = 5.56*10
+        weights[(np.array(self.lengths) >= 800) & (np.array(self.lengths) < 900)] = 3.54
+        weights[(np.array(self.lengths) >= 700) & (np.array(self.lengths) < 800)] = 2.51
+        weights[(np.array(self.lengths) >= 600) & (np.array(self.lengths) < 700)] = 1.62
+        weights[(np.array(self.lengths) >= 500) & (np.array(self.lengths) < 600)] = 1.0
+        weights[(np.array(self.lengths) >= 400) & (np.array(self.lengths) < 500)] = 0.68
+        weights[(np.array(self.lengths) >= 300) & (np.array(self.lengths) < 400)] = 0.49
+        weights[(np.array(self.lengths) >= 200) & (np.array(self.lengths) < 300)] = 0.49
+        weights[(np.array(self.lengths) >= 100) & (np.array(self.lengths) < 200)] = 0.49
+        weights[np.array(self.lengths) < 100] = 1.23*10
+        return weights / np.sum(weights)  # 将权重归一化
+    def __iter__(self):
+        # 根据加权采样进行索引选择
+        indices = np.random.choice(len(self.dataset), self.total_size,  replace=True, p=self.weights)
+        # 边界处理：截断到可以整除 num_replicas 的长度
+        total_size_local = (len(indices) // self.num_replicas) * self.num_replicas
+        indices = indices[:total_size_local]  # 截断多余的样本
+        # 将样本分配给不同进程
+        indices = indices[self.rank:total_size_local:self.num_replicas]
+        if self.shuffle:
+            np.random.shuffle(indices)
+        return iter(indices.tolist())
+    def set_epoch(self, epoch):
+        super().set_epoch(epoch)
+class LengthAwareDistributedSampler_CR(DistributedSampler):
+    def __init__(self, dataset, lengths, data_num_rat=None,num_replicas=None, rank=None, shuffle=True):
+        super().__init__(dataset, num_replicas=num_replicas, rank=rank, shuffle=shuffle)
+        self.lengths = lengths  # 每个样本的长度列表
+        self.weights = self.calculate_weights()  # 根据长度初始化权重
+        self.data_num_rat=data_num_rat
+        self.total_size = int(len(dataset) * data_num_rat)
+    def calculate_weights(self):
+        # 分段式加权策略
+        weights = np.ones(len(self.lengths))
+        weights[np.array(self.lengths) >= 1300] = 61.55*20
+        weights[(np.array(self.lengths) >= 1200) & (np.array(self.lengths) < 1300)] = 3.66*20
+        weights[(np.array(self.lengths) >= 1100) & (np.array(self.lengths) < 1200)] = 2.96*10
+        weights[(np.array(self.lengths) >= 1000) & (np.array(self.lengths) < 1100)] = 2.54*10
+        weights[(np.array(self.lengths) >= 900) & (np.array(self.lengths) < 1000)] = 2.11*10
+        weights[(np.array(self.lengths) >= 800) & (np.array(self.lengths) < 900)] = 1.79
+        weights[(np.array(self.lengths) >= 700) & (np.array(self.lengths) < 800)] = 1.39
+        weights[(np.array(self.lengths) >= 600) & (np.array(self.lengths) < 700)] = 1.11
+        weights[(np.array(self.lengths) >= 500) & (np.array(self.lengths) < 600)] = 1.0
+        weights[(np.array(self.lengths) >= 400) & (np.array(self.lengths) < 500)] = 0.82
+        weights[(np.array(self.lengths) >= 300) & (np.array(self.lengths) < 400)] = 0.73
+        weights[(np.array(self.lengths) >= 200) & (np.array(self.lengths) < 300)] = 0.67
+        weights[(np.array(self.lengths) >= 100) & (np.array(self.lengths) < 200)] = 0.66
+        weights[np.array(self.lengths) < 100] = 1.18*10
+        return weights / np.sum(weights)  # 将权重归一化
+    def __iter__(self):
+        # 根据加权采样进行索引选择
+        indices = np.random.choice(len(self.dataset), self.total_size,  replace=True, p=self.weights)
+        # 边界处理：截断到可以整除 num_replicas 的长度
+        total_size_local = (len(indices) // self.num_replicas) * self.num_replicas
+        indices = indices[:total_size_local]  # 截断多余的样本
+        # 将样本分配给不同进程
+        indices = indices[self.rank:total_size_local:self.num_replicas]
+        if self.shuffle:
+            np.random.shuffle(indices)
+        return iter(indices.tolist())
+    def set_epoch(self, epoch):
+        super().set_epoch(epoch)
+class LengthAwareDistributedSampler_PC(DistributedSampler):
+    def __init__(self, dataset, lengths, data_num_rat=None,num_replicas=None, rank=None, shuffle=True):
+        super().__init__(dataset, num_replicas=num_replicas, rank=rank, shuffle=shuffle)
+        self.lengths = lengths  # 每个样本的长度列表
+        self.weights = self.calculate_weights()  # 根据长度初始化权重
+        self.data_num_rat=data_num_rat
+        self.total_size = int(len(dataset) * data_num_rat)
+    def calculate_weights(self):
+        # 分段式加权策略
+        weights = np.ones(len(self.lengths))
+        weights[np.array(self.lengths) >= 1300] = 318.0*200
+        weights[(np.array(self.lengths) >= 1200) & (np.array(self.lengths) < 1300)] = 13.98*200
+        weights[(np.array(self.lengths) >= 1100) & (np.array(self.lengths) < 1200)] = 10.26*100
+        weights[(np.array(self.lengths) >= 1000) & (np.array(self.lengths) < 1100)] = 7.62*100
+        weights[(np.array(self.lengths) >= 900) & (np.array(self.lengths) < 1000)] = 6.14*100
+        weights[(np.array(self.lengths) >= 800) & (np.array(self.lengths) < 900)] = 3.80
+        weights[(np.array(self.lengths) >= 700) & (np.array(self.lengths) < 800)] = 2.67
+        weights[(np.array(self.lengths) >= 600) & (np.array(self.lengths) < 700)] = 1.88
+        weights[(np.array(self.lengths) >= 500) & (np.array(self.lengths) < 600)] = 1.0
+        weights[(np.array(self.lengths) >= 400) & (np.array(self.lengths) < 500)] = 0.88
+        weights[(np.array(self.lengths) >= 300) & (np.array(self.lengths) < 400)] = 0.75
+        weights[(np.array(self.lengths) >= 200) & (np.array(self.lengths) < 300)] = 0.76
+        weights[(np.array(self.lengths) >= 100) & (np.array(self.lengths) < 200)] = 0.83
+        weights[np.array(self.lengths) < 100] = 1.87*100
+        return weights / np.sum(weights)  # 将权重归一化
+    def __iter__(self):
+        # 根据加权采样进行索引选择
+        indices = np.random.choice(len(self.dataset), self.total_size,  replace=True, p=self.weights)
+        # 边界处理：截断到可以整除 num_replicas 的长度
+        total_size_local = (len(indices) // self.num_replicas) * self.num_replicas
+        indices = indices[:total_size_local]  # 截断多余的样本
+        # 将样本分配给不同进程
+        indices = indices[self.rank:total_size_local:self.num_replicas]
+        if self.shuffle:
+            np.random.shuffle(indices)
+        return iter(indices.tolist())
+    def set_epoch(self, epoch):
+        super().set_epoch(epoch)
+class LengthAwareDistributedSampler_EscherichiaColi(DistributedSampler):
+    def __init__(self, dataset, lengths, data_num_rat=None,num_replicas=None, rank=None, shuffle=True):
+        super().__init__(dataset, num_replicas=num_replicas, rank=rank, shuffle=shuffle)
+        self.lengths = lengths  # 每个样本的长度列表
+        self.weights = self.calculate_weights()  # 根据长度初始化权重
+        self.data_num_rat=data_num_rat
+        self.total_size = int(len(dataset) * data_num_rat)
+    def calculate_weights(self):
+        # 分段式加权策略
+        weights = np.ones(len(self.lengths))
+        weights[np.array(self.lengths) >= 1300] = 211.0*200
+        weights[(np.array(self.lengths) >= 1200) & (np.array(self.lengths) < 1300)] = 26.38*200
+        weights[(np.array(self.lengths) >= 1100) & (np.array(self.lengths) < 1200)] = 15.07*100
+        weights[(np.array(self.lengths) >= 1000) & (np.array(self.lengths) < 1100)] = 11.72*100
+        weights[(np.array(self.lengths) >= 900) & (np.array(self.lengths) < 1000)] = 11.11*100
+        weights[(np.array(self.lengths) >= 800) & (np.array(self.lengths) < 900)] = 4.06
+        weights[(np.array(self.lengths) >= 700) & (np.array(self.lengths) < 800)] = 2.81
+        weights[(np.array(self.lengths) >= 600) & (np.array(self.lengths) < 700)] = 2.07
+        weights[(np.array(self.lengths) >= 500) & (np.array(self.lengths) < 600)] = 1.0
+        weights[(np.array(self.lengths) >= 400) & (np.array(self.lengths) < 500)] = 0.46
+        weights[(np.array(self.lengths) >= 300) & (np.array(self.lengths) < 400)] = 0.30
+        weights[(np.array(self.lengths) >= 200) & (np.array(self.lengths) < 300)] = 0.25
+        weights[(np.array(self.lengths) >= 100) & (np.array(self.lengths) < 200)] = 0.25
+        weights[np.array(self.lengths) < 100] = 0.47
+        return weights / np.sum(weights)  # 将权重归一化
+    def __iter__(self):
+        # 根据加权采样进行索引选择
+        indices = np.random.choice(len(self.dataset), self.total_size,  replace=True, p=self.weights)
+        # 边界处理：截断到可以整除 num_replicas 的长度
+        total_size_local = (len(indices) // self.num_replicas) * self.num_replicas
+        indices = indices[:total_size_local]  # 截断多余的样本
+        # 将样本分配给不同进程
+        indices = indices[self.rank:total_size_local:self.num_replicas]
+        if self.shuffle:
+            np.random.shuffle(indices)
+        return iter(indices.tolist())
+    def set_epoch(self, epoch):
+        super().set_epoch(epoch)
+class LengthAwareDistributedSampler_TK(DistributedSampler):
+    def __init__(self, dataset, lengths, data_num_rat=None,num_replicas=None, rank=None, shuffle=True):
+        super().__init__(dataset, num_replicas=num_replicas, rank=rank, shuffle=shuffle)
+        self.lengths = lengths  # 每个样本的长度列表
+        self.weights = self.calculate_weights()  # 根据长度初始化权重
+        self.data_num_rat=data_num_rat
+        self.total_size = int(len(dataset) * data_num_rat)
+    def calculate_weights(self):
+        # 分段式加权策略
+        weights = np.ones(len(self.lengths))
+        weights[(np.array(self.lengths) >= 1200) & (np.array(self.lengths) < 1300)] = 12.25*10
+        weights[(np.array(self.lengths) >= 1100) & (np.array(self.lengths) < 1200)] = 8.17*10
+        weights[(np.array(self.lengths) >= 1000) & (np.array(self.lengths) < 1100)] = 24.5*10
+        weights[(np.array(self.lengths) >= 900) & (np.array(self.lengths) < 1000)] = 8.17*10
+        weights[(np.array(self.lengths) >= 800) & (np.array(self.lengths) < 900)] = 3.27
+        weights[(np.array(self.lengths) >= 700) & (np.array(self.lengths) < 800)] = 2.33
+        weights[(np.array(self.lengths) >= 600) & (np.array(self.lengths) < 700)] = 1.09
+        weights[(np.array(self.lengths) >= 500) & (np.array(self.lengths) < 600)] = 1.0
+        weights[(np.array(self.lengths) >= 400) & (np.array(self.lengths) < 500)] = 0.25
+        weights[(np.array(self.lengths) >= 300) & (np.array(self.lengths) < 400)] = 0.17
+        weights[(np.array(self.lengths) >= 200) & (np.array(self.lengths) < 300)] = 0.13
+        weights[(np.array(self.lengths) >= 100) & (np.array(self.lengths) < 200)] = 0.10
+        weights[np.array(self.lengths) < 100] = 0.22
+        return weights / np.sum(weights)  # 将权重归一化
+    def __iter__(self):
+        # 根据加权采样进行索引选择
+        indices = np.random.choice(len(self.dataset), self.total_size,  replace=True, p=self.weights)
+        # 边界处理：截断到可以整除 num_replicas 的长度
+        total_size_local = (len(indices) // self.num_replicas) * self.num_replicas
+        indices = indices[:total_size_local]  # 截断多余的样本
+        # 将样本分配给不同进程
+        indices = indices[self.rank:total_size_local:self.num_replicas]
+        if self.shuffle:
+            np.random.shuffle(indices)
+        return iter(indices.tolist())
+    def set_epoch(self, epoch):
+        super().set_epoch(epoch)
+class LengthAwareDistributedSampler_human_circ(DistributedSampler):
+    def __init__(self, dataset, lengths, data_num_rat=None,num_replicas=None, rank=None, shuffle=True):
+        super().__init__(dataset, num_replicas=num_replicas, rank=rank, shuffle=shuffle)
+        self.lengths = lengths  # 每个样本的长度列表
+        self.weights = self.calculate_weights()  # 根据长度初始化权重
+        self.data_num_rat=data_num_rat
+        self.total_size = int(len(dataset) * data_num_rat)
+    def calculate_weights(self):
+        # 分段式加权策略
+        weights = np.ones(len(self.lengths))
+        weights[np.array(self.lengths) >= 1300] = 89.62*20
+        weights[(np.array(self.lengths) >= 1200) & (np.array(self.lengths) < 1300)] = 5.24*20
+        weights[(np.array(self.lengths) >= 1100) & (np.array(self.lengths) < 1200)] = 4.58*10
+        weights[(np.array(self.lengths) >= 1000) & (np.array(self.lengths) < 1100)] = 3.82*10
+        weights[(np.array(self.lengths) >= 900) & (np.array(self.lengths) < 1000)] = 3.30
+        weights[(np.array(self.lengths) >= 800) & (np.array(self.lengths) < 900)] = 2.34
+        weights[(np.array(self.lengths) >= 700) & (np.array(self.lengths) < 800)] = 1.74
+        weights[(np.array(self.lengths) >= 600) & (np.array(self.lengths) < 700)] = 1.36
+        weights[(np.array(self.lengths) >= 500) & (np.array(self.lengths) < 600)] = 1.0
+        weights[(np.array(self.lengths) >= 400) & (np.array(self.lengths) < 500)] = 0.74
+        weights[(np.array(self.lengths) >= 300) & (np.array(self.lengths) < 400)] = 0.57
+        weights[(np.array(self.lengths) >= 200) & (np.array(self.lengths) < 300)] = 0.46
+        weights[(np.array(self.lengths) >= 100) & (np.array(self.lengths) < 200)] = 0.38
+        weights[np.array(self.lengths) < 100] = 0.48
+        return weights / np.sum(weights)  # 将权重归一化
+    def __iter__(self):
+        # 根据加权采样进行索引选择
+        indices = np.random.choice(len(self.dataset), self.total_size,  replace=True, p=self.weights)
+        # 边界处理：截断到可以整除 num_replicas 的长度
+        total_size_local = (len(indices) // self.num_replicas) * self.num_replicas
+        indices = indices[:total_size_local]  # 截断多余的样本
+        # 将样本分配给不同进程
+        indices = indices[self.rank:total_size_local:self.num_replicas]
+        if self.shuffle:
+            np.random.shuffle(indices)
+        return iter(indices.tolist())
+    def set_epoch(self, epoch):
+        super().set_epoch(epoch)