Sparse Autoencoders Improve Low - N Protein Function Prediction and Design

This repository contains the models used in the paper "Sparse Autoencoders Improve Low - N Protein Function Prediction and Design", by Darin Tsui, Kunal Talreja, and Amirali Aghazadeh. A link to the paper can be found here.

models/
β”œβ”€β”€ esm2_t33_650M_UR50D/           # Base ESM2 protein language model
β”œβ”€β”€ SPG1_STRSG_Wu_2016/            # Protein-specific models for SPG1
β”œβ”€β”€ SPG1_STRSG_Olson_2014/         # Protein-specific models for SPG1 (Olson)
β”œβ”€β”€ GRB2_HUMAN_Faure_2021/         # Protein-specific models for GRB2
β”œβ”€β”€ GFP_AEQVI_Sarkisyan_2016/      # Protein-specific models for GFP
β”œβ”€β”€ F7YBW8_MESOW_Ding_2023/        # Protein-specific models for F7YBW8
└── DLG4_HUMAN_Faure_2021/         # Protein-specific models for DLG4

Each protein directory contains two types of models:

Fine-tuned ESM2 Models (esm_ft/)

  • adapter_model.safetensors: Fine-tuning adapter weights
  • adapter_config.json: Adapter configuration
  • README.md: Model card with detailed information
  • tokenizer_config.json
  • special_tokens_map.json

Sparse Autoencoders (sae/)

  • checkpoints/: Training checkpoints with different loss values
  • Model files with naming pattern: esm2_plm{PLM_DIM}_l{SAE_LAYER}_sae{SAE_DIM}_k{SAE_K}_auxk{SAE_AUXK}_-step={step}-avg_mse_loss={loss}.ckpt
  • Last checkpoint simply named last.ckpt

Usage

Loading Fine-tuned Models

from peft import PeftModel
from transformers import AutoTokenizer, AutoModelForMaskedLM

# Load fine-tuned model
base_model = AutoModel.from_pretrained("models/esm2_t33_650M_UR50D")
ft_model = PeftModel.from_pretrained(base_model, f"models/{DMS_ASSAY}/esm_ft")
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for ktalreja/LowNSAE

Finetuned
(24)
this model