Contrastively Learned Attention based Stratified PTM Predictor (CLASPP) a unified PTM prediction model

CLASPP is a ESM2-150m protein lanuguage model that can pred PTM envents occuring on the substrate based off primary protein sequence. This is done on multiple differnt PTM types (12) as a form of multi-label classifcation. The encoder is training on a supervised Contrastive learing task then the classifcation head is finetunted on the multi-label classifcation. Existing PTM prediction models predominantly focus  on either single PTM types or employ ensemble methods that combine multiple models to predict different PTM types. This fragmentation is largely driven by the vast imbalance in data availability across PTM types making it difficult to predict multiple PTM types with a single model. To address this limitation, we present the Contrastively Learned Attention-Based Stratified PTM Predictor (CLASPP), a unified PTM prediction model.

Quick overview of the dependencies

From conda:

From pip:

For torch/PyTorch

Make sure you go to this website pytorch

Follow along with its recommendation

Installing torch can be the most complex part

How to Get Started with the Model

Downloading this repository

make sure git lfs is installed

Can not store weight files here (too big)

git clone https://huggingface.co/esbglab/Claspp_forward

cd Claspp_forward

Creating this conda environment

Just type these lines of code into the terminal after you download this repository (this assumes you have anaconda already installed)

conda create -n claspp_forward python=3.9.23

conda deactivate

conda activate claspp_forward

pip3 install numpy==2.0.2

pip3 install transformers==4.53.2

pip3 install datasets==4.0.0

For torch you will have to download to the torch's specification if you want gpu acceleration from this website pytroch **

pip3 install torch torchvision torchaudio

the terminal line above might look different for you

We provided code to test CLASPP (see section below)

:tada: you are know ready to run the code :tada:

Use the code below to get started with the model.

Model Details

Lin, Z., Akin, H., Rao, R., Hie, B., Zhu, Z., Lu, W., ... & Rives, A. (2023). Evolutionary-scale prediction of atomic-level protein structure with a language model. Science, 379(6637), 1123-1130.

PTM type	Residue trained on	Number of clusters allocated	output indexes	input label indexes (training)
ST_Phosphorylation	S,T	5	0 or 1	0-4
Y_Phosphorylation	Y	1	3	25
K_Ubiquitination	K	20	2	5-24
K_Acetylation	K	10	4	26-35
AM_Acetylation	A,M	1	13 or 14	49
N_N-linked-Glycosylation	N	1	5	36
ST_O-linked-Glycosylation	S,T	5	6 or 7	37-41
RK_Methylation	R,K	4	8 or 9	42-45
K_Sumoylation	K	1	10	46
K_Malonylation	K	1	11	53
M_Sulfoxidation	M	1	12	48
C_Glutathionylation	C	1	15	50
C_S-palmitoylation	C	1	16	51
PK_Hydroxylation	P,K	1	17 or 18	52
negitve	all res	N/A	19	53

Data organization and number of clusters

Repo	Link	Discription
GitHub	github version Data_cur	This verstion contains code but but no data. It needs you to run the code to generate all the helper-files (will take some time run this code)
GitHub	github version Forward	This verstion contains code but NOT any weights (file too big for github)
Huggingface	huggingface version Forward	This verstion contains code and training weights
Zenodo	zenodo version training_data	zenodo version of training/testing/validation data
webtool	website version of webtool	webtool hosted on a server

Repository: [More Information Needed]
Paper [optional]: [More Information Needed]
Demo [optional]: [More Information Needed]

Uses

Direct Use

Usage: python3 claspp_forward.py [OPTION]... --input INPUT [FASTA_FILE or TXT_FILE]...
predict PTM events on peptides or full sequences

Example 1: python3 claspp_forward.py -B 100 -S 0 -i random.txt
Example 2: python3 claspp_forward.py -B 50 -S 1 -i random.fasta

FASTA_FILE contain protein sequences in proper fasta or a2m format
TXT_FILE cointain protien peptides 21 in length with the center
residue being the PTM modification site


Pattern selection and interpretation:
  -B, --batch_size          (int) that describes how many predictions
                            can be predicted at a time on the GPU
                            (reduce if you get run out of GPU space)

  -S  --scrape_fasta        (int) should be a 1 or a 0 
                            1 = read a fasta and scrape posible 21 peptides
                            that can be modified by a PTM 
                            0 = read a txt file that has the 21mer already 
                            sperated and all peptides should be sperated by 
                            a '\\n' (can be faster) than fasta option
  
  -h  --help                your reading it right now

  -i  --input               location of the input fasta or txt

  -o  --output              location of the output csv

Developed by: Major author for most code Nathan Gravel.
Finetuning code inspried by Zhongliang Zhou.
Contrastive learing code inspried by Ruili Fang.
Codebase testing and verstion controle by Austin Downes.
Webtool dev Saber Soleymani.
Funded by [optional]: [NIH]
Shared by [optional]: [More Information Neede]
Model type: [Text classication]
Language(s) (NLP): [Protein Sequence]
License: [MIT]
Finetuned from model [optional]: [ESM-2 150M]

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for esbglab/Claspp_forward

Base model

facebook/esm2_t30_150M_UR50D

Quantized

andrewdalpino/ESM2-150M-Protein-Molecular-Function

Finetuned

(1)

this model