Table of Contents
Protein Location Predictor
A comprehensive GUI application for predicting protein subcellular localization using SVM and Random Forest classifiers using state-of-the-art protein language models including PROST-T5 and ESM-C embeddings as training data.
Features
Multiple Model Support: Choose from three different prediction models:
- PROST-T5: Transformer-based protein language model
- ESM-C 300M: Evolutionary Scale Modeling (300M parameters)
- ESM-C 600M: Evolutionary Scale Modeling (600M parameters)
User-Friendly GUI: Simple Tkinter-based interface with progress tracking (see screenshot below)
Sequential Processing: Process multiple protein sequences from FASTA files
Flexible Output: Save predictions with confidence scores in text (CSV) format
Error Handling: Comprehensive error handling and user feedback
Supported Python Version
This project has been tested on Python 3.10+.
Requirements
Dependencies (Full environment.yml)
The complete environment definition is located in environment.yml
. This file includes all necessary packages for PyTorch, Transformers, ESM models, and GUI operation. Here is a brief excerpt:
name: tesisEnv
channels:
- bioconda
- anaconda
- conda-forge
- defaults
# Python version and major packages
dependencies:
- python=3.10.16
- pytorch=2.6.0
- torchvision=0.21.0
- torchtext=0.18.0
- transformers=4.46.3
- scikit-learn=1.6.1
- biopython=1.85
- esm=3.1.4
- numpy=1.26.4
- joblib=1.4.2
- tk
# plus many others (see full file for complete list)
To ensure exact reproducibility, use:
conda env create -f environment.yml
Hardware Requirements
- Minimum: 8β―GB RAM, CPU-only execution
- Recommended: 16β―GB+ RAM, NVIDIA GPU with 8β―GB+ VRAM
- Storage: ~5β―GB for model weights and cache
Installation
Clone the repository (with Gitβ―LFS for large model files):
git lfs install git clone https://huggingface.co/jpuglia/ProteinLocationPredictor
If you prefer to skip downloading model weights initially:
GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/jpuglia/ProteinLocationPredictor
Navigate into the project directory:
cd ProteinLocationPredictor
Create and activate the Conda environment:
conda env create -f environment.yml conda activate tesisEnv
(If skipped above) Download model weights manually: Model files live in the
Models/
directory. If you usedGIT_LFS_SKIP_SMUDGE
, run:git lfs pull
Usage
GUI Mode
Launch the application:
python gui.py
In the menu, click File β Load FASTA and select your input file (
.fasta
,.fa
, or.fas
).Choose one of the prediction models (PROST-T5, ESM-C 300M, or ESM-C 600M).
Click Run Prediction and monitor the progress bar.
When complete, you will be prompted to choose an output directory and filename.
Example Input & Output
Input FASTA (example/input.fasta
):
>protein_1
MKTVRQERLKSIVRILERSKEPVSGAQLAEELSVSRQVIVQDIAYLRSLGYNIVATPRGYVLAGG
>protein_2
MKTIIALSYIFCLVFAHATAKASEQTDNLQWDLAAIDNSGGHNAVDIKQNLQFQCQNNLHGCF
Output CSV (example/output.csv
):
Sequence_ID,Prediction 1,Prediction 2,Prediction 3,Prediction 4,Prediction 5,Prediction 6
protein_1,Cytoplasmic (0.9860),CytoplasmicMembrane (0.0081),Periplasmic (0.0029),Extracellular (0.0019),OuterMembrane (0.0007),Cellwall (0.0003)
protein_2,SignalPeptide (0.7523),Extracellular (0.1234),CytoplasmicMembrane (0.0645),Cellwall (0.0345),Periplasmic (0.0201),OuterMembrane (0.0052)
Project Structure
ProteinLocationPredictor/
βββ gui.py
βββ src/
β βββ my_utils.py
βββ Models/
β βββ ProstT5_svm.joblib
β βββ ESMC-300m_svm.joblib
β βββ ESMC-600m_svm.joblib
β βββ ...
βββ environment.yml
βββ README.md
βββ doc/
βββ screenshots/
βββ gui_example.png
Contributing
Fork the repository
Create a feature branch:
git checkout -b feature/amazing-feature
Commit your changes:
git commit -m "Add amazing feature"
Push to your branch:
git push origin feature/amazing-feature
Open a Pull Request or start a discussion: Repository Discussions
Model tree for jpuglia/ProteinLocationPredictor
Base model
EvolutionaryScale/esmc-300m-2024-12