ESMC Protein Function Predictor

An Evolutionary-scale Model (ESM) for protein function prediction from amino acid sequences using the Gene Ontology (GO). Based on the ESM Cambrian Transformer architecture, pre-trained on UniRef, MGnify, and the Joint Genome Institute's database and fine-tuned on the AmiGO Boost protein function dataset, this model predicts the GO subgraph for a particular protein sequence - giving you insight into the molecular function, biological process, and location of the activity inside the cell.

What are GO terms?

"The Gene Ontology (GO) is a concept hierarchy that describes the biological function of genes and gene products at different levels of abstraction (Ashburner et al., 2000). It is a good model to describe the multi-faceted nature of protein function."

"GO is a directed acyclic graph. The nodes in this graph are functional descriptors (terms or classes) connected by relational ties between them (is_a, part_of, etc.). For example, terms 'protein binding activity' and 'binding activity' are related by an is_a relationship; however, the edge in the graph is often reversed to point from binding towards protein binding. This graph contains three subgraphs (subontologies): Molecular Function (MF), Biological Process (BP), and Cellular Component (CC), defined by their root nodes. Biologically, each subgraph represent a different aspect of the protein's function: what it does on a molecular level (MF), which biological processes it participates in (BP) and where in the cell it is located (CC)."

From CAFA 5 Protein Function Prediction

Pretrained Models

The following pretrained models are available on HuggingFace Hub.

Name Embedding Dim. Attn. Heads Encoder Layers Context Length Total Parameters
andrewdalpino/ESMC-300M-Protein-Function 960 15 30 2048 361M
andrewdalpino/ESMC-600M-Protein-Function 1152 18 36 2048 644M

Basic Pretrained Example

First, install the esmc_function_classifier package using pip.

pip install esmc_function_classifier obonet

Then, we'll load the model weights from HuggingFace Hub and the GO graph using obonet, tokenize the amino acid sequence, and infer the GO subgraph.

import torch

import obonet

from esm.tokenization import EsmSequenceTokenizer

from esmc_function_classifier.model import EsmcGoTermClassifier


model_name = "andrewdalpino/ESMC-300M-Protein-Function"

go_db_path = "./dataset/go-basic.obo"

sequence = "MPPKGHKKTADGDFRPVNSAGNTIQAKQKYSIDDLLYPKSTIKNLAKETLPDDAIISKDALTAIQRAATLFVSYMASHGNASAEAGGRKKIT"

top_p = 0.5

graph = obonet.read_obo(go_db_path)

tokenizer = EsmSequenceTokenizer()

model = EsmcGoTermClassifier.from_pretrained(model_name)

model.load_gene_ontology(graph)

out = tokenizer(sequence, max_length=2048, truncation=True)

input_ids = torch.tensor(out["input_ids"], dtype=torch.int64)

subgraph, go_term_probabilities = model.predict_subgraph(
    input_ids, top_p=top_p
)

Code Repository

The code for this model can be found at https://github.com/andrewdalpino/ESMC-Function-Classifier

References:

References:

  • T. Hayes, et al. Simulating 500 million years of evolution with a language model, 2024.
  • M. Ashburner, et al. Gene Ontology: tool for the unification of biology, 2000.
Downloads last month
327
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for andrewdalpino/ESMC-300M-Protein-Function

Finetuned
(2)
this model

Dataset used to train andrewdalpino/ESMC-300M-Protein-Function

Collection including andrewdalpino/ESMC-300M-Protein-Function