A Guide to Designing New Functional Proteins and Improving Protein Function, Stability, and Diversity with Generative AI

Community Article Published July 2, 2024

image/png

Above we see an image of a protein backbone generated by RFDiffusion All Atom to bind to a specified small molecule ligand, a task which it is exceptional at, visualized using Discovery Studio.

Introduction

Recently there has been a multitude of new AI models created by researchers working on deep learning applied to biochemistry. These models are incredibly useful, powerful, and effective at improving the thermostability, binding affinity, and function of proteins by modifying their 3D structure and their sequences. They are also incredibly useful for designing entirely new proteins, de novo, with specific functions. While these models are incredibly effective, the fact that they are so new means they are not widely understood and used by researchers yet. Additionally, understanding the way these models should be used together to solve complex problems in biochemistry is difficult to newcomers for multiple reasons. Using them generally requires some coding experience, which many biochemists do not have. Secondly, understanding their capabilities and use cases requires some understanding of deep learning, which is also something many biochemists find difficult due to the complexity of some of these models and the mathematical depth of their inner workings, leading them to be labeled "uninterpretable black boxes" and causing some disdain on the part of biochemists. Additionally, there is such a large volume of AI methods being produced, sifting through the research and determining which methods are actually effective can be time consuming and laborious. All of these barriers conspire with one another to impede adoption, effective usage, and understanding of these methods and there are some platforms that are trying to address these barriers.

In the following we will discuss in detail how to use a suite of AI models for proteins and small molecules to optimize and diversify proteins, and how to create new proteins with similar function to a protein of interest. We will provide some specific examples with real proteins and small molecules to illustrate the usefulness of this methodology, focusing on two examples in particular,

(1) plastic degrading proteins which bind to the PET polymer

  • Molecules Involved: PETase (PDB ID: 5XJH) and PET ligand
  • Benefit of Strengthening Interaction: Enhancing the interaction between PETase and PET polymers can lead to more efficient breakdown of PET plastics, which are commonly used in bottles and packaging materials. A stronger binding affinity could increase the rate of hydrolysis, thereby accelerating the degradation process.
  • Explanation: PETase is an enzyme that hydrolyzes PET into smaller, more manageable molecules that can be further degraded or upcycled. By strengthening this interaction, the efficiency of PET degradation in recycling processes and natural environments would improve, contributing to reduced plastic waste.

(2) A protein-protein interaction (PPI) between Brain-Derived Neurotrophic Factor (BDNF, PDB ID: 1BND) and Tropomyosin receptor kinase B (TrkB, PDB ID: 4AT3)

  • Proteins Involved: Brain-Derived Neurotrophic Factor (BDNF, PDB ID: 1BND) and Tropomyosin receptor kinase B (TrkB, PDB ID: 4AT3)
  • Benefit of Strengthening Interaction: Increasing the interaction strength between BDNF and TrkB could improve neuronal survival, growth, and differentiation, which is crucial in neurological disorders such as depression, Alzheimer's disease, and other neurodegenerative diseases.
  • Implications: A stronger BDNF-TrkB interaction can promote neuronal health and plasticity. Therapeutic strategies that mimic or enhance this interaction could potentially slow down neurodegenerative processes and improve outcomes in various neurological disorders.

We will describe a procedure using a collection of AI models which will improve binding affinity and thermostability. The general procedure is as follows:

  1. Predict the structure of the protein-small molecule or protein-protein complex with RoseTTAFold All Atom
  2. Give the output PDB from the previous step to RFDiffusion All Atom (or RFDiffusion) to perform partial diffusion to obtain diverse protein backbones similar to the original protein
  3. Use AF2Bind, Evo, a protein language model like ESM-2, AlphaMissense, and/or UniProt or PDB annotations to identify important structural motifs in the original protein such as binding sites and active sites and scaffold these motifs with RFDiffusion All Atom (or RFDiffusion) to obtain new proteins that are dissimilar to the original protein
  4. Optionally use AlphaFlow to obtain conformational ensembles which recapitulate molecular dynamics simulations (MD simulations) of the protein backbones to better handle transient binding pockets and obtain additional residues or motifs of importance
  5. Optionally sample the Boltzmann distribution for your proteins using Distributional Graphormer (this can be used in place of the previous step and will provide more information about your protein's dynamics and the transitions between the various metastable states)
  6. Scaffold functional structural motifs with RFDiffusion or RFDiffusion All Atom
  7. Optionally use Evo to determine which point mutations are likely to improve function and which are likely to be deleterious to function to use for biasing residues towards or away from particular amino acids at those locations when designing sequences with LigandMPNN
  8. Use LigandMPNN to design diverse and chemically favorable sequences for the protein backbones generated in previous steps, optionally biasing particular residues towards or away from certain amino acids using the information provided by Evo, AF2Bind, AlphaMissense, etc.
  9. Validate and asses the quality of your newly designed protein sequences with AlphaFold2 (or OpenFold)
  10. Predict binding affinity between your protein and small molecule ligand or in your protein-protein interaction by computing the LIS score from the PAE output of RoseTTAFold All Atom to filter out the best sequences
  11. Optionally predict thermostability with ThermoMPNN
  12. Follow up with experimental validation!

Predicting Structure for a Protein-Small Molecule Complex with RoseTTAFold All Atom

The instructions for setting up RoseTTAFold All Atom can be found on the GitHub here. You will also need to make sure you have space for the MSA and template database mentioned in step (7), which is just over 300GB. This database will speed up computations significantly which will be useful for high throughput prediction of binding affinity using the LIS score obtained from the PAE output of RoseTTAFold All Atom. Once you have set up RoseTTAFold All Atom, you can provide the SMILES string for you ligand and the sequence for your protein to predict the structure of the plastic degrading protein and the PET molecule, or if you are interested in the second example you can provide the two protein sequences to RoseTTAFold All Atom to predict the structure of the PPI. This will give us a PDB file of the protein-small molecule complex or the protein-protein complex which we will use as input to RFDiffusion All Atom (or RFDiffusion).

Diversifying Protein Structures using Partial Diffusion with RFDiffusion or RFDiffusion All Atom

For Google Colab (.ipynb) versions of RFDiffusion, see the following links:.

Link 1

Link 2

Once you have your PDB output from RoseTTAFold All Atom, you can give this as input to RFDiffusion All Atom, or RFDiffusion. RFDiffusion All Atom will be able to understand use the small molecule ligand, RFDiffusion will only work with proteins though. Given the PDB that you obtained in step (1) you can perform "partial diffusion" on the protein structure. This add a small amount of noise, specified by you, to the protein backbone structure, and then denoises this to obtain a new backbone that is similar but not identical to your original protein. The more noise you add, the more diverse your protein backbones will be. If you do this with RFDiffusion All Atom with the small molecule ligand as context, you will be able to design new backbones with higher shape complementarity to the ligand. This means the protein and ligand will fit together better and the binding affinity will likely increase. If you are doing this with RFDiffusion on a protein-protein complex, this will again improve the shape complementarity between your protein binder and your protein target, likely increasing binding affinity. During this process, you may choose particular residues to add noise to, or you may noise the entire protein structure. Choosing which parts of the protein to noise can be done based on prior knowledge of the protein or based on the next steps using AF2Bind and/or Evo or a pLM.

While we will not discuss this in any detail here, RFDiffusion also has the ability to design entirely new protein binders, de novo, and the designs very often have very high affinity and specificity, meaning they bind very well to the intended protein target and have very few if any off target interactions. Similarly, RFDiffusion All Atom can design entirely new protein that bind to a specified small molecule, de novo, and the binding pockets very often have high shape complementarity to the ligand. You should experiment with this capability if you are interested in disrupting particular protein-protein interactions for example.

We should also note that RFDiffusion has many other capabilities such as symmetric oligomer generation, symmetric motif scaffolding, and unconditional generation of protein backbones, fold conditioning, and the option to use guiding potentials, which we will not discuss here.

Identifying Important Structural Motifs

AF2Bind

We will also want to use the motif scaffolding functionality of RFDiffusion to design new proteins that are very different from our original protein but that contain the same important motifs. For example, we might want to identify the binding sites of the protein. We can use the residues within some cutoff distance from the interface between the ligand or target protein and our protein structure, or we can use a method like AF2Bind to identify bindig sites. Additionally, we might look for annotations in UniProt or the PDB to obtain active site, catalytic sites, or other regions of interest. AF2Bind will help us determine binding sites, but it will also give us a good idea of which "bait amino acids" have favorable interactions with each residue of our protein, which is indicative of the chemical properties of those residues. This can help us choose how to bias LigandMPNN to design sequences for our protein backbones with favorable chemical properties which help improve binding affinity. A diagram of how AF2Bind works can be seen below.

image/png

Evo and Protein Language Models

Evo, a DNA language model based on the striped hyena architecture, is a model which can perform various tasks both generative and predictive. One of its use cases is predicting variant effects, determining which mutations are likely beneficial to function and which are likely deleterious, and providing us with a description of just how beneficial or deleterious those mutations are.

This is similar to using log-likelihood ratios (LLR) to predict the effects of every point mutation of a protein, and plotting the results in a heatmap, which can be done with a protein language model (pLM) like ESM-2 as follows:

from transformers import AutoTokenizer, EsmForMaskedLM
import torch
import matplotlib.pyplot as plt
import numpy as np

# Load the model and tokenizer
model_name = "facebook/esm2_t6_8M_UR50D"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = EsmForMaskedLM.from_pretrained(model_name)

# Input protein sequence
protein_sequence = "MAPLRKTYVLKLYVAGNTPNSVRALKTLNNILEKEFKGVYALKVIDVLKNPQLAEEDKILATPTLAKVLPPPVRRIIGDLSNREKVLIGLDLLYEEIGDQAEDDLGLE"

# Tokenize the input sequence
input_ids = tokenizer.encode(protein_sequence, return_tensors="pt")
sequence_length = input_ids.shape[1] - 2  # Excluding the special tokens

# List of amino acids
amino_acids = list("ACDEFGHIKLMNPQRSTVWY")

# Initialize heatmap
heatmap = np.zeros((20, sequence_length))

# Calculate LLRs for each position and amino acid
for position in range(1, sequence_length + 1):
    # Mask the target position
    masked_input_ids = input_ids.clone()
    masked_input_ids[0, position] = tokenizer.mask_token_id
    
    # Get logits for the masked token
    with torch.no_grad():
        logits = model(masked_input_ids).logits
        
    # Calculate log probabilities
    probabilities = torch.nn.functional.softmax(logits[0, position], dim=0)
    log_probabilities = torch.log(probabilities)
    
    # Get the log probability of the wild-type residue
    wt_residue = input_ids[0, position].item()
    log_prob_wt = log_probabilities[wt_residue].item()
    
    # Calculate LLR for each variant
    for i, amino_acid in enumerate(amino_acids):
        log_prob_mt = log_probabilities[tokenizer.convert_tokens_to_ids(amino_acid)].item()
        heatmap[i, position - 1] = log_prob_mt - log_prob_wt

# Visualize the heatmap
plt.figure(figsize=(15, 5))
plt.imshow(heatmap, cmap="viridis", aspect="auto")
plt.xticks(range(sequence_length), list(protein_sequence))
plt.yticks(range(20), amino_acids)
plt.xlabel("Position in Protein Sequence")
plt.ylabel("Amino Acid")
plt.title("Predicted Effects of Mutations on Protein Sequence (LLR)")
plt.colorbar(label="Log Likelihood Ratio (LLR)")
plt.show()

This will print something like the following

image/png

In this heatmap, we can see regions that are highly conserved and that cannot be easily mutated or that have very restricted mutations that are not deleterious. We also see the opposite, where there are residues or regions that can be easily mutated to almost any amino acid without detrimental effects. This provides us with an idea of which regions might be important to preserve or fix, and which regions to redesign. It also provides us with amino acids which might be beneficial, or deleterious, for specific residues allowing us to bias LigandMPNN later on when we are designing sequences. Understanding the effects of mutations in protein sequences is crucial for elucidating the molecular basis of various biological processes. The code snippet provided aims to predict the potential consequences of amino acid substitutions at different positions within a protein sequence. It utilizes a pretrained transformer model to estimate Log Likelihood Ratios (LLRs) for amino acid variants, which are indicative of the likelihood of a given mutation being deleterious, neutral, or positive. This can provide us with additional information about how to bias particular residues towards or away from some subset of the 20 standard amino acids when using LigandMPNN to design sequence.

Methods:

  1. Tokenization:

The code begins by importing necessary libraries, loading the pretrained ESM-2 model and tokenizer, and specifying the input protein sequence. The sequence is tokenized using the tokenizer, resulting in a sequence of token IDs. Each amino acid in the protein sequence is mapped to a corresponding token using the tokenizer's vocabulary.

  1. LLR Calculation:

For each position p along the protein sequence, LLRs are calculated for each of the 20 standard amino acids. Let i represent the index of an amino acid variant in the list of amino acids. The LLR for amino acid substitution i at position p is given by:

LLRi,p=log(Pi,pPwt,p) LLR_{i,p} = \log\left(\frac{P_{i,p}}{P_{\text{wt},p}}\right)

where:

Pi,p P_{i,p}

is the probability of observing amino acid i at position p.

Pwt,p P_{\text{wt},p}

is the probability of observing the wild-type amino acid at position p.

  1. Model Inference:

At each position p, the target amino acid is masked, and the model is used to predict the probability distribution of amino acid tokens at that position. The logits output by the model for each amino acid token are transformed into probabilities using the softmax function:

Pi,p=softmax(logitsi,p) P_{i,p} = \text{softmax}\left(\text{logits}_{i,p}\right)

  1. Log Probability Calculation:

The predicted probabilities are then used to calculate the log probabilities:

logPi,p=log(softmax(logitsi,p)) \log P_{i,p} = \log\left(\text{softmax}\left(\text{logits}_{i,p}\right)\right)

  1. LLR Calculation for Wild-Type:

The log probability of the wild-type amino acid at position p, denoted as logPwt,p log P_{wt, p}, is retrieved from the log probability tensor.

  1. LLR Calculation for Variant Amino Acids:

The log probability of amino acid variant i at position p, denoted as logPi,p \log P_{i, p}, is calculated similarly.

Results:

The LLRs for all amino acid substitutions at each position are calculated and stored in a heatmap, where rows correspond to amino acid variants and columns correspond to positions along the protein sequence. The LLR value represents the relative impact of substituting the wild-type amino acid with the corresponding variant at a particular position.

AlphaMissense

Yet another very robust way to predict which mutations are beneficial, which are neutral, and which are deleterious to a protein's function is AlphaMissense, which uses AlphaFold2. Diagrams of how AlphaMissense works can be seen below.

image/png

image/png

Choosing the right method for determining variant effects can be difficult if you are not familiar with how these work. For state-of-the-art performance, we recommend using Evo, but the simplest to implement if you do not code is probably the ESM-2 example we provided above. If you do code, implementing this same type of scoring and heatmap visualization with Evo is relatively simple and is very much the same idea. AlphaMissense is the method that got the most attention of the three due to the fact that it uses the very popular and widely known AlphaFold2 and due to the fact that the results were published in Nature. It is likely better performing than using a pLM like ESM-2, but it does not perform as well as Evo.

Obtaining Conformational Ensembles with AlphaFlow

AlphaFlow is a flow matching model, which is a generalization of diffusion models, trained partially on MD simulation data. AlphaFlow is generative, but instead of producing single static backbones of protein the way RFDiffusion does, it generates some specified number of conformations of a protein backbone, which may recapitulate molecular dynamics simulations. Below, we can see how AlphaFlow recapitulates the frames of an MD trajectory.

image/gif

Sampling the Boltzmann Distribution of a Protein with Distributional Graphormer

<video controls autoplay src="

">

Distribution Graphormer or "DiG" is a generative diffusion model which provides us with a way to sample the Boltzmann distribution of proteins, transition paths between metastable states, ligand binding structure generation for given protein pockets, adsorbate configuration sampling on catalytic surfaces, and property-guided structure generation (inverse design).

image/png

Using DiG, we can get a better handle on transient binding pockets which may only be present in particular metastable states or in the transitional states in between those metastable states. This can provide us with a more comprehensive set of residues or structural motifs involved in binding a ligand or protein, which can in turn give us additional motifs or sites to scaffold with RFDiffusion or RFDiffusion All Atom. To obtain these new binding sites, we simply run AF2Bind on the various conformations generated by AlphaFlow or Distributional Graphormer.

Creating Additional Protein Backbones with Motif Scaffolding and Sequence Inpainting using RFDiffusion or RFDiffusion All Atom

Once we have all of the important structural motifs that we would like to use, either from AF2Bind or from annotations in UniProt or the PDB, we can design entirely new protein backbones that hold these motifs in place or that "scaffold the motifs". We can specify length ranges for the regions between our motifs, which can be specified exactly or sampled at random each time a new scaffold backbone is generated. We can also use the sequence inpainting capabilities of RFDiffusion to allow RFDiffusion to redesign particular residues which aren't very important to the structure or function of our protein. Below, we provide a visualization of functional motif scaffolding with EvoDiff, another diffusion model that works in protein sequence space rather than protein structure space like RFDiffusion and RFDiffusion All Atom. We will not be discussing EvoDiff here, but it is a perfectly good model for functional motif scaffolding. There are downsides and limitations to using EvoDiff though. For one, it is not able to use a second protein or a small molecule ligand as context to improve its performance and to produce binders or scaffolds with high shape complementarity to the protein or small molecule target. Additionally, it is not able to generate de novo binders, symmetric oligomers, or symmetric motif scaffolds. It also does not have functionality for partial diffusion and backbone diversification. It also does not support guiding potentials. Thus we do not recommend using it and prefer RFDiffusion and RFDiffusion All Atom.

image/gif

Designing Sequences with LigandMPNN

The LigandMPNN model, pictured below, is another generative model which autoregressively designs protein sequences for given backbone structures. LigandMPNN operates on three different graphs. First, a protein-only graph with residues as nodes and 25 Angstrom distances between N, Cα, C, O, and virtual Cβ atoms for residues i and j. Second, an intra-ligand graph with atoms as nodes that encodes chemical element types and distances between atoms as edges. Third, a protein-ligand graph with residues and ligand atoms as nodes and edges encoding residue j and ligand atom geometry. The LigandMPNN model has three neural network blocks: a protein backbone encoder, a protein-ligand encoder, and a decoder. Protein sequences and side-chain torsion angles are autoregressively decoded to obtain sequence and full protein structure samples.

image/png

This model is used to design sequences for the backbones that we have generated thus far. An explanation of how its predecessor can be used for Improving protein expression, stability, and function (with ProteinMPNN) can be implemented with LigandMPNN as well, and due to LigandMPNN being superior in performance to ProteinMPNN, we expect improved performance on the sequence design (inverse folding) task. Additionally, LigandMPNN uses ligands as additional context, which also improves performance. You have the option of biasing particular residues towards subsets of the 20 standard amino acids, and away from others of your choosing, which can be tuned using weights for the individual residues. This, coupled with the knowledge we gained from AF2Bind, Evo, a pLM like ESM-2, AlphaMissense, and/or UniProt or PDB annotations, allows us to have more fine-grained control over the chemical properties of the generated protein sequences. This sequence design step alone can improve binding affinity and thermostability of proteins, and can be used on our original starting protein backbone(s), as well as the backbones we generated with RFDiffusion or RFDiffusion All Atom using partial diffusion and/or motif scaffolding.

LigandMPNN has many other functionalities and various knobs you can tune. Like RFDiffusion and RFDiffusion All Atom, it can handle symmetry. It can also generate residues that are transmembrane buried residues or transmembrane interface residues based on user input. Additionally, it can generate side-chain conformations and hyperparameters such as temperature can be adjusted to increase sequence diversity vs. sequence recovery. You can also adjust the Gaussian noise added to the backbones the model was trained on, or use SolubleMPNN to generate more soluble protein sequences. Additionally, there are various outputs such as the following

out_dict = {}
out_dict["logits"] - raw logits from the model
out_dict["probs"] - softmax(logits)
out_dict["log_probs"] - log_softmax(logits)
out_dict["decoding_order"] - decoding order used (logits will depend on the decoding order)
out_dict["native_sequence"] - parsed input sequence in integers
out_dict["mask"] - mask for missing residues (usually all ones)
out_dict["chain_mask"] - controls which residues are decoded first
out_dict["alphabet"] - amino acid alphabet used
out_dict["residue_names"] - dictionary to map integers to residue_names, e.g. {0: "C10", 1: "C11"}
out_dict["sequence"] - parsed input sequence in alphabet
out_dict["mean_of_probs"] - averaged over batch_size*number_of_batches probabilities, [protein_length, 21]
out_dict["std_of_probs"] - same as above, but std

or logits or probabilities of the form p(AAibackbone)p(AA_i|backbone) and p(AAibackbone,AAall except AAi)p(AA_i|backbone, AA_{all\ except\ AA_i}) that can be returned as output.

Validating Designed Sequences with AlphaFold2

Once you have design one or multiple sequences for each protein backbone with LigandMPNN, it is standard to validate them and check their quality with AlphaFold2 (or OpenFold). We do this by predicting the structure of the LigandMPNN generated sequence using OpenFold without an MSA or templates, that is with the sequence alone. We then compare this predicted structure to the structure that RFDiffusion or RFDiffusion All Atom generated using a metric like RMSD. While there are other metric you can use, RMSD is standard. This validation step allows us to filter out low quality sequences based on the RMSD scores. If the RMSD between the predicted structure and the RFDiffusion or RFDiffusion All Atom structures is high, we know the quality of the sequence that LigandMPNN designed is low.

Predicting Binding Affinity with LIS Scores from RoseTTAFold All Atom

The AlphaFold-Multimer LIS Score is a metric computed from the PAE or "Predicted Aligned Error" outputs of AlphaFold-Multimer (or OpenFold). This score is a very effective new method that predicts protein-protein interactions. At present, no one has used this with RoseTTAFold All Atom to predict interactions between proteins and small molecules or proteins and DNA/RNA, but the method generalizes to both of these scenarios. The PAE (and somewhat erroneously the ipTM and pLDDT scores) are often used by researchers to help predict the strength of PPIs. The LIS score is computed from the PAE and has better predictive power compared to PAE and various deep learning models trained specifically for predicting binding affinity. Below, we see a figure showing how AFM-LIS computes the LIS.

image/png

We recommend using the PAE output from OpenFold or RoseTTAFold All Atom for computing the LIS score and predicting protein-protein interactions, and we recommend RoseTTAFold All Atom for protein-small molecule or protein-DNA/RNA interactions. This allows us to filter out more low quality sequences based on the LIS score.

Predict Thermostability with ThermoMPNN

Lastly, if increasing thermostability is among your list of goals, we recommend using ThermoMPNN (see here for the preprint), a finetuned version of ProteinMPNN.

ThermoMPNN architecture and primary dataset statistics are pictured below. (a) Model architecture of ThermoMPNN, a graph neural network trained on embeddings extracted from a pre-trained sequence recovery model (ProteinMPNN, left panel) to predict thermostability changes caused by protein point mutations. The input protein is passed through ProteinMPNN, where the learned embeddings from each decoder layer are extracted and concatenated with the learned sequence embedding to create a vector representation of the residue environment. This vector is passed through a light attention block (LA, purple block) which uses self-attention to reweight the vector based on learned context. Finally, a small multilayer perceptron (MLP, red block) predicts a ΔΔG° for mutation to each possible amino acid. (b) Curation, clustering, and data splitting procedure for the Megascale and Fireprot datasets used in this study. Each split is labelled with its total number of mutations, and homologues are shown in yellow. Each clustering result is labeled with the number of clusters in each dataset. (c) Histogram of mutations per protein distribution for each dataset. (d) Histogram of protein length distribution for each dataset. (e) Donut charts of percentage of mutations to alanine compared to other polar and nonpolar residues for each dataset, along with natural residue abundance for all proteins in the SwissProt database for comparison.

image/png

This allows us to filter the designed protein sequences even further and eliminate proteins with low thermostability.

Concluding Remarks

With all of these methods combined, you should now be able to both improve binding affinity, thermostability, and function of proteins, or design entirely new proteins which bind to your target small molecule or protein. This provides us with a high impact pipeline which can be used for improving the environment, curing diseases, improving human health, and many other tasks of interest. Hopefully this guide provides you with a good starting point for becoming an expert protein engineer, computational biochemist, and AI scientist, an intersection of skillsets that will have an incredibly positive impact on the world.