Computing the energy of a protein and a single mutation#

In this tutorial we familiarize ourselves with foldx, and how to e.g. measure the stability of a wild type and a single mutation of it.

Note

We assume that you have already installed foldx, and that you know where its binaries are located.

If you need a refresher/want to install it, check the previous chapter here.

This tutorial is heavily inspired by how LaMBO uses foldx.

Setting up folders and files#

Let’s start by setting up some environment variables that will be essential. FoldX works by reading and creating several files, so we need to set up path variables to be able to

  1. Copy some essentials from the foldx files (namely, the rotabase.txt file)

  2. Create a working directory where foldx will put all the files it wants to create.

This notebook assumes that your foldx files are inside to your home directory, i.e. we expect that

  • the binary ~/foldx/foldx exists, and

  • the file ~/foldx/rotabase.txt exists.

These two can be found in your foldx installation. You might need to rename the binary.

from pathlib import Path
import shutil

# The path to the folder where the FoldX files are located
PATH_TO_FOLDX_FILES = Path().home() / "foldx"

# The path to where this notebook is located
THIS_DIR = Path().resolve()

# Creating a working directory for foldx's files
WORKING_DIR = THIS_DIR / "tmp"
WORKING_DIR.mkdir(exist_ok=True)

# Copying the rotabase.txt file to the working directory
shutil.copyfile(PATH_TO_FOLDX_FILES / "rotabase.txt", WORKING_DIR / "rotabase.txt")

Downloading a PDB file#

In this example, we use a protein responsible for transporting oxigen in sperm whales (called 101m). You could download it by hand from pdb, but in this notebook we will use Python, and we will download it directly on the working directory at ./tmp:

import urllib.request

# Downloading the PDB file
web_address = "https://files.rcsb.org/view/101M.pdb"
urllib.request.urlretrieve(
    web_address,
    WORKING_DIR / "101m.pdb"
)

Of course, feel free to adapt this to another .pdb file (as long as you place it in ./tmp). To aid this generalization, we’ll define a pdb_name variable, which will correspond to 101m in our example

# In case you choose a different pdb, you can change the name here
pdb_name = "101m"

Repairing the file#

The common way to process a pdb for mutation is to start by repairing the file using the RepairPDB command inside FoldX. Check their documentation for more details.

We will check whether the file exists (for future re-runs), since the process is a little bit time-consumig (>400sec).

import subprocess

# Building the command for repairing the wildtype PDB
if not (WORKING_DIR / "101m_Repair.pdb").exists():
    repair_cmd = [
        str(PATH_TO_FOLDX_FILES / "foldx"),
        "--pdb",
        "101m.pdb",
        "--command",
        "RepairPDB",
        "--water",
        "-CRYSTAL",
        "--pH",
        "7.0",
    ]

    # Running the repair command
    subprocess.run(repair_cmd, cwd=WORKING_DIR)

After we run this repair process, this is the tree structure of the tmp working directory:

└── tmp
    ├── 101m.pdb
    ├── 101m_Repair.fxout
    ├── 101m_Repair.pdb
    ├── Unrecognized_molecules.txt
    └── rotabase.txt

101m_Repair.pdb is the main output of this RepairPDB, and we will use it onwards.

Now, let’s parse this file and inspect it.

Parsing the repaired wildtype#

We can easily load up the structure using biopython:

from Bio import PDB

parser = PDB.PDBParser()
structure = parser.get_structure("pdb", WORKING_DIR / "101m_Repair.pdb")
residues = list(structure.get_residues())

At this point, the list residues is full of Bio.PDB.Residues. These contain all the relevant information of each residue. Let’s dive into some of the attributes and methods of the first one:

first_residue = residues[0]
print(f"Residue's name: {first_residue.resname}")
print(f"The sequence index: {first_residue.id[1]}")
print(f"Chain id: {first_residue.get_parent().id}")
Residue's name: MET
The sequence index: 0
Chain id: A

As you can see, the residue is specified as the 3-letter code. We can extract the 1-letter code using Bio.SeqUtils’ seq1.

from Bio.SeqUtils import seq1

print(f"Residue as one letter: {seq1(first_residue.resname)}")
Residue as one letter: M

These are all the ingredients we will need for defining mutations. This is what we focus on next.

Defining mutations#

As they say in their documentation, we need to specify mutations by either providing the mutated 1-letter sequence, or by specifying a list of individual mutations.

In this tutorial, we will show how to work with the latter version. Mutations are defined in an individual_list.txt file. Each row of this file has the following structure:

row = f"{original_residue}{chain_id}{position_in_sequence}{mutant_residue};"

For our running example, we know that

  • original_residue = seq1(first_residue.resname) is M,

  • chain_id = first_residue.get_parent().id is A,

  • position_in_sequence = first_residue.id[1] is 0

  • mutant_residue is for us to choose, let’s say G (i.e. Glycine (?)).

Warning

Be careful with position_in_sequence. In general, this position is not the position in the list of residues. The sure-fire way of getting it is, as we did, by accessing .id[1].

Let’s define this mutation in a way that’s easily modifiable. We can create a mutation_list with the position in the residue list we want to modify, and to which residue:

mutation_list = [
    {
        "residue_idx": 0,
        "to": "G",
    }
]

Now we can create the individual_list.txt file inside our working directory:

with open(WORKING_DIR / "individual_list.txt", "w") as f:
    for mutation in mutation_list:
        # We get the initial residue from the PDB file, this is a Residue type.
        # You can check the documentation of these inside Biopython.
        residue = residues[mutation["residue_idx"]]

        # The original residue lies in residue.resname. This is the 3-letter code.
        # To tranform it into the 1-letter code, we use seq1 from Bio.SeqUtils.
        original_residue = seq1(residue.resname)

        # We can read the position as the second position of the ID:
        # residue.id = (' ', position_in_chain, ' ')
        position = residue.id[1]

        # The chain ID can be read from the parent of the residue
        chain_id = residue.get_parent().id

        # The line we need to write, then, is:
        f.write(f"{original_residue}{chain_id}{position}{mutation['to']};" + "\n")

The contents of indiviual_list.txt then are:

MA0G;

Warning

It is important that the list of mutations is called individual_list.txt. Otherwise, foldx won’t be able to process it.

Computing the mutation’s energy#

foldx’s BuildModel command allows us to compute (among several things) the total Gibbs energy of both the original wild type, as well as mutations of it.

The command to run goes as follows:

foldx_cmd = [
    str(PATH_TO_FOLDX_FILES / "foldx"),
    "--pdb",
    "101m_Repair.pdb",
    "--command",
    "BuildModel",
    "--mutant-file",
    str(WORKING_DIR / "individual_list.txt"),
    "--water",
    "-CRYSTAL",
    "--pH",
    "7.0",
]

which can be run using subprocess’ run method:

subprocess.run(foldx_cmd, cwd=WORKING_DIR)
Hide code cell output
   ********************************************
   ***                                      ***
   ***             FoldX 5 (c)            ***
   ***                                      ***
   ***     code by the FoldX Consortium     ***
   ***                                      ***
   ***     Jesper Borg, Frederic Rousseau   ***
   ***    Joost Schymkowitz, Luis Serrano   ***
   ***    Peter Vanhee, Erik Verschueren    ***
   ***     Lies Baeten, Javier Delgado      ***
   ***       and Francois Stricher          ***
   *** and any other of the 9! permutations ***
   ***   based on an original concept by    ***
   ***   Raphael Guerois and Luis Serrano   ***
   ********************************************

1 models read: 101m_Repair.pdb
1 models read: 101m_Repair.pdb
BackHbond       =               -142.58
SideHbond       =               -48.61
Energy_VdW      =               -179.63
Electro         =               -8.33
Energy_SolvP    =               245.28
Energy_SolvH    =               -238.89
Energy_vdwclash =               3.42
energy_torsion  =               6.70
backbone_vdwclash=              158.16
Entropy_sidec   =               105.87
Entropy_mainc   =               231.69
water bonds     =               0.00
helix dipole    =               -8.75
loop_entropy    =               0.00
cis_bond        =               0.00
disulfide       =               0.00
kn electrostatic=               0.00
partial covalent interactions = 0.00
Energy_Ionisation =             1.56
Entropy Complex =               0.00
-----------------------------------------------------------
Total          = 				  -32.28
BackHbond       =               -142.58
SideHbond       =               -48.61
Energy_VdW      =               -179.63
Electro         =               -8.33
Energy_SolvP    =               245.28
Energy_SolvH    =               -238.89
Energy_vdwclash =               3.42
energy_torsion  =               6.70
backbone_vdwclash=              158.16
Entropy_sidec   =               105.87
Entropy_mainc   =               231.69
water bonds     =               0.00
helix dipole    =               -8.75
loop_entropy    =               0.00
cis_bond        =               0.00
disulfide       =               0.00
kn electrostatic=               0.00
partial covalent interactions = 0.00
Energy_Ionisation =             1.56
Entropy Complex =               0.00
-----------------------------------------------------------
Total          = 				  -32.28

Starting BuildModel
Reading MA0G;
Residue to Mutate META0 has residue index 0
Mutating residue = META0 into GLY
Your file run OK
End time of FoldX: Thu Nov 21 17:39:03 2024
Total time spend: 24.49 seconds.
validated file "101m_Repair_1.pdb" => successfully finished
Cleaning BuildModel...DONE
CompletedProcess(args=['/Users/sjt972/foldx/foldx', '--pdb', '101m_Repair.pdb', '--command', 'BuildModel', '--mutant-file', '/Users/sjt972/Projects/poli-docs/docs/poli-docs/understanding_foldx/01-single-mutation-using-foldx/tmp/individual_list.txt', '--water', '-CRYSTAL', '--pH', '7.0'], returncode=0)

After running this command, 6 new files are created in our working directory:

└── tmp
    ├── 101m.pdb
    ├── 101m_Repair.fxout
    ├── 101m_Repair.pdb
    ├── 101m_Repair_1.pdb           # <--
    ├── Average_101m_Repair.fxout   # <--
    ├── Dif_101m_Repair.fxout       # <--
    ├── PdbList_101m_Repair.fxout   # <--
    ├── Raw_101m_Repair.fxout       # <--
    ├── Unrecognized_molecules.txt
    ├── WT_101m_Repair_1.pdb        # <--
    ├── individual_list.txt
    └── rotabase.txt

Of these, the resulting energy can be found in the last two lines of Raw_{pdb_name}_Repair.fxout, which form a table with several more quantities. Let’s create a pandas dataframe:

# The details of these can be found in:
# https://foldxsuite.crg.eu/command/BuildModel

column_names = [
    "Pdb",
    "total energy",
    "Backbone Hbond",
    "Sidechain Hbond",
    "Van der Waals",
    "Electrostatics",
    "Solvation Polar",
    "Solvation Hydrophobic",
    "Van der Waals clashes",
    "entropy sidechain",
    "entropy mainchain",
    "sloop_entropy", 
    "mloop_entropy",
    "cis_bond",
    "torsional clash",
    "backbone clash",
    "helix dipole",
    "water bridge",
    "disulfide",
    "electrostatic kon",
    "partial covalent bonds",
    "energy Ionisation",
    "Entropy Complex"
]
import pandas as pd

with open(WORKING_DIR / f"Raw_{pdb_name}_Repair.fxout") as f:
    lines = f.readlines()

# The important data is in the last two lines
df = pd.DataFrame(
    [line.split() for line in lines[-2:]],
    columns=column_names
)
/var/folders/l3/qk9dx6g958765kmn_2wn34t00000gn/T/ipykernel_60159/1758354106.py:1: DeprecationWarning: 
Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd

Of all these columns, we are interested in the total energy (the predicted overall stability of the protein, according to foldx). Let’s check what the total energy was for both of these:

df[["Pdb", "total energy"]]
Pdb total energy
0 101m_Repair_1.pdb -31.7457
1 WT_101m_Repair_1.pdb -34.3436

In foldx’s notation, the row at index 0 corresponds to the mutated version of the wild type (which appears at index 1).

Computing the SASA score#

SASA stands for solvent-accessible surface area. An algorithm for computing this value was described in [Shrake and Rupley, 1973].

Computing it is pretty easy now that we have both the PDB for the wild type and for the mutated molecule. Biopython has all the functionality available:

from Bio.PDB import SASA

# Let's load up the structure again, for both the wildtype and the mutant
parser = PDB.PDBParser(QUIET=True)
wt_structure = parser.get_structure("pdb", WORKING_DIR / "101m_Repair.pdb")
mut_structure = parser.get_structure("pdb1", WORKING_DIR / "101m_Repair_1.pdb")

# Now, we can create a SASA computer
sasa = SASA.ShrakeRupley()

# We can attach, to each structure, its SASA
sasa.compute(wt_structure, level="S")
sasa.compute(mut_structure, level="S")

print(f"Wildtype SASA: {wt_structure.sasa}")
print(f"Mutant SASA: {mut_structure.sasa}")
Wildtype SASA: 8407.731560227876
Mutant SASA: 8439.063468009845

Conclusion#

This short tutorial shows how to use the binaries of foldx to measure the impact of a mutation on a wild type’s predicted energy. To do so, we

  1. repaired the .pdb we downloaded from the database,

  2. defined a desired mutation using an individual_list.txt file,

  3. used foldx’s BuildModel command, passing this list of desired mutations.

  4. computed both the stability and the SASA of the wild type and its mutation.

These are quantities we want to optimize: the lesser the energy, the stabler a protein might be, and higher SASA correlates with e.g. length of fluorescence in certain proteins. Indeed, these are the quantities described and optimized one of the tasks presented in LaMBO [Stanton et al., 2022].