Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

EMBO Course Haddock Practical.

EMBO Practical NMR Course 2013


Haddock Practical
Haddock is a data-driven docking program that is based on the principal that knowledge of residues
involved in interactions between molecules can be used to derive intermolecular ambiguous
restraints (AIRs). These restraints can then be used in structure calculations, based broadly on the
Aria protocols. The residues that make up the restraints are divided into active residues, for which
there is experimental data for their involvement and passive residues, which may be involved in
the interaction. Together these form distance restraints, one for each active residue in partner A and
one for each active residue in partner B:

An AIR is defined as an ambiguous intermolecular distance (diAB) with a maximum value of 3


between any atom m of residue i of protein A (miA) and any atom n of both active and passive
residues k (Nres in total) of protein B (nkB).

The experimental data used in HADDOCK can be noisy, causing some residues to be defined as
active, whereas in reality, they are not making any contact with the partner molecule (false
positives). To deal with noisy data a fraction of the restraints is discarded at random. This option is
enabled by default, with 50% of the restraints discarded. In HADDOCK, the docking process is
repeated several thousands of times, starting from different initial random orientations, and also
with a different set of restraints being discarded. This means that some of the docking solutions, by
chance, will be driven only by restraints that are correct, leading to correct docking solutions.
HADDOCK relies on the fact that these correct solutions should have better scores to discriminate
them from solutions driven by wrong restraints.

The docking protocol consists of three stages:


(i) randomization of orientations and a rigid-body energy minimization (1000 structures).
Molecules are separated by 25 and a random rotation applied to each
(ii) semi-flexible simulated annealing in torsion angle space (200 structures). The amino acids
at the interface are allowed to move to optimize the packing.
(iii) refinement in explicit solvent (200 structures).

After each stage, the structures are scored and ranked, and the best structures are kept for the next
stage. The HADDOCK score is a weighted sum of van der Waals, electrostatic, desolvation and
restraint violation energies together with buried surface area.

The resulting models are then clustered based on a similarity measure positional interface ligand
RMSD2 (iL-RMSD) that captures conformational changes about the interface by fitting on the
interface of the first molecule and calculating the RMSDs of the interface of the second molecule.

1
EMBO Course Haddock Practical.

The Practical

In this practical, we will use the Sec5 protein that you looked at in the Analysis practical. We have
the structure of the free domain (pdb code 1HK6), the structure of its binding partner Ral (pdb code
2KE5) and the structure of their complex (pdb code 1UAD). All of these pdb files are provided in
the haddock_practical directory.

Generating the restraints

The Sec5 residues that interact with Ral have been defined by chemical shift mapping. The
chemical shift mapping results are shown here graphically:
5

4
[(15N)2 + 4(1H)2]1/2

0
0 20 40 60 80 100
Residue Number
Black bars = experimental shift changes. Dotted bars = residues whose peaks have disappeared
(assigned a of 1.0). Dotted line = average experimental shift change.

The list of shift changes is in the file shifts_0_to_1.5 in the haddock practical directory. The average
chemical shift change is 0.646. The first task is to find the residues whose chemical shift change is
larger than the average, so write these down here:

Now it is necessary to exclude those residues which are unlikely to be involved in the interaction,
that is, those which are buried in the structure. To find these, run a program that measures solvent
accessibility. One such program is naccess, which rolls a probe around the van der Waals surface
of the molecule. The radius of the probe is the same as the radius of a water molecule.
Type naccess 1HK6.pdb in the haddock directory. This creates a file called 1HK6.rsa:

REM Relative accessibilites read from external file


"/usr/local/naccess2.1.1/standard.data"
REM File of summed (Sum) and % (per.) accessibilities for
REM RES _ NUM All-atoms Total-Side Main-Chain Non-polar All polar
REM ABS REL ABS REL ABS REL ABS REL ABS REL
RES HIS A 3 227.15 124.2 166.55 113.2 60.60 169.3 112.41 115.7 114.74 133.8
RES MET A 4 180.39 92.9 162.41 103.7 17.98 47.9 162.41 102.9 17.98 49.5
RES ARG A 5 171.65 71.9 152.50 75.8 19.14 51.0 69.80 89.7 101.84 63.3

2
EMBO Course Haddock Practical.

The ABS column shows you the absolute accessibility of each residue in 2 and the relative
accessibility (compared to Ala-X-Ala). You now need to filter out residues whose chemical shifts
have changed but are not sufficiently accessible. The usual cutoff is to find residues which have
either the main-chain or the side-chain > 50% accessible, i.e. the column Total-Side (REL) or Main-
Chain (REL) must be more than 50%. To find these quickly, use the awk command:

awk '{if ($8>50||$10>50) print $0}' < 1HK6.rsa

to print anything with more than 50 in fields 8 and 10.


Now you can strike out of your list those residues which are buried. The remaining residues are the
active residues. You should be left with: 11, 14, 16, 17, 27, 67, 85, 87, 88, 91, 92

To generate the list of passive residues, you need to find all the residues that are contacting the
active residues and are also solvent accessible. This can be done using Pymol or Rasmol, but to
save time, we will let HADDOCK figure them out for us. This works well when you have sufficient
experimental data to define the active residues completely. If you have very little data (e.g. only 1-2
mutations), you might want to define more passive residues based on other knowledge of the
protein, e.g. where homologous proteins are known to interact.

The Ral residues that interact with Sec5 have been defined by mutagenesis. Mutation of residues
14, 36, 38, 47, 49, 50, 51, 67 disrupts the binding of Sec5. Check the solvent accessibility of these
as before to filter out anything that is not accessible.

Running Haddock

The Haddock webserver is at http://haddock.science.uu.nl/ and this is the one that you will use for
your own calculations. You must apply for an account via this page and then you can perform your
own Haddock calculations.

For this course, we will use course accounts which only calculate a subset of structures for speed
(250 and 50 instead of 1000 and 200). This is incorrect because for Haddock to work properly
requires clustering of a large number of structures.

Go to the course webserver at:

milou.science.uu.nl/services/HADDOCK

Select HADDOCK server: the Easy interface

Find First molecule and unfold the menu.


The first structure will be uploaded from the haddock practical directory on your computer, because
it has includes a Mg ion and the pdb file has been edited to explicitly include the charge. The file is
called 2KE5.pdb. Select to use all the chains of the pdb file (it only has one chain).
Input the active residues that you have defined and select define passive residues automatically
around the active residues.

Find Second molecule. Select Download from the pdb option, use all the chains again. Use the
pdb code of Sec5: 1HK6. This pdb entry contains a family of NMR structures but Haddock only
uses the first structure in the ensemble. In this case, the first structure is selected to be the most
representative structure (closest to the mean structure). Input the active and passive residues that
you have defined.

3
EMBO Course Haddock Practical.

If you use a family of structures, you should check that this is the best structure by going to the
pdb (e.g. www.rcsb.org) and looking at the pdb file header. Somewhere in the header, it should
state which is the best representative in the ensemble. If it were not the first one, it would be
necessary to download the pdb file and create a new file containing only the best structure.

A number of course accounts have been set up. Use the username I assign you and the password
EMBO2013 and Submit Query to start the run.

Next click the link to the results and wait. The Haddock run should take about 30 minutes.

You use another browser window or tab to access a docking run that I ran earlier while you are
waiting for your own to finish. It is at:

http://milou.science.uu.nl/serviceresults/HADDOCK/40620101233/run2/

and you can use it to carry out some of the assessments described below while you wait for your
own structures.

Assessing the Results

After the Haddock run is complete, a results page is generated. If you were using your own account,
an email would be sent to you.

The docking creates a large set of models, with the assumption that some of them will be correct. At
the end of the docking run all of the models are clustered to remove isolated structures that do not
resemble many others in the pool of models. Each cluster is then analysed based on the haddock
scores.

The results page includes an overview of the top ten clusters, ranked by the average HADDOCK
score of their four best structures, and statistics of energetic terms and other structural measures for
each cluster. This allows a quick assessment of the quality of the generated models.
The haddock score is calculated as:

HADDOCK score = 1.0xEvdW + 0.2xEElec + 0.1xEAIR + 1.0xEDesolv

The top cluster in the list is considered to be the best on the basis of the haddock score and the
associated Z-score.

At the bottom of the page, there are several plots of different measures of the models to assess the
quality of the clusters.

The meanings of some of the terms used in these plots are defined at the bottom of the page:
i-RMSD RMSD of backbone atoms of all residues in the interface
l-RMSD RMSD of the Sec5 protein after fitting models on the backbone of Ral.
FCC fraction of intermolecular contacts that are common between the models.

It is not that easy to decide which of the models are best based on these plots but it gives you an
idea about how good some of the clusters are and whether the models within each cluster are similar
to each other.

4
EMBO Course Haddock Practical.

To have a good look at the structures produced we need to download the results of the haddock run.
You can download my results from:
http://computing.bio.cam.ac.uk/~hrm28/haddock_practical/run2.tgz

Once the download is complete, move the run2.tgz file into the haddock practical directory. Go to
the directory where the download was put and use mv run2.tgz /home/mott/haddock_practical/.

To unpack the tar file go into the haddock practical directory and type:
tar zxvf run2.tgz
and cd run2.

The first thing that you can do is to compare the best model from each cluster. To do this, type
pymol cluster*_1.pdb which will load cluster1_1, cluster2_1, cluster3_1 and cluster4_1 into
pymol.

In pymol:
Find all on the right and click H (hide) then everything. Click S (show) and Cartoon.
Go to cluster1_1 and click A (action), then align, then all to this. This will align all the structures
over the Ral molecule. To show and hide individual models click on their names.

Find the S at the bottom right of the screen you can switch on the sequence to select the active
residues of the two chains in the individual models (NB Ral is the first sequence you see and Sec5
is the second). Select the residues in the sequence by clicking on them, then find sele on the right
hand side, select C and colour them by element (choose one of the colour schemes you like). Then
click S and lines, which shows the sidechains in a stick representation, coloured by atom. Now have
a good look at the active residues on each chain and decide whether you think they are properly
satisfied in the model. If there is time you can have a more extensive look at all the interface
residues to decide whether there are any clashes there.

Now compare the models to the known crystal structure of Sec5 with Ral. Use the file menu in
pymol to load up the structure 1UAD_single.pdb from the analysis directory and align all the
clustered models now onto the known structure. Display the sequence again and you can see the
alignment and you should be able to see that the larger molecule (Ral) is aligned in all of the
structures and that only some of the models have Sec5 aligned to the known structure. This allows
you to see which one(s) are closer to correct. You can display the cartoon and try to figure out
which of the models is the best match for the known structure. If you are struggling with this, try to
locate the N terminus of the Sec5 molecule in the known structure and in the model.

Is the best cluster as defined by Haddock the closest to the real structure?

What extra data do you think we need to be able to generate a better model?

The structures generated by the haddock run are in the following directories:
structures/it0 output from the rigid body minimization
structures/it1 output from the TAD SA
structures/it1/water output from the water refinement.

In the water directory, we can look at the following files:


The pdb files corresponding to each structure are in complex_*.pdb
cluster.out contains a list of the structures that are in each cluster, designated by their number X (i.e.
cluster_X.pdb).

5
EMBO Course Haddock Practical.

A number of Analysis scripts have been run on each structure and the contents of the files are:

file.nam_clustX_bsa
contains the buried surface area of each structure of cluster X
file.nam_clustX_dH
contains the total energy difference calculated as total energy of the complex - Sum of total energies of the
individual components
file.nam_clustX_Edesol
contains the desolvation energy calculated using the empirical atomic solvation parameters from Fernandez-
Recio et al. JMB 335:843 (2004)
file.nam_clustX_ener
contains all the energy terms (intermolecular, Van der Waals, electrostatic and AIR) for each structures of
cluster X
file.nam_clustX_haddock-score
contains the combined haddock score
file.nam_clustX_rmsd
contains the RMSD of each structure of cluster X from the best (lowest) HADDOCK score structure of cluster
X.
file.nam_clustX_rmsd-Emin
contains the RMSD of each structure of cluster X from the best (lowest) HADDOCK score structure of all
calculated structures
file.nam_clustX_viol
contains the number of AIR and dihedral violations per structure

Eight files containing various averages over clusters are created:


cluster_bsa.txt
contains the average buried surface area of each cluster and the standard deviation
cluster_dH.txt
contains the average total energy difference calculated as total energy of the complex - Sum of total
energies of the individual components
cluster_Edesolv.txt
contains the average desolvation energy calculated using the empirical atomic solvation parameters
from Fernandez-Recio et al. JMB 335:843 (2004)
cluster_ener.txt
contains the average energy terms of each cluster and the standard deviations
cluster_haddock.txt
contains the average combined haddock score
cluster_rmsd.txt
contains the average RMSD and standard deviation from the best (lowest) HADDOCK score of the
structures belonging to that cluster
cluster_rmsd-Emin.txt
contains the average RMSD and standard deviation of the clusters from the best (lowest) HADDOCK
score structure of all calculated structures
cluster_viol.txt
contains the average AIR and dihedral violations for each cluster and the standard deviations

Can you use any of these files to decide which of the clusters are best?

There are a number of other files that summarize this information; their description is at the
Haddock homepage (http://www.nmr.chem.uu.nl/haddock/) under Analysis.

If you want to have a look at the full haddock run of this protein complex, you can download the
results from http://computing.bio.cam.ac.uk/~hrm28/haddock_practical/full_run4.tgz.

Can you use the same criteria to discriminate between correct and incorrect models?

You might also like