Professional Documents
Culture Documents
Bioinformatics Notes - 17Bt54: Module - 4
Bioinformatics Notes - 17Bt54: Module - 4
MODULE - 4:
HOMOLOGY MODELING
As the name suggests, homology modeling predicts protein structures based on sequence
homology with known structures. It is also known as comparative modeling. The principle
behind it is that if two proteins share a high enough sequence similarity, they are likely to have
very similar three-dimensional structures. If one of the protein sequences has a known structure,
then the structure can be copied to the unknown protein with a high degree of confidence.
Homology modeling produces an all-atom model based on alignment with template proteins. The
overall homology modeling procedure consists of six steps. The first step is template selection,
which involves identification of homologous sequences in the protein structure database to be
used as templates for modeling. The second step is alignment of the target and template
sequences. The third step is to build a framework structure for the target protein consisting of
main chain atoms. The fourth step of model building includes the addition and optimization of
side chain atoms and loops. The fifth step is to refine and optimize the entire model according to
energy criteria. The final step involves evaluating of the overall quality of the model obtained. If
necessary, alignment and model building are repeated until a satisfactory result is obtained.
Template Selection
The first step in protein structural modeling is to select appropriate structural templates. This
forms the foundation for rest of the modeling process. The template selection involves searching
the Protein Data Bank (PDB) for homologous proteins with determined structures. The search
can be performed using a heuristic pairwise alignment search program such as BLAST or
FASTA. As a rule of thumb, a database protein should have at least 30% sequence identity with
the query sequence to be selected as template. Occasionally, a 20% identity level can be used as
threshold as long as the identity of the sequence pair falls within the “safe zone”. Often, multiple
database structures with significant similarity can be found as a result of the search. In that case,
it is recommended that the structure(s) with the highest percentage identity, highest resolution,
and the most appropriate cofactors is selected as a template. On the other hand, there may be a
situation in which no highly similar sequences can be found in the structure database. In that
instance, template selection can become difficult. Either a more sensitive profile-based PSI-
BLAST method or a fold recognition method such threading can be used to identify distant
homologs. Most likely, in such a scenario, only local similarities can be identified with distant
homologs. Modeling can therefore only be done with the aligned domains of the target protein.
Sequence Alignment
Once the structure with the highest sequence similarity is identified as a template, the full-length
sequences of the template and target proteins need to be realigned using refined alignment
algorithms to obtain optimal alignment. This realignment is the most critical step in homology
modeling, which directly affects the quality of the final model. This is because incorrect
alignment at this stage leads to incorrect designation of homologous residues and therefore to
incorrect structural models. Errors made in the alignment step cannot be corrected in the
following modeling steps. Even alignment using the best alignment program may not be error
free and should be visually inspected to ensure that conserved key residues are correctly aligned.
If necessary, manual refinement of the alignment should be carried out to improve alignment
quality.
Loop Modeling
In the sequence alignment for modeling, there are often regions caused by insertions and
deletions producing gaps in sequence alignment. The gaps cannot be directly modeled, creating
“holes” in the model. Closing the gaps requires loop modeling, which is a very difficult problem
in homology modeling and is also a major source of error. Loop modeling can be considered a
Pruthvish R, Department of Biotechnology, Acharya Institute of Technology, Bangalore-560107
AIT/IQAC/Aca/19-20/Bioinfo/Notes
mini–protein modeling problem by itself. Unfortunately, there are no mature methods available
that can model loops reliably. Currently, there are two main techniques used to approach the
problem: the database searching method and the ab initio method. The database method involves
finding “spare parts” from known protein structures in a database that fit onto the two stem
regions of the target protein. The stems are defined as the main chain atoms that precede and
follow the loop to be modeled. The procedure begins by measuring the orientation and distance
of the anchor regions in the stems and searching PDB for segments of the same length that also
match the above endpoint conformation. Usually, many different alternative segments that fit the
endpoints of the stems are available. The best loop can be selected based on sequence similarity
as well as minimal steric clashes with the neighboring parts of the structure. The conformation of
the best matching fragments is then copied onto the anchoring points of the stems. The ab initio
method generates many random loops and searches for the one that does not clash with nearby
side chains and also has reasonably low energy and φ and ψ angles in the allowable regions in
the Ramachandran plot.
conformational search can be much improved. After adding the most frequently occurring
rotamers, the conformations have to be further optimized to minimize steric overlaps with the
rest of the model structure.
Model Evaluation
The final homology model has to be evaluated to make sure that the structural features of the
model are consistent with the physicochemical rules. This involves checking anomalies in φ–ψ
angles, bond lengths, close contacts, and so on. Another way of checking the quality of a protein
model is to implicitly take these stereochemical properties into account. This is a method that
detects errors by compiling statistical profiles of spatial features and interaction energy from
experimentally determined structures. By comparing the statistical parameters with the
constructed model, the method reveals which regions of a sequence appear to be folded normally
and which regions do not. If structural irregularities are found, the region is considered to have
errors and has to be further refined.
Procheck is a program that is able to check general physicochemical parameters such as φ–ψ
angles, chirality, bond lengths, bond angles, and so on. The parameters of the model are used to
compare with those compiled from well-defined, high-resolution structures. If the program
detects unusual features, it highlights the regions that should be checked or refined further.
Verify3D is another server using the statistical approach. It uses a pre computed database
containing eighteen environmental profiles based on secondary structures and solvent exposure,
compiled from high-resolution protein structures. To assess the quality of a protein model, the
secondary structure and solvent exposure propensity of each residue are calculated. If the
parameters of a residue fall within one of the profiles, it receives a high score, otherwise a low
score. The result is a two-dimensional graph illustrating the folding quality of each residue of the
protein structure. The threshold value is normally set at zero. Residues with scores below zero
are considered to have an unfavorable environment.
Because no single method is clearly superior to any other, a good strategy is to use multiple
verification methods and identify the consensus between them. It is also important to keep in
mind that the evaluation tests performed by these programs only check the stereochemical
correctness, regardless of the accuracy of the model, which may or may not have any biological
meaning.
SWISS-MODEL MODES
SWISS-MODEL server gives the user the choice between three main interaction modes.
Alignment mode
In the „alignment mode‟ the modeling procedure is initiated by submitting a sequence alignment.
The user specifies which sequence in the given alignment is the target sequence and which one
corresponds to a structurally known protein chain from the ExPDB template library. The server
will build the model based on the given alignment.
Project mode
The „project mode‟ allows the user to submit a manually optimized modeling request to the
SWISS-MODEL server. The starting point for this mode is a DeepView project file. It contains
the superposed template structures, and the alignment between the target and the templates. This
mode gives the user control over a wide range of parameters, e.g. template selection or gap
placement in the alignment. Furthermore, the project mode can also be used to iteratively
improve the output of the „first approach mode‟.
MODELING PROCEDURE
All homology-modeling methods consist of the following four steps: (i) template selection; (ii)
target template alignment; (iii) model building; and (iv) evaluation. These steps can be iteratively
repeated, until a satisfying model structure is achieved. Several different techniques for model
building have been developed. The SWISS-MODEL server approach can be described as rigid
fragment assembly.
Template selection
The SWISS-MODEL server template library ExPDB is extracted from the PDB. In order to
allow a stable and automated workflow of the server, the PDB coordinate files are split into
individual protein chains and unreliable entries, e.g. theoretical models and low quality structures
providing only Ca coordinates, are removed. Additional information useful for template selection
is gathered and added to the file header, e.g. probable quaternary structure, quality indicators or
ANOLEA scores. To select templates for a given protein, the sequences of the template structure
library are searched. If these templates cover distinct regions of the target sequence, the
modeling process will be split into separate independent batches.
Alignment
Up to five template structures per batch are superposed using an iterative least squares algorithm.
A structural alignment is generated after removing incompatible templates, i.e. omitting
structures with high Ca root mean square deviations to the first template. A local pair-wise
alignment of the target sequence to the main template structures is calculated, followed by step to
improve the alignment for modeling purposes. The placement of insertions and deletions is
optimized considering the template structure context. In particular, isolated residues in the
alignment („islands‟) are moved to the flanks to facilitate the loop building process.
Pruthvish R, Department of Biotechnology, Acharya Institute of Technology, Bangalore-560107
AIT/IQAC/Aca/19-20/Bioinfo/Notes
Model building
To generate the core of the model, the backbone atom positions of the template structure are
averaged. The templates are thereby weighted by their sequence similarity to the target sequence,
while significantly deviating atom positions are excluded. To generate those parts, an ensemble
of fragments compatible with the neighboring stems is constructed using constraint space
programming (CSP). The best loop is selected using a scoring scheme, which accounts for force
field energy, steric hindrance and favorable interactions like hydrogen bond formation. In cases
where CSP does not give a satisfying solution and for loops above 10 residues, a loop library
derived from experimental structures is searched to find compatible loop fragments.
Side chain modeling
The reconstruction of the model side chains is based on the weighted positions of corresponding
residues in the template structures. Starting with conserved residues, the model side chains are
built. Possible side chain conformations are selected from a backbone dependent rotamer library,
which has been constructed carefully taking into account the quality of the source structures. A
scoring function assessing favorable interactions (hydrogen bonds, disulfide bridges) and
unfavorably close contacts is applied to select the most likely conformation.
Energy minimization
Deviations in the protein structure geometry, which have been introduced by the modeling
algorithm when joining rigid fragments are regularized in the last modeling step by steepest
descent energy minimization. Energy minimization or molecular dynamics methods are in
general not able to improve the accuracy of the models, and are used in SWISS-MODEL only to
regularize the structure.
Profile Method
In the profile-based method, a profile is constructed for a group of related protein structures. The
structural profile is generated by superimposition of the structures to expose corresponding
residues. Statistical information from these aligned residues is then used to construct a profile.
The profile contains scores that describe the propensity of each of the twenty amino acid residues
to be at each profile position. The profile scores contain information for secondary structural
types, the degree of solvent exposure, polarity, and hydrophobicity of the amino acids. To predict
the structural fold of an unknown query sequence, the query sequence is first predicted for its
secondary structure, solvent accessibility, and polarity. The predicted information is then used
for comparison with propensity profiles of known structural folds to find the fold that best
represents the predicted profile.
3D-PSSM (www.bmm.icnet.uk/∼ 3dpssm/) is a web-based program that employs the structural
profile method to identify protein folds. The profiles for each protein superfamily are constructed
by combining multiple smaller profiles. First, protein structures in a superfamily based on the
SCOP classification are superimposed and are used to construct a structural profile by
incorporating secondary structures and solvent accessibility information for corresponding
residues. In addition, each member in a protein structural superfamily has its own sequence-
based PSI-BLAST profile computed. These sequence profiles are used in combination with the
structure profile to form a large superfamily profile in which each position contains both
sequence and structural information. For the query sequence, PSI-BLAST is performed to
generate a sequence-based profile. PSI-PRED is used to predict its secondary structure. Both the
sequence profile and predicted secondary structure are compared with the precomputed protein
superfamily profiles, using a dynamic programming approach. The matching scores are
calculated in terms of secondary structure, salvation energy, and sequence profiles and ranked to
find the highest scored structure fold.
number of models are built and their overall energy potentials calculated. The conformation with
the lowest global free energy is chosen as the best model.
CASP
Discussion of protein structural prediction would not be complete without mentioning CASP
(Critical Assessment of Techniques for Protein Structure Prediction). With so many protein
structure prediction programs available, there is a need to know the reliability of the prediction
methods. For that purpose, a common benchmark is needed to measure the accuracies of the
prediction methods. It allows developers to predict unknown protein structures through blind
testing so that the reliability of new prediction methods can be objectively evaluated. This is the
experiment of CASP.
surface representations. Atoms may also be labelled with arbitrary text strings. Alternate
conformers and multiple NMR models may be specially coloured and identified in atom labels.
Different parts of the molecule may be represented and coloured independently of the rest of the
molecule or displayed in several representations simultaneously. The displayed molecule may be
rotated, translated, zoomed and z-clipped (slabbed) interactively using the mouse, the scroll bars,
the command line or an attached dial box. RasMol can read a prepared list of commands from a
'script' file (or via inter-process communication) to allow a given image or viewpoint to be
restored quickly. RasMol can also create a script file containing the commands required to
regenerate the current image. Finally, the rendered image may be written out in a variety of
formats including either raster or vector PostScript, GIF, PPM, BMP, PICT, Sun rasterfile or as a
MolScript input script or Kinemage.
The RasMol help facility can be accessed by typing "help <topic>" or "help <topic> <subtopic>"
from the command line. A complete list of RasMol commands may be displayed by typing "help
commands". A single question mark may also be used to abbreviate the keyword "help".
Backbone
The RasMol 'backbone' command permits the representation of a polypeptide backbone as a
series of bonds connecting the adjacent alpha carbons of each amino acid in a chain. The display
of these backbone 'bonds' is turned on and off by the command parameter in the same way as
with the 'wireframe' command. The command 'backbone off' turns off the selected 'bonds', and
'backbone on' or with a number turns them on. The number can be used to specify the cylinder
radius of the representation in either Ångstrom or RasMol units. A parameter value of 500 (2.0
Ångstroms) or above results in a "Parameter value too large" error. Backbone objects may be
coloured using the RasMol 'colour backbone' command.
The reserved word backbone is also used as a predefined set ("help sets") and as a parameter to
the 'set hbond' and 'set ssbond' commands. The RasMol command 'trace' renders a smoothed
backbone, in contrast to 'backbone' which connects alpha carbons with straight lines.
Background
The RasMol 'background' command is used to set the colour of the "canvas" background. The
colour may be given as either a colour name or a comma separated triple of Red, Green and Blue
(RGB) components enclosed in square brackets. Typing the command 'help colours' will give
a list of the predefined colour names recognised by RasMol.
Pruthvish R, Department of Biotechnology, Acharya Institute of Technology, Bangalore-560107
AIT/IQAC/Aca/19-20/Bioinfo/Notes
Bond
The RasMol command 'bond <number> <number> +' adds the designated bond to the drawing,
increasing the bond order if the bond already exists. The command 'bond <number> <number>
pick' selects the two atoms specified by the atom serial numbers as the two ends of a bond
around which the 'rotate bond <angle>' command will be applied. If no bond exists, it is
created.
Rotation around a previously picked bond may be specified by the 'rotate bond <angle>'
command, or may also be controlled with the mouse, using the 'bond rotate on/off' or the
equivalent 'rotate bond on/off' commands.
Cartoon
The RasMol 'cartoon' command does a display of a molecule 'ribbons' as Richardson
(MolScript) style protein 'cartoons', implemented as thick (deep) ribbons. The easiest way to
obtain a cartoon representation of a protein is to use the 'Cartoons' option on the 'Display'
menu. The 'cartoon' command represents the currently selected residues as a deep ribbon with
width specified by the command's argument. Using the command without a parameter results in
the ribbon's width being taken from the protein's secondary structure, as described in the
'ribbons' command. By default, the C-termini of beta-sheets are displayed as arrow heads. This
may be enabled and disabled using the 'set cartoons' command. The depth of the cartoon may
be adjusted using the 'set cartoons <number>' command. The 'set cartoons' command
without any parameters returns these two options to their default values.
Centre
The RasMol 'centre' command defines the point about which the 'rotate' command and the
scroll bars rotate the current molecule. Without a parameter the centre command resets the centre
of rotation to be the centre of gravity of the molecule. If an atom expression is specified, RasMol
rotates the molecule about the centre of gravity of the set of atoms specified by the expression.
Hence, if a single atom is specified by the expression, that atom will remain 'stationary' during
rotations.
Type 'help expression' for more information on RasMol atom expressions.
Colour
Colour the atoms (or other objects) of the selected region. The colour may be given as either a
colour name or a comma separated triple of Red, Green and Blue (RGB) components enclosed in
square brackets. Typing the command 'help colours' will give a list of all the predefined colour
names recognised by RasMol.
Allowed objects are 'atoms', 'bonds', 'backbone', 'ribbons', 'labels', 'dots', 'hbonds', 'map', and
'ssbonds'. If no object is specified, the default keyword 'atom' is assumed. Some colour schemes
are defined for certain object types. The colour scheme 'none' can be applied to all objects except
atoms and dots, stating that the selected objects have no colour of their own, but use the colour of
their associated atoms (i.e. the atoms they connect). 'Atom' objects can also be coloured by
'amino', 'chain', 'charge', 'cpk', 'group', 'model', 'shapely', 'structure', 'temperature' or
'user'. Hydrogen bonds can also be coloured by 'type' and dot surfaces can also be coloured by
'electrostatic potential'. For more information type 'help colour <colour>'. Map
objects may be coloured by specific color of by nearest atom.
Amino Colours
The RasMol 'amino' colour scheme colours amino acids according to traditional amino acid
properties. The purpose of colouring is to identify amino acids in an unusual or surprising
environment. The outer parts of a protein that are polar are visible (bright) colours and non-polar
residues darker. Most colours are hallowed by tradition. This colour scheme is similar to the
'shapely' scheme.
Chain Colours
The RasMol 'chain' colour scheme assigns each macromolecular chain a uniquecolour. This
colour scheme is particularly useful for distinguishing the parts of multimeric structure or
the individual 'strands' of a DNA chain. 'Chain' can be selected from the RasMol 'Colours' menu.
Charge Colours
The RasMol 'charge' colour scheme colour codes each atom according to the charge value stored
in the input file (or beta factor field of PDB files). High values are coloured in blue (positive)
and lower values coloured in red (negative). Rather than use a fixed scale this scheme determines
the maximum and minimum values of the charge/temperature field and interpolates from red to
blue appropriately. Hence, green cannot be assumed to be 'no net charge' charge.
The difference between the 'charge' and 'temperature' colour schemes is that increasing
temperature values proceed from blue to red, whereas increasing charge values go from red to
blue. If the charge/temperature field stores reasonable values it is possible to use the RasMol 'colour
dots potential' command to colour code a dot surface (generated by the 'dots' command) by
electrostatic potential.
CPK Colours
The RasMol 'cpk' colour scheme is based upon the colours of the popular plastic spacefilling
models which were developed by Corey, Pauling and later improved by Kultun. This colour
scheme colours 'atom' objects by the atom (element) type. This is the scheme conventionally
used by chemists. The assignment of the most commonly used element types to colours is given
below.
Note that except for green, white, blue, and orange, these colour names are not the ones
specified as "Predefined colours" in RasMol; thus, they can only be specified on the command
line as RGB triplets.
In the CPK colouring scheme, RasMol will attempt to assign a colour to each element from the
periodic table from a list of 16 colours (the colour codes listed are to help in understanding the
mapping and are not used by RasMol):
In the CPKnew colouring scheme, RasMol uses brighter colours:
For X-ray crystallographic models of proteins and nucleic acids (i.e. without hydrogens) the
display can be 'brightened' by converting the O, C, and N atoms from the RasMol default cpk
colors to "true red, white and blue" using RasMol's predefined colour scheme.
Group Colours
The RasMol 'group' colour scheme colour codes residues by their position in a macromolecular
chain. Each chain is drawn as a smooth spectrum from blue through green, yellow and orange to
red. Hence the N terminus of proteins and 5' terminus of nucleic acids are coloured red and the C
terminus of proteins and 3' terminus of nucleic acids are drawn in blue. If a chain has a large
number of heterogeneous molecules associated with it, the macromolecule may not be drawn in
the full 'range' of the spectrum. 'Group' can be selected from the RasMol 'Colours' menu.
Shapely Colours
The RasMol 'shapely' colour scheme colour codes residues by amino acid property. This scheme
is based upon Bob Fletterick's "Shapely Models". Each amino acid and nucleic acid residue is
given a unique colour. The 'shapely' colour scheme is used by David Bacon's Raster3D program.
This colour scheme is similar to the 'amino' colour scheme.
Structure Colours
The RasMol 'structure' colour scheme colours the molecule by protein secondary structure.
Alpha helices are coloured magenta, [240,0,128], beta sheets are coloured yellow, [255,255,0],
turns are coloured pale blue, [96,128,255] and all other residues are coloured white. The
secondary structure is either read from the PDB file (HELIX, SHEET and TURN records), if
available, or determined using Kabsch and Sander's DSSP algorithm. The RasMol 'structure'
command may be used to force DSSP's structure assignment to be used.
Temperature Colours
The RasMol 'temperature' colour scheme colour codes each atom according to the anisotropic
temperature (beta) value stored in the PDB file. Typically this gives a measure of the
mobility/uncertainty of a given atom's position. High values are coloured in warmer (red) colours
Pruthvish R, Department of Biotechnology, Acharya Institute of Technology, Bangalore-560107
AIT/IQAC/Aca/19-20/Bioinfo/Notes
and lower values in colder (blue) colours. This feature is often used to associate a "scale" value
[such as amino acid variability in viral mutants] with each atom in a PDB file, and colour the
molecule appropriately. The difference between the 'temperature' and 'charge' colour schemes is that
increasing temperature values proceed from blue to red, whereas increasing charge values go from red to
blue.
Potential Colours
The RasMol 'potential' colour scheme applies only to dot surfaces, hence is used in the
command 'colour dots potential'. This scheme colours each currently displayed dot by the
electrostatic potential at that point in space. This potential is calculated using Coulomb's law
taking the temperature/charge field of the input file to be the charge assocated with that atom.
This is the same interpretation used by the 'colour charge' command. Like the 'charge' colour
scheme low values are blue/white and high values are red.
Connect
The RasMol 'connect' command is used to force RasMol to (re)calculate the connectivity of the
current molecule. If the original input file contained connectivity information, this is discarded.
The command 'connect false' uses a fast heuristic algorithm that is suitable for determining
bonding in large bio-molecules such as proteins and nucleic acids. The command "connect true"
uses a slower more accurate algorithm based upon covalent radii that is more suitable to small
molecules containing inorganic elements or strained rings. If no parameters are given, RasMol
determines which algorithm to use based on the number of atoms in the input file. Greater than
255 atoms causes RasMol to use the faster implementation. This is the method used to determine
bonding, if necessary, when a molecule is first read in using the 'load' command.
Pruthvish R, Department of Biotechnology, Acharya Institute of Technology, Bangalore-560107
AIT/IQAC/Aca/19-20/Bioinfo/Notes
Spacefill
The RasMol 'spacefill' command is used to represent all of the currently selected atoms as solid
spheres. This command is used to produce both union-of-spheres and ball-and-stick models of a
molecule. The command, 'spacefill true', the default, represents each atom as a sphere of van der
Waals radius. The command 'spacefill off' turns off the representation of the selected atom as
spheres. The 'temperature' option sets the radius of each sphere to the value stored in its
temperature field. Zero or negative values have no effect and values greater than 2.0 are
truncated to 2.0. The 'user' option allows the radius of each sphere to be specified by additional
lines in the molecule's PDB file using Raster 3D's COLOUR record extension.
The RasMol command 'cpk' is synonymous with the 'spacefill' command.
Depth
The RasMol 'depth' command enables, disables or positions the back-clipping plane of the
molecule. The program only draws those portions of the molecule that are closer to the viewer
than the clipping plane. Integer values range from zero at the very back of the molecule to 100
which is completely in front of the molecule. Intermediate values determine the percentage of the
molecule to be drawn.
This command interacts with the 'slab <value>' command, which clips to the front of a given z-
clipping plane.
Dots
The RasMol 'dots' command is used to generate a van der Waals' dot surface around the
currently selected atoms. Dot surfaces display regularly spaced points on a sphere of van der
Waals' radius about each selected atom. Dots that would are 'buried' within the van der Waals'
radius of any other atom (selected or not) are not displayed. The command 'dots on' deletes any
existing dot surface and generates a dots surface around the currently selected atom set with a
default dot density of 100. The command 'dots off' deletes any existing dot surface. The dot
density may be specified by providing a numeric parameter between 1 and 1000. This value
approximately corresponds to the number of dots on the surface of a medium sized atom.
By default, the colour of each point on a dot surface is the colour of its closest atom at the time
the surface is generated. The colour of the whole dot surface may be changed using the 'colour
dots' command.
Pruthvish R, Department of Biotechnology, Acharya Institute of Technology, Bangalore-560107
AIT/IQAC/Aca/19-20/Bioinfo/Notes
HBonds
The RasMol 'hbond' command is used to represent the hydrogen bonding of the protein
molecule's backbone. This information is useful in assessing the protein's secondary structure.
Hydrogen bonds are represented as either dotted lines or cylinders between the donor and
acceptor residues. The first time the 'hbond' command is used, the program searches the
structure of the molecule to find hydrogen bonded residues and reports the number of bonds to
the user. The command 'hbonds on' displays the selected 'bonds' as dotted lines, and the 'hbonds
off' turns off their display. The colour of hbond objects may be changed by the 'colour hbond'
command. Initially, each hydrogen bond has the colours of its connected atoms.
By default the dotted lines are drawn between the accepting oxygen and the donating nitrogen.
By using the 'set hbonds' command the alpha carbon positions of the appropriate residues may
be used instead. This is especially useful when examining proteins in backbone representation.
Label
The RasMol 'label' command allows an arbitrary formatted text string to be associated with each
currently selected atom. This string may contain embedded 'expansion specifiers' which display
properties of the atom being labelled. An expansion specifier consists of a '%' character followed
by a single alphabetic character specifying the property to be displayed (similar to C's printf
syntax). An actual '%' character may be displayed by using the expansion specifier '%%'.
Atom labelling for the currently selected atoms may be turned off with the command 'label off'.
By default, if no string is given as a parameter, RasMol uses labels appropriate for the current
molecule. RasMol uses the label '%n%r:%c.%a' if the molecule contains more than one chain,
'%e%i' if the molecule has only a single residue (a small molecule) and '%n%r.%a' otherwise.
The colour of each label may be changed using the 'colour label' command. By default, each
label is drawn in the same colour as the atom to which it is attached. The size and spacing of the
displayed text may be changed using the 'set fontsize' command.
Map
The RasMol 'map' commands manipulate electron density maps in coordination with the display
of molecules. These commands are very memory intensive and may not work on machines with
limited memory. Each molecule may have as many maps as available memory permits. Maps
may be read from files or generated from Gaussian density distributions around atoms.
Pruthvish R, Department of Biotechnology, Acharya Institute of Technology, Bangalore-560107
AIT/IQAC/Aca/19-20/Bioinfo/Notes
Molecule
The RasMol 'molecule' command selects one of up to 5 previously loaded molecules for active
manipulation. While all the molcules are displayed and may be rotated collectively, only one
molecule at a time is active for manipulation by the commands which control the details of
rendering.
Monitor
The RasMol 'monitor' command allows the display of distance monitors. A distance monitor is a
dashed (dotted) line between an arbitrary pair of atoms, optionally labelled by the distance
between them. The RasMol command 'monitor <number> <number>' adds such a distance
monitor between the two atoms specified by the atom serial numbers given as parameters
Distance monitors may also be added to a molecule interactively with the mouse, using the 'set
picking monitor' command. Clicking on an atom results in its being identified on the rasmol
command line. In addition every atom picked increments a modulo counter such that, in monitor
mode, every second atom displays the distance between this atom and the previous one. The shift
key may be used to form distance monitors between a fixed atom and several consecutive
positions. A distance monitor may also be removed (toggled) by selecting the appropriate pair of
atom end points a second time.
Restrict
The RasMol 'restrict' command both defines the currently selected region of the molecule and
disables the representation of (most of) those parts of the molecule no longer selected. All
subsequent RasMol commands that modify a molecule's colour or representation affect only the
currently selected region. The parameter of a 'restrict' command is a RasMol atom expression
that is evaluated for every atom of the current molecule. This command is very similar to the
RasMol 'select' command, except 'restrict' disables the 'wireframe', 'spacefill' and 'backbone'
representations in the non-selected region.
Ribbons
The RasMol 'ribbons' command displays the currently loaded protein or nucleic acid as a
smooth solid "ribbon" surface passing along the backbone of the protein. The ribbon is drawn
between each amino acid whose alpha carbon is currently selected. The colour of the ribbon is
changed by the RasMol 'colour ribbon' command. If the current ribbon colour is 'none' (the
default), the colour is taken from the alpha carbon at each position along its length.
The width of the ribbon at each position is determined by the optional parameter in the usual
RasMol units. By default the width of the ribbon is taken from the secondary structure of the
protein or a constant value of 720 (2.88 Ångstroms) for nucleic acids. The default width of
protein alpha helices and beta sheets is 380 (1.52 Ångstroms) and 100 (0.4 Ångstroms) for turns
and random coil. The secondary structure assignment is either from the PDB file or calculated
using the DSSP algorithm as used by the 'structure' command. This command is similar to the
RasMol command 'strands' which renders the biomolecular ribbon as parallel depth-cued
curves.
Rotate
Rotate the molecule about the specified axis. Permitted values for the axis parameter are "x", "y"
and "z". The integer parameter states the angle in degrees for the structure to be rotated. For the
X and Y axes, positive values move the closest point up and right, and negative values move it
down and left, respectively. For the Z axis, a positive rotation acts clockwise and a negative
angle anti-clockwise.
Script
The RasMol 'script' command reads a set of RasMol commands sequentially from a text file and
executes them. This allows sequences of commonly used commands to be stored and performed
by single command. A RasMol script file may contain a further script command up to a
maximum "depth" of 10, allowing complicated sequences of actions to be executed.
Scripts may also be created with a text editor.
Select
Define the currently selected region of the molecule. All subsequent RasMol commands that
manipulate a molecule or modify its colour or representation only affect the currently selected
region. The parameter of a 'select' command is a RasMol expression that is evaluated for every
atom of the current molecule. The currently selected (active) region of the molecule are those
atoms that cause the expression to evaluate true. To select the whole molecule use the RasMol
command 'select all'. The behaviour of the 'select' command without any parameters is
determined by the RasMol 'hetero' and 'hydrogen' parameters.
Slab
The RasMol 'slab' command enables, disables or positions the z-clipping plane of the molecule.
The program only draws those portions of the molecule that are further from the viewer than the
slabbing plane. Integer values range from zero at the very back of the molecule to 100 which is
completely in front of the molecule. Intermediate values determine the percentage of the
molecule to be drawn.
SSBonds
The RasMol 'ssbonds' command is used to represent the disulphide bridges of the protein
molecule as either dotted lines or cylinders between the connected cysteines. The first time that
the 'ssbonds' command is used, the program searches the structure of the protein to find half-
cysteine pairs (cysteines whose sulphurs are within 3 Ångstroms of each other) and reports the
number of bridges to the user. The command 'ssbonds on' displays the selected "bonds" as
dotted lines, and the command 'ssbonds off' disables the display of ssbonds in the currently
selected area. Selection of disulphide bridges is identical to normal bonds, and may be adjusted
using the RasMol 'set bondmode' command. The colour of disulphide bonds may be changed
using the 'colour ssbonds' command. By default, each disulphide bond has the colours of its
connected atoms.
Stereo
The RasMol 'stereo' command provides side-by-side stereo display of images. Stereo viewing of
a molecule may be turned on (and off) either by selecting 'Stereo' from the 'Options' menu, or by
typing the commands 'stereo on' or 'stereo off'.
Pruthvish R, Department of Biotechnology, Acharya Institute of Technology, Bangalore-560107
AIT/IQAC/Aca/19-20/Bioinfo/Notes
Strands
The RasMol 'strands' command displays the currently loaded protein or nucleic acid as a smooth
"ribbon" of depth-cued curves passing along the backbone of the protein. The ribbon is
composed of a number of strands that run parallel to one another along the peptide plane of each
residue. The ribbon is drawn between each amino acid whose alpha carbon is currently selected.
The colour of the ribbon is changed by the RasMol 'colour ribbon' command. If the current
ribbon colour is 'none' (the default), the colour is taken from the alpha carbon at each position
along its length. The central and outermost strands may be coloured independently using the
'colour ribbon1' and 'colour ribbon2' commands, respectively. The number of strands in the
ribbon may be altered using the RasMol 'set strands' command.
Structure
The RasMol 'structure' command calculates secondary structure assignments for the currently
loaded protein. If the original PDB file contained structural assignment records (HELIX, SHEET
and TURN) these are discarded. Initially, the hydrogen bonds of the current molecule are found,
if this hasn't been done already. The secondary structure is then determined using Kabsch and
Sander's DSSP algorithm. Once finished the program reports the number of helices, strands and
turns found.
Surface
The RasMol 'surface' command renders a Lee-Richards molecular surface resulting from rolling
a probe atom on the selected atoms. The value given specifies the radius of the probe. If given in
the first form, the evolute of the surface of the probe is shown (the solvent excluded surface). If
given in the second form, the envelope of the positions of the center of the probe is shown (the
solvent accessible surface).
Wireframe
The RasMol 'wireframe' command represents each bond within the selected region of the
molecule as a cylinder, a line or a depth-cued vector. The display of bonds as depth-cued vectors
(drawn darker the further away from the viewer) is turned on by the command 'wireframe' or
'wireframe on'. The selected bonds are displayed as cylinders by specifying a radius either as an
integer in RasMol units or containing a decimal point as a value in Ångstroms. A parameter
value of 500 (2.0 Ångstroms) or above results in an "Parameter value too large" error. Bonds
may be coloured using the 'colour bonds' command. If the selected bonds involved atoms of
alternate conformers then the bonds are narrowed in the middle to a radius of .8 of the specified
radius (or to the radius specifed as the optional second parameter).
Zoom
Change the magnification of the currently displayed image. Boolean parameters either magnify
or reset the scale of current molecule. An integer parameter specifies the desired magnification
as a percentage of the default scale. The minimum parameter value is 10; the maximum
parameter value is dependent upon the size of the molecule being displayed. For medium sized
proteins this is about 500.
aligned. Each residue in the active layer will be colored accordingly to its RMS backbone
deviation from the corresponding amino acid of the reference protein (the first loaded).
NOTE: Colors are mapped from a fixed linear scale, in which dark blue is for RMS = 0
Å, and red is for RMS = 5 Å. A relative scale can be selected in Preferences>General
where the best fit is dark blue and the worst fit is red.
• By B-Factor Colors sidechains and backbones, independently, according to their
respective largest Bfactor per group. In the case of a model returned by Swiss-Model, the
B-factor column contains the Model Confidence Factor.
NOTE: The coloring gradient can be adjusted in Preferences>General to fit the range of
B factor values present in the structure.
• By Secondary Structure
Colors the selected object according to the three common secondary structure types:
Helix=red, Strand=yellow, and Coil =gray. Especially useful for coloring ribbon
drawings.
Default colors can be redefined in Preferences>Colors.
• By Secondary Struct. Success.
Produces a gradient along the polypeptide chain from N-terminus (blue) to the C-
terminus (red). Each secondary structure element gets a single color, and random-coils
are gray.Especially useful for coloring ribbon drawings.
Ramachandran Plot is colored in yellow. The backbone of proline residues whose angle
deviates more than 25° from the ideal –65° value is colored in red. Buried sidechains of
residues that could make H-bonds but do not are colored in orange. Clashes are computed
and will appear as pink dotted lines.
Color menu, fourth block
• By Other Color Prompts you for a single color to be applied to the entire layer. It is
functionally equivalent to a shift-click on any color box of the Control Panel window.
• By Backbone, Sidechain, Ribbon, Surface, Label Color
WORKING ON A LAYER
Classification
Advanced commands that can be applied to a single layer can be grouped into four categories:
• Mutates amino acids
Pruthvish R, Department of Biotechnology, Acharya Institute of Technology, Bangalore-560107
AIT/IQAC/Aca/19-20/Bioinfo/Notes
DALI is a structure comparison web server that uses the intramolecular distance method. It
works by maximizing the similarity of two distance graphs. The matrices are based on distances
between all Cα atoms for each individual protein. Two distance matrices are overlaid and moved
one relative to the other to identify most similar regions. DALI uses a statistical significance
value called a Z-score to evaluate structural alignment. The Z-score is the number of standard
deviations from the average score derived from the database background distribution. The higher
the Z-score when comparing a pair of protein structures, the less likely the similarity observed is
a result of random chance. Empirically, a Z-score>4 indicates a significant level of structure
similarity. The webserver is at the sametime a database that contains Z-scores of all precomputed
structure pairs of proteins in PDB. The user can upload a structure to compare it with all known
structures, or perform a pairwise comparison of two uploaded structures.
The Dali method uses a weighted sum of similarities of intra-molecular distances, which
correlates with expert classifications in the sense that the structures of homologous proteins
typically get higher similarity scores than the structures of evolutionarily unrelated proteins. This
property is useful to a biologist using structure comparison to learn more about her query
protein: the biologically informative neighbors are found at the top of the match list with
relatively few false leads.
In DaliLite v.3, new options for database searching (DaliLite –quick) and database updates
(DaliLite –update) are introduced. The new protocols improve server throughput and vastly
simplify the updates, making the complete system portable. The key change from earlier is that
the all versus all matrix of similarities are abandoned in favor of a connected graph of
similarities.
The nodes of the graph represent protein structures and edges represent structural alignments.
Whereas before each representative structure was directly linked to all its structurally similar
neighbors, now require only that there is a path of continuous structural similarity through the
graph. The structural neighbors of a query structure are collected by walks through the graph.
Not only need the graph be less densely connected than the all versus all matrixes, thus saving
computational effort, but also there is the added benefit that the incremental updates of the
structural similarity graph and the choice of structural representatives are completely decoupled.
Methods
PDB clustering
The PDB is highly redundant. Use a representative subset at 90% sequence identity level
(PDB90), derived from the current set of PDB sequences. The PDB contains over 100 000
structures (chains), which is reduced to about 20 000 PDB90 representatives.
Structural similarity graph
The structural similarity graph and alignment data are stored in a relational database (MySQL).
The graph is updated incrementally. If a new structure has strong similarity to structures already
in the graph, one edge is sufficient to connect the new structure to the graph in the proper
neighborhood. If there is no strong match, compare the new structure to all existing structures
and add edges for all significant similarities. Similarity is measured by Dali Z-scores.
„Significant similarities‟ have a Z-score above 2; they usually correspond to similar folds.
„Strong matches‟ have sequence identity above 20% or a Z-score above a cutoff that depends on
the size of the query protein. The Z-score cutoff was set to n/10−4, where n is the number of
residues in the query structure. A segment of the query structure longer than 80 residues without
any structural matches always disqualifies a strong match.
Database searching
The database search option DaliLite –quick compares a query structure to all structures in the
PDB, as organized in the structural similarity graph. To initiate a transitive search of structures in
the graph, the query structure must be attached to some structural neighbors. Fast feature filters
are often successful in finding near neighbors. Currently use of sequence comparison by Blast,
GTG sequence motifs and secondary structure triplets to rank the structures in PDB90 are
implemented. Feature filter scores are converted to Z-scores in order to combine the ranked lists.
The top 100 structures are compared using the normal Dali procedures. If a strong match is
found, move to the next step (transitive alignment). Otherwise, the query structure is compared
against all 20 000 structures in PDB90.
The entry points connect the query structure to one or more structures in the structural similarity
graph. These are direct (first shell) neighbors of the query. Structures in the second shell are
compared in batches of 100, selecting those with the strongest connections first. Connection
Pruthvish R, Department of Biotechnology, Acharya Institute of Technology, Bangalore-560107
AIT/IQAC/Aca/19-20/Bioinfo/Notes
strength is the lesser Z score along the path from query to the first neighbor to the second
neighbor. The transitive alignment (via first neighbor) between the query structures and second
neighbor is used as starting point for refinement, skipping the costly alignment optimization from
scratch. The expansion is repeated until the connection strength drops below a Z-score cutoff of
2, or a maximum number of matches have been reported (default: MAX_HITS = 500).
VAST (Vector Alignment Search Tool) is a web server that performs alignment using both the
inter- and intramolecular approaches. The superposition is based on information of directionality
of secondary structural elements (represented as vectors). Optimal alignment between two
structures is defined by the highest degree of vector matches.Protein structure neighbors in
Entrez are determined by direct comparison of 3D protein structures with the Vector Alignment
Search Tool (VAST) algorithm. Each of the more than 87,000 domains and complete protein
chains in MMDB is compared to every other one. Entrez can list structure neighbors; however
VAST Structure Neighbors pages provide further information and displays of structure
superpositions and structure-based alignments.
VAST pages begin with a brief text description of the query domain, including PubMed links.
The precomputed structure neighbors, ranked by a selected similarity measure, are displayed
below in a graphic or table. Individual 3D superpositions can be selected by clicking check boxes
and viewed in Cn3D. The corresponding sequence alignments can be displayed in HTML, text,
and FASTA formats. The "Find" feature is convenient for looking for particular structure
neighbors, where the user wants to specify a particular identifier.
VAST similarity measures
All of the similarity measures for each structure neighbor detected by VAST can be listed in a
table to facilitate the examination of VAST results. The table includes the following columns:
• Aligned Length: The number of equivalent pairs of C-alpha atoms superimposed
between the two structures, i.e. how many residues have been used to calculate the 3D
superposition.
• SCORE: The VAST structure-similarity score. This number is related to the number of
secondary structure elements superimposed and the quality of that superposition. Higher
VAST scores correlate with higher similarity.
• P-VAL: The VAST p value is a measure of the significance of the comparison, expressed
as a probability. For example, if the p value is 0.001, then the odds are 1000 to 1 against
seeing a match of this quality by pure chance. The p value from VAST is adjusted for the
Pruthvish R, Department of Biotechnology, Acharya Institute of Technology, Bangalore-560107
AIT/IQAC/Aca/19-20/Bioinfo/Notes
effects of multiple comparisons using the assumption that there are 500 independent and
unrelated types of domains in the MMDB database. The p value shown thus corresponds
to the p value for the pairwise comparison of each domain pair, divided by 500.
• RMSD: The root mean square superposition residual in Angstroms. This number is
calculated after optimal superposition of two structures, as the square root of the mean
square distances between equivalent C-alpha atoms. Note that the RMSD value scales
with the extent of the structural alignments and that this size must be taken into
consideration when using RMSD as a descriptor of overall structural similarity.
• %Id: Percent identical residues in the aligned sequence region. This is a raw measure of
sequence similarity in the parts of the proteins that have been superimposed.
• LHM: Loop Hausdorff Metric. A Loop Similarity measure that shows how well two
structures conform to each other in the loop regions, after structural superposition. The
"loop regions" are the parts of the structures between aligned secondary structure
elements (helices and strands). LHM is measured in Angstroms, with a smaller value
indicative of greater similarity. The loop similarity may be undefined (indicated by 'NA')
if there are too many residues with missing coordinates in the loops.
• GSP: Gapped Score. A combination (algebraic) score that uses RMSD, aligned length,
and the number of gapped regions in the alignment. A smaller gapped score correlates
with greater similarity.
Note:
Base-calling
Base-calling converts raw or processed data from a sequencing instrument into
sequences and quality scores.
All currently available commercial next-gen sequencing platforms use optical
detection and CCD cameras. Images are the raw data. Base-calling usually refers to the
conversion of intensity data into sequences and
quality scores. Intensity information is extracted from images by the image
analysis.
Images are analysed closer to the instrument (on the instrument control
PC), then base-calls are transferred to a secondary analysis server.
Quality scores quantify the probability that a base-call is correct (or wrong)
Terminology :
Phred scores
A base quality score assigned by the phred software (or a program based on the
phred)
A quality score expressed on a logarithmic scale: Q = -10 log10( probability of an
error )
Example: Q20 = 1% error probability
The Phred method assigns quality scores to a base-call based on observed
properties of the base (predictors)
Phred is a two-step process:
Training: Given a set of reads, labels as to which bases are correct, and a set of
quality statistics for each base, produce a model that can predict error rates for
unseen bases
Application: Given new reads and quality statistics, predict the quality for each of
the bases.
Phred is essentially a big lookup table!