Bioinformatics Notes - 17Bt54: Module - 4

AIT/IQAC/Aca/19-20/Bioinfo/Notes
BIOINFORMATICS NOTES - 17BT54
MODULE - 4:
HOMOLOGY MODELING
As the name suggests, homology modeling predicts protein structures based on sequence
homology with known structures. It is also known as comparative modeling. The principle
behind it is that if two proteins share a high enough sequence similarity, they are likely to have
very similar three-dimensional structures. If one of the protein sequences has a known structure,
then the structure can be copied to the unknown protein with a high degree of confidence.
Homology modeling produces an all-atom model based on alignment with template proteins. The
overall homology modeling procedure consists of six steps. The first step is template selection,
which involves identification of homologous sequences in the protein structure database to be
used as templates for modeling. The second step is alignment of the target and template
sequences. The third step is to build a framework structure for the target protein consisting of
main chain atoms. The fourth step of model building includes the addition and optimization of
side chain atoms and loops. The fifth step is to refine and optimize the entire model according to
energy criteria. The final step involves evaluating of the overall quality of the model obtained. If
necessary, alignment and model building are repeated until a satisfactory result is obtained.
Template Selection
The first step in protein structural modeling is to select appropriate structural templates. This
forms the foundation for rest of the modeling process. The template selection involves searching
the Protein Data Bank (PDB) for homologous proteins with determined structures. The search
can be performed using a heuristic pairwise alignment search program such as BLAST or
FASTA. As a rule of thumb, a database protein should have at least 30% sequence identity with
the query sequence to be selected as template. Occasionally, a 20% identity level can be used as
threshold as long as the identity of the sequence pair falls within the “safe zone”. Often, multiple
database structures with significant similarity can be found as a result of the search. In that case,
it is recommended that the structure(s) with the highest percentage identity, highest resolution,
and the most appropriate cofactors is selected as a template. On the other hand, there may be a
situation in which no highly similar sequences can be found in the structure database. In that
instance, template selection can become difficult. Either a more sensitive profile-based PSI-
Pruthvish R, Department of Biotechnology, Acharya Institute of Technology, Bangalore-560107

BLAST method or a fold recognition method such threading can be used to identify distant
homologs. Most likely, in such a scenario, only local similarities can be identified with distant
homologs. Modeling can therefore only be done with the aligned domains of the target protein.
Sequence Alignment
Once the structure with the highest sequence similarity is identified as a template, the full-length
sequences of the template and target proteins need to be realigned using refined alignment
algorithms to obtain optimal alignment. This realignment is the most critical step in homology
modeling, which directly affects the quality of the final model. This is because incorrect
alignment at this stage leads to incorrect designation of homologous residues and therefore to
incorrect structural models. Errors made in the alignment step cannot be corrected in the
following modeling steps. Even alignment using the best alignment program may not be error
free and should be visually inspected to ensure that conserved key residues are correctly aligned.
If necessary, manual refinement of the alignment should be carried out to improve alignment
quality.
Backbone Model Building

Once optimal alignment is achieved, residues in the aligned regions of the target protein can
assume a similar structure as the template proteins, meaning that the coordinates of the
corresponding residues of the template proteins can be simply copied onto the target protein. In
backbone modeling, it is simplest to use only one template structure. As mentioned, the structure
with the best quality and highest resolution is normally chosen if multiple options are available.
This structure tends to carry the fewest errors. Occasionally, multiple template structures are
available for modeling. In this situation, the template structures have to be optimally aligned and
superimposed before being used as templates in model building. One can either choose to use
average coordinate values of the templates or the best parts from each of the templates to model.
Loop Modeling
In the sequence alignment for modeling, there are often regions caused by insertions and
deletions producing gaps in sequence alignment. The gaps cannot be directly modeled, creating
“holes” in the model. Closing the gaps requires loop modeling, which is a very difficult problem
in homology modeling and is also a major source of error. Loop modeling can be considered a
mini–protein modeling problem by itself. Unfortunately, there are no mature methods available
that can model loops reliably. Currently, there are two main techniques used to approach the
problem: the database searching method and the ab initio method. The database method involves
finding “spare parts” from known protein structures in a database that fit onto the two stem
regions of the target protein. The stems are defined as the main chain atoms that precede and
follow the loop to be modeled. The procedure begins by measuring the orientation and distance
of the anchor regions in the stems and searching PDB for segments of the same length that also
match the above endpoint conformation. Usually, many different alternative segments that fit the
endpoints of the stems are available. The best loop can be selected based on sequence similarity
as well as minimal steric clashes with the neighboring parts of the structure. The conformation of
the best matching fragments is then copied onto the anchoring points of the stems. The ab initio
method generates many random loops and searches for the one that does not clash with nearby
side chains and also has reasonably low energy and φ and ψ angles in the allowable regions in
the Ramachandran plot.
Side Chain Refinement

Once main chain atoms are built, the positions of side chains that are not modeled must be
determined. Modeling side chain geometry is very important in evaluating protein–ligand
interactions at active sites and protein–protein interactions at the contact interface. A side chain
can be built by searching every possible conformation at every torsion angle of the side chain to
select the one that has the lowest interaction energy with neighboring atoms. However, this
approach is computationally prohibitive in most cases. In fact, most current side chain prediction
programs use the concept of rotamers, which are favored side chain torsion angles extracted
from known protein crystal structures. A collection of preferred side chain conformations is a
rotamer library in which the rotamers are ranked by their frequency of occurrence. Having a
rotamer library reduces the computational time significantly because only a small number of
favored torsion angles are examined. In prediction of side chain conformation, only the possible
rotamers with the lowest interaction energy with nearby atoms are selected.
In many cases, even applying the rotamer library for every residue can be computationally too
expensive. To reduce search time further, backbone conformation can be taken into account. It
has been observed that there is a correlation of backbone conformations with certain rotamers.
By using such correlations, many possible rotamers can be eliminated and the speed of
conformational search can be much improved. After adding the most frequently occurring
rotamers, the conformations have to be further optimized to minimize steric overlaps with the
rest of the model structure.
Model Refinement Using Energy Function

In these loop modeling and side chain modeling steps, potential energy calculations are applied
to improve the model. However, this does not guarantee that the entire raw homology model is
free of structural irregularities such as unfavorable bond angles, bond lengths, or close atomic
contacts. These kinds of structural irregularities can be corrected by applying the energy
minimization procedure on the entire model, which moves the atoms in such a way that the
overall conformation has the lowest energy potential. The goal of energy minimization is to
relieve steric collisions and strains without significantly altering the overall structure. However,
energy minimization has to be used with caution because excessive energy minimization often
moves residues away from their correct positions. Therefore, only limited energy minimization is
recommended (a few hundred iterations) to remove major errors, such as short bond distances
and close atomic clashes. Key conserved residues and those involved in cofactor binding have to
be restrained if necessary during the process.
Another often used structure refinement procedure is molecular dynamic simulation. This
practice is derived from the concern that energy minimization only moves atoms toward a local
minimum without searching for all possible conformations, often resulting in a suboptimal
structure. To search for a global minimum requires moving atoms uphill as well as downhill in a
rough energy landscape. It is hoped that this simulation follows the protein folding process and
has a better chance at finding the true structure.

Model Evaluation
The final homology model has to be evaluated to make sure that the structural features of the
model are consistent with the physicochemical rules. This involves checking anomalies in φ–ψ
angles, bond lengths, close contacts, and so on. Another way of checking the quality of a protein
model is to implicitly take these stereochemical properties into account. This is a method that
detects errors by compiling statistical profiles of spatial features and interaction energy from
experimentally determined structures. By comparing the statistical parameters with the
constructed model, the method reveals which regions of a sequence appear to be folded normally
and which regions do not. If structural irregularities are found, the region is considered to have
errors and has to be further refined.
Procheck is a program that is able to check general physicochemical parameters such as φ–ψ
angles, chirality, bond lengths, bond angles, and so on. The parameters of the model are used to
compare with those compiled from well-defined, high-resolution structures. If the program
detects unusual features, it highlights the regions that should be checked or refined further.
Verify3D is another server using the statistical approach. It uses a pre computed database
containing eighteen environmental profiles based on secondary structures and solvent exposure,
compiled from high-resolution protein structures. To assess the quality of a protein model, the
secondary structure and solvent exposure propensity of each residue are calculated. If the
parameters of a residue fall within one of the profiles, it receives a high score, otherwise a low
score. The result is a two-dimensional graph illustrating the folding quality of each residue of the
protein structure. The threshold value is normally set at zero. Residues with scores below zero
are considered to have an unfavorable environment.
Because no single method is clearly superior to any other, a good strategy is to use multiple
verification methods and identify the consensus between them. It is also important to keep in
mind that the evaluation tests performed by these programs only check the stereochemical
correctness, regardless of the accuracy of the model, which may or may not have any biological
meaning.

SWISS-MODEL (http://swissmodel.expasy.org) is a server for automated comparative

modeling of three dimensional (3D) protein structures. SWISSMODEL provides several levels
of user interaction through its World Wide Web interface: in the „first approach mode‟ only an
amino acid sequence of a protein is submitted to build a 3D model. Template selection,
alignment and model building are done completely automated by the server. In the „alignment
mode‟, the modeling process is based on a user-defined target-template alignment. Complex
modeling tasks can be handled with the „project mode‟ using DeepView (Swiss-PdbViewer), an
integrated sequence-to-structure workbench. All models are sent back via email with a detailed
modeling report. WhatCheck analyses and ANOLEA evaluations are provided optionally.
SWISS-MODEL MODES
SWISS-MODEL server gives the user the choice between three main interaction modes.
First approach mode

The „first approach mode‟ provides a simple interface and requires only an amino acid sequence
as input data. The server will automatically select suitable templates. Optionally, the user can
specify up to five template structures, either from the ExPDB library or uploaded coordinate
files. The automated modeling procedure will start if at least one modeling template is available
that has a sequence identity of more than 25% with the submitted target sequence. However,
users need to be aware that the model reliability decreases as the sequence identity decreases and
that target-template pairs sharing less than 50% sequence identity may often require manual
adjustment of the alignment.
Alignment mode
In the „alignment mode‟ the modeling procedure is initiated by submitting a sequence alignment.
The user specifies which sequence in the given alignment is the target sequence and which one
corresponds to a structurally known protein chain from the ExPDB template library. The server
will build the model based on the given alignment.

Project mode
The „project mode‟ allows the user to submit a manually optimized modeling request to the
SWISS-MODEL server. The starting point for this mode is a DeepView project file. It contains
the superposed template structures, and the alignment between the target and the templates. This
mode gives the user control over a wide range of parameters, e.g. template selection or gap
placement in the alignment. Furthermore, the project mode can also be used to iteratively
improve the output of the „first approach mode‟.
MODELING PROCEDURE
All homology-modeling methods consist of the following four steps: (i) template selection; (ii)
target template alignment; (iii) model building; and (iv) evaluation. These steps can be iteratively
repeated, until a satisfying model structure is achieved. Several different techniques for model
building have been developed. The SWISS-MODEL server approach can be described as rigid
fragment assembly.
Template selection
The SWISS-MODEL server template library ExPDB is extracted from the PDB. In order to
allow a stable and automated workflow of the server, the PDB coordinate files are split into
individual protein chains and unreliable entries, e.g. theoretical models and low quality structures
providing only Ca coordinates, are removed. Additional information useful for template selection
is gathered and added to the file header, e.g. probable quaternary structure, quality indicators or
ANOLEA scores. To select templates for a given protein, the sequences of the template structure
library are searched. If these templates cover distinct regions of the target sequence, the
modeling process will be split into separate independent batches.
Alignment
Up to five template structures per batch are superposed using an iterative least squares algorithm.
A structural alignment is generated after removing incompatible templates, i.e. omitting
structures with high Ca root mean square deviations to the first template. A local pair-wise
alignment of the target sequence to the main template structures is calculated, followed by step to
improve the alignment for modeling purposes. The placement of insertions and deletions is
optimized considering the template structure context. In particular, isolated residues in the
alignment („islands‟) are moved to the flanks to facilitate the loop building process.
Model building
To generate the core of the model, the backbone atom positions of the template structure are
averaged. The templates are thereby weighted by their sequence similarity to the target sequence,
while significantly deviating atom positions are excluded. To generate those parts, an ensemble
of fragments compatible with the neighboring stems is constructed using constraint space
programming (CSP). The best loop is selected using a scoring scheme, which accounts for force
field energy, steric hindrance and favorable interactions like hydrogen bond formation. In cases
where CSP does not give a satisfying solution and for loops above 10 residues, a loop library
derived from experimental structures is searched to find compatible loop fragments.
Side chain modeling
The reconstruction of the model side chains is based on the weighted positions of corresponding
residues in the template structures. Starting with conserved residues, the model side chains are
built. Possible side chain conformations are selected from a backbone dependent rotamer library,
which has been constructed carefully taking into account the quality of the source structures. A
scoring function assessing favorable interactions (hydrogen bonds, disulfide bridges) and
unfavorably close contacts is applied to select the most likely conformation.
Energy minimization
Deviations in the protein structure geometry, which have been introduced by the modeling
algorithm when joining rigid fragments are regularized in the last modeling step by steepest
descent energy minimization. Energy minimization or molecular dynamics methods are in
general not able to improve the accuracy of the models, and are used in SWISS-MODEL only to
regularize the structure.
MODELING RESULTS AND EVALUATION

Possible applications of protein models depend largely on the quality of the models. The
accuracy of a model can vary significantly, even within different regions of the same protein:
usually highly-conserved core regions can be modeled much more reliably than variable loop
regions or surface residues. Several tools are provided to allow the SWISSMODEL user to
evaluate the reliability of the model. WhatCheck reports and evaluation by the atomic mean
force potential ANOLEA are provided by SWISSMODEL to assess the quality of the model.

THREADING AND FOLD RECOGNITION

There are only small number of protein folds available (<1,000), compared to millions of protein
sequences. This means that protein structures tend to be more conserved than protein sequences.
Consequently, many proteins can share a similar fold even in the absence of sequence
similarities. This allowed the development of computational methods to predict protein structures
beyond sequence similarities. To determine whether a protein sequence adopts a known three-
dimensional structure fold relies on threading and fold recognition methods.
By definition, threading or structural fold recognition predicts the structural fold of an unknown
protein sequence by fitting the sequence into a structural database and selecting the best-fitting
fold. The comparison emphasizes matching of secondary structures, which are most
evolutionarily conserved. Therefore, this approach can identify structurally similar proteins even
without detectable sequence similarity.
The algorithms can be classified into two categories, pairwise energy based and profile based.
The pairwise energy–based method was originally referred to as threading and the profile-based
method was originally defined as fold recognition. However, the two terms are now often used
interchangeably without distinction in the literature.
Pairwise Energy Method

In the pairwise energy based method, a protein sequence is searched for in a structural fold
database to find the best matching structural fold using energy-based criteria. The detailed
procedure involves aligning the query sequence with each structural fold in a fold library. The
alignment is performed essentially at the sequence profile level using dynamic programming or
heuristic approaches. Local alignment is often adjusted to get lower energy and thus better
fitting. The adjustment can be achieved using algorithms such as double-dynamic programming.
The next step is to build a crude model for the target sequence by replacing aligned residues in
the template structure with the corresponding residues in the query. The third step is to calculate
the energy terms of the raw model, which include pairwise residue interaction energy, solvation
energy, and hydrophobic energy. Finally, the models are ranked based on the energy terms to
find the lowest energy fold that corresponds to the structurally most compatible fold.

Profile Method
In the profile-based method, a profile is constructed for a group of related protein structures. The
structural profile is generated by superimposition of the structures to expose corresponding
residues. Statistical information from these aligned residues is then used to construct a profile.
The profile contains scores that describe the propensity of each of the twenty amino acid residues
to be at each profile position. The profile scores contain information for secondary structural
types, the degree of solvent exposure, polarity, and hydrophobicity of the amino acids. To predict
the structural fold of an unknown query sequence, the query sequence is first predicted for its
secondary structure, solvent accessibility, and polarity. The predicted information is then used
for comparison with propensity profiles of known structural folds to find the fold that best
represents the predicted profile.
3D-PSSM (www.bmm.icnet.uk/∼ 3dpssm/) is a web-based program that employs the structural
profile method to identify protein folds. The profiles for each protein superfamily are constructed
by combining multiple smaller profiles. First, protein structures in a superfamily based on the
SCOP classification are superimposed and are used to construct a structural profile by
incorporating secondary structures and solvent accessibility information for corresponding
residues. In addition, each member in a protein structural superfamily has its own sequence-
based PSI-BLAST profile computed. These sequence profiles are used in combination with the
structure profile to form a large superfamily profile in which each position contains both
sequence and structural information. For the query sequence, PSI-BLAST is performed to
generate a sequence-based profile. PSI-PRED is used to predict its secondary structure. Both the
sequence profile and predicted secondary structure are compared with the precomputed protein
superfamily profiles, using a dynamic programming approach. The matching scores are
calculated in terms of secondary structure, salvation energy, and sequence profiles and ranked to
find the highest scored structure fold.

AB INITIO PROTEIN STRUCTURAL PREDICTION

Both homology and fold recognition approaches rely on the availability of template structures in
the database to achieve predictions. If no correct structures exist in the database, the methods
fail. However, proteins in nature fold on their own without checking what the structures of their
homologs are in databases. Obviously, there is some information in the sequences that provides
instruction for the proteins to “find” their native structures. Early biophysical studies have shown
that most proteins fold spontaneously into a stable structure that has near minimum energy. This
structural state is called the native state. This folding process appears to be nonrandom; however,
its mechanism is poorly understood.
The limited knowledge of protein folding forms the basis of ab initio prediction. As the name
suggests, the ab initio prediction method attempts to produce all-atom protein models based on
sequence information alone without the aid of known protein structures. The perceived
advantage of this method is that predictions are not restricted by known folds and that novel
protein folds can be identified. However, because the physicochemical laws governing protein
folding are not yet well understood, the energy functions used in the ab initio prediction are at
present rather inaccurate. The folding problem remains one of the greatest challenges in
bioinformatics today.
Current ab initio algorithms are not yet able to accurately simulate the protein folding process.
They work by using some type of heuristics. Because the native state of a protein structure is
near energy minimum, the prediction programs are thus designed using the energy minimization
principle. These algorithms search for every possible conformation to find the one with the
lowest global energy. However, searching for a fold with the absolute minimum energy may not
be valid in reality. This contributes to one of the fundamental flaws of this approach. In addition,
searching for all possible structural conformations is not yet computationally feasible. Some
recent ab initio methods combine fragment search and threading to yield a model of an unknown
protein. The following web program is such an example using the hybrid approach.
Rosetta is a web server that predicts protein three-dimensional conformations using the ab initio
method. This in fact relies on a “mini-threading” method. The method first breaks down the
query sequence into many very short segments (three to nine residues) and predicts the
secondary structure of the small segments using a hidden Markov model–based program,
HMMSTR. The segments with assigned secondary structures are subsequently assembled into a
three-dimensional configuration. Through random combinations of the fragments, a large
number of models are built and their overall energy potentials calculated. The conformation with
the lowest global free energy is chosen as the best model.
CASP
Discussion of protein structural prediction would not be complete without mentioning CASP
(Critical Assessment of Techniques for Protein Structure Prediction). With so many protein
structure prediction programs available, there is a need to know the reliability of the prediction
methods. For that purpose, a common benchmark is needed to measure the accuracies of the
prediction methods. It allows developers to predict unknown protein structures through blind
testing so that the reliability of new prediction methods can be objectively evaluated. This is the
experiment of CASP.
Structure Visualisation – RASMOL and Swiss PDB Viewer

RasMol is a molecular graphics program intended for the visualisation of proteins, nucleic acids
and small molecules. The program is aimed at display, teaching and generation of publication
quality images. RasMol runs on wide range of architectures and operating systems including
Microsoft Windows, Apple Macintosh, UNIX and VMS systems. UNIX and VMS versions
require an 8, 24 or 32 bit colour X Windows display (X11R4 or later). The X Windows version
of RasMol provides optional support for a hardware dials box and accelerated shared memory
communication (via the XInput and MIT-SHM extensions) if available on the current X Server.
The program reads in a molecule coordinate file and interactively displays the molecule on the
screen in a variety of colour schemes and molecule representations. Currently available
representations include depth-cued wireframes, sticks, spacefilling (CPK) spheres, ball and stick,
solid and strand biomolecular ribbons, atom labels and dot surfaces.
Up to 5 molecules may be loaded and displayed at once. Any one or all of the molecules may be
rotated and translated.
The program reads in molecular coordinate files and interactively displays the molecule on the
screen in a variety of representations and colour schemes. Supported input file formats include
Protein Data Bank (PDB), Sybyl Mol2 formats, Molecular Design Limited's (MDL) Mol file
format, CHARMm format, CIF format and mmCIF format files. If connectivity information is
not contained in the file this is calculated automatically. The loaded molecule can be shown as
wireframe bonds, stick bonds, alpha-carbon trace, space-filling (CPK) spheres, macromolecular
ribbons (either smooth shaded solid ribbons or parallel strands), hydrogen bonding and dot
surface representations. Atoms may also be labelled with arbitrary text strings. Alternate
conformers and multiple NMR models may be specially coloured and identified in atom labels.
Different parts of the molecule may be represented and coloured independently of the rest of the
molecule or displayed in several representations simultaneously. The displayed molecule may be
rotated, translated, zoomed and z-clipped (slabbed) interactively using the mouse, the scroll bars,
the command line or an attached dial box. RasMol can read a prepared list of commands from a
'script' file (or via inter-process communication) to allow a given image or viewpoint to be
restored quickly. RasMol can also create a script file containing the commands required to
regenerate the current image. Finally, the rendered image may be written out in a variety of
formats including either raster or vector PostScript, GIF, PPM, BMP, PICT, Sun rasterfile or as a
MolScript input script or Kinemage.
The RasMol help facility can be accessed by typing "help <topic>" or "help <topic> <subtopic>"
from the command line. A complete list of RasMol commands may be displayed by typing "help
commands". A single question mark may also be used to abbreviate the keyword "help".
Backbone
The RasMol 'backbone' command permits the representation of a polypeptide backbone as a
series of bonds connecting the adjacent alpha carbons of each amino acid in a chain. The display
of these backbone 'bonds' is turned on and off by the command parameter in the same way as
with the 'wireframe' command. The command 'backbone off' turns off the selected 'bonds', and
'backbone on' or with a number turns them on. The number can be used to specify the cylinder
radius of the representation in either Ångstrom or RasMol units. A parameter value of 500 (2.0
Ångstroms) or above results in a "Parameter value too large" error. Backbone objects may be
coloured using the RasMol 'colour backbone' command.
The reserved word backbone is also used as a predefined set ("help sets") and as a parameter to
the 'set hbond' and 'set ssbond' commands. The RasMol command 'trace' renders a smoothed
backbone, in contrast to 'backbone' which connects alpha carbons with straight lines.
Background
The RasMol 'background' command is used to set the colour of the "canvas" background. The
colour may be given as either a colour name or a comma separated triple of Red, Green and Blue
(RGB) components enclosed in square brackets. Typing the command 'help colours' will give
a list of the predefined colour names recognised by RasMol.
Bond
The RasMol command 'bond <number> <number> +' adds the designated bond to the drawing,
increasing the bond order if the bond already exists. The command 'bond <number> <number>
pick' selects the two atoms specified by the atom serial numbers as the two ends of a bond
around which the 'rotate bond <angle>' command will be applied. If no bond exists, it is
created.
Rotation around a previously picked bond may be specified by the 'rotate bond <angle>'
command, or may also be controlled with the mouse, using the 'bond rotate on/off' or the
equivalent 'rotate bond on/off' commands.
Cartoon
The RasMol 'cartoon' command does a display of a molecule 'ribbons' as Richardson
(MolScript) style protein 'cartoons', implemented as thick (deep) ribbons. The easiest way to
obtain a cartoon representation of a protein is to use the 'Cartoons' option on the 'Display'
menu. The 'cartoon' command represents the currently selected residues as a deep ribbon with
width specified by the command's argument. Using the command without a parameter results in
the ribbon's width being taken from the protein's secondary structure, as described in the
'ribbons' command. By default, the C-termini of beta-sheets are displayed as arrow heads. This
may be enabled and disabled using the 'set cartoons' command. The depth of the cartoon may
be adjusted using the 'set cartoons <number>' command. The 'set cartoons' command
without any parameters returns these two options to their default values.

Centre
The RasMol 'centre' command defines the point about which the 'rotate' command and the
scroll bars rotate the current molecule. Without a parameter the centre command resets the centre
of rotation to be the centre of gravity of the molecule. If an atom expression is specified, RasMol
rotates the molecule about the centre of gravity of the set of atoms specified by the expression.
Hence, if a single atom is specified by the expression, that atom will remain 'stationary' during
rotations.
Type 'help expression' for more information on RasMol atom expressions.
Colour
Colour the atoms (or other objects) of the selected region. The colour may be given as either a
colour name or a comma separated triple of Red, Green and Blue (RGB) components enclosed in
square brackets. Typing the command 'help colours' will give a list of all the predefined colour
names recognised by RasMol.
Allowed objects are 'atoms', 'bonds', 'backbone', 'ribbons', 'labels', 'dots', 'hbonds', 'map', and
'ssbonds'. If no object is specified, the default keyword 'atom' is assumed. Some colour schemes
are defined for certain object types. The colour scheme 'none' can be applied to all objects except
atoms and dots, stating that the selected objects have no colour of their own, but use the colour of
their associated atoms (i.e. the atoms they connect). 'Atom' objects can also be coloured by
'amino', 'chain', 'charge', 'cpk', 'group', 'model', 'shapely', 'structure', 'temperature' or
'user'. Hydrogen bonds can also be coloured by 'type' and dot surfaces can also be coloured by
'electrostatic potential'. For more information type 'help colour <colour>'. Map
objects may be coloured by specific color of by nearest atom.
Amino Colours
The RasMol 'amino' colour scheme colours amino acids according to traditional amino acid
properties. The purpose of colouring is to identify amino acids in an unusual or surprising
environment. The outer parts of a protein that are polar are visible (bright) colours and non-polar
residues darker. Most colours are hallowed by tradition. This colour scheme is similar to the
'shapely' scheme.

Chain Colours
The RasMol 'chain' colour scheme assigns each macromolecular chain a uniquecolour. This
colour scheme is particularly useful for distinguishing the parts of multimeric structure or
the individual 'strands' of a DNA chain. 'Chain' can be selected from the RasMol 'Colours' menu.
Charge Colours
The RasMol 'charge' colour scheme colour codes each atom according to the charge value stored
in the input file (or beta factor field of PDB files). High values are coloured in blue (positive)
and lower values coloured in red (negative). Rather than use a fixed scale this scheme determines
the maximum and minimum values of the charge/temperature field and interpolates from red to
blue appropriately. Hence, green cannot be assumed to be 'no net charge' charge.
The difference between the 'charge' and 'temperature' colour schemes is that increasing
temperature values proceed from blue to red, whereas increasing charge values go from red to
blue. If the charge/temperature field stores reasonable values it is possible to use the RasMol 'colour
dots potential' command to colour code a dot surface (generated by the 'dots' command) by
electrostatic potential.
CPK Colours
The RasMol 'cpk' colour scheme is based upon the colours of the popular plastic spacefilling
models which were developed by Corey, Pauling and later improved by Kultun. This colour
scheme colours 'atom' objects by the atom (element) type. This is the scheme conventionally
used by chemists. The assignment of the most commonly used element types to colours is given
below.
Note that except for green, white, blue, and orange, these colour names are not the ones
specified as "Predefined colours" in RasMol; thus, they can only be specified on the command
line as RGB triplets.
In the CPK colouring scheme, RasMol will attempt to assign a colour to each element from the
periodic table from a list of 16 colours (the colour codes listed are to help in understanding the
mapping and are not used by RasMol):
In the CPKnew colouring scheme, RasMol uses brighter colours:
For X-ray crystallographic models of proteins and nucleic acids (i.e. without hydrogens) the
display can be 'brightened' by converting the O, C, and N atoms from the RasMol default cpk
colors to "true red, white and blue" using RasMol's predefined colour scheme.

Group Colours
The RasMol 'group' colour scheme colour codes residues by their position in a macromolecular
chain. Each chain is drawn as a smooth spectrum from blue through green, yellow and orange to
red. Hence the N terminus of proteins and 5' terminus of nucleic acids are coloured red and the C
terminus of proteins and 3' terminus of nucleic acids are drawn in blue. If a chain has a large
number of heterogeneous molecules associated with it, the macromolecule may not be drawn in
the full 'range' of the spectrum. 'Group' can be selected from the RasMol 'Colours' menu.
Shapely Colours
The RasMol 'shapely' colour scheme colour codes residues by amino acid property. This scheme
is based upon Bob Fletterick's "Shapely Models". Each amino acid and nucleic acid residue is
given a unique colour. The 'shapely' colour scheme is used by David Bacon's Raster3D program.
This colour scheme is similar to the 'amino' colour scheme.
NMR Model Colours

The RasMol 'model' colour scheme codes each NMR model with a distinct colour. The NMR
model number is taken as a numeric value. High values are coloured in blue and lower values
coloured in red. Rather than use a fixed scale this scheme determines the maximum value of the
NMR model number and interpolates from red to blue appropriately.
Structure Colours
The RasMol 'structure' colour scheme colours the molecule by protein secondary structure.
Alpha helices are coloured magenta, [240,0,128], beta sheets are coloured yellow, [255,255,0],
turns are coloured pale blue, [96,128,255] and all other residues are coloured white. The
secondary structure is either read from the PDB file (HELIX, SHEET and TURN records), if
available, or determined using Kabsch and Sander's DSSP algorithm. The RasMol 'structure'
command may be used to force DSSP's structure assignment to be used.
Temperature Colours
The RasMol 'temperature' colour scheme colour codes each atom according to the anisotropic
temperature (beta) value stored in the PDB file. Typically this gives a measure of the
mobility/uncertainty of a given atom's position. High values are coloured in warmer (red) colours
and lower values in colder (blue) colours. This feature is often used to associate a "scale" value
[such as amino acid variability in viral mutants] with each atom in a PDB file, and colour the
molecule appropriately. The difference between the 'temperature' and 'charge' colour schemes is that
increasing temperature values proceed from blue to red, whereas increasing charge values go from red to
blue.
HBond Type Colours

The RasMol 'type' colour scheme applies only to hydrogen bonds, hence is used in the command
'colour hbonds type'. This scheme colour codes each hydrogen bond according to the distance
along a protein chain between hydrogen bond donor and acceptor. This schematic representation
was introduced by Belhadj-Mostefa and Milner-White. This representation gives a good insight
into protein secondary structure (hbonds forming alpha helices appear red, those forming sheets
appear yellow and those forming turns appear magenta).
Potential Colours
The RasMol 'potential' colour scheme applies only to dot surfaces, hence is used in the
command 'colour dots potential'. This scheme colours each currently displayed dot by the
electrostatic potential at that point in space. This potential is calculated using Coulomb's law
taking the temperature/charge field of the input file to be the charge assocated with that atom.
This is the same interpretation used by the 'colour charge' command. Like the 'charge' colour
scheme low values are blue/white and high values are red.
Connect
The RasMol 'connect' command is used to force RasMol to (re)calculate the connectivity of the
current molecule. If the original input file contained connectivity information, this is discarded.
The command 'connect false' uses a fast heuristic algorithm that is suitable for determining
bonding in large bio-molecules such as proteins and nucleic acids. The command "connect true"
uses a slower more accurate algorithm based upon covalent radii that is more suitable to small
molecules containing inorganic elements or strained rings. If no parameters are given, RasMol
determines which algorithm to use based on the number of atoms in the input file. Greater than
255 atoms causes RasMol to use the faster implementation. This is the method used to determine
bonding, if necessary, when a molecule is first read in using the 'load' command.
Spacefill
The RasMol 'spacefill' command is used to represent all of the currently selected atoms as solid
spheres. This command is used to produce both union-of-spheres and ball-and-stick models of a
molecule. The command, 'spacefill true', the default, represents each atom as a sphere of van der
Waals radius. The command 'spacefill off' turns off the representation of the selected atom as
spheres. The 'temperature' option sets the radius of each sphere to the value stored in its
temperature field. Zero or negative values have no effect and values greater than 2.0 are
truncated to 2.0. The 'user' option allows the radius of each sphere to be specified by additional
lines in the molecule's PDB file using Raster 3D's COLOUR record extension.
The RasMol command 'cpk' is synonymous with the 'spacefill' command.
Depth
The RasMol 'depth' command enables, disables or positions the back-clipping plane of the
molecule. The program only draws those portions of the molecule that are closer to the viewer
than the clipping plane. Integer values range from zero at the very back of the molecule to 100
which is completely in front of the molecule. Intermediate values determine the percentage of the
molecule to be drawn.
This command interacts with the 'slab <value>' command, which clips to the front of a given z-
clipping plane.
Dots
The RasMol 'dots' command is used to generate a van der Waals' dot surface around the
currently selected atoms. Dot surfaces display regularly spaced points on a sphere of van der
Waals' radius about each selected atom. Dots that would are 'buried' within the van der Waals'
radius of any other atom (selected or not) are not displayed. The command 'dots on' deletes any
existing dot surface and generates a dots surface around the currently selected atom set with a
default dot density of 100. The command 'dots off' deletes any existing dot surface. The dot
density may be specified by providing a numeric parameter between 1 and 1000. This value
approximately corresponds to the number of dots on the surface of a medium sized atom.
By default, the colour of each point on a dot surface is the colour of its closest atom at the time
the surface is generated. The colour of the whole dot surface may be changed using the 'colour
dots' command.
HBonds
The RasMol 'hbond' command is used to represent the hydrogen bonding of the protein
molecule's backbone. This information is useful in assessing the protein's secondary structure.
Hydrogen bonds are represented as either dotted lines or cylinders between the donor and
acceptor residues. The first time the 'hbond' command is used, the program searches the
structure of the molecule to find hydrogen bonded residues and reports the number of bonds to
the user. The command 'hbonds on' displays the selected 'bonds' as dotted lines, and the 'hbonds
off' turns off their display. The colour of hbond objects may be changed by the 'colour hbond'
command. Initially, each hydrogen bond has the colours of its connected atoms.
By default the dotted lines are drawn between the accepting oxygen and the donating nitrogen.
By using the 'set hbonds' command the alpha carbon positions of the appropriate residues may
be used instead. This is especially useful when examining proteins in backbone representation.
Label
The RasMol 'label' command allows an arbitrary formatted text string to be associated with each
currently selected atom. This string may contain embedded 'expansion specifiers' which display
properties of the atom being labelled. An expansion specifier consists of a '%' character followed
by a single alphabetic character specifying the property to be displayed (similar to C's printf
syntax). An actual '%' character may be displayed by using the expansion specifier '%%'.
Atom labelling for the currently selected atoms may be turned off with the command 'label off'.
By default, if no string is given as a parameter, RasMol uses labels appropriate for the current
molecule. RasMol uses the label '%n%r:%c.%a' if the molecule contains more than one chain,
'%e%i' if the molecule has only a single residue (a small molecule) and '%n%r.%a' otherwise.
The colour of each label may be changed using the 'colour label' command. By default, each
label is drawn in the same colour as the atom to which it is attached. The size and spacing of the
displayed text may be changed using the 'set fontsize' command.
Map
The RasMol 'map' commands manipulate electron density maps in coordination with the display
of molecules. These commands are very memory intensive and may not work on machines with
limited memory. Each molecule may have as many maps as available memory permits. Maps
may be read from files or generated from Gaussian density distributions around atoms.
Molecule
The RasMol 'molecule' command selects one of up to 5 previously loaded molecules for active
manipulation. While all the molcules are displayed and may be rotated collectively, only one
molecule at a time is active for manipulation by the commands which control the details of
rendering.
Monitor
The RasMol 'monitor' command allows the display of distance monitors. A distance monitor is a
dashed (dotted) line between an arbitrary pair of atoms, optionally labelled by the distance
between them. The RasMol command 'monitor <number> <number>' adds such a distance
monitor between the two atoms specified by the atom serial numbers given as parameters
Distance monitors may also be added to a molecule interactively with the mouse, using the 'set
picking monitor' command. Clicking on an atom results in its being identified on the rasmol
command line. In addition every atom picked increments a modulo counter such that, in monitor
mode, every second atom displays the distance between this atom and the previous one. The shift
key may be used to form distance monitors between a fixed atom and several consecutive
positions. A distance monitor may also be removed (toggled) by selecting the appropriate pair of
atom end points a second time.
Restrict
The RasMol 'restrict' command both defines the currently selected region of the molecule and
disables the representation of (most of) those parts of the molecule no longer selected. All
subsequent RasMol commands that modify a molecule's colour or representation affect only the
currently selected region. The parameter of a 'restrict' command is a RasMol atom expression
that is evaluated for every atom of the current molecule. This command is very similar to the
RasMol 'select' command, except 'restrict' disables the 'wireframe', 'spacefill' and 'backbone'
representations in the non-selected region.

Ribbons
The RasMol 'ribbons' command displays the currently loaded protein or nucleic acid as a
smooth solid "ribbon" surface passing along the backbone of the protein. The ribbon is drawn
between each amino acid whose alpha carbon is currently selected. The colour of the ribbon is
changed by the RasMol 'colour ribbon' command. If the current ribbon colour is 'none' (the
default), the colour is taken from the alpha carbon at each position along its length.
The width of the ribbon at each position is determined by the optional parameter in the usual
RasMol units. By default the width of the ribbon is taken from the secondary structure of the
protein or a constant value of 720 (2.88 Ångstroms) for nucleic acids. The default width of
protein alpha helices and beta sheets is 380 (1.52 Ångstroms) and 100 (0.4 Ångstroms) for turns
and random coil. The secondary structure assignment is either from the PDB file or calculated
using the DSSP algorithm as used by the 'structure' command. This command is similar to the
RasMol command 'strands' which renders the biomolecular ribbon as parallel depth-cued
curves.
Rotate
Rotate the molecule about the specified axis. Permitted values for the axis parameter are "x", "y"
and "z". The integer parameter states the angle in degrees for the structure to be rotated. For the
X and Y axes, positive values move the closest point up and right, and negative values move it
down and left, respectively. For the Z axis, a positive rotation acts clockwise and a negative
angle anti-clockwise.
Script
The RasMol 'script' command reads a set of RasMol commands sequentially from a text file and
executes them. This allows sequences of commonly used commands to be stored and performed
by single command. A RasMol script file may contain a further script command up to a
maximum "depth" of 10, allowing complicated sequences of actions to be executed.
Scripts may also be created with a text editor.

Select
Define the currently selected region of the molecule. All subsequent RasMol commands that
manipulate a molecule or modify its colour or representation only affect the currently selected
region. The parameter of a 'select' command is a RasMol expression that is evaluated for every
atom of the current molecule. The currently selected (active) region of the molecule are those
atoms that cause the expression to evaluate true. To select the whole molecule use the RasMol
command 'select all'. The behaviour of the 'select' command without any parameters is
determined by the RasMol 'hetero' and 'hydrogen' parameters.
Slab
The RasMol 'slab' command enables, disables or positions the z-clipping plane of the molecule.
The program only draws those portions of the molecule that are further from the viewer than the
slabbing plane. Integer values range from zero at the very back of the molecule to 100 which is
completely in front of the molecule. Intermediate values determine the percentage of the
molecule to be drawn.
SSBonds
The RasMol 'ssbonds' command is used to represent the disulphide bridges of the protein
molecule as either dotted lines or cylinders between the connected cysteines. The first time that
the 'ssbonds' command is used, the program searches the structure of the protein to find half-
cysteine pairs (cysteines whose sulphurs are within 3 Ångstroms of each other) and reports the
number of bridges to the user. The command 'ssbonds on' displays the selected "bonds" as
dotted lines, and the command 'ssbonds off' disables the display of ssbonds in the currently
selected area. Selection of disulphide bridges is identical to normal bonds, and may be adjusted
using the RasMol 'set bondmode' command. The colour of disulphide bonds may be changed
using the 'colour ssbonds' command. By default, each disulphide bond has the colours of its
connected atoms.
Stereo
The RasMol 'stereo' command provides side-by-side stereo display of images. Stereo viewing of
a molecule may be turned on (and off) either by selecting 'Stereo' from the 'Options' menu, or by
typing the commands 'stereo on' or 'stereo off'.
Strands
The RasMol 'strands' command displays the currently loaded protein or nucleic acid as a smooth
"ribbon" of depth-cued curves passing along the backbone of the protein. The ribbon is
composed of a number of strands that run parallel to one another along the peptide plane of each
residue. The ribbon is drawn between each amino acid whose alpha carbon is currently selected.
The colour of the ribbon is changed by the RasMol 'colour ribbon' command. If the current
ribbon colour is 'none' (the default), the colour is taken from the alpha carbon at each position
along its length. The central and outermost strands may be coloured independently using the
'colour ribbon1' and 'colour ribbon2' commands, respectively. The number of strands in the
ribbon may be altered using the RasMol 'set strands' command.
Structure
The RasMol 'structure' command calculates secondary structure assignments for the currently
loaded protein. If the original PDB file contained structural assignment records (HELIX, SHEET
and TURN) these are discarded. Initially, the hydrogen bonds of the current molecule are found,
if this hasn't been done already. The secondary structure is then determined using Kabsch and
Sander's DSSP algorithm. Once finished the program reports the number of helices, strands and
turns found.
Surface
The RasMol 'surface' command renders a Lee-Richards molecular surface resulting from rolling
a probe atom on the selected atoms. The value given specifies the radius of the probe. If given in
the first form, the evolute of the surface of the probe is shown (the solvent excluded surface). If
given in the second form, the envelope of the positions of the center of the probe is shown (the
solvent accessible surface).

Wireframe
The RasMol 'wireframe' command represents each bond within the selected region of the
molecule as a cylinder, a line or a depth-cued vector. The display of bonds as depth-cued vectors
(drawn darker the further away from the viewer) is turned on by the command 'wireframe' or
'wireframe on'. The selected bonds are displayed as cylinders by specifying a radius either as an
integer in RasMol units or containing a decimal point as a value in Ångstroms. A parameter
value of 500 (2.0 Ångstroms) or above results in an "Parameter value too large" error. Bonds
may be coloured using the 'colour bonds' command. If the selected bonds involved atoms of
alternate conformers then the bonds are narrowed in the middle to a radius of .8 of the specified
radius (or to the radius specifed as the optional second parameter).
Zoom
Change the magnification of the currently displayed image. Boolean parameters either magnify
or reset the scale of current molecule. An integer parameter specifies the desired magnification
as a percentage of the default scale. The minimum parameter value is 10; the maximum
parameter value is dependent upon the size of the molecule being displayed. For medium sized
proteins this is about 500.

SWISS PDB VIEWER

I. OVERVIEW
DeepView – the Swiss-PdbViewer (or SPDBV), is an interactive molecular graphics program
for viewing and analyzing protein and nucleic acid structures. In combination with Swiss-Model
(a server for automated comparative protein new protein structures can also be modeled.
To facilitate understanding of the following chapters, some essential terms are introduced here:
A molecular coordinate file (e.g. *.pdb, *.mmCIF, etc.) is a text file containing, amongst other
information, the atom coordinates of one or several molecules. It can be opened from a local
directory or imported from a remote server by entering its PDB accession code. The content of
one coordinate file is loaded in one (or more) layers; the first one will be referred to as the
"reference layer".
DeepView can simultaneously display several layers, and this constitutes a project. When
working on projects, the layer that is currently governed by the Control Panel is called the
currently active layer.
Each molecule is composed of groups, which can be amino acids, hetero-groups, water
molecules, etc. and each group is composed of atoms.
Non-coordinate files containin specific information other than atom coordinates. Molecular
surfaces, electrostatic potential maps, and electron density maps are examples of non-
coordinate files, which can either be computed by DeepView, or loaded from specialized
external programs.
II. WORKING ENVIRONMENT
DeepView can display up to eight interconnected interactive windows.
1. Graphic window:
It is used to visualize loaded molecules, which can be rotated, translated and zoomed.
Display of the coordinate axis is optional. Molecular surfaces, electrostatic potential maps, and
electron density maps can also be displayed on the Graphic window.
2. Control panel:
This table-like window is for controlling the visual representation of the currently active
layer. It lets you enable the display of backbones, side chains, labels, molecular surfaces, and
ribbons for each group; and set the colors for the different objects on display.
3. Toolbar:
Contains the menus and tools of the program.
4. layers info window :

This table-like window is for controlling the display of individual layers.
You can toggle on and off the visualization and movement of layers, and enable the display of
certain objects (e.g. H-bonds or water molecules), for each layer.
5. Alignment window :
Shows the amino-acid sequence of loaded proteins in one-letter abbreviations. This window
is used to compare and to align sequences of two or more proteins. During homology modeling,
it allows correcting the alignment of target sequences onto the templates.
6. Ramachandran plot window:
Displays a Ramachandran plot.
Each do
Ramachandran plots are used to judge the quality of a model, by finding residues whose
conformational angles lie outside allowed regions.
7. Surface and cavity window:
Gives the surface ("2) and volume ("3) of a molecule and its cavities.
This window can only be displayed if a molecular surface has been computed. It is mainly for
information purposes, but can also be used to center the view on specific cavities.
8. Electron Density Map Infos window :
This is a table-like window that lets you control the appearance of electron density maps and
electrostatic potential maps.
9. Text windows:
In addition to all previously described windows, you can open many windows for viewing
text files such as PDB files, energy reports, BLAST results, help texts, etc.
Text files cannot be edited or printed directly in DeepView. Please use any text editor for this
purpose.
Initiating a DeepView session means:
• displaying molecules by loading molecular coordinate files,
•
and electron density maps (molecular surfaces and electrostatic potential maps can also
be computed),
• displaying the required windows.

Loading molecular coordinate files

The File menu offers the following commands to load a molecular coordinate file. This can be a
PDB, mmCIF, or MOL file:
Loading non-coordinate files
The File menu offers the following commands to load a non-coordinate file:
Classification
The following basic DeepView commands are mainly for setting the visualization of molecules
by selecting, displaying, and coloring objects, as well as for analyzing molecules by measuring
distances and angles between atoms. They can be grouped according to their location:
Center the visible groups,
Translate, zoom, and rotate molecules ,
Measure distances between atoms ,
Measure bond angles ,
Measure dihedral angles ,
Identify groups and atoms ,
Display/select groups within a distance of a picked atom ,
Tools
Center the model on a picked atom ,
Edit commands Edit the identification of a molecule ,
Select commands
- apply basic selections
- select groups by type
- select groups by property
- select groups by secondary structure
- select groups with respect to a reference
- select groups by distance
- select groups by structural criteria
Display commands
- show/hide various objects
- select various views for displaying a molecule
- set the style of labels placed by the Control Panel
- clear all labels placed by the tools
• Selecting groups by type

• Selecting groups by property
Basic Arg, Lys, His
Acidic Asp, Glu
Polar Asn, Gln, Ser, Thr, Tyr
non-Polar Ala, Cys, Gly, Ile, Leu, Met, Phe, Pro, Trp, Val
• Selecting groups by secondary structure
• Selecting groups with respect to a reference
The following commands presuppose that a structural alignment has been computed:
Command Action
• aa identical toref.
Selects residues that are strictly conserved between the currently active layer and the
reference layer (first loaded).
• aa similar to ref. Selects similar residues between the currently active layer and the reference
layer (first loaded). By default, the PAM 200 matrix will be used, and the minimum score
needed to be considered similar can be modified in Preferences>Alignment.
• aa matching ref.structure
Selects residues of the currently active layer whose backbone has a RMS deviation to the
reference layer inferior or equal to a certain threshold.
Selecting groups by distance
The three following commands prompt the previously described Display Radius dialog box,
which allows selecting groups on the Control Panel, or displaying groups on the Graphic
window, within a distance that you can specify. The dialog lets you extend a selection/display
around a previous selection/display, and includes an option to act on all layers.
Command Action
• Neighbors of selected aa
Selects/displays groups with at least one atom within the specified distance of any atom of
selected groups.
• Groups close to another chain
Selects/displays any group that is near any other group with a different chain ID. This command
is useful to highlight residues at the interface of two chains.

• Groups close to another layer

Selects/displays any group that is near any other group from a different layer. It applies to all
layers, and is useful when interacting chains have been loaded into separate layers.
Selecting groups by structural criteria

Finally, use the five following commands to select groups according to specific structural
criteria.
Command Action
• Accessible aa Selects residues with an accessible surface area higher than a given
percentage, which you will be prompted for in a dialog.
• aa Making Clashes
Selects residues with atoms too close to atoms of other residues. Since van der Waals radii
are not assigned when files are loaded, DeepView looks for atoms that are closer than the
minimal H-bond distance (as set in Preferences>H bond detection threshold, when no
hydrogen atoms are present). A finer way to find clashes consists in coloring the molecule by
force field energy: residues that have a high non-bonded energy (colored in red) are too close
to each other.
• aa Making Clashes with Backbone
Selects groups with at least one atom too close to the backbone of another group.
• Sidechain lacking Proper H-bonds
Selects those buried residues whose sidechain could make an H-bond or a salt-bridge, but do
none.
Coloring menu, first block
• By CPK Colors the selected object by element type, using a default standard CPK
scheme: N=blue, O=red, C=white, H=cyan, P=orange, S=yellow, other=gray. This
command is only effective if backbones and/or sidechains are selected for coloring.
Default colors can be redefined in Preferences>Colors ,
• By Type Colors the selected object by residue property: Acidic=red, Basic=blue,
Polar=yellow, and Non-Polar=gray (Acidic, Basic, Polar, and Non-Polar). Default colors
can be redefined in Preferences>Colors,.
• By RMS At least two proteins must have been loaded, superposed, and structurally

aligned. Each residue in the active layer will be colored accordingly to its RMS backbone
deviation from the corresponding amino acid of the reference protein (the first loaded).
NOTE: Colors are mapped from a fixed linear scale, in which dark blue is for RMS = 0
Å, and red is for RMS = 5 Å. A relative scale can be selected in Preferences>General
where the best fit is dark blue and the worst fit is red.
• By B-Factor Colors sidechains and backbones, independently, according to their
respective largest Bfactor per group. In the case of a model returned by Swiss-Model, the
B-factor column contains the Model Confidence Factor.
NOTE: The coloring gradient can be adjusted in Preferences>General to fit the range of
B factor values present in the structure.
• By Secondary Structure
Colors the selected object according to the three common secondary structure types:
Helix=red, Strand=yellow, and Coil =gray. Especially useful for coloring ribbon
drawings.
Default colors can be redefined in Preferences>Colors.
• By Secondary Struct. Success.
Produces a gradient along the polypeptide chain from N-terminus (blue) to the C-
terminus (red). Each secondary structure element gets a single color, and random-coils
are gray.Especially useful for coloring ribbon drawings.
Color menu, second block

• By Selection Colors selected residues in cyan and non-selected residues in dark gray.
Useful to quickly find where selected residues are located in the model. Default colors
can be redefined in Preferences>Colors.
• By Layer Each layer gets a single unique color. The layers are colored in order from the
first as: yellow, blue, green, red, gray, magenta, cyan, salmon, purple, light green, and
brown. The color succession is repeated for additional layers. Ideal for viewing
superposed structures.
• By Chain Colors each chain by a different color: yellow, blue, green, red, gray, magenta,
cyan, salmon, purple, light green, and brown. The color succession is repeated for
additional chains.

Color menu, third block

• By Alignment
Diversity
At least two proteins must have been loaded, superposed, and structurally aligned. Applies a
blue-to-red color gradient to all layers, according to the degree of similarity among all
aligned residues. Blue indicates identical or very similar, and red indicates that residues have
dissimilar properties.
• By Accessibility Each group is colored by its relative accessibility. Colors range from dark
blue for completely buried amino acids, to red for residues with at least 75% of their
maximum surface exposure.
• By Threading Energy
Colors each residue of the protein according to its energy. Dark blue means that the threading
energy is low (the residue is happy with its environment), red means that the threading
energy is high (the residue is not happy with its environment).
• By Force FieldEnergy
Colors each residue according to its force field energy.Especially useful during refinement of
a model as you can color by bond and angle deviations only, and this will identify distorted
parts of the protein.
• By Protein Problems
Ramachandran Plot is colored in yellow. The backbone of proline residues whose angle
deviates more than 25° from the ideal –65° value is colored in red. Buried sidechains of
residues that could make H-bonds but do not are colored in orange. Clashes are computed
and will appear as pink dotted lines.
Color menu, fourth block
• By Other Color Prompts you for a single color to be applied to the entire layer. It is
functionally equivalent to a shift-click on any color box of the Control Panel window.
• By Backbone, Sidechain, Ribbon, Surface, Label Color
WORKING ON A LAYER
Classification
Advanced commands that can be applied to a single layer can be grouped into four categories:
• Mutates amino acids
• Modifies torsion angles of selected groups

• Build>Build Loop
• Build>Scan Loop Database
• Build>Find best Fitting
• Build>Break/ Ligate Backbone
• Build>Add C-terminal oxygen
• Tools>Set Omega/Phi/Psi
• Ramachandran Plot window
Modify the backbone (break/ligate it, alter conformational angles, add OXT groups)
• Build >Add / Remove Add/remove structural elements (bonds, hydrogen atoms, H-bonds)
• Tools>Fix Selected Sidechain Re-orientates sidechains
• Tools>Randomize Selected Groups Randomly translates all atoms of selected groups
Modifying commands
(modify the structure of molecules)
• Edit>Assign Helix/Strand/Coil Type
• Tools>Detect Secondary Structure
Alter the visualization of the ribbon secondary structure (*)
• Edit>Find Sequence
• Edit>Find Next
Search a layer for segments that match a given amino acid sequence
•Edit>Search for PROSITE pattern Searches a layer for segments that match
PROSITE patterns
Searching commands
• Edit>BLAST Selection vs. SwissProt
• Edit>BLAST Selection vs. ExPDB
Search protein databases for homologue amino acid sequences
100
• Tools>Compute H-bonds Computes H-bonds
• Tools>Compute Molecular Surface Computes molecular surfaces
• Tools>Compute Electrost. Potential Computes electrostatic potential maps •
Tools>Triangulate Maps Triangulates maps
• Tools>Compute Energy (Threading)
• Tools>Compute Energy (Force Field)

Computing commands
• Tools>Enery Minimisation Performs energy minimisations
• Tools>Transl. Layer along Unit Cell Translates a molecule along its unit cell
• Tools>Build Crystallogr. Symmetry Applies crystallographic symmetries 109
Crystallographic commands
• File>Open Electron Density Map Loads and displays electron density maps.
Superposing commands
Superposing two structures
Concept
Two given structure can be superposed on the Graphic window.
Examples of application
Superposing two molecules lets you compare their structures, for various purposes.
Procedure
The Fit menu offers three commands (Magic Fit, Iterative Magic Fit and Explore Alternate Fits)
to superpose a molecule onto another. DeepView offers three modes to visualize a molecule on
the Graphic window:
Mode Main display features
Normal Backbones, sidechains, ribbons, and molecular surfaces are rendered as wire frame. Van
der Waals and accessible surfaces are dotted. This is the fastest rendering mode (not available for
SGI and Linux versions).
3D-rendering Renders molecules in solid 3D. Two 3D-rendering types are available: one applies
to ribbons and surfaces only, and the other renders the whole molecule in solid 3D.
Stereoscopic Allows visualizing molecules in real 3D. Depending on the characteristics of your
computer, up to three stereoscopic modes might be available.
Slab Display Mode

Click Display>Slab: this toggle on and off the slab mode, which delimits a molecule slab
parallel to the screen by removing those groups that reside too far into or out the screen.

PROTEIN STRUCTURE COMPARISON

With the visualization and computer graphics tools available, it becomes easy to observe and
compare protein structures. To compare protein structures is to analyze two or more protein
structures for similarity. The comparative analysis often, but not always, involves the direct
alignment and superimposition of structures in a three-dimensional space to reveal which part of
structure is conserved and which part is different at the three-dimensional level. This structure
comparison is one of the fundamental techniques in protein structure analysis. The comparative
approach is important in finding remote protein homologs. Because protein structures have a
much higher degree of conservation than the sequences, proteins can share common structures
even without sequence similarity. Thus, structure comparison can often reveal distant
evolutionary relationships between proteins, which is not feasible using the sequence-based
alignment approach alone. In addition, protein structure comparison is a prerequisite for protein
structural classification into different fold classes. It is also useful in evaluating protein
prediction methods by comparing theoretically predicted structures with experimentally
determined ones.
Intermolecular Method
The intermolecular approach is normally applied to relatively similar structures. To compare and
superpose two protein structures, one of the structures has to be moved with respect to the other
in such a way that the two structures have a maximum overlap in a three-dimensional space. This
procedure starts with identifying equivalent residues or atoms. After residue–residue
correspondence is established, one of the structures is moved laterally and vertically toward the
other structure, a process known as translation, to allow the two structures to be in the same
location (or same coordinate frame). The structures are further rotated relative to each other
around the three-dimensional axes, during which process the distances between equivalent
positions are constantly measured. The rotation continues until the shortest intermolecular
distance is reached. At this point, an optimal superimposition of the two structures is reached.
After superimposition, equivalent residue pairs can be identified, which helps to quantitate the
fitting between the two structures.
An important measurement of the structure fit during superposition is the distance between
equivalent positions on the protein structures. This requires using a leas tsquare- fitting function
called root mean square deviation (RMSD), which is the square root of the averaged sum of the
squared differences of the atomic distances where D is the distance between coordinate data
points and N is the total number of corresponding residue pairs.
DALI is a structure comparison web server that uses the intramolecular distance method. It
works by maximizing the similarity of two distance graphs. The matrices are based on distances
between all Cα atoms for each individual protein. Two distance matrices are overlaid and moved
one relative to the other to identify most similar regions. DALI uses a statistical significance
value called a Z-score to evaluate structural alignment. The Z-score is the number of standard
deviations from the average score derived from the database background distribution. The higher
the Z-score when comparing a pair of protein structures, the less likely the similarity observed is
a result of random chance. Empirically, a Z-score>4 indicates a significant level of structure
similarity. The webserver is at the sametime a database that contains Z-scores of all precomputed
structure pairs of proteins in PDB. The user can upload a structure to compare it with all known
structures, or perform a pairwise comparison of two uploaded structures.
The Dali method uses a weighted sum of similarities of intra-molecular distances, which
correlates with expert classifications in the sense that the structures of homologous proteins
typically get higher similarity scores than the structures of evolutionarily unrelated proteins. This
property is useful to a biologist using structure comparison to learn more about her query
protein: the biologically informative neighbors are found at the top of the match list with
relatively few false leads.
In DaliLite v.3, new options for database searching (DaliLite –quick) and database updates
(DaliLite –update) are introduced. The new protocols improve server throughput and vastly
simplify the updates, making the complete system portable. The key change from earlier is that
the all versus all matrix of similarities are abandoned in favor of a connected graph of
similarities.
The nodes of the graph represent protein structures and edges represent structural alignments.
Whereas before each representative structure was directly linked to all its structurally similar
neighbors, now require only that there is a path of continuous structural similarity through the
graph. The structural neighbors of a query structure are collected by walks through the graph.

Not only need the graph be less densely connected than the all versus all matrixes, thus saving
computational effort, but also there is the added benefit that the incremental updates of the
structural similarity graph and the choice of structural representatives are completely decoupled.
Methods
PDB clustering
The PDB is highly redundant. Use a representative subset at 90% sequence identity level
(PDB90), derived from the current set of PDB sequences. The PDB contains over 100 000
structures (chains), which is reduced to about 20 000 PDB90 representatives.
Structural similarity graph
The structural similarity graph and alignment data are stored in a relational database (MySQL).
The graph is updated incrementally. If a new structure has strong similarity to structures already
in the graph, one edge is sufficient to connect the new structure to the graph in the proper
neighborhood. If there is no strong match, compare the new structure to all existing structures
and add edges for all significant similarities. Similarity is measured by Dali Z-scores.
„Significant similarities‟ have a Z-score above 2; they usually correspond to similar folds.
„Strong matches‟ have sequence identity above 20% or a Z-score above a cutoff that depends on
the size of the query protein. The Z-score cutoff was set to n/10−4, where n is the number of
residues in the query structure. A segment of the query structure longer than 80 residues without
any structural matches always disqualifies a strong match.
Database searching
The database search option DaliLite –quick compares a query structure to all structures in the
PDB, as organized in the structural similarity graph. To initiate a transitive search of structures in
the graph, the query structure must be attached to some structural neighbors. Fast feature filters
are often successful in finding near neighbors. Currently use of sequence comparison by Blast,
GTG sequence motifs and secondary structure triplets to rank the structures in PDB90 are
implemented. Feature filter scores are converted to Z-scores in order to combine the ranked lists.
The top 100 structures are compared using the normal Dali procedures. If a strong match is
found, move to the next step (transitive alignment). Otherwise, the query structure is compared
against all 20 000 structures in PDB90.
The entry points connect the query structure to one or more structures in the structural similarity
graph. These are direct (first shell) neighbors of the query. Structures in the second shell are
compared in batches of 100, selecting those with the strongest connections first. Connection
strength is the lesser Z score along the path from query to the first neighbor to the second
neighbor. The transitive alignment (via first neighbor) between the query structures and second
neighbor is used as starting point for refinement, skipping the costly alignment optimization from
scratch. The expansion is repeated until the connection strength drops below a Z-score cutoff of
2, or a maximum number of matches have been reported (default: MAX_HITS = 500).
VAST (Vector Alignment Search Tool) is a web server that performs alignment using both the
inter- and intramolecular approaches. The superposition is based on information of directionality
of secondary structural elements (represented as vectors). Optimal alignment between two
structures is defined by the highest degree of vector matches.Protein structure neighbors in
Entrez are determined by direct comparison of 3D protein structures with the Vector Alignment
Search Tool (VAST) algorithm. Each of the more than 87,000 domains and complete protein
chains in MMDB is compared to every other one. Entrez can list structure neighbors; however
VAST Structure Neighbors pages provide further information and displays of structure
superpositions and structure-based alignments.
VAST pages begin with a brief text description of the query domain, including PubMed links.
The precomputed structure neighbors, ranked by a selected similarity measure, are displayed
below in a graphic or table. Individual 3D superpositions can be selected by clicking check boxes
and viewed in Cn3D. The corresponding sequence alignments can be displayed in HTML, text,
and FASTA formats. The "Find" feature is convenient for looking for particular structure
neighbors, where the user wants to specify a particular identifier.
VAST similarity measures
All of the similarity measures for each structure neighbor detected by VAST can be listed in a
table to facilitate the examination of VAST results. The table includes the following columns:
• Aligned Length: The number of equivalent pairs of C-alpha atoms superimposed
between the two structures, i.e. how many residues have been used to calculate the 3D
superposition.
• SCORE: The VAST structure-similarity score. This number is related to the number of
secondary structure elements superimposed and the quality of that superposition. Higher
VAST scores correlate with higher similarity.
• P-VAL: The VAST p value is a measure of the significance of the comparison, expressed
as a probability. For example, if the p value is 0.001, then the odds are 1000 to 1 against
seeing a match of this quality by pure chance. The p value from VAST is adjusted for the
effects of multiple comparisons using the assumption that there are 500 independent and
unrelated types of domains in the MMDB database. The p value shown thus corresponds
to the p value for the pairwise comparison of each domain pair, divided by 500.
• RMSD: The root mean square superposition residual in Angstroms. This number is
calculated after optimal superposition of two structures, as the square root of the mean
square distances between equivalent C-alpha atoms. Note that the RMSD value scales
with the extent of the structural alignments and that this size must be taken into
consideration when using RMSD as a descriptor of overall structural similarity.
• %Id: Percent identical residues in the aligned sequence region. This is a raw measure of
sequence similarity in the parts of the proteins that have been superimposed.
• LHM: Loop Hausdorff Metric. A Loop Similarity measure that shows how well two
structures conform to each other in the loop regions, after structural superposition. The
"loop regions" are the parts of the structures between aligned secondary structure
elements (helices and strands). LHM is measured in Angstroms, with a smaller value
indicative of greater similarity. The loop similarity may be undefined (indicated by 'NA')
if there are too many residues with missing coordinates in the loops.
• GSP: Gapped Score. A combination (algebraic) score that uses RMSD, aligned length,
and the number of gapped regions in the alignment. A smaller gapped score correlates
with greater similarity.
Note:

Structure Prediction Flowchart
Base-calling
Base-calling converts raw or processed data from a sequencing instrument into
sequences and quality scores.
All currently available commercial next-gen sequencing platforms use optical
detection and CCD cameras. Images are the raw data. Base-calling usually refers to the
conversion of intensity data into sequences and
quality scores. Intensity information is extracted from images by the image
analysis.

Images are analysed closer to the instrument (on the instrument control
PC), then base-calls are transferred to a secondary analysis server.

Image analysis identifies clusters and extracts intensity traces

Detection: Find all clusters on the image
Registration: Track clusters over multiple sequencing cycles
Extraction: Provide intensity estimates for clusters in a given image

Base-calling has two aspects: Identifying the base-call and assigning a

confidence estimate to the call
Making a base-call is usually based on the intensity estimatesSignal-processing

needs to correct for confounding factors:Frequency cross-talk (optical detection
mechanism) Phasing effects (imperfect chemistry)
Signal decay
Assignment of a confidence estimate or quality score is vital for downstream
analysis phred method can be extended to Next_gen technologies

Quality scores quantify the probability that a base-call is correct (or wrong)
Terminology :
Base quality scores Individual bases have quality scores which

reflect the likelihood of the base being correct/incorrect
Alignment scores Probability than an alignment to a given

position in the reference genome is correct
Allele scores, SNP scores, …Probability that a given allele,

SNP was observed (often conditional on the alignment being
correct)
Base and alignment scores are single read scores; SNP scores
are consensus scores Consensus calls use information from
multiple reads

Phred scores
A base quality score assigned by the phred software (or a program based on the
phred)
A quality score expressed on a logarithmic scale: Q = -10 log10( probability of an
error )
Example: Q20 = 1% error probability
The Phred method assigns quality scores to a base-call based on observed
properties of the base (predictors)
Phred is a two-step process:
Training: Given a set of reads, labels as to which bases are correct, and a set of
quality statistics for each base, produce a model that can predict error rates for
unseen bases
Application: Given new reads and quality statistics, predict the quality for each of
the bases.
Phred is essentially a big lookup table!




Bioinformatics Notes - 17Bt54: Module - 4

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Bioinformatics Notes - 17Bt54: Module - 4

Uploaded by

Copyright:

Available Formats

AIT/IQAC/Aca/19-20/Bioinfo/Notes

BIOINFORMATICS NOTES - 17BT54

Pruthvish R, Department of Biotechnology, Acharya Institute of Technology, Bangalore-560107

Backbone Model Building

Side Chain Refinement

Model Refinement Using Energy Function

Pruthvish R, Department of Biotechnology, Acharya Institute of Technology, Bangalore-560107

Pruthvish R, Department of Biotechnology, Acharya Institute of Technology, Bangalore-560107

SWISS-MODEL (http://swissmodel.expasy.org) is a server for automated comparative

First approach mode

Pruthvish R, Department of Biotechnology, Acharya Institute of Technology, Bangalore-560107

MODELING RESULTS AND EVALUATION

Pruthvish R, Department of Biotechnology, Acharya Institute of Technology, Bangalore-560107

THREADING AND FOLD RECOGNITION

Pairwise Energy Method

Pruthvish R, Department of Biotechnology, Acharya Institute of Technology, Bangalore-560107

Pruthvish R, Department of Biotechnology, Acharya Institute of Technology, Bangalore-560107

AB INITIO PROTEIN STRUCTURAL PREDICTION

Structure Visualisation – RASMOL and Swiss PDB Viewer

Pruthvish R, Department of Biotechnology, Acharya Institute of Technology, Bangalore-560107

Pruthvish R, Department of Biotechnology, Acharya Institute of Technology, Bangalore-560107

Pruthvish R, Department of Biotechnology, Acharya Institute of Technology, Bangalore-560107

NMR Model Colours

HBond Type Colours

Pruthvish R, Department of Biotechnology, Acharya Institute of Technology, Bangalore-560107

Pruthvish R, Department of Biotechnology, Acharya Institute of Technology, Bangalore-560107

Pruthvish R, Department of Biotechnology, Acharya Institute of Technology, Bangalore-560107

Pruthvish R, Department of Biotechnology, Acharya Institute of Technology, Bangalore-560107

SWISS PDB VIEWER

4. layers info window :

Pruthvish R, Department of Biotechnology, Acharya Institute of Technology, Bangalore-560107

Loading molecular coordinate files

• Selecting groups by type

Pruthvish R, Department of Biotechnology, Acharya Institute of Technology, Bangalore-560107

• Groups close to another layer

Selecting groups by structural criteria

Pruthvish R, Department of Biotechnology, Acharya Institute of Technology, Bangalore-560107

Color menu, second block

Pruthvish R, Department of Biotechnology, Acharya Institute of Technology, Bangalore-560107

Color menu, third block

• Modifies torsion angles of selected groups

• Tools>Compute Energy (Force Field)

Slab Display Mode

Pruthvish R, Department of Biotechnology, Acharya Institute of Technology, Bangalore-560107

PROTEIN STRUCTURE COMPARISON

points and N is the total number of corresponding residue pairs.

Pruthvish R, Department of Biotechnology, Acharya Institute of Technology, Bangalore-560107

Pruthvish R, Department of Biotechnology, Acharya Institute of Technology, Bangalore-560107

Structure Prediction Flowchart

Pruthvish R, Department of Biotechnology, Acharya Institute of Technology, Bangalore-560107

Pruthvish R, Department of Biotechnology, Acharya Institute of Technology, Bangalore-560107

Image analysis identifies clusters and extracts intensity traces

Pruthvish R, Department of Biotechnology, Acharya Institute of Technology, Bangalore-560107

Base-calling has two aspects: Identifying the base-call and assigning a

Making a base-call is usually based on the intensity estimatesSignal-processing

Pruthvish R, Department of Biotechnology, Acharya Institute of Technology, Bangalore-560107

Base quality scores Individual bases have quality scores which

Alignment scores Probability than an alignment to a given

Allele scores, SNP scores, …Probability that a given allele,

Pruthvish R, Department of Biotechnology, Acharya Institute of Technology, Bangalore-560107

Pruthvish R, Department of Biotechnology, Acharya Institute of Technology, Bangalore-560107

Pruthvish R, Department of Biotechnology, Acharya Institute of Technology, Bangalore-560107