Download as pdf or txt
Download as pdf or txt
You are on page 1of 142

Introduction

Computational Biology
Computational biology is the combined application of math, statistics and computer science to
solve biology-based problems. Examples of biology problems are: genetics, evolution, cell
biology, biochemistry.

Underpinnings Of Computational Biology


The beginnings of computational biology essentially date to the origins of computer science.
British mathematician and logician Alan Turing, often called the father of computing, used
early computers to implement a model of biological morphogenesis (the development of
pattern and form in living organisms) in the early 1950s, shortly before his death. At about the
same time, a computer called MANIAC, built at the Los Alamos National Laboratory in New
Mexico for weapons research, was applied to such purposes as modeling hypothesized genetic
codes. (Pioneering computers had been used even earlier in the 1950s for numeric calculations
in population genetics, but the first instances of authentic computational modeling in biology
were the work by Turing and by the group at Los Alamos.)
By the 1960s, computers had been applied to deal with much more-varied sets of analyses,
namely those examining protein structure. These developments marked the rise of
computational biology as a field, and they originated from studies centred on
protein crystallography, in which scientists found computers indispensable for carrying out
laborious Fourier analyses to determine the three-dimensional structure of proteins.
Starting in the 1950s, taxonomists began to incorporate computers into their work, using the
machines to assist in the classification of organisms by clustering them based on similarities of
sets of traits. Such taxonomies have been useful particularly for phylogenetics (the study of
evolutionary relationships). In the 1960s, when existing techniques were extended to the level
of DNA sequences and amino acid sequences of proteins and combined with a burgeoning
knowledge of cellular processes and protein structures, a whole new set of computational
methods was developed in support of molecular phylogenetics. These computational methods
entailed the creation of increasingly sophisticated techniques for the comparison of strings of
symbols that benefited from the formal study of algorithms and the study
of dynamic programming in particular. Indeed, efficient algorithms always have been of
primary concern in computational biology, given the scale of data available, and biology has
in turn provided examples that have driven much advanced research in computer science.

M. P. Garud Page 1
Examples include graph algorithms for genome mapping (the process of locating fragments of
DNA on chromosomes) and for certain types of DNA and peptide sequencing methods,
clustering algorithms for gene expression analysis and phylogenetic reconstruction, and pattern
matching for various sequence search problems.
Beginning in the 1980s, computational biology drew on further developments in computer
science, including a number of aspects of artificial intelligence (AI). Among these were
knowledge representation, which contributed to the development of ontologies (the
representation of concepts and their relationships) that codify biological knowledge in
“computer-readable” form, and natural-language processing, which provided a technological
means for mining information from text in the scientific literature. Perhaps most significantly,
the subfield of machine learning found wide use in biology, from modeling sequences for
purposes of pattern recognition to the analysis of high-dimensional (complex) data from large-
scale gene-expression studies.
Difference between Computational Biology and Bioinformatics
The differences between the fields are subtle but practical. The National Institutes of Health
defines the use of computational biology as “data-analytical and theoretical methods,
mathematical modeling and computational simulation techniques to the study of biological,
behavioral, and social systems.”1 Computational biologists tend to draw upon skills in
development of algorithms, mathematical modeling, and statistical evaluation to make
inferences from complicated data sets.

Bioinformatics, on the other hand, uses computational tools and approaches to expand the use
of “biological, medical, behavioral or health data, including those to acquire, store, organize,
archive, analyze, or visualize such data.” 1Bioinformatics approaches tend to draw upon skills
in software development, database development and management, and visualization methods
to convey information contained within data sets.
Simply put, computational biology is about studying biology using computational techniques,
which further the understanding of the science. Bioinformatics focuses more on the engineering
side and the creation of tools that work with biological data to solve problems.

Computational Biology Algorithms


Some examples of algorithms used in computational biology are:
 Global Matching

M. P. Garud Page 2
 Local Sequence Matching
 Hidden Markov Models
 Population genetics
 Evolutionary Trees
 Gene Regulation Networks
 Chemical Equations
Global Matching (also known as the Needleman-Wunsch problem) and Local Sequence
Matching (also known as the Smith-Waterman problem) makes use of our knowledge about
the proteins of an organism to understand more about other organisms proteins.
Markov Models are used for modelling sequences. In these types of models, the probability of
an event to happen is just dependent on its previous state (this type of model can, for instance,
be used to model a DNA sequence). Hidden Markov Models (Figure 1) makes instead use of
a probabilistic Finite-state machine in which, depending on the probability of the state we are
in, we emit a letter and then move to the next state. The next state can possibly be equal to the
original one.

Figure 1: Hidden Markov Model [2]

Population genetics tries to model evolution. To do so, it commonly makes use of the Fisher-
Wright Model. This model tries to simulate what happens at the location of a gene in selection,
mutation and crossover conditions.
Evolutionary trees (Figure 2) can be created based on some form of evolutionary distance.
There are two main types of evolutionary trees: distance-based trees and sequence-based trees.
Evolutionary trees are used to explain distances between different species.

M. P. Garud Page 3
Figure 2: Evolutionary Trees [3]
Gene regulation networks are formed thanks to the interaction of different proteins in an
organism. The different proteins control’s each other and according to the nature of their
interactions, the cell type is determined.
Chemical equations (Figure 3) can finally be used to describe the mechanics behind gene
regulation networks. The reaction rates are dependent on the concentration of the elements in
the chemical equations.

M. P. Garud Page 4
Figure 3: Chemical Equations [4]
Applications of Computational biology:
1. Varietal Information System
2. Plant Genetic Resources Data Base
3. Biometrical Analysis
4. Storage and Retrieval of Data
5. Studies on Plant Modelling
6. Pedigree Analysis
7. Preparation of Reports
8. Updating of Information
9. Diagrammatic Representation
10. Planning of Breeding Program
Application # 1. Varietal Information System:
Bioinformatics has useful application in developing varietal information system. In connection
with plant variety protection (PVP) Act, various terms such as extant variety, candidate variety,
reference variety, example variety and farmer’s variety are frequently used. Hence, knowledge
of these terms is essential. These are defined below:
Extant Variety. All released, notified and unprotected varieties.
Candidate variety. A variety to be registered under Plant Variety Protection Act is referred to
as candidate variety.
Reference Variety. All released and notified extant varieties of common knowledge which are
in seed production chain.

M. P. Garud Page 5
Example Variety. A variety that is used for comparison for a particular character is called
example variety.
Farmers Variety. A variety that has been developed by a farmer and used for commercial
cultivation for several years is called farmers variety.
The detailed information about various types of varieties can be developed using highly
heritable characters.
Such information can be used in various ways as given below:
(i) For varietal identification in DUS testing.
(ii) In grouping of varieties on the basis of various highly heritable characters.
(iii) In sorting out of cultivars for use in pre-breeding and traditional breeding.

The information can be stored in the computer memory and be retrieved as and when required.

Application # 2. Plant Genetic Resources Data Base:


The genetic material of plants which is of value for present and future generations of people is
referred to as plant genetic resources. It is also known as gene pool, genetic stock and
germplasm. The germplasm is evaluated for several characters such as highly heritable
morphological, yield contributing characters, quality characters, resistance to biotic and abiotic
stresses and characters of agronomic value.
International Plant Genetic Resources Institute (IPGRI), Rome Italy has developed descriptors
and descriptor states for various crop plants. Such descriptors help in uniform recording of
observations on germplasm of crop plants all over the world. Thus huge data is collected on
crop genetic resources for several years. Bioinformatics plays an important role in systematic
management of this huge data.
Bioinformatics is useful in handling such data in several ways as follows:
(i) It maintains the data of several locations and several years in systematic way.
(ii) It permits addition, deletion and updating of information.
(iii) It helps in storage and retrieval of huge data.
(iv) It also helps in classification of PGR data based on various criteria.
(v) It helps in retrieval of data belonging specific group such as early maturity, late maturity,
dwarf types, tall types, resistant to biotic stresses, resistance to abiotic stresses, superior quality,
marker genes, etc.
All such data can be easily managed by computer aided programs and can be manipulated to
get meaningful results.

M. P. Garud Page 6
Application # 3. Biometrical Analysis:
In crop improvement, various biometrical analyses are performed.
Important biometrical analyses that are performed in plant breeding and genetics are given
below:
(1) Simple measures of variability such as mean, standard deviation, standard error, coefficient
of variation, etc.
(2) Correlations: It includes genotypic, phenotypic and environmental correlations. It also
includes simple, partial and multiple correlations.
(3) Path Coefficients: It includes analysis of genotypic, phenotypic and environmental paths.
(4) Discriminant function analysis.
(5) Metroglyph analysis and D2 statistics
(6) Stability analysis
(7) Diallel, partial diallel, line x tester triallel, quadriallel, biparental and triple test cross
analysis.
(8) Generation mean analysis, etc.
All these analyses can be easily performed through computer aided programs.

Application # 4. Storage and Retrieval of Data:


In crop improvement, huge data is collected on the following aspects:
(i) Segregating populations:
Single plant selections are made in segregating populations and data are recorded on various
characters such as yield components, quality characters, resistance to biotic and abiotic stresses,
etc.
(ii) Multi-location Experiments:
Such experiments are conducted mainly for identification and release of new varieties and
hybrids and also for assessment of varietal stability.
(iii) Multi-seasonal Experiments:
Such experiments are conducted for several years (3-5 years) for identification of new varieties
and hybrids. The above data remain in active use generally for two decades. Handling of such
a huge data is a difficult task.
However, such data can be easily stored in various storage devices such as hard disks, compact
disks, pen drive, data cards, etc. Storage of data in computers require less space and is very
safe as compared to storage of data in paper registers and files.

M. P. Garud Page 7
Application # 5. Studies on Plant Modelling:
Computers are useful tools for undertaking studies on modelling of plants. First the theoretical
model can be prepared with the help of computer keeping in view various plant characters.
Then such model plants can be developed through hybridization and directional selection.
This type of studies is useful in developing crop ideotypes or ideal plant types in different field
crops. First the conceptual model is prepared and then efforts are made to achieve such model
by combining desirable genes from different sources into a single genotype through appropriate
breeding procedures. Such studies have been made in field pea.

Application # 6. Pedigree Analysis:


Computer aided studies are useful in pedigree analysis of various cultivars and hybrids.
Information about the parentage of cultivars and hybrids is entered into the computer memory
which can be retrieved any time. The list of parents which are common in the pedigree of
various cultivars and hybrids can be sorted out easily.
It helps in the pedigree analysis which in turn can be used in planning plant breeding programs
especially in the selection of parents for use in hybridization programs. The study of proteomics
also helps in pedigree analysis.

Application # 7. Preparation of Reports:


After biometrical analysis of data, results are interpreted and various types of reports or
documents are prepared.
In crop improvement following types of reports is prepared:
(i) Research Project Report: The annual progress report of each project is prepared and salient
findings are documented.
(ii) Monthly, quarterly, half yearly and annual progress reports of all the research projects are
also prepared.
(iii) Sometimes, bulletin and booklets are prepared to document specific information for
adoption and benefit of farmers.
(iv) Research papers and popular articles are prepared based on research findings.
(v) Germplasm catalogues are prepared for various characters.
Such reports can be easily be prepared with the help of computers using MS Word program.
This information can be stored in computer memory and reused as and when required. The
editing and updating of reports can be done any time without much extra efforts.

M. P. Garud Page 8
Application # 8. Updating of Information:
In plant breeding and genetics, results of multi-seasonal and long term experiments require
continuous updating. Computers have made this task very simple. The information related to
any experiment which is already stored in the computer memory, can be updated any time by
editing the concerned file. Any portion of information can be deleted or revised easily.

Application # 9. Diagrammatic Representation:


Inclusion of diagrams makes the reports, research papers, articles, bulletins, etc. more
attractive, informative and easily understandable.
The following types of diagrams are made in plant breeding:
(i) Line diagrams, bar diagrams, histograms and pie diagrams.
(ii) Cluster diagram: It is prepared when data is subjected to D2 analysis.
(iii) Path diagram: It is prepared when data is subjected to path coefficient analysis.
(iv) Vr-Wr Graph: It is prepared when data is subjected to Hayman’s graphical approach of
diallel cross analysis.
(v) Metroglyph Chart: It is prepared when data is subjected to Metroglyph analysis.
All these diagrams can be easily prepared with the help of computer using specific program.

Application # 10. Planning of Breeding Programs:


Plant breeders have to plan various breeding programs every year. Computers are useful tools
in such planning. The following activities can be easily planned with the help of computers.
(i) Sowing plans of various breeding experiments.
(ii) Selfing and crossing plans.
(iii) Breeder seed production plan.
(iv) Hybrid seed production plan.
(v) Germplasm collection, conservation, evaluation, distribution, utilization and documentation
plan.
(vi) Screening plan of breeding material against biotic and abiotic stresses.
(vii) Selection, quality evaluation and multi-location testing plans.

All the above plans can be easily prepared with the help of computer well in advance. This is
very important for proper implementation of various breeding programs. Computers are also
useful in printing out labels and list of observations to be recorded in various breeding
experiments.

M. P. Garud Page 9
Genomic analysis
Genomic analysis is the identification, measurement or comparison of genomic features such
as DNA sequence, structural variation, gene expression, or regulatory and functional element
annotation at a genomic scale. Methods for genomic analysis typically require high-throughput
sequencing or microarray hybridization and bioinformatics.
Introduction to Whole-Genome Sequencing
Whole-genome sequencing (WGS) is a comprehensive method for analyzing entire genomes.
Genomic information has been instrumental in identifying inherited disorders, characterizing
the mutations that drive cancer progression, and tracking disease outbreaks. Rapidly dropping
sequencing costs and the ability to produce large volumes of data with today’s sequencers make
whole-genome sequencing a powerful tool for genomics research.

While this method is commonly associated with sequencing human genomes, the scalable,
flexible nature of next-generation sequencing (NGS) technology makes it equally useful for
sequencing any species, such as agriculturally important livestock, plants, or disease-related
microbes.
Advantages of Genome analysis:

 Provides a high-resolution, base-by-base view of the genome

 Captures both large and small variants that might be missed with targeted approaches

 Identifies potential causative variants for further follow-up studies of gene expression
and regulation mechanisms

 Delivers large volumes of data in a short amount of time to support assembly of novel
genomes

Web based Genome browser Introduction:

With the rapid development of next-generation sequencing technologies, hundreds of


eukaryotic and thousands of prokaryotic genomes have been sequenced (http://
www.genomesonline.org/). All the sequence data as well as the annotations generated through
most completed or ongoing genome projects are collected in the genome databases and are
publicly available through web portals such as the NCBI genome portal
(http://www.ncbi.nlm.nih.gov/genome/) and the EBI genome database website
(http://www.ebi.ac.uk/Databases/genomes.html ).

M. P. Garud Page 10
By systematic integration of genome sequences together with annotations generated through
much heterogeneous data, genome browser provides a unique platform for molecular biologists
to browse, search, retrieve and analyze these genomic data efficiently and conveniently. With
a graphical interface, genome browser helps users to extract and summarize information
intuitively from huge amount of raw data. web-based genome browsers which are useful in
promoting biological research due to their data quality, flexible accessibility and high
performance. First, dedicated organizations often collect and integrate high-quality annotation
data into web-based genome browsers, providing plentiful up-to-date information for the
community. Second, users can access them anywhere with a standard web browser, avoiding
any additional effort of setting up local environment for application installation and data
preparation. Third, web-based genome browsers are usually installed on high performance
servers and can support more complex and larger scale data types and applications.

M. P. Garud Page 11
Currently, there are two types of web-based genome browsers. The first type is the multiple
species genome browsers implemented in, among others, the UCSC genome database [5], the
Ensembl project [6], the NCBI Map viewer website [7], the Phytozome and Gramene platforms
[8]. These genome browsers integrate sequence and annotations for dozens of organisms and
further promote cross-species comparative analysis. Most of them contain abundant
annotations, covering gene model, transcript evidence, expression profiles, regulatory data,
genomic conversation, etc. Each set of pre-computed annotation data is called a track in
genome browsers. The essence of a genome browser is to pile up multiple tracks under the
same genomic coordinate along the Y-axis, thus users could easily examine the consistency or
difference of the annotation data and make their judgments of the functions or other features
of the genomic region.
The other type is the species-specific genome browsers which mainly focus on one model
organism and may have more annotations for a particular species. Powered by the Generic
Model Organism Database (GMOD) project (http://gmod.org/), dozens of open-source
software tools are collected for creating and managing genome biological databases, and the
GBrowse framework [9] is one of the most popular tools in the GMOD project. Currently, most
of these species-specific genome browsers are implemented based on the GBrowse framework,
such as MGI, FlyBase, WormBase, SGD and TAIR

M. P. Garud Page 12
Genome browser frameworks
Building a web-based genome browser from scratch is both time and labor consuming, while

well-designed genome browser frameworks could be useful in this aspect.


Genome browser can be divided into two cate- gories based on whether the image is rendered
on the server side or on the client side. Server-side ren- dering browsers such as UCSC,
Ensembl, GBrowse and ABrowse extract the requested data from the back-end databases and
render them into pictures on the server, and then send the pictures to the client web browsers.
Client-side rendering browsers such as Anno-J and JBrowse send the requested data to client
web browsers directly and draw the pictures dynamically in client web browsers.
FUNCTIONATILIESAND FEATURES
Visualization
The principal function of the genome browser is to aggregate different types of annotation data
together and integrate them into an abstract graphical view. Annotations are organized under a
uniform genome coordinate with the chromosome as the X-axis and various types of data being
displayed along the Y-axis.
Data retrieval and analysis
In addition to graphical data navigation, data retrieval and analysis are useful features for a
genome browser. Most of the existing genome browsers support search functions to locate

M. P. Garud Page 13
genomic regions by coordinates, sequences or keywords. Some genome browsers employ a
system to retrieve bulk data. For example, the UCSC system offers Table Browser to retrieve
specified datasets [19], while the Ensembl, Gramene and ABrowse projects employ the
BioMart system [20, 21] for making large data queries. To facilitate further data analysis,
multiple data access approaches are supported for analysis tools to retrieve data from the
genome browsers. The Galaxy genome browser Trackster supports analysis by inte- grating
tools in the same platform, connecting data manipulation with visualization tightly. The users
can view the data in the genome browser seamlessly and further filter the visualized data on-
the-fly, which helps to refine the results conveniently and efficiently.
Customization
It is much easier to build a genome browser based on a framework. Most of the frameworks
have config- uration files for users to customize local data. Currently, it is easy for users to
integrate annotation into general genome browsers with several popular data formats, such as
GFF, BED, SAM and WIG.

M. P. Garud Page 14
Phred
Targeted re-sequencing is one of the most powerful and widely used strategies for population
genetics studies because it allows screening of variation in a way that is unbiased in respect to
the allele frequency spectrum and because it is suitable for a wide variety of living organisms.
Although there is a plethora of new opportunities from next-generation sequencing (NGS)
technologies, re-sequencing studies are traditionally performed using Sanger DNA sequencing.
This is due, in part, to the widespread availability of automatic sequencers based on capillary
electrophoresis and also to the fact that Sanger sequencing is still less prone to base-calling
errors, which is critical in population genetics studies for which the accurate identification of
substitutions carried by unique chromosomes (singletons) is highly informative
Examples of studies in different areas of genetics that require re sequencing data are:
(a) Inferences of past demographic parameters of populations of humans, animals, plants and
microorganisms and of the action of natural selection based on ascertainment-biasfree allelic
spectra.
(b) Epidemiological studies designed to capture rare polymorphisms responsible for complex
traits
(c) Screening for variation in populations that are not included in public databases such as
HapMap, to optimally select informative single nucleotide polymorphism SNPs (tag-SNPs)
for association studies;
(d) Forensic studies or analyses based on mitochondrial DNA data; and
(e) Screenings for mutations in families or small populations with high incidences of specific
genetic diseases

Two of the most popular, powerful and freely available tools for re-sequencing studies are
(1) The software package Phred-Phrap-Consed-Polyphred (PPCP) that performs base calling,
alignment, graphical edition and polymorphism identification
(2) The DNA Sequence Polymorphism software (DNAsp), which performs a wide set of
population genetics analyses through a user friendly Windows interface

Phred/Phrap/Consed is a worldwide distributed package for:


a. Trace file (chromatograms) reading;
b. Quality (confidence) assignment to each individual base;
c. Vector and repeat sequences identification and masking;
d. Sequence assembly;
e. Assembly visualization and editing;
f. Automatic finishing.
Phred is a program that performs several tasks:
a. Reads trace files – compatible with most file formats: SCF (standard chromatogram
format), ABI (373/377/3700), ESD (MegaBACE) and LI-COR.
b. Calls bases – attributes a base for each identified peak with a lower error rate than
the standard base calling programs.
c. Assigns quality values to the bases – a “Phred value” based on an error rate estimation
calculated for each individual base.
d. Creates output files – base calls and quality values are written to output files.

History :
In 1995, Bonfield and Staden proposed a method to use base-specific quality scores to improve
the accuracy of consensus sequences in DNA sequencing projects.[5]

M. P. Garud Page 15
However, early attempts to develop base-specific quality scores[6][7] had only limited success.
The first program to develop accurate and powerful base-specific quality scores was the
program Phred. Phred was able to calculate highly accurate quality scores that were
logarithmically linked to the error probabilities. Phred was quickly adopted by all the major
genome sequencing centers as well as many other laboratories; the vast majority of the DNA
sequences produced during the Human Genome Project were processed with Phred.
A Phred quality score is a measure of the quality of the identification of the nucleobases
generated by automated DNA sequencing. It was originally developed for Phred base calling
to help in the automation of DNA sequencing in the Human Genome Project.
The quality score of a base, also known as a Phred or Q score, is an integer value representing
the estimated probability of an error, i.e. that the base is incorrect
Phred quality scores are assigned to each nucleotide base call in automated sequencer traces.
The FASTQ format encodes phred scores as ASCII characters alongside the read sequences.
Phred quality scores have become widely accepted to characterize the quality of DNA
sequences and can be used to compare the efficacy of different sequencing methods. Perhaps
the most important use of Phred quality scores is the automatic determination of accurate,
quality-based consensus sequences
Phred value formula
q = - 10 x log10 (p)
where
q - quality value
p - estimated probability error for a base call
Examples:
q = 20 means p = 10-2 (1 error in 100 bases)
q = 40 means p = 10-4 (1 error in 10,000 bases)

METHODS
Base-Calling Algorithm Overview
The phred base-caller uses a four-phase procedure to determine a sequence of base-calls from
the processed trace. In the first phase, idealized peak locations (predicted peaks) are
determined; the idea is to use the fact that fragments are locally Presstively evenly spaced,
on average, in most regions of the gel, to determine the correct number of bases and their
idealized evenly spaced locations in regions where the peaks are not well resolved, noisy,
or displaced (as in compressions). In the second phase, observed peaks are identified in the
trace. In the third phase, observed peaks are matched to the predicted peak locations,
omitting some peaks and splitting others; as each observed peak comes from a specific array
and is thus associated with 1 of the 4 bases, the ordered list of matched observed peaks
determines a base sequence for the trace. In the final phase, the uncalled (i.e., unmatched)
observed peaks are checked for any peak that appears to represent a base but could not be
assigned to a predicted peak in the third phase, and, if found, the corresponding base is
inserted into the read sequence. The entire procedure is rapid, taking less than half a second
per trace on typical workstations.
Phred takes as input chromatogram files, in ABI format or Standard Chromatogram Format
(SCF), containing the processed trace data; currently traces produced by the ABI analysis
software.

M. P. Garud Page 16
M. P. Garud Page 17
Phred score (Q score)

Quality value for each sequenced base


The Phred score is a measure for base quality in DNA sequencing. The larger the Phred
value, the better the quality of a sequenced base.

Phred Error-rate Accuracy

10 1 of 10 90% # very low quality

20 1 of 100 99% # minimum quality for many tools

30 1 of 1000 99.9% # reasonable good quality

40 1 of 10000 99.99% # high quality

50 1 of 100000 99.999% # very high quality

Application:
Phred quality scores are used for assessment of sequence quality, recognition and removal of
low-quality sequence (end clipping), and determination of accurate consensus sequences.
Phrap also uses Phred quality scores to estimate whether discrepancies between two
overlapping sequences are more likely to arise from random errors, or from different copies
of a repeated sequence.
Within the Human Genome Project, the most important use of Phred quality scores was for
automatic determination of consensus sequences.

EGassembler

EGassembler is an online service, which provides an automated as well as a user-


customized analysis tool for cleaning, repeat masking, vector trimming, organelle
masking, clustering and assembly of ESTs and genomic fragments. EGassembler
consists of a pipeline of the following five components, each using highly reliable
open-source tools (see Acknowledgements for details) and a non-redundant custom-
made database of vectors and repeats covering almost all publicly available vectors
and repeats databases. Figure 1 shows a flow chart of the EGassembler process.

M. P. Garud Page 18
Pipeline Description

The web server accepts any type of DNA sequences in FASTA format (EST, GSS,
cDNA, gDNA). Pipeline includes:

M. P. Garud Page 19
o Sequence Cleaning: automated trimming and screening for various
contaminants, low quality and low-complexity sequences.
o Repeat Masking: masking DNA sequences for repetitive elements including
small RNA pseudogenes, LINEs, SINEs, LTR elements, microsatellites and
other interspersed repeats. By default it uses our custom-made non-redundant
repeats database which includes: RepBase, TREP repeats, TIGR plant repeats
and over thousands other publicly available repeat sequences on the Internet.
Researchers can also use their own libraries of repeats for screening.
o Vector Masking: screening out the vector, adaptors and other contamination. It
uses by default the NCBI's UniVec core vector/adaptor library, and EMBL's
emvec vector library as an option. Users can also upload their own database of
vector sequences for screening.
o Organelle Masking: using NCBI's entire current organelle database (762
mitochondria, 42 plastids, 14 plasmids and 3 nucleomorph), users have the
opportunity to screen their sequences against all plastids and mitochondrial
genomes (Fungi, Metazoan, Plants and plasmids). Users can also use their
organelle sequences for screening.
o Sequence Assembly: Clustering and assembly the sequences into contigs and
singletons using CAP3.

Interface Description

EGassembler web interface has three sub-menus, each targeted for different users.

1. One-Click Assembly
2. Step-by-Step Assembly
3. Stand-Alone Processing

One-Click Assembly
This option suits users new to bioinformatics. All the components in the
pipeline would be running consecutively with their default options. Users only
select libraries for masking repeats, vectors and organelles. Each process runs
consequently until all process finished.

Step-by-Step Assembly
Users can run all the components outlined in the pipeline interactively and
have the opportunity to run each one of them with advanced options. The
output of each step of the process will be automatically used as input to the
next step of the pipeline; users can also jump into any step at anytime with the
previous results.

Stand-Alone Processing
Users can use each one of the components alone with all options available.
Web-interface displays the default parameters of the original programs, any of
which users can choose/change for each program. This option is the same as
Step-By-Step Processing. The only difference is that here users can not use the
output of one process as input to another process.

M. P. Garud Page 20
Browser Compatibility
The web server has been test successfully on a number of browsers, including Internet
Explorer, Firefox, Mozilla, Opera, Safari and Maxthon on three operating system
(Microsoft Windows, Linux and MAC).

Example.1
Using One-Click Assembly for downloading and assembly all ESTs of Arabidopsis
lyrata deposited in Genbank.

Downloading EST:

1. Go to the web site: http://www.ncbi.nlm.nih.gov/


2. Search Nucleotide for "arabidopsis lyrata AND gbdiv_est[PROP] "
3. Change Display format to FASTA
4. Send to File
5. Save on your computer

One-Click Assembly:

6. Go to web site: https://www.genome.jp/tools/egassembler/


7. Choose File Upload and upload your file, or you may copy and paste your
sequence into the text field.
8. Enable Sequence Cleaning Process if needed (you also can choose CPU
numbers and identity threshold)
9. Enable Repeat Masking Process to select the library of repeats that you want
to screen against (here choose RepBase database and arabidopsis option)
10. Enable Vector Masking Process: select the library of the vector that you want
to screen against (here NCBI's core vector library)
11. Enable Organelle Masking Process: select either plastids or Mitochondria
database (here plastids, arabidopsis)
12. Enable Sequence Assembly Process; overlap percent identity cutoff can be
modified.
13. Click on Submit button.

M. P. Garud Page 21
Output of is a page with hyperlinks to all results from the different processes.

M. P. Garud Page 22
M. P. Garud Page 23
M. P. Garud Page 24
M. P. Garud Page 25
M. P. Garud Page 26
M. P. Garud Page 27
Viewing Results

1- Sequence Cleaning Process

.clean file is the query sequences, with Poly-A/Ploy-T and low-complexes removed.
.cln file is the summary of cleaning

2- Repeat Masking Process

.masked file is the query sequence along with parts masked by X's

M. P. Garud Page 28
Genome Tools
The GenomeTools genome analysis system is a free collection of bioinformatics tools (in the
realm of genome informatics) combined into a single binary named gt. It is based on a C
library named libgenometools which contains a wide variety of classes for efficient and
convenient implementation of sequence and annotation processing software.

This list shows all GenomeTools tools and their functions.

 gt The GenomeTools genome analysis system.


 gt bed_to_gff3 Parse BED file and convert it to GFF3.
 gt cds Add CDS (coding sequence) features to exon features given in GFF3 file.
 gt chain2dim Chain pairwise matches.
 gt chseqids Change sequence ids by the mapping given in a mapping file.
 gt clean Remove all files in the current directory which are automatically created by
gt.
 gt compreads Call a fastq file compression tool.
 gt compreads compress Generates compact encoding for fastq data.
 gt compreads decompress Decodes a file of compressed reads.
 gt compreads refcompress Generates compact encoding for fastq data using Reference
Compressed Reads (RCR).
 gt compreads refdecompress Decodes a given RCR (Reference Compressed Reads).
 gt condenseq Call one of the CONDENSER tools to prepare or manipulate
redundancy compressed genomic data.
 gt congruence Call a congruence subtool and pass argument(s) to it.
 gt congruence spacedseed Match spaced seeds.
 gt convertseq Parse and convert sequence file formats (FASTA/FASTQ, GenBank,
EMBL).
 gt csa Transform spliced alignments from GFF3 file into consensus spliced
alignments.
 gt dot Prints feature graphs in dotfile format.
 gt dupfeat Duplicate internal feature nodes in given GFF3 files.
 gt encseq Call an encoded sequence manipulation tool and pass argument(s) to it.
 gt encseq bench Perform benchmark on extractions from encseq.
 gt encseq bitextract Extracts internal data from encoded sequences.

M. P. Garud Page 29
 gt encseq check Check the consistency of an encoded sequence file.
 gt encseq decode Decode/extract encoded sequences.
 gt encseq encode Encode sequence files (FASTA/FASTQ, GenBank, EMBL)
efficiently.
 gt encseq info Display meta-information about an encoded sequence.
 gt encseq md5 Display MD5 sums for an encoded sequence.
 gt encseq sample Decode/extract encoded sequences by random choice.
 gt encseq2spm Compute suffix prefix matches from encoded sequence.
 gt eval Compare annotation files and show accuracy measures (prediction vs.
reference).
 gt extractfeat Extract features given in GFF3 file from sequence file.
 gt extractseq Extract sequences from given sequence file(s) or fastaindex.
 gt fastq_sample Print samples by random choice from given FASTQ files using at
least n sequence-chars. Output is fastq/fasta format depending on whether qualities
are available.
 gt featureindex Retrieve annotations from a persistent feature index as GFF3 output.
 gt fingerprint Compute MD5 fingerprints for each sequence given in a set of sequence
files.
 gt genomediff Calculates Kr: pairwise distances between genomes.
 gt gff3 Parse, possibly transform, and output GFF3 files.
 gt gff3_to_gtf Parse GFF3 file(s) and show them as GTF2.2.
 gt gff3validator Strictly validate given GFF3 files.
 gt gtf_to_gff3 Parse GTF2.2 file and convert it to GFF3.
 gt hop Cognate sequence-based homopolymer error correction.
 gt id_to_md5 Change sequence IDs in given GFF3 files to MD5 fingerprints of the
corresponding sequences.
 gt inlineseq_add Adds inline sequences from external source to GFF3 input.
 gt inlineseq_split Split GFF3 annotations with inline sequences into separate files.
 gt interfeat Add intermediary features between outside features in given GFF3 file(s).
 gt loccheck Checks parent-child containment in GFF3 input.
 gt ltrclustering Cluster features of LTRs.
 gt ltrdigest Identifies and annotates sequence features in LTR retrotransposon
candidates.
 gt ltrharvest Predict LTR retrotransposons.
 gt matchtool Parse match formats and/or invoke matching tools.
 gt matstat Compute matching statistics.

M. P. Garud Page 30
 gt md5_to_id Change MD5 fingerprints used as sequence IDs in given GFF3 files to
“regular” ones.
 gt merge Merge sorted GFF3 files in sorted fashion.
 gt mergefeat Merge adjacent features without children of the same type in given
GFF3 file(s).
 gt mkfeatureindex Creates a new FeatureIndex from annotation data.
 gt mmapandread Map the supplied files into memory and read them once.
 gt orffinder Identifies ORFs (open reading frames) in sequences.
 gt packedindex Call apacked index subtool and pass argument(s) to it.
 gt prebwt Precompute bwt-bounds for some prefix length.
 gt readjoiner Readjoiner: a string graph-based sequence assembler.
 gt readjoiner assembly Construct string graph and output contigs.
 gt readjoiner overlap Compute suffix prefix matches from encoded sequence.
 gt readjoiner prefilter Remove contained and low-quality reads and encode read set in
GtEncseq format.
 gt repfind Compute maximal exact matches (and more).
 gt scriptfilter Get info about and validate Lua script filters.
 gt seed_extend Calculate local alignments using the seed and extend algorithm.
 gt select Select certain features (specified by the used options) from given GFF3
file(s).
 gt seq Parse the given sequence file(s) and construct the corresponding index files.
 gt seqfilter Filter the given sequence file(s) and show the results on stdout.
 gt seqids Show sequence IDs from annotation file.
 gt seqmutate Mutate the sequences of the given sequence file(s).
 gt seqorder Output sequences as MultiFasta in specified order.
 gt seqstat Calculate statistics for fasta file(s).
 gt seqtransform Perform simple transformations on the given sequence file(s).
 gt seqtranslate Translates a nucleotide sequence into a protein sequence.
 gt sequniq Filter out repeated sequences in given sequence files.
 gt shredder Shredder sequence file(s) into consecutive pieces of random length.
 gt shulengthdist Compute distribution of pairwise shustring lengths.
 gt simreads Simulate sequencing reads from random positions in the input
sequence(s).
 gt sketch Create graphical representation of GFF3 annotation files.
 gt sketch_page Draw a multi-page PDF/PS representation of an annotation file.

M. P. Garud Page 31
 gt snpper Annotates SNPs according to their effect on the genome as given by a
genomic annotation.
 gt speck Checks spec definition compliance in GFF3 input.
 gt splicesiteinfo Show information about splice sites given in GFF3 files.
 gt splitfasta Split the supplied fasta file.
 gt stat Show statistics about features contained in GFF3 files.
 gt tagerator Map short sequence tags in given index.
 gt tallymer Call a tallymer subtool and pass argument(s) to it.
 gt tallymer mkindex Count and index k-mers in the given enhanced suffix array for a
fixed value of k.
 gt tallymer occratio Compute occurrence ratio for a set of sequences represented by
an enhanced suffix array.
 gt tallymer search Search a set of k-mers in an index constructed by “gt tallymer
mkindex”.
 gt tirvish Identify Terminal Inverted Repeat (TIR) elements,such as DNA
transposons.
 gt uniq Filter out repeated feature node graphs in a sorted GFF3 file.

M. P. Garud Page 32
Online Bioinformatics Resources Collection
The urgent need of organizing the bioinformatics resources has recently been raised. Among
the existing efforts to solve the problem are the Molecular Biology Database Collection
compiled by the NAR, the Bioinformatics Links Directory, the Expasy Life Sciences Directory
(http://www.expasy.org/links.html), the DBcat, the Database of Databases and the Pathguide.
In order to help biomedical researchers to quickly find the most relevant bioinformatics
resources for their specific information needs the Online Bioinformatics Resources Collection
(OBRC) at the Health Sciences Library System (HSLS), University of Pittsburgh was
developed. This collection currently includes 1542 online bioinformatics databases and
software tools, most of which have been published by NAR or listed in its Molecular Biology
Database Collection (2). In addition, the Vivı´simo Clustering Engine implemented to OBRC
to help users navigate through their search results.
The number of the online databases listed in the Nucleic Acids Research (NAR)
Molecular Biology Database Collection alone has increased more than 14-fold from 58 in 1996
to 858 in 2006. The majority of these newly emerged online resources are specialized databases
and Web servers that provide not only sequence information, but also data on gene expression,
macromolecular structures, genotype and phenotype of model organisms, as well as
computational tools for analyzing macromolecular sequences/structures and global gene
expression.
The OBRC contains links and annotations for over 2,000 bioinformatics databases and
software tools. It was created in 2006 by the Molecular Biology Information Service for the
Health Sciences Library System at the University of Pittsburgh and can be accessed from
their webpage via search. HSLS. MolBio.
The purpose of the manually curetted OBRC is to bridge the gap between the rising
information needs of biological and medical researchers and the rapidly growing number of
online bioinformatics resources. This freely available, searchable database arranges resources
by categories and sub-categories such as Structure Databases and Analysis Tools, Proteomics
Resources, and Enzymes and Pathways. The OBRC is the largest online collection of its kind
and the only one with advanced search results clustering. It is a one-stop guided information
gateway to the major bioinformatics databases and software tools on the Web
OBRC is available at the University of Pittsburgh's HSLS Web site
(http://www.hsls.pitt.edu/guides/genetics/obrc).

M. P. Garud Page 33
METHODOLOGY
The new search strategy consists of two major components:
 A centralized collection of the curated information on major online bioinformatics
databases and software tools,
 The implementation of the Vivı´simo Clustering Engine_ to enhance the output of
search results.
Source materials :
The primary sources of OBRC are the databases and software tools published by the NAR
(http://nar.oxfordjournals.org/). Specifically, the source materials were mainly the databases
published in the NAR Annual Database Issues from 2001 to 2006, and the software tools
published in the NAR Annual Web Server Issues from 2004 to 2006. Other databases listed in
the NAR Molecular Biology Database Collection, including those published by NAR before
2001 and those not published by the NAR, were also selected. Selected databases and software
tools described in other peer-reviewed journals, such as Bioinformatics and BMC
Bioinformatics, were included in the collections. In addition, a number of unpublished but
popular online software tools were also entered.
Collection construction, organization and maintenance :
Information on each resource was entered using the HSLS content management system built
on the Zope_ Web application server. For each entry, the information for the following fields
was entered: URL to the resource; name of the resource; a one-sentence description of the
major functions; URL to the relevant PubMed abstract(s); last modification date of the entry;
highlights of the resource; and keywords. The title, description and highlights for each entry
were generated based on the PubMed abstract(s), as well as the content and scope of the
resource. As a major part of curation efforts, keywords were generated based on the
information in the PubMed abstract(s), the MESH terms of the abstract(s), the information
posted on corresponding web site, as well as the domain knowledge in molecular biology.
Standard terminologies, commonly used by researchers in their publications, were used. The
main types of keywords include biological concepts, entities, organism names, widely studied
gene and protein names, and common molecular biology tasks. OBRC implemented categorical
structure and basic classification
Theme .
Vivı´simo Clustering Engine_ implementation :
The Vivı´simo Clustering Engine_ is based on a novel, intricate three-pass algorithm that is
augmented with hundreds of special processing heuristics and endowed with thousands of

M. P. Garud Page 34
specific facts and general patterns of English and other languages (http://Vivisimo.com/). It
automatically organizes large number of search results into different groups and enables users
to quickly survey and identify relevant groups. The Vivı´simo Clustering Engine_ has been
successfully applied on the Web by search engines such as the Clusty (http://clusty.com) and
ClusterMedTM (http://www.clustermed.info). Queries can be formed with basic Boolean
operators. Queries are first processed by the Zope_-based search engine that leverages on
Zope_ search tools. The results are then processed by the Vivı´simo Clustering Engine_ on-
the-fly using the textual information from a set of fields selected from the following fields:
title, descriptions, highlights and Nucleic Acids Research, Database issue D781 keywords. The
search results organized by the Vivı´simo Clustering Engine_ are finally presented to the users.

RESULTS
Organized with a three-level hierarchical category classification, OBRC was divided into 13
major categories, 40 secondary-categories and 12 tertiary-categories to assist users browsing
the entire collection The top five main categories are ‘DNA Sequence Databases and D782
Nucleic Acids Research, Database issue Analysis Tools’ (325), ‘Protein Sequence Databases
and Analysis Tools’ (306), ‘Genomic Databases and Analysis Tools’ (270), ‘Structure
Databases and Analysis Tools’ (244) and ‘RNA Databases and Tools’ (130). The top five
specific topics are ‘Protein structures’ (214), ‘Regulatory sites and transcription factors’ (112),
‘Protein sequence motifs, active or functional sites, and functional annotations’ (77), ‘Human
mutations and diseases’ (76) and ‘General protein sequence databases, sequence similarity
search, analysis, and alignment tools’ (68). Some resources were listed in multiple categories.

The Online Bioinformatics Resources Collection (OBRC) contains annotations and links for
2458 bioinformatics databases and software tools.

 DNA Sequence Databases and Analysis Tools (463)


 Enzymes and Pathways (242)
 Gene Mutations, Genetic Variations and Diseases (257)
 Genomics Databases and Analysis Tools (636)
 Immunological Databases and Tools (49)
 Microarray, SAGE, and other Gene Expression (166)

M. P. Garud Page 35
 Organelle Databases (25)
 Other Databases and Tools (Literature Mining, Lab Protocols, Medical Topics, and
others) (147)
 Plant Databases (146)
 Protein Sequence Databases and Analysis Tools (408)
 Proteomics Resources (58)
 RNA Databases and Analysis Tools (222)
 Structure Databases and Analysis Tools (385)

M. P. Garud Page 36
URL (http://www.hsls.pitt.edu/guides/genetics/obrc)

Bio-Molecule Software

M. P. Garud Page 37
M. P. Garud Page 38
ENSEMBLE

Ensembl provides genes and other annotation such as regulatory regions, conserved base pairs
across species, and sequence variations. The Ensembl gene set is based on protein and mRNA
evidence in UniProtKB and NCBI RefSeq databases, along with manual annotation from the
VEGA/Havana group. All the data are freely available and can be accessed via the web browser
at www.ensembl.org.

Gene sequences can be downloaded from the Ensembl browser itself, Or through the use of the
BioMart web interface, which can extract information from the Ensembl databases without the
need for programming knowledge! A sister browser at www.ensemblgenomes.org is set up to
access non‐chordates, namely bacteria, plants, fungi, metazoa, and protists).
Ensembl is a joint project between the EBI (European Bioinformatics Institute) and the

Wellcome Trust Sanger Institute that annotates chordate genomes (i.e. vertebrates and closely
related invertebrates with a notochord, such as sea squirt). Gene sets from model organisms
(e.g. yeast, fruitfly and worm are also imported for comparative analysis by the Ensembl
Comparative Genomics team. Most annotation is updated every two months, leading to
increasing Ensembl versions (such as version 73), however the gene sets are determined less
frequently.

M. P. Garud Page 39
The picture below shows the Ensembl homepage for human. Links to the human karyotype,
to the previous human assembly and a summary of gene and genome information are found in
Links to
this index page.
example
News features
Search
in
Ensembl

Information
And
statistics

gsgsgfd

Click on the different taxa to see their homepages. Each of them is colour--‐coded:

M. P. Garud Page 40
M. P. Garud Page 41
M. P. Garud Page 42
M. P. Garud Page 43
BioMart is a web--‐interface that can extract information from the Ensembl databases and
present the user with a table of information

Without the need for programming. It can be used to output sequences or tables of genes along
with gene positions (chromosome and base pair locations), single nucleotide polymorphisms
(SNPs), homologues, and other annotation in HTML, text, or Microsoft Excel format. BioMart
can also translate one type of ID to another, identify genes associated with an InterPro domains
or gene ontology (GO) terms, export gene expression data and lots.
Advantage of Ensemble
• View genes with other annotation along the chromosome;
• View alternative transcripts (i.e. splice variants) for a given gene;
• Explore homologues and phylogenetic trees across more than 65 chordate species for any
gene;
• Compare whole genome alignments and conserved regions across species;
• View microarray sequences that match to Ensembl genes;
• View ESTs, clones, mRNA and proteins for any chromosomal region;
• Examine single nucleotide polymorphisms (SNPs) for a gene or chromosomal region;
• View SNPs across strains (rat, mouse) and human populations;
• View positions and sequence of mRNAs and proteins that align with an Ensembl genes;
• Display your own data on the Ensembl browser;
• Use BLAST or BLAT against any Ensembl genome;
• Export sequence or create a table of gene information with BioMart;
• Use the Variant Effect Predictor;
• Share Ensembl views with your colleagues and collaborators.

M. P. Garud Page 44
How to process and access data?
Go to www.ensembl.org and click on the human icon to open the human home page.

Type ESPN into the search bar and click the Go button.

Click on Gene and Human to find the hits.

M. P. Garud Page 45
One gene matches the query (ESPN) in human. Links to the Gene tab (from the HGNC official
name and Ensembl gene ID), Location tab and Variation table are provided.

Let’s view the genomic region in which this gene and its transcript are located by clicking on
the 1:6484848--‐6521430 link. The Location tab should open.
The first image shows an overview of chromosome where the human ESPN gene has been
mapped. In this image, you can see G banding pattern of the chromosome as well as the regions
that correspond to the patches and haplotypes of the human chromosome1.
The second image shows a 1Mb region around the ESPN gene. This view allows scrolling back
and forth along the chromosome.

M. P. Garud Page 46
M. P. Garud Page 47
BLAST/BLAT
BLAST [1] and BLAT [2] are sequence similarity search tools that can be used for both DNA
and proteins. BLAT is the default tool in Ensembl due to its faster speed. See other differences
between BLAT and BLAST on our FAQ page.

How to run BLAST/BLAT

1) Entering a query sequence

Paste in a sequence (the suggested format is FASTA) or upload a sequence as a file. Up to 30


sequences can be added. If inputting multiple species, make sure a header (for example a
FASTA header) separates each one.

M. P. Garud Page 48
2) Selecting the species to search against

You can perform multiple similarity searches at once by choosing different genomes and
adding them to your list of species.

Click on 'Add/Remove Species' to open the Species Selector box. If you start typing the species
name you wish to add, the search box will auto-fill with matches. Selecting these species will
add them to your BLAST search. Alternatively, you can click on the species divisions (in green)
to browse and select (by checking the boxes) any of the available species in Ensembl. The
selected species will appear on the right side of the species selector box, to remove species;
click the (-) button on the right of its name. Once you have selected all the species you wish to
run through BLAST/BLAT, click apply to return to the query page.

The databases available for similarity searches are DNA and protein target databases.

DNA databases

 Genomic sequence

Repetitve and/or low complexity regions are not masked

 Genomic sequence (hard masked)

Genomic sequences have been run through the RepeatMasker program and repetitive and/or
low complexity regions have been masked as Ns

 Genomic sequence (soft masked)

M. P. Garud Page 49
Genomic sequences have been run through the RepeatMasker program and repetitive and/or
low complexity regions have been masked as lower case letters

 cDNAs (transcripts/splice variant)

 Ab-initio cDNAs (Genscan/SNAP)

Predictions based on the sequence alone, therefore not supported by experimental evidence

 Ensembl Non-coding RNA genes

Protein databases

 Proteins (GENCODE/Ensembl)

 Ab-initio peptides (Genscan/SNAP)

3) Selecting the search tool

The following options are available:

 BLAT: nucleotide sequences against nucleotide databases

 BLASTN: nucleotide sequences against nucleotide databases

 TBLASTX: translated nucleotide sequences against a translated nucleotide database

 BLASTX: nucleotide sequences against amino acid databases

 BLASTP: amino acid databases against protein databases

 TBLASTN: amino acid databases against nucleotide databases

Pre-Configured Sets

For BLAST searches, you can change the 'Search Sensitivity' from normal to the following:

Near match (to find closer matches- more stringent settings than 'normal')
Short sequences (for short sequences like primers: BLASTN only.)
Distant homologies (to allow lower-scoring pairs to pass through)

Specific parameters for these configurations can be found by expanding the Configuration
options. Alternatively, change the configuration options to customise your own BLAST search.

M. P. Garud Page 50
Running the job

You can give a name or description to this BLAST or BLAT search in the Description
(optional) field.

Once your parameters are set, click RUN to start the search.

4) Recent BLAST tickets

The table lists jobs that are currently running or recently completed. A ticket ID is assigned to
each job and additional information is provided i.e. Analysis, Jobs and Submitted at (date and
time). You can customise the table by showing/hiding columns.

The progression of the job gets automatically refreshed every 10 seconds until the job is fully
completed.

You can view the results by clicking on the ticket number or on the link View results.

5) Results

The results are displayed in three different sections:

A) Job details

Details of the job include job name, species, search type (e.g. BLAT), sequence, query and
database types and configuration settings.

M. P. Garud Page 51
Click on the title or (-) to collapse the Job details section.

B) Results table

It can be viewed on the page or downloaded as a file.

This table lists all hits in order of high to low score (and E-value) but it can be customised to
show/hide columns. The results can be sorted by any parameters available in the table.

Hover over the links provided in the results table and click on them to get:

 Genomic location: shows the BLAST/BLAT hit on the Region in detail view in the Location
tab of the Ensembl Genome Browser. The BLAST/BLAT hit will appear as a red bar along the
genome. You may want to click on the red bar to view a summary of the search, including E-
Value, %ID, etc.

M. P. Garud Page 52
 Sequence: shows the genomic sequence or query sequence

 Alignment: shows the BLAST/BLAT alignment

Click on the title or (-) to collaspe the Results table.

C) HSP distribution on genome

High-scoring segment pair (HSP) is a local alignment with no gaps that achieves one of the
highest alignment scores in a given search. It corresponds to the matching region between the
query and the database hit sequence.

The HSP distribution can be visualised on the karyotype (if the karyotype is available for a
given species) and the hits are represented as arrows (the best hit is represented in a box).

Click on the arrows for a pop-up window with a summary of BLAST/BLAT hits such as
Genomic location (bp), Score, E-value, etc for all target features available. Links to the
Alignment (A), Query sequence (S) and Genomic sequence (G) are also available.

Click on the title or (-) to collaspe the karyotype image.

M. P. Garud Page 53
D) HSP distribution on query sequence

The HSP distribution can be visualised on the query, which is shown as a chain of black and
white boxes. Fragments of the query sequence that hit other places in the genome are shown as
red boxes (click on those for more information). Usually these fragments are small (they vary
between 100-200 nt) and map to various locations. These sequences are of low complexity,
such as repetitive sequences.

Click on the title or (-) to collaspe the query sequence image.

6) Configuration Options

General options

A) Maximum number of hits to report:

Number of database hits that are displayed. The actual number of alignments may be greater
than this. It varies from 10 to 5000 to 100000 and the default is 100.

B) Maximum E-value for reported alignments:

Number of hits reported that contain lover than the E-values selected. It varies from 1e-200 to
1000 and the defatul value is 1e-1.

M. P. Garud Page 54
C) Word size for seeding alignments:

This option is available for BLAST searches only, not BLAT. It is the length of the seed that
initiates and alignment between the query and the target sequences. It varies from 2 to 15 and
the default is 11 (nucleotides) for DNA and 3 (residues) for protein.

BioMart
BioMart is an easy-to-use web-based tool that allows extraction of data without any
programming knowledge or understanding of the underlying database structure. You can
navigate through the BioMart web interface using the left panel. Filters and attributes can be
selected in the right panel. A summary of your choices is also displayed in the left panel.

1. Select a mart database (a type of data)

First, select a mart database which will correspond to the type of data you are interested in. In
Ensembl, you can choose data from one of our four mart databases:

 Ensembl Genes: This mart contains the Ensembl gene set and allows you to retrieve Ensembl

genes, transcripts and proteins as well as external references, microarrays, protein domains,

structure, sequences, variants (only variants mapped to Ensembl Transcripts) and homology

data.

M. P. Garud Page 55
 Ensembl Variation: This mart allows you to retrieve germline and somatic variants as well as

germline and somatic structural variants. This mart also contains variants' phenotypes,

citations, synomyms, consequences and flanking sequences; you can also retrieve Ensembl

genes, transcripts, regulatory and motif features mapped to variants.

 Ensembl Regulation: This mart allows you to retrieve regulatory features, evidence and

segments, miRNA target regions, binding motifs and other regulatory regions.

 Vega: This mart contains the Ensembl Vega gene set (manual annotation coming from Havana)

and allows you to retrieve Ensembl Vega genes, transcripts and proteins as well as external

references, structures, sequences and protein domains

2. Select a mart dataset (a species)

Next, select a mart dataset which correspond to the species you are interested in and want to
retrieve data from.

The marts are available for the following species:

M. P. Garud Page 56
 Ensembl Genes: All the Ensembl species, full list on the species page.

 Ensembl Variation: We only have a subset of our species which have variation data available,
full list on the variation species page.

 Ensembl Regulation: We have regulation data available for human, mouse and fruitfly.

 Vega: We have data available for human, mouse, rat, pig and zebrafish.

3. Filter your mart query (query restriction and input data)

BioMart allows you to restrict your query with information that you know, e.g: input a list of
IDs, restrict to a region. You can access the filter page by clicking on the "Filters" button
located on the left panel. Filters are organised into different sections, clicking on the "+/-" boxes
will expand/collapse a section and display its content.

In the following image, I have expanded the "Region" section by clicking on the "+" box as
indicated by the number 1. Then, I have selected "1" in the chromosome list to restrict my
query to all the genes located on human chromosome 1. The left panel now displays
"Chromosome: 1" as indicated by the number 2. Clicking on the "Count" button as indicated
by number 3 will display the number of Ensembl genes that the query will return next to
"Dataset" as indicated by the number 4. This mean that we have 5406 Ensembl Genes located
on the human chromosome 1.

M. P. Garud Page 57
4. Select mart attributes (desired output)

By clicking on the "Attributes" button on the left panel, you will access the mart attribute page.
This page allows you to select your desired output; the default output is "Ensembl Gene ID"
and "Ensembl Transcript ID" in the Ensembl Genes mart. The attributes are organised in pages
that you can access by selecting the radio buttons as displayed by the red box on the image
below.

M. P. Garud Page 58
5. Display and retrieve your query

Clicking on the "Results" button will bring you to the mart result page. This page will, by
default, show you a preview of the first 10 results of your query in HTML format. The number
of results previewed and format can be changed as indicated by number 1 on the image below.
You can also automatically remove all the duplicated results from your query by clicking on
the "Unique results only" button as indicated by number 2. If you are happy with your query,
you can use the "Export all results to" section as indicated by number 3 (select a format and
click on the "Go" button) to download your results.

M. P. Garud Page 59
You can also request the results of your query to be sent to you by email. To do this, select
"Compressed web file (notify by email)" in the first dropdown in the red box on the image
below, then select your desired format, type your email address in the "Email notification to"
box and press the "Go" button. BioMart will compile the result of your query in the background
and send you a link to the compressed file by email.

M. P. Garud Page 60
BioMart tutorials, multiple dataset query, Perl API, RESTful and Bio

M. P. Garud Page 61
UCSC
History

Initially built and still managed by Jim Kent, then a graduate student, and David Haussler,
professor of Computer Science (now Biomolecular Engineering) at the University of
California, Santa Cruz in 2000, the UCSC Genome Browser began as a resource for the
distribution of the initial fruits of the Human Genome Project. Funded by the Howard Hughes
Medical Institute and the National Human Genome Research Institute, NHGRI (one of the US
National Institutes of Health), the browser offered a graphical display of the first full-
chromosome draft assembly of human genome sequence. Today the browser is used by
geneticists, molecular biologists and physicians as well as students and teachers of evolution
for access to genomic information.

The University of California Santa Cruz (UCSC) Genome Bioinformatics website consists of
a suite of free, opensource, on-line tools that can be used to browse, analyze, and query genomic
data. These tools are available to anyone who has an Internet browser and an interest in
genomics. The website provides a quick and easy-touse visual display of genomic data. It
places annotation tracks beneath genome coordinate positions, allowing rapid visual
correlation of different types of information. Many of the annotation tracks are submitted by
scientists worldwide; the others are computed by the UCSC Genome Bioinformatics group
from publicly available sequence data. It also allows users to upload and display their own
experimental results or annotation sets by creating a custom track. The suite of tools,
downloadable data files, and links to documentation and other information can be found at
http://genome.ucsc.edu/.

On June 22, 2000, UCSC and the other members of the International Human Genome Project
consortium completed the first working draft of the human genome assembly, forever ensuring
free public access to the genome and the information it contains. A few weeks later, on July 7,
2000, the newly assembled genome was released on the web at http://genome.ucsc.edu, along
with the initial prototype of a graphical viewing tool, the UCSC Genome Browser. It is an
interactive website offering access to genome sequence data from a variety of vertebrate and
invertebrate species and major model organisms, integrated with a large collection of aligned
annotations.

Genomes

UCSC Genomes

In the years since its inception, the UCSC Browser has expanded to accommodate genome
sequences of all vertebrate species and selected invertebrates for which high-coverage genomic
sequences is available,[5] now including 46 species.

M. P. Garud Page 62
Genomes hide

great apes human, baboon, bonobo, chimpanzee, gibbon, gorilla, orangutan

non-ape
bushbaby, marmoset, mouse lemur, rhesus macaque, squirrel monkey, tarsier, tree shrew
primates

mouse, alpaca, armadillo, cat, Chinese hamster, cow, dog, dolphin, elephant, ferret, guinea pig,
non-primate hedgehog, horse, kangaroo rat, manatee, Minke whale, naked mole-rat, opossum, panda, pig,
mammals pika, platypus, rabbit, rat, rock hyrax, sheep, shrew, sloth, squirrel, Tasmanian devil, tenrec,
wallaby, white rhinoceros

non- American alligator, Atlantic cod, budgerigar, chicken, coelacanth, elephant shark, Fugu,
mammal lamprey, lizard, medaka, medium ground finch, Nile tilapia, painted turtle, stickleback,
chordates Tetraodon, turkey, Xenopus tropicalis, zebra finch, zebrafish

Caenorhabditis spp (5), Drosophilaspp. (11), Ebola virus, honey bee, lancelet, mosquito, P.
invertebrates
Pacificus, sea hare, sea squirt, sea urchin, yeast

M. P. Garud Page 63
The UCSC Genome Browser presents a diverse collection of annotation datasets (known as
"tracks" and presented graphically), including mRNA alignments, mappings of DNA repeat
elements, gene predictions, gene-expression data, disease-association data (representing the

M. P. Garud Page 64
relationships of genes to diseases), and mappings of commercially available gene chips (e.g.,
Illumina and Agilent). The basic paradigm of display is to show the genome sequence in the
horizontal dimension, and show graphical representations of the locations of the mRNAs, gene
predictions, etc. Blocks of color along the coordinate axis show the locations of the alignments
of the various data types. The ability to show this large variety of data types on a single
coordinate axis makes the browser a handy tool for the vertical integration of the data.
One unique and useful feature that distinguishes the UCSC Browser from other genome
browsers is the continuously variable nature of the display. Sequence of any size can be
displayed, from a single DNA base up to the entire chromosome (human chr1 = 245 million
bases, Mb) with full annotation tracks. Researchers can display a single gene, a single exon, or
an entire chromosome band, showing dozens or hundreds of genes and any combination of the
many annotations. A convenient drag-and-zoom feature allows the user to choose any region
in the genome image and expand it to occupy the full screen.
Researchers may also use the browser to display their own data via the Custom Tracks tool.
This feature allows users to upload a file of their own data and view the data in the context of
the reference genome assembly. Users may also use the data hosted by UCSC, creating subsets
of the data of their choosing with the Table Browser tool (such as only the SNPs that change
the amino acid sequence of a protein) and display this specific subset of the data in the browser
as a Custom Track.

Tracks

UCSC Genome Browser Tracks


Below the displayed image of the UCSC Genome browser are nine categories of additional
tracks that can be selected and displayed alongside the original data. These categories are
Mapping and Sequencing, Genes and Gene Predictions, Phenotype and Literature, mRNA and
EST, Expression, Regulation, Comparative Genomics, Variation, and Repeats.

M. P. Garud Page 65
Category Description Examples of track

Mapping and Allows control over the style of sequencing


Base Position. Alt Map, Gap
Sequencing displayed.

Genes and
Which programs to predict genes and which GENCODE v24, Geneid Genes, Pfam in
Gene
databases to display known genes from. UCSC Gene
Predictions

Phenotype Databases containing specific styles of OMIM Alleles, Cancer Gene Expr Super-
and Literature phenotype data. track

mRNA and Access to mRNAs and ESTs for human specific


Human ESTs, Other ESTs, Other mRNAs
EST searches or general all purpose searches.

Display unique expressions of predetermined


Expression GTEx Gene, Affy U133
sequences.

Information relevant to regulation of ENCODE Regulation Super-track


Regulation
transcriptions from different studies. Settings, ORegAnno

Allows the comparison of the searched


Comparative Conservation, Cons 7 Verts, Cons 30
sequence with other groups of animals with
Genomics Primates
sequenced genomes.

Compares the searched sequence with known Common SNPs(150), All


Variation
variations. SNPs(146), Flagged SNPs(144)

Allows tracking of different kinds of repeated RepeatMasker, Microsatellite, WM +


Repeats
sequences in the query. SDust

Mapping and Sequencing


These tracks allow for user control over the display of genomic coordinates, sequences, and
gaps. Researchers have the ability to select tracks which best represent their query to allow for

M. P. Garud Page 66
more applicable data to be displayed depending on the type and depth of research being done.
The mapping and sequencing tracks can also display a percentage based track to show a
researcher if a particular genetic element is more prevalent in the specified area.
Genes and Gene Predictions
The gene and gene predictions tracks control the display of genes and their subsequent parts.
The different tracks allow the user to display gene models, protein coding regions, and non-
coding RNA as well as other gene related data. There are numerous tracks available allowing
researchers to quickly compare their query with pre-selected sets of genes to look for
correlations between known sets of genes.
Phenotype and Literature
Phenotype and Literature tracks deal with phenotype directly linked with genes as well as
genetic phenotype. The uses of these tracks are intended for use primarily by physicians and
other professionals concerned with genetic disorders, by genetics researchers, and by advanced
students in science and medicine. A researcher can also display a track that shows the genomic
positions of natural and artificial amino acid variants.
mRNA and EST
These tracks are related to expressed sequence tags and messenger RNA. ESTs are single-read
sequences, typically about 500 bases in length, that usually represent fragments of transcribed
genes. The mRNA tracks allow the display of mRNA alignment data in Humans, as well as,
other species. There are also tracks allowing comparison with regions of ESTs that show signs
of splicing when aligned with the genome.
Expression
Expression tracks are used to relate genetic data with the tissue areas it is expressed in. This
allows a researcher to discover if a particular gene or sequence is linked with various tissues
throughout the body. The expression tracks also allow for displays of consensus data about the
tissues that express the query region.
Regulation
The regulation tracks of the UCSC Genome browser are a category of tracks that control the
representation of promoter and control regions within the genome. A researcher can adjust the
regulation tracks to add a display graph to the genome browser. These displays allow for more
detail about regulatory regions, transcription factor binding sites, RNA binding sites, regulatory
variants, haplotypes, and other regulatory elements.
Comparative Genomics
The UCSC Genome Browser allows the user to display different kinds of conservation data.
The user can select from different tracks including primates, vertebrates, mammals among
others, and see how the gene sequence they searched is conserved amongst other species. The
comparative alignments give a graphical view of the evolutionary relationships among species.
This makes it a useful tool both for the researcher, who can visualize regions of conservation
among a group of species and make predictions about functional elements in unknown DNA
regions, and in the classroom as a tool to illustrate one of the most compelling arguments for
the evolution of species. The 44-way comparative track on the human assembly clearly shows
that the farther one goes back in evolutionary time, the less sequence homology remains, but
functionally important regions of the genome (e.g., exons and control elements, but not introns
typically) are conserved much farther back in evolutionary time.

M. P. Garud Page 67
Variation data
Many types of variation data are also displayed. For example, the entire contents of each release
of the dbSNP database from NCBI are mapped to human, mouse and other genomes. This
includes the fruits of the 1000 Genomes Project, as soon as they are released in dbSNP. Other
types of variation data include copy-number variation data (CNV) and human population allele
frequencies from the HapMap project.
Repeats
The repeat tracks of the genome browser allow the user to see a visual representation of the
DNA areas with low complexity repetitions. Being able to visualize repetitions in a sequence
allows for quick inferences about a search query in the genome browser. A researcher has the
potential to quickly see that their specified search contains large amounts of repeated sequences
at a glance and adjust their search or track displays accordingly.

M. P. Garud Page 68
MUMmer
MUMmer is a bioinformatics software system for sequence alignment. It is based on
the suffix tree data structure and is one of the fastest and most efficient systems available for
this task, enabling it to be applied to very long sequences. It has been widely used for
comparing different genomes to one another. In recent years it has become a popular algorithm
for comparing genome assemblies to one another, which allows scientists to determine how a
genome has changed after adding more DNA sequence or after running a different genome
assembly program. The acronym "MUMmer" comes from "Maximal Unique Matches", or
MUMs. The original algorithms in the MUMMER software package were designed by Art
Delcher, Simon Kasif and Steven Salzberg. Mummer was the first whole genome
comparison system developed in Bioinformatics The system is maintained primarily by Steven
Salzberg and Arthur Delcher at Center for Computational Biology at Johns Hopkins
University.. It was originally applied to comparison of two related strains of bacteria. The
MUMmer software is open source and can be found at the MUMmer home page. The home
page also has links to technical papers describing the system.

Overview
MUMmer is a system for rapidly aligning entire genomes, whether in complete or draft form.
For example, MUMmer 3.0 can find all 20-basepair or longer exact matches between a pair of
5-megabase genomes in 13.7 seconds, using 78 MB of memory, on a 2.4 GHz Linux desktop
computer. MUMmer can also align incomplete genomes; it can easily handle the 100s or 1000s
of contigs from a shotgun sequencing project, and will align them to another set of contigs or
a genome using the NUCmer program included with the system. If the species are too divergent
for a DNA sequence alignment to detect similarity, then the PROmer program can generate
alignments based upon the six-frame translations of both input sequences. There are number of
versions of MUMmer such as the original MUMmer system, version 1.0, Version 2.1,
MUMmer 3.0 and GPU-accelerated version of MUMmer called MUMmerGPU

Examples

1. mummer

mummer is a suffix tree algorithm designed to find maximal exact matches of some minimum
length between two input sequences. The match lists produced by mummer can be used alone
to generate alignment dot plots, or can be passed on to the clustering algorithms for the
identification of longer non-exact regions of conservation. These match lists have great
versatility because they contain huge amounts of information and can be passed forward to
other interpretation programs for clustering, analysis, searching, etc.

In the following sections, a short example is given that demonstrates how to use mummer. This
example compares a single query sequence to a single reference sequence using mummer, and
then uses mummerplot to generate a dot plot representation of the comparison.

M. P. Garud Page 69
The following input files will be used to demonstrate this example:

H_pylori26695_Eslice.fasta

H_pyloriJ99_Eslice.fasta

The following output files will be generated by this example:

mummer.gp

mummer.mums

mummer.fplot

mummer.rplot

mummer.ps

1.1. Running mummer

mummer can handle multiple reference and multiple query sequences, however a dotplot of
more that two sequences can be confusing, so for the case of this example we will be dealing
with a single reference and a single query sequence.

mummer -mum -b -c H_pylori26695_Eslice.fasta H_pyloriJ99_Eslice.fasta >


mummer.mums

This command will find all maximal unique matches (-mum) between the reference and query
on both the forward and reverse strands (-b) and report all the match positions relative to the
forward strand (-c). Output is to stdout, so we will redirect it into a file named mummer.mums.
This file lists all of the MUMs of the default length or greater between the two input sequences.

1.2. Running mummerplot

A dotplot of all the MUMs between two sequences can reveal their macroscopic similarity.

mummerplot -x "[0,275287]" -y "[0,265111]" -postscript -p mummer mummer.mums

This command will plot all of the MUMs in the mummer.mums file in postscript format (-
postscript) between the given ranges for the X and Y axes. When plotting mummer output, it
is necessary to use the lengths of the input sequences to set the plot ranges, otherwise the plot
will be automatically scaled around the minimum and maximum data points. The four output
files are prefixed by the string specified with the -p option. The plot files contains the data
points, mummer.gp is a gnuplot script for plotting the data points in the plot files, and
mummer.ps is the postscript plot generated by the gnuplot script. Below, you can see the
mummer.ps file displayed with ghostview. Note that with newer versions of mummerplot the
color and thickness of the plot lines may be different.

mummer postscript plot

M. P. Garud Page 70
Most image manipulation programs can edit the postscript output, or it can be sent directly to
a printer with the lpr command. If you would rather use the default terminal for gnuplot, simply
remove the -postscript option from the mummerplot call.

1.3. Viewing the output

mummerplot example

The above postscript plot represents the set of all MUMs between the two input sequences used
in this example. Forward MUMs are plotted as red lines/dots while reverse MUMs are plotted
as green lines/dots (blue may be used for reverse matches in newer versions). A line of dots
with slope == 1 represents an undisturbed segment of conservation between the two sequences,
while a line of slope == -1 represents an inverted segment of conservation between the two
sequences. The green segment in the upper left quadrant of the graph shows both an inversion
and translocation, as it is of negative slope and inconsistently located relative to the rest of the
plot which falls on a line approximated by f(x) = x. However the green segment in the upper
right quadrant of the graph shows only an inversion, as it is of negative slope but is consistent
in location with the rest of the plot. Generally, the closer a plot is to an imaginary line f(x) = x
(or -x) the fewer macroscopic differences exist between the two sequences.

2. nucmer

nucmer is MUMmer's most user-friendly alignment script for standard DNA sequence
alignment. It is a robust pipeline that allows for multiple reference and multiple query
sequences to be aligned in a many vs. many fashion. For instance, a very common use for
nucmer is to determine the position and orientation of a set of sequence contigs in relation to a
finished or draft genome. It is equally useful in comparing two finished sequences, or two
assemblies of the same genome to one another.

In the following sections, a short example is given that demonstrates how to use
nucmer. This example aligns a set of draft sequence contigs to a finished sequence using
nucmer; displays the alignment coordinates using show-coords; and tiles them across the
reference using show-tiling.

The following input files will be used to demonstrate this example:

B_anthracis_Mslice.fasta

B_anthracis_contigs.fasta

The following output files will be generated by this example:

nucmer.coords

nucmer.delta

nucmer.snps

nucmer.tiling

M. P. Garud Page 71
2.1. Running nucmer

Like mummer, nucmer can handle multiple reference and query sequences, however this
example will demonstrate the alignment of multiple query sequences to a single reference. We
will align a number of B. anthracis draft contigs to the final assembly.

nucmer -mumreference -c 100 -p nucmer B_anthracis_Mslice.fasta B_anthracis_contigs.fasta

The -mumreference option can be omitted as this is the default; this reports all matches that are
maximal in length and that are unique (after extending to maximal length) in the reference
genome. You could instead use -maxmatch, which reports all matches regardless of uniqueness,
but be very careful with this option: if you have N repeats in a genome, then you will be
instructing nucmer to report N^2 alignments (all versus all), which you probably don't want.
The minimum cluster size here is bumped up to 100 (-c 100). The two output files are prefixed
by the string specified with the -p option. nucmer.delta is an encoded file that represents the
alignment between the two inputs. At this stage, the alignment of the two inputs is complete,
however it is necessary to parse the nucmer.delta file with the provided utilities in order to
extract useful information from the comparison.

2.2. Running show-coords

To view a summary of all the alignments produced by NUCmer, we can run the nucmer.delta
file through the show-coords utility.

show-coords -r -c -l nucmer.delta > nucmer.coords

This command will list the coordinates, percent identities and other useful statistics of each
alignment in a table. Each line of the table represents an individual pairwise alignment, and
each line is sorted by its starting reference coordinate (-r). Additional information, like
alignment coverage (-c) and sequence length (-l) can be added to the table with the appropriate
options. Output is to stdout, so we have redirected it into the file, nucmer.coords.

2.3. Running show-snps

To view a summary of all the SNPs and indels between the two sequence sets, we need to run
the nucmer.delta file through the show-snps utility.

show-snps -C nucmer.delta > nucmer.snps

This will generate a report of all the SNPs internal to the alignments contained in the
nucmer.delta file. Each line of the table represents a single mismatch in the pairwise alignment.
With the -C option, only SNPs from uniquely aligned regions will be reported. Additional
information can be added or removed with the command line switches described in the manual.
Output is to stdout, so we have redirected it into the file, nucmer.snps.

2.4. Running show-tiling

To produce a minimal tiling of contigs across the reference sequence, we need to run the
nucmer.delta file through the show-tiling utility.
M. P. Garud Page 72
show-tiling nucmer.delta > nucmer.tiling

This command will list the contigs and positions that generate the maximal alignment coverage
across the reference sequence using the fewest contigs possible. This output can aid the closure
of a draft genome when a closely related organism has already be finished.

2.5. Viewing the output

nucmer and show-tiling output can both be viewed with mummerplot, however these plots
would offer little more information in regards to this example.

3. promer

promer is a close relative to the NUCmer script. It follows the exact same steps as NUCmer
and even uses most of the same programs in its pipeline, with one exception - all matching and
alignment routines are performed on the six frame amino acid translation of the DNA input
sequence. This provides promer with a much higher sensitivity than nucmer because protein
sequences tends to diverge much slower than their underlying DNA sequence. Therefore, on
the same input sequences, promer may find many conserved regions that nucmer will not,
simply because the DNA sequence is not as highly conserved as the amino acid translation.

In the following sections, a short example is given that demonstrates how to use promer. This
example aligns a few query sequences to single reference sequence using promer; displays the
alignment coordinates using show-coords; and prints a pairwise alignment of one of the contigs
using show-aligns.

The following input files will be used to demonstrate this example:

D_melanogaster_2Rslice.fasta

D_pseudoobscura_contigs.fasta

The following output files will be generated by this example:

promer.aligns

promer.coords

promer.delta

3.1. Running promer

Like mummer, promer can handle multiple reference and query sequences, however it is most
commonly used to map a set of query sequences to a single reference sequence. This example
will demonstrate that functionality, as two D. pseudoobscura draft contigs will be mapped to
the final D. melanogaster assembly.

promer -p promer D_melanogaster_2Rslice.fasta D_pseudoobscura_contigs.fasta

M. P. Garud Page 73
Default parameters were used to align the two inputs, however if the alignment is too sensitive
or not sensitive enough the minimum match length and cluster sizes can be adjusted
accordingly. The two output files are prefixed by the string specified with the -p option.
promer.delta is an encoded file that represents the alignment between the two inputs. At this
stage, the alignment of the two inputs is complete, however it is necessary to parse the
promer.delta file with the provided utilities in order to extract useful information from the
comparison.

3.2. Running show-coords

To view a summary of all the alignments produced by PROmer, we need to run the promer.delta
file through the show-coords utility.

show-coords -r -c -l -L 100 -I 50 promer.delta > promer.coords

This command will list the coordinates, percent identities and other useful statistics of each
alignment in a table. Each line of the table represents an individual pairwise alignment, and
each line is sorted by its starting reference coordinate (-r). Additional information, like
alignment coverage (-c) and sequence length (-l) can be added to the table with the appropriate
options. And minimum length (-L) and minimum percent identity (-I) cutoffs can be specified
to reduce poor alignments. Output is to stdout, so we have redirected it into the file,
promer.coords.

3.3. Running show-aligns

To view all the pairwise alignments between two of the input sequences, we need to run the
promer.delta file through the show-coords utility.

show-aligns promer.delta "D_melanogaster_2Rslice" "3214968" > promer.aligns

This command will print all of the pairwise alignments stored in the promer.delta file for the
sequences "D_melanogaster_2Rslice" and "3214968". Output is to stdout, so we have
redirected it into the file, promer.aligns. If the alignments do not fit within your screen width,
or you would like them to be printed on longer lines, the screen width can be adjusted with the
-w option. Since show-aligns only displays the alignments between two sequences, it will have
to be run separately for each desired pair of sequences.

3.4. Viewing the output

promer and show-tiling output can both be viewed with mummerplot, however these plots
would offer little more information in regards to this example.

4. run-mummer1

run-mummer1 is a legacy script from the original MUMmer1.0 release. It has been updated to
utilize the new suffix tree code of version 3.0, however all other programs called from this
script are identical to the original MUMmer release back in 1999. Even though it is an outdated
program, it still has some advantages over the newer alignment scripts (nucmer, promer, run-

M. P. Garud Page 74
mummer3). Like all of the alignment scripts, run-mummer1 is a three step process - matching,
clustering and extension. However, unlike the newer alignment scripts, run-mummer1 uses the
gaps program for its clustering step. The gaps program does not allow for rearrangements like
mgaps, instead if finds the single longest increasing subset of matches across the full length of
both sequences. This makes it well suited for SNP and small indel identification between small
(< 10 Mbp), very similar sequences with few to no rearrangements.

In the following sections, a short example is given that demonstrates how to use run-mummer1.
This example aligns a single query sequence to a single reference sequence using run-
mummer1.

The following input files will be used to demonstrate this example:

H_pylori26695_Bslice.fasta

H_pyloriJ99_Bslice.fasta

The following output files will be generated by this example:

mummer1.align

mummer1.errorsgaps

mummer1.gaps

mummer1.out

4.1. Running run-mummer1

run-mummer1 is only suited for a single reference and query sequence that have few to zero
inversions or translocations. This example aligns two such sequences.

run-mummer1 H_pylori26695_Bslice.fasta H_pyloriJ99_Bslice.fasta mummer1

To adjust the minimum match length for the comparison, the user must manually edit the run-
mummer1 script. Output files are prefixed by the string specified at the end of the command
line call. mummer1.align displays the alignments of each gap between adjacent MUMs,
mummer1.errorsgaps lists each MUM and the number of errors between it and the previous
MUM, mummer1.gaps lists the ordered set of MUMs and the gap distance to the previous
MUM, and mummer1.out simply lists all of the MUMs greater than or equal to the minimum
match length.

4.2. Viewing the output

There are no visualization tools designed for run-mummer1 output. To view a MUM dotplot,
run mummer by itself on two individual sequence as demonstrated in the mummer
walkthrough section of this tutorial.

M. P. Garud Page 75
BIND: the Biomolecular Interaction Network Database
 The Biomolecular Interaction Network Database (BIND: http://bind.ca) archives
biomolecular interaction, complex and pathway information.
 A web based system is available to query, view and submit records. BIND continues to
grow with the addition of individual submissions as well as interaction data from the
PDB and a number of large-scale interaction and complex mapping experiments using
yeast two hybrid, mass spectrometry, genetic interactions and phage display.
 The Biomolecular Interaction Network Database (BIND) is designed to capture protein
function, defined at the molecular level as the set of other molecules with which a
protein interacts or reacts along with the molecular outcome.
 BIND stores information about interactions, molecular complexes and pathways.
Interactions occur between two biological ‘objects’, A and B, which could be protein,
RNA, DNA, molecular complex, small molecule, photon (light) or gene.
 BIND is based on an extensive ASN.1 data that can describe much of the detail
underlying biochemical and genetic networks.
 Initially, BIND was designed only to support physical/biochemical interactions.
Stemming from collaboration with a yeast genetic mapping project. 3.0 version has a
wide range of support for genetic interactions (valid when A and B are genes), where
both the genetic experiment and its result can be described in detail.
 Recently, all molecular interactions in PDB were imported into BIND, via the validated
MMDB database, using MMDBBIND.
The public BIND site has been currently running on a shared mid-size web server. A
transition to larger, redundant servers is currently being planned in conjunction with the
launch of SeqHound service upon which BIND depends. SeqHound is our in-house
integrated database, similar in scope to the Entrez system, which contains extensive C, Cþþ
and Perl programming APIs.

URL: http://bind.ca/
Highlights:

M. P. Garud Page 76
 BIND is a collection of records documenting molecular interactions, including high-
throughput data submissions and hand-curated information gathered from the scientific
literature.
 A BIND record represents an interaction between two or more objects that is believed to
occur in a living organism. A biological object can be a protein, DNA, RNA, ligand,
molecular complex, gene, photon or an unclassified biological entity.
 BIND records are created for interactions which have been shown experimentally and
published in at least one peer-reviewed journal. A record also references any papers with
experimental evidence that support or dispute the associated interaction.
 Data from the PDB and a number of large-scale interaction and complex mapping
experiments using yeast two hybrid, mass spectrometry, genetic interactions and phage
display are added.
 A new graphical analysis tool provides users with a view of the domain composition of
proteins in interaction and complex records to help relate functional domains to protein
interactions.
 The BIND database was updated in Nov. 2004 with a new web site

FUNCTIONAL ALIGNMENT SEARCH TOOL (FAST)


Many proteins contain a number of structural and functional modules such as SH3, SH2, kinase
and DNA binding domains (14). Most of these domains mediate protein interactions with other
biomolecules. A collection of interaction information, such as BIND, enables the study of the
relationships between protein domain architecture and protein–protein interactions.
Specifically, it is possible to classify the interactors of a protein into distinct groups based on
domain composition.
FAST as an application that displays the domain annotation for a group of functionally related
proteins. In BIND, these groups of related proteins can be proteins that interact with a common
partner or are found together in molecular complexes. The domain annotation is from
SeqHound which contains a complete RPS-BLAST analysis of the GenBank or dataset, using
the Conserved Domain Database performed on 216 Beowulf cluster. FAST has a web-based
graphical interface, based on Macromedia Flash vector graphics, that displays a set of proteins
and their domains. Vector graphics format was chosen as it provides improved resolution and
zooming ability over bitmap images. FAST is accessible from BIND via interaction and
molecular complex records.

M. P. Garud Page 77
When accessed from an interaction record, the protein and its protein interactors in BIND are
displayed. When accessed from a complex record, the protein subunits are displayed. Domain
composition is shown as unique coloured horizontal bars above a line representing the

sequence.
Fig; Functional Alignment Search Tool (FAST). Domain composition for a set of proteins that
interact with mouse Fyn is shown as uniquely coloured horizontal bars above a line representing
the sequence. Expanded viewof Vav, linked to via right-pointing red arrows, where domains are
shown correctly situated on the amino acid sequence of each protein. For brevity, this figure does
not show all Fyn-interacting proteins in BIND or, in the expanded view, all of the domains in
Vav.

Clicking on the arrow beside each protein links a user to an expanded display where domains
are shown with respect to the amino acid sequence of the protein. Users can zoom in and out
to examine the boundaries of a domain of interest in more detail using the Flash control tool.
A domain summary table for the protein set, containing links to information on each protein
and domain, can be accessed from the FAST image page. Visualization of a list of related
proteins and their domains is a powerful approach to help direct future interaction studies.
For example, the human and mouse variants of the protein tyrosine kinase Fyn each have nine
recorded interactions in BIND (Fig.). The human and mouse forms of Fyn share six similar
interactions, however, the mouse variant is known to interact with a second protein tyrosine

M. P. Garud Page 78
kinase Vav, whereas the human Fyn currently has no recorded interaction with the human Vav
homologue. Using FAST, it is easy to see that many Fyn-interacting proteins, including Vav,
contain common cell-signaling modules such SH2 and SH3 domains. In combination with
other tools and databases such as NCBI’s CDART, human homologues with similar domain
architectures to mouse Fyn interactors can be identified (e.g. VAV-3 and TIM). These proteins
potentially interact with human Fyn.
FAST can also be used to study the topology and function of molecular complexes. A number
of protein complexes were recently identified in large-scale mass-spectrometry studies. FAST
can help decipher the interaction topology of these complexes by grouping proteins according
to their domain composition. For example, part of the proteasome complex was identified using
the protein Ygl004c as bait (BIND complex ID 11939). The domain architecture of the
identified proteins reveals three distinct subgroups corresponding to three functional elements
that control proteasome activity: ATPase (Rpt5, Rpt4, Rpt3, Rpt2, Rpt1), proteasome (Rpn9,
Rpn7, Rpn6, Rpn5, Rpn3) and proteasome regulatory subunits (Rpn8, Rpn11).

DIP

M. P. Garud Page 79
The Database of Interacting Proteins (DIP) aims to integrate the diverse body of experimental
knowledge about interacting proteins into a single, easily accessed database. The
DIPTM database catalogs experimentally determined interactions between proteins. It combines
information from a variety of sources to create a single, consistent set of protein-protein
interactions. The data stored within the DIP database were curated, both, manually by expert

curators and also automatically using computational approaches that utilize the the knowledge
about the protein-protein interaction networks extracted from the most reliable, core subset of
the DIP data.

Biological knowledge about protein–protein interactions is contained in many different


scientific journals and in archives such as MEDLINE (National Library of Medicine, MD,
USA). Although the literature and archives are used daily by the scientific community,
retrieving specialized data from such sources requires more effort than from the DIP, which
combines information from multiple observations and experimental techniques as well as
providing information about networks of interacting proteins.
The primary goal of DIP is to extract and integrate the wealth of information about protein–
protein interactions into a user-friendly environment. Although organism-specific databases

M. P. Garud Page 80
such as YPD (1) for yeast, EcoCyc (2) for Escherichia coli, and FlyNet for Drosophila (3)
often contain information regarding protein pathways and protein complexes as do pathway
databases such as KEGG (4) and CNSB (5), the DIP was created to complement the existing
databases and to include interacting proteins from many organisms allowing scientists to
expand and complement the observations of protein–protein interactions in one organism with
observations from other organisms.

DESCRIPTION AND STRUCTURE OF THE DATABASE


The DIP database is composed of three linked tables: a table of protein information, a table of
protein–protein interactions, and a table describing details of experiments detecting the
protein–protein interactions.

M. P. Garud Page 81
(i) The protein information table contains protein identification codes from the SWISS-
PROT (7), PIR (8) and GenBank (9) sequence databases, as well as each protein’s gene name,
description, enzyme code and cellular localization, when known.
(ii) The interaction table describes proteins that interact from the protein information table, as
well as the ranges of amino acids and the protein domains involved in the protein–protein
interaction, when known.
(iii) The experimental article table details the experiments used to detect the interactions from
the interaction table and their associated literature citations. This table includes the MEDLINE
standard article code (PMID/UID), as well as the authors, title, journal and year of publication
of the article. Over 20 different experimental techniques are represented in DIP, including co-
immunoprecipitation, yeast two-hybrid and in vitro binding assays; for a complete list see
http://dip.doe-mbi.ucla.edu/help.html . Where determined, a dissociation constant is also
included.

SEARCHING THE DATABASE OF INTERACTING PROTEINS

DIP can be searched in a variety of ways. One can look for interactions involving a specific
protein by entering its gene name or its accession code from GenBank, PIR or SWISS-PROT.
More general searches can be performed for information such as organisms, protein
superfamilies, keywords, experimental techniques or literature citations. A search returns a list
of protein–protein interactions, each hyperlinked to a DIP entry. Each resulting DIP entry
reports information about the two interacting proteins, the protein domains and range of amino
acids involved, the curator, date of entry and updating and the articles describing the
interaction, and the corresponding experiments. For example, a search on a single protein
returns all of the interactions recorded in DIP in which that protein participates.

M. P. Garud Page 82
M. P. Garud Page 83
PRIDE
The Proteomics Identifications database (PRIDE), previously described by Martens et al. (2)
is a PSI compliant public repository for proteomics identifications to which any proteomics
laboratory is welcome to submit data. It is envisaged, but not mandated, that any such
submission would normally be in the context of the corresponding submission of a manuscript
to a journal describing the identifications submitted to PRIDE. As such, PRIDE aims to become
the proteomics equivalent of the ArrayExpress database (3) used to capture microarray
experiment data in support of journal publications.
PRIDE is not alone in this endeavor. Several other publicly available databases exist for the
purpose of capturing and disseminating proteomics data from mass spectrometry. Such
databases include the Global Proteome Machine Database (gpmDB) (4), The Institute for
Systems Biology's PeptideAtlas (5) and the University of Texas' Open Proteomics Database
(opd) PRIDE, the ‘PRoteomics IDEntifications database’ (http://www.ebi.ac.uk/pride) is a
database of protein and peptide identifications that have been described in the scientific
literature. These identifications will typically be from specific species, tissues and sub-cellular
locations, perhaps under specific disease conditions. Any post-translational modifications that
have been identified on individual peptides can be described. These identifications may be
annotated with supporting mass spectra. At the time of writing, PRIDE includes the full set of
identifications as submitted by individual laboratories participating in the HUPO Plasma
Proteome Project and a profile of the human platelet proteome submitted by the University of
Ghent in Belgium. Proteomics laboratories are encouraged to submit their identifications and
spectra to PRIDE to support their manuscript submissions to proteomics journals. Data can be
submitted in PRIDE XML format if identifications are included or mzData format if the
submitter is depositing mass spectra without identifications. PRIDE is a web application, so
submission, searching and data retrieval can all be performed using an internet browser. PRIDE
can be searched by experiment accession number, protein accession number, literature
reference and sample parameters including species, tissue, sub-cellular location and disease
state. Data can be retrieved as machine-readable PRIDE or mzData XML (the latter for mass
spectra without identifications), or as human-readable HTML.
DATABASE DESCRIPTION
What is the scope of PRIDE?
PRIDE can store
i. The title and description of the experiment, together with contact details of the
submitter.
ii. Literature references.
iii. Protein identifications by accession number supported by a corresponding list of one or
more peptide identifications.
iv. For each peptide identified, the sequence and coordinates of the peptide within the
protein that it provides evidence for. Optionally, a reference to any submitted mass
spectra that form the evidence for the peptide identification.

M. P. Garud Page 84
v. Any post-translational modifications (natural or artefactual) coordinated in relation to
the specific peptide that they have been found upon.
vi. A description of the sample under analysis, including but not limited to the species of
origin, tissue, sub-cellular location (if appropriate), disease state and any other relevant
annotation.
vii. A description of the instrumentation used to perform the analysis, including mass
spectrometer source, analysers and detector, instrument settings and software settings
used in data processing to generate peak lists.
viii. Processed peak lists supporting the identifications in PRIDE in the versatile PSI mzData
format.

Datasets currently available in PRIDE


A significant dataset that is publicly available from PRIDE at the time of writing is the set of
protein and peptide identifications from the individual laboratories involved in the HUPO
Plasma Proteome Project (8). This project was in part responsible for the requirements
statement that initiated the PRIDE project.
Another publicly available dataset in PRIDE is a profile of the human platelet proteome (9)
submitted by the Department of Medical Protein Research, Ghent University. This department
is also scheduled to contribute a substantial dataset identifying proteolytic cleavage by caspases

M. P. Garud Page 85
in apoptotic Jurkat T-cells (10) as well as a large set of spectra used to evaluate spectrum quality
filtering software
Data security in PRIDE: PRIDE as a tool for journal review
Data submitted to PRIDE is marked as public or private. Private data can be shared through a
collaborative mechanism that allows individuals to apply to join a collaboration, their
application then being confirmed or rejected by the creator of the collaboration. As well as
allowing collaborating laboratories to share their data, this mechanism can also be used to allow
manuscript reviewers to access the corresponding PRIDE entry in a confidential manner on a
neutral site.

M. P. Garud Page 86
THREADING / FOLD RECOGNITION

There are three computational approaches to protein three-dimensional structural modeling and
prediction. They are homology modeling, threading, and ab initio pre- diction. The first two
are knowledge-based methods; they predict protein structures based on knowledge of existing
protein structural information in databases. Homology modeling builds an atomic model based
on an experimentally determined structure that is closely related at the sequence level.
Threading identifies proteins that are structurally similar, with or without detectable sequence
similarities. The ab initio approach is simulation based and predicts structures based on
physicochemical principles governing protein folding without the use of structural templates.

There are only small number of protein folds available (<1,000), compared to millions
of protein sequences. This means that protein structures tend to be more conserved than protein
sequences. Consequently, many proteins can share a similar fold even in the absence of
sequence similarities. This allowed the development of computational methods to predict
protein structures beyond sequence similarities. To determine whether a protein sequence
adopts a known three-dimensional structure fold relies on threading and fold recognition
methods.
By definition, threading or structural fold recognition predicts the structural fold of an
unknown protein sequence by fitting the sequence into a structural database and selecting the
best-fitting fold. The comparison emphasizes matching of secondary structures, which are most
evolutionarily conserved. Therefore, this approach can identify structurally similar proteins
even without detectable sequence similarity. The algorithms can be classified into two
categories, pairwise energy based and profile based. The pairwise energy–based method was
originally referred to as threading and the profile-based method was originally defined as fold
recognition. However, the two terms are now often used interchangeably without distinction
in the literature.

M. P. Garud Page 87
Outline of the threading method using the pairwise energy approach to predict protein
structural folds from sequence. By fitting a structural fold library and assessing the energy
terms of the resulting raw models, the best-fit structural fold can be selected.

M. P. Garud Page 88
Pairwise Energy Method
In the pairwise energy based method, a protein sequence is searched for in a structural fold
database to find the best matching structural fold using energy-based criteria. The detailed
procedure involves aligning the query sequence with each structural fold in a fold library. The
alignment is performed essentially at the sequence profile level using dynamic programming
or heuristic approaches. Local alignment is often adjusted to get lower energy and thus better
fitting. The adjustment can be achieved using algorithms such as double-dynamic
programming (see Chapter 14). The next step is to build a crude model for the target sequence
by replacing aligned residues in the template structure with the corresponding residues in the
query. The third step is to calculate the energy terms of the raw model, which include pairwise
residue interaction energy, solvation energy, and hydrophobic energy. Finally, the models are
ranked based on the energy terms to find the lowest energy fold that corresponds to the
structurally most compatible fold (Fig.).

Profile Method
In the profile-based method, a profile is constructed for a group of related protein structures.
The structural profile is generated by superimposition of the structures to expose corresponding
residues. Statistical information from these aligned residues is then used to construct a profile.
The profile contains scores that describe the propensity of each of the twenty amino acid
residues to be at each profile position. The profile scores contain information for secondary
structural types, the degree of solvent exposure, polarity, and hydrophobicity of the amino
acids. To predict the structural fold of an unknown query sequence, the query sequence is first
predicted for its secondary structure, solvent accessibility, and polarity. The predicted
information is then used for comparison with propensity profiles of known structural folds to
find the fold that best represents the predicted profile. Because threading and fold recognition
detect structural homologs without completely relying on sequence similarities, they have been
shownto be far more sensitive than PSI-BLAST in finding distant evolutionary relationships.
In many cases, they can identify more than twice as many distant homologs than PSI-BLAST.
However, this high sensitivity can also be theirweakness because high sensitivity is often
associated with low specificity. The predictions resulting from threading and fold recognition
often come with very high rates of false positives. Therefore, much caution is required in
accepting the prediction results. Threading and fold recognition assess the compatibility of an
amino acid sequence with a knownstructure ina fold library. If the protein fold to be predicted
does not exist in the fold library, the method will fail. Another disadvantage compared to

M. P. Garud Page 89
homology modeling lies in the fact that threading and fold recognition do not generate fully
refined atomic models for the query sequences. This is because accurate alignment between
distant homologs is difficult to achieve. Instead, threading and fold recognition procedures only
provide a rough approximation of the overall topology of the native structure. A number of
threading and fold recognition programs are available using either or both prediction strategies.
At present, no single algorithm is always able to provide reliable fold predictions. Some
algorithms work well with some types of structures, but fail with others. It is a good practice
to compare results from multiple programs for consistency and judge the correctness by using
external knowledge. 3D-PSSM (www.bmm.icnet.uk/∼3dpssm/) is a web-based program that
employs the structural profile method to identify protein folds. The profiles for each protein
superfamily are constructed by combining multiple smaller profiles. First, protein structures in
a superfamily based on the SCOP classification are superimposed and are used to construct a
structural profile by incorporating secondary structures and solvent accessibility information
for corresponding residues. In addition, each member in a protein structural superfamily has its
own sequence-based PSI-BLAST profile computed. These sequence profiles are used in
combination with the structure profile to forma large superfamily profile in which each position
contains both sequence and structural information. For the query sequence, PSI-BLAST is
performed to generate a sequence-based profile. PSI-PRED is used to predict its secondary
structure. Both the sequence profile and predicted secondary structure are compared with the
precomputed protein superfamily profiles, using a dynamic programming approach. The
matching scores are calculated in terms of secondary structure, salvation energy, and sequence
profiles and ranked to find the highest scored structure fold (Fig)
GenThreader (http://bioinf.cs.ucl.ac.uk/psipred/index.html) is a web-based program
that uses a hybrid of the profile and pairwise energy methods. The initial step is similar to 3D-
PSSM; the query protein sequence is subject to three rounds of PSI-BLAST. The resulting
multiple sequence hits are used to generate a profile. Its secondary structure is predicted using
PSIPRED. Both are used as input for threading computation based on a pairwise energy
potential method. The threading results are evaluated using neural networks that combine
energy potentials, sequence alignment scores, and length information to create a single score
representing the relationship between the query and template proteins. Fugue (www-
cryst.bioc.cam.ac.uk/∼fugue/prfsearch.html) is a profile-based fold recognition server. It has
precomputed structural profiles compiled from multiple alignments of homologous structures,
which take into account local structural environment such as secondary structure, solvent
accessibility, and hydrogen bonding status. The query sequence (or a multiple sequence
M. P. Garud Page 90
alignment if the user prefers) is used to scan the database of structural profiles. The comparison
between the query and the structural profiles is done using global alignment or local alignment
depending on sequence variability.

Schematic diagram of fold recognition by 3D-PSSM. A profile for protein structures in a


SCOP superfamily is precomputed based on the structure profile of all members of the
superfamily, as well as on PSI-BLAST sequence profiles of individual members of the
M. superfamily.
P. Garud For the query sequence, a PSI-BLAST profile is constructed and its secondary
Page 91
structure information is predicted, which together are used to compare with the
precomputed protein superfamily profile.
HOMOLOGY MODELING

As the name suggests, homology modeling predicts protein structures based on


sequence homology with known structures. It is also known as comparative modeling. The
principle behind it is that if two proteins share a high enough sequence similarity, they are
likely to have very similar three-dimensional structures. If one of the protein sequences has a
known structure, then the structure can be copied to the unknown protein with a high degree of
confidence. Homology modeling produces an all-atom model based on alignment with
template proteins.

The overall homology modeling procedure consists of six steps. The first step is
template selection, which involves identification of homologous sequences in the protein
structure database to be used as templates for modeling. The second step is alignment of the
target and template sequences. The third step is to build a frame- work structure for the target
protein consisting of main chain atoms. The fourth step of model building includes the addition
and optimization of side chain atoms and loops. The fifth step is to refine and optimize the
entire model according to energy criteria. The final step involves evaluating of the overall
quality of the model obtained (Fig.). If necessary, alignment and model building are repeated
until a satisfactory result is obtained.

Template Selection

M. P. Garud Page 92
The first step in protein structural modeling is to select appropriate structural templates.
This forms the foundation for rest of the modeling process. The template selection involves
searching the Protein Data Bank (PDB) for homologous proteins with determined structures.
The search can be performed using a heuristic pairwise alignment search program such as
BLAST or FASTA. However, the use of dynamic programming based search programs such
as SSEARCH or ScanPS can result in more sensitive search results. The relatively small size
of the structural database means that the search time using the exhaustive method is still within
reasonable limits, while giving a more sensitive result to ensure the best possible similarity
hits. As a rule of thumb, a database protein should have at least 30% sequence identity with the
query sequence to be selected as template. Occasionally, a 20% identity level can be used as
threshold as long as the identity of the sequence pair falls within the “safe zone” Often, multiple
database structures with significant similarity can be found as a result of the search. In that
case, it is recommended that the structure(s) with the highest percentage identity, highest
resolution, and the most appropriate cofactors is selected as a template. On the other hand, there
may be a situation in which no highly similar sequences can be found in the structure database.
In that instance, template selection can become difficult. Either a more sensitive profile-based
PSI-BLAST method or a fold recognition method such threading can be used to identify distant
homologs. Most likely, in such a scenario, only local similarities can be identified with distant
homologs. Modeling can therefore only be done with the aligned domains of the target protein.

Sequence Alignment

Once the structure with the highest sequence similarity is identified as a template, the
full-length sequences of the template and target proteins need to be realigned using refined
alignment algorithms to obtain optimal alignment. This realignment is the most critical step in
homology modeling, which directly affects the quality of the final model. This is because
incorrect alignment at this stage leads to incorrect designation of homologous residues and
therefore to incorrect structural models. Errors made in the alignment step cannot be corrected
in the following modeling steps. Therefore, the best possible multiple alignment algorithms,
such as Praline and T-Coffee should be used for this purpose. Even alignment using the best
alignment program may not be error free and should be visually inspected to ensure that
conserved key residues are correctly aligned. If necessary, manual refinement of the alignment
should be carried out to improve alignment quality.

Backbone Model Building

M. P. Garud Page 93
Once optimal alignment is achieved, residues in the aligned regions of the target protein
can assume a similar structure as the template proteins, meaning that the coordinates of the
corresponding residues of the template proteins can be simply copied onto the target protein.
If the two aligned residues are identical, coordinates of the side chain atoms are copied along
with the main chain atoms. If the two residues differ, only the backbone atoms can be copied.
The side chain atoms are rebuilt in a subsequent procedure.
In backbone modeling, it is simplest to use only one template structure. As mentioned, the
structure with the best quality and highest resolution is normally chosen if multiple options are
available. This structure tends to carry the fewest errors. Occasionally, multiple template
structures are available for modeling. In this situation, the template structures have to be
optimally aligned and superimposed before being used as templates in model building. One
can either choose to use average coordinate values of the templates or the best parts from each
of the templates to model.
Loop Modeling
In the sequence alignment for modeling, there are often regions caused by insertions
and deletions producing gaps in sequence alignment. The gaps cannot be directly modeled,
creating “holes” in the model. Closing the gaps requires loop modeling, which is a very difficult
problem in homology modeling and is also a major source of
error. Loop modeling can be considered a mini–protein
modeling problem by itself. Unfortunately, there are no
mature methods available that can model loops reliably.
Currently, there are two main techniques used to approach the
problem: the database searching method and the ab initio
method. The database method involves finding “spare parts”
from known protein structures in a database that fit onto the twoFigure
stem2: Schematic of loop modeling by fitting
a loop structure onto the endpoints of existing
regions of the target protein. The stems are defined as the stem structures represented by cylinders

main chain atoms that precede and follow the loop to be modeled. The procedure begins by
measuring the orientation and distance of the anchor regions in the stems and searching PDB
for segments of the same length that also match the above endpoint conformation. Usually,
many different alternative segments that fit the endpoints of the stems are available. The best
loop can be selected based on sequence similarity as well as minimal steric clashes with the
neighboring parts of the structure. The conformation of the best matching fragments is then
copied onto the anchoring points of the stems (Fig.). The ab initio method generates many

M. P. Garud Page 94
random loops and searches for the one that does not clash with nearby side chains and also has
reasonably low energy and φ and ψ angles in the allowable regions in the Ramachandran plot.
If the loops are relatively short (three to five residues), reasonably correct models can be built
using either of the two methods. If the loops are longer, it is very difficult to achieve a reliable
model. The following are specialized programs for loop modeling.
FREAD (www-cryst.bioc.cam.ac.uk/cgi-bin/coda/fread.cgi) is a web server that models loops
using the database approach. PETRA (www-cryst.bioc.cam.ac.uk/cgi-bin/coda/pet.cgi) is a
web server that uses the ab initio method to model loops.
CODA (www-cryst.bioc.cam.ac.uk/∼charlotte/Coda/search coda.html) is a web server that
uses a consensus method based on the prediction results from FREAD and PETRA. For loops
of three to eight residues, it uses consensus conformation of both methods and for nine to thirty
residues, it uses FREAD prediction only.
Side Chain Refinement
Once main chain atoms are built, the positions of side chains that are not modeled must
be determined. Modeling side chain geometry is very important in evaluating protein–ligand
interactions at active sites and protein–protein interactions at the contact interface. A side chain
can be built by searching every possible conformation at every torsion angle of the side chain
to select the one that has the lowest interaction energy with neighboring atoms. However, this
approach is computationally prohibitive in most cases. In fact, most current side chain
prediction programs use the concept of rotamers, which are favored side chain torsion angles
extracted from known protein crystal structures. A collection of preferred side chain
conformations is a rotamer library in which the rotamers are ranked by their frequency of
occurrence. Having a rotamer library reduces the computational time significantly because only
a small number of favored torsion angles are examined. In prediction of side chain
conformation, only the possible rotamers with the lowest interaction energy with nearby atoms
are selected.
In many cases, even applying the rotamer library for every residue can be
computationally too expensive. To reduce search time further, backbone conformation can be
taken into account. It has been observed that there is a correlation of backbone conformations
with certain rotamers. By using such correlations, many possible rotamers can be eliminated
and the speed of conformational search can be much improved. After adding the most
frequently occurring rotamers, the conformations have to be further optimized to minimize
steric overlaps with the rest of the model structure. Most modeling packages incorporate the
side chain refinement function. A specialized side chain modeling program that has reasonably
M. P. Garud Page 95
good performance is SCWRL (side chain placement with a rotamer library;
www.fccc.edu/research/labs/dunbrack/scwrl/), a UNIX program that works by placing side
chains on a backbone template according to preferences in the backbone dependent rotamer
library. It removes rotamers that have steric clashes with main chain atoms. The final, selected
set of rotamers has minimal clashes with main chain atoms and other side chains.

Model Refinement Using Energy Function

In these loop modeling and side chain modeling steps, potential energy calculations are
applied to improve the model. However, this does not guarantee that the entire raw homology
model is free of structural irregularities such as unfavorable bond angles, bond lengths, or close
atomic contacts. These kinds of structural irregularities can be corrected by applying the energy
minimization procedure on the entire model, which moves the atoms in such a way that the
overall conformation has the lowest energy potential. The goal of energy minimization is to
relieve steric collisions and strains without significantly altering the overall structure. However,
energy minimization has to be used with caution because excessive energy minimization often
moves residues away from their correct positions. Therefore, only limited energy minimization
is recommended (a few hundred iterations) to remove major errors, such as short bond distances
and close atomic clashes. Key conserved residues and those involved in cofactor binding have
to be restrained if necessary during the process. Another often used structure refinement
procedure is molecular dynamic simulation. This practice is derived from the concern that
energy minimization only moves atoms toward a local minimum without searching for all
possible conformations, often resulting in a suboptimal structure. To search for a global
minimum requires moving atoms uphill as well as downhill in a rough energy landscape. This
requires thermodynamic calculations of the atoms. In this process, a protein molecule is
“heated” or “cooled” to simulate the uphill and downhill molecular motions. Thus, it helps
overcome energy hurdles that are inaccessible to energy minimization. It is hoped that this
simulation follows the protein folding process and has a better chance at finding the true
structure. A more realistic simulation can include water molecules surrounding the structure.
This makes the process an even more computationally expensive procedure than energy
minimization, however. Furthermore, it shares a similar weakness of energy minimization: a
molecular structure can be “loosened up” such that it becomes less realistic. Much caution is
therefore needed in using these molecular dynamic tools.

M. P. Garud Page 96
GROMOS (www.igc.ethz.ch/gromos/) is a UNIX program for molecular dynamic
simulation. It is capable of performing energy minimization and thermodynamic simulation of
proteins, nucleic acids, and other biological macromolecules. The simulation can be done in
vacuum or in solvents. A lightweight version of GROMOS has been incorporated in SwissPDB
Viewer.
Model Evaluation
The final homology model has to be evaluated to make sure that the structural features
of the model are consistent with the physicochemical rules. This involves checking anomalies
in φ–ψ angles, bond lengths, close contacts, and so on. Another way of checking the quality of
a protein model is to implicitly take these stereo-chemical properties into account. This is a
method that detects errors by compiling statistical profiles of spatial features and interaction
energy from experimentally determined structures. By comparing the statistical parameters
with the constructed model, the method reveals which regions of a sequence appear to be folded
normally and which regions do not. If structural irregularities are found, the region is
considered to have errors and has to be further refined. Procheck
(www.biochem.ucl.ac.uk/roman/procheck/procheck.html) is a UNIX program that is able to
check general physicochemical parameters such as φ–ψ angles, chirality, bond lengths, bond
angles, and so on. The parameters of the model are used to compare with those compiled from
well-defined, high-resolution structures. If the program detects unusual features, it highlights
the regions that should be checked or refined further.
WHAT IF (www.cmbi.kun.nl:1100/WIWWWI/) is a comprehensive protein analysis
server that validates a protein model for chemical correctness. It has many functions, including
checking of planarity, collisions with symmetry axes (close contacts), proline puckering,
anomalous bond angles, and bond lengths. It also allows the generation of Ramachandran plots
as an assessment of the quality of the model.
ANOLEA (Atomic Non-Local Environment Assessment; http://protein.bio.puc.cl/
cardex/servers/anolea/index.html) is a web server that uses the statistical evaluation approach.
It performs energy calculations for atomic interactions in a protein chain and compares these
interaction energy values with those compiled from a database of protein x-ray structures. If
the energy terms of certain regions deviate significantly from those of the standard crystal
structures, it defines them as unfavorable regions.
An example of the output from the verification of a homology model is shown in Figure A.
The threshold for unfavorable residues is normally set at 5.0. Residues with scores above 5.0
are considered regions with errors. Verify3D (www.doembi.ucla.edu/Services/Verify 3D/) is

M. P. Garud Page 97
another server using the statistical approach. It uses a pre computed database containing
eighteen environmental profiles based on secondary structures and solvent exposure, compiled
from high-resolution protein structures. To assess the quality of a protein model, the secondary
structure and solvent exposure propensity of each residue are calculated. If the parameters of a
residue fall within one of the profiles, it receives a high score, otherwise a low score. The result
is a two-dimensional graph illustrating the folding quality of each residue of the protein
structure.
A verification output of the above homology model is shown in FigureB. The threshold
value is normally set at zero. Residues with scores below zero are considered to have an
unfavorable environment. The assessment results can be different using different verification
programs. As shown in Figure, ANOLEA appears to be less stringent thanVerify3D. Although
the full-length protein chain of this model is declared favorable by ANOLEA, residues in the
C-terminus of the protein are considered to be of low quality by Verify3D. Because no single
method is clearly superior to any other, a good strategy is to use multiple verification methods
and identify the consensus between them. It is also important to keep in mind that the evaluation
tests performed by these programs only check the stereo chemical correctness, regardless of
the accuracy of the model, which may or may not have any biological meaning.

M. P. Garud Page 98
Figure: Example of protein model evaluation outputs by ANOLEA and Verify3D. The
protein model was obtained from the Swiss model database (model code 1n5d). (A) The
assessment result by the ANOLEA server. The threshold for unfavorable residues is
normally set at 5.0. Residues with scores above 5.0 are considered regions with errors. (B)
M. P. assessment
The Garud result by the Verify3D server. The threshold value is normally set atPage
zero.99
The residues with the scores below zero are considered to have an unfavorable
environment.
Comprehensive Modeling Programs
A number of comprehensive modeling programs are able to perform the complete
procedure of homology modeling in an automated fashion. The automation requires assembling
a pipeline that includes target selection, alignment, model generation, and model evaluation.
Some freely available protein modeling programs and servers are listed. Modeller
(http://bioserv.cbs.cnrs.fr/HTML BIO/frame mod.html) is a web server for homology
modeling. The user provides a predetermined sequence alignment of a template(s) and a target
to allow the program to calculate a model containing all of the heavy atoms (nonhydrogen
atoms). The program models the backbone using a homology-derived restraint method, which
relies on multiple sequence alignment between target and template proteins to distinguish
highly conserved residues from less conserved ones. Conserved residues are given high
restraints in copying from the template structures. Less conserved residues, including loop
residues, are given less or no restraints, so that their conformations can be built in a more or
less ab initio fashion. The entire model is optimized by energy minimization and molecular
dynamics procedures. Swiss-Model (www.expasy.ch/swissmod/SWISS-MODEL.html) is an
automated modeling server that allows a user to submit a sequence and to get back a structure
automatically. The server constructs a model by automatic alignment (First Approach mode)
or manual alignment (Optimize mode). In the First Approach mode, the user provides sequence
input for modeling. The server performs alignment of the query with sequences in PDB using
BLAST. After selection of suitable templates, a raw model is built. Refinement of the structure
is done using GROMOS. Alternatively, the user can specify or upload structures as templates.
The final model is sent to the user by e-mail. In the Optimize mode, the user constructs a
sequence alignment in Swiss Pdb Viewer and submits it to the server for model construction.
3D-JIGSAW (www.bmm.icnet.uk/servers/3djigsaw/) is a modeling server that works in either
the automatic mode or the interactive mode. Its loop modeling relies on the database method.
The interactive mode allows the user to edit alignments and select templates, loops, and side
chains during modeling, whereas the automatic mode allows no human intervention and models
a submitted protein sequence if it has an identity >40% with known protein structures.
Homology Model Databases
The availability of automated modeling algorithms has allowed several research groups
to use the fully automated procedure to carry out large-scale modeling projects. Protein models
for entire sequence databases or entire translated genomes have been generated. Databases for
modeled protein structures that include nearly one third of all known proteins have been
established. They provide some useful information for understanding evolution of protein

M. P. Garud Page 100


structures. The large databases can also aid in target selection for drug development. However,
it has also been shown that the automated procedure is unable to model moderately distant
protein homologs. Automated modeling tends to be less accurate than modeling that requires
human intervention because of inappropriate template selection, suboptimal alignment, and
difficulties in modeling loops and side chains. ModBase
(http://alto.compbio.ucsf.edu/modbase-cgi/index.cgi) is a database of protein models generated
by the Modeller program. For most sequences that have been modeled, only partial sequences
or domains that share strong similarities with templates are actually modeled. 3Dcrunch
(www.expasy.ch/swissmod/SWISS-MODEL.html) is another database archiving results of
large-scale homology modeling projects. Models of partial sequences from the Swiss-Prot
database are derived using the Swiss-Model program.

M. P. Garud Page 101


METHODS FOR COMPARISON OF 3D STRUCTURE OF PROTEIN
Because the structure is actually much more conserved than sequence, structure
comparisons allow us to look even further back into earth’s prehistory to track the origins and
evolution of many key enzymes and protein. The structure of a molecule in 3D space is the
main factor which determines its chemical properties as well as its function. All information
required for a protein to be folded in its natural 3D structure is coded in its amino acid sequence.
Therefore, the 3D representation of a residue sequence and the way this sequence folds in the
3D space are very important in order to be able to understand the “logic” in which a function
or biological action of a protein is based on. With the technology innovation and the rapid
development of X-Ray crystallography methods and NMR spectrum analysis techniques, a
high number of new 3D structures of protein molecules is determined.
Why compare structures?
• In evolution, structure is better preserved than sequence
• Structure comparison gives a powerful method for searching for homologous proteins.
• Structure comparison allow to study protein evolution
• To classify structures

The majority of the proteome is made by amino-acid sequences that, due to evolutionary
selection, reliably and reproducibly form essentially the same three-dimensional structure. This
observation formed a basis of the “one sequence– one structure” paradigm that dominated the
protein science for a long time. However, the growing redundancy of protein structure
databases, i.e. the increase in the number of structures per protein, made it clear that these
fascinating molecules possess a lot more than a simple, unique rigid structure, and that varying
degrees of the inherent flexibility of proteins are critical for their functioning. Consequently,
quantifying the structural differences in a sensible way becomes essential.

Structure comparison methods have been actively developed and used in the field of
computational modeling assessments for quantitative evaluation of correctness of predicted
models. Since 1994, a community-wide experiment called CASP (Critical Assessment of
techniques for protein Structure Prediction) provides the modeling community with the
possibility to evaluate their methods in blind prediction of structures of newly solved (but
unpublished at the moment of the assessment) proteins. The submitted models are compared
to an experimental structure using various criteria specifically developed for this task. In the
recent years, other initiatives of this kind have emerged, including CAPRI (Critical Assessment
of PRedicted Interactions) and GPCR Dock, the assessment of modeling and docking methods

M. P. Garud Page 102


for human G-protein coupled receptor targets, and the assessment of the docking and scoring
algorithms

Nevertheless, there are tools and techniques that make it possible to compare near- or relatively
similar three-dimensional structure. The most common method is called structure
superposition. Superposition or superimposition is simply the process of rotating or orienting
an object until it can be superimposed on top of a similar object. This is very similar to the
process human normally perform when putting the last piece of a jigsaw puzzle in to place,
where they rotate and translate the puzzle piece around until it finally fits. The simplest rout to
three-dimensional superposition is to identify a minimum of two sets of three common
reference points, one set for the object to be superimposed and another set on the reference
points are almost matching (i.e. minimally different). Fortunately, there are mathematical
approaches that allow this superposition process are identified and as long as the two objects
have the same number of identified points. These approaches include Lagrangian multipliers,
Quaternion methods and matrix diagnolizatin technique.
METHODS
Sequence-dependent vs. sequence-independent methods—Sequence-dependent methods of
protein structure comparison assume strict one-to-one correspondence between target and
model residues. In sequence-independent methods, structural superimposition is performed
independently, followed by the evaluation of residue correspondence obtained from such
superimposition. The usefulness of the sequence-independent approach is limited to cases
where a model approximately captures the correct target fold but the amino-acid sequence
threading within this fold is incorrect, e.g. when one turn shift of an alpha-helix occurs. An
example of an alignment-independent measure is the AL0 score routinely used in CASP model
evaluation (13). AL0 score measures model accuracy by counting the number of correctly
aligned residues in the sequence-independent superposition of the model and the reference
target structure. A model residue is considered to be correctly aligned if the Ca atom falls within
3.8 Å of the corresponding atom in the experimental structure, and there is no other
experimental structure Ca atom nearer. AL0 score values are clearly dependent on the
superimposition; in its original implementation used for CASP model assessment, the score is
calculated using the so called Local/Global Alignment (LGA (14)) superimposition of the two
structures. A variety of sequence-independent structural alignment methods have been
developed in the field: CE (15), DALI (16), DejaVu (17), MAMMOTH (18), Structal (19),

M. P. Garud Page 103


FOLDMINER (20), KENOBI/K2 (21), LSQMAN (22), Matras (23, 24), PrISM (25), ProSup
(26), SSM (27), and others.
The results of alignment-dependent and alignment-independent structure comparison are
highly correlated with the exception of very distant homology cases.
Superimposition-based vs. superimposition-independent methods— Any method that
relies on distance measurements between reference points in the model and their respective
counterparts in the reference template requires prior superimposition of the model onto
template, with the results of the comparison clearly dependent on the superimposition. Finding
an optimal superimposition is an ambiguous task that has multiple solutions optimizing specific
parameters, therefore, all superimposition dependent methods suffer from this ambiguity.
Superimposition of a specific subset may not resolve this issue because the choice of the subset
is subjective and ambiguous. A method that iteratively optimizes the superimposition of two
protein structures by assigning lower weights to most deviating fragments and, in this way by
finding the largest super imposable core of the two proteins, is described below. However, even
in this approach, the choice of weight decay function is rather arbitrary and subjective which
may lead to multiple solutions introducing ambiguity in any similarity score derived from these
superimpositions. Superimposition-independent methods, such as contact based measures are
devoid of this ambiguity.

Root Mean Square Deviation (RMSD) is the most commonly used quantitative measure of the
similarity between two superimposed atomic coordinates. RMSD values are presented in Å and
calculated by

where the averaging is performed over the n pairs of equivalent atoms and di is the distance
between the two atoms in the i-th pair. RMSD can be calculated for any type and subset of
atoms; for example, Cα atoms of the entire protein, Cα atoms of all residues in a specific subset
(e.g. the trans membrane helices, binding pocket, or a loop), all heavy atoms of a specific subset
of residues, or all heavy atoms in a small-molecule ligands.

M. P. Garud Page 104


The main disadvantage of the RMSD lies in the fact that it is dominated by the amplitudes of
errors. Two structures that are identical with the exception of a position of a single loop or a
flexible terminus typically have a large global backbone RMSD and cannot be effectively
superimposed by any algorithm that optimizes the global RMSD.
RMSD is always non-negative, and a value of 0 (almost never achieved in practice) would
indicate a perfect fit to the data. In general, a lower RMSD is better than a higher one. However,
comparisons across different types of data would be invalid because the measure is dependent
on the scale of the numbers used.
An example of such a pair is given by the active and inactive conformations of an estrogen
receptor α (ERα) which are only different by the movement of a single helix 12 (Figure 1). By
global backbone RMSD, this pair is virtually indistinguishable from the pair of albumin
structures where multiple smaller scale rearrangements occur. The colored map in Figure 1
shows the distribution of the protein backbone RMSD for a large number of experimentally
determined structure pairs of identical proteins in the PDB. It demonstrates that for the majority
of pairs, the RMSD ranges from 0 to 1.2 Å, due to inherent protein flexibility and experimental
resolution limits. Figure 1 also presents the results of comparison of most accurate GPCR Dock
2010 models to their respective reference (answer) structures. It is clear that the backbone
RMSD values are distributed around 2.3 Å for the easier homology modeling case, D3, and
around 4.5 Å for the distant homology modeling case, CXCR4. It is however important to
realize that these RMSD distributions do not reflect the true model accuracy because they are
largely affected by flexible and poorly defined regions such as C-termini and extracellular
loops in both GPCRs.

M. P. Garud Page 105


Application
 In meteorology, to see how effectively a mathematical model predicts the behavior of
the atmosphere.
 In bioinformatics, the root-mean-square deviation of atomic positions is the measure of the
average distance between the atoms of superimposed proteins.
 In structure based drug design, the RMSD is a measure of the difference between a crystal
conformation of the ligand conformation and a docking prediction.
 In economics, the RMSD is used to determine whether an economic model fits economic
indicators. Some experts have argued that RMSD is less reliable than Relative Absolute
Error.[6]
 In experimental psychology, the RMSD is used to assess how well mathematical or
computational models of behavior explain the empirically observed behavior.
 In GIS, the RMSD is one measure used to assess the accuracy of spatial analysis and
remote sensing.
 In hydrogeology, RMSD and NRMSD are used to evaluate the calibration of a
groundwater model.[7]
 In imaging science, the RMSD is part of the peak signal-to-noise ratio, a measure used to
assess how well a method to reconstruct an image performs relative to the original
image.
 In computational neuroscience, the RMSD is used to assess how well a system learns a
given model.[8]
 In protein nuclear magnetic resonance spectroscopy, the RMSD is used as a measure to
estimate the quality of the obtained bundle of structures.
 Submissions for the Netflix Prize were judged using the RMSD from the test dataset's
undisclosed "true" values.
 In the simulation of energy consumption of buildings, the RMSE and CV(RMSE) are
used to calibrate models to measured building performance.[9]
 In X-ray crystallography, RMSD (and RMSZ) is used to measure the deviation of the
molecular internal coordinates deviate from the restraints library values.

M. P. Garud Page 106


Phylogenetic analysis
A phylogenetic tree or evolutionary tree is a branching diagram or "tree" showing the
evolutionary relationships among various biological species or other entities—
their phylogeny is based upon similarities and differences in their physical or genetic
characteristics. All life on Earth is part of a single phylogenetic tree, indicating common
ancestry.
Purposes of phylogenetic tree
• Understanding human origin
• Understanding biogeography
• Understanding the origin of particular traits
• Understanding the process of molecular evaluation
• Origin of disease
• The aim of phylogenetic tree construction, is to find the tree which best describes the
relationships between objects in a set. Usually the objects are species.
Applications
• The inference of phylogenies with computational methods has many important
applications in medical and biological research, such as drug discovery and conservation
biology.
• Phylogenetic trees have already witnessed applications in numerous practical domains,
such as in conservation biology (illegal whale hunting), epidemiology (predictive
evolution), forensics (dental practice HIV transmission), gene function prediction and drug
development.
• Other applications of phylogenies include multiple sequence alignment, protein structure
prediction, gene and protein function prediction and drug design.
• The computation of the tree-of life containing representatives of all living beings on earth
is considered to be one of the grand challenges in Bioinformatics.
Limitations
• It is important to remember that trees do have limitations. For example, trees are meant
to provide insight into a research question and not intended to represent an entire species
history.
• Several factors, like gene transfers, may affect the output placed into a tree.
• All knowledge of limitations related to DNA degradation over time must be considered,
especially in the case of evolutionary trees aimed at ancient or extinct organisms.

M. P. Garud Page 107


Terminology :
 node : a node represents a taxonomic unit. This can be a taxon (an existing species) or
an ancestor (unknown species : represents the ancestor of 2 or more species).
 branch : defines the relationship between the taxa in terms of descent and ancestry.
 topology : is the branching pattern.
 branch length : often represents the number of changes that have occurred in that
branch.
 root : is the common ancestor of all taxa.
 distance scale : scale which represents the number of differences between sequences
(e.g. 0.1 means 10 % differences between two sequences).

Figure : The tree terminology.

M. P. Garud Page 108


Methods of phylogenetic analysis :
There are two major groups of analyses to examine phylogenetic relationships between
sequences :
1. Phenetic methods : trees are calculated by similarities of sequences and are based
on distance methods. The resulting tree is called a dendrogram and does not
necessarily reflect evolutionary relationships. Distance methods compress all of the
individual differences between pairs of sequences into a single number.
2. Cladistic methods : trees are calculated by considering the various possible pathways
of evolution and are based on parsimony or likelihood methods. The resulting tree is
called a cladogram. Cladistic methods use each alignment position as evolutionary
information to build a tree.
Phenetic methods based on distances :
1. Starting from an alignment, pairwise distances are calculated between DNA
sequences as the sum of all base pair differences between two sequences (the most
similar sequences are assumed to be closely related). This creates a distance matrix.
o All base changes can be considered equally or a matrix of the possible
replacements can be used.
o Insertions and deletions are given a larger weight than replacements. Insertions
or deletions of multiple bases at one position are given less weight than multiple
independent insertions or deletions.
o it is possible to correct for multiple substitutions at a single site.
2. From the obtained distance matrix, a phylogenetic tree is calculated with clustering
algorithms. These cluster methods construct a tree by linking the least distant pair of
taxa, followed by successively more distant taxa.
o UPGMA clustering (Unweighted Pair Group Method using Arithmetic
averages) : this is the simplest method
o Neighbor Joining : this method tries to correct the UPGMA method for its
assumption that the rate of evolution is the same in all taxa.
UPGMA
Abbreviation of “Unweighted Pair Group Method with Arithmetic Mean”.
Originally developed for numeric taxonomy in 1958 by Sokal and Michener
UPGMA characteristics

· UPGMA is the simplest method for constructing trees.

M. P. Garud Page 109


· The great disadvantage of UPGMA is that it assumes the same evolutionary speed on
all lineages, i.e. the rate of mutations is constant over time and for all lineages in the tree.
This is called a 'molecular clock hypothesis'. This would mean that all leaves (terminal
nodes) have the same distance from the root. In reality the individual branches are very
unlikely to have the same mutation rate. Therefore, UPGMA frequently generates wrong
tree topologies.
· Generates rooted trees (re-rooting is not allowed!)
· Generates ultrametric trees
How to construct a tree with UPGMA
Prepare a distance matrix
Repeat step 1 and step 2 until there are only two clusters
Step 1:
Cluster a pair of leaves (taxa) by shortest distance
Step 2:
Recalculate a new average distance with the new cluster and other taxa, and make a new
distance matrix
Example

New average distance between AB and C is: C to AB = (60 + 50) / 2 = 55


Distance between D to AB is: D to AB = (100 + 90) / 2 = 95
Distance between E to AB is: E to AB = (90 + 80) / 2 = 85

M. P. Garud Page 110


New average distance between AB and DE is: ABtoDE= (95 + 85) / 2 = 90

New Average distance between CDE and AB is: CDE to AB = (90 + 55) / 2 = 72.5

Neighbor Joining method


The N-J method for clustering was developed by Saitou and Nei (1987). It reconstructs the
unrooted phylogenetic tree with branch lengths using minimum evolution criterion that
minimizes the lengths of tree. It does not assume the constancy of substitution rates across
sites and does not require the data to be ultrametric, unlike UPGMA. Hence, this method is

M. P. Garud Page 111


more appropriate for the sites with variable rates of evolution. The principle of this method is
to find pairs of operational taxonomic units (OTUs [= neighbors]) that minimize the total
branch length at each stage of clustering of OTUs starting with a starlike tree.
Advantages of NJ
· fast (suited for large datasets)
· does not require ultrametric data: suited for datasets comprising lineages with largely
varying rates of evolution
· permits correction for multiple substitutions
Disadvantages of NJ
· information is reduced (distance matrix based)
· gives only one tree (out of several possible trees)
· the resulting tree depends on the model of evolution used

M. P. Garud Page 112


This algorithm does not make the assumption of molecular clock and adjust for the rate
variation among branches. It begins with an unresolved star-like tree (fig (a)). Each pair is
evaluated for being joined and the sum of all branches length is calculated of the resultant tree.
The pair that yields the smallest sum is considered the closest neighbors and is thus joined .A
new branch is inserted between them and the rest of the tree (fig (b)) and the branch length is
recalculated. This process is repeated until only one terminal is present.

Maximum parsimony
The Maximum parsimony (MP) method is based on the simple principle of searching the
tree or collection of trees that minimizes the number of evolutionary changes in the form of
change of one character state into other, which are able to describe observed differences in
the informative sites of OTUs. There are two problems under the parsimony criterion, a)
determining the length of the tree i.e. estimating the number of changes in character states,
b) searching overall possible tree topologies to find the tree that involves minimum number
of changes. Finally all the trees with minimum number of changes are identified for each of
the informative sites. Fitch’s algorithm is used for the calculation of changes for a fixed tree
topology (Fitch, 1971). If the number of OTUs, N is moderate, this algorithm can be used to

M. P. Garud Page 113


calculate the changes for all possible tree topologies and then the most parsimonious rooted
tree with minimum number of changes is inferred. However, if N is very large it becomes
computationally expensive to calculate the changes for the large number of possible rooted
trees. In such cases, a branch and bound algorithm is used to restrict the search space of tree
topologies in accordance with Fitch’s algorithm to arrive at parsimonious tree (Hendy &
Penny, 1982). However, this approach may miss some parsimonious topologies in order to
reduce the search space.
An illustrative example of phylogeny analysis using Maximum parsimony is shown in Table
and Fig. Table shows a snapshot of MSA of 4 sequences where 5 columns show the
aligned nucleotides. Since there are four taxa (A, B, C & D), three possible unrooted trees
can be obtained for each site. Out of 5 character sites, only two sites, viz., 4 & 5 are
informative i.e. sites having at least two different types of characters (nucleotides/amino
acids) with a minimum frequency 2. In the Maximum parsimony method, only informative
sites are analysed. Fig. shows the Maximum parsimony phylogenetic analysis of site 5
shown in Table 5. Three possible unrooted trees are shown for site 5 and the tree length is
calculated in terms of number of substitutions. Tree II is favoured over trees I and III as it
can explain the observed changes in the sequences just with a single substitution. In the
same way unrooted trees can be obtained for other informative sites such as site 4. The most
parsimonious tree among them will be selected as the final phylogenetic tree. If two or more
trees are found and no unique tree can be inferred, trees are said to be equally
parsimonious.

Example of phylogenetic analysis from 5 aligned character sites in 4 OTUs using Maximum
parsimony method

M. P. Garud Page 114


Example showing various tree topologies based on site 5 in Table using the Maximum parsimony
method.

This method is suitable for a small number of sequences with higher similarity and was
originally developed for protein sequences. Since this method examines the number of
evolutionary changes in all possible trees it is computationally intensive and time consuming.
Thus, it is not the method of choice for large sized genome sequences with high variation.
The unequal rates of variation in different sites can lead to erroneous parsimony tree with
some branches having longer lengths than others as parsimony method assumes the rate of
change across all sites to be equal.

Maximum likelihood
In this method probabilistic models for phylogeny are developed and the tree would be
reconstructed by sampling method for the given set of sequences. The main difference between
this method and some of the available methods is that it ranks various possible tree topologies
according to their likelihood. The same can be using the Baysian approach (likelihood based
on the posterior probabilities i.e. by using probability (tree|data)). This method also facilitates
computing the likelihood of a sub-tree topology along the branch.
To make the method operative, one must know how to compute P(x*|T,t*) probability of set
of data given tree topology T and set of branch length t*. The tree having maximum probability
or the one, which maximizes the likelihood would be chosen as the best tree. The maximization
can also be based on the posterior probability P(tree|data) and can be carried out by obtaining
required probability using P(x*|T,t*)=P(data|tree) and by applying the Baye’s theorem.
The exercise of maximization involves two steps:
M. P. Garud Page 115
a. A search over all possible tree topologies with order of assignment of sequences at the leaves
specified.
b. For each topology, a search over all possible lengths of edges in t*

M. P. Garud Page 116


DNA Bar-coding Database : BOLD
DNA barcoding employs sequence diversity in short, standardized gene regions to aid
species identification and discovery in large assemblages of life. The Consortium for the
Barcode of Life (CBOL) was launched in May 2004 and now includes more than 120
organizations from 45 nations. CBOL is fostering development of the international research
alliances needed to build, over the next 20 years, a barcode library for all eukaryotic life.
Although bold aids the assembly of barcode data and maintains these records, a copy
of all sequence and key specimen data also migrate to NCBI or its sister genomic repositories
[DNA Data Bank of Japan (DDBJ), European Molecular Biology Laboratory (EMBL)] as soon
as results are ready for public release. Access to bold is open to any researcher with interests
in DNA barcoding; computational resources and personnel are available to sustain its primary
site until 2011. bold now involves more than 65 000 lines of combined code written in Java
(for business logic and light analytics), C++ (for heavy analytics), and PHP (for front end). It
runs in a Linux environment with all data residing in a PostgreSql relational database
(www.postgresql.org). The Barcode of Life Data System (BOLD) is an online workbench and
database that supports the assembly and use of DNA barcode data. It is a collaborative hub for
the scientific community and a public resource for citizens at large.
The Barcode of Life Data Systems (BOLD) is a web platform that provides an
integrated environment for the assembly and use of DNA barcode data. It delivers an online
database for the collection and management of specimen, distributional, and molecular data as
well as analytical tools to support their validation. Since its launch in 2005, BOLD has been
extended to provide a range of functionality including data organization, validation,
visualization and publication. The most recent version of the system, launched in October 2013,
brings a collection of iterative improvements supporting data collection and analysis but also
includes novel modules improving data dissemination, citation, and annotation.
BOLD is freely available to any researcher with interests in DNA Barcoding. By
providing specialized services, it aids in the publication of records that meet the standards
needed to gain BARCODE designation in the international nucleotide sequence databases.
Because of its web-based delivery and flexible data security model, it is also well positioned
to support projects that involve broad research alliances.

M. P. Garud Page 117


M. P. Garud Page 118
How to access Searching Public Data
Users can enter a combination of search terms to advance their searchers in all four
BOLD public databases. For example, searching "Lepidoptera Canada" in the Public Data
Portal will return all of the Lepidoptera records collected in Canada.
Searchable keywords include taxonomy (scientific names only), geography, collectors,
identifiers, or institutions as well as BOLD Sample IDs, Process IDs, and Project Codes.
Below are the current search guidelines accepted in the system:

M. P. Garud Page 119


 Multiple terms from the same domain can be searched to retrieve all results matching
either term. For example, "Anura Caudata" will deliver results for records from both
orders.
 Multiple terms from different domains can be searched to retrieve the intersection
results. For example, "Canada Aves" will return results for Aves collected in Canada
only.
 Quotation marks must be used for exact match retrieval of multi-word terms in a multi-
term search. For example, "United States" Aves will deliver results for US birds.
 The minus operator (-) will omit certain results from the search. For example,
"Lepidoptera -Saturniidae" will deliver results for Lepidoptera but not Saturniidae (a
family within Lepidoptera).
 Combination searches are possible within and across domains. For example,
"Biodiversity Institute of Ontario" Sesiidae -Manitoba" will deliver results for the
Sesiidae stored in the Biodiversity Institute of Ontario, but not collected in Manitoba.
 BOLD Project Codes can be searched for published projects/datasets. For example,
"NBCAD" will return all the records from that project.
 Researchers names may be searched to find records that the researcher collected or
identified. For example, "Xin Zhou" will deliver results for all records that were
collected or identified by researcher Xin Zhou.

M. P. Garud Page 120


Public Search Bar as illustrated in the Public Data Portal.
Troubleshooting For The Public Search Bar
There are several reasons why searches may not be retrieving the desired results. These are the
most commonly encountered issues:
 A typo, spelling mistake, or invalid synonym was entered instead of the proper search
term.
 The database does not have records matching the exact search term(s); it may be useful
to broaden the search.
 An additional space was entered when using the negative sign. For example, "- Ontario"
was used instead of instead of "-Ontario".
 The search terms are retrieving results from a different domain than expected. In this
case it may be useful to append a domain code to the search term, such as "[tax]",
"[geo]", or "[identifier]" to narrow results to a specific domain.
Downloading Public Data

M. P. Garud Page 121


BOLD Systems provides the option to download public data from the search results page.
Several download options and file formats are available.
In the BIN Database and the Public Data Portal, users can choose to download specimen data
(in XML or TSV formats), sequences (in FASTA format), trace files (in either .ab1 or .scf
formats), or both specimen details and sequences (in XML or TSV formats). In addition,
occurrence maps of specimens, species, or barcodes are also available for download in these
databases. To download data for all the records returned, simply click on the download option
desired. To choose a selection of records, use the checkboxes to the left of each record.
Primer details and bibliographies can also be downloaded from the Primer Database and
the Publication Database, respectively.

Public Data Portal with the available download tools highlighted in the red box.

Registering for a User Account


Getting an account on BOLD extends the benefits available to a user of the system beyond
access to public data and use of the BOLD Identification Engine. Upon signing in, users can

M. P. Garud Page 122


submit data to BOLD and gain access to other in-progress, private projects with the permission
of the data owners. Moreover, it is possible for users to annotate published data, as well as help
to curate and clean the identification library. Once data is on BOLD, a large set of analytical
tools are available for validation and generation of reports for publications.
To register for an account, click on Workbench in the header. Under the login section is a link
called Create Account. After the registration form is submitted, users can log in immediately
and will also receive a welcome e-mail within a few minutes. Users that are not associated with
any formal institution may register for an account by registering a new institution, in the
recommended format "Research Collection of Jane S. Stewards".

M. P. Garud Page 123


Registration form for new users.
User Preferences
Once logged in to the system, users may update their profile anytime by clicking on the gear
symbol on the top right corner of the screen and selecting User Preferences. BOLD allows
users to change their name, institution, password, or email.

M. P. Garud Page 124


Databases: Identification
BOLD Identification Engine

The library of sequences collected in BOLD is available for facilitating identification of


unknown sequences. The BOLD Identification Engine uses all sequences uploaded to BOLD
from public and private projects to locate the closest match. To ensure data security, sequences
from private records are never exposed.

Animal Identification (COI)

The BOLD ID Engine accepts sequences from the 5’ region of the mitochondrial gene COI
and returns a species-level identification (when possible). BOLD uses the BLAST algorithm
to identify single base indels before aligning the protein translation through profile to a Hidden
Markov Model of the COI protein. There are four types of databases that can be used to identify
COI sequences. The BOLD ID Engine provides historical copies of the COI databases dating
back to 2009 for use in replicating results from previous years. The Full-Length COI database

M. P. Garud Page 125


is designed for use with short query sequences as it provides maximum overlap in the barcode
region of COI.

Fungal (ITS) And Plant (RbcL & MatK) Identification

In the BOLD ID Engine, ITS is the default identification tool for fungal barcodes and rbcL
and matK are the defaults for plant barcodes. Both return a species-level identification (when
possible). The BLAST algorithm is employed in place of BOLD’s internal identification engine
for these sequences. The number of fungal and plant sequences in BOLD is relatively limited
compared to the number of animal sequences and thus a successful species match may not be
possible. As new sequences are added to the database, the number of successful matches should
improve. These databases include many species represented by only one or two specimens, as
well as all species with interim taxonomy. Both searches will return a list of the nearest matches
but do not provide a probability of placement to a taxon.

M. P. Garud Page 126


Descriptions of the 6 types of identification databases on BOLD

Database Name Description Database Size

All Barcode Every COI sequence on BOLD


>1,390,000 sequences
Records >500bp

Species Barcode Every COI sequence >500bp with


>1,150,000 sequences
Records species level identification

Public Barcode
Every public COI sequence >500bp >270,000 sequences
Records

Full-Length Every COI sequence on BOLD


>950,000 sequences
Barcode Records >640bp

Fungal Records Every ITS sequence on BOLD >100bp >15,000 sequences

Every rbcL and matK sequence on >95,000 & >70,000


Plant Records
BOLD >500bp sequences respectively

The results page for a typical animal sequence identification is illustrated below. For each
sequence queried, a overview is provided describing the best match, links to both the taxonomic
page and the BIN cluster for the match, as well as a Taxon ID Tree placing the query sequence
in among 100 of the closest matches. The top matches listed in the table provide links to the
public record where available. A map is provided displaying the collection location of all the
public records in the top 100 matches. For a batch of sequences queried, each result page is
accessible via the accordion tabs in the page.

M. P. Garud Page 127


Taxonomy Browser

M. P. Garud Page 128


The Taxonomy Browser is a synthetic database that allows users to examine the progress of
DNA barcoding by browsing through the different levels of the taxonomic hierarchy available
on BOLD.

Within the Taxonomy Browser, users can select phlya in the Animal, Plant, Fungus, or Protist
kingdoms to navigate from phylum to species level. Statistics on the progress of DNA
barcoding at each taxon are generated from both public and private data while protecting
private user-owned data. To look up a specific taxon directly, use the search function by
entering a taxonomic name into the search bar at the top of the Taxonomy Browser or on
the BOLD Home page. Descriptions of the features on each taxon page are illustrated and
described below.

M. P. Garud Page 129


M. P. Garud Page 130
BOLD Taxonomy Browser

1. Lineage Displays the taxon name and the higher taxonomic levels.

2. Search Bar Enter a taxonomic name to go directly to a page.

3. Sub-
Links to all sub-taxa with number of specimen records for each.
Taxonomy

4. Taxon
Displays the description of this taxon from the Wikipedia website.
Description

These statistics are compiled by BOLD for this taxon. A species progress list can
5. Statistics be downloaded for each rank that has sub-taxa. The published and released
sequences for this taxon can be downloaded from this section.

6. Sample
A graph of the top institutions that provided specimens with their specimen tallies.
Sources

A random selection of the images available for the subtaxa of this taxon. Mousing
7. Imagery
over an image selects it for higher-resolution display to the right.

8. Image The taxonomic identifier, the sample identifier, license, and attribution are all
Details displayed beneath the selected image.

9. Collection
A map of the collection sites including a list of the top countries.
Sites

10. Taxon
A map of the occurrence data for this taxon from GBIF.
Occurrence

M. P. Garud Page 131


Publication Database

The Publication Database contains details on publications that are relevant to the barcoding
community and are submitted by users of the system. It is accessible without logging into
BOLD. This database indexes title, abstract, year, and authors, allowing for broad searches.
Expanding a publication from the results list will provide details on the publication, including
a link to the article on the journal’s site, as illustrated below. A citation or set of citations can
be downloaded from BOLD using the drop down menu to the right of the search bar.

Bibliographies can be submitted to this database by users, following the Bibliography


Submission protocol. By associating records to a bibliography on BOLD, the article citation
will appear everywhere the records appear in BOLD.

M. P. Garud Page 132


Publication database showing an example search for an author name.

Primer Database

The Primer Database is a database of all the public primers available in BOLD. This can be
accessed without a BOLD account. Using the search bar, users can enter terms that appear in
the primer code, submitter, or reference fields. Selecting a primer from the database will
provide details on the primer, including primer performance statistics derived from data
submitted to BOLD as illustrated below. A primer or set of selected primers can be downloaded
in FASTA format using the Download Selected Primers button the the right of the search
bar.If users have previously registered a primer in BOLD, it will be available in the Primer
Database if the user is signed in to BOLD, allowing private primers to be edited (ie, to make
it publicly available and to add citation information). New primers must be registered from

M. P. Garud Page 133


the User Console before trace files generated using them are submitted to records on BOLD
following the Trace Submission protocol.

Primer database showing an example search for primers associated with the keyword "bird".

Public Data Portal

Searching The Public Data Portal : The BOLD Public Data Portal is a database of all of the public
records on BOLD, including those in the early data release phase of the iBOL project, where
information is still masked. This database can be used to access and download specimen data
and sequences.

Public users can search the Public Data Portal using taxonomy, geography (country and
state/province), and institution keywords, or by using Sample ID or BOLD Process ID to find
individual records. Any combination of keywords into the search bar. For example, searching

M. P. Garud Page 134


"Lepidoptera Canada" will return all of the Lepidoptera records collected in Canada. Searching
"Lepidoptera Canada -Ontario" will return the same results, but with the specimens collected
in Ontarioomitted.
For further details and examples can be entered for using the search functionality, see the search
help section that is available by clicking on the help button to the right of the search bar. The
search results will display a list of the public records that match the searched terms, as
illustrated below. Toggling to "BINs" next to the search button will convert the list to all BINs
available.

Public Data Portal results from a search for "Chordata"

M. P. Garud Page 135


Specimen Record :The record page gives information on the specimen identifier, taxonomy,
specimen details, collection data (including collection site), sequence information, specimen
image details, and attribution details. The figure below shows the details page for a particular
record. A record page will reference a BIN when one is available and provides links to GenBank
records.

M. P. Garud Page 136


Public Record Page

Barcode Index Numbers (BINs)

The Barcode Index Number System is an online framework that clusters barcode sequences
algorithmically, generating a web page for each cluster. Since clusters show high concordance

M. P. Garud Page 137


with species, this system can be used to verify species identifications as well as document
diversity when taxonomic information is lacking. This system consists of three parts:

1. A clustering algorithm employing graph theoretic methods to generate operational


taxonomic units (OTUs) and putative species from sequence data without prior
taxonomic information.
2. A curated registry of barcode clusters integrated with an online database of specimen
and taxonomic data with support for community annotations.
3. An Annotation framework that allows researchers to review and critique the
taxonomic identifications associated with each BIN and notify data owners of errors.

The BIN framework can greatly expedite the evaluation and annotation of described species
and putative new ones while reducing the need to generate interim names, a non-trivial issue
in barcoding datasets. The BIN algorithm has been effectively tested on a broad set of
taxonomic groups and shows potential for applications in species abundance studies and
environmental barcoding. The registry employs modern URI and web service functionality
enabling integration with other databases.

BIN pages display aggregated data in several sections described and illustrated below.

M. P. Garud Page 138


M. P. Garud Page 139
BIN page example

BIN details include BIN identifiers (URI and DOI), the member count, and
1. BIN Details distributional information. Also, nearest neighbour BIN details are provided
,along with the nearest member and the taxonomy of that record.

The taxonomy of the public data is visible for the BIN, with highlighting to
indicate taxonomy concordance and discordance. NEW! For each taxon, logged
2. Taxonomy
in users can search the records that they have access to by clicking on the
magnifying glass icon.

Via the Add Tags & Comments button, BIN pages support community vetting
3. Annotation though annotation of individual data elements (taxonomy, images, collection
sites and attribution). Please see the Annotation section for more details.

4. Distance A histogram provides the distribution of distances between sequences within the
Distribution BIN and against the nearest neighbour sequence.

5. Associated
List of the publications that contain sequences from the BIN.
Publications

For BINs with 3 - 150 members, a circle tree is displayed which also includes the
6. Dendrogram nearest neighbour. Hovering over taxon names on the circular tree highlights the
of Sequences terminal branch. A PDF version of the tree is available for download for all BINs
with more than 2 members.

The interactive diagram allows for investigation of the haplotypes in the BIN
cluster along species and geographical splits. Hovering over a haplotype node in
7. Haplotype
the diagram reveals details on which species or geographical information are
Network
grouped. The larger the node, the more sequences in the haplotype. The thicker
the line between nodes, the more closely related those two haplotypes are.

M. P. Garud Page 140


A list of the collection countries and number of specimens collected per country
8. Collection
followed by a list of the owners of the public and private sequences contained
and Owner
within a BIN. NEW! For each country, logged in users can search the records
Data
that they have access to by clicking on the magnifying glass icon.

9. BIN Barcode BINs are marked as compliant if they contain at least one sequence that meets
Compliance Barcode Compliance standards.

10. Specimen Displays all images for records clustered in the BIN, with license information
Images available for each.

11. Sampling
Displays a map of the collection sites based on GPS coordinates.
Sites

Lists the institutions where specimens are deposited and sequenced, along with
photographers, collectors, taxonomists, and funding sources. NEW! For each
12. Attribution
Specimen Depository, logged in users can search the records that they have
access to by clicking on the magnifying glass icon.

Public Annotation on Databases

As the volume of barcode data being generated increases rapidly, the need for routine curation
has become apparent. BOLD’s annotation and notification system supports rapid community
based validation of barcode data. Annotation can occur at the project level, record level, and
also on specific data elements including taxonomy, images, and sequences on BIN pages .
The Annotation System leverages the large user-base and expert knowledge for curation of
both private data within collaborative projects and public data through the Public Data Portal.
Tagging allows for categorization using custom and controlled tags. Both custom and
controlled tags can be used for filters, searches, and workflow management.

Comments and tags applied to data by BOLD users will appear in the Activity Report on
the User Console and the Activity Report on the appropriate Project Console. Comments

M. P. Garud Page 141


will persist on the data element with the user's full name and a date stamp. Tags can be removed
at any time by any user.

Annotation is available wherever the Add Tags and Comments button appears within BOLD.
Users must be signed in to BOLD to be able to add tags and comments.

The figure below illustrates the annotation window which allows for comments as well as the
option to choose an existing tag or create a new tag.

M. P. Garud Page 142

You might also like