Download as pdf or txt
Download as pdf or txt
You are on page 1of 125

Bioinformatics

Board of Studies

Prof. H. N. Verma Prof. M. K. Ghadoliya


Vice- Chancellor Director,
Jaipur National University, Jaipur School of Distance Education and Learning
Jaipur National University, Jaipur
Dr. Rajendra Takale
Prof. and Head Academics
SBPIM, Pune

___________________________________________________________________________________________
Subject Expert Panel

Dr. Ramchandra G. Pawar Ashwini Pandit


Director, SIBACA, Lonavala Subject Matter Expert
Pune

___________________________________________________________________________________________
Content Review Panel

Gaurav Modi Shubhada Pawar


Subject Matter Expert Subject Matter Expert

___________________________________________________________________________________________
Copyright ©

This book contains the course content for Bioinformatics.

First Edition 2013

Printed by
Universal Training Solutions Private Limited

Address
05th Floor, I-Space,
Bavdhan, Pune 411021.

All rights reserved. This book or any portion thereof may not, in any form or by any means including electronic
or mechanical or photocopying or recording, be reproduced or distributed or transmitted or stored in a retrieval
system or be broadcasted or transmitted.

___________________________________________________________________________________________
Index

I. Content....................................................................... II

II. List of Figures............................................................ V

III. Abbreviations..........................................................VI

IV. Application............................................................ 104

V. Bibliography........................................................... 112

VI. Self Assessment Answers...................................... 115

Book at a Glance

I
Contents
Chapter I........................................................................................................................................................ 1
Introduction to Bioinformatics.................................................................................................................... 1
Aim................................................................................................................................................................. 1
Objectives....................................................................................................................................................... 1
Learning outcome........................................................................................................................................... 1
1.1 Introduction............................................................................................................................................... 2
1.2 Bioinformatics:The Brain of Biotechnology ........................................................................................... 3
1.3 Evolutionary Biology................................................................................................................................ 3
1.4 Origin & History of Bioinformatics ....................................................................................................... 4
1.5 Origin of Bioinformatic/Biological Databases ........................................................................................ 6
1.6 Importance of Bioinformatics................................................................................................................... 7
1.7 Use of Bioinformatics .............................................................................................................................. 8
1.8 Basics of Molecular Biology.................................................................................................................... 9
1.9 Definitions of Fields Related to Bioinformatics .....................................................................................11
1.10 Bioinformatics Applications ................................................................................................................ 13
Summary...................................................................................................................................................... 16
References.................................................................................................................................................... 17
Recommended Reading.............................................................................................................................. 17
Self Assessment............................................................................................................................................ 18

Chapter II.................................................................................................................................................... 20
Biological Databases................................................................................................................................... 20
Aim............................................................................................................................................................... 20
Objectives..................................................................................................................................................... 20
Learning outcome......................................................................................................................................... 20
2.1 Introduction............................................................................................................................................. 21
2.2 Categories of Biological Databases ....................................................................................................... 22
2.3 The Database Industry ........................................................................................................................... 22
2.4 Classification of Biological Databases .................................................................................................. 23
2.5 The Creation of Sequence Databases . ................................................................................................... 29
2.6 Bioinformatics Programs and Tools . ..................................................................................................... 31
2.7 Bioinformatics Tools............................................................................................................................... 32
2.8 Application of Programmes in Bioinformatics....................................................................................... 35
Summary...................................................................................................................................................... 36
References.................................................................................................................................................... 36
Recommended Reading.............................................................................................................................. 37
Self Assessment............................................................................................................................................ 38

Chapter III................................................................................................................................................... 40
Genomics & Proteomics............................................................................................................................. 40
Aim............................................................................................................................................................... 40
Objectives..................................................................................................................................................... 40
Learning outcome......................................................................................................................................... 40
3.1 DNA, Genes and Genomes..................................................................................................................... 41
3.2 DNA Sequencing.................................................................................................................................... 41
3.3 Genome Mapping.................................................................................................................................... 42
3.4 Implications of Genomics for Medical Science...................................................................................... 42
3.5 Proteomics............................................................................................................................................... 43
3.6 Application of Proteomics to Medicine.................................................................................................. 46
3.7 Difference between Proteomics and Genomics...................................................................................... 46

II
3.8 Protein Modeling.................................................................................................................................... 46
Summary...................................................................................................................................................... 48
References.................................................................................................................................................... 48
Recommended Reading.............................................................................................................................. 48
Self Assessment............................................................................................................................................ 49

Chapter IV................................................................................................................................................... 51
Sequence Alignment.................................................................................................................................... 51
Aim............................................................................................................................................................... 51
Objectives..................................................................................................................................................... 51
Learning outcome......................................................................................................................................... 51
4.1 Introduction............................................................................................................................................. 52
4.2 Pairwise Sequence Alignment................................................................................................................. 52
4.3 Multiple Sequence Alignment (MSA).................................................................................................... 56
4.4 Substitution Matrices.............................................................................................................................. 56
4.5 Two Sample Applications....................................................................................................................... 56
Summary...................................................................................................................................................... 57
References.................................................................................................................................................... 57
Recommended Reading.............................................................................................................................. 57
Self Assessment............................................................................................................................................ 58

Chapter V..................................................................................................................................................... 60
Phylogenetic Analysis................................................................................................................................. 60
Aim............................................................................................................................................................... 60
Objectives..................................................................................................................................................... 60
Learning outcome......................................................................................................................................... 60
5.1 Introduction............................................................................................................................................. 61
5.2 Fundamental Elements of Phylogenetic Models..................................................................................... 62
5.3 Tree Interpretation: Importance of Identifying Paralogs and Orthologs................................................. 63
5.4 Phylogenetic Data Analysis.................................................................................................................... 64
5.4.1 Alignment: Building the Data Model...................................................................................... 64
5.4.2 Determining the Substitution Model....................................................................................... 64
5.4.3 Tree-Building Methods........................................................................................................... 65
5.4.4 Tree Evaluation....................................................................................................................... 65
Summary...................................................................................................................................................... 66
References.................................................................................................................................................... 66
Recommended Reading.............................................................................................................................. 66
Self Assessment............................................................................................................................................ 67

Chapter VI................................................................................................................................................... 69
Microarray Technology: A Boon to Biological Sciences.......................................................................... 69
Aim............................................................................................................................................................... 69
Objectives..................................................................................................................................................... 69
Learning outcome......................................................................................................................................... 69
6.1 Introduction to Microarray...................................................................................................................... 70
6.2 Microarray Technique............................................................................................................................. 70
6.3 Potential of Microarray Analysis ........................................................................................................... 72
6.4 Microarray Products .............................................................................................................................. 73
6.5 Microarray: Identifying Interactions....................................................................................................... 73
6.6 Applications of Microarrays................................................................................................................... 73
Summary...................................................................................................................................................... 76
References.................................................................................................................................................... 76
Recommended Reading.............................................................................................................................. 77
Self Assessment............................................................................................................................................ 78

III
Chapter VII................................................................................................................................................. 80
Bioinformatics in Drug Discovery: A Brief Overview............................................................................. 80
Aim............................................................................................................................................................... 80
Learning outcome......................................................................................................................................... 80
7.1 Introduction............................................................................................................................... 81
7.2 Drug Discovery....................................................................................................................................... 82
7.3 Informatics and Medical Sciences.......................................................................................................... 82
7.4 Bioinformatics and Medical Sciences..................................................................................................... 83
7.5 Bioinformatics in Computer-Aided Drug Design................................................................................... 84
7.6 Bioinformatics Tools . ............................................................................................................................ 86
Summary...................................................................................................................................................... 88
References.................................................................................................................................................... 88
Recommended Reading.............................................................................................................................. 88
Self Assessment............................................................................................................................................ 89

Chapter VIII................................................................................................................................................ 91
Human Genome Project............................................................................................................................. 91
Objectives..................................................................................................................................................... 91
Learning outcome......................................................................................................................................... 91
8.1 Introduction............................................................................................................................................. 92
8.2 Human Genome Project.......................................................................................................................... 92
8.3 Genome Sequenced in the Public (HGP) and Private Projects............................................................... 92
8.4 Funding for Human Genome Sequencing............................................................................................... 93
8.5 DNA Sequencing ................................................................................................................................... 93
8.6 Bioinformatic Analysis: Finding Functions............................................................................................ 94
8.7 Insights Learned from the Human DNA Sequence................................................................................. 97
8.8 Future Challenges .................................................................................................................................. 98
Summary.................................................................................................................................................... 100
References.................................................................................................................................................. 100
Recommended Reading............................................................................................................................ 101
Self Assessment.......................................................................................................................................... 102

IV
List of Figures
Fig. 1.1 Genes encode the recipes for proteins............................................................................................... 9
Fig. 2.1 Growth of the GenBank database.................................................................................................... 24
Fig. 2.2 GenBank file format........................................................................................................................ 25
Fig. 2.3 International nucleotide data banks................................................................................................. 27
Fig. 2.4 Application of bioinformatics in medical science........................................................................... 34
Fig. 4.1 Dot matrix........................................................................................................................................ 56
Fig. 5.1 Clade and node................................................................................................................................ 61
Fig. 5.2 A phylogenetic tree.......................................................................................................................... 63
Fig. 6.1 Gene expression data....................................................................................................................... 74
Fig. 6.2 Gene expression over time.............................................................................................................. 75

V
Abbreviations
ADMET - Absorption, Distribution, Metabolism, Excretion, Toxicity

ASCII - American Standard Code for Information Interchange

BLAST - Basic Local Alignment Search Tool

BLOSUM - Blocks Substitution Matrix

CADD - Computer-Aided Drug Design

cDNA - Complementary DNA

COPIA - Consensus Pattern Identification and Analysis

DDBJ - DNA Data Bank of Japan

DNA - Deoxyribonucleic Acid

EBI - European Bioinformatics Institute

ELSI - Ethical, Legal and Social Issues

EMBL - European Molecular Biology Laboratory

EMBOSS - European Molecular Biology Open Software Suite

EMR - Electronic Medical Records

ESI - Electro-Spray Ionisation

GIS - Geographic Information System

HGP - Human Genome Project

HMM - Hidden Markov Models

HTML - HyperText Markup Language

JVM - Java Virtual Machine

LC - Liquid Chromatography

MALDI - Matrix-Assisted Laser Desorption Ionisation

MS - Mass Spectrometry

NCBI - National Center for Biotechnology Information

NIH - National Institutes of Health

NMR - Nuclear Magnetic Resonance

OMIM - Online Mendelian Inheritance in Man

ORF - Open Reading Frame

PCR - Polymerase Chain Reaction

PDB - Protein Data Bank

PROSPECT - Protein Structure Prediction and Evaluation Computer ToolKit

VI
PSA - Prostate-Specific Antigen

QSAR - Quantitative Structure Activity Relationships

RNA - Ribonucleic Acid

SCOP - Structural Classification of Proteins

TOF - Time-of-Flight

vHTS - Virtual High-Throughput Screening

W3C - World-Wide Web Consortium

WWW - World Wide Web

XML - Extensible Markup Language

VII
Chapter I
Introduction to Bioinformatics

Aim
The aim of this chapter is to:

• define bioinformatics

• enlist components of bioinformatics

• describe evolutionary biology

Objectives
The objectives of this chapter are to:

• explain history of bioinformatics

• elucidate use of HTML and Java

• describe XML

Learning outcome
At the end of this chapter, you will be able to:

• understand bioinformation infrastructure

• describe CORBA

• explain features of ROSETTA

1
Bioinformatics

1.1 Introduction
Bioinformatics is a newly emerged scientific discipline for the computational analysis and storage of biological
data. The word bioinformatics has been derived from two words ‘Bio’ means biology and ‘Informatique’ (a French
word) meaning ‘data processing’.

Bioinformatics is the combination of biology and information technology. The discipline encompasses any
computational tools and methods used to manage, analyse and manipulate large sets of biological data. Essentially,
bioinformatics has three components:
• The creation of databases allowing the storage and management of large biological data sets.
• The development of algorithms and statistics to determine relationships among members of large data sets.
• The use of these tools for the analysis and interpretation of various types of biological data, including DNA,
RNA and protein sequences, protein structures, gene expression profiles, and biochemical pathways.

The term bioinformatics first came into use in the 1990s and was originally synonymous with the management and
analysis of DNA, RNA and protein sequence data. Computational tools for sequence analysis had been available since
the 1960s, but this was a minority interest until advances in sequencing technology, which led to a rapid expansion
in the number of stored sequences in databases such as GenBank. Now, the term has expanded to incorporate many
other types of biological data, for example protein structures, gene expression profiles and protein interactions. Each
of these areas requires its own set of databases, algorithms and statistical methods.

Bioinformatics is largely, although not exclusively, a computer-based discipline. Computers are important in
bioinformatics for two reasons:

Firstly, many bioinformatics problems require the same task to be repeated millions of times. For example, comparing
a new sequence to every other sequence stored in a database or comparing a group of sequences systematically
to determine evolutionary relationships. In such cases, the ability of computers to process information and test
alternative solutions rapidly is indispensable.

Secondly, computers are required for their problem-solving power. Typical problems that might be addressed using
bioinformatics could include solving the folding pathways of protein given its amino acid sequence, or deducing a
biochemical pathway given a collection of RNA expression profiles. Computers can help with such problems, but
it is important to note that expert input and robust original data are also required.

Bioinformatics is the field in which biology, computer science and information technology merge into single discipline
for managing and analysing biological data using advanced computing techniques. Bioinformatics has emerged
as a full-fledged interdisciplinary subject that interfaces the developments of computer science and information
technology with biological sciences. The knowledge of computer science and information technology is applied for
creation as well as management of databases, data warehousing, data mining and overall communication networking
throughout the world.

Bioinformatics is the application of computer technology to the management and analysis of biological
data. The result is that computers are being used to gather, store, analyse and merge biological data. 

The ultimate goal of bioinformatics is to uncover the wealth of biological information hidden in the mass of data
and obtain a clearer insight into the fundamental biology of organisms. This new knowledge could have profound
impacts on fields as varied as human health, agriculture, the environment, energy and biotechnology.

Bioinformatics is conceptualising biology in terms of molecules (in the sense of physical-chemistry) and then
applying “informatics” techniques (derived from disciplines such as applied math, CS, and statistics) to understand
and organise the information associated with these molecules, on a large-scale.

2
The three terms: bioinformatics, computational biology and bioinformation infrastructure are very similar and most
of the time used interchangeably. However;
• Bioinformatics refers to database like activities, involving persistent sets of data that are maintained in a consistent
state over essentially indefinite periods of time.
‚‚ Computational biology encompasses the use of algorithmic tools to facilitate biological analyses.
‚‚ Bioinformation infrastructure comprises the entire collection of information management systems, analysis
tools and communication networks supporting biology. Thus, the latter may be viewed as a computational
scaffold of the former two.

The future of bioinformatics is integration. For example, integration of a wide variety of data sources such as clinical
and genomic data will allow us to use disease symptoms to predict genetic mutations and vice versa. The integration
of GIS data, such as maps, weather systems, with crop health and genotype data, will allow us to predict successful
outcomes of agriculture experiments.

Another future area of research in bioinformatics is large-scale comparative genomics. For example, the development
of tools that can do 10-way comparisons of genomes will push forward the discovery rate in this field of bioinformatics.
Along these lines, the modelling and visualisation of full networks of complex systems could be used in the future
to predict how the system (or cell) reacts to a drug for example. A technical set of challenges faces bioinformatics
and is being addressed by faster computers, technological advances in disk storage space, and increased bandwidth.
Finally, a key research question for the future of bioinformatics will be how to computationally compare complex
biological observations, such as gene expression patterns and protein networks. Bioinformatics is about converting
biological observations to a model that a computer will understand. This is a very challenging task since biology
can be very complex. This problem of how to digitise phenotypic data such as behavior, electrocardiograms, and
crop health into a computer readable form offers exciting challenges for future bioinformaticians.

A bioinformaticist is an expert who not only knows how to use bioinformatics tools, but also knows how to write
interfaces for effective use of the tools. A bioinformatician, on the other hand, is a trained individual who only knows
to use bioinformatics tools without a deeper understanding.

1.2 Bioinformatics:The Brain of Biotechnology


The practical aspect of bioinformatics is to understand the code of life that is to decode the information reside in
nucleotide sequence. It is well known fact that DNA is the basic molecule of life that directly controls the fundamental
biology of nearly all organisms (except those where RNA is genetic material). The nucleotide sequence constitutes
the genes, which in turn express in terms of proteins. Any variations and errors in the nucleotide sequence of the
genomic DNA or mutations may lead to development of genetic disorders or other metabolic changes. Therefore,
researchers/ scientists in the fields of biotechnology or molecular biology need to know the nature of individual
genomes of various prokaryotic and eukaryotic organisms. Already many DNA sequencing projects have been
completed and many more are in progress leading to huge amount of biological information. This has added in the
growth of the science of bioinformatics. Handling such enormous information and interpretation was not possible
without bioinformatics. Hence, bioinformatics can be called as brain of biotechnology.

1.3 Evolutionary Biology


New insight into the molecular basis of a disease may come from investigating the function of homolog’s of a disease
gene in model organisms. In this case, homology refers to two genes sharing a common evolutionary history. Scientists
also use the term homology, or homologous, to simply mean similar, regardless of the evolutionary relationship.

3
Bioinformatics

Equally exciting is the potential for uncovering evolutionary relationships and patterns between different forms
of life. With the aid of nucleotide and protein sequences, it should be possible to find the ancestral ties between
different organisms. Thus far, experience has taught us that closely related organisms have similar sequences and
that more distantly related organisms have more dissimilar sequences. Proteins that show significant sequence
conservation, indicating a clear evolutionary relationship, are said to be from the same protein family. By studying
protein folds (distinct protein building blocks) and families, scientists are able to reconstruct the evolutionary
relationship between two species and to estimate the time of divergence between two organisms since they last
shared a common ancestor.

1.4 Origin & History of Bioinformatics


Over a century ago, bioinformatics history started with an Austrian monk named Gregor Mendel. He is known as
the “Father of Genetics”. He cross-fertilised different colours of the same species of flowers. He kept careful records
of the colours of flowers that he cross- fertilised and the colour(s) of flowers they produced. Mendel illustrated that
the inheritance of traits could be more easily explained if it was controlled by factors passed down from generation
to generation.

Since Mendel, bioinformatics and genetic record keeping have come a long way. The understanding of genetics has
advanced remarkably in the last thirty years. In 1972, Paul berg made the first recombinant DNA molecule using
ligase. In that same year, Stanley Cohen, Annie Chang and Herbert Boyer produced the first recombinant DNA
organism. In 1973, two important things happened in the field of genomics:
• Joseph Sambrook led a team that refined DNA electrophoresis technique using agarose gel, and
• Herbert Boyer and Stanely Cohen invented DNA cloning. By 1977, a method for sequencing DNA was discovered
and the first genetic engineering company, Genetech was founded.

By 1981, 579 human genes had been mapped and mapping by in situ hybridisation had become a standard method.
Marvin Carruthers and Leory Hood made a huge leap in bioinformatics when they invented a method for automated
DNA sequencing. In 1988, the Human Genome Organisation (HUGO) was founded. This is an international
organisation of scientists involved in Human Genome Project. In 1989, the first complete genome map was published
of the bacteria Haemophilus influenza. The following year, the Human Genome Project was started. By 1991, a total
of 1879 human genes had been mapped. In 1993, Genethon, a human genome research centre in France produced a
physical map of the human genome. Three years later, Genethon published the final version of the Human Genetic
Map. This concluded the end of the first phase of the Human Genome Project.

In the mid-1970s, it would take a laboratory at least two months to sequence 150 nucleotides. Ten years ago, the
only way to track genes was to scour large, well documented family trees of relatively inbred populations, such as
the Ashkenzai Jews from Europe. These types of genealogical searches 11 million nucleotides a day for its corpo-
rate clients and company research.

Bioinformatics was fuelled by the need to create huge databases, such as GenBank and EMBL and DNA Database of
Japan to store and compare the DNA sequence data erupting from the human genome and other genome sequencing
projects. Today, bioinformatics embraces protein structure analysis, gene and protein functional information, data
from patients, pre-clinical and clinical trials, and the metabolic pathways of numerous species.

Origin of internet
The management and, more importantly, accessibility of this data is directly attributable to the development of the
Internet, particularly the World Wide Web (WWW). Originally developed for military purposes in the 60’s and
expanded by the National Science Foundation in the 80’s, scientific use of the Internet grew dramatically following
the release of the WWW by CERN in 1992.

4
HTML
The WWW is a graphical interface based on hypertext by which text and graphics can be displayed and highlighted.
Each highlighted element is a pointer to another document or an element in another document, which can reside on
any internet host computer. Page display, hypertext links and other features are coded using a simple, cross-platform
HyperText Markup Language (HTML) and viewed on UNIX workstations, PCs and Apple Macs as WWW pages
using a browser.

Java
The first graphical WWW browser - Mosaic for X and the first molecular biology WWW server - ExPASy were made
available in 1993. In 1995, Sun Microsystems released Java, an object-oriented, portable programming language
based on C++. In addition to being a standalone programming language in the classic sense, Java provides a highly
interactive, dynamic content to the Internet and offers a uniform operational level for all types of computers, provided
they implement the ‘Java Virtual Machine’ (JVM). Thus, programs can be written, transmitted over the internet and
executed on any other type of remote machine running a JVM. Java is also integrated into Netscape and Microsoft
browsers, providing both the common interface and programming capability, which is vital in sorting through and
interpreting the gigabytes of bioinformatics data now available and increasing at an exponential rate.

XML
The new XML standard 8 is a project of the World-Wide Web Consortium (W3C), which extends the power of the
WWW to deliver not only HTML documents but an unlimited range of document types using customised markup.
This will enable the bioinformatics community to exchange data objects such as sequence alignments, chemical
structures, spectra and so on together with appropriate tools to display them, just as easily as they exchange HTML
documents today. Both Microsoft and Netscape support this new technology in their latest browsers.

CORBA
Another new technology, called CORBA, provides a way of bringing together many existing or ‘legacy’ tools
and databases with a common interface that can be used to drive them and access data. CORBA frameworks for
bioinformatics tools and databases have been developed by, for example, NetGenics and the European Bioinformatics
Institute (EBI).

Representatives from industry and the public sector under the umbrella of the Object Management Group are
working on open CORBA-based standards for biological information representation The Internet offers scientists a
universal platform on which to share and search for data and the tools to ease data searching, processing, integration
and interpretation. The same hardware and software tools are also used by companies and organisations in more
private yet still global Intranet networks. One such company, Oxford GlycoSciences in the UK, has developed a
bioinformatics system as a key part of its proteomics activity.

ROSETTA
ROSETTA focuses on protein expression data and sets out to identify the specific proteins, which are up or down-
regulated in a particular disease; characterise these proteins with respect to their primary structure, post-translational
modifications and biological function; evaluate them as drug targets and markers of disease; and develop novel drug
candidates. OGS uses a technique called fluorescent IPG-PAGE to separate and measure different protein types in
a biological sample such as a body fluid or purified cell extract. After separation, each protein is collected and then
broken up into many different fragments using controlled techniques. The mass and sequence of these fragments is
determined with great accuracy using a technique called mass spectrometry. The sequence of the original protein
can then be theoretically reconstructed by fitting these fragments back together in a kind of jigsaw. This reassembly
of the protein sequence is a task well-suited to signal processing and statistical methods.

ROSETTA is built on an object-relational database system, which stores demographic and clinical data on sample
donors and tracks the processing of samples and analytical results. It also interprets protein sequence data and matches
this data with that held in public, client and proprietary protein and gene databases. ROSETTA comprises a suite
of linked HTML pages, which allow data to be entered, modified and searched and allows the user easy access to

5
Bioinformatics

other databases. A high level of intelligence is provided through a sophisticated suite of proprietary search, analytical
and computational algorithms. These algorithms facilitate searching through the gigabytes of data generated by the
Company’s proteome projects, matching sequence data, carrying out de novo peptide sequencing and correlating
results with clinical data. These processing tools are mostly written in C, C++ or Java to run on a variety of computer
platforms and use the networking protocol of the internet, TCP/IP, to co-ordinate the activities of a wide range of
laboratory instrument computers, reliably identifying samples and collecting data for analysis.

The need to analyse ever increasing numbers of biological samples using increasingly complex analytical techniques
is insatiable. Searching for signals and trends in noisy data continues to be a challenging task, requiring great
computing power. Fortunately, this power is available with today’s computers, but of key importance is the integration
of analytical data, functional data and biostatistics. The protein expression data in ROSETTA forms only part of an
elaborate network of the type of data, which can now be brought to bear in biology. The need to integrate different
information systems into a collaborative network with a friendly face is bringing together an exciting mixture of
talents in the software world and has brought the new science of bioinformatics to life.

1.5 Origin of Bioinformatic/Biological Databases


The first bioinformatic/biological databases were constructed a few years after the first protein sequences began to
become available. The first protein sequence reported was that of bovine insulin in 1956, consisting of 51 residues.
Nearly a decade later, the first nucleic acid sequence was reported, that of yeast alanine tRNA with 77 bases. Just a
year later, Dayhoff gathered all the available sequence data to create the first bioinformatic database. The Protein

Data Bank followed in 1972 with a collection of ten X-ray crystallographic protein structures, and the SWISSPROT
protein sequence database began in 1987.A huge variety of divergent data resources of different type sand sizes
are now available either in the public domain or more recently from commercial third parties. All of the original
databases were organised in a very simple way with data entries being stored in flat files, either one per entry, or
as a single large text file. Re-write - Later on lookup indexes were added to allow convenient keyword searching
of header information.

Origin of tools
After the formation of the databases, tools became available to search sequence databases - at first in a very simple
way, looking for keyword matches and short sequence words, and then more sophisticated pattern matching and
alignment based methods. The rapid but less rigorous BLAST algorithm has been the mainstay of sequence database
searching since its introduction a decade ago, complemented by the more rigorous and slower FASTA and Smith
Waterman algorithms. Suites of analysis algorithms, written by leading academic researchers at Stanford, CA,
Cambridge, UK and Madison, WI for their in-house projects, began to become more widely available for basic
sequence analysis. These algorithms were typically single function black boxes that took input and produced output
in the form of formatted files. UNIX style commands were used to operate the algorithms, with some suites having
hundreds of possible commands, each taking different command options and input formats. Since these early efforts,
significant advances have been made in automating the collection of sequence information.

Rapid innovation in biochemistry and instrumentation has brought us to the point where the entire genomic
sequence of at least 20 organisms, mainly microbial pathogens, are known and projects to elucidate at least 100
more prokaryotic and eukaryotic genomes are currently under way. Groups are now even competing to finish the
sequence of the entire human genome. With new technologies we can directly examine the changes in expression
levels of both mRNA and proteins in living cells, both in a disease state or following an external challenge. We can
go on to identify patterns of response in cells that lead us to an understanding of the mechanism of action of an agent
on a tissue. The volume of data arising from projects of this nature is unprecedented in the pharmaceutical industry,
and will have a profound effect on the ways in which data are used and experiments performed in drug discovery
and development projects. This is true not least because, with much of the available interesting data being in the
hands of commercial genomics companies, pharmacies are unable to get exclusive access to many gene sequences
or their expression profiles.

6
The competition between co-licensees of a genomic database is effectively a race to establish a mechanistic role
or other utility for a gene in a disease state in order to secure a patent position on that gene. Much of this work is
carried out by informatics tools. Despite the huge progress in sequencing and expression analysis technologies, and
the corresponding magnitude of more data that is held in the public, private and commercial databases, the tools
used for storage, retrieval, analysis and dissemination of data in bioinformatics are still very similar to the original
systems gathered together by researchers 15-20 years ago.

Many are simple extensions of the original academic systems, which have served the needs of both academic and
commercial users for many years. These systems are now beginning to fall behind as they struggle to keep up
with the pace of change in the pharmaceutical industry. Databases are still gathered, organised, disseminated and
searched using flat files. Relational databases are still few and far between, and object-relational or fully object
oriented systems are rarer still in mainstream applications. Interfaces still rely on command lines, fat client interfaces,
which must be installed on every desktop, or HTML/CGI forms. Whilst they were in the hands of bioinformatics
specialists, pharmacies have been relatively undemanding of their tools. Now the problems have expanded to cover
the mainstream discovery process, much more flexible and scalable solutions are needed to serve pharmaceutical
R&D informatics requirements.

There are different views of origin of Bioinformatics- From T K Attwood and D J ParrySmith’s “Introduction to
Bioinformatics”, Prentice-Hall 1999 [Longman Higher Education; ISBN 0582327881]: “The term bioinformatics
is used to encompass almost all computer applications in biological sciences, but was originally coined in the
mid-1980s for the analysis of biological sequence data.” From Mark S. Boguski’s article in the “Trends Guide to
Bioinformatics” Elsevier, Trends Supplement 1998 p1: “The term “bioinformatics” is a relatively recent invention,
not appearing in the literature until 1991 and then only in the context of the emergence of electronic publishing. The
National Center for Biotechnology Information (NCBI), is celebrating its 10th anniversary this year, having been
written into existence by US Congressman Claude Pepper and President Ronald Reagan in 1988. So, bioinformatics
has, in fact, been in existence for more than 30 years and is now middle-aged.

1.6 Importance of Bioinformatics


The greatest challenge facing the molecular biology community today is to make sense of the wealth of
data that has been produced by the genome sequencing projects. Traditionally, molecular biology research
was carried out entirely at the experimental laboratory bench but the huge increase in the scale of data
being produced in this genomic era has seen a need to incorporate computers into this research process. 

Sequence generation, and its subsequent storage, interpretation and analysis are entirely computer dependent tasks.
However, the molecular biology of an organism is a very complex issue with research being carried out at different levels
including the genome, proteome, transcriptome and metabalome levels. Following on from the explosion in volume of
genomic data, similar increase in data have been observed in the fields of proteomics, transcriptomics and metabalomics. 

The first challenge facing the bioinformatics community today is the intelligent and efficient storage of this mass of
data. It is then their responsibility to provide easy and reliable access to this data. The data itself is meaningless before
analysis and the sheer volume present makes it impossible for even a trained biologist to begin to interpret it manually.
Therefore, incisive computer tools must be developed to allow the extraction of meaningful biological information. 

There are three central biological processes around, which bioinformatics tools must be developed:
• DNA sequence determines protein sequence
• Protein sequence determines protein structure
• Protein structure determines protein function

The integration of information learned about these key biological processes should allow us to achieve the long
term goal of the complete understanding of the biology of organisms.

7
Bioinformatics

1.7 Use of Bioinformatics


Bioinformatics is used to:
• Store/retrieve biological information (databases)
• Retrieve/compare gene sequences
• Predict function of unknown genes/proteins
• Search for previously known functions of a gene
• Compare data with other researchers
• Compile/distribute data for other researchers

Due to the spectacular growth of biotechnology and molecular biology tremendous amount of data of nucleotide
sequences or protein sequences are being produced. Here, comes the role of bioinformatics:
• To uncover the wealth of biological information hidden in the mass of nucleotide sequence.
• Knowing the amino acid sequence on the basis of nucleotide sequences.
• Knowing structure of proteins on the basis of amino acid sequences.
• Prediction of functional aspects of proteins on the basis of its structure.

Besides these, the other aims of bioinformatics are:


• To provide biological data information and other related literature on the Internet.
• To obtain a clearer insight into the fundamental biology of organisms.
• Using this information for welfare of mankind.

Therefore, it is clear that the knowledge of bioinformatics not merely limited to the computation of data, but in
reality it can be used to solve many biological problems and can be applied how living things work.

The major applications of bioinformatics being to access, search, visualise and retrieve the information of databases
of the sequences as well as to understand structural information of biomolecules proteome analysis and so on. Other
applications include cell metabolism, biodiversity, downstream processing in chemical engineering, drug and vaccine
design. These are the areas in which bioinformatics is an integral component. Current efforts in molecular biology
(example, genome projects) are producing a large quantity of data that is not only providing exciting opportunities
for knowledge discovery, but also increasing problem of information overload. Bioinformatics also concerns the
development of new tools for the analysis of genomic and molecular biological data. This can be applied to all fields
of biological science as agricultural science, environmental science, pharmaceutical science, chemical science and
medical science.

8
Fig. 1.1 Genes encode the recipes for proteins
(Source: http://birg.cs.wright.edu/text/Ch2.ppt)

1.8 Basics of Molecular Biology

The key concepts are:


Cell
Our body consists of a number of organs. Each organ composes of a number of tissues, and each tissue composes
of cells of the same type. The individual cell is the minimal self-reproducing unit in all living species. It performs
two types of functions, such as performs chemical reactions necessary to maintain our life and also passes the
information for maintaining life to the next generation. Since the cell is the vehicle for transmission of the genetic
information in all living species, it needs to store the genetic information in the form of double-stranded DNA. The
cell replicates its information by separating the paired DNA strands and using each as a template for polymerisation
to make a new DNA strand with a complementary sequence of nucleotides. The same strategy is used to transcribe
portions of the information from DNA into molecules of the closely related polymer, RNA. RNA is the intermediate
between DNA and protein and it guides the synthesis of protein molecules by the complex machinery of translation
that is the ribosome. The resultant proteins are the main catalysts for almost all the chemical reactions in the cell.
In addition to catalyst, proteins are performing also building block, transportation, signaling, and so on.

Proteins: Molecular Machines


Proteins constitute most of a cell’s dry mass. They are not only the building blocks from, which cells are built; they
also execute nearly all cell functions. Understanding of proteins can guide us to understand how our bodies function
and other biological processes. Protein is made from a long chain of amino acids, each links to its neighbor through a
covalent peptide bond. There are 20 types of amino acids in proteins, and each amino acid carries different chemical
properties. The length of proteins is in the range of 20 to more than 5000 amino acids. In average, protein contains
around 350 amino acids. Therefore, protein is also known as polypeptides. In order to perform its chemical function,
proteins need to fold into certain 3 dimensional shapes. There are several interactions that cause the proteins to fold,
such as the sets of weak non-covalent bonds that form between one part of the chain and another. The weak bonds
are of three types, such as hydrogen bonds, ionic bonds, and Van der Waals attractions. In addition to these, three
weak bonds, the fourth weak force, that is the hydrophobic interaction, also has a central role in determining the
shape of a protein. Correct shape for a protein is vital to its functionality.

9
Bioinformatics

Proteins have a variety of roles that they must fulfill:


• They are the enzymes that rearrange chemical bonds.
• They carry signals to and from the outside of the cell, and within the cell.
• They transport small molecules.
• They form many of the cellular structures.
• They regulate cell processes, turning them on and off and controlling their rates.

This variety of roles is accomplished by the variety of proteins, which collectively can assume a variety of three-
dimensional shapes.

A protein’s three-dimensional shape, in turn, is determined by the particular one-dimensional composition of the
protein. Each protein is a linear sequence made of smaller constituent molecules called amino acids. The constituent
amino acids are joined by a “backbone” composed of a regularly repeating sequence of bonds. There are 20 different
types of amino acids. The three-dimensional shape assumed by the protein is determined by the specific linear
sequence of amino acids from N-terminus to C-terminus. Different sequences of amino acids fold into different
three-dimensional shapes.
• Proteins in your muscles allow you to move (myosin and actin)
• Enzymes (digestion, catalysis)
• Structure (collagen)
• Signaling (hormones, kinases)
• Transport (energy, oxygen)

DNA
DNA is the genetic material in all organisms (with certain viruses being exception) and it stores the instruction
needed by the cell to perform daily life function. DNA can be thought of as a large cookbook with recipes for making
every protein in the cell. The information in DNA is used like a library. Library books can be read and reread many
times. Similarly, the information in the genes is read, perhaps millions of times in the life of an organism, but the
DNA itself is never used up. DNA consists of two strands, which interwoven together and form a double helix. Each
strand is a chain of small molecules called nucleotides. DNA contains the instructions needed by the cell to carry
out its functions. DNA consists of two long interwoven strands that form the famous “double helix”. Each strand
is built from a small set of constituent molecules called nucleotides.

DNA Structure
DNA is double-helix in structure and it consists of two strands, which interwoven together to resemble a twisted
ladder. If you look at it in detail, you could observe that the rungs are consisted of chemical compounds called
bases, while the sides of the rungs are the sugar (deoxyribose) and the phosphate molecules. These three parts that
are base, sugar, and phosphate form the small molecules that we knew as nucleotides. There are 4 types of bases
that form the rungs of DNA double-helix that is the 4 letters genetic code (A/Adenine, G/Guanine, C/Cytosine, and
T/Thymine). The correct structure of DNA was first deduced by J. D. Watson and F. H. C. Crick in 1953.

RNA
Chemically, RNA is very similar to DNA. There are two main differences:
• RNA uses the sugar ribose instead of deoxyribose in its backbone (from which RNA, RiboNucleic Acid, gets
its name).
• RNA uses the base uracil (U) instead of thymine (T). U is chemically similar to T, and in particular is also
complementary to A.

RNA has two properties important for our purposes. First, it tends to be single-stranded in its “normal” cellular
state. Secondly, because RNA (like DNA) has base-pairing capability, it often forms intramolecular hydrogen bonds,
partially hybridising to it. Because of this, RNA, like proteins, can fold into complex three-dimensional shapes.

10
RNA has some of the properties of both DNA and proteins. It has the same information storage capability as DNA
due to its sequence of nucleotides. But its ability to form three-dimensional structures allows it to have enzymatic
properties like those of proteins. Because of this dual functionality of RNA, it has been conjectured that life may
have originated from RNA alone, DNA and proteins having evolved later.

Genome
The genome of an organism is its complete set of DNA. All the genetic information in an organism is referred
collectively as a “genome”. Genomes vary widely in size: the smallest known genome for a free-living organism
(a bacterium of the genus Mycoplasma, such as Mycoplasma genitalium) contains about 600,000 DNA base pairs,
while human and mouse genomes have some 3 billions. Except for mature red blood cells, all human cells contain
a complete genome.

Chromosome
The 3 billion bases of the human genome are not all in one continuous strand of DNA. Rather, the human genome
is divided into 23 separate pairs of DNA, called chromosomes. Chromosomes are structures within the cell nucleus
that carries genes.

Gene
A gene is a DNA sequence that encodes a protein or an RNA molecule. Each chromosome contains many genes
that is the basic physical and functional units of heredity. Each gene exists in the particular position of particular
chromosome. In human genome, it is expected that there are 30,000 - 35,000 genes.

1.9 Definitions of Fields Related to Bioinformatics


Computational Biology
Computational biologists interest themselves more with evolutionary, population and theoretical biology rather than
cell and molecular biomedicine. It is inevitable that molecular biology is profoundly important in computational
biology, but it is certainly not what computational biology is all about. In these areas of computational biology it
seems that computational biologists have tended to prefer statistical models for biological phenomena over physico-
chemical ones.

One computational biologist (Paul J Schulte) did object to the above and makes the entirely valid point that this
definition derives from a popular use of the term, rather than a correct one. Paul works on water flow in plant cells.
He points out that biological fluid dynamics is a field of computational biology in itself. He argues that this, and
any application of computing to biology, can be described as “computational biology”. Computational biology is
not a “field”, but an “approach” involving the use of computers to study biological processes and hence it is an area
as diverse as biology itself.

Genomics
Genomics is a field, which existed before the completion of the sequences of genomes, but in the crudest of forms,
for example, the referenced estimate of 100000 genes in the human genome derived from an famous piece of “back
of an envelope” genomics, guessing the weight of chromosomes and the density of the genes they bear. Genomics
is any attempt to analyse or compare the entire genetic complement of a species or species (plural). It is, of course
possible to compare genomes by comparing more-or-less representative subsets of genes within genomes.

Proteomics
Michael J.Dunn, the Editor-in-Chief of Proteomics defines the “proteome” as: “the Protein complement of the
genome” and proteomics to be concerned with: “Qualitative and quantitative studies of gene expression at the level
of the functional proteins themselves” that is: “an interface between protein biochemistry and molecular biology”
Characterising the many tens of thousands of proteins expressed in a given cell type at a given time, whether
measuring their molecular weights or isoelectric points, identifying their ligands or determining their structures,
which involves the storage and comparison of vast numbers of data. Inevitably this requires bioinformatics.

11
Bioinformatics

Pharmacogenomics
Pharmacogenomics is the application of genomic approaches and technologies to the identification of drug targets.
Examples include trawling entire genomes for potential receptors by bioinformatics means, or by investigating
patterns of gene expression in both pathogens and hosts during infection, or by examining the characteristic expression
patterns found in tumours or patients samples for diagnostic purposes (or in the pursuit of potential cancer therapy
targets).

Pharmacogenetics
All individuals respond differently to drug treatments; some respond positively, others with little obvious change
in their conditions and yet others with side effects or allergic reactions. Much of this variation is known to have a
genetic basis. Pharmacogenetics is a subset of Pharmacogenomics, which uses genomic/bioinformatics methods
to identify genomic correlates, for example SNPs (Single Nucleotide Polymorphisms), characteristic of particular
patient response profiles and use those markers to inform the administration and development of therapies. Strikingly
such approaches have been used to “resurrect” drugs thought previously to be ineffective, but subsequently found
to work with in subset of patients or in optimising the doses of chemotherapy for particular patients.

Cheminformatics
The Web advertisement for Cambridge Healthtech Institute’s Sixth Annual Cheminformatics conference describes
the field thus: “the combination of chemical synthesis, biological screening, and data-mining approaches used to
guide drug discovery and development” but this, again, sounds more like a field being identified by some of its
most popular (and lucrative) activities, rather than by including all the diverse studies that come under its general
heading.

The story of one of the most successful drugs of all time, penicillin, seems bizarre, but the way we discover and
develop drugs even now has similarities, being the result of chance, observation and a lot of slow, intensive chemistry.
Until recently, drug design always seemed doomed to continue to be a labour-intensive, trial-and-error process. The
possibility of using information technology, to plan intelligently and to automate processes related to the chemical
synthesis of possible therapeutic compounds is very exciting for chemists and biochemists. The rewards for bringing
a drug to market more rapidly are huge, so naturally this is what a lot of cheminformatics works is about. The span
of academic cheminformatics is wide and is exemplified by the interests of the cheminformatics groups at the Centre
for Molecular and Biomolecular Informatics at the University of Nijmegen in the Netherlands.

These interests include: Synthesis Planning, Reaction and Structure Retrieval , 3-D Structure Retrieval , Modelling
Computational Chemistry , Visualisation Tools and Utilities Trinity University’s Cheminformatics Web page, for
another example, concerns itself with cheminformatics as the use of the Internet in chemistry.

Medical Informatics
“Biomedical Informatics is an emerging discipline that has been defined as the study, invention, and implementation
of structures and algorithms to improve communication, understanding and management of medical information.”
Medical informatics is more concerned with structures and algorithms for the manipulation of medical data, rather than
with the data itself. This suggests that one difference between bioinformatics and medical informatics as disciplines
lies with their approaches to the data; there are bioinformaticists interested in the theory behind the manipulation
of that data and there are bioinformatics scientists concerned with the data itself and its biological implications.
Medical informatics, for practical reasons, is more likely to deal with data obtained at “grosser” biological levels that
is information from super-cellular systems, right up to the population level-while most bioinformatics is concerned
with information about cellular and biomolecular structures and systems.

12
1.10 Bioinformatics Applications
Molecular medicine
The human genome will have profound effects on the fields of biomedical research and clinical medicine. Every
disease has a genetic component. This may be inherited (as is the case with an estimated 3000-4000 hereditary
disease including Cystic Fibrosis and Huntingtons disease) or a result of the body’s response to an environmental
stress which causes alterations in the genome (example, cancers, heart disease, diabetes.).

The completion of the human genome means that we can search for the genes directly associated with different
diseases and begin to understand the molecular basis of these diseases more clearly. This new knowledge of the
molecular mechanisms of disease will enable better treatments, cures and even preventative tests to be developed.

Personalised medicine
Clinical medicine will become more personalised with the development of the field of pharmacogenomics. This is
the study of how an individual’s genetic inheritance affects the body’s response to drugs. At present, some drugs fail
to make it to the market because a small percentage of the clinical patient population show adverse affects to a drug
due to sequence variants in their DNA. As a result, potentially life saving drugs never makes it to the marketplace.
Today, doctors have to use trial and error to find the best drug to treat a particular patient as those with the same
clinical symptoms can show a wide range of responses to the same treatment. In the future, doctors will be able to
analyse a patient’s genetic profile and prescribe the best available drug therapy and dosage from the beginning.

Preventative medicine
With the specific details of the genetic mechanisms of diseases being unraveled, the development of diagnostic tests
to measure a person’s susceptibility to different diseases may become a distinct reality. Preventative actions such
as change of lifestyle or having treatment at the earliest possible stages when they are more likely to be successful,
could result in huge advances in our struggle to conquer disease.

Gene therapy
In the not too distant future, the potential for using genes themselves to treat disease may become a reality. Gene
therapy is the approach used to treat, cure or even prevent disease by changing the expression of a person’s genes.
Currently, this field is in its infantile stage with clinical trials for many different types of cancer and other diseases
ongoing.

Drug development
At present all drugs on the market target only about 500 proteins. With an improved understanding of disease
mechanisms and using computational tools to identify and validate new drug targets, more specific medicines that
act on the cause, not merely the symptoms, of the disease can be developed. These highly specific drugs promise
to have fewer side effects than many of today’s medicines.

Microbial genome applications


Microorganisms are ubiquitous, that is they are found everywhere. They have been found surviving and thriving
in extremes of heat, cold, radiation, salt, acidity and pressure. They are present in the environment, our bodies, the
air, food and water. Traditionally, use has been made of a variety of microbial properties in the baking, brewing
and food industries. The arrival of the complete genome sequences and their potential to provide a greater insight
into the microbial world and its capacities could have broad and far reaching implications for environment, health,
energy and industrial applications. For these reasons, in 1994, the US

Department of Energy (DOE) initiated the MGP (Microbial Genome Project) to sequence genomes of bacteria
useful in energy production, environmental cleanup, industrial processing and toxic waste reduction. By studying
the genetic material of these organisms, scientists can begin to understand these microbes at a very fundamental
level and isolate the genes that give them their unique abilities to survive under extreme conditions.

13
Bioinformatics

Waste cleanup
The world’s toughest bacterium is the most radiation resistant organism known. Scientists are interested in this
organism because of its potential usefulness in cleaning up waste sites that contain radiation and toxic chemicals.

Climate change studies


Increasing levels of carbon dioxide emission, mainly through the expanding use of fossil fuels for energy, are thought
to contribute to global climate change. Recently, the DOE

(Department of Energy, USA) launched a program to decrease atmospheric carbon dioxide levels. One method of
doing so is to study the genomes of microbes that use carbon dioxide as their sole carbon source.

Alternative energy sources


Scientists are studying the genome of the different organisms, which have an unusual capacity for
generating energy from light.

Biotechnology
Some archaeon and the bacterium have potential for practical applications in industry and government-funded
environmental remediation. These microorganisms thrive in water temperatures above the boiling point and therefore,
may provide the DOE, the Department of Defence, and private companies with heat-stable enzymes suitable for
use in industrial processes.

Other industrially useful microbes are of high industrial interest as a research object because it is used by the chemical
industry for the biotechnological production of the amino acid lysine. The substance is employed as a source of protein
in animal nutrition. Lysine is one of the essential amino acids in animal nutrition. Biotechnologically produced lysine
is added to feed concentrates as a source of protein, and is an alternative to soybeans or meat and bonemeal.

Micro-organisms are useful in the dairy industry, for manufacturing dairy products like buttermilk, yogurt and
cheese. They are also used to prepare pickled vegetables, beer, wine, some bread and sausages and other fermented
foods. Researchers anticipate that understanding the physiology and genetic make-up of this bacterium will prove
invaluable for food manufacturers as well as the pharmaceutical industry as a vehicle for delivering drugs.

Antibiotic resistance
Scientists have been examining the genome of a bacterium. They have discovered a virulence region made up of a
number of antibiotic-resistant genes that may contribute to the bacterium’s transformation from harmless gut bacteria
to a menacing invader. The discovery of the region, known as a pathogenicity island, could provide useful markers
for detecting pathogenic strains and help to establish controls to prevent the spread of infection in wards.

Forensic analysis of microbes


Scientists used their genomic tools to help distinguish between the strains of a rod shaped bacterium that was used
in the summer of 2001 terrorist attack in Florida with that of closely related anthrax strains.

The reality of bioweapon creation


Scientists have recently built the virus poliomyelitis using entirely artificial means. They did this using genomic
data available on the Internet and materials from a mail-order chemical supply. The research was financed by the US
Department of Defense as part of a biowarfare response program to prove to the world the reality of bioweapons.
The researchers also hope their work will discourage officials from ever relaxing programs of immunisation. This
project has been met with very mixed feelings.

Evolutionary studies
The sequencing of genomes from all three domains of life, eukaryota, bacteria and archaea means that evolutionary
studies can be performed in a quest to determine the tree of life and the last universal common ancestor.

14
Crop improvement
Comparative genetics of the plant genomes has shown that the organisation of their genes has remained more
conserved over evolutionary time than was previously believed. These findings suggest that information obtained
from the model crop systems can be used to suggest improvements to other food crops. At present the complete
genomes of water cress and rice are available.

Insect resistance
Genes from Bacillus thuringiensis that can control a number of serious pests have been successfully transferred to
cotton, maize and potatoes. This new ability of the plants to resist insect attack means that the amount of insecticides
being used can be reduced and hence the nutritional quality of the crops is increased.

Improve nutritional quality


Scientists have recently succeeded in transferring genes into rice to increase levels of Vitamin A, iron and other
micronutrients. This work could have a profound impact in reducing occurrences of blindness and anaemia caused
by deficiencies in Vitamin A and iron respectively. Scientists have inserted a gene from yeast into the tomato, and
the result is a plant whose fruit stays longer on the vine and has an extended shelf life.

Development of drought resistance varieties


Progress has been made in developing cereal varieties that have a greater tolerance for soil alkalinity, free aluminium
and iron toxicities. These varieties will allow agriculture to succeed in poorer soil areas, thus adding more land
to the global production base. Research is also in progress to produce crop varieties capable of tolerating reduced
water conditions.

Vetinary science
Sequencing projects of many farm animals including cows, pigs and sheep are now well under way in the hope that
a better understanding of the biology of these organisms will have huge impacts for improving the production and
health of livestock and ultimately have benefits for human nutrition.

Comparative studies
Analysing and comparing the genetic material of different species is an important method for studying the functions
of genes, the mechanisms of inherited diseases and species evolution. Bioinformatics tools can be used to make
comparisons between the numbers, locations and biochemical functions of genes in different organisms. Organisms
that are suitable for use in experimental research are termed model organisms.

They have a number of properties that make them ideal for research purposes including short life spans, rapid
reproduction, being easy to handle, inexpensive and they can be manipulated at the genetic level.

An example of a human model organism is the mouse. Mouse and human are very closely related (>98%) and for
the most part we see a one to one correspondence between genes in the two species. Manipulation of the mouse at
the molecular level and genome comparisons between the two species can and is revealing detailed information on
the functions of human genes, the evolutionary relationship between the two species and the molecular mechanisms
of many human diseases.

15
Bioinformatics

Summary
• Bioinformatics is a newly emerged scientific discipline for the computational analysis and storage of biological
data.
• Bioinformatics is the combination of biology and information technology. The discipline encompasses any
computational tools and methods used to manage, analyse and manipulate large sets of biological data.
• Bioinformatics is the field in which biology, computer science and information technology merge into single
discipline for managing and analysing biological data using advanced computing techniques.
• The three terms: bioinformatics, computational biology and bioinformation infrastructure are very similar and
most of the time used interchangeably
• Bioinformatics refers to database like activities, involving persistent sets of data that are maintained in a consistent
state over essentially indefinite periods of time.
• Computational biology encompasses the use of algorithmic tools to facilitate biological analyses.
• Bioinformation infrastructure comprises the entire collection of information management systems, analysis tools
and communication networks supporting biology. Thus, the latter may be viewed as a computational scaffold
of the former two.
• A bioinformaticist is an expert who not only knows how to use bioinformatics tools, but also knows how to
write interfaces for effective use of the tools.
• A bioinformatician, on the other hand, is a trained individual who only knows to use bioinformatics tools without
a deeper understanding.
• Page display, hypertext links and other features are coded using a simple, cross-platform HyperText Markup
Language (HTML) and viewed on UNIX workstations, PCs and Apple Macs as WWW pages using a
browser.
• The first graphical WWW browser - Mosaic for X and the first molecular biology WWW server - ExPASy were
made available in 1993.
• CORBA frameworks for bioinformatics tools and databases have been developed by, for example, NetGenics
and the European Bioinformatics Institute (EBI).
• ROSETTA is built on an object-relational database system, which stores demographic and clinical data on
sample donors and tracks the processing of samples and analytical results.
• Sequence generation, and its subsequent storage, interpretation and analysis are entirely computer dependent
tasks.
• Due to the spectacular growth of biotechnology and molecular biology tremendous amount of data of nucleotide
sequences or protein sequences are being produced.
• DNA is the genetic material in all organisms (with certain viruses being exception) and it stores the instruction
needed by the cell to perform daily life function.
• A gene is a DNA sequence that encodes a protein or an RNA molecule.

16
References
• Koslow, S. H., & Huerta, M. F., 2000. Electronic collaboration in science, Routledge.
• Rastogi, S. C., Mendiratta, N. & Rastogi, P., 2006. Bioinformatics-Concepts, Skills, Applications, 2nd ed., PHI
Learning Pvt. Ltd.
• Robbins. Bioinformatics: Essential Infrastructure For Global Biology [pdf] Available at: <http://www.esp.org/
oecd.pdf> [Accessed 28 February 2012].
• Khandekar. Role of Bioinformatics In Medical Informatics A Case Study : Tuberculosis [pdf] Available at:
<http://www.jbtdrc.org/Symposium/Topics/Role_bio.pdf> [Accessed 28 February 2012].
• InsGenomeSciences, 2010. Introduction to Bioinformatics [Video Online] Available at: <http://www.youtube.
com/watch?v=xODTm4a6nsM> [Accessed 28 February 2012].
• plantbreedgenomics, 2010. Bioinformatics 101 - Part 2 Intro [Video Online] Available at: <http://www.youtube.
com/watch?v=WlVGTtqT4Tg&feature=related> [Accessed 28 February 2012].

Recommended Reading
• Ramsden, J., 2009. Bioinformatics: An introduction, 2nd ed., Springer.
• Polański, A. & Kimmel, M., 2007. Bioinformatics, Springer.
• Letovsky, S. Bioinformatics: Databases and Systems, O’REILLY.

17
Bioinformatics

Self Assessment

1. The word ‘bio’ refers to _________.


a. biology
b. data mining
c. data warehousing
d. analysis

2. _________ encompasses the use of algorithmic tools to facilitate biological analyses.


a. Computational biology
b. Bioinformation infrastructure
c. Bioinformatics
d. Biology

3. What refers to database like activities, involving persistent sets of data that are maintained in a consistent state
over essentially indefinite periods of time?
a. Computational biology
b. Bioinformation infrastructure
c. Bioinformatics
d. Biology

4. What comprises the entire collection of information management systems, analysis tools and communication
networks supporting biology?
a. Computational biology
b. Bioinformation infrastructure
c. Bioinformatics
d. Biology

5. ___________refers to two genes sharing a common evolutionary history.


a. Homology
b. Evolutionary biology
c. Biology
d. Bioinformatics

6. The _______is a graphical interface based on hypertext by which text and graphics can be displayed and
highlighted.
a. WWW
b. HTML
c. UNIX
d. JVM

18
7. Which of the following statements is false?
a. Proteins that show significant sequence conservation, indicating a clear evolutionary relationship, are said
to be from the same protein family.
b. Protein is the basic molecule of life that directly controls the fundamental biology of nearly all organisms
(except those where RNA is genetic material).
c. Any variations and errors in the nucleotide sequence of the genomic DNA or mutations may lead to
development of genetic disorders or other metabolic changes.
d. A bioinformaticist is an expert who not only knows how to use bioinformatics tools, but also knows how to
write interfaces for effective use of the tools.

8. Which of the following statements is false?


a. Data mining refers to database like activities, involving persistent sets of data that are maintained in a
consistent state over essentially indefinite periods of time.
b. Computational biology encompasses the use of algorithmic tools to facilitate biological analyses.
c. Bioinformation infrastructure comprises the entire collection of information management systems, analysis
tools and communication networks supporting biology. Thus, the latter may be viewed as a computational
scaffold of the former two.
d. The ultimate goal of bioinformatics is to uncover the wealth of biological information hidden in the mass
of data and obtain a clearer insight into the fundamental biology of organisms.

9. Which of the following statements is false?


a. A bioinformaticist is a trained individual who only knows to use bioinformatics tools without a deeper
understanding.
b. Scientists also use the term homology, or homologous, to simply mean similar, regardless of the evolutionary
relationship.
c. Homology refers to two genes sharing a common evolutionary history.
d. Proteins that show significant sequence conservation, indicating a clear evolutionary relationship, are said
to be from the same protein family.

10. Who is known as the father of genetics?


a. Stanley Cohen
b. Gregor Mendel
c. Paul Berg
d. Herbert Boyer

19
Bioinformatics

Chapter II
Biological Databases

Aim
The aim of this chapter is to:

• define biological database

• describe four major sequence databases

• understand the GenBank file format

Objectives
The objectives of this chapter are to:

• describe nucleotide databases

• elucidate the principal requirements on the public data services

• classify biological databases

Learning outcome
At the end of this chapter, you will be able to:

• understand the specific features of biological databases

• enumerate the categories of biological databases

• differentiate between primary and secondary database

20
2.1 Introduction
The modern genomic research leads to the generation of huge amounts of raw sequence data. Sophisticated
computational methodologies are required to manage the mass of data as the volume of genomic data grows. The
challenge in the genomics era is to store and handle the volume of information through the establishment and use
of computer databases. Thus, the development of databases to handle the vast amount of molecular biological data
is a fundamental task of bioinformatics.

A biological database is a large, organised body of persistent data, usually associated with computerised software
designed to update, query and retrieve components of the data stored within system. A simple database might
be a single file containing many records, each including the same set of information. A record associated with a
nucleotide sequence database contains information such as contact name, input sequence with a description of the
type of molecule, scientific name of the source organism from which it was isolated and literature citations associated
with sequence.

For researchers to benefit from the data stored in a database, two additional requirements must be met:
• Easy access to the information
• A method for extracting only that information needed to answer a specific biological question

Currently, a lot of bioinformatics work is concerned with the technology of databases. These databases include both
‘public’ repositories of gene data such as GenBank or Protein.

DataBank (PDB), and private databases such as those used by research groups involved in gene mapping projects or
those held by biotech companies. Making such databases accessible through open standards (such as Web) is very
important since consumers of bioinformatics data use a range of computer platforms, from the more powerful and
forbidding UNIX boxes favoured by the developers and curators to the far friendlier Macs often found populating
the labs of computer-wary biologists.

RNA and DNA are the proteins that store hereditary information about an organism. These macromolecules have a
fixed structure analysed by biologists with the help of bioinformatic tools and databases. A few popular databases
are GenBank from NCBI (National Center for Biotechnology Information), SwissProt from Swiss Institute of
Bioinformatics and PIR from Protein Information Resource.

History of biological databases


• 1965: Margaret Dayhoff et al. published ‘Atlas of Protein Sequences and Structures.’
• 1982: EMBL initiates DNA sequence databases, followed within a year by GenBank and in 1984 by the DNA
Database of Japan.
• 1988: EMBL/GenBank/DDBJ agrees on common format for data elements.
• 1980: Only 80 genes were fully sequenced. The PCR techniques in 1983 lead to tremendous increase in
nucleotide sequence.

The specific features of biological databases are:


• Sub-class of scientific databases
• Autonomous: many independent maintainers
• Heterogeneous data formats (for example, various data formats for the same data entities; various types of
biological data: genomic, microarray, proteomic)
• Dynamic: frequent and continuous changes in data content (in data schema)
• Broad domain knowledge
• Workflow-oriented: databases and rich set of analysis tools
• Information integration is essential: data aggregation from several databases

21
Bioinformatics

Depending on the research project, biological data comes in many different flavours. Most researches work with
a number of different formats even though they may not realise this at first hand. Some of the data, which can be
found when researching any biological question, is briefly listed below. Some of these data in the databases are
partly overlapping and referring to each other.
• Text: Examples of text databases are PubMed and OMIM containing textual information and references related
to biological data.
• Sequence data: GenBank and UniProt exemplify biological databases containing DNA and protein sequences,
respectively.
• Protein structure: You can also find databases specifically related to protein structure files (for example, the
PDB, SCOP and CATH databases).
• Links: Most databases contain information on sequence data within a specific field or subject. A different type
of database is for example, the InterPro database consisting of a collection of links from protein domains and
families to other databases providing related resources.
• Images: In the field of 2D gel and microscopic images you can also find various databases containing data, for
example, identified on reference gel images.
• Numerical data: Gene expression data as well as other microarray data are also accessible from a number of
databases. An example is the ArrayExpress database of the European Bioinformatics Institute, EBI.
• Biological matter: Frozen bacterial strains, vectors and so on are also to be found in databases collecting
information on each of these specific biological matters, for example, UniVec database hosted by NCBI.

2.2 Categories of Biological Databases


These include:
• Nucleotide sequences
• Genomics (information on gene chromosomal location and nomenclature, provide links to sequence
databases)
• Mutation/polymorphism (sequence variations linked or not to genetic diseases)
• Protein sequences
• Protein domain/family
• Proteomics
• Microarray (high-dimensional data: profiles of thousands of genes depending on hundreds/thousands of various
conditions)
• Organism-specific
• 3D structure
• Metabolism (for example, metabolic pathways – graph data)
• Bibliography

2.3 The Database Industry


The public databases have become the major medium through which genome sequence data are published. This
is because of the high rate of data production and need for researchers to have rapid access to new data. Public
databases and data services that support them are important resources in bioinformatics. However, successful public
data services suffer from continually escalating demands from the biological community.

EMBL and GenBank are the two major nucleotide databases. EMBL is the European version and GenBank is the
American. EMBL and GenBank collaborate and synchronise their databases so that the databases will contain same
information. The rate of growth of DNA databases has been following an exponential trend, with a doubling time
now estimated to be 9 to12 months. In January 1998, EMBL contained more than a million entries, representing more
than 15,500 species, although most data is from model organisms. These databases are updated on a daily basis.

22
The principal requirements on the public data services are:
• Data quality: Data quality has to be of the highest priority. However, because the data services in most cases
lack access to supporting data, the quality of the data must remain the primary responsibility of the submitter.
• Supporting data: Database users will need to examine the primary experimental data, either in the database
itself, or by following cross-references back to network accessible laboratory databases.
• Deep annotation: Deep, consistent annotation comprising supporting and ancillary information should be attached
to each basic data object in the database.
• Timeliness: The basic data should be available on an Internet-accessible server within days (or hours) of
publication or submission.
• Integration: Each data object in the database should be cross-referenced to representation of the same or related
biological entities in other databases. Data services should provide capabilities for following these links from
one database or data service to another.

2.4 Classification of Biological Databases

Biological databases can be broadly classified into sequence and structure databases:

Sequence databases
With the current speed of sequencing projects, to store and organise sequence data a lot of work is needed. Most
sequence databases store additional information along with the sequence. This could be references to the original
research papers stored in PubMed, information about annotated regions, regions were conflicting residues have been
published, information on species and much more. So far, a common standard for handling all of this information
has not been created. Thus, every database has its own standard on how to store the data. However, most data is
stored in a plain text format (flat file) and can thus, be opened in standard software such as Word, Notepad and so
on. However, large amounts of plain text may not be easy comprehensible. Another problem by storing in a flat
file format is the size of the database. Databases with long sequence entries may become too large to handle on a
normal PC for most users.

Sequence databases are applicable to both nucleic acid sequences and protein sequences, whereas structure database
is applicable to only proteins. The first database was created within a short period after the insulin protein sequence
was made available in 1956. Incidentally, Insulin is the first protein to be sequenced. The sequence of Insulin
consisted of just 51 residues, which characterise the sequence. An alternative approach used by most websites with
large databases is to store all the information in a relational database.

Relational databases have connections or pointers to additional data in other databases or tables. Thus, one can
easily and very fast retrieve a large amount of information on one particular sequence.

One of the characteristics of these databases is that they are maintained and kept up to date on a regular basis. The
four major sequence databases are:
• GenBank: A US-based comprehensive collection of various biological data.
• EMBL: The main European Resource of Nucleotide Sequence Data.
• DDBJ: The DNA Data Bank of Japan.
• UniProt: The universal protein resource.

23
Bioinformatics

The four databases are further described in detail as follows:

GenBank at NCBI
GenBank (Genetic Sequence Databank) is one of the fastest growing repositories of known genetic sequences. It
has a flat file structure, which is an ASCII text file, readable by both humans and computers. In addition to sequence
data, GenBank files contain information like accession numbers and gene names, phylogenetic classification and
references to published literature. There are approximately 191,400,000 bases and 183,000 sequences as of June
1994.

The National Institute of Health hosted at http://www.ncbi.nlm.nih.gov/, has achieved a strong position in collecting
biological data of almost any kind. In addition to storing sequence data, NCBI stores almost all kinds of biological
sequence related data. PubMed is probably the mostly used service that NCBI offers to theirs users together with
BLAST, an option for searching for homologous sequences in the entire database. The NCBI staff provides software
tools for handling sequence data.

Bases in GenBank

90
Billions
80
70
60
50
40
30
20
10
0
Jan- Jan- Jan- Jan- Jan- Jan- Jan- Jan- Jan- Jan- Jan- Jan- Jan-
83 85 87 89 91 93 95 97 99 01 03 05 07

Fig. 2.1 Growth of the GenBank database


(Source: http://www.clcbio.com/sciencearticles/BE-biodatabase.pdf)

24
Fig. 2.2 GenBank file format
(Source: www.bioinf.org.uk/teaching/c40/ppt/intro_databases.ppt)

25
Bioinformatics

EMBL
EMBL Nucleotide Sequence Database is a comprehensive database of DNA and RNA sequences collected from the
scientific literature and patent applications and directly submitted from researchers and sequencing groups. Data
collection is done in collaboration with GenBank (USA) and the DNA Database of Japan (DDBJ). The database
currently doubles in size every 18 months and currently (June 1994) contains nearly 2 million bases from 182,615
sequence entries.

The EMBL Nucleotide Sequence Database is hosted by EBI - the European Bioinformatics Institute, at the European
Molecular Biology Laboratory (EMBL), hosted at http://www.ebi.ac.uk/embl/. DNA and RNA sequences are directly
submitted to the EMBL nucleotide sequence database by individual researchers wand also by genome sequencing
projects and patent applications, and the database is produced and maintained collaborating with both GenBank and
the DNA Data Bank of Japan (DDBJ). The international collection of sequence data is exchanged between EMBL,
GenBank and DDBJ on a daily basis and knowledge of global sequence information can be retrieved from any of
the three entries.

DNA Data Bank of Japan (DDBJ)


DDBJ (DNA Data Bank of Japan) is a nucleotide database hosted in Japan and is accepting DNA submission from
mainly Japanese researchers. They work in close collaboration with GenBank and EMBL and the three databases
store almost identical data. DDBJ also provides various search and analysis tools through the website http://www.
ddbj.nig.ac.jp/.

UniProt
UniProt is the universal protein resource, and as stated on its website the database intends to be both comprehensive and
of high quality. At the UniProt website, http://www.uniprot.org/, data has been divided into three classifications:
• Core data
• Supporting data
• Information

You can search information of your sequence in these three categories. You can also do BLAST searches and
create alignments; a couple of other services are also provided. The UniProt Knowledgebase, UniProtKB, contains
translations of the coding sequences submitted to EMBL, GenBank and DDBJ and the UniProtKB contains all
publicly available protein sequences.

26
EMBL GenBank

Europe USA

EMBL International NLM

Advisory Meeting NCBI


EBI
Collaborative Meeting

TrEMBL DDBJ NRDB

Japan

NIG

CIB

Fig. 2.3 International nucleotide data banks


(Source: www.bioinf.org.uk/teaching/c40/ppt/intro_databases.ppt)

Other valuable databases and resources

SwissProt
This is a protein sequence database that provides a high level of integration with other databases and also has a very
low level of redundancy (means less identical sequences are present in the database).

PubMed
PubMed gives biological data in text format and this service provided by the U.S. National Library of Medicine
links to more than 17 million resources from different journals within the field of life science. A relatively new
functionality at NCBI website is possibility to sign up for an account at My NCBI, which is a service, offering a
customised and automated PubMed update. After registration at My NCBI, you can save your searches and set up
automated searches alerting you by e-mail. You can also customise, for example, filtering options on the searches.
PubMed can be accessed at http://www.ncbi.nlm.nih.gov/pubmed/.

Ensembl
Ensembl is a project developing software for automatic annotation of eukaryotic genomes.
EMBL - EBI and the Sanger Institute are behind the project and at the website, http:
//www.ensembl.org/index.html, you can search within all the data from the Ensemble project divided into
species.

EBI
One of the larger European bioinformatic centers, European Bioinformatics Institute (EBI), hosts a number of
databases and a lot of methods to help analyse all this data. The EBI website http://www.ebi.ac.uk/ also stores the
EMBL nucleotide sequence database (http://www.ebi.ac.uk/embl/).

27
Bioinformatics

InterPro
Most of the databases mentioned above also provide links to related information in other databases. InterPro database
try to a larger extent to link from one protein domain or family to a number of different databases, which individually
contains a lot of relevant information. InterPro database does not contain any sequence information but is largely a
mesh of hyperlinks to various other resources. The link to InterPro is http://www.ebi.ac.uk/interpro/

Pfam
A very useful database for finding protein domains is the Pfam database. Pfam currently stores information on more
than 9000 protein families. When working on an unknown protein, it is often very valuable to retrieve information of
the actual protein family as identification of functional domains within a protein sequence can benefit your knowledge
about the role and function of the protein. You can access the Pfam database at http://pfam.janelia.org/.

Structure databases
Information about protein structure is not developing as fast as sequence data information due to slower pace in
solving 3 D structures of proteins. The RCSB Protein Data Bank hosted at http://www.pdb.org holds slightly more
than 48000 structures. At the website, you can download structure files and you are provided a number of tools for
structure studies.
• SCOP: Structural Classification of Proteins is accessible at http://scop.berkeley.edu/. The SCOP database
describes structural and evolutionary relationships between all known protein structures and also provides a
number of links to other on-line resources related to protein structure and to sequence databases in general.
• CATH Protein Structure Classification. The CATH database hosted at http://www.cathdb.info/ classifies
protein structures from the PDB according to a four-level hierarchy.

Species-specific databases
A number of species-specific databases usually hold very detailed information about only one particular species.
During this period, 3D structure of proteins were studied and well known PDB was developed as the first protein
structure database with only 10 entries in 1972, which has now grown into a large database with over 10,000 entries.
While the initial databases of protein sequences were maintained at the individual laboratories, the development
of a consolidated formal database known as SWISS-PROT protein sequence database was initiated in 1986, which
recently has about 70,000 protein sequences from more than 5000 model organisms, a small fraction of all known
organisms. These huge varieties of data resources are now available for study and research by both academic
institutions and industries. These are made available as public domain information in the larger interest of research
community through Internet (www.ncbi.nlm.nih.gov) and CD-ROMs (on request from www.rcsb.org).
Databases can be classified into:

Primary databases
A primary database contains information of the sequence or structure alone. Examples of these include Swiss-Prot and
PIR for protein sequences, GenBank and DDBJ for Genome sequences and Protein Databank for protein structures.
Biological databases are archives of consistent data stored in an efficient manner. These databases contain data from
a wide spectrum of molecular biology areas. Primary or archived databases contain information and annotation of
DNA and protein sequences, DNA and protein structures and DNA and protein expression profiles.

Secondary databases
A secondary database contains derived information from the primary database. A secondary sequence database
contains information such as the conserved sequence, signature sequence and active site residues of the protein
families arrived by multiple sequence alignment of a set of related proteins.

28
A secondary structure database contains entries of the PDB in an organised way. These contain entries classified
according to their structure such as all alpha proteins, all beta proteins, and so on. These also contain information
on conserved secondary structure motifs of a particular protein. Some of the secondary database created and hosted
by various researchers at their individual laboratories includes SCOP, developed at Cambridge University; CATH
developed at University College of London, PROSITE of Swiss Institute of Bioinformatics, eMOTIF at Stanford.

Secondary or derived databases contain the results of analysis on the primary resources including information on
sequence patterns or motifs, variants and mutations and evolutionary relationships. Information from the literature
is contained in bibliographic databases (Medline). These databases are easily accessible and that an intuitive query
system is provided to allow researchers to obtain very specific information on a particular biological subject. The
data should be provided in a clear, consistent manner with some visualisation tools for biological interpretation.
Specialist databases for particular subjects have been set-up, for example EMBL database for nucleotide sequence
data, UniProtKB/Swiss-Prot protein database and PDB (a 3D protein structure database).

Scientists also need to be able to integrate the information obtained from the underlying heterogeneous databases in
a sensible manner for having an overview of their biological subject. Sequence Retrieval System (SRS) is a power-
ful, querying tool provided by EBI that links information from more than 150 heterogeneous resources.

Composite databases
Composite database amalgamates different primary database sources, which obviates the need to search multiple
resources. Different composite database use different primary database and different criteria in their search algorithm.
Various options for search have also been incorporated in the composite database. NCBI hosts these nucleotide and
protein databases in their large high available redundant array of computer servers. NCBI provides free access to
various persons involved in research.

This also has link to OMIM (Online Mendelian Inheritance in Man), which contains information about the proteins
involved in genetic diseases. The growth of the primary databases gave rise to questions on the format of sequences,
reliability and comprehensiveness of databases. To address the format issues, in-house software solutions have been
developed to convert format of one database to another. A public domain software (FORCON) can also be used.
The newer software tools are used for analysis to accept data in multiple formats.

The problem in the data reliability is the possibility of misannotations. The misannotations are some time introduced
due to the process of automation of annotation process carried out with computers. A misannotation (if introduced)
multiplies in subsequent additions and may accumulate to an unbelievable extent and create confusion. A possible
solution to prevent this from happening is to flag the protein sequence, which has been annotated by sequence
comparison but whose function has not been validated by experimental methods.

2.5 The Creation of Sequence Databases


Most biological databases consist of long strings of nucleotides (guanine, adenine, thymine, cytosine and uracil)
and/or amino acids (threonine, serine, glycine, and so on). Each sequence of nucleotides or amino acids represents
a particular gene or protein, respectively. Sequences are represented in shorthand, using single letter designations.
This decreases the space necessary to store information and increases processing speed for analysis.

While most biological databases contain nucleotide and protein sequence information, there are also databases, which
include taxonomic information such as the structural and biochemical characteristics of organisms. However, the
power and ease of using sequence information has made it the method of choice in modern analysis.

Contributions from the fields of biology and chemistry have facilitated an increase in the speed of sequencing
genes and proteins. The advent of cloning technology allowed foreign DNA sequences to be easily introduced into
bacteria. In this way, rapid mass production of particular DNA sequences became possible. Oligonucleotide synthesis
provided researchers with the ability to construct short fragments of DNA with sequences of their own choosing.
These oligonucleotides could then be used in probing vast libraries of DNA to extract genes containing that sequence.

29
Bioinformatics

Alternatively, these DNA fragments could also be used in polymerase chain reactions to amplify existing DNA
sequences or to modify these sequences. With these techniques in place, progress in biological research increased
exponentially. However, for researchers to benefit from all this information, two additional things were required:
• Ready access to the collected pool of sequence information and
• A way to extract from this pool only those sequences of interest to a given researcher

Collecting all necessary sequence information of interest to a given project from published journal articles quickly
became a formidable task. After collection, the organisation and analysis of this data still remained. It could take
weeks to months for a researcher to search sequences by hand in order to find related genes or proteins. Computer
technology has provided the obvious solution to this problem. Not only can computers be used to store and organise
sequence information into databases, but they can also be used to analyse sequence data rapidly.

The evolution of computing power and storage capacity has, so far, been able to outpace the increase in sequence
information being created. Theoretical scientists have derived new and sophisticated algorithms, which allow
sequences to be readily compared using probability theories. These comparisons become the basis for determining
gene function, developing phylogenetic relationships and simulating protein models.

The physical linking of a vast array of computers in the 1970’s provided a few biologists with ready access to the
expanding pool of sequence information. This web of connections, now known as the Internet, has evolved and
expanded so that nearly everyone has access to this information and the tools necessary to analyse it. Databases
of existing sequencing data can be used to identify homologues of new molecules that have been amplified and
sequenced in the lab. The property of sharing a common ancestor, homology, can be a very powerful indicator in
bioinformatics.

Acquisition of sequence data


Bioinformatics tools can be used to obtain sequences of genes or proteins of interest, either from material obtained,
labelled, prepared and examined in electric fields by individual researchers/groups or from repositories of sequences
from previously investigated material.

Analysis of data
Both types of sequence can then be analysed in many ways with bioinformatics tools. They can be assembled. Note
that this is one of the occasions when the meaning of a biological term differs markedly from a computational one.
Computer scientists, banish from your mind any thought of assembly language. Sequencing can only be performed
for relatively short stretches of a biomolecule and finished sequences are therefore, prepared by arranging overlapping
‘reads’ of monomers (single beads on a molecular chain) into a single continuous passage of ‘code’.

This is the bioinformatic sense of assembly. They can be mapped (that is, their sequences can be parsed to find sites
where so-called ‘restriction enzymes’ will cut them). They can be compared, usually by aligning corresponding
segments and looking for matching and mismatching letters in their sequences. Genes or proteins, which are
sufficiently similar are likely to be related and are therefore, said to be ‘homologous’ to each other the whole truth
is rather more complicated than this. Such cousins are called ‘homologues’. If a homologue (a related molecule)
exists then a newly discovered protein may be modelled that is the three dimensional structure of the gene product
can be predicted without doing laboratory experiments.

Bioinformatics is used in primer design. Primers are short sequences needed to make many copies of (amplify) a
piece of DNA as used in PCR (the Polymerase Chain Reaction). Bioinformatics is used to attempt to predict the
function of actual gene products. Information about the similarity, and, by implication, the relatedness of proteins
is used to trace the family trees’ of different molecules through evolutionary time.

30
There are various other applications of computer analysis to sequence data, but, with so much raw data being
generated by the Human Genome Project and other initiatives in biology, computers are presently essential for
many biologists just to manage their day-to-day results Molecular modelling/structural biology is a growing field,
which can be considered part of bioinformatics. There are, for example, tools which allow (often via the Net) to
make pretty good predictions of the secondary structure of proteins arising from a given amino acid sequence, often
based on known ‘solved’ structures and other sequenced molecules acquired by structural biologists. Structural
biologists use ‘bioinformatics’ to handle the vast and complex data from X-ray crystallography, nuclear magnetic
resonance (NMR) and electron microscopy investigations and create the 3-D models of molecules that seem to be
everywhere in the media.

2.6 Bioinformatics Programs and Tools


Bioinformatic tools are software programs that are designed for extracting the meaningful information from the
mass of data and to carry out this analysis step.

Factors that must be taken into consideration when designing these tools are:
• The end user (the biologist) may not be a frequent user of computer technology
• These software tools must be made available over the internet given the global distribution of the scientific
research community

Major categories of bioinformatics tools


There are both standard and customised products to meet the requirements of particular projects. There are data-
mining software that retrieves data from genomic sequence databases and also visualisation tools to analyse and
retrieve information from proteomic databases. These can be classified as homology and similarity tools, protein
functional analysis tools, sequence analysis tools and miscellaneous tools. Here, is a brief description of a few of
these. Everyday bioinformatics is done with sequence search programs like BLAST, sequence analysis programs,
like the EMBOSS and Staden packages, structure prediction programs like THREADER or PHD or molecular
imaging/modelling programs like RasMol and WHATIF.

Structural analysis
These sets of tools allow you to compare structures with the known structure databases. The function of a protein
is more directly a consequence of its structure rather than its sequence with structural homologs tending to share
functions. The determination of a protein’s 2D/3D structure is crucial in the study of its function.

Homology and similarity tools


Homologous sequences are sequences that are related by divergence from a common ancestor. Thus, the degree of
similarity between two sequences can be measured while their homology is a case of being either true or false. This
set of tools can be used to identify similarities between novel query sequences of unknown structure and function
and database sequences whose structure and function have been elucidated.

Protein function analysis


These groups of programs allow you to compare your protein sequence to the secondary (or derived) protein databases
that contain information on motifs, signatures and protein domains. Highly significant hits against these different
pattern databases allow you to approximate the biochemical function of your query protein.

Protein sequence analysis


Apart from maintaining the large database, mining useful information from these set of primary and secondary
databases is very important. Lot of efficient algorithms have been developed for data mining and knowledge
discovery. These are computation intensive and need fast and parallel computing facilities for handling multiple
queries simultaneously. It is these search tools that integrate the user and the databases. One of the widely used
search program is BLAST (Basic Local Alignment Search Tool)

31
Bioinformatics

2.7 Bioinformatics Tools

BLAST
Sequence data are compared with one another using the Basic Local Alignment Search Tool or BLAST (Altschul et
al., 1990). This algorithm attempts to find ‘‘high-scoring segment pairs’’ (HSPs), which are pairs of sequences that
can be aligned with one another and, when aligned, meet certain scoring and statistical criteria.

BLAST is a set of similarity search programs designed to explore all of the available sequence databases regardless
of whether the query is protein or DNA. The BLAST programs have been designed for speed, with a minimal
sacrifice of sensitivity. The scores assigned in a BLAST search have a well-defined statistical interpretation, making
real matches easier to distinguish from random background hits. BLAST uses a heuristic algorithm, which seeks
local as opposed to global alignments and is therefore able to detect relationships among sequences which share
only isolated regions of similarity. This is a primary criterion in sequence analysis. Other tool available includes
CLUSTALW for multiple sequence alignment.

BLAST (Basic Local Alignment Search Tool) comes under the category of homology and similarity tools. It is a set
of search programs designed for the Windows platform and is used to perform fast similarity searches regardless of
whether the query is for protein or DNA. Comparison of nucleotide sequences in a database can be performed. Also
a protein database can be searched to find a match against the queried protein sequence. NCBI has also introduced
the new queuing system to BLAST (Q BLAST) that allows users to retrieve results at their convenience and format
their results multiple times with different formatting options.

Depending on the type of sequences to compare, there are different programs:


• blastp compares an amino acid query sequence against a protein sequence database
• blastn compares a nucleotide query sequence against a nucleotide sequence database
• blastx compares a nucleotide query sequence translated in all reading frames against a protein sequence
database
• tblastn compares a protein query sequence against a nucleotide sequence database dynamically translated in
all reading frames
• tblastx compares the six-frame translations of a nucleotide query sequence against the six-frame translations
of a nucleotide sequence database.

Databases available for BLAST search include the following:

Protein sequence databases


• nr: All non-redundant GenBank CDS translations+PDB+SwissProt+PIR+PRF
• month: All new or revised GenBank CDS translation+PDB+SwissProt+PIR+PRF released in the last 30
days.
• Swissprot : Last major release of the SWISS-PROT protein sequence database (no updates)
• Drosophila genome: Drosophila genome proteins provided by Celera and Berkeley Drosophila
• Genome Project (BDGP): (www.fruitfly.org)
• Yeast: Yeast (Saccharomyces cerevisiae) genomic CDS translations
• ecoli : Escherichia coli genomic CDS translations
• pdb: Sequences derived from the 3-dimensional structure from Brookhaven Protein Data Bank (www.pdb.
org)
• kabat: Kabat’s database of sequences of immunological interest (http://immuno.bme.nwu.edu)
• alu: Translations of select Alu repeats from REPBASE, suitable for masking Alu repeats from query
sequences.

32
Nucleotide sequence databases
• nr: All GenBank+EMBL+DDBJ+PDB sequences (but no EST, STS, GSS, or phase 0, 1 or 2 HTGS sequences).
No longer ‘non-redundant’.
• month : All new or revised GenBank+EMBL+DDBJ+PDB sequences released in the last 30 days.
• Drosophila genome : Drosophila genome provided by Celera and Berkeley Drosophila Genome Project )
• dbest: Database of GenBank+EMBL+DDBJ sequences from EST Divisions
• dbsts: Database of GenBank+EMBL+DDBJ sequences from STS Divisions
• htgs: Unfinished High Throughput Genomic Sequences: phases 0, 1 and 2
• gss: Genome Survey Sequence, includes single-pass genomic data, exon-trapped sequences, and Alu PCR
sequences.
• Yeast: Yeast (Saccharomyces cerevisiae) genomic nucleotide sequences
• E. coli : Escherichia coli genomic nucleotide sequences
• pdb :Sequences derived from the 3-dimensional structure from Brookhaven Protein Data Bank
• kabat: Kabat’s database of sequences of immunological interest
• vector : Vector subset of GenBank(R), NCBI, in ftp://ncbi.nlm.nih.gov/blast/db/
• mito : Database of mitochondrial sequences
• alu: Select Alu repeats from REPBASE, suitable for masking Alu repeats from query sequences. It is available
by anonymous FTP from ncbi.nlm.nih.gov (under the /pub/jmc/alu directory).
• Epd: Eukaryotic Promotor Database found on the web at http://www.genome.ad.jp/dbgetbin/www_bfind?epd

33
Bioinformatics

Fig. 2.4 Application of bioinformatics in medical science


(Source: http://www.mrc-lmb.cam.ac.uk/genomes/madanm/pdfs/biodbseq.pdf)

FASTA
FASTA is an alignment program for protein sequences created by Pearsin and Lipman in 1988. The program is one of
the many heuristic algorithms proposed to speed up sequence comparison. The basic idea is to add a fast pre screen
step to locate the highly matching segments between two sequences, and then extend these matching segments to
local alignments using more rigorous algorithms such as Smith-Waterman.

EMBOSS
EMBOSS (European Molecular Biology Open Software Suite) is a software-analysis package. It can work with
data in a range of formats and also retrieve sequence data transparently from the Web. Extensive libraries are also
provided with this package, allowing other scientists to release their software as open source. It provides a set of
sequence-analysis programs, and also supports all UNIX platforms.

Clustalw
It is a fully automated sequence alignment tool for DNA and protein sequences. It returns the best match over a total
length of input sequences, be it a protein or a nucleic acid.

34
RasMol
It is a powerful research tool to display the structure of DNA, proteins, and smaller molecules. Protein Explorer, a
derivative of RasMol, is an easier to use program.

PROSPECT
PROSPECT (Protein Structure Prediction and Evaluation Computer ToolKit) is a protein structure prediction system
that employs a computational technique called protein threading to construct a protein’s 3-D model.

PatternHunter
PatternHunter, based on Java, can identify all approximate repeats in a complete genome in a short time using little
memory on a desktop computer. Its features are its advanced patented algorithm and data structures, and the java
language used to create it. The Java language version of PatternHunter is just 40 KB, only 1% the size of Blast,
while offering a large portion of its functionality.

COPIA
COPIA (Consensus Pattern Identification and Analysis) is a protein structure analysis tool for discovering motifs
(conserved regions) in a family of protein sequences. Such motifs can be then used to determine membership to
the family for new protein sequences, predict secondary and tertiary structure and function of proteins and study
evolution history of the sequences.

2.8 Application of Programmes in Bioinformatics


The applications include:

JAVA in bioinformatics
Since research centers are scattered all around the globe ranging from private to academic settings, and a range of
hardware and OSs are being used, Java is emerging as a key player in bioinformatics. Physiome Sciences’ computer-
based biological simulation technologies and Bioinformatics Solutions’ PatternHunter are two examples of the
growing adoption of Java in bioinformatics.

Perl in bioinformatics
String manipulation, regular expression matching, file parsing, data format inter-conversion and so on are the common
text-processing tasks performed in bioinformatics. Perl excels in such tasks and is being used by many developers.
Yet, there are no standard modules designed in Perl specifically for the field of bioinformatics. However, developers
have designed several of their own individual modules for the purpose, which have become quite popular and are
coordinated by the BioPerl project.

35
Bioinformatics

Summary
• A biological database is a large, organised body of persistent data, usually associated with computerised software
designed to update query and retrieve components of the data stored within system.
• RNA and DNA are the proteins that store hereditary information about an organism.
• Biological databases are archives of consistent data stored in an efficient manner.
• Primary or archived databases contain information and annotation of DNA and protein sequences, DNA and
protein structures and DNA and protein expression profiles.
• Secondary or derived databases contain the results of analysis on the primary resources including information
on sequence patterns or motifs, variants and mutations and evolutionary relationships.
• EMBL Nucleotide Sequence Database is a comprehensive database of DNA and RNA sequences collected
from the scientific literature and patent applications and directly submitted from researchers and sequencing
groups.
• DDBJ (DNA Data Bank of Japan) is a nucleotide database hosted in Japan and is accepting DNA submission
from mainly Japanese researchers.
• UniProt is the universal protein resource, and as stated on its website the database intends to be both comprehensive
and of high quality.
• Ensembl is a project developing software for automatic annotation of eukaryotic genomes.
• The SCOP database describes structural and evolutionary relationships between all known protein structures
and also provides a number of links to other on-line resources related to protein structure and to sequence
databases in general.
• The CATH database hosted at http://www.cathdb.info/ classifies protein structures from the PDB according to
a four-level hierarchy.
• Composite database amalgamates different primary database sources, which obviates the need to search multiple
resources.
• Bioinformatic tools are software programs that are designed for extracting the meaningful information from the
mass of data and to carry out this analysis step.
• Homologous sequences are sequences that are related by divergence from a common ancestor.
• BLAST is a set of similarity search programs designed to explore all of the available sequence databases
regardless of whether the query is protein or DNA.
• BLAST (Basic Local Alignment Search Tool) comes under the category of homology and similarity tools.
• COPIA (Consensus Pattern Identification and Analysis) is a protein structure analysis tool for discovering motifs
(conserved regions) in a family of protein sequences.
• PatternHunter, based on Java, can identify all approximate repeats in a complete genome in a short time using
little memory on a desktop computer.
• EMBOSS (European Molecular Biology Open Software Suite) is a software-analysis package.

References
• Baxevanis, A. D. & Ouellette, B. F., 1998. Bioinformatics: A Practical Guide to the Analysis of Genes and
Proteins, John Wiley and Sons, New York.
• Rastogi, S. C., Mendiratta, N. & Rastogi, P., 2006. Bioinformatics-Concepts, Skills, Applications, 2nd ed., PHI
Learning Pvt. Ltd.
• clcbio. Bioinformatics explained: Biological databases [Online] Available at: <http://www.clcbio.com/index.
php?id=1238> [Accessed 28 February 2012].
• EMBL-EBI. What is Bioinformatics? [Online] Available at: <http://www.ebi.ac.uk/2can/bioinformatics/
bioinf_biodatabases_1.html> [Accessed 28 February 2012].

36
• InsGenomeSciences, 2010. Introduction to Bioinformatics [Video Online] Available at: <http://www.youtube.
com/watch?v=xODTm4a6nsM> [Accessed 28 February 2012].
• jv51jjv5, 2010. NCBI BLAST Tutorial - Part 1 [Video Online] Available at: <http://www.youtube.com/
watch?v=ZuBMBJmfn-4&feature=related> [Accessed 28 February 2012].

Recommended Reading
• Lehninger, A. L. 1984. Principles of Biochemistry, CBS publishers and distributors, New Delhi, India.
• Shanmughavel, P. 2005. Principles of Bioinformatics, Pointer Publishers, Jaipur, India.
• Markel, S. and Leon, D., Sequence Analysis in A Nutshell, O’REILLY.

37
Bioinformatics

Self Assessment
1. RNA and DNA are the ________that store hereditary information about an organism.
a. proteins
b. nucleotides
c. biological databases
d. software programs

2. ______is a set of similarity search programs designed to explore all of the available sequence databases regardless
of whether the query is protein or DNA.
a. BLAST
b. COPIA
c. CATH
d. EMBOSS

3. _________is a software-analysis package.


a. BLAST
b. COPIA
c. CATH
d. EMBOSS

4. ________can identify all approximate repeats in a complete genome in a short time using little memory on a
desktop computer.
a. PatternHunter
b. BLAST
c. COPIA
d. CATH

5. _____is the universal protein resource, and as stated on its website the database intends to be both comprehensive
and of high quality.
a. PatternHunter
b. UniProt
c. Ensembl
d. COPIA

6. ________is project developing software for automatic annotation of eukaryotic genomes.


a. PatternHunter
b. UniProt
c. Ensembl
d. COPIA

7. Which is a protein structure analysis tool for discovering motifs (conserved regions) in a family of protein
sequences?
a. PatternHunter
b. UniProt
c. Ensembl
d. COPIA

38
8. Which of the following statements is false?
a. String manipulation, regular expression matching, file parsing, data format inter-conversion and so on are
the common text-processing tasks performed in bioinformatics.
b. The Java language version of PatternHunter is just 40 KB, only 1% the size of Blast, while offering a large
portion of its functionality.
c. EMBOSS is a protein structure prediction system that employs a computational technique called protein
threading to construct a protein’s 3-D model.
d. Protein Explorer, a derivative of RasMol, is an easier to use program.

9. Which of the following statements is false?


a. FASTA is an alignment program for protein sequences created by Pearsin and Lipman in 1988.
b. EMBOSS (European Molecular Biology Open Software Suite) is a software-analysis package.
c. blastx compares an amino acid query sequence against a protein sequence database.
d. blastn compares a nucleotide query sequence against a nucleotide sequence database.

10. Which of the following comes under the category of homology and similarity tools?
a. BLAST
b. FASTA
c. EMBOSS
d. COPIA

39
Bioinformatics

Chapter III
Genomics & Proteomics

Aim
The aim of this chapter is to:

• define proteomics

• describe application of proteomics to medicine

• explain genome mapping

Objectives
The objectives of this chapter are to:

• describe genomics

• define proteome

• elucidate the implications of genomics for medical science

Learning outcome
At the end of this chapter, you will be able to:

• identify proteomics

• understand DNA sequencing

• differentiate between proteomics and genomics

40
3.1 DNA, Genes and Genomes
Deoxyribonucleic acid (DNA) is the chemical compound that contains the instructions needed to develop and direct
the activities of nearly all living organisms. DNA molecules are made of two twisting, paired strands, often referred
to as a double helix.

Each DNA strand is made of four chemical units, called nucleotide bases, which comprise the genetic “alphabet.”
The bases are adenine (A), thymine (T), guanine (G), and cytosine (C). Bases on opposite strands pair specifically:
an A always pairs with a T; a C always pairs with a G. The order of the As, Ts, Cs, and Gs determines the meaning
of the information encoded in that part of the DNA molecule just as the order of letters determines the meaning of
a word.

An organism’s complete set of DNA is called its genome. Virtually every single cell in the body contains a complete
copy of the approximately 3 billion DNA base pairs, or letters, that make up the human genome.

With its four-letter language, DNA contains the information needed to build the entire human body. A gene is the
unit of DNA that carries the instructions for making a specific protein or set of proteins. Each of the estimated 20,000
to 25,000 genes in the human genome codes for an average of three proteins.

Located on 23 pairs of chromosomes packed into the nucleus of a human cell, genes direct the production of proteins
with the assistance of enzymes and messenger molecules. Specifically, an enzyme copies the information in a gene’s
DNA into a molecule called messenger ribonucleic acid RNA (mRNA). The mRNA travels out of the nucleus and into
the cell’s cytoplasm, where the mRNA is read by a tiny molecular machine called a ribosome, and the information
is used to link together small molecules called amino acids in the right order to form a specific protein.

Proteins make up body structures like organs and tissue, as well as control chemical reactions and carry signals
between cells. If a cell’s DNA is mutated, an abnormal protein may be produced, which can disrupt the body’s usual
processes and lead to a disease, such as cancer.

Genomics is the study of genes and non-coding sequences of DNA in organisms.


• It is the large-scale study of proteins, particularly their structures and functions.
• It is study of sequences, gene organisation and mutations at the DNA level.
• It is the study of information flow within a cell.

This term was coined to make an analogy with genomics, and is often viewed as the “next step”, but proteomics is
much more complicated than genomics. Most importantly, while the genome is a rather constant entity, the proteome
is constantly changing through its biochemical interactions with the genome. One organism will have radically
different protein expression in different parts of its body and in different stages of its life cycle. The entirety of
proteins in existence in an organism is referred to as the proteome.

3.2 DNA Sequencing


Sequencing simply means determining the exact order of the bases in a strand of DNA. Because bases exist as pairs,
and the identity of one of the bases in the pair determines the other member of the pair, researchers do not have to
report both bases of the pair.

In the most common type of sequencing used today, called the chain termination method, a DNA strand is treated with
a variety of nucleotides, a set of enzymes, and a specific primer to generate a collection of smaller DNA fragments.
Four fluorescent tags, each specific for a given base, is part of the mixture. Each of the fragments differs in length
by one base and is marked with a fluorescent tag that identifies the last base of the fragment. The fragments are then
separated according to size and passed by a detector that reads the fluorescent tag. Then, a computer reconstructs
the entire sequence of the long DNA strand by identifying the base at each position from the size of each fragment
and the particular fluorescent signal at its end.

41
Bioinformatics

At present, this technology only can determine the order of up to 800 base pairs of DNA at a time. So, to assemble
the sequence of all the bases in a large piece of DNA, such as a gene, researchers need to read the sequence of
overlapping segments. This allows the longer sequence to be assembled from shorter pieces, somewhat like putting
together a linear jigsaw puzzle. In this process, each base has to be read not just once, but at least several times in
the overlapping segments to ensure accuracy.

Researchers can use DNA sequencing to search for genetic variations and/or mutations that may play a role in the
development or progression of a disease. The disease-causing change may be as small as the substitution, deletion,
or addition of a single base pair or as large as a deletion of thousands of bases.

The Human Genome Project, which was led at the National Institutes of Health (NIH) by the National Human
Genome Research Institute, produced a very high-quality version of the human genome sequence that is freely
available in public databases. This international project was successfully completed in April 2003, under budget
and more than two years ahead of schedule.

The sequence is not that of one person, but is a composite derived from several individuals. Therefore, it is a
“representative” or generic sequence. To ensure anonymity of the DNA donors, more blood samples (nearly 100)
were collected from volunteers than were used, and no names were attached to the samples that were analysed.
Thus, not even the donors knew whether their samples were actually used.

The Human Genome Project was designed to generate a resource that could be used for a broad range of biomedical
studies. One such use is to look for the genetic variations that increase risk of specific diseases, such as cancer, or
to look for the type of genetic mutations frequently seen in cancerous cells. More research can then be done to fully
understand how the genome functions and to discover the genetic basis for health and disease.

The International HapMap Project, in which NIH also played a leading role, represents a major step in that direction.
In October 2005, the project published a comprehensive map of human genetic variation that is already speeding the
search for genes involved in common, complex diseases, such as heart disease, diabetes, blindness, and cancer.

Another initiative that builds upon the tools and technologies created by the Human Genome Project is The Cancer
Genome Atlas pilot project. This three-year pilot, which was launched in December 2005, will develop and test
strategies for a comprehensive exploration of the universe of genetic factors involved in cancer.

3.3 Genome Mapping


Genomic maps serve as a scaffold for orienting sequence information. A few years ago, a researcher wanting to
localise a gene, or nucleotide sequence, was forced to manually map the genomic region of interest, a time-consuming
and often painstaking process. Today, thanks to new technologies and the influx of sequence data, a number of high-
quality, genome-wide maps are available to the scientific community for use in their research.

Computerised maps make gene hunting faster, cheaper, and more practical for almost any scientist. In a nutshell,
scientists would first use a genetic map to assign a gene to a relatively small area of a chromosome. They would then
use a physical map to examine the region of interest close up, to determine a gene’s precise location. In light of these
advances, a researcher’s burden has shifted from mapping a genome or genomic region of interest to navigating a
vast number of Web sites and databases.

3.4 Implications of Genomics for Medical Science


Virtually every human ailment, except perhaps trauma, has some basis in our genes. Until recently, doctors were
able to take the study of genes, or genetics, into consideration only in cases of birth defects and a limited set of
other diseases. These were conditions, such as sickle cell anemia, which have very simple, predictable inheritance
patterns because each is caused by a change in a single gene.

42
With the vast trove of data about human DNA generated by the Human Genome Project and the HapMap Project,
scientists and clinicians have much more powerful tools to study the role that genetic factors play in much more
complex diseases, such as cancer, diabetes, and cardiovascular disease that constitute the majority of health problems
in the United States. Genome-based research is already enabling medical researchers to develop more effective
diagnostic tools, to better understand the health needs of people based on their individual genetic make-ups, and to
design new treatments for disease. Thus, the role of genetics in health care is starting to change profoundly and the
first examples of the era of personalised medicine are on the horizon.

It is important to realise, however, that it often takes considerable time, effort, and funding to move discoveries from
the scientific laboratory into the medical clinic. Most new drugs based on genome-based research are estimated to be
at least 10 to 15 years away. According to biotechnology experts, it usually takes more than a decade for a company
to conduct the kinds of clinical studies needed to receive approval from the Food and Drug Administration.

Screening and diagnostic tests, however, are expected to arrive more quickly. Rapid progress is also anticipated in
the emerging field of pharmacogenomics, which involves using information about a patient’s genetic make-up to
better tailor drug therapy to their individual needs.

Clearly, genetics remains just one of several factors that contribute to people’s risk of developing most common
diseases. Diet, lifestyle, and environmental exposures also come into play for many conditions, including many types
of cancer. Still, a deeper understanding of genetics will shed light on more than just hereditary risks by revealing
the basic components of cells and, ultimately, explaining how all the various elements work together to affect the
human body in both health and disease.

3.5 Proteomics
Proteomics studies the structure and function of proteins, the principal constituents of the protoplasm of all cells.

Proteome
The word “proteome” is derived from proteins expressed by a genome, and it refers to all the proteins produced by
an organism, much like the genome is the entire set of genes. The human body may contain more than 2 million
different proteins, each having different functions. As the main components of the physiological pathways of the
cells, proteins serve vital functions in the body such as:
• catalysing various biochemical reactions, example, enzymes
• acting as messengers, example, neurotransmitters
• acting as control elements that regulate cell reproduction
• influencing growth and development of various tissues, example, trophic factors
• transporting oxygen in the blood, example, hemoglobin
• defending the body against disease, example, antibodies

Proteins are fairly large molecules made up of strings of amino acids linked like a chain. While there are only 20
amino acids, they combine in different ways to form tens of thousands of proteins, each with a unique, genetically
defined sequence that determines the protein’s specific shape and function. In addition, each protein can undergo a
variety of post-translational modifications that further influence its shape and function. Researchers and scientists
are working on developing a map of the human proteome much like that of the human genome that identifies novel
protein families, protein interactions and signaling pathways.

43
Bioinformatics

Proteomics is “the analysis of complete complements of proteins”. Proteomics includes not only the identification
and quantification of proteins, but also the determination of their localisation, modifications, interactions, activities,
and, ultimately, their function. Initially encompassing protein separation and identification, proteomics now refers
to any procedure that characterises large sets of proteins. The explosive growth of this field is driven by multiple
forces genomics and its revelation of more and more new proteins; powerful protein technologies such as newly
developed mass spectrometry approaches and innovative computational tools and methods to process, analyse, and
interpret prodigious amounts of data.

There are many different subdivisions of proteomics, including:


• Structural proteomics : In-depth analysis of protein structure
• Expression proteomics : Analysis of expression and differential expression of proteins
• Interaction proteomics:Analysis of interactions between proteins to characterise complexes and determine
function. 

The theme of molecular biology research, in the past, has been oriented around the gene rather than the protein.
This is not to say that researchers have neglected to study proteins, but rather that the approaches and techniques
most commonly used have looked primarily at the nucleic acids and then later at the protein(s) implicated. The main
reason for this has been that the technologies available, and the inherent characteristics of nucleic acids, have made
the genes the low hanging fruit. This situation has changed recently and continues to change at larger scale, higher
throughput methods are developed for both nucleic acids and proteins. The majority of processes that take place in
a cell are not performed by the genes themselves, but rather by the proteins that they code for.

A disease can arise when a gene/protein is over or under expressed, or when a mutation in a gene results in a
malformed protein, or when post translational modifications alter a protein’s function. Thus, to truly understand
a biological process, the relevant proteins must be studied directly. But there are more challenges while studying
proteins compared to studying genes, due to their complex 3-D structure, which is related to the function, analogous
to a machine.

Proteomics is defined as the systematic large-scale analysis of protein expression under normal and perturbed
(stressed, diseased, and/or drugged) states, and generally involves the separation, identification, and characterisation
of all of the proteins in a cell or tissue sample. The meaning of the term has also been expanded, and is now used
loosely to refer to the approach of analysing which proteins a particular type of cell synthesises, how much the
cell synthesises, how cells modify proteins after synthesis, and how all of those proteins interact. There are orders
of magnitude more proteins than genes in an organism - based on alternative splicing (several per gene) and post
translational modifications (over 100 known), there are estimated to be a million or more.

Fortunately there are features such as folds and motifs, which allow them to be categorised into groups and families,
making the task of studying them more tractable. There is a broad range of technologies used in proteomics, but
the central paradigm has been the use of 2-D gel electrophoresis (2D-GE) followed by mass spectrometry (MS).
2D-GE is used to first separate the proteins by isoelectric point and then by size.

The individual proteins are subsequently removed from the gel and prepared, then analysed by MS to determine their
identity and characteristics. There are various types of mass analysers used in proteomics MS including quadrupole,
time-of-flight (TOF), and ion trap, and each has its own particular capabilities. Tandem arrangements are often used,
such as quadrupole-TOF, to provide more analytical power. The recent development of soft ionisation techniques,
namely matrix-assisted laser desorption ionisation (MALDI) and electro-spray ionisation (ESI), has allowed large
biomolecules to be introduced into the mass analyser without completely decomposing their structures, or even
without breaking them at all, depending on the design of the experiment.

There are techniques, which incorporate liquid chromatography (LC) with MS, and others that use LC by itself.
Robotics has been applied to automate several steps in the 2DGE-MS process such as spot excision and enzyme
digests. To determine a protein’s structure, XRD and NMR techniques are being improved to reach higher throughput

44
and better performance. For example, automated high-throughput crystallisation methods are being used upstream of
XRD to alleviate that bottleneck. For NMR, cryo-probes and flow probes shorten analysis time and decrease sample
volume requirements. The hope is that determining about 10,000 protein structures will be enough to characterise
the estimated 5,000 or so folds, which will feed into more reliable in silico structural prediction methods.

Structure by itself does not provide all of the desired information, but is a major step in the right direction. Protein
chips are being developed for many of the processes in proteomics. For example, researchers are developing protocols
for protein microarrays at institutions such as Harvard and Stanford as well as at several companies. These chips -
grids of attached peptide fragments, attached antibodies, or gel “pads” with proteins suspended inside - will be used
for various experiments such as protein-protein interaction studies and differential expression analysis.

They can also be used to filter out high abundance proteins before further experiments; one of the major challenges
in proteomics is isolating and analysing the low abundance proteins, which are thought to be the most important.
There are many other types of protein chips, and the number will continue to grow. For example, microfluidics chips
can combine the sample preparation steps prior to MS, such as enzyme digests, with nanoelectrospray ionisation,
all on the one chip. Or, the samples can be ionised directly off of the surface of the chip, similar to a MALDI target.
Microfluidics chips are also being combined with NMR.

In the next few years, various protein chips will be used increasingly in diagnostic applications as well. The
bioinformatics side of proteomics includes both databases and analysis software. There are many public and
private databases containing protein data ranging from sequences, to functions, to post translational modifications.
Typically, a researcher will first perform 2D-GE followed by MS; this will result in a fingerprint, molecular weight,
or even sequence for each protein of interest, which can then be used to query databases for similarities or other
information.

Swiss-Prot and TrEMBL, developed in collaboration between the Swiss Institute of Bioinformatics and the European
Bioinformatics Institute, are currently the major databases dedicated to cataloging protein data, but there are dozens
of more specialised databases and tools. New bioinformatics approaches are constantly being introduced. Recent
customised versions of PSI-BLAST can, for example, utilise not only the curated protein entries in Swiss-Prot
but also linguistic analyses of biomedical journal articles to help determine protein family relationships. Publicly
available databases and tools are popular, but there are also several companies offering subscriptions to proprietary
databases, which often include protein-protein interaction maps generated.

The proteomics market is comprised of instrument manufacturers, bioinformatics companies, laboratory product
suppliers, service providers, and other biotech related companies, which can defy categorisation. A given company
can often overlap more than one of these areas. Many of the companies involved in the proteomics market are actually
doing drug discovery as their major focus, while partnering, or providing services or subscriptions, to other companies
to generate short term revenues. The market for proteomics products and services was estimated to be $1.0B in 2000,
growing at a CAGR of 42% to about $5.8B in 2005. The major drivers will continue to be the biopharmaceutical
industry’s pursuit of blockbuster drugs and the recent technological advances, which have allowed large-scale
studies of genes and proteins. Alliances are becoming increasingly important in this field, because it is challenging
for companies to find all of the necessary expertise to cover the different activities involved in proteomics.

Synergies must be created by combining forces. For example, many companies working with mass spectrometry,
both the manufacturers and end user labs, are collaborating with protein chip related companies. There are
many combinations of diagnostics, instrumentation, chip, and bioinformatics companies, which create effective
partnerships.

In general, proteomics appears to hold great promise in the pursuit of biological knowledge. There has been a
general realisation that the large-scale approach to biology, as opposed to the strictly hypothesis-driven approach,
will rapidly generate much more useful information.

45
Bioinformatics

The two approaches are not mutually exclusive, and the happy medium seems to be the formation of broad hypotheses,
which are subsequently investigated by designing large-scale experiments and selecting the appropriate data.
Proteomics and genomics, and other varieties of ‘omics’, will all continue to complement each other in providing
the tools and information for this type of research.

3.6 Application of Proteomics to Medicine


Proteomic technologies will play an important role in drug discovery, diagnostics and molecular medicine because is
the link between genes, proteins and disease. As researchers study defective proteins that cause particular diseases,
their findings will help develop new drugs that either alter the shape of a defective protein or mimic a missing
one.

Already, many of the best-selling drugs today either act by targeting proteins or are proteins themselves. Advances
in proteomics may help scientists eventually create medications that are “personalised” for different individuals
to be more effective and have fewer side effects. Current research is looking at protein families linked to diseases
including cancer, diabetes and heart disease.

Identifying unique patterns of protein expression, or biomarkers, associated with specific diseases is one of the most
promising areas of clinical proteomics. One of the first biomarkers used in disease diagnosis was prostate-specific
antigen (PSA). Today, serum PSA levels are commonly used in diagnosing prostate cancer in men. Unfortunately,
many single protein biomarkers have proven to be unreliable. Researchers are now developing diagnostic tests
that simultaneously analyse the expression of multiple proteins in hopes of improving the specificity and sensitivity
of these types of assays.     

3.7 Difference between Proteomics and Genomics


Unlike the genome, which is relatively static, the proteome changes constantly in response to tens of thousands of
intra- and extracellular environmental signals. The proteome varies with health or disease, the nature of each tissue,
the stage of cell development, and effects of drug treatments. As such, the proteome often is defined as “the proteins
present in one sample (tissue, organism, cell culture) at a certain point in time.”

In many ways, proteomics runs parallel to genomics: genomics starts with the gene and makes inferences about its
products (proteins), whereas proteomics begins with the functionally modified protein and works back to the gene
responsible for its production.

The sequencing of the human genome has increased interest in proteomics because while DNA sequence information
provides a static snapshot of the various ways in which the cell might use its proteins, the life of the cell is a dynamic
process. This new data set holds great new promise for proteomic applications in science, medicine, and most
notably – pharmaceuticals.

3.8 Protein Modeling


The process of evolution has resulted in the production of DNA sequences that encode proteins with specific functions.
In the absence of a protein structure that has been determined by X-ray crystallography or nuclear magnetic resonance
(NMR) spectroscopy, researchers can try to predict the three-dimensional structure using  protein or molecular
modeling. This method uses experimentally determined protein structures (templates) to predict the structure of
another protein that has a similar amino acid sequence (target).

Although, molecular modeling may not be as accurate at determining a protein’s structure as experimental methods,
it is still extremely helpful in proposing and testing various biological hypotheses. Molecular modeling also provides
a starting point for researchers wishing to confirm a structure through X-ray crystallography and NMR spectroscopy.
As the different genome projects are producing more sequences and as novel protein folds and families are being
determined, protein modeling will become an increasingly important tool for scientists working to understand normal
and disease-related processes in living organisms.

46
The four steps of protein modeling are:
• Identify the proteins with known three-dimensional structures that are related to the target sequence.
• Align the related three-dimensional structures with the target sequence and determine those structures that will
be used as templates.
• Construct a model for the target sequence based on its alignment with the template structure(s).
• Evaluate the model against a variety of criteria to determine if it is satisfactory.

47
Bioinformatics

Summary
• Deoxyribonucleic acid (DNA) is the chemical compound that contains the instructions needed to develop and
direct the activities of nearly all living organisms.
• Genomics is the study of genes and non-coding sequences of DNA in organisms.
• Sequencing simply means determining the exact order of the bases in a strand of DNA.
• In the most common type of sequencing used today, called the chain termination method, a DNA strand is
treated with a variety of nucleotides, a set of enzymes, and a specific primer to generate a collection of smaller
DNA fragments.
• The Human Genome Project, which was led at the National Institutes of Health (NIH) by the National Human
Genome Research Institute, produced a very high-quality version of the human genome sequence that is freely
available in public databases.
• The International HapMap Project, in which NIH also played a leading role, represents a major step in that
direction.
• Genomic maps serve as a scaffold for orienting sequence information.
• Proteomics studies the structure and function of proteins, the principal constituents of the protoplasm of all
cells.
• The word “proteome” is derived from proteins expressed by a genome, and it refers to all the proteins produced
by an organism, much like the genome is the entire set of genes.
• Proteins are fairly large molecules made up of strings of amino acids linked like a chain.
• Proteomics is “the analysis of complete complements of proteins”.
• Proteomics is defined as the systematic large-scale analysis of protein expression under normal and perturbed
(stressed, diseased, and/or drugged) states, and generally involves the separation, identification, and
characterisation of all of the proteins in a cell or tissue sample.

References
• Mount, D. W., 2001. Bioinformatics Sequence and Genome Analysis, Cold Spring Harbor Laboratory Press.
• Rastogi, S. C., Mendiratta, N. & Rastogi, P., 2006. Bioinformatics-Concepts, Skills, Applications, 2nd ed., PHI
Learning Pvt. Ltd.
• NHGRI. A Brief Guide to Genomics [Online] Available at: <http://www.genome.gov/18016863> [Accessed
28 February 2012].
• cisn. Genomics [Online] Available at: <http://cisncancer.org/research/what_we_know/omics/genomics.html>
[Accessed 28 February 2012].
• genomicseducation, 2009. What is Genomics Part 2 - The Human Genome Project [Video Online] Available
at: <http://www.youtube.com/watch?v=C86YbyEsct8&feature=results_main&playnext=1&list=PLE62E79A
B3FDD7867> [Accessed 28 February 2012].
• genomicseducation, 2010. What is Genomics - Chapter 1 [Video Online] Available at: <http://www.youtube.
com/watch?v=9jZF74iqLac&feature=related> [Accessed 28 February 2012].

Recommended Reading
• Patthy, L., 1999. Protein Evolution, Blackwell Science.
• Shanmughavel, P., 2005. Principles of Bioinformatics, Pointer Publishers, Jaipur, India.
• Branden C., & Tooze, J., 1999. Introduction to Protein Structure, Garland Publishing, New York.

48
Self Assessment
1. _______is the chemical compound that contains the instructions needed to develop and direct the activities of
nearly all living organisms.
a. DNA
b. RNA
c. Protein
d. Proteome

2. Sequencing simply means determining the exact order of the bases in a strand of___________.
a. RNA
b. DNA
c. protein
d. proteome

3. _______serve as a scaffold for orienting sequence information.


a. Genomic maps 
b. Proteins
c. Proteome
d. Sequencing

4. _______is defined as “the proteins present in one sample (tissue, organism, cell culture) at a certain point in
time.”
a. Genome
b. Proteome
c. Databases
d. Nucleotides

5. _____________molecules are made of two twisting, paired strands, often referred to as a double helix.
a. RNA
b. DNA
c. protein
d. proteome

6. An organism’s complete set of DNA is called its_________.


a. genome
b. proteome
c. genome map
d. protein model

7. Which of the following statements is false?


a. Molecular modeling also provides a starting point for researchers wishing to confirm a structure through
X-ray crystallography and NMR spectroscopy.
b. Genomics starts with the gene and makes inferences about its products (proteins).
c. Proteomics begins with the functionally modified protein and works back to the gene responsible for its
production.
d. The genome changes constantly in response to tens of thousands of intra- and extracellular environmental
signals.

49
Bioinformatics

8. Which of the following statements is false?


a. Proteomic technologies will play an important role in drug discovery, diagnostics and molecular medicine
because is the link between genes, proteins and disease.
b. Genomics is “the analysis of complete complements of proteins”.
c. Proteomics includes not only the identification and quantification of proteins, but also the determination of
their localisation, modifications, interactions, activities, and, ultimately, their function.
d. Proteins are fairly large molecules made up of strings of amino acids linked like a chain.

9. Which of the following statements is true?


a. The word “proteome” is derived from proteins expressed by a genome.
b. Genome refers to all the proteins produced by an organism, much like the genome is the entire set of
genes.
c. The human body may contain more than 2 million different proteins, each having same functions.
d. Proteomics studies the structure and function of nucleotides, the principal constituents of the protoplasm
of all cells.

10. What refers to the analysis of expression and differential expression of proteins?
a. Structural proteomics 
b. Expression proteomics 
c. Interaction proteomics 
d. Genomics

50
Chapter IV
Sequence Alignment

Aim
The aim of this chapter is to:

• define sequence alignment

• describe pairwise sequence alignment

• explain global alignment

Objectives
The objectives of this chapter are to:

• explain the Needleman-Wunsch algorithm

• elucidate alignment scoring function

• describe local alignment

Learning outcome
At the end of this chapter, you will be able to:

• identity matrix

• know Smith Waterman algorithm

• understand features of substitution matrices

51
Bioinformatics

4.1 Introduction
Once a genome is completely sequenced, there are sorts of analyses performed on it. Some of the goals of sequence
analysis are the following:
• Identify the genes.
• Determine the function of each gene. One way to hypothesise the function is to find another gene (possibly
from another organism) whose function is known and to which the new gene has high sequence similarity. This
assumes that sequence similarity implies functional similarity, which may or may not be true.
• Identify the proteins involved in the regulation of gene expression.
• Identify sequence repeats.
• Identify other functional regions.

Many of these tasks are computational in nature. Given the incredible rate at which sequence data is being produced,
the integration of computer science, mathematics, and biology will be integral to analysing those sequences.

Sequence alignment in bioinformatics is a field of research focused on developing tools for comparing and finding
similar sequences of amino acids or DNA base pairs with the aid of computers. The sequence similarity is used to
assess gene and protein homology, classify genes and proteins, predict biological function, secondary and tertiary
protein structure, detect point mutations, construct evolutionary trees, and so on. There are two main areas of
sequence alignment: pairwise sequence alignment and multiple sequence alignment.

Sequence alignment is an arrangement of two or more sequences, highlighting their similarity. The sequences are
padded with gaps (dashes) so that wherever possible, columns contain identical characters from the sequences
involved.

tcctctgcctctgccatcat---caaccccaaagt
|||| ||| ||||| ||||| ||||||||||||
tcctgtgcatctgcaatcatgggcaaccccaaagt

4.2 Pairwise Sequence Alignment


Pairwise sequence alignment is concerned with comparing two DNA or amino acid sequences , finding the global
and local “optimum alignment” of the two sequences. Based on differences between the two sequences, one can
calculate the “cost” of aligning the two sequences by using replacements, deletions and insertions, and assign a
similarity score.

The problem has tractable solutions by means of dynamic programming and Hidden Markov Models and is the basis
of popular heuristic search methods such as FASTA or BLAST. Needleman and Wunsch (1970), were the first to
present a dynamic programming algorithm that could find the global alignment between two amino acid sequences.
Smith and Waterman (1981), introduced a new algorithm with a different method of scoring similarity aimed at
finding optimum local alignment sub-sequences, at the expense of the global score. Global algorithms are generally
not sensitive for highly diverged sequences with some localised similarities within them.

A particular application of pairwise sequence alignment is quickly searching large DNA and protein databases for
matches to a query sequence. Popular heuristic algorithms, such as those from the FASTA (Pearson and Lipman
1985, 1988) or BLAST (Altschul et al 1990, 1997) families are much faster than algorithms based on dynamic
programming.

Pairwise sequence alignment methods are concerned with finding the best-matching piecewise local or global
alignments of protein (amino acid) or DNA (nucleic acid) sequences. Typically, the purpose of this is to find
homologues (relatives) of a gene or gene-product in a database of known examples.

52
This information is useful for answering a variety of biological questions:
• The identification of sequences of unknown structure or function.
• The study of molecular evolution.

Global alignment
A global alignment between two sequences is an alignment in which all the characters in both sequences participate
in the alignment. Global alignments are useful mostly for finding closely-related sequences. As these sequences
are also easily identified by local alignment methods global alignment is now somewhat deprecated as a technique.
Further, there are several complications to molecular evolution (such as domain shuffling), which prevent these
methods from being useful. Find the global best fit between two sequences.

Example: the sequences s = VIVALASVEGAS and


t = VIVADAVIS align like:
A(s,t) = V I V A L A S V E G A S
| | | | | | |
V I V A D A - V - - I S

The Needleman-Wunsch algorithm


The Needleman-Wunsch algorithm performs a global alignment on two sequences (s and t) and is applied to align
protein or nucleotide sequences. The Needleman-Wunsch algorithm is an example of dynamic programming, and
is guaranteed to find the alignment with the maximum score. This works for both DNA-sequences as for protein-
sequences.

Alignment scoring function


The cost of aligning two symbols xi and yj is the scoring function σ(xi,yj

Alignment cost
The cost of the entire alignment:
c
M = ∑ s ( xi , yi )
i =1

A simple scoring function


σ(-,a) = σ(a,-) = -1
σ(a,b) = -1 if a ≠ b
σ(a,b) = 1 if a = b

The substitution matrix


A more realistic scoring function is given by the biologically inspired substitution matrix:
- A G C T
A 10 -1 -3 -4
G -1 7 -5 -3
C -3 -5 9 0
T -4 -3 0 8

Scoring function
The cost for aligning the two sequences s = VIVALASVEGAS and t = VIVADAVIS

A(s,t) = V I V A L A S V E G A S
| | | | | | |
V I V A D A - V - - I S

53
Bioinformatics

is:
M(A) = 7 matches + 2 mismatches + 3 gaps
=7 –2 –3 =2

Optimal global alignment


The optimal global alignment A* between two sequences s and t is the alignment A(s,t) that maximises the total
alignment score M(A) over all possible alignments.
A* = argmax M(A)

Finding the optimal alignment A* looks a combinatorial optimisation problem:


• generate all possible alignments
• compute the score M
• select the alignment A* with the maximum score M*

Local alignment
Local alignment methods find related regions within sequences they can consist of a subset of the characters within
each sequence. For example, positions 20-40 of sequence A might be aligned with positions 50-70 of sequence B.
This is a more flexible technique than global alignment and has the advantage that related regions, which appear in
a different order in the two proteins (which is known as domain shuffling) can be identified as being related. This
is not possible with global alignment methods.

The Smith Waterman algorithm


The Smith-Waterman algorithm (1981) is for determining similar regions between two nucleotide or protein sequences.
Smith-Waterman is also a dynamic programming algorithm and improves on Needleman-Wunsch. As such, it has
the desirable property that it is guaranteed to find the optimal local alignment with respect to the scoring system
being used (which includes the substitution matrix and the gap-scoring scheme). However, the Smith-Waterman
algorithm is demanding of time and memory resources: in order to align two sequences of lengths m and n, O(mn)
time and space are required.

As a result, it has largely been replaced in practical use by the BLAST algorithm; although not guaranteed to find
optimal alignments, BLAST is much more efficient.

Sequence similarities and scoring


Given two sequences: how similar are they? This question cannot be answered because it depends on the context.
Perhaps the sequences must have the same trend (stock market), contain the same pattern (text), or have the same
frequencies (speech) and so on to be similar to one another.

Identity matrix
For biological sequences it is known how one sequence can mutate into another one. First there are point mutation
that is one nucleotide or amino acid is changed into another one. Secondly, there is deletion that is one element
(nucleotide or amino acid) or a whole subsequence of element is deleted from the sequence. Thirdly, there are
insertions such as one element or a subsequence is inserted into the sequence. First approach the similarity of two
biological sequences that can be expressed through the minimal number of mutations to transform one sequence
into another one. All mutations are not equally likely. Point mutations are more likely because an amino acid can be
replaced by an amino acid with similar chemical properties without changing the function. Deletions and insertions
are more prone to destroying the function of the protein, where the length of deletions and insertions must be taken
into account. For simplicity we can count the length of insertions and deletions. Finally, we are left with simply
counting the number of amino acids, which match in the two sequences (it is the length of both sequences added
together and insertions, deletions and two times the mismatches subtracted, finally divided by two).

54
Here is an example:
BIOINFORMATICS BIOIN-FORMATICS
!
BOILING FOR MANICS B-OILINGFORMANICS

The hit count gives 12 identical letters out of the 14 letters of BIOINFORMATICS. The mutations would be:
• delete I BOINFORMATICS
• insert LI BOILINFORMATICS
• insert G BOILINGFORMATICS
• change T into N BOILINGFORMANICS

These two texts seem to be very similar. Note that insertions or deletions cannot be distinguished if two sequences
are presented (is I deleted form the first string or inserted in the second?). Therefore, both are denoted by a “-” (note,
two “-” are not matched to one another). The task for bioinformatics algorithms is to find from the two strings (left
hand side in above example) the optimal alignment (right hand side in above example). The optimal alignment is
the arrangement of the two strings in a way that the number of mutations is minimal. The optimality criterion scores
matches (the same amino acid) with 1 and mismatches (different amino acids) with 0. If these scores for pairs of
amino acids are written in matrix form, then the identity matrix is obtained. The number of mutation is one criterion
for optimality but there exists more. In general, an alignment algorithm searches for the arrangement of two sequences
such that a criterion is optimised. The sequences can be arranged by inserting “-” into the strings and moving them
horizontally against each other. For long sequences the search for an optimal alignment can be very difficult.

One tool for representing alignments is the dot matrix, where one sequence is written horizontally on the top and
the other one vertically on the left. This gives a matrix where each letter of the first sequence is paired with each
letter of the second sequence. For each matching of letters, a dot is written in the according position in the matrix.
Which pairs appear in the optimal alignment? We will see later, that each path through the dot matrix corresponds
to an alignment. The dots on diagonals correspond to matching regions.

B I O I N F O R M A T I C S
B
O
I
L
I
N
G
F
O
R
M
A
N
I
C
S

55
Bioinformatics

B I O I N F O R M A T I C S
B
O
I
L
I
N
G
F
O
R
M
A
N
I
C
S

Fig. 4.1 Dot matrix


(Source: http://www.master-bioinformatik.at/curriculum/BioInf_I_Notes.pdf)

4.3 Multiple Sequence Alignment (MSA)


Multiple sequence alignment aims to find similarities between many sequences. MSA is hard and less tractable than
pairwise alignment. Dynamic programming is impractical for a large number of sequences. The most successful
MSA solutions are heuristic algorithms with approximate approaches, such as the CLUSTAL family of programs
created by Higgins, which use a progressive algorithm (Feng and Doolittle 1987): CLUSTAL (1988), ClustalV
(1992), ClustalW (1994), ClustalX (1998). Profile Hidden Markov Models (HMMs) provide another successful
solution to the problem of MSA. They were introduced by Krogh and colleagues in 1994.

4.4 Substitution Matrices


Both pairwise and multiple sequence alignment algorithms use substitution matrices to score the sequence alignment.
In substitution matrices each possible residue substitution is given a score reflecting the probability of such a change.
There are two popular protein substitution matrix models: Percent Accepted Mutation (PAM - Dayhoff 1978) and
Blocks Substitution Matrix (BLOSUM - Henikoff and Henikoff 1992).

4.5 Two Sample Applications


Sequence alignment algorithms are often used to characterise newly sequenced genes or gene products. For
example, the sequenced genome of the SARS virus was investigated by using BLAST, FASTA, Pfam, and ClustalX
to find proteins with sequences similar to those expected to be produced by the SARS virus ORFs (Mara 2003,
Rota 2003). Biological function and structure was then predicted for the SARS proteins based on the information
available for the homologous proteins. Another application of sequence alignment tools is the study of phylogenetics.
Phylogenetics is a field of molecular evolution that correlates mutations in DNA and protein sequences with
evolutionary divergence.

Molecular distances of evolution between species can be calculated using various metrics based on DNA or protein
sequence difference. The smaller the number of differences in the DNA and/or protein sequences of similar genes
from two related organisms, the less they have evolutionarily diverged from each other.

56
Summary
• Sequence alignment in bioinformatics is a field of research focused on developing tools for comparing and
finding similar sequences of amino acids or DNA base pairs with the aid of computers.
• Sequence alignment is an arrangement of two or more sequences, highlighting their similarity.
• Pairwise sequence alignment is concerned with comparing two DNA or amino acid sequences finding the global
and local “optimum alignment” of the two sequences.
• Needleman and Wunsch (1970) were the first to present a dynamic programming algorithm that could find the
global alignment between two amino acid sequences.
• Smith and Waterman (1981) introduced a new algorithm with a different method of scoring similarity aimed at
finding optimum local alignment sub-sequences, at the expense of the global score.
• Global algorithms are generally not sensitive for highly diverged sequences with some localised similarities
within them.
• A global alignment between two sequences is an alignment in which all the characters in both sequences
participate in the alignment.
• The Needleman-Wunsch algorithm performs a global alignment on two sequences (s and t) and is applied to
align protein or nucleotide sequences.
• Local alignment methods find related regions within sequences. They can consist of a subset of the characters
within each sequence.
• Smith-Waterman algorithm (1981) is for determining similar regions between two nucleotide or protein
sequences.
• Multiple sequence alignment aims to find similarities between many sequences. MSA is hard and less tractable
than pairwise alignment.
• Both pairwise and multiple sequence alignment algorithms use substitution matrices to score the sequence
alignment.
• Sequence alignment algorithms are often used to characterise newly sequenced genes or gene products.

References
• Branden, C. & Tooze, J., 1998. An Introduction to Protein Structure. Garland, 1998.
• Rastogi, S. C., Mendiratta, N. & Rastogi, P., 2006. Bioinformatics-Concepts, Skills, Applications, 2nd ed., PHI
Learning Pvt. Ltd.
• Huson, D., 2005. A Brief Guide to Genomics [Online] Available at: <http://lectures.molgen.mpg.de/
Algorithmische_Bioinformatik_WS0607/reinert1.pd> [Accessed 28 February 2012].
• Biology Computers. Pairwise sequence alignment [Online] Available at: <http://gtbinf.wordpress.com/biol-
41506150/pairwise-sequence-alignment/> [Accessed 28 February 2012].
• ABNOVA1, 2010. BLAST - Multiple Alignment [Video Online] Available at: <http://www.youtube.com/
watch?v=xdF6iZEPH_s> [Accessed 28 February 2012].
• sanjaysingh765, 2011. Multiple sequence alignment with clustalw and boxshade [Video Online] Available at:
<http://www.youtube.com/watch?v=BrzhdNvXXDs> [Accessed 28 February 2012].

Recommended Reading
• Livingstone & Barton., 1993. Protein Sequence Alignments: a Strategy for the Hierarchical Analysis of Residue
Conservation, Computer Applications in the Biosciences.
• Shanmughavel, P., 2005. Principles of Bioinformatics, Pointer Publishers, Jaipur, India.
• International Human Genome Sequencing Consortium, 2001. Initial Sequencing and Analysis of the Human
Genome, Nature.

57
Bioinformatics

Self Assessment
1. There are _____main areas of sequence alignment.
a. two
b. three
c. four
d. five

2. ________is an arrangement of two or more sequences, highlighting their similarity.


a. Sequence alignment
b. FASTA
c. BLAST
d. Pfam

3. __________aims to find similarities between many sequences.


a. Multiple sequence alignment
b. Global alignment
c. Local alignment
d. Pairwise alignment

4. ________is a field of molecular evolution that correlates mutations in DNA and protein sequences with
evolutionary divergence.
a. Bioinformatics
b. Phylogenetics
c. Genomics
d. Proteomics

5. _________methods find related regions within sequences - they can consist of a subset of the characters within
each sequence.
a. Multiple sequence alignment
b. Global alignment
c. Local alignment
d. Pairwise alignment

6. A __________between two sequences is an alignment in which all the characters in both sequences participate
in the alignment.
a. Multiple sequence alignment
b. Global alignment
c. Local alignment
d. Pairwise alignment

7. _________ is useful mostly for finding closely-related sequences.


a. Multiple sequence alignment
b. Global alignment
c. Local alignment
d. Pairwise alignment

58
8. Which of the following statements is false?
a. The Smith-Waterman algorithm performs a global alignment on two sequences (s and t) and is applied to
align protein or nucleotide sequences.
b. The Needleman-Wunsch algorithm is an example of dynamic programming, and is guaranteed to find the
alignment with the maximum score.
c. The Smith-Waterman algorithm (1981) is for determining similar regions between two nucleotide or protein
sequences.
d. Smith-Waterman is also a dynamic programming algorithm and improves on Needleman-Wunsch.

9. Which of the following statements is false?


a. In point mutation, one nucleotide or amino acid is changed into another one.
b. In deletion, one element (nucleotide or amino acid) or a whole subsequence of element is deleted from the
sequence.
c. In insertion, one element or a subsequence is inserted into the sequence.
d. Alignments are more prone to destroying the function of the protein, where the length of deletions and
insertions must be taken into account.

10. Which of the following statements is false?


a. The optimal alignment is the arrangement of the two strings in a way that the number of mutations is
minimal.
b. In general, an alignment algorithm searches for the arrangement of two sequences such that a criterion is
optimised.
c. The sequences can be arranged by inserting “@” into the strings and moving them horizontally against
each other.
d. One tool for representing alignments is the dot matrix, where one sequence is written horizontally on the
top and the other one vertically on the left.

59
Bioinformatics

Chapter V
Phylogenetic Analysis

Aim
The aim of this chapter is to:

• define phylogenetics

• describe phylogenetic analysis

• explain fundamental elements of phylogenetic models

Objectives
The objectives of this chapter are to:

• define bootstrapping

• describe tree evaluation

• elucidate the tree-building methods

Learning outcome
At the end of this chapter, you will be able to:

• differentiate between paralogs and orthologs

• understand tree interpretation

• enumerate the steps of phylogenetic data analysis

60
5.1 Introduction
Phylogenetic analysis is the process you use to determine the evolutionary relationships between organisms. The
results of an analysis can be drawn in a hierarchical diagram called a cladogram or phylogram (phylogenetic tree).
The branches in a tree are based on the hypothesised evolutionary relationships (phylogeny) between organisms. Each
member in a branch, also known as a monophyletic group, is assumed to be descended from a common ancestor.
Originally, phylogenetic trees were created using morphology, but now, determining evolutionary relationships
includes matching patterns in nucleic acid and protein sequences.

Phylogenetics is the study of evolutionary relationships. Phylogenetic analysis is the means of inferring or estimating
these relationships. The evolutionary history inferred from phylogenetic analysis is usually depicted as branching,
treelike diagrams that represent an estimated pedigree of the inherited relationships among molecules (‘gene trees”),
organisms, or both. Phylogenetics is sometimes called cladistics because the word ‘clade,’ a set of descendants
from a single ancestor, is derived from the Greek word for branch. However, cladistics is a particular method of
hypothesising about evolutionary relationships.

The basic tenet behind cladistics is that members of a group or clade share a common evolutionary history and
are more related to each other than to members of another group. A given group is recognised by sharing unique
features that were not present in distant ancestors. These shared, derived characteristics can be anything that can be
observed and described from two organisms having developed a spine to two sequences having developed a mutation
at a certain base pair of a gene. Usually, cladistic analysis is performed by comparing multiple characteristics or
‘characters’ at once, either multiple phenotypic characters or multiple base pairs or amino acids in a sequence.
• There are three basic assumptions in cladistics. Any group of organisms is related by descent from a common
ancestor (fundamental tenet of evolutionary theory).
• There is a bifurcating pattern of cladogenesis. This assumption is controversial.
• Change in characteristics occurs in lineages over time. This is a necessary condition for cladistics to work.

The resulting relationships from cladistic analysis are most commonly represented by a phylogenetic tree:

A node
Human
A clade
Mouse

Fly

Fig. 5.1 Clade and node


(Source: http://www.bioon.com/book/biology/bioinformatics/chapter-14.pdf)

Even with this simple tree, a number of terms that are used frequently in phylogenetic analysis can be introduced:
• A clade is a monophyletic taxon. Clades are groups of organisms or genes that include the most recent common
ancestor of all of its members and all of the descendants of that most recent common ancestor. Clade is derived
from the Greek word ‘klados,’ meaning branch or twig.
• A taxon is any named group of organisms but not necessarily a clade.
• In some analyses, branch lengths correspond to divergence (example, in the above example, mouse is slightly
more related to fly than human is to fly).
• A node is a bifurcating branch point.

61
Bioinformatics

Macromolecules, especially sequences, have surpassed morphological and other organism characters as the most
popular form of data for phylogenetic or cladistic analysis. Although numerous phylogenetic algorithms, procedures,
and computer programs have been devised, their reliability and practicality are, in all cases, dependent on the
structure and size of the data.

The danger of generating incorrect results is inherently greater in computational phylogenetics than in many other
fields of science. The events yielding a phylogeny happened in the past and can only be inferred or estimated.
Despite the well-documented limitations of available phylogenetic procedures, current biological literature is repleted
with examples of conclusions derived from the results of analyses in which data had been simply run through
one or another phylogeny program. Occasionally, the limiting factor in phylogenetic analysis is not so much the
computational method used; more often than not, the limiting factor is the users’ understanding of what the method
is actually doing with the data.

5.2 Fundamental Elements of Phylogenetic Models


Phylogenetic tree-building methods presume particular evolutionary models. For a given data set, these models
can be violated because of occurrences such as the transfer of genetic material between organisms. Thus, when
interpreting a given analysis, one should always consider the model used and its assumptions and entertain other
possible explanations for the observed results. As an example, consider the tree in figure given below. An investigation
of organismal relationships in the tree suggests the eukaryote 1 is more related to the bacteria than to the other
eukaryotes. Because the vast majority of other cladistic analyses, including those based on morphological features,
suggest that eukaryote 1 is more related to the other eukaryotes than to bacteria; we suspect that for this analysis
the assumptions of a bifurcating pattern of evolution are incorrect. We suspect that horizontal gene transfer from an
ancestor of the bacteria 1, 2, and 3 to the ancestor of eukaryote 1 occurred because this would most simply explain
the results.

Models inherent in phylogenetics methods make additional ‘default’ assumptions:


• The sequence is correct and originates from the specified source.
• The sequences are homologous (that is are all descended in some way from a shared ancestral sequence).
• Each position in a sequence alignment is homologous with every other in that alignment.
• Each of the multiple sequences included in a common analysis has a common phylogenetic history with the
others (example, there are no mixtures of nuclear and organellar sequences).
• The sampling of taxa is adequate to resolve the problem of interest.
• Sequence variation among the samples is representative of the broader group of interest.
• The sequence variability in the sample contains phylogenetic signal adequate to resolve the problem of
interest.

62
Fig. 5.2 A phylogenetic tree
(Source: http://www.bioon.com/book/biology/bioinformatics/chapter-14.pdf)

Example of a phylogenetic tree based on genes that do not match organismal phylogeny, suggesting horizontal gene
transfer has occurred. The ancestor of protozoan eukaryote 1 (underlined and marked with an arrow) appears to
have obtained the gene from the ancestor of Bacteria 1, 2, and 3, as this is the simplest explanation for the results.
This unexpected result is not without precedent: there have been a number of reported phylogenetic analyses that
suggest that protozoa have taken up genes from bacteria, most likely from bacteria that they have ingested.

There are additional assumptions that are defaults in some methods but can be at least partially corrected for in
others:
• The sequences in the sample evolved according to a single stochastic process.
• All positions in the sequence evolved according to the same stochastic process.
• Each position in the sequence evolved independently.

Errors in published phylogenetic analyses can often be attributed to violations of one or more of the foregoing
assumptions. Every sequence data set must be evaluated against these assumptions, with other possible explanations
for the observed results considered.

5.3 Tree Interpretation: Importance of Identifying Paralogs and Orthologs


As more genomes are sequenced, we are becoming more interested in learning about protein or gene evolution (that
is investigating gene phylogeny, rather than organismal phylogeny). This can aid our understanding of the function
of proteins and genes.

Studies of protein and gene evolution involve the comparison of homologs sequences that have common origins but
may or may not have common activity. Sequences that share an arbitrary, threshold level of similarity determined
by alignment of matching bases are termed homologous. They are inherited from a common ancestor that possessed
similar structure, although the structure of the ancestor may be difficult to determine because it has been modified
through descent.

63
Bioinformatics

Homologs are most commonly orthologs, paralogs, or xenologs.


• Orthologs are homologs produced by speciation. They represent genes derived from a common ancestor that
diverged due to divergence of the organisms they are associated with. They tend to have similar function.
• Paralogs are homologs produced by gene duplication. They represent genes derived from a common ancestral gene
that duplicated within an organism and then subsequently diverged. They tend to have different functions.
• Xenologs are homologs resulting from horizontal gene transfer between two organisms. The determination
of whether a gene of interest was recently transferred into the current host by horizontal gene transfer is often
difficult.

Occasionally, the %(G _ C) content may be so vastly different from the average gene in the current host that a
conclusion of external origin is nearly inescapable, however often it is unclear whether a gene has horizontal origins.
Function of xenologs can be variable depending on how significant the change in context was for the horizontally
moving gene; however, in general, the function tends to be similar.

5.4 Phylogenetic Data Analysis


A straightforward phylogenetic analysis consists of four steps:
• Alignment (both building the data model and extracting a phylogenetic dataset)
• Determining the substitution model
• Tree building
• Tree evaluation

Each step is critical for the analysis and should be handled accordingly. For example, trees are only as good as the
alignment they are based on. When performing a phylogenetic analysis, it is often insightful to build trees based on
different modifications of the alignment to see how the alignment proposed influences the resulting tree.

5.4.1 Alignment: Building the Data Model


Phylogenetic sequence data usually consist of multiple sequence alignments; the individual, aligned-base positions
are commonly referred to as ‘sites.’ These sites are equivalent to ‘characters’ in theoretical phylogenetic discussions,
and the actual base (or gap) occupying a site is the ‘character state.’

Aligned sequence positions subjected to phylogenetic analysis represent a priori phylogenetic conclusions because
the sites themselves (not the actual bases) are effectively assumed to be genealogically related, or homologous. Sites
at which one is confident of homology and that contain changes in character states useful for the given phylogenetic
analysis are often referred to as ‘informative sites.’

Steps in building the alignment include selection of the alignment procedure(s) and extraction of a phylogenetic
data set from the alignment. The latter procedure requires determination of how ambiguously aligned regions and
insertion/deletions (referred to as indels, or gaps) will be treated in the tree-building procedure.

A typical alignment procedure involves the application of a program such as CLUSTAL W, followed by manual
alignment editing and submission to a tree building program. This procedure should be performed with the following
questions and considerations in mind.

5.4.2 Determining the Substitution Model


The substitution model should be given the same emphasis as alignment and tree building. As implied in the preceding
section, the substitution model influences both alignment and tree building; hence, a recursive approach is warranted.
At the present time, two elements of the substitution model can be computationally assessed for nucleotide data but
not for amino acid or codon data. One element is the model of substitution between particular bases; the other is the
relative rate of overall substitution among different sites in the sequence. Simple computational procedures have not
been developed for assessing more complex variables (example, site- or lineage specific substitution models).

64
5.4.3 Tree-Building Methods
Tree building methods can be sorted into distance-based vs. character-based methods. Much of the discussion in
molecular phylogenetics dwells on the utility of distance and character-based methods (example, Saitou, 1996; Li,
1997). Distance methods compute pairwise distances according to some measure and then discard the actual data,
using only the fixed distances to derive trees. Character-based methods derive trees that optimise the distribution
of the actual data patterns for each character. Pairwise distances are, therefore, not fixed, as they are determined
by the tree topology. The most commonly applied distance-based methods include neighbor-joining and the most
common character-based methods include maximum parsimony and maximum likelihood.
• Distance-based: Transform the data into pairwise distances (dissimilarities), and then use a matrix during tree
building.
• Character-based: Use the aligned characters, such as DNA or protein sequences, directly during tree inference
– based on substitutions.

5.4.4 Tree Evaluation


Several procedures are available that evaluate the phylogenetic signal in the data and the robustness of trees (Swofford
et al., 1996; Li, 1997). The most popular of the former class are tests of data signal versus randomised data (skewness
and permutation tests). The latter class includes tests of tree support from resampling of observed data (nonparametric
bootstrap). The likelihood ratio test provides a means of evaluating both the substitution model and the tree.

Bootstrap
Bootstrapping is a resampling tree evaluation method that works with distance, parsimony, likelihood, and just about
any other tree derivation method. It was invented in 1979 (Efron, 1979) and introduced as a tree evaluation method
in phylogenetic analysis by Felsenstein (1985). The result of bootstrap analysis is typically a number associated
with a particular branch in the phylogenetic tree that gives the proportion of bootstrap replicates that supports the
monophyly of the clade.

Bootstrapping can be considered a two-step process comprising the generation of (many) new data sets from the
original set and the computation of a number that gives the proportion of times that a particular branch (example, a
taxon) appeared in the tree. That number is commonly referred to as the bootstrap value. New data sets are created
from the original data set by sampling columns of characters at random from the original data set with replacement.
‘With replacement’ means that each site can be sampled again with the same probability as any of the other sites.
As a consequence, each of the newly created data sets has the same number of total positions as the original data
set, but some positions are duplicated or triplicated and others are missing. It is therefore possible that some of the
newly created data sets are completely identical to the original set—or, on the other extreme, that only one of the
sites is replicated, say, 500 times, whereas the remaining 499 positions in the original data set are dropped.

Although it has become common practice to include bootstrapping as part of a thorough phylogenetic analysis,
there is some discussion on what exactly is measured by this method. It was originally suggested that the bootstrap
value is a measure of repeatability (Felsenstein, 1985). In more recent interpretations, it has been considered to be
a measure of accuracy a biologically more relevant parameter that gives the probability that the true phylogeny has
been recovered. On the basis of simulation studies, it has been suggested that, under favourable conditions (roughly
equal rates of change, symmetric branches), bootstrap values greater than 70% correspond to a probability of greater
than 95% that the true phylogeny has been found (Hillis and Bull, 1993). By the same token, under less favourable
conditions, bootstrap values greater than 50% will be overestimates of accuracy (Hillis and Bull, 1993). Simply put,
under certain conditions, high bootstrap values can make the wrong phylogeny look good; therefore, the conditions
of the analysis must be considered. Bootstrapping can be used in experiments in which trees are recomputed after
internal branches are deleted one at a time. The results provide information on branching orders that are ambiguous
in the full data set (cf. Leipe et al., 1994).

65
Bioinformatics

Summary
• Bootstrapping is a resampling tree evaluation method that works with distance, parsimony, likelihood, and just
about any other tree derivation method.
• Distance methods compute pairwise distances according to some measure and then discard the actual data, using
only the fixed distances to derive trees.
• Character-based methods derive trees that optimise the distribution of the actual data patterns for each
character.
• The substitution model should be given the same emphasis as alignment and tree building.
• Phylogenetic sequence data usually consist of multiple sequence alignments; the individual, aligned-base
positions are commonly referred to as ‘sites.’
• Phylogenetic tree-building methods presume particular evolutionary models.
• Phylogenetics is the study of evolutionary relationships.
• Phylogenetic analysis is the means of inferring or estimating these relationships. It is the process you use to
determine the evolutionary relationships between organisms.
• Clades are groups of organisms or genes that include the most recent common ancestor of all of its members
and all of the descendants of that most recent common ancestor.

References
• Baxevanis, A. D. & Ouellette, B. F., 2001. Bioinformatics: a practical guide to the analysis of genes and proteins,
John Wiley and Sons.
• Rastogi, S. C., Mendiratta, N. & Rastogi, P., 2006. Bioinformatics-Concepts, Skills, Applications, 2nd ed., PHI
Learning Pvt. Ltd., p.456.
• Brinkma, F. S. L., 2001. Phylogenetic Analysis [pdf] Available at: <http://www.bioon.com/book/biology/
bioinformatics/chapter-14.pdf > [Accessed 28 February 2012].
• NCBI. Systematics and Molecular Phylogenetics [Online] Available at: <http://www.ncbi.nlm.nih.gov/About/
primer/phylo.html> [Accessed 28 February 2012].
• Thermy33, 2011. Understanding Phylogenetic Trees [Video Online] Available at: <http://www.youtube.com/
watch?v=xwuhmMIIspo> [Accessed 28 February 2012].
• UCBerkeley, 2010. Biology 1B - Lecture 24: Phylogenetics [Video Online] Available at: <http://www.youtube.
com/watch?v=vrGfDPteKqU> [Accessed 28 February 2012].

Recommended Reading
• Livingstone & Barton, 1993. Protein Sequence Alignments: a Strategy for the Hierarchical Analysis of Residue
Conservation, Computer Applications in the Biosciences.
• Steel, M. A., 2003. Phylogenetics, Oxford University Press.
• Jogota, A., 2005. Computational Methods in Phylogenetic Analysis.

66
Self Assessment
1. ________is a resampling tree evaluation method that works with distance, parsimony, likelihood, and just about
any other tree derivation method.
a. Phylogenetics
b. Bootstrapping
c. Phylogenetic analysis
d. Cladistics

2. Which of the following statements is true?


a. Distance methods compute pairwise distances according to some measure and then discard the actual data,
using only the fixed distances to derive trees.
b. Distance-based methods derive trees that optimise the distribution of the actual data patterns for each
character.
c. The substitution model should be given the same emphasis as alignment and tree building.
d. Phylogenetic sequence data usually consist of multiple sequence alignments; the individual, aligned-base
positions are commonly referred to as ‘sites.’

3. Which of the following statements is false?


a. Phylogenetic tree-building methods presume particular evolutionary models.
b. Cladistics is the study of evolutionary relationships.
c. Phylogenetic analysis is the means of inferring or estimating these relationships.
d. Phylogenetic analysis is the process you use to determine the evolutionary relationships between
organisms.

4. Bootstrapping can be considered a ______process.


a. two-step
b. three-step
c. four-step
d. six-step

5. Which of these is not a step of phylogenetic analysis?


a. Alignment
b. Determining the substitution model
c. Tree building
d. Tree prediction

6. ______are homologs produced by speciation.


a. Orthologs
b. Paralogs
c. Xenologs
d. Analogs

7. _______are homologs produced by gene duplication.


a. Orthologs
b. Paralogs
c. Xenologs
d. Analogs

67
Bioinformatics

8. _______are homologs resulting from horizontal gene transfer between two organisms.
a. Orthologs
b. Paralogs
c. Xenologs
d. Analogs

9. Which of the following statements is false?


a. Phylogenetic tree-building methods presume particular evolutionary models.
b. Macromolecules, especially sequences, have surpassed morphological and other organismal characters as
the most popular form of data for phylogenetic or cladistic analysis.
c. Clade is derived from the Greek word ‘klados,’ meaning branch or twig.
d. A node is any named group of organisms but not necessarily a clade.

10. Which of the following statements is false?


a. A clade is a bifurcating branch point.
b. In some analyses, branch lengths correspond to divergence.
c. There is a bifurcating pattern of cladogenesis. This assumption is controversial.
d. Change in characteristics occurs in lineages over time.

68
Chapter VI
Microarray Technology: A Boon to Biological Sciences

Aim
The aim of this chapter is to:

• define microarray

• describe the microarray technique

• enumerate characteristics of microarrays

Objectives
The objectives of this chapter are to:

• define hybridisation

• describe potential of microarray analysis

• elucidate the microarray products

Learning outcome
At the end of this chapter, you will be able to:

• understand hybridisation technique

• enumerate the applications of microarrays

• know gene discovery with bioinformatics

69
Bioinformatics

6.1 Introduction to Microarray


Molecular biology research evolves through the development of the technologies used for carrying them out. It is not
possible to research on a large number of genes using traditional methods. DNA microarray is one such technology,
which enables the researchers to investigate and address issues, which were once thought to be non traceable. One
can analyse the expression of many genes in a single reaction quickly and in an efficient manner. DNA microarray
technology has empowered the scientific community to understand the fundamental aspects underlining the growth
and development of life as well as to explore the genetic causes of anomalies occurring in the functioning of the
human body.

Although all of the cells in the human body contain identical genetic material, the same genes are not active in every
cell. Studying active genes and inactive genes in different cell types helps scientists to understand both how these cells
function normally and how they are affected when various genes do not perform properly. In the past, scientists have
only been able to conduct these genetic analyses on a few genes at once. With the development of DNA microarray
technology, however, scientists can now examine how active thousands of genes are at any given time.

All living organisms contain DNA, a molecule that encodes all the information required for the development and
functioning of an organism. Finding and deciphering the information encoded in DNA, and understanding how
such a simple molecule can give rise to the amazing biological diversity of life, is a goal shared in some way by
all life scientists. Microarrays provide an unprecedented view into the biology of DNA, and thus a rich way to
examine living systems. DNA is a physical molecule that is able to encode information in a linear structure. Cells
express information from different parts of this structure in a context-dependent fashion. DNA encodes for genes,
and regulatory elements control whether genes are on or off. For instance, all the cells of the human body contain
the same DNA, yet there are hundreds of different types of cells, each expressing a unique configuration of genes
from the DNA. In this regard, DNA could be described as existing in some number of states. Microarrays are a tool
used to read the states of DNA. Microarrays have had a transforming effect on the biological sciences. In the past,
biologists had to work very hard to generate small amounts of data that could be used to explore a hypothesis with
one observation at a time.

With the advent of microarrays, individual experiments generate thousands of data points or observations. This
turns the experiment from a hypothesis-driven endeavour to a hypothesis generating endeavour because every
experiment sheds light across an entire terrain of gene expression, letting relevant genes reveal themselves, often
in surprising ways.

The highly parallel nature of microarrays that are used to make biological observations signifies that most experiments
generate more information than the experimenter could possibly interpret. Indeed, from a statistical point of view,
every gene measured on a microarray is an independent variable in a highly parallel experiment. The number of
hypotheses to which the data may or may not lend support cannot be known in advance. To take advantage of the
excess information in microarray data, repositories have been set up in which people can deposit their experiments,
thus making them available to a wide community of researchers with questions to explore.

A typical microarray experiment involves the hybridisation of an mRNA molecule to the DNA template from which
it is originated. Many DNA samples are used to construct an array. The amount of mRNA bound to each site on the
array indicates the expression level of the various genes. This number may run in thousands. All the data is collected
and a profile is generated for gene expression in the cell.

6.2 Microarray Technique


An array is an orderly arrangement of samples where matching of known and unknown DNA samples is done based
on base pairing rules. An array experiment makes use of common assay systems such as microplates or standard
blotting membranes. The sample spot sizes are typically less than 200 microns in diameter usually contain thousands
of spots.

70
Thousands of spotted samples (DNA) known as probes (with known identity) are immobilised on a solid support
(microscope glass slides or silicon chips or nylon membrane). These are used to determine complementary binding of
the unknown sequences thus allowing parallel analysis for gene expression and gene discovery. An experiment with
a single DNA chip can provide information on thousands of genes simultaneously. An orderly arrangement of the
probes on the support is important as the location of each spot on the array is used for the identification of a gene.

The genome is an information scaffold


Microarrays measure events in the genome. An event may be the transcription of a gene, the binding of a protein to
a segment of the DNA, the presence or absence of a mutation, a change in the copy number of a locus, a change in
the methylation state of the DNA, or any of a number of states or activities that are associated with DNA or RNA
molecules. As a genomic readout, microarrays identify where these events occur. The idea that one can accurately
describe the genome, let alone measure its activity in a comprehensive way, is a relatively novel concept. Several
factors have led to the recent enhancement and blending of molecular biology into a field called genomics. The
first is genome-sequencing projects.

Today, sequencing a genome is considered a routine activity. However, in the late 1980s when sequencing the human
genome was first suggested as a serious endeavour, the community was divided. Given the sequencing technology
available at the time, the project looked as if it would consume colossal resources over a long time frame that many
thought could be put to better use on more practical projects. However, visionaries were banking on two precepts:
once given the mandate, the technology would transform itself and new sequencing methods would be invented
that would increase the rate of sequence accumulation. The second aspect is that the finished project, full genome
sequences, would be a public gold mine of a resource that would pay off for all biologists. Both of these assumptions
have come to fruition. Genome sequences accumulate at rates few imagined possible. Biologists can expect the
sequence of their model organism to exist in GenBank or to be in someone’s sequencing pipeline. More important,
having a map of the full genomic sequence of an organism has transformed the way biology is studied.

DNA gives rise to the organism and so is a scaffold for information. The genomic map is like a landscape of code,
openly visible to all and for anyone to figure out. Through experimentation, often involving microarrays, DNA
is annotated with functional information. In addition, the large-scale sequencing effort served as a kind of space
program for biology, whereby the genome was a new frontier. It made possible previously unforeseen possibilities
and conceptually paved the way for a host of parallel analysis methods. The unveiling of a unified map begged
the creation of microarrays, as well as other large genome-sized projects, such as the systematic deletion of every
yeast gene, the systematic fusion of every yeast promoter to a reporter gene and many other similar projects. As
the invention of the telescope changed how we view the universe, microarrays have changed the way we view the
genome.

Gene expression is detected by hybridisation


The purpose of a microarray is to examine expression of multiple genes simultaneously in response to some biological
perturbation. More generally, a microarray serves to interrogate the concentrations of molecules in a complex mixture
and thus, can serve as a powerful analytical tool for many kinds of experiments. To understand how this occurs, it
may be useful to review the structure of DNA and examine how the unique structure of this molecule plays a role
in identifying itself. Although DNA is remarkably informationally complex, the general structure of the molecule
is really quite simple.

DNA is made up of four chemical building blocks called bases: adenine (A), cytosine(C), guanosine(G), and thymidine
(T). As individual subunits these building blocks are also referred to as nucleotides. A strand of DNA consists of a
sugar phosphate backbone to which these bases are covalently linked such that they form a series. Because these
four bases can form sequences, it is possible to use them to encode information based on their patterns of occurrence.
Indeed, from an information point of view, DNA has a potential data density of 145 million bits per inch and has
been considered as a substrate for computation whereby the sequences are referred to as software.

71
Bioinformatics

Like strings of text in a book, the sequences that make up a strand of DNA have directionality such that information
can be encoded in a given direction. The amount of DNA, and thus the amount of sequence, varies from organism
to organism. For instance, the microorganism Escherichia coli have 4.5 million bases of sequence, whereas human
cells have about 3 billion bases. Exactly how much biological information is encoded in these sequences is unknown,
representing one of the deepest mysteries of biology, but microarrays provide a way to gain clues. Cellular DNA
most often consists not just of one strand but of two strands anti-parallel to each other. The two strands are hydrogen
bonded together by interactions between the bases, forming a structure in the cell. The structure is helical, similar
to a spiral staircase in which the bases are attached to each side and interact in a plane to form the steps of the
staircase. Besides the hydrogen bonds between the bases of opposite strands, the overlapping and proximity of the
bases to each other lead to a second kind of non covalent force called a stacking interaction that contributes to the
stability of the double-stranded structure.

The bases of one strand interact with the bases of the other strand according to a set of pairing rules, such that A
pairs with T and C pairs with G. Thus, if one knows the sequence of one strand, by definition, then one knows the
sequence of the opposite strand. This property has profound consequences in the study of biology. It is also what
the cell uses to replicate itself. As the interaction between the bases is non covalent, consisting only of hydrogen
bonds, the strands can essentially be melted apart and separated, thus opening the way for a copying mechanism to
read each single strand and re-create the second complementary strand for each half of the pair, resulting in a new
double-stranded molecule for each cell. This is also the mechanism by which cells express genes. The strands are
opened by the gene expression machinery so that some number of RNA copies of a gene can be synthesised.

The RNA transcript has the same sequence as the gene with the exception that uracil (U) replaces T; though the
hybridisation pairing rules remain the same (U and T can both pair with A). This property of complementarity is
also what is used for measuring gene expression on microarrays.

Just as energy can melt strands apart and separate them into single molecules, the process is reversible such that single
strands that are complementary to each other can come together and reanneal to form a double stranded complex.
This process is called hybridisation and is the basis for many assays or experiments in molecular biology. In the
cell, hybridisation is at the center of several biological processes, whereas in the lab complementarity is identity
and thus, hybridisation is at the center of many in vitro reactions and analytical techniques. The molecules can come
from completely different sources, but if they match, they will hybridise.

6.3 Potential of Microarray Analysis


The academic research community stands to benefit from microarray technology just as much as the pharmaceutical
industry. The ability to use it in place of existing technology will allow researchers to perform experiments faster and
more cheaply, and will enable them to concentrate on analysing the results of microarray experiments rather than
simply performing the experiments. This research could then lead to a better understanding of the disease process,
which will require many different levels of research. While the field of expression has received most attention
so far, looking at the gene copy level and protein level is just as important. Microarray technology has potential
applications in each of these three levels.

Identifying drug targets provided the initial market for the microarrays. A good drug target has extraordinary value
for developing pharmaceuticals. By comparing the ways in which genes are expressed in a normal and diseased
heart, for example, scientists might be able to identify the genes and hence the associated proteins that are part of
the disease process. Researchers could then use that information to synthesise drugs that interact with these proteins,
thus reducing the disease’s effect on the body.

Gene sequences can be measured simultaneously and calculated instantly when an ordered set of DNA molecules
of known sequence a microarray is used. Consequently, scientists can evaluate an entire set of genes at once, rather
than looking at physiological changes one gene at a time. For example, Genetics Institute, a biotechnology company
in Cambridge, Massachusetts, built an array consisting of genes for cytokines, which are proteins that affect cell
physiology during the inflammatory response, among other effects. The full set of DNA molecules contained more
than 250 genes. While that number was not large by current standards of microarrays, it vastly outnumbered the

72
one or two genes examined in typical pre-microarray experiments. The Genetics Institute scientists used the array
to study how changes experienced by cells in the immune system during the inflammatory response are reflected in
the behavior of all 250 genes at the same time. This experiment established the potential for using the patterns of
response to help locate points in the body at which drugs could prove most effective.

6.4 Microarray Products


Within that basic technological foundation, microarray companies have created a variety of products and services.
They range in price, and involve several different technical approaches. A kit containing a simple array with limited
density can cost as little as $1,100, while a versatile system favoured by R&D laboratories in pharmaceutical and
biotechnology companies costs more than $200,000. The differences among products lie in the basic components
and the precise nature of the DNA on the arrays.

The type of molecule placed on the array units also varies according to circumstances. The most commonly used
molecule is complementary DNA (cDNA). Since they are derived from a distinct messenger RNA; each feature
represents an expressed gene.

6.5 Microarray: Identifying Interactions


To detect interactions at microarray features, scientists must label the test sample in such a way that an appropriate
instrument can recognise it. Since the minute size of microarray features limits the amount of material that can be
located at any feature, detection methods must be extremely sensitive.
Other than a few low-end systems that use radioactive or chemiluminescent tagging, most microarrays use fluorescent
tags as their means of identification. These labels can be delivered to the DNA units in several different ways.
While relatively simple, this approach has low sensitivity because it delivers only one unit of label per interaction.
Technologists can achieve more sensitivity by multiplexing the labeled entity that is delivering more than one unit
of label per interaction.

6.6 Applications of Microarrays


Microarray technology will help researchers to learn more about many different diseases, including heart disease,
mental illness and infectious diseases, to name only a few. One intense area of microarray research at the National
Institutes of Health (NIH) is the study of cancer. In the past, scientists have classified different types of cancers
based on the organs in which the tumours develop. With the help of microarray technology, however, they will be
able to further classify these types of cancers based on the patterns of gene activity in the tumour cells. Researchers
will then be able to design treatment strategies targeted directly to each specific type of cancer. Additionally, by
examining the differences in gene activity between untreated and treated tumour cells - for example those that are
radiated or oxygen-starved - scientists will understand exactly how different therapies affect tumours and be able
to develop more effective treatments.

Gene discovery: DNA Microarray technology helps in the identification of new genes, know about their functioning
and expression levels under different conditions.

Disease diagnosis: DNA Microarray technology helps researchers learn more about different diseases such as heart
diseases, mental illness, infectious disease and especially the study of cancer. Until recently, different types of cancer
have been classified on the basis of the organs in which the tumours develop. Now, with the evolution of microarray
technology, it will be possible for the researchers to further classify the types of cancer on the basis of the patterns
of gene activity in the tumour cells. This will tremendously help the pharmaceutical community to develop more
effective drugs as the treatment strategies will be targeted directly to the specific type of cancer.

Drug discovery: Microarray technology has an extensive application in Pharmacogenomics. Pharmacogenomics is


the study of correlations between therapeutic responses to drugs and the genetic profiles of the patients. Comparative
analysis of the genes from a diseased and a normal cell will help the identification of the biochemical constitution of
the proteins synthesised by the diseased genes. The researchers can use this information to synthesise drugs, which
combat with these proteins and reduce their effect.

73
Bioinformatics

Toxicological research: Microarray technology provides a robust platform for the research of the impact of toxins on
the cells and their passing on to the progeny. Toxic genomics establishes correlation between responses to toxicants
and the changes in the genetic profiles of the cells exposed to such toxicants.

The characteristics of microarrays include:


• It allows simultaneous measurement of gene expression.
• Differential expression, changes over time.
• Single microarray can test ~10k genes.
• Data obtained is faster than can be processed.
• Can find genes that behave similarly.

Fig. 6.1 Gene expression data


(Source: http://www.science.co.il/enuka/essays/microarray-review.pdf)

Each spot represents the expression level of a gene in two different experiments. Yellow or red spots indicate that the
gene is expressed in one experiment. Green spots show that the gene is expressed at same levels in both experiments.
Each box represents one gene’s expression over time. Track sample over a period of time to see gene expression
over time. Track two different samples under same conditions to see difference in gene expressions

74
Fig. 6.2 Gene expression over time
(Source: http://www.bioalgorithms.info/presentations/Ch10_Clustering.ppt)

75
Bioinformatics

Summary
• Molecular biology research evolves through the development of the technologies used for carrying them out.
• DNA is a physical molecule that is able to encode information in a linear structure.
• DNA encodes for genes, and regulatory elements control whether genes are on or off.
• Microarrays are a tool used to read the states of DNA. Microarrays have had a transforming effect on the
biological sciences.
• A typical microarray experiment involves the hybridisation of an mRNA molecule to the DNA template from
which it is originated.
• An array is an orderly arrangement of samples where matching of known and unknown DNA samples is done
based on base pairing rules.
• An array experiment makes use of common assay systems such as micro plates or standard blotting
membranes.
• The sample spot sizes are typically less than 200 microns in diameter usually contain thousands of spots.
• Microarrays measure events in the genome.
• Several factors have led to the recent enhancement and blending of molecular biology into a field called
genomics.
• The purpose of a microarray is to examine expression of multiple genes simultaneously in response to some
biological perturbation.
• DNA is made up of four chemical building blocks called bases: adenine (A), cytosine(C), guanosine (G), and
thymidine (T).
• Cellular DNA most often consists not just of one strand but of two strands anti-parallel to each other.
• Identifying drug targets provided the initial market for the microarrays.
• Gene sequences can be measured simultaneously and calculated instantly when an ordered set of DNA molecules
of known sequence a microarray is used.
• Microarray technology will help researchers to learn more about many different diseases, including heart disease,
mental illness and infectious diseases, to name only a few.
• DNA microarray technology helps in the identification of new genes, know about their functioning and expression
levels under different conditions.
• Microarray technology has an extensive application in Pharmacogenomics. 
• Microarray technology provides a robust platform for the research of the impact of toxins on the cells and their
passing on to the progeny.

References
• Baxevanis, A. D. & Ouellette, B. F., 2001. Bioinformatics: a practical guide to the analysis of genes and proteins,
John Wiley and Sons.
• Rastogi, S. C., Mendiratta, N. & Rastogi, P., 2006. Bioinformatics-Concepts, Skills, Applications, 2nd ed., PHI
Learning Pvt. Ltd., p.456.
• Korol, A. B., 2001. Microarray cluster analysis and applications [pdf] Available at: <http://www.science.co.il/
enuka/essays/microarray-review.pdf > [Accessed 28 February 2012].
• Clustering [Online] Available at: <http://www.bioalgorithms.info/presentations/Ch10_Clustering.ppt > [Accessed
28 February 2012].
• HaroonBBT, 2011. Microarray [Video Online] Available at: <http://www.youtube.com/watch?v=wKcQZVeIK-
k&feature=related > [Accessed 28 February 2012].
• wenl888, 2012. Easy to use microarray data analysis tool - No training needed: Goober [Video Online] Available
at: <http://www.youtube.com/watch?v=nSlhCaJKhjY > [Accessed 28 February 2012].

76
Recommended Reading
• Stekel, D., 2003. Microarray bioinformatics, Cambridge University Press, p.263.
• Borlak, 2005. Handbook of toxicogenomics:Strategies and applications, Wiley-VCH.
• Zelikovsky, A., 2008. Bioinformatics algorithms: techniques and applications, John Wiley & Sons.

77
Bioinformatics

Self Assessment
1. All living organisms contain_____, a molecule that encodes all the information required for the development
and functioning of an organism.
a. DNA
b. RNA
c. protein
d. nucleotide

2. _________is an orderly arrangement of samples where matching of known and unknown DNA samples is done
based on base pairing rules.
a. Miroplate
b. Array
c. Blotting membrane
d. Probe

3. Thousands of spotted samples (DNA) known as __________(with known identity) are immobilised on a solid
support.
a. miroplates
b. arrays
c. blotting membranes
d. probes

4. Which of these is not included as a solid support for microarray techniques?


a. Microscope glass slides
b. Silicon chips
c. Nylon membrane
d. Copper membrane

5. Microarrays measure events in the_________.


a. genome
b. proteome
c. DNA
d. RNA

6. The _______is like a landscape of code, openly visible to all and for anyone to figure out.
a. genomic map
b. proteome
c. microarray
d. probe

7. Through experimentation, often involving microarrays, DNA is ______with functional information.


a. annotated
b. marked
c. tagged
d. methylated

78
8. DNA is made up of four chemical building blocks called ___________.
a. bases
b. arrays
c. sugar phosphate backbones
d. bonds

9. Which of the following statements is false?


a. A good drug target has extraordinary value for developing pharmaceuticals.
b. Gene sequences can be measured simultaneously and calculated instantly when an ordered set of protein
molecules of known sequence a microarray is used.
c. Since the minute size of microarray features limits the amount of material that can be located at any feature,
detection methods must be extremely sensitive.
d. Microarray technology will help researchers to learn more about many different diseases, including heart
disease, mental illness and infectious diseases.

10. Which of the following statements is false?


a. DNA Microarray technology helps in the identification of new genes, know about their functioning and
expression levels under different conditions.
b. DNA Microarray technology helps researchers learn more about different diseases such as heart diseases,
mental illness, infectious disease and especially the study of cancer.
c. Different types of cancer have been classified on the basis of the organs in which the tumours develop.
d. Hybridisation is the study of correlations between therapeutic responses to drugs and the genetic profiles
of the patients.

79
Bioinformatics

Chapter VII
Bioinformatics in Drug Discovery: A Brief Overview

Aim
The aim of this chapter is to:

• define drug discovery

• explain the concept of electronic medical records

• describe the impact of bioinformatics in medical sciences

Objectives

The objectives of this chapter are to:

• define drug-likeness

• describe potential of pharmacogenomics

• elucidate the drug discovery process

Learning outcome
At the end of this chapter, you will be able to:

• understand application of bioinformatics in computer-aided drug design

• enumerate the benefits of CADD

• know bioinformatics tools

80
7.1 Introduction
In recent years, we have seen an explosion in the amount of biological information that is available. Various databases
are doubling in size every 15 months and we now have the complete genome sequences of more than 100 organisms.
It appears that the ability to generate vast quantities of data has surpassed the ability to use this data meaningfully.
The pharmaceutical industry has embraced genomics as a source of drug targets. It also recognises that the field of
bioinformatics is crucial for validating these potential drug targets and for determining, which ones are the most
suitable for entering the drug development pipeline.

Recently, there has been a change in the way that medicines are being developed due to our increased understanding
of molecular biology. In the past, new synthetic organic molecules were tested in animals or in whole organ
preparations. This has been replaced with a molecular target approach in which in-vitro screening of compounds
against purified, recombinant proteins or genetically modified cell lines is carried out with a high throughput. This
change has come about as a consequence of better and ever improving knowledge of the molecular basis of disease.
All marketed drugs today target only about 500 gene products. The elucidation of the human genome, which has an
estimated 30,000 to 40,000 genes presents immense new opportunities for drug discovery and simultaneously creates
a potential bottleneck regarding the choice of targets to support the drug discovery pipeline. The major advances in
genomics and sequencing means that finding an attractive target is no longer a problem but finding the targets that
are most likely to succeed has become the challenge. The focus of bioinformatics in the drug discovery process has
therefore shifted from target identification to target validation.

A lot of factors need to be taken into account concerning a candidate target from a multitude of heterogeneous
resources. The types of information that one needs to gather about potential targets include nucleotide and protein
sequencing information, homologues, mapping information, function prediction, pathway information, disease
associations, variants, structural information, gene and protein expression data and species/taxonomic distribution
among others. Different bioinformatics tools can be used to gather this information. The accumulation of this
information into databases about potential targets means that the pharmaceutical companies can save themselves
much time, effort and expense exerting bench efforts on targets that will ultimately fail. The information that is
gathered helps to characterise the different targets into families and subfamilies. It also classifies the behaviour of
the different molecules in a biochemical and cellular context.

Decisions about which families provide the best potential targets are guided by a number of criteria. It is important
that the potential target has a suitable structure for interacting with drug molecules. Structural genomics helps to
prioritise the families in terms of their 3D structures. Sometimes we want to develop broad spectrum drugs that
are effective against a wide range of pathogenic species while at other times we want to develop narrow spectrum
drugs that are highly specific to a particular organism. Comparative genomics helps to find protein families that
are widely taxonomically dispersed and those that are unique to a particular organism. For example, when we want
to develop a broad spectrum antibiotic, we are looking for targets that are present in a large number of bacteria
yet have no similar homologues in human. This means that the antibiotic will be effective against many bacteria
killing them while causing no harm to the human. In order to determine the role our potential drug target plays in a
particular disease mechanism we use DNA and protein chips. These chips can measure the amount of transcript or
protein expressed by a cell at different times or in different states (healthy versus diseased).

Clustering algorithms are used to organise this expression data into different biologically relevant clusters. We can
then compare the expression profiles from the diseased and healthy cells to help us understand the role our gene or
protein plays in a disease process. All of these computational tools can help to compose a detailed picture about a
protein family, its involvement in a disease process and its potential as a possible drug target.

Following on from the genomics explosion and the huge increase in the number of potential drug targets, there has
been a move from the classical linear approach of drug discovery to a non linear and high throughput approach.
The field of bioinformatics has become a major part of the drug discovery pipeline playing a key role for validating
drug targets. By integrating data from many inter-related yet heterogeneous resources, bioinformatics can help in
our understanding of complex biological processes and help improve drug discovery.

81
Bioinformatics

7.2 Drug Discovery


Drug discovery is the process of discovering and designing drugs, which includes target identification, target
validation, lead identification, lead optimisation and introduction of the new drugs to the public. This process is
very important, involving analysing the causes of the diseases and finding ways to tackle them.

Bioinformatics, a term coined for the applications of computer science in biology is now emerging as a major
element in contemporary biology and biomedical research. There is a paradigm shift in biological research to
use the computers, software tools and computational models in a large scale. Walter Gilbert, a renowned scientist,
described this shift in biology as follows:

‘The new paradigm, now emerging, is that all of the ‘genes’ will be known (in the sense of being resident in databases
available electronically), and that the starting point of a biological investigation will be theoretical. An individual
scientist will begin with a theoretical conjecture only then turning to experiment to follow or test that hypothesis.’

Bioinformatics deals with the exponential growth in biological data have led to the development of primary and
secondary databases of nucleic acid sequences, protein sequences and structures. Some of the well-known databases
include GenBank, SWISS-PROT, PDB, PIR, SCOP, CATH and so on. These databases are available as public
domain information and hosted on various Internet servers across the world. Basic research and modelling is done
using these databases with the help sequence analysis tools like BLAST, FASTA, CLUSTALW, and so on and the
modelled structures are visualised using visualisation tools such as WebLab, MOLMOL, Rasmol and so on.

Bioinformatics plays an important role for the integration of broad disciplines of biology to understand the complex
mechanisms of the cell. Bioinformatics also aids the way in which biomedical investigators use the information
in their testing. The complete process of data collection to analysis of the results of such tests may be categorised
under a separate area named ‘Clinical Informatics’.

7.3 Informatics and Medical Sciences


It is a known fact that most of the doctors are averse to computers. To overcome this problem, one of the solutions
proposed, after an intensive research contacting 1500 doctors from different cities, is to introduce Palmtops specially
tailored for physicians. These palmtops are of the size that easily fits in to the pocket of a lab coat. This helps the
doctor to feed in the medical data in a sequential manner that he has collected when moving from ward to ward. This
addresses the basic need of any medical analysis - data capture and creating Electronic Medical Records (EMR),
which eventually develops into a database for reference and analysis.

The major advantage with the introduction of the concept of Electronic Medical Records (EMR) is that, the
information can be easily accessed and shared in comparison to traditional medical records. EMR also drastically
reduces the possibilities of introduction of errors due to frustration and other psychological disturbances during
the manual data entry process after collecting the necessary information on paper. It also helps to eliminate the
manual task of extracting data from charts or filling out specialised data sheets. The data required for a study can
be obtained directly from the electronic record, thus making research data collection for analysis, a by-product of
routine clinical record keeping. The record environment can help to assure compliance with a research protocol,
pointing out to a clinician when a patient is eligible for a study, or when the protocol for a study calls for a specific
management plan given the currently available data about that patient.

In the near future one can see a situation where the complete information on the patient can be accessed from the
EMR. This information can be of any type, ranging from drug trial data to the various tests performed on that
patient and the outcome of such experiments. The challenge in such cases will be to organise and integrate the
heterogeneity of the information into a comprehensive, knowledge based database from which an individual can
access the necessary portion of the record for any research analysis.

82
7.4 Bioinformatics and Medical Sciences
Bioinformatics has a profound impact in medical sciences. The biological databases are helping physicians to
diagnose the disease and develop strategies for its therapy. Consider a situation where a patient with a genetic form
of haemophilia meets a physician. The physician is not sure with the symptoms of the disease but has the only
clue that the patient’s family has suffered from haemophilia earlier. The physician could surf the web to get the
information on the disease by checking out the OMIM (Online Mendelian Inheritance of Man) resources available
at http://www.ncbi.nlm.nih.gov/omim/ which provides detailed information on genetic disorders. A focussed search
for diabetes would reveal multiple disorders including Von Willerbrand Disease and also provide the information
that the primary defect is due to the low anti-haemophilic globulin (AHG; factor VIII) in this disorder. Further, the
search on ‘Factor VIII’ in the protein sequence database would result in the match encoding the human Factor VIII
with the complete cDNA and corresponding protein sequence. The gene is linked to its DNA sequence, protein
sequence and a set of references in the MEDLINE literature database. Following this MEDLINE literature database,
the original research article (where the association of factor VIII with haemophilia is discussed) is obtained.

By following the link to the protein sequence, the detailed information is obtained from the SWISS-PROT database
and Protein Information Resource (PIR). The information on the crystal structure can be obtained by following the
link to Protein Data Bank (PDB) provided in the SWISSPROT database. Following the link to the DNA sequence
in the genetic database, GENBANK, the nucleotide sequence of the gene is obtained along with records of gene
irregularities. Thus, the physician uses a number of databases to collect information about the disease, which aids
him to diagnose and device strategies for therapy.

Infectious diseases are now the world’s biggest killers of children and young adults. They account for more than 13
million deaths a year - one in two deaths in developing countries as stated by the WHO. Most deaths from infectious
diseases occur in developing countries. The cause for this has been attributed to the unavailability of efficient drugs
and if at all available, the high cost associated with those drugs. Development of cheap and efficient drugs for a
disease is one of the major problems faced by mankind. The solution to this problem could be from rational drug
design using Bioinformatics.

The focus of the pharmaceutical industry has shifted from the trial and error process of drug discovery to a rational,
structure based drug design. A successful and reliable drug design process could reduce the time and cost of developing
useful pharmacological agents. Computational methods are used for the prediction of drug-likeness, which is the
identification and elimination of candidate molecules that are unlikely to survive the later stages of discovery and
development. Drug-likeness could be predicted by genetic algorithm and neural network based approaches.

People have been working on constructing efficient algorithms and better energy functions to predict protein structures
and interaction of small molecules with them. The technical barrier to these approaches is that they are computation
intensive and we do not have the computational power to handle such massive requirement. Realising the amount of
raw computational power needed in such problems, IBM had recently announced a new $100 million exploratory
research initiative to build a supercomputer, which is 500 times more powerful than the world’s fastest existing
computer and 2 million times faster than the today’s fastest desktop PC. This new computer nicknamed ‘Blue Gene’
by IBM researchers will be capable of performing close to one Petaflop (10 15 operations per second).

As stated earlier, from the pharmaceutical industry point of view, Bioinformatics is the key to rational drug design.
It reduces the number of trials in the screening of drug compounds and in identifying potential drug targets for a
particular disease using high power computing workstations and software like Insight. This profound application
of Bioinformatics in genome sequence has led to a new area in pharmacology – Pharmacogenomics, which is the
study of genetic basis for the differences between individuals in response to drugs. This is mainly due to Single
Nucleotide Polymorphisms (SNPs). In order to develop innovative and safe drugs, Pharmacogenomics needs to
be integrated in the drug development process. Knowing the importance of SNPs, an international consortium to
produce a map of human SNPs (which could aid pharmacogenomics) has been formed by major pharmaceutical
companies in which IBM is also a member. In future, drug design is going to rely on the variation in SNPs. In fact
SNPs with combinatorial chemistry can speed up the process of drug discovery and may also result in identifying
a new set of target proteins that cross-react with drugs in the preliminary clinical trials.

83
Bioinformatics

Taking into account all the above mentioned factors that have to go in for developing effective drugs, there has
been a strong urge to start the Human Proteomics Initiative. This initiative aims at identifying the functions and
polymorphism of all the proteins coded in the human genome and predicts their structure, or solves the structure of
these proteins if possible so that these could be used as potential targets for developing drugs.

Need for Integration


Rapid advances in the field of computers coupled with increasing computer literacy among professionals favour
the implementation of computer applications in medical practice. Further, the availability of numerous databases
on the Internet has revolutionised the way by which a physician devices a strategy for treatment. Projects like the
Human Proteomics Initiative is a classic example to show the necessity of integrating Bioinformatics - to predict
structures and functions of proteins, Medical Sciences - to identify proteins that are important in metabolic or other
disorders and Pharmacology (drug discovery) - to identify novel drugs against the predicted targets. Thus, it is apt
to conclude that all the three areas must work in concert to achieve the ultimate goal of understanding the basis of
life process and apply it for the betterment of human lives.

Fig. 7.1 Drug discovery process


(Source: http://www.vls3d.com/courses_talk/Villoutreix_intro_drug_design.pdf)

7.5 Bioinformatics in Computer-Aided Drug Design


Computer-Aided Drug Design (CADD) is a specialised discipline that uses computational methods to simulate drug-
receptor interactions. CADD methods are heavily dependent on bioinformatics tools, applications and databases.
As such, there is considerable overlap in CADD research and bioinformatics.

Bioinformatics hub
Bioinformatics can be thought of as a central hub that unites several disciplines and methodologies. On the support side
of the hub, Information Technology, Information Management, software applications, databases and computational
resources all provide the infrastructure for bioinformatics. On the scientific side of the hub, bioinformatic methods
are used extensively in molecular biology, genomics, proteomics, other emerging areas (that is metabolomics,
transcriptomics) and in CADD research.

84
Information
Molecular Technology/
Biology Information
Management

Genomics/
Applications/
Proteomics/ Bioinformatics
Databases
x-omics

CADD
(Computer Aided Computational
Drug Resources
Design)

Fig. 7.2 Bioinformatics hub


(Source: http://www.b-eye-network.com/view/852)

There are several key areas where bioinformatics supports CADD research. 
• Virtual High-Throughput Screening (vHTS): Pharmaceutical companies are always searching for new leads
to develop into drug compounds. One search method is virtual high-throughput screening. In vHTS, protein
targets are screened against databases of small-molecule compounds to see, which molecules bind strongly to
the target. If there is a ‘hit’ with a particular compound, it can be extracted from the database for further testing.
With today’s computational resources, several million compounds can be screened in a few days on sufficiently
large clustered computers. Pursuing a handful of promising leads for further development can save researchers
considerable time and expense. ZINC is a good example of a vHTS compound library.
• Sequence Analysis: In CADD research, one often knows the genetic sequence of multiple organisms or the
amino acid sequence of proteins from several species. It is very useful to determine how similar or dissimilar
the organisms are based on gene or protein sequences. With this information one can infer the evolutionary
relationships of the organisms, search for similar sequences in bioinformatic databases and find related species
to those under investigation. There are many bioinformatic sequence analysis tools that can be used to determine
the level of sequence similarity.    
• Homology Modeling: Another common challenge in CADD research is determining the 3-D structure of
proteins. Most drug targets are proteins, so it’s important to know their 3-D structure in detail. It’s estimated
that the human body has 500,000 to 1 million proteins. However, the 3-D structure is known for only a small
fraction of these. Homology modeling is one method used to predict 3-D structure. In homology modeling, the
amino acid sequence of a specific protein (target) is known, and the 3-D structures of proteins related to the target
(templates) are known. Bioinformatics software tools are then used to predict the 3-D structure of the target
based on the known 3-D structures of the templates.  MODELLER is a well-known tool in homology modeling,
and the SWISS-MODEL Repository is a database of protein structures created with homology modeling.

85
Bioinformatics

Similarity Searches:  A common activity in biopharmaceutical companies is the search for drug analogues. Starting
with a promising drug molecule, one can search for chemical compounds with similar structure or properties to a
known compound. There are a variety of methods used in these searches, including sequence similarity, 2D and
3D shape similarity, substructure similarity, electrostatic similarity and others. A variety of bioinformatic tools and
search engines are available for this work.

Drug Lead Optimisation: When a promising lead candidate has been found in a drug discovery program, the next
step (a very long and expensive step) is to optimise the structure and properties of the potential drug. This usually
involves a series of modifications to the primary structure (scaffold) and secondary structure (moieties) of the
compound. This process can be enhanced using software tools that explore related compounds (bioisosteres) to the
lead candidate. OpenEye’s WABE is one such tool. Lead optimisation tools such as WABE offer a rational approach
to drug design that can reduce the time and expense of searching for related compounds.

Physicochemical Modeling. Drug-receptor interactions occur on atomic scales. To form a deep understanding of


how and why drug compounds bind to protein targets, we must consider the biochemical and biophysical properties
of both the drug itself and its target at an atomic level. Swiss-PDB is an excellent tool for doing this. Swiss-PDB
can predict key physicochemical properties, such as hydrophobicity and polarity that have a profound influence on
how drugs bind to proteins. 

Drug Bioavailability and Bioactivity. Most drug candidates fail in Phase III clinical trials after many years of research
and millions of dollars have been spent on them. And most fail because of toxicity or problems with metabolism.
The key characteristics for drugs are Absorption, Distribution, Metabolism, Excretion, Toxicity (ADMET) and
efficacy—in other words bioavailability and bioactivity. Although, these properties are usually measured in the lab,
they can also be predicted in advance with bioinformatics software.       

Benefits of CADD
CADD methods and bioinformatics tools offer significant benefits for drug discovery programs.
Cost Savings: The Tufts Report  suggests that the cost of drug discovery and development has reached $800
million for each drug successfully brought to market. Many biopharmaceutical companies now use computational
methods and bioinformatics tools to reduce this cost burden. Virtual screening, lead optimisation and predictions
of bioavailability and bioactivity can help guide experimental research. Only the most promising experimental
lines of inquiry can be followed and experimental dead-ends can be avoided early based on the results of CADD
simulations.

Time-to-Market:  The predictive power of CADD can help drug research programs to choose only the most
promising drug candidates. By focusing drug research on specific lead candidates and avoiding potential ‘dead-end’
compounds, biopharmaceutical companies can get drugs to market more quickly. 

Insight: One of the non-quantifiable benefits of CADD and the use of bioinformatics tools is the deep insight that
researchers acquire about drug-receptor interactions. Molecular models of drug compounds can reveal intricate,
atomic scale binding properties that are difficult to envision in any other way. When we show researchers new
molecular models of their putative drug compounds, their protein targets and how the two bind together, they often
come up with new ideas on how to modify the drug compounds for improved fit. This is an intangible benefit that
can help design research programs.

CADD and bioinformatics together are a powerful combination in drug research and development. An important
challenge for us going forward is finding skilled, experienced people to manage all the bioinformatics tools available
to us.

7.6 Bioinformatics Tools


The processes of designing a new drug using bioinformatics tools have opened a new area of research. However,
computational techniques assist one in searching drug target and in designing drug in silco, but it takes long time
and money. In order to design a new drug one need to follow the following path.

86
Identify target disease
One needs to know all about the disease and existing or traditional remedies. It is also important to look at very similar
afflictions and their known treatments. Target identification alone is not sufficient in order to achieve a successful
treatment of a disease. A real drug needs to be developed. This drug must influence the target protein in such a way
that it does not interfere with normal metabolism. Bioinformatics methods have been developed to virtually screen
the target for compounds that bind and inhibit the protein.

Study interesting compounds


One needs to identify and study the lead compounds that have some activity against a disease. These may be only
marginally useful and may have severe side effects. These compounds provide a starting point for refinement of
the chemical structures.

Detection the molecular bases for disease


If it is known that a drug must bind to a particular spot on a particular protein or nucleotide then a drug can be
tailor made to bind at that site. This is often modeled computationally using any of several different techniques.
Traditionally, the primary way of determining what compounds would be tested computationally was provided by
the researchers’ understanding of molecular interactions. A second method is the brute force testing of large numbers
of compounds from a database of available structures.

Rational drug design techniques


These techniques attempt to reproduce the researchers’ understanding of how to choose likely compounds built into
a software package that is capable of modeling a very large number of compounds in an automated way. Many
different algorithms have been used for this type of testing, many of which were adapted from artificial intelligence
applications. The complexity of biological systems makes it very difficult to determine the structures of large
biomolecules. Ideally experimentally determined (x-ray or NMR) structure is desired, but biomolecules are very
difficult to crystallise.

Refinement of compounds
Once the number of lead compounds has been found, computational and laboratory techniques have been very
successful in refining the molecular structures to give a greater drug activity and fewer side effects. Done both in the
laboratory and computationally by examining the molecular structures to determine, which aspects are responsible
for both the drug activity and the side effects.

Quantitative Structure Activity Relationships (QSAR)


Computational technique should be used to detect the functional group in the compound in order to refine your
drug. QSAR consists of computing every possible number that can describe a molecule than doing an enormous
curve fit to find out which aspects of the molecule correlate well with the drug activity or side effect severity. This
information can then be used to suggest new chemical modifications for synthesis and testing.

Solubility of molecule
One needs to check whether the target molecule is water soluble or readily soluble in fatty tissue will affect what
part of the body it becomes concentrated in. The ability to get a drug to the correct part of the body is an important
factor in its potency. Ideally, there is a continual exchange of information between the researchers doing QSAR
studies, synthesis and testing.

These techniques are frequently used and often very successful since they do not rely on knowing the biological
basis of the disease, which can be very difficult to determine.

Drug testing
Once a drug has been shown to be effective by an initial assay technique, much more testing must be done before it
can be given to human patients. Animal testing is the primary type of testing at this stage. Eventually, the compounds,
which are deemed suitable at this stage, are sent on to clinical trials. In the clinical trials, additional side effects may
be found and human dosages are determined.

87
Bioinformatics

Summary
• Clustering algorithms are used to organise this expression data into different biologically relevant clusters.
• Drug discovery is the process of discovering and designing drugs, which includes target identification, target
validation, lead identification, lead optimisation and introduction of the new drugs to the public.
• Bioinformatics deals with the exponential growth in biological data, which led to the development of primary
and secondary databases of nucleic acid sequences, protein sequences and structures.
• Bioinformatics plays an important role for the integration of broad disciplines of biology to understand the
complex mechanisms of the cell.
• The complete process of data collection to analysis of the results of such tests may be categorised under a
separate area named ‘Clinical Informatics’.
• EMR also drastically reduces the possibilities of introduction of errors due to frustration and other psychological
disturbances during the manual data entry process after collecting the necessary information on paper.
• Bioinformatics has a profound impact in medical sciences. The biological databases are helping physicians to
diagnose the disease and develop strategies for its therapy.
• Computational methods are used for the prediction of ‘drug-likeness’, which is the identification and elimination
of candidate molecules that are unlikely to survive the later stages of discovery and development.
• Pharmacogenomics is the study of genetic basis for the differences between individuals in response to drugs.
• Computer-Aided Drug Design (CADD) is a specialised discipline that uses computational methods to simulate
drug-receptor interactions.
• In CADD research, one often knows the genetic sequence of multiple organisms or the amino acid sequence
of proteins from several species.
• MODELLER is a well-known tool in homology modeling, and the SWISS-MODEL Repository is a database
of protein structures created with homology modeling.
• QSAR consists of computing every possible number that can describe a molecule than doing an enormous curve
fit to find out which aspects of the molecule correlate well with the drug activity or side effect severity.

References
• Chorghade, M. S., 2006. Drug discovery and development, John Wiley and Sons.
• Rastogi, S. C., Mendiratta, N. & Rastogi, P., 2006. Bioinformatics-Concepts, Skills, Applications, 2nd ed., PHI
Learning Pvt. Ltd., p.456.
• Young, D. Computational Techniques in the Drug Design Process [Online] Available at: < http://www.
ccl.net/cca/documents/dyoung/topics-orig/drug.html> [Accessed 28 February 2012].
• Computer aided drug design and bioinformatics: A current tool for designing [Online] Available at:
<http://www.pharmainfo.net/reviews/computer-aided-drug-design-and-bioinformatics-current-tool-
designing> [Accessed 28 February 2012].
• Novartis, 2011. Drug discovery and development process [Video Online] Available at: <http://www.
youtube.com/watch?v=3Gl0gAcW8rw> [Accessed 28 February 2012].
• nicolazonta, 2008. User driven molecular modeling in drug design [Video Online] Available at: <http://
www.youtube.com/watch?v=hd2YaygJC-w&feature=related> [Accessed 28 February 2012].

Recommended Reading
• Larson, S., 2005. Bioinformatics and Drug Discovery, Humana Press, p.444.
• Borlak, 2005. Handbook of toxicogenomics: Strategies and applications, Wiley-VCH.
• Barnes, R., 2007. Bioinformatics for geneticists: A bioinformatics primer for the analysis of genetic data, John
Wiley & Sons.

88
Self Assessment
1. The focus of bioinformatics in the drug discovery process has shifted from target identification to________.
a. target validation
b. target evaluation
c. target prediction
d. target mapping

2. What helps to prioritise the families in terms of their 3D structures?


a. Bioinformatics
b. Pharmacogenomics
c. Structural genomics
d. Proteomics

3. Which of these is not a well-known database?


a. GenBank
b. SWISS-PROT
c. PDB
d. EMR

4. _________eventually develops in to a database for reference and analysis.


a. GenBank
b. PIR
c. PDB
d. EMR

5. The biological databases are helping physicians to _____the disease and develop strategies for its therapy.
a. diagnose
b. treat
c. predict
d. clinically evaluate

6. Drug-likeness could be predicted by _________and neural network based approaches.


a. computational method
b. genetic algorithm
c. drug discovery
d. data capturing

7. Which of the following statements is true?


a. The full form of OMIM is Online Morey Inheritance of Man.
b. The full form of PIR is Protein Information Resource.
c. The full form of PDB is Proteome Data Bank.
d. The full form of EMR is Electronic Medical Resource.

89
Bioinformatics

8. Which of the following statements is false?


a. In the near future one can see a situation where the complete information on the patient can be accessed
from the EMR.
b. From the pharmaceutical industry point of view, bioinformatics is the key to rational drug design.
c. Proteomics is the study of genetic basis for the differences between individuals in response to drugs.
d. Computer-Aided Drug Design (CADD) is a specialised discipline that uses computational methods to
simulate drug-receptor interactions.

9. Which of the following statements is false?


a. Bioinformatic methods are used extensively in molecular biology, genomics, and proteomics and in CADD
research.
b. IT, Information management, software applications, databases and computational resources all provide the
infrastructure for bioinformatics.
c. Pharmaceutical companies are always searching for new leads to develop into drug compounds.
d. In sequence analysis, protein targets are screened against databases of small-molecule compounds to see
which molecules bind strongly to the target.

10. Which is a well-known tool in homology modeling?


a. MODELLER
b. SWISS-MODEL
c. OpenEye’s WABE
d. ZINC

90
Chapter VIII
Human Genome Project

Aim

The aim of this chapter is to:

• define genome

• enlist characteristics of HGP

• describe project goals of HGP

Objectives
The objectives of this chapter are to:

• explain genome sequenced in the public (HGP) and private projects

• elucidate about funding organisations for human genome sequencing

• describe DNA sequencing

Learning outcome
At the end of this chapter, you will be able to:

• compare draft sequence and finished sequence

• understand bioinformatic analysis

• know features of BLAST

91
Bioinformatics

8.1 Introduction
Bioinformatics is the branch of biology concerned with the acquisition, storage and analysis of the information found
in nucleic acid and protein sequence data. Computers and bioinformatics software are the tools of the trade.

When the Human Genome Project had begun in 1990 it was understood that to meet the project’s goals, the speed
of DNA sequencing would have to increase and the cost would have to come down. Over the life of the project
virtually every aspect of DNA sequencing was improved. It took the project approximately four years to sequence
its first one billion bases but just four months to sequence the second billion bases.

During the month of January, 2003, 1.5 billion bases were sequenced. As the speed of DNA sequencing increased,
the cost decreased from 10 dollars per base in 1990 to 10 cents per base at the conclusion of the project in April
2003. Although, the Human Genome Project is officially over, improvements in DNA sequencing continue to be
made. Researchers are experimenting with new methods for sequencing DNA that have the potential to sequence a
human genome in just a matter of weeks for a few thousand dollars.

DNA sequencing performed on an industrial scale has produced a vast amount of data to analyse. In August 2005,
it was announced that the three largest public collections of DNA and RNA sequences together store one hundred
billion bases, representing over 1,65,000 different organisms. As sequence data began to pile up, the need for new
and better methods of sequence analysis was critical.

Genetic data represent a treasure trove for researchers and companies interested in how genes contribute to our
health and well being. Almost half of the genes identified by the Human Genome Project have no known function.
Researchers are using bioinformatics to identify genes, establish their functions, and develop gene-based strategies
for preventing, diagnosing, and treating disease.

8.2 Human Genome Project


Begun formally in 1990, the U.S. Human Genome Project was a 13-year effort coordinated by the U.S. Department
of Energy and the  National Institutes of Health. The project originally was planned to last 15 years, but rapid
technological advances accelerated the completion date to 2003. Project goals are:
• Identify all the approximately 20,000-25,000 genes in human DNA
• Determine the sequences of the 3 billion chemical base pairs that make up human DNA
• Store this information in databases
• Improve tools for data analysis
• Transfer related technologies to the private sector and
• Address the ethical, legal, and social issues (ELSI) that may arise from the project.

DNA sequencing, the process of determining the exact order of the 3 billion chemical building blocks called bases
(abbreviated as A, T, C, and G) that make up the DNA of the 24 different human chromosomes, was the greatest
technical challenge in the Human Genome Project. Achieving this goal has helped reveal the estimated 20,000-
25,000 human genes within our DNA as well as their controlling regions. The resulting DNA sequence maps are
being used by the scientists to explore human biology and other complex phenomena. To meet the Human Genome
Project sequencing goals by 2003 required continual improvements in sequencing speed, reliability and costs.

8.3 Genome Sequenced in the Public (HGP) and Private Projects


The human genome reference sequences represent not only any one person’s genome. Rather, they serve as a starting
point for broad comparisons across humanity. The knowledge obtained from the sequences applies to everyone
because all humans share the same basic set of genes and genomic regulatory regions that control the development
and maintenance of their biological structures and processes.

92
In the international public-sector Human Genome Project (HGP), researchers collected blood (female) or sperm
(male) samples from a large number of donors. Only a few samples were processed as DNA resources. Thus, donors’
identities were protected so neither they nor scientists could know whose DNA was sequenced. DNA clones from
many libraries were used in the overall project.

In the Celera Genomics private-sector project, DNA from a few different genomes was mixed and processed for
sequencing DNA for these studies came from anonymous donors of European, African, American (North, Central,
South), and Asian ancestry. The lead scientist of Celera Genomics at that time, Craig Venter, has since acknowledged
that his DNA was among those sequenced.

8.4 Funding for Human Genome Sequencing


Human Genome Project research was funded at many laboratories across the U.S. by the Department of Energy
(DOE), the National Institutes of Health (NIH), or both. Other researchers at numerous colleges, universities, and
laboratories throughout US also have received DOE and NIH funding for human genome research. At any given
time, DOE Human Genome Project has funded about 100 principal investigators. Also, many large and small
private U.S. companies are conducting genome research. Atleast, 18 other countries have participated in the Human
Genome Project.

8.5 DNA Sequencing


A DNA sequencing reaction produces a sequence that is several hundred bases long. Gene sequences typically
run for thousands of bases. To study genes, scientists first assemble long DNA sequences from series of shorter
overlapping sequences.
• Chromosomes ranging in size from 50 million to 250 million bases must first be broken into much shorter
pieces.
• Each short piece is used as a template to generate a set of fragments that differ in length from each other by a
single base that will be identified in a later step.
• The fragments in a set are separated by technique called gel electrophoresis. New fluorescent dyes allow
separation of all four fragments in a single lane on the gel.
• The final base at the end of each fragment is identified. This process recreates the original sequence of As, Ts,
Cs and Gs for each short piece generated in the first step.

After the bases are ‘read,’ computers are used to assemble the short sequences (in blocks of about 500 bases each,
called the read length) into long continuous stretches that are analysed for errors, gene-coding regions, and other
characteristics. Finished sequences are submitted to major public sequence databases, such as  GenBank. Thus,
Human Genome Project sequence data are freely available to anyone around the world.

Scientists enter their assembled sequences into genetic databases so that other scientists may use the data. Since the
sequences of the two DNA strands are complementary, it is only necessary to enter the sequence of one DNA strand
into a database. By selecting an appropriate computer program, scientists can use sequence data to look for genes,
get clues to gene functions, examine genetic variation, and explore evolutionary relationships. New bioinformatics
software is being developed while existing software is continually updated.

Difference between draft sequence and finished sequence


In generating the draft sequence (released in June 2000), scientists determined the order of base pairs in each
chromosomal area at least 4 to 5 times to ensure data accuracy and to help with reassembling DNA fragments in their
original order. This repeated sequencing is known as genome ‘depth of coverage.’ Draft sequence data are mostly
in the form of 10,000 base pair-sized fragments whose approximate chromosomal locations are known.

93
Bioinformatics

To generate a high-quality reference sequence (completed in April 2003) additional sequencing was done to close
gaps and allow for only a single error every 10,000 bases, the agreed-upon standard for the HGP. Investigators
believe a high-quality sequence is critical for recognising gene-regulatory components important in understanding
human biology and disorders such as heart disease, cancer, and diabetes. The finished version provides an estimated
8x to 9x coverage of each chromosome.

Completely sequenced genomes


The small genomes of several viruses and bacteria and the much larger genomes of three higher organisms have
been completely sequenced; they are bakers’ or brewers’ yeast, the roundworm and the fruit fly. In October 2001,
the draft sequence of the pufferfish, the first vertebrate after the human, was completed; and scientists finished the
first genetic sequence of a plant, that of the weed Arabidopsis sp., in December 2000. Many more genome sequences
have been completed since then.

8.6 Bioinformatic Analysis: Finding Functions


One of the most important aspects of bioinformatics is identifying genes within a long DNA sequence. Once a nucleic
acid or amino acid sequence has been assembled, bioinformatic analysis can be used to determine if the sequence is
similar to that of a known gene. This is where sequences from model organisms are helpful. A bioinformatic analysis
finds a similar sequence from mouse that is associated with a gene that codes for a membrane protein that regulates
salt balance. It is a good bet that the human sequence also is part of a gene that codes for a membrane protein that
regulates salt balance. Determining the similarity of two sequences is not as easy as you might think. For example,
it was recently reported that the genomes of humans and chimpanzees are 96 % similar.
Consider the following two sequences:

Each sequence consists of 20 bases. There is just one base difference between them. Because the two sequences
match at 19 out 20 bases, we can say that the two sequences are 95 % similar. Now, consider the following two
DNA sequences:

Now, 16 out of 20 bases match. We can say that the two sequences are 80 % the same. Careful inspection however
reveals another sort of similarity between Sequences 3 and 4. If we align the sequences as below, it is seen that the
two sequences differ by just a missing base in Sequence 4 (or an added base to Sequence 3).

94
The deletion (or insertion) of a single base is not equal to four base substitutions as suggested in this example. While
comparing sequences, we must be concerned not only with the quantity of the differences but the type as well.

Scientists have written computer programs that can be used to see if a particular DNA sequence is similar to any
others that are stored in a sequence database. One of the most popular programs is called BLAST (Basic Local
Alignment Search Tool). Using this program is somewhat like using a search engine on the Internet. The user
provides the program with a biological sequence (when using BLAST) or a subject (when using a search engine).
In each case, the program compares the input information to the information found in the database. The results are
given with the most closely matching items (or sequences) listed first, followed by items (or sequences) that match
less well.

95
Bioinformatics

See an example of a BLAST search. The input sequence that is being compared to others in the database is called
the query sequence. In our example, the query is the short human DNA sequence listed below.

Once the query sequence is submitted, the BLAST program compares it, one-at-a-time, to every sequence in its
database. Typically, the search results are displayed so that the query sequence is shown at the top and the matching
sequences are listed below it. The listed sequence ‘hits’ also may include links to relevant bibliographic information.
The results from this search are shown below.

96
Examining variation
The BLAST program compares a single input sequence, one at a time, to others in a sequence database. The results
can provide clues as to the identity and function of the input sequence. Sometimes you may want to compare a
number of different sequences, all at the same time to see where they are alike and where they are different. The
CLUSTAL program was developed to produce such multiple alignments. CLUSTAL gets its name because it deals
with clusters of sequences.

CLUSTAL alignments are sometimes used by scientists examining genetic variation within a population. For
example, once a gene has been associated with a disease, scientists can use CLUSTAL to examine how the gene
sequence varies among people with and without the disease. The example below shows a CLUSTAL alignment
of DNA sequences from a portion of the gene associated with cystic fibrosis. The person affected by the disease is
seen to be missing a three-base DNA sequence.

Multiple sequence alignments are also useful to scientists investigating the evolutionary relationships among species.
For example, the CLUSTAL program can be used to align a series of related sequences from different species. Once
the program has produced the best alignment for the sequences, another program can calculate the evolutionary
relationships between them. These data can be used to construct a tree diagram showing the evolutionary relationships
for that sequence among the various species.

8.7 Insights Learned from the Human DNA Sequence


The human genome contains 3.2 billion chemical nucleotide base pairs (A, C, T, and G).The average gene consists
of 3,000 base pairs, but sizes vary greatly, with the largest known human gene being dystrophin at 2.4 million base
pairs. The total number of genes is estimated at 25,000, much lower than previous estimates of 80,000 to 140,000
that had been based on extrapolations from gene-rich areas as opposed to a composite of gene-rich and gene-poor
areas.

97
Bioinformatics

The human genome sequence is almost exactly the same (99.9%) in all people. Functions are unknown for more
than 50% of discovered genes. About 2% of the genome encodes instructions for the synthesis of proteins. Repeat
sequences that do not code for proteins make up at least 50% of the human genome.

Repeat sequences are thought to have no direct functions, but they shed light on chromosome structure and dynamics.
Over time, these repeats reshape the genome by rearranging it, thereby creating entirely new genes or modifying
and reshuffling existing genes. During the past 50 million years, a dramatic decrease seems to have occurred in the
rate of accumulation of repeats in the human genome.

The human genome’s gene-dense ‘urban centers’ are predominantly composed of the DNA building blocks G and
C. In contrast, the gene-poor ‘deserts’ are rich in the DNA building blocks A and T. GC- and AT-rich regions usually
can be seen through a microscope as light and dark bands on chromosomes.

Genes appear to be concentrated in random areas along the genome, with vast expanses of non-coding DNA between.
Particular gene sequences have been associated with numerous diseases and disorders, including breast cancer,
muscle disease, deafness, and blindness. Stretches of up to 30,000 C and G bases repeating over and over often
occur adjacent to gene-rich areas, forming a barrier between the genes and the ‘junk DNA.’

Comparison of human genome with other organisms


• Unlike the human’s seemingly random distribution of gene-rich areas, many other organisms’ genomes are more
uniform, with genes evenly spaced throughout.
• Humans have on average three times as many kinds of proteins as the fly or worm because of mRNA transcript
‘alternative splicing’ and chemical modifications to the proteins. This process can yield different protein products
from the same gene.
• Humans share most of the same protein families with worms, flies, and plants, but the number of gene family
members has expanded in humans, especially in proteins involved in development and immunity.
• The human genome has a much greater portion (50%) of repeat sequences than the mustard weed (11%), the
worm (7%), and the fly (3%).
• Over 40% of predicted human proteins share similarity with fruit-fly or worm proteins.
• Although, humans appear to have stopped accumulating repeated DNA over 50 million years ago, there seems
to be no such decline in rodents. This may account for some of the fundamental differences between hominids
and rodents, although gene estimates are similar in these species. Scientists have proposed many theories to
explain evolutionary contrasts between humans and other organisms, including those of life span, litter sizes,
inbreeding, and genetic drift.
• Variations and mutations: Scientists have identified million locations where single-base DNA differences occur
in humans. This information promises to revolutionise the processes of finding DNA sequences associated with
some common diseases.

8.8 Future Challenges


The working-draft DNA sequence and the more polished 2003 version represent an enormous achieve-
ment, akin in scientific importance, some say, to developing the periodic table of elements. And, as in
most major scientific advances, much work remains to realise the full potential of the accomplishment.
Early explorations of the human genome, now joined by projects on the genomes of several other organisms, are
generating data whose volume and complex analyses are unprecedented in biology. Genomic-scale technologies
will be needed to study and compare entire genomes, sets of expressed RNAs or proteins, gene families from a
large number of species, variation among individuals, and the classes of gene regulatory elements.

98
Deriving meaningful knowledge from DNA sequences will define biological research through the coming decades
and require the expertise and creativity of teams of biologists, chemists, engineers, and computational scientists,
among others. A sampling follows of some research challenges in genetics,what we still don’t know, even with the
full human DNA sequence in hand.

The draft sequence already is having an impact on finding genes associated with disease. One of the greatest impacts
of having the sequence may be in enabling an entirely new approach to biological research. In the past, researchers
studied one or a few genes at a time. With whole-genome sequences and new high-throughput technologies, they
can approach questions systematically and on a grand scale. They can study all the genes in a genome, for example,
or all the transcripts in a particular tissue or organ or tumour, or how ten of thousands of genes and proteins work
together in interconnected networks to orchestrate the chemistry of life.

Post-sequencing projects are well under way worldwide. These explorations will result in a profound, new, and more
comprehensive understanding of complex living systems, with applications to agriculture, human health, energy,
global climate change, and environmental remediation, among others.

The checklist for future research includes:


• Gene number, exact locations, and functions
• Gene regulation
• DNA sequence organisation
• Chromosomal structure and organisation
• Non-coding DNA types, amount, distribution, information content, and functions
• Coordination of gene expression, protein synthesis, and post-translational events
• Interaction of proteins in complex molecular machines
• Predicted vs experimentally determined gene function
• Evolutionary conservation among organisms
• Protein conservation (structure and function)
• Proteomes (total protein content and function) in organisms
• Correlation of SNPs (single-base DNA variations among individuals) with health and disease
• Disease-susceptibility prediction based on gene sequence variation
• Genes involved in complex traits and multigene diseases
• Complex systems biology, including microbial consortia useful for environmental restoration
• Developmental genetics, genomics

99
Bioinformatics

Summary
• Bioinformatics is the branch of biology concerned with the acquisition, storage and analysis of the information
found in nucleic acid and protein sequence data.
• Researchers are using bioinformatics to identify genes, establish their functions, and develop gene-based
strategies for preventing, diagnosing, and treating disease.
• Begun formally in 1990, the U.S. Human Genome Project was a 13-year effort coordinated by the U.S. Department
of Energy and the National Institutes of Health.
• DNA sequencing, the process of determining the exact order of the 3 billion chemical building blocks called
bases (abbreviated as A, T, C, and G) that make up the DNA of the 24 different human chromosomes, was the
greatest technical challenge in the Human Genome Project.
• In the Celera Genomics private-sector project, DNA from a few different genomes was mixed and processed
for sequencing.
• Human Genome Project research was funded at many laboratories across the U.S. by the Department of Energy
(DOE), the National Institutes of Health (NIH), or both.
• A DNA sequencing reaction produces a sequence that is several hundred bases long.
• Gene sequences typically run for thousands of bases.
• To study genes, scientists first assemble long DNA sequences from series of shorter overlapping sequences.
• Since the sequences of the two DNA strands are complementary, it is only necessary to enter the sequence of
one DNA strand into a database.
• By selecting an appropriate computer program, scientists can use sequence data to look for genes, get clues to
gene functions, examine genetic variation, and explore evolutionary relationships.
• Once a nucleic acid or amino acid sequence has been assembled, bioinformatic analysis can be used to determine
if the sequence is similar to that of a known gene.
• Scientists have written computer programs that can be used to see if a particular DNA sequence is similar to any
others that are stored in a sequence database. One of the most popular programs is called BLAST (Basic Local
Alignment Search Tool).
• The BLAST program compares a single input sequence, one at a time, to others in a sequence database.
• CLUSTAL alignments are sometimes used by scientists examining genetic variation within a population.
• Multiple sequence alignments are also useful to scientists investigating the evolutionary relationships among
species.
• The human genome contains 3.2 billion chemical nucleotide base pairs (A, C, T, and G).
• The human genome sequence is almost exactly the same (99.9%) in all people.
• Repeat sequences that do not code for proteins make up at least 50% of the human genome.
• The human genome’s gene-dense ‘urban centers’ are predominantly composed of the DNA building blocks G
and C.
• Stretches of up to 30,000 C and G bases repeating over and over often occur adjacent to gene-rich areas, forming
a barrier between the genes and the ‘junk DNA.’

References
• Toriello, J., 2003. The Human Genome Project, The Rosen Publishing Group.
• Cooper, G., 1994. The Human Genome Project: deciphering the blueprint of heredity, University Science
Books.
• NHGRI, Bioinformatics: Examining Variation [Online] Available at: <http://www.genome.gov/25020003>
[Accessed 28 February 2012].
• Insights Learned from the Human DNA Sequence [Online] Available at: <http://www.ornl.gov/sci/techresources/
Human_Genome/project/journals/insights.shtml> [Accessed 28 February 2012].

100
• bimaticsblog, 2009. Human genome sequencing-Animated tutorial [Video Online] Available at: <http://www.
youtube.com/watch?v=-gVh3z6MwdU> [Accessed 28 February 2012].
• norman466, 2011. Biology Project: Bioinformatics [Video Online] Available at: <http://www.youtube.com/
watch?v=saW1oEbboUM> [Accessed 28 February 2012].

Recommended Reading
• Ramsden, J., 2009. Bioinformatics: An introduction, 2nd ed., Springer.
• Polański, A. & Kimmel, M., 2007. Bioinformatics, Springer.
• Boon, K. The human genome project: What does decoding DNA mean for us?, Enslow Publishers.

101
Bioinformatics

Self Assessment
1. DNA sequencing, the process of determining the exact order of the 3 billion chemical building blocks called
_________.
a. bases
b. nucleotides
c. arrays
d. amino acids

2. A DNA sequencing reaction produces a sequence that is several ______bases long.


a. hundred
b. thousand
c. million
d. billion

3. What are mostly in the form of 10,000 base pair-sized fragments whose approximate chromosomal locations
are known?
a. High-quality reference sequence data
b. Finished sequence data
c. Draft sequence data
d. Analytical data

4. One of the most important aspects of bioinformatics is________. 


a. identifying genes within a long DNA sequence
b. finding similar sequence
c. assembling sequence
d. sequencing DNA

5. Which of the following statements is false?


a. Once a nucleic acid or amino acid sequence has been assembled, bioinformatic analysis can be used to
determine if the sequence is similar to that of a known gene.
b. Scientists have written computer programs that can be used to see if a particular DNA sequence is similar
to any others that are stored in a sequence database.
c. Once the query sequence is submitted, the BLAST program compares it, one-at-a-time, to every sequence
in its database.
d. The CLUSTAL program compares a single input sequence, one at a time, to others in a sequence
database.

6. Which of the following statements is false?


a. CLUSTAL alignments are sometimes used by scientists examining genetic variation within a population.
b. The BLAST program was developed to produce such multiple alignments.
c. CLUSTAL gets its name because it deals with clusters of sequences.
d. The scientists can use CLUSTAL to examine how the gene sequence varies among people with and without
the disease.

102
7. The human genome contains 3.2 billion chemical__________.
a. nucleotide base pairs
b. amino acids
c. nucleotides
d. genes

8. About 2% of the genome encodes instructions for the synthesis of_________.


a. proteins
b. amino acids
c. nucleotides
d. DNA

9. Repeat sequences that do not code for proteins make up at least ______ of the human genome.
a. 10%
b. 50%
c. 100%
d. 80%

10. Stretches of up to 30,000 C and G bases repeating over and over often occur adjacent to gene-rich areas, forming
a barrier between the genes and the ________.
a. junk DNA
b. urban centers
c. gene-dense
d. rich DNA

103
Bioinformatics

Application I
Limitations of Current Multicore Architectures for Bioinformatics Applications – A Case Study on Cell BE

Abstract
The fast growth of biological databases has attracted the attention of computer scientists calling for greater efforts
to improve computational performance. From a computer architecture point of view, we intend to investigate how
bioinformatics applications can benefit from future multicore processors. Here, we present a preliminary study of
the Cell BE limitations when executing a representative bioinformatics application performing multiple sequence
alignment (that is ClustalW). The inherent large parallelism of the core algorithm used makes it ideal for architectures
supporting multiple dimensions of parallelism. However, in the case of Cell BE we identified several architectural
limitations that need a careful study.

Introduction
Bioinformatics is a rapidly growing field that requires High Performance Computing (HPC) systems in order to cope
with the fast increase of biological databases. One of the most important tasks in bioinformatics is the alignment
of biological sequences (DNA, proteins, RNA). Popular alignment algorithms like Needleman-Wunsch (NW) use
dynamic programming techniques and are in most cases extremely computationally intensive. ClustalW is a widely
used application that features NW as its main hot-spot kernel. The inherent multi-dimensional parallelism present
in this type of applications makes them ideal to be mapped on a multicore platform where both thread-level and
data-level parallelism can be exploited. We have used Cell BE processor as an example of a modern multicore
processor.

With this research, we aim at identifying the architectural and micro-architectural limitations that Cell BE exhibit
when targeting a representative multiple sequence alignment application such as ClustalW. We present different
optimisation and parallelisation strategies and analyse the factors that limit the performance. Recent publications
have mapped bioinformatics applications on Cell BE with a focus on software optimisation. Our work aims at
identifying limitations of current architectures in order to guide the design of future multicore systems. ClustalW
Implementation on Cell BE ClustalW performs the multiple alignment of a set of sequences in three main steps:
• all-to-all pairwise alignment
• creation of a phylogenetic tree
• final multiple alignment

Profiling experiments reveal that the core functions of the first step (that is forward_pass) consumes about
70% of the total execution time. This function is called n(n-1)/2 times to calculate a similarity score among two
sequences, implementing a modified version of NW. Not only the independence among calls makes parallelisation
appealing but also vectorisation of the inner loop is possible. We ported forward pass function to the SPU ISA and
implemented a number of optimisations. DMA transfers are used to exchange data between main memory and the
SPUs LS. Saturated addition and maximum instructions not present in the SPUs were emulated with 9 and 2 SPU
instructions respectively. The first optimisation uses 16-bit vector elements instead of 32-bit allowing a theoretical
double throughput but requiring the implementation of an overflow check in software. The inner loop contains
random scalar memory accesses and a complex branch for checking boundary conditions. This type of operations
is very inefficient in the SPUs. We have unrolled this loop and manually evaluated the boundary conditions outside
the inner loop. In the multi-SPU versions, the PPU distributes pairs of sequences for each SPU to be processed
independently. Such a version was first implemented using a simple round-robin strategy but the load distribution
was not efficient. A second strategy uses a table of flags that SPUs can raise to indicate idleness. This way the PPU
can take better decisions on where to allocate the tasks.

Results and Analysis


Fig. 1 shows a comparison of ClustalW running on various single-core platforms as compared to different versions
using a single SPU. Since the clock frequency of the G5 is more than twice as low as the Cell, it is clear that in
terms of cycles it outperforms any Cell 1SPU version. The G5 platform contains a powerful out-of-order superscalar

104
PowerPC970 that runs scalar code very efficiently while the PPU has more limited capabilities (less functional
units and registers, in-order execution, and so on). The fourth bar shows the straightforward SPU implementation
of ClustalW without optimisations. The fifth bar shows a significant speedup (1.7×) when using 16-bit data type.
This double vector parallelism is most of the time achievable but the program should always check for overflow
and go back to the 32-bit version if needed. Since the SPUs do not provide support for overflow check (unlike the
PPU), this had to be implemented in software and consequently affecting the performance. The next two bars show
results for unrolling a small loop located within the inner loop of the kernel, allowing us to achieve accumulative
2.6× speedup. The last two versions removed boundary conditions involving a scalar branch and handled them
explicitly outside the loop. This final (accumulative) optimisation provided about 4.2× speedup with respect to the
initial version.

Fig. 2 shows the scalability of ClustalW kernel when using multiple SPUs. The black part of the bars reveals a
perfect scalability (8× for 8 SPUs). This is due to the relatively low amount of data transferred and the independence
between every instance of the kernel. In future experiments, it will be interesting to see how far this perfect scalability
will continue.

After the successful reduction of the execution time for forward pass, significant application speedups are only
possible by accelerating other parts of the program. The progressive alignment phase is now the portion consuming
most of the time. This issue is currently being studied.

Fig. 1. ClustalW performance for different platforms and optimisations

Fig. 2. ClsutalW speedup using multiple SPUs

105
Bioinformatics

The following is a list of the limitations we have found in the experiments:

Unaligned data accesses: The lack of hardware support for unaligned data accesses is one of the issues that can
limit the performance the most. When the application needs to do unaligned loads or stores, the compiler must
introduce extra code that contains additional memory accesses plus instructions for data reorganisation. In ClustalW,
this situation appears in critical parts of the code and then performance is affected.

Scalar operations: Given the SIMD-only nature of the SPUs ISA and the lack of unaligned access support, scalar
instructions may cause performance degradation too. Since there are only vector instructions, scalar operations must
be performed employing vectors with only one useful element. Apart from power inefficiency issues, this works
well only if the scalars are in the appropriate position within the vector. If not, the compiler has to introduce some
extra instructions to make the scalar operands aligned and perform the instruction. This limitation is responsible for
a significant efficiency reduction.

Saturated arithmetic’s: These frequently executed operations are present in Altivec but not in the SPU ISA. They
are used to compute partial scores avoiding that they are zeroed when overflow occurs with unsigned addition. This
limitation may become expensive depending on the data types. For signed short, 9 additional SPU instructions are
required.

Max instruction: One of the most important and frequent operations in both applications is the computation of a
maximum between two or more values. Since the SPU ISA does not provide such an instruction, it is necessary to
use two SPU instructions.

Overflow flag: This flag is not available in the SPUs and has to be implemented in software, adding overhead.

Branch prediction: The SPUs do not handle branches efficiently and the penalty of a mispredicted branch is about
18 cycles. The kernel of ClustalW has several branches that, when mispredicted, reduce the application execution
speed.

Conclusions and future work


We have described the mapping and some optimisation alternatives of a representative bioinformatics application
targeting Cell BE. Our study revealed various architectural aspects that negatively impact Cell BE performance for
bioinformatics workloads. More precisely, the missing HW support for unaligned memory accesses and the lack of
saturating arithmetic instructions appear to be the most critical. We are performing additional experiments in order
to have a quantitative measure of all these aspects.

Additionally, we intend to explore solutions to the issues found and validating them with simulations using UNISIM.
We are using this research as guidance for the architecture design of future multicore systems incorporating domain
specific accelerators for bioinformatics. We intend to widen our study to other applications of the same field. We
believe that heterogeneous multicore architectures able to exploit specialisation and multiple dimensions of parallelism
will bring great performance improvements for bioinformatics workloads.

(Source: Isaza, S. & Gaydadjiev, G., Limitations of Current Multicore Architectures for Bioinformatics Applications
– A Case Study on Cell BE, Computer Engineering Lab, Delft University of Technology, The Netherlands [pdf]
Available at: http://ce.et.tudelft.nl/publicationfiles/1783_742_acaces_abstract_word.pdf)

106
Questions
1. Enumerate the Cell BE limitations when executing a representative bioinformatics application performing
multiple sequence alignment.
Answer: The limitations of Cell BE include:
• Unaligned data accesses
• Scalar operations
• Saturated arithmetic
• Max instruction
• Overflow flag
• Branch prediction

2. What was the conclusion drawn in the preliminary study of the Cell BE?
Answer: We have described the mapping and some optimisation alternatives of a representative bioinformatics
application targeting Cell BE. Our study revealed various architectural aspects that negatively impact Cell BE
performance for bioinformatics workloads. More precisely, the missing HW support for unaligned memory
accesses and the lack of saturating arithmetic instructions appear to be the most critical. We are performing
additional experiments in order to have a quantitative measure of all these aspects.

3. What are the future works that need to be carried out to improve the performance for bioinformatics
workloads?
Answer: Additionally, we intend to explore solutions to the issues found and validating them with simulations
using UNISIM. We are using this research as guidance for the architecture design of future multicore systems
incorporating domain specific accelerators for bioinformatics. We intend to widen our study to other applications
of the same field. We believe that heterogeneous multicore architectures able to exploit specialisation and multiple
dimensions of parallelism will bring great performance improvements for bioinformatics workloads.

107
Bioinformatics

Application II

DNA Sequencing and Personal Genomics Case Study for Intro Biology


The rapid advances in DNA sequencing technology are beginning to affect how human diseases are diagnosed, and
will soon affect significant numbers of people in the developed world. Because this technology will fundamentally
alter many fields of biological research, students even in freshman biology courses should become aware of the
technology and its potential impact. I think stories of children with mystery diseases, who are diagnosed by genome
sequencing and successfully treated as a result, will make a compelling learning experience and lead students to
questions that address most aspects of genomics appropriate for a college-level introductory biology course.

Isn’t sequencing a human genome prohibitively expensive and time consuming?


The graph below from the National Human Genome Research Institute shows that the cost of DNA sequencing
has plummeted in recent years. The $1,000 human genome sequence is within sight. The figure below shows the
advances in reducing the cost of sequencing a million bases of DNA, compared with Moore’s Law for advances in
computing power.

The rapid decline in cost of sequencing resulted from advent of next-generation sequencing platforms such as Roche
454 (a YouTube playlist for a number of different sequencing technologies (http://www.youtube.com/view_play_
list?p=1B2FEA81FFAD1748).

The development of massively parallel high-throughput sequencing technologies, coupled with single-molecule
sequencing (the so-called third generation), in a highly competitive marketplace, continue to lower the cost of
obtaining a whole human genome sequence. Illumina announced in a June 8, 2011 press release a huge price drop for
a personal whole genome sequence, from $19,500 to $9,500 (Illumina 6/08/2011), along with release of a personal
genome browser app for the iPad.

108
What does this mean for ordinary people?
It means that the era of personalised genomic medicine has arrived. Instead of individual genetic tests, it will become
cost-effective for each person to have his or her own genome sequence.
Here, is a series of excellent articles in the Milwaukee Journal Sentinel about the first published use of genome
sequencing to diagnose and identify a cure for a boy, Nicholas Volker, suffering from a previously unknown
disease.

In this case, rather than sequencing the entire genome, the researchers sequenced the boy’s exome, the 2% of the
genome that encodes proteins. Their paper was published in March 2011 in Genetics in Medicine.

So what do you get when your DNA is sequenced?


Too much information? A bunch of As, Gs, Cs and Ts, in strings of 200-400 letters. Your DNA sample is shredded and
random fragments are sequenced. To get 99% of the target DNA sequenced at least once, the researchers sequenced
Nick’s exome to an average of 34-fold. Individual sequence strings are matched against the human reference
genome and differences noted. For Nick Volker’s exome, Worthey et al. found more than 16,000 differences from
the reference human sequence. Which of these, if any, is causing the boy’s disease? The paper by Worthey et al.
describes the process of sifting through the chaff to identify candidate gene mutations.

(Source: DNA Sequencing and Personal Genomics Case Study for Intro Biology [Online] Available at: <http://
jchoigt.wordpress.com/2011/06/10/dna-sequencing-and-personal-genomics-case-study-for-intro-biology/> Accessed
7 March 2011.))

Questions
1. Is sequencing a human genome expensive and time consuming? Justify.
2. How was rapid decline in cost of sequencing considered as a benefit?
3. According to you, what could be the possible ethical, legal and social considerations for human genome
sequencing?

109
Bioinformatics

Application III

Ivy Genomics-Based Medicine Project

Client
The Ben and Catherine Ivy Foundation, and The Translational Genomics Research Institute (TGen).

Overview
Funded by the Ben and Catherine Ivy Foundation and guided by TGen, the Ivy Genomics-Based Medicine Project is
an active collaboration among nine US institutions working together to better understand how the genetic differences
in individual brain tumours can potentially inform the prediction of what will be the most effective treatment option
for each patient.

This project will categorise tumours by molecular profiling and, for the first time in brain cancer research, test each
tumour against a wide spectrum of treatments to match differences in response with the profiles. It challenges not
only many of the traditional boundaries of IT but the business processes that support the anticipated throughput of
collaborative science and the consortia model.

Problem
The Ivy GBM project challenged not only many of the traditional boundaries of IT but the business processes that
support the anticipated throughput of collaborative science and the consortia model. The vision was to provide:
• First access to the chemo vulnerability- and genomic-profiling data
• Full access to any GBM models used in the consortium
• Use of consortium resources for independent research and/or follow-on sustained research projects
• Demonstration and practice of profile-guided management
• Synergy of collaborative, interdisciplinary, multi-institutional research
• Positioning to participate in stage II clinical trial of profile-guided treatments

Solution
5AM Solutions created a highly collaborative environment for participating institutions across the country to share
study information before, during and after the study. We developed customised workflows, inventory tracking, and
role-based information collection and sharing, supporting subject enrolment, clinical data collection, specimen
creation and tracking, data export and online analysis. 5AM Solutions launched an initial study in 4 weeks to meet
the demanding timeline set by the consortium. 5AM effectively balanced the needs for speed, reliability, accuracy
and the exposure of progress.

Benefits
A key benefit occurred up front through the study definition and elicitation process. This series of activities sharpened
the direction of the study (not just the software), forced the analysis of how the research would be run from a variety
of perspectives and allowed us to be able to incrementally meet the needs as they were derived. 5AM’s hosting
of the software eliminated concerns of HIPAA, data backup, encryption, in-house maintenance, and collaborator/
customer service and support.

For the collaborators, the project portal provided unprecedented ability to share study related documents and SOPs
for the study. User adoption was quick as new collaborators were able to contribute within an hour of supervised
training.

(Source: Isaza, S. & Gaydadjiev, G., Limitations of Current Multicore Architectures for Bioinformatics Applica-
tions – A Case Study on Cell BE, Computer Engineering Lab, Delft University of Technology, The Netherlands
[Online] Available at: <http://www.5amsolutions.com/resources/casestudies/ivy_gbm.php>)

110
Questions
1. What was the vision of the Ivy GBM project?
2. What were the efforts of 5AM Solutions in Ivy GBM project?
3. What were the benefits provided by 5AM Solutions?

111
Bioinformatics

Bibliography
Reference
• ABNOVA1, 2010. BLAST - Multiple Alignment [Video Online] Available at: <http://www.youtube.com/
watch?v=xdF6iZEPH_s> [Accessed 28 February 2012].
• Baxevanis, A. D. & Ouellette, B. F., 1998. Bioinformatics: A Practical Guide to the Analysis of Genes and
Proteins, John Wiley and Sons, New York.
• bimaticsblog, 2009. Human genome sequencing-Animated tutorial [Video Online] Available at: <http://www.
youtube.com/watch?v=-gVh3z6MwdU> [Accessed 28 February 2012].
• Biology Computers. Pairwise sequence alignment [Online] Available at: <http://gtbinf.wordpress.com/biol-
41506150/pairwise-sequence-alignment/> [Accessed 28 February 2012].
• Branden, C. & Tooze, J., 1998. An Introduction to Protein Structure. Garland, 1998.
• Brinkma, F. S. L., 2001. Phylogenetic Analysis [pdf] Available at: <http://www.bioon.com/book/biology/
bioinformatics/chapter-14.pdf > [Accessed 28 February 2012].
• Chorghade, M. S., 2006. Drug discovery and development, John Wiley and Sons.
• cisn. Genomics [Online] Available at: <http://cisncancer.org/research/what_we_know/omics/genomics.html>
[Accessed 28 February 2012].
• clcbio. Bioinformatics explained: Biological databases [Online] Available at: <http://www.clcbio.com/index.
php?id=1238> [Accessed 28 February 2012].
• Clustering [Online] Available at: <http://www.bioalgorithms.info/presentations/Ch10_Clustering.ppt > [Accessed
28 February 2012].
• Computer aided drug design and bioinformatics: A current tool for designing [Online] Available at: <http://www.
pharmainfo.net/reviews/computer-aided-drug-design-and-bioinformatics-current-tool-designing> [Accessed
28 February 2012].
• Cooper, G. 1994. The Human Genome Project: deciphering the blueprint of heredity, University Science
Books.
• EMBL-EBI. What is Bioinformatics? [Online] Available at: <http://www.ebi.ac.uk/2can/bioinformatics/
bioinf_biodatabases_1.html> [Accessed 28 February 2012].
• genomicseducation, 2009. What is Genomics Part 2 - The Human Genome Project [Video Online] Available
at: <http://www.youtube.com/watch?v=C86YbyEsct8&feature=results_main&playnext=1&list=PLE62E79A
B3FDD7867> [Accessed 28 February 2012].
• genomicseducation, 2010. What is Genomics - Chapter 1 [Video Online] Available at: <http://www.youtube.
com/watch?v=9jZF74iqLac&feature=related> [Accessed 28 February 2012].
• HaroonBBT, 2011. Microarray [Video Online] Available at: <http://www.youtube.com/watch?v=wKcQZVeIK-
k&feature=related > [Accessed 28 February 2012].
• Huson, D., 2005. A Brief Guide to Genomics [Online] Available at: <http://lectures.molgen.mpg.de/
Algorithmische_Bioinformatik_WS0607/reinert1.pd> [Accessed 28 February 2012].
• InsGenomeSciences, 2010. Introduction to Bioinformatics [Video Online] Available at: <http://www.youtube.
com/watch?v=xODTm4a6nsM> [Accessed 28 February 2012].
• Insights Learned from the Human DNA Sequence [Online] Available at: <http://www.ornl.gov/sci/techresources/
Human_Genome/project/journals/insights.shtml> [Accessed 28 February 2012].
• jv51jjv5, 2010. NCBI BLAST Tutorial - Part 1 [Video Online] Available at: <http://www.youtube.com/
watch?v=ZuBMBJmfn-4&feature=related> [Accessed 28 February 2012].
• Khandekar. Role of Bioinformatics In Medical Informatics A Case Study : Tuberculosis [pdf] Available at:
<http://www.jbtdrc.org/Symposium/Topics/Role_bio.pdf> [Accessed 28 February 2012].
• Korol, A. B., 2001. Microarray cluster analysis and applications [pdf] Available at: <http://www.science.co.il/
enuka/essays/microarray-review.pdf > [Accessed 28 February 2012].

112
• Koslow, S. H., & Huerta, M. F., 2000. Electronic collaboration in science, Routledge.
• Mount, D. W., 2001. Bioinformatics Sequence and Genome Analysis, Cold Spring Harbor laboratory Press.
• NCBI. Systematics and Molecular Phylogenetics [Online] Available at: <http://www.ncbi.nlm.nih.gov/About/
primer/phylo.html> [Accessed 28 February 2012].
• NHGRI, Bioinformatics: Examining Variation [Online] Available at: <http://www.genome.gov/25020003>
[Accessed 28 February 2012].
• NHGRI. A Brief Guide to Genomics [Online] Available at: <http://www.genome.gov/18016863> [Accessed
28 February 2012].
• nicolazonta, 2008. User driven molecular modeling in drug design [Video Online] Available at: <http://www.
youtube.com/watch?v=hd2YaygJC-w&feature=related> [Accessed 28 February 2012].
• norman466, 2011. Biology Project: Bioinformatics [Video Online] Available at: <http://www.youtube.com/
watch?v=saW1oEbboUM> [Accessed 28 February 2012].
• Novartis, 2011. Drug discovery and development process [Video Online] Available at: <http://www.youtube.
com/watch?v=3Gl0gAcW8rw> [Accessed 28 February 2012].
• plantbreedgenomics, 2010. Bioinformatics 101 - Part 2 Intro [Video Online] Available at: <http://www.youtube.
com/watch?v=WlVGTtqT4Tg&feature=related> [Accessed 28 February 2012].
• Rastogi, S. C., Mendiratta, N. & Rastogi, P., 2006. Bioinformatics-Concepts, Skills, Applications, 2nd ed., PHI
Learning Pvt. Ltd.
• Robbins. Bioinformatics: Essential Infrastructure For Global Biology [pdf] Available at: <http://www.esp.org/
oecd.pdf> [Accessed 28 February 2012].
• sanjaysingh765, 2011. Multiple sequence alignment with clustalw and boxshade [Video Online] Available at:
<http://www.youtube.com/watch?v=BrzhdNvXXDs> [Accessed 28 February 2012].
• Thermy33, 2011. Understanding Phylogenetic Trees [Video Online] Available at: <http://www.youtube.com/
watch?v=xwuhmMIIspo> [Accessed 28 February 2012].
• Toriello, J., 2003. The Human Genome Project, The Rosen Publishing Group.
• UCBerkeley, 2010. Biology 1B - Lecture 24: Phylogenetics [Video Online] Available at: <http://www.youtube.
com/watch?v=vrGfDPteKqU> [Accessed 28 February 2012].
• wenl888, 2012. Easy to use microarray data analysis tool - No training needed: Goober [Video Online] Available
at: <http://www.youtube.com/watch?v=nSlhCaJKhjY > [Accessed 28 February 2012].
• Young, D. Computational Techniques in the Drug Design Process [Online] Available at: < http://www.
ccl.net/cca/documents/dyoung/topics-orig/drug.html> [Accessed 28 February 2012].

Recommended Reading
• Barnes, R., 2007. Bioinformatics for geneticists:A bioinformatics primer for the analysis of genetic data, John
Wiley & Sons.
• Boon, K., The human genome project: what does decoding DNA mean for us?, Enslow Publishers.
• Borlak, 2005. Handbook of toxicogenomics: strategies and applications, Wiley-VCH.
• Branden C., & Tooze, J., 1999. Introduction to Protein Structure, Garland Publishing, New York.
• International Human Genome Sequencing Consortium, 2001. Initial Sequencing and Analysis of the Human
Genome, Nature.
• Jogota, A., 2005. Computational Methods in Phylogenetic Analysis, p.74.
• Larson, S. 2005. Bioinformatics and Drug Discovery, Humana Press, p.444.
• Lehninger, A. L., 1984. Principles of Biochemistry, CBS publishers and distributors, New Delhi, India.
• Letovsky, S. Bioinformatics: Databases and Systems, O’REILLY.
• Livingstone & Barton, 1993. Protein Sequence Alignments: a Strategy for the Hierarchical Analysis of Residue
Conservation, Computer Applications in the Biosciences.

113
Bioinformatics

• Markel, S. and Leon, D., Sequence Analysis in A Nutshell, O’REILLY.


• Patthy, L., 1999. Protein Evolution, Blackwell Science.
• Polański, A. & Kimmel, M., 2007. Bioinformatics, Springer.
• Ramsden, J., 2009. Bioinformatics: an introduction, 2nd ed., Springer.
• Shanmughavel, P., 2005. Principles of Bioinformatics, Pointer Publishers, Jaipur, India.
• Steel, M. A., 2003. Phylogenetics, Oxford University Press.
• Stekel, D., 2003. Microarray bioinformatics, Cambridge University Press, p.263.
• Zelikovsky, A., 2008. Bioinformatics algorithms: Techniques and applications, John Wiley & Sons.

114
Self Assessment Answers
Chapter I
1. a
2. a
3. c
4. b
5. a
6. a
7. b
8. a
9. a
10. b

Chapter II
1. a
2. a
3. d
4. a
5. b
6. c
7. d
8. c
9. c
10. a

Chapter III
1. a
2. b
3. a
4. b
5. b
6. a
7. d
8. b
9. a
10. b

Chapter IV
1. a
2. a
3. a
4. b
5. c
6. b
7. b
8. a
9. d
10. c

115
Bioinformatics

Chapter V
1. b
2. b
3. b
4. a
5. d
6. a
7. b
8. c
9. d
10. a

Chapter VI
1. a
2. b
3. d
4. d
5. a
6. a
7. a
8. a
9. b
10. d

Chapter VII
1. a
2. c
3. d
4. d
5. a
6. b
7. b
8. c
9. d
10. a

Chapter VIII
1. a
2. a
3. c
4. a
5. d
6. b
7. a
8. a
9. b
10. a

116

You might also like