Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

1

L11 INTRODUCTION TO BIOINFORMATICS


Introduction
As we have noted in our previous lectures, the molecular biology approach to understanding the flow
and expression of genetic information involves studying the structure of macromolecules, DNA, RNA
and proteins, and the metabolic steps that mediate the flow of information from the genome to the
phenotype of the organism. Molecular biologists study the structure and functions of the molecules that
is of critical importance to the flow of information both from genome to phenotype and also from
generation to generation. The processes such as DNA replication, transcription, translation, and protein
targeting, mediate and control the expression of genetic information. In clear terms, the goal of
molecular biology is to understand the mechanism, specificity and regulation of these processes. Thus,
a view has been presented that molecular biology is an information science. With the advent of large
databases of genetic information for these macromolecules from a large number of organisms
including humans, a completely new approach to studying gene expression and its regulation has been
developed. This field is known as bioinformatics.

WHAT IS BIOINFORMATICS?
Bioinformatics is conceptualizing biology in terms of molecules and applying informatics techniques
(derived from disciplines such as applied mathematics, computer science and statistics) to understand
and organize the information associated with these molecules on a large scale. Hence, bioinformatics
is the management information system for molecular biology which has many practical applications in
various fields. The fundamental issue for bioinformatics is, how do we describe, analyze, simulate and
predict the dynamics of various biological processes by using the information technology tools?

11.1 BIOINFORMATICS: THE INFORMATION SCIENCE


Bioinformatics is the science of using information to understand biological phenomena. It offers the
tools that can be used to answer some of the questions related to biological phenomena. It is a subset of
the larger science of computational biology, which is the application of quantitative analytical
techniques in modeling and solving problems in the biological systems. Bioinformatics is in a way,
application of the statistical methods, pattern recognition, and some of the computational methods.
Bioinformatics at the basic level deals with biological information: data collection and storage:
data searching and retrieval; analysis and predicting patterns. Fredj Tekaia of the Pasteur Institute
suggested the following definition of bioinformatics: “The mathematical, statistical and computing
methods that aim to solve biological problems using DNA and amino acid sequences and related
information”.
Most of the large biomolecules are polymers, which are ordered chains of simpler molecular
modules called monomers that can be thought of as beads or building blocks which, despite having
different colours and shapes, have the same thickness and the same way of connecting to one another.
Each monomer is of the same general class, but each kind of monomer has its own well-defined set of
characteristics. Many monomers can be joined together to form a single, far larger macromolecule
which has exquisitely specific information content and chemical properties. According to this scheme,
the monomers in a given macro-molecule of DNA or protein can be treated computationally as letters
of an alphabet put together in preprogrammed arrangements to carry messages or do work in a cell.
Bioinformatics and Data Analysis
The 21st century, aptly called the gene age, envisages that biologists across the globe must not only be
computer-literate but also must become proficient in the use of biological databases towards a
complete understanding of the various life forms. The field of bioinformatics involves the use of
sophisticated computer resource systems in the management of biological information.
Rapid DNA sequencing technology which evolved during 1970s, has increased tremendously
the amount of biological information available to the bioscientists. The human genome project has
sequenced more than 100,000 genes encoded by 3 x 109 bases, necessitating the use of a good
computer-based system for the analysis of such enormous data. Other genome projects such as
2

Schistosoma sp. Plasmodium sp. and the yeast genome projects have further increased this information
repertoire. A vast number of nucleotides (108) and amino acids of mammals, primates, rodents,
bacteria and other life forms have already been classified and stored in various publicly available
databases.
It therefore implies that today’s biologists must be properly equipped to handle and manage enormous
amount of biological information stored in the databases. These databases include GenBank, EMBL,
PIR, Swiss-Prot, SEQDB, OMIM, CLDB, PROSITE, MEDLINE, etc., which store information of
sequencing nucleic acids, proteins, cloning vectors, restriction enzymes and transcription factors.
Besides, these databases also store information on cell lines and genomic disorders.

Bioinformatics beyond Data Analysis


In a broader sense, bioinformatics has moved just beyond data analysis and includes the following:
DNA microarrays: There are new technologies designed to measure the relative number of copies of a
genetic message (levels of gene expression) at different stages in development, in disease or in
different tissues.
Functional genomics: Large-scale ways of identifying gene functions and associations (for example,
yeast two-hybrid methods).
Structural genomics: Attempts to crystallize and/or predict the structure of all proteins.
Comparative genomics: Study of multiple whole genomes for understanding the differences and
similarities between all the genes of multiple species. From such studies we can draw particular
conclusions about species and their evolution.

Medical informatics: The management of biomedical experimental data associated with particular
molecules—from mass spectroscopy to in vitro assays to clinical side effects.
As a result of the massive surge in the data and its complexity, many of the challenges in biology have
actually become challenges in computing. Such an approach is ideal because of the ease with which
computers can handle large quantities of data and probe the complex dynamics observed in nature.
Bioinformatics is often regarded as the application of computational techniques to understand and
organize the information associated with macromolecules. The confluence of two dissimilar fields is
largely attributed to the fact that biology itself is an information science; genes, which at the basic
level can be viewed as digital repositories of information, largely dictate an organism’s physiology and
behaviour.

11.2 OBJECTIVES OF BIOINFORMATICS


Following are the objectives:
(1) At its simplest and basic level, bioinformatics organizes data in a way that allows researchers to
access existing information and to submit new entries as they are produced, e.g. protein data bank for
3D macromolecular structures. While data-curation is an essential task, the information stored in these
data bases is essentially useless unless analyzed. Thus the purpose of bioinformatics extends far
beyond mere volume control of data.
(2) To develop tools and resources that aid in the analysis of Data. For example, having sequenced a
particular protein, it is of interest to compare it with previously characterized sequences. This requires
more than just a straightforward database search. As such, programs such as FASTA and PSI-BLAST
must consider what constitutes a biologically significant resemblance. Development of such resources
requires extensive knowledge of computational theory, as well as a thorough understanding of biology.
(3) To use these tools to analyze the data and interpret the results in a biologically meaningful manner.
Traditionally, biological studies examined individual systems in detail, and frequently compared them
with a few that are related. In bioinformatics one can also conduct global analyses of all the available
data with the aim of uncovering common principles that apply across many systems and highlight
features that are unique to some.
3

Challenge for Bioinformaticians


Elucidating structure and function from sequence is a challenging problem for several reasons. First,
genetic information is highly redundant. In coding regions, for example, each amino acid has on
average 3.05 codons to chose from 61 codons coding for 20 amino acids. This means that even a short
protein like human insulin, 51 amino acids in length, has 351 or 1024 distinct DNA sequences that can
encode an identical amino acid sequence.
There is structural redundancy in genetic information as well. There are over 700 globin
sequences in the current protein databases, and all of them have a nearly identical three-dimensional
structure, but the sequences are so different as to be unrecognisable by most methods currently in use.
This implies that many different protein sequences can encode precisely the same folding, just as many
different DNA sequences can encode the same protein sequence. Nature makes use of this redundancy
to encode multiple functions in genetic information. Within a single eukaryotic coding region, the
DNA sequence will show evidence for codon choice that regulates the rate of translation, dinucleotide
selection that favours bending of the DNA into chromatin, as well as the protein coding information
itself. Finally, the genetic information is inherently one-dimensional, but structure and function depend
upon three- dimensional attributes.

Vastness of Biological Data


Biological data that is available is enormous and can be used for bioinformatics to predict the function
of actual gene products. The data can be broadly divided into four categories:
A. Raw DNA/protein sequence data
B. Protein structure data
C. Data on gene expression
D. Experimental data on biological systems and metabolic pathways
A. Raw DNA/Protein Data
Computational analysis can help to reveal important biological information from nucleotide sequences
of DNA and amino acid sequences of proteins. These provide a clue to the structure and functions of
genes and proteins. Applications of this information are:
(1) DNA and protein sequence databases: Enormous databases have been generated which have been
deposited in public databases like GenBank, EMBL Databank, SWISS-PROT, NRDB etc. The
GenBank repository of nucleic acid sequences holds more than a total of 9.5 billion bases in 8.2
million entries. Automated DNA sequencers and international genome sequencing efforts have led to
an exponential growth in sequence information requiring more and more efficient methods for
database management and data mining. At the next level are protein sequences comprising of strings
of 20 amino acid letters, which have generated about 300,000 known protein sequences.
Macromolecular structural data represents a more complex form of information. There are currently
19,000 entries in the protein data bank (PDB), most of which are protein structures. A typical PDB file
for a medium-sized protein contains the xyz coordinates of approximately 2,000 atoms.
(2) Sequence similarity search: Similarity search against the database is an important parameter.
Whenever a new sequence is found, it can be compared usually by aligning corresponding segments
and looking for matching and mismatching letters in their sequences. Genes or proteins that are
sufficiently similar are likely to be related and are therefore said to be homologous to each other.
BLAST and FASTA are some widely used programmes that help in rapid sequence similarity searches
against huge databases.
(3) Sequence alignment: Comparison of sequences is done to determine evolutionary relationships or
judging their functions based on structure. The relatedness of proteins is used to trace the family trees
of different molecules through evolutionary time.
(4) Diversity in size: There are invariably more sequence-based data than structural data because of the
relative ease with which they can be produced. This is partly related to the greater complexity and
information content of individual structures compared to individual sequences. While more biological
4

information can be derived from a single structure than a protein sequence, the problem is overcome in
the latter by analyzing larger quantities of data.
(5) Predicting genes in Genomic sequences: A concept that underpins most research methods in
bioinformatics is that much of the data can be grouped together based on biologically meaningful
similarities. For example, sequence segments are often repeated at different positions of genomic DNA
in which genes can be identified. A database search can lead to detection of previously characterized
genes. But the real challenge lies in the prediction of new genes in genomic sequences, which can help
in designing experiments to determine their functions. Genes can be clustered into those with particular
functions or according to metabolic pathway to which they belong.

B. Protein Structure Data


The genetic information is contained in DNA which functions as the blue-print of life, but this
information is expressed in the form of proteins, the molecules of biological specificity. Proteins are
undoubtedly the most complex biological molecules, which are integral constituents of cell structure,
help in signal transduction and cellular communication, participate in immunological functions, and
function as enzymes to catalyze biochemical reactions. Proteins are, therefore, important
macromolecules which are an expression of the genetic information residing in DNA. In order to use
this valuable genetic information, one must learn to classify new protein sequences into existing
superfamilies and new genes into existing phylogenies, so that one can commit valuable time to study
the truly novel genes and proteins. More importantly, it is also critical that one should be able to see
reason from example and deduce novel structures and functions from sequence databases.
(1) Protein sequence databases: Protein sequence databases are categorized as primary and composite
or secondary. Primary databases contain over 300,000 protein sequences and function as repository for
the raw data. Two major protein sequence databases are SWISS-PROT and PIRinternational which are
different from the nucleotide databases in that they are both curated. This means that groups of
designated curators (scientists) prepare the entries from literature and/or contacts with external experts.
(2) Protein structure databases: Structural databases are databases of macromolecular structures. The
PDB is the main primary database for 3D structures of proteins determined by X-ray crystallography
and NMR. Structural biologists usually deposit their data in the PDB, which was established in the
1970s. In fact, the PDB provides a primary archive of all 3D structures for macromolecules, such as
proteins, RNA, DNA and various complexes. Most of the 19,000 structures are solved by X-ray
crystallography and NMR, but some theoretical models are also available.
(3) Structure visualization: Protein structure data is stored as collections of x, y, z coordinates, but
proteins cannot be visualized simply by plotting those points. The connectivity between atoms in
proteins has to be taken into account, and for the visualization to be effective, a virtual 3D
environment, which provides the illusion of depth, needs to be created. A variety of structure
visualization tools are available which include software programmes like RasMol, VMD, MidasPlus
etc. These are powerful visualization tools which help in displaying 3D models, creation of high-
quality graphics by molecular modeling etc.
(4) Identification of protein family: One of the most popular databases is PROSITE, which is a
database of short sequence patterns and profiles that characterize biologically significant sites in
proteins. PROSITE is a database of protein families and domains. It consists of biologically significant
sites, patterns and profiles that help to reliably identify to which known family a new sequence
belongs.
(5) Prediction of protein structure: The sequence of a protein can be easily determined, but structure
prediction is not possible experimentally. It would be interesting to predict function from sequence, to
identify functional sites in uncharacterized 3D structures, and eventually build designed proteins-
molecular machines that do whatever we want them to do. But without an understanding of how
sequence determines structure, these other goals cannot reliably be achieved. Although many workers
have tried to develop methods for structure prediction, the only methods that produce a large number
of successful 3D structure predictions are those based on sequence homology.
5

(6) Drug design: Proteins are the molecules of biological specificity. Faulty proteins would often result
in diseases. The behaviour of proteins can be controlled by drug molecules which may provide a tool
to curing diseases. Hence, knowledge of protein structure can be exploited to design drug molecules
that can bind to target proteins to enhance or inhibit their activity.
C. Data on Gene Expression
DNA microarrays are miniaturized laboratories for the study of gene expression. Each DNA
microarray or gene chip contains a deliberately designed array of probe molecules that can bind
specific pieces of DNA or mRNA. Labeling the DNA or RNA with fluorescent tags allows the level of
expression of any gene in a cellular preparation to be measured quantitatively. Microarrays also have
other applications in molecular biology, but their use in studying gene expression has opened up a new
way of measuring genome functions. The DNA microarray technology was developed in late 1990s
and since then gene expression data is growing rapidly, paralleling the growth of the sequence and
structure databases.
D. Experimental Data
In addition to sequence and structure information, laboratory experimentation is generating enormous
data on biochemical and biophysical information. A number of bioinformatics tools are available
which help in the analysis and interpretation of the experimental data.
(1) Statistical analysis: Besides finding relationships between different proteins, much of
bioinformatics involves the analysis of statistical data to infer and understand the observations for
another type of data. An example is the use of sequence and structural data to predict the secondary
and tertiary structures of new protein sequences. These methods are often used on statistical rules
derived from structures, such as the propensity for certain amino acids sequences to produce different
secondary structural elements. Another example is the use of structural data to understand a protein’s
function; here studies have investigated the relationship between different protein folds and their
functions and analyzed similarities between different binding sites in the absence of homology.
Combined with similarity measurements, these studies provide us with an understanding of how much
biological information can be accurately transferred between homologous proteins.
(2) Image analysis: There are other fields, for example, medical imaging/image analysis that might be
considered part of bioinformatics. Electrophoretic separation of DNA/protein could be done by one or
two dimensional gel electrophoresis and then subjected to image analysis. Image analysis is an
important tool in microarray analysis while identifying gene expression.
(3) Classification databases: Classification databases are taxonomies of protein structure, and they
bear a strong resemblance to the morphology—based taxonomies developed by biologists. Proteins
that look grossly the same in terms of shape and topology, are classified as more closely related than
proteins that look substantially different. The classification databases can be construed as trees with
many branchings at each branch point—very similar to phylogenetic trees.
(4) Simulation experiments: Bioinformatics is often regarded as the application of computational
techniques to understand and organize the information associated with biological systems.
Experiments can be simulated to predict organism’s physiology, properties of drugs and drug design.
Models can be built to predict disease outbreak in a community or population. Although several
methods of machine learning are used to predict function from sequence or structure, there are many
methods for simulating metabolism from known biological functions. Medical science would like to
understand pathways to know which genomic changes could give rise to each known inherited disease.
The pharmaceutical industry would like to use this knowledge to produce drugs, proteins or genetic
therapies that can reverse disease phenotypes.

Table 28.1: Types of data that are analyzed in bioinformatics research

28.3 BIOINFORMATICS APPROACH TO BIOLOGICAL PROBLEMS


6

Molecular biology deals with the biological activity at the molecular level—the other levels of
abstraction of biological activity are the atomic at the basic level and the network level. While it is
important to study the atomic basis of biological activity, the approach taken by molecular biologists is
to reduce it to a molecular level. The interaction between various macromolecules and their
transformation within the cell is a part of the network study.
The systems approach is useful to understand the complex systems like biological systems.
The basic building blocks are molecules like nucleic acids and proteins. The systems view gives the
case of integration of various approaches to understand their structure and function. While the
traditional approach is experimental, performing various experiments in the laboratory to understand
the molecular biology, bioinformatics offers a synthetic approach. The synthetic approach is much
faster and in certain cases, is as reliable as the experimental approach (Fig. 28.1).

Fig. 28.1 Bioinformatics approach to the understanding of the biological systems.

Bioinformatics Approach
The information flow in the cell as understood by central dogma is fairly simple. However, the
complexity is enormous and gives rise to several problems for the biologist to resolve. Some of the
problems and their bioinformatics approaches are summarized in the Table 28.2.

28.4 BIOINFORMATICS APPLICATIONS


What are the kinds of problems bioinformatics can address and what is the approach? Some of these
are illustrated in the discussion below.
DNA Level
1. Routine re-sequencing of megabase regions of genomic DNA: Understanding the disease
susceptibilities and predispositions—basically sequence affected and unaffected people. For example,
isolate cancer cells, sequence tumour and normal tissue and get genotyping. Massively parallel DNA
arrays are new sequencing technologies under development.
2. Systematic identification of common variants in genes: Usually there are a small number of common
variants per locus. Variants provide clues to susceptibilities, e.g. three variants of apolipoprotein E in
Alzheimer’s. Major applications include understanding of cardiovascular disease, thrombosis, heart
disease, obesity, HIV resistance. The approach is to get most such variant sequences from sequences of
100 random individuals and move from family-based linkage analyses to association analyses. This is
usually done by testing disease susceptibility against all common variants simultaneously by
genotyping a well-characterized clinical group with a comprehensive DNA array. This is done by
characterizing the SNPs (Single Nucleotide Polymorphisms) associated with a given human disease.
Use of DNA array technology for human SNP analysis is already underway with the availability of the
affymetrix GeneChipR and others. This helps in identification of the susceptible genes.

Table 28.2: Some problems and their bioinformatics approaches

3. Rapid sequencing of other organisms: Comparative whole genome analysis can provide clues about
molecular evolution. While a conserved sequence provides information about key motifs; the sequence
differences provide information about diversity of form and function. Bioinformatics provides
facilities for sequencing-storage and automation of the analysis.
4. DNA sequence assembly: The problem with given DNA sequence fragments of 200-700 base pairs
in length is to assemble them into original DNA sequence from which the fragments were derived. One
approach is the Pairwise Sequence Alignment. For a given pair of sequences (DNA or protein) and a
method of scoring similarity of sequences, one has to determine how similar the two sequences are
(best similarity score) and show where the two sequences match (best alignment).
5. Repetitive sequences in DNA: In the DNA domain, a motivation for multiple sequence alignment
arises in the study of repetitive sequences. These are sequences of DNA, often without clearly
7

understood biological function that are repeated many times throughout the genome. The repetitions
are generally not exact, but differ from each other in a small number of insertions, deletions, and
substitutions. As an example, the Alu repeat is approximately 300 bp long, and appears over 600,000
times in the human genome. It is believed that as much as 60% of the human genome may be
attributable to repetitive sequences without known biological function. In order to highlight the
similarities and differences among the instances of such a repeat family, one would like to display a
good multiple sequence alignment of its constituent sequences.
RNA Level
Simultaneous monitoring of expression of all genes: The mRNA levels define state of the cell. The
approach is to monitor all mRNAs at quantitative sensitivity level of one molecule per cell and a
quantitative sensitivity level sufficient to distinguish alternative splicing. Use of DNA microarrays can
help in this. This information can be used for:
(a) Description—catalogues of proteins present in different cells, different stages and different
environments,
(b) Classification—classify protein re-susceptibilities, disease, and population subtypes,
(c) Circuitry—gene networks and gene expression circuits for development and response pathways.
Monitoring Level and Modification State of all Proteins
It is important to monitor post-translational proteins and genetic network modifications, e.g.
phosphorylation state. It is possible to do a 2D protein gel analysis of proteins followed by Mass
Spectrometry analysis. Mass spectrometry analysis of peptide fragments helps in identification of
protein “signatures”.
Identification of all Basic Protein Shapes
Most probably, there are limited number of protein shapes and hence a limited number of protein
families. One can analyze amino acid sequences against database of protein shapes. Some of the
bioinformatics databases being used are:
(a) Pfam—protein multiple sequence alignments and common protein domains
(b) SCOP—Structural classification of proteins
(c) CATH—protein classification by Class, Architecture, Topology, and Homology.
Multiple Sequence Alignment of Proteins
An important motivation of studying the similarity among multiple strings is the fact that protein
databases are often categorized by protein families. A protein family is a collection of proteins with
similar structure (i.e. three-dimensional shape), similar function, or similar evolutionary history. When
one has a newly sequenced protein, one would like to know to which family it belongs, as this provides
hypotheses about its structure, function or evolutionary history. The new protein may not be
particularly similar to a single protein in the database, yet may still share considerable similarity with
the collective members of a family of proteins. One approach is to construct a representation of each
protein family, for example, a good multiple sequence alignment of all its members. Then, when one
has a newly sequenced protein and wants to find its family, one only has to compare it to the
representation of each family. Common structure, function or origin of a molecule may only be weakly
reflected in its sequence. For example, the three-dimensional structure of a protein is very difficult to
infer from its sequence, and yet is very important to predict its function. Multiple sequence
comparisons may help highlight weak sequence similarity, and shed light on structure, function or
origin.

Determining Gene Function


Much of bioinformatics is focused on helping the biologist determine gene function. To do this, one
needs to
• Find genes in a genome
• Predict the gene product
• Predict the gene function
Approaches to finding genes:
8

• Search by sequence similarity: find genes by looking for matches to sequences that are
known to be related to genes
• Search by signal: find genes by identifying the sequence signals involved in gene expression
• Search by content: find genes by statistical properties that distinguish protein-coding DNA
from noncoding DNA.
Evidence for genes can consist of matches to:
 Known proteins
 Protein motifs (e.g. zinc finger, ATP and GTP—binding motifs, etc.)
 Expressed sequence tags (ESTs)
Searching for matches to known proteins:
 Translate DNA sequence in all reading frames
 Search against protein database
 High scoring matches suggest the presence of homologous genes in the DNA.
CONCLUSION
The advent of highly advanced electronic communication network system has enabled on-line
applications in biological research, giving biologists access to these databases through INTERNET.
Through this net, biologists are able to maintain directories, digests, newsletters and bibliography
databases. The ability to compare nucleotide sequence homology of distant genes, sequence analysis,
PCR primer design, protein domain similarity and location of coding sequences are some of the many
uses of these databases and software programmes. The biologist must be made aware of such on-line
programmes, such as GOPHER, BLAST, FASTA, WWW, WAIS, etc.

Computer aided applications have revolutionized the pace of scientific research, and have contributed
a lot on problems ranging from the prediction of molecular properties of compounds, molecular
modeling, molecular graphics and 3D structure determination by means of image processing.
Bioscientists must, therefore, keep pace with the development of efficient, accurate, selective
techniques, softwares and programmes for an effective and efficient management and

You might also like