Professional Documents
Culture Documents
l11 Introduction to Bioinformatics
l11 Introduction to Bioinformatics
WHAT IS BIOINFORMATICS?
Bioinformatics is conceptualizing biology in terms of molecules and applying informatics techniques
(derived from disciplines such as applied mathematics, computer science and statistics) to understand
and organize the information associated with these molecules on a large scale. Hence, bioinformatics
is the management information system for molecular biology which has many practical applications in
various fields. The fundamental issue for bioinformatics is, how do we describe, analyze, simulate and
predict the dynamics of various biological processes by using the information technology tools?
Schistosoma sp. Plasmodium sp. and the yeast genome projects have further increased this information
repertoire. A vast number of nucleotides (108) and amino acids of mammals, primates, rodents,
bacteria and other life forms have already been classified and stored in various publicly available
databases.
It therefore implies that today’s biologists must be properly equipped to handle and manage enormous
amount of biological information stored in the databases. These databases include GenBank, EMBL,
PIR, Swiss-Prot, SEQDB, OMIM, CLDB, PROSITE, MEDLINE, etc., which store information of
sequencing nucleic acids, proteins, cloning vectors, restriction enzymes and transcription factors.
Besides, these databases also store information on cell lines and genomic disorders.
Medical informatics: The management of biomedical experimental data associated with particular
molecules—from mass spectroscopy to in vitro assays to clinical side effects.
As a result of the massive surge in the data and its complexity, many of the challenges in biology have
actually become challenges in computing. Such an approach is ideal because of the ease with which
computers can handle large quantities of data and probe the complex dynamics observed in nature.
Bioinformatics is often regarded as the application of computational techniques to understand and
organize the information associated with macromolecules. The confluence of two dissimilar fields is
largely attributed to the fact that biology itself is an information science; genes, which at the basic
level can be viewed as digital repositories of information, largely dictate an organism’s physiology and
behaviour.
information can be derived from a single structure than a protein sequence, the problem is overcome in
the latter by analyzing larger quantities of data.
(5) Predicting genes in Genomic sequences: A concept that underpins most research methods in
bioinformatics is that much of the data can be grouped together based on biologically meaningful
similarities. For example, sequence segments are often repeated at different positions of genomic DNA
in which genes can be identified. A database search can lead to detection of previously characterized
genes. But the real challenge lies in the prediction of new genes in genomic sequences, which can help
in designing experiments to determine their functions. Genes can be clustered into those with particular
functions or according to metabolic pathway to which they belong.
(6) Drug design: Proteins are the molecules of biological specificity. Faulty proteins would often result
in diseases. The behaviour of proteins can be controlled by drug molecules which may provide a tool
to curing diseases. Hence, knowledge of protein structure can be exploited to design drug molecules
that can bind to target proteins to enhance or inhibit their activity.
C. Data on Gene Expression
DNA microarrays are miniaturized laboratories for the study of gene expression. Each DNA
microarray or gene chip contains a deliberately designed array of probe molecules that can bind
specific pieces of DNA or mRNA. Labeling the DNA or RNA with fluorescent tags allows the level of
expression of any gene in a cellular preparation to be measured quantitatively. Microarrays also have
other applications in molecular biology, but their use in studying gene expression has opened up a new
way of measuring genome functions. The DNA microarray technology was developed in late 1990s
and since then gene expression data is growing rapidly, paralleling the growth of the sequence and
structure databases.
D. Experimental Data
In addition to sequence and structure information, laboratory experimentation is generating enormous
data on biochemical and biophysical information. A number of bioinformatics tools are available
which help in the analysis and interpretation of the experimental data.
(1) Statistical analysis: Besides finding relationships between different proteins, much of
bioinformatics involves the analysis of statistical data to infer and understand the observations for
another type of data. An example is the use of sequence and structural data to predict the secondary
and tertiary structures of new protein sequences. These methods are often used on statistical rules
derived from structures, such as the propensity for certain amino acids sequences to produce different
secondary structural elements. Another example is the use of structural data to understand a protein’s
function; here studies have investigated the relationship between different protein folds and their
functions and analyzed similarities between different binding sites in the absence of homology.
Combined with similarity measurements, these studies provide us with an understanding of how much
biological information can be accurately transferred between homologous proteins.
(2) Image analysis: There are other fields, for example, medical imaging/image analysis that might be
considered part of bioinformatics. Electrophoretic separation of DNA/protein could be done by one or
two dimensional gel electrophoresis and then subjected to image analysis. Image analysis is an
important tool in microarray analysis while identifying gene expression.
(3) Classification databases: Classification databases are taxonomies of protein structure, and they
bear a strong resemblance to the morphology—based taxonomies developed by biologists. Proteins
that look grossly the same in terms of shape and topology, are classified as more closely related than
proteins that look substantially different. The classification databases can be construed as trees with
many branchings at each branch point—very similar to phylogenetic trees.
(4) Simulation experiments: Bioinformatics is often regarded as the application of computational
techniques to understand and organize the information associated with biological systems.
Experiments can be simulated to predict organism’s physiology, properties of drugs and drug design.
Models can be built to predict disease outbreak in a community or population. Although several
methods of machine learning are used to predict function from sequence or structure, there are many
methods for simulating metabolism from known biological functions. Medical science would like to
understand pathways to know which genomic changes could give rise to each known inherited disease.
The pharmaceutical industry would like to use this knowledge to produce drugs, proteins or genetic
therapies that can reverse disease phenotypes.
Molecular biology deals with the biological activity at the molecular level—the other levels of
abstraction of biological activity are the atomic at the basic level and the network level. While it is
important to study the atomic basis of biological activity, the approach taken by molecular biologists is
to reduce it to a molecular level. The interaction between various macromolecules and their
transformation within the cell is a part of the network study.
The systems approach is useful to understand the complex systems like biological systems.
The basic building blocks are molecules like nucleic acids and proteins. The systems view gives the
case of integration of various approaches to understand their structure and function. While the
traditional approach is experimental, performing various experiments in the laboratory to understand
the molecular biology, bioinformatics offers a synthetic approach. The synthetic approach is much
faster and in certain cases, is as reliable as the experimental approach (Fig. 28.1).
Bioinformatics Approach
The information flow in the cell as understood by central dogma is fairly simple. However, the
complexity is enormous and gives rise to several problems for the biologist to resolve. Some of the
problems and their bioinformatics approaches are summarized in the Table 28.2.
3. Rapid sequencing of other organisms: Comparative whole genome analysis can provide clues about
molecular evolution. While a conserved sequence provides information about key motifs; the sequence
differences provide information about diversity of form and function. Bioinformatics provides
facilities for sequencing-storage and automation of the analysis.
4. DNA sequence assembly: The problem with given DNA sequence fragments of 200-700 base pairs
in length is to assemble them into original DNA sequence from which the fragments were derived. One
approach is the Pairwise Sequence Alignment. For a given pair of sequences (DNA or protein) and a
method of scoring similarity of sequences, one has to determine how similar the two sequences are
(best similarity score) and show where the two sequences match (best alignment).
5. Repetitive sequences in DNA: In the DNA domain, a motivation for multiple sequence alignment
arises in the study of repetitive sequences. These are sequences of DNA, often without clearly
7
understood biological function that are repeated many times throughout the genome. The repetitions
are generally not exact, but differ from each other in a small number of insertions, deletions, and
substitutions. As an example, the Alu repeat is approximately 300 bp long, and appears over 600,000
times in the human genome. It is believed that as much as 60% of the human genome may be
attributable to repetitive sequences without known biological function. In order to highlight the
similarities and differences among the instances of such a repeat family, one would like to display a
good multiple sequence alignment of its constituent sequences.
RNA Level
Simultaneous monitoring of expression of all genes: The mRNA levels define state of the cell. The
approach is to monitor all mRNAs at quantitative sensitivity level of one molecule per cell and a
quantitative sensitivity level sufficient to distinguish alternative splicing. Use of DNA microarrays can
help in this. This information can be used for:
(a) Description—catalogues of proteins present in different cells, different stages and different
environments,
(b) Classification—classify protein re-susceptibilities, disease, and population subtypes,
(c) Circuitry—gene networks and gene expression circuits for development and response pathways.
Monitoring Level and Modification State of all Proteins
It is important to monitor post-translational proteins and genetic network modifications, e.g.
phosphorylation state. It is possible to do a 2D protein gel analysis of proteins followed by Mass
Spectrometry analysis. Mass spectrometry analysis of peptide fragments helps in identification of
protein “signatures”.
Identification of all Basic Protein Shapes
Most probably, there are limited number of protein shapes and hence a limited number of protein
families. One can analyze amino acid sequences against database of protein shapes. Some of the
bioinformatics databases being used are:
(a) Pfam—protein multiple sequence alignments and common protein domains
(b) SCOP—Structural classification of proteins
(c) CATH—protein classification by Class, Architecture, Topology, and Homology.
Multiple Sequence Alignment of Proteins
An important motivation of studying the similarity among multiple strings is the fact that protein
databases are often categorized by protein families. A protein family is a collection of proteins with
similar structure (i.e. three-dimensional shape), similar function, or similar evolutionary history. When
one has a newly sequenced protein, one would like to know to which family it belongs, as this provides
hypotheses about its structure, function or evolutionary history. The new protein may not be
particularly similar to a single protein in the database, yet may still share considerable similarity with
the collective members of a family of proteins. One approach is to construct a representation of each
protein family, for example, a good multiple sequence alignment of all its members. Then, when one
has a newly sequenced protein and wants to find its family, one only has to compare it to the
representation of each family. Common structure, function or origin of a molecule may only be weakly
reflected in its sequence. For example, the three-dimensional structure of a protein is very difficult to
infer from its sequence, and yet is very important to predict its function. Multiple sequence
comparisons may help highlight weak sequence similarity, and shed light on structure, function or
origin.
• Search by sequence similarity: find genes by looking for matches to sequences that are
known to be related to genes
• Search by signal: find genes by identifying the sequence signals involved in gene expression
• Search by content: find genes by statistical properties that distinguish protein-coding DNA
from noncoding DNA.
Evidence for genes can consist of matches to:
Known proteins
Protein motifs (e.g. zinc finger, ATP and GTP—binding motifs, etc.)
Expressed sequence tags (ESTs)
Searching for matches to known proteins:
Translate DNA sequence in all reading frames
Search against protein database
High scoring matches suggest the presence of homologous genes in the DNA.
CONCLUSION
The advent of highly advanced electronic communication network system has enabled on-line
applications in biological research, giving biologists access to these databases through INTERNET.
Through this net, biologists are able to maintain directories, digests, newsletters and bibliography
databases. The ability to compare nucleotide sequence homology of distant genes, sequence analysis,
PCR primer design, protein domain similarity and location of coding sequences are some of the many
uses of these databases and software programmes. The biologist must be made aware of such on-line
programmes, such as GOPHER, BLAST, FASTA, WWW, WAIS, etc.
Computer aided applications have revolutionized the pace of scientific research, and have contributed
a lot on problems ranging from the prediction of molecular properties of compounds, molecular
modeling, molecular graphics and 3D structure determination by means of image processing.
Bioscientists must, therefore, keep pace with the development of efficient, accurate, selective
techniques, softwares and programmes for an effective and efficient management and