How Bioinformatics can change your life

Basic Concepts of
 Introduction
 Basic concepts in Molecular biology
 Bioinformatics techniques
 Areas in bioinformatics
 Applications
 Related Computer Technology
 Conference in Glasgow
 Acknowledgements
 Reference
 A Major event happened that was to
change the course of human history
 It was a joint British and American
 nothing to do with IRAQ!
 It was a race – who will complete
 Race Test – not whether they have
taken drugs but whether they can
produce them!
 Human genome was sequenced
A Situ…somewhere in the
near future
 A virus –not ‘I love you’ virus- creates an epidemic
 Geneticists and bioinformaticians role on their
 Genetic material of the virus is compared with the
existing base of known genetic material of other
 As the characteristics of the other viruses are
 From genetic material computer programs will
derive the proteins necessary for the survival of the
 When the protein (sequence and structure) is
known then medicines can be designed

What is
 The marriage between computer
science and molecular biology
 The algorithm and techniques of
computer science are being used to
solve the problems faced by molecular
 ‘Information technology applied to
the management and analysis of
biological data’
 Storage and Analysis are two of the
important functions – bioinformaticians
build tools for each

Biology Chemistry

Science Statistics


What is..
 This is the age of the Information
 However storing info is nothing new
 Information to the volume of
Britannica Encyclopedia is stored in
each of our cells
 ‘Bioinformatics tries to determine
what info is biologically important’

Molecular Biology….

DNA & Genes
 DNA is where the genetic information is
 Blonde hair and blue eyes are inherited by
 Gene - The basic unit of heredity
 There are genes for characteristics i.e. a gene
for blond hair etc
 Genes contain the information as a
sequence of nucleotides
 Genes are abstract concepts – like
longitude and latitudes in the sense that
you cannot see them separately
 Genes are made up of nucleotides

Nucleotide (nt)
 Each nt is made up of
 Sugar
 Phospate group
 Base
 The base it (nt) contains makes the only
difference between one nt and the other
 There are 4 different bases
 G(uanine),A(denine),T(hymine),C(ytosine)
 The information is in the order of nucleotide
and the order is the info
 Genes can be many thousands of nt long
 The complete set of genetic instructions is
called genomes

 DNA strings make

 Analogy
 Letters - nt
 Sentences – genes

 Individual volumes of Britannica

encyclopedia – chromosomes
 All voles together - Genome

Double Helix
 The DNA is a double helix
 Each strand has complementary
 Each particular base in one strand is
bonded with another particular base in the
next strand
G- C
A- T
 For example -
 AATGC one strand
 TTACG other strand

 Proteins are very important
biological feature
 Amino Acids make up the proteins
 20 different amino acids are there
 The function of a protein is
dependant on the order of the amino

 The information required to make aa is
stored in DNA
 DNA sequence determines amino acid
 Amino Acid sequence determines protein
 Protein structure determines protein
 A Substance called RNA is used to carry
the Info stored in the DNA that in turn is
used to make proteins
 Storage - DNA
 Information Transfer – RNA
 RNA is the message boy!
Central dogma

DNA transcription RNA Translation Protein

RNA Polymerase Ribosomes

 Since there are 20 amino acids to
translate one nt cannot correspond
to one aa, neither can it correspond
as twos
 So in triplet codes – codon – protein
information is carried
 The codons that do not correspond
to a protein are stop codons – UAA,
UAG, UGA (RNA has U instead of T)

 Some codons are used as start

codons - AUG as well as to code
Protein Structure
 Shows a wide variety as opposed to the
DNA whose structure is uniform
 X-ray crystallography or Nuclear Magnetic
Resonance (NMR) is used to figure out the
 Structure is related to the function or rather
structure determines the function
 Although proteins are created as a linear
structure of aa chain they fold into 3 d
 If you stretch them and leave them they will
go back to this structure – this is the native
structure of a protein
 Only in the native structure the proteins
functions well
 Even after the translation is over protein 20
goes through some changes to its structure
Gene Expression
 Gene Expression – the process of
Transcripting a DNA and translating a RNA
to make protein
 Where do the genes begin in a
 How does the RNA identify the beginning
of a gene to make a protein
 A single nt cannot be taken to point out the
beginning of a gene as they occur
 But a particular combination of a nucleotide
can be
 Promoter sequences – the order of nt
which mark the beginning of a gene

Prediction and Pattern
 The two main areas of bioinformatics
 Pattern recognition
 ‘A particular sequence or structure has
been seen before’ and that a particular
characteristic can be associated with it
 Prediction
 From a sequence (what we know) we
can predict the structure and function
(what we don’t know)
Dot plots….
 Simple way of evaluating
similarity between two
 In a graph one sequence is on
one side the next on the other
 Where there are matches
between the two sequences the
graph is marked
 A match for similarity between the characters of two or
more sequences
 Eg.
 There are so many ways to align the above two
 1.
 2.
 3.
 So which one do we choose and on what basis?
 Solution is to Provide a match score and mismatch score

 Introduce gaps and a penalty
score for gaps

 However not all gaps are bad

 How do we align?
 ---CAA---
 These gaps are not biologically significant
 Semi Global Alignments

Scoring Matrix
 For DNA/protein sequence alignment we create a matrix
 If A and A score is 1
 If A and T score is -5
 If A and C score is -1

Dynamic Programming
 As the length of the query sequences
increase and the difference of length
between the two sequence also increases
–more gaps has to be inserted in various
 We cannot perform an exhaustive search
 Combinatorial explosion occurs – too much
combinations to search for
 Dynamic programming is a way of using
heuristics to search in the most promising
 Sequence info is stored in
 So that they can be manipulated
 The db (next slide) are located
at diff places
 They exchange info on a daily
basis so that they are up-to-date
and are in sync
 Primary db – sequence data
Major Primary DB
Nucleic Acid Protein
EMBL (Europe) PIR -
Protein Information
GenBank (USA) MIPS
University of Geneva,
now with EBI
A supplement to
Composite DB
 As there are many db which one to
search? Some are good in some
aspects and weak in others?
 Composite db is the answer – which
has several db for its base data
 Search on these db is indexed and
streamlined so that the same stored
sequence is not searched twice in
different db

Composite DB

 OWL has these as their primary

 SWISS PROT (top priority)

 GenBank

 NRL-3D

Secondary db
 Store secondary structure info
or results of searches of the
primary db

Compo Primary
DB Source


Database Searches
 We have sequenced and identified
genes. So we know what they do
 The sequences are stored in
 So if we find a new gene in the
human genome we compare it with
the already found genes which are
stored in the databases.
 Since there are large number of
databases we cannot do sequence
alignment for each and every
 So heuristics must be used again.
Areas in

 Because of the multicellular structure, each
cell type does gene expression in a
different way –although each cell has the
same content as far as the genetic
 i.e. All the information for a liver cell to be a
liver cell is also present on nose cell, so
gene expression is the only thing that

Genomics - Finding Genes
 Gene in sequence data – needle in a
 However as the needle is different
from the haystack genes are not diff
from the rest of the sequence data
 Is whole array of nt we try to find and
border mark a set o nt as a gene
 This is one of the challenges of
 Neural networks and dynamic
programming are being employed

Organism Genome Gene Web Site
Size Number
bp * 1,000,000

Yeast 13.5 6,241 http://genome-

Fruit Flies 180 13,601
Homo 3,000 45,000 http://www.ncbi.n
 Proteome is the sum total of an
organisms proteins
 More difficult than genomics
 4 20
 Simple chemical makeup complex
 Can duplicate can’t
 We are entering into the ‘post
genome era’
 Meaning much has been done with
the Genes – not that it’s a over
 The relationship between the RNA and the protein it codes are
usually very different
 After translation proteins do change
 So aa sequence do not tell anything about the post
translation changes
 Proteins are not active until they are combined into a larger
complex or moved to a relevant location inside or outside the cell
 So aa only hint in these things
 Also proteins must be handled more carefully in labs as they tend
to change when in touch with an inappropriate material

Protein Structure Prediction

 Is one of the biggest challenges

of bioinformatics and esp.
 No algorithm is there now to
consistently predict the structure
of proteins

Structure Prediction methods

 Comparative Modeling
 Target proteins structure is
compared with related proteins
 Proteins with similar sequences
are searched for structures

 The taxonomical system reflects
evolutionary relationships
 Phylogenetics trees are things which reflect
the evolutionary relationship thru a
 Rooted trees where there is only one
 Un rooted trees just showing the
 Phylogenetic tree reconstruction algorithms
are also an area of research

Medical Implications
 Pharmacogenomics
 Not all drugs work on all patients, some good
drugs cause death in some patients
 So by doing a gene analysis before the
treatment the offensive drugs can be avoided
 Also drugs which cause death to most can be
used on a minority to whose genes that drug is
well suited – volunteers wanted!
 Customized treatment
 Gene Therapy
 Replace or supply the defective or missing gene
 E.g: Insulin and Factor VIII or Haemophilia

 BioWeapons (??)

Diagnosis of Disease
 Diagnosis of disease
 Identification of genes which cause the
disease will help detect disease at early
stage e.g. Huntington disease -
 Symptoms – uncontrollable dance like
movements, mental disturbance,
personality changes and intellectual
 Death in 10-15 years
 The gene responsible for the disease has
been identified
 Contains excessively repeated sections of
 So once analyzed the couple can be
Drug Design
 Can go up to 15yrs and
 One of the goals of
bioinformatics is to reduce the
time and cost involved with it.
 The process
 Discovery
 Computational methods can
improves this
 Testing
Target identification
 Identifying the molecule on which the
germs relies for its survival
 Then we develop another molecule
i.e. drug which will bind to the target
 So the germ will not be able to interact
with the target.
 Proteins are the most common targets

 For example HIV produces HIV
protease which is a protein and
which in turn eat other proteins
 This HIV protease has an active
site where it binds to other
 So HIV drug will go and bind
with that active site
 Easily said than done!
 Lead compounds are the

molecules that go and bind to
the target protein’s active site
 Traditionally this has been a trial
and error method
 Now this is being moved into the
realm of computers

Related Computer

 Perl is commonly used for
bioinformatics calculations as its
ability to manipulate character
 The default CGI language
 It started out as a scripting language
but has become a fully fledged
 IT has everything now, even web
service support
Data bases and Mining
 Lot of the sequence databases are
available publicly
 As there is a DB involved various
data mining techniques are used to
pull the data out
 As there is a lot of literature – articles
etc – on this area a data mining on
the literature – not on the sequence
data has also become a PhD topic
for many
European Molecular Biology
Network (EMBnet)
 A central system for sharing, training
and centralizing up to date bio info
 Some of the EMBnet sites are:
 EBI – European Bioinformatics
 Dan E. Krane and Michael L. Raymer
 Basic Concepts of Bioinformatics

 Arthur M Lesk
 Intro to Bioinformatics

 T.K. Attwood & D. J. Parry-Smith

 Intro to Bioinformatics

 The genetic Revolution

 Dr Patrick Dixon

 Prof David Gilbert’s Site

Thank You!

