Professional Documents
Culture Documents
Into To Bioinfo
Into To Bioinfo
2
2000
A Major event happened that was to
change the course of human history
It was a joint British and American
effort
It was a race – who will complete first
Race Test – not whether they have
taken drugs but whether they can
produce them!
Human genome was sequenced
3
Bioinformatics is:
driven by the generation of data,
moderated by hardware and
analysis methods
Computing
power
Analysis methods
Data generation
platforms
4
What is
The merging between computer
science and molecular biology
The algorithm and techniques of
computer science are being used to
solve the problems faced by molecular
biologists
‘Information technology applied to
the management and analysis of
biological data’
Storage and Analysis are two of the
important functions – bioinformaticians
build tools for each
5
Biology Chemistry
Computer
Science Statistics
Bioinformatics
6
What is..
7
Basics
of
Molecular Biology….
8
DNA & Genes
DNA is where the genetic information is
stored
Blonde hair and blue eyes are inherited by
this
Gene - The basic unit of heredity
There are genes for characteristics i.e. a gene
for blond hair etc
Genes contain the information as a
sequence of nucleotides
Genes are abstract concepts – like
longitude and latitudes in the sense that
you cannot see them separately
Genes are made up of nucleotides
9
Nucleotide (nt)
Each nt I made up of
Sugar
Phospate group
Base
The base it (nt) contains makes the only
difference between one nt and the other
There are 4 different bases
G(uanine),A(denine),T(hymine),C(ytosine)
The information is in the order of nucleotide
and the order is the info
Genes can be many thousands of nt long
The complete set of genetic instructions is
called genomes
10
Proteins
Proteins are very important
biological feature
Amino Acids make up the proteins
20 different amino acids are there
The function of a protein is
dependant on the order of the amino
acids
11
Proteins…
The information required to make aa is stored
in DNA
DNA sequence determines amino acid
sequence
Amino Acid sequence determines protein
structure
Protein structure determines protein function
A Substance called RNA is used to carry the
Info stored in the DNA that in turn is used to
make proteins
Storage - DNA
Information Transfer – RNA
RNA is the message boy!
12
Central dogma
13
14
Proteins…..
Since there are 20 amino acids to
translate one nt cannot correspond to
one aa, neither can it correspond as twos
So in triplet codes – codon – protein
information is carried
The codons that do not correspond to a
protein are stop codons – UAA, UAG,
UGA (RNA has U instead of T)
15
Protein Structure
Shows a wide variety as opposed to the DNA
whose structure is uniform
X-ray crystallography or Nuclear Magnetic
Resonance (NMR) is used to figure out the
structure
Structure is related to the function or rather
structure determines the function
Although proteins are created as a linear structure
of aa chain they fold into 3 d structure.
If you stretch them and leave them they will go
back to this structure – this is the native structure
of a protein
Only in the native structure the proteins functions
well
Even after the translation is over protein goes
through some changes to its structure
16
Bioinformatics
Techniques…..
17
Prediction and Pattern
Recognition
The two main areas of bioinformatics
are
Pattern recognition
‘A particular sequence or structure has
been seen before’ and that a particular
characteristic can be associated with it
Prediction
From a sequence (what we know) we
can predict the structure and function
(what we don’t know)
18
Dot plots….
21
Dynamic Programming
As the length of the query sequences
increase and the difference of length
between the two sequence also increases
–more gaps has to be inserted in various
places
We cannot perform an exhaustive search
Combinatorial explosion occurs – too much
combinations to search for
Dynamic programming is a way of using
heuristics to search in the most promising
path
22
Databases
Sequence info is stored in databases
So that they can be manipulated
easily
The db (next slide) are located at diff
places
They exchange info on a daily basis
so that they are up-to-date and are in
sync
Primary db – sequence data
23
Nucleic acid (DNA/RNA)
sequence databases
One main database arising from a partnership between
GenBANK at the NCBI (National Center for
Biotechnology Information – USA), the EMBL data
library at the EBI (European Bioinformatics Institute –
UK) and the DNA Data Bank at the NIG (National
Institute of Genetics – Japan).
Daily exchanges between the 3 partners to keep the
databases synchronised.
DNA and RNA sequences: curated, archived,
distributed.
Sequences from genome projects, scientific articles,
patent applications. Most scientific journals require DNA
and RNA sequences related to each publication to be
publicly available.
Sequences deposited early and going through a review
cycle; unannotated.. preliminary.. unreviewed..
standard.
Format: human and computer readable.
24
25
Major Primary DB
Nucleic Acid Protein
EMBL (Europe) PIR -
Protein Information
Resource
GenBank (USA) MIPS,NCBI
DDBJ (Japan) SWISS-PROT
University of Geneva,
now with EBI
NCBI TrEMBL
A supplement to SWISS-
PROT
NRL-3D
Composite DB
27
Composite DB
GenBank
NRL-3D
28
Secondary db
Compo Primary
DB Source
PROSITE SWISS-PROT
PRINTS OWL
29
Structural databases
30
Database Searches
We have sequenced and identified
genes. So we know what they do
The sequences are stored in databases
So if we find a new gene in the human
genome we compare it with the already
found genes which are stored in the
databases.
Since there are large number of
databases we cannot do sequence
alignment for each and every sequence
So heuristics must be used again.
31
Areas in
Bioinformatics…
32
Genomics
Because of the multicellular structure, each
cell type does gene expression in a
different way –although each cell has the
same content as far as the genetic
i.e. All the information for a liver cell to be a
liver cell is also present on nose cell, so
gene expression is the only thing that
differentiates
33
Genomics - Finding Genes
Gene in sequence data – needle in a
haystack
However as the needle is different
from the haystack genes are not diff
from the rest of the sequence data
Is whole array of nt we try to find and
border mark a set of nt as a gene
This is one of the challenges of
bioinformatics
Neural networks and dynamic
programming are being employed
34
Organism Genome Gene Web Site
Size Number
(Mb)
bp * 1,000,000
37
Protein Structure Prediction
38
Structure Prediction methods
Comparative Modeling
Target proteins structure is
compared with related proteins
Proteins with similar sequences
are searched for structures
39
Phylogenetics
The taxonomical system reflects
evolutionary relationships
Phylogenetics trees are things which reflect
the evolutionary relationship thru a
picture/graph
Rooted trees where there is only one
ancestor
Un rooted trees just showing the
relationship
Phylogenetic tree reconstruction algorithms
are also an area of research
40
Applications….
41
Medical Implications
Pharmacogenomics
Not all drugs work on all patients, some good
drugs cause death in some patients
So by doing a gene analysis before the
treatment the offensive drugs can be avoided
Also drugs which cause death to most can be
used on a minority to whose genes that drug is
well suited – volunteers wanted!
Customized treatment
Gene Therapy
Replace or supply the defective or missing gene
E.g: Insulin and Factor VIII or Haemophilia
BioWeapons (??)
42
Diagnosis of Disease
Diagnosis of disease
Identification of genes which cause the
disease will help detect disease at early
stage e.g. Huntington disease -
Symptoms – uncontrollable dance like
movements, mental disturbance,
personality changes and intellectual
impairment
Death in 10-15 years
The gene responsible for the disease has
been identified
43
Drug Design
Can go up to 15yrs and
$700million
One of the goals of bioinformatics
is to reduce the time and cost
involved with it.
The process
Discovery
Computational methods can improves
this
Testing
44
Discovery
Target identification
Identifying the molecule on which the
germs relies for its survival
Then we develop another molecule
i.e. drug which will bind to the target
So the germ will not be able to interact
with the target.
Proteins are the most common targets
45
Discovery…
47
Related Computer
Technology………….
48
PERL
Perl is commonly used for
bioinformatics calculations as its ability
to manipulate character symbols
The default CGI language
It started out as a scripting language
but has become a fully fledged
language
IT has everything now, even web
service support
http://bio.perl.org
49
The place of XML & Web
Services
Various markup languages are being created –
Gene Markup language etc to represent
sequence/gene data
Web Services – program to program interaction,
making the web application centric as opposed to
human centric
So this has to platform language independent
Protocols like SOAP help in this regard
In bioinformatics various databases are being used,
different platforms, languages etc
So web services helps achieve platform
independence and program interaction
Since sequence data bases are in various formats,
platforms SOAP also helps in this regards
50
Data bases and Mining
51
European Molecular Biology
Network (EMBnet)
A central system for sharing, training
and centralizing up to date bio info
Some of the EMBnet sites are:
SQENET
http://www.seqnet.dl.ac.uk
UCL
http://www.biochem.ucl.ac.uk/bsm/dbbro
wser/embnet/
EBI – European Bioinformatics
Institute
www.ebi.ac.uk
52
References
Dan E. Krane and Michael L. Raymer
Basic Concepts of Bioinformatics
Arthur M Lesk
Intro to Bioinformatics
53