Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 53

How Bioinformatics can change your life

Basic Concepts of Bioinformatics


Introduction……

2
2000
 A Major event happened that was to
change the course of human history
 It was a joint British and American
effort
 It was a race – who will complete first
 Race Test – not whether they have
taken drugs but whether they can
produce them!
 Human genome was sequenced

3
Bioinformatics is:
driven by the generation of data,
moderated by hardware and
analysis methods

Computing
power

Analysis methods

Data generation
platforms
4
What is
 The merging between computer
science and molecular biology
 The algorithm and techniques of
computer science are being used to
solve the problems faced by molecular
biologists
 ‘Information technology applied to
the management and analysis of
biological data’
 Storage and Analysis are two of the
important functions – bioinformaticians
build tools for each

5
Biology Chemistry

Computer
Science Statistics

Bioinformatics

6
What is..

 This is the age of the Information


Technology
 However storing info is nothing new
 Information to the volume of
Britannica Encyclopedia is stored in
each of our cells
 ‘Bioinformatics tries to determine
what info is biologically important’

7
Basics
of
Molecular Biology….

8
DNA & Genes
 DNA is where the genetic information is
stored
 Blonde hair and blue eyes are inherited by
this
 Gene - The basic unit of heredity
 There are genes for characteristics i.e. a gene
for blond hair etc
 Genes contain the information as a
sequence of nucleotides
 Genes are abstract concepts – like
longitude and latitudes in the sense that
you cannot see them separately
 Genes are made up of nucleotides

9
Nucleotide (nt)
 Each nt I made up of
 Sugar
 Phospate group
 Base
 The base it (nt) contains makes the only
difference between one nt and the other
 There are 4 different bases
 G(uanine),A(denine),T(hymine),C(ytosine)
 The information is in the order of nucleotide
and the order is the info
 Genes can be many thousands of nt long
 The complete set of genetic instructions is
called genomes

10
Proteins
 Proteins are very important
biological feature
 Amino Acids make up the proteins
 20 different amino acids are there
 The function of a protein is
dependant on the order of the amino
acids

11
Proteins…
 The information required to make aa is stored
in DNA
 DNA sequence determines amino acid
sequence
 Amino Acid sequence determines protein
structure
 Protein structure determines protein function
 A Substance called RNA is used to carry the
Info stored in the DNA that in turn is used to
make proteins
 Storage - DNA
 Information Transfer – RNA
 RNA is the message boy!

12
Central dogma

DNA transcription RNA Translation Protein


RNA Polymerase Ribosomes

13
14
Proteins…..
 Since there are 20 amino acids to
translate one nt cannot correspond to
one aa, neither can it correspond as twos
 So in triplet codes – codon – protein
information is carried
 The codons that do not correspond to a
protein are stop codons – UAA, UAG,
UGA (RNA has U instead of T)

 Some codons are used as start codons -


AUG as well as to code methionine

15
Protein Structure
 Shows a wide variety as opposed to the DNA
whose structure is uniform
 X-ray crystallography or Nuclear Magnetic
Resonance (NMR) is used to figure out the
structure
 Structure is related to the function or rather
structure determines the function
 Although proteins are created as a linear structure
of aa chain they fold into 3 d structure.
 If you stretch them and leave them they will go
back to this structure – this is the native structure
of a protein
 Only in the native structure the proteins functions
well
 Even after the translation is over protein goes
through some changes to its structure

16
Bioinformatics
Techniques…..

17
Prediction and Pattern
Recognition
 The two main areas of bioinformatics
are
 Pattern recognition
 ‘A particular sequence or structure has
been seen before’ and that a particular
characteristic can be associated with it
 Prediction
 From a sequence (what we know) we
can predict the structure and function
(what we don’t know)
18
Dot plots….

 Simple way of evaluating


similarity between two
sequences
 In a graph one sequence is on
one side the next on the other
side
 Where there are matches
between the two sequences the
graph is marked
19
20
Alignments
 A match for similarity between the characters of two or
more sequences
 Eg.
 TTACTATA
 TAGATA
 There are so many ways to align the above two
sequences
 1.
 TTACTATA
 TAGATA
 2.
 TTACTATA
 TAGATA
 3.
 TTACTATA
 TAGATA
 So which one do we choose and on what basis?
 Solution is to Provide a match score and mismatch score

21
Dynamic Programming
 As the length of the query sequences
increase and the difference of length
between the two sequence also increases
–more gaps has to be inserted in various
places
 We cannot perform an exhaustive search
 Combinatorial explosion occurs – too much
combinations to search for
 Dynamic programming is a way of using
heuristics to search in the most promising
path
22
Databases
 Sequence info is stored in databases
 So that they can be manipulated
easily
 The db (next slide) are located at diff
places
 They exchange info on a daily basis
so that they are up-to-date and are in
sync
 Primary db – sequence data

23
Nucleic acid (DNA/RNA)
sequence databases
 One main database arising from a partnership between
GenBANK at the NCBI (National Center for
Biotechnology Information – USA), the EMBL data
library at the EBI (European Bioinformatics Institute –
UK) and the DNA Data Bank at the NIG (National
Institute of Genetics – Japan).
 Daily exchanges between the 3 partners to keep the
databases synchronised.
 DNA and RNA sequences: curated, archived,
distributed.
 Sequences from genome projects, scientific articles,
patent applications. Most scientific journals require DNA
and RNA sequences related to each publication to be
publicly available.
 Sequences deposited early and going through a review
cycle; unannotated.. preliminary.. unreviewed..
standard.
 Format: human and computer readable.
24
25
Major Primary DB
Nucleic Acid Protein
EMBL (Europe) PIR -
Protein Information
Resource
GenBank (USA) MIPS,NCBI
DDBJ (Japan) SWISS-PROT
University of Geneva,
now with EBI
NCBI TrEMBL
A supplement to SWISS-
PROT
NRL-3D
Composite DB

 As there are many db which one to


search? Some are good in some
aspects and weak in others?
 Composite db is the answer – which
has several db for its base data
 Search on these db is indexed and
streamlined so that the same stored
sequence is not searched twice in
different db

27
Composite DB

 OWL has these as their primary


db
 SWISS PROT (top priority)
 PIR

 GenBank

 NRL-3D

28
Secondary db

 Store secondary structure info


or results of searches of the
primary db

Compo Primary
DB Source
PROSITE SWISS-PROT

PRINTS OWL

29
Structural databases

 The main database of protein structures is the PDB


(Protein Data Bank).

 The PDB started in 1971 at Brookhaven National


Labs (NY, USA) and is now a distributed
organisation (Research Collaboratory for Structural
Bioinformatics, www.rcsb.org) of US partners
(Rutgers, NJ; San Diego Supercomputer Centre,
Ca; NIST, Md).

 The PDB includes protein structures (and a few


DNA and other structures) determined by X-ray
crystallography and Nuclear Magnetic Resonance.

30
Database Searches
 We have sequenced and identified
genes. So we know what they do
 The sequences are stored in databases
 So if we find a new gene in the human
genome we compare it with the already
found genes which are stored in the
databases.
 Since there are large number of
databases we cannot do sequence
alignment for each and every sequence
 So heuristics must be used again.

31
Areas in
Bioinformatics…

32
Genomics
 Because of the multicellular structure, each
cell type does gene expression in a
different way –although each cell has the
same content as far as the genetic
 i.e. All the information for a liver cell to be a
liver cell is also present on nose cell, so
gene expression is the only thing that
differentiates

33
Genomics - Finding Genes
 Gene in sequence data – needle in a
haystack
 However as the needle is different
from the haystack genes are not diff
from the rest of the sequence data
 Is whole array of nt we try to find and
border mark a set of nt as a gene
 This is one of the challenges of
bioinformatics
 Neural networks and dynamic
programming are being employed

34
Organism Genome Gene Web Site
Size Number
(Mb)
bp * 1,000,000

Yeast 13.5 6,241 http://genome-


www.stanford.ed
u/Saccharomyce
s
Fruit Flies 180 13,601 http://flybase.bio.
indiana.edu
Homo 3,000 45,000 http://www.ncbi.n
Sapiens lm.nih.gov/geno
me/guide
Proteomics
 Proteome is the sum total of an
organisms proteins
 More difficult than genomics
 4 20
 Simple chemical makeup complex
 Can duplicate can’t
 We are entering into the ‘post
genome era’
 Meaning much has been done with
the Genes – not that it’s a over
36
Proteomics…..
 The relationship between the RNA and the protein it codes are
usually very different
 After translation proteins do change
 So aa sequence do not tell anything about the post
translation changes
 Proteins are not active until they are combined into a larger
complex or moved to a relevant location inside or outside the cell
 So aa only hint in these things
 Also proteins must be handled more carefully in labs as they tend
to change when in touch with an inappropriate material

37
Protein Structure Prediction

 Is one of the biggest challenges


of bioinformatics and esp.
biochemistry
 No algorithm is there now to
consistently predict the structure
of proteins

38
Structure Prediction methods

 Comparative Modeling
 Target proteins structure is
compared with related proteins
 Proteins with similar sequences
are searched for structures

39
Phylogenetics
 The taxonomical system reflects
evolutionary relationships
 Phylogenetics trees are things which reflect
the evolutionary relationship thru a
picture/graph
 Rooted trees where there is only one
ancestor
 Un rooted trees just showing the
relationship
 Phylogenetic tree reconstruction algorithms
are also an area of research

40
Applications….

41
Medical Implications
 Pharmacogenomics
 Not all drugs work on all patients, some good
drugs cause death in some patients
 So by doing a gene analysis before the
treatment the offensive drugs can be avoided
 Also drugs which cause death to most can be
used on a minority to whose genes that drug is
well suited – volunteers wanted!
 Customized treatment
 Gene Therapy
 Replace or supply the defective or missing gene
 E.g: Insulin and Factor VIII or Haemophilia

 BioWeapons (??)

42
Diagnosis of Disease
 Diagnosis of disease
 Identification of genes which cause the
disease will help detect disease at early
stage e.g. Huntington disease -
 Symptoms – uncontrollable dance like
movements, mental disturbance,
personality changes and intellectual
impairment
 Death in 10-15 years
 The gene responsible for the disease has
been identified

43
Drug Design
 Can go up to 15yrs and
$700million
 One of the goals of bioinformatics
is to reduce the time and cost
involved with it.
 The process
 Discovery
 Computational methods can improves
this
 Testing
44
Discovery

Target identification
 Identifying the molecule on which the
germs relies for its survival
 Then we develop another molecule
i.e. drug which will bind to the target
 So the germ will not be able to interact
with the target.
 Proteins are the most common targets

45
Discovery…

 For example HIV produces HIV


protease which is a protein and
which in turn eat other proteins
 This HIV protease has an active
site where it binds to other
molecules
 So HIV drug will go and bind
with that active site
46
Discovery…

 Lead compounds are the


molecules that go and bind to
the target protein’s active site
 Traditionally this has been a trial
and error method
 Now this is being moved into the
realm of computers

47
Related Computer
Technology………….

48
PERL
 Perl is commonly used for
bioinformatics calculations as its ability
to manipulate character symbols
 The default CGI language
 It started out as a scripting language
but has become a fully fledged
language
 IT has everything now, even web
service support
 http://bio.perl.org

49
The place of XML & Web
Services
 Various markup languages are being created –
Gene Markup language etc to represent
sequence/gene data
 Web Services – program to program interaction,
making the web application centric as opposed to
human centric
 So this has to platform language independent
 Protocols like SOAP help in this regard
 In bioinformatics various databases are being used,
different platforms, languages etc
 So web services helps achieve platform
independence and program interaction
 Since sequence data bases are in various formats,
platforms SOAP also helps in this regards

50
Data bases and Mining

 Lot of the sequence databases are


available publicly
 As there is a DB involved various
data mining techniques are used to
pull the data out
 As there is a lot of literature – articles
etc – on this area a data mining on
the literature.

51
European Molecular Biology
Network (EMBnet)
 A central system for sharing, training
and centralizing up to date bio info
 Some of the EMBnet sites are:
 SQENET
 http://www.seqnet.dl.ac.uk
 UCL
 http://www.biochem.ucl.ac.uk/bsm/dbbro
wser/embnet/
 EBI – European Bioinformatics
Institute
 www.ebi.ac.uk
52
References
 Dan E. Krane and Michael L. Raymer
 Basic Concepts of Bioinformatics

 Arthur M Lesk
 Intro to Bioinformatics

 T.K. Attwood & D. J. Parry-Smith


 Intro to Bioinformatics

 The genetic Revolution


 Dr Patrick Dixon

 Prof David Gilbert’s Site


 http://www.brc.dcs.gla.ac.uk/~drg/

53

You might also like