Into To Bioinfo

How Bioinformatics can change your life
Basic Concepts of Bioinformatics

Introduction……
2
2000
 A Major event happened that was to
change the course of human history
 It was a joint British and American
effort
 It was a race – who will complete first
 Race Test – not whether they have
taken drugs but whether they can
produce them!
 Human genome was sequenced
3
Bioinformatics is:
driven by the generation of data,
moderated by hardware and
analysis methods
Computing
power
Analysis methods
Data generation
platforms
4
What is
 The merging between computer
science and molecular biology
 The algorithm and techniques of
computer science are being used to
solve the problems faced by molecular
biologists
 ‘Information technology applied to
the management and analysis of
biological data’
 Storage and Analysis are two of the
important functions – bioinformaticians
build tools for each
5
Biology Chemistry
Computer
Science Statistics
Bioinformatics
6
What is..
 This is the age of the Information

Technology
 However storing info is nothing new
 Information to the volume of
Britannica Encyclopedia is stored in
each of our cells
 ‘Bioinformatics tries to determine
what info is biologically important’
7
Basics
of
Molecular Biology….
8
DNA & Genes
 DNA is where the genetic information is
stored
 Blonde hair and blue eyes are inherited by
this
 Gene - The basic unit of heredity
 There are genes for characteristics i.e. a gene
for blond hair etc
 Genes contain the information as a
sequence of nucleotides
 Genes are abstract concepts – like
longitude and latitudes in the sense that
you cannot see them separately
 Genes are made up of nucleotides
9
Nucleotide (nt)
 Each nt I made up of
 Sugar
 Phospate group
 Base
 The base it (nt) contains makes the only
difference between one nt and the other
 There are 4 different bases
 G(uanine),A(denine),T(hymine),C(ytosine)
 The information is in the order of nucleotide
and the order is the info
 Genes can be many thousands of nt long
 The complete set of genetic instructions is
called genomes
10
Proteins
 Proteins are very important
biological feature
 Amino Acids make up the proteins
 20 different amino acids are there
 The function of a protein is
dependant on the order of the amino
acids
11
Proteins…
 The information required to make aa is stored
in DNA
 DNA sequence determines amino acid
sequence
 Amino Acid sequence determines protein
structure
 Protein structure determines protein function
 A Substance called RNA is used to carry the
Info stored in the DNA that in turn is used to
make proteins
 Storage - DNA
 Information Transfer – RNA
 RNA is the message boy!
12
Central dogma
DNA transcription RNA Translation Protein

RNA Polymerase Ribosomes
13
14
Proteins…..
 Since there are 20 amino acids to
translate one nt cannot correspond to
one aa, neither can it correspond as twos
 So in triplet codes – codon – protein
information is carried
 The codons that do not correspond to a
protein are stop codons – UAA, UAG,
UGA (RNA has U instead of T)
 Some codons are used as start codons -

AUG as well as to code methionine
15
Protein Structure
 Shows a wide variety as opposed to the DNA
whose structure is uniform
 X-ray crystallography or Nuclear Magnetic
Resonance (NMR) is used to figure out the
structure
 Structure is related to the function or rather
structure determines the function
 Although proteins are created as a linear structure
of aa chain they fold into 3 d structure.
 If you stretch them and leave them they will go
back to this structure – this is the native structure
of a protein
 Only in the native structure the proteins functions
well
 Even after the translation is over protein goes
through some changes to its structure
16
Bioinformatics
Techniques…..
17
Prediction and Pattern
Recognition
 The two main areas of bioinformatics
are
 Pattern recognition
 ‘A particular sequence or structure has
been seen before’ and that a particular
characteristic can be associated with it
 Prediction
 From a sequence (what we know) we
can predict the structure and function
(what we don’t know)
18
Dot plots….
 Simple way of evaluating

similarity between two
sequences
 In a graph one sequence is on
one side the next on the other
side
 Where there are matches
between the two sequences the
graph is marked
19
20
Alignments
 A match for similarity between the characters of two or
more sequences
 Eg.
 TTACTATA
 TAGATA
 There are so many ways to align the above two
sequences
 1.
 TTACTATA
 TAGATA
 2.
 TTACTATA
 TAGATA
 3.
 TTACTATA
 TAGATA
 So which one do we choose and on what basis?
 Solution is to Provide a match score and mismatch score
21
Dynamic Programming
 As the length of the query sequences
increase and the difference of length
between the two sequence also increases
–more gaps has to be inserted in various
places
 We cannot perform an exhaustive search
 Combinatorial explosion occurs – too much
combinations to search for
 Dynamic programming is a way of using
heuristics to search in the most promising
path
22
Databases
 Sequence info is stored in databases
 So that they can be manipulated
easily
 The db (next slide) are located at diff
places
 They exchange info on a daily basis
so that they are up-to-date and are in
sync
 Primary db – sequence data
23
Nucleic acid (DNA/RNA)
sequence databases
 One main database arising from a partnership between
GenBANK at the NCBI (National Center for
Biotechnology Information – USA), the EMBL data
library at the EBI (European Bioinformatics Institute –
UK) and the DNA Data Bank at the NIG (National
Institute of Genetics – Japan).
 Daily exchanges between the 3 partners to keep the
databases synchronised.
 DNA and RNA sequences: curated, archived,
distributed.
 Sequences from genome projects, scientific articles,
patent applications. Most scientific journals require DNA
and RNA sequences related to each publication to be
publicly available.
 Sequences deposited early and going through a review
cycle; unannotated.. preliminary.. unreviewed..
standard.
 Format: human and computer readable.
24
25
Major Primary DB
Nucleic Acid Protein
EMBL (Europe) PIR -
Protein Information
Resource
GenBank (USA) MIPS,NCBI
DDBJ (Japan) SWISS-PROT
University of Geneva,
now with EBI
NCBI TrEMBL
A supplement to SWISS-
PROT
NRL-3D
Composite DB
 As there are many db which one to

search? Some are good in some
aspects and weak in others?
 Composite db is the answer – which
has several db for its base data
 Search on these db is indexed and
streamlined so that the same stored
sequence is not searched twice in
different db
27
Composite DB
 OWL has these as their primary

db
 SWISS PROT (top priority)
 PIR
 GenBank
 NRL-3D
28
Secondary db
 Store secondary structure info

or results of searches of the
primary db
Compo Primary
DB Source
PROSITE SWISS-PROT
PRINTS OWL
29
Structural databases
 The main database of protein structures is the PDB

(Protein Data Bank).
 The PDB started in 1971 at Brookhaven National

Labs (NY, USA) and is now a distributed
organisation (Research Collaboratory for Structural
Bioinformatics, www.rcsb.org) of US partners
(Rutgers, NJ; San Diego Supercomputer Centre,
Ca; NIST, Md).
 The PDB includes protein structures (and a few

DNA and other structures) determined by X-ray
crystallography and Nuclear Magnetic Resonance.
30
Database Searches
 We have sequenced and identified
genes. So we know what they do
 The sequences are stored in databases
 So if we find a new gene in the human
genome we compare it with the already
found genes which are stored in the
databases.
 Since there are large number of
databases we cannot do sequence
alignment for each and every sequence
 So heuristics must be used again.
31
Areas in
Bioinformatics…
32
Genomics
 Because of the multicellular structure, each
cell type does gene expression in a
different way –although each cell has the
same content as far as the genetic
 i.e. All the information for a liver cell to be a
liver cell is also present on nose cell, so
gene expression is the only thing that
differentiates
33
Genomics - Finding Genes
 Gene in sequence data – needle in a
haystack
 However as the needle is different
from the haystack genes are not diff
from the rest of the sequence data
 Is whole array of nt we try to find and
border mark a set of nt as a gene
 This is one of the challenges of
bioinformatics
 Neural networks and dynamic
programming are being employed
34
Organism Genome Gene Web Site
Size Number
(Mb)
bp * 1,000,000
Yeast 13.5 6,241 http://genome-

www.stanford.ed
u/Saccharomyce
s
Fruit Flies 180 13,601 http://flybase.bio.
indiana.edu
Homo 3,000 45,000 http://www.ncbi.n
Sapiens lm.nih.gov/geno
me/guide
Proteomics
 Proteome is the sum total of an
organisms proteins
 More difficult than genomics
 4 20
 Simple chemical makeup complex
 Can duplicate can’t
 We are entering into the ‘post
genome era’
 Meaning much has been done with
the Genes – not that it’s a over
36
Proteomics…..
 The relationship between the RNA and the protein it codes are
usually very different
 After translation proteins do change
 So aa sequence do not tell anything about the post
translation changes
 Proteins are not active until they are combined into a larger
complex or moved to a relevant location inside or outside the cell
 So aa only hint in these things
 Also proteins must be handled more carefully in labs as they tend
to change when in touch with an inappropriate material
37
Protein Structure Prediction
 Is one of the biggest challenges

of bioinformatics and esp.
biochemistry
 No algorithm is there now to
consistently predict the structure
of proteins
38
Structure Prediction methods
 Comparative Modeling
 Target proteins structure is
compared with related proteins
 Proteins with similar sequences
are searched for structures
39
Phylogenetics
 The taxonomical system reflects
evolutionary relationships
 Phylogenetics trees are things which reflect
the evolutionary relationship thru a
picture/graph
 Rooted trees where there is only one
ancestor
 Un rooted trees just showing the
relationship
 Phylogenetic tree reconstruction algorithms
are also an area of research
40
Applications….
41
Medical Implications
 Pharmacogenomics
 Not all drugs work on all patients, some good
drugs cause death in some patients
 So by doing a gene analysis before the
treatment the offensive drugs can be avoided
 Also drugs which cause death to most can be
used on a minority to whose genes that drug is
well suited – volunteers wanted!
 Customized treatment
 Gene Therapy
 Replace or supply the defective or missing gene
 E.g: Insulin and Factor VIII or Haemophilia
 BioWeapons (??)
42
Diagnosis of Disease
 Diagnosis of disease
 Identification of genes which cause the
disease will help detect disease at early
stage e.g. Huntington disease -
 Symptoms – uncontrollable dance like
movements, mental disturbance,
personality changes and intellectual
impairment
 Death in 10-15 years
 The gene responsible for the disease has
been identified
43
Drug Design
 Can go up to 15yrs and
$700million
 One of the goals of bioinformatics
is to reduce the time and cost
involved with it.
 The process
 Discovery
 Computational methods can improves
this
 Testing
44
Discovery
Target identification
 Identifying the molecule on which the
germs relies for its survival
 Then we develop another molecule
i.e. drug which will bind to the target
 So the germ will not be able to interact
with the target.
 Proteins are the most common targets
45
Discovery…
 For example HIV produces HIV

protease which is a protein and
which in turn eat other proteins
 This HIV protease has an active
site where it binds to other
molecules
 So HIV drug will go and bind
with that active site
46
Discovery…
 Lead compounds are the

molecules that go and bind to
the target protein’s active site
 Traditionally this has been a trial
and error method
 Now this is being moved into the
realm of computers
47
Related Computer
Technology………….
48
PERL
 Perl is commonly used for
bioinformatics calculations as its ability
to manipulate character symbols
 The default CGI language
 It started out as a scripting language
but has become a fully fledged
language
 IT has everything now, even web
service support
 http://bio.perl.org
49
The place of XML & Web
Services
 Various markup languages are being created –
Gene Markup language etc to represent
sequence/gene data
 Web Services – program to program interaction,
making the web application centric as opposed to
human centric
 So this has to platform language independent
 Protocols like SOAP help in this regard
 In bioinformatics various databases are being used,
different platforms, languages etc
 So web services helps achieve platform
independence and program interaction
 Since sequence data bases are in various formats,
platforms SOAP also helps in this regards
50
Data bases and Mining
 Lot of the sequence databases are

available publicly
 As there is a DB involved various
data mining techniques are used to
pull the data out
 As there is a lot of literature – articles
etc – on this area a data mining on
the literature.
51
European Molecular Biology
Network (EMBnet)
 A central system for sharing, training
and centralizing up to date bio info
 Some of the EMBnet sites are:
 SQENET
 http://www.seqnet.dl.ac.uk
 UCL
 http://www.biochem.ucl.ac.uk/bsm/dbbro
wser/embnet/
 EBI – European Bioinformatics
Institute
 www.ebi.ac.uk
52
References
 Dan E. Krane and Michael L. Raymer
 Basic Concepts of Bioinformatics
 Arthur M Lesk
 Intro to Bioinformatics
 T.K. Attwood & D. J. Parry-Smith

 Intro to Bioinformatics
 The genetic Revolution

 Dr Patrick Dixon
 Prof David Gilbert’s Site

 http://www.brc.dcs.gla.ac.uk/~drg/
53

Into To Bioinfo

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Into To Bioinfo

Uploaded by

Copyright:

Available Formats

How Bioinformatics can change your life

Basic Concepts of Bioinformatics

 This is the age of the Information

DNA transcription RNA Translation Protein

 Some codons are used as start codons -

 Simple way of evaluating

 As there are many db which one to

 OWL has these as their primary

 Store secondary structure info

 The main database of protein structures is the PDB

 The PDB started in 1971 at Brookhaven National

 The PDB includes protein structures (and a few

Yeast 13.5 6,241 http://genome-

 Is one of the biggest challenges

 For example HIV produces HIV

 Lead compounds are the

 Lot of the sequence databases are

 T.K. Attwood & D. J. Parry-Smith

 The genetic Revolution

 Prof David Gilbert’s Site

You might also like