Bioinformatics Lesson 01

AN
INTRODUCTION
From Biology to Bioinformatics
• Tremendous recent progress in
– Biology (molecular, genetics, etc..)
– Technological advancements
• Opened many new domains
• From Academic Interest
– To Commercial interest
• From Knowledge discovery
– To industrial development
• Added pursuit for
– Longer life
– Cure for diseases
Human Genome Project (HGP)
• Started 1986 (1990 formally) completed April 2003
• U.S. Department of Energy (DoE) and the National
Institutes of Health (NIH)
Goals:
■identify all the genes in human DNA,
■determine all the sequences of chemical base pairs
that make up human DNA
■ store this information in databases,
■ improve tools for data analysis,
■transfer related technologies to the private sector,
and
■address the ethical, legal, and social issues (ELSI)
that may arise from the project.
http://www.ornl.gov/sci/techresources/Human_Genome/home.shtml
Human Genome Project (HGP)
By the Numbers
• 3 billion (3164.7 million) chemical nucleotide bases (A, C, T, and G).
•The average gene consists of 3000 bases, but sizes vary greatly, (largest
one 2.4 million bases).
•The total number of genes is estimated at around 30,000--much lower
than previous estimates of 80,000 to 140,000.
• The functions are unknown for over 50% of discovered genes.
http://www.ornl.gov/sci/techresources/Human_Genome/home.shtml
How does the human genome stack up?
Organism Genome Size Estimated
(Bases) Genes
Human (Homo sapiens) 3 billion 30,000
Laboratory mouse (M. musculus) 2.6 billion 30,000
Mustard weed (A. thaliana) 100 million 25,000
Roundworm (C. elegans) 97 million 19,000
Fruit fly (D. melanogaster) 137 million 13,000
Yeast (S. cerevisiae) 12.1 million 6,000
Bacterium (E. coli) 4.6 million 3,200

Human immunodeficiency virus (HIV) 9700 9
What about Chimpanzee Genome?
Other Genome Projects:

www.tigr.org/tdb
Chimpanzee VS Human
• Completed in August 2005

• 2.8 Billion Base pairs
• 29% genes are absolutely identical
• Average protein changing mutation 2
• Some genes has radical changes
• But Genome similarity Not Everything
• Domestic dog genomes 99.85% similar
• M. Musculus and M. Spretus have similar similarity
• Will help to unveil the evolution
Quantity of Data
• Sequencing, Proteomics, Gene Expression
Data, Metabolic studies etc.
• For different organisms: from very simple
virus/bacteria to very complex homo sapiens
• These data are stored in many public and
private databases
• Human genome: 30,000 genes and 1.5
million proteins (approx.)
• One gene Needs: 300 TB (approx)
trace data
• Only Medical imaging generates
400MGB
Quality of Data
• Biological data is highly

– Intricate
– Interrelated
– Imperfect (Noisy)
• You start with protein – need to take care of
structure, function, interaction, sequence,
relation etc…
• We need sophisticated repository and tools to
deal with
What Bioinformatics can do?
• Organizing variety of related information
• Develop tools and techniques for analyzing and
interpretation of these data
• Gene annotation, gene function prediction, gene
library establishment
• Gene to protein identification, function and structure
prediction
• Methodology for identifying and understanding the
molecular machineries
• Modeling, simulation and inference of metabolic,
genetic and protein networks
• Provide guideline to identify the origin of disease
• Help in drug discovery and cure disease
What is Bioinformatics? Definition
• Simple definition – bringing biological themes to computers
• Peter Elkin: Primer on Medical Genomics: Part V: Bioinformatics

– “Bioinformatics is the discipline that develops and applies informatics to the
field of molecular biology.”
• BISTIC Bioinformatics Definition
– “Research, development, or application of computational tools and
approaches for expanding the use of biological, medical, behavioral
or health data, including those to acquire, store, organize, archive,
analyze, or visualize such data”
• BISTIC Computational Biology Definition
– “Computational Biology: the development and application of data-
analytical and theoretical methods, mathematical modeling and
computational simulation techniques to the study of biological,
behavioral, and social systems.”
• http://www.bisti.nih.gov/
Bioinformatics
The applications of computer sciences to

molecular biology in particular to the study of
macromolecules such as proteins and nucleic
acids.
Synonyms: Molecular Bioinformatics,

Computational Biology, Biocomputing
How Technology Interacts with Bioinformatics
Useful/Necessary Bioinformatics Skills
• Strong background in some aspect of molecular biology!!!
• Ability to communicate biological questions
comprehensibly to computer scientists
• Thorough comprehension of the problem in the
bioinformatics field
• Statistics (association studies, clustering,
sampling)
• Ability to filter, and parse data and determine the
relationships between the data sets
• Mathematics (e.g. algorithm development)
• Engineering (e.g. robotics)
• Good knowledge of a few molecular biology software packages
(molecular modeling / sequence analysis)
• Command line computing environment (Linux/Unix knowledge)
• Data administration (esp. relational database concept) and
Computer Programming Skills/Experience (C/C++, Sybase, Java,
Oracle) and Scripting Language Knowledge (Perl and perhaps
Phython)
Explosion of Genome Sequence Data
High throughput DNA sequencing Centre
DNA sequences are meaningless!
gggtctctcttgttagaccagatctgagcctgggagctctctggctaactagggaacccactgcttaagcctcaataaagcttgccttgagtgcttcaagtagtgtgtgcccgtctgttgtgtgactctgatagctagagatcccttcagaccaaatttagtcagtgtgaaaa
atctctagcagtggcgcctgaacagggacttgaaagcgaaagagaaaccagagaagctctctcgacgcaggactcggcttgctgaagcgcgcacggcaagaggcgaggggacggcgactggtgagtacgccaaaattttgactagcggaggctagaaggagagagatgggtgc
gagagcgtcgatattaagcgggggaggattagatagatgggaaaaaattcggttaaggccagggggaaagaaaaaatatagattaaaacatttagtatgggcaagcagggagctagaacgattcgcagtcaatcctggcctattagaaacatcagaaggttgtagacaaatac
tgggacaactacaaccagcccttcagacaggatcagaagaacttagatcattatataatacagtagcaaccctctattgtgtgcatcaaaagatagatgtaaaagacaccaaggaagctttagataagatagaggaagagcaaaacaaaagtaagaaaaaagcacagcaagca
gcagctgacacaggaaatagcagccaggtcagccaaaattaccccatagtgcagaacatccaggggcaaatggtacatcaggccatatcacctagaactttaaatgcatgggtaaaagtagtagaagagaaggctttcagcccagaagtaatacccatgttttcagcattatc
agaaggagccaccccacaagatttaaacaccatgctaaacacagtggggggacatcaagcagccatgcaaatgttaaaagagaccatcaatgaggaagctgcagaatgggatagattgcatccagtgcatgcagggcctcatccaccaggccagatgagagaaccaaggggaa
gtgacatagcaggaactactagtacccttcaggaacaaatagcatggatgacaaataatccacctatcccagtaggagaaatctataagagatggataatcctgggattaaataaaatagtaaggatgtatagccctaccagcattctggacataaaacaaggaccaaaggaa
ccctttagagactatgtagaccggttctataagactctaagagccgagcaagcttcacaggaggtaaaaaattggatgacagaaaccttgttggtccaaaatgcgaacccagattgtaagactattttaaaagcattgggaccagcagctacactagaagaaatgatgacagc
atgtcagggagtgggaggacccggccataaagcaagagttttggcagaagcaatgagccaagtaacaaattcagctaccataatgatgcagaaaggcaattttaggaaccaaagaaaaattgttaagtgtttcaattgtggcaaagaagggcacatagccaaaaattgcaggg
cccctaggaaaaggggctgttggaaatgtggaaaggagggacaccaaatgaaagattgtactgagagacaggctaattttttagggaaaatctggccttcccacaggggaaggccagggaattttcctcagaacagactagagccaacagccccaccagccccaccagaagag
agcttcaggtttggggaagagacaacaactccctctcagaagcaggagctgatagacaaggaactgtatccttcagcttccctcaaatcactctttggcaacgaccccttgtcacaataaagataggggggcaactaaaggaagctctattagatacaggagcagatgataca
gtattagaagaaataaatttgccaggaagatggaaaccaaaaatgatagggggaattggaggttttatcaaagtaagacagtatgatcaaatactcgtagaaatctgtggacataaagctataggtacagtattagtaggacctacacctgtcaacataattggaagaaatct
gttgactcagattggttgcactttaaattttcccattagtcctattgaaactgtaccagtaaaattaaagccaggaatggatggcccaaaagttaaacaatggccattgacagaagaaaaaataaaagcattagtagaaatctgtacagaaatggaaaaggaaggaaaaattt
caaaaatcgggcctgaaaatccatataatactccagtatttgccataaagaaaaaagacagtactaaatggagaaaattagtagatttcagagaacttaataagaaaactcaagacttctgggaagttcaattaggaataccacatcccgcagggttaaaaaagaaaaaatca
gtaacagtactggatgtgggtgatgcatatttttcagttcccttagataaagaattcaggaagtacactgcatttaccatacctagtataaacaatgagacaccagggattagatatcagtacaatgtgcttccacagggatggaaaggatcaccagcaatattccaaagcag
catgacaaaaatcttagagccttttagaaaacaaaatccagacatagttatctatcaatacatggacgatttgtatgtaggatctgacttagaaatagggcagcatagaacaaaaatagaggaactgagacaacatctgttgaagtggggatttaccacaccagacaaaaaac
atcagaaagaacctccattcctttggatgggttatgaactccatcctgataaatggacagtacagcctatagtgctgccagaaaaggacagctggactgtcaatgacatacagaagttagtgggaaaattgaattgggcaagtcagatttacccagggattaaagtaaagcaa
ttatgtagactccttaggggaaccaaggcactaacagaagtaataccactaacaaaagaagcagagctagaactggcagaaaacagggaaattctaaaagaaccagtacatggagtgtattatgacccatcaaaagacttaatagcggaaatacagaagcaggggcaaggtca
atggacatatcaaatttatcaagagccatttaaaaatctgaaaacaggaaaatatgcaagaatgaggggtgcccacactaatgatgtaaaacaattaacagaggcagtgcaaaaaataaccacagaaagcatagtaatatggggaaagactcctaaatttaaactacccatac
aaaaagaaacatgggaaacatggtggacagagtattggcaagccacctggattcctgagtgggagtttgtcaatacccctcccttagtaaaattatggtaccagttagagaaagaacccataataggagcagaaactttctatgtagatggggcagctaacagggagactaaa
ttaggaaaagcaggatatgttactaacaaagggagacaaaaagttgtctccataactgacacaacaaatcagaagactgagttacaagcaattcttctagcattacaggattctggattagaagtaaacatagtaacagactcacaatatgcattaggaatcattcaagcaca
accagataaaagtgaatcagagatagtcagtcaaataatagagcagttaataaaaaaagaaaaggtctacctgacatgggtaccagcgcacaaaggaattggaggaaatgaacaagtagataaattagtcagtactggaatcaggaaagtactctttttagatggaatagata
aagcccaagaagaacatgaaaaatatcacagtaattggagggcaatggctagtgattttaacctgccacctgtggtagcaaaagagatagtagccagctgtgataaatgtcagctaaaaggagaagccatgcatggacaagtagactgtagtccaggaatatggcaactagat
tgtacacatttagaaggaaaaattatcctggtagcagttcatgtagccagtggatatatagaagcagaagttattccagcagaaacagggcaggaaacagcatactttctcttaaaattagcaggaagatggccagtaaaaacagtacatacagacaatggcagcaatttcac
cagtactacagttaaggccgcctgttggtgggcaggaatcaagcaggaatttggcattccctacaatccccaaagtcaaggagtagtagaatctataaataaagaattaaagaaagttataggacagataagagatcaggctgaacatcttaagacagcagtacaaatggcag
tattcatccacaattttaaaagaaaaggggggattggggggtacagtgcaggggaaagaatagtagacataatagcaacagacatacaaactaaagaactacaaaaacaaattacaaaaattcaaaattttcgggtttattacagggacagcagagatccactttggaaagga
ccagcaaagcttctctggaaaggtgaaggggcagtagtaatacaagataatagtgacataaaagtagtgccaagaagaaaagcaaagatcattagggattatggaaaacagatggcaggtgatgattgtgtggcaagtagacaggatgaggattagaacatggaaaagtttag
taaaacaccatatgtatgtttcaaggaaagctaagggatggttttatagacatcactatgaaagtactcatccgagaataagttcagaagtacacatcccactagggaatgcaaaattggtaataacaacatattggggtctacatacaggagaaagagactggcatttgggt
caaggagtctccatagaattgaggaaaaggagatatagcacacaattagaccctaacctagcagaccaactaattcatctgcattactttgattgtttttcagaatctgctataagaaatgccatattaggacatatagttagccctaggtgtgaatatcaagcaggacataa
caaggtaggatctctacagtacttggcactaacagcattagtaagaccaagaaaaaagataaagccacctttgcctagtgttacaaaactgacagaggatagatggaacaagccccagaagaccaagggccacaaagggaaccatacaatgaatggacactagaacttttaga
ggagctcaagaatgaagctgttagacattttcctaggatatggctccatagcttagggcaacatatctatgaaacttatggagatacttgggcaggagtggaagccataataagaattctgcaacaactgctgtttattcatttcagaattgggtgtcaacatagcagaatag
acattcttcgacgaaggagagcaagaaatggagccagtagatcctagactagagccctggaagcatccaggaagtcagcctaggactgcttgtaccaattgctattgtaaaaagtgttgctttcattgccaagtttgtttcataacaaaaggcttaggcatctcctatggcag
gaagaagcggagacagcgacgaagagctcctcaagacagtcagactcatcaagtttctctatcaaagcagtaagtagtacatgtaatgcaatctttacaaatattagcagtagtagcattagtagtagcagcaataatagcaatagttgtgtggtccatagtattcatagaat
ataggaaaataagaagacaaaacaaaatagaaaggttgattgatagaataatagaaagagcagaagacagtggcaatgagagtgacggagatcaggaagaattatcagcacttgtggaaatggggcacgatgctccttgggatgttaatgatctgtaaagctgcagaaaattt
gtgggtcacagtttattatggggtacctgtgtggaaagaagcaaccaccactctattttgtgcctcagatgctaaagcgtatgatacagaggtacataatgtttgggccacacatgcctgtgtacccacagaccccaacccacaagaagtagaactgaagaatgtgacagaaa
attttaacatgtggaaaaataacatggtagaccaaatgcatgaggatataattagtttatgggatcaaagcctaaagccatgtgtaaaattaaccccactctgtgttactttaaattgcactgattatgggaatgatactaacaccaataatagtagtgctactaaccccact
agtagtagcgggggaatggaggggagaggagaaataaaaaattgctctttcaatatcaccagaagcataagagataaagtgaagaaagaatatgcacttttttatagtcttgatgtaataccaataaaagatgataatactagctataggttgagaagttgtaacacctcagt
cattacacaggcctgtccaaaggtatcctttgaaccaattcccatacattattgtgccccggctggttttgcgattctaaagtgtaatgataaaaagttcaatggaaaaggaccatgtacaaatgtcagcacagtacaatgtacacatggaattaggccagtagtatcaactc
aactgctgttaaatggcagtctagcagaagaagaggtagtaattagatcagacaatttctcggacaatgctaaagtcataatagtacatctgaatgaatctgtagaaattaattgtacaagactcaacaacattacaaggagaagtatacatgtaggacatgtaggaccaggc
agagcaatttatacaacaggaataataggaaaaataagacaagcacattgtaacattagtagagcaaaatggaataacactttaaaacagatagttacaaaattaagagaacaatttaagaataaaacaatagtctttaatcaatcctcaggaggggacccagaaattgtaat
gcacagttttaattgtggaggggaatttttctactgtaattcaacacaactgtttaacagtacttggaatggtactgcatggtcaaataacactgaaggaaatgaaaatgacacaatcacactcccatgcagaataaaacaaattataaacatgtggcaggaagtaggaaaag
caatgtatgcacctcccatcagaggacaaattagatgttcatcaaatattacagggctgatattaacaagagatggtggtattaaccagaccaacaccaccgagattttcaggcctggaggaggagatatgaaggacaattggagaagtgaattatataaatataaagtagta
aaaattgaaccattaggagtagcacccaccaaggcaaagagaagagtggtgcaaagagaaaaaagagcagtgggaataataggagctatgctccttgggttcttgggagcagcaggaagcactatgggcgcagcgtcaatgacgctgacggtacaggccagacaattattgtc
tggtatagtgcaacagcagaacaatttgctgagggctattgaggcgcaacagcatctgttgcacctcacagtctggggcatcaagcagctccaagcaagagtcctggctgtggaaagatacctaagggatcaacagctcctggggttttggggttgctctggaaaactcattt
gcaccactgctgtgccttggaatactagttggagtaataaatctctgagtcagatttgggataacatgacctggatgcagtgggaaagggaaattgataattacacaagcttaatatacaacttaattgaagaatcgcaaaaccaacaagaaaagaatgaacaagagttattg
gaattagataactgggcaagtttgtggaattggtttagcataacaaattggctgtggtatataaaaatattcataatgatagtaggaggcttggtaggtttaagaatagtttttactgtactttctatagtaaatagagttaggcagggatactcaccattgtcgtttcagac
gcgcctcccagccaggaggggacccgacaggcccgaaggaatcgaagaagaaggtggagagagagacagagacagatccggtcaattagtggatggattcttagcaattatctgggtcgacctgcggagcctgtgcctcttcagctaccaccgcttgagagacttactcttga
ttgtaacgaggattgtggaacttctgggacgcagggggtgggaagccctcaaatattggtggaatctcctacaatattggattcaggaactaaagaatagtgctgttagcttgctcaacgccacagccatagcagtagctgagggaactgatagggttatagaagtattacaa
agagcttgtagagctattctccacatacctagaagaataagacagggcttagaaagggctttgcaataagatgggtggtaagtggtcaaaaagtagtaaaattggatggcctactgtaagggaaagaatgagaagagctgagccagcagcagatggggtgggagcagtatctc
gagacctggaaaaacatggagcaatcacaagtagtaatacagcaactaacaatgctgattgtgcctggctagaagcacaagaggaggaggaggtgggttttccagtcagacctcaggtacctttaagaccaatgacttacaagggagcgttagatcttagccactttttaaaa
gaaaaggggggactggaagggctaatttggtcccagaaaagacaagacatccttgatttgtgggtccaccacacacaaggctacttccctgattggcagaactacacaccagggccagggatcagatatccactgacctttggttggtgcttcaagctagtaccagttgagcc
agagaaggtagaagaggccaatgaaggagagaacaacagattgttacaccctgtgagcctgcatgggatggaggacccggagaaagaagtgttagtatggaggtttgacagccgcctagtactccgtcacatggcccgagagctgcatccggagtactacaaggactgctgac
actgagctttctacaagggactttccgctggggactttccagggaggcgtggcctgggcgggactggggagtggcgagccctcagatgctgcatataagcagctgctttttgcctgtactgggtctctcttgttagaccagatctgagcctgggagctctctggctaactagg
gaacccactgcttaagcctcaataaagcttgccttgagtgcttca
From gene to protein and its function(s)
Gene Function
> DNA sequence

AATTCATGAAAATCGTATACTGGTCTGGTACCGGCAACAC > Protein sequence
TGAGAAAATGGCAGAGCTCATCGCTAAAGGTATCATCGAA MKIVYWSGTGNTEKMAELIAKGIIESGKDVNTINVSDV
TCTGGTAAAGACGTCAACACCATCAACGTGTCTGACGTTA NIDELLNEDILILGCSAMGDEVLEESEFEPFIEEISTKISG
ACATCGATGAACTGCTGAACGAAGATATCCTGATCCTGGG KKVALFGSYGWGDGKWMRDFEERMNGYGCVVVETP
TTGCTCTGCCATGGGCGATGAAGTTCTCGAGGAAAGCGAA LIVQNEPDEAEQDCIEFGKKIANI
TTTGAACCGTTCATCGAAGAGATCTCTACCAAAATCTCTG
GTAAGAAGGTTGCGCTGTTCGGTTCTTACGGTTGGGGCGA
CGGTAAGTGGATGCGTGACTTCGAAGAACGTATGAACGGC
TACGGTTGCGTTGTTGTTGAGACCCCGCTGATCGTTCAGA
ACGAGCCGGACGAAGCTGAGCAGGACTGCATCGAATTTGG
TAAGAAGATCGCGAACATCTAGTAGA
Goals of Functional Genomics
What is the function of these structures?
What is the function of this sequence?
What is the function of this motif?

– the fold provides a scaffold, which can be
decorated in different ways by different
sequences to confer different functions
– knowing the fold & function allows us to
rationalise how the structure effects its function
at the molecular level
Bioinformatics Application Levels
• Basic Level
• Organization of the collected data
• Maintenance: correction and update
• Types of data sets:
• Genome sequence
• Macromolecular structures
• Functional genomics experimental data
• Others
– Phylogenetic trees, metabolic pathways, scientific
literature etc.
• Very sophisticated databases are
needed
Protein Data Bank (PDB)
http://www.rcsb.org/pdb/
Protein Data Bank (PDB)
http://www.rcsb.org/pdb/
Molecule Type
Proteins Nucleic Protein/ Other Total

Acids NA
Comple
xes
Exp. X-ray 35091 973 1624 28 37716
Method NMR 5457 773 130 7 6367
Electron 101 10 38 0 149
Microsc
opy
Other 81 4 3 0 88
Total 40730 1760 1795 35 44320
(As of Tuesday Jun 26, 2007 )

SWISS-PROT/TrEMBL
• Collaboration between the SIB

(CH) and EMBL/EBI (UK)
• SWISS-PROT: Fully annotated
(manually), non-redundant,
cross-referenced, documented
protein sequence database
• TrEMBL: is automatically
generated (from annotated
EMBL coding sequences
(CDS)) and annotated using
software tools
http://ca.expasy.org/sprot/
SWISS-PROT/TrEMBL
10-Jul-2007 of UniProtKB/TrEMBL contains 4553922 sequence entries
http://ca.expasy.org/sprot/
NCBI Entrez Genome Projects
http://www.ncbi.nlm.nih.gov/entrez/
• Second Level
• Development of tools and resources
• For analysis and interpretation of data
• More challenging task
• More important and interesting to biologist
• One important task is searching for similarity
BLAST: Sequence Similarity Searches
VAST: Structure Similarity Searches
• Third Level
• Modeling and simulating different bio-modules
• Use system level analysis and interpretation
• Search and unravel the origin of life, rules of
evolution
• Use the acquired knowledge for treating and curing
disease, aging
Some Bioinformatics Applications
• Information Search and Retrieval

• One indispensable tool needed in
Bioinformatics
• Gigantic databases are being
piled up
• We need very expert search tools
– Example is PUBMED
http://www.ncbi.nlm.nih.gov/sites/entrez?db=PubMed
Genetics Based Applications
• Three types of computation problems:

• Gene Annotation
– Identify the genes
– locate promoters, binding sites etc.
• Homology Detection
– assess similarity with known genes
• Genome-wide Analysis
– derive evolutionary relationship
– identify gene families
– determination of chromosomal location
Sequence Comparison
• One of the most useful application for

biologists
• Similarity search is helpful for
• homology detection
• distance measure
• evolutionary relationship detection
• Most popular tools are
• BLAST
• FASTA
Linkage Analysis
• Used to identify chromosomal location of

genes
• Involves the analysis of large amount of
data
• Has important implication in disease
identification
• Many programs are available
• http://linkage.rockefeller.edu
Phylogenetic Analysis
• Also known as molecular taxonomy

• Evolutionary relationship is presented in
the form of a tree
• One popular tool is PHYLIP
Rational Drug Design
• Understanding how structures bind other molecule (function)

• Designing inhibitors
• Docking, structure modeling
Drug Lead Screening & Docking
Complementarity
- Shape
- Chemical
- Electrostatic
Computer Aided Drug Design (CADD)
• Very recent emerging discipline

• Uses
– Bioinformatics Tools
– Chemoinformatics
– Combinatorial Chemistry
• Commercially very important
– Some tools are already avialble
Drug Development Life Cycle
Discovery
(2 to 10 Years)
Preclinical Testing
(Lab and Animal Testing)
Phase I
(20-30 Healthy Volunteers used to
With the aid of Bioinformatics check for safety and dosage)
Phase II
(100-300 Patient Volunteers used to
check for efficacy and side effects)
Phase III
(1000-5000 Patient Volunteers
used to monitor reactions to
long-term drug use)
FDA Review
& Approval
Post-Marketing
Years Testing
0 2 4 6 8 12 14 16
10
7 – 15 Years!
Drug lead screening
5,000 to 10,000 compounds screened
250 lead candidates in

Preclinical
5 drug candidates
Testing
enter Clinical Testing;
80% pass Phase I
30% pass Phase II
80% pass Phase III
One drug
approved
by the FDA
Systems Biology
• System-level identification of organism,

organelles
• How the systems works with its
constitutes?
• How outputs are generated from given
inputs?
• More concerned with modeling and
simulations
• Makes auspicious promises to disease
treatment and disease cure
Applications of Bioinformatics (Summary)
Search for new drugs OH NH2 OCH3

N
NH2 NH N CH2 OCH3
DNA chips
N CH2
Cl
NH2 NH2 OCH3
N N N
Cl
NH2
NH
NH Cl NH
H
C CH3 NH NH
NH N Cl
CH3 NH2 O H COO - H
Genetic Variations
C CH
OH NH N
3
N N N CH3
NH2 NH
CH3
NH2 N N
CH3 N
N H2 O C H3 N
N C H2 O C H3 NH COO-
N H2 N O C H3 O H COO-
Biochemical Networks
Optimizing therapies
data analysis,
algorithms,
visualization, statistics,
etc. caaaaatagggttaatatgaatctcgatctccattttgttcatcgtattcaacaacaagcc
Genomes aaaactcgtacaaatatgaccgcacttcgctataaagaacacggcttgtggcgagatatct
cttggaaaaactttcaagagcaactcaatcaactttctcgagcattgcttgctcacaatat
tgacgtacaagataaaatcgccatttttgcccataatatggaacgttgggttgttcatgaa
actttcggtatcaaagatggtttaatgaccactgttcacgcaacgactacaatcgttgaca
ttgcgaccttacaaattcgagcaatcacagtgcctatttacgcaaccaatacagcccagca
agcagaatttatcctaaatcacgccgatgtaaaaattctcttcgtcggcgatcaagagcaa
tacgatcaaacattggaaattgctcatcattgtccaaaattacaaaaaattgtagcaatga
aatccaccattcaattacaacaagatcctctttcttgcacttgg
Proteins
d1dhfa_ LNCIVAVSQNMGIGKNGDLPWPPLRNEFRYFQRMTTTSS-
VNELGVKIMQGKKTWFSI d8dfr
LNSIVAVCQNMGIGKDGNLPWPPLRNEYKYFQRMTSTSH-VNEAGVKIQMGKKTWFSI
Molecular d4dfra_ ISLIAALAVDRVIGMENAMPWN- LPADLAWFKRN-T--L-----

NKPVIMGRHTWESI d3dfr
V------
d1dhfa_ GKIMVVGRRTYESF
TAFLWAQDRDGLIGKDGHLPW- LHPDDLHYFRAQT--
LNCIVAVSQNMGIGKNGDLPWPPLRNEFRYFQRMTTTSS-
Interactions
VNELGVKIMQGKKTWFSI d8dfr
LNSIVAVCQNMGIGKDGNLPWPPLRNEYKYFQRMTSTSH-VNEAGVKIQMGKKTWFSI
d4dfra_ ISLIAALAVDRVIGMENAMPW- NLPADLAWFKRNT-L--D-----
Structure Prediction KPVIMGRHTWESI d3dfr

G----- KIMVVGRRTYESF
TAFLWAQDRNGLIGKDGHLP- WHLPDDLHYFRAQT-V--
Sequence Analysis

Bioinformatics Lesson 01

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Bioinformatics Lesson 01

Uploaded by

Copyright:

Available Formats

AN

Laboratory mouse (M. musculus) 2.6 billion 30,000

Mustard weed (A. thaliana) 100 million 25,000

Roundworm (C. elegans) 97 million 19,000

Fruit fly (D. melanogaster) 137 million 13,000

Yeast (S. cerevisiae) 12.1 million 6,000

Bacterium (E. coli) 4.6 million 3,200

What about Chimpanzee Genome?

Other Genome Projects:

• Completed in August 2005

• Biological data is highly

• Peter Elkin: Primer on Medical Genomics: Part V: Bioinformatics

The applications of computer sciences to

Synonyms: Molecular Bioinformatics,

> DNA sequence

What is the function of these structures?

What is the function of this sequence?

What is the function of this motif?

Proteins Nucleic Protein/ Other Total

(As of Tuesday Jun 26, 2007 )

• Collaboration between the SIB

10-Jul-2007 of UniProtKB/TrEMBL contains 4553922 sequence entries

• Information Search and Retrieval

• Three types of computation problems:

• One of the most useful application for

• Used to identify chromosomal location of

• Also known as molecular taxonomy

• Understanding how structures bind other molecule (function)

• Very recent emerging discipline

5,000 to 10,000 compounds screened

250 lead candidates in

30% pass Phase II

80% pass Phase III

• System-level identification of organism,

Search for new drugs OH NH2 OCH3

Molecular d4dfra_ ISLIAALAVDRVIGMENAMPWN- LPADLAWFKRN-T--L-----

Structure Prediction KPVIMGRHTWESI d3dfr

You might also like