Professional Documents
Culture Documents
Bioinformatics Lesson 01
Bioinformatics Lesson 01
INTRODUCTION
From Biology to Bioinformatics
• Tremendous recent progress in
– Biology (molecular, genetics, etc..)
– Technological advancements
• Opened many new domains
• From Academic Interest
– To Commercial interest
• From Knowledge discovery
– To industrial development
• Added pursuit for
– Longer life
– Cure for diseases
Human Genome Project (HGP)
• Started 1986 (1990 formally) completed April 2003
• U.S. Department of Energy (DoE) and the National
Institutes of Health (NIH)
Goals:
■identify all the genes in human DNA,
■determine all the sequences of chemical base pairs
that make up human DNA
■ store this information in databases,
■ improve tools for data analysis,
■transfer related technologies to the private sector,
and
■address the ethical, legal, and social issues (ELSI)
that may arise from the project.
http://www.ornl.gov/sci/techresources/Human_Genome/home.shtml
Human Genome Project (HGP)
By the Numbers
• 3 billion (3164.7 million) chemical nucleotide bases (A, C, T, and G).
•The average gene consists of 3000 bases, but sizes vary greatly, (largest
one 2.4 million bases).
•The total number of genes is estimated at around 30,000--much lower
than previous estimates of 80,000 to 140,000.
• The functions are unknown for over 50% of discovered genes.
http://www.ornl.gov/sci/techresources/Human_Genome/home.shtml
How does the human genome stack up?
Organism Genome Size Estimated
(Bases) Genes
Human (Homo sapiens) 3 billion 30,000
Gene Function
• Basic Level
• Organization of the collected data
• Maintenance: correction and update
• Types of data sets:
• Genome sequence
• Macromolecular structures
• Functional genomics experimental data
• Others
– Phylogenetic trees, metabolic pathways, scientific
literature etc.
• Very sophisticated databases are
needed
Protein Data Bank (PDB)
http://www.rcsb.org/pdb/
Protein Data Bank (PDB)
http://www.rcsb.org/pdb/
Molecule Type
http://ca.expasy.org/sprot/
SWISS-PROT/TrEMBL
http://ca.expasy.org/sprot/
NCBI Entrez Genome Projects
http://www.ncbi.nlm.nih.gov/entrez/
Bioinformatics Application Levels
• Second Level
• Development of tools and resources
• For analysis and interpretation of data
• More challenging task
• More important and interesting to biologist
• One important task is searching for similarity
BLAST: Sequence Similarity Searches
VAST: Structure Similarity Searches
Bioinformatics Application Levels
• Third Level
• Modeling and simulating different bio-modules
• Use system level analysis and interpretation
• Search and unravel the origin of life, rules of
evolution
• Use the acquired knowledge for treating and curing
disease, aging
Some Bioinformatics Applications
http://www.ncbi.nlm.nih.gov/sites/entrez?db=PubMed
Genetics Based Applications
Complementarity
- Shape
- Chemical
- Electrostatic
Computer Aided Drug Design (CADD)
Preclinical Testing
(Lab and Animal Testing)
Phase I
(20-30 Healthy Volunteers used to
With the aid of Bioinformatics check for safety and dosage)
Phase II
(100-300 Patient Volunteers used to
check for efficacy and side effects)
Phase III
(1000-5000 Patient Volunteers
used to monitor reactions to
long-term drug use)
FDA Review
& Approval
Post-Marketing
Years Testing
0 2 4 6 8 12 14 16
10
7 – 15 Years!
Drug lead screening
One drug
approved
by the FDA
Systems Biology
DNA chips
N CH2
Cl
NH2 NH2 OCH3
N N N
Cl
NH2
NH
NH Cl NH
H
C CH3 NH NH
NH N Cl
CH3 NH2 O H COO - H
Genetic Variations
C CH
OH NH N
3
N N N CH3
NH2 NH
CH3
NH2 N N
CH3 N
N H2 O C H3 N
N C H2 O C H3 NH COO-
N H2 N O C H3 O H COO-
Biochemical Networks
Optimizing therapies
data analysis,
algorithms,
visualization, statistics,
etc. caaaaatagggttaatatgaatctcgatctccattttgttcatcgtattcaacaacaagcc
Genomes aaaactcgtacaaatatgaccgcacttcgctataaagaacacggcttgtggcgagatatct
cttggaaaaactttcaagagcaactcaatcaactttctcgagcattgcttgctcacaatat
tgacgtacaagataaaatcgccatttttgcccataatatggaacgttgggttgttcatgaa
actttcggtatcaaagatggtttaatgaccactgttcacgcaacgactacaatcgttgaca
ttgcgaccttacaaattcgagcaatcacagtgcctatttacgcaaccaatacagcccagca
agcagaatttatcctaaatcacgccgatgtaaaaattctcttcgtcggcgatcaagagcaa
tacgatcaaacattggaaattgctcatcattgtccaaaattacaaaaaattgtagcaatga
aatccaccattcaattacaacaagatcctctttcttgcacttgg
Proteins
d1dhfa_ LNCIVAVSQNMGIGKNGDLPWPPLRNEFRYFQRMTTTSS-
VNELGVKIMQGKKTWFSI d8dfr
LNSIVAVCQNMGIGKDGNLPWPPLRNEYKYFQRMTSTSH-VNEAGVKIQMGKKTWFSI
Interactions
VNELGVKIMQGKKTWFSI d8dfr
LNSIVAVCQNMGIGKDGNLPWPPLRNEYKYFQRMTSTSH-VNEAGVKIQMGKKTWFSI
d4dfra_ ISLIAALAVDRVIGMENAMPW- NLPADLAWFKRNT-L--D-----
Sequence Analysis