Professional Documents
Culture Documents
Introduction To Bioinformatics 1
Introduction To Bioinformatics 1
Introduction To Bioinformatics 1
Bioinformatics
1
GENERAL INFORMATION
Course Methodology
•Each lecture starts with a "mini-exam" with three short questions belonging
to the previous lecture.
•In the skills classes (SCs) several programming tasks are performed, one of
which has to be submitted until next SC.
3
GENERAL INFORMATION
mini-exams
* Closed Book
4
GENERAL INFORMATION
Skills Class:
5
6
7
GENERAL INFORMATION
Final Exam:
* Open book
8
GENERAL INFORMATION
Grading:
Study Points:
6 ECTS/ 4 NSP
9
GENERAL INFORMATION
Course Book:
Introduction to
Computational
Genomics
A Case Studies
Approach
Nello Cristianini,
Matthew W. Hahn
10
GENERAL INFORMATION
Additional recommended texts:
11
Introduction to Bioinformatics.
LECTURES
12
Introduction to Bioinformatics.
LECTURE 1:
* Prologue (In praise of cells)
13
Introduction to Bioinformatics.
Prologue :
In praise of cells
14
GENOMICS and PROTEOMICS
Genomics is the study of an organism's genome and the use of the genes.
It deals with the systematic use of genome information, associated with other
data, to provide answers in biology, medicine, and industry.
16
modern map-makers
have mapped the entire
human genome
17
Metabolic activity in GENETIC PATHWAYS
18
19
How can we
measure
metabolic
processes
and
gene activity ???
20
EXAMPLE:
Caenorhabditis elegans
21
Some fine day
in 1982 …
22
Boy, do I
want to
map the
activity of
these
genes !!!
23
Until recently we lacked tools to measure
gene activity
24
Until recently we lacked tools to measure
gene activity
Stephen Fodor
Microarray-ontwikkelaar
Ontwikkelde microarray
25
26
27
Some fine day many,
many, many years
later …
28
Now I’m
almost
there …
29
Using the microarray technology we can
now make time series of the activity of
our 22.000 genes – so-called
genome wide expression profiles
30
The identification of genetic pathways
from Microarray Timeseries
Sequence of genome-
wide expression profiles
at consequent instants
become more realistic
with decreasing costs …
31
Genomewide expression profiles: 25,000 genes
32
Now the problem is to map these
microarray-series of genome-wide
expression profiles into something that
tells us what the genes are actually doing
… for instance a network representing
their interaction
33
34
GENOMICS: structure and coding
35
DNA
Deoxyribonucleic acid (DNA) is a nucleic acid that
contains the genetic instructions specifying the biological
development of all cellular forms of life (and most viruses).
36
37
DNA under electron microscope
38
3D model of a section of the DNA molecule
39
James Watson and Francis Crick
40
41
Genetic code
The genetic code is a set of rules that maps DNA sequences
to proteins in the living cell, and is employed in the process of
protein synthesis.
Nearly all living things use the same genetic code, called the
standard genetic code, although a few organisms use minor
variations of the standard code.
42
Genetic code
43
Replication
of
DNA
44
Genetic code: TRANSCRIPTION
DNA → RNA
Transcription is the process through which a DNA sequence is enzymatically
copied by an RNA polymerase to produce a complementary RNA. Or, in other
words, the transfer of genetic information from DNA into RNA. In the case of
protein-encoding DNA, transcription is the beginning of the process that
ultimately leads to the translation of the genetic code (via the mRNA
intermediate) into a functional peptide or protein. Transcription has some
proofreading mechanisms, but they are fewer and less effective than the
controls for DNA; therefore, transcription has a lower copying fidelity than
DNA replication.
Like DNA replication, transcription proceeds in the 5' → 3' direction (ie the old
polymer is read in the 3' → 5' direction and the new, complementary
fragments are generated in the 5' → 3' direction).
47
Protein Structure:
primary structure
48
Protein
Structure:
secondary
Structure
a: Alpha-helix,
b: Beta-sheet
49
Protein Structure:
super-secondary Structure
50
Protein Structure = protein function:
51
EVOLUTION and the origin of SPECIES
52
Tree of Life
53
54
55
Phylogenetic relations between
Cetaceans and ariodactyl
56
Unsolved problems in biology
Life. How did it start? Is life a cosmic phenomenon? Are the conditions necessary for the origin of
life narrow or broad? How did life originate and diversify in hundred millions of years? Why, after
rapid diversification, do microorganisms remain unchanged for millions of years? Did life start on
this planet or was there an extraterrestrial intervention (for example a meteor from another planet)?
Why have so many biological systems developed sexual reproduction? How do organisms
recognize like species? How are the sizes of cells, organs, and bodies controlled? Is immortality
possible?
DNA / Genome. Do all organisms link together to a primary source? Given a DNA sequence, what
shape will the protein fold into? Given a particular desired shape, what DNA sequence will produce
it? What are all the functions of the DNA? Other than the structural genes, which is the simpler part
of the system? What is the complete structure and function of the proteome proteins expressed by
a cell or organ at a particular time and under specific conditions? What is the complete function of
the regulator genes? The building block of life may be a precursor to a generation of electronic
devices and computers, but what are the electronic properties of DNA? Does Junk DNA function as
molecular garbage?
Viruses / Immune system. What causes immune system deficiencies? What are the signs of
current or past infection to discover where Ebola hides between human outbreaks? What is the
origin of antibody diversity? What leads to the complexity of the immune system? What is the
relationship between the immune system and the brain?
Humanity: Why are there drastic changes in hominid morphology? Why are there giant hominid
skeletons and very small hominid skeletons? Is hominid evolution static? Is hominid devolution
possible? Are there Human-Neanderthal hybrids? What explains the differences between Human 57
and Neanderthal Fossils?
Introduction to Bioinformatics.
LECTURE 1:
CHAPTER 1:
The first look at a genome (sequence statistics)
* All models are wrong, but some are useful. (G. Box)
58
Introduction to Bioinformatics.
59
Introduction to Bioinformatics
LECTURE 1: The first look at a genome (sequence statistics)
61
Introduction to Bioinformatics
LECTURE 1: The first look at a genome (sequence statistics)
• Definition of genome
• Prokaryotic genomes
• Eukaryotic genomes
• Viral genomes
• Organellar genomes
62
Introduction to Bioinformatics
LECTURE 1: The first look at a genome (sequence statistics)
63
Introduction to Bioinformatics
LECTURE 1: The first look at a genome (sequence statistics)
* p = {pA,pC,pG,pT}, pA + pC + pG + pT = 1
n
* P(s) p(s(i))
i 1
65
Introduction to Bioinformatics
LECTURE 1: The first look at a genome (sequence statistics)
Markov
sequence model
66
Introduction to Bioinformatics
LECTURE 1: The first look at a genome (sequence statistics)
67
Introduction to Bioinformatics
LECTURE 1: The first look at a genome (sequence statistics)
69
Introduction to Bioinformatics
LECTURE 1: The first look at a genome (sequence statistics)
70
Introduction to Bioinformatics
LECTURE 1: The first look at a genome (sequence statistics)
71
Introduction to Bioinformatics
LECTURE 1: The first look at a genome (sequence statistics)
72
Introduction to Bioinformatics
LECTURE 1: The first look at a genome (sequence statistics)
73
Evidence for co-evolution of gene order and recombination rate
Csaba Pál & Laurence D. Hurst
Nature Genetics 33, 392 - 395 (2003)
Organism GC content
H. influenzae 38.8
M. tuberculosis 65.8
S. enteridis 49.5
75
GC content
•Detect foreign genetic material
• Evolution:
Archaea > Eubacteriae > Eukaryotes
76
GC content
78
Change points in Labda-phage
79
k-mer frequency motif bias
• dimer, trimer, k-mer: nucleotide word of length 2, 3, k
• “unusual” k-mers
• 2-mer in H. influenzae
80
k-mer frequency motif bias
2-mer (dinucleotide) density in H. influenzae
*A C G T
A* 0.1202 0.0505 0.0483 0.0912
C 0.0665 0.0372 0.0396 0.0484
G 0.0514 0.0522 0.0363 0.0499
T 0.0721 0.0518 0.0656 0.1189
81
k-mer frequency motif bias
AAAGTGCGGT
ACCGCACTTT
Why?
82
83
84
Unusual DNA-words
Compare OBSERVED with EXPECTED frequency
of a word using multinomial model
Observed/expected ratio:
*A C G T
A* 1.2491 0.8496 0.8210 0.9535
C 1.1182 1.0121 1.0894 0.8190
G 0.8736 1.4349 1.0076 0.8526
T 0.7541 0.8763 1.1204 1.2505
86
genome
signature:
Nucleotide motif
bias in four
genomes
87
Introduction to Bioinformatics
LECTURE 1: The first look at a genome (sequence statistics)
• Online databases
88
DATABASES
89
OVERVIEW OF DATABASES
90
Generalized DNA, protein
and carbohydrate databases
91
NCBI: National Center for
Biotechnology information
92
NCBI - GenBank
93
The EMBL Nucleotide Sequence Database (also known as EMBL-Bank) constitutes
Europe's primary nucleotide sequence resource. Main sources for DNA and RNA
sequences are direct submissions from individual researchers, genome
sequencing projects and patent applications.
94
EBI: European
Bioinformatics Institute
96
ExPASy Proteomics Server
(SWISS-PROT)
97
Generalized DNA, protein
and carbohydrate databases
Carbohydrate databases
3D structure databases
99
PROTEIN DATA BANK
100
DATABASE SEARCH
101
Search across databases Help
Welcome to the Entrez cross-database search page
PubMed: biomedical literature citations and abstracts PubMed Central: free, full
text journal articles Site Search: NCBI web and FTP sites Books: online books
OMIM: online Mendelian Inheritance in Man OMIA: online Mendelian Inheritance in
Animals
Nucleotide: sequence database (GenBank) Protein: sequence database Genome:
whole genome sequences Structure: three-dimensional macromolecular structures
Taxonomy: organisms in GenBank SNP: single nucleotide polymorphism Gene:
gene-centered information HomoloGene: eukaryotic homology groups PubChem
Compound: unique small molecule chemical structures PubChem Substance:
deposited chemical substance records Genome Project: genome project information
UniGene: gene-oriented clusters of transcript sequences CDD: conserved protein
domain database 3D Domains: domains from Entrez Structure UniSTS: markers
and mapping data PopSet: population study data sets GEO Profiles: expression
and molecular abundance profiles GEO DataSets: experimental sets of GEO data
Cancer Chromosomes: cytogenetic databases PubChem BioAssay: bioactivity
screens of chemical substances GENSAT: gene expression atlas of mouse central 102
nervous system Probe: sequence-specific reagents
New! Assembly Archive recently created at NCBI links together trace data and finished sequence providing
complete information about a genome assembly. The Assembly Archive's first entries are a set of closely related
strains of Bacillus anthracis. The assemblies are avalaible at TraceAssembly
See more about Bacillus anthracis genome Bacillus licheniformis ATCC
14580Release Date: September 15, 2004
Reference: Rey,M.W.,et al.
Complete genome sequence of the industrial bacterium Bacillus licheniformis and
comparisons with closely related Bacillus species (er) Genome Biol. 5, R77 (2004)
Lineage: Bacteria; Firmicutes; Bacillales; Bacillaceae; Bacillus.
Organism: Bacillus licheniformis ATCC 14580
Genome sequence information
chromosome - CP000002 - NC_006270
Size: 4,222,336 bp Proteins: 4161
Sequence data files submitted to GenBank/EMBL/DDBJ can be found at NCBI FTP:
GenBank or RefSeq Genomes
Bacillus cereus ZKRelease Date: September 15, 2004
Reference: Brettin,T.S., et al. Complete genome sequence of Bacillus cereus ZK
Lineage: Bacteria; Firmicutes; Bacillales; Bacillaceae; Bacillus; Bacillus cereus group.
Organism: 103
BLAST
NCBI → BLAST Latest news: 6 December 2005 : BLAST 2.2.13 released About
Getting started / News / FAQs
More info
NAR 2004 / NCBI Handbook / The Statistics of Sequence Similarity Scores
Software
Downloads / Developer info
Other resources
References / NCBI Contributors / Mailing list / Contact us
The Basic Local Alignment Search Tool (BLAST) finds regions of local similarity
between sequences. The program compares nucleotide or protein sequences to
sequence databases and calculates the statistical significance of matches. BLAST can
be used to infer functional and evolutionary relationships between sequences as well
as help identify members of gene families. Nucleotide
Quickly search for highly similar sequences (megablast)
Quickly search for divergent sequences (discontiguous megablast)
Nucleotide-nucleotide BLAST (blastn)
Search for short, nearly exact matches
Search trace archives with megablast or discontiguous megablast
Protein 104
Download Software
105
Kangaroo
MOTIV BASED SEARCH
Kangaroo is a program that facilitates searching for gene and protein patterns
and sequences
To use this program, simply enter a sequence of DNA or Amino Acids in the
pattern window, choose the type of search, the taxonomy and submit your
request.
106
ANALYSIS TOOLS
Phylogeny
107
MISCELLANEOUS
Literature search
Patent search
Bioinformatics centers and servers
Links to other collections of bioinformatics resources
Medical resources
Bioethics
Protocols
Software
(Bio)chemie
Educational resources
108
Introduction to Bioinformatics.
END of LECTURE 1
109