Professional Documents
Culture Documents
Salale University: Genomics and Bioinformatics (BIO-6324) For Postgraduate in Applied Genetics and Biotechnology
Salale University: Genomics and Bioinformatics (BIO-6324) For Postgraduate in Applied Genetics and Biotechnology
Salale University: Genomics and Bioinformatics (BIO-6324) For Postgraduate in Applied Genetics and Biotechnology
1
2
3
PART I:
GENOMICS
What do you think Genomics is ?
5
What is DNA?
DNA (deoxyribonucleic acid) carries the genetic information in the body’s cells. DNA is
made up of four similar chemicals (called bases and abbreviated A, T, C, and G) that are
repeated over and over in pairs.
What Is a Gene?
A gene is a distinct portion of a cell’s DNA.. Genes are coded instructions for making
everything the body needs, especially proteins.
What Are Proteins?
Proteins are chains of chemical building blocks called amino acids. A protein could
contain just a few amino acids in its chain or it could have several thousands. Proteins
form the basis for most of what the body does, such as digestion, making energy and
growing.
What Are Chromosomes?
Genes are packaged in bundles called chromosomes. Humans have 23 pairs of
chromosomes
Of those, 1 pair is the sex chromosomes (determines whether you are male or female,
plus some other body characteristics), and the other 22 pairs are autosomal
chromosomes (determine the rest of the body’s makeup).
What Is the Human Genome?
The human genome is a complete copy of the entire set of human gene instructions. The
Human Genome Project, completed in 2003, identified all the human genes in DNA and
stored the information in databases so all researchers everywhere could use it.
Central Dogma of Molecular Biology
What is a genome?
Genome:
Is the entirety of an organism's hereditary information
The complete set of genes of an organism
Encoded either in DNA in RNA
Gene :
o Segments of DNA that carries genetic information
o codes for certain type of protein or for an RNA
o There is a lot of DNA that is not part of genes
o In humans only 2% of the DNA is part the gene
o What are the remaining 98% ???????
chromosome : long piece of DNA
9
Genome:
o Complete set of genetic instructions of an organism
o Contains complete set of genes of an organism
o Life is specified by the genomes of the organisms
o Mostly genome are made of DNA but a few viruses have
RNA genomes
o Every organism possesses a genome that contains the
biological information
Each of the approximately 1013 cells in the adult human
body has its own copy or copies of the genome
10
The human genome consists of two distinct parts
1. The nuclear genome: comprises approximately 3.2 billion
nucleotides of DNA and about 19,000 genes protein coding
genes
2. The mitochondrial genome: Circular DNA molecule of
16,569 Nucleotides
The human mitochondrial genome contains 37genes
Genomics:
Any attempt to analyze or compare the entire genetic
complement of a species
11
Genome Analysis
Entails the prediction of genes in uncharacterized genomic sequences
Experimental genome analysis is slow and time consuming.
Computational genome analysis is relatively simple
Prokaryotes
o Most genome is coding, giving proper exons
o Small amount of non-coding is regulatory sequences
o No introns
o Computational analysis or gene prediction is much simpler
Eukaryotes
o Most genome is non-coding (98%)
o Regulatory sequences
o Introns
o Repetitive DNA
o Computational gene prediction and annotation is challenging
Analysis is required:
o For prediction of uncharacterised genes or proteins
o To obtain the complete sequences
o For Genetic modification / engineering
o To know which sequence code for proteins or structural
RNAs ?
o To indicate function of the predicted gene products
o For searching any mutation or malfunction of gene
o Structuromics . . .
13
14
Genes can be classified into:
1. protein coding genes (proper exon)
2. pseudogenes
3. Functional RNA genes: tRNA, rRNA and others
• snoRNA
• snRNA
• miRNA
There are several kinds of exons:
o Noncoding
o initial coding exons
o internal exons
o terminal exons
15
16
17
Human genome by functions
18
Comparative genome sizes of organisms
chromosome
organism Size (bp) gene number average gene density
number
H.Sapiens 3.2 billion ~20,000 1 gene /100,000 bases 46
M. musculus 2.6 billion ~20,000 1 gene /100,000 bases 40
19
Facts about Genome
Individual genomes show extensive variation.
Not all genes are essential.
A substantial part of most eukaryotic nuclear genomes is
made up of Repetitive DNA.
Repetitive DNA: individual sequence elements that are
repeated many times over, either in tandem arrays or
interspersed throughout the genome.
Single copy DNA: which includes most genes, and is
made up of sequences that are not repeated elsewhere.
Extrachromosomal genes:
The mitochondria and chloroplasts
These genes code for the RNAs and some of the proteins
required in the organelle.
Facts about Genome
sequencing , and
functional analysis of genomics.
The field includes studies of introgenomic phenomena
Genomics is high throughput approaches to allow more
analyses in parallel
Genomics is dependent on computational analysis due to
larger data sets
Types of Genomics
Though the area of genomics has quietly widened, in
broad sense it can be classified into three
1. Structural genomics
Deals with DNA sequencing, mapping, sequence
assembly, sequence organization and management
It is starting stage of genome analysis (construction of
genetic, physical or sequence maps of organisms)
2. Functional genomics
Deals with functions of all gene sequences and their
expression in organisms
If gene and sequence is known the next step will be
finding what function they have?
25
3. Comparative genomics
comparison of entire genomes across species,
looking at functions and evolutionary relationships
Includes a comparison of gene number, gene content and
gene location in both prokaryotic and eukaryotic
What is conserved between species?
Genes for basic processes
Understand the uniqueness between different species
What makes closely related species different?
Genome analysis tools
Genome analysis is the techniques needed to
determine and compare the genetic sequences
This may includes:
Hybridization analysis (southern, FISH, microarray)
Mapping of DNA
DNA sequencing
By this type of analysis important genes
influencing metabolism, differentiation and
development, and disease processes can be
identified and manipulated
Analysis is to identify those genes that are
predicted to have a particular biological function or
nonfunctional genes
29
Common molecular testing methods in genomics
FISH
Microarray
DNA hybridization
Genotyping (SNP detection)
PCR (Real time or
Convectional)
Genome sequencing
Hybridization analysis (southern,
FISH, microarray)
SouthernBlotting techniques
FISH (Fluorescent In Situ Hybridization)
microarray
31
32
FISH
• A variety of specimen types can by analyzed using FISH.
and quantitation
will hybridize to another nucleic acid on the basis of
base complementarity
Probe can be
Radioactive (32P, 35S, 14C, 3H)
Fluorescent
FISH: fluorescent in situ hybridization
Biotinylated (avidin-streptavidin)
• DIRECT FLUORESCENT -LABELED
PROBE
Specimen DNA
T F
A
T A
G C T
A G A COVALENT BOND
C G
T C
F
FISH Probe DNA
modes of FISH Probes
Centromere
Telomere
Whole chromosome paint
locus
Modes of probes
Centromeric (satellite)
probes
41
Genetic mapping / Linkage mapping…
If two genes are on different chromosomes, they are
unlinked and will sort independently during meiosis
Genes on the same chromosome are physically linked
together, and a crossover between them during meiosis
By working out the recombination frequency it is therefore
possible to produce a map of the relative locations of the
marker genes
The genetic map is just like the road map between cities and the physical map is
similar to the map of a city specifying the exact distances between places giving
almost complete information to find that place.
Both mapping tools provide scientists the order and position of the gene in question and
its distance from other genes or markers in a genome, just like the address and map of
the road that leads to a particular house
42
Genetic mapping limitation
43
2. Physical mapping
Relies upon observable experimental outcomes
• hybridization
• amplification certain DNA regions
units of distance is base pairs
genes are identified and localized purely on the basis of
their physical positions along the chromosomes
uses molecular biology techniques to examine DNA
molecules directly in order to construct maps
the ultimate physical map is the DNA sequence of the whole
genome
44
DNA markers for mapping analysis
Many molecular markers have been developed and
majority are still on developing
48
Similarity of DNA microarrays and chips
Preparation Density
Support
[probes/cm2]
4
Printing of oligo- e.g. glass up to 10
Microarray
nucleotids or
PCR fragments
5
Chip Direct synthesis e.g. glass up to 2.5 *10
on the support
49
DNA microarrays: Principle
cell type A
Extract mRNA make labeled cDNA hybridize to
microarray
more in “A”
cell type B more in “B”
equal in A & B
Position Intensity
Array, chip
(imobilized Identity Amount
probe)
53
Production of DNA-chips/array
Gene Discovery:
Disease Diagnosis:
Drug Discovery:
Toxicological Research:
DNA sequencing
5’AGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCT 3’
57
Sequencing: methods for determining the precise order of
the nucleotides in a DNA (or RNA)
Goal:
Find the complete sequence of A, C, G, T’s in DNA
Challenge:
There is no machine that takes long DNA as an input, and gives
the complete sequence as output
Can only sequence ~500 letters at a time
1870 Miescher: Discovers DNA
1940 Avery: Proposes DNA as ‘Genetic Material’
1953 Watson & Crick: Double Helix Structure of DNA
1965 Holley: Sequences Yeast tRNAAla
1970 Wu: Sequences Cohesive End DNA
Sanger: Dideoxy Chain Termination
1977
Gilbert: Chemical Degradation
1980 Messing: M13 Cloning
1986 Hood et al.: First automated sequencing machine
• Cycle Sequencing, HGP started
1990
• Improved Sequencing Enzymes
• Improved Fluorescent Detection Schemes
1995 – First bacterial genome – H. influenzae (1.8 Mb)
1998 – First animal genome – C. elegans (97 Mb)
2003 – Completion of HGP (3 Gb)
– 13 years, $2.7 bn
64
Maxam-Gilbert Sequencing
DMS + P H H+
DM S
S
G G C C
G A T C
G G T
G G C C
C
A T
G C
A C
A T
66
Maxam-Gilbert Sequencing
70
Incorporation of
any of ddNTPs
results in chain
termination since
they lack 3’ OH for
chain to enlongate
71
72
By carrying out four reactions, four separate sets of
fragments are formed which are specifically terminated
involves the following major steps
1. Preparation of four reaction tubes (A mix, T mix, C mix, G
mix) each contains:
Multiple copies of ssDNA,
primer
four dNTPs
Enzyme
Small amount (usually 1-5 %) of ddNTPs that brings
chain termination
2. Run four separate reactions each with different ddNTPs
3. Fragment separation on high resolution polyacrylamide gel
4. Autoradiograph and reading sequence from bottom to top
The original sequence is complementary with what you read
73
Dideoxy (Sanger) Method
ddNTP- 2’,3’-
dideoxynucleotide
that terminates
chain when
incorporated
Add enough so
each ddNTP is
randomly and
completely
incorporated at
each base
74
75
3’
5’
76
77
Automated DNA sequencing
Developed in 1990
Modification of Sanger method
Sanger sequencing: dye terminator sequencing
uses different fluerecent dye tagged ddNTPs
Performed in single tube
No hazardous effect of radioactive isotopes
minimize the time needed to sequence
The fluorescence tags are attached to the chain-terminating
nucleotides
sequence data is found in real time by detecting the DNA bands
within the gel during the electrophoretic separation
Each of the four dideoxynucleotides carries a spectrally
different fluorophore
The DNA bands are detected by their fluorescence
79
80
Second generation sequencing (SGS)
The key feature:
parallelization of high number of sequencing
reactions.
Reduction of time needed
price reduction,
Technologies
Illumina sequencing
Most dominantly used
Relies on sequencing by synthesis
Like automated DNA sequencing it uses
fluorescently labeled nucleotides
produce larger volume of data, compared to 454
sequencer.
Ion torrent
Developed by life technologies
Similar to 454 pyrosequencing, except that it does not use
fluorescently labelled nucleotides like other SGS
It is based on detection of H+ released during sequencing
process
Third-generation sequencing
Works by reading nucleotide seq at single molecule level
The three commercially available TDS technologies
1. Pacific Biosciences (PacBio) Single Molecule Real Time (SMRT)
sequencing,
2. Illumina Tru-seq Long-Read technology
3. Oxford Nanopore Technologies
All three technologies can produce long reads averaging
between 5,000bp to 15,000bp, with some reads exceeding
100,000bp.
Shear gDNA
91
92
Animation source
https://dnatech.genomecenter.ucdavis.edu/pa
cbio-sequencing/
93
Comparison Among Technologies
94
Comparison among technologies
95
96
What do you think the advantage of DNA
sequencing?
How DNA sequencing gives accurate
physical mapping
What do you know about Human genome
project consortium?
APPLICATIONS OF GENOMICS:
2. Application in agriculture
3. Application in Industry
4. Application in environment
PART II:
Bioinformatics
99
PART II: What is Bioinformatics?
Multiple definition were available on literature and web
Bioinformatican has their own definition, experts in some
other area has another definition
100
What is Bioinformatics?
Biologists :-
collect molecular data:
DNA , RNA, Protein sequences
Bioinformaticians
Study biological
questions by
analyzing molecular
Computer scientists ,
Mathematicians, and Statisticians:- data
Develop tools, softwares, algorithms
to store and analyze the data. 101
What is Bioinformatics?
Some people relate bioinformatics with computational biology,
however computational biology is very broad
genetics mathematics
biochemistry statistics
bioinformatics numerical
biophysics analysis
algorithmics
evolution
image data
analysis management
103
Bioinformatics Flow Chart
Sequencing
Gene & Protein expression
Analysis of nucleic acid seq. data
105
Milestone development of BI
1978 : The term bioinformatics was coined
1979: The first DNA database was established
1980s: GenBank established, fast database searching algorithms
developed
FASTA by William Pearson
BLAST by Stephen Altschul and coworkers
Prediction of RNA secondary structure
Prediction of protein secondary and 3D structures
1990s-2000: Prediction of genes and Studies of complete genome
sequences started, EMBL established (1994), first bacterial genome
completely sequenced (1995), Yeast genome completely sequenced
(1996), HGP (1990-2003)…
2013: The number of sequences increased to more than 185 million
106
Level of bioinformatics analysis
1. Analysis of a single gene (protein) sequence.
Similarity with other known genes
Identification of well-defined domains in the sequence
Prediction of secondary and tertiary structure
2. Analysis of complete genomes…..Genomics
Which gene families are present, which are missing?
Searching the location of genes on the chromosomes
Expansion/duplication of gene families
Identification of "missing" genes and hence product
107
Level of bioinformatics analysis
3. Sequence structure analysis and prediction
Expression analysis
Proteomics
108
Goals and scope
Goal: To better understand a living cell and how it functions
at the molecular level
SCOPE: Two major subfields
1. Development of computational tools (softwares) and
smart databases
2. Application of these tools and databases in generating
meaning full biological informations
The tools are used in three areas of genomic and molecular
biological research:
1. sequence analysis,
2. structural analysis,
3. functional analysis.
109
The genetic code
The information required to make aa is stored in DNA
• DNA sequence determines amino acid sequence
• Amino Acid sequence determines protein structure
• Protein structure determines protein function
Amino acids are coded by codons – triplets of nucleotides,
e.g. |ACG|TAT|….
There are 43 = 64 codons for ~20 amino acids, the code is
degenerate
Deletions or insertions destroy a message
Three stop codons (UAA, UAG, UGA) are always located
at the end of the protein coding part of a gene
AUG is start codon in about 90% of genomes, except in
some prokaryotes and organelles
110
The genetic code
111
The amino acid shorthand notation
112
Reading Frames and open reading frames
Reading Frame (RF)
• Any potential string of amino acids (codon) encoded by
DNA
• A given stretch of DNA has six potential reading frames,
three forward and three backward.
• The different reading frames
give entirely different proteins
ORF: Open Reading Frame
• Any continuous reading frame
that starts with a start codon
and stops with a stop codon.
• DNA or RNA sequence that could
be translated into a peptide sequence
113
Single letter representation of four bases in
synthetic sequence construct and bioinformatics
R A/G puRine
Y C/T pYrimidine
S G/C Strong
W A/T Weak
K G/T Keto
M A/C aMino
B C/T/G not-A
D A/T/G not-C
H A/C/T not-G
V A/C/G not-U
N A/C/G/T aNy
114
Mutation and conservation
Mutation:
o Changes in DNA that affect
genetic information
115
Frameshift Mutations – shifts the reading frame of the
genetic message
Insertion
THE DOG ATE THE BAT
Deletion
THE DOG ATE THE BAT
116
Chromosome Mutations:
chromosomes
Deletion
Duplication
Inversion
Translocation
117
Significance of Mutations
• Most are neutral
• Eye color
• intronic
• Some are harmful
• Sickle Cell Anemia
• Down Syndrome (chr 21 cannot separate), trisomy
• Some are beneficial
• People with Sickle Cell anemia are resistant to
Malaria
• Immunity to HIV
118
Introduction to Biological Databases
What is a database?
What type of biological databases can we access?
What roles do they play?
What type of information can we get from them?
How do we access these information?
119
Database:
Organized collection of logically related data
It is structured collection of information
Convenient tools of searching vast amount of information
Why databases ?
Help to handle and share large volumes of biological data
allow large scale analysis of biological data
Disseminate biological data and information
provide biological data in computer-readable form
organize data in simpler and structured way
Make data access easy and updated
120
The two main goals of biological databases:
ﮓ Information retrieval
ﮓ Knowledge discovery
122
Types of biological databases…
Content based classification: Three
Depending on the content, databases can be:
1. primary databases
Contain original information (sequences or structure )
"simple" archives of sequences, structures, images, etc.
Raw data, with minimal annotations, not always well curated
Examples :
Nucleic acids proteins
i. GenBank at NCBI raw sequences of UniProt, trEMBL
ii. EMBL PDB
iii. DDBJ PIR
Genbank, EMBL and DDBJ are the major three public primary
sequence databases that store raw NA data worldwide
123
The three databases collaborated and exchange new data daily
They together constitute the International Nucleotide Sequence
Database Collaboration (INSDC)
WormBase
129
Interconnection between Biological Databases
130
Major Biological Databases Available Via the
WWW
Databases URL
DDBJ www.ddbj.nig.ac.jp
EMBL www.ebi.ac.uk/embl/index.html
NCBI www.ncbi.nlm.nih.gov/gquery/gquery.fcgi
ExPASY http://us.expasy.org/
FlyBase http://flybase.bio.indiana.edu/
GenBank www.ncbi.nlm.nih.gov/Genbank
HIV databases www.hiv.lanl.gov/content/index
OMIM www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=OMIM
SWISS-Prot www.ebi.ac.uk/swissprot/access.html
Enzymes www.chem.qmul.ac.uk
KEGG www.genome.ad.jp
131
Drawbacks of Biological Databases
Errors in sequence databases
redundancy in the primary databases
repeated submission of identical or overlapping sequences
NCBI has created a non redundant database, called
RefSeq, in which identical sequences from the same
organism and associated sequence fragments are
merged into a single entry
SWISS-PROT database also has minimal redundancy
for protein sequences
Occasionally false annotations of genes
Incompatibility (format, terminology, data types, etc.)
132
Information Retrieval from NCBI and EBI
Two most popular retrieval systems:
1. ENTREZ - NCBI
Allows text-based searches
Integrate cross-referenced information
Both:
Provide access to multiple databases
Allow complex queries
133
Entrez
NCBI
EMBL
DDBJ
CIB EBI
•Submissions
•Updates SRS
getentry
134 134
135
Searching information from GenBank
136
GenBank Sequence Format
contain three sections
137
GenBank Sequence Format
version number and a gene index (Gi) number: identify
the version of the sequence
ORGANISM: the source of the organism with the
scientific name
REFERENCE: provides related publication citation
139
140
141
Ideal minimal content of an entry in a sequence database
• Sequence
• Accession number (AC)
• Taxonomic data
• References
• Annotation/Curation
• Keywords
• Cross-references
• Documentation
142
Alternative Sequence Formats
In addition to the GenBank format, there are many other
sequence formats
FASTA:
The simplest and the most popular
Has a single definition line begins with a right angle
bracket (>) followed by a sequence name
Extra information such as Gi number are
separated from the sequence name by “|” symbol
Drawback = much annotation information is lost
143
FASTA format
144
SEQUENCE ALIGNMENT
What do you think about sequence alignment?
Why it is needed?
145
SEQUENCE ALIGNMENT
way of arranging the sequences of DNA, RNA, or protein
to identify regions of similarity
reveal three types of pairs ( Match, Mismatches and
Gaps)
Optimal alignment
Shall exhibits the highest match and the least differences
Shall have highest overall score
146
Why do sequence alignment is important?
Provide evolutionary information
similar sequences may have similar structure, similar function
and likely originated from one common ancestor
It is unlikely that sequences obtained similarity by chance
o Probability in DNA sequence is low P = 4-n
o For proteins the probability is even much lower P = 20 –n
Local alignment:
looks for best internal matching region
Logical questions
Orthologous sequences:
similar sequences in different organisms arisen due to a speciation
originated from a single ancestral gene in the most recent
common ancestor
Often have similar function
151
Paralogous sequences:
similar sequences within a single organism that have arisen
due to a gene duplication
tend to have different functions
Xenologous sequnces:
similar sequences that do not share the same evolutionary
origin, but rather have arisen due to horizontal transfer
Sequence similarity:
quantitative measure of alignment between two sequences
direct result of observation from the sequence alignment
Sequence similarity can be quantified using percentages
Eg 80% similarity, but not 80% homology
Identity vs Similarity
Sequence identity:
• exactly the same aas or Nucleotide in the same position
• similarity and identity are interchangeably used in
DNA/RNA
• But both have different meaning in protein sequence
similarity in protein sequence:
• percentage of aligned residues that have similar
physicochemical characteristics
identity in protein sequence:
• percentage of matches of the same aa residues between
two aligned sequences
154
Alignment edition and edit operations
best optimal alignment of sequences can be obtained using any
of edit operation and alignment edition and re-edition
What are edit operations?
Edit distance
Minimal number of 'edit operations‘ (RID) required to
For example:
A T G C C A T
A T - T C - -
– Number of match = 3
– Number of mismatch = 1 Score = 3+0+(-1x3) = 0
– Number of gap = 3
Simple/ regular Gap Penalty
One may find different possible alignment by introducing gaps
The possible way to select one best alignment is using score
Example
Align AATCTATA with AAGATA by introducing gap and rate
the possible alignments using match =1, mismatch =0 and gap =
-1
We may find a lot of possible alignments
1. A A T C T A T A ----score = 3+0x3+(-1x2) = 1
A A G - A T - A
2. A A T C T A T A -----score = 5+0-2 = 3
A A - G - A T A
3. A A T C T A T A -----score = 5+0-2 = 3
A A - - G A T A
We have two best alignment (2 and 3), which one to choose?
Another mechanism of penalizing was thus important
Origination and Length Penalties
165
Exercise
o For the following alignments, calculate the total score using
GOP = -2, GEP = -1, Match = 1 and mismatch = 0
ACGTCTGATACGCCGTATAGTCTATC
ACGTCTGAT - - - - - - - ATAGTCTATC
ACGTCTGATACGCCGTATAGTCTATC
AC - T–TGA - - CG -CGT– TA - TCTATC
ACGTCTGATACGCCGTATAGTCTATCT
| ||||| ||| || ||| ||||
A--GCTGATTCGC---ATCGTC-ATCT