Salale University: Genomics and Bioinformatics (BIO-6324) For Postgraduate in Applied Genetics and Biotechnology

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 166

Salale University

Genomics and Bioinformatics (BIO-6324) for postgraduate in Applied


Genetics and Biotechnology

1
2
3
PART I:
GENOMICS
What do you think Genomics is ?

5
 What is DNA?
 DNA (deoxyribonucleic acid) carries the genetic information in the body’s cells. DNA is
made up of four similar chemicals (called bases and abbreviated A, T, C, and G) that are
repeated over and over in pairs.
 What Is a Gene?
 A gene is a distinct portion of a cell’s DNA.. Genes are coded instructions for making
everything the body needs, especially proteins.
 What Are Proteins?
 Proteins are chains of chemical building blocks called amino acids. A protein could
contain just a few amino acids in its chain or it could have several thousands. Proteins
form the basis for most of what the body does, such as digestion, making energy and
growing.
 What Are Chromosomes?
 Genes are packaged in bundles called chromosomes. Humans have 23 pairs of
chromosomes
 Of those, 1 pair is the sex chromosomes (determines whether you are male or female,
plus some other body characteristics), and the other 22 pairs are autosomal
chromosomes (determine the rest of the body’s makeup).
 What Is the Human Genome?
 The human genome is a complete copy of the entire set of human gene instructions. The
Human Genome Project, completed in 2003, identified all the human genes in DNA and
stored the information in databases so all researchers everywhere could use it.
Central Dogma of Molecular Biology
What is a genome?
 Genome:
 Is the entirety of an organism's hereditary information
 The complete set of genes of an organism
 Encoded either in DNA in RNA

 Genome includes both the genes and the non-coding


sequences of the DNA/RNA

 Genomes can be:


 Prokaryotic genomes
 Eukaryotic Genomes
 Nuclear Genomes
 Mitochondrial genomes
 Chloroplast genomes
Gene, Chromosome, and Genome

 Gene :
o Segments of DNA that carries genetic information
o codes for certain type of protein or for an RNA
o There is a lot of DNA that is not part of genes
o In humans only 2% of the DNA is part the gene
o What are the remaining 98% ???????
 chromosome : long piece of DNA
9
Genome:
o Complete set of genetic instructions of an organism
o Contains complete set of genes of an organism
o Life is specified by the genomes of the organisms
o Mostly genome are made of DNA but a few viruses have
RNA genomes
o Every organism possesses a genome that contains the
biological information
 Each of the approximately 1013 cells in the adult human
body has its own copy or copies of the genome

10
 The human genome consists of two distinct parts
1. The nuclear genome: comprises approximately 3.2 billion
nucleotides of DNA and about 19,000 genes protein coding
genes
2. The mitochondrial genome: Circular DNA molecule of
16,569 Nucleotides
 The human mitochondrial genome contains 37genes

 Genomics:
 Any attempt to analyze or compare the entire genetic
complement of a species

11
Genome Analysis
 Entails the prediction of genes in uncharacterized genomic sequences
 Experimental genome analysis is slow and time consuming.
 Computational genome analysis is relatively simple
 Prokaryotes
o Most genome is coding, giving proper exons
o Small amount of non-coding is regulatory sequences
o No introns
o Computational analysis or gene prediction is much simpler
 Eukaryotes
o Most genome is non-coding (98%)
o Regulatory sequences
o Introns
o Repetitive DNA
o Computational gene prediction and annotation is challenging

Why we use model organisms? 12


Why Genome analysis is required ?

 Analysis is required:
o For prediction of uncharacterised genes or proteins
o To obtain the complete sequences
o For Genetic modification / engineering
o To know which sequence code for proteins or structural
RNAs ?
o To indicate function of the predicted gene products
o For searching any mutation or malfunction of gene
o Structuromics . . .

13
14
 Genes can be classified into:
1. protein coding genes (proper exon)
2. pseudogenes
3. Functional RNA genes: tRNA, rRNA and others
• snoRNA
• snRNA
• miRNA
 There are several kinds of exons:
o Noncoding
o initial coding exons
o internal exons
o terminal exons

15
16
17
Human genome by functions

18
Comparative genome sizes of organisms
chromosome
organism Size (bp) gene number average gene density
number
H.Sapiens 3.2 billion ~20,000 1 gene /100,000 bases 46
M. musculus 2.6 billion ~20,000 1 gene /100,000 bases 40

D. melanogaster 137 million 13,000 1 gene / 9,000 bases 8

A. thaliana 100 million 25,000 1 gene / 4000 bases 10

C. elegans 97 million 19,000 1 gene / 5000 bases 12

S. cerevisiae 12.1 million 6000 1 gene / 2000 bases 32

E. coli 4.6 million 3200 1 gene / 1400 bases 1


H. influenzae 1.8 million 1700 1 gene /1000 bases 1

19
Facts about Genome
 Individual genomes show extensive variation.
 Not all genes are essential.
 A substantial part of most eukaryotic nuclear genomes is
made up of Repetitive DNA.
 Repetitive DNA: individual sequence elements that are
repeated many times over, either in tandem arrays or
interspersed throughout the genome.
 Single copy DNA: which includes most genes, and is
made up of sequences that are not repeated elsewhere.
 Extrachromosomal genes:
 The mitochondria and chloroplasts

 These genes code for the RNAs and some of the proteins
required in the organelle.
Facts about Genome

 In prokaryotes, most of the genome (85-90%) is non-


repetitive DNA (coding DNA), while non-coding
regions only take a small part
 However, eukaryotes have the feature of exon-
intron organization of protein coding genes
 The major part of mammalians and plants genome
composed of repetitive DNA
 Example VNTR, Satellites (mini or microsatellite)
Genes and Proteins & the role of Introns
 Introns:
 Derived from the term "intragenic regions", are non-coding
sections
 Exons:
 Are coding sections that remain in the mRNA sequence.
Introns and Exons
 Introns are common in eukaryotic pre-mRNA,
 Unlike introns, exons are coding sections that remain in the
mRNA sequence.
 It is now recognized that introns are "a complex mix of
different DNA, much of which are vital to the life of the
cell”.
What is Genomics?

 Genomics is the sub discipline of genetics devoted to the


 mapping,

 sequencing , and
 functional analysis of genomics.
 The field includes studies of introgenomic phenomena
 Genomics is high throughput approaches to allow more
analyses in parallel
 Genomics is dependent on computational analysis due to
larger data sets
Types of Genomics
 Though the area of genomics has quietly widened, in
broad sense it can be classified into three
1. Structural genomics
 Deals with DNA sequencing, mapping, sequence
assembly, sequence organization and management
 It is starting stage of genome analysis (construction of
genetic, physical or sequence maps of organisms)
2. Functional genomics
 Deals with functions of all gene sequences and their
expression in organisms
 If gene and sequence is known the next step will be
finding what function they have?

25
3. Comparative genomics
 comparison of entire genomes across species,
looking at functions and evolutionary relationships
 Includes a comparison of gene number, gene content and
gene location in both prokaryotic and eukaryotic
 What is conserved between species?
 Genes for basic processes
 Understand the uniqueness between different species
 What makes closely related species different?
Genome analysis tools
 Genome analysis is the techniques needed to
determine and compare the genetic sequences
 This may includes:
 Hybridization analysis (southern, FISH, microarray)
 Mapping of DNA
 DNA sequencing
 By this type of analysis important genes
influencing metabolism, differentiation and
development, and disease processes can be
identified and manipulated
 Analysis is to identify those genes that are
predicted to have a particular biological function or
nonfunctional genes

29
Common molecular testing methods in genomics

 FISH
 Microarray
 DNA hybridization
 Genotyping (SNP detection)
 PCR (Real time or
Convectional)
 Genome sequencing
Hybridization analysis (southern,
FISH, microarray)
 SouthernBlotting techniques
 FISH (Fluorescent In Situ Hybridization)

 microarray

31
32
FISH
• A variety of specimen types can by analyzed using FISH.

FISH for Detection of Single to Multiple Genetic Events

Dual Targets Multiple


Single Target Targets
Two colors Multi- colors
One color

Allows one to look at multiple genomic changes within a


single cell, without destruction of the cellular morphology. 33
Probes
 Probe is a nucleic acid that
 can be labeled with a marker which allows identification

and quantitation
 will hybridize to another nucleic acid on the basis of
base complementarity

Probe can be
 Radioactive (32P, 35S, 14C, 3H)

 Fluorescent
 FISH: fluorescent in situ hybridization
 Biotinylated (avidin-streptavidin)
• DIRECT FLUORESCENT -LABELED
PROBE

Specimen DNA
T F
A
T A
G C T
A G A COVALENT BOND
C G
T C

F
FISH Probe DNA
modes of FISH Probes
 Centromere
 Telomere
 Whole chromosome paint
 locus
Modes of probes

Centromeric (satellite)
probes

Locus specific probes

Whole chromosome painting probes


Multi Color FISH
 Multicolor FISH can provide “colorized” information relative
to chromosome rearrangements, especially useful in
specimens where chromosome preparations are less than
optimal for standard cytogenetic banding analysis.
FISH Procedure

 Denature the chromosomes


 Denature the probe
 Hybridization
 Fluorescence staining
 Examine slides or store in the dark
Genome Mapping
 Gene or genome mapping (molecular mapping), is
determining the location of genes within a genome
 Identifying relationships between genes on chromosomes
 Two types of genome mapping methods
1. Genetic mapping / Linkage mapping
 calculation of map distance based on recombination
frequencies during meiosis
 based on genetic techniques such as cross-breeding
or pedigrees
 recombination frequency is measured in
centimorgans (cM), where 1cM = 1% recombination =
one map unit

41
Genetic mapping / Linkage mapping…
 If two genes are on different chromosomes, they are
unlinked and will sort independently during meiosis
 Genes on the same chromosome are physically linked
together, and a crossover between them during meiosis
 By working out the recombination frequency it is therefore
possible to produce a map of the relative locations of the
marker genes

 The genetic map is just like the road map between cities and the physical map is
similar to the map of a city specifying the exact distances between places giving
almost complete information to find that place.
 Both mapping tools provide scientists the order and position of the gene in question and
its distance from other genes or markers in a genome, just like the address and map of
the road that leads to a particular house

42
Genetic mapping limitation

 A map generated by genetic techniques is


rarely sufficient for the following reasons

a. The resolution depends on the number of


crossovers that have been scored
b. Cross over take longer time
c. have limited accuracy

43
2. Physical mapping
 Relies upon observable experimental outcomes
• hybridization
• amplification certain DNA regions
 units of distance is base pairs
 genes are identified and localized purely on the basis of
their physical positions along the chromosomes
 uses molecular biology techniques to examine DNA
molecules directly in order to construct maps
 the ultimate physical map is the DNA sequence of the whole
genome
44
DNA markers for mapping analysis
 Many molecular markers have been developed and
majority are still on developing

 Molecular markers can generally classified into


1. Non PCR based
 Eg. RFLP
2. PCR based
 RAPD, SSR, ISSR, AFLP, SNP…
 Marker polymorphism is developed due to mutation of:
 Base pair change
 rearrangment,
 Indels
 Tandem repeat variation
45
DNA microarray
DNA microarray
 A DNA microarray:
 A DNA microarray (also known as DNA chip or biochip)
is a collection of microscopic DNA spots attached to a
solid surface
 Each DNA spot contains specific probes

 Probes are immobilized on a solid support (a microscope glass


slides or silicon chips or nylon membrane)
 Probes can be DNA, cDNA, or oligonucleotides.

 Used to measure the expression levels of large numbers of genes


simultaneously
 One can analyze the expression of many genes in a single reaction
quickly
The probes are initially bound to matrix in robotic manner
Imobilized phase – multiple probes with known sequences bind
on certain support material
Mobile phase = labelled mixture of analyzed NAs

48
Similarity of DNA microarrays and chips

Preparation Density
Support
[probes/cm2]

4
Printing of oligo- e.g. glass up to 10
Microarray
nucleotids or
PCR fragments

5
Chip Direct synthesis e.g. glass up to 2.5 *10
on the support

49
DNA microarrays: Principle

cell type A
Extract mRNA make labeled cDNA hybridize to
microarray

more in “A”
cell type B more in “B”
equal in A & B

Amount of mRNA bound indicates the


expression level of various genes
Hybridization
Fluorescent (RI) signal
Fluorescently
labelled analyzed NAs
(mobile phase)

Position Intensity

Array, chip
(imobilized Identity Amount
probe)
53
Production of DNA-chips/array

1. Microspotting: DNA probes are spotted by a


robot onto the glass slide via a micro capillary and
linked by covalent bonds to the glass slide

2. Microspraying: DNA probes are sprayed


(touch free) on the slide by inkjet printing

3. In situ arrays: DNA probes are synthesized


directly on the glass slide
Typical micro spotting

Commercial DNA spotter


Applications of Microarrays

 Gene Discovery:
 Disease Diagnosis:
 Drug Discovery:
 Toxicological Research:
DNA sequencing
5’AGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCT 3’

57
 Sequencing: methods for determining the precise order of
the nucleotides in a DNA (or RNA)

Goal:
Find the complete sequence of A, C, G, T’s in DNA

Challenge:
There is no machine that takes long DNA as an input, and gives
the complete sequence as output
Can only sequence ~500 letters at a time
1870 Miescher: Discovers DNA
1940 Avery: Proposes DNA as ‘Genetic Material’
1953 Watson & Crick: Double Helix Structure of DNA
1965 Holley: Sequences Yeast tRNAAla
1970 Wu: Sequences  Cohesive End DNA
Sanger: Dideoxy Chain Termination
1977
Gilbert: Chemical Degradation
1980 Messing: M13 Cloning
1986 Hood et al.: First automated sequencing machine
• Cycle Sequencing, HGP started
1990
• Improved Sequencing Enzymes
• Improved Fluorescent Detection Schemes
 1995 – First bacterial genome – H. influenzae (1.8 Mb)
 1998 – First animal genome – C. elegans (97 Mb)
2003 – Completion of HGP (3 Gb)
– 13 years, $2.7 bn

 2005 – First “next-generation” sequencing


instrument

 2013– >10,000 genome sequences in NCBI


database
DNA sequencing
 Common DNA sequencing techniques
1. First generation sequencing methods (classical)
 Chemical Degradation (Maxam-Gilbert)
 Chain-termination or Sanger Enzymatic method
 Sequences <1000bp
2. Second generation sequencing (SGS)
E.g. Roche 454, Illumina, Ion Torrent
• Sequences ~100-1000 bp long, depending on
technology
3. Third Generation Sequencing (TGS)
• E.g. PacBio, Nanopore
• Single-molecule real time sequencing (SMRT)
• Long read-lengths <27 kbp, some times upto 150kb
61
Maxam-Gilbert Chemical degradation method
 Designed by Allan Maxam and Walter Gilbert in 1977
 Uses ss or dsDNA radiolabelled DNA
 It is chemical method of sequencing
 Involves the following steps
1. Label either 3’ or 5’end of DNA with 32P
2. Separate the labelled strands
3. Divide the mixture into four samples and treat each with
different chemicals having property of destroying
 A and G with Dimethyl sulfate or formic acid
 T and C with Hydrazine (at alkaline condition)
 Only G with Dimethyl sulfate and piperidine
 Only C with Hydrazine + 1M NaCl
63
Maxam-Gilbert Chemical degradation…
4. Electrophoresis of each of four samples in four different
lanes of the gel
5. Autoradiography

 Hydrazine – under alkaline condition, attack the two


pyrimidine bases (T and C)
 In the presence of 1 M NaCl, hydrazine attacks only cytosine
 Dimethyl sulfate – attacks A and G
 Without piperidine = A + G
 With piperidine = only G
 Then run electrophoresis and read the bands

64
Maxam-Gilbert Sequencing

DMS + P H H+
DM S
S

G G C C
G A T C
G G T
G G C C
C
A T
G C
A C
A T

Maxam-Gilbert sequencing is performed by chain


breakage at specific nucleotides 65
Maxam-Gilbert method

66
Maxam-Gilbert Sequencing

Sequencing gels are read from bottom to top (5′ to 67


 What do you think about Drawback ?
Sanger dideoxy (primer extension/chain-
termination) method
 most popular protocol for sequencing
 very adaptable, scalable to large sequencing projects
 Requires the following components
1) ssDNA template
2) A primer
3) DNA polymerase that lacks editing activity (usually klenow
fragment of DNA pol I)
4) dNTPs (dATP, dTTP, dCTP, dGTP) one radioactively labeled
and dideoxynucleotide triphosphates, ddNTPS ( ddATP,
ddTTP, ddCTP, ddGTP)
 The 3′-OH group necessary for formation of the
phosphodiester bond is missing in ddNTPs
 their incorporation results chain termination
69
 With addition of enzyme (DNA polymerase), the primer is
extended until a ddNTP is encountered
 The chain will end with the incorporation of the ddNTP.
 With the proper dNTP:ddNTP ratio, the chain will terminate
throughout the length of the template
 The resulting terminated chains are resolved by
electrophoresis
 Fragments from each of the four tubes are placed in four
separate gel lanes

70
Incorporation of
any of ddNTPs
results in chain
termination since
they lack 3’ OH for
chain to enlongate

71
72
 By carrying out four reactions, four separate sets of
fragments are formed which are specifically terminated
 involves the following major steps
1. Preparation of four reaction tubes (A mix, T mix, C mix, G
mix) each contains:
 Multiple copies of ssDNA,
 primer
 four dNTPs
 Enzyme
 Small amount (usually 1-5 %) of ddNTPs that brings
chain termination
2. Run four separate reactions each with different ddNTPs
3. Fragment separation on high resolution polyacrylamide gel
4. Autoradiograph and reading sequence from bottom to top
The original sequence is complementary with what you read
73
Dideoxy (Sanger) Method

ddNTP- 2’,3’-
dideoxynucleotide
that terminates
chain when
incorporated

Add enough so
each ddNTP is
randomly and
completely
incorporated at
each base

74
75
3’

5’
76
77
Automated DNA sequencing
 Developed in 1990
 Modification of Sanger method
 Sanger sequencing: dye terminator sequencing
 uses different fluerecent dye tagged ddNTPs
 Performed in single tube
 No hazardous effect of radioactive isotopes
 minimize the time needed to sequence
 The fluorescence tags are attached to the chain-terminating
nucleotides
 sequence data is found in real time by detecting the DNA bands
within the gel during the electrophoretic separation
 Each of the four dideoxynucleotides carries a spectrally
different fluorophore
 The DNA bands are detected by their fluorescence
79
80
Second generation sequencing (SGS)
 The key feature:
 parallelization of high number of sequencing
reactions.
 Reduction of time needed

 price reduction,

 Technologies

 454 pyrosequencing : 2005


 Illumina Solexa : 2006
 Applied Biosystems SOLiDTM System: 2007
 Helicos HeliscopeTM :
454 sequencing
 It is sequencing by synthesis
 Developed by 454 Life Sciences, which was later
acquired by Roche
 454 sequencing relies on pyrosequencing

Illumina sequencing
 Most dominantly used
 Relies on sequencing by synthesis
 Like automated DNA sequencing it uses
fluorescently labeled nucleotides
 produce larger volume of data, compared to 454
sequencer.
Ion torrent
 Developed by life technologies
 Similar to 454 pyrosequencing, except that it does not use
fluorescently labelled nucleotides like other SGS
 It is based on detection of H+ released during sequencing
process
Third-generation sequencing
 Works by reading nucleotide seq at single molecule level
 The three commercially available TDS technologies
1. Pacific Biosciences (PacBio) Single Molecule Real Time (SMRT)
sequencing,
2. Illumina Tru-seq Long-Read technology
3. Oxford Nanopore Technologies
 All three technologies can produce long reads averaging
between 5,000bp to 15,000bp, with some reads exceeding
100,000bp.

PacBio RSII PacBio Sequel


Third-generation sequencing
 PacBio SMRT : the most established one
 commercially introduced in 2010
 sequences DNA using sequencing-by-synthesis, and optically monitors
fluorescently tagged nucleotides as they are incorporated into
individual template molecules
 The PacBio RS II, produces read lengths of up to ~100,000 bp with
sequencing capacity of ~8GB / day
 Recently released PacBio Sequel instrument increases capacity of
PacBio RSII by 7 fold
 The only possible limitation of TDS plat forms are sequencing
errors (reaches 10% to 15%) and cost
 However, several algorithmic techniques have been developed that
can improve the per-nucleotide accuracy to over 99.99% or more
with sufficient coverage
The workflow for library construction steps of PacBio:
 Determine the quality of genomic DNA (gDNA)

 Shear gDNA

 Select size and adjust concentration

 Repair DNA damage and ends of fragmented DNA

 Conduct DNA purification

 Blunt-end ligation using blunt adapters

 Purify template for submission to a sequencer

 The template, called a SMRTbell, is a closed single-


stranded circular DNA, which is created by ligating
hairpin adapters to both ends of target double-
stranded DNA (dsDNA) molecules.

91
92
Animation source

 https://dnatech.genomecenter.ucdavis.edu/pa
cbio-sequencing/

93
Comparison Among Technologies

94
Comparison among technologies

95
96
 What do you think the advantage of DNA
sequencing?
 How DNA sequencing gives accurate
physical mapping
 What do you know about Human genome
project consortium?
APPLICATIONS OF GENOMICS:

 List application of Genomics in the following sectors

1. Application in health and forensic sciences

2. Application in agriculture

3. Application in Industry

4. Application in environment
PART II:
Bioinformatics
99
PART II: What is Bioinformatics?
 Multiple definition were available on literature and web
 Bioinformatican has their own definition, experts in some
other area has another definition

 “Bioinformatics is an integration of mathematical,


statistical and computer methods to analyse biological,
biochemical and biophysical data”

 unified discipline formed from the combination of


biology, computer science, mathematics, statistics and
information technology

100
What is Bioinformatics?
Biologists :-
collect molecular data:
DNA , RNA, Protein sequences

Bioinformaticians
Study biological
questions by
analyzing molecular
Computer scientists ,
Mathematicians, and Statisticians:- data
Develop tools, softwares, algorithms
to store and analyze the data. 101
What is Bioinformatics?
 Some people relate bioinformatics with computational biology,
however computational biology is very broad

 Bioinformatics is limited to sequence, structural, and functional


analysis of genes, genomes, proteomes and metabolome

 Computational biology encompasses all areas of biology that involve


computation
‫ ﮓ‬mathematical modeling of ecosystems, and population
dynamics
‫ ﮓ‬study and application of computing methods for classical
biology
‫ ﮓ‬theoretical biology, rather than the cellular or molecular level

o all employ computational tools, but do not necessarily involve


biological macromolecules 102
Multidisciplinarity
molecular genomics
biology

genetics mathematics

biochemistry statistics

bioinformatics numerical
biophysics analysis

algorithmics
evolution

image data
analysis management

103
Bioinformatics Flow Chart
Sequencing
Gene & Protein expression
Analysis of nucleic acid seq. data

Analysis of protein seq.


Drug screening

Molecular structure prediction


Ab initio drug design OR
Drug compound screening
molecular interaction
in database of molecules

Metabolic and regulatory


networks Genetic variability
Milestone development of BI
 1965: The first major bioinformatics project was undertaken by
Margaret Dayhoff
 Developed a first protein sequence database “Atlas of Protein
Sequence”
 1970: First sequence alignment algorithm was developed by
Needleman and Wunsch
 Early 1970s: Brookhaven National Laboratory established the
Protein data Bank
 1972: The first biological database - Protein Identification Resource
(PIR) was established by Margaret Dayhoff
 1974: The first protein structure prediction algorithm was developed
by Chou and Fasman
 1977: Maxam-Gilbert and Sanger DNA sequencing Methods

105
Milestone development of BI
 1978 : The term bioinformatics was coined
 1979: The first DNA database was established
 1980s: GenBank established, fast database searching algorithms
developed
 FASTA by William Pearson
 BLAST by Stephen Altschul and coworkers
 Prediction of RNA secondary structure
 Prediction of protein secondary and 3D structures
 1990s-2000: Prediction of genes and Studies of complete genome
sequences started, EMBL established (1994), first bacterial genome
completely sequenced (1995), Yeast genome completely sequenced
(1996), HGP (1990-2003)…
 2013: The number of sequences increased to more than 185 million

106
Level of bioinformatics analysis
1. Analysis of a single gene (protein) sequence.
 Similarity with other known genes
 Identification of well-defined domains in the sequence
 Prediction of secondary and tertiary structure
2. Analysis of complete genomes…..Genomics
 Which gene families are present, which are missing?
 Searching the location of genes on the chromosomes
 Expansion/duplication of gene families
 Identification of "missing" genes and hence product

107
Level of bioinformatics analysis
3. Sequence structure analysis and prediction

 Protein sequence structure analysis and prediction

Remember that structure determine function

4. Analysis of genes and genomes with respect to functional data

 Expression analysis

 Proteomics

 Comparison and analysis of biochemical pathways

108
Goals and scope
 Goal: To better understand a living cell and how it functions
at the molecular level
 SCOPE: Two major subfields
1. Development of computational tools (softwares) and
smart databases
2. Application of these tools and databases in generating
meaning full biological informations
 The tools are used in three areas of genomic and molecular
biological research:
1. sequence analysis,
2. structural analysis,
3. functional analysis.
109
The genetic code
 The information required to make aa is stored in DNA
• DNA sequence determines amino acid sequence
• Amino Acid sequence determines protein structure
• Protein structure determines protein function
 Amino acids are coded by codons – triplets of nucleotides,
e.g. |ACG|TAT|….
 There are 43 = 64 codons for ~20 amino acids, the code is
degenerate
 Deletions or insertions destroy a message
 Three stop codons (UAA, UAG, UGA) are always located
at the end of the protein coding part of a gene
 AUG is start codon in about 90% of genomes, except in
some prokaryotes and organelles
110
The genetic code

111
The amino acid shorthand notation

112
Reading Frames and open reading frames
 Reading Frame (RF)
• Any potential string of amino acids (codon) encoded by
DNA
• A given stretch of DNA has six potential reading frames,
three forward and three backward.
• The different reading frames
give entirely different proteins
 ORF: Open Reading Frame
• Any continuous reading frame
that starts with a start codon
and stops with a stop codon.
• DNA or RNA sequence that could
be translated into a peptide sequence

113
Single letter representation of four bases in
synthetic sequence construct and bioinformatics
R A/G puRine
Y C/T pYrimidine
S G/C Strong
W A/T Weak
K G/T Keto
M A/C aMino
B C/T/G not-A
D A/T/G not-C
H A/C/T not-G
V A/C/G not-U
N A/C/G/T aNy
114
Mutation and conservation
 Mutation:
o Changes in DNA that affect
genetic information

 Point Mutations – changes in one or a few nucleotides


 Substitution
 …CAT… to …RAT…
 Insertion
 …CAT…to…CART…
 Deletion
 …CAT…to…AT…

115
 Frameshift Mutations – shifts the reading frame of the
genetic message

 Insertion
 THE DOG ATE THE BAT

 THE DOG CAT ETH EBA T

 Deletion
 THE DOG ATE THE BAT

 THE DOG TET HEB AT

116
 Chromosome Mutations:

o Changes in number and structure of entire

chromosomes

 Deletion

 Duplication

 Inversion

 Translocation

117
Significance of Mutations
• Most are neutral
• Eye color
• intronic
• Some are harmful
• Sickle Cell Anemia
• Down Syndrome (chr 21 cannot separate), trisomy
• Some are beneficial
• People with Sickle Cell anemia are resistant to
Malaria
• Immunity to HIV

118
Introduction to Biological Databases

 What is a database?
 What type of biological databases can we access?
 What roles do they play?
 What type of information can we get from them?
 How do we access these information?

119
Database:
 Organized collection of logically related data
 It is structured collection of information
 Convenient tools of searching vast amount of information
Why databases ?
 Help to handle and share large volumes of biological data
 allow large scale analysis of biological data
 Disseminate biological data and information
 provide biological data in computer-readable form
 organize data in simpler and structured way
 Make data access easy and updated

120
The two main goals of biological databases:
‫ﮓ‬ Information retrieval
‫ﮓ‬ Knowledge discovery

 What makes a good database?


 comprehensiveness
 accuracy
 up-to-date
 Minimum redundancy
 Easy retrieval of data
 Good interface
121
Types of biological databases
 Over 1000 biological databases available
 In broad sense biological database can be :
1. sequence type databases : NA and protein
2. structure database : protein
3. Pathway databases: e.g., KEGG (Kyoto Encyclopedia
of Genes and Genomes)
4. Literatures : Example Pubmed, OMIM…
 Example:
Type of database No of records
DNA 87
RNA 29
Protein 94
Genomic 58
Mapping 29
Protein structure 18
Literature 43

122
Types of biological databases…
Content based classification: Three
 Depending on the content, databases can be:
1. primary databases
 Contain original information (sequences or structure )
 "simple" archives of sequences, structures, images, etc.
 Raw data, with minimal annotations, not always well curated
 Examples :
Nucleic acids proteins
i. GenBank at NCBI raw sequences of UniProt, trEMBL
ii. EMBL PDB
iii. DDBJ PIR
 Genbank, EMBL and DDBJ are the major three public primary
sequence databases that store raw NA data worldwide
123
 The three databases collaborated and exchange new data daily
 They together constitute the International Nucleotide Sequence
Database Collaboration (INSDC)

The International Sequence Database Collaboration 124


125
126
127
Secondary databases
 Contain results from the analysis of the sequences in the
primary databases
 Enhanced with more complete annotation
 computationally processed or manually curated
 Examples :
 Functionally annotated translated protein sequences of
Swiss-Prot, TrEMBL, Which are later combined into
UniProt
 PIR
 RefSeq
 Taxon
 OMIM
 SWISS-PROT, provides detailed sequence annotation that
includes structure, function, and protein family assignment
128
Specialized databases
 Focus on a particular research interest or organism
 usually highly curated
 may be sequences or other types of information
 Example: Databases that specialize in a particular
organism
 Flybase

 WormBase

 HIV sequence database

 Ribosomal Database Project

 GenBank EST database

 Microarray Gene Expression Database at EBI

129
Interconnection between Biological Databases

 primary databases are central repositories and


distributors of raw sequence and structure for
secondary databases

 secondary and specialized databases are connected to


the primary databases to provide detailed information

 user has to get information from both primary and


secondary databases

130
Major Biological Databases Available Via the
WWW
Databases URL
DDBJ www.ddbj.nig.ac.jp
EMBL www.ebi.ac.uk/embl/index.html
NCBI www.ncbi.nlm.nih.gov/gquery/gquery.fcgi
ExPASY http://us.expasy.org/
FlyBase http://flybase.bio.indiana.edu/
GenBank www.ncbi.nlm.nih.gov/Genbank
HIV databases www.hiv.lanl.gov/content/index

OMIM www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=OMIM

SWISS-Prot www.ebi.ac.uk/swissprot/access.html
Enzymes www.chem.qmul.ac.uk
KEGG www.genome.ad.jp
131
Drawbacks of Biological Databases
 Errors in sequence databases
 redundancy in the primary databases
 repeated submission of identical or overlapping sequences
 NCBI has created a non redundant database, called
RefSeq, in which identical sequences from the same
organism and associated sequence fragments are
merged into a single entry
 SWISS-PROT database also has minimal redundancy
for protein sequences
 Occasionally false annotations of genes
 Incompatibility (format, terminology, data types, etc.)

132
Information Retrieval from NCBI and EBI
 Two most popular retrieval systems:

1. ENTREZ - NCBI
 Allows text-based searches
 Integrate cross-referenced information

2. SRS - Sequence Retrieval Systems - EBI


 offers direct access to certain sequence analysis

Both:
 Provide access to multiple databases
 Allow complex queries

133
Entrez

NCBI

•Submissions GenBank •Submissions


•Updates •Updates

EMBL
DDBJ
CIB EBI

•Submissions
•Updates SRS

getentry

134 134
135
Searching information from GenBank

 The most complete collection of NA sequence


 Two ways to search sequences in GenBank
 using text-based keywords
 search by sequence similarity using BLAST

 The search can be limited to “organism,” “accession


number,” “authors,” and “publication date.”

136
GenBank Sequence Format
 contain three sections

1. Header: Describes the origin of the sequence, identification of


the organism, and unique identifiers
 Locus: contains a unique database identifier for a sequence

location in the database


 DEFINITION: provides sequence summary information

 Accession number (AC): unique number assigned for each


sequence
 This is the number that should be cited in publications

137
GenBank Sequence Format
 version number and a gene index (Gi) number: identify
the version of the sequence
 ORGANISM: the source of the organism with the

scientific name
 REFERENCE: provides related publication citation

 includes author and title information

2. Features: includes annotated information of gene and


product
 Source: provides the length of the sequence, the
scientific name, and the taxonomy identification
number
 Gene: Give information about the nucleotide coding
sequence (CDS) and its name
138
GenBank Sequence Format

3. sequence: starts with the label “ORIGIN.”

 Used for both DNA or protein sequences,

 ends with two forward slashes (the “//” symbol).


 The sequence of DNA interms of Bases

 Sequence of protein interms of aa single letter code

139
140
141
Ideal minimal content of an entry in a sequence database

• Sequence
• Accession number (AC)
• Taxonomic data
• References
• Annotation/Curation
• Keywords
• Cross-references
• Documentation

142
Alternative Sequence Formats
 In addition to the GenBank format, there are many other
sequence formats

FASTA:
 The simplest and the most popular
 Has a single definition line begins with a right angle
bracket (>) followed by a sequence name
 Extra information such as Gi number are
separated from the sequence name by “|” symbol
 Drawback = much annotation information is lost

 There are also many other less popular sequence formats

143
FASTA format

144
SEQUENCE ALIGNMENT
 What do you think about sequence alignment?

 Why it is needed?

145
SEQUENCE ALIGNMENT
 way of arranging the sequences of DNA, RNA, or protein
to identify regions of similarity
 reveal three types of pairs ( Match, Mismatches and
Gaps)

Optimal alignment
 Shall exhibits the highest match and the least differences
 Shall have highest overall score

146
Why do sequence alignment is important?
 Provide evolutionary information
 similar sequences may have similar structure, similar function
and likely originated from one common ancestor
 It is unlikely that sequences obtained similarity by chance
o Probability in DNA sequence is low P = 4-n
o For proteins the probability is even much lower P = 20 –n

 Sequence alignment helps in providing extra information in:


o Annotation of new sequences
o modelling of protein structures
o Drug design/vaccine development
o Protein or enzyme engineering
o Disease detection and early treatment
Types of sequence alignment
 Depending on alignment length
 Local
 Global
 Depending on sequence number
 Pairwise
pairwise
 Multiple Local
multiple
Alignment
pairwise
Global
multiple
Global vs Local alignment
 Global alignment:
 Alignment of entire sequence from start to an end

 Similar sequences with approximately the same length


are suitable candidates
 Use Needleman-Wunsch algorithm

 Local alignment:
 looks for best internal matching region

 Appropriate to pick only similar regions (conserved


domains) even when sequence length are different
 Use Smith-Waterman algorithm
Global vs Local alignment

Logical questions

 Which alignment method do you think is best for divergent


seqs?
 If you want to compare mutation in active site of enzymes
among a family of proteins ,which alignment you choose?
Basic terminologies
 Homologous equences
 similar sequences in different organisms that have been
derived from a common ancestor
 Homologs can be either orthologous or Paralogous

 Orthologous sequences:
 similar sequences in different organisms arisen due to a speciation
 originated from a single ancestral gene in the most recent
common ancestor
 Often have similar function

151
 Paralogous sequences:
 similar sequences within a single organism that have arisen
due to a gene duplication
 tend to have different functions
 Xenologous sequnces:
 similar sequences that do not share the same evolutionary
origin, but rather have arisen due to horizontal transfer

 Sequence similarity:
 quantitative measure of alignment between two sequences
 direct result of observation from the sequence alignment
 Sequence similarity can be quantified using percentages
 Eg 80% similarity, but not 80% homology
Identity vs Similarity
 Sequence identity:
• exactly the same aas or Nucleotide in the same position
• similarity and identity are interchangeably used in
DNA/RNA
• But both have different meaning in protein sequence
 similarity in protein sequence:
• percentage of aligned residues that have similar

physicochemical characteristics
 identity in protein sequence:
• percentage of matches of the same aa residues between
two aligned sequences

154
Alignment edition and edit operations
 best optimal alignment of sequences can be obtained using any
of edit operation and alignment edition and re-edition
 What are edit operations?

 Events commonly found in nature leading to evolution

 The edit operations are (replacement, insertion and deletion)

Edit distance
 Minimal number of 'edit operations‘ (RID) required to

change one string into the other string


 Nature prefers minimum edit distance. Why?

 For example:

AG–TCC = Edit distance = 3


CGCTCA
o Minimum edit operations required to mimic what nature
prefers
o Minimum edit distance is the maximum possible similarity
 How many edit distance do the following sequences
require when aligned in optimum match?
X: A T A T A T A T A T
Y: T A T A T A T A T A
 Depending on the way of penalizing raw sequences for finding
similarity, alignment can be divided into two

1. Alignment with out gap ( gap free alignment)

 Alignment with out allowing any internal gaps

 We only expect match or mismatch score

 Obtain the optimal alignment by sliding the shorter sequence

 ignores impact of nature (RID)

 Cannot penalize mismatches

 Do you think that such similarity can represent events in


nature?
Example
S = CGTTAGA
T = CGTAC
 How many possible alignments we expect without inserting gap?
 What is the optimal alignment? If match score =1, mismatch = 0
 There are 3 different possible alignments
1. C G T T A G A ----- score = 3,
CGTAC optimal alignment
2. C G T T A G A ----- score = 2
CGTAC
3. C G T T A G A ------ score = 0
CGTAC
 Limitation: ignore events that are found in nature
158
2. Alignment with gaps
o A gap is a consecutive run of spaces in an alignment
o Add gaps to make lengths of 2 sequences equal
o More possible alignments than simple alignment
o Gaps help to create alignments that better conform to
underlying biological models
o Gap penalty is added to the scoring function, i.e. gap
penalty, match score, mismatch score
o Gap penalty has lower score than match and mismatch
because indels are less common than replacement
o Gap penalties can be divided into two:
 Simple gap penalty
 Origination and length penalty (OLP)
Simple /regular Gap Penalty
 This type of penalty score give the same value for all gaps

 Calculate the score of the following sequence if gap penalty is ‘-1’,


match score is ‘1’, and mismatch score is ‘0’

A T G C C A T
A T - T C - -
– Number of match = 3
– Number of mismatch = 1 Score = 3+0+(-1x3) = 0
– Number of gap = 3
Simple/ regular Gap Penalty
 One may find different possible alignment by introducing gaps
 The possible way to select one best alignment is using score
Example
 Align AATCTATA with AAGATA by introducing gap and rate
the possible alignments using match =1, mismatch =0 and gap =
-1
 We may find a lot of possible alignments
1. A A T C T A T A ----score = 3+0x3+(-1x2) = 1
A A G - A T - A
2. A A T C T A T A -----score = 5+0-2 = 3
A A - G - A T A
3. A A T C T A T A -----score = 5+0-2 = 3
A A - - G A T A
 We have two best alignment (2 and 3), which one to choose?
 Another mechanism of penalizing was thus important
Origination and Length Penalties

 Also called Affine gap penalty


 Simple gap penalties may generate many optimal alignments
(same score)
 OLP = initiation of gap is expensive than extension of gap
 A long gap is more likely to happen than many isolated gaps
 In OLP, event that happen less likely should have more
penalties
 Origination penalty – start a new series of gap in one of the
sequence
 Length penalty – number of sequential missing character
Origination and Length Penalties (OLP)
 In the OLP, a gap is given two weights
1. The weight to ”open the gap” = WO
2. The weight to ”extend the gap” = WE
 The total penalty for a gap of length L is:

WTotal = WO+ LWE


 Take gap opening penalty = -2 and solve for previous seq.
1. A A T C T A T A ---- score with simple gap penalty = 3+0-2 =1
A A G - A T - A ------ score with OLP = -2x2+ (-2) + 3 = -3
2. A A T C T A T A ----- score = 5+0-2 = 3
A A - G - A T A------ score with OLP = -2x2+(-2) + 5= -1
3. A A T C T A T A ----- score = 5+0-2 = 3
A A - - G A T A ----- score with OLP = -2+ (-2) + 5= +1
Alignment 1 Alignment 2
T A A G C T A T A A G C T A
A A - - - G A A A - G - A -

o Use: match score = 1, mismatch = 0, GOP = -2


and GEP = -1
o What will be the score of each alignment
1. Using simple gap penalty?
2. Origination and length penalty?
3. Which alignment possibly occur in nature? Why?
With different penalty value

165
Exercise
o For the following alignments, calculate the total score using
GOP = -2, GEP = -1, Match = 1 and mismatch = 0
ACGTCTGATACGCCGTATAGTCTATC
ACGTCTGAT - - - - - - - ATAGTCTATC

ACGTCTGATACGCCGTATAGTCTATC
AC - T–TGA - - CG -CGT– TA - TCTATC
ACGTCTGATACGCCGTATAGTCTATCT
| ||||| ||| || ||| ||||
A--GCTGATTCGC---ATCGTC-ATCT

You might also like