Gene Prediction

GENE PREDICTION
Group 4
INTRODUCTION
Computational gene prediction is a prerequisite for detailed functional annotation
of genes and genomes.
The process of detection of the location of open reading frames (ORFs) and
delineation of the structures of introns as well as exons if the genes of interest are
of eukaryotic origin can define gene prediction as a whole.
Gene prediction one of the most difficult problems in the field of pattern
recognition. There are two basic problems in gene prediction: prediction of
protein coding regions and prediction of the functional sites of genes.
ORF (OPEN READING FRAMES)
Open reading frames (ORFs) are defined as spans of DNA sequence between the
start and stop codons.
A long open reading frame is often part of a gene
ORF is a sequence that has a length divisible by three and is bounded by stop
codons.
GENE
PREDICTIO
N DIFFERS
FOR
DIFFERENT
ORGANISMS
There are mainly two classes of methods
TYPES OF for computational gene prediction. One is
GENE based on sequence similarity searches,
while the other is gene structure and
PREDICTI signal-based searches, which is also
ON referred to as ab initio gene finding.
METHODS Consensus based programs use data

derived from both methods
Ab initio methods Homology based approach
Uses only gene unique feature in a gene makes predictions based on significant matches of
sequence detected using probabilistic methods the query sequence with sequences of known genes.
(e.g., HMM)
Translated DNA Genomic DNA

Gene signals Gene content and Protein and cDNA
GENE PREDICTION IN
PROKARYOTES
Conventional determination of
ORFs
• Nucleotide distribution / Gene contents
• Gene signals
Using Markov and Hidden Markov

models
• Smaller genome size
• Higher gene density
• Continuous ORFs
• Presence of unique features like
• Shine Delgarno sequence *
• conserved consensus motifs
• Start and stop codons
• ρ-independent
terminator signal * Consensus motif
PROKARYOTES GENE
FEATURES
Prokaryotic DNA is first subject to conceptual translation in all six possible frames,
three frames forward and three frames reverse.
Commonly checked signals:

Absence of Stop Presence of Start Detection of Other signals
codons for more codons (ATG, homologs after in • Shine Delgarno sequence
than 30 frames GTG, or TGT ) sillico translation • ρ-independent terminator
signal
GENE SIGNALS
Examines non randomness of
nucleotide distribution example:
 * GC over AT Codon Bias in 3rd
nucleotide in a coding sequence
 * Testcode : repetition of 3rd nucleotide in
coding sequence
By plotting the repeating patterns of the
nucleotides at these positions, coding and
noncoding regions can be differentiated
Disadvantage:
 They identify only typical genes and tend
to miss atypical genes in which the rule of
GENE CONTENT codon bias is not strictly followed
WHAT IS A MARKOV MODEL
A Markov model is a stochastic method for randomly changing systems that

possess the Markov property. This means that, at any given time, the next state
is only dependent on the current state and is independent of anything in the
past.
2 types:
• Markov chains.
• Hidden Markov models.
ORF DETERMINATION
USING MARKOV AND
HIDDEN MARKOV MODELS
Oligonucleotide distributions in the coding regions are different from those for the noncoding regions
Can be used to Provide finer statistical description of a gene
A Markov model describes the probability of the distribution of nucleotides in a DNA sequence, in
which the conditional probability of a particular sequence position depends on k previous positions. In
this case, k is the order of a Markov model.
*A second-order model is more characteristic of codons in a coding sequence
Higher the order of a Markov model built in sets of three nucleotides, the more accurately it can
predict a gene.
HMM prediction algorithm is created by combining multiple orders of Markov models that represent
different nucleotide distributions to create an accurate prediction for typical and a typical genes**
HMM/IMM-BASED GENE
FINDING PROGRAMS
• GeneMark (based on the fifth-order HMMs)
• Glimmer (Gene Locator and Interpolated Markov Modeler) (IMM algorithm *)
• FGENESB (based on fifth-order HMMs for detecting coding regions )
• RBSfinder uses the prediction output from Glimmer and searches for the Shine–
Delgarno sequences in the vicinity of predicted start sites.
GENE
PREDICTION IN
EUKARYOTES
 Eukaryotic nuclear genomes are
much larger than prokaryotic
ones(10 Mbp to 670 Gbp)
 Low gene density
 A gene is split into pieces (called
exons) by intervening noncoding
sequences (called introns)
 Makes gene prediction in
eukaryotes more complex and
challenging
Uses both ab initio and homology Eukaryotic gene is modified in different ways before becoming a mature
modelling methods mrna for protein translation.
Consensus motif of GTAAGT at 5’ splice junction
Consensus motif of (Py)12NCAG at 3’ splice

junction
FEATURE Nucleotide compositions codon bias in coding
S IN regions and noncoding regions differs.
EUKARYO Hexamer frequencies in coding regions are higher

than in noncoding regions.
TIC Kozak sequence (CCGCCATGG) : conserved
sequence flanking ATG
GENES CpG island* (p refers to the phosphodiester bond
connecting the two nucleotides)
Poly-A signal
Ab Initio–Based Programs
GENE • The goal of the ab initio gene prediction
PREDICTION programs is to discriminate exons from
PROGRAMS noncoding sequences and subsequently join the
exons together in the correct order. To predict
FOR exons, the algorithms rely on two features, gene
EUKARYOTES signals and gene content. Tools used are
• Neural Networks.
Most of these programs are • Hidden Markov Models
organism specific because • Discriminant Analysis
training data sets for obtaining • LDA or quadratic discriminant analysis
statistical parameters must be (QDA) is used to improve accuracy.
derived from individual
organisms Homology-Based Programs
They fall into all three categories
of algorithms:
Consensus-Based Programs
PREDICTION
USING NEURAL
NETWORKS
A neural network is a statistical model
with a special architecture for pattern
recognition and classification composed
of a network of mathematical variables
connected by weighted functions. The
network processes information and
modifies parameters of the weight
functions between variables during the
training stage. Once it is trained, it can
make automatic predictions about the
unknown.
GRAIL (Gene Recognition and
Assembly Internet Link/) is a web-
based program that is based on a neural
Architecture of a sample neural network for eukaryotic gene
network algorithm. prediction.
PREDICTION USING HMM’S
Follows similar principle as described in prokaryotic gene prediction while
recognising gene signial and content specific to eukaryotes.
GENSCAN (predictions based on fifth-order HMMs) . It combines hexamer
frequencies with coding signals (initiation codons, TATA box, cap site, polyA, etc.)
in prediction.
HMMgene (HMM-based web program)
 The unique feature of the program is that it uses a criterion called the conditional maximum
likelihood* to discriminate coding from noncoding features.
HOMOLOGY-BASED
PROGRAMS
Exon structures and exon sequences of related species are highly conserved.
When potential coding frames in a query sequence are translated and used to align
with closest protein homologs found in databases, near perfectly matched regions
can be used to reveal the exon boundaries in the query.
Drawback: If the homologs are not available in the database, the method cannot
be used hence Novel genes in a new species cannot be discovered.
E.g. : GenomeScan, EST2Genome, SGP (synthetic gene prediction)1, TwinScan

CONSENCUS BASED
PROGRAMS
Combine results of multiple programs based on
consensus
Works by retaining common predictions agreed by
most programs and removing inconsistent predictions
Improve the specificity by correcting the false
positives and the problem of overprediction.
GeneComber combines HMMgene and GenScan
prediction results.
DIGIT uses prediction from three ab initio programs
– FGENESH, GENSCAN, and HMMgene
Parameters used
PERFORMA  sensitivity(SN): ability to include correct predictions.
NCE  specificity(SP): the ability to exclude incorrect predictions.
EVALUATIO Features are used to describe SN and SP:

 true positive (TP) : correctly predicted feature;
N FOR  False positive (FP) : which is an incorrectly predicted feature
PREDICTION  false negative (FN) : which is a missed feature
PROGRAMS  true negative (TN) : which is the correctly predicted absence of a

feature
FOR EUKARYOTES
The sensitivity and specificity have to be defined on the levels of nucleotides, exons,
and entire genes.
For exons, instead of using CC, an average of sensitivity and specificity at the exon
level is used instead.
In addition, the proportion of missed exons and missed genes as well as wrongly
predicted exons and wrong genes, which have no overlaps with true exons or genes,
often must be indicated.
CORRELATION COEFFICIENT
(CC)
Sumarises both sensitivity and specificity in one.
The value of the CC provides an overall measure of accuracy, which ranges from −1
to +1, with +1 meaning always correct prediction and −1 meaning always incorrect
prediction.
Group members
Nathan Jude Serpes
THANK Anmol Ghale
Tushima Sharma
YOU Pushti Verma
Jeneeta
Kajal

Gene Prediction

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Gene Prediction

Uploaded by

Copyright:

Available Formats

GENE PREDICTION

METHODS Consensus based programs use data

Translated DNA Genomic DNA

Using Markov and Hidden Markov

Commonly checked signals:

A Markov model is a stochastic method for randomly changing systems that

Consensus motif of (Py)12NCAG at 3’ splice

EUKARYO Hexamer frequencies in coding regions are higher

E.g. : GenomeScan, EST2Genome, SGP (synthetic gene prediction)1, TwinScan

EVALUATIO Features are used to describe SN and SP:

PROGRAMS  true negative (TN) : which is the correctly predicted absence of a

You might also like