Pattern Matching: Rhys Price Jones Anne R. Haake

You might also like

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 20

Pattern Matching

Rhys Price Jones

Anne R. Haake

What is pattern matching?

Pattern matching is the procedure of

scanning a nucleic acid or protein sequence
for matches to short sequence patterns
(Staden 1990).

Why search for patterns?

Usually the sequences of interest (the query
sequences) are known to be indicators of
some important biological function
Search for patterns in nucleotide sequence

Search for patterns in amino acid sequence

multiples uses of the word
Def: a pattern; typically is used to refer to a
short (up to ten bases or residues) repeated
or conserved pattern in nucleic acids or
Def: a short conserved sequence in a protein;
usually associated with function
in a broader sense, motif is used for all localized
regions of homology, regardless of size

Some examples of patterns in DNA


Restriction sites:recognition sites for the

restriction endonucleases
Intron splice sites
Codons specifying ORFs
DNA binding sites for regulatory proteins

Restriction Sites
Why identify them?
Exact or inexact matches?
Restriction sites

Splice Sites
Splice donor and splice acceptor
are consensus sequences
A statistical determination of the
pattern;approximates the pattern

C(orA)AG/GTA(orG)AGT "donor" splice site

T(orC)nNC(orT)AG/G "acceptor" splice site
Splice site example

Splice Sites
Remember that they are consensus sequences
Why are splice sites of interest?
Gene finding
Mutations in consensus sequence at the splice junctions
common in many inherited disorders
Ex: thalassemias, muscular dystrophy, Tay-Sachs,
neurofibromatosis, Dariers disease..
One of the thalassemias: mutation at splice acceptor
YYYNCAG| normal
YYYNCGG| mutant

Codons Specifying ORFs

ORFs (open reading frames)
Start codon .60-100 and no stop
Prokaryotic start codons: ATG, GTG or TTG
usually, but is species specific
Eukaryotic start: ATG
Code table
More on this, too, when we discuss gene

Prokaryotic promoters: Consensus sequences

Eukaryotic promoters
TATA box at 25 relative to transcriptional start site
consensus is 5-TATAWAW-3 (W= A or T)

Initiator sequence(Inr)
consensus is 5-YYCARR-3 (Y is C or T; R is G or A)
the +1 nucleotide (start) is usually the A of the Inr sequence

Bind basal transcription factors

Well revisit this when we discuss gene finding

Transcription Factor Binding Sites

Regulatory transcription factors are
sequence-specific DNA-binding proteins;
sites are often found in or near gene
promoter regions
DNA sequence is called the response
What are the DNA sequences like?
Response elements

Some examples of patterns in protein

sequences (motifs):

Prediction of secondary and tertiary


e.g. transcription factors

helix-turn-helix, b-zip, zinc-finger

Presence of active sites of enzymes

Presence of cell localization signals

Exact vs Inexact (Approximate) Pattern

Exact Pattern Matching
Limited use in bioinformatics
Well-known algorithms (last week)
A common use of exact pattern matching is to
compare a sequence against a large number of
possible known patterns such as in the
identification of restriction sites

Most of the other examples of pattern matching in

Other uses of exact pattern matching?

Check PCR primers?
Annotation? (text matching)

Why search for patterns?

Pattern matching in sequences is also the
basis of searching through a sequence
Sequence alignment

Pairwise Sequence Alignment

An alignment between 2 sequences is a
pairwise match between sequences.
Pairwise sequence comparison is the primary
means of linking biological function to the
genome and of propagating known
information from one genome to another
(Gibas & Jambeck)

Why are inexact pattern matches relevant

in sequence alignments?
Sequencing errors
2 primary types
point mutations (affect a single nucleotide)
segmental mutations (affect a few to hundreds of
adjoining nucleotides)

substitutions (transitions, transversions)

insertions, deletions

Point mutations usually occur from a nucleotide
mismatch that becomes fixed during the process of
Escapes the DNA repair mechanism

Significant when occur within a coding region and

also cause a change in functionality
Non-synonymous mutation
Synonymous mutation: mutated sequence codes for same
amino acid as before mutation
Allowance for synonymous mutation due to wobble and
degeneracy of the code
Code Table

Evolutionary Considerations
Through time mutations tend to be preserved
if they are not deleterious
Functionally important sequences tend to be
Non-functional or non-coding sequences
diverge at a high rate

Evolutionary Considerations
The tendency of functionally important
sequences to remain relatively unchanged
over time is the basis for sequence analysis
Allows us to draw evolutionary connections among
genes that are related in sequence

You might also like