Pattern Matching: Rhys Price Jones Anne R. Haake

You might also like

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 20

Pattern Matching

Rhys Price Jones


Anne R. Haake

What is pattern matching?

Pattern matching is the procedure of


scanning a nucleic acid or protein sequence
for matches to short sequence patterns
(Staden 1990).

Why search for patterns?


Usually the sequences of interest (the query
sequences) are known to be indicators of
some important biological function
Search for patterns in nucleotide sequence
DNA or RNA

Search for patterns in amino acid sequence

Motif
multiples uses of the word
Def: a pattern; typically is used to refer to a
short (up to ten bases or residues) repeated
or conserved pattern in nucleic acids or
proteins
Def: a short conserved sequence in a protein;
usually associated with function
in a broader sense, motif is used for all localized
regions of homology, regardless of size

Some examples of patterns in DNA


sequence:

Restriction sites:recognition sites for the


restriction endonucleases
Intron splice sites
Codons specifying ORFs
Promoters
DNA binding sites for regulatory proteins

Restriction Sites
Why identify them?
Exact or inexact matches?
Examples:
Restriction sites

Splice Sites
Splice donor and splice acceptor
are consensus sequences
A statistical determination of the
pattern;approximates the pattern

C(orA)AG/GTA(orG)AGT "donor" splice site


T(orC)nNC(orT)AG/G "acceptor" splice site
Splice site example

Splice Sites
Remember that they are consensus sequences
Why are splice sites of interest?
Gene finding
Mutations in consensus sequence at the splice junctions
common in many inherited disorders
Ex: thalassemias, muscular dystrophy, Tay-Sachs,
neurofibromatosis, Dariers disease..
One of the thalassemias: mutation at splice acceptor
YYYNCAG| normal
YYYNCGG| mutant

Codons Specifying ORFs


ORFs (open reading frames)
Start codon .60-100 a.as and no stop
codon
Prokaryotic start codons: ATG, GTG or TTG
usually, but is species specific
Eukaryotic start: ATG
Code table
More on this, too, when we discuss gene
finding

Promoters
Prokaryotic promoters: Consensus sequences
TTGACA171TATAAT
3510

Eukaryotic promoters
TATA box at 25 relative to transcriptional start site
consensus is 5-TATAWAW-3 (W= A or T)

Initiator sequence(Inr)
consensus is 5-YYCARR-3 (Y is C or T; R is G or A)
the +1 nucleotide (start) is usually the A of the Inr sequence

Bind basal transcription factors


Well revisit this when we discuss gene finding

Transcription Factor Binding Sites


Regulatory transcription factors are
sequence-specific DNA-binding proteins;
sites are often found in or near gene
promoter regions
DNA sequence is called the response
element
What are the DNA sequences like?
Response elements

Some examples of patterns in protein


sequences (motifs):

Prediction of secondary and tertiary


structure

e.g. transcription factors


helix-turn-helix, b-zip, zinc-finger
Examples

Presence of active sites of enzymes


Presence of cell localization signals

Exact vs Inexact (Approximate) Pattern


Matching
Exact Pattern Matching
Limited use in bioinformatics
Well-known algorithms (last week)
A common use of exact pattern matching is to
compare a sequence against a large number of
possible known patterns such as in the
identification of restriction sites

Approximate
Most of the other examples of pattern matching in
bioinformatics

Other uses of exact pattern matching?


Check PCR primers?
Annotation? (text matching)

Why search for patterns?


Pattern matching in sequences is also the
basis of searching through a sequence
database
Sequence alignment

Pairwise Sequence Alignment


An alignment between 2 sequences is a
pairwise match between sequences.
Pairwise sequence comparison is the primary
means of linking biological function to the
genome and of propagating known
information from one genome to another
(Gibas & Jambeck)
.

Why are inexact pattern matches relevant


in sequence alignments?
Sequencing errors
Mutation
2 primary types
point mutations (affect a single nucleotide)
segmental mutations (affect a few to hundreds of
adjoining nucleotides)

substitutions (transitions, transversions)


insertions, deletions

Mutations
Point mutations usually occur from a nucleotide
mismatch that becomes fixed during the process of
replication
Escapes the DNA repair mechanism

Significant when occur within a coding region and


also cause a change in functionality
Non-synonymous mutation
Synonymous mutation: mutated sequence codes for same
amino acid as before mutation
Allowance for synonymous mutation due to wobble and
degeneracy of the code
Code Table

Evolutionary Considerations
Through time mutations tend to be preserved
if they are not deleterious
Functionally important sequences tend to be
conserved
Non-functional or non-coding sequences
diverge at a high rate

Evolutionary Considerations
The tendency of functionally important
sequences to remain relatively unchanged
over time is the basis for sequence analysis
Allows us to draw evolutionary connections among
genes that are related in sequence

You might also like