Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 22

Bio-informatics

Examen: 4 theorievragen en 4 programmeeroefeningen

Basic terms
Mapping = determining the position of the reads on a reference genome, where does each read
come from in the genome?
Alignment = arranging sequences to identify regions of similarity/comparing two or more sequences,
to visualize relationships between sequences etc.

Lecture 1: Introduction
Evolution of data
- Protein Atlas: first database of all global proteins
- Alignment algorithms for global alignment (11 years later: local)
- Eventually, too much data: databases such as BLAST used
Genome sequencing (Nature vs Science)
- Clone-by-clone: taking large set of BAC clones that make up the entire genome, sending these
clones to labs all over the world (=divide & rule) to sequence, then map on the human
genome (Nature, 25 years)
- Whole genome shotgun sequencing: cutting genome many times at random, performing
paired end seq on these fragments, using database for alignment
o Because of database, this was possible in 9 months (versus 25 years otherwise)
Genome size
- Humans: 3 Gb
o Difference in genome with another human is only 0.01% = 3Mb
o Essentially only need to sequence these 3Mb to characterize a human
- Human, mouse and rat: very similar, little diversity bcs survival of the fittest
Sequencing verses re-sequencing
- re-sequencing to find certain differences (eg. SNP’s), Illumina often used (max 100 bp)
Next generation sequencing
- DNA library preparation: genome fragmentation, addition of adapters to DNA fragments
- Amplification of DNA library with PCR
- Sequencing of amplified library: eg sequencing by synthesis
Importance of paired-end reads
- Single read: can map to multiple positions
- Paired-end: beginning and end of fragment are known +
distance between them is known  can find the exact
location that they map to because both end and
beginning need to match
Epigenomics
- Changes in histones, methylation, differentiation within genome  can make DNA
unexpressable = ZWAKSTROOM (when are parts of DNA active and when not?)
- Ribo-seq: sequencing the RNA that was actively bound to ribosomes & were being translated
to found out which proteins are active
Third generation sequencing
- Big difference with previous generations: no reactions in solutions! Polymerase is
immobilized. Single molecule (often enzyme), don’t need amplification bcs we look at
polymerase with special optics
- Nanopore by Pacific Biosciences: uses zero mode waveguides (waves that are very small
compared to light) for sensing of ionic conductance caused by nucleic acids passing through
nanometer-size pore
Lecture 2: Where do we find data? How to access? How to store?
Molecular Biology
New version of full genome
- T2T (telomere to telomere): repeats are in lower case but rest is in upper case <-> difference
with normal sequencing: everything in lower case
Structure of DNA
- Only 1 strand in stored in the database: 5’ to 3’, reverse compliment never stored
o Consequences: if gene is present on antisense then you would never find it, ATG will
be present on antisense and not on sense
o Half of genes is actually present on antisense
- RNA is never sequenced but cDNA is! Reverse transcriptase to make cDNA (never uracil, only
thymine found in databases)
Introns and exons
- Caused by splicing: two types of splicing (always between CT and AG)
- Cis splicing: splicing from same molecule/same strand
- Trans splicing: sense and antisense are both transcribed to make RNA = come together as the
RNA that is being translated by the ribosomes
Proteins
- Often difficult to map back to the genome because don’t know where protein came from
- Reasons:
1. A lot of modifications at protein level
2. A lot of things happen that we don’t understand yet eg PROTEIN SPLICING: certain parts
of protein seq taken out or recombined
Flat files ‘sequence’ databases
Flat files
- A text file
- RefSeq, Genbank format
Genbank format
Nucleotide databases (same information but a different format, redundant)
- EMBL Nucleotide Sequence Database
- GenBank
- DDBJ
- Important: when working with GENBANK, be careful, everything put without checking/no
quality check  different usage of ontology/def of gene or exon
- Data retrieved from these database: annotated text <-> FASTA: unannotated text
- These database contain strings = letters, but this is not what we measure!
o Trace files = what we measure
o GenBank = basecalling, done with certain level of confidence (Q value)
o Q value = 40 <-> Probability of incorrect base call is 1:10000
o Major issue: HETEROZYGOUS genes, GenBank does not take alternative form of genes
into account, trace file does
EMBL format

Phred score
- Quality value, low Q value = lower base calling accuracy
- Q = -10*log_10(P)
- Predicts peak location and reads actual peak locations from base trace
- Matches actual location with predicted location
- Why used?
o Trimming (end of sequence often of poor quality): because difficult to resolve larger
fragments
o Weak or variable signal strength to a base
Phrap score
- Used Phred’s score to determine consensus sequences
- Uses highest scoring sequences (of a series of individual sequences) to extend the consensus
sequence
RefSeq (reference database, not redundant)
- Includes chromosomes, mRNA and proteins
Open reading frames (ORF)
- Every DNA strand has 6 ORF  3 on sense and 3 on antisense

Protein sequence databases (minimal redundancy level)


- SWISS-PROT: best protein database, highly annotated (info about mutation and structure
given)
- UNI-PROT: new alternative to swissprot
- TREMBL: ORF predicted from database, contains translated EMBL seq that are not yet
present/integrated in swissprot, contains more noise because not all translated/predicted
ORF are actually coding for an existing protein
- Same format as EMBL
- Minimal redundancy: separate entries/data is merged together, eg summary of gene =
summary of many papers coming from different entries

Structure databases (structure  function)


- Protein data bank (PDB): 3D structure of proteins
- Databases growing but number of new structures not growing?
o Structure of proteins is often very conserved, eg 300 aa  20^300 possibilities 
change into another aa will not necessarily change structure
o When DNA is modified, RNA is modified even more, protein modified the most BUT
doesn’t mean that structure changed

Relational databases
Databases that need a higher level of annotation and relational information, used to answer more
complex questions
Because of explosive growth in biological data it became difficult to use flat files  relational
databases to the rescue
Relational databases contain a set of tables where a data model is poured into, tables are linked with
each other, to query these tables you need a language (SQL) and a program (RDBMS)
Benefits: minimal redundancy, data sharing, data independency
Disadvantages: size, complexity

RDBMS
- Program used to manage these databases
- SQL often used as language to jump into the data and search the data, can do SELECT, FROM,
WHERE, JOIN, etc
Ensembl/BioMart (eg searching complete human genome, genome browser)
- Datamodel that stores the information from flat files such as GenBank
- Tables have limited amount of fields but unlimited amounts of records (so short and long)
- You can query the database/ask questions with SQL language
o Filters: gene ontology = linking genes with cell process such as cell death or DNA
repair
Lecture 3
Recap
Ensembl = genome browser
- Started after full genome was sequenced in 2002: genome of only 4 organisms was then
available
- Years later: more and more organisms added to the database
- Currently used to answer very specific genomic questions
Definitions
Identity: simple to define, extremely powerful!
- The extent to which two (nt or aa) sequences are invariant/uniform
- Identity of 30%: structures/function are kind of similar, almost same sequence
Homology
- Concept of similarity, the similarity is attributed to descent from a common ancestor
- Importance: immunotherapy!  trials in humanized mice: transfection with human immune
system by replacing its genes by human orthologous genes
- Some genes are really conserved throughout species = all from same ancestor
Orthologous
- Homologous sequences in DIFFERENT species, from a common ancestor during speciation
(no duplication)
- Consequence: sequences that are homologous could take over each others function, can +-
replace each other

Paralogous
- Homologous sequences from a SINGLE species, by gene duplication
- The extra gene that was made because of duplication, is a gene that nature can play with and
that can change
Sequence alignment
- Two biological sequences are sufficiently similar (30% identical)  same function, same
structure
o Sequence determines the function + many positions/AA may be changed without
change in function (redundancy
- Some counterexamples: tumorgenes  1 AA difference can have enormous impact on
function
Scoring Matrices
Theoretical
- Aging = linked with changes in immune system, loss of information in the genome = loss of
DNA that is responsible for repair
- Scoring matrix = tool to quantify how well a model was able to align two sequences  the
result from this matrix is only meaningful in the context of that model
- Choice of matrix  influences the outcome
Nucleic acid scoring matrices
- Alignment of DNA sequences
- Transition/Transversion matrix
o Purines (A & G) versus pyrimidines (T & C)
o Mutation that changes the ring number: transversions (A  C or T, less conserved)
o Mutation that keeps the ring number the same: transition (A  G, more conserved)
 Happens way more in nature than transversions (even though more
possibilities for transversions!)
o Reduces the noise in comparing distantly related sequences (so very different
sequences!)

Protein scoring matrices: Unitary matrix


- Identity metric used: score 1 for matches, score 0 for mismatches  unitary matrix
- Works for closely related proteins
Genetic code matrix
- The distance between 2 AA is the minimal number of mutations/base changes required to
transform 1 codon into the other codon
- Improves sensitivity and specificity in comparison with identity matrix
- Proves that evolution has minimized the effects of point mutations on the AA sequence
o Impact of mutation doesn’t necessarily have chemical impact = chemical properties
conserved
o Either multiple codons that are similar for 1 AA or similar codons for AA with same
chemical properties  chemical similarity matrix: match when two AA have similar
side chain/residue
Amino acid residues
- Hydrophobic AA : side chain is non-polar methyl group (-CH3)
o Interior of protein because hydrophobic (hates water)
- Aromatic AA: these substitutions are very conserved
- Neutral-polar AA: polar groups and no charge
- Acidic AA: carboxyl side chains (-COO) and therefore – charged
- Basic AA: + charged
o Can interact with DNA, eg zinc fingers, almost half of proteins are + charged!
- Conformationally important AA: unique AA
o Glycine: no side chain and can therefore take conformation that are for other AA
impossible due to sterical hindrance (eg sharp turn), also used as a spacer to link two
structures to each other without influencing them
o Proline: most rigid
 Any property of AA could be used for similarity scoring matrices
Empirical: different approach but similar result
Absence of valid model based on chemical principles (which property to choose for scoring matrix?
which property minimizes change of function when AA is changed?  no consensus), so empirical
approach needed!

PAM (Dayhoff percent accepted mutation)


- Scores amino acid pairs based on the expected frequency of the one aa changing into the
other aa
- Assumption: once the evolutionary relationship of two sequences is established, the
exchanged residues are similar
- Principles
o Model of evolution: proteins evolve through a succession of independent point
mutations
o Evolutionary distance: number of point mutations needed to evolve from one seq
into another seq
 Measure of this distance: PAM = percent accepted mutation: 1 accepted
point mutation on the path between two sequences, per 100 residues
 Accepted point mutation = mutation that is stably fixed in gene pool (
so mutation in the sex genes)
 Higher PAM = you allow higher likelihood of mutations
 Choosing PAM value based on level of divergence
o PAM is unit to measure evolutionary distance between two sequences
- Construction of the scoring matrix
1. Finding accepted point mutations: making a phylogenetic tree (families with >85%
identity to avoid ‘meerduidigheid’)
2. Frequencies of occurrence: the statements that we make depend on aa occurrence
frequencies, some aa have multiple codons = higher frequency off occurrence
3. Relative mutabilities: must take into account aa that do not mutate at all, what is chance
that aa will mutate at all = relative mutability mj = number of changes by the
aa/frequency of occurrence
4. Mutation probability matrix: construction of matrix using the previous values, prob that
aa in row I will replace/change into aa in column j?
A = pair exchange frequency for ij

5. Evolutionary distance: scaling all values of matrix by constant factor, we chose scaling so
that overall number of changes = 1% = distance for 1 PAM (1/100 AA will change)
6. Relatedness odds: prob that aa replaces another in random seq = frequency of aa
occurrence  P(random, i) = frequency (i)

7. Log-odds matrix: multiplication is often difficult so taking the log for addition = easier,
value of +1 = corresponding pair has been observed 10 times more than expected by
chance  log10 (10) = 1, 101 = 10
- Some remarks: chance that the effect of the mutation is hidden + some aa may mutate more
than once (accepted point mutation much bigger than evolutionary distance!)

- How to observe protein evolution?


As the PAM value increases, it indicates a greater evolutionary distance between the
sequences. Higher PAM values imply a higher likelihood of amino acid substitutions,
insertions, and deletions. Consequently, as sequences diverge over evolutionary time, the
percent identity decreases.
- PAM values and corresponding distances

- Low pam – closely related sequences, high identity, low score for mutations, low distance
o Finds alignments of highly similar sequences
- High pam – distant sequences, low identity, high score for mutations, high distance
o Finds alignments of very different sequences (bcs allows more aa substitutions)

BLOSUM (Block substitution matrix)


- Scores amino acid pairs based on frequency of the one aa changing into the other aa in
aligned sequence motifs (=blocks)
- Principle: using sequence blocks, these blocks are sequence fragments that can be aligned
without introducing gaps = highly conserved regions
- Construction of score matrix:
o LOG(observed pairs/expected pairs)
o Observed pairs (gij)
 Making relative frequency table: prob of obtaining a substitution pair if
randomly choosing pair from block?
 Need to divide observed by total amount of pairs within the sequence
o Expected pairs (eij)
 Making random relative frequency table: prob of obtaining a substitution pair
if drawn independently from block?
 Summation of theoretical expectancy of drawing certain pair
o Summary = log2 (gij/ eij)
- Blosum values
o BLOSUM62: using blocks with 62% identity
o BLOSUM45: using blocks with 45% identity
- High blosum – closely related sequences
- Low blosum – distant related sequences

In general
When looking for very small differences in similar sequences: low PAM, high BLOSUM (high identity,
low substitution score)
When there are a lot of differences between the sequences: high PAM, low BLOSUM (low identity,
high substitution score)
No single matrix is complete answer! Use both to compliment each other (high blosum with high pam
for example)

Pairwise alignment
Dot-plots
Graphical representation using two axis and dots for regions of similarity
Best way to see the structures in common between two sequences eg repeated structures or
conserved domains
- Dot is plotted when identity is higher than certain threshold in a sliding window
- Sliding window used that slides over the sequence
o Depends on goal of the analysis (size of exon, size of gene promotor etc)
o About 20 for distant proteins
o About 12 for nucleic acids
Diagonal run of dots = similarity regions
Vertical or horizontal skip = a gap (no similarity found, eg when seq 1 = RNA and seq 2 = DNA, some
regions spliced out)
Rules of thumb: check seq versus itself, check seq versus another seq, use window size appropriate
for type of analysis
Lecture 4
Recap
No singe score matrix is the complete answer. Best to you multiple that compliment each other
As more 3D structures of proteins are determined, score matrices derived from structure comparison
will give the most reliable data.

Introduction
Global versus local alignment
- Global: sequences are completely aligned with each other (entire sequence)
o Needleman-Wunsch
- Local: only a sub region of sequences are aligned with each other
o Smith-Waterman
Multiple alignment
- To characterize protein families: shared regions of homology need to be identified  can be
done by multiple sequence alignment
- To find conserved regions/To determine the consensus sequence of several sequences
aligned with each other
o Conserved regions are often key functional regions and prime target for drug
development
- To help the protein structure prediction
- To do phylogenetic analysis
o Comparing sequences of different genomes/species eg comparing same protein from
different species
Algorithms
What?
Algorithm
- A method or a process followed to solve a problem (eg a recipe)
- Takes the input to a problem and transform it to the output
- 1 problem can need many algorithms to solve it
- The output doesn’t change even when input order has been changed (<-> heuristic methods)
o Heuristic methods = approach to problem-solving that used practical/experience
based strategies and does not follow a theoretical path. Prioritize speed and
efficiency over precision
- Example: algorithm to find the greatest common divisor
Pseudocode
- A programming code that gives the procedure but is not written in a specific programming
language
Examples
Bubble Sort Algorithm
- List of numbers, comparing 2 neighboring numbers, swap them according to desired order
(increasing or decreasing), loop until no elements need to be exchanged
o List of 20 numbers: need to loop 19 times  not most efficient!
Properties
- It must be correct
- It must be composed of concrete and finite steps  it must terminate
- No ambiguity (‘meerduidigheid’, different outputs bcs of different order of input)

Efficiency of algorithms
- Types of complexity
o Space: total operations dependent on n, if n increases = more space needed for
algorithm
 Polynomial algorithm: n² + n + 1, n doubles  operations quadruples
o Time: when space increases, often time also increases
 Big Oh Notation: O(n²)

Dynamic programming for Pairwise Alignment


Pairwise = aligning 2 sequences

Concept
Best alignment: the one with the maximal score
Alignment score model
- Each column gets value for match (+1) and mismatch (-1) and gaps/indels (-2)
- Total score of alignment = sum of values assigned to the columns
- Gap score: difference between creating a gap or extending a gap!
Algorithm used for alignment = dynamic programming
- Reduces the 1 problem into local sub problems
- Simplifies the problem
Dynamic Programming
1. Matches: give value +1
2. Give last row and last column value 0 (except if there is already value 1)
Looking for maximum value to the right in column underneath: add in to all values in column
above
Maximum value = 5
End result: problem = K can be aligned with two K (same score), choosing first K doesn’t create gap,
but choosing second K creates a gap

<- Path if we chose first K Path if we chose second K ->

Needleman-Wunsch
Global alignment
Different alignment direction!
Expanding gap by adding -1 to -9
Calculating score matrix value coming from three different directions: a, b and c  choose max value!
This method allows for very fast backtracing:

Scores/Values
- Using gap penalties
o Gap > 1: constant gap penalty
o Creating a gap
o Extension of a gap
- Using BLOSUM62: instead of match (+1) and mismatches (-1)
Uses
- Time complexity: for sequence of 10000 aa  already takes 11 minutes
For 10000 aa  memory overwritten, so 4 seq of 100 aa is not possible!
- Generating histogram from aligned random sequences: creates an extreme value distribution
Gives an alignment score to randomly generated sequences
Best alignment path
- On the diagonal of the matrix  may not need to fill the entire matrix! Only fill band around
the main diagonal

Sensitivity versus specificity/selectivity


Sensitivity = ability to find true positives = ability to find all sequences in database
Specificity/selectivity = ability to minimize amount of false positives = ability to select only the seq
that are related to query in the database
 Compromise between the two
Example: you are a doctor, you have high sensitivity if you are able to tell your patients if they have
cancer or not, however, you only have high specificity if you only tell this to the people that actually
have cancer

Smith-Waterman
Local alignment = alignment between parts of the two sequences

2 proteins can have region of high similarity, but can be very different outside that region  if we
used a global alignment for this then we would have lots of mismatches and gaps outside the region
of similarity  local alignment is solution

3 differences with global alignment


1. Edges of score matrix are = 0 and not an increased gap penalty value
2. Maximum score is 0, no point is recorded unless greater than 0
3. Back-trace starts at the highest score (and not at the lowest score in the Needleman-Wunsch
method)

Multiple Alignment
The best alignment = maximum total score

For multiple alignment: amount of sequences > 2

Hyperlattice
- K-dimensional hyperlattice: running time proportional to number of nodes!
o Memory space is no. 1 problem if we want to trace back the alignment: entire lattice
needs to be stored
ClustalW: heuristic method to do multiple alignment
MSA
Most practical multiple sequence alignment method is the extension of pairwise alignment methods!
Method
- First do all pairwise alignments (every sequence with every other sequence, not just 1
sequence with all sequences)
- Combine these alignments to generate an overall alignment
o Perform cluster analysis to generate a hierarchy (binary guide tree)
o First align the most similar pair of sequences (eg seq 2 & 4), then the next most
similar (seq 1 & seq 3) etc.
 Introduce gaps to optimize the alignment
o Align the alignment of 2&4 with alignment of 1&3 – preserve gaps
ClustalW
- Alignment program for DNA and proteins
- msf file = multiple sequence text file = set of aligned sequences
Good alignment?
- Roughly 1/3 of AA are the same (30% identity)
Protein structure prediction
- MSA can be used for prediction of protein structure
o Eg conserved S = cysteine bridges, conserved glycine or proline = beta-turn, position
of indels = surface loops
o Alternating hydrophobic residues (i, i+2, i+4) = surface beta-strands
Alternative methods
SIM
See slides

BLAST ( NCBI )
- To compare two sequences
- To compare sequence to a database

DALI
- Knows secondary structure
- Changes penalty score depending on if it changes the secondary structure or not

Lecture 5: DataBase Searching


Recap
Dynamic programming is an algorithm that divides the dynamic problem in little subproblems that
are more easy to solve. It is used for alignment.

Sequence Alignment Tools


Multiple Sequence Alignment: MSA, ClustalW
Short Read Sequence Alignment: BWA, Bowtie
Database Search: BLAST, FASTA

Read length versus resequencing


- Resequencing = trying to detect sequence differences between individual and already
sequenced genome of the species
o Goal: identifying variants such as SNP’s and mutation
o Achieving high genome coverage (each base sequenced multiple times/sequencing
dept) is more important than maximizing the individual read length
- Short reads only work if you have a reference genome
- For PCR: primer of about 16-20 bp is long enough to match a specific part in the genome
- T2T of genome: contains repeats and non-repeats, repeats will only give you 70% but if you
select non-repeats, you could get to 90% (never 100%)
- For E.coli: genome of 104, need a much shorter primer

Short Read Sequence


Basic idea: computing an index and making a database, each time database changes  compute new
index = creates a register
Bowtie
Memory efficient short aligner. Aligns short DNA sequences to human genome.
Burrows-Wheeler Aligner (BWA)
Aligner that implement bwa-short and BWA-SW algorithms to a reference genome. Bwa-short works
for short query sequences and BWA-SW works for longer query sequences.
Mapping reads back to genome
- Hast Table/Lookup table: fast
- Array Scanning
- Dynamic Programming (Smith-Watermann, local)
- Burrows-Wheeler Transform: fast

Burrows-Wheeler Transform
Why?
- Very compact, as large as the original text: compression is useful for storage is space!
o Gain speed, search faster: look up time ~ size of index/original text
o Linear-time search algorithm: time search is proportional to length of the query [O(m
+ N)]
- Reversible
- Index

Concept: reversible permutation


1. $ added as a placeholder
2. Rotations: piece of DNA behind placeholder changes
3. Sort them based on the character: $ is lowest character, so put first!
4. Last column = Burrows-Wheeler Matrix (BWM)
Reversible operation! When given outcome of BWM, you can reconstruct the original sequence!

Sometimes characters are dependent on each other, they always occur together = character
dependency

T-ranking/ LF mapping
Keeping track of the nucleotides could reveal interesting patterns: in picture you can see that order of
the characters is the same in the first and in the last column
However we rank the occurrences of the nucleotides, the ranks will appear in the same order in the
First column and in the Last column.
- Why does the LF mapping hold?
o Sorted by right-context

B-ranking
Giving arbitrary numbers (for example with ascending ranks), not like they occur in the original
sequence.
Reversing

SAM file
Short reads are mapped to reference genome. Alignment is created in SAM format.
8M = 8 matches, 2I = 2 indels, 1D = 1 dismatch

Database searching
Suppose you want to query a sequence in SWISSPROT database  given size of database, would take
way to look to find local alignments  more efficient methods are needed

Heuristic approaches for database searching


- FASTA: not used anymore, uses heuristics to avoid calculating a dynamic programming matrix,
more sensitive
- BLAST: uses rapid word lookup, faster than FASTA and Smith-Waterman, less sensitive than
FASTA
FASTA
Initial problem
- Too many calculations are wasted if you compare regions that have nothing in common
- Solution: similar regions between two sequences probably share short stretches that are
identical
- Anything that is very likely, contains regions that are identical (heuristic method)
Basic method
- Only looking for similar regions near identical stretches
= only computing calculations where we have a chance of a high alignment score
FASTA-stages
1. Finding k-mers in the two sequences, find identical regions between the two sequences using
these k-mers
2. Score and select the top 10 diagonals
3. Rescan this top 10 and re-score
4. Applying a certain joining treshold: score under?  eliminating the segment, unlikely part of
the alignment
5. Dynamic programming to optimize the alignment
E-score
- P value x size of data base = amount of seq expected to find comparing query to database
- Typical cut off of 10-5

Which database to use?


- Known protein: query protein sequence in protein database, because if protein is already
known then it will be present in the database
- Unknown protein: query protein sequence in translated DNA database, protein is unknown
so not known in the protein database yet
- Yeast 2 hybrid: measured DNA but we know it is a protein (that is known), looking in the
protein database
- Yeast 2 hybrid: measured DNA but know it’s a protein (but unknown!!!), looking in the DNA
database
- DNA-DNA and protein-protein: scanning library for similar sequences

BLAST – Basic Local Alignment Search Tool


Basic method
- Searches a lot of sequences for hits to a query sequences and returns the alignments (+
scores) for these hits
- Really fast
- Find similar regions/pairs by finding word pairs (with size w) with a score above threshold T
(<-> FASTA: searching in neighborhood were there are exact matches)
o Break query sequence into words, for each word/residue from query sequence,
compile a list of words (try all possibilities) that would give a score above T
= list of high scoring words with length w
typically 50 words per query residue
Blast algorithm
1. Compile list of high scoring words (see above) = neighborhood words
2. Compare this list of words to database and identify exact matches
3. For each hit ( = two non-overlapping pairs within distance A, need 2 words of list to match to
have a hit!!) extend the alignment in both directions to find new alignments (score above
threshold S), stop extending when score under S
a. For low S: tying all the neighborhood matches together = global alignment
b. For high S: only alignments with a high score will be extended = local alignment
Blast parameters: EXAMEN
- Lowering T : increased noise BUT more sequences that are more distant to query sequences
can be found
- Value of W: small  many matches, big  many words generated, compromise!!
- Lowering S : longer extensions for each hit
- Changing E-value: changes threshold for reporting if something is a hit
Examples:
High T, low w = very specific, not sensitive (finds true matches but does not find all the
matches)
Low T, big w = very sensitive, not specific (finds all the matches, but not necessarily true
matches)

You might also like