Professional Documents
Culture Documents
Same Nva Tting
Same Nva Tting
Basic terms
Mapping = determining the position of the reads on a reference genome, where does each read
come from in the genome?
Alignment = arranging sequences to identify regions of similarity/comparing two or more sequences,
to visualize relationships between sequences etc.
Lecture 1: Introduction
Evolution of data
- Protein Atlas: first database of all global proteins
- Alignment algorithms for global alignment (11 years later: local)
- Eventually, too much data: databases such as BLAST used
Genome sequencing (Nature vs Science)
- Clone-by-clone: taking large set of BAC clones that make up the entire genome, sending these
clones to labs all over the world (=divide & rule) to sequence, then map on the human
genome (Nature, 25 years)
- Whole genome shotgun sequencing: cutting genome many times at random, performing
paired end seq on these fragments, using database for alignment
o Because of database, this was possible in 9 months (versus 25 years otherwise)
Genome size
- Humans: 3 Gb
o Difference in genome with another human is only 0.01% = 3Mb
o Essentially only need to sequence these 3Mb to characterize a human
- Human, mouse and rat: very similar, little diversity bcs survival of the fittest
Sequencing verses re-sequencing
- re-sequencing to find certain differences (eg. SNP’s), Illumina often used (max 100 bp)
Next generation sequencing
- DNA library preparation: genome fragmentation, addition of adapters to DNA fragments
- Amplification of DNA library with PCR
- Sequencing of amplified library: eg sequencing by synthesis
Importance of paired-end reads
- Single read: can map to multiple positions
- Paired-end: beginning and end of fragment are known +
distance between them is known can find the exact
location that they map to because both end and
beginning need to match
Epigenomics
- Changes in histones, methylation, differentiation within genome can make DNA
unexpressable = ZWAKSTROOM (when are parts of DNA active and when not?)
- Ribo-seq: sequencing the RNA that was actively bound to ribosomes & were being translated
to found out which proteins are active
Third generation sequencing
- Big difference with previous generations: no reactions in solutions! Polymerase is
immobilized. Single molecule (often enzyme), don’t need amplification bcs we look at
polymerase with special optics
- Nanopore by Pacific Biosciences: uses zero mode waveguides (waves that are very small
compared to light) for sensing of ionic conductance caused by nucleic acids passing through
nanometer-size pore
Lecture 2: Where do we find data? How to access? How to store?
Molecular Biology
New version of full genome
- T2T (telomere to telomere): repeats are in lower case but rest is in upper case <-> difference
with normal sequencing: everything in lower case
Structure of DNA
- Only 1 strand in stored in the database: 5’ to 3’, reverse compliment never stored
o Consequences: if gene is present on antisense then you would never find it, ATG will
be present on antisense and not on sense
o Half of genes is actually present on antisense
- RNA is never sequenced but cDNA is! Reverse transcriptase to make cDNA (never uracil, only
thymine found in databases)
Introns and exons
- Caused by splicing: two types of splicing (always between CT and AG)
- Cis splicing: splicing from same molecule/same strand
- Trans splicing: sense and antisense are both transcribed to make RNA = come together as the
RNA that is being translated by the ribosomes
Proteins
- Often difficult to map back to the genome because don’t know where protein came from
- Reasons:
1. A lot of modifications at protein level
2. A lot of things happen that we don’t understand yet eg PROTEIN SPLICING: certain parts
of protein seq taken out or recombined
Flat files ‘sequence’ databases
Flat files
- A text file
- RefSeq, Genbank format
Genbank format
Nucleotide databases (same information but a different format, redundant)
- EMBL Nucleotide Sequence Database
- GenBank
- DDBJ
- Important: when working with GENBANK, be careful, everything put without checking/no
quality check different usage of ontology/def of gene or exon
- Data retrieved from these database: annotated text <-> FASTA: unannotated text
- These database contain strings = letters, but this is not what we measure!
o Trace files = what we measure
o GenBank = basecalling, done with certain level of confidence (Q value)
o Q value = 40 <-> Probability of incorrect base call is 1:10000
o Major issue: HETEROZYGOUS genes, GenBank does not take alternative form of genes
into account, trace file does
EMBL format
Phred score
- Quality value, low Q value = lower base calling accuracy
- Q = -10*log_10(P)
- Predicts peak location and reads actual peak locations from base trace
- Matches actual location with predicted location
- Why used?
o Trimming (end of sequence often of poor quality): because difficult to resolve larger
fragments
o Weak or variable signal strength to a base
Phrap score
- Used Phred’s score to determine consensus sequences
- Uses highest scoring sequences (of a series of individual sequences) to extend the consensus
sequence
RefSeq (reference database, not redundant)
- Includes chromosomes, mRNA and proteins
Open reading frames (ORF)
- Every DNA strand has 6 ORF 3 on sense and 3 on antisense
Relational databases
Databases that need a higher level of annotation and relational information, used to answer more
complex questions
Because of explosive growth in biological data it became difficult to use flat files relational
databases to the rescue
Relational databases contain a set of tables where a data model is poured into, tables are linked with
each other, to query these tables you need a language (SQL) and a program (RDBMS)
Benefits: minimal redundancy, data sharing, data independency
Disadvantages: size, complexity
RDBMS
- Program used to manage these databases
- SQL often used as language to jump into the data and search the data, can do SELECT, FROM,
WHERE, JOIN, etc
Ensembl/BioMart (eg searching complete human genome, genome browser)
- Datamodel that stores the information from flat files such as GenBank
- Tables have limited amount of fields but unlimited amounts of records (so short and long)
- You can query the database/ask questions with SQL language
o Filters: gene ontology = linking genes with cell process such as cell death or DNA
repair
Lecture 3
Recap
Ensembl = genome browser
- Started after full genome was sequenced in 2002: genome of only 4 organisms was then
available
- Years later: more and more organisms added to the database
- Currently used to answer very specific genomic questions
Definitions
Identity: simple to define, extremely powerful!
- The extent to which two (nt or aa) sequences are invariant/uniform
- Identity of 30%: structures/function are kind of similar, almost same sequence
Homology
- Concept of similarity, the similarity is attributed to descent from a common ancestor
- Importance: immunotherapy! trials in humanized mice: transfection with human immune
system by replacing its genes by human orthologous genes
- Some genes are really conserved throughout species = all from same ancestor
Orthologous
- Homologous sequences in DIFFERENT species, from a common ancestor during speciation
(no duplication)
- Consequence: sequences that are homologous could take over each others function, can +-
replace each other
Paralogous
- Homologous sequences from a SINGLE species, by gene duplication
- The extra gene that was made because of duplication, is a gene that nature can play with and
that can change
Sequence alignment
- Two biological sequences are sufficiently similar (30% identical) same function, same
structure
o Sequence determines the function + many positions/AA may be changed without
change in function (redundancy
- Some counterexamples: tumorgenes 1 AA difference can have enormous impact on
function
Scoring Matrices
Theoretical
- Aging = linked with changes in immune system, loss of information in the genome = loss of
DNA that is responsible for repair
- Scoring matrix = tool to quantify how well a model was able to align two sequences the
result from this matrix is only meaningful in the context of that model
- Choice of matrix influences the outcome
Nucleic acid scoring matrices
- Alignment of DNA sequences
- Transition/Transversion matrix
o Purines (A & G) versus pyrimidines (T & C)
o Mutation that changes the ring number: transversions (A C or T, less conserved)
o Mutation that keeps the ring number the same: transition (A G, more conserved)
Happens way more in nature than transversions (even though more
possibilities for transversions!)
o Reduces the noise in comparing distantly related sequences (so very different
sequences!)
5. Evolutionary distance: scaling all values of matrix by constant factor, we chose scaling so
that overall number of changes = 1% = distance for 1 PAM (1/100 AA will change)
6. Relatedness odds: prob that aa replaces another in random seq = frequency of aa
occurrence P(random, i) = frequency (i)
7. Log-odds matrix: multiplication is often difficult so taking the log for addition = easier,
value of +1 = corresponding pair has been observed 10 times more than expected by
chance log10 (10) = 1, 101 = 10
- Some remarks: chance that the effect of the mutation is hidden + some aa may mutate more
than once (accepted point mutation much bigger than evolutionary distance!)
- Low pam – closely related sequences, high identity, low score for mutations, low distance
o Finds alignments of highly similar sequences
- High pam – distant sequences, low identity, high score for mutations, high distance
o Finds alignments of very different sequences (bcs allows more aa substitutions)
In general
When looking for very small differences in similar sequences: low PAM, high BLOSUM (high identity,
low substitution score)
When there are a lot of differences between the sequences: high PAM, low BLOSUM (low identity,
high substitution score)
No single matrix is complete answer! Use both to compliment each other (high blosum with high pam
for example)
Pairwise alignment
Dot-plots
Graphical representation using two axis and dots for regions of similarity
Best way to see the structures in common between two sequences eg repeated structures or
conserved domains
- Dot is plotted when identity is higher than certain threshold in a sliding window
- Sliding window used that slides over the sequence
o Depends on goal of the analysis (size of exon, size of gene promotor etc)
o About 20 for distant proteins
o About 12 for nucleic acids
Diagonal run of dots = similarity regions
Vertical or horizontal skip = a gap (no similarity found, eg when seq 1 = RNA and seq 2 = DNA, some
regions spliced out)
Rules of thumb: check seq versus itself, check seq versus another seq, use window size appropriate
for type of analysis
Lecture 4
Recap
No singe score matrix is the complete answer. Best to you multiple that compliment each other
As more 3D structures of proteins are determined, score matrices derived from structure comparison
will give the most reliable data.
Introduction
Global versus local alignment
- Global: sequences are completely aligned with each other (entire sequence)
o Needleman-Wunsch
- Local: only a sub region of sequences are aligned with each other
o Smith-Waterman
Multiple alignment
- To characterize protein families: shared regions of homology need to be identified can be
done by multiple sequence alignment
- To find conserved regions/To determine the consensus sequence of several sequences
aligned with each other
o Conserved regions are often key functional regions and prime target for drug
development
- To help the protein structure prediction
- To do phylogenetic analysis
o Comparing sequences of different genomes/species eg comparing same protein from
different species
Algorithms
What?
Algorithm
- A method or a process followed to solve a problem (eg a recipe)
- Takes the input to a problem and transform it to the output
- 1 problem can need many algorithms to solve it
- The output doesn’t change even when input order has been changed (<-> heuristic methods)
o Heuristic methods = approach to problem-solving that used practical/experience
based strategies and does not follow a theoretical path. Prioritize speed and
efficiency over precision
- Example: algorithm to find the greatest common divisor
Pseudocode
- A programming code that gives the procedure but is not written in a specific programming
language
Examples
Bubble Sort Algorithm
- List of numbers, comparing 2 neighboring numbers, swap them according to desired order
(increasing or decreasing), loop until no elements need to be exchanged
o List of 20 numbers: need to loop 19 times not most efficient!
Properties
- It must be correct
- It must be composed of concrete and finite steps it must terminate
- No ambiguity (‘meerduidigheid’, different outputs bcs of different order of input)
Efficiency of algorithms
- Types of complexity
o Space: total operations dependent on n, if n increases = more space needed for
algorithm
Polynomial algorithm: n² + n + 1, n doubles operations quadruples
o Time: when space increases, often time also increases
Big Oh Notation: O(n²)
Concept
Best alignment: the one with the maximal score
Alignment score model
- Each column gets value for match (+1) and mismatch (-1) and gaps/indels (-2)
- Total score of alignment = sum of values assigned to the columns
- Gap score: difference between creating a gap or extending a gap!
Algorithm used for alignment = dynamic programming
- Reduces the 1 problem into local sub problems
- Simplifies the problem
Dynamic Programming
1. Matches: give value +1
2. Give last row and last column value 0 (except if there is already value 1)
Looking for maximum value to the right in column underneath: add in to all values in column
above
Maximum value = 5
End result: problem = K can be aligned with two K (same score), choosing first K doesn’t create gap,
but choosing second K creates a gap
Needleman-Wunsch
Global alignment
Different alignment direction!
Expanding gap by adding -1 to -9
Calculating score matrix value coming from three different directions: a, b and c choose max value!
This method allows for very fast backtracing:
Scores/Values
- Using gap penalties
o Gap > 1: constant gap penalty
o Creating a gap
o Extension of a gap
- Using BLOSUM62: instead of match (+1) and mismatches (-1)
Uses
- Time complexity: for sequence of 10000 aa already takes 11 minutes
For 10000 aa memory overwritten, so 4 seq of 100 aa is not possible!
- Generating histogram from aligned random sequences: creates an extreme value distribution
Gives an alignment score to randomly generated sequences
Best alignment path
- On the diagonal of the matrix may not need to fill the entire matrix! Only fill band around
the main diagonal
Smith-Waterman
Local alignment = alignment between parts of the two sequences
2 proteins can have region of high similarity, but can be very different outside that region if we
used a global alignment for this then we would have lots of mismatches and gaps outside the region
of similarity local alignment is solution
Multiple Alignment
The best alignment = maximum total score
Hyperlattice
- K-dimensional hyperlattice: running time proportional to number of nodes!
o Memory space is no. 1 problem if we want to trace back the alignment: entire lattice
needs to be stored
ClustalW: heuristic method to do multiple alignment
MSA
Most practical multiple sequence alignment method is the extension of pairwise alignment methods!
Method
- First do all pairwise alignments (every sequence with every other sequence, not just 1
sequence with all sequences)
- Combine these alignments to generate an overall alignment
o Perform cluster analysis to generate a hierarchy (binary guide tree)
o First align the most similar pair of sequences (eg seq 2 & 4), then the next most
similar (seq 1 & seq 3) etc.
Introduce gaps to optimize the alignment
o Align the alignment of 2&4 with alignment of 1&3 – preserve gaps
ClustalW
- Alignment program for DNA and proteins
- msf file = multiple sequence text file = set of aligned sequences
Good alignment?
- Roughly 1/3 of AA are the same (30% identity)
Protein structure prediction
- MSA can be used for prediction of protein structure
o Eg conserved S = cysteine bridges, conserved glycine or proline = beta-turn, position
of indels = surface loops
o Alternating hydrophobic residues (i, i+2, i+4) = surface beta-strands
Alternative methods
SIM
See slides
BLAST ( NCBI )
- To compare two sequences
- To compare sequence to a database
DALI
- Knows secondary structure
- Changes penalty score depending on if it changes the secondary structure or not
Burrows-Wheeler Transform
Why?
- Very compact, as large as the original text: compression is useful for storage is space!
o Gain speed, search faster: look up time ~ size of index/original text
o Linear-time search algorithm: time search is proportional to length of the query [O(m
+ N)]
- Reversible
- Index
Sometimes characters are dependent on each other, they always occur together = character
dependency
T-ranking/ LF mapping
Keeping track of the nucleotides could reveal interesting patterns: in picture you can see that order of
the characters is the same in the first and in the last column
However we rank the occurrences of the nucleotides, the ranks will appear in the same order in the
First column and in the Last column.
- Why does the LF mapping hold?
o Sorted by right-context
B-ranking
Giving arbitrary numbers (for example with ascending ranks), not like they occur in the original
sequence.
Reversing
SAM file
Short reads are mapped to reference genome. Alignment is created in SAM format.
8M = 8 matches, 2I = 2 indels, 1D = 1 dismatch
Database searching
Suppose you want to query a sequence in SWISSPROT database given size of database, would take
way to look to find local alignments more efficient methods are needed