DNA Seq Analysis

DNA Sequencing Analysis
DNA Sequencing Workflow

Introduction to Mapping
You are here!

Evaluating your raw sequencing data
• To find the total count of the fastq file
• Using linux command
 wc -l file_name.fastq
 grep --count file_name.fastq

Quality Control & Assurance
• FastQC is a tool that produces a quality analysis report on FASTQ files.
• FastQC report for a good Illumina dataset
• FastQC report for a bad Illumina dataset
• Online documentation for each FastQC report

• The FastQC reports gives useful information:
• The Per base sequence quality report, which can help you decide if sequence trimming is needed
before alignment.
• The Sequence Duplication Levels report, which helps you evaluate library enrichment / complexity
• The Overrepresented Sequences report, which helps evaluate adapter contamination.

Running FastQC
• FastQC creates a sub-directory for each analyzed FASTQ file
• Using following command fastQC can be executed
• fastaQC read_1.fastq read_2.fastq

• Result will be viewed in html format
• Should we trim this data?
• Based on this FastQC output, should we trim this data?
• The Per base sequence quality report does not look good. The data should probably be
trimmed (to 40 or 50 bp) before alignment
Adapter trimming
• Data from RNA-seq or other library prep methods that resulted in very
short fragments
• Which can cause problems with moderately long (50-100bp) reads
since the 3' end of sequence can be read through to the 3' adapter at a
variable position
• Trimmomatic
• Trimmomatic is a fast, multithreaded command line tool that can be used to
trim and crop Illumina (FASTQ) data as well as to remove adapters
• Cutadapt
• The cutadapt program is an excellent tool for removing adapter contamination
Basic steps of mapping reads
1. Read file quality control
2. Build reference sequence index
3. Map DNA sequencing reads

– Exact tool/approach depends on sequencing technology and DNA fragment library
type
4. Convert result to SAM/BAM database
5. Application specific analysis.

– These steps are common to any reference-based (opposed to de novo) data analysis.
– We will look at variant calling first.

Read sequences
FASTQ Format
@HWI-EAS216_91209:1:2:454:192#0/1
GTTGATGAATTTCTCCAGCGCGAATTTGTGGGCT
+HWI-EAS216_91209:1:2:454:192#0/1
B@BBBBBB@BBBBAAAA>@AABA?BBBAAB??>A?
machine_id:lane:flowcell_grid_coordinates end_number:failed_qc:0:barcode
Line 1: @read name

Line 2: called base sequence
Line 3: +read name (optional after +)
Line 4: base quality scores
Deciphering base quality scores
Quality character !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHI
| | | | |
ASCII Value 33 43 53 63 73
Base Quality (Q) 0 10 20 30 40
Probability of Error = 10–Q/10

(This is a Phred score, also used for other types of qualities.)
* Very low quality scores can mean something special –

Illumina Q ≤ 3 means something like: "I'm lost, you might
want to stop believing sequencing cycles from here on out."
* In older FASTQ files, the formula and ASCII offset might differ.
Consult: http://en.wikipedia.org/wiki/FASTQ_format
Example of Illumina data
• Most bases have high qualities (Q>30).

• Overall qualities are well calibrated*.
Read sequences
Garbage in = garbage out
• Contaminated with other sequences?
• Sample barcodes removed?
• Adaptor sequences trimmed?
– RNAseq
• Trim ends of reads with poor quality?
• Know your data
– Paired reads? Relative orientations?
– Technology specific concerns?
• Indels with 454
Types of Illumina fragment libraries
Reference considerations
Is it appropriate to your study?
Close enough to your species?
Complete?
Which version? http://microbialgenomics.energy.gov
Make sure you use an agreed-on standard
Does it contain repeats?
Know this up front or you will be confused
What annotations exist?
References lacking feature annotations are much more
challenging to use
Reference sequences
FASTA Format
>gi|254160123|ref|NC_012967.1| Escherichia coli B str. REL606
agcttttcattctgactgcaacgggcaatatgtctctgtgtggattaaaaaaagagtgtc
tgatagcagcttctgaactggttacctgccgtgagtaaattaaaattttattgacttagg
tcactaaatactttaaccaatataggcatagcgcacagacagataaaaattacagagtac
acaacatccatgaaacgcattagcaccaccattaccaccaccatcaccattaccacaggt
aacggtgcgggctgacgcgtacaggaaacacagaaaaaagcccgcacctgacagtgcggg
cttttttttcgaccaaaggtaacgaggtaacaaccatgcgagtgttgaagttcggcggta
....
Using complex reference sequence names is a

common problem during analysis. Might rename:
>REL606
agcttttcattctgactgcaacgggcaatatgtctctgtgtggattaaaaaaagagtgtc
tgatagcagcttctgaactggttacctgccgtgagtaaattaaaattttattgacttagg
....
Finding a Reference
• Microbes (<10Mb): download FASTA containing in
sequence and/or GenBank/EMBL/GFF flatfiles
encapsulating both sequence and features.
• Macrobes (>100Mb): download specific consortium
"build" of reference (Ex: hg19), consisting of FASTA, and
various files used to construct a database of features.
• Non-model organisms: build your own?
See de novo assembly topic.
Mappers/Aligners
• Algorithms
– Spaced-seed indexing
– Burrows-Wheeler transform (BWT)
• Differences
– Input data (read length, colorspace aware/useful)
– Speed and scaleability (multithreading, GPUs)
– Memory requirements (RAM, fat nodes)
– Sensitivity: esp. indels (gaps)
– Ease of installation and use. Development phase.
– Uses base qualities? Outputs mapping scores?
– Handles of multiple matches, paired end matches
– Configurability and transparency of options
Hash table enables lookup of exact matches.
Subsequence Reference Positions
ATAGCTAATCCAA 2341, 2617264
A
ATAGCTAATCCAA
T
ATAGCTAATCCAA 134, 13311, 732661,
C
ATAGCTATCCAAA
G
ATAGCTAATCCAT
A
ATAGCTAATCCAT 3452
T Table is sorted and complete so you
can jump immediately to matches.
ATAGCTAATCCAT
(But this can take a lot of memory.)
C
May include N bases,234456673
ATAGCTATCCAAT skip positions, etc.
G
Trapnell, C. & Salzberg, S. L. How to map billions of short
reads onto genomes. Nature Biotech. 27, 455–457 (2009).
From Mapped Read to Alignment
• Mapping determines a position where the read
shares a subsequence with the reference. But, is
this the best match?
• Alignment starts with the seed and determines
how the read is best aligned
on a base-by-base basis.
Seed→Alignment score→Mapping quality

Alignment Score
• Dynamic programming algorithm
(Smith-Waterman | Needleman-Wunsch)
• Alignment score = Σ
• match reward
• base mismatch penalty
• gap open penalty
• gap extension penalty
• rewards and penalties may be adjusted for quality scores
of bases involved
• Important: Local versus global alignment of read

Mapping Quality
Mapping quality– what is the probability that the read
is correctly mapped to this location in the reference
genome?
Read 1 Read 2
ATCGGGAGATCC
or ATCGGGAGATCC GCGTAGTCTGCC
|||||||||||| |||||||||||| || |||| ||||
...TAATCGGGAGATCCGC...TTATCGGGAGATCCGC... ...TAGCCTAGTGTGCCGC...
Reference Sequence
High alignment score ≠ high mapping quality.
Phred score: P(mismapped) = 10–MQ/10
Paired-end mapping (PEM)
• There is an expected insert size distribution based
on the DNA fragment library.
• Mapping one read anchor the paired read to a
specific location, even if the second read alone
maps multiple places in the reference.
• Only one read in a pair might be mappable.
(singleton/orphan)
• Both reads can map with an unexpected insert size
or orientation (discordant pair)
Split-read alignment (SRA)
• Useful for predicting splice variants or structural
variants.
• Not many mappers do this directly, usually happens
in a post-processing step.
Trapnell, C. & Salzberg, S. L. How to map

billions of short reads onto genomes.
Nature Biotech. 27, 455–457 (2009).
List of Mappers/Aligners
trie = tree structure for fast text retrieval.
Courtesy Matt Vaughn, TACC

Indexing time
Aligner Time (mins) to Index 3GB genome
BWA 110.73
Bowtie 220.82
Some Comparisons
Ma p p in g Ac c u ra c y - S p lic e d D a t a t o W h o le Ge n o m e
Mapsplice
Tophat
ABI Bioscope
Mapped correctly
Mapped incorrectly
Splicemap
Not mapped
BWA
SOAP
Bowtie
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
Ma p p in g %
Some Comparisons
CP U Tim e - W h o le Ch ro m o s o m e
SOAP
Bowtie
BWA
ABI Bioscope
Mapsplice (Threaded)
Tophat
Splicemap
Mapsplice (Not Threaded)
0 200 400 600 800 1000 1200 1400 1600

Tim e ( m in s )
Final Words
My personal favorites of the moment...
Bowtie2
BWA

DNA Seq Analysis

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DNA Seq Analysis

Uploaded by

Copyright:

Available Formats

DNA Sequencing Analysis

DNA Sequencing Workflow

You are here!

• Using linux command

 grep --count file_name.fastq

• FastQC report for a good Illumina dataset

• FastQC report for a bad Illumina dataset

• Online documentation for each FastQC report

• The Overrepresented Sequences report, which helps evaluate adapter contamination.

• fastaQC read_1.fastq read_2.fastq

2. Build reference sequence index

3. Map DNA sequencing reads

4. Convert result to SAM/BAM database

5. Application specific analysis.

– We will look at variant calling first.

Line 1: @read name

Probability of Error = 10–Q/10

* Very low quality scores can mean something special –

• Most bases have high qualities (Q>30).

Using complex reference sequence names is a

Seed→Alignment score→Mapping quality

• Important: Local versus global alignment of read

Trapnell, C. & Salzberg, S. L. How to map

trie = tree structure for fast text retrieval.

Courtesy Matt Vaughn, TACC

Aligner Time (mins) to Index 3GB genome

Mapsplice (Not Threaded)

0 200 400 600 800 1000 1200 1400 1600

You might also like