Download as pdf or txt
Download as pdf or txt
You are on page 1of 2

Sanger sequencing

● Golden Standard
● ­High quality
● ­Low throughput
● ­Large files
○ ­Trace files
○ ­How do we read these?
■ ­CutePeaks
■ SeqTrace

"Low Throughput" Sequence formats

Some of the most frequent file formats

● ­ ASTA
F
● ­GB
● ­MEGA
● ­ALN
● ­NEXUS
● ­PHYLIP

DNA sequence databases

● NCBI (USA)
● ­ENA (Europe)
● ­DDBJ (Japan)
○ ­Data repositories
○ ­Replicated
○ ­Queryable

Storage vs alignment

● FASTA files are good for storing data


● ­Making the data comparable is something else
● ­For that, we need alignments
○ ­Each base in our sequence needs to be aligned so that their positions are
comparable
● ­Some formats are designed for aligned data

High throughput sequencing

● What if we need to scale things?


○ ­(And not spend our entire budget on sequencing)
● ­Unknown genome regions
○ ­Non-model organisms

FASTQ Format

● Each chromatogram takes 100 ~ 200 KB


○ ­200M reads * 150KB = 30TB
○ ­Chromatograms just don't scale!
● ­Each FASTQ sequence is composed of 4 lines:
○ ­@Sequence_identifier
○ ­ATGCGATAGCTGACTGACTAGCT
○ ­+ (optionally the seq_id again)
○ ­!''*(((((******+**,-

Assemblies

● Sequence assemblies are a huge problem


○ ­HTS reads come from random genome locations
○ ­We could use an entire semester to deal with this problem
● ­Two types of sequence assembly
○ ­Mapping assemblies (reference available)
○ ­Denovo assemblies (reference unavailable)

SAM/BAM Format

The "standard" way to represent assembled data

● ­ ontain the reads and their coordinates relative to a reference / each other
C
● ­A BAM file is a binary version of a SAM file
● ­BAM files can be indexed

You might also like