Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

How to start with next

generation sequencing (NGS)


datasets?
Surendra Vikram
Illumina
sequencing
work flow
FASTQ: Sequence with quality

Sequence Header
Sequence

Quality Header
Quality for each
base

How Quality Value is interpreted?


Quality value interpretation
• A Phred quality score is a unit of quality identification of the nucleobases generated by
automated DNA sequencing.
• Phred quality scores are assigned to each nucleotide base call during the sequencing process.

Phred quality scores are logarithmically linked to error probabilities


Phred Quality Score Probability of incorrect Base call accuracy
base call
10 1 in 10 90%
20 1 in 100 99%
30 1 in 1000 99.9%
40 1 in 10,000 99.99%
50 1 in 100,000 99.999%
60 1 in 1,000,000 99.9999%
FASTQ quality value
ASCII code Character ASCII code Character ASCII code Character
33 ! exclamation point 34 " double quotation 35 # number sign
36 $ dollar sign 37 % percent sign 38 & ampersand
39 ' apostrophe 40 ( left parenthesis 41 ) right parenthesis
42 * asterisk 43 + plus sign 44 , comma
45 - hyphen 46 . period 47 / slash
48 0 49 1 50 2
51 3 52 4 53 5
54 6 55 7 56 8
57 9 58 : colon 59 ; semicolon
60 < less-than sign 61 = equals sign 62 > greater-than sign
63 ? question mark 64 @ at sign 65 A uppercase a
66 B uppercase b 67 C uppercase c 68 D uppercase d
69 E uppercase e 70 F uppercase f 71 G uppercase g
72 H uppercase h 73 I uppercase i 74 J uppercase j
75 K uppercase k 76 L uppercase l 77 M uppercase m
78 N uppercase n 79 O uppercase o 80 P uppercase p
81 Q uppercase q 82 R uppercase r 83 S uppercase s
84 T uppercase t 85 U uppercase u 86 V uppercase v
87 W uppercase w 88 X uppercase x 89 Y uppercase y
90 Z uppercase z 91 [ left square bracket 92 \ backslash
93 ] right square bracket 94 ^ caret 95 _ underscore
96 ` grave accent 97 a lowercase a 98 b lowercase b
99 c lowercase c 100 d lowercase d 101 e lowercase e
102 f lowercase f 103 g lowercase g 104 h lowercase h
105 i lowercase i 106 j lowercase j 107 k lowercase k
108 l lowercase l 109 m lowercase m 110 n lowercase n
111 o lowercase o 112 p lowercase p 113 q lowercase q
114 r lowercase r 115 s lowercase s 116 t lowercase t
117 u lowercase u 118 v lowercase v 119 w lowercase w
120 x lowercase x 121 y lowercase y 122 z lowercase z
123 { left curly brace 124 | vertical bar 125 } right curly brace
126 ~ tilde
Paired-end sequencing

Source: Illumina.com
Quality filtering of the NGS data
For practice download the SRA data (SRR3924649) from the ENA website.

Command to check the quality of the fastq file.


(https://www.bioinformatics.babraham.ac.uk/projects/fastqc/)

conda install -c bioconda fastqc


Or
sudo apt install fastqc

“mkdir fastqc-raw”
“fastqc mycoplasma_pneumoniae_F_100K.fastq
mycoplasma_pneumoniae_F_500K.fastq -o fastqc-raw”
Open the html file to see the output.
Depending on the quality of the sequences we can choose the filtering step.
Trimmomatic:
We will run the quality filtering step using the Trimmomatic software
(http://www.usadellab.org/cms/?page=trimmomatic) [conda install -c bioconda trimmomatic or
sudo apt install trimmomatic]
First to create a directory
mkdir trimmomatic-out
trimmomatic PE -threads 4 mycoplasma_pneumoniae_F_100K.fastq mycoplasma_pneumoniae_R_100K.fastq ./trimmomatic-
out/mycoplasma_pneumoniae_F_paired_100K.fastq ./trimmomatic-out/mycoplasma_pneumoniae_F_unpaired_100K.fastq
./trimmomatic-out/mycoplasma_pneumoniae_R_paired_100K.fastq ./trimmomatic-
out/mycoplasma_pneumoniae_R_unpaired_100K.fastq ILLUMINACLIP:all_adapters.fa:2:30:10 LEADING:3 TRAILING:3
SLIDINGWINDOW:4:20 CROP:240 HEADCROP:19 MINLEN:50
Trimmomatic provides a set of Illumina adapters to remove them from our data. I combined all
Illumina adapters in one file “All_adapters.fa” to remove the sequencing adapters used during the
Illumina sequencing.
Initially Trimmomatic will look for seed matches (16 bases) allowing maximally 2
mismatches. These seeds will be extended and clipped if in the case of paired end reads a
score of 30 is reached (about 50 bases), or in the case of single ended reads a score of 10,
(about 17 bases).
Remove leading low quality or N bases (below quality 3)
Remove trailing low quality or N bases (below quality 3)
Scan the read with a 4-base wide sliding window, cutting when the average quality per
base drops below 20
Drop reads which are less than 50 bases long after these steps
DROP: cut the read at specific length by cutting the bases from the end
HEADDROP: cut the specified number of bases from the start of reads
Other NGS-QC tools

Prinseq
https://edwards.sdsu.edu/cgi-bin/prinseq/prinseq.cgi

Adapterremoval
https://github.com/MikkelSchubert/adapterremoval

seQscorer (RNA-seq, ChIP-seq, and DNase-seq)


https://github.com/salbrec/seqQscorer

Cutadapt
https://cutadapt.readthedocs.io/en/stable/installation.html

You might also like