NGS QC

How to start with next
generation sequencing (NGS)

datasets?
Surendra Vikram
Illumina
sequencing
work flow
FASTQ: Sequence with quality
Sequence Header
Sequence
Quality Header
Quality for each
base
How Quality Value is interpreted?

Quality value interpretation
• A Phred quality score is a unit of quality identification of the nucleobases generated by
automated DNA sequencing.
• Phred quality scores are assigned to each nucleotide base call during the sequencing process.
Phred quality scores are logarithmically linked to error probabilities

Phred Quality Score Probability of incorrect Base call accuracy
base call
10 1 in 10 90%
20 1 in 100 99%
30 1 in 1000 99.9%
40 1 in 10,000 99.99%
50 1 in 100,000 99.999%
60 1 in 1,000,000 99.9999%
FASTQ quality value
ASCII code Character ASCII code Character ASCII code Character
33 ! exclamation point 34 " double quotation 35 # number sign
36 $ dollar sign 37 % percent sign 38 & ampersand
39 ' apostrophe 40 ( left parenthesis 41 ) right parenthesis
42 * asterisk 43 + plus sign 44 , comma
45 - hyphen 46 . period 47 / slash
48 0 49 1 50 2
51 3 52 4 53 5
54 6 55 7 56 8
57 9 58 : colon 59 ; semicolon
60 < less-than sign 61 = equals sign 62 > greater-than sign
63 ? question mark 64 @ at sign 65 A uppercase a
66 B uppercase b 67 C uppercase c 68 D uppercase d
69 E uppercase e 70 F uppercase f 71 G uppercase g
72 H uppercase h 73 I uppercase i 74 J uppercase j
75 K uppercase k 76 L uppercase l 77 M uppercase m
78 N uppercase n 79 O uppercase o 80 P uppercase p
81 Q uppercase q 82 R uppercase r 83 S uppercase s
84 T uppercase t 85 U uppercase u 86 V uppercase v
87 W uppercase w 88 X uppercase x 89 Y uppercase y
90 Z uppercase z 91 [ left square bracket 92 \ backslash
93 ] right square bracket 94 ^ caret 95 _ underscore
96 ` grave accent 97 a lowercase a 98 b lowercase b
99 c lowercase c 100 d lowercase d 101 e lowercase e
102 f lowercase f 103 g lowercase g 104 h lowercase h
105 i lowercase i 106 j lowercase j 107 k lowercase k
108 l lowercase l 109 m lowercase m 110 n lowercase n
111 o lowercase o 112 p lowercase p 113 q lowercase q
114 r lowercase r 115 s lowercase s 116 t lowercase t
117 u lowercase u 118 v lowercase v 119 w lowercase w
120 x lowercase x 121 y lowercase y 122 z lowercase z
123 { left curly brace 124 | vertical bar 125 } right curly brace
126 ~ tilde
Paired-end sequencing
Source: Illumina.com
Quality filtering of the NGS data
For practice download the SRA data (SRR3924649) from the ENA website.
Command to check the quality of the fastq file.

(https://www.bioinformatics.babraham.ac.uk/projects/fastqc/)
conda install -c bioconda fastqc

Or
sudo apt install fastqc
“mkdir fastqc-raw”
“fastqc mycoplasma_pneumoniae_F_100K.fastq
mycoplasma_pneumoniae_F_500K.fastq -o fastqc-raw”
Open the html file to see the output.
Depending on the quality of the sequences we can choose the filtering step.
Trimmomatic:
We will run the quality filtering step using the Trimmomatic software
(http://www.usadellab.org/cms/?page=trimmomatic) [conda install -c bioconda trimmomatic or
sudo apt install trimmomatic]
First to create a directory
mkdir trimmomatic-out
trimmomatic PE -threads 4 mycoplasma_pneumoniae_F_100K.fastq mycoplasma_pneumoniae_R_100K.fastq ./trimmomatic-
out/mycoplasma_pneumoniae_F_paired_100K.fastq ./trimmomatic-out/mycoplasma_pneumoniae_F_unpaired_100K.fastq
./trimmomatic-out/mycoplasma_pneumoniae_R_paired_100K.fastq ./trimmomatic-
out/mycoplasma_pneumoniae_R_unpaired_100K.fastq ILLUMINACLIP:all_adapters.fa:2:30:10 LEADING:3 TRAILING:3
SLIDINGWINDOW:4:20 CROP:240 HEADCROP:19 MINLEN:50
Trimmomatic provides a set of Illumina adapters to remove them from our data. I combined all
Illumina adapters in one file “All_adapters.fa” to remove the sequencing adapters used during the
Illumina sequencing.
Initially Trimmomatic will look for seed matches (16 bases) allowing maximally 2
mismatches. These seeds will be extended and clipped if in the case of paired end reads a
score of 30 is reached (about 50 bases), or in the case of single ended reads a score of 10,
(about 17 bases).
Remove leading low quality or N bases (below quality 3)
Remove trailing low quality or N bases (below quality 3)
Scan the read with a 4-base wide sliding window, cutting when the average quality per
base drops below 20
Drop reads which are less than 50 bases long after these steps
DROP: cut the read at specific length by cutting the bases from the end
HEADDROP: cut the specified number of bases from the start of reads
Other NGS-QC tools
Prinseq
https://edwards.sdsu.edu/cgi-bin/prinseq/prinseq.cgi
Adapterremoval
https://github.com/MikkelSchubert/adapterremoval
seQscorer (RNA-seq, ChIP-seq, and DNase-seq)

https://github.com/salbrec/seqQscorer
Cutadapt
https://cutadapt.readthedocs.io/en/stable/installation.html

NGS QC

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

NGS QC

Uploaded by

Copyright:

Available Formats

How to start with next

generation sequencing (NGS)

How Quality Value is interpreted?

Phred quality scores are logarithmically linked to error probabilities

Command to check the quality of the fastq file.

conda install -c bioconda fastqc

seQscorer (RNA-seq, ChIP-seq, and DNase-seq)

You might also like