Professional Documents
Culture Documents
NGS QC
NGS QC
Sequence Header
Sequence
Quality Header
Quality for each
base
Source: Illumina.com
Quality filtering of the NGS data
For practice download the SRA data (SRR3924649) from the ENA website.
“mkdir fastqc-raw”
“fastqc mycoplasma_pneumoniae_F_100K.fastq
mycoplasma_pneumoniae_F_500K.fastq -o fastqc-raw”
Open the html file to see the output.
Depending on the quality of the sequences we can choose the filtering step.
Trimmomatic:
We will run the quality filtering step using the Trimmomatic software
(http://www.usadellab.org/cms/?page=trimmomatic) [conda install -c bioconda trimmomatic or
sudo apt install trimmomatic]
First to create a directory
mkdir trimmomatic-out
trimmomatic PE -threads 4 mycoplasma_pneumoniae_F_100K.fastq mycoplasma_pneumoniae_R_100K.fastq ./trimmomatic-
out/mycoplasma_pneumoniae_F_paired_100K.fastq ./trimmomatic-out/mycoplasma_pneumoniae_F_unpaired_100K.fastq
./trimmomatic-out/mycoplasma_pneumoniae_R_paired_100K.fastq ./trimmomatic-
out/mycoplasma_pneumoniae_R_unpaired_100K.fastq ILLUMINACLIP:all_adapters.fa:2:30:10 LEADING:3 TRAILING:3
SLIDINGWINDOW:4:20 CROP:240 HEADCROP:19 MINLEN:50
Trimmomatic provides a set of Illumina adapters to remove them from our data. I combined all
Illumina adapters in one file “All_adapters.fa” to remove the sequencing adapters used during the
Illumina sequencing.
Initially Trimmomatic will look for seed matches (16 bases) allowing maximally 2
mismatches. These seeds will be extended and clipped if in the case of paired end reads a
score of 30 is reached (about 50 bases), or in the case of single ended reads a score of 10,
(about 17 bases).
Remove leading low quality or N bases (below quality 3)
Remove trailing low quality or N bases (below quality 3)
Scan the read with a 4-base wide sliding window, cutting when the average quality per
base drops below 20
Drop reads which are less than 50 bases long after these steps
DROP: cut the read at specific length by cutting the bases from the end
HEADDROP: cut the specified number of bases from the start of reads
Other NGS-QC tools
Prinseq
https://edwards.sdsu.edu/cgi-bin/prinseq/prinseq.cgi
Adapterremoval
https://github.com/MikkelSchubert/adapterremoval
Cutadapt
https://cutadapt.readthedocs.io/en/stable/installation.html