Professional Documents
Culture Documents
A Guide To Next Generation Sequence Analysis
A Guide To Next Generation Sequence Analysis
Abstract
Next generation sequencing (NGS) technology transformed Leishmania genome studies and became an
indispensable tool for Leishmania researchers. Recent Leishmania genomics analyses facilitated the discov-
ery of various genetic diversities including single nucleotide polymorphisms (SNPs), copy number varia-
tions (CNVs), somy variations, and structural variations in detail and provided valuable insights into the
complexity of the genome and gene regulation. Many aspects of Leishmania NGS analyses are similar to
those of related pathogens like trypanosomes. However, the analyses of Leishmania genomes face a unique
challenge because of the presence of frequent aneuploidy. This makes characterization and interpretation
of read depth and somy a key part of Leishmania NGS analyses because read depth affects the accuracy of
detection of all genetic variations. However, there are no general guidelines on how to explore and inter-
pret the impact of aneuploidy, and this has made it difficult for biologists and bioinformaticians, especially
for beginners, to perform their own analyses and interpret results across different analyses. In this guide we
discuss a wide range of topics essential for Leishmania NGS analyses, ranging from how to set up a com-
putational environment for genome analyses, to how to characterize genetic variations among Leishmania
samples, and we will particularly focus on chromosomal copy number variation and its impact on genome
analyses.
Key words Next generation sequencing, Bioinformatics, Somy variation, SNP calling, Leishmania
1 Introduction
Joachim Clos (ed.), Leishmania: Methods and Protocols, Methods in Molecular Biology, vol. 1971,
https://doi.org/10.1007/978-1-4939-9210-2_3, © Springer Science+Business Media, LLC, part of Springer Nature 2019
69
70 Hideo Imamura and Jean-Claude Dujardin
it affects all aspects of the NGS analyses [3–7, 9]. Many of the
common NGS guidelines and practices established for other
euploid and polyploid organisms are not particularly applicable to
genomes with frequent aneuploidy. First of all, normalization fac-
tors must be calculated separately for all individual chromosomes
to reflect aneuploidy. Second, aneuploidy is common in cloned and
noncloned cultured promastigotes regardless of their cloning sta-
tus, and the presence of different copy numbers of a given chromo-
some in cloned cells (somy mosaicism) is considered to be common
[10]. This somy mosaicism makes it difficult but critical to charac-
terize the somy values of all chromosomes. It is essential to care-
fully distinguish technical artifacts and real chromosome copy
number variations when somy mosaicism is also possible. In
Leishmania genome analyses, it is imperative to evaluate normal-
ized read depth with and without somy effects to properly attri-
bute the cause of depth changes to local copy number variations or
somy variations [3, 4, 11], and we will discuss this point in detail.
Frequent aneuploidy is one of the major differences between
Leishmania and Trypanosoma genome analyses [12, 13].
This guide mainly focuses on practical and specific computa-
tional aspects of Leishmania NGS sequencing analyses. However,
we must emphasize that proper planning and meticulous prepara-
tion are crucial, and before considering bioinformatics, we must
optimize many experimental details in genome sequencing thor-
oughly, including the experimental setup, number of samples,
number of replicates, type of DNA preparation kits, and read
length and insertion size [14, 15].
In this guide, we will concentrate on DNA genomics analyses
mainly and will also briefly discuss RNA sequencing. We will first
discuss how to set up computer environments and computational
tools for NGS analyses. Then we will discuss key sequencing pro-
cessing steps such as read mapping, reference genome evaluation,
depth characterization, and SNP and indel characterization, which
are described in a schematic diagram (Fig. 1). We describe the
details of depth analyses that are often ignored or misunderstood
since that is the key factor to understand Leishmania NGS results.
2.1 Set Up a Linux Most bioinformatic tools are developed for the Linux system.
Computer System Therefore, it is highly recommended to use either a Linux or
Linux-based system. We briefly discuss the different options below,
as well as some practical solutions for people working with a
Windows-based system.
Linux: People performing sequencing analysis regardless of
their previous backgrounds must get familiar with a Linux com-
puter system and key essential Linux commands for genome
sequence analyses. Many programs are designed for a Linux envi-
A Guide to Next Generation Sequence Analysis of Leishmania Genomes 71
2) Mapping reads to
the reference
1) Acquiring sequence reads
and a reference genome read mapping
samtools
• base quality check Picard
• masking reference SMALT
BWA alignment (sam/bam )
bowtie2
3) Evaluating mapping
4) Depth 4) SNPs/indels
evaluation identification
GATK
raw depth dr mpileup calling SNPs and indels
Freebayes
depth per chromosome dch filter SNPs
GATK
Fig. 1 Schematic diagrams for different key sequencing steps: (1) acquiring sequence reads and a reference
genome, (2) mapping reads to the reference, (3) evaluating mapping, and (4) depth evaluation and SNPs/indels
identification. The program names are shown left to the process they work on, and some key processes and
characteristics are shown with bullet points
ronment including Mac OS, and these systems are most suitable
for sequencing analyses that generate a large amount of data.
Bio-linux: For beginners, it may be difficult to decide what
kind of programs to install, and there are specialized Linux pack-
ages designed for sequence analysis such as Bio-linux (http://envi-
ronmentalomics.org/bio-linux/). This package offers a simple
solution, but the programs in the package tend to become out-
dated quickly. Therefore, it is recommended to check their
sequencing tools and to install updated versions of these tools
individually.
Windows: In Windows, merely inspecting simple results can be
daunting, and many essential tools are not available for this envi-
ronment. Therefore, it is essential to have access to some form of
Linux computer environment. For Windows users, Linux can be
readily installed as an virtual operation system within Windows
(e.g., https://www.virtualbox.org), and recently Windows 10 has
started offering a Linux environment; thus, Windows users are
72 Hideo Imamura and Jean-Claude Dujardin
2.2 Setting Once our computer is ready for sequence analyses, it is time to
Up Sequencing install various relevant programs, and we will list a limited number
Analysis Tools of essential general software packages that will help us to analyze
and appreciate the sequence results. There are also more alterna-
tive programs available, but we keep the list short because once we
get familiar with these tools, we will be able to obtain the addi-
tional programs we need. We will discuss some specific sequencing
tools in detail in the upcoming sections.
Software managing programs: Installing programs can be com-
plicated and time-consuming, but several specialized programs will
help to install recent sequencing tools. For example, a software man-
aging program called “Homebrew” can be installed both in Mac
(https://brew.sh) and Linux (http://linuxbrew.sh) and is a conve-
nient software managing tool to install the most recent sequencing
tools including samtools. When a software managing program can-
not update programs anymore, it is recommended to reinstall a new
Linux OS which makes bioinformatics tasks much easier.
Scripting languages and their scientific packages: Python and
Perl are popular versatile scripting languages, suitable for sequenc-
ing data including characters, numbers, and processing files.
Python has many numerical and scientific packages for sequencing
data processing. A Python package manager called anaconda
(https://www.anaconda.com) helps users to install most of Python
scientific packages including matplotlib (https://matplotlib.org),
numpy, scipy, and pandas (https://www.scipy.org) for computa-
tion and visualization with ease. Biopython (https://biopython.
org) also provides bioinformatics tools but its strength is leaning
toward structural biology. Perl is a flexible and versatile scripting
language to handle characters and complex text, and it is easy to
create our own statistical functions; however, it lacks extensive
statistical and visualization modules. The Comprehensive Perl
A Guide to Next Generation Sequence Analysis of Leishmania Genomes 73
3.1 Obtain Most of the Leishmania and Trypanosoma reference genomes can
a Reference Genome be obtained from TriTrypDB (http://tritrypdb.org/common/
and Obtain Sequence downloads/) [17]. However, other updated Leishmania refer-
Reads ence genomes are also hosted by Leishmania expression and
sequencing projects (http://leish-esp.cbm.uam.es), and an
updated Leishmania donovani LdBPK282 reference can be also
74 Hideo Imamura and Jean-Claude Dujardin
3.2 How to Evaluate To evaluate overall sequence quality, we need to measure various
Sequence Quality sequence features such as base quality and read depth quality. High
quality of bases, read depth, and a reference are equally critical for
the accurate sequence analyses. If the read depth fluctuation is too
high, if a reference is not assembled correctly or contains many
repetitive regions, it would affect the identification of all genetic
variations. Therefore, to thoroughly evaluate sequence quality, we
must cross-examine all aspects of the data shown in Fig. 1 as a
whole because when there are many technical artifacts, it is difficult
to identify the real genetic variations.
The quality of bases can be measured by a read base quality
control (QC) program at the beginning, but it can be more effi-
ciently measured after mapping the reads since low quality bases
would be trimmed off by an aligner or can be easily screened out
in a SNP calling process. Using alignment files, we can start evalu-
ating read depth and at the same time we can evaluate base quality,
and it is far more effective to evaluate sequence quality by inspect-
ing alignments in sequence viewers. Initial base quality control is
more essential for de novo assembly, but it is often counterproduc-
tive to assume that the initial read quality control and trimming
reads would guarantee high-quality genome sequencing results
without thoroughly examining other sequencing properties such
as the quality of read depth, mapping, and a reference. So, first, we
briefly describe how to check base quality and then describe how
to evaluate read depth and overall quality of sequence data.
Read quality check and base trimming: Read quality control
(QC) programs such as FastQC (http://www.bioinformatics.
babraham.ac.uk/projects/fastqc/) can be applied to measure vari-
ous basic read quality factors, such as base quality, overall GC con-
tent, GC content per position, duplicate level, length distribution,
and FastQC produces read quality information in an html file that
we can view in a browser (Fig. 1). Then reads can be trimmed
using a program such as trimmomatic [18] to remove bad-quality
bases [19]. If we obtain sequence reads from public short read
archive (SRA) databases, we are new to sequencing, we are testing
new methods, we are performing de novo assembly, or we are par-
ticularly interested in structural variation in detail, it is essential to
perform read quality control by a read quality evaluation program
because the quality of reads can vary significantly depending on
A Guide to Next Generation Sequence Analysis of Leishmania Genomes 75
3.3 Mapping Reads Read mapping algorithms: The first step of sequence analysis is to
map fasta read files with their base quality scores (FASTQ) to a
reference genome (Fig. 1) There are two main read mapping algo-
rithms, and we will briefly describe these two. One is based on hash
indexing and another is based on Burrows–Wheeler character
string transformation. For a hash-based alignment method, map-
ping FASTQ reads to a reference involves two steps: indexing a
reference and mapping reads to a refence. Indexing a reference
creates an indexed reference database which make a reference
genome readily accessible for quick search, instead of preforming
intensive base similarity search all over a reference database. Using
the indexed reference database, an aligner will identify optimal
matching positions in a reference by hash search and then perform
more rigorous search for best matches around these candidate
positions in the indexed reference. An alternative algorithm is
based on suffix/prefix digital trees (Burrows–Wheeler transforma-
tion) and can store genetic variations more efficiently [20–22].
The size of Leishmania reference genomes is around 32 million
bases, roughly 100 times smaller than a human genome reference,
making it possible to perform more thorough sensitive search for
alignment than the default parameters.
Read aligners SMALT, BWA, and bowtie2: For read mapping,
SMALT based on a hash algorithm, BWA [20] and bowtie2 [21],
based on Burrows–Wheeler transformation, are often used in
Leishmania sequencing studies. They are all effective aligners, and
we can select one after testing these.
A Guide to Next Generation Sequence Analysis of Leishmania Genomes 77
3.4.3 Copy Number To evaluate copy number variations at gene level, we define an
Variation average haploid depth per gene without its somy impact as d HG and
define full cell depth with its somy impact as d FG and their relation-
ship is given as d FG = S × d HG. To evaluate copy number variations,
in general, average values of haploid depth dH and full cell depth dF
in 1000 or 2000 bases windows can be used, and if the depth varia-
tion is high, then the window size must be increased. Sliding win-
dows of 200 bases were used to measure copy number variations in
Imamura 2016 [3]. This method resolved smaller scale CNVs but
it was difficult to identify CNV boundaries to find commonly
shared CNVs among the samples. Therefore, it is more practical to
use a wider window size so that the CNV shared by many strains
would have the same CNV boundaries. CNVs can become statisti-
cally significant in two ways: a statistically higher copy number and
a statistically longer CNV. The cutoff values of these statistical sig-
nificances must be defined for each sample set because these cutoffs
depend on the size of vertical depth fluctuation and the size of
horizontal depth fluctuation of each data set. The z-score can be
used to find optimal cutoffs.
We illustrated how to perform CNV analyses based on the
CNV analyses in a previous work [3]. For a CNV analysis, it is
essential to define a baseline haploid depth level using median or
average haploid depth for a number of strains. If wild type strains
A Guide to Next Generation Sequence Analysis of Leishmania Genomes 79
B 4
3.5
3
2.5
somy
2
1.5
1
0.5
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
chromosome
Somy
Fig. 2 Chromosome depth and somy: (a) Median chromosome depths are given in dark grey. The median
depth of all chromosome median depths dmch is suitable for somy normalization rather than the average
depth of all chromosome median depths. (b) These chromosome median depths can be converted to somy
values using dmch
3.4.4 Length Bias In somy values in some previous studies, we frequently observed
of Somy and Local Somy some somy bias associated with chromosome length for some
Normalization sequencing data. For example, we often observed the trend that
somy values of shorter chromosomes tend to be smaller in samples
sequenced by Illumina Genome Analyzer II [4, 5]. Similar skewed
somy values affected by chromosome length have been still
observed in Illumina HiSeq results [6]. However, it is clear that
these somy biases were technical artifacts because when these sam-
ples were sequenced again, none of these depth biases were
80 Hideo Imamura and Jean-Claude Dujardin
amplicon 15K
B
large scale deletion and duplication
duplication
reads deletion
depth
reference
linear episome up to 300K
reads
depth
reference
Fig. 3 Somy variation and local copy number variations: (a) Chromosome copy numbers are quite variable in
many chromosomes of Leishmania promastigotes. (b) Large-scale deletions and duplications are also com-
mon. Long and high copy number amplifications containing a few genes spanning about 15,000 bp and long
linear episomes spanning up to 300,000 bp have been observed in Leishmania donovani strains [3]
3.4.5 Somy Estimation When the depth of samples is not sufficient or the variability of
for Sequences with Higher depth is too large, it is necessary to use other normalization fac-
Depth Variability tors, such as various percentile depths, or obtain depth from single
copy genes [30] or from nonrepetitive regions and the latter was
A Guide to Next Generation Sequence Analysis of Leishmania Genomes 81
3.4.6 Somy It is possible to estimate somy values from alternative read allele
and Alternative Read Allele frequency because when there are sufficient heterozygous SNPs,
Frequency read allele frequency distribution can reflect somy in most cases [5,
7]. However, read allele frequency can drastically change along a
chromosome without any somy or copy number variation [11, 13,
31, 32]; therefore, we must carefully interpret the relationship
between somy and read allele frequency.
3.4.7 Evaluating Somy When the depth of sequence is low and all the chromosomes have
Based on Binned similar read depth, we can still estimate somy from read allele fre-
Alternative Read Allele quency estimated from multiple SNP sites (e.g., 1000 SNP sites)
Frequency [13]. This method was applied to establish the evidences for viable
and stable triploid Trypanosoma congolense parasites during its life
cycle [13]. This method can be used to monitor somy variations
across different environment conditions when there are a sufficient
amount of heterozygous SNPs and sufficient DNA from the para-
sites from different stages because it requires the comparisons of
aggregated read allele frequency between different samples.
3.5 Characterizing Once alignment files (bam) are created, we can identify genetic
Base Variations variants such as single nucleotide polymorphisms and small
indels with these bam files. We describe three ways to identify
3.5.1 Individual
genetic variations. The first is an individual variant calling mode
and Population Variant
that identifies genetic variations per sample independently from
Calling Modes
any other samples, and is the simplest method [4, 11, 23, 32]. The
and Consensus Method
second is a population variant calling mode that identifies genetic
variations from multiple samples simultaneously [6, 12, 13, 28,
31]. The third is a consensus variant calling that obtains a consen-
sus of different variant calling methods, and this can be done in
individual [3, 24] or population mode or can be hybrid of both,
though these complicated cases will not be discussed here.
Individual variant calling: An individual variant calling
method is the one many people are familiar with and is commonly
used for analyzing various Leishmania samples. This method is
suitable to characterize a single sample or several unrelated samples
in detail. The advantages of this approach are its simplicity, and
SNP calling parameters can be adjusted to each sample. The main
disadvantage of this method is that it can miss a SNP whose read
allele frequency is less than 0.05 (e.g., a position with total depth
of 20, 19 reference allele bases, and 1 alternative allele base).
82 Hideo Imamura and Jean-Claude Dujardin
1
0.9
0.8
2) a duplication 4) a duplication
close to a gap
reference
unique unique tandem repeats unique
Fig. 5 Locations associated with many false positive SNPs in a reference genome: False positive SNPs were
often found in (1) low complexity regions such as homopolymers or short tandem repeat regions, (2) a dupli-
cated region where tandem repeats are truncated into a shorter unit in the reference, (3) tandem repeats
where mapping quality is zero and these SNPs may not be detected by SNP callers leading to false negative
SNPs, (4) duplications close to a gap, and (5) duplications close to the end and beginning of a chromosome.
The broad light green lines represent unique self-blast hits to themselves and the dark grey lines represent
repetitive matchings. In the figure, “SNP” indicates a real SNP. “F” indicates a false positive SNP and “FN”
indicates a false negative SNP. The left upper box highlights the characteristics of high-quality SNPs
3.5.2 Masking Our challenge here is to distinguish true SNPs from a flood of false
a Reference: Reference- SNPs mainly caused by a reference itself. The main characteristics of
Specific False Positives true SNPs are described in the box in Fig. 5, which include (1)
clean neighboring bases without gaps, repeats nor base errors, (2)
higher alternative read allele frequency, (3) high mapping score,
and (4) nonrepetitive high complexity regions. In contrast, we illus-
trated how a true SNP (SNP), false positive SNPs (F), and false
negative SNPs (FN) can appear in an alignment in Fig. 5. We
depicted the various features causing false SNPs in a hypothetical
reference genome including a homopolymer, a duplication, tandem
repeats, duplications close to a gap and the end of a chromosome.
Here a reference genome was self-blasted to itself, and one-to-one
A Guide to Next Generation Sequence Analysis of Leishmania Genomes 85
3.5.3 Sample-Specific Read depth cutoffs: Duplicated and deleted regions can create false
Cutoffs positive SNPs for phylogenetic analysis; it is therefore common to
apply a read depth cutoff. For example, read depth cutoff-
normalized depth dm > 0.5 or dm < 2 can be used. Ideally, this
normalized depth must be defined for each chromosome in
Leishmania since somy values can vary significantly, but it is com-
mon to set a general common cutoff for all chromosomes for indi-
vidual samples or a total depth for all samples. More practically, raw
depth cutoffs can be used to simplify the computation.
SNP clusters: It is common to exclude or mark SNP clusters
where there are more than three SNPs within ten bases of each
other because these clusters are often associated with false positive
SNPs. The definition of SNP clusters, however, must be adjusted
for each data set, based on the amount of real SNP clusters.
Specifying somy: A various integer somy value can be assigned
for given chromosomes in GATK and Freebayes, which is quite
effective for polyploid plant genomes, whose ploidy is over penta-
somy. In Leishmania genomes, however, aneuploidy can often be
transient, and also intermediate somy values are quite common [3,
6]. Therefore, it is often difficult to specify proper somy values for
each chromosome in different strains without introducing unin-
tended biases. Therefore, it is normally sufficient to use a default
somy setting for SNP calling. For example, we have never observed
any SNP deficiency in chromosome 31, whose somy has been
always greater than 3 [3], nor in septasomic chromosomes we have
examined (data not shown).
3.5.4 Annotating A functional annotation program SnpEff [41] can be used to clas-
Functional Impact of SNPs sify all SNPs and indels based on their functional impact such as
and Indels frameshift, nonsynonymous, synonymous change and intergenic
mutation. SNPs and indels were compiled in a population genetic
variation vcf file. From this vcf file alternative allele and depth
information can be extracted for further analysis. Variants common
to all strains are often uninformative, and they can be excluded
from the analysis.
3.6 Screening SNPs Filtering SNPs and indels: Once SNPs and indels are identified, the
and Indels next step is to apply a variant filtering such as GATK Variant
Filtration, which evaluates many SNP quality conditions including
SNP quality per depth, SNP strand bias, root mean square of the
mapping quality and mapping quality test between reference and
alternative alleles. A caution here is that GATK is updated fre-
quently and their recommended parameters change occasionally;
therefore, it is essential to consult the GATK user guide to set
proper parameters.
SNP quality score: The next step is to test and select a SNP
score that match the sensitivity and specificity required for the
analysis. There is no universal SNP score that can be applied to any
samples since a SNP score itself depends on read depth, read base
A Guide to Next Generation Sequence Analysis of Leishmania Genomes 87
quality, and other factors, but we will provide some values that
applied in our previous analyses.
In practice, SNP scores are given in the 6th column with the
tag QUAL in vcf files. They are expressed in terms of Phred scores
Q defined to be Q = −10 log10(P) where P is the estimated error
probability for that base-call [42]. In our data set for L. donovani,
other Leishmania and Trypanosoma species, we found that SNP
scores of 300 for GATK individual calling and 1500 for GATK
population calling are sufficient to remove most of false positive
SNPs. These values must be adjusted specifically to a given data set
because each data set can have different biases. In our previous
analyses, these values were selected after testing GATK using the
BPK282 sequence data from which the reference genome was cre-
ated. For example, in a sample set containing about 200 samples in
which dozens of samples are affected by base errors, a GATK SNP
score might be elevated to 4000 or higher in a population calling
to eliminate low-quality heterozygous false positive SNPs. In gen-
eral, a proper SNP cutoff should not change the number of SNPs
substantially when the cutoff is changed slightly such as 300–500 in
a population SNP calling of several dozen samples and 50–100 in
an individual SNP calling. If this happens, the SNP cutoff is too
low [3], and it is often safe to use the cutoff around which the
number of SNPs is stable, since the number of true positive SNPs
should not vary much depending on the cutoff. Alternatively, it is
better to remove samples producing excessive SNPs, if that is pos-
sible. During selection of a SNP cutoff, it is essential to inspect
SNPs visually in the Integrative Genomic Viewer, artemis, or sam-
tools tview to avoid false positives and mask regions that produce
excessive false positive SNPs.
3.7 Specific We have discussed the basics of sequence analyses and now we will
Sequencing Analyses further discuss more specific cases that improve sequencing analysis
methods and also explore topics that are important but less fre-
quently discussed.
3.7.1 Mapping Reference Mapping reference reads to the genome itself is a good practice to
Reads to Its Own become familiar with a reference genome and to know its potential
Reference Genome misassemblies. For example, the reference genome of L. major
to Understand Friedlin is considered the most accurate among Leishmania
the Reference genomes. However, the copy number variation analysis of various
and to Improve CNV of Leishmania genomes showed that many of the repetitive genes
Detection and SNP Calling were often concatenated, resulting in their higher copy numbers
Methods [5, 40]. Therefore, mapping reference reads to itself is excellent
way to calibrate and improve our own CNV detection and SNP
calling methods. Higher depth in a gene of a reference indicates
that some sections were truncated in the reference while lower
depth around a repetitive region indicated that the region was
overextended in the assembly process and this was observed in ref-
erences assembled by PacBio SMRT Sequencing [6, 31].
88 Hideo Imamura and Jean-Claude Dujardin
3.7.2 Optimizing SNP Optimizing SNP calling can be done by using simulated reads but
Calling Using Artificial it is more relevant to optimize the method by using alignments of
Mixtures of Two Samples an artificial mixtures of two samples, which have sufficient amount
of homozygous SNP differences. The main weakness of the simu-
lated reads is that these reads are too clean and too easy to train our
methods because real samples produce nearly intractable SNPs that
simulated reads cannot simply emulate. If we do not have two
clean samples to train our SNP calling method, it might be infor-
mative to optimize our SNP calling using the dataset which is
already published. For example, the data for in vitro selection of
miltefosine resistance in promastigotes of Leishmania donovani
from Nepal [29] may be used. These samples do not contain more
than 200 homozygous and heterozygous SNPs and BPK282 lines
are closely related to the reference BPK282 strain. Therefore, they
are an ideal data set to calibrate SNP calling. To refine SNP calling
parameters further, the alignment files of BPK282s and BPK275s,
which belong to a different genotype, can be mixed at certain pro-
portions using “samtools view -s” command. We are able to test
sensitivity and specificity of the method based on artificially mixed
simulated data set based on the real alignments because BPK275
lines have their own unique mainly homozygous SNPs. By using
these data sets, we will notice that some regions would produce
disproportionally many SNPs because of their repetitiveness or
some unknown factors and these regions must be masked for fur-
ther SNP analyses [3].
3.7.3 Genetic Variation Sequencing samples to identify genetic variations that can explain
Detection Among Samples specific phenotypes is common. One of the most common analyses
with Different Phenotypes is comparing between a drug-sensitive line and an induced drug-
Including Drug Induction resistant line. In this case, it is essential to identify any genetic
Experiments variations located in coding regions or even noncoding regions.
Further, we must not discard but explore base variations that over-
lap with local copy number variations since it is known that some
specific gene like L. donovani miltefosine transporter (LdMT) and
aquaglyceroporin 1 (AQP1) can contain partial or full deletions,
SNPs, or indels simultaneously in cells [3]. In drug induction
experiments, it is essential to the monitor allele frequency changes
at various intermediate stages to identify SNPs critical to drug resis-
tance. Here applying a population calling method among sensitive,
intermediate, and resistant lines will help to identify critical genetic
variations because it provides all allele frequency information at a
position where at least one high score SNP is present. For example,
in the previously mentioned miltefosine drug resistance induction
experiment [29], the critical SNP at LdMT was not detected at the
beginning of the induction experiment, but the SNP allele gradu-
ally increased and reached up to 100% in fully miltefosine-resistant
parasites (Fig. 4). To monitor this type of SNP allele transition, it
was more effective to use a population c alling method. If an indi-
A Guide to Next Generation Sequence Analysis of Leishmania Genomes 89
3.7.5 How to Handle For a large sequencing project, sequence data can come from vari-
Samples with Biased ous sources. Sequencing data quality may not be uniform and
False SNPs some samples may be of lower quality. Therefore, the data from
public archives must be treated with caution, and it is critical to
read the original paper in which the sequence data was generated
because the paper often describes the read quality issues and practi-
cal solutions in detail when the quality of some samples are lower
than that of other samples. For example, in our previous L. don-
ovani project [4], 18 samples had lower sequence quality and these
samples would have generated ten times more false positive SNPs
than the rest of the samples, but the problem was solved by apply-
ing additional SNP screening conditions to these samples. If we
obtained the data from a public database and observed abnormal
number of SNPs in a limited number of samples, then it is appro-
priate to remove these samples from the analyses rather than keep-
ing them, provided these samples are not essential.
3.7.6 Characterizing When a read can be mapped to multiple locations with the same
Depth and SNPs mapping score, aligners select one location randomly or alterna-
from Repetitive Regions tively ignore the multiple mapped read. Selecting a random posi-
tion is often more appropriate to characterize base variations and
copy number variations than keeping only uniquely mapped reads.
When only uniquely mapped reads are kept, the depth of repetitive
regions such as GP63 and HSP70 would be underestimated
90 Hideo Imamura and Jean-Claude Dujardin
3.7.7 Lower Input DNA For various technical and biological reasons, the depth of samples
Samples can be lower than one or two reads per bases. It is, however, still
possible to estimate somy and perform genotyping from such
lower depth samples. For example, the read depth for Leishmania
donovani amastigotes from a hamster was lower than 0.8 and too
low to estimate somy based on a regular depth method as discussed
above. It was, however, possible to estimate their somy based on
the number of reads per 1000 bp [6]. Genotyping using regular
SNP calling methods is not possible for such low depth samples,
but if diagnostic SNP markers are known from other higher depth
samples, direct searching of base motifs that contain a diagnostic
SNP marker can be used for genotyping. For copy number varia-
tion, it is likely possible to identify amplicons with a high copy
number using depth based on a certain window. Smaller scale
CNVs can be detected by clustering many lower depth samples
together to increase resolution.
3.8 Transcriptomics We will briefly summarize the basics of upstream analysis of tran-
Analysis scriptomics sequencing and describe the read count normalization
specific to Leishmania and how to handle repetitive genes. RNA
sequencing analysis generally requires replicates per sample rang-
ing from 2 to 6. When it is not feasible to have replicates in some
experimental and clinical settings, it is common to group samples
together based on phenotypic differences to increase statistical
power. It is more difficult to choose an optimal RNA sequencing
library than DNA sequencing library because we must optimize
many factors such as number of replicates, read depth, read length,
inset size, and single or paired reads [43–45].
RNA read mapping and read counting: For RNA analysis,
STAR (Spliced Transcripts Alignment to a Reference) is likely a
convenient option in many cases [46]. It can perform elaborate
two-step mapping and read counting by itself. It can also handle
strand-specific RNA sequencing. Leishmania does not have introns
in general, so a splicing-aware mapping may not be needed; but it
is an attractive feature. As a cautionary note for STAR users, when
using a gff file from TritrypDB, a gff annotation file must be con-
verted to gtf annotation file even though STAR indicates that a gff
file is accepted. In general, STAR often produces an empty count-
ing when a gff from TritrypDB is used. Once we obtained read
count data, they can then be analyzed by DEseq2 which provides
normalized read count and fold changes and Benjamini–Hochberg-
adjusted p-values. It is common to use a fold change cutoff 2, and
a Benjamini–Hochberg adjusted p value <0.05 to define differen-
tially expressed genes.
A Guide to Next Generation Sequence Analysis of Leishmania Genomes 91
4 Conclusion
Acknowledgments
References
16. Haddock SHD, Dunn CW (2011) Practical 30. Reis-Cunha JL, Rodrigues-Luiz GF, Valdivia
computing for biologists. Sinauer, Sunderland HO et al (2015) Chromosomal copy number
17. Aslett M, Aurrecoechea C, Berriman M et al variation reveals differential levels of genomic
(2009) TriTrypDB: a functional genomic plasticity in distinct Trypanosoma cruzi strains.
resource for the Trypanosomatidae. Nucleic BMC Genomics 16(1):499
Acids Res 38(suppl_1):D457–D462 31. Schwabl P, Imamura H, Van den Broeck F et al
18. Bolger AM, Lohse M, Usadel B (2014) (2018) Parallel sexual and parasexual popula-
Trimmomatic: a flexible trimmer for tion genomic structure in Trypanosoma cruzi.
Illumina sequence data. Bioinformatics bioRxiv. https://doi.org/10.1101/338277
30(15):2114–2120 32. Rogers MB, Downing T, Smith BA et al
19. Cuypers B, Domagalska MA, Meysman P et al (2014) Genomic confirmation of hybridi-
(2017) Multiplexed spliced-leader sequencing: sation and recent inbreeding in a vector-
a high-throughput, selective method for RNA- isolated Leishmania population. PLoS Genet
seq in Trypanosomatids. Sci Rep 7(1):3725 10(1):e1004092
20. Li H, Durbin R (2009) Fast and accurate short 33. McKenna A, Hanna M, Banks E et al (2010)
read alignment with Burrows–Wheeler trans- The genome analysis toolkit: a MapReduce
form. Bioinformatics 25(14):1754–1760 framework for analyzing next- generation
21. Langmead B, Salzberg SL (2012) Fast gapped- DNA sequencing data. Genome Res
read alignment with Bowtie 2. Nat Methods 20(9):1297–1303
9(4):357 34. Marth GT, Korf I, Yandell MD et al (1999)
22. Li H, Homer N (2010) A survey of sequence A general approach to single-nucleotide poly-
alignment algorithms for next-generation morphism discovery. Nat Genet 23(4):452
sequencing. Brief Bioinform 11(5):473–483 35. Li H, Handsaker B, Wysoker A et al (2009)
23. Zackay A, Cotton JA, Sanders M et al (2018) The sequence alignment/map format and
Genome wide comparison of Ethiopian SAMtools. Bioinformatics 25(16):2078–2079
Leishmania donovani strains reveals differences 36. Iqbal Z, Caccamo M, Turner I et al (2012)
potentially related to parasite survival. PLoS De novo assembly and genotyping of variants
Genet 14(1):e1007133 using colored de Bruijn graphs. Nat Genet
24. Coughlan S, Mulhair P, Sanders M et al (2017) 44(2):226
The genome of Leishmania adleri from a mam- 37. Frith MC (2010) A new repeat-masking
malian host highlights chromosome fission in method enables specific detection of homolo-
Sauroleishmania. Sci Rep 7:43747 gous sequences. Nucleic Acids Res 39(4):e23
25. Coughlan S, Taylor AS, Feane E et al (2018) 38. Derrien T, Estelln J, Sola SM et al (2012) Fast
Leishmania naiffi and Leishmania guyanensis computation and applications of genome map-
reference genomes highlight genome structure pability. PLoS One 7(1):e30377
and gene evolution in the Viannia subgenus. R 39. Otto TD, Sanders M, Berriman M, Newbold
Soc Open Sci 5(4):172212 C (2010) Iterative correction of reference
26. Rastrojo A, García-Hernández R, Vargas P nucleotides (iCORN) using second genera-
et al (2018) Genomic and transcriptomic alter- tion sequencing technology. Bioinformatics
ations in Leishmania donovani lines experi- 26(14):1704–1707
mentally resistant to antileishmanial drugs. Int 40. Gonznformaticssecond S, Peirnformati R et al
J Parasitol Drugs Drug Resist 8(2):246–264 (2017) Resequencing of the Leishmania infan-
27. Valdivia HO, Reis-Cunha JL, Rodrigues- tum (strain JPCM5) genome and de novo
Luiz GF et al (2015) Comparative genomic assembly into 36 contigs. Sci Rep 7(1):18050
analysis of Leishmania (Viannia) peruviana 41. Cingolani P, Platts A, Wang LL et al (2012)
and Leishmania (Viannia) braziliensis. BMC program for annotating and predicting the
Genomics 16(1):715 effects of single nucleotide polymorphisms,
28. Dumetz F, Cuypers B, Imamura H et al (2018) SnpEff: SNPs in the genome of Drosophila
Molecular preadaptation to antimony resis- melanogaster strain w1118; iso-2; iso-3. Fly
tance in Leishmania donovani on the Indian 6(2):80–92
subcontinent. mSphere 3(2):e00548–e00517 42. Ewing B, Green P (1998) Base-calling of auto-
29. Shaw CD, Lonchamp J, Downing T et al mated sequencer traces using phred. II. Error
(2016) In vitro selection of miltefosine resis- probabilities. Genome Res 8(3):186–194
tance in promastigotes of Leishmania donovani 43. Schurch NJ, Schofield P, Gierliński M et al
from Nepal: genomic and metabolomic charac- (2016) How many biological replicates are
terization. Mol Microbiol 99(6):1134–1148 needed in an RNA-seq experiment and which
94 Hideo Imamura and Jean-Claude Dujardin
differential expression tool should you use? standing of the biology and host-pathogen
RNA 22(6):839–851 interactions. Infect Genet Evol 49:273–282
44. Fiebig M, Kelly S, Gluenz E (2015) 46. Dobin A, Davis CA, Schlesinger F et al (2013)
Comparative life cycle transcriptomics revises STAR: ultrafast universal RNA-seq aligner.
Leishmania mexicana genome annotation and Bioinformatics 29(1):15–21
links a chromosome duplication with parasitism 47. Love MI, Huber W, Anders S (2014)
of vertebrates. PLoS Pathog 11(10):e1005186 Moderated estimation of fold change and
45. Patino LH, Ramírez JD (2017) RNA-seq in dispersion for RNA-seq data with DESeq2.
kinetoplastids: A powerful tool for the under- Genome Biol 15(12):550