Download as pdf or txt
Download as pdf or txt
You are on page 1of 26

Chapter 3

A Guide to Next Generation Sequence Analysis


of Leishmania Genomes
Hideo Imamura and Jean-Claude Dujardin

Abstract
Next generation sequencing (NGS) technology transformed Leishmania genome studies and became an
indispensable tool for Leishmania researchers. Recent Leishmania genomics analyses facilitated the discov-
ery of various genetic diversities including single nucleotide polymorphisms (SNPs), copy number varia-
tions (CNVs), somy variations, and structural variations in detail and provided valuable insights into the
complexity of the genome and gene regulation. Many aspects of Leishmania NGS analyses are similar to
those of related pathogens like trypanosomes. However, the analyses of Leishmania genomes face a unique
challenge because of the presence of frequent aneuploidy. This makes characterization and interpretation
of read depth and somy a key part of Leishmania NGS analyses because read depth affects the accuracy of
detection of all genetic variations. However, there are no general guidelines on how to explore and inter-
pret the impact of aneuploidy, and this has made it difficult for biologists and bioinformaticians, especially
for beginners, to perform their own analyses and interpret results across different analyses. In this guide we
discuss a wide range of topics essential for Leishmania NGS analyses, ranging from how to set up a com-
putational environment for genome analyses, to how to characterize genetic variations among Leishmania
samples, and we will particularly focus on chromosomal copy number variation and its impact on genome
analyses.

Key words Next generation sequencing, Bioinformatics, Somy variation, SNP calling, Leishmania

1  Introduction

Next generation sequencing (NGS) technologies enabled us to


study Leishmania genomes in greater detail in the past 10 years,
and NGS has become an indispensable tool in molecular and evo-
lutionary biology of the parasite [1–4]. Among the genetic varia-
tions of Leishmania such as SNPs, CNVs, aneuploidy, and structural
variations, frequent aneuploidy is one of the most striking features
and occurs frequently in cultured promastigotes [3, 5], while
disomy is more common in amastigotes [6, 7]. Aneuploidy poses
serious challenges for genetic manipulations [8], and it can compli-
cate the interpretation of phylogenetic analyses. Therefore, read
depth characterization in Leishmania sequencing is critical because

Joachim Clos (ed.), Leishmania: Methods and Protocols, Methods in Molecular Biology, vol. 1971,
https://doi.org/10.1007/978-1-4939-9210-2_3, © Springer Science+Business Media, LLC, part of Springer Nature 2019

69
70 Hideo Imamura and Jean-Claude Dujardin

it affects all aspects of the NGS analyses [3–7, 9]. Many of the
common NGS guidelines and practices established for other
euploid and polyploid organisms are not particularly applicable to
genomes with frequent aneuploidy. First of all, normalization fac-
tors must be calculated separately for all individual chromosomes
to reflect aneuploidy. Second, aneuploidy is common in cloned and
noncloned cultured promastigotes regardless of their cloning sta-
tus, and the presence of different copy numbers of a given chromo-
some in cloned cells (somy mosaicism) is considered to be common
[10]. This somy mosaicism makes it difficult but critical to charac-
terize the somy values of all chromosomes. It is essential to care-
fully distinguish technical artifacts and real chromosome copy
number variations when somy mosaicism is also possible. In
Leishmania genome analyses, it is imperative to evaluate normal-
ized read depth with and without somy effects to properly attri-
bute the cause of depth changes to local copy number variations or
somy variations [3, 4, 11], and we will discuss this point in detail.
Frequent aneuploidy is one of the major differences between
Leishmania and Trypanosoma genome analyses [12, 13].
This guide mainly focuses on practical and specific computa-
tional aspects of Leishmania NGS sequencing analyses. However,
we must emphasize that proper planning and meticulous prepara-
tion are crucial, and before considering bioinformatics, we must
optimize many experimental details in genome sequencing thor-
oughly, including the experimental setup, number of samples,
number of replicates, type of DNA preparation kits, and read
length and insertion size [14, 15].
In this guide, we will concentrate on DNA genomics analyses
mainly and will also briefly discuss RNA sequencing. We will first
discuss how to set up computer environments and computational
tools for NGS analyses. Then we will discuss key sequencing pro-
cessing steps such as read mapping, reference genome evaluation,
depth characterization, and SNP and indel characterization, which
are described in a schematic diagram (Fig.  1). We describe the
details of depth analyses that are often ignored or misunderstood
since that is the key factor to understand Leishmania NGS results.

2  Before Performing Genome Sequencing

2.1  Set Up a Linux Most bioinformatic tools are developed for the Linux system.
Computer System Therefore, it is highly recommended to use either a Linux or
Linux-based system. We briefly discuss the different options below,
as well as some practical solutions for people working with a
Windows-based system.
Linux: People performing sequencing analysis regardless of
their previous backgrounds must get familiar with a Linux com-
puter system and key essential Linux commands for genome
sequence analyses. Many programs are designed for a Linux envi-
A Guide to Next Generation Sequence Analysis of Leishmania Genomes 71

2) Mapping reads to
the reference
1) Acquiring sequence reads
and a reference genome read mapping
samtools
• base quality check Picard
• masking reference SMALT
BWA alignment (sam/bam )
bowtie2

3) Evaluating mapping

4) Depth 4) SNPs/indels
evaluation identification

GATK
raw depth dr mpileup calling SNPs and indels
Freebayes
depth per chromosome dch filter SNPs
GATK

median depth of all evaluate impacts of


chromosomes dmch snpEff variants
• normalized depth dm=dr /dch
• haploid depth dH =2 dm • individual calling
• somy S=dch/ dmch • population calling
• full cell depth dF =S dH • read allele frequency
• length bias correlation • alternative allele frequency

Fig. 1 Schematic diagrams for different key sequencing steps: (1) acquiring sequence reads and a reference
genome, (2) mapping reads to the reference, (3) evaluating mapping, and (4) depth evaluation and SNPs/indels
identification. The program names are shown left to the process they work on, and some key processes and
characteristics are shown with bullet points

ronment including Mac OS, and these systems are most suitable
for sequencing analyses that generate a large amount of data.
Bio-linux: For beginners, it may be difficult to decide what
kind of programs to install, and there are specialized Linux pack-
ages designed for sequence analysis such as Bio-linux (http://envi-
ronmentalomics.org/bio-linux/). This package offers a simple
solution, but the programs in the package tend to become out-
dated quickly. Therefore, it is recommended to check their
sequencing tools and to install updated versions of these tools
individually.
Windows: In Windows, merely inspecting simple results can be
daunting, and many essential tools are not available for this envi-
ronment. Therefore, it is essential to have access to some form of
Linux computer environment. For Windows users, Linux can be
readily installed as an virtual operation system within Windows
(e.g., https://www.virtualbox.org), and recently Windows 10 has
started offering a Linux environment; thus, Windows users are
72 Hideo Imamura and Jean-Claude Dujardin

able to install many sequencing tools directly on its subsystem


without a third party software. Alternatively, we can also obtain
SSH client, a terminal interface program, for Windows to connect
to other Linux computers or to a larger supercomputing system,
and a program such as PuTTY (https://www.chiark.greenend.
org.uk/~sgtatham/putty/latest.html) is a popular simple option.
Once a Linux environment is set up, we can work through
some introductory books such as Practical Computing for
Biologists [16], which covers topics from basic Linux skills to prac-
tical programming skills for sequence analyses. We can find some of
many introductory lectures for bioinformatics online and start
learning simple but critical skills. For example, we can start learn-
ing basic computer skills by a freely available document such as
Unix & Perl Primer for Biologists (http://korflab.ucdavis.edu/
Unix_and_Perl/current.pdf), and we will be ready to handle our
sequence data.

2.2  Setting Once our computer is ready for sequence analyses, it is time to
Up Sequencing install various relevant programs, and we will list a limited number
Analysis Tools of essential general software packages that will help us to analyze
and appreciate the sequence results. There are also more alterna-
tive programs available, but we keep the list short because once we
get familiar with these tools, we will be able to obtain the  addi-
tional programs we need. We will discuss some specific sequencing
tools in detail in the upcoming sections.
Software managing programs: Installing programs can be com-
plicated and time-consuming, but several specialized programs will
help to install recent sequencing tools. For example, a software man-
aging program called “Homebrew” can be installed both in Mac
(https://brew.sh) and Linux (http://linuxbrew.sh) and is a conve-
nient software managing tool to install the most recent sequencing
tools including samtools. When a software managing program can-
not update programs anymore, it is recommended to reinstall a new
Linux OS which makes bioinformatics tasks much easier.
Scripting languages and their scientific packages: Python and
Perl are popular versatile scripting languages, suitable for sequenc-
ing data including characters, numbers, and processing files.
Python has many numerical and scientific packages for sequencing
data processing. A Python package manager called anaconda
(https://www.anaconda.com) helps users to install most of Python
scientific packages including matplotlib (https://matplotlib.org),
numpy, scipy, and pandas (https://www.scipy.org) for computa-
tion and visualization with ease. Biopython (https://biopython.
org) also provides bioinformatics tools but its strength is leaning
toward structural biology. Perl is a flexible and versatile scripting
language to handle characters and complex text, and it is easy to
create our own statistical functions; however, it lacks extensive
statistical and visualization modules. The Comprehensive Perl
­
A Guide to Next Generation Sequence Analysis of Leishmania Genomes 73

Archive Network (CPAN) offers various convenient modules


(https://www.cpan.org). Bioperl (https://bioperl.org) has many
useful file conversion tools.
R and gunplot: R (https://www.r-project.org) is a compre-
hensive statistical tool that offers many essential sequence tools and
bioconductor (https://www.bioconductor.org) is specialized for
genomic and sequence data analysis. Gunplot (http://www.gnu-
plot.info) is simple but versatile data visualization tool that is easy
to use and is particularly useful for initial quick data inspection for
sequence depth and allele frequency.
Sequence viewers: The Integrative Genomics Viewer (IGV) is a
powerful visualization tool for NGS data (http://software.
broadinstitute.org/software/igv/) and efficiently handles several
sequence alignment files called bam files. Artemis and Act (Artemis
Comparison Tool) are also genome browser and annotation tools
that allow for visualization of sequence features (https://www.
sanger.ac.uk/science/tools/artemis). Act is a unique convenient
tool to compare multiple samples and shows the blast similarity
between samples. Many other sequence viewers exist, but IGV and
artemis are good starters for visualizing large sequence data
efficiently.
Online NGS discussion forums: NGS technologies and bioin-
formatics tools are rapidly evolving, and well-maintained online
NGS discussion forums such as Biostars (https://www.biostars.
org) and Seqanswers (http://seqanswers.com) are popular forums
to find the most recent information and troubleshooting about
NGS analyses.
Online discussion forum and mailing list for a program: Many
commonly used NGS programs maintain their own online discus-
sion forum or mailing list, and these are good sources of up-to-­
date information about the usage for these programs.

3  Sequence Analysis Steps

Leishmania sequencing analysis involves many interconnected


components, and here we classify them into four key parts for this
guide: mapping reads to a reference, evaluating a reference, char-
acterizing read depth, and identifying SNPs and indels. These key
parts are illustrated in Fig. 1, and we will refer to each component
as we walk through different sequence analysis steps.

3.1  Obtain Most of the Leishmania and Trypanosoma reference genomes can
a Reference Genome be obtained from TriTrypDB (http://tritrypdb.org/common/
and Obtain Sequence downloads/)  [17]. However, other updated Leishmania refer-
Reads ence genomes are also hosted by Leishmania expression and
sequencing projects (http://leish-esp.cbm.uam.es), and an
updated Leishmania donovani LdBPK282 reference can be also
74 Hideo Imamura and Jean-Claude Dujardin

obtained from (ftp://ftp.sanger.ac.uk/pub/project/patho-


gens/Leishmania/donovani/LdBPKPAC2016beta/).
For real analyses or testing, we can start mapping the reads to a
reference if we already sequenced our own samples. Alternatively, we
can also find sequence reads using sample accession numbers
described in publications in public repositories such as EBI European
Nucleotide Archive (https://www.ebi.ac.uk/ena) or NCBI
Sequence Read Archive (https://www.ncbi.nlm.nih.gov/sra).

3.2  How to Evaluate To evaluate overall sequence quality, we need to measure various
Sequence Quality sequence features such as base quality and read depth quality. High
quality of bases, read depth, and a reference are equally critical for
the accurate sequence analyses. If the read depth fluctuation is too
high, if a reference is not assembled correctly or contains many
repetitive regions, it would affect the identification of all genetic
variations. Therefore, to thoroughly evaluate sequence quality, we
must cross-examine all aspects of the data shown in Fig.  1 as a
whole because when there are many technical artifacts, it is difficult
to identify the real genetic variations.
The quality of bases can be measured by a read base quality
control (QC) program at the beginning, but it can be more effi-
ciently measured after mapping the reads since low quality bases
would be trimmed off by an aligner or can be easily screened out
in a SNP calling process. Using alignment files, we can start evalu-
ating read depth and at the same time we can evaluate base quality,
and it is far more effective to evaluate sequence quality by inspect-
ing alignments in sequence viewers. Initial base quality control is
more essential for de novo assembly, but it is often counterproduc-
tive to assume that the initial read quality control and trimming
reads would guarantee high-quality genome sequencing results
without thoroughly examining other sequencing properties such
as the quality of read depth, mapping, and a reference. So, first, we
briefly describe how to check base quality and then describe how
to evaluate read depth and overall quality of sequence data.
Read quality check and base trimming: Read quality control
(QC) programs such as FastQC (http://www.bioinformatics.
babraham.ac.uk/projects/fastqc/) can be applied to measure vari-
ous basic read quality factors, such as base quality, overall GC con-
tent, GC content per position, duplicate level, length distribution,
and FastQC produces read quality information in an html file that
we can view in a browser (Fig.  1). Then reads can be trimmed
using a program such as trimmomatic [18] to remove bad-quality
bases [19]. If we obtain sequence reads from public short read
archive (SRA) databases, we are new to sequencing, we are testing
new methods, we are performing de novo assembly, or we are par-
ticularly interested in structural variation in detail, it is essential to
perform read quality control by a read quality evaluation program
because the quality of reads can vary significantly depending on
A Guide to Next Generation Sequence Analysis of Leishmania Genomes 75

their DNA preparation, sequencing platform, and many other fac-


tors. If we obtain data from public data archives, the sequence
providers may not disclose a detailed information of read quality,
or it is often not easy to find such information in a public
database.
Sequence quality check after mapping: We may skip the initial
base QC and start mapping reads to a reference if we are not per-
forming de novo assembly and not analyzing repetitive regions in
detail. This will allow us to focus on detailed sequence data evalua-
tion as a whole and to screen out low quality reads and bases if nec-
essary. It is more effective to evaluate sequence quality after mapping
reads because read aligners and SNP callers perform base quality
control by themselves and can handle lower quality bases, and many
reads that produce false positive SNPs do not necessarily have lower
base quality. Many sequence providers currently already perform
their own quality control and provide an initial sequence read quality
report. But as for read depth quality, they would only provide an
expected average read depth which does not tell whether the read
depth quality is sufficient for detailed copy number variation analy-
ses. As for the handling of bases with low quality, an aligner such as
SMALT (www.sanger.ac.uk/resources/software/smalt/) can trim
off bad quality bases during its read mapping. After mapping reads,
we can examine SNPs, depth variations, and somy variations in detail
to critically test sequence quality and judge if the sequence variations
observed are real or just technical artifacts. Typical signs for unreli-
able sequence quality include higher SNP strand biases, uneven read
depth coverage, a large-­scale depth fluctuation, uneven coverage
over genes and intergenic regions, and uneven chromosome copy
numbers. Higher SNP strand biases was more common in the past
but this has been improved recent years. However, irregular read
depth with higher fluctuation is still common, and it cannot be cor-
rected. This depth problem would make it hard to interpret accurate
copy number variation at gene level and to quantify read allele fre-
quency at heterozygous SNP sites, and the problem forces to lower
the resolution of base and depth analyses.
Impact of sequence library preparation on sequence analyses: In
practice, sequence library preparation comes earlier in a sequenc-
ing experiment, but we discuss it here as a potential detrimental
factor that impairs proper read depth and therefore hinders subse-
quent sequence analyses. These negative impacts cannot be
detected by the initial read quality controls but by some systematic
depth analysis. Detailed comparisons of library preparations must
be thoroughly tested [14, 15] before library selection. Now, we
will briefly address sequencing library preparation issues we have
observed. In general, the TruSeq DNA library preparation kit
without PCR amplification consistently produced higher quality
depth coverage than the Nextera XT DNA library and Nextera
DNA library preparation kit (Illumina Inc.) in our quality control
76 Hideo Imamura and Jean-Claude Dujardin

experiments where genomes of several strains were sequenced


using different protocols. Most commonly observed Nextera depth
deviations were lower or little depth in intergenic regions regard-
less of their repetitiveness. This was observed in our quality control
experiments and in Nextera reads downloaded from a previously
published L. infantum experiment. SNP calling would not be
severely affected by this for runs with their depth over 30×, but it
will reduce the resolution to detect small changes in read depth
allele frequency for samples with their depth less than 15×. For
CNV detection, TruSeq provides more reliable read depth cover-
age. The Nextera XT kit performs far better for samples with a
limited amount of DNA [13], but when there is a sufficient amount
of DNA available, it is wise to use methods that produce higher
quality results even though that might require some extra steps.
However, the TruSeq kit also can capture more kDNA than the
Nextera XT kit. We also found that DNA- and RNA-specific kits
work better than DNA and RNA dual-use kits even though dual
usage kits are more time effective for the experiments. Read depth
with high fluctuation would severely reduce the resolution to
detect copy number. Unfortunately, this excessive depth variability
is common in many studies; therefore, it is essential to test and
compare sequencing kits thoroughly.

3.3  Mapping Reads Read mapping algorithms: The first step of sequence analysis is to
map fasta read files with their base quality scores (FASTQ) to a
reference genome (Fig. 1) There are two main read mapping algo-
rithms, and we will briefly describe these two. One is based on hash
indexing and another is based on Burrows–Wheeler character
string transformation. For a hash-based alignment method, map-
ping FASTQ reads to a reference involves two steps: indexing a
reference and mapping reads to a refence. Indexing a reference
creates an indexed reference database which make a reference
genome readily accessible for quick search, instead of preforming
intensive base similarity search all over a reference database. Using
the indexed reference database, an aligner will identify optimal
matching positions in a reference by hash search and then perform
more rigorous search for best matches around these candidate
positions in the indexed reference. An alternative algorithm is
based on suffix/prefix digital trees (Burrows–Wheeler transforma-
tion) and can store genetic variations more efficiently [20–22].
The size of Leishmania reference genomes is around 32 million
bases, roughly 100 times smaller than a human genome reference,
making it possible to perform more thorough sensitive search for
alignment than the default parameters.
Read aligners SMALT, BWA, and bowtie2: For read mapping,
SMALT based on a hash algorithm, BWA [20] and bowtie2 [21],
based on Burrows–Wheeler transformation, are often used in
Leishmania sequencing studies. They are all effective aligners, and
we can select one after testing these.
A Guide to Next Generation Sequence Analysis of Leishmania Genomes 77

SMALT has been used for various Leishmania sequencing


analyses [3, 6, 11, 23–25]. We can use SMALT v0.7.4 with the
exhaustive searching (−x) and a sequence match threshold of 80%
(−y 0.8) and a reference hash index of 11–13 bases and a sliding
steps of 2 or 3. For example, the index of 11 and the sliding step of
2 can perform more thorough hash base searching but this option
is slower than the index of 13 and the sliding step of 3. For shorter
reads like 50 or 76 bps, a hash index of 11 and a sliding step of 2
may be suitable but for long reads over 100 bps, a hash index of 13
and a sliding step of 3 are more efficient overall. The main benefit
of SMALT is that it can apply exhaustive Smith–Waterman after
initial hash word search for optimal mapping positions, that can
trim lower base quality bases, and that it can properly identify small
indels without splitting them into two locations. This feature per-
formed better than previous GATK realignment protocol which
created multiple indels for a single indel. Unfortunately, SAMLT
has not been updated often any more, but BWA or bowtie2 are
more frequently updated.
BWA-mem was used in [7, 19], and bowtie2 was also used in
many Leishmania studies [26, 27]. In many cases, their default
parameters are sufficient for Leishmania sequence analyses. bow-
tie2 automatically adjusts searching word length based on read
length and maps reads without trimming for its default parameters,
while both SAMLT and BWA-mem trim end bases, so bowtie2
might be beneficial for users who need to focus reads mapped in
full length. Many other aligners exist, but it is important to choose
properly maintained programs used by many users to get support
and suggestions. It is often advisable to avoid using the first version
of a new program since it would inevitably contain many bugs and
also avoid programs that are rarely used in sequence analyses of
Leishmania or other similar organisms.

3.4  Characterizing Two-loop depth estimation: To estimate a chromosome median


Depth: Aneuploidy depth dch, we first measure an average and standard deviation of a
raw read depth, and then for the second loop, read depth is mea-
3.4.1  Normalized
sured again at positions where the depth is within one standard
Chromosome Read Depth
deviation from a median value. This two-loop method can estimate
a proper median read-depth by removing outliers such as assembly
gaps, spurious high coverage regions, or real copy number variant
loci. Zero depth should be excluded to avoid large gaps in a chro-
mosome that would not skew depth. We defined a normalized
depth dm as a raw depth dr divided by the median depth of its chro-
mosome dch; that is, dm = dr/dch (Figs. 1 and 2a). A median value of
a normalized depth dm is used for characterizing depth variation
among different strains regardless of ploidy difference. Then hap-
loid depth can be defined as dH =2 × dm where 2 reflects that each
chromosome has two copies for disomic cells. The variation of
­normalized depth dm can be approximated as SD(dm) = where SD
78 Hideo Imamura and Jean-Claude Dujardin

represents a standard deviation and the values of SD(dm) can be


used to measure the quality of depth of samples.

3.4.2  Somy We need to normalize the variation of DNA yields of sequence


runs for proper interstrain comparison, and for this we express a
median depth of all chromosome median depth dch of a strain as
dmch for interstrain depth normalization. Then somy can be written
as S = 2 × dch/dmch and full cell depth dF, which reflects ploidy dif-
ference, can be defined as dF=S × dm = 2 × dch/dmch × dm = 2 × dr/
dmch where × represents a multiplication. Note that the multiplica-
tion factor 2 was used because the most frequent base somy is
assumed to be disomic and if the base somy is trisomic, 2 must be
replaced by 3  in the formulas. Then, the range of monosomy,
disomy, trisomy, tetrasomy, and pentasomy can be defined to be
the full cell normalized chromosome depth or somy S of S < 1.5,
1.5 ≤S < 2.5, 2.5 ≤S < 3.5, 3.5 ≤ S < 4.5, and 4.5 ≤ S < 5.5, respec-
tively (Figs.  2b and 3a). Here it is possible to use an average of
average or median of all chromosome depths, but it is necessary to
use a median of average or median of all chromosome depths to
compute somy values that are closer to integers, otherwise somy
values of all chromosomes may become noninteger intermediate
values, which are not convenient for subsequent analyses. Somy
variation can be approximated as S × SD(dm) where SD represents
a standard deviation [3–5].

3.4.3  Copy Number To evaluate copy number variations at gene level, we define an
Variation average haploid depth per gene without its somy impact as d HG and
define full cell depth with its somy impact as d FG and their relation-
ship is given as d FG = S × d HG. To evaluate copy number variations,
in general, average values of haploid depth dH and full cell depth dF
in 1000 or 2000 bases windows can be used, and if the depth varia-
tion is high, then the window size must be increased. Sliding win-
dows of 200 bases were used to measure copy number variations in
Imamura 2016 [3]. This method resolved smaller scale CNVs but
it was difficult to identify CNV boundaries to find commonly
shared CNVs among the samples. Therefore, it is more practical to
use a wider window size so that the CNV shared by many strains
would have the same CNV boundaries. CNVs can become statisti-
cally significant in two ways: a statistically higher copy number and
a statistically longer CNV. The cutoff values of these statistical sig-
nificances must be defined for each sample set because these ­cutoffs
depend on the size of vertical depth fluctuation and the size of
horizontal depth fluctuation of each data set. The z-score can be
used to find optimal cutoffs.
We illustrated how to perform CNV analyses based on the
CNV analyses in a  previous work [3]. For a CNV analysis, it is
essential to define a baseline haploid depth level using median or
average haploid depth for a number of strains. If wild type strains
A Guide to Next Generation Sequence Analysis of Leishmania Genomes 79

A raw chromosome depth


80
70
60
50 average
40
median dmch
30
20
10
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
chromosome
Median

B 4
3.5
3
2.5
somy

2
1.5
1
0.5
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
chromosome
Somy

Fig. 2 Chromosome depth and somy: (a) Median chromosome depths are given in dark grey. The median
depth of all chromosome median depths dmch is suitable for somy normalization rather than the average
depth of all chromosome median depths. (b) These chromosome median depths can be converted to somy
values using dmch

and strains in vitro selected for drug resistance are compared, an


average haploid depth of wild type can be used as a baseline hap-
loid depth level [28, 29]. For CNV detection, we need to consider
CNV length along with depth because we can detect long hetero-
zygous deletions or duplications with a smaller depth deviation
better than we can detect them in similar CNVs in shorter variants.
Therefore, different depth thresholds can be used. For example, in
Imamura et  al. [3], for CNVs of 2–5  kb, the threshold was five
times the standard deviation of the chromosomal depth; for CNVs
of 5–20 kb, the threshold was three times this value; for CNVs of
>20 kb, the threshold was 1.5 times this value (Fig. 3b).

3.4.4  Length Bias In somy values in some previous studies, we frequently observed
of Somy and Local Somy some somy bias associated with chromosome length for some
Normalization sequencing data. For example, we often observed the trend that
somy values of shorter chromosomes tend to be smaller in samples
sequenced by Illumina Genome Analyzer II [4, 5]. Similar skewed
somy values affected by chromosome length have been still
observed in Illumina HiSeq results [6]. However, it is clear that
these somy biases were technical artifacts because when these sam-
ples were sequenced again, none of these depth biases were
80 Hideo Imamura and Jean-Claude Dujardin

whole chromosomal duplication


A reads
somy variation depth
reference
reads
depth
reference

amplicon 15K

B
large scale deletion and duplication

duplication
reads deletion
depth
reference
linear episome up to 300K
reads
depth
reference

Fig. 3 Somy variation and local copy number variations: (a) Chromosome copy numbers are quite variable in
many chromosomes of Leishmania promastigotes. (b) Large-scale deletions and duplications are also com-
mon. Long and high copy number amplifications containing a few genes spanning about 15,000 bp and long
linear episomes spanning up to 300,000 bp have been observed in Leishmania donovani strains [3]

observed in high quality sequence runs. Unfortunately, however, it


is still common to interpret these somy trends as real biological
phenomena [5, 27], but it would be more appropriate to interpret
such length trends as technical artifacts. If they are real, it is critical
to confirm these results using different methods. Fortunately, it is
often possible to reduce this bias by applying a bias correction to
the somy estimation [6]. Instead of calculating a median depth
based on all chromosomes, we can calculate this value based on the
median values of neighboring chromosomes, whose lengths are
similar since chromosomes are numbered according to increasing
size. Thus, somy values can be calculated using a median depth
based on the median read depth of nearby 15 chromosomes: the
seven neighboring chromosomes on each side and the chromo-
some itself. When a chromosome does not have seven neighboring
chromosomes on one side, mirror-imaged values are used, so that
some depth values are used twice. This correction will not work
when trisomic and tetrasomic chromosomes are clustering together
around chromosomes of similar length, which would wrongly
­suppress valid higher somy. Then it is necessary to perform local
normalization case by case, or else it may not be possible to fix this
bias when the bias is large and random.

3.4.5  Somy Estimation When the depth of samples is not sufficient or the variability of
for Sequences with Higher depth is too large, it is necessary to use other normalization fac-
Depth Variability tors, such as various percentile depths, or obtain depth from single
copy genes [30] or from nonrepetitive regions and the latter was
A Guide to Next Generation Sequence Analysis of Leishmania Genomes 81

found to be more effective in a detailed population analysis of


Trypanosoma cruzi [31]. To characterize significant differences, it
is beneficial to use biological and statistical significance. Previously,
we used the following conditions: S-values should differ by more
than 0.5 and shift from one somy distribution range to another: for
instance, from 1.5–2.5 to 2.5–3.5. A t-test statistical value can be
calculated directly from binned depth [6] or from average depth,
standard deviation, and a number of data points [28].

3.4.6  Somy It is possible to estimate somy values from alternative read allele
and Alternative Read Allele frequency because when there are sufficient heterozygous SNPs,
Frequency read allele frequency distribution can reflect somy in most cases [5,
7]. However, read allele frequency can drastically change along a
chromosome without any somy or copy number variation [11, 13,
31, 32]; therefore, we must carefully interpret the relationship
between somy and read allele frequency.

3.4.7  Evaluating Somy When the depth of sequence is low and all the chromosomes have
Based on Binned similar read depth, we can still estimate somy from read allele fre-
Alternative Read Allele quency estimated from multiple SNP sites (e.g., 1000 SNP sites)
Frequency [13]. This method was applied to establish the evidences for viable
and stable triploid Trypanosoma congolense parasites during its life
cycle [13]. This method can be used to monitor somy variations
across different environment conditions when there are a sufficient
amount of heterozygous SNPs and sufficient DNA from the para-
sites from different stages because it requires the comparisons of
aggregated read allele frequency between different samples.

3.5  Characterizing Once alignment files (bam) are created, we can identify genetic
Base Variations variants such as single nucleotide polymorphisms and small
indels  with  these bam files. We describe three ways to identify
3.5.1  Individual
genetic variations. The first is an individual variant calling mode
and Population Variant
that identifies genetic variations per sample independently from
Calling Modes
any other samples, and is the simplest method [4, 11, 23, 32]. The
and Consensus Method
second is a population variant calling mode that identifies genetic
variations from multiple samples simultaneously [6, 12, 13, 28,
31]. The third is a consensus variant calling that obtains a consen-
sus of different variant calling methods, and this can be done in
individual [3, 24] or population mode or can be hybrid of both,
though these complicated cases will not be discussed here.
Individual variant calling: An individual variant calling
method is the one many people are familiar with and is commonly
used for analyzing various Leishmania samples. This method is
suitable to characterize a single sample or several unrelated samples
in detail. The advantages of this approach are its simplicity, and
SNP calling parameters can be adjusted to each sample. The main
disadvantage of this method is that it can miss a SNP whose read
allele frequency is less than 0.05 (e.g., a position with total depth
of 20, 19 reference allele bases, and 1 alternative allele base).
82 Hideo Imamura and Jean-Claude Dujardin

1
0.9
0.8

Read allele frequency


0.7
0.6
Sensitive
0.5
0.4 Resistant
0.3
0.2
0.1
0
0 3 6 12 35 49 61 74
Miltefosine (µM)

Fig. 4 Alternative allele frequency at a LdMT SNP site in a miltefosine induction


experiment: It is important to characterize transitional minor allele frequency to
monitor development of drug resistant parasite cells. Minor read allele frequency
0.0164 would have been totally ignored if an individual SNP calling was used.
The black line represents read allele frequency associated with miltefosine sen-
sitive cells and the gray line represents alternative (SNP) read allele frequency
associated with miltefosine-resistant cells. The x-axis represents miltefosine
concentrations

Therefore, it will ignore SNPs with a minor read allele frequency


unless SNP calling parameters are modified. This might be insig-
nificant in many analyses, but we will illustrate below that these
minor reads can play critical roles in many experiments (Fig. 4).
Population variant calling: A population variant calling
method is not commonly used for analyzing various Leishmania
samples. This method is suitable to characterize and compare many
samples effectively. In most cases, we perform sequence analysis to
compare multiple samples. Therefore, it is essential to characterize
genetic variations among all the samples to find out the presence or
absence of a given genetic variant across all samples at a given
position.
Comparison of individual and population calling methods: To
illustrate some advantage of a population calling over an individual
calling method, let us consider the case that, in an individual SNP
calling process, we have only a read depth of 2 for an alternative
base and a read depth of 48 for a reference base at a SNP position.
This SNP would be ignored because an alternative read allele needs
to have a higher  frequency than 15% on a  disomic chromosome.
However, in a population calling process such minor alternative
bases can be identified when other samples have a homozygous
SNP or clear heterozygous SNP where the alternative allele is higher
than 20–30%. Therefore, it is beneficial to apply a population SNP
calling method or a consensus SNP calling method [29]. Subtle,
consistent gradual read allele frequency changes were a key to find
a SNP critical to drug resistance in the MIL drug induction experi-
ment. An individual SNP calling could have been used in this case,
A Guide to Next Generation Sequence Analysis of Leishmania Genomes 83

but then the method would have to have been performed multiple


times to characterize read allele frequency of all samples. This was
one of the main reasons we developed the consensus method
COCALL in order not to miss minor read allele SNPs among 206
strains [3] and among drug induction samples [29] since popula-
tion methods were still under development at that time.
SNP calling programs: Genome Analysis Toolkit (GATK) [33],
Freebayes (https://github.com/ekg/freebayes) [34], and sam-
tools mpileup [35] are popular genome variant callers and shown
in boldface in the next paragraphs (Fig. 1).
GATK comes with many versatile tools and extensive user
guides that beginners can follow and are well maintained; there-
fore, it is likely most valuable for many readers. It has many SNP
calling tools such as UnifiedGenotyper and Haplotypecaller [31].
UnifiedGenotyper has more mature population calling algorithm
while Haplotypecaller can handle a sample with many indels much
better than UnifiedGenotyper since Haplotypecaller performs local
de novo assembly for indels.
Freebayes is very sensitive but could be difficult to use since it
requires extensive post-SNP screening and deep understanding of
its parameters. It is often used by advanced users who can adjust
various parameters by themselves and users can take advantage of a
user guideline to perform postprocessing [3, 11]. If it is producing
some unexpected SNP calling results, we may need to examine its
parameters carefully. For example, a homozygous SNP can be
excluded by a single aberrant read whose mapping score is zero
because it excludes alignments from analysis if they have a mini-
mum mapping quality less than 1. Please see the parameter
--min-mapping-quality.
samtools mpileup is also popular and produces relatively fewer
false positives even though it may miss some heterozygous SNPs
with lower alternative allele frequency.
COCALL and consensus calling: In a previous analysis we used
a consensus SNP calling method, COCALL [3], which obtained a
consensus of five different SNP calling tools, samtools pileup, sam-
tools mpileup, Freebayes, GATK, and CORTEX [36], where each
tool was applied to each sample individually. We found that
COCALL outperformed all the 5 methods in an individual calling
mode and also a population calling mode of GATK, samtools
mpileup and Freebayes [3] and was particularly suitable for a set of
genetically homogenous samples like the core group of L. don-
ovani in the Indian subcontinent. However, in the last several
years, the situation has changed because the improvements in base
quality, longer read length, and genome analysis tools made SNP
callers more reliable, and the need to process larger sample num-
bers than before. Therefore, a complex consensus approach is slow
and difficult to implement and maintain, because such a method
requires extensive parameter adjustment and needs uniformity in
84 Hideo Imamura and Jean-Claude Dujardin

Characteristics of High quality SNPs


• Clean neighbouring bases (no gaps/repeats/base errors)
• Higher alternative read allele frequency
• Good mapping score (not in repeat region)
• High complexity region

2) a duplication 4) a duplication
close to a gap

1) a homopolymer 3) tandem repeats


AAAAAAA
reads AAAAAAA
AAAAAAA
reference AAATAAA
F FN FN F F
SNP F
5) a duplication
low complexity close to the end
of a chromosome

reference
unique unique tandem repeats unique

Fig. 5 Locations associated with many false positive SNPs in a reference genome: False positive SNPs were
often found in (1) low complexity regions such as homopolymers or short tandem repeat regions, (2) a dupli-
cated region where tandem repeats are truncated into a shorter unit in the reference, (3) tandem repeats
where mapping quality is zero and these SNPs may not be detected by SNP callers leading to false negative
SNPs, (4) duplications close to a gap, and (5) duplications close to the end and beginning of a chromosome.
The broad light green lines represent unique self-blast hits to themselves and the dark grey lines represent
repetitive matchings. In the figure, “SNP” indicates a real SNP. “F” indicates a false positive SNP and “FN”
indicates a false negative SNP. The left upper box highlights the characteristics of high-quality SNPs

sequence quality including read depth and base quality. A consen-


sus SNP calling is still used for evaluating a genome assembly. For
example, samtools pileup and samtools mpileup were used for the
assembly of L. naiff [24]. A consensus calling may not be suitable
for beginners who need a simple method they can follow and mod-
ify, but it would be informative to call SNPs using a few methods
to understand SNP callers and evaluate these tools to find their
strength and weakness.

3.5.2  Masking Our challenge here is to distinguish true SNPs from a flood of false
a Reference: Reference-­ SNPs mainly caused by a reference itself. The main characteristics of
Specific False Positives true SNPs are described in the box in Fig.  5, which include (1)
clean neighboring bases without gaps, repeats nor base errors, (2)
higher alternative read allele frequency, (3) high mapping score,
and (4) nonrepetitive high complexity regions. In contrast, we illus-
trated how a true SNP (SNP), false positive SNPs (F), and false
negative SNPs (FN) can appear in an alignment in Fig.  5. We
depicted the various features causing false SNPs in a hypothetical
reference genome including a homopolymer, a duplication, tandem
repeats, duplications close to a gap and the end of a chromosome.
Here a reference genome was self-blasted to itself, and one-to-one
A Guide to Next Generation Sequence Analysis of Leishmania Genomes 85

unique similarity lines are shown in light green, and repetitive


multiple hits are shown in dark grey. Some regions of a genome
produce unexpectedly many false positive SNPs: lower complexity
region like homopolymer bases, misassembled repetitive regions,
bases adjacent to gaps or contig ends [3, 4], and the conserved
genome regions that are similar to contaminating host DNAs.
Characterizing a reference in detail is a fundamental step that is
often ignored by many analyses, and this is critical since true posi-
tive SNPs might be evenly distributed across a genome, but false
positive SNPs are not distributed equally. Hence, false SNPs caused
by technical biases should not be described as key biological
characteristics.
Lower complexity regions and mappability: Lower complexity
regions including homopolymers, simple repeats, or simple tan-
dem repeats can be identified by a repeat masking program tantan
[37] that can mask such regions to avoid false positives from
homologous regions. Alternatively, it is possible to characterize
proportions of a genome that can be properly mapped by reads,
using a mappability tool [38]. It is also possible to exactly calculate
k-mer uniqueness of a Leishmania reference for single and paired
reads using simple k-mer calculation [4].
Self blast matching: Higher complexity regions, which are not
homopolymers, simple repeats, or simple tandem repeats, can be
repetitive, and they can be identified by self-blasting itself using a
cutoff of 10e–20 [3]. This is an intuitive way to mask homologous
regions.
Gap and contig edges: Gaps are often located around a larger
repetitive region where the base cannot be corrected by computa-
tional base correction tool such as ICORN [4, 39]. Therefore, it is
better to exclude 100 bp adjacent to a gap and bases close to con-
tig edges for any genome analysis [24, 25].
The conserved genome regions that are similar to contaminating
host DNAs: When amastigote samples are sequenced, it is common
to have higher host DNA contaminations, and this can cause unex-
pected false positive SNPs in the Leishmania regions conserved to
host DNAs. These false positives can show up as high SNP score
heterozygous SNPs on truncated reads. These regions can be man-
ually masked out to avoid false positive SNPs.
Reference with many repetitive regions: Many Leishmania
genomes assembled using next generation sequencing technolo-
gies are relatively high quality and did not contain chromosomes
or contigs where reads cannot be mapped correctly [4, 24, 25].
Pacific Biosciences (PacBio) technology further improved
Leishmania references [6, 40]. On the other hand, for extremely
repetitive genomes, it is very difficult to estimate proper depth for
chromosome and characterize genetic variations particularly when
aneuploidy is common. For example, aneuploidy is not uncommon
in Trypanosoma cruzi, and this makes it difficult to estimate proper
read depth using a Trypanosoma cruzi reference [30].
86 Hideo Imamura and Jean-Claude Dujardin

3.5.3  Sample-Specific Read depth cutoffs: Duplicated and deleted regions can create false
Cutoffs positive SNPs for phylogenetic analysis; it is therefore common to
apply a  read depth cutoff. For example, read depth cutoff-­
normalized depth dm  >  0.5 or dm  <  2 can be used. Ideally, this
normalized depth must be defined for each chromosome in
Leishmania since somy values can vary significantly, but it is com-
mon to set a general common cutoff for all chromosomes for indi-
vidual samples or a total depth for all samples. More practically, raw
depth cutoffs can be used to simplify the computation.
SNP clusters: It is common to exclude or mark SNP clusters
where there are more than three SNPs within ten bases of each
other because these clusters are often associated with false positive
SNPs. The definition of SNP clusters, however, must be adjusted
for each data set, based on the amount of real SNP clusters.
Specifying somy: A various integer somy value can be assigned
for given chromosomes in GATK and Freebayes, which is quite
effective for polyploid plant genomes, whose ploidy is over penta-
somy. In Leishmania genomes, however, aneuploidy can often be
transient, and also intermediate somy values are quite common [3,
6]. Therefore, it is often difficult to specify proper somy values for
each chromosome in different strains without introducing unin-
tended biases. Therefore, it is normally sufficient to use a default
somy setting for SNP calling. For example, we have never observed
any SNP deficiency in chromosome 31, whose somy has been
always greater than 3 [3], nor in septasomic chromosomes we have
examined (data not shown).

3.5.4  Annotating A functional annotation program SnpEff [41] can be used to clas-
Functional Impact of SNPs sify all SNPs and indels based on their functional impact such as
and Indels frameshift, nonsynonymous, synonymous change and intergenic
mutation. SNPs and indels were compiled in a population genetic
variation vcf file. From this vcf file alternative allele and depth
information can be extracted for further analysis. Variants common
to all strains are often uninformative, and they can be excluded
from the analysis.

3.6  Screening SNPs Filtering SNPs and indels: Once SNPs and indels are identified, the
and Indels next step is to apply a variant filtering such as GATK Variant
Filtration, which evaluates many SNP quality conditions including
SNP quality per depth, SNP strand bias, root mean square of the
mapping quality and mapping quality test between reference and
alternative alleles. A caution here is that GATK is updated fre-
quently and their recommended parameters change occasionally;
therefore, it is essential to consult the GATK user guide to set
proper parameters.
SNP quality score: The next step is to test and select a SNP
score that match the sensitivity and specificity required for the
analysis. There is no universal SNP score that can be applied to any
samples since a SNP score itself depends on read depth, read base
A Guide to Next Generation Sequence Analysis of Leishmania Genomes 87

quality, and other factors, but we will provide some values that
applied in our previous analyses.
In practice, SNP scores are given in the 6th column with the
tag QUAL in vcf files. They are expressed in terms of Phred scores
Q defined to be Q = −10 log10(P) where P is the estimated error
probability for that base-call [42]. In our data set for L. donovani,
other Leishmania and Trypanosoma species, we found that SNP
scores of  300 for GATK individual calling and 1500 for  GATK
population calling are sufficient to remove most of false positive
SNPs. These values must be adjusted specifically to a given data set
because each data set can have different biases. In our previous
analyses, these values were selected after testing GATK using the
BPK282 sequence data from which the reference genome was cre-
ated. For example, in a sample set containing about 200 samples in
which dozens of samples are affected by base errors, a GATK SNP
score might be elevated to 4000 or higher in a population calling
to eliminate low-quality heterozygous false positive SNPs. In gen-
eral, a proper SNP cutoff should not change the number of SNPs
substantially when the cutoff is changed slightly such as 300–500 in
a population SNP calling of several dozen samples and 50–100 in
an individual SNP calling. If this happens, the SNP cutoff is too
low [3], and it is often safe to use the cutoff around which the
number of SNPs is stable, since the number of true positive SNPs
should not vary much depending on the cutoff. Alternatively, it is
better to remove samples producing excessive SNPs, if that is pos-
sible. During selection of a SNP cutoff, it is essential to inspect
SNPs visually in the Integrative Genomic Viewer, artemis, or sam-
tools tview to avoid false positives and mask regions that produce
excessive false positive SNPs.

3.7  Specific We have discussed the basics of sequence analyses and now we will
Sequencing Analyses further discuss more specific cases that improve sequencing analysis
methods and also explore topics that are important but less fre-
quently discussed.

3.7.1  Mapping Reference Mapping reference reads to the genome itself is a good practice to
Reads to Its Own become familiar with a reference genome and to know its potential
Reference Genome misassemblies. For example, the reference genome of L. major
to Understand Friedlin is considered the most accurate among Leishmania
the Reference genomes. However, the copy number variation analysis of various
and to Improve CNV of Leishmania genomes showed that many of the repetitive genes
Detection and SNP Calling were often concatenated, resulting in their higher copy numbers
Methods [5, 40]. Therefore, mapping reference reads to itself is excellent
way to calibrate and improve our own CNV detection and SNP
calling methods. Higher depth in a gene of a reference indicates
that some sections were truncated in the reference while lower
depth around a repetitive region indicated that the region was
overextended in the assembly process and this was observed in ref-
erences assembled by PacBio SMRT Sequencing [6, 31].
88 Hideo Imamura and Jean-Claude Dujardin

3.7.2  Optimizing SNP Optimizing SNP calling can be done by using simulated reads but
Calling Using Artificial it is more relevant to optimize the method by using alignments of
Mixtures of Two Samples an artificial mixtures of two samples, which have sufficient amount
of homozygous SNP differences. The main weakness of the simu-
lated reads is that these reads are too clean and too easy to train our
methods because real samples produce nearly intractable SNPs that
simulated reads cannot simply emulate. If we do not have two
clean samples to train our SNP calling method, it might be infor-
mative to optimize our SNP calling using the dataset which is
already published. For example, the data for in vitro selection of
miltefosine resistance in promastigotes of Leishmania donovani
from Nepal [29] may be used. These samples do not contain more
than 200 homozygous and heterozygous SNPs and BPK282 lines
are closely related to the reference BPK282 strain. Therefore, they
are an ideal data set to calibrate SNP calling. To refine SNP calling
parameters further, the alignment files of BPK282s and BPK275s,
which belong to a different genotype, can be mixed at certain pro-
portions using “samtools view -s” command. We are able to test
sensitivity and specificity of the method based on artificially mixed
simulated data set based on the real alignments because BPK275
lines have their own unique mainly homozygous SNPs. By using
these data sets, we will notice that some regions would produce
disproportionally many SNPs because of their repetitiveness or
some unknown factors and these regions must be masked for fur-
ther SNP analyses [3].

3.7.3  Genetic Variation Sequencing samples to identify genetic variations that can explain
Detection Among Samples specific phenotypes is common. One of the most common analyses
with Different Phenotypes is comparing between a drug-sensitive line and an induced drug-­
Including Drug Induction resistant line. In this case, it is essential to identify any genetic
Experiments ­variations located in coding regions or even noncoding regions.
Further, we must not discard but explore base variations that over-
lap with local copy number variations since it is known that some
specific gene like L. donovani miltefosine transporter (LdMT) and
aquaglyceroporin 1 (AQP1) can contain partial or full deletions,
SNPs, or indels simultaneously in cells [3]. In drug induction
experiments, it is essential to the monitor allele frequency changes
at various intermediate stages to identify SNPs critical to drug resis-
tance. Here applying a population calling method among sensitive,
intermediate, and resistant lines will help to identify critical genetic
variations because it provides all allele frequency information at a
position where at least one high score SNP is present. For example,
in the previously mentioned miltefosine drug resistance induction
experiment [29], the critical SNP at LdMT was not detected at the
beginning of the induction experiment, but the SNP allele gradu-
ally increased and reached up to 100% in fully miltefosine-resistant
parasites (Fig. 4). To monitor this type of SNP allele transition, it
was more effective to use a population c­ alling method. If an indi-
A Guide to Next Generation Sequence Analysis of Leishmania Genomes 89

vidual calling method is used, then all alternative allele frequencies


from all samples must be recalculated from all SNP sites. Therefore,
it is more efficient to use a population calling method from the
start. For example, we have used a COALL at that time for the SNP
detection and had to reevaluate allele frequencies of all SNP sites in
all the samples to achieve the same results as a population calling
method to capture rare allele SNPs.

3.7.4  Phylogenetic Comprehensible phylogenetic analyses require specialized skill and


Analysis methods and therefore we will not discuss the topic in details. We
will, however, describe simple practical guides applicable to basic
phylogenetic analyses. In general, SNPs selected for phylogenetic
analysis can be more rigorously filtered to capture the key phyloge-
netic relationship. Therefore, it is essential to perform extensive
screening of SNPs to retain high confidence SNPs and to mask out
the problematic regions from a reference to reduce false positive
SNPs. Special care must be taken to combine multiple SNP vcf files
to properly handle reference bases, alternative alleles and missing
information without unintentionally distorting the base informa-
tion. It is common to remove SNPs that appears in only one sample
or that appear in all samples. It is also common to remove linked
SNPs, and regions containing strong signatures of recombinations,
but identifying such SNPs and regions must be performed with care.

3.7.5  How to Handle For a large sequencing project, sequence data can come from vari-
Samples with Biased ous sources. Sequencing data quality may not be uniform and
False SNPs some samples may be of lower quality. Therefore, the data from
public archives must be treated with caution, and it is critical to
read the original paper in which the sequence data was generated
because the paper often describes the read quality issues and practi-
cal solutions in detail when the quality of some samples are lower
than that of other samples. For example, in our previous L. don-
ovani project [4], 18 samples had lower sequence quality and these
samples would have generated ten times more false positive SNPs
than the rest of the samples, but the problem was solved by apply-
ing additional SNP screening conditions to these samples. If we
obtained the data from a public database and observed abnormal
number of SNPs in a limited number of samples, then it is appro-
priate to remove these samples from the analyses rather than keep-
ing them, provided these samples are not essential.

3.7.6  Characterizing When a read can be mapped to multiple locations with the same
Depth and SNPs mapping score, aligners select one location randomly or alterna-
from Repetitive Regions tively ignore the multiple mapped read. Selecting a random posi-
tion is often more appropriate to characterize base variations and
copy number variations than keeping only uniquely mapped reads.
When only uniquely mapped reads are kept, the depth of repetitive
regions such as GP63 and HSP70 would be underestimated
90 Hideo Imamura and Jean-Claude Dujardin

because multiple mapped reads were ignored. This would conse-


quently lead to skewed CNV results [3, 5]. It is also easy to exclude
multiple mapped reads later based on the mapping scores of reads
by using samtools view with a mapping quality cutoff.

3.7.7  Lower Input DNA For various technical and biological reasons, the depth of samples
Samples can be lower than one or two reads per bases. It is, however, still
possible to estimate somy and perform genotyping from such
lower depth samples. For example, the read depth for Leishmania
donovani amastigotes from a hamster was lower than 0.8 and too
low to estimate somy based on a regular depth method as discussed
above. It was, however, possible to estimate their somy based on
the number of reads per 1000 bp [6]. Genotyping using regular
SNP calling methods is not possible for such low depth samples,
but if diagnostic SNP markers are known from other higher depth
samples, direct searching of base motifs that contain a diagnostic
SNP marker can be used for genotyping. For copy number varia-
tion, it is likely possible to identify amplicons with a high copy
number using depth based on a certain window. Smaller scale
CNVs can be detected by clustering many lower depth samples
together to increase resolution.

3.8  Transcriptomics We will briefly summarize the basics of upstream analysis of tran-
Analysis scriptomics sequencing and describe the read count normalization
specific to Leishmania and how to handle repetitive genes. RNA
sequencing analysis generally requires replicates per sample rang-
ing from 2 to 6. When it is not feasible to have replicates in some
experimental and clinical settings, it is common to group samples
together based on phenotypic differences to increase statistical
power. It is more difficult to choose an optimal RNA sequencing
library than DNA sequencing library because we must optimize
many factors such as number of replicates, read depth, read length,
inset size, and single or paired reads [43–45].
RNA read mapping and read counting: For RNA analysis,
STAR (Spliced Transcripts Alignment to a Reference) is likely a
convenient option  in many cases  [46]. It can perform elaborate
two-step mapping and read counting by itself. It can also handle
strand-specific RNA sequencing. Leishmania does not have introns
in general, so a splicing-aware mapping may not be needed; but it
is an attractive feature. As a cautionary note for STAR users, when
using a gff file from TritrypDB, a gff annotation file must be con-
verted to gtf annotation file even though STAR indicates that a gff
file is accepted. In general, STAR often produces an empty count-
ing when a gff from TritrypDB is used. Once we obtained read
count data, they can then be analyzed by DEseq2 which provides
normalized read count and fold changes and Benjamini–Hochberg-­
adjusted p-values. It is common to use a fold change cutoff 2, and
a Benjamini–Hochberg adjusted p value <0.05 to define differen-
tially expressed genes.
A Guide to Next Generation Sequence Analysis of Leishmania Genomes 91

Customized read count normalization: If readers are sufficiently


familiar and confident with their data sets, they can try their own
normalization reflecting Leishmania specific transcriptional differ-
ences in promastigotes and amastigotes. In Leishmania, promasti-
gotes tend to have higher expression level than amastigotes in
general, therefore DESeq2 [47] read count normalization may
slightly suppress promastigote expression levels [6]. In particular,
our group quantified the amount of transcripts by assessing read
depth as described in DNA somy estimation above. For each chro-
mosome, the average depth of transcripts was used to compute an
RNA-based relative somy value. In our previous analysis, we calcu-
lated RNA read depth as described in DNA sequencing and
obtained read depth for each gene. Then we converted these
depths into raw read counts, and calculated transcriptional somy.
This way, we found DESeq2 suppressed the amount of the total
normalized read count of a promastigote sample 4% compared to
the normalization based on RNA somy [6]. In this specific case,
the difference was only 4%, but we can envision to encounter a
more skewed expression level between promastigotes and amasti-
gotes. For example, when the promastigotes’ somy level is much
higher than the amastigotes’ somy level, regular DESeq2 normal-
ization may not be optimal and therefore some customized nor-
malization based on RNA somy might produce better results.
Transcription analysis over repetitive genes: In RNA sequencing
analyses, repetitive regions are often simply excluded from the
analyses because reads cannot be mapped uniquely to repetitive
regions. There are at least three ways to handle repetitive genes in
RNAseq data: ignoring multiple-hit reads, selecting one
­representative homolog among these homologs, and describing
these genes separately.
First, if the main aim is to characterize transcription level in
general, it is common to ignore reads mapping to multiple loca-
tions. This method can still characterize expression of repetitive
genes as long as reads can be uniquely mapped. Second, repetitive
genes can be clustered using a program like cd-hit-est (http://
weizhongli-lab.org/cd-hit/). A specific similarity cutoff of 90% or
95% can be applied and a similarity cutoff 90% should be sufficient
to characterize general transcriptional behavior. However, for a
transcriptional analysis in the context of drug induction experi-
ments, small base differences, such as one frameshifting base differ-
ence, can have a significant impact on a gene. Therefore, a similarity
cutoff must be adjusted based on the aim of experiments. Third, as
we described in the last part of the previous section, we treat RNA
reads like DNA reads, and then we can characterize RNA depth
and SNPs. Then it is possible to evaluate RNA somy, RNA gene
depth, and RNA SNPs to observe translational biases, providing
additional information that the first two methods may not
provide.
92 Hideo Imamura and Jean-Claude Dujardin

4  Conclusion

We have discussed a broad range of topics that are often neglected


but critical to Leishmania NGS analyses and we have particularly
emphasized on how to handle read depth and somy. We have also
discussed important but often neglected topics such as masking a
reference, the differences between individual and population SNP
calling methods, and RNA depth normalization between promasti-
gote and amastigote samples. We hope this guide would help read-
ers to create their own sequence analysis tools and to understand
and evaluate the existing literature on Leishmania NGS better.

Acknowledgments

We thank Geraldine De Muylder, Bart Cuypers, and Malgorzata


Domagalska for their comments on the manuscript.

References

1. Leprohon P, Fernandez-Prada C, Gazanion É in Leishmania: current knowledge and future


et  al (2015) Drug resistance analysis by next prospects. ACS Infect Dis 4(4):467–477
generation sequencing in Leishmania. Int 9. Mannaert A, Downing T, Imamura H,
J Parasitol Drugs Drug Resist 5(1):26–35 Dujardin JC (2012) Adaptive mechanisms in
2. Mardis ER (2017) DNA sequencing technolo- pathogens: universal aneuploidy in Leishmania.
gies: 2006–2016. Nat Protoc 12(2):213 Trends Parasitol 28(9):370–376
3. Imamura H, Downing T, Van den Broeck 10. Sterkers Y, Lachaud L, Bourgeois N et  al
F et  al (2016) Evolutionary genomics of (2012) Novel insights into genome plasticity in
epidemic visceral leishmaniasis in the Indian
­ Eukaryotes: mosaic aneuploidy in Leishmania.
subcontinent. Elife 5:e12613 Mol Microbiol 86(1):15–23
4. Downing T, Imamura H, Decuypere S et  al 11. Iantorno SA, Durrant C, Khan A et al (2017)
(2011) Whole genome sequencing of mul- Gene expression in Leishmania is regu-
tiple Leishmania donovani clinical isolates pro- lated predominantly by gene dosage. MBio
vides insights into population structure and 8(5):e01393–e01317
mechanisms of drug resistance. Genome Res 12. Tihon E, Imamura H, Dujardin JC et  al
21(12):2143–2156 (2017) Discovery and genomic analyses of
5. Rogers MB, Hilley JD, Dickens NJ et al (2011) hybridization between divergent lineages of
Chromosome and gene copy number variation Trypanosoma congolense, causative agent of
allow major structural change between spe- Animal African Trypanosomiasis. Mol Ecol
cies and strains of Leishmania. Genome Res 26(23):6524–6538
21(12):2129–2142 13. Tihon E, Imamura H, Dujardin JC, Van Den
6. Dumetz F, Imamura H, Sanders M et  al Abbeele J (2017) Evidence for viable and sta-
(2017) Modulation of aneuploidy in ble triploid Trypanosoma congolense parasites.
Leishmania ­ donovani during adaptation to Parasit Vectors 10(1):468
different in  vitro and in  vivo environments 14. Head SR, Komori HK, LaMere SA et al (2014)
and its impact on gene expression. MBio Library construction for next-­ generation
8(3):e00599–e00517 sequencing: overviews and challenges.
7. Barja PP, Pescher P, Bussotti G et  al (2017) Biotechniques 56(2):61
Haplotype selection as an adaptive mechanism 15. Vincent AT, Derome N, Boyle B et al (2017)
in the protozoan pathogen Leishmania don- Next-generation sequencing (NGS) in the
ovani. Nat Ecol Evol 1(12):1961 microbiological world: how to make the
8. Jones NG, Catta-Preta CM, Lima AP, Mottram most of your money. J  Microbiol Methods
JC (2018) Genetically validated drug targets 138:60–71
A Guide to Next Generation Sequence Analysis of Leishmania Genomes 93

16. Haddock SHD, Dunn CW (2011) Practical 30. Reis-Cunha JL, Rodrigues-Luiz GF, Valdivia
computing for biologists. Sinauer, Sunderland HO et al (2015) Chromosomal copy number
17. Aslett M, Aurrecoechea C, Berriman M et  al variation reveals differential levels of genomic
(2009) TriTrypDB: a functional genomic plasticity in distinct Trypanosoma cruzi strains.
resource for the Trypanosomatidae. Nucleic BMC Genomics 16(1):499
Acids Res 38(suppl_1):D457–D462 31. Schwabl P, Imamura H, Van den Broeck F et al
18. Bolger AM, Lohse M, Usadel B (2014) (2018) Parallel sexual and parasexual popula-
Trimmomatic: a flexible trimmer for tion genomic structure in Trypanosoma cruzi.
Illumina sequence data. Bioinformatics bioRxiv. https://doi.org/10.1101/338277
30(15):2114–2120 32. Rogers MB, Downing T, Smith BA et  al
19. Cuypers B, Domagalska MA, Meysman P et al (2014) Genomic confirmation of hybridi-
(2017) Multiplexed spliced-leader sequencing: sation and recent inbreeding in a vector-­
a high-throughput, selective method for RNA-­ isolated Leishmania population. PLoS Genet
seq in Trypanosomatids. Sci Rep 7(1):3725 10(1):e1004092
20. Li H, Durbin R (2009) Fast and accurate short 33. McKenna A, Hanna M, Banks E et al (2010)
read alignment with Burrows–Wheeler trans- The genome analysis toolkit: a MapReduce
form. Bioinformatics 25(14):1754–1760 framework for analyzing next-­ generation
21. Langmead B, Salzberg SL (2012) Fast gapped-­ DNA sequencing data. Genome Res
read alignment with Bowtie 2. Nat Methods 20(9):1297–1303
9(4):357 34. Marth GT, Korf I, Yandell MD et  al (1999)
22. Li H, Homer N (2010) A survey of sequence A general approach to single-nucleotide poly-
alignment algorithms for next-generation morphism discovery. Nat Genet 23(4):452
sequencing. Brief Bioinform 11(5):473–483 35. Li H, Handsaker B, Wysoker A et  al (2009)
23. Zackay A, Cotton JA, Sanders M et al (2018) The sequence alignment/map format and
Genome wide comparison of Ethiopian SAMtools. Bioinformatics 25(16):2078–2079
Leishmania donovani strains reveals differences 36. Iqbal Z, Caccamo M, Turner I et  al (2012)
potentially related to parasite survival. PLoS De novo assembly and genotyping of variants
Genet 14(1):e1007133 using colored de Bruijn graphs. Nat Genet
24. Coughlan S, Mulhair P, Sanders M et al (2017) 44(2):226
The genome of Leishmania adleri from a mam- 37. Frith MC (2010) A new repeat-masking
malian host highlights chromosome fission in method enables specific detection of homolo-
Sauroleishmania. Sci Rep 7:43747 gous sequences. Nucleic Acids Res 39(4):e23
25. Coughlan S, Taylor AS, Feane E et al (2018) 38. Derrien T, Estelln J, Sola SM et al (2012) Fast
Leishmania naiffi and Leishmania guyanensis computation and applications of genome map-
reference genomes highlight genome structure pability. PLoS One 7(1):e30377
and gene evolution in the Viannia subgenus. R 39. Otto TD, Sanders M, Berriman M, Newbold
Soc Open Sci 5(4):172212 C (2010) Iterative correction of reference
26. Rastrojo A, García-Hernández R, Vargas P nucleotides (iCORN) using second genera-
et al (2018) Genomic and transcriptomic alter- tion sequencing technology. Bioinformatics
ations in Leishmania donovani lines experi- 26(14):1704–1707
mentally resistant to antileishmanial drugs. Int 40. Gonznformaticssecond S, Peirnformati R et al
J Parasitol Drugs Drug Resist 8(2):246–264 (2017) Resequencing of the Leishmania infan-
27. Valdivia HO, Reis-Cunha JL, Rodrigues-­ tum (strain JPCM5) genome and de novo
Luiz GF et  al (2015) Comparative genomic assembly into 36 contigs. Sci Rep 7(1):18050
analysis of Leishmania (Viannia) peruviana 41. Cingolani P, Platts A, Wang LL et  al (2012)
and Leishmania (Viannia) braziliensis. BMC program for annotating and predicting the
Genomics 16(1):715 effects of single nucleotide polymorphisms,
28. Dumetz F, Cuypers B, Imamura H et al (2018) SnpEff: SNPs in the genome of Drosophila
Molecular preadaptation to antimony resis- melanogaster strain w1118; iso-2; iso-3. Fly
tance in Leishmania donovani on the Indian 6(2):80–92
subcontinent. mSphere 3(2):e00548–e00517 42. Ewing B, Green P (1998) Base-calling of auto-
29. Shaw CD, Lonchamp J, Downing T et  al mated sequencer traces using phred. II. Error
(2016) In vitro selection of miltefosine resis- probabilities. Genome Res 8(3):186–194
tance in promastigotes of Leishmania donovani 43. Schurch NJ, Schofield P, Gierliński M et  al
from Nepal: genomic and metabolomic charac- (2016) How many biological replicates are
terization. Mol Microbiol 99(6):1134–1148 needed in an RNA-seq experiment and which
94 Hideo Imamura and Jean-Claude Dujardin

differential expression tool should you use? standing of the biology and host-pathogen
RNA 22(6):839–851 interactions. Infect Genet Evol 49:273–282
44. Fiebig M, Kelly S, Gluenz E (2015) 46. Dobin A, Davis CA, Schlesinger F et al (2013)
Comparative life cycle transcriptomics revises STAR: ultrafast universal RNA-seq aligner.
Leishmania mexicana genome annotation and Bioinformatics 29(1):15–21
links a chromosome duplication with parasitism 47. Love MI, Huber W, Anders S (2014)
of vertebrates. PLoS Pathog 11(10):e1005186 Moderated estimation of fold change and
45. Patino LH, Ramírez JD (2017) RNA-seq in dispersion for RNA-seq data with DESeq2.
kinetoplastids: A powerful tool for the under- Genome Biol 15(12):550

You might also like