Robust and Exact Structural Variation Detection With Paired-End and Soft-Clipped Alignments: Softsv Compared With Eight Algorithms

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 13

Briefings in Bioinformatics, 17(1), 2016, 51–62

doi: 10.1093/bib/bbv028
Advance Access Publication Date: 20 May 2015
Paper

Robust and exact structural variation detection with


paired-end and soft-clipped alignments: SoftSV
compared with eight algorithms
Christoph Bartenhagen and Martin Dugas
Corresponding author. Christoph Bartenhagen, Institute of Medical Informatics, University of Münster, Albert-Schweitzer-Campus 1, D-48149 Münster,
Germany. Tel.: þ49-(0)251-83-58367; E-mail: christoph.bartenhagen@uni-muenster.de

Abstract
Structural variation (SV) plays an important role in genetic diversity among the population in general and specifically in
diseases such as cancer. Modern next-generation sequencing (NGS) technologies provide paired-end sequencing data at
high depth with increasing read lengths. This development enabled the analysis of split-reads to detect SV breakpoints
with single-nucleotide resolution. But ambiguous mappings and breakpoint sequences with further co-occurring mutations
hamper split-read alignments against a reference sequence. The trade-off between high sensitivity and low false-positive
rate is problematic and often requires a lot of fine-tuning of the analysis method based on knowledge about its algorithm
and the characteristics of the data set. We present SoftSV, a method for exact breakpoint detection for small and large dele-
tions, inversions, tandem duplications and inter-chromosomal translocations, which relies solely on the mutual alignment
of soft-clipped reads within the neighborhood of discordantly mapped paired-end reads. Unlike other SV detection algo-
rithms, our approach does not require thresholds regarding sequencing coverage or mapping quality. We evaluate SoftSV
together with eight approaches (Breakdancer, Clever, CREST, Delly, GASVPro, Pindel, Socrates and SoftSearch) on simulated
and real data sets. Our results show that sensitive and reliable SV detection is subject to many different factors like read
length, sequence coverage and SV type. While most programs have their individual drawbacks, our greedy approach turns
out to be the most robust and sensitive on many experimental setups. Sensitivities above 85% and positive predictive values
between 80 and 100% could be achieved consistently for all SV types on simulated data sets starting at relatively short 75 bp
reads and low 10–15 sequence coverage.

Key words: structural variation; paired-end sequencing; split-reads; simulation

Introduction ongoing research. For example, Yang et al. [3] described the SV
Structural variations (SVs) such as deletions, inversions, tan- landscape and mechanisms behind SV formation across 10 dif-
dem duplications and translocations have gained increasing at- ferent tumor types, whereas chromothripsis denotes the co-
tention as paired-end sequencing technologies became more occurrence of hundreds of SVs within individual cancer cells
popular, robust and affordable. Over the past years, several (see [4] and Supplementary data).
studies emerged that gave insights into prevalence, type, com- A great contribution was made by the development, im-
plexity and formation of SVs. For instance, the 1000 Genomes provement and specialization of algorithms to detect SVs from
project [1, 2] showed that SVs (mainly those affecting the copy paired-end sequencing data. Initially, SV breakpoint locations
number) contribute largely to genetic variation among healthy were only approximated with paired-end reads [5, 6]. As longer
individuals. The role of SVs in tumorigenesis is a focus of read lengths of 75, 100 or 150 bp became available, split-read

Christoph Bartenhagen is a PhD student at the Institute of Medical Informatics at the University of Münster, Germany. His research interest is the develop-
ment of computational tools for mutation analysis with NGS data, in particular structural variation in cancer data sets.
Martin Dugas is Director of the Institute of Medical Informatics at the University of Münster, Germany. His research is focused on informatics for personal-
ized medicine, in particular NGS data analysis in oncology and medical data models for electronic health records.
Submitted: 5 January 2015; Received (in revised form): 7 April 2015
C The Author 2015. Published by Oxford University Press. For Permissions, please email: journals.permissions@oup.com
V

51
52 | Bartenhagen and Dugas

methods, like Pindel [7], CREST [8] or Socrates [9], were able to breakpoints. First, we describe a very greedy approach analyz-
compute the breakpoint sequence with single-nucleotide preci- ing any discordantly mapping paired-end read and every pos-
sion. The latest developments aim at the integration of two sible split-read for breakpoint patterns of large and small SVs
approaches, like paired-end with split-reads (Delly [10], respectively. Then, we define a criterion for split-read align-
SoftSearch [11]) or read depth information (GASVPro [12]). ments without a reference genome, which reduces the large set
But, the configuration of SV detection programs is often not of approximated candidates to a high-quality set of breakpoints
trivial, involving numerous parameters, which can influence with single-nucleotide resolution. The main principle of our ap-
the number of detections and the method’s sensitivity and proach to breakpoint detection is to be as independent from SV
false-positive rate. Three very typical parameters and concepts support, mapping quality and reference genome as possible.
can be found in most of today’s programs: This sets our methods apart from other soft-clip-based
approaches like SoftSearch and Socrates, who follow the more
i. All the SV detection methods mentioned above rely on clus-
common quantitative approach by counting unaligned soft-
ters of reads around or across the breakpoint. The cluster
clips or soft-clip alignments to the reference respectively.
size or number of reads, often described as support, is an
We performed SV simulations with focus on repeat regions
important criterion to estimate the probability of an SV pre-
of the human genome with varying read lengths and depths to
diction to be true or false positive. Typical default values for
compare SoftSV with eight programs published between 2009
the minimum support are to two or three reads
and 2014. We extend the evaluation to a trio data set from the
(Supplementary Table S1). Low-pass sequencing data with
1000 Genomes Project and one data set from the HapMap
a depth around 10 require a low threshold for an adequate
Project and analyze the SV size, coverage and breakpoint accur-
sensitivity and often deal with many false-positive predic
acy and distribution of all methods.
tions. On the other hand, the threshold often needs to be ad
justed for sequencing data at high depths like 30 or
higher. A high setting, however, may discard some true-
positive detections with low support owing to non-uniform Methods
sequence coverage or low prevalence of tumor cells show SoftSV uses a combination of paired-end information and soft-
ing this aberration like in heterozygous or mixed popula clipping of split-read alignments, which were sequenced and
tions in tumor genetics. Handling this threshold is based mapped around or across the SV breakpoints. Given a soft-
mainly on an empirical basis, and the trade-off between sen clip-aware paired-end alignment as input, we follow a two-step
sitivity and false-positive rate often demands manual post- procedure: (1) Approximation of regions that may contain SV
processing of the results. breakpoints. (2) Analyze split-reads within these regions to ver-
ii. Another parameter is the minimum mapping quality of a ify the breakpoints and determine its precise location up to a
read. It depends on the alignment program and typically re- few base pairs. The complete work flow is shown schematically
flects the uniqueness of the alignment, the number of mis- in Figure 1 and described in detail in the following sections and
matches and/or the sequence base qualities. Most SV the supplementary data.
detection programs filter alignments by mapping quality to
lower false-positive predictions due to low quality or am-
biguous alignments (Supplementary Table S1). In most ex Input: Alignment and soft-clipped reads
periments, the greatest part of the alignment passes this fil
Instead of a prevalent split-read alignment that many methods
ter, but low complexity or repetitive regions may be
do themselves, like splitting a read into small subsequences
inaccessible for exact breakpoint detection.
and aligning each one independently, we take advantage of the
iii. Split-read methods expect reads being sequenced across
soft-clipping information already present in the input align-
the breakpoint. They perform a partial mapping to locate
ment. A soft-clip is an unmapped subsequence at the 50 and/or
both sides of the breakpoint precisely. This is usually
30 end of a read, because of low base quality, a mismatch or gap
achieved by a gap-aware alignment or by splitting the reads
frequency exceeding the aligner’s thresholds or—most import-
up into two or more subsequences and aligning the small
ant for SV detection—because the read was sequenced and
parts back to the genome. The minimum length of each
mapped across an SV breakpoint (as shown for deletions in
subsequence affects the number of possible split-read
Figure 2C). Hence, the read was only partially mapped and the
alignments (e.g. a higher limit allows fewer positions for a
origin of the soft-clipped part is unknown.
split) and the ambiguity of the alignment (e.g. short
SoftSV expects a paired-end alignment in a coordinate
sequences are prone to multiple or wrong alignments). Co-
sorted BAM format [14] as input, including information about
occurring mutations, such as single nucleotide variants and
read pairing and insert size. The aligner has to support soft-
indels, can complicate the split-read alignment around the
clipped alignments and the soft-clip information needs to be
breakpoint. Hence, an error tolerance for relatively short
written into the CIGAR string. For example, ‘82M18S’ for 82
(sub-)sequences is required as well.
matched bases and a 18 bp soft-clip at the 30 end of the read.
With all the methods published so far and all these factors Since version 0.7.0, the Burrows-Wheeler Aligner package
in mind, it remains difficult to find and configure the most (BWA) [15], includes the BWA-MEM algorithm [16], which
suited method for the analysis of different types or sizes of vari- performs a local alignment for relatively short paired-end
ations. Studies like [13], for example, suggest combining mul- reads (75 bp). We use and recommend BWA-MEM for SV detec-
tiple approaches into a single pipeline to find a most tion with SoftSV on paired-end reads of this length, as it is more
comprehensive solution. sensitive for soft-clipped alignments than an end-to-end align-
In the following sections, we present and evaluate a method, ment with the BWA-backtrack algorithm, which handles
called SoftSV, which addresses the three problems mentioned soft-clipped alignments only exceptionally for unmapped, but
above: (1) cluster size, (2) ambiguous mappings and (3) align- anchored, paired-end reads (see Supplementary Figure S10
ment of short subsequences of 10 bp around unclean and [16]).
Robust and exact structural variation detection | 53

Figure 1. SoftSV work flow: It combines paired-end reads and soft-clip information of local alignments to compute the breakpoint sequences of small and large dele-
tions, inversions, tandem duplications and translocations.

Approximation of candidate breakpoint regions threshold to the mean insert size of a large subset of prop-
erly mapped reads plus three times its standard deviation.
This step produces two lists of breakpoint candidates. One for
ii. Inversions: Sequencing at the edge of an inverted DNA seg-
relatively large SVs based on discordantly mapping paired-end
ment leads to an inverted mapping orientation for one of
reads and a second for smaller SVs derived directly from soft-
the paired reads. Hence, both reads have the same
clipped reads. Some SVs can be detected by both approaches, so
orientation.
the lists may contain overlaps.
iii. Tandem duplications: Paired-end reads at the junction be-
Large SVs: Breakpoint approximation with paired-end information tween two duplicate DNA segments have a reversed map-
Large SVs are characterized by discordantly mapped paired-end ping order.
reads. The definition of ‘discordantly’ depends on the type of SV iv. Inter-chromosomal translocations: These mutations affect
(see also Figure 2A and Supplementary Figure S2): two non-homologous chromosomes. Hence, the paired-end
reads map to different chromosomes with an undefined in-
i. Deletions: A paired-end mapping is discordant, if the insert
sert size.
size is larger than expected after the mapping, i.e. larger
than the mean insert size plus a certain tolerance. We as- SoftSV classifies the discordant paired-end reads into one of
sume the insert size to be normally distributed and set this these four categories. For every SV, we compute two
54 | Bartenhagen and Dugas

Figure 2. (A) Detection of large deletions with paired-end and soft-clipped reads (SCs). SCs (highlighted in gray) can be interconnected in a graph structure by sequence
homology and form a maximal clique, which spans both breakpoints (BP1 and BP2). Two reads from different BPs of the same SV are connected if the SC of one
matches the sequence of the other read. For reads at the same BP, the soft-clips of both reads are compared. Outlier SCs distant from the BPs like the read on the far
right are not part of the clique and thus excluded from the detection. (B) Detection of small deletions with soft-clipped reads. Every SC can be the breakpoint of an SV
with the second breakpoint located in its neighborhood. The following graph procedure is equivalent to large deletions in A. (C) Soft-clip alignment in detail for a 10 bp
deletion of Ts. Split-reads that were sequenced across the BP in the target genome are aligned as soft-clipped reads to the reference genome at BP1 and BP2. The soft-
clipped and aligned sequences share some part of the sequence around the BP. Point mutations at the BP (like A to C in this example) do not hamper the detection, as
only the reads are compared with each other without using the reference genome.
Robust and exact structural variation detection | 55

approximate breakpoint regions in direction of the reads: start- Smith-Waterman alignment. Two vertices from the same BP
ing at the 50 end of the read, reaching downstream for align- (65 bp tolerance) are connected, if both soft-clips match. The
ments on the plus strand and starting at the 30 end, reaching sequence similarity has to be at least 90% (mismatch, gap, gap
upstream for reads on the minus strand (see deletion example extension equally weighted), to account for sequencing errors
in Figure 2A). The size of each region is the mean insert size and to allow at least one mismatch for the minimum SC length
plus three times the standard deviation. If both breakpoint re- of 10 bp.
gions overlap between two SVs of the same type, we take their For every SV with at least one SC at both breakpoints, the
intersection to narrow down the search space for split-read can- graph, or some of its subgraphs, form a maximal clique, i.e. a
didates in the next step. set of vertices, such that for every pair of vertices there exists
SoftSV has no threshold regarding mapping quality or break- an edge in-between and the set cannot be extended by adjacent
point support. A single paired-end read can be sufficient to call vertices without corrupting this criteria.
the SV. In case of inversions and translocations, some SCs cannot be
Paired-end reads have the disadvantage that they can only matched, although they belong to the same SV and eventually
detect relatively large SVs. The minimum size depends on two share the same BP, because the SCs are not part of the other
characteristics: Insert size (deletions), as the breakpoints lie read. To avoid multiple maximal cliques for the same SV, those
somewhere between both ends, and read length (inversions, SCs are connected without an alignment, if they are within 5 bp
tandem duplications), as the reads are required to map within of the same BP or separated at both BPs (see dotted lines in
the region affected by the variation. Supplementary Figure S2A and C).
Those edges and edges between SCs at the same BPs have a
Small SVs: Breakpoint approximation with soft-clipping information weight of zero, while edges that are based on matching SCs be-
To detect SVs below this limit, we add the start of every soft- tween different BPs have a weight of one. We implemented
clipped sequence (SC) and its neighborhood to the list of the Bron-Kerbosch algorithm [17]—extended by a pivoting
breakpoint regions, as shown in Figure 2B. Every SC can be the strategy—to compute all maximal cliques. For the output, we
start or end of an SV and the other breakpoint could be just a select the clique with the highest positive sum of edge weights,
few base-pairs away, which is in our default case maximal i.e. the clique with the highest number of matching SCs from
four times the standard-deviation of the insert size distribu- both breakpoints. If there is no maximal cliques with an edge
tion. This way, we can detect deletions as small as 10–20 bp with positive weight we discard the SV to make sure that at least
(as long as the aligner marks the missing sequence as soft- one valid alignment between SCs at both BPs supports the SV.
clip) and inverted and tandem duplicated sequences below Finally, we report the consensus SV interval as the median
the read length (as long as the read can be mapped partially of all SC positions at BP1 and BP2 respectively.
within the SV region).
Similar to large SVs, we intersect overlapping regions to
avoid duplicate calls and get more precise approximations.
Results and discussion
In this section, we compare SoftSV with eight SV detection pro-
Filtering by a control data set grams. We evaluate the algorithms based on the sensitivity and
If a control sample is available, SVs with overlapping breakpoint the positive predictive value (PPV) for SV detection on simulated
regions between both samples can be discarded during step 1 data, the overlap with published SV lists of two popular data
(see also Supplementary Table S2). Many cancer studies, for ex sets from the 1000 Genomes and HapMap Project and the simi-
ample, sequence paired normal/diseased samples. In this set larities between a child and its parents in a CEU trio data set.
ting, putative somatic mutations can be separated from germ Furthermore, we discuss the breakpoint coverage and accuracy,
line mutations. as well as the breakpoint and SV size distribution.

Exact breakpoint detection with soft-clipping Simulated data sets


information We rearranged the first 12 chromosomes of the human refer-
For every potential SV in both lists (large and small SVs), SoftSV ence genome (GRCh37.59) using the program RSVSim [18]. In
searches the approximated breakpoint regions for all reads, total, we simulated 1.710 SVs: 500 deletions, 500 inversions and
which have a minimum soft-clipped sequence of 10 bp 500 tandem duplications and 210 insertions/translocations
(Figure 2A). To localize the SV precisely, we expect SCs to map of genome fragments from one chromosome into another.
in a certain pattern: One cluster of SCs at the upstream break- Except for the translocations, which exchange whole chromo-
point (BP1) and one at the downstream breakpoint (BP2). Both some ends, the SV sizes range from 20 bp to 10 kb and are
clusters can be combined in the manner of an assembly without based on the beta distribution to include more small than large
a reference genome to retrieve the breakpoint sequence: The events. We calculated the parameters of the beta distribution
soft-clipped sequence at BP1 matches the non-soft-clipped se- from real data from the Database of Genomic Variants [19].
quence at BP2 and vice versa (Figure 2B). Further, SCs at the Hence, smaller SVs were more abundant than larger ones
same breakpoint have to share a common soft-clipped se- (Supplementary Figure S1).
quence. Taken together, all SCs, which belong to a true SV, can Studies like [3] showed that SV breakpoints are not distrib-
be interconnected by aligning soft-clipped and non-soft-clipped uted uniformly across the genome and can often be found
sequences among all SCs. within or close to repeat elements and regions of low complex-
Next, we use these relationships between SCs to build a ity. In our simulation, we imitated this behavior and correlated
weighted, undirected graph as shown in Figure 2A, C and up to 80% of the SVs with repeat regions using the default par-
Supplementary Figure S2. Each vertex represents an SC. Two ver ameters of our SV simulator RSVSim (Supplementary Figure S1).
tices from different BPs are connected by an edge, if the soft- Because the DNA break is rarely a clean cut and often
clipped and the non-soft-clipped sequence can be matched by a accompanied by smaller passenger mutations, breakpoint
56 | Bartenhagen and Dugas

sequences can harbor additional point and small indel muta- approximated SV regions. Delly requires clusters of dis-
tions in a 50 bp neighborhood around the breakpoint. For further cordant paired-end reads and split-reads and despite the
details about the simulation procedure, see the supplementary split-read detection, the minimum detectable SV size is
data and [18]. limited by the insert size.
Based on the rearranged genome, dwgsim [20] was used to v. GASVPro [12] combines paired-end information with the
simulate 15 sets of paired-end reads with varying properties: A read depth signal around the breakpoint to approximate
mean insert size of 250 bp with 50 bp standard deviation and 75, its location. Hence, its focus are large SVs affecting the
100 and 150 bp read lengths at 5, 10, 15, 20 and 25 copynumber.
sequence depth each. vi. Pindel [7] performs a split-read alignment of unmapped
reads around mapped mate reads within a default dis-
tance of 10 kb. It is able to detect small SVs up to few base
Real data sets
pairs but no translocations.
We downloaded the CEU population trio NA12877 (father), vii. Socrates [9] realigns long soft-clipped sequences back to
NA12878 (mother), NA12882 (child) and the Yoruba sample the reference genome and validates the detected candi-
NA18507 from the Sequence Read Archive (Accession numbers dates by a comparison of short soft-clip sequences at the
ERX069504, ERX012406, ERX069506 and SRX016231 respectively). breakpoint. It is suited for all kinds and sizes of SVs eval-
The paired-end libraries were sequenced on the Illumina uated in our study, but filters candidates by mapping qual-
Genome Analyzer II (NA12878, NA18507) or HiSeq 2000 ity and read support.
(NA12877, NA12882) with 100 bp reads and a mean insert size of viii. SoftSearch [11] combines clusters of soft-clipped reads
291–318 bp (trio) and 516 bp (NA18507) and standard deviations with paired-end information to detect (precise and ap-
of 41–70 bp and 84 bp respectively. After removal of duplicate proximate) SV breakpoints. While the basic idea is similar
reads, the trio and NA18507 had a mean sequence depth of 12 to SoftSV, SoftSearch focuses on large SVs and only reports
and 40 respectively. We down-sampled NA18507 randomly to the presence of certain arrangements of soft-clipped reads
20 and 10 to evaluate the effect of sequence depth on real and performs no further alignment between them to re-
data. affirm the breakpoints and compute its sequence.
The data sets NA12878 and NA18507 were well studied and
lists with high-quality candidate SVs were published in various Because not every program can detect both small and large
studies and databases. Unfortunately, these lists mainly contain SVs, we separated our data sets into groups of small SVs be-
deletions and rarely other events. tween 20 or 50 bp and 200 bp and SVs between 200 bp and 10 kb.
Hence, we concentrate on deletions reported in the Database To avoid skewed PPVs, we excluded SV detections below 10 bp,
of Genomic Variants (release 2014-10-16) and published by Yang because such small SVs can not be detected with soft-clipped
et al. in [3]. alignments and are not part of our evaluation.
For NA12878, we further selected validated deletions from Similar to the evaluation in [22], we count an SV detection as
the gold standard SV set published by Mills et al. in [1]. To make true positive if both intervals, detection and ground truth, have
sure that every algorithm in this evaluation has a chance to de- a reciprocal overlap of at least 50%.
tect the deletions, we selected only sets from the six participat- Evaluating whole-genome paired-end sequencing data of
ing groups that used paired-end approaches. If deletion calls single samples, we investigate a rather general approach to SV
were overlapping between the groups, we calculated the union detection. Some other recent programs are not part of the evalu-
of both regions. ation owing to their special design (see also Supplementary
From all SV sets, we selected SVs between 20 bp (deletions) Table S1): BreaKmer [23], which assembles partially or mis
or 50 bp (inversions and duplications) and 10 kb, which is the aligned reads by a k-mer subtraction procedure, was designed
lower limit for SoftSV and the upper limit for Pindel. specifically for targeted sequencing data sets. SMUFIN [24], de
signed for cancer data sets, directly compares sequences from
paired tumor and normal samples to detect somatic SVs with
Programs out an initial alignment to a reference genome. Ulysses [25] mod
We compare SoftSV with eight other SV detection programs, els the amount of artificial chimeric sequences in mate-pair
each one implementing a different approach: libraries and determines the statistical significance of its SV
calls from aberrant mate-pair mappings. Gustaf [26] recom
i. Breakdancer [5] is one of the first approaches for large SV mends the multi-split aligner Stellar from the same group and
detection, based solely on insert size and read orientation.
chains multiple partial alignments of single- and paired-end
The quality and reliability of the breakpoint approxima-
reads using a weighted split-read graph to resolve complex
tion relies on the clustering of paired-end reads. breakpoints involving more than one SV.
ii. CREST [8] assembles overlapping soft-clipped sequences
and maps the contigs back to the genome to find the op-
posite SV breakpoint.
Alignment
iii. Clever [21] is an insert-size-based approach for small and We aligned the paired-end sequences with the BWA-backtrack
large deletions and insertions, which calculates maximal (end-to-end) or BWA-MEM algorithm (local) to the human refer-
cliques from an alignment graph of concordant and dis- ence genome (GRCh37.59). SoftSV relies on very sensitive
cordant paired-end reads at breakpoint sites and rates soft-clipped alignments with BWA-MEM. Pindel and Delly im-
them by the probability that the reads in the clique sup- plemented their own procedure for split-read mapping, and too
port the SV. many spurious paired-end alignments may hamper detection
iv. Delly [10] combines information from paired-end and for paired-end-only approaches like Breakdancer and GASVPro.
split-reads to detect SV breakpoints with single-nucleotide Hence, we compared sensitivity and PPV for BWA-backtrack
resolution. It performs its own split-read alignment of un- and BWA-MEM for a 100 bp simulated data set in
mapped or partially mapped reads found in previously Supplementary Figure S10 and chose the aligner delivering the
Robust and exact structural variation detection | 57

best results: BWA-MEM for SoftSV, SoftSearch and CREST and reads and selected the one having the highest F1 score across
BWA-backtrack for the remaining six programs. With up to 10% all SV types (Figure 3 and Supplementary Figure S6). The F1
difference in sensitivity and PPV, SoftSV is the only program score can be seen as a weighted average of the sensitivity and
that greatly profits from the very sensitive local alignments of the PPV. All other parameters kept their default settings.
the BWA-MEM algorithm. With some exceptions for Pindel (too low) and SoftSearch (too
high), the default support threshold already works well for most
methods. For sequencing coverages of 10 or higher, only small
Parameter tuning with simulated data improvements by a few percent could be made by changing its
All eight methods mentioned above have several parameters to value. For values larger than one, the support mainly affects the
adjust their performance. The most influential threshold is the sensitivity of a method, except for Pindel, whose number of false-
minimum support for a breakpoint to be reported. For every positive detections increase relatively strong for lower thresholds
simulated data set, we tested settings between one and five (Supplementary Figures S3–S5). Sometimes, the loss in sensitivity
can be compensated by increased PPVs and vice versa. Among all
split-read methods, SoftSV, which we designed for single read
support, and Socrates, which has a default support of one, are the
most robust across all data sets.
Further, we tested the influence of the minimum mapping
quality threshold for Breakdancer, Delly, Pindel and SoftSearch
by setting it to zero (i.e. allowing ambiguously mapping reads).
Surprisingly, the change had only a minor effect on the sensitiv-
ity and increased the number of false positives for Delly and
SoftSearch (Supplementary Figures S7–S9). Hence, we kept the
default setting. Possibly, breakpoints are covered worse by such
low-quality reads and filtered out otherwise or the split-read
alignment of Delly and Pindel is not working in these regions.

Breakpoint detection on simulated data sets


Figure 4 and Supplementary Tables S3 and S4 show the sensitiv
ity and PPVs of the SV detection on the simulated 100 bp paired-
end data set for five coverage settings (5, 10,15, 20 and
25), separated into small and large events. The results for read
Figure 3. The minimum support parameter for each method has been chosen lengths 75 and 150 bp are shown in Supplementary Figure S11.
according to the highest F1 score over all SV types. The default setting is marked
There are two main issues the plots demonstrate very clearly:
by an asterisk. Only the simulated 100 bp/15 data set is shown here, all other
data sets can be seen in Supplementary Figure S6. A colour version of this figure i. The detection of deletions (especially small ones) and inter-
is available at BIB online: http://bib.oxfordjournals.org. chromosomal translocations is the most difficult. Inference

Figure 4. Sensitivities (y axis) and PPVs (x axis) for eight methods on simulated 100 bp paired-end data. The darker the symbol, the higher the sequence coverage (5 to
25). SVs are separated into groups of small (20–200 bp) and large SVs (200–10 000 bp). Not all plots show every method, if it is not designed for this type of SV. A colour
version of this figure is available at BIB online: http://bib.oxfordjournals.org.
58 | Bartenhagen and Dugas

of deletions from aberrant insert sizes above a certain


threshold is subject to experimental design and data quality
(insert size distribution) and more vague criteria than the
change in mapping order for tandem duplications or orien-
tation for inversions. The duplications usually benefit from
a higher coverage at the breakpoints, which leads to the best
results for all nine programs. Translocations often suffer
from wrong alignments all over the genome with false-
positive detections being most problematic. In all those
cases, however, SoftSV is the most sensitive approach, while
keeping the false positives very low (PPV 80–100%).
ii. Every program we compare SoftSV with has its advantages,
favorite SVs and requirements on sequence coverage and
read length, leading to highly variable results. Breakdancer,
Delly and SoftSearch have major short-comings on short
SVs, while Delly also misses many large deletions and inver-
sions. Same for GASVPro, which performs well for large de-
letions and inversions, but does not detect tandem
duplications and determines many translocations incor-
rectly. The latest addition, Socrates, is the most precise algo-
rithm, but the low false-positive rate comes at the cost of
sensitivity. Pindel is either too sensitive (small deletions) or Figure 5. Breakpoint accuracy for the simulated 100 bp/15 data set. For all true
too restrictive (large deletions), but improves on inversions positive breakpoints (i.e. the start and end of an SV), the y axis shows the dis-
and tandem duplications. SoftSV is the only method produc- tance between the detected and the true coordinates. Some paired-end- or read
ing stable results on all settings for all types of SVs with depth-based methods can only give approximations, while split-read align-
high sensitivities above 85% and PPVs between 80 and 100% ments can detect the breakpoint with almost single-nucleotide precision.

for low sequence coverages of 10. It gives almost the


same results for 75 bp reads as for 150 bp (Supplementary
The paired-end-based methods are relatively sensitive on all
Figure S11). One reason is its comparison of soft-clipped
data sets, but restricted to certain SV sizes (Breakdancer) or de-
reads without an alignment back with a reference genome.
letions (Clever) and only applicable to experiments where
This enables us to work with relatively short soft-clipped se
breakpoint accuracy is not a primary goal or can be refined in a
quences of 10 bp and more split-read alignments.
post-processing step.
Accordingly, compared with the other split-read/soft-clip
The downsampling of NA18507 from 40 to 10 affirms that
methods Delly, Pindel and Socrates, true positive break
more coverage increases sensitivity for all programs. Similar to
points are often covered by more split-reads
our simulations, SoftSV is more robust than all other split-read
(Supplementary Figure S12).
methods when lowering the coverage. The step from 20 to 10
Next to the sensitivity and amount of false-positive predic- often means a significant decrease in sensitivity for most
tions, the breakpoint accuracy is another important measure to methods, meaning that sequencing above 10 coverage is rec-
define the quality of SV detection. Being able to detect the ommended for exact breakpoint detection. Breakpoint approxi-
breakpoints with single-nucleotide precision can be key to mations, however, are less affected by coverage. Breakdancer
some applications and facilitate further laboratory validations. and Clever work for the 10 data set almost as good as for 40.
Figure 5 shows the distance between detected breakpoints and For whole genome sequencing experiments, mappings at re-
true breakpoints in the simulated 100 bp/15 data set. As ex- petitive, low complexity regions with high diversity between in-
pected, paired-end- and read depth-based methods can only ap- dividual genomes often appear as a kind of junkyard for
proximate the true breakpoints up to a few hundred discordantly, partially and often wrongly mapped reads
(Breakdancer) or 10–100 bp (Clever, GASVPro). All split-read- (Supplementary Figures S16, S18). Some methods like SoftSV,
based methods are able to detect most breakpoints with only Delly, Breakdancer and Pindel tend to detect breakpoint hot-
very few bp deviation, with only few exceptions. SoftSearch, spots at those locations (Supplementary Figure S17). Excluding
applying a mixture of paired-end and soft-clip detection, reports only the centromeres from the analysis can decrease the num
both, approximations up to 500 bp and precise locations. ber of detected SVs by a third in some cases.

Trio analysis
Breakpoint detection on real data sets In a family setting, most SVs found in a child can be expected to
Breakdancer, Clever, Pindel and SoftSV show the highest over- be passed on from its parents [27]. Hence, the amount of exclu-
lap with the reference deletions for data sets NA12878 and sive SVs in the offspring allows conclusions regarding the false-
NA18507 (Figures 6 and 7 and Supplementary Table S5). But positive rate. Figure 8 and Supplementary Table S6 show that
Pindel, like Clever, deals with a very high increase of break large differences in the amount of SV detections in the child
points toward smaller deletions (Supplementary Figure S14). (total and overlap with parents) can be seen for all methods and
The simulations indicated a high amount of false positives for SV types. Deletions yield the highest overlap in most cases,
small events, but this is impossible to resolve for real data sets while at least half of the inversions and tandem duplications
without a comprehensive ground truth. The two other methods were detected exclusively in the child by most methods. SoftSV
based on the concept of soft-clipped reads, Socrates and detects many (exclusive) tandem duplications, while Socrates,
SoftSearch, are far less sensitive than SoftSV. on the other end, appears to be the most precise but less sensi
Robust and exact structural variation detection | 59

Figure 6. Deletion detection on a data set from the 1000 Genomes Project (NA12878) for reference SV sets published in the Database of Genomic Variants [19], by Mills
et al. [1] and by Yang et al. [3] (top row: large SVs; bottom row: small SVs). It shows the overlap in percent, the size of the reference set in the caption and the total num-
ber of SV detections in brackets behind each method. A colour version of this figure is available at BIB online: http://bib.oxfordjournals.org.

tive. Concerning the ratio between shared and exclusive SVs, detected just by looking at the mapped reads (see examples
split-read-based methods mostly perform better on the trio com from the IGV in Supplementary Figures S19–S21).
parison than Breakdancer and GASVPro.
A high overlap between the individuals either represents
truly inherited variants or reproducible, artificial detections
Conclusions
owing to ambiguities and problems during read mapping. Other SV breakpoint detection is still far away from a comprehensive
more random effects, for example issues of the sequencing and robust solution, which delivers SV breakpoints of various
technique like the generation of chimeric fragments or kinds of SVs with high sensitivity and precision, no matter if
sequencing errors, can cause high-quality false-positive predic- few base-pairs small or spanning hundreds of base-pairs, lying
tions and low overlaps between closely related samples. The in exons or non-coding repetitive regions. Many factors can in-
number of SV calls varies strongly between different methods fluence the outcome, like sequencing coverage, read quality,
and even the samples themselves. Without validations and mapping ambiguity, read length and breakpoint location.
knowledge about the mechanism behind each SV, the compari- Furthermore, a research scenario, which concentrates on the
son can only give an estimation of the detection quality of each discovery of (new) variations, may sacrifice some of the preci-
method. High overlaps, although lower than in the child/par- sion for a high sensitivity, while a clinical application, where
ents setting, can be seen between unrelated samples as well false-positive predictions are far more critical, requires both
(Supplementary Figure S15). qualities to be as high as possible.
Most programs encounter these problems with high configu-
rability and many parameters and filters. But this may require a
lot of fine-tuning, high expertise, experience and knowledge
Breakpoint visualization about the algorithms and the properties of each individual data
Visualizations of the alignment, for example, with the set. Every parameter and filter comes with a risk of losing im-
Integrative Genomics Viewer [28] (IGV), can be helpful to resolve portant or gaining wrong information. The trade-off between
complex breakpoints and uncertainties about an SV detection. sensitivity and false-positive rate is one of the key problems for
Most methods, however, perform their own split-read align- SV detection. We tested the influence of the most common par-
ments (CREST, Delly, Pindel, Socrates) and apply many filters. ameters and showed, for example, that some programs are
This makes the interpretation of their results in an alignment more robust to different minimum support thresholds than
browser like the IGV difficult. Although their algorithm has others. Fortunately, the default settings for them are often well
been described in detail in the publications, their actual pro- chosen. Nonetheless, many programs come with a set of indi-
cessing of the data often remains a black box. The fact that vidual parameters specially geared to their algorithm and their
SoftSV only analyses sequence similarities between partially influence remains open.
mapped reads already present in the alignment provided by the Our goal was to find a strong criterion for exact breakpoint
user, makes it easier to comprehend how the SV has been detection that demands no extensive configuration. An
60 | Bartenhagen and Dugas

Figure 7. Similar to Figure 6, it shows the overlap of detected deletions from nine methods for a data set from the HapMap Project (NA18507) with reference sets published
in the Database of Genomic Variants [19] and by Yang et al. [3]. Below the original sequence coverage (40), it shows two randomly down-sampled data sets (20 and 10).
A colour version of this figure is available at BIB online: http://bib.oxfordjournals.org.

approach handling low breakpoint support, while being as duplications between 20 and 10 000 bp and inter-chromosomal
greedy as possible and taking all reads into account that may be translocations.
related to an SV, despite their mapping quality or ambiguity. With paired-end reads getting longer and sequencing cover-
With SoftSV, we described, implemented and evaluated a age becoming higher at lower costs, the chances to sequence
method for exact breakpoint detection based on soft-clip com- across a breakpoint with sufficient support will increase and cur-
parisons, which demands no user input but a soft-clip aware rent split-read methods are expected to become more sensitive
alignment. For simulations and real data sets, it was the only and higher thresholds will decrease false-positive detections. But
method, which yield robust results across various sequencing what remains are the low complexity and repetitive regions,
depths between 5 and 25 and read lengths between 75 and where mapping is ambiguous and quality is low. To enable sensi-
150 bp for small and large deletions, inversions, tandem tive breakpoint detection outside the comfortable zone of coding
Robust and exact structural variation detection | 61

Figure 8. Overlap of SV detections between the child and its parents in the CEU trio data set for small and large deletions, inversions and tandem duplications.
Methods without a detection for the SV type and size are not shown. Hot-spots of false-positive breakpoints close to the centromeres (þ/ 1 Mb) were excluded to avoid
skewed results. For the comparison of two unrelated samples (the parents) see Supplementary Figure S15. A colour version of this figure is available at BIB online:
http://bib.oxfordjournals.org.

sequences, programs need to loosen their restrictions on map- References


ping quality and breakpoint coverage and find new criteria to dis-
1. Mills RE, Walter K, Stewart C, et al. Mapping copy number
tinguish a true mutation from artificial noise. Our approach,
variation by population-scale genome sequencing. Nature
presented in this study, is one step in this direction.
2011;470:59–65.
2. Kidd JM, Cooper GM, Donahue WF, et al. Mapping and
Key Points sequencing of structural variation from eight human gen-
omes. Nature 2008;453:56–64.
• Current methods for SV detection use paired-end,
3. Yang L, Luquette LJ, Gehlenborg N, et al. Diverse mechanisms
split-reads, read depth or a combination to detect of somatic structural variations in human cancer genomes.
breakpoints approximately or with single-nucleotide Cell 2013;153:919–29.
resolution. Simulations showed that they have individ- 4. Stephens PJ, Greenman CD, Fu B, et al. Massive genomic re-
ual strengths and weaknesses regarding the type and arrangement acquired in a single catastrophic event during
size of an SV, long and short paired-end reads, high cancer development. Cell 2011;144:27–40.
and low sequencing coverage. 5. Chen K, Wallis JW, McLellan MD, et al. BreakDancer: an algo-
• Breakpoints in repeat rich and low-complexity regions,
rithm for high-resolution mapping of genomic structural
as well as passenger mutations, pose problems to variation. Nat Methods 2009;6:677–81.
split-read alignments and hamper sensitivity of many 6. Korbel JO, Urban AE, Affourtit JP, et al. Paired-end mapping re-
current methods in those areas. veals extensive structural variation in the human genome.
• SoftSV is a greedy yet robust approach to detect exact
Science 2011;318:420–6.
breakpoints of small and large deletions, inversions, 7. Ye K, Schulz MH, Long Q, et al. Pindel: a pattern growth ap-
tandem duplications and inter-chromosomal trans- proach to detect break points of large deletions and medium
locations. Breakpoints need to be supported by only sized insertions from paired-end short reads. Bioinformatics
one paired-end and one soft-clipped read at every 2009;25:2865–71.
breakpoint. SoftSV builds an undirected graph from an 8. Wang J, Mullighan CG, Easton J, et al. CREST maps somatic
assembly of soft-clipped sequences, without aligning structural variation in cancer genomes with base-pair reso-
them back to the reference genome. It was the only lution. Nat Methods 2011;8:652–4.
method, which detected all four kinds of SVs consist- 9. Schroeder J, Hsu A, Boyle SE, et al. Socrates: identification of gen-
ently over a wide range of simulated data sets without omic rearrangements in tumour genomes by re-aligning soft
any user input but an alignment. clipped reads. Bioinformatics 2014;30:1064–72.
• SoftSV is implemented in Cþþ and available from
10. Rausch T, Zichner T, Schlattl A, et al. DELLY: structural variant
http://sourceforge.net/projects/softsv. discovery by integrated paired-end and split-read analysis.
Bioinformatics 2012;28:i333–9.
11. Hart SN, Sarangi V, Moore R, et al. SoftSearch: integration of
Supplementary data multiple sequence features to identify breakpoints of struc-
Supplementary data are available online at http://bib.oxford tural variations. PLoS One 2013;8:e83356.
journals.org/. 12. Sindi SS, Onal S, Peng LC, et al. An integrative probabilistic
model for identification of structural variation in sequencing
data. Genome Biol 2012;13:R22.
Funding 13. Wong K, Keane TM, Stalker J, et al. Enhanced structural vari-
Funded by Deutsche Krebshilfe (Grant 110495). ant and breakpoint detection using SVMerge by integration of
62 | Bartenhagen and Dugas

multiple detection methods and local assembly. Genome Biol 22. Hormozdiari F, Alkan C, Eichler EE, et al. Combinatorial algo-
2010;11:R128. rithms for structural variation detection in high-throughput
14. Li H, Handsaker B, Wysoker A, et al. The sequence align- sequenced genomes. Genome Res 2009;19:1270–8.
ment/map format and SAMtools. Bioinformatics 2009;25: 23. Abo RP, Ducar M, Garcia EP, et al. BreaKmer: detection of
2078–9. structural variation in targeted massively parallel sequencing
15. Li H, Durbin R. Fast and accurate short read alignment data using kmers. Nucleic Acids Res 2014;43:e19.
with Burrows-Wheeler transform. Bioinformatics 2009;25: 24. Moncunill V, Gonzalez S, Bea S, et al. Comprehensive charac-
1754–60. terization of complex structural variations in cancer by dir-
16. Li H. Aligning sequence reads, clone sequences and assembly ectly comparing genome sequence reads. Nat Biotechnol 2014;
contigs with BWA-MEM. ArXiv 2013;1303.3997v2. 32:1106–12.
17. Bron C, Kerbosch J. Algorithm 457: finding all cliques of an un- 25. Gillet-Markowska A, Richard H, Fischer G, et al. Ulysses: accur-
directed graph. Commun ACM 1973;16:575–7. ate detection of low-frequency structural variations in large in-
18. Bartenhagen C, Dugas M. RSVSim: an R/Bioconductor pack- sert-size sequencing libraries. Bioinformatics 2014:btu730.
age for the simulation of structural variations. Bioinformatics 26. Trappe K, Emde AK, Ehrlich HC, et al. Gustaf: detecting and
2013;29:1679–81. correctly classifying SVs in the NGS twilight zone.
19. Iafrate AJ, Feuk L, Rivera MN, et al. Detection of large- Bioinformatics 2014;30:3484–90.
scale variation in the human genome. Nat Genet 2004;36: 27. Kloosterman WP, Guryev V, van Roosmalen M, et al. Chromo-
949–51. thripsis as a mechanism driving complex de novo structural re-
20. Homer N. Whole Genome Simulation. http://http:// arrangements in the germline. Hum Mol Genet 2011;20:1916–24.
sourceforge.net/projects/dnaa (5 Januar 2014, date last 28. Thorvaldsdottir H, Robinson JT, Mesirov JP. Integrative
accessed). Genomics Viewer (IGV): high-performance genomics data
21. Marschall T, Costa IG, Canzar S, et al. CLEVER: clique- visualization and exploration. Briefings in Bioinformatics 2013;
enumerating variant finder. Bioinformatics 2012;28:2875–82. 14:178–92.
Copyright of Briefings in Bioinformatics is the property of Oxford University Press / USA
and its content may not be copied or emailed to multiple sites or posted to a listserv without
the copyright holder's express written permission. However, users may print, download, or
email articles for individual use.

You might also like