Professional Documents
Culture Documents
Bty 624
Bty 624
doi: 10.1093/bioinformatics/bty624
Advance Access Publication Date: 13 July 2018
Original Paper
Sequence analysis
Abstract
Motivation: Functional somatic mutations within coding amino acid sequences confer growth
advantage in pathogenic process. Most existing methods for identifying cancer-related mutations
focus on the single amino acid or the entire gene level. However, gain-of-function mutations often
cluster in specific protein regions instead of existing independently in the amino acid sequences.
Some approaches for identifying mutation clusters with mutation density on amino acid chain
have been proposed recently. But their performance in identification of mutation clusters remains
to be improved.
Results: Here we present a Data-adaptive Mutation Clustering Method (DMCM), in which kernel
density estimate (KDE) with a data-adaptive bandwidth is applied to estimate the mutation density,
to find variable clusters with different lengths on amino acid sequences. We apply this approach in
the mutation data of 571 genes in over twenty cancer types from The Cancer Genome Atlas
(TCGA). We compare the DMCM with M2C, OncodriveCLUST and Pfam Domain and find that
DMCM tends to identify more significant clusters. The cross-validation analysis shows DMCM is ro-
bust and cluster cancer type enrichment analysis shows that specific cancer types are enriched for
specific mutation clusters.
Availability and implementation: DMCM is written in Python and analysis methods of DMCM are
written in R. They are all released online, available through https://github.com/XinguoLu/DMCM.
Contact: hnluxinguo@126.com or pengshaoliang@nudt.edu.cn
Supplementary information: Supplementary data are available at Bioinformatics online.
1 Introduction process driven by serially acquired new mutations, including base pair
Cancer genomes possess a large number of aberrations including som- substitutions, small insertions and deletions of bases, chromosomal
atic mutations and copy number variations (Lu et al., 2017). rearrangements and gains and losses in gene copy number. However,
Genomic aberration associated with cancers is a complicated patho- recent insights have emerged from comprehensive genomic character-
logical phenomenon in carcinogenic process. Much of our previous ization using next-generation sequencing (NGS) technologies, which
understanding of cancer genetics is grounded on the principle that has allowed for the sequencing of the coding portion of the genome
cancer arises from a clone that has accumulated the requisite through hybrid-capture whole-exome sequencing (WES) or nearly all
somatically-acquired genetic aberrations, leading to malignant trans- base pairs in a tumor-normal pair by whole-genome sequencing
formation (Stratton, 2011). Progression of such a transformed clone (WGS), revealing unanticipated complexities in the patterns of somat-
to disseminated disease has previously been thought to be a linear ic mutations in cancers (Chin et al., 2011; Meyerson et al., 2010).
C The Author(s) 2018. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com
V 389
390 X.Lu et al.
Cancers are clonal proliferations that arise owing to mutations (i.e. data-adaptive bandwidth to identify cancer-related mutation clus-
driver mutations) that confer selective growth advantage on cells, but ters. And we applied DMCM on over 500 mutated genes in 23
most of somatic mutations in a patient have not been subject to selec- tumor types from TCGA and 1309 mutation clusters were identi-
tion (i.e. passenger mutations) (Watson et al., 2013). That is to say, fied. To validate the robustness of DMCM, we split the mutation
driver mutations in cancer-related genes alter the structure of protein data into two partitions and used each partition to generate a new
after coding and lead to the cancers. However, the characteristics of KDE model for each gene, we then compared the resulting clusters
driver and passenger mutations in cancer genomes are not currently from each partition. These processes indicated that DMCM is ro-
well distinguished. Thus, how to identify driver mutations accurately bust. To validate the robustness of clusters, we calculated scores for
and understand the exact consequences of such mutations still each cluster found by DMCM and found abundant robust clusters
nonparametric method for estimating the probability density. Given mutations in every amino acid position of a chain, and C ¼ 1.059 is
an n-sample x1 ; x2 ; . . . ; xn of independent random variables with the a constant.
common density f. Parzen-Rosenblatt kernel estimate of density f We assumed MSEðxÞ ¼ biasðf^ðxÞÞ2 þ varf^ðxÞ, then a data-
was defined as Equation (1): adaptive bandwidth showed as Equation (5) was obtained by differ-
x x entiating to MSE and setting the derivative equal to 0,
1 Xn
i
f^ðxÞ ¼ d K (1) Ð !15
nh i¼1 h
f ðxÞ KðtÞ2 dt 1
hðxÞ ¼ n5 (5)
l22 f 00 ðxÞ2
where f^ðxÞ is estimated probability density for x, n is the number of
Epanechnikov kernel function, Triangular kernel function or according to Equation (1), (5) and (6),
Biweight kernel function) are chosen for estimating the mutation 1
^ 9 1 f^ðxÞ5
frequency and the results shows that these estimators produce the hðxÞ ¼ h5 c5 25 (7)
Pn
almost same density curves (See Supplementary Fig. S1A). So the 1 2
i¼1 nh t KðtÞ f^ðxÞ
following KDE will be conducted without consideration of kernel
function selection. However, the bandwidth is set to 0.8, 1, 2, 4, 8 where h ¼ hopt is the optimal fixed bandwidth calculated by
Ð 00
Equation (4), and c ¼ f^ ðxÞ2 dx ¼ 38 p2 h5 .
1
respectively with Gaussian kernel function and the result shows
that KDE is very sensitive to bandwidth selection. Tremendously Finally, the data-adaptive kernel density estimate function is:
divergent density curves are produced by their corresponding !
bandwidths (See Supplementary Fig. S1B). Hence, this sensitivity 1 X n
x xi
f^ðxÞ ¼ K (8)
of the final density estimate to the selected bandwidth makes ^
nhðxÞ ^
hðxÞ
i¼1
this parameter an important area of focus for this work. Fitting
^
Where hðxÞ is the data-adaptive bandwidth calculated by
curve with variable bandwidth for different hotspots on amino
acid sequence is critical to explore cancer-related mutation Equation (7).
clusters.
To this end, we propose a data-adaptive KDE algorithm, 2.4 Generating the clusters
described in Algorithm 1 in Supplementary Files, to optimize the Clusters are referred as amino acid positions pairs (xi, xj) where xi is
bandwidth at every specific x on amino acid sequence. Firstly, the the start position of cluster and xj is the end position of the cluster.
optimal fixed bandwidth for a specific amino acid sequence is DMCM identifies cancer-related mutation clusters for a given data-
obtained by minimizing the Mean Integral Square Error (MISE). set by four steps.
MISE is defined with equation (2). Step 1: Background noise detection.
Ð Ð We identify the background noise distribution by applying an
MISEðf^ðxÞÞ ¼ ½biasðf^ðxÞÞ2 dx þ varf^ðxÞdx initial weight to coding-silent mutations. The initial weight of noise
ð 2 Ð
h4 KðtÞ2 dt distribution, denoted by Ui as Equation (9), is defined as the percent-
¼ l2 ðKÞ2 f 00 ðxÞ dx þ (2) age of coding-silent mutations at ith amino acid unit.
4 nh
1 XL
þO h4 þ M0 i
nh Ui ¼ ; Mtotal ¼ mutationi ; ð0 i LÞ (9)
Mtotal i¼0
where biasðf^ðxÞÞ ¼ Ef^ðxÞ f ðxÞ is the difference between the
where M0i is the number of coding-silent mutations at ith amino acid
expected f^ðxÞ and observed probability density f(x), varf^ðxÞ is the
Ð unit for a given gene g, L is the length of amino acid sequence of g.
variance for f^ðxÞ, and l2 ¼ t2 KðtÞdt, O is the little o notation. So
This background noise is filtered for generating the clusters.
the Asymptotics Mean Integrated Square Error (AMISE), defined
Step 2: Identify initial seed Gaussian.
with Equation (3), consists of the first two leading items of MISE.
^
We calculate the data-adaptive bandwidth, hðxÞ. An initial set of
ð 2 Ð
h4 KðtÞ2 dt n seed Gaussians is then identified, where n is the number of local
AMISEðf^ðxÞÞ ¼ l2 ðKÞ2 f 00 ðxÞ dx þ (3)
4 nh maxima in the density estimate. An Gaussian is defined by 2 varia-
bles, its mean and its standard deviation denoted by l and r
After taking one dimension gaussian function as kernel function,
respectively.
we obtained the optimal fixed bandwidth showed as Equation (4)
by differentiating AMISE and setting the derivative equal to 0, 1 pffi
xl2
Gðx; l; rÞ ¼ pffiffiffiffiffiffiffiffiffiffiffi e 2r (10)
1 1
2pr2
hopt ¼ Cr2 n5 (4)
The parameters, l and r, are initially estimated from kernel
P
where hopt is the optimal fixed bandwidth, r ¼ 1n ni¼1 ðxi x Þ, n is density estimate, f(x). The mean of each Gaussian and the standard
the length of amino acid chain to a gene, xi is the number of muta- deviation of each gaussian, l and r, are initialized as the location of
tions on specific amino acid position in a gene, x is the mean of the local maxima and the distance between the two adjacent local
392 X.Lu et al.
minima around a given maxima of the KDE result respectively. We Table 1. Table of variables in Fisher’s Exact Test
define the amino acid location of the ith local maximum as (lmai,
Cluster A Other clusters
f ðlmai Þ). This maximum will be bordered by two local minima,
(lmii, f ðlmii Þ) and (lmiiþ1 ; f ðlmiiþ1 Þ) to the left and right, respective- Cancer type T N1 N2
ly. In the edge cases where there is no local minima before or after Other cancer types N3 N4
ai, the values (0, f(0)) and (L, f(L)) are used as local minima depend-
ing upon the case, where L is the length of the gene in amino acids.
Then, the initial parameters, li and ri, of Gi are given by: cluster is significantly associated with a specific cancer type.
Relevant statistics are given in Table 1.
Fig. 1. Comparison of KDE result generated with the optimal fixed bandwidth
(non-data-adaptive result) and data-adaptive bandwidth (DMCM result). A) 3.2.2 Mutation clusters identified by DMCM
Comparison between the real density and the non-data-adaptive density. B) DMCM identified 1309 clusters for 392 genes out of the selected
Comparison between the real density and the DMCM density. C) Comparison 435 genes. The remaining 43 genes without any clusters have all
of optimal non-data-adaptive bandwidth (dotted) and the data-adaptive band-
their mutations referred as constructing background noise model by
width (dashed). D) Comparison of two bias curves
DMCM and are omitted from further analysis. The following results
indicate that our method finds variable length of regions of proteins
spectrums in Figure 2B and C respectively (Fig. 2B is the density re-
which are enriched for mutations. Clusters span a wide range of
sult of KDE with fixed bandwidth and Figure 2C is the density result
lengths: from 1 to 600 amino acid units, and the number of muta-
of DMCM). This process result in 3 and 6 effective clusters for KDE
tions in each cluster ranges from 15 to 338 mutations. Figure 3A
with fixed bandwidth and DMCM (clusters identified by KDE with
shows the comparison of two density curves generated by DMCM
fixed bandwidth and DMCM are shown in Fig. 2D and E) respect-
and non-data-adaptive method with an optimal fixed bandwidth of
ively, clusters with light grey are removed since the amount of data
4 respectively. Then, two sets of mutation clusters are identified by
samples in these clusters are too few to increase the statistic power.
using these two density curves as shown in Figure 3B and C. We find
The result shows that DMCM focuses on finding narrower clusters
that the density curve generated by DMCM are more precise than
and DMCM identifies more clusters than KDE with fixed band-
by non-data-adaptive method. Clusters identified by non-data-
width since the data-adaptive bandwidth suits better the change of
adaptive method are generally longer than clusters from DMCM.
data characteristics than the fixed bandwidth as described previous-
We compare the clusters found by DMCM to multiscale mutation
ly. We also calculate the cover ratio of clusters identified by DMCM
clustering (M2C) algorithm (Poole et al., 2017), method based on
and non-data-adaptive method. The result presents that DMCM
Density Based Spatial Clustering of Applications with Noise
clusters cover 100% of clusters identified by non-data-adaptive
(OncodriveCLUST) (Tamborero et al., 2013 b) and protein domains
method. On the other hand, clusters detected by non-data-adaptive
from Pfam (Finn et al., 2010). M2C uses a multiscale clustering method
method cover only 66.7% (regions of two clusters of DMCM are
based on KDE with 23 fixed bandwidths to create 23 kernel densities
overlapped below 50% by non-data-adaptive clusters).
and finally merges these densities to identify mutation clusters. Our
method find more clusters based on KDE with data-adaptive bandwidth
3.2 Experiments on pan-cancer mutation data than M2C (M2C found 1255 clusters in 393 genes). OncodriveCLUST’s
3.2.1 Pan-cancer data pre-processing approach also uses a kernel smoother to create a mutation density, but
The raw pan-cancer gene list, including 435 genes, is obtained as using only one predefined fixed bandwidth. Despite finding fewer clus-
mentioned above by combining the results of five methods (MuSiC, ters (OncodriveCLUST found 5185 clusters in 514 genes), we find that
OncodriveFM, OncodriveCLUST, ActiveDriver and MutSig) for clusters identified by DMCM tend to be larger and have more muta-
exploiting all the signals of positive selections. Specifically, each tions. These statistics are represented in Figure 4A.
method considers a different signal of positive selections and five po- DMCM clusters cover a total of 71% of M2C clusters which cover
tential candidate driver lists are identified. These potential candidate 52% more than OncodriveCLUST clusters (OncodriveCLUST clusters
driver lists are then merged into a consensus driver gene list. After only cover 18% of M2C clusters). In addition, DMCM clusters cover a
associating these consensus driver genes with TCGA somatic muta- total of 51% of OncodriveCLUST clusters which is 6% more than
tion data of 23 cancer types, we finally obtain the raw somatic mu- M2C clusters (M2C clusters cover 48% of OncodriveCLUST clusters).
tation dataset which can be found in the Supplementary Data. For DMCM clusters, clusters from M2C and OncodriveCLUST cover
The cancer types we consider are: ACC, BLCA, BRCA, CESC, 50% and 10% of DMCM clusters separately. Finally, we note that
COAD, GBM, HNSC, KICH, KIRC, KIRP, LAML, LGG, LIHC, DMCM have a total of 34% of clusters located within or overlapping
LUAD, LUSC, OV, PRAD, READ, SKCM, STAD, THCA, UCEC with protein domains while M2C and OncodriveCLUST have 31%
and UCS. Full name of each cancer type can be found in T1 in and 22% of clusters overlap protein domains from Pfam respectively.
Supplementary Tables. These statistics are summarized in Figure 4B and C.
394 X.Lu et al.
Fig. 4. Cluster statistics comparing DMCM clusters with M2C clusters, OncodriveCLUST clusters and Pfam Domains. A) Cluster length histogram. B) Coverage
with competing methods and Pfam histogram
3.2.3 Cross-validation analysis of DMCM Coefficient 0.99 and P-value 0). These results are plotted in
We validate the robustness of DMCM by splitting our dataset Figure 5 and indicate that the statistical model underlying
into two equally sized partitions (data from one partition is DMCM is robust.
regarded as the training data and data from the other partition is We also calculate the cover ratio of clusters between the two par-
regarded as the validation data) and running the algorithm separ- titions with Equation (13). We define ’covered’ to mean that for two
ately on each partition. A data-adaptive KDE model is then gener- clusters in the same gene from different partitions, one of the clus-
ated for each gene from two partitions and every KDE model is ters overlaps the other by at least 50%, which is illustrated before.
used to generate a mutation cluster set. First, We calculate the Our validation analysis shows that on average DMCM robustness is
log-likelihoods for KDE models from each partition and consider about 40%, meaning about 40% of clusters are covered. However,
if the log-likelihoods are highly correlated by calculating the we further note that smaller denser clusters are more highly con-
Spearman Correlation Coefficient. The result showed that the log- served and overlap by a greater percentage between partitions than
likelihoods are significantly correlated (Spearman Correlation large sparse clusters.
DMCM: Data-adaptive Mutation Clustering Method 395
Acknowledgements
The authors wish to thank all the anonymous reviewers whose constructive
comments will be very helpful to strengthen the presentation of this study.
Funding
This work was supported by the Natural Science Foundation of Hunan
Province, China (Grant No. 2018JJ2053).
Fig. 6. Selected top-ranking genes from DMCM analysis on two cancer types
(Only genes annotated in the Cancer Gene Census are depicted). Summary of Conflict of Interest: none declared.
the results obtained by three methods (DMCM, OncodriveCLUST and M2C)
aimed to find driver genes for the two analyzed cancer types
References
Opteron, 24 GB of memory capacity and 500 GB of hard disk cap- Benjamini,Y. and Hochberg,Y. (1995) Controlling the false discovery rate: a
acity. We compare the time consumption of DMCM with M2C by practical and powerful approach to multiple testing. J. R. Stat. Soc., 57,
using the speed of convergence to evaluate the computational time 289–300.
Chin,L. et al. (2011) Making sense of cancer genomic data. Genes Dev., 25,
of these two algorithms since both of two algorithms are iterative
534–555.
convergence algorithm. By now, the EM algorithm is one of the best
Chwialkowski,K. et al. (2016) A kernel test of goodness of fit. In: 33rd
solutions to conquer convergence problems. The mutation cluster
International Conference on Machine Learning, ICML 2016, vol. 6, New
generation process of these two algorithms is done by the EM algo- York City, NY, pp. 3854–3867.
rithm to optimize the cluster boundary. Hence, our time consump- Dees,N.D. et al. (2012) Music: identifying mutational significance in cancer
tion analysis focuses on comparing the speed of EM convergence. genomes. Genome Res., 22, 1589–1598.
Gene PTEN containing 550 mutations across 21 cancer types is used Ding,J. et al. (2015) Systematic analysis of somatic mutations impacting gene
to perform this comparison. For DMCM, the AMISE converges expression in 12 tumour types. Nat. Commun., 6, 8554.
after 40 iterations while it takes 60 iterations for M2C (See Eynden,J.V.D. et al. (2015) Sominaclust: detection of cancer genes based
Supplementary Fig. S4A). Moreover, M2C stores all the mutation on somatic mutation patterns of inactivation and clustering. BMC
Bioinformatics, 16, 125.
cluster lists generated from each fixed bandwidth (totally 28 band-
Finn,R.D. et al. (2010) The pfam protein families database. Nucleic Acids
widths) in memory and finally merges 28 cluster lists to generate a
Res., 40, 290–301.
final list. Hence, there is potential memory overhead with the
Fisher,R.A. (1922) On the interpretation of v2 from contingency tables, and
increased size of dataset. In contrast, DMCM is more memory effi- the calculation of p. J. R. Stat. Soc., 85, 87–94.
cient as it stores only one mutation cluster list generated by the data- Gonzalez-Perez,A. and Lopez-Bigas,N. (2012) Functional impact bias reveals
adaptive bandwidth. The comparison of memory usage in gene cancer drivers. Nucleic Acids Res., 40, e169.
PTEN is shown in Supplementary Figure S4B which shows that Lawrence,M.S. et al. (2013) Mutational heterogeneity in cancer and the search
M2C costs 5.5GB of memory space while DMCM only costs 4GB. for new cancer-associated genes. Nature, 499, 214–218.
Taken together, DMCM is more efficient than M2C in both of com- Lee,C.S. et al. (2014) Recurrent point mutations in the kinetochore gene
putational time consumption and memory usage. knstrn in cutaneous squamous cell carcinoma. Nat. Genet., 46, 1060–1062.
Lu,X. et al. (2014) A co-expression modules based gene selection for cancer
recognition. J. Theor. Biol., 362, 75–82.
Lu,X. et al. (2017) Driver pattern identification over the gene co-expression of
4 Conclusion drug response in ovarian cancer by integrating high throughput genomics
Many previous methods identifying driver mutations focus on the data. Sci. Rep., 7, 16188.
entire gene or the single amino acid level which may not be appro- Lu,X. et al. (2018) The integrative method based on the module-network for
identifying driver genes in cancer subtypes. Molecules, 23, 183.
priate owing to the fact that mutations that associated with cancers
Meyerson,M. et al. (2010) Advances in understanding cancer genomes
do not occur uniformly in a gene or at random positions within the
through second-generation sequencing. Nat. Rev. Genet., 11, 685.
coding amino acid sequence. We have proposed a Data-adaptive
Network,T.C.G.A. (2014) Comprehensive molecular characterization of uro-
Mutation Clustering Method (DMCM) which combines these two thelial bladder carcinoma. Nature, 507, 315.
approaches by creating a data-adaptive bandwidth searching for Poole,W. et al. (2017) Multiscale mutation clustering algorithm identifies
variable-length regions of interest within individual genes. DMCM pan-cancer mutational clusters associated with pathway-level changes in
represents a data driven approach towards systematically identifying gene expression. Plos Computat. Biol., 13, e1005347.
DMCM: Data-adaptive Mutation Clustering Method 397
Reimand,J. and Bader,G.D. (2014) Systematic analysis of somatic mutations Tamborero,D. et al. (2013a) Comprehensive identification of mutational can-
in phosphorylation signaling predicts novel cancer drivers. Mol. Syst. Biol., cer driver genes across 12 tumor types. Sci. Rep., 3, 2650.
9, 637. Tamborero,D. et al. (2013b) Oncodriveclust: exploiting the positional clustering of
Sabarinathan,R. et al. (2017) The whole-genome panorama of cancer drivers. somatic mutations to identify cancer genes. Bioinformatics, 29, 2238.
bioRxiv, 190330. doi: https://doi.org/10.1101/190330. Watson,I.R. et al. (2013) Emerging patterns of somatic mutations in cancer.
Stehr,H. et al. (2011) The structural impact of cancer-associated missense Nat. Rev. Genet., 14, 703–718.
mutations in oncogenes and tumor suppressors. Mol. Cancer, 10, 54. Ye,J. et al. (2010) Statistical method on nonrandom clustering with ap-
Stratton,M.R. (2011) Exploring the genomes of cancer cells: progress and plication to somatic mutations in cancer. BMC Bioinformatics,
promise. Science, 331, 1553–1558. 11, 11.