Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

Bioinformatics, 35(3), 2019, 389–397

doi: 10.1093/bioinformatics/bty624
Advance Access Publication Date: 13 July 2018
Original Paper

Sequence analysis

DMCM: a Data-adaptive Mutation Clustering

Downloaded from https://academic.oup.com/bioinformatics/article/35/3/389/5053323 by University of Bologna user on 24 September 2020


Method to identify cancer-related
mutation clusters
Xinguo Lu1,*, Xin Qian1, Xing Li1, Qiumai Miao1 and Shaoliang Peng1,2,*
1
College of Computer Science and Electronic Engineering, Hunan University, Changsha 410082, China and 2School
of Computer Science, National University of Defense Technology, Changsha 410073, China
*To whom correspondence should be addressed.
Associate Editor: Bonnie Berger
Received on March 7, 2018; revised on June 12, 2018; editorial decision on July 6, 2018; accepted on July 12, 2018

Abstract
Motivation: Functional somatic mutations within coding amino acid sequences confer growth
advantage in pathogenic process. Most existing methods for identifying cancer-related mutations
focus on the single amino acid or the entire gene level. However, gain-of-function mutations often
cluster in specific protein regions instead of existing independently in the amino acid sequences.
Some approaches for identifying mutation clusters with mutation density on amino acid chain
have been proposed recently. But their performance in identification of mutation clusters remains
to be improved.
Results: Here we present a Data-adaptive Mutation Clustering Method (DMCM), in which kernel
density estimate (KDE) with a data-adaptive bandwidth is applied to estimate the mutation density,
to find variable clusters with different lengths on amino acid sequences. We apply this approach in
the mutation data of 571 genes in over twenty cancer types from The Cancer Genome Atlas
(TCGA). We compare the DMCM with M2C, OncodriveCLUST and Pfam Domain and find that
DMCM tends to identify more significant clusters. The cross-validation analysis shows DMCM is ro-
bust and cluster cancer type enrichment analysis shows that specific cancer types are enriched for
specific mutation clusters.
Availability and implementation: DMCM is written in Python and analysis methods of DMCM are
written in R. They are all released online, available through https://github.com/XinguoLu/DMCM.
Contact: hnluxinguo@126.com or pengshaoliang@nudt.edu.cn
Supplementary information: Supplementary data are available at Bioinformatics online.

1 Introduction process driven by serially acquired new mutations, including base pair
Cancer genomes possess a large number of aberrations including som- substitutions, small insertions and deletions of bases, chromosomal
atic mutations and copy number variations (Lu et al., 2017). rearrangements and gains and losses in gene copy number. However,
Genomic aberration associated with cancers is a complicated patho- recent insights have emerged from comprehensive genomic character-
logical phenomenon in carcinogenic process. Much of our previous ization using next-generation sequencing (NGS) technologies, which
understanding of cancer genetics is grounded on the principle that has allowed for the sequencing of the coding portion of the genome
cancer arises from a clone that has accumulated the requisite through hybrid-capture whole-exome sequencing (WES) or nearly all
somatically-acquired genetic aberrations, leading to malignant trans- base pairs in a tumor-normal pair by whole-genome sequencing
formation (Stratton, 2011). Progression of such a transformed clone (WGS), revealing unanticipated complexities in the patterns of somat-
to disseminated disease has previously been thought to be a linear ic mutations in cancers (Chin et al., 2011; Meyerson et al., 2010).

C The Author(s) 2018. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com
V 389
390 X.Lu et al.

Cancers are clonal proliferations that arise owing to mutations (i.e. data-adaptive bandwidth to identify cancer-related mutation clus-
driver mutations) that confer selective growth advantage on cells, but ters. And we applied DMCM on over 500 mutated genes in 23
most of somatic mutations in a patient have not been subject to selec- tumor types from TCGA and 1309 mutation clusters were identi-
tion (i.e. passenger mutations) (Watson et al., 2013). That is to say, fied. To validate the robustness of DMCM, we split the mutation
driver mutations in cancer-related genes alter the structure of protein data into two partitions and used each partition to generate a new
after coding and lead to the cancers. However, the characteristics of KDE model for each gene, we then compared the resulting clusters
driver and passenger mutations in cancer genomes are not currently from each partition. These processes indicated that DMCM is ro-
well distinguished. Thus, how to identify driver mutations accurately bust. To validate the robustness of clusters, we calculated scores for
and understand the exact consequences of such mutations still each cluster found by DMCM and found abundant robust clusters

Downloaded from https://academic.oup.com/bioinformatics/article/35/3/389/5053323 by University of Bologna user on 24 September 2020


remains a formidable challenge. with high scores. In order to test whether the clusters are enriched
Pathogenesis of cancer with somatic mutations is constantly for specific cancer types, we performed cluster cancer type enrich-
undergoing researches with the development of various high ment analysis and calculated P-value for each cluster by Fisher’s
throughput analysis techniques. Meanwhile, NGS technologies and Exact Test, and the result showed that the percent of clusters with
large genomic projects like The Cancer Genome Atlas (TCGA) and P-value < 0.05 is 100% which revealed that the clusters identified
International Cancer Genome Consortium (ICGC) have generated a by DMCM were significantly enriched for specific cancer types. To
large quantity of somatic mutation data on different functions in re- give a clear description of this association, we listed several well-
cent years. Many computational methods based on statistical theory known clusters in the cancer literature recovered by DMCM
have been proposed to identify the driver mutations and understand and several novel clusters found by DMCM. Finally, DMCM is
the functional impact of these mutations. MuSiC (Dees et al., 2012) applied to identify driver genes. The results are compared with
identified mutational significance in cancer genomes and separated OncodriveCLUST and M2C. The comparison indicates that DMCM
the significant events which are likely drivers for disease from the makes an effective complements to other driver-detection methods.
passenger mutations by using a variety of statistical methods.
OncodriveFM (Gonzalez-Perez and Lopez-Bigas, 2012) identified
lowly recurrently mutated driver events by applying FM bias to 2 Materials and methods
three datasets of tumor somatic variants. ActiveDriver (Reimand 2.1 Benchmark dataset
and Bader, 2014) provided a systematic analysis of somatic muta- In this literature, the benchmark dataset consists of a simulation
tions in phosphorylation signaling to predict novel cancer drivers. dataset and a real pan-cancer dataset. The simulation data, contain-
These analytical approaches are used for gathering more mutations ing 100 data samples, is generated by conjoining five standard nor-
and detecting the cancer-related drivers. But the hypothesis on the mal distributions Gi, i ¼ 1; 2; . . . ; 5. In these standard normal
uniform background noise distribution is unsatisfied for the real distributions, li ; i ¼ 1; 2; . . . ; 5, is set to 0, 1, 1.5, 2, 3 and
characteristic of somatic mutations in cancer. MutSig (Lawrence ri ; i ¼ 1; 2; . . . ; 5, is set to 1, 2, 3, 4, 5 separately. Two percent of
et al., 2013) detected cancer-related genes and ranked them based samples are randomly selected to generate as the background noise
on somatic mutation events. SomInaClust (Eynden et al., 2015) distribution. In this simulation data, 10 data points within the simu-
identified driver genes based on their mutation pattern across tumor lated sequence are selected to construct the background noise.
samples and then classified them into oncogenes (OGs) and tumor The pan-cancer gene list is composed by 291 high-confidence
suppressor genes (TSGs) respectively. Module-Network (Lu et al., driver genes and 144 candidate driver genes (Tamborero et al.,
2018) distinguished the subtype-specific drivers from normal genes 2013a). This pre-computed and curated gene list is a consensus
by applying a module-network analysis. But these methods identify gene list generated by five methods (MuSiC, OncodriveFM,
cancer drivers by focusing on the level of entire gene. In addition, OncodriveCLUST, ActiveDriver and MutSig) for considering three
many methods are proposed to identify cancer mutations based on a mutation signals of positive selection (recurrence, functional impact
single amino acid level (Lee et al., 2014; Lu et al., 2014; Network, and clustering) (Sabarinathan et al., 2017; Tamborero et al., 2013a).
2014). However, mutations that associated with cancers do not
occur within the coding sequence at random positions. Cancer
2.2 Constructing the synonymous mutations model for
mutations usually behave as clusters or hotspot regions of non-
synonymous mutations (e.g. missense mutations) in a cancer-related
background noise
gene. More importantly, the cause of altering the functions of In previous studies, the mutation noise on amino acid sequences is
cancer-related genes is probably the interaction of different muta- assumed to be uniform distribution at every point (Stehr et al.,
tion clusters (e.g. PIK3CA). Hence, the mutation cluster identifica- 2011; Ye et al., 2010). However, it is contentious that functional
tion has become the research focus in the pathogenesis of cancer. mutations within the coding sequences (missense mutations, inser-
Recently, a multiscale mutation clustering algorithm (M2C) tion mutations or deletion mutations) and ordinary mutations (non-
(Poole et al., 2017), based on kernel density estimate (KDE), was sense mutations or silent mutations) occur at random positions on
proposed to identify variable length mutation clusters through the amino acid chain (Poole et al., 2017; Tamborero et al., 2013 b). The
amino acid chain. This approach identifies mutation cluster by using researches have shown that synonymous mutations including
28 different fixed bandwidths ranging from 2 to 450 (amino acid coding-silent mutations are unsatisfied with a uniform distribution
units) in KDE and finally merged the 28 kernel density models gen- but are instead accumulated at some regions in amino acid sequen-
erated by these bandwidths. Yet as we know that the result of kernel ces (references more). Therefore, we constructed the background
density estimate at a fixed bandwidth is usually out of line with the noise model using the distribution of coding-silent mutations.
real result. We expect a kernel density model with a data-adaptive
bandwidth to satisfy the diversification of mutation frequency at 2.3 Kernel density estimate approach using a
every point of the amino acid sequence. data-adaptive bandwidth
To this end, we proposed an Data-adaptive Mutation Clustering Kernel density estimate (KDE) is applied to detect driver patterns
Method (DMCM) based on kernel density estimate with a in previous mutation cluster identification methods. KDE is a
DMCM: Data-adaptive Mutation Clustering Method 391

nonparametric method for estimating the probability density. Given mutations in every amino acid position of a chain, and C ¼ 1.059 is
an n-sample x1 ; x2 ; . . . ; xn of independent random variables with the a constant.
common density f. Parzen-Rosenblatt kernel estimate of density f We assumed MSEðxÞ ¼ biasðf^ðxÞÞ2 þ varf^ðxÞ, then a data-
was defined as Equation (1): adaptive bandwidth showed as Equation (5) was obtained by differ-
x  x  entiating to MSE and setting the derivative equal to 0,
1 Xn
i
f^ðxÞ ¼ d K (1) Ð !15
nh i¼1 h
f ðxÞ KðtÞ2 dt 1
hðxÞ ¼ n5 (5)
l22 f 00 ðxÞ2
where f^ðxÞ is estimated probability density for x, n is the number of

Downloaded from https://academic.oup.com/bioinformatics/article/35/3/389/5053323 by University of Bologna user on 24 September 2020


samples which represents the length of amino acid sequence in our
However, the f(x) is the real distribution which is not known before.
mutation clustering algorithm, h is the bandwidth of KDE, d is the
Here we replace f(x) by f^ðxÞ showed as Equation (1) as an estimate.
number of data dimension and KðÞ is kernel function. d is set to 1 as 00
We then calculate the f^ ðxÞ by applying the gaussian kernel function
the somatic mutation events are located on the 1D amino acid
to Equations (1) and (5),
sequence.
To generate a KDE for one somatic mutation dataset, the 00 1 Xn
1 t2 t2
f^ ðxÞ ¼ 3 pffiffiffiffiffiffi ðt2 e 2  e 2 Þ (6)
kernel function KðÞ and the bandwidth h must be preset as input- nh i¼1 2p
ting. The effects of KðÞ and h are demonstrated in Supplementary
Fig. S1. The different kernel functions (Gaussian kernel function, where t ¼ xX
h . And then we obtain the data-adaptive bandwidth
i

Epanechnikov kernel function, Triangular kernel function or according to Equation (1), (5) and (6),
Biweight kernel function) are chosen for estimating the mutation 1

^ 9 1 f^ðxÞ5
frequency and the results shows that these estimators produce the hðxÞ ¼ h5 c5  25 (7)
Pn
almost same density curves (See Supplementary Fig. S1A). So the 1 2
i¼1 nh t KðtÞ  f^ðxÞ
following KDE will be conducted without consideration of kernel
function selection. However, the bandwidth is set to 0.8, 1, 2, 4, 8 where h ¼ hopt is the optimal fixed bandwidth calculated by
Ð 00
Equation (4), and c ¼ f^ ðxÞ2 dx ¼ 38 p2 h5 .
1
respectively with Gaussian kernel function and the result shows
that KDE is very sensitive to bandwidth selection. Tremendously Finally, the data-adaptive kernel density estimate function is:
divergent density curves are produced by their corresponding !
bandwidths (See Supplementary Fig. S1B). Hence, this sensitivity 1 X n
x  xi
f^ðxÞ ¼ K (8)
of the final density estimate to the selected bandwidth makes ^
nhðxÞ ^
hðxÞ
i¼1
this parameter an important area of focus for this work. Fitting
^
Where hðxÞ is the data-adaptive bandwidth calculated by
curve with variable bandwidth for different hotspots on amino
acid sequence is critical to explore cancer-related mutation Equation (7).
clusters.
To this end, we propose a data-adaptive KDE algorithm, 2.4 Generating the clusters
described in Algorithm 1 in Supplementary Files, to optimize the Clusters are referred as amino acid positions pairs (xi, xj) where xi is
bandwidth at every specific x on amino acid sequence. Firstly, the the start position of cluster and xj is the end position of the cluster.
optimal fixed bandwidth for a specific amino acid sequence is DMCM identifies cancer-related mutation clusters for a given data-
obtained by minimizing the Mean Integral Square Error (MISE). set by four steps.
MISE is defined with equation (2). Step 1: Background noise detection.
Ð Ð We identify the background noise distribution by applying an
MISEðf^ðxÞÞ ¼ ½biasðf^ðxÞÞ2 dx þ varf^ðxÞdx initial weight to coding-silent mutations. The initial weight of noise
ð 2 Ð
h4 KðtÞ2 dt distribution, denoted by Ui as Equation (9), is defined as the percent-
¼ l2 ðKÞ2 f 00 ðxÞ dx þ (2) age of coding-silent mutations at ith amino acid unit.
4 nh
 
1 XL
þO h4 þ M0 i
nh Ui ¼ ; Mtotal ¼ mutationi ; ð0  i  LÞ (9)
Mtotal i¼0
where biasðf^ðxÞÞ ¼ Ef^ðxÞ  f ðxÞ is the difference between the
where M0i is the number of coding-silent mutations at ith amino acid
expected f^ðxÞ and observed probability density f(x), varf^ðxÞ is the
Ð unit for a given gene g, L is the length of amino acid sequence of g.
variance for f^ðxÞ, and l2 ¼ t2 KðtÞdt, O is the little o notation. So
This background noise is filtered for generating the clusters.
the Asymptotics Mean Integrated Square Error (AMISE), defined
Step 2: Identify initial seed Gaussian.
with Equation (3), consists of the first two leading items of MISE.
^
We calculate the data-adaptive bandwidth, hðxÞ. An initial set of
ð 2 Ð
h4 KðtÞ2 dt n seed Gaussians is then identified, where n is the number of local
AMISEðf^ðxÞÞ ¼ l2 ðKÞ2 f 00 ðxÞ dx þ (3)
4 nh maxima in the density estimate. An Gaussian is defined by 2 varia-
bles, its mean and its standard deviation denoted by l and r
After taking one dimension gaussian function as kernel function,
respectively.
we obtained the optimal fixed bandwidth showed as Equation (4)
by differentiating AMISE and setting the derivative equal to 0, 1 pffi
xl2
Gðx; l; rÞ ¼ pffiffiffiffiffiffiffiffiffiffiffi e 2r (10)
1 1
2pr2
hopt ¼ Cr2 n5 (4)
The parameters, l and r, are initially estimated from kernel
P
where hopt is the optimal fixed bandwidth, r ¼ 1n ni¼1 ðxi  x Þ, n is density estimate, f(x). The mean of each Gaussian and the standard
the length of amino acid chain to a gene, xi is the number of muta- deviation of each gaussian, l and r, are initialized as the location of
tions on specific amino acid position in a gene, x is the mean of the local maxima and the distance between the two adjacent local
392 X.Lu et al.

minima around a given maxima of the KDE result respectively. We Table 1. Table of variables in Fisher’s Exact Test
define the amino acid location of the ith local maximum as (lmai,
Cluster A Other clusters
f ðlmai Þ). This maximum will be bordered by two local minima,
(lmii, f ðlmii Þ) and (lmiiþ1 ; f ðlmiiþ1 Þ) to the left and right, respective- Cancer type T N1 N2
ly. In the edge cases where there is no local minima before or after Other cancer types N3 N4
ai, the values (0, f(0)) and (L, f(L)) are used as local minima depend-
ing upon the case, where L is the length of the gene in amino acids.
Then, the initial parameters, li and ri, of Gi are given by: cluster is significantly associated with a specific cancer type.
Relevant statistics are given in Table 1.

Downloaded from https://academic.oup.com/bioinformatics/article/35/3/389/5053323 by University of Bologna user on 24 September 2020


lmiiþ1  lmii
li ¼ lmai ri ¼ (11) According to Fisher’s Exact Test, We calculate the P-value by:
2
   
Step 3: Generate initial mutation clusters. N1 þ N3 N2 þ N4

The parameters of means and standard deviations are optimized N N2
P  value ¼  1  (14)
with an Expectation Maximization (EM) algorithm. The EM N1 þ N2 þ N3 þ N4
algorithm consists of 3 steps: E step (calculate the expectation of N1 þ N2
l and r), M step (calculate the maximum likelihood estimate of
H1 is true if the calculated P-value < 0.05.
Gaussian model by these two parameters from E step) and iteration
step (repeat E step and M step until convergence of these two
parameters). And this process determine the boundary of mutation 3 Experimental results and discussion
clusters by which a set of mutation clusters is identified.
Step 4: Remove insufficient mutation. We conduct experiments on simulation data and pan-cancer data re-
Clusters which contains less than a mutations are removed to en- spectively. For simulation dataset, the data-adaptive bandwidth is calcu-
sure that clusters have sufficient mutations for further statistic ana- lated to generate the KDE curves, which are compared with the results
lysis. The selection of a depends on the richness of the mutation generated by the non-data-adaptive method. For real pan-cancer data-
dataset. The parameter a is set to 15 which means clusters that con- set, mutation clusters are identified by DMCM and then compared with
tain less than 15 mutations are omitted for further analysis. M2C (Poole et al., 2017), OncodriveCLUST (Tamborero et al., 2013 b)
The details of Data-adaptive Mutation Clustering Method and Pfam (Finn et al., 2010). Also, the cross-validation analysis and the
(DMCM) is shown with Algorithm 2 in Supplementary Files. cluster cancer type enrichment analysis is performed.

3.1 Experiments on simulation data


3.1.1 Kernel density estimated by DMCM
2.5 Performance evaluation metrics Kernel density curve for simulation data is generated by our pro-
We apply the goodness-of-fit as one of the performance evaluation posed data-adaptive method of DMCM. We compare the result
metrics to validate if the density curve estimated by DMCM is high- with the curve estimated by the non-data-adaptive method. For non-
ly fitting to the real density curve from the data. According to data-adaptive method, the optimal fixed bandwidth is calculated
Nonparametric Goodness-of-fit Test (Chwialkowski et al., 2016), (h ¼ 2.1) according to the Equations (2) and (4). We apply gaussian
we define the goodness-of-fit, denoted EAf^ðxÞ , of DMCM as the bias function in the kernel density estimate and the density curve is
between estimated density curve and real density curve from the obtained as shown with the red dotted line in Figure 1A where the
data, which is given by: green line is the real density curve for simulation data. For DMCM,
1X n we firstly calculate the optimal fixed bandwidth, hopt, by minimizing
EAf^ðxÞ ¼ jf^ðxi Þ  f ðxi Þj (12) the MISE. MSE is defined as a transformation of MISE, then the
n i¼1
data-adaptive bandwidth is calculated using the Equation (5). The
where n is the number of data samples, f^ðxÞ and f(x) is the estimated data-adaptive density curve is shown with the red dashed line in
density and the real density at data point x respectively. The bias Figure 1B in which the green line represents the real density result.
indicates the goodness-of-fit of estimated density curve to the real From Figure 1A and B, the density curve generated by DMCM is
density curve. High bias represents poor accurate of estimated dens- more fitting to the real density curve than by the non-data-adaptive
ity result. method. In other words, density curve generated by DMCM with a
To compare the superiority of DMCM clusters over other exist- data-adaptive bandwidth presents the detailed features better and is
ing methods’ clusters, we define a ’cover’ as > 50% overlap, which closer to the actual density. Comparison of fixed bandwidth with
means the region of a cluster overlaps at least 50% of another clus- data-adaptive bandwidth is shown in Figure 1C from which we no-
ter’s region. Then, the cover rate of DMCM clusters overlapping tice that the fixed bandwidth is over-averaged. We also compare the
clusters identified by the reference method is defined as below: goodness of fit between the data-adaptive density result and the
non-data-adaptive density result by calculating the bias with
Numcovered
Cover Ratio ¼ (13) Equation (12). The result shows that, compared with non-data-
Numall
adaptive density f^ðxÞ, DMCM improves the goodness-of-fit by
where Numcovered is the number of covered clusters detected by the 73.7% (EAf^ðxÞ ¼ 0:00078 and EAf^ðxÞ ¼ 0:00296).
reference method, and Numall is the number of all clusters of the ref-
erence method. 3.1.2 Clusters identified by DMCM
In addition, we validate if specific cancer types are enriched for To further validate the superiority of DMCM, two sets of clusters
specific mutation clusters by using Fisher’s Exact Test (Fisher, 1922) are identified by KDE with fixed bandwidth and DMCM respective-
to calculate the P-value. Our hypothesis is that H0: one specific clus- ly (Fig. 2). Density curves are generated by KDE with fixed band-
ter is independent from a specific cancer type; H1: one specific width and DMCM and we present these density results with density
DMCM: Data-adaptive Mutation Clustering Method 393

Downloaded from https://academic.oup.com/bioinformatics/article/35/3/389/5053323 by University of Bologna user on 24 September 2020


Fig. 2. Comparison of clusters identified by DMCM and KDE with optimal
fixed bandwidth. A) represents the histogram of the raw data distribution. B)
and C) shows two spectrums of mutation density calculated by KDE with
fixed bandwidth and DMCM respectively. D) and E) displays clusters identi-
fied by KDE with fixed bandwidth and DMCM respectively

Fig. 1. Comparison of KDE result generated with the optimal fixed bandwidth
(non-data-adaptive result) and data-adaptive bandwidth (DMCM result). A) 3.2.2 Mutation clusters identified by DMCM
Comparison between the real density and the non-data-adaptive density. B) DMCM identified 1309 clusters for 392 genes out of the selected
Comparison between the real density and the DMCM density. C) Comparison 435 genes. The remaining 43 genes without any clusters have all
of optimal non-data-adaptive bandwidth (dotted) and the data-adaptive band-
their mutations referred as constructing background noise model by
width (dashed). D) Comparison of two bias curves
DMCM and are omitted from further analysis. The following results
indicate that our method finds variable length of regions of proteins
spectrums in Figure 2B and C respectively (Fig. 2B is the density re-
which are enriched for mutations. Clusters span a wide range of
sult of KDE with fixed bandwidth and Figure 2C is the density result
lengths: from 1 to 600 amino acid units, and the number of muta-
of DMCM). This process result in 3 and 6 effective clusters for KDE
tions in each cluster ranges from 15 to 338 mutations. Figure 3A
with fixed bandwidth and DMCM (clusters identified by KDE with
shows the comparison of two density curves generated by DMCM
fixed bandwidth and DMCM are shown in Fig. 2D and E) respect-
and non-data-adaptive method with an optimal fixed bandwidth of
ively, clusters with light grey are removed since the amount of data
4 respectively. Then, two sets of mutation clusters are identified by
samples in these clusters are too few to increase the statistic power.
using these two density curves as shown in Figure 3B and C. We find
The result shows that DMCM focuses on finding narrower clusters
that the density curve generated by DMCM are more precise than
and DMCM identifies more clusters than KDE with fixed band-
by non-data-adaptive method. Clusters identified by non-data-
width since the data-adaptive bandwidth suits better the change of
adaptive method are generally longer than clusters from DMCM.
data characteristics than the fixed bandwidth as described previous-
We compare the clusters found by DMCM to multiscale mutation
ly. We also calculate the cover ratio of clusters identified by DMCM
clustering (M2C) algorithm (Poole et al., 2017), method based on
and non-data-adaptive method. The result presents that DMCM
Density Based Spatial Clustering of Applications with Noise
clusters cover 100% of clusters identified by non-data-adaptive
(OncodriveCLUST) (Tamborero et al., 2013 b) and protein domains
method. On the other hand, clusters detected by non-data-adaptive
from Pfam (Finn et al., 2010). M2C uses a multiscale clustering method
method cover only 66.7% (regions of two clusters of DMCM are
based on KDE with 23 fixed bandwidths to create 23 kernel densities
overlapped below 50% by non-data-adaptive clusters).
and finally merges these densities to identify mutation clusters. Our
method find more clusters based on KDE with data-adaptive bandwidth
3.2 Experiments on pan-cancer mutation data than M2C (M2C found 1255 clusters in 393 genes). OncodriveCLUST’s
3.2.1 Pan-cancer data pre-processing approach also uses a kernel smoother to create a mutation density, but
The raw pan-cancer gene list, including 435 genes, is obtained as using only one predefined fixed bandwidth. Despite finding fewer clus-
mentioned above by combining the results of five methods (MuSiC, ters (OncodriveCLUST found 5185 clusters in 514 genes), we find that
OncodriveFM, OncodriveCLUST, ActiveDriver and MutSig) for clusters identified by DMCM tend to be larger and have more muta-
exploiting all the signals of positive selections. Specifically, each tions. These statistics are represented in Figure 4A.
method considers a different signal of positive selections and five po- DMCM clusters cover a total of 71% of M2C clusters which cover
tential candidate driver lists are identified. These potential candidate 52% more than OncodriveCLUST clusters (OncodriveCLUST clusters
driver lists are then merged into a consensus driver gene list. After only cover 18% of M2C clusters). In addition, DMCM clusters cover a
associating these consensus driver genes with TCGA somatic muta- total of 51% of OncodriveCLUST clusters which is 6% more than
tion data of 23 cancer types, we finally obtain the raw somatic mu- M2C clusters (M2C clusters cover 48% of OncodriveCLUST clusters).
tation dataset which can be found in the Supplementary Data. For DMCM clusters, clusters from M2C and OncodriveCLUST cover
The cancer types we consider are: ACC, BLCA, BRCA, CESC, 50% and 10% of DMCM clusters separately. Finally, we note that
COAD, GBM, HNSC, KICH, KIRC, KIRP, LAML, LGG, LIHC, DMCM have a total of 34% of clusters located within or overlapping
LUAD, LUSC, OV, PRAD, READ, SKCM, STAD, THCA, UCEC with protein domains while M2C and OncodriveCLUST have 31%
and UCS. Full name of each cancer type can be found in T1 in and 22% of clusters overlap protein domains from Pfam respectively.
Supplementary Tables. These statistics are summarized in Figure 4B and C.
394 X.Lu et al.

Downloaded from https://academic.oup.com/bioinformatics/article/35/3/389/5053323 by University of Bologna user on 24 September 2020


Fig. 3. DMCM Illustration on PTEN upon over twenty cancer types. A) Results of kernel density estimate with a self-adaptive bandwidth (in bold line) and a fixed
bandwidth (in ordinary line) were represented to make a comparison. B) The distribution of all the PTEN mutation events on amino acid sequences; mutation
clusters were identified by DMCM and represented in different colors with the amino acid interval on the right side. C) Mutation data of PTEN across all of pan-
cancer dataset and clusters identified by non-data-adaptive method (Color version of this figure is available at Bioinformatics online.)

Fig. 4. Cluster statistics comparing DMCM clusters with M2C clusters, OncodriveCLUST clusters and Pfam Domains. A) Cluster length histogram. B) Coverage
with competing methods and Pfam histogram

3.2.3 Cross-validation analysis of DMCM Coefficient  0.99 and P-value  0). These results are plotted in
We validate the robustness of DMCM by splitting our dataset Figure 5 and indicate that the statistical model underlying
into two equally sized partitions (data from one partition is DMCM is robust.
regarded as the training data and data from the other partition is We also calculate the cover ratio of clusters between the two par-
regarded as the validation data) and running the algorithm separ- titions with Equation (13). We define ’covered’ to mean that for two
ately on each partition. A data-adaptive KDE model is then gener- clusters in the same gene from different partitions, one of the clus-
ated for each gene from two partitions and every KDE model is ters overlaps the other by at least 50%, which is illustrated before.
used to generate a mutation cluster set. First, We calculate the Our validation analysis shows that on average DMCM robustness is
log-likelihoods for KDE models from each partition and consider about 40%, meaning about 40% of clusters are covered. However,
if the log-likelihoods are highly correlated by calculating the we further note that smaller denser clusters are more highly con-
Spearman Correlation Coefficient. The result showed that the log- served and overlap by a greater percentage between partitions than
likelihoods are significantly correlated (Spearman Correlation large sparse clusters.
DMCM: Data-adaptive Mutation Clustering Method 395

types. Here, we highlight several well-known and novel clusters


detected by DMCM:
BRAF V600 mutation is a well-known gene mutation event which
is proved to be associated with many kinds of cancer types. DMCM
identified a amino acid cluster from 600-601 in BRAF within the can-
cer types of Melanoma, Rectum Adenocarcinoma (READ), Thyroid
Carcinoma (THCA), Lung Adenocarcinoma (LUAD), Breast Invasive
Carcinoma (BRCA), Stomach adenocarcinoma (STAD) and Ovarian
serous cystadenocarcinoma (OV), etc. Detection of cluster 600-601

Downloaded from https://academic.oup.com/bioinformatics/article/35/3/389/5053323 by University of Bologna user on 24 September 2020


mutation in BRAF is also recommended in cases of DNA mismatch
repair protein deficiency by the guide of the National Comprehensive
Cancer Network (NCCN) in the Lynch syndrome detection strategy.
In addition, BRAF is the most common gene mutation in thyroid pap-
illary carcinoma which is associated with an increase in mortality in
patients (FDR < 1%). More importantly, experiments of the BRAF
Fig. 5. Cross-validation of DMCM: each circle shows the log-likelihood of the
DMCM model trained from partition 1 to generate the data from partition 2 kinase inhibitor, named vemurafenib, in two phase of clinical trails
for a single gene (red circles). The opposite analysis, using partition 2’s proved the inhibitor response rate of melanoma patients carrying
DMCM model to generate data from partition 1, is also shown (purple x’s) BRAF mutations of cluster 600-601 is over 50% and the survival rate
(Color version of this figure is available at Bioinformatics online.) of 6 months is over 84%. Therefore, the detection of cluster 600-601
mutation in BRAF has become a necessary test for the diagnosis and
In addition, We calculate a score under the background noise distri- treatment of many kinds of tumors.
bution for each mutation cluster found by DMCM. This score, donated CTNNB1 has two mutation clusters from 25-45 and 322-354
as cluster scorec of a cluster c, is defined by the log value of the ratio of which are associated with global changes of gene expression in Liver
the emission probability of the mutations in the cluster with the emis- Hepatocellular Carcinoma (LIHC) at a false discovery rate of 1% and
sion probability of the same mutations based upon the null hypothesis Adrenocortical Carcinoma (ACC) at a false discovery rate of 5%. pre-
of the background noise distribution across the cluster defined below: viously, increased level of CTNNB1 expression caused by the mutation
! ! cluster of amino acid positions 25-45 have been associated with metas-
N1 þ N3 N2 þ N4
 tasis in Uterine Corpus Endometrial Carcinoma (FDR < 1%) (Ding
N1 N2 et al., 2015). However, to our knowledge this somatic mutation region
P  value ¼ ! (15)
N1 þ N2 þ N3 þ N4 have not been extensively studied. We note that clusters 25-45
N1 þ N2 and 322-354 have a much lower global gene expression association
P-values (about 21 and 6 orders of magnitude smaller, respectively) in
M is the set of all n pan-cancer mutations in cluster c, Gðx; lc ; rc Þ Liver Hepatocellular Carcinoma which perhaps reveals the specific
is the normalized Gaussian distribution with mean lc and standard function to the Liver Hepatocellular Carcinoma. Thus, this gene may
deviation rc representing the unweighted component of the mixture be a candidate for further study of its role in Liver Hepatocellular
model corresponding to cluster c. Finally P ¼ L1 is the emission Carcinoma and probably the other tumor types mentioned above.
probability of single mutation by the background noise distribution GPRIN2 has a mutation cluster in Adrenocortical Carcinoma
over the gene containing the clusters of length L (in amino acids). The (ACC) and Skin Cutaneous Melanoma (SKCM): amino acid positions
cluster score provides a good indication of how robust a cluster is like- 39-126 with 23 somatic mutations. This cluster has a much lower glo-
ly to be (See Supplementary Table T2, Supplementary Fig. S3). bal gene expression association P-value (about 13 orders of magni-
Higher scores indicate increased robustness. tude smaller) with Adrenocortical Carcinoma which may reveal the
specific function to this disease (FDR < 1%). We hope our results are
helpful to the research of GPRIN2’s function.
3.2.4 Cluster cancer type enrichment analysis
We calculate an enrichment P-value by using Fisher’s Exact Test to de-
3.2.5 Driver gene prediction by DMCM
termine if specific cancer types are enriched for specific mutation clus-
To distinguish the cancer driver genes from passenger genes, we
ters. A contingency table for each cluster cancer type pair is created
calculate the gene clustering score and sort the gene list with their
across the pan-cancer set of samples based upon the two Boolean varia-
clustering scores. The gene clustering score is generated as in
bles: (i) Is the sample inside the cluster and (ii) Does the sample belong
OncodriveCLUST and the P-value list is acquired. A comprehensive
to the cancer type being analyzed. These results are then tested for sig-
driver gene list is obtained (See T4 in Supplementary Tables). We
nificance together using the Benjamini-Hochberg method (Benjamini
compared the identified driver genes with which are derived by
and Hochberg, 1995) with results at FDR (False Discovery Rates) of 1,
OncodriveCLUST and M2C respectively on CGC gene list. The com-
5, 10 and 25% (See Supplementary Tables T3).
parison is demonstrated in Figure 6. Some previous predicted driver
The result generated by cluster cancer type enrichment analysis
genes, such as JUN, CDH11 and EGFR in SKCM, are identified by
shows that about 57.9% of clusters (P < 0.01) and the percent of
DMCM while they are not predicted by OncodriveCLUST and M2C.
clusters with P < 0.05 identified by DMCM is 100%. These results
The results show that potential driver genes are acquired by DMCM
confirmed that specific cancer types are enriched for specific muta-
and DMCM makes an effective complement to other methods.
tion clusters. The association between mutations of TP53 and 14
cancer types is illustrated (See Supplementary Fig. S3). The cluster
located in the region of (256, 260) is significantly related to BRCA, 3.2.6 Computational complexity analysis of DMCM
LIHC, LUAD, LUSC, READ and UCEC which indicates this muta- The experiments are run on a desktop workstation (Windows 7
tion cluster is identified as a potential driver cluster for these cancer operating system), the computer contains a Dual-Core AMD
396 X.Lu et al.

mutation clusters in genes. Both the cross-validation analysis and


cluster scores indicate the robustness of DMCM. To reveal the sig-
nificant relationship between specific mutation clusters and specific
cancer types, a cluster cancer type enrichment analysis is provided
by which we find that mutation clusters detected by DMCM are sig-
nificantly cancer-related. However, it is likely that many of our one-
dimensional clusters, if mapped onto the 3D structure of a protein,
would be more worthy in cancer studies. A future direction is to
carry out this mapping and determine more realistic structural muta-

Downloaded from https://academic.oup.com/bioinformatics/article/35/3/389/5053323 by University of Bologna user on 24 September 2020


tion clusters and we suspect that such an approach would further in-
crease statistical power.

Acknowledgements
The authors wish to thank all the anonymous reviewers whose constructive
comments will be very helpful to strengthen the presentation of this study.

Funding
This work was supported by the Natural Science Foundation of Hunan
Province, China (Grant No. 2018JJ2053).
Fig. 6. Selected top-ranking genes from DMCM analysis on two cancer types
(Only genes annotated in the Cancer Gene Census are depicted). Summary of Conflict of Interest: none declared.
the results obtained by three methods (DMCM, OncodriveCLUST and M2C)
aimed to find driver genes for the two analyzed cancer types
References
Opteron, 24 GB of memory capacity and 500 GB of hard disk cap- Benjamini,Y. and Hochberg,Y. (1995) Controlling the false discovery rate: a
acity. We compare the time consumption of DMCM with M2C by practical and powerful approach to multiple testing. J. R. Stat. Soc., 57,
using the speed of convergence to evaluate the computational time 289–300.
Chin,L. et al. (2011) Making sense of cancer genomic data. Genes Dev., 25,
of these two algorithms since both of two algorithms are iterative
534–555.
convergence algorithm. By now, the EM algorithm is one of the best
Chwialkowski,K. et al. (2016) A kernel test of goodness of fit. In: 33rd
solutions to conquer convergence problems. The mutation cluster
International Conference on Machine Learning, ICML 2016, vol. 6, New
generation process of these two algorithms is done by the EM algo- York City, NY, pp. 3854–3867.
rithm to optimize the cluster boundary. Hence, our time consump- Dees,N.D. et al. (2012) Music: identifying mutational significance in cancer
tion analysis focuses on comparing the speed of EM convergence. genomes. Genome Res., 22, 1589–1598.
Gene PTEN containing 550 mutations across 21 cancer types is used Ding,J. et al. (2015) Systematic analysis of somatic mutations impacting gene
to perform this comparison. For DMCM, the AMISE converges expression in 12 tumour types. Nat. Commun., 6, 8554.
after 40 iterations while it takes 60 iterations for M2C (See Eynden,J.V.D. et al. (2015) Sominaclust: detection of cancer genes based
Supplementary Fig. S4A). Moreover, M2C stores all the mutation on somatic mutation patterns of inactivation and clustering. BMC
Bioinformatics, 16, 125.
cluster lists generated from each fixed bandwidth (totally 28 band-
Finn,R.D. et al. (2010) The pfam protein families database. Nucleic Acids
widths) in memory and finally merges 28 cluster lists to generate a
Res., 40, 290–301.
final list. Hence, there is potential memory overhead with the
Fisher,R.A. (1922) On the interpretation of v2 from contingency tables, and
increased size of dataset. In contrast, DMCM is more memory effi- the calculation of p. J. R. Stat. Soc., 85, 87–94.
cient as it stores only one mutation cluster list generated by the data- Gonzalez-Perez,A. and Lopez-Bigas,N. (2012) Functional impact bias reveals
adaptive bandwidth. The comparison of memory usage in gene cancer drivers. Nucleic Acids Res., 40, e169.
PTEN is shown in Supplementary Figure S4B which shows that Lawrence,M.S. et al. (2013) Mutational heterogeneity in cancer and the search
M2C costs 5.5GB of memory space while DMCM only costs 4GB. for new cancer-associated genes. Nature, 499, 214–218.
Taken together, DMCM is more efficient than M2C in both of com- Lee,C.S. et al. (2014) Recurrent point mutations in the kinetochore gene
putational time consumption and memory usage. knstrn in cutaneous squamous cell carcinoma. Nat. Genet., 46, 1060–1062.
Lu,X. et al. (2014) A co-expression modules based gene selection for cancer
recognition. J. Theor. Biol., 362, 75–82.
Lu,X. et al. (2017) Driver pattern identification over the gene co-expression of
4 Conclusion drug response in ovarian cancer by integrating high throughput genomics
Many previous methods identifying driver mutations focus on the data. Sci. Rep., 7, 16188.
entire gene or the single amino acid level which may not be appro- Lu,X. et al. (2018) The integrative method based on the module-network for
identifying driver genes in cancer subtypes. Molecules, 23, 183.
priate owing to the fact that mutations that associated with cancers
Meyerson,M. et al. (2010) Advances in understanding cancer genomes
do not occur uniformly in a gene or at random positions within the
through second-generation sequencing. Nat. Rev. Genet., 11, 685.
coding amino acid sequence. We have proposed a Data-adaptive
Network,T.C.G.A. (2014) Comprehensive molecular characterization of uro-
Mutation Clustering Method (DMCM) which combines these two thelial bladder carcinoma. Nature, 507, 315.
approaches by creating a data-adaptive bandwidth searching for Poole,W. et al. (2017) Multiscale mutation clustering algorithm identifies
variable-length regions of interest within individual genes. DMCM pan-cancer mutational clusters associated with pathway-level changes in
represents a data driven approach towards systematically identifying gene expression. Plos Computat. Biol., 13, e1005347.
DMCM: Data-adaptive Mutation Clustering Method 397

Reimand,J. and Bader,G.D. (2014) Systematic analysis of somatic mutations Tamborero,D. et al. (2013a) Comprehensive identification of mutational can-
in phosphorylation signaling predicts novel cancer drivers. Mol. Syst. Biol., cer driver genes across 12 tumor types. Sci. Rep., 3, 2650.
9, 637. Tamborero,D. et al. (2013b) Oncodriveclust: exploiting the positional clustering of
Sabarinathan,R. et al. (2017) The whole-genome panorama of cancer drivers. somatic mutations to identify cancer genes. Bioinformatics, 29, 2238.
bioRxiv, 190330. doi: https://doi.org/10.1101/190330. Watson,I.R. et al. (2013) Emerging patterns of somatic mutations in cancer.
Stehr,H. et al. (2011) The structural impact of cancer-associated missense Nat. Rev. Genet., 14, 703–718.
mutations in oncogenes and tumor suppressors. Mol. Cancer, 10, 54. Ye,J. et al. (2010) Statistical method on nonrandom clustering with ap-
Stratton,M.R. (2011) Exploring the genomes of cancer cells: progress and plication to somatic mutations in cancer. BMC Bioinformatics,
promise. Science, 331, 1553–1558. 11, 11.

Downloaded from https://academic.oup.com/bioinformatics/article/35/3/389/5053323 by University of Bologna user on 24 September 2020

You might also like