Download as pdf or txt
Download as pdf or txt
You are on page 1of 13

IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 11, NO.

5, SEPTEMBER/OCTOBER 2014 915

Mining Conditional Phosphorylation Motifs


Xiaoqing Liu, Jun Wu, Haipeng Gong, Shengchun Deng, and Zengyou He

Abstract—Phosphorylation motifs represent position-specific amino acid patterns around the phosphorylation sites in the set of
phosphopeptides. Several algorithms have been proposed to uncover phosphorylation motifs, whereas the problem of efficiently
discovering a set of significant motifs with sufficiently high coverage and non-redundancy still remains unsolved. Here we present
a novel notion called conditional phosphorylation motifs. Through this new concept, the motifs whose over-expressiveness mainly
benefits from its constituting parts can be filtered out effectively. To discover conditional phosphorylation motifs, we propose an
algorithm called C-Motif for a non-redundant identification of significant phosphorylation motifs. C-Motif is implemented under the
Apriori framework, and it tests the statistical significance together with the frequency of candidate motifs in a single stage.
Experiments demonstrate that C-Motif outperforms some current algorithms such as MMFPh and Motif-All in terms of coverage
and non-redundancy of the results and efficiency of the execution. The source code of C-Motif is available at: https://sourceforge.
net/projects/cmotif/.

Index Terms—Phosphorylation motif, protein phosphorylation, frequent pattern, data mining

1 INTRODUCTION

P ROTEIN phosphorylation is one of the most frequent


post-translational modification events for the regula-
tion and maintenance of most biological processes. This
for gaining a comprehensive understanding of mecha-
nism of protein phosphorylation is an essential work.
The problem of discovering phosphorylation motifs has
event plays vital roles in numerous key cellular processes been widely studied and several algorithms have already
including metabolism, signal transduction, transcription, been proposed. Motif-X [9] employs a greedy algorithm to
translation, membrane transport, as well as the regulation identify over-expressed motifs in an iterative manner. This
of cellular activities such as proliferation, migration, differ- method is demonstrated to successfully uncover both
entiation, and death [1], [2], [3], [4]. The advent of high- known and novel informative motifs. MoDL [10] makes an
throughput methods has greatly enhanced the investiga- attempt to optimize the expressiveness of a set of motifs
tions into phosphorylation, typically the technique of instead of quantifying the significance of a single motif. F-
tandem mass spectrometry, which enables rapid and direct Motif [11] borrows an idea from clustering and combines it
discovery of large-scale phosphorylation sites in a single with an iterative greedy search derived from Motif-X. In
experiment [5], [6], [7], [8]. fact, the greedy approaches such as Motif-X and F-Motif
Phosphorylation motifs represent common amino acids may miss some important motifs due to greedy choices and
aligned upstream and downstream of the phosphoryla- foreground reduction, and MoDL leaves out a quantity of
tion sites. Phosphorylation motif discovery aims at find- significant motifs as well. MMFPh [12] extends Motif-X by
ing a set of motifs that occur with disproportionate employing a more complete search and can report much
frequency in two coordinate sequence data sets: the more statistically significant and sufficiently frequent motifs
phosphorylated peptide set P and the unphosphorylated than Motif-X. Motif-All [13] is two-step procedure: first
peptide set N. In this context, P is regarded as the fore- extracts a set of frequent motifs from the phosphorylated
ground and N corresponds to the background. The identi- peptide data set, and then evaluates the statistical signifi-
fied interesting motifs are all “over-expressed” in the cance of frequent motifs using both phosphorylated and
foreground, that is, they appear more frequently in the unphosphorylated peptide data sets. Overall, both MMFPh
foreground than that in the background [9], [10]. Such and Motif-All claim to guarantee the completeness of signif-
sets of reported phosphorylation motifs can also provide icant phosphorylation motifs under their corresponding
information about the specificities of the kinases involved, definitions [14], [15]. But the lack of consensus on the defini-
reveal the underlying regulation mechanism and facilitate tion of significant motifs leads to a gap between the results
the prediction of unknown phosphorylation events. of these two methods. In general, Motif-All can ensure the
Hence, achieving a rapid phosphorylation motif search maximum level of coverage of the potentially interesting
results whereas both MMFPh and Motif-All involve some
redundant motifs.
 X. Liu, J. Wu, H. Gong, and Z. He are with the School of Software, Dalian Despite the significant progress achieved in the past
University of Technology, Dalian, China. E-mail: {eileenwelldone,
wujun.myway, haipengxf}@gmail.com, zyhe@dlut.edu.cn. years, there are still some important issues that remain
 S. Deng is with the School of Computer Science and Engineering, Harbin unsolved. One challenging problem is how to filter out
Institute of Technology, Harbin, China. E-mail: dsc@hit.edu.cn. those biased motifs whose over-expressiveness mainly
Manuscript received 9 Nov. 2013; revised 4 Apr. 2014; accepted 21 Apr. 2014. comes from its subsets so as to improve the accuracy of the
Date of publication 30 Apr. 2014; date of current version 2 Oct. 2014. identified motif set. For example, compared to other algo-
For information on obtaining reprints of this article, please send e-mail to:
reprints@ieee.org, and reference the Digital Object Identifier below. rithms, experimental results demonstrate that Motif-All per-
Digital Object Identifier no. 10.1109/TCBB.2014.2321400 forms relatively better by reporting the largest number of
1545-5963 ß 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
916 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 11, NO. 5, SEPTEMBER/OCTOBER 2014

significant motifs. However, no further statistical correction data set, we can re-calculate the statistical significance of the
is carried out in Motif-All, rendering an abundance of k-motif which is called the conditional (or local) significance.
motifs whose over-expressiveness mainly originates from Moreover, we define the statistical significance of a motif
some portion of them in the result set. To remove these based on the original data set according to the traditional def-
redundant motifs, one method is to apply the permutation inition as the global significance. In this setting, this k-motif is
test in statistics [16] to post-process the results. However, as claimed as a conditional phosphorylation motif if it is not
pointed out in [17], the standard permutation test procedure only locally significant on all k sub-motif induced data sets
may not fully address this issue. As a result, they utilize the but also globally significant on the whole data sets. Hence,
“sequential permutation test” method [18] to reduce the there will be two parameters to measure the significance of
effect of sub-motifs when calculating the statistical signifi- motifs: the global significance threshold and the local signifi-
cance of phosphorylation motifs. However, the permutation cance threshold, respectively. As a result, the effects of its
test process is very time-consuming in practice. Another sub-motifs in the evaluation of statistical significance about
alternative strategy is to choose a rigorous measure that can the over-expressiveness are reduced.
remove the influence of subsets of motifs. Existing methods To address the problem of conditional phosphorylation
such as Motif-X [9] and MMFPh [12] are examples of this motif mining, we present a new algorithm called C-Motif.
idea. They partially succeed in reducing the redundancy We implement C-Motif by utilizing a support constraint
with more stringent restrictions for pruning strategy. Nev- together with statistical measures in the same mining pro-
ertheless, this is not the end of solution, and some addi- cess. Here the support is defined as the percentage of
tional motifs still remain in the results. sequences that contain this motif. One motif is said to be
Another challenging problem is how to achieve a maxi- frequent if its support is no less than a given threshold.
mum coverage such that all frequent and statistically signif- Experiments on real data sets show that our algorithm is
icant motifs are included in the result. Existing methods efficient and effective in conditional phosphorylation
such as Motif-X, F-Motif, MoDL and MMFPh will miss motif discovery.
some motifs that are both sufficiently large and significant The remainder of this paper is organized as follows:
phosphorylation motifs due to their incomplete search or Section 2 presents the details of C-Motif algorithm. Section 3
overstrict requirements on target motifs. shows the experimental results on real data. Section 4 con-
Overall, the redundancy and coverage issues in phos- cludes the paper.
phorylation motif discovery are only partially solved.
This paper addresses this problem and takes a further
step towards this direction by proposing a new problem 2 METHODS
formulation for phosphorylation motif discovery: condi- 2.1 Basic Terminology
tional phosphorylation motif discovery. Owing to the We describe a motif as a string with a single phosphorylated
problem formulation, the motifs reported would be residue that is denoted with an underlined character, e.g., S,
“actual” motifs with more accurate statistical significance T or Y. We write the conserved positions of one motif as the
scores. This new problem formulation is a non-trivial corresponding amino acids directly and its wild positions
extension and generalization of the motif assessment are represented by ‘x’. For example, suppose a 2-motif has a
strategy used in Motif-X and MMFPh. Different from pre- fixed ‘P’ one position downstream and a fixed ‘D’ two posi-
vious methods, the key feature of our method is that, the tions upstream as well as a wild position next to the cen-
evaluation of significance with respect to one motif is tered phosphorylation site, this motif is thus represented as
only relevant to its own inherent property rather than ‘PSxD’.
whether it has significant constitute parts or not. Hence, For a k-motif m, we use Supðm; P Þ to denote its support
we achieve an ideal trade-off between the high coverage (frequency) in the foreground data set P . Since the goal of
and non-redundancy of significant motif set together with phosphorylation motif discovery is to find those motifs that
running efficiency to some extent. occur more frequently in the foreground data set against the
In the following, we will first illustrate the basic idea of background data set, we usually only use the foreground to
conditional phosphorylation motif and then show how this assess the frequency of a motif.
concept can remove motifs whose over-expressiveness
Definition 1. m is a frequent motif if Supðm; P Þ is no less than
mainly comes from their subsets. If a position in one motif is
the user-specified threshold usup .
specified by a certain fixed amino acid, then it is a so-called
conserved position. Otherwise, it is a wild position that can The set of candidate frequent motifs of size k is denoted
match any arbitrary amino acid. One motif A is said to be a as Sk . To discover all frequent motifs, the level-wise search
k-motif if it has k conserved positions. If another motif B con- strategy rooted from the Apriori algorithm [19] is widely
tains only a subset of these k amino acids at the correspond- used in practice. It first accumulates the count for each 1-
ing positions in A, then B is a sub-motif of A. In fact, every motif and collects those motifs with larger support than the
peptide that contains the motif must also contain its corre- given threshold usup to form the set of frequent 1-motifs F1 .
sponding sub-motifs, but the reverse is not true. So the set of Subsequently, since a k-motif will not be frequent if one of
peptides that contain one motif must be a subset of the collec- its sub-motifs of size k1 is infrequent, Fk1 is utilized to
tion of peptides that contain its sub-motif. Notably there are generate Sk and those infrequent ones are pruned to gener-
exactly k sub-motifs of size k1 for one k-motif. For each ate Fk .
sub-motif of size k1, we can generate a set of peptides in Owing to the fact that m consists of k fixed amino acids,
which every peptide contains this sub-motif. On this new this motif has k sub-motifs of size k1. These motifs
LIU ET AL.: MINING CONDITIONAL PHOSPHORYLATION MOTIFS 917

subsumed by m are denoted as m1 ; m2 ; . . . ; mk , respectively. passing the given threshold in addition to fulfilling the tra-
The only difference between m and mi (1  i  k) is that the ditional definition.
ith fixed position in m is non-fixed in mi . And we describe Definition 2. Local statistical significance:
the sets of peptides in the foreground data P where these
sub-motifs occur as P ðm1 Þ; P ðm2 Þ; . . . ; P ðmk Þ, respectively. Sigl ðm; P; NÞ ¼ min Sigðm; P ðmi Þ; Nðmi ÞÞ:
1ik
Similarly, we use Nðmi Þ to denote the set of peptides in the
background data N that contain mi . We utilize Sigðm; P; NÞ Definition 3. Global statistical significance:
to denote the statistical significance calculation function for
each motif m, and it measures the over-expressiveness of m Sigg ðm; P; NÞ ¼ Sigðm; P; NÞ:
in P against N. In fact, both the nonparametric measure-
ments such as odds ratio or relative risk and the binomial Overall, to identify all statistically significant, sufficiently
probability model have been used in the assessment of frequent conditional phosphorylation motifs, we assess at
phosphorylation motif [13], [9], [12]. Note that the use of dif- least two aspects of each motif: frequency and statistical sig-
ferent evaluation methods will not change the nature of the nificance including both the local significance and the global
problem. Generally, these statistical significance assessment significance:
functions are consistent with each other in practice. For the
ease of illustration and a fair comparison of different meth-  Frequency. We impose the support constraint to
ods, here we use relative risk and odds ratio as the representa- reduce the search space and prevent the generation
tives of statistical significance evaluation methods for of random artifacts.
phosphorylation motifs. Both relative risk and odds ratio  Statistical significance. Note that the statistical evalua-
describe a likelihood change of the occurrence of one motif tion of over-expressiveness for a motif can be done
between the foreground and the background. Relative risk is in various ways. The statistical significance measures
defined as the ratio of the supports of m in the two data such as relative risk and odds ratio are available to be
sets. When using relative risk to measure the statistical signif- utilized interchangeably. The choice of significance
icance of m, it reads as: assessment measure will not change the perfor-
mance of underlying algorithms.
Supðm; P Þ
Sigðm; P; NÞ ¼ : (1)
Supðm; NÞ 2.2 Problem Formulation
As shown above, we strengthen and optimize the defini-
A relative risk of 1 indicates that the target motif under
tion of phosphorylation motif finding and try to conduct
study is equally likely to occur in both data sets. A relative
an extensive and non-redundant (NR) discovery. The con-
risk greater than 1 means that this motif is more likely to
ditional phosphorylation motifs are deemed to be true
occur in the foreground. In addition, the odds is the ratio
positives with prominent over-expressiveness under no
of the probability that the interesting event does happen
subsets interplays. We impose a persuasive significance
to the probability that it does not happen. The odds ratio
constraint called the local or conditional significance on
is defined as the ratio of the odds of an event occurring in
each candidate motif that evaluates the statistical signifi-
one group to the odds of it occurring in another group. In
cance of a phosphorylation motif with the sets of sequen-
the context of phosphorylation motif finding, if we adopt
ces induced from its sub-motifs. Furthermore, we also
odds ratio to measure the over-expressiveness, then the
perform the global significance evaluation using the origi-
significance of m is:
nal data sets in the traditional way. Thus, there are two
Supðm; P Þ=ð1  Supðm; P ÞÞ parameters to measure the significance of motifs: the
Sigðm; P; NÞ ¼ : (2) global significance threshold ug sig and the local signifi-
Supðm; NÞ=ð1  Supðm; NÞÞ
cance threshold ul sig , respectively. That is, to ensure the
Odds ratio has the same characteristics as relative risk: only significance of one motif m over P against N, two criteria
those motifs whose odds ratio is greater than 1 have potential must be satisfied simultaneously: Sigg ðm; P; NÞ  ug sig
to be statistically significant. and Sigl ðm; P; NÞ  ul sig . Hence, we can guarantee that
Particularly, if we consider P ðm1 Þ, P ðm2 Þ; . . . ; P ðmk Þ as the conditional phosphorylation motifs are also signifi-
the new foreground data set instead of P and Nðm1 Þ, cant under the traditional definition.
Nðm2 Þ; . . . ; Nðmk Þ as the new background data set instead However, there is one critical issue remaining for this
of N when estimating the over-expressiveness, there are task. That is, whether the effect of the sub-motifs has really
exactly k different significance values for m, that are been removed through the use of local statistical signifi-
Sigðm; P ðmi Þ; Nðmi ÞÞ where 1  i  k, respectively. With- cance? To address this issue, we employ a measure called
out loss of generalization, we assume that Sigðm; P; NÞ has improvement proposed in [20] for justification. More pre-
positive correlation with the over-expressiveness: the bigger cisely, the improvement is defined as the difference
Sigðm; P; NÞ is, the more significant the motif m is. Under between the statistical significance of one motif and that of
this setting, we adopt the minimum value of Sigðm; its sub-motifs. In general, the positive improvement indi-
P ðmi Þ; Nðmi ÞÞ (1  i  k) as the local (or conditional) statis- cates that the over-expressiveness of one target motif comes
tical significance in the estimation of over-expressiveness of from the combinations of all its constituent amino acids
each motif. Then, the problem of conditional phosphoryla- rather than just one of its subsets. We should prune those
tion motif discovery is to discover all frequent motifs from redundant motifs that have no positive improvements: the
P with sub-motif derived statistical significance values over-expressiveness of a motif is equal to or less than that
918 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 11, NO. 5, SEPTEMBER/OCTOBER 2014

of its sub-motifs.1 To further clarify this issue, since we uti- only ul sig is used in both Motif-X and MMFPh, they implic-
lize the local significance to get rid of sub-motif interplays, itly require that one sub-motif of the target motif should be
we calculate the difference values between the conditional locally significant as well. In this section, we first show that
phosphorylation kmotif and its corresponding sub-motifs all the motifs reported by Motif-X and MMFPh are also
of k-1 with respect to the global significance. For the sim- globally significant with high over-expressiveness as shown
plicity of illustrations and mathematical derivations, here in the lemma below.
we use relative risk as the significance measure as the exam- Lemma 2. For a k-motif m in data set P and N, suppose we use
ple measure in the remainder of this section. relative risk as the measure and set both ul sig and ug sig to be
Lemma 1. Each conditional phosphorylation motif possesses posi- t, t  1. Let mðrÞ represents one sub-motif of m of size r. If
tive improvement if relative risk is used to measure the statis- SigðmðrÞ ; P ðmðr1Þ Þ; Nðmðr1Þ ÞÞ  t, for all 1  r < k, then
tical significance. Sigg ðm; P; NÞ  t.
Proof. For a k-motif m in data set P and N, suppose one of Proof.
its sub-motifs mj of length k1 has the minimal signifi-
cance value, i.e., Sigl ðm; P; NÞ ¼ Sigðm; P ðmj Þ; Nðmj ÞÞ. SigðmðrÞ ; P ðmðr1Þ Þ; Nðmðr1Þ ÞÞ
We set ul sig ¼ t, t > 1. Then the Lemma 1 can be formu- SupðmðrÞ ; P ðmðr1Þ ÞÞ
lated as: if Sigl ðm; P; NÞ  t, then Sigg ðm; P; NÞ > ¼
SupðmðrÞ ; Nðmðr1Þ ÞÞ (6)
Sigg ðmi ; P; NÞ for any 1  i  k: ðrÞ ðr1Þ
jP ðm ÞkNðm Þj
¼ ;
Supðm; P ðmi ÞÞ jNðmðrÞ ÞkP ðmðr1Þ Þj
Sigðm; P ðmi Þ; Nðmi ÞÞ ¼
Supðm; Nðmi ÞÞ
jP ðmÞkNðmi Þj Sigðmðr1Þ ; P ðmðr2Þ Þ; Nðmðr2Þ ÞÞ
¼  Sigl ðm; P; NÞ
jNðmÞkP ðmi Þj
Supðmðr1Þ ; P ðmðr2Þ ÞÞ
¼ Sigðm; P ðmj Þ; Nðmj ÞÞ  t; ¼
Supðmðr1Þ ; Nðmðr2Þ ÞÞ (7)
(3) jP ðm ðr1Þ
ÞkNðm ðr2Þ
Þj
¼ ðr2Þ ðr1Þ
;
jP ðm ÞkNðm Þj
Supðm; P Þ jNkP ðmÞj
Sigg ðm; P; NÞ ¼ ¼ ; (4)
Supðm; NÞ jP kNðmÞj Supðm; P Þ jP ðmÞkNj
Sigg ðm; P; NÞ ¼ ¼ : (8)
Supðm; NÞ jP kNðmÞj
Supðmi ; P Þ jNkP ðmi Þj
Sigg ðmi ; P; NÞ ¼ ¼ : (5) It is easy to see that SigðmðrÞ ; P ðmðr2Þ Þ; Nðmðr2Þ ÞÞ  t2
Supðmi ; NÞ jP kNðmi Þj
by multiplying Equations (6) and (7). That is, mðrÞ is sta-
jP ðmÞj jP ðmi Þj
Since t is greater than 1, so jNðmÞj  jNðm i Þj
. As a result, we tistically significant in the data sets derived from mðr2Þ .
can get that Sigg ðm; P; NÞ > Sigg ðmi ; P; NÞ by making This rule also applies to motif mðr2Þ so that mðr2Þ is sig-
equation (4) minus equation (2), which returns a result nificant in the peptide sequences induced by one of its
that is greater than 0. u
t sub-motifs as well. Therefore, we can infer that
Sigg ðm; P; NÞ  tk by iterating the multiplication process.
To make the following description easier to follow, we Since t is a positive number that is no less than 1, then
provide a precise problem definition of conditional phos- Sigg ðm; P; NÞ must be equal to or greater than t, too.
phorylation motif discovery with clearly stated input and Thus, m is globally significant as well. u
t
output:
Lemma 2 shows that Motif-X and MMFPh can find
 Input. The set of phosphorylated peptides (the fore- motifs that are globally significant. However, two implicit
ground data set P ) and the set of unphosphorylated issues exist with regard to coverage and redundancy. The
peptides (the background data set N), the support first one is that they may miss some potentially meaningful
threshold usup , the local significance threshold ul sig motifs under the conditional phosphorylation motif defini-
and the global significance threshold ug sig . tion. This is because these methods investigate a certain can-
 Output. A set of conditional phosphorylation motifs didate k-motif m on condition that it must contain at least
R, where each motif m 2 R satisfies: (1) Supðm; P Þ  one both frequent and significant sub-motif of size k1. If
usup ; (2) Sigl ðm; P; NÞ  ul sig ; (3) Sigg ðm; P; NÞ  all constituent motifs mi s are not significant, there is no
ug sig . chance to generate and evaluate this motif so that Motif-X
and MMFPh will not check m. Fig. 1 provides such an
example.
2.3 Categorization of Existing Methods under the For illustration purpose, we adopt relative risk as the sig-
New Formulation nificance measure here. Additionally, we set the significance
Notice that Motif-X [9] and MMFPh[12] also measure the threshold usig ¼ ul sig ¼ ug sig ¼ 1:2 and the support thresh-
over-expressiveness of one motif in a way that is similar to old usup ¼ 0:2. In this sample data set, we plant one frequent
our definition of local significance. More precisely, although and significant phosphorylation motif ‘KMS’ with relative
risk value larger than 1.2. There may be some other signifi-
1. For justification, we also define some motifs as redundant ones if cant motifs presented in this data set while we will
their improvement is very small. only focus our discussion on ‘KMS ’. ‘KMS’ consists of two
LIU ET AL.: MINING CONDITIONAL PHOSPHORYLATION MOTIFS 919

Fig. 2. The relationship of the sets of significant motifs discovered by dif-


ferent algorithms.

‘GLxxxxS’ and ‘GxxxxxSW’, rendering little improvement


of the over-expressiveness. As a result, ‘GLxxxxSW’ is a
redundant motif whose over-expressiveness possibly
mainly roots from ‘GLxxxxS’ and ‘GxxxxxSW’ other than
the integrity.
Fig. 1. A sample data set. Both the foreground data set and background In summary, the algorithms that only adopt the global
data set consist of 10 sequences. significance such as Motif-All often return a set of phos-
phorylation motifs that contains lots of motifs whose over-
sub-motifs, ‘KxS’ and ‘MS’, which are definitely insignifi- expressiveness mainly benefits from the subsets. Further-
cant with significance scores equal to 1.0. Thus, methods more, MMFPh is able to find much more significant motifs
like Motif-X and MMFPh have to filter out ‘KMS’ since it missed by Motif-X. This method decreases the number of
lacks of frequent and significant constituent motifs. How- redundant motifs but excludes some ones with no “over-
ever, motifs of this type might also have great potential to expressed” sub-motifs and includes some redundant ones
play valuable roles in biological research. So expanding the under subsets interplays as well. In contrast, our formula-
scope of search and assimilating such kind of motifs as use- tion and proposed algorithm use the global significance
ful ones is a considerable work for phoshphorylation motif together with the local significance and can discover as
discovery. many potential significant motifs as possible with non-
An additional issue is that Motif-X and MMFPh may redundancy. The relationship of the motif sets reported by
involve some motifs that are not reported by C-Motif as the different algorithms is provided in Fig. 2.
well. According to their definitions, one motif is deemed to
be significant as long as it can pass the significance test on 2.4 The C-Motif Algorithm
the reconstructed data sets induced from at least one sub-
C-Motif (Algorithm 1) is implemented in a single-stage
motif. Such loose restriction makes it possible to report
where the frequency and the statistical significance are
some excess motifs whose over-expressiveness mainly
tested at the same time. In detail, when discovering the set
comes from their subsets. As a result, this kind of motifs
F1 from S1 in the first iteration, we calculate the local signifi-
should be pruned according to our definition whereas
cance as well as the global significance for each frequent
Motif-X and MMFPh will report them.
member of F1 . In fact, the global significance of 1-motifs is
We will use one potential motif ‘GLxxxxSW’ in Fig. 1 as
identical to their local significance. Since one motif is statis-
an example for illustration. With the significance threshold
tically significant on condition that its global significance
usig ¼ ul sig ¼ ug sig ¼ 1:2 and the support threshold
and local significance are not less than the given threshold
usup ¼ 0:2, its three sub-motifs ‘LxxxxSW’, ‘GxxxxxSW’ and
ug sig and ul sig , respectively. We filter out the insignificant
‘GLxxxxS’ are all frequent and (both locally and globally)
motifs and then add the rest to the result set. So after inves-
significant when relative risk is used for significance assess-
tigating every possible 1-motif, all the motifs in the result
ment. So it is possible to generate ‘GLxxxxSW’ in three
set are significant conditional phosphorylation motifs of
ways in MMFPh and Motif-X. Moreover, ‘GLxxxxSW’
size 1. With respect to the kth iterations, we perform the fol-
achieves three significance values on the subset induced
lowing operations:
data sets: one is 1.0 induced by ‘GLxxxxS’, another is also
1.0 by ‘GxxxxxSW’ and the other is 1.5 by ‘LxxxxSW’. 1) Generate the set of potential frequent motifs of size k,
According to MMFPh and Motif-X, ‘GLxxxxSW’ is a signifi- i.e., Sk , by joining Fk1 with itself naturally (Steps 5-
cant motif with significance value 1.5 which is larger than 7). Let l1 and l2 be members of Fk1 . They are
usig . However, ‘GLxxxxSW’ will be discarded by C-Motif assumed to be joinable if their first k2 amino acids
since its local significance value 1.0 is less than the local are in common with the same conserved positions.
significance threshold. Further investigation on the relation- To avoid duplications, we rank these two motifs
ship between the target motif and its sub-motifs shows according to the positions of their last conserved
that ‘GLxxxxSW’ has equivalent global significance with amino acids. Suppose l1 and l2 are a pair of joinable
920 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 11, NO. 5, SEPTEMBER/OCTOBER 2014

motifs, then l1 < l2 if the last conserved amino acid enumerates and checks all frequent motifs in the mining
of l1 lies in the relative left of that of l2 . Line up these process. The second is that our implementation only
identical amino acids and the left in l1 and l2 accord- prunes those insignificant motifs with respect to both
ing to their rank to compose one k-motif m. global significance and local significance. u
t
2) Evaluate the frequency of each candidate in Sk by
Theorem 2. The C-Motif algorithm is correct.
their supports (Steps 8-10). Prune all the infrequent
ones and add all the left to Fk . Proof.. The correctness of the C-Motif algorithm can be
3) Test the local significance and the global significance guaranteed by two facts. First, only frequent motifs are
of each potential motif in Fk (Steps 11-18). For one k- generated. Second, both the local and global significance
motif m, obtain all of its (k1)-sub-motifs. For each values are exactly calculated and every motif with signif-
sub-motif mi of m, construct its matching foreground icance values that are lower than the user-specified
P ðmi Þ and background Nðmi Þ. Calculate the signifi- thresholds will be pruned. u
t
cance value on the new data sets and choose the min-
imum one as the final local significance value. In
addition, we also obtain the global significance value 3 EXPERIMENTAL RESULTS
using the original data sets P and N in consistency In order to demonstrate the efficacy and utility of our algo-
with traditional definition. Filter out the insignificant rithm, we conduct a series of tests with real data. In our
motifs and save the both globally and locally signifi- experiments, we compare our algorithm with the Motif-All
cant ones in R. algorithm and the MMFPh algorithm with respect to effi-
4) Repeat the above steps until no more frequent candi- ciency, coverage and non-redundancy. Note that several
dates can be generated in Sk . Return R as the final motif-discovery methods have been proposed, the reason
result. why we choose Motif-All and MMFPh for comparisons
here is that they are representatives for algorithms that use
only global significance threshold and local significance
threshold, respectively. Furthermore, we use the same sig-
nificance measure in all algorithms so as to make their out-
puts comparable. More precisely, we choose relative risk to
measure the over-expressiveness so as to ensure the motifs
reported by MMFPh2 are also globally significant, which
can facilitate a fair comparison for our experiments.
In the experiments, we apply C-Motif, MMFPh and
Motif-All to both non-kinase-specific and kinase-specific
phosphorylation data sets with a fixed length of amino acids
upstream and downstream of the phosphorylated residues.
The details of these data sets are provided in the following
sections. In each experiment, we first present a brief descrip-
tion of the data and tune the thresholds so as to clarify the
comparison. Subsequently, we perform a general analysis
of the motifs discovered and then illustrate the superiority
of C-Motif against MMFPh and Motif-All.

3.1 Non-Kinase-Specific Phosphorylation Data


3.1.1 Data Description
We use Phospho.ELM (version 9.0) [21] and Swiss-Prot
(release 2011_11) [22] as the data sources to construct the
data for this experiment. To generate the set of phosphory-
lated peptides P, we directly extract the annotated phosphor-
ylated peptides from Phospho.ELM without considering
kinase information. To generate the set of unphosphorylated
peptides N, we first extract all the proteins annotated as
‘Homo sapiens’, and then follow Musite [23] to partition all
the proteins into different groups through BLAST-Clust in
BLAST [24] package version 2.2.19 with a sequence identity
threshold of 50 percent. We select the protein with the largest
2.5 Proofs of Completeness and Correctness number of known phosphorylation sites in each group to
Theorem 1. The C-Motif algorithm is complete.
Proof. With respect to the definition of conditional phos- 2. Note that in the original MMFPh, it reports the maximal phos-
phorylation motif, the completeness of C-Motif can be phorylation motifs that are not subsumed by motifs with more fixed
amino acids. In our implementation, we report all motifs that can pass
shown by the following two facts. The first is that the the significance threshold so as to generate more motifs for a fair com-
Apriori algorithm is complete, that is, it explicitly parison with respect to the size of motif set.
LIU ET AL.: MINING CONDITIONAL PHOSPHORYLATION MOTIFS 921

Fig. 4. The relationship of reported motifs of size 2 on non-kinase-specific


phosphorylation data sets. Here the rounded rectangle denotes the result
set of Motif-All, the circle presents that of C-Motif and the rectangle cor-
responds to that of MMFPh. Here the support threshold usup is 0.005 and
Fig. 3. The number of reported motifs of different sizes on non-kinase- the significance threshold usig is 1.7.
specific phosphorylation data sets. Here the support threshold usup is
0.005 and the significance threshold usig is 1.7. Fig. 3 summarizes the number of phosphorylation motifs
discovered by C-Motif, MMFPh and Motif-All. Several
form a non-redundant data set; if there are no phosphory- observations can be made from Fig. 3.
lated proteins in the current group, we select the longest pro- First, C-Motif is able to find more motifs than MMFPh in
tein as the member of NR data set. In fact, there are at least general. Moreover, Motif-All reports much more motifs,
five kinds of methods to construct the set of unphosphory- and all the reported motifs of C-Motif and MMFPh are
lated peptides in different way [25]. Here we employ the included in the result set of Motif-All.
method in [26] to construct the background data. More pre- Second, we find that all the algorithms present a same set
cisely, we generate the data by sampling 5,000 phosphory- of interesting 1-motifs and there is visible difference in the
lated peptides and 5,000 unphosphorylated peptides from discovery of 2-motifs and 3-motifs. In Fig. 4, we describe the
NR data set. Here all the peptides have the fixed length 13 relationship of the 2-motif sets reported by different algo-
and they are aligned on the residue that lies in the center rithms. Apparently, Motif-All reports a superset of the other
position. two methods, which can be divided into four parts accord-
In the context of phosphorylation motif discovery, one ing to their distributions. Some considerable analysis upon
algorithm can report different result sets of motifs from dif- the four parts can be made as follows:
ferent test data sets. Since our main objective in this paper is
1) The motifs in the intersection of the results (the first
to compare the performance of different algorithms in the
part with label ‘1’) reported by MMFPh, Motif-All
same data sets, we will not discuss the effects of the data
and C-Motif are all conditional phosphorylation
construction methods on motif discovery and just use some
motifs.
existing methods for constructing both the foreground data
2) The second part with label ‘2’ consists of motifs just
and the background data.
discovered by Motif-All and C-Motif. The common
characteristic of these motifs is that all of their sub-
3.1.2 Results motifs are insignificant on the subset induced data
In this experiment, we choose a lower value as the support sets. Accordingly, MMFPh fails to generate and
threshold and a smaller value as the significance threshold check these motifs even though they are both glob-
so as to uncover a relatively larger number of significant ally significant and locally significant.
motifs for a more remarkable comparison. 3) The third part contains only one 2-motif ‘RxxSP’
For consistency, we set the support threshold usup ¼ 0:005 detected by MMFPh and Motif-All. Since ‘RxxSP’ is
and the significance thresholds3 (both ug sig and ul sig ) to be composed of two significant sub-motifs (‘RxxS’ and
1.7 for all the algorithms under discussion. In fact, Motif-All ‘SP’), which enables MMFPh to generate this motif
only uses the global significance ug sig and MMFPh only by extending either ‘RxxS’ or ‘SP’. Thus, ‘RxxSP’ can
adopts the local significance ul sig in their algorithms. The achieve two significance values, 3.43 and 1.54, upon
questions we want to answer in this experiment are: How the sub-motif induced data sets. MMFPh chooses the
many phosphorylation motifs of different sizes can be dis- first value and reports this motif as significant one
covered by the algorithms, respectively? Which methods while C-Motif prunes it since the minimum value is
can detect more meaningful phosphorylation motifs? Which less than the given local significance threshold. In
methods can filter out more biased motifs with respect to fact, ‘RxxSP’ is quite dependent on ‘SP’ in the sense
subsets interplays? that its global significance value of 6.6 is relatively
similar to that of 5.5 of ‘SP’. This indicates that the
discriminative power of ‘RxxSP’ mainly comes from
3. For the simplicity of notations, we use usig to denote both the
global significance threshold and local significance threshold when
its subset ‘SP’. So regarding such kind of motifs as
they are same. redundant ones is reasonable and justifiable.
922 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 11, NO. 5, SEPTEMBER/OCTOBER 2014

Fig. 5. The running time comparison of C-Motif, MMFPh and Motif-All on non-kinase-specific phosphorylation data.

4) The fourth part consists of motifs that are only identical set of motifs of size 1 with MMFPh and a slightly
reported by Motif-All. Motifs in this category gener- different set from Motif-All under this setting. If we
ally have at least one very significant sub-motif that change the support threshold to 0.001 with the significance
actually leads to the statistical significance of its threshold equivalent to 1.5, C-Motif reports 1,530 motifs,
superset. For instance, ‘GxxRxxS’ is composed of whereas MMFPh reports 515 motifs and Motif-All reports
two sub-motifs: ‘GxxxxxS’ and ‘RxxS’. ‘GxxxxxS’ is 4,236 motifs. Accordingly, there exists a bigger gap
not a significant phosphorylation motif with both between their result sets. On one hand, MMFPh misses a
local and global significance values of 0.99 while number of interesting motifs as well as includes several
‘RxxS’ is a significant one with the significance val- potentially redundant motifs whose over-expressiveness
ues of 2.48. Particularly, ‘GxxRxxS’ has the global mainly derives from their sub-motifs. On the other hand,
statistical significance values of 3.22, rendering little Motif-All also reports many motifs whose sub-motifs are
improvement upon the over-expressiveness com- included in the result as well. This demonstrates that C-
pared to that of ‘RxxS’. In this regard, ‘GxxRxxS’ is Motif not only can find more significant motifs than
an insignificant motif according to our definition MMFPh but also is more qualified than Motif-All and
although it can pass the significance threshold usig . MMFPh to achieve non-redundancy in a flexible manner.
Similar analysis can also be made for the other motifs In conclusion, it has been illustrated that MMFPh is par-
in this part. Hence, these motifs are redundant in the tially useful and effective in presenting meaningful motifs
sense that their discriminative power mainly comes and reducing redundant motifs than Motif-All, although it
from some sub-motifs that are already claimed as to would miss some statistically significant motifs. Further-
be statistically significant. Actually, filtering out this more, Motif-All acquires higher coverage at the cost of
kind of motifs is able to reduce the redundancy in including many motifs whose over-expressiveness is
phosphorylation motif discovery. rooted from sub-motifs. In contrast, C-Motif makes a rea-
We have similar observations on the motifs of size 3 sonable trade-off between redundancy and coverage.
reported by different algorithms. In general, Motif-All cov- Hence, the empirical comparison shows that C-Motif out-
ers all the potentially significant motifs detected by MMFPh performs the other methods like MMFPh and Motif-All,
and C-Motif but contains many redundant ones with no sig- by discovering more meaningful and non-redundant
nificant improvements of statistical significance. The differ- phosphorylation motifs.
ence set between MMFPh and C-Motif includes several Fig. 5 presents the running time of different algorithms.
potentially redundant motifs from MMFPh and some inter- For this data set, C-Motif and Motif-All spend similar execu-
esting ones without any significant sub-motifs discovered tion time in the mining process under the same parameter
by C-Motif. setting. Due to the specific motif candidate generation pro-
Note that the largest size of reported phosphorylation cedure, MMFPh needs more time when the significance
motifs is 3 under the setting of usup ¼ 0:005 and usig ¼ 1:7. threshold is very low and is more efficient when the signifi-
There is more evident difference in the performance of C- cance threshold is increased. Overall, C-Motif is relatively
Motif and MMFPh as well as Motif-All especially when the efficient and is comparable with other algorithms in terms
size of target motifs is larger than 2. That is, the bigger the of running efficiency.
size is, the more different the results of the algorithms are.
Thus, we can infer that increasing the size of target motifs
will apparently highlight the advantage of C-Motif over 3.2 CDK-Specific Phosphorylation Data
MMFPh and Motif-All. 3.2.1 Data Description
To further check if this is true, we also perform phos- A protein kinase phosphorylates the substrates by transfer-
phorylation motif mining at the support threshold of 0.1 ring phosphate from adenosine triphosphate or guanosine
with significance threshold of 4.0. C-Motif obtains an triphosphate to specific amino acids (serine, threonine and
LIU ET AL.: MINING CONDITIONAL PHOSPHORYLATION MOTIFS 923

Specially, we find that almost all the motifs found by


MMFPh also occur in the result sets reported by the other
algorithms in addition to some special cases. The motifs
that are reported by C-Motif but missed by MMFPh are
both globally and locally significant according to our def-
inition. For instance, ‘PxVxxxSxxK’ is a 3-motif reported
by C-Motif and Motif-All but missed by MMFPh. With
respect to the global significance, this motif is indeed a
significant motif with prominent over-representation in
the foreground data set. In contrast, its sub-motifs
‘VxxxSxxK’, ‘PxxxxxSxxK’ and ‘PxVxxxS’ are all insignifi-
cant on their sub-motif induced data sets with lower
over-expressiveness. As shown in the former section,
since there are no frequent and significant sub-motifs for
Fig. 6. The number of reported motifs of different sizes on CDK-specific ‘PxVxxxSxxK’, so there is no chance to generate and eval-
phosphorylation data sets. Here the support threshold usup is 0.015 and uate the target motif in MMFPh, and thus MMFPh
the significance threshold usig is 2.0. prunes it for certain. In this sense, MMFPh fails to dis-
cover some underlying interesting motifs because of its
restrictive definition. In addition, we also find that most
tyrosine). Cyclin-dependent kinases (CDK) is a major class
of those motifs missed by MMFPh are of larger sizes.
of enzymes involved in the regulation of the cell cycle. This
Particularly, the motifs discovered by MMFPh while fil-
kind of kinases is activated alternatively along with the cell
tered out by C-Motif are some exceptions whose over-
cycle, and phosphorylates the corresponding substrates so
expressiveness mainly roots from their significant parts.
as to make the cell cycle proceed in an orderly manner.
‘STPxxxxR’ is such one special case with three significant
Similar to generating the non-kinase-specific phosphory-
sub-motifs: ‘STP’, ‘STxxxxxR’ and ‘TPxxxxR’. ‘STPxxxxR’
lation data, we first extract the phosphorylated peptides
can be extended from any sub-motif so that it has three
from phosphorylated proteins of the kinase CDK to com-
possible significance values to measure the over-expressive-
pose the foreground data set. We use the method in [27] to
ness, which are 1.5, 2.17 and 1.0, respectively. So the target
construct the nonphosphorylation data. There are approxi-
motif is significant only if we evaluate it relying on its sec-
mately 200 sequences (i.e., 13-mers) in both the foreground
ond sub-motif induced data sets. In addition, ‘STPxxxxR’
data and background data.
has an identical global significance value with that of its
sub-motif ‘TPxxxxR’. That whether ‘TPxxxxR’ is significant
3.2.2 Results or not greatly affects the significance degree of ‘STPxxxxR’.
For this data set, we first set the support threshold usup ¼ As a result, motifs like this are biased ones that should be
0:015 for all the algorithms and the significance threshold removed according to our definition. In conclusion, MMFPh
usig ¼ 2:0 for Motif-All and MMFPh. Particularly, we set an still includes some redundant motifs even though they have
identical threshold for global significance and local signifi- made some contributions to redundancy reduction.
cance for C-Motif with usig ¼ 2:0. That is, both ug sig and For illustration purpose, we also study the performance
ul sig are equal to 2:0. Fig. 6 presents the number of the sig- of the different algorithms under different statistical sig-
nificant motifs found by C-Motif, Motif-All and MMFPh nificance standards and support levels. We gradually
under this setting. Accordingly, C-Motif presents a set of change the significance threshold and the support thresh-
phosphorylation motifs whose size is less than that of old to obtain various cases. The results are summarized in
Motif-All while it is greater than that of MMFPh. Table 1. Under any setting, it is easy to see that Motif-All
Motif-All also succeeds in finding all the phosphoryla- always reports the most motifs and MMFPh returns the
tion motifs reported by C-Motif, whereas including some least quantitatively. If we lower the support threshold
controversial ones that need to be further discussed. For with unchanged significance threshold, or lower the sig-
this reason, we have to re-conduct significance tests for nificance threshold with unchanged support threshold,
those motifs only found by Motif-All. We first extract them there will be much more motifs of different sizes to be
to form a new set, and then check the statistical significance reported. In this situation, more redundant motifs are con-
of each motif together with that of its component parts. sidered as significant ones by Motif-All while more poten-
Most of these motifs have something in common that they tial ones are missed by MMFPh. It indicates that the
consist of significant parts and insignificant parts, and their smaller the support threshold and the significance thresh-
significant parts contribute to the over-expressiveness of the old are, the more motifs of the larger sizes are found, and
whole motifs. According to our definition, these motifs the more obvious difference between the results of the
should be considered as redundant motifs. The result set of algorithms is visible. This is because if one motif has larger
Motif-All may also contain some motifs that have similar size, then this motif contains more sub-motifs, and the
over-expressiveness compared to their subsets. Therefore, subset interplays play more important role and are more
due to the weaker pruning of the global significance in the potential to affect the whole combination with respect to
assessment of over-expressiveness, Motif-All often fails to over-expressiveness, especially compared to the motif of
get rid of subsets interplays and returns a larger number of smaller size. Hence, it increases the probability for both
motifs with lots of undesired noise. Motif-All and MMFPh to regard the target motif as an
924 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 11, NO. 5, SEPTEMBER/OCTOBER 2014

TABLE 1
The Number of Reported Motifs of Different Sizes on the CDK Phosphorylation Data Sets by Tuning
the Thresholds usup and usig Gradually

usig indicates identical threshold for ug sig and ul sig in C-Motif.

interesting one even its over-expressiveness is derived the construction of CDK-specific phosphorylation data. In
from some subsets, or just for MMFPh to filter one motif the generated data, the number of phosphorylated peptides
out if it has no frequent and significant sub-motifs is roughly equivalent to that of unphosphorylated peptides.
although this motif is globally significant. On the contrary,
more motifs will be pruned and less ones can be discov- 3.3.2 Results
ered with higher thresholds. This will reduce the perfor- In this experiment, we further validate our approach using
mance gap among these algorithms. In this sense, Motif- the PKA-specific phosphorylation peptides with parameters
All and MMFPh may be more appropriate for discovering as usup ¼ 0:01 and usig ¼ 2:0. For a fair comparison, the sig-
those motifs of smaller size. To conclude, C-Motif is a bet- nificance thresholds for C-Motif are specified as: usig ¼ ul sig
ter algorithm with respect to coverage as well as non- ¼ ug sig ¼ 2:0. Under this setting, the results of all the motif
redundancy in phosphorylation motif discovery. extraction algorithms are shown in Fig. 8.
Fig. 7 depicts the running time of the algorithms under As shown in Fig. 8, in addition to the gap on the total
different parameters. Despite of the fluctuation of MMFPh, number of motifs, MMFPh fails to find any motifs of size
it is apparent that C-Motif needs similar running time as larger than 2 while Motif-All reports motifs of size even up
Motif-All and a little more than MMFPh to finish the mining to 9. C-Motif returns a result set of motifs whose sizes are
process. Meanwhile, we would like to point out that the per- not larger than 3.
formance gap between our algorithm and other methods is To further describe the difference, here we use several
not significant. Therefore, C-Motif is a relatively efficient identified motifs as examples for illustration. ‘SxxSVT’ is a
algorithm for the task of finding phosphorylation motifs. statistically significant motif with both global significance
and local significance values larger than the given threshold
3.3 Protein kinase A (PKA)-Specific usig . However, MMFPh ignores this motif since it lacks fre-
Phosphorylation Data quent and significant sub-motifs. The significance values of
3.3.1 Data Description its three 2-sub-motifs (‘SxxSV’, ‘SxxSxT’ and ‘SVT’) are all
Protein kinase A is a class of cAMP-dependent enzymes. It less than usig . On the other hand, ‘GEKxxxS’ reported by
plays important role in gene regulatory protein phosphory- Motif-All is composed of a significant part ‘GxKxxxS’ with
lation and activating transcription of specific genes. We also local significance value of 3.6 and an insignificant part
use Phospho.ELM and Swiss-Prot databases to generate ‘ExxxxS’ with local significance value of 0.5 with the assess-
phosphorylated peptides and unphosphorylated peptides ment in C-Motif. In addition, ‘GEKxxxS’ has a global signifi-
with PKA-specific phosphorylation sites, which is similar to cance value of 5.14 that has no significant improvement

Fig. 7. The running time comparison of C-Motif, MMFPh and Motif-All on CDK-specific phosphorylation data.
LIU ET AL.: MINING CONDITIONAL PHOSPHORYLATION MOTIFS 925

ability of identifying conditional phosphorylation motifs at


the cost of excluding some underlying ones, they suffer
from failure to filter out those potential false positives
whose over-expressiveness comes from their sub-motifs.
Overall, C-Motif outperforms Motif-All and MMFPh
with a trade-off between coverage and non-redundancy of
the final significant motif set. Next, we will concentrate our
discussion on the efficiency of the motif discovery algo-
rithms. Fig. 9 presents the running time of C-Motif, MMFPh
and Motif-All on PKA-specific phosphorylation data sets by
changing the parameters gradually. We first fix the support
threshold usup ¼ 0:015 and then increase the significance
threshold usig from 1.0 to 10.0. Subsequently, we keep the
significance threshold usig remain unchanged as 2.0 and
Fig. 8. The number of reported motifs of different sizes on PKA-specific increase the support threshold usup from 0.01 to 0.1. As
phosphorylation data sets. Here the support threshold usup is 0.01 and shown in Fig. 9, our method has comparable running effi-
the significance threshold usig is 2.0. ciency with the other two algorithms. In addition, we have
the following remarks.
over that of ‘GxKxxxS’ (the global significance value of First, there is an obvious gap between MMFPh and
‘GxKxxxS’ is 4.11). As a result, the significant sub-motif the other two methods in most cases. MMFPh seems
‘GxKxxxS’ will directly enhance the statistical significance like the most ‘inefficient’ approach before a critical point
of ‘GEKxxxS’ to some extent. Hence, motifs like this are and the most ‘efficient’ approach after that point. This is
unmeaningful ones since its over-expressiveness does because both Motif-All and C-Motif conduct the discov-
induce from its component parts other than the whole com- ery under an Apriori framework, which is very efficient
bination. Owing to the less-stringent pruning criterion, in frequent pattern mining. However, MMFPh generates
Motif-All reports ‘GEKxxxS’ as an interesting one improp- candidates with enumeration method and includes lots
erly. Similarly, ‘KRxxxxS’ reported by MMFPh and Motif- of duplications, which is relatively time-consuming espe-
All possesses similar global significance value of 5.14 as that cially when there are plenty of significant ones to extend.
of 4.52 of its sub-motif ‘RxxxxS’, which results in that Moreover, MMFPh would ignore some globally signifi-
‘KRxxxxS’ is a redundant motif with little improvement cant motifs if the motifs lack significant and frequent
with respect to the over-expressiveness. sub-motifs, while C-Motif and Motif-All would not. So
The detailed investigation on all the motifs reported by C-Motif and Motif-All have to spend more time to test
C-Motif, MMFPh and Motif-All demonstrates two crucial those candidate motifs. In addition, C-Motif also pro-
points. The first point is that the motifs identified by C- vides global significance assessment like Motif-All for
Motif generally have no globally significant sub-motifs. Fur- justice. As a result, MMFPh runs more effectively when
thermore, these motifs survive after the pruning stage due the significance threshold is relatively large.
to their inherent properties of the combination of each part Second, the running time of Motif-All and C-Motif is
rather than some subsets. The second point is that a propor- almost the same. Though there is a little difference between
tion of the motifs reported by Motif-All and several ones these two methods in some special situations, they are still
reported by MMFPh are motifs whose over-expressiveness in the same magnitude and the gap can be negligible in the
mainly benefits from their constituent parts. Consequently, range of the errors permitted. This is because when detect-
though Motif-All has much success in returning the highest ing one potential significant motif, C-Motif has to investi-
coverage of the significant motifs and MMFPh achieves the gate every peptide in the original data and check if this

Fig. 9. The running time comparison of C-Motif, MMFPh and Motif-All on PKA-specific phosphorylation data.
926 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 11, NO. 5, SEPTEMBER/OCTOBER 2014

peptide contains one of its sub-motifs. Then it generates a REFERENCES


new data set instead for calculating the local significance [1] P. Cohen, “The regulation of protein function by multisite phos-
value. In contrast, Motif-All only adopts global significance phorylation-a 25 year update,” Trends Biochem. Sci., vol. 25, no. 12,
to evaluate one candidate, rendering that it obtains much pp. 596–601, 2000.
reduction of computation and enumeration for generating [2] G. Manning, G. D. Plowman, T. Hunter, and S. Sudarsanam,
“Evolution of protein kinase signaling from yeast to man,” Trends
the new data. While Motif-All has to spend additional run- Biochem. Sci., vol. 27, no. 10, pp. 514–520, 2002.
ning time on checking more candidates than C-Motif. [3] S. B. Ficarro, M. L. McCleland, P. T. Stukenberg, D. J. Burke, M.
Third, considering C-Motif is better at achieving an over- M. Ross, J. Shabanowitz, D. F. Hunt, and F. M. White,
“Phosphoproteome analysis by mass spectrometry and its appli-
all balance of the coverage and non-redundancy of signifi- cation to saccharomyces cerevisiae,” Nature Biotechnol., vol. 20,
cant phosphorylation motifs compared to Motif-All and no. 3, pp. 301–305, 2002.
MMFPh, it is reasonable to consider C-Motif as an efficient [4] B. E. Turk, “Understanding and exploiting substrate recognition
algorithm for mining phosphorylation motifs. by protein kinases,” Current Opinion Chem. Biol., vol. 12, no. 1,
pp. 4–10, 2008.
[5] R. Amanchy, B. Periaswamy, S. Mathivanan, R. Reddy, S. G.
4 CONCLUSION Tattikota, and A. Pandey, “A curated compendium of phos-
phorylation motifs,” Nature Biotechnol., vol. 25, no. 3, pp. 285–
This paper formally proposes the notion of conditional 286, 2007.
phosphorylation motif and presents the problem of condi- [6] A. N. Kettenbach and S. A. Gerber, “Rapid and reproducible sin-
tional phosphorylation motif discovery. Additionally, we gle-stage phosphopeptide enrichment of complex peptide mix-
tures: Application to general and phosphotyrosine-specific
also propose a new algorithm called C-Motif for this task phosphoproteomics experiments,” Anal. Chem., vol. 83, no. 20,
and make an attempt to conduct a non-redundant pp. 7635–7644, 2011.
discovery. Experiments on both non-kinase-specific phos- [7] Y. Yu, S.-O. Yoon, G. Poulogiannis, Q. Yang, X. M. Ma, J. Villen, N.
phorylation data and kinase-specific phosphorylation data Kubica, G. R. Hoffman, L. C. Cantley, S. P. Gygi, and J. Blenis,
“Phosphoproteomic analysis identifies Grb10 as an mTORC1 sub-
demonstrate that C-Motif is able to uncover more interest- strate that negatively regulates insulin signaling,” Science,
ing phosphorylation motifs than MMFPh and retains less vol. 332, no. 6035, pp. 1322–1326, 2011.
false positives than Motif-All under the same given parame- [8] S. Matsuoka, B. A. Ballif, A. Smogorzewska, E. R. McDonald, K. E.
Hurov, J. Luo, C. E. Bakalarski, Z. Zhao, N. Solimini, Y. Lerenthal,
ter setting. Moreover, it is very fast so that it is able to find Y. Shiloh, S. P. Gygi, and S. J. Elledge, “ATM and ATR substrate
hundreds of significant motifs from large data sets. As a analysis reveals extensive protein networks responsive to DNA
result, C-Motif outperforms the existing phosphorylation damage,” Science, vol. 316, no. 5828, pp. 1160–1166, 2007.
motif discovery algorithms with respect to efficiency, cover- [9] D. Schwartz and S. P. Gygi, “An iterative statistical approach
to the identification of protein phosphorylation motifs from
age as well as non-redundancy. large-scale data sets,” Nature Biotechnol., vol. 23, no. 11,
For the future work, there are several possible directions pp. 1391–1398, 2005.
that need further investigations. [10] A. Ritz, G. Shakhnarovich, A. R. Salomon, and B. J. Raphael,
“Discovery of phosphorylation motif mixtures in phosphoproteo-
Firstly, although methods such as C-Motif and MMFPh mics data,” Bioinformatics, vol. 25, no. 1, pp. 14–21, 2009.
can reduce the number of phosphorylation motifs returned [11] Y.-C. Chen, K. Aguan, C.-W. Yang, Y.-T. Wang, N. R. Pal, and
to the biologists, there are still many statistically significant I.-F. Chung, “Discovery of protein phosphorylation motifs
motifs that remain. As a result, it is still not an easy task through exploratory data analysis,” PloS One, vol. 6, no. 5,
p. e20025, 2011.
for people to find really biologically meaningful motifs [12] T. Wang, A. N. Kettenbach, S. A. Gerber, and C. Bailey-Kellogg,
from this pool. From the computational and statistical per- “MMFPh: A maximal motif finder for phosphoproteomics data-
spective, we still need to develop effective algorithms and sets,” Bioinformatics, vol. 28, no. 12, pp. 1562–1570, 2012.
[13] Z. He, C. Yang, G. Guo, N. Li, and W. Yu, “Motif-all: Discovering
rigorous statistical testing procedures for further reducing
all phosphorylation motifs,” BMC Bioinformat., vol. 12, p. S22,
the number of reported motifs. The concept of “maximal 2011.
motif” in MMFPh and the application of permutation test [14] Z. He and H. Gong, “Comments on’MMFPh: A maximal motif
in [17] are research efforts towards this direction. However, finder for phosphoproteomics datasets,” Bioinformatics, vol. 28,
no. 16, pp. 2211–2212, 2012.
that is still not sufficient and further investigations should [15] T. Wang, A. N. Kettenbach, S. A. Gerber, and C. Bailey-Kellogg,
be conducted. “Response to ‘comments on ‘MMFPh: A maximal motif finder for
Secondly, existing motif discovery algorithms fulfill the phosphoproteomics datasets,” Bioinformatics, vol. 28, no. 16,
pp. 2211–2212, 2012.
mining task merely with the sequence data around the phos- [16] P. I. Good, Permutation, Parametric and Bootstrap Tests of Hypotheses,
phorylation site as input. This may prevent us to find really New York, NY, USA: Springer-Verlag, 2005, ch. 9.
biologically interesting motifs to derive useful scientific [17] H. Gong and Z. He, “Permutation methods for testing the signifi-
insights. One possible remedy for this limitation is to con- cance of phosphorylation motifs,” Statist. interface, vol. 5, pp. 61–
73, 2012.
duct motif search on expanded data sets that include supple- [18] L. Ma, T. L. Assimes, N. B. Asadi, C. Iribarren, T. Quertermous,
mentary information such as the 3D protein structures. and W. H. Wong, “An almost exhaustive search-based sequential
Finally, it is highly necessary to collect biologically vali- permutation method for detecting epistasis in disease association
dated motifs to build a public repository as the reference studies,” Genetic Epidemiol., vol. 34, pp. 434–443, 2010.
[19] R. Agrawal and R. Srikant, “Fast algorithms for mining associa-
database for performance assessment and comparison. tion rules in large databases,” in Proc. 20th Int. Conf. Very Large
Data Bases, 1994, pp. 487–499.
[20] R. J. Bayardo Jr., R. Agrawal, and D. Gunopulos, “Constraint-
ACKNOWLEDGMENTS based rule mining in large, dense databases,” Data Mining Knowl.
This work was partially supported by the Natural Science Discovery, vol. 4, no. 2-3, pp. 217–240, 2000.
[21] H. Dinkel, C. Chica, A. Via, C. M. Gould, L. J. Jensen, T. J. Gibson,
Foundation of China under Grant No. 61003176 and No. and F. Diella, “Phospho. ELM: A database of phosphorylation
61073051, the Fundamental Research Funds for the Central sites-update 2011,” Nucleic Acids Res., vol. 39, no. 1, pp. D261–
Universities of China (DUT14QY07). D267, 2011.
LIU ET AL.: MINING CONDITIONAL PHOSPHORYLATION MOTIFS 927

[22] N. Farriol-Mathis, J. S. Garavelli, B. Boeckmann, S. Duvaud, E. Haipeng Gong received the BS degree in soft-
Gasteiger, A. Gateau, A.-L. Veuthey, and A. Bairoch, “Annotation ware engineering from Dalian University of Tech-
of post-translational modifications in the Swiss-Prot knowledge nology, China, in 2011. He is currently working
base,” Proteomics, vol. 4, no. 6, pp. 1537–1550, 2004. toward the MS degree in the School of Software
[23] J. Gao, J. J. Thelen, A. K. Dunker, and D. Xu, “Musite, a tool for at Dalian University of Technology. His research
global prediction of general and kinase-specific phosphorylation interests include data mining and computational
sites,” Molecular Cellular Proteomics, vol. 9, no. 12, pp. 2586–2600, proteomics.
2010.
[24] S. F. Altschul, T. L. Madden, A. A. Sch€affer, J. Zhang, Z. Zhang, W.
Miller, and D. J. Lipman, “Gapped BLAST and PSI-BLAST: A new
generation of protein database search programs,” Nucleic Acids
Res., vol. 25, no. 17, pp. 3389–3402, 1997.
[25] H. Gong, X. Liu, J. Wu, and Z. He, “Data construction for Shengchun Deng received the PhD degree in
phosphorylation site prediction,” Briefings Bioinformat., 2013, computer science from Harbin Institute of Tech-
doi:10.1093/bib/bbt012. nology in 2002. He is currently a professor at
[26] N. Blom, S. Gammeltoft, and S. Brunak, “Sequence and structure- Harbin Institute of Technology. His research
based prediction of eukaryotic protein phosphorylation sites,” J. interests include database, software engineer-
Molecular Biol., vol. 294, no. 5, pp. 1351–1362, 1999. ing, and software interoperability.
[27] T. H. Dang, K. Van Leemput, A. Verschoren, and K. Laukens,
“Prediction of kinase-specific phosphorylation sites using condi-
tional random fields,” Bioinformatics, vol. 24, no. 24, pp. 2857–
2864, 2008.

Xiaoqing Liu received the BS degree in software


engineering from Dalian University of Technol- Zengyou He received the BS, MS, and PhD
ogy, China, in 2013. She is currently working degrees in computer science from Harbin Insti-
toward the MS degree in the School of Software tute of Technology, China, in 2000, 2002, and
at Dalian University of Technology. Her research 2006, respectively. He was a research associate
interests include bioinformatics and data mining. in the Department of Electronic and Computer
Engineering at the Hong Kong University of Sci-
ence and Technology from February 2007 to
February 2010. Since March 2010, he has been
an associate professor in the School of Software
at Dalian University of Technology. His research
interests include computational proteomics and
Jun Wu received the BS degree in software engi- biological data mining.
neering from Dalian University of Technology,
China, in 2013. He is currently working toward
the MS degree in the School of Software at " For more information on this or any other computing topic,
Dalian University of Technology. His research please visit our Digital Library at www.computer.org/publications/dlib.
interests include bioinformactics and data mining.

You might also like