Professional Documents
Culture Documents
2014 Mining Conditional Phosphorylation Motifs
2014 Mining Conditional Phosphorylation Motifs
Abstract—Phosphorylation motifs represent position-specific amino acid patterns around the phosphorylation sites in the set of
phosphopeptides. Several algorithms have been proposed to uncover phosphorylation motifs, whereas the problem of efficiently
discovering a set of significant motifs with sufficiently high coverage and non-redundancy still remains unsolved. Here we present
a novel notion called conditional phosphorylation motifs. Through this new concept, the motifs whose over-expressiveness mainly
benefits from its constituting parts can be filtered out effectively. To discover conditional phosphorylation motifs, we propose an
algorithm called C-Motif for a non-redundant identification of significant phosphorylation motifs. C-Motif is implemented under the
Apriori framework, and it tests the statistical significance together with the frequency of candidate motifs in a single stage.
Experiments demonstrate that C-Motif outperforms some current algorithms such as MMFPh and Motif-All in terms of coverage
and non-redundancy of the results and efficiency of the execution. The source code of C-Motif is available at: https://sourceforge.
net/projects/cmotif/.
1 INTRODUCTION
significant motifs. However, no further statistical correction data set, we can re-calculate the statistical significance of the
is carried out in Motif-All, rendering an abundance of k-motif which is called the conditional (or local) significance.
motifs whose over-expressiveness mainly originates from Moreover, we define the statistical significance of a motif
some portion of them in the result set. To remove these based on the original data set according to the traditional def-
redundant motifs, one method is to apply the permutation inition as the global significance. In this setting, this k-motif is
test in statistics [16] to post-process the results. However, as claimed as a conditional phosphorylation motif if it is not
pointed out in [17], the standard permutation test procedure only locally significant on all k sub-motif induced data sets
may not fully address this issue. As a result, they utilize the but also globally significant on the whole data sets. Hence,
“sequential permutation test” method [18] to reduce the there will be two parameters to measure the significance of
effect of sub-motifs when calculating the statistical signifi- motifs: the global significance threshold and the local signifi-
cance of phosphorylation motifs. However, the permutation cance threshold, respectively. As a result, the effects of its
test process is very time-consuming in practice. Another sub-motifs in the evaluation of statistical significance about
alternative strategy is to choose a rigorous measure that can the over-expressiveness are reduced.
remove the influence of subsets of motifs. Existing methods To address the problem of conditional phosphorylation
such as Motif-X [9] and MMFPh [12] are examples of this motif mining, we present a new algorithm called C-Motif.
idea. They partially succeed in reducing the redundancy We implement C-Motif by utilizing a support constraint
with more stringent restrictions for pruning strategy. Nev- together with statistical measures in the same mining pro-
ertheless, this is not the end of solution, and some addi- cess. Here the support is defined as the percentage of
tional motifs still remain in the results. sequences that contain this motif. One motif is said to be
Another challenging problem is how to achieve a maxi- frequent if its support is no less than a given threshold.
mum coverage such that all frequent and statistically signif- Experiments on real data sets show that our algorithm is
icant motifs are included in the result. Existing methods efficient and effective in conditional phosphorylation
such as Motif-X, F-Motif, MoDL and MMFPh will miss motif discovery.
some motifs that are both sufficiently large and significant The remainder of this paper is organized as follows:
phosphorylation motifs due to their incomplete search or Section 2 presents the details of C-Motif algorithm. Section 3
overstrict requirements on target motifs. shows the experimental results on real data. Section 4 con-
Overall, the redundancy and coverage issues in phos- cludes the paper.
phorylation motif discovery are only partially solved.
This paper addresses this problem and takes a further
step towards this direction by proposing a new problem 2 METHODS
formulation for phosphorylation motif discovery: condi- 2.1 Basic Terminology
tional phosphorylation motif discovery. Owing to the We describe a motif as a string with a single phosphorylated
problem formulation, the motifs reported would be residue that is denoted with an underlined character, e.g., S,
“actual” motifs with more accurate statistical significance T or Y. We write the conserved positions of one motif as the
scores. This new problem formulation is a non-trivial corresponding amino acids directly and its wild positions
extension and generalization of the motif assessment are represented by ‘x’. For example, suppose a 2-motif has a
strategy used in Motif-X and MMFPh. Different from pre- fixed ‘P’ one position downstream and a fixed ‘D’ two posi-
vious methods, the key feature of our method is that, the tions upstream as well as a wild position next to the cen-
evaluation of significance with respect to one motif is tered phosphorylation site, this motif is thus represented as
only relevant to its own inherent property rather than ‘PSxD’.
whether it has significant constitute parts or not. Hence, For a k-motif m, we use Supðm; P Þ to denote its support
we achieve an ideal trade-off between the high coverage (frequency) in the foreground data set P . Since the goal of
and non-redundancy of significant motif set together with phosphorylation motif discovery is to find those motifs that
running efficiency to some extent. occur more frequently in the foreground data set against the
In the following, we will first illustrate the basic idea of background data set, we usually only use the foreground to
conditional phosphorylation motif and then show how this assess the frequency of a motif.
concept can remove motifs whose over-expressiveness
Definition 1. m is a frequent motif if Supðm; P Þ is no less than
mainly comes from their subsets. If a position in one motif is
the user-specified threshold usup .
specified by a certain fixed amino acid, then it is a so-called
conserved position. Otherwise, it is a wild position that can The set of candidate frequent motifs of size k is denoted
match any arbitrary amino acid. One motif A is said to be a as Sk . To discover all frequent motifs, the level-wise search
k-motif if it has k conserved positions. If another motif B con- strategy rooted from the Apriori algorithm [19] is widely
tains only a subset of these k amino acids at the correspond- used in practice. It first accumulates the count for each 1-
ing positions in A, then B is a sub-motif of A. In fact, every motif and collects those motifs with larger support than the
peptide that contains the motif must also contain its corre- given threshold usup to form the set of frequent 1-motifs F1 .
sponding sub-motifs, but the reverse is not true. So the set of Subsequently, since a k-motif will not be frequent if one of
peptides that contain one motif must be a subset of the collec- its sub-motifs of size k1 is infrequent, Fk1 is utilized to
tion of peptides that contain its sub-motif. Notably there are generate Sk and those infrequent ones are pruned to gener-
exactly k sub-motifs of size k1 for one k-motif. For each ate Fk .
sub-motif of size k1, we can generate a set of peptides in Owing to the fact that m consists of k fixed amino acids,
which every peptide contains this sub-motif. On this new this motif has k sub-motifs of size k1. These motifs
LIU ET AL.: MINING CONDITIONAL PHOSPHORYLATION MOTIFS 917
subsumed by m are denoted as m1 ; m2 ; . . . ; mk , respectively. passing the given threshold in addition to fulfilling the tra-
The only difference between m and mi (1 i k) is that the ditional definition.
ith fixed position in m is non-fixed in mi . And we describe Definition 2. Local statistical significance:
the sets of peptides in the foreground data P where these
sub-motifs occur as P ðm1 Þ; P ðm2 Þ; . . . ; P ðmk Þ, respectively. Sigl ðm; P; NÞ ¼ min Sigðm; P ðmi Þ; Nðmi ÞÞ:
1ik
Similarly, we use Nðmi Þ to denote the set of peptides in the
background data N that contain mi . We utilize Sigðm; P; NÞ Definition 3. Global statistical significance:
to denote the statistical significance calculation function for
each motif m, and it measures the over-expressiveness of m Sigg ðm; P; NÞ ¼ Sigðm; P; NÞ:
in P against N. In fact, both the nonparametric measure-
ments such as odds ratio or relative risk and the binomial Overall, to identify all statistically significant, sufficiently
probability model have been used in the assessment of frequent conditional phosphorylation motifs, we assess at
phosphorylation motif [13], [9], [12]. Note that the use of dif- least two aspects of each motif: frequency and statistical sig-
ferent evaluation methods will not change the nature of the nificance including both the local significance and the global
problem. Generally, these statistical significance assessment significance:
functions are consistent with each other in practice. For the
ease of illustration and a fair comparison of different meth- Frequency. We impose the support constraint to
ods, here we use relative risk and odds ratio as the representa- reduce the search space and prevent the generation
tives of statistical significance evaluation methods for of random artifacts.
phosphorylation motifs. Both relative risk and odds ratio Statistical significance. Note that the statistical evalua-
describe a likelihood change of the occurrence of one motif tion of over-expressiveness for a motif can be done
between the foreground and the background. Relative risk is in various ways. The statistical significance measures
defined as the ratio of the supports of m in the two data such as relative risk and odds ratio are available to be
sets. When using relative risk to measure the statistical signif- utilized interchangeably. The choice of significance
icance of m, it reads as: assessment measure will not change the perfor-
mance of underlying algorithms.
Supðm; P Þ
Sigðm; P; NÞ ¼ : (1)
Supðm; NÞ 2.2 Problem Formulation
As shown above, we strengthen and optimize the defini-
A relative risk of 1 indicates that the target motif under
tion of phosphorylation motif finding and try to conduct
study is equally likely to occur in both data sets. A relative
an extensive and non-redundant (NR) discovery. The con-
risk greater than 1 means that this motif is more likely to
ditional phosphorylation motifs are deemed to be true
occur in the foreground. In addition, the odds is the ratio
positives with prominent over-expressiveness under no
of the probability that the interesting event does happen
subsets interplays. We impose a persuasive significance
to the probability that it does not happen. The odds ratio
constraint called the local or conditional significance on
is defined as the ratio of the odds of an event occurring in
each candidate motif that evaluates the statistical signifi-
one group to the odds of it occurring in another group. In
cance of a phosphorylation motif with the sets of sequen-
the context of phosphorylation motif finding, if we adopt
ces induced from its sub-motifs. Furthermore, we also
odds ratio to measure the over-expressiveness, then the
perform the global significance evaluation using the origi-
significance of m is:
nal data sets in the traditional way. Thus, there are two
Supðm; P Þ=ð1 Supðm; P ÞÞ parameters to measure the significance of motifs: the
Sigðm; P; NÞ ¼ : (2) global significance threshold ug sig and the local signifi-
Supðm; NÞ=ð1 Supðm; NÞÞ
cance threshold ul sig , respectively. That is, to ensure the
Odds ratio has the same characteristics as relative risk: only significance of one motif m over P against N, two criteria
those motifs whose odds ratio is greater than 1 have potential must be satisfied simultaneously: Sigg ðm; P; NÞ ug sig
to be statistically significant. and Sigl ðm; P; NÞ ul sig . Hence, we can guarantee that
Particularly, if we consider P ðm1 Þ, P ðm2 Þ; . . . ; P ðmk Þ as the conditional phosphorylation motifs are also signifi-
the new foreground data set instead of P and Nðm1 Þ, cant under the traditional definition.
Nðm2 Þ; . . . ; Nðmk Þ as the new background data set instead However, there is one critical issue remaining for this
of N when estimating the over-expressiveness, there are task. That is, whether the effect of the sub-motifs has really
exactly k different significance values for m, that are been removed through the use of local statistical signifi-
Sigðm; P ðmi Þ; Nðmi ÞÞ where 1 i k, respectively. With- cance? To address this issue, we employ a measure called
out loss of generalization, we assume that Sigðm; P; NÞ has improvement proposed in [20] for justification. More pre-
positive correlation with the over-expressiveness: the bigger cisely, the improvement is defined as the difference
Sigðm; P; NÞ is, the more significant the motif m is. Under between the statistical significance of one motif and that of
this setting, we adopt the minimum value of Sigðm; its sub-motifs. In general, the positive improvement indi-
P ðmi Þ; Nðmi ÞÞ (1 i k) as the local (or conditional) statis- cates that the over-expressiveness of one target motif comes
tical significance in the estimation of over-expressiveness of from the combinations of all its constituent amino acids
each motif. Then, the problem of conditional phosphoryla- rather than just one of its subsets. We should prune those
tion motif discovery is to discover all frequent motifs from redundant motifs that have no positive improvements: the
P with sub-motif derived statistical significance values over-expressiveness of a motif is equal to or less than that
918 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 11, NO. 5, SEPTEMBER/OCTOBER 2014
of its sub-motifs.1 To further clarify this issue, since we uti- only ul sig is used in both Motif-X and MMFPh, they implic-
lize the local significance to get rid of sub-motif interplays, itly require that one sub-motif of the target motif should be
we calculate the difference values between the conditional locally significant as well. In this section, we first show that
phosphorylation kmotif and its corresponding sub-motifs all the motifs reported by Motif-X and MMFPh are also
of k-1 with respect to the global significance. For the sim- globally significant with high over-expressiveness as shown
plicity of illustrations and mathematical derivations, here in the lemma below.
we use relative risk as the significance measure as the exam- Lemma 2. For a k-motif m in data set P and N, suppose we use
ple measure in the remainder of this section. relative risk as the measure and set both ul sig and ug sig to be
Lemma 1. Each conditional phosphorylation motif possesses posi- t, t 1. Let mðrÞ represents one sub-motif of m of size r. If
tive improvement if relative risk is used to measure the statis- SigðmðrÞ ; P ðmðr1Þ Þ; Nðmðr1Þ ÞÞ t, for all 1 r < k, then
tical significance. Sigg ðm; P; NÞ t.
Proof. For a k-motif m in data set P and N, suppose one of Proof.
its sub-motifs mj of length k1 has the minimal signifi-
cance value, i.e., Sigl ðm; P; NÞ ¼ Sigðm; P ðmj Þ; Nðmj ÞÞ. SigðmðrÞ ; P ðmðr1Þ Þ; Nðmðr1Þ ÞÞ
We set ul sig ¼ t, t > 1. Then the Lemma 1 can be formu- SupðmðrÞ ; P ðmðr1Þ ÞÞ
lated as: if Sigl ðm; P; NÞ t, then Sigg ðm; P; NÞ > ¼
SupðmðrÞ ; Nðmðr1Þ ÞÞ (6)
Sigg ðmi ; P; NÞ for any 1 i k: ðrÞ ðr1Þ
jP ðm ÞkNðm Þj
¼ ;
Supðm; P ðmi ÞÞ jNðmðrÞ ÞkP ðmðr1Þ Þj
Sigðm; P ðmi Þ; Nðmi ÞÞ ¼
Supðm; Nðmi ÞÞ
jP ðmÞkNðmi Þj Sigðmðr1Þ ; P ðmðr2Þ Þ; Nðmðr2Þ ÞÞ
¼ Sigl ðm; P; NÞ
jNðmÞkP ðmi Þj
Supðmðr1Þ ; P ðmðr2Þ ÞÞ
¼ Sigðm; P ðmj Þ; Nðmj ÞÞ t; ¼
Supðmðr1Þ ; Nðmðr2Þ ÞÞ (7)
(3) jP ðm ðr1Þ
ÞkNðm ðr2Þ
Þj
¼ ðr2Þ ðr1Þ
;
jP ðm ÞkNðm Þj
Supðm; P Þ jNkP ðmÞj
Sigg ðm; P; NÞ ¼ ¼ ; (4)
Supðm; NÞ jP kNðmÞj Supðm; P Þ jP ðmÞkNj
Sigg ðm; P; NÞ ¼ ¼ : (8)
Supðm; NÞ jP kNðmÞj
Supðmi ; P Þ jNkP ðmi Þj
Sigg ðmi ; P; NÞ ¼ ¼ : (5) It is easy to see that SigðmðrÞ ; P ðmðr2Þ Þ; Nðmðr2Þ ÞÞ t2
Supðmi ; NÞ jP kNðmi Þj
by multiplying Equations (6) and (7). That is, mðrÞ is sta-
jP ðmÞj jP ðmi Þj
Since t is greater than 1, so jNðmÞj jNðm i Þj
. As a result, we tistically significant in the data sets derived from mðr2Þ .
can get that Sigg ðm; P; NÞ > Sigg ðmi ; P; NÞ by making This rule also applies to motif mðr2Þ so that mðr2Þ is sig-
equation (4) minus equation (2), which returns a result nificant in the peptide sequences induced by one of its
that is greater than 0. u
t sub-motifs as well. Therefore, we can infer that
Sigg ðm; P; NÞ tk by iterating the multiplication process.
To make the following description easier to follow, we Since t is a positive number that is no less than 1, then
provide a precise problem definition of conditional phos- Sigg ðm; P; NÞ must be equal to or greater than t, too.
phorylation motif discovery with clearly stated input and Thus, m is globally significant as well. u
t
output:
Lemma 2 shows that Motif-X and MMFPh can find
Input. The set of phosphorylated peptides (the fore- motifs that are globally significant. However, two implicit
ground data set P ) and the set of unphosphorylated issues exist with regard to coverage and redundancy. The
peptides (the background data set N), the support first one is that they may miss some potentially meaningful
threshold usup , the local significance threshold ul sig motifs under the conditional phosphorylation motif defini-
and the global significance threshold ug sig . tion. This is because these methods investigate a certain can-
Output. A set of conditional phosphorylation motifs didate k-motif m on condition that it must contain at least
R, where each motif m 2 R satisfies: (1) Supðm; P Þ one both frequent and significant sub-motif of size k1. If
usup ; (2) Sigl ðm; P; NÞ ul sig ; (3) Sigg ðm; P; NÞ all constituent motifs mi s are not significant, there is no
ug sig . chance to generate and evaluate this motif so that Motif-X
and MMFPh will not check m. Fig. 1 provides such an
example.
2.3 Categorization of Existing Methods under the For illustration purpose, we adopt relative risk as the sig-
New Formulation nificance measure here. Additionally, we set the significance
Notice that Motif-X [9] and MMFPh[12] also measure the threshold usig ¼ ul sig ¼ ug sig ¼ 1:2 and the support thresh-
over-expressiveness of one motif in a way that is similar to old usup ¼ 0:2. In this sample data set, we plant one frequent
our definition of local significance. More precisely, although and significant phosphorylation motif ‘KMS’ with relative
risk value larger than 1.2. There may be some other signifi-
1. For justification, we also define some motifs as redundant ones if cant motifs presented in this data set while we will
their improvement is very small. only focus our discussion on ‘KMS ’. ‘KMS’ consists of two
LIU ET AL.: MINING CONDITIONAL PHOSPHORYLATION MOTIFS 919
motifs, then l1 < l2 if the last conserved amino acid enumerates and checks all frequent motifs in the mining
of l1 lies in the relative left of that of l2 . Line up these process. The second is that our implementation only
identical amino acids and the left in l1 and l2 accord- prunes those insignificant motifs with respect to both
ing to their rank to compose one k-motif m. global significance and local significance. u
t
2) Evaluate the frequency of each candidate in Sk by
Theorem 2. The C-Motif algorithm is correct.
their supports (Steps 8-10). Prune all the infrequent
ones and add all the left to Fk . Proof.. The correctness of the C-Motif algorithm can be
3) Test the local significance and the global significance guaranteed by two facts. First, only frequent motifs are
of each potential motif in Fk (Steps 11-18). For one k- generated. Second, both the local and global significance
motif m, obtain all of its (k1)-sub-motifs. For each values are exactly calculated and every motif with signif-
sub-motif mi of m, construct its matching foreground icance values that are lower than the user-specified
P ðmi Þ and background Nðmi Þ. Calculate the signifi- thresholds will be pruned. u
t
cance value on the new data sets and choose the min-
imum one as the final local significance value. In
addition, we also obtain the global significance value 3 EXPERIMENTAL RESULTS
using the original data sets P and N in consistency In order to demonstrate the efficacy and utility of our algo-
with traditional definition. Filter out the insignificant rithm, we conduct a series of tests with real data. In our
motifs and save the both globally and locally signifi- experiments, we compare our algorithm with the Motif-All
cant ones in R. algorithm and the MMFPh algorithm with respect to effi-
4) Repeat the above steps until no more frequent candi- ciency, coverage and non-redundancy. Note that several
dates can be generated in Sk . Return R as the final motif-discovery methods have been proposed, the reason
result. why we choose Motif-All and MMFPh for comparisons
here is that they are representatives for algorithms that use
only global significance threshold and local significance
threshold, respectively. Furthermore, we use the same sig-
nificance measure in all algorithms so as to make their out-
puts comparable. More precisely, we choose relative risk to
measure the over-expressiveness so as to ensure the motifs
reported by MMFPh2 are also globally significant, which
can facilitate a fair comparison for our experiments.
In the experiments, we apply C-Motif, MMFPh and
Motif-All to both non-kinase-specific and kinase-specific
phosphorylation data sets with a fixed length of amino acids
upstream and downstream of the phosphorylated residues.
The details of these data sets are provided in the following
sections. In each experiment, we first present a brief descrip-
tion of the data and tune the thresholds so as to clarify the
comparison. Subsequently, we perform a general analysis
of the motifs discovered and then illustrate the superiority
of C-Motif against MMFPh and Motif-All.
Fig. 5. The running time comparison of C-Motif, MMFPh and Motif-All on non-kinase-specific phosphorylation data.
4) The fourth part consists of motifs that are only identical set of motifs of size 1 with MMFPh and a slightly
reported by Motif-All. Motifs in this category gener- different set from Motif-All under this setting. If we
ally have at least one very significant sub-motif that change the support threshold to 0.001 with the significance
actually leads to the statistical significance of its threshold equivalent to 1.5, C-Motif reports 1,530 motifs,
superset. For instance, ‘GxxRxxS’ is composed of whereas MMFPh reports 515 motifs and Motif-All reports
two sub-motifs: ‘GxxxxxS’ and ‘RxxS’. ‘GxxxxxS’ is 4,236 motifs. Accordingly, there exists a bigger gap
not a significant phosphorylation motif with both between their result sets. On one hand, MMFPh misses a
local and global significance values of 0.99 while number of interesting motifs as well as includes several
‘RxxS’ is a significant one with the significance val- potentially redundant motifs whose over-expressiveness
ues of 2.48. Particularly, ‘GxxRxxS’ has the global mainly derives from their sub-motifs. On the other hand,
statistical significance values of 3.22, rendering little Motif-All also reports many motifs whose sub-motifs are
improvement upon the over-expressiveness com- included in the result as well. This demonstrates that C-
pared to that of ‘RxxS’. In this regard, ‘GxxRxxS’ is Motif not only can find more significant motifs than
an insignificant motif according to our definition MMFPh but also is more qualified than Motif-All and
although it can pass the significance threshold usig . MMFPh to achieve non-redundancy in a flexible manner.
Similar analysis can also be made for the other motifs In conclusion, it has been illustrated that MMFPh is par-
in this part. Hence, these motifs are redundant in the tially useful and effective in presenting meaningful motifs
sense that their discriminative power mainly comes and reducing redundant motifs than Motif-All, although it
from some sub-motifs that are already claimed as to would miss some statistically significant motifs. Further-
be statistically significant. Actually, filtering out this more, Motif-All acquires higher coverage at the cost of
kind of motifs is able to reduce the redundancy in including many motifs whose over-expressiveness is
phosphorylation motif discovery. rooted from sub-motifs. In contrast, C-Motif makes a rea-
We have similar observations on the motifs of size 3 sonable trade-off between redundancy and coverage.
reported by different algorithms. In general, Motif-All cov- Hence, the empirical comparison shows that C-Motif out-
ers all the potentially significant motifs detected by MMFPh performs the other methods like MMFPh and Motif-All,
and C-Motif but contains many redundant ones with no sig- by discovering more meaningful and non-redundant
nificant improvements of statistical significance. The differ- phosphorylation motifs.
ence set between MMFPh and C-Motif includes several Fig. 5 presents the running time of different algorithms.
potentially redundant motifs from MMFPh and some inter- For this data set, C-Motif and Motif-All spend similar execu-
esting ones without any significant sub-motifs discovered tion time in the mining process under the same parameter
by C-Motif. setting. Due to the specific motif candidate generation pro-
Note that the largest size of reported phosphorylation cedure, MMFPh needs more time when the significance
motifs is 3 under the setting of usup ¼ 0:005 and usig ¼ 1:7. threshold is very low and is more efficient when the signifi-
There is more evident difference in the performance of C- cance threshold is increased. Overall, C-Motif is relatively
Motif and MMFPh as well as Motif-All especially when the efficient and is comparable with other algorithms in terms
size of target motifs is larger than 2. That is, the bigger the of running efficiency.
size is, the more different the results of the algorithms are.
Thus, we can infer that increasing the size of target motifs
will apparently highlight the advantage of C-Motif over 3.2 CDK-Specific Phosphorylation Data
MMFPh and Motif-All. 3.2.1 Data Description
To further check if this is true, we also perform phos- A protein kinase phosphorylates the substrates by transfer-
phorylation motif mining at the support threshold of 0.1 ring phosphate from adenosine triphosphate or guanosine
with significance threshold of 4.0. C-Motif obtains an triphosphate to specific amino acids (serine, threonine and
LIU ET AL.: MINING CONDITIONAL PHOSPHORYLATION MOTIFS 923
TABLE 1
The Number of Reported Motifs of Different Sizes on the CDK Phosphorylation Data Sets by Tuning
the Thresholds usup and usig Gradually
interesting one even its over-expressiveness is derived the construction of CDK-specific phosphorylation data. In
from some subsets, or just for MMFPh to filter one motif the generated data, the number of phosphorylated peptides
out if it has no frequent and significant sub-motifs is roughly equivalent to that of unphosphorylated peptides.
although this motif is globally significant. On the contrary,
more motifs will be pruned and less ones can be discov- 3.3.2 Results
ered with higher thresholds. This will reduce the perfor- In this experiment, we further validate our approach using
mance gap among these algorithms. In this sense, Motif- the PKA-specific phosphorylation peptides with parameters
All and MMFPh may be more appropriate for discovering as usup ¼ 0:01 and usig ¼ 2:0. For a fair comparison, the sig-
those motifs of smaller size. To conclude, C-Motif is a bet- nificance thresholds for C-Motif are specified as: usig ¼ ul sig
ter algorithm with respect to coverage as well as non- ¼ ug sig ¼ 2:0. Under this setting, the results of all the motif
redundancy in phosphorylation motif discovery. extraction algorithms are shown in Fig. 8.
Fig. 7 depicts the running time of the algorithms under As shown in Fig. 8, in addition to the gap on the total
different parameters. Despite of the fluctuation of MMFPh, number of motifs, MMFPh fails to find any motifs of size
it is apparent that C-Motif needs similar running time as larger than 2 while Motif-All reports motifs of size even up
Motif-All and a little more than MMFPh to finish the mining to 9. C-Motif returns a result set of motifs whose sizes are
process. Meanwhile, we would like to point out that the per- not larger than 3.
formance gap between our algorithm and other methods is To further describe the difference, here we use several
not significant. Therefore, C-Motif is a relatively efficient identified motifs as examples for illustration. ‘SxxSVT’ is a
algorithm for the task of finding phosphorylation motifs. statistically significant motif with both global significance
and local significance values larger than the given threshold
3.3 Protein kinase A (PKA)-Specific usig . However, MMFPh ignores this motif since it lacks fre-
Phosphorylation Data quent and significant sub-motifs. The significance values of
3.3.1 Data Description its three 2-sub-motifs (‘SxxSV’, ‘SxxSxT’ and ‘SVT’) are all
Protein kinase A is a class of cAMP-dependent enzymes. It less than usig . On the other hand, ‘GEKxxxS’ reported by
plays important role in gene regulatory protein phosphory- Motif-All is composed of a significant part ‘GxKxxxS’ with
lation and activating transcription of specific genes. We also local significance value of 3.6 and an insignificant part
use Phospho.ELM and Swiss-Prot databases to generate ‘ExxxxS’ with local significance value of 0.5 with the assess-
phosphorylated peptides and unphosphorylated peptides ment in C-Motif. In addition, ‘GEKxxxS’ has a global signifi-
with PKA-specific phosphorylation sites, which is similar to cance value of 5.14 that has no significant improvement
Fig. 7. The running time comparison of C-Motif, MMFPh and Motif-All on CDK-specific phosphorylation data.
LIU ET AL.: MINING CONDITIONAL PHOSPHORYLATION MOTIFS 925
Fig. 9. The running time comparison of C-Motif, MMFPh and Motif-All on PKA-specific phosphorylation data.
926 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 11, NO. 5, SEPTEMBER/OCTOBER 2014
[22] N. Farriol-Mathis, J. S. Garavelli, B. Boeckmann, S. Duvaud, E. Haipeng Gong received the BS degree in soft-
Gasteiger, A. Gateau, A.-L. Veuthey, and A. Bairoch, “Annotation ware engineering from Dalian University of Tech-
of post-translational modifications in the Swiss-Prot knowledge nology, China, in 2011. He is currently working
base,” Proteomics, vol. 4, no. 6, pp. 1537–1550, 2004. toward the MS degree in the School of Software
[23] J. Gao, J. J. Thelen, A. K. Dunker, and D. Xu, “Musite, a tool for at Dalian University of Technology. His research
global prediction of general and kinase-specific phosphorylation interests include data mining and computational
sites,” Molecular Cellular Proteomics, vol. 9, no. 12, pp. 2586–2600, proteomics.
2010.
[24] S. F. Altschul, T. L. Madden, A. A. Sch€affer, J. Zhang, Z. Zhang, W.
Miller, and D. J. Lipman, “Gapped BLAST and PSI-BLAST: A new
generation of protein database search programs,” Nucleic Acids
Res., vol. 25, no. 17, pp. 3389–3402, 1997.
[25] H. Gong, X. Liu, J. Wu, and Z. He, “Data construction for Shengchun Deng received the PhD degree in
phosphorylation site prediction,” Briefings Bioinformat., 2013, computer science from Harbin Institute of Tech-
doi:10.1093/bib/bbt012. nology in 2002. He is currently a professor at
[26] N. Blom, S. Gammeltoft, and S. Brunak, “Sequence and structure- Harbin Institute of Technology. His research
based prediction of eukaryotic protein phosphorylation sites,” J. interests include database, software engineer-
Molecular Biol., vol. 294, no. 5, pp. 1351–1362, 1999. ing, and software interoperability.
[27] T. H. Dang, K. Van Leemput, A. Verschoren, and K. Laukens,
“Prediction of kinase-specific phosphorylation sites using condi-
tional random fields,” Bioinformatics, vol. 24, no. 24, pp. 2857–
2864, 2008.