Professional Documents
Culture Documents
Threshold Tuning For Improved Classification Association Rule Mining
Threshold Tuning For Improved Classification Association Rule Mining
1 Introduction
An Association Rule (AR) is a way of describing a relationship that can be ob-
served between database attributes [1], of the form “if the set of attribute-values
A is found together in a database record, then it is likely that the set B will
be present also”. A rule of this form, A → B, is of interest only if it meets at
least two threshold requirements: support and confidence. The support for the
rule defines the number of database records within which the association can be
observed. The confidence in the rule is the ratio of its support to that of its an-
tecedent. Association Rule Mining (ARM) aims to uncover all such relationships
that are present in a database, for specified thresholds of support and confidence.
One application of ARM is to define rules that will classify database records.
A Classification Association Rule (CAR) is a rule of the form X → c, where X
is a set of attribute-values, and c is a class to which database records (instances)
can be assigned. Mining of CARs usually proceeds in two steps. First, a training
set of database records is mined to find all ARs for which one of the target classes
is the consequent, and which satisfy specified thresholds of support and confi-
dence. This stage is essentially similar to ARM in the more general case, with the
classes c treated as attribute-values, and the restriction that the only rules we
need consider are those for which the consequent is one of these. A second stage
then sorts and reduces the set of rules found, with the aim of producing a con-
sistent set that will enable efficient and reliable classification of future instances.
T.B. Ho, D. Cheung, and H. Liu (Eds.): PAKDD 2005, LNAI 3518, pp. 216–225, 2005.
c Springer-Verlag Berlin Heidelberg 2005
Threshold Tuning for Improved Classification Association Rule Mining 217
The CBA algorithm described in [6] exemplifies the approach. First, a version
of the well-known Apriori algorithm [2] is used to generate a set of ruleitems
that satisfy a required support threshold, where a ruleitem is a set of items
(attribute-values) associated with a class label, which thus defines a potential
CAR. The rules thus generated are pruned, using the calculated confidence to
eliminate those that fail to meet a required confidence threshold or which conflict
with higher-confidence rules. Finally, a classifier is built by selecting an ordered
subset of the remaining CARs. This process involves coverage analysis in which
each candidate CAR is examined in turn, to find a set of CARs that cover the
dataset fully. The CMAR algorithm of [7] has a similar general form, using a
version of the FP-growth algorithm [5] to generate the candidate CARs.
Results presented in [6] and[7] show that classification using CARs seems to
offer greater accuracy, in many cases, than other methods such as C4.5 [8]. The
problem with both CBA and CMAR, however, is that the cost of the coverage
analysis is essentially a product of the size of the dataset and the number of candi-
date CARs being considered. The CPAR algorithm [10] has a different approach
to generating rules, using a procedure derived from the FOIL algorithm [9] rather
than a classical ARM method. Although this improves performance by generat-
ing a smaller set of rules, the performance is still O(nmr), where n is the number
of records, m the number of items, and r the number of candidate rules. The
RIPPER algorithm [4] incorporates a pruning strategy that can be applied in a
manner independent of the rule generation strategy used. All these methods fol-
low an overfit and prune strategy that will be costly if used to construct classifiers
from the very large and wide datasets that are characteristic of classical ARM.
Thus, we identify three general problems of CAR mining. First, the ARM task
is inherently costly because of the exponential complexity of the search space.
Second, it is very likely that this first stage will generate a very large number
of candidate rules, and so the selection of a suitable subset for classification
may also be computationally expensive. Finally, the reliability of the resulting
classifier depends in some degree on the rather arbitrary choice of support and
confidence thresholds used in the mining process.
In this paper we describe a new approach to the generation of CARs, that
significantly reduces the cost of mining the training data by using both support
and confidence thresholds in the first stage of mining, to produce a small set of
rules without the need for coverage analysis. We present results to show that
this approach achieves a classification accuracy that is comparable with other
methods, but in many cases at much lower cost. We also describe a strategy for
tuning support and confidence thresholds to obtain a best accuracy.
3 Experimental Results
To investigate the performance of TFPC, we carried out experiments using a
number of data sets taken from the UCI Machine Learning Repository. The
implementation of TFPC was as a Java program, and for comparison purposes
we have used our own (Java) implementations of the published algorithms for
CMAR [7] and CPAR [10]. In the first set of experiments, as in [6] and [7], we
have assumed a support threshold of 1% and a confidence threshold of 50%.
For the implementation of CPAR, we used the same parameters used in [10],
i.e. minimum gain threshold = 0.7, total weight threshold = 0.05, decay factor
= 2/3, and similarity ratio 1 : 0.99. We tried to use the same data sets as those
used to analyse CMAR and CPAR. However in many cases the data sets that
were used in [7] and [10] appear to be no longer available, and in others were
found not have identical parameters to those reported.
The data sets chosen therefore comprise a subset of those used in [7] and [10],
augmented by a further selection from the UCI repository. The choice of addi-
tional data sets concentrated on larger/denser data sets (2000+ records) because
the majority of the data sets used in the reported analysis of CMAR and CPAR
were relatively small (less than 1000 records). The sets chosen were discretized
using the LUCS-KDD DN software 1 , where appropriate continuous attributes
1
Available at http://www.csc.liv.ac.uk/˜frans/KDD/Software/LUCS-KDD-DN/
lucs-kdd DN.html
220 F. Coenen, P. Leng, and L. Zhang
were ranged using five sub-ranges. The programs were run on a 1.2 GHz Intel
Celeron CPU with 512 Mbyte of RAM running under Red Hat Linux 7.3.
The results from these experiments are shown in Table 1. The row labels
describe the key characteristics of the data, in the form into which it was dis-
cretized. For example, the label anneal.D106.N798.C6 denotes the ’anneal’
data set, which includes 798 records in 6 classes, with attributes that for our
experiments have been discretized into 106 binary categories.
The last three columns in Table 1 tabulate the accuracy of the classification
obtained by the three methods. In all cases, the figure shown is the average
obtained from a 10-fold cross-validation using the full data set. The accuracy
obtained using the TFPC method is in most cases comparable with that ob-
tained from the other methods investigated. Although the average accuracy of
the method was slightly lower than the others, in 5 cases TFPC gave an accuracy
as high as or higher than both other methods, and only in two cases (ionosphere
and wine) was it markedly worse than both.
The significance of these results is that this level of accuracy was obtained
from a very efficient rule-generation process. The first three columns in Table 1
show the execution times for the three methods. In each case, the time shown is
that obtained for the full experimental evaluation using 10-fold cross-validation.
As can be seen, the performance of TFPC (in comparable implementations) is
markedly superior to that of CMAR (in all cases) and CPAR (in many). The
improvement over CMAR results from the smaller number of rules that tend to
be produced using TFPC and because this approach eliminates the expensive
coverage analysis carried out in CMAR. The columns headed ’Number of rules’
show the average number of rules included in the final classifiers generated, for
the three methods. For comparison, the column headed TFP shows the total
number of classification rules generated by the TFP algorithm, which defines
the total number of candidate classification rules produced by a straightforward
ARM algorithm using the given support and confidence thresholds. A compari-
son of this column with that for TFPC shows the advantage gained during rule
generation from the heuristics used in the latter. In most (although not all) cases,
TFPC also produces fewer rules than CMAR.
CPAR, conversely, usually produces fewer rules than TFPC, and this is re-
flected in the faster execution times it achieves in some cases. The advantage of
TFPC, however, is that by dispensing with coverage analysis the performance
of the method scales better for larger data sets. In all the cases where the data
included more than 10000 records (adult, connect4, nursery, pendigits), TFPC
is much faster than CPAR, sometimes by an order of magnitude or more. CPAR
also performed relatively poorly in the cases (pageblocks, led) in which there
were both a moderately large number of records and a relatively large number
of classes. In all these cases, TFPC achieves a classification accuracy close to or
superior to CPAR at much lower cost.
a) adult.D131.N48842.C2 b) breast.D48.N699.C2
c) horseColic.D95.N300.C3 d) ionosphere.D172.N351.C2
e) led7.D24.N3200.C10 f ) nursery.D32.N12960.C5
5 Conclusions
Previous research has shown that ARM can be an effective route to accurate clas-
sification. A drawback of existing methods is that they involve detailed coverage
analysis of candidate rules to obtain a final set of classification rules. This pro-
cedure is expensive: prohibitively so when very large training sets are involved.
We have here introduced an algorithm, TFPC, which obtains a set of classifica-
tion rules directly from an ARM procedure, without further coverage analysis.
Our results show that the classification accuracy obtained is comparable with, or
close to, that obtained from established methods. The TFPC method, however,
is very much faster than alternatives that involve coverage analysis, especially
when dealing with large data sets. We believe this method offers a realistic ap-
proach to deriving classifiers from extremely large data sets, for which existing
methods would be inapplicable.
Threshold Tuning for Improved Classification Association Rule Mining 225
References
1. Agrawal, R., Imielinski, T. and Swami, A.: Mining Association Rules between Sets
of Items in Large Databases. Proc. ACM SIGMOD Conference on Management of
Data, 207-216, 1993.
2. Agrawal, R. and Srikant, R.: Fast Algorithms for Mining Association Rules. Proc.
of the 20th VLDB Conference, Santiago, Chile, 487-499, 1994.
3. Coenen, F., Goulbourne, G., and Leng, P.: Computing Association Rules using
Partial Totals. PKDD 2001, pages 54-66, 2001.
4. Cohen, W.W.: Fast Effective Rule Induction. Proc. of the 12th Int. Conf. on Ma-
chine Learning, pages115-123, 1995.
5. Han, J., Pei, J. and Yin, Y.: Mining Frequent Patterns without Candidate Gener-
ation. In Proc. of the ACM SIGMOD Conference on Management of Data, Dallas,
pages 1-12, 2000.
6. Liu, B., Hsu, W. and Ma, Y.: Integrating Classification and Association Rule Min-
ing. Proc KDD 1998, 80-86
7. Li, W., Han, J. and Pei, J.: CMAR: Accurate and Efficient Classification Based on
Multiple Class-Association Rules. Proc ICDM’01, 2001, 369-376
8. Quinlan, J.R. C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993.
9. Quinlan, J.R. and Cameron-Jones, R.M.: FOIL: A midterm report. Proc ECML
1993, 3-20.
10. Yin, X. and Han, J.: CPAR: Classification Based on Predictive Association Rules.
Proc SIAM Int Conf on Data Mining (SDM’03), 2003, 331-335