Professional Documents
Culture Documents
Final Paper
Final Paper
1
step approach: first is the association rule that a single measure alone cannot
extraction (e.g. with the Apriori algorithm); determine the interestingness of the rule.
and the second step is the evaluation of the
rules’ interestingness or quality, by the domain This paper is divided in to three sections
expert or using statistical quality measures. the first section gives the formal definition
The interestingness measures play an and some explanation of each measure. The
important role in data mining. second section gives us the calculation of each
measure on our sample data and the last
section contains our recommendation on
using which measure for discovering the
interesting rules.
2
Since P(Y) appears in the denominator of the
interest measure, the interest can be seen as The second problem is that the interest
the confidence divided by the baseline measure should not be used to compare the
frequency of Y. The interest measure is interestingness of itemsets of different size.
defined over Indeed, the interest tends to be higher for large
[0, ∞ [and its interpretation is as follows: itemsets than for small itemsets.
If I <1, then X and Y appear less frequently Chi-square Test for Independency: A
together in the data than expected under the natural way to express the dependence
assumption of conditional independence. X between the antecedent and the consequent of
and Y are said to be negatively interdependent. an association Rule XUY is the correlation
measure based on the Chi-square test for
If I = 1, then X and Y appear as frequently independence [3].
together as expected under the assumption of
conditional independence. X and Y are said to
be independent of each other.
3
distribution. Thiapproximation breaks down If Scorrelation {X UY}<0, it denotes that the
when the expected values (Exy) are small. items in antecedent X and the consequent Y of
an association rule are negative correlation,
The Chi-square test should only beused when and the items have a relationship of restricting
all cells in the contingency table have each other.
expected values greater than 1 and at least
80% of the cells have expected values greater If Scorrelation {X UY}=0, it means that the
than 5. items in antecedent X and the consequent Y of
an association rule are independent, and the
The Chi-squaretest will produce larger values items are not mutually influence.
when the data set grows to infinity. Therefore,
more items will tend to become significantly If Scorrelation {XUY}>0, it represents that
interdependent if the size of the dataset the items in antecedent X and the consequent
increases. The reason is that the Chi-square Y of an association rule have some degree
value depends on the total number of correlation, and correlation is more and more
transactions, whereas the critical cutoff value strong with the Scorrelation increase.
only depends on the degrees of freedom
(which is equal to 1 for binary variables) and Advantages
the desired significance level. Therefore, Scorrelation , which can enhance the
whilst comparison of Chi-squared values correlation degree of items in association rule
within the same data set may be meaningful, it and cut negative correlation rules.
is certainly not advisable to compare Chi-
squared values across different data sets.
Example
Correlation Coefficient: The [7] The sample data (Table 1) for the analysis
correlation coefficient (also known as the Φ- purpose is taken from a store database of
coefficient) measures the degree of linear customer transaction there are six different
interdependency between a pair of random types of items and a total of ten transactions.
variables. It is defined by the covariance In each transaction a 1 represents the
presence of an item while a 0 represents
between the two variables divided by their
the absence of an item from the market
standard deviations: basket.
Table 1: Sample Transactions
Tid Items
A B C D E F
1 1 1 0 1 0 1
Where ρXY = 0 when X and Y are 2 1 0 1 1 0 1
independent and ranges from [-1, +1]. 3 1 0 1 1 0 1
4 0 1 1 1 0 0
5 0 1 0 1 1 0
Statistical Correlation : To[8] get the 6 1 0 0 0 1 1
association rules with real correlation, this 7 1 0 1 0 1 1
measure put forward statistical correlation 8 0 0 1 0 0 0
from the view point of statistics to compensate 9 0 1 1 1 0 0
the deficiency of support-confidence. 10 1 1 0 1 1 0
Statistical correlation is defined as equation , TOTAL 6 5 6 7 4 5
which is
The frequent item set generated by the
sample data
using A-priori algorithm [6] is shown in
the following
Table 2:
4
itemsets support
{A,D} 40% G rap h b etw een D iffrent In trestin g n ess
{A,F} 50% M easures
{B,D} 50% R u le s
10
{C,D} 40%
9 s up p ort
All measures are calculated for each rule
in table 2, 8
c on fid e n c e
which is output of the A-priori algorithm. 7
The results 6 L ift
are shown in table 3 5
Values
C h i-s q u a re
Table 3: Calculation of different measure on 4 te s t
sample datasets 3 C o rre la tio n
5
rules of different size since it tends to Proceedings of the 1993 ACM SIGMOD
overestimate the interestingness for large Conference, pp.207–216, 1993.
itemsets.
• When association rules need to be compared [5] Han J, Pei J, Yin Y, Mining frequent
between data sets of different sizes, the Chi- patterns without candidate generation[A],
square test for independence and Correlation Proceeding of 2000 ACM-SIGMOD
analysis are not preferred since they are highly International Conference on Management of
dependent on the dataset size. Both measures Data[C], pp.1–12, 2000.
tend to overestimate the interestingness of
itemsets in large datasets.
[6] R. Agrawal R S. “Fast Algorithms for
References: Mining “Association Rules.” Proc. 20th Int.
Conf. on Very Large DataBases, 1994, pp:
[1]Aggarwal & Yu, 1998 C.C. Aggarwal and 487~499.
P.S. Yu . A New Framework for Item Set
Generation. In: Proceedings of the ACM
PODS Symposium on Principles of Database [7] Jianhua Liu “A New Interestingness
Systems, Seattle, Washington (USA), 18-24, Measure of Association Rules” Second
1998. International Conference on Genetic and
Evolutionary Computing.
[2]Agresti, 1996 A. Agresti. An Introduction
to Categorical Data Analysis. Wiley Series in
Probability and Statistics, 1996. [8] Jian Hu &Xiang Yang-Li “Association
Rules Mining Based on Statistical
[3]Brijs et al., 1999 T. Brijs, G. Swinnen, K. Correlation”
Vanhoof and G. Wets. The use of association
rules for product assortment decisions: a case
study. In: Proceedings of the Fifth [9] A.silberschatz A T. “What Makes pattern
International Conference on Knowledge interesting in kownledge discovery systems.”
Discovery and Data Mining, San Diego IEEE Transactions on Knowledge and Data
(USA), August 15-18, 254-260, 1999. Engineering, 1996, 8(6), pp: 970~974