Professional Documents
Culture Documents
Final Paper
Final Paper
1
quality measures. The interestingness measures contains our recommendation on using which
play an important role in data mining. measure for discovering the interesting rules.
2
assumption of conditional independence. X and association Rule XUY is the correlation measure
Y are said to be negatively interdependent. based on the Chi-square test for independence
[3].
If I = 1, then X and Y appear as frequently
together as expected under the assumption of
conditional independence. X and Y are said to be
independent of each other.
The second problem is that the interest measure The Chi-square test should only beused when all
should not be used to compare the cells in the contingency table have expected
interestingness of itemsets of different size. values greater than 1 and at least 80% of the
Indeed, the interest tends to be higher for large cells have expected values greater than 5.
itemsets than for small itemsets.
The Chi-squaretest will produce larger values
Chi-square Test for Independency: A when the data set grows to infinity. Therefore,
natural way to express the dependence between more items will tend to become significantly
the antecedent and the consequent of an interdependent if the size of the dataset
3
increases. The reason is that the Chi-square correlation, and correlation is more and more
value depends on the total number of strong with the Scorrelation increase.
transactions, whereas the critical cutoff value
only depends on the degrees of freedom (which Advantages
is equal to 1 for binary variables) and the desired Scorrelation , which can enhance the correlation
significance level. Therefore, whilst comparison degree of items in association rule and cut
of Chi-squared values within the same data set negative correlation rules.
may be meaningful, it is certainly not advisable
to compare Chi-squared values across different
data sets. Example
The sample data (Table 1) for the analysis
Correlation Coefficient: The [7] correlation purpose is taken from a store database of
coefficient (also known as the Φ-coefficient) customer transaction there are six different types
measures the degree of linear interdependency of items and a total of ten transactions. In each
between a pair of random variables. It is defined transaction a 1 represents the presence of
by the covariance between the two variables an item while a 0 represents the absence
of an item from the market basket.
divided by their standard deviations:
Table 1: Sample Transactions
Tid Items
A B C D E F
1 1 1 0 1 0 1
2 1 0 1 1 0 1
Where ρXY = 0 when X and Y are independent 3 1 0 1 1 0 1
and ranges from [-1, +1]. 4 0 1 1 1 0 0
5 0 1 0 1 1 0
Statistical Correlation : To[8] get the 6 1 0 0 0 1 1
7 1 0 1 0 1 1
association rules with real correlation, this
8 0 0 1 0 0 0
measure put forward statistical correlation from
9 0 1 1 1 0 0
the view point of statistics to compensate the
10 1 1 0 1 1 0
deficiency of support-confidence. Statistical TOTAL 6 5 6 7 4 5
correlation is defined as equation , which is
The frequent item set generated by the
sample data
using A-priori algorithm [6] is shown in the
following
Table 2:
If Scorrelation {X UY}<0, it denotes that the itemsets support
items in antecedent X and the consequent Y of {A,D} 40%
an association rule are negative correlation, and {A,F} 50%
the items have a relationship of restricting each {B,D} 50%
other. {C,D} 40%
All measures are calculated for each rule
If Scorrelation {X UY}=0, it means that the in table 2,
which is output of the A-priori algorithm.
items in antecedent X and the consequent Y of
The results
an association rule are independent, and the
are shown in table 3
items are not mutually influence.
Table 3: Calculation of different measure on
If Scorrelation {XUY}>0, it represents that the sample datasets
items in antecedent X and the consequent Y of
Rule Suppo Con Lift Chi- Corrrl Scorrl
an association rule have some degree s rt f. squa a. a.
re
4
Test experts, which leads us to explore the subjective
measures of the association rules.20 The
A→D 0.40 0.66 0.95 5.86 -0.089 - following suggestions can be formulated based
5 0.040 on the analysis of the different interestingness
8
measures discussed in the previously with
D→A 0.40 0.57 0.95 5.86 -0.089 -
5 0.040 example:
8
A→F 0.50 0.83 1.66 0.91 +0.81 +0.52
5 79 2
• Confidence is never the preferred method to
F→A 0.50 1.00 1.66 0.91 +0.81 +0.52 compare association rules since it does not
5 79 2 account for the baseline frequency of the
B→D 0.50 1.00 1.42 1.71 +0.65 +0.31 consequent.
3 5 5
• The lift/interest value corrects for this baseline
D→B 0.50 0.71 1.42 1.71 +0.65 +0.31 frequency but when the support threshold is very
3 5 5 low, it may be instable due to sampling
C→D 0.40 0.66 0.95 8.61 -0.089 - variability. However, when the data set is very
3 0.040 large, even a low percentage support threshold
8 will yield rather large absolute support values. In
D→C 0.40 0.57 0.95 8.61 -0.089 - that case, we do not need to worry too much
3 0.040 about sampling variability. A drawback of the
8 interest measure is that it cannot be used to
compare itemsets or rules of different size since
it tends to overestimate the interestingness for
G rap h b etw een Diffrent In trestin g n ess large itemsets.
M easures • When association rules need to be compared
R ule s between data sets of different sizes, the Chi-
10 square test for independence and Correlation
9 s u p p o rt
analysis are not preferred since they are highly
8 dependent on the dataset size. Both measures
c o n fid e n c e
7 tend to overestimate the interestingness of
6 L ift
itemsets in large datasets.
5
Values
C hi-s q u a re
4 te s t References:
3 C orre la tio n
2
S ta tis tic a l [1]Aggarwal & Yu, 1998 C.C. Aggarwal and
1 C orre la tio n
P.S. Yu . A New Framework for Item Set
0
Generation. In: Proceedings of the ACM PODS
-1 1 3 5 7 9 11 13
Symposium on Principles of Database Systems,
R u le s Seattle, Washington (USA), 18-24, 1998.
5
International Conference on Genetic and
[4] R. Agrawal, T. Imielinski, and A. N.Swami, Evolutionary Computing.
Mining Association Rules between Sets of Items
in Large Databases, in: Proceedings of the 1993
ACM SIGMOD Conference, pp.207–216, 1993. [8] Jian Hu &Xiang Yang-Li “Association Rules
Mining Based on Statistical Correlation”
[5] Han J, Pei J, Yin Y, Mining frequent patterns
without candidate generation[A], Proceeding of
2000 ACM-SIGMOD International Conference [9] A.silberschatz A T. “What Makes pattern
on Management of Data[C], pp.1–12, 2000. interesting in kownledge discovery systems.”
IEEE Transactions on Knowledge and Data
Engineering, 1996, 8(6), pp: 970~974
[6] R. Agrawal R S. “Fast Algorithms for
Mining “Association Rules.” Proc. 20th Int.
Conf. on Very Large DataBases, 1994, pp: [10]T. Brijs, K. Vanhoof, G. Wets “Defining
487~499. Interestingness For Association Rules”
International Journal "Information Theories &
Applications" Vol.10
[7] Jianhua Liu “A New Interestingness
Measure of Association Rules” Second