Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 11

Interestingness Measure: Correlations

min_sup=30%
min_conf=60%
 play basketball  eat cereal [40%, 66.7%] is misleading
 The overall % of students eating cereal is 75% > 66.7%.
 play basketball  not eat cereal [20%, 33.3%] is more accurate,
although with lower support and confidence
Basketball Not basketball Sum (row)
Cereal 2000 1750 3750

Not cereal 1000 250 1250

Sum(col.) 3000 2000 5000

1
Interestingness Measure: Correlations (Lift)
 play basketball  eat cereal [40%, 66.7%] is misleading
 The overall % of students eating cereal is 75% > 66.7%.
 play basketball  not eat cereal [20%, 33.3%] is more accurate,
although with lower support and confidence
 Measure of dependent/correlated events: lift

P( A B ) {A}U{B}={A,B}
lift 
P ( A) P( B) Basketball Not basketball Sum (row)
Cereal 2000 1750 3750
2000 / 5000
lift ( B, C )   0.89 Not cereal 1000 250 1250
(3000 / 5000) * (3750 / 5000)
Sum(col.) 3000 2000 5000
1000 / 5000
lift ( B, C )   1.33
(3000 / 5000) * (1250 / 5000)

2
Interestingness Measure
game game’ sum
P ( A B ) 0.4/
lift 
P ( A) P ( B ) (0.75*0.6)=0. video 4000 3500 7500
89 (4500) (3000)
(observed  exp ected ) 2
 
2
video’ 2000 500 2500
exp ected (1500) (1000)
P ( A, B ) sum 6000 4000 10000
coherence 
P ( A)  P ( B )  P ( A, B)

sup( X ) P(video)=0.75
all _ conf 
max_ item _ sup( X ) P(game)=0.6
P(video and game)=0.4
3
Interestingness Measure
P ( A B ) game Game’ sum
lift 
P ( A) P ( B )
video 4000 3500 7500
(4500) (3000)
(observed  exp ected ) 2
 
2
Video’ 2000 500 2500
exp ected (1500) (1000)
(4000  4500) 2 (3500  3000) 2

4500 3000 sum 6000 4000 10000
(2000  1500) 2 (500  1000) 2
   555.56
1500 1000

P( A, B)
coherence  P(video)=0.75
P( A)  P( B)  P( A, B )
P(game)=0.6
sup( X )
all _ conf  P(video and game)=0.4
max_ item _ sup( X )
4
Interestingness Measure
P ( A B ) game Game’ sum
lift  Degrees
of
P ( A) P ( B ) Freedom
(df)
Probability (p)

video 4000 3500 7500


  0.95 0.90 0.80 (4500)
0.70 0.50 0.30 (3000)
0.20 0.10 0.05 0.01 0.001

(observed  exp ected ) 2


 
2 1 0.004 0.02 0.06 0.15 0.46 1.07 1.64 2.71 3.84 6.64 10.83
Video’ 2000 500 2500
exp ected 2 0.10 0.21 0.45 0.71 1.39 2.41 3.22 4.60
(1500) (1000)
5.99 9.21 13.82

(4000  4500) 2 (3500  3000) 2 3 0.35 0.58 1.01 1.42 2.37 3.66 4.64 6.25 7.82 11.34 16.27

4500 3000 4 0.71
sum1.65
1.06
6000
2.203.36 4.88
4000
5.99 7.78
10000
9.4913.28 18.47
(2000  1500) 2 (500  1000) 2
   555.56 5 1.14 1.61 2.34 3.00 4.35 6.06 7.29 9.24 11.07 15.09 20.52
1500 1000

P( A, B) 6 1.63 2.20 3.07 3.83 5.35 7.23 8.56 10.64 12.59 16.81 22.46

coherence  P(video)=0.75
P( A)  P( B)  P( A, B )
7 2.17 2.83 3.82 4.67 6.35 8.38 9.80 12.02 14.07 18.48 24.32

8 2.73
P(game)=0.6
3.49
4.59 5.53 7.34 9.52 11.03 13.36 15.51 20.09 26.12

sup( X ) 9 3.32 4.17 5.38 6.39 8.34 10.66 12.24 14.68 16.92 21.67 27.88

all _ conf  P(video and game)=0.4


max_ item _ sup( X 10) 3.94 4.86
6.18 7.27 9.34 11.78 13.44 15.99 18.31 23.21 29.59

  Nonsignificant Significant
5
Interestingness Measure
P ( A B ) game Game’ sum
lift 
P ( A) P ( B )
video 4000 3500 7500
2
( observed  exp ected ) (4500) (3000)
2  
exp ected Video’ 2000 500 2500
P ( A, B ) (1500) (1000)
coherence 
P ( A)  P ( B )  P ( A, B) sum 6000 4000 10000

0.4/(0.75+0.6-0.4)
=0.42 P(video)=0.75
P(game)=0.6
sup( X )
all _ conf  P(video and game)=0.4
max_ item _ sup( X )
6
Interestingness Measure

P ( A B ) game Game’ sum


lift 
P ( A) P ( B )
video 4000 3500 7500
( observed  exp ected ) 2 (4500) (3000)
2  
exp ected Video’ 2000 500 2500
(1500) (1000)
P ( A, B )
coherence  sum 6000 4000 10000
P ( A)  P ( B )  P ( A, B)

sup( X )
all _ conf  P(video)=0.75
max_ item _ sup( X )
P(game)=0.6
0.4/0.75=0.53
P(video and game)=0.4
7
Practice

P ( A B ) coffee coffee’ sum


lift 
P ( A) P ( B )
tea 100 900 1000
2
( observed  exp ected )
2  
exp ected tea’ 700 8300 9000

P ( A, B )
coherence  sum 800 9200 10000
P ( A)  P ( B )  P ( A, B)

sup( X )
all _ conf 
max_ item _ sup( X )

8
Are lift and 2 Good Measures of Correlation?

P ( A B ) sup( X )
lift  all _ conf 
P ( A) P ( B ) max_ item _ sup( X )
2
( observed  exp ected )
2  
exp ected
P( A, B )
coherence  Milk No Milk Sum (row)
P( A)  P( B)  P( A, B) Coffee m, c ~m, c c
No Coffee m, ~c ~m, ~c ~c
Sum(col.) m ~m 

DB m, c ~m, c m~c ~m~c lift all-conf coh 2


P A1 1000 100 100 10,000 9.26 0.91 0.83 9055
N A2 100 1000 1000 100,000 8.44 0.09 0.05 670
N A3 1000 100 10000 100,000 9.18 0.09 0.09 8172
I A4 1000 1000 1000 1000 1 0.5 0.33 0
Which Measures Should Be Used?

 Support and confidence


are not good to indicate
correlations
 Over 20 interestingness
measures have been
proposed (see Tan,
Kumar, Sritastava
@KDD’02)
 Which are good ones?

10
Which Measures Should Be Used?
 lift and 2 are not good measures for correlations in large
transactional DBs
 all-conf or coherence could be good measures
(Omiecinski@TKDE’03)
 Both all-conf and coherence have the downward closure
property
 Efficient algorithms can be derived for mining (Lee et al.
@ICDM’03sub)

11

You might also like