Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

Mining Ratio Rules Via Principal Sparse Non-Negative Matrix Factorization

1 2 3 4 3 2 2
Chenyong Hu ,Benyu Zhang ,Shuicheng Yan ,Qiang Yang ,Jun Yan ,Zheng Chen ,Wei-Ying Ma

1 3
# $ % &
! '$ (
&) ! ! !
*
+ , % & ( - .
! " ! ! " / $

Abstract bread : milk : butter = a : b : c


(a, b, c is arbitrary numerical values)
Association rules are traditionally designed to This rule states that for each a amount spent on bread, a
capture statistical relationship among itemsets in a given customer normally spends b amount on milk and c amount
database. To additionally capture the quantitative on butter.
association knowledge, F.Korn et al recently proposed a Principal Component Analysis (PCA) is often used to
paradigm named Ratio Rules [4] for quantifiable data discover the eigen-vectors of a dataset. Ratio Rules [4]
mining. However, their approach is mainly based on can represent the quantitative associations between items
Principle Component Analysis (PCA) and as a result, it as the principal eigen-vectors, where the values a, b and c
cannot guarantee that the ratio coefficient is non-negative. in the example above correspond to the projections of the
This may lead to serious problems in the rules’ eigenvector. Because the element of eigen-vector can be
application. In this paper, we propose a new method, either positive or negative, sometime the ratio coefficient
called Principal Sparse Non-Negative Matrix Factoriza- of Ratio Rules may contain negative value, such as
tion (PSNMF), for learning the associations between Shoe : Coat : Hat = 1: -2 : -5
itemsets in the form of Ratio Rules. In addition, we Obviously, such rule loses the intuitive appeal of
provide a support measurement to weigh the importance associations between items, because a customer’s
of each rule for the entire dataset. spending should always be positive.
Our method amounts to a novel application of non-
1. Introduction negative matrix factorization (NMF) [5]. However, we
cannot directly apply NMF for our purpose, because it is
Association rules are one of the major representations still difficult to explain that these latent components
in representing the knowledge discovered from large represent the latent association between items in a
databases. The problem of association rule mining (ARM) quantifiable dataset. We need to provide a bridge to bring
in large transactional databases was introduced in [1, 3], NMF closer to association rules.
Its basic idea is to discover important and interesting In this work, we propose a novel method called
associations among the data items. The form of such Principal Sparse Non-Negative Matrix Factorization
association is as following: (PSNMF), which adds the sparsity constraint as well as
{bread , milk} => butter (80%) the non-negativity constraint in the standard NMF[5].
To find association rules, most prevalent approaches The rest of the paper is organized as follows: Section
assume the transactions only carry Boolean information 2 describes the problem and the intuition behind the Ratio
and ignore the valuable knowledge inherent in the Rules. Section 3 introduces our new algorithm (PSNMF).
quantities of the items. In fact, considering that the Section 4 presents the experimental results. Section 5
quantities of the items normally contain valuable concludes the paper. The convergence of PSNMF learning
information for us, it is necessary to provide a definition procedure is provided in Appendix.
of quantitative association rules when the datasets contain
quantitative attributes. Several efficient algorithms for 2. Problem Definition
mining quantitative association rules have been proposed
in the past [2, 7]. A notable algorithm is the work [4], The problem that we tackle is as follows. Given a
where they provided a stronger set of rules as Ratio Rules. N × M matrix V (e.g., market basket databases), the entity
A rule under this framework is expressed in the following vij gives the amount spent by customers on the product.
form:
The goal is to find all Ratio Rules of the form:
100 100 100

80
80 80
(0.20718 0.79282)
60 (0.64290.76595)

$spend on butter
$spend on butter

$spend on butter
60 60
40

40 20 (-0.765950.6429) 40

20 0
20

-20
(0.63907 0.36093)
0 0
0 20 40 60 80 100 -20 0 20 40 60 80 100 0 20 40 60 80 100
$spend on bread $spend on bread $spend on bread

(a) A data matrix with 2-dimension (b) Ratio Rules identified by PCA c)Ratio Rules identified by PSNMF
Fig 1. A data matrix and latent associations discovered by PCA and PSNMF

v1 : v2 : v3 : ... : vM ( vi ≥ 0 ) may contain negative values, NMF [5] is proposed as a


procedure for matrix factorization which imposes non-
The above form means that customers who buys the items negative instead of orthogonal constraint, and NMF uses
will spend v1 , v2 … respectively on each itemset. the I-divergence of V from Y , which is defined as
Fig1.(a) lists distribution of the matrix V which is vij
D(V || Y ) = (vij log − vij + yij ) (2)
organized with N customers and M ( M = 2) products i, j yij
Here we assume that the dataset is consisted with two As the measurement of fitness for factorizing V into
clusters. Our goal is to capture the associations between WH Y = [Yij ] , a NMF factorization is defined as
items. We list two Ratio Rules discovered by PCA[4] in
Fig.1 (b), where one contains negative values: min D(V WH ) s.t W H ≥ 0, wij = 1 ∀j (3)
W ,H
i
bread : butter = -0.77 : 0.64
The above optimization can be done by using multipli-
Obviously, the negative association between items cative update rules [5].
(“bread” and “butter”) does not make sense. Furthermore,
it is obvious that the Ratio Rules deviate with the latent
associations behind the distribution of these points.
3.2 Sparse Non-negative Matrix Factorization
In fact, from Fig 1.(a), we find that the latent (SNMF)
associations are not mutually orthogonal, while the
method by PCA imposes the orthogonality constraint on Although NMF is successful in Matrix Factorization,
these ratio rules. Therefore, Ratio Rules based on PCA the NMF model does not impose the sparse constraints.
cannot truly reflect the latent associations among the items Therefore, it can hardly yield a factorization, which
correctly. Compared to Fig.1 (b), Fig 1(c) illustrates the reveals local sparse features in the data V . Related sparse
Ratio Rules captured by our proposed PSNMF. coding is proposed in the work of [6] for matrix
Surprisingly, each rule could be treated as an association factorization.
in the two clusters respectively. Inspired by the original NMF and sparse coding, the
aim of our work is to propose Sparse Non-negative Matrix
3. Principal Sparse Non-Negative Matrix Factorization (SNMF), which imposes the sparse and non-
negative constraint. Therefore, we put forward the
Factorization (PSNMF) following constrained divergence as objective function:
vij
Given a M × N non-negative matrix V , denote a set D(V || Y ) = (vij log − vij + yij ) +λ lj (4)
i, j yij j
1
of P M basis components by a M × P matrix W , where
l j = ( h1 j , h2 j , h3 j ,..., h pj ) denotes the column of H .
T
each transaction (column vector) can be represented as a
linear combination of the basis components using the
approximate factorization:
where WH Y = [Yij ] , and λ obtained by experience was
V ≈ WH (1) assumed a positive constant. As the measurement of
where H is a P × N coefficients matrix. fitness for factorizing V into WH Y = [Yij ] , a SNMF
factorization is defined as:
3.1 Non-negative Matrix Factorization(NMF) min D(V WH )
W ,H
(5)

s.t ∀i, j : Wij ≥ 0 H ij ≥ 0, and ∀i wi 1 = 1


Because the entries of W and H calculated by PCA
Notice that we have chosen to measure sparseness by a Gaussian distribution points on x-y plain (generated with
linear activation penalty (i.e. minimum the 1-norm of the mu=[3;5], sigma=[1,1.2;1.2,2]) and 50 points on y-z plain.
column of H ). A Sparse solution to the above constrained (Fig2.)(Generated with mu=[3;5], sigma=[2,1.6;1.6,2]).
minimization can be found by the following update rules:
wik
hkl = hkl
i
vil
( wik hkl )
( i
wik + λ ) (6)
25
k

hlj 20
wkl = wkl vkj hlj (7) 15
j l
wkl hlj j

Z
10
To make the solution unique, we further require that the 1-
5
normal of the column vector in matrix W is one. In
0 30
addition, matrix H needs to be adjusted accordingly. 0 20
10 10
wkl = wkl wkl (8) 20
k
30 0 Y
X
hkl = hkl k
wkl (9) Fig 2.Dataset with two clusters
It is proved that the objective function is non-increasing
under the above iterative updating rules, and the Table 1. Ratio Rules based on PSNMF and PCA
RR1 RR2 RR3
convergence of the iteration is guaranteed (in Appendix). PSNMF RR1 RR2
(X) 0.000 0.696 0.020 PCA

3.3 Principal SNMF (Y) 0.493 0.304 0.980 (X) -0.52 0.72
(Z) 0.507 0.000 0.000
(Y) -0.77 -0.15
When the dataset V is decomposed with W and H ,
Sum( wi ) 49.88 21.64 3.488
(Z) -0.38 -0.68
Support ( wi ) 0.665 0.289 0.046
each column value of H represents the corresponding
projection on the basis space W . As a whole, the sum of (a)Based on PSNMF (b) Based on PCA
every row vector of H represents the importance of
corresponding base. Therefore, we define a support Table 1.(a) lists all the rules and corresponding support
measurement after normalizing every column of H : values. After ranking such rules, Ratio Rules are obtained
hkl = hkl k
hkl (10) since support ( w1 ) + support ( w2 ) = 0.9535>90% .
Definition. For every rule (column vector) of W , we rule1 :: X : Y : Z => 0 : 0.493 : 0.507 (0.6650)
define a support measurement: rule2 :: X : Y : Z => 0.696 : 0.304 : 0 (0.2885)
support ( wi ) = hij hij (11) For example, 2/3 transactions (the cluster with distribution
j ij
on y-z plain) are mostly depended on rule1 and others on
Consequently, we can measure the importance of each
rule2 . Therefore, the corresponding support value (0.665)
rule for the entire dataset by their support values. The
more value of support implies the more importance of of rule1 does not contradict with intuition. Otherwise,
such rule for the whole dataset. Table 1.(b) lists the Ratio Rules by PCA which is difficult
In order to select the principal k rules as Ratio Rules, to explain the negative association obviously.
firstly, we rank the whole rules in descending by the
support value. And then, retain the first k principal rules Real Dataset: NBA ( 459 × 11 )
as Ratio Rules because they are more important than
others. About the selection of k value, a simple method is
This dataset comes from basketball statistics
taken such as:
k
obtained from the 97-98 season, including minutes played,
support ( wi ) Point per Game, Assist per Game, etc. The reason why we
min i =1
> threshold (12)
k M
support ( wi ) select this dataset is that it can give a intuitive meaning of
i =1
such latent associations. Table 2 presents the first three
From above (12), Ratio Rules are obtained effectively
Ratio Rules ( RR1 , RR2 , RR3 ) by the PSNMF. Based on a
according that the sum of k support values of rules cover
threshold (i.e.90%) of the grand total support values. general knowledge of basketball, we conjecture the RR1
represent the agility of a player, which gives the ratio of
4. Experiments Assists per Game and Steals, is 0.206:0.220 ≈ 1:1 . It means
that the average player who possess one time of assist per
game will be also steal the ball one time, and so does
Synthetic dataset:
RR2 (0.117 : 0.263 ≈ 1: 2.25) . In this case, traditional method
We have applied both the PSNMF and the PCA to a
dataset that consists of two clusters, which contains 25 cannot give such information behind the dataset.
iteration ,the new state vector to the values that minimize
Table 2. Ratio Rules by PSNMF from NBA the auxiliary function:
field RR1 RR2 RR3 Z ( t +1) = arg min z G ( Z , Z t ) (13)
Games 0.450 Then the objective function L( Z ) is non-increasing when
Minute 0.013
Points Per Game 0.010 Z is updated using (13), because of
Rebound Per 0.117 L( Z ( t +1) ) ≤ G ( Z (t +1) , Z t ) ≤ G ( Z t , Z t ) = L ( Z t ) .
Game
Assists per Game 0.206 Updating H : with W fixed, H is updated by
Steals 0.220 minimizing L( H ) = D (V WH ) . An auxiliary function is
Fouls 0.263
3Points constructed for L( H ) as:
wik hkj' wik hkj'
)=
G( H , H ' vij log vij − vij log ( wik hkj ) − log
5. Conclusion i, j i, j,k k
w h '
ik kj k
wik hkj'
+ yij − vij + λ hij
In this work, we proposed Principal Sparse Non- i, j i, j i, j

Negative Matrix Factorization (PSNMF) for learning Since it is easy to verify lj =


1
hij , therefore it is not
sparse non-negative components in matrix factorization. It j i, j

aims to learn latent components which are called Ratio difficult to testify G ( H , H ) = L ( H ) . The following proves
Rules. Experimental results illustrate that our Ratio Rules G(H , H ') ≥ L( H ) . Because log( k wik hkj ) is a convex
are more suited for representing associations between
function, the following holds for all i ,j and µijk = 1 :
items than that by PCA. k

wik hkj wik hkj'


− log( wik hkj ) ≤ −( µ ijk log ) (where µ ijk = )
References k k
µijk k
wik hkj'
thus
[1] Agrawal, R., Imielinski, T. and Swami, A.N., Mining
wik hkj' wik hkj'
association rules between sets of items in large databases. − log( w h ) ≤ −( log wik hkj − log
k ik kj '
In the Proc. of the ACM SIGMOD, (1993), 207-216.
k
k
w h ik kj k
wik hkj'
[2] Aumann, Y. and Lindell, Y., A statistical theory for
quantitative association rules. In the Proc. KDD, (1999). ) ≥ L( H ) .
Thus, G ( H , H '
[3] Han, J. and Fu, Y., Discovery of Multiple-Level To minimize L( H ) , we update H by:
Association Rules from Large Databases. In Proc. of the H ( t +1) = arg min H G ( H , H t ) (14)
VLDB, (1995), 420--431.
[4] Korn, F., Labrinidis, A., Kotidis, Y. and Faloutsos, C., for all kl
Ratio rules: A new paradigm for fast, quantifiable data ∂G ( H , H ') wik hkj' 1
mining. In the Proc. of the VLDB, (1998), 582-593. =- v
i i .l
+ b +λ =0
i i.k
∂hkl k
wik h '
kj hkl
[5] Lee, D.D. and Seung, H.S., Algorithms for nonnegative
matrix factorization. In Proc. of the Advances in Neural Solving for H , this gives:
wik
[6]
Information Processing Systems 13, (2001), 556-- 562.
Olshausen, B.A. and Field, D.J. Emergence of simple-cell
hkl = hkl'
i
vil
( wik hkl' )
( i
wik + λ )
k
receptive field properties by learning a sparse code for which is the desired updated H .
natural images. Nature 381(1996). 607--609.
[7] Srikant, r. and Agrawal, R., Mining quantitative Updating W : with H fixed, W is updated by
association rules in large relational tables. In Proc. of the minimizing L(W ) = D(V WH ) . The auxiliary function is
ACM SIGMOD (1996).
wik'hkj wik'hkj
)=
G (W ,W ' vij log vij − vij '
log ( wik hkj ) − log
wik hkj w'ik hkj
Appendix
i, j i, j,k k k

+ yij − vij + λ hij


i, j i, j i, j

To prove the convergence of the leaning algorithm It is easily to prove G (W ,W ) = L(W ) and G (W ,W '
) ≥ L (W )
(6)-(7), an auxiliary function G ( H , Z ' ) is given for likewise. we can get:
objective function L( Z ) with the properties that hlj
wkl = wkl' vkj hlj
) ≥ L ( Z )) and G ( Z , Z ) = L( Z ) ,we will show that the
G (Z , Z ' j k
wkl'hlj j

multiplicative update rule corresponds to setting ,at each This completes the proof.

You might also like