Download as pdf or txt
Download as pdf or txt
You are on page 1of 23

Market Basket Analysis using Minimum Spanning Trees

Mauricio A. Valle1 Gonzalo A. Ruz2 Rodrigo Morrás2

1 Universidad Finis Terrae, Chile


2 Unversidad Adolfo Ibáñez, Chile

52th Annual Assembly CLADEA 2017


October 17-19, 2017, Riverside-California, USA.

1 / 23
Motivation

Existing data mining tools for MBA are the ARs.


I Disadvantage: Apriori algorithm finds a very large number of rules.
I Goal of this work is to complement ARs in MBA.
A need to represent a large transactional database in a simple visual
representation.
I Advantage: Easy identification significant co-occurrences on a market
basket.
I Delivery: A visual approach to help retail managers in marketing
activities.

2 / 23
Data definition
Let I = {i1 , ..., id } is a set of all items.
T = {ti } be the set of all transactions.
I Each transaction ti = (ei1 , ..., eiN ) is a subset of items chosen from I and
represent a market basket of the ith transaction.
I Every element of ti is a binary vector. Thus, eij = 1 if the jth item is present
on the market basket, eij = 0 otherwise.

Figure: An example of T.

3 / 23
The network of products I

A network of products is an undirected weighted graph G(I) = (V, E) in


which V = {ik |ik ∈ I}, and E = {(ei , ej )|ei , ej ∈ tp , i 6= j}, where ei , ej are
indicating if item i and j in I are in tp .
Note that G(I) could be a complete network: A few vertex highly connected, and
others poorly connected.

Each edge (u, v) ∈ E has a weight indicating co-ocurrence or a dissimilarity


measure ρ(u, v).

4 / 23
The network of products II

Figure: Product network example with 17 products categories. Each color represent one of them. (a) Threshold at 10%. (b) Threshold at 1%.

5 / 23
Minimum spanning trees I

A spanning tree T = (V, E, w(e)) of a connected G(I), is a subgraph of G(I)


in which:
T contains every vertex of G(I),
T has not cycles.
w(e) are the weights of the edges (u, v) of G(I).

The minimum spanning tree (MST) is the spanning tree T that minimizes the
total weight:
X
w(T) = w(e)
e∈T

The minimization of w(T) could be accomplished by greedy algorithm as Prim’s algorithm or


Kruskal’s algorithm which takes O(m log n) time.

6 / 23
Minimum spanning trees II

Figure: (a) Product network for wines category. (b) The corresponding MST.

7 / 23
Proposed methodology

The proposed market basket analysis methodology is summarized in the


following three steps:
1 Obtain a correlation matrix between products.
I For product X and Y,

hXYi − hXihYi
φXY = p
(hX 2 i − hXi2 ) (hY 2 i − hYi2 )
I The output is a symmetric matrix with elements φij
2 Transform the correlations into a metric of distance.
I When thep correlation is equal to 1, the distance is zero.
I d(i, j) = (2(1 − φij )
3 Obtain the MST.
I This was carried out with Prim’s algorithm.

8 / 23
Application
A real transactional database was taken, containing 1,046,804 transactions, 17
main categories, 220 subcategories, from a typical supermarket in a 15-month
period.

28
196
25
26
110
27
63
212
100
172
221
76
4
140
16
182
69
217
65
15
59
220
214
99
80
204
79
23
162
210
150
219
29
218
118
94
7
183
70
103
117
11
55
89
24
21
184
176
78
113

Figure: A sample of 80 subcategory items has been extracted to represent a heatmap of the transactions.

9 / 23
Results
A process of rules discovery was carried out using Apriori algorithm.
minimum support: 0.001
minimum confidence: 0.01
Result: 179,610 rules were found.
Number of rules with L ≥ 1: 33,759.

Scatter plot for 33759 rules

order 7

0.8
order 6

0.6
order 5
confidence

0.4 order 4

order 3
0.2

order 2
0

0 0.02 0.04 0.06 0.08 0.1 0.12


support

Figure: Scatterplot of ARs with L ≥ 1. The right side indicates level of L according with intensity of the color

10 / 23
Results

Figure: Representation of the complete MST for the transactional database. Each color represent one of the 17 product’s subcategories. Note:
Distances are not proportional to the weights.

11 / 23
Results

177

86

17

G: Milks, Yogurt, 107

cheeses, cereals,
butter. 43

11199 159

167 160 156


168
35
219 200 195
201
208
205 198

197
218 828 83
130 135 202
B
147 203 206
102 162
36
124 194 34
221 193
112 204
149
15018
104
108 165
139 100
216 56 191
45 122 73
215
217 44 190 31 131 196 180 145
42 41 46 87
101 146 29
183 154
110 50
148 90 141
207 222
158 105 212
220213 40
214 166 211

33 23
61
136 210 28
58 127
11 192 25
32 164 174 17649 209
173 16 26
179 125 170 169 138 84 172
48 137 27
81 155 47 121 4 24
9 51 189 89
182 78
15
52
181 79 80 96 178
140
30 76
1013
142
128
175 5
59 103 123
106
57
153

F
77
132 85 14
A
12
161 157
7
163

133
199

129
134

126 171 114


53 94
75 E
117
39 95 92
65 91 37 55
C 64 109 6
187
70
72
62
118 3

144 97 2
115 21 188
113
120 88
54 185
116 66
119 93 67 69
38
186 71
68 63
151
20 98
152
19 22

184 74
60

12 / 23
Importance measure

We define a general term for the importance of a node. Let T = (V, E, W) be


a spanning tree. V is the set of N nodes, E the set of N − 1 edges that connects
pairs of nodes u and v in V with a distance wuv in W. The importance of a
node u, I(u) is the sum over all distances that connects node u with the set Ku
of nodes all incident to it:
X 1
I(u) = (1)
wuk
k∈Ku

Thus, the greater the node degree, the importance value increases in inverse
proportion to the distance of the incident node. The closer the node is to the
other adjacent one, the importance value increases.

13 / 23
Importance measure

219

130 135

204

108

31
42 29
146
110

213
214

174
170 172
189
15 80

175

153

94

117
55

118

188
120

186

Figure: Red edges indicate strong relationships (distances are below the 10th percentile, which represents the influence zone), grey are weak
relationships (distances above the 90th percentile); the rest are in black. Red nodes have importance equal to or above the 95th percentile
(these are the most important nodes). Yellow represents equal to or above the 90th percentile but below the 95th. Black color nodes have
importance are below the 90th percentile. Note: The edges distances are not proportional to their weights.

14 / 23
An Example

Table: First 8 rules found with the highest level of lift.

Left Hand Side Right Hand Side


Support Confidence Lift
(LHS) (RHS)
1 22 → 186 0.0025 0.314 20.018
2 51 → 189 0.0011 0.215 13.627
3 80, 118, 94, 187 → 21 0.0010 0.565 13.467
4 118, 94, 187 → 21 0.0013 0.552 13.156
5 214, 11, 21, 72 → 187 0.0010 0.756 13.148
6 80, 94, 21, 72 → 187 0.0011 0.755 13.138
7 214, 80, 7, 21, 72 → 187 0.0011 0.752 13.091
8 94, 7, 21, 72 → 187 0.0011 0.750 13.042

22: Special hair balsam conditioner, 186: Specific shampoo, 51: Packaged mussels, 189: Fish soup.

15 / 23
An Example

(a)
(b)
notation: L (C)

187 187

8.43 (0.06)
21 21
185 185
6.31 (0.10) L=3.53

186 22.9 (S<.001)


186
19 19 L=2.88
20.01 (0.31) 20.01 (0.31)
7.04 (0.12) 22.2 (S<.001)

20 22 184 22
20 184

12.9 (S<.001) L=5.83

60 60

Figure: Zoom of the MST on hygiene subcategory. Red edge represents the rule {22} → {186}.

19: Antidandruff hair balsam, 20: Dry hair balsam conditioner, 21: Traditional balsam conditioner, 22:
Special hair balsam conditioner, 60: Combing Cream, 184: Antidandruff shampoo, 185: Beauty shampoo,
186: Specific shampoo, 187: Familiar size shampoo.

16 / 23
Conclusions

MST helps to find those products with a high propensity to be together in


market basket, which has a practical sense for a retail manager.
By construction, MST is a visual representation of rules with pairs of
products with high lift. MST offers a more graphical insight of
association rules found by Apriori algorithm.
Our approach could help the retail manager, to guide marketing activities
like special promotions or product bundling.

17 / 23
Thank you for your attention.

Contact e-mail:
mvalle@uft.cl

18 / 23
Argument I

6
log(Lift)

0
1 21 41 61 81 101 121 141 161 181 201
Rules

Figure: Plot of ln(L) of the association rules given the MST (black points). The red and orange lines represents the percentile 75 and 90
respectively of the distribution of log(lift) for each set Ri of association rules.

A simulation in which for each MST product i, we search for the set Ri of all association rules of type
Pi → Pj where i 6= j. That is, for our case, there will be 220 different sets of rules. Then, for set Ri we find
the rule Pi → Pm , where m represents the product or node that is connected to the product i in the MST.
For this rule we obtain their respective Lift. Then, we compare this Lift level with the Lift mean of the set
rules Ri . This procedure is carried out for all products.

19 / 23
Association rules I

The association rules try to find dependency between items that comprise
consumer market baskets ti , and which are found in a transactional database
T. They are of the from:

{ spaghetti } → { tomato sauce }

The left term is the antecedent and the right term is the consequence.

20 / 23
Association rules II

Definitions:
Support: S(Y) = P(Y).
Confidence: C(Y → X) = P(Y|X).
P(Y|X)
Lift: L(Y → X) = P(X) .

21 / 23
Turnover of products

Figure: Frequency of items for the transactional database.

80: Vitaminized pasta, 214: Beaten yogurt, 7: Vegetable oil, 11: Rice grade 2, 28: Familiar
size soda, 23: Familiar size Coca-Cola.

22 / 23
Degree and weight distribution

100 100

10−1

10−1

10−2

P(W ≥ w)
P(K ≥ k)

10−2
10−3

10−4
10−3

10−5

10−4
10−6

100 100.5 101 101.5 102 100 101 102 103 104
k w

Figure: On the left, the degree distribution of the entire network of items. Note the heavy-tailed curve, characteristic of this type of network.
On the right, the weight distribution, again with a heavy-tail indicating few but strong relationships between items, and a large but weak
relationship between items.

23 / 23

You might also like