Professional Documents
Culture Documents
Showv3 Enic2017
Showv3 Enic2017
1 / 23
Motivation
2 / 23
Data definition
Let I = {i1 , ..., id } is a set of all items.
T = {ti } be the set of all transactions.
I Each transaction ti = (ei1 , ..., eiN ) is a subset of items chosen from I and
represent a market basket of the ith transaction.
I Every element of ti is a binary vector. Thus, eij = 1 if the jth item is present
on the market basket, eij = 0 otherwise.
Figure: An example of T.
3 / 23
The network of products I
4 / 23
The network of products II
Figure: Product network example with 17 products categories. Each color represent one of them. (a) Threshold at 10%. (b) Threshold at 1%.
5 / 23
Minimum spanning trees I
The minimum spanning tree (MST) is the spanning tree T that minimizes the
total weight:
X
w(T) = w(e)
e∈T
6 / 23
Minimum spanning trees II
Figure: (a) Product network for wines category. (b) The corresponding MST.
7 / 23
Proposed methodology
hXYi − hXihYi
φXY = p
(hX 2 i − hXi2 ) (hY 2 i − hYi2 )
I The output is a symmetric matrix with elements φij
2 Transform the correlations into a metric of distance.
I When thep correlation is equal to 1, the distance is zero.
I d(i, j) = (2(1 − φij )
3 Obtain the MST.
I This was carried out with Prim’s algorithm.
8 / 23
Application
A real transactional database was taken, containing 1,046,804 transactions, 17
main categories, 220 subcategories, from a typical supermarket in a 15-month
period.
28
196
25
26
110
27
63
212
100
172
221
76
4
140
16
182
69
217
65
15
59
220
214
99
80
204
79
23
162
210
150
219
29
218
118
94
7
183
70
103
117
11
55
89
24
21
184
176
78
113
Figure: A sample of 80 subcategory items has been extracted to represent a heatmap of the transactions.
9 / 23
Results
A process of rules discovery was carried out using Apriori algorithm.
minimum support: 0.001
minimum confidence: 0.01
Result: 179,610 rules were found.
Number of rules with L ≥ 1: 33,759.
order 7
0.8
order 6
0.6
order 5
confidence
0.4 order 4
order 3
0.2
order 2
0
Figure: Scatterplot of ARs with L ≥ 1. The right side indicates level of L according with intensity of the color
10 / 23
Results
Figure: Representation of the complete MST for the transactional database. Each color represent one of the 17 product’s subcategories. Note:
Distances are not proportional to the weights.
11 / 23
Results
177
86
17
cheeses, cereals,
butter. 43
11199 159
197
218 828 83
130 135 202
B
147 203 206
102 162
36
124 194 34
221 193
112 204
149
15018
104
108 165
139 100
216 56 191
45 122 73
215
217 44 190 31 131 196 180 145
42 41 46 87
101 146 29
183 154
110 50
148 90 141
207 222
158 105 212
220213 40
214 166 211
33 23
61
136 210 28
58 127
11 192 25
32 164 174 17649 209
173 16 26
179 125 170 169 138 84 172
48 137 27
81 155 47 121 4 24
9 51 189 89
182 78
15
52
181 79 80 96 178
140
30 76
1013
142
128
175 5
59 103 123
106
57
153
F
77
132 85 14
A
12
161 157
7
163
133
199
129
134
144 97 2
115 21 188
113
120 88
54 185
116 66
119 93 67 69
38
186 71
68 63
151
20 98
152
19 22
184 74
60
12 / 23
Importance measure
Thus, the greater the node degree, the importance value increases in inverse
proportion to the distance of the incident node. The closer the node is to the
other adjacent one, the importance value increases.
13 / 23
Importance measure
219
130 135
204
108
31
42 29
146
110
213
214
174
170 172
189
15 80
175
153
94
117
55
118
188
120
186
Figure: Red edges indicate strong relationships (distances are below the 10th percentile, which represents the influence zone), grey are weak
relationships (distances above the 90th percentile); the rest are in black. Red nodes have importance equal to or above the 95th percentile
(these are the most important nodes). Yellow represents equal to or above the 90th percentile but below the 95th. Black color nodes have
importance are below the 90th percentile. Note: The edges distances are not proportional to their weights.
14 / 23
An Example
22: Special hair balsam conditioner, 186: Specific shampoo, 51: Packaged mussels, 189: Fish soup.
15 / 23
An Example
(a)
(b)
notation: L (C)
187 187
8.43 (0.06)
21 21
185 185
6.31 (0.10) L=3.53
20 22 184 22
20 184
60 60
Figure: Zoom of the MST on hygiene subcategory. Red edge represents the rule {22} → {186}.
19: Antidandruff hair balsam, 20: Dry hair balsam conditioner, 21: Traditional balsam conditioner, 22:
Special hair balsam conditioner, 60: Combing Cream, 184: Antidandruff shampoo, 185: Beauty shampoo,
186: Specific shampoo, 187: Familiar size shampoo.
16 / 23
Conclusions
17 / 23
Thank you for your attention.
Contact e-mail:
mvalle@uft.cl
18 / 23
Argument I
6
log(Lift)
0
1 21 41 61 81 101 121 141 161 181 201
Rules
Figure: Plot of ln(L) of the association rules given the MST (black points). The red and orange lines represents the percentile 75 and 90
respectively of the distribution of log(lift) for each set Ri of association rules.
A simulation in which for each MST product i, we search for the set Ri of all association rules of type
Pi → Pj where i 6= j. That is, for our case, there will be 220 different sets of rules. Then, for set Ri we find
the rule Pi → Pm , where m represents the product or node that is connected to the product i in the MST.
For this rule we obtain their respective Lift. Then, we compare this Lift level with the Lift mean of the set
rules Ri . This procedure is carried out for all products.
19 / 23
Association rules I
The association rules try to find dependency between items that comprise
consumer market baskets ti , and which are found in a transactional database
T. They are of the from:
The left term is the antecedent and the right term is the consequence.
20 / 23
Association rules II
Definitions:
Support: S(Y) = P(Y).
Confidence: C(Y → X) = P(Y|X).
P(Y|X)
Lift: L(Y → X) = P(X) .
21 / 23
Turnover of products
80: Vitaminized pasta, 214: Beaten yogurt, 7: Vegetable oil, 11: Rice grade 2, 28: Familiar
size soda, 23: Familiar size Coca-Cola.
22 / 23
Degree and weight distribution
100 100
10−1
10−1
10−2
P(W ≥ w)
P(K ≥ k)
10−2
10−3
10−4
10−3
10−5
10−4
10−6
100 100.5 101 101.5 102 100 101 102 103 104
k w
Figure: On the left, the degree distribution of the entire network of items. Note the heavy-tailed curve, characteristic of this type of network.
On the right, the weight distribution, again with a heavy-tail indicating few but strong relationships between items, and a large but weak
relationship between items.
23 / 23