Computational Tools DTU Presentation Week5

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 47

Computational Tools for Data Science

Week 5:

Frequent itemsets

26 September 2023 DTU Compute Computational Tools for Data Science - Week 5 - Karl Heuer 1
Motivation: Market-Basket Analysis
• Setting: Supermarket

• Task: Analyse shopping baskets of customers

• Goal: Find sets of items that appear often, i.e. in many baskets

• Why:
– See demand, hence we can adjust the price.
– Find patterns (association rules) in the data and make more
tricky sales.

26 September 2023 DTU Compute Computational Tools for Data Science - Week 5 - Karl Heuer 2
Motivation: Market-Basket Analysis
Prime example:

Via data analysis: {Diapers, Beer} is a frequent pair of items

Not expected / interesting

Explanation: new parents rarely go into bars, they drink at home.

Sales trick: make a sale on diapers, but secretly increase beer price

26 September 2023 DTU Compute Computational Tools for Data Science - Week 5 - Karl Heuer 3
Motivation: Market-Basket Analysis
Other examples:

• Baskets correspond to patients’s infos (biomarkers, diseases)

 Frequent items with a common disease suggest testing.

• Baskets are sets of sentences, items are (text) documents.

Note: Items need not be `in´ baskets!


Items and baskets are in some many-to-many relation.

Here: item in basket if the document contains the sentence.

Frequent pair  2 documents that share many sentences (plagiarism)


26 September 2023 DTU Compute Computational Tools for Data Science - Week 5 - Karl Heuer 4
Motivation: Market-Basket Analysis
In general:

• there are many (!) items


• there are also many (!) baskets

• baskets correspond to sets of items


• size of a basket is small compared to the number of all items

• our relevant data: sequence / list of baskets


» problem: often too big to fit into main memory

26 September 2023 DTU Compute Computational Tools for Data Science - Week 5 - Karl Heuer 5
Motivation: Market-Basket Analysis
In general:

Often only interested in frequent pairs, or triples.

𝑛𝑛 𝑛𝑛(𝑛𝑛−1) 𝑛𝑛2
Given 𝑛𝑛 items, there are 2
= ≈ pairs of items.
2 2

𝑛𝑛 𝑛𝑛! 𝑛𝑛𝑘𝑘
Given 𝑛𝑛 items, there are 𝑘𝑘
= ≈ 𝑘𝑘-element itemsets.
𝑛𝑛−𝑘𝑘 !⋅𝑘𝑘! 𝑘𝑘!

Practice:
• Too much data for considering 𝑘𝑘-element itemsets for big 𝑘𝑘.
• Sales offers: can only handle few frequent pairs.
»There will be even fewer frequent triples, etc.

26 September 2023 DTU Compute Computational Tools for Data Science - Week 5 - Karl Heuer 6
Measures: (Support)
Let 𝑰𝑰 be a set of items.

# 𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝐼𝐼
𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 𝐼𝐼 =
𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡 𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛 𝑜𝑜𝑜𝑜 𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏

Meaning: Probability of finding all items of 𝐼𝐼 in one basket.

Definition: We call an itemset 𝑰𝑰 frequent, if its support is greater than


some initially fixed value 𝒔𝒔 ∈ ℝ.
𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 𝐼𝐼 ≥ 𝑠𝑠

26 September 2023 DTU Compute Computational Tools for Data Science - Week 5 - Karl Heuer 7
Measures for association rules: (Confidence)
• Let 𝑰𝑰 be a set of items
• Let 𝒋𝒋 be an item that is not in 𝑰𝑰.
• Let 𝑰𝑰 → 𝒋𝒋 be an association rule.

# 𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝐼𝐼 ∪ {𝑗𝑗}


𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶 𝐼𝐼 → 𝑗𝑗 =
# 𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝐼𝐼

Meaning: Probability of having all of 𝐼𝐼 ∪ {𝑗𝑗} in one basket, given that


we already have all of 𝐼𝐼 in the basket.

26 September 2023 DTU Compute Computational Tools for Data Science - Week 5 - Karl Heuer 8
Motivation: Market-Basket Analysis
Example:

Basket # Items in the baskets


1 {Butter, Bread, Milk, Cola, Cheese, Toothbrush}
2 {Butter, Bread, Milk, Cola}
3 {Bread, Milk, Toothbrush}
4 {Butter, Bread, Onion}
5 {Flour, Milk, Cola, Cheese}
6 {Flour, Butter, Milk, Cola}
7 {Flour, Milk, Toothbrush}
8 {Flour, Onion, Toothbrush}

26 September 2023 DTU Compute Computational Tools for Data Science - Week 5 - Karl Heuer 9
Basket # Items in the baskets
1 {Butter, Bread, Milk, Cola, Cheese, Toothbrush}
2 {Butter, Bread, Milk, Cola}
3 {Bread, Milk, Toothbrush}
4 {Butter, Bread, Onion}
5 {Flour, Milk, Cola, Cheese}
6 {Flour, Butter, Milk, Cola}
7 {Flour, Milk, Toothbrush}
8 {Flour, Onion, Toothbrush}

𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵 = 0,5

26 September 2023 DTU Compute Computational Tools for Data Science - Week 5 - Karl Heuer 10
Basket # Items in the baskets
1 {Butter, Bread, Milk, Cola, Cheese, Toothbrush}
2 {Butter, Bread, Milk, Cola}
3 {Bread, Milk, Toothbrush}
4 {Butter, Bread, Onion}
5 {Flour, Milk, Cola, Cheese}
6 {Flour, Butter, Milk, Cola}
7 {Flour, Milk, Toothbrush}
8 {Flour, Onion, Toothbrush}

𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵 = 0,5


𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵 = 0,5

26 September 2023 DTU Compute Computational Tools for Data Science - Week 5 - Karl Heuer 11
Basket # Items in the baskets
1 {Butter, Bread, Milk, Cola, Cheese, Toothbrush}
2 {Butter, Bread, Milk, Cola}
3 {Bread, Milk, Toothbrush}
4 {Butter, Bread, Onion}
5 {Flour, Milk, Cola, Cheese}
6 {Flour, Butter, Milk, Cola}
7 {Flour, Milk, Toothbrush}
8 {Flour, Onion, Toothbrush}

𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵 = 0,5


𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵 = 0,5
𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀 = 0,75

26 September 2023 DTU Compute Computational Tools for Data Science - Week 5 - Karl Heuer 12
Basket # Items in the baskets
1 {Butter, Bread, Milk, Cola, Cheese, Toothbrush}
2 {Butter, Bread, Milk, Cola}
3 {Bread, Milk, Toothbrush}
4 {Butter, Bread, Onion}
5 {Flour, Milk, Cola, Cheese}
6 {Flour, Butter, Milk, Cola}
7 {Flour, Milk, Toothbrush}
8 {Flour, Onion, Toothbrush}

𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵 = 0,5


𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵 = 0,5
𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀 = 0,75
𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇 = 0,5

26 September 2023 DTU Compute Computational Tools for Data Science - Week 5 - Karl Heuer 13
Basket # Items in the baskets
1 {Butter, Bread, Milk, Cola, Cheese, Toothbrush}
2 {Butter, Bread, Milk, Cola}
3 {Bread, Milk, Toothbrush}
4 {Butter, Bread, Onion}
5 {Flour, Milk, Cola, Cheese}
6 {Flour, Butter, Milk, Cola}
7 {Flour, Milk, Toothbrush}
8 {Flour, Onion, Toothbrush}

𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵 = 0,5


𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵 = 0,5
𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀 = 0,75
𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇 = 0,5
𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶 = 0,25
26 September 2023 DTU Compute Computational Tools for Data Science - Week 5 - Karl Heuer 14
Basket # Items in the baskets
1 {Butter, Bread, Milk, Cola, Cheese, Toothbrush}
2 {Butter, Bread, Milk, Cola}
3 {Bread, Milk, Toothbrush}
4 {Butter, Bread, Onion}
5 {Flour, Milk, Cola, Cheese}
6 {Flour, Butter, Milk, Cola}
7 {Flour, Milk, Toothbrush}
8 {Flour, Onion, Toothbrush}

𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵 = 0,5 𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵, 𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵 = 0,375


𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵 = 0,5
𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀 = 0,75
𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇 = 0,5
𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶 = 0,25
26 September 2023 DTU Compute Computational Tools for Data Science - Week 5 - Karl Heuer 15
Basket # Items in the baskets
1 {Butter, Bread, Milk, Cola, Cheese, Toothbrush}
2 {Butter, Bread, Milk, Cola}
3 {Bread, Milk, Toothbrush}
4 {Butter, Bread, Onion}
5 {Flour, Milk, Cola, Cheese}
6 {Flour, Butter, Milk, Cola}
7 {Flour, Milk, Toothbrush}
8 {Flour, Onion, Toothbrush}

𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵 = 0,5 𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵, 𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵 = 0,375


𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵 = 0,5 𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀, 𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶 = 0,25
𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀 = 0,75
𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇 = 0,5
𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶 = 0,25
26 September 2023 DTU Compute Computational Tools for Data Science - Week 5 - Karl Heuer 16
Basket # Items in the baskets
1 {Butter, Bread, Milk, Cola, Cheese, Toothbrush}
2 {Butter, Bread, Milk, Cola}
3 {Bread, Milk, Toothbrush}
4 {Butter, Bread, Onion}
5 {Flour, Milk, Cola, Cheese}
6 {Flour, Butter, Milk, Cola}
7 {Flour, Milk, Toothbrush}
8 {Flour, Onion, Toothbrush}

𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵 = 0,5 𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵, 𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵 = 0,375


𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵 = 0,5 𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀, 𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶 = 0,25
𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀 = 0,75 𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀, 𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇 = 0,375
𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇 = 0,5
𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶 = 0,25
26 September 2023 DTU Compute Computational Tools for Data Science - Week 5 - Karl Heuer 17
Basket # Items in the baskets
1 {Butter, Bread, Milk, Cola, Cheese, Toothbrush}
2 {Butter, Bread, Milk, Cola}
3 {Bread, Milk, Toothbrush}
4 {Butter, Bread, Onion}
5 {Flour, Milk, Cola, Cheese}
6 {Flour, Butter, Milk, Cola}
7 {Flour, Milk, Toothbrush}
8 {Flour, Onion, Toothbrush}

𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶 𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵} → 𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵 = 0,75

26 September 2023 DTU Compute Computational Tools for Data Science - Week 5 - Karl Heuer 18
Basket # Items in the baskets
1 {Butter, Bread, Milk, Cola, Cheese, Toothbrush}
2 {Butter, Bread, Milk, Cola}
3 {Bread, Milk, Toothbrush}
4 {Butter, Bread, Onion}
5 {Flour, Milk, Cola, Cheese}
6 {Flour, Butter, Milk, Cola}
7 {Flour, Milk, Toothbrush}
8 {Flour, Onion, Toothbrush}

𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶 𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵} → 𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵 = 0,75


𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶 𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶} → 𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀 = 1

26 September 2023 DTU Compute Computational Tools for Data Science - Week 5 - Karl Heuer 19
Basket # Items in the baskets
1 {Butter, Bread, Milk, Cola, Cheese, Toothbrush}
2 {Butter, Bread, Milk, Cola}
3 {Bread, Milk, Toothbrush}
4 {Butter, Bread, Onion}
5 {Flour, Milk, Cola, Cheese}
6 {Flour, Butter, Milk, Cola}
7 {Flour, Milk, Toothbrush}
8 {Flour, Onion, Toothbrush}

𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶 𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵} → 𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵 = 0,75


𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶 𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶} → 𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀 = 1
𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶 𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇} → 𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀 = 0,75

26 September 2023 DTU Compute Computational Tools for Data Science - Week 5 - Karl Heuer 20
Measures for association rules: (Lift)
• Let 𝑰𝑰 be a set of items
• Let 𝒋𝒋 be an item that is not in 𝑰𝑰.
• Let 𝑰𝑰 → 𝒋𝒋 be an association rule.

𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶(𝐼𝐼 → 𝑗𝑗)
𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿 𝐼𝐼 → 𝑗𝑗 =
𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆({𝑗𝑗})

Meaning: Probability of having 𝑗𝑗 in a basket given that we already


have all of 𝐼𝐼 in the basket divided by the probability of having 𝑗𝑗 in a
basket.

26 September 2023 DTU Compute Computational Tools for Data Science - Week 5 - Karl Heuer 21
Measures for association rules: (Lift)
• Let 𝑰𝑰 be a set of items
• Let 𝒋𝒋 be an item that is not in 𝑰𝑰.
• Let 𝑰𝑰 → 𝒋𝒋 be an association rule.

𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶(𝐼𝐼 → 𝑗𝑗)
𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿 𝐼𝐼 → 𝑗𝑗 =
𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆({𝑗𝑗})

Note:
• If 𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿 𝐼𝐼 → 𝑗𝑗 is close to 1, then “not really interesting”.
• If 𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿 𝐼𝐼 → 𝑗𝑗 ≫ 1, then 𝐼𝐼 encourages to buy 𝑗𝑗.
• If 𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿 𝐼𝐼 → 𝑗𝑗 ≪ 1, then 𝐼𝐼 discourages to buy 𝑗𝑗.

26 September 2023 DTU Compute Computational Tools for Data Science - Week 5 - Karl Heuer 22
Measures for association rules: (Interest)
• Let 𝑰𝑰 be a set of items
• Let 𝒋𝒋 be an item that is not in 𝑰𝑰.
• Let 𝑰𝑰 → 𝒋𝒋 be an association rule.

𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼 𝐼𝐼 → 𝑗𝑗 = 𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶 𝐼𝐼 → 𝑗𝑗 − 𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆({𝑗𝑗})

Note:
• If 𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼 𝐼𝐼 → 𝑗𝑗 is close to 0, then “not really interesting”.
• If 𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼 𝐼𝐼 → 𝑗𝑗 is highly positive, then 𝐼𝐼 encourages to buy 𝑗𝑗.
• If 𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼 𝐼𝐼 → 𝑗𝑗 is highly negative, then 𝐼𝐼 discourages to buy 𝑗𝑗.

26 September 2023 DTU Compute Computational Tools for Data Science - Week 5 - Karl Heuer 23
Basket # Items in the baskets
1 {Butter, Bread, Milk, Cola, Cheese, Toothbrush}
2 {Butter, Bread, Milk, Cola}
3 {Bread, Milk, Toothbrush}
4 {Butter, Bread, Onion}
5 {Flour, Milk, Cola, Cheese}
6 {Flour, Butter, Milk, Cola}
7 {Flour, Milk, Toothbrush}
8 {Flour, Onion, Toothbrush}

𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶 𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵} → 𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵 = 0,75


𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿 𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵} → 𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵 = 1,5

26 September 2023 DTU Compute Computational Tools for Data Science - Week 5 - Karl Heuer 24
Basket # Items in the baskets
1 {Butter, Bread, Milk, Cola, Cheese, Toothbrush}
2 {Butter, Bread, Milk, Cola}
3 {Bread, Milk, Toothbrush}
4 {Butter, Bread, Onion}
5 {Flour, Milk, Cola, Cheese}
6 {Flour, Butter, Milk, Cola}
7 {Flour, Milk, Toothbrush}
8 {Flour, Onion, Toothbrush}

𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶 𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇} → 𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀 = 0,75


𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿 𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇} → 𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀 = 1

26 September 2023 DTU Compute Computational Tools for Data Science - Week 5 - Karl Heuer 25
Finding frequent itemsets

Finding / determining interesting association rules is easy compared


to finding frequent itemsets (what we have to do first anyway).

Challenges:
• Huge amount of data.  Not enough main memory.

Focus:
• How to find frequent itemsets (algorithms).
• How handle (main) memory.

26 September 2023 DTU Compute Computational Tools for Data Science - Week 5 - Karl Heuer 26
Store itemset count in main memory

For each frequent itemset algorithm we have to store the count for
the itemsets.

(!) Use integers (4 byte) instead of item names / long IDs.


 via a hash table.

26 September 2023 DTU Compute Computational Tools for Data Science - Week 5 - Karl Heuer 27
Store itemset count in main memory

For each frequent itemset algorithm we have to store the count for
the itemsets.

Limit on how many items we can deal with without thrashing.

Example: Say we have 𝑛𝑛 items and need to count all pairs of them.
𝑛𝑛2
We must save about integers, each 4 bytes. So 2𝑛𝑛2 bytes.
2

If we have 2 GB main memory, say 231 bytes, then 𝑛𝑛 ≤ 215 ≈ 33000.

26 September 2023 DTU Compute Computational Tools for Data Science - Week 5 - Karl Heuer 28
Triangular-Matrix Method
Let 𝑛𝑛 be the number of all items of which we want to count pairs.

• Use one-dimensional array 𝑎𝑎.

• 𝑎𝑎 𝑘𝑘 contains count of the 𝑘𝑘 -th pair of items.

• The ordering of the item pairs is as follows:


1, 2 , 1, 3 , … , 1, 𝑛𝑛 , 2, 3 , 2, 4 , … , 2, 𝑛𝑛 , … , 𝑛𝑛 − 1, 𝑛𝑛

Formally: 𝑎𝑎 𝑘𝑘 contains the count of the pair 𝑖𝑖, 𝑗𝑗 , with 1 ≤ 𝑖𝑖 < 𝑗𝑗 ≤ 𝑛𝑛,
𝑖𝑖
where 𝑘𝑘 = 𝑖𝑖 − 1 𝑛𝑛 − + 𝑗𝑗 − 𝑖𝑖.
2

26 September 2023 DTU Compute Computational Tools for Data Science - Week 5 - Karl Heuer 29
Triples Method
Let 𝑛𝑛 be the number of all items of which we want to count pairs.
Let 1 ≤ 𝑖𝑖 < 𝑗𝑗 ≤ 𝑛𝑛.

• Store counts as triples 𝑖𝑖, 𝑗𝑗, 𝑐𝑐 where 𝑐𝑐 equals the count of 𝑖𝑖, 𝑗𝑗 .

Advantage: No need to store pairs with count 𝑐𝑐 equal to 0.

Disadvantage: For each pair we use 3 integers instead of just 1.


1
Hence: If fewer than of all possible pairs appear in some basket,
3
then use the Triples Method.

26 September 2023 DTU Compute Computational Tools for Data Science - Week 5 - Karl Heuer 30
Naïve algorithm for counting frequent itemsets
Say our main memory is gigantic and will not limit us.

1. Read the whole data (list of baskets) into main memory.

2a. Use 2 nested loops to generate all pairs of items.


(Use 𝑘𝑘 nested loops to generate and count all 𝑘𝑘-element itemsets.)

2b. Use Triangular-Matrix Method or Triples Method to store counts.

3. Examine which pairs / 𝑘𝑘-tuples have count bigger than the support
threshold 𝑠𝑠.  yields frequent itemsets

26 September 2023 DTU Compute Computational Tools for Data Science - Week 5 - Karl Heuer 31
Computational model

Remember: In general our dataset is too big to fit into main memory.

The time for reading data into main memory dominates the time for
generating / counting the relevant pairs or small 𝑘𝑘-element itemsets.

Hence, we measure the cost of computation by the number of


passes an algorithm makes over the data.

26 September 2023 DTU Compute Computational Tools for Data Science - Week 5 - Karl Heuer 32
Monotonicity of frequent itemsets

Observation: If an itemset is frequent, then so is every subset of it.

The A-Priori Algorithm utilises this simple observation.

Definition: We call a frequent itemset maximal if no proper superset


of it is frequent.

Saves space later: Only store maximal frequent itemsets.

26 September 2023 DTU Compute Computational Tools for Data Science - Week 5 - Karl Heuer 33
A-Priori Algorithm (for item pairs)
Uses 2 passes through the data, but requires less main memory.

Pass 1: Read data (list of baskets) and only store count of each
individual item, i.e. of singleton itemsets {𝑖𝑖}.

In main memory:

• Hash table (item names to integers in range 1 to 𝑛𝑛)


• Count for each item (stored as an array).

 Required space proportional to the number of items.

26 September 2023 DTU Compute Computational Tools for Data Science - Week 5 - Karl Heuer 34
A-Priori Algorithm (for item pairs)
Before pass 2:

Say we had found 𝑚𝑚 frequent singleton itemsets after pass 1.


𝑛𝑛
( 𝑚𝑚 ≈ could be reasonable.)
100

Create new array (frequent singletons table), indexed 1 to 𝑛𝑛, where


• non-frequent items are mapped to 0, and
• each frequent item is mapped to a unique integer between 1 and 𝑚𝑚.

In main memory:
• Hash table (item names to integers in range 1 to 𝑛𝑛)
• Frequent singletons table
26 September 2023 DTU Compute Computational Tools for Data Science - Week 5 - Karl Heuer 35
A-Priori Algorithm (for item pairs)
Pass 2: Read data again and only store pairs of items both of
which are frequent (utilise frequent singletons table).

By monotonicity: we do not miss any pair of frequent items!

In main memory:

• Hash table (item names to integers)


• Frequent singletons table
• Count for pairs of items (via Triangular-Matrix or Triples Method)

26 September 2023 DTU Compute Computational Tools for Data Science - Week 5 - Karl Heuer 36
A-Priori Algorithm (for item pairs)

picture: from MMDS, p. 219

26 September 2023 DTU Compute Computational Tools for Data Science - Week 5 - Karl Heuer 37
A-Priori Algorithm (for k-element itemsets)
• Uses 𝑘𝑘 passes through data.

• In pass 𝑖𝑖:
generate set of candidates 𝐶𝐶𝑖𝑖 for frequent 𝑖𝑖-element
itemsets and store counts.

To do this, use:
- frequent (𝑖𝑖 − 1)-element itemset table that lists 𝐿𝐿𝑖𝑖−1 ,
the frequent (𝑖𝑖 − 1)-element itemsets, and
- frequent singletons table 𝐿𝐿1 .

• Between pass 𝑖𝑖 and 𝑖𝑖 + 1:


create frequent 𝑖𝑖-element itemset table.
26 September 2023 DTU Compute Computational Tools for Data Science - Week 5 - Karl Heuer 38
A-Priori Algorithm (for k-element itemsets)

picture: from MMDS, p. 220

26 September 2023 DTU Compute Computational Tools for Data Science - Week 5 - Karl Heuer 39
PCY-Algorithm (named after Park, Chen and Yu)
• A-Priori works fine unless generating 𝐶𝐶2 causes thrashing.

• Rough idea:
• Utilise unused main memory in pass 1 for additional hash table.
» Further reduce number of item pairs that are created.

• Example / Note:

Say we have 1.000.000 items and several GB of main memory.


Then we need not even 10% of main memory for:
• hash table (item names to integers)
• array for counting (singleton) item frequency
26 September 2023 DTU Compute Computational Tools for Data Science - Week 5 - Karl Heuer 40
PCY-Algorithm (for pairs)
• Fill left over main memory with a hash table (integer array).

• The buckets correspond to integers (counts).

• Use as many buckets as possible for the table.

• Hash pairs of items to buckets.

During pass 1 of A-Priori:


• Also initialise hash table / array with all zeros.

• Beside counting all singleton itemsets, also generate all pairs,


BUT instead of storing them: hash them and increase count of
bucket to which they hash by 1.
26 September 2023 DTU Compute Computational Tools for Data Science - Week 5 - Karl Heuer 41
PCY-Algorithm (for pairs)

Idea to improve the second pass:

If a bucket is not frequent, then clearly no pair in it can be frequent.

Hence, define candidate set 𝐶𝐶2 of pairs {𝑖𝑖, 𝑗𝑗} via:

(a) {𝑖𝑖} and {𝑗𝑗} are both frequent singleton itemsets

(b) {𝒊𝒊, 𝒋𝒋} hashes to a frequent bucket, i.e. whose


count is bigger than our threshold 𝒔𝒔.

26 September 2023 DTU Compute Computational Tools for Data Science - Week 5 - Karl Heuer 42
PCY-Algorithm (for pairs)

Between pass 1 and pass 2:

Summarize hash table (integer counts) via bitmap:


» 1 if bucket is frequent
» 0 if bucket is not frequent.

Hence 4 byte = 32 bits integer counts are replaced by 1 bit each.

26 September 2023 DTU Compute Computational Tools for Data Science - Week 5 - Karl Heuer 43
PCY-Algorithm (for pairs)

picture: from MMDS, p. 231

26 September 2023 DTU Compute Computational Tools for Data Science - Week 5 - Karl Heuer 44
PCY-Algorithm (for pairs)
Advantages and disadvantages:

(+) We save memory if we have (many) buckets that are infrequent.

(-) We are forced to store counts for pairs via Triples-Method.

 In A-Priori: renumbered frequent singleton itemsets


(form 1 to 𝑚𝑚 instead to 𝑛𝑛) to save space.

 Now pairs in infrequent buckets could be rather randomly


distributed within the lexicographic order.

Hence: PCY only saves memory compared to A-Priori


if 2⁄3 of all candidate pairs can be discarded.
26 September 2023 DTU Compute Computational Tools for Data Science - Week 5 - Karl Heuer 45
Refinement of PCY: The Multistage Algorithm

picture: from MMDS, p. 233

26 September 2023 DTU Compute Computational Tools for Data Science - Week 5 - Karl Heuer 46
Refinement of PCY: The Multihash Algorithm

picture: from MMDS, p. 235

26 September 2023 DTU Compute Computational Tools for Data Science - Week 5 - Karl Heuer 47

You might also like