Professional Documents
Culture Documents
Computational Tools DTU Presentation Week5
Computational Tools DTU Presentation Week5
Computational Tools DTU Presentation Week5
Week 5:
Frequent itemsets
26 September 2023 DTU Compute Computational Tools for Data Science - Week 5 - Karl Heuer 1
Motivation: Market-Basket Analysis
• Setting: Supermarket
• Goal: Find sets of items that appear often, i.e. in many baskets
• Why:
– See demand, hence we can adjust the price.
– Find patterns (association rules) in the data and make more
tricky sales.
26 September 2023 DTU Compute Computational Tools for Data Science - Week 5 - Karl Heuer 2
Motivation: Market-Basket Analysis
Prime example:
Sales trick: make a sale on diapers, but secretly increase beer price
26 September 2023 DTU Compute Computational Tools for Data Science - Week 5 - Karl Heuer 3
Motivation: Market-Basket Analysis
Other examples:
26 September 2023 DTU Compute Computational Tools for Data Science - Week 5 - Karl Heuer 5
Motivation: Market-Basket Analysis
In general:
𝑛𝑛 𝑛𝑛(𝑛𝑛−1) 𝑛𝑛2
Given 𝑛𝑛 items, there are 2
= ≈ pairs of items.
2 2
𝑛𝑛 𝑛𝑛! 𝑛𝑛𝑘𝑘
Given 𝑛𝑛 items, there are 𝑘𝑘
= ≈ 𝑘𝑘-element itemsets.
𝑛𝑛−𝑘𝑘 !⋅𝑘𝑘! 𝑘𝑘!
Practice:
• Too much data for considering 𝑘𝑘-element itemsets for big 𝑘𝑘.
• Sales offers: can only handle few frequent pairs.
»There will be even fewer frequent triples, etc.
26 September 2023 DTU Compute Computational Tools for Data Science - Week 5 - Karl Heuer 6
Measures: (Support)
Let 𝑰𝑰 be a set of items.
# 𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝐼𝐼
𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 𝐼𝐼 =
𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡 𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛 𝑜𝑜𝑜𝑜 𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏
26 September 2023 DTU Compute Computational Tools for Data Science - Week 5 - Karl Heuer 7
Measures for association rules: (Confidence)
• Let 𝑰𝑰 be a set of items
• Let 𝒋𝒋 be an item that is not in 𝑰𝑰.
• Let 𝑰𝑰 → 𝒋𝒋 be an association rule.
26 September 2023 DTU Compute Computational Tools for Data Science - Week 5 - Karl Heuer 8
Motivation: Market-Basket Analysis
Example:
26 September 2023 DTU Compute Computational Tools for Data Science - Week 5 - Karl Heuer 9
Basket # Items in the baskets
1 {Butter, Bread, Milk, Cola, Cheese, Toothbrush}
2 {Butter, Bread, Milk, Cola}
3 {Bread, Milk, Toothbrush}
4 {Butter, Bread, Onion}
5 {Flour, Milk, Cola, Cheese}
6 {Flour, Butter, Milk, Cola}
7 {Flour, Milk, Toothbrush}
8 {Flour, Onion, Toothbrush}
26 September 2023 DTU Compute Computational Tools for Data Science - Week 5 - Karl Heuer 10
Basket # Items in the baskets
1 {Butter, Bread, Milk, Cola, Cheese, Toothbrush}
2 {Butter, Bread, Milk, Cola}
3 {Bread, Milk, Toothbrush}
4 {Butter, Bread, Onion}
5 {Flour, Milk, Cola, Cheese}
6 {Flour, Butter, Milk, Cola}
7 {Flour, Milk, Toothbrush}
8 {Flour, Onion, Toothbrush}
26 September 2023 DTU Compute Computational Tools for Data Science - Week 5 - Karl Heuer 11
Basket # Items in the baskets
1 {Butter, Bread, Milk, Cola, Cheese, Toothbrush}
2 {Butter, Bread, Milk, Cola}
3 {Bread, Milk, Toothbrush}
4 {Butter, Bread, Onion}
5 {Flour, Milk, Cola, Cheese}
6 {Flour, Butter, Milk, Cola}
7 {Flour, Milk, Toothbrush}
8 {Flour, Onion, Toothbrush}
26 September 2023 DTU Compute Computational Tools for Data Science - Week 5 - Karl Heuer 12
Basket # Items in the baskets
1 {Butter, Bread, Milk, Cola, Cheese, Toothbrush}
2 {Butter, Bread, Milk, Cola}
3 {Bread, Milk, Toothbrush}
4 {Butter, Bread, Onion}
5 {Flour, Milk, Cola, Cheese}
6 {Flour, Butter, Milk, Cola}
7 {Flour, Milk, Toothbrush}
8 {Flour, Onion, Toothbrush}
26 September 2023 DTU Compute Computational Tools for Data Science - Week 5 - Karl Heuer 13
Basket # Items in the baskets
1 {Butter, Bread, Milk, Cola, Cheese, Toothbrush}
2 {Butter, Bread, Milk, Cola}
3 {Bread, Milk, Toothbrush}
4 {Butter, Bread, Onion}
5 {Flour, Milk, Cola, Cheese}
6 {Flour, Butter, Milk, Cola}
7 {Flour, Milk, Toothbrush}
8 {Flour, Onion, Toothbrush}
26 September 2023 DTU Compute Computational Tools for Data Science - Week 5 - Karl Heuer 18
Basket # Items in the baskets
1 {Butter, Bread, Milk, Cola, Cheese, Toothbrush}
2 {Butter, Bread, Milk, Cola}
3 {Bread, Milk, Toothbrush}
4 {Butter, Bread, Onion}
5 {Flour, Milk, Cola, Cheese}
6 {Flour, Butter, Milk, Cola}
7 {Flour, Milk, Toothbrush}
8 {Flour, Onion, Toothbrush}
26 September 2023 DTU Compute Computational Tools for Data Science - Week 5 - Karl Heuer 19
Basket # Items in the baskets
1 {Butter, Bread, Milk, Cola, Cheese, Toothbrush}
2 {Butter, Bread, Milk, Cola}
3 {Bread, Milk, Toothbrush}
4 {Butter, Bread, Onion}
5 {Flour, Milk, Cola, Cheese}
6 {Flour, Butter, Milk, Cola}
7 {Flour, Milk, Toothbrush}
8 {Flour, Onion, Toothbrush}
26 September 2023 DTU Compute Computational Tools for Data Science - Week 5 - Karl Heuer 20
Measures for association rules: (Lift)
• Let 𝑰𝑰 be a set of items
• Let 𝒋𝒋 be an item that is not in 𝑰𝑰.
• Let 𝑰𝑰 → 𝒋𝒋 be an association rule.
𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶(𝐼𝐼 → 𝑗𝑗)
𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿 𝐼𝐼 → 𝑗𝑗 =
𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆({𝑗𝑗})
26 September 2023 DTU Compute Computational Tools for Data Science - Week 5 - Karl Heuer 21
Measures for association rules: (Lift)
• Let 𝑰𝑰 be a set of items
• Let 𝒋𝒋 be an item that is not in 𝑰𝑰.
• Let 𝑰𝑰 → 𝒋𝒋 be an association rule.
𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶(𝐼𝐼 → 𝑗𝑗)
𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿 𝐼𝐼 → 𝑗𝑗 =
𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆({𝑗𝑗})
Note:
• If 𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿 𝐼𝐼 → 𝑗𝑗 is close to 1, then “not really interesting”.
• If 𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿 𝐼𝐼 → 𝑗𝑗 ≫ 1, then 𝐼𝐼 encourages to buy 𝑗𝑗.
• If 𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿 𝐼𝐼 → 𝑗𝑗 ≪ 1, then 𝐼𝐼 discourages to buy 𝑗𝑗.
26 September 2023 DTU Compute Computational Tools for Data Science - Week 5 - Karl Heuer 22
Measures for association rules: (Interest)
• Let 𝑰𝑰 be a set of items
• Let 𝒋𝒋 be an item that is not in 𝑰𝑰.
• Let 𝑰𝑰 → 𝒋𝒋 be an association rule.
Note:
• If 𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼 𝐼𝐼 → 𝑗𝑗 is close to 0, then “not really interesting”.
• If 𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼 𝐼𝐼 → 𝑗𝑗 is highly positive, then 𝐼𝐼 encourages to buy 𝑗𝑗.
• If 𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼 𝐼𝐼 → 𝑗𝑗 is highly negative, then 𝐼𝐼 discourages to buy 𝑗𝑗.
26 September 2023 DTU Compute Computational Tools for Data Science - Week 5 - Karl Heuer 23
Basket # Items in the baskets
1 {Butter, Bread, Milk, Cola, Cheese, Toothbrush}
2 {Butter, Bread, Milk, Cola}
3 {Bread, Milk, Toothbrush}
4 {Butter, Bread, Onion}
5 {Flour, Milk, Cola, Cheese}
6 {Flour, Butter, Milk, Cola}
7 {Flour, Milk, Toothbrush}
8 {Flour, Onion, Toothbrush}
26 September 2023 DTU Compute Computational Tools for Data Science - Week 5 - Karl Heuer 24
Basket # Items in the baskets
1 {Butter, Bread, Milk, Cola, Cheese, Toothbrush}
2 {Butter, Bread, Milk, Cola}
3 {Bread, Milk, Toothbrush}
4 {Butter, Bread, Onion}
5 {Flour, Milk, Cola, Cheese}
6 {Flour, Butter, Milk, Cola}
7 {Flour, Milk, Toothbrush}
8 {Flour, Onion, Toothbrush}
26 September 2023 DTU Compute Computational Tools for Data Science - Week 5 - Karl Heuer 25
Finding frequent itemsets
Challenges:
• Huge amount of data. Not enough main memory.
Focus:
• How to find frequent itemsets (algorithms).
• How handle (main) memory.
26 September 2023 DTU Compute Computational Tools for Data Science - Week 5 - Karl Heuer 26
Store itemset count in main memory
For each frequent itemset algorithm we have to store the count for
the itemsets.
26 September 2023 DTU Compute Computational Tools for Data Science - Week 5 - Karl Heuer 27
Store itemset count in main memory
For each frequent itemset algorithm we have to store the count for
the itemsets.
Example: Say we have 𝑛𝑛 items and need to count all pairs of them.
𝑛𝑛2
We must save about integers, each 4 bytes. So 2𝑛𝑛2 bytes.
2
26 September 2023 DTU Compute Computational Tools for Data Science - Week 5 - Karl Heuer 28
Triangular-Matrix Method
Let 𝑛𝑛 be the number of all items of which we want to count pairs.
Formally: 𝑎𝑎 𝑘𝑘 contains the count of the pair 𝑖𝑖, 𝑗𝑗 , with 1 ≤ 𝑖𝑖 < 𝑗𝑗 ≤ 𝑛𝑛,
𝑖𝑖
where 𝑘𝑘 = 𝑖𝑖 − 1 𝑛𝑛 − + 𝑗𝑗 − 𝑖𝑖.
2
26 September 2023 DTU Compute Computational Tools for Data Science - Week 5 - Karl Heuer 29
Triples Method
Let 𝑛𝑛 be the number of all items of which we want to count pairs.
Let 1 ≤ 𝑖𝑖 < 𝑗𝑗 ≤ 𝑛𝑛.
• Store counts as triples 𝑖𝑖, 𝑗𝑗, 𝑐𝑐 where 𝑐𝑐 equals the count of 𝑖𝑖, 𝑗𝑗 .
26 September 2023 DTU Compute Computational Tools for Data Science - Week 5 - Karl Heuer 30
Naïve algorithm for counting frequent itemsets
Say our main memory is gigantic and will not limit us.
3. Examine which pairs / 𝑘𝑘-tuples have count bigger than the support
threshold 𝑠𝑠. yields frequent itemsets
26 September 2023 DTU Compute Computational Tools for Data Science - Week 5 - Karl Heuer 31
Computational model
Remember: In general our dataset is too big to fit into main memory.
The time for reading data into main memory dominates the time for
generating / counting the relevant pairs or small 𝑘𝑘-element itemsets.
26 September 2023 DTU Compute Computational Tools for Data Science - Week 5 - Karl Heuer 32
Monotonicity of frequent itemsets
26 September 2023 DTU Compute Computational Tools for Data Science - Week 5 - Karl Heuer 33
A-Priori Algorithm (for item pairs)
Uses 2 passes through the data, but requires less main memory.
Pass 1: Read data (list of baskets) and only store count of each
individual item, i.e. of singleton itemsets {𝑖𝑖}.
In main memory:
26 September 2023 DTU Compute Computational Tools for Data Science - Week 5 - Karl Heuer 34
A-Priori Algorithm (for item pairs)
Before pass 2:
In main memory:
• Hash table (item names to integers in range 1 to 𝑛𝑛)
• Frequent singletons table
26 September 2023 DTU Compute Computational Tools for Data Science - Week 5 - Karl Heuer 35
A-Priori Algorithm (for item pairs)
Pass 2: Read data again and only store pairs of items both of
which are frequent (utilise frequent singletons table).
In main memory:
26 September 2023 DTU Compute Computational Tools for Data Science - Week 5 - Karl Heuer 36
A-Priori Algorithm (for item pairs)
26 September 2023 DTU Compute Computational Tools for Data Science - Week 5 - Karl Heuer 37
A-Priori Algorithm (for k-element itemsets)
• Uses 𝑘𝑘 passes through data.
• In pass 𝑖𝑖:
generate set of candidates 𝐶𝐶𝑖𝑖 for frequent 𝑖𝑖-element
itemsets and store counts.
To do this, use:
- frequent (𝑖𝑖 − 1)-element itemset table that lists 𝐿𝐿𝑖𝑖−1 ,
the frequent (𝑖𝑖 − 1)-element itemsets, and
- frequent singletons table 𝐿𝐿1 .
26 September 2023 DTU Compute Computational Tools for Data Science - Week 5 - Karl Heuer 39
PCY-Algorithm (named after Park, Chen and Yu)
• A-Priori works fine unless generating 𝐶𝐶2 causes thrashing.
• Rough idea:
• Utilise unused main memory in pass 1 for additional hash table.
» Further reduce number of item pairs that are created.
• Example / Note:
26 September 2023 DTU Compute Computational Tools for Data Science - Week 5 - Karl Heuer 42
PCY-Algorithm (for pairs)
26 September 2023 DTU Compute Computational Tools for Data Science - Week 5 - Karl Heuer 43
PCY-Algorithm (for pairs)
26 September 2023 DTU Compute Computational Tools for Data Science - Week 5 - Karl Heuer 44
PCY-Algorithm (for pairs)
Advantages and disadvantages:
26 September 2023 DTU Compute Computational Tools for Data Science - Week 5 - Karl Heuer 46
Refinement of PCY: The Multihash Algorithm
26 September 2023 DTU Compute Computational Tools for Data Science - Week 5 - Karl Heuer 47