Computational Tools DTU Presentation Week5

Computational Tools for Data Science
Week 5:
Frequent itemsets
26 September 2023 DTU Compute Computational Tools for Data Science - Week 5 - Karl Heuer 1
Motivation: Market-Basket Analysis
• Setting: Supermarket
• Task: Analyse shopping baskets of customers
• Goal: Find sets of items that appear often, i.e. in many baskets
• Why:
– See demand, hence we can adjust the price.
– Find patterns (association rules) in the data and make more
tricky sales.
Prime example:
Via data analysis: {Diapers, Beer} is a frequent pair of items
Not expected / interesting
Explanation: new parents rarely go into bars, they drink at home.
Sales trick: make a sale on diapers, but secretly increase beer price
Other examples:
• Baskets correspond to patients’s infos (biomarkers, diseases)
 Frequent items with a common disease suggest testing.
• Baskets are sets of sentences, items are (text) documents.
Note: Items need not be `in´ baskets!

Items and baskets are in some many-to-many relation.
Here: item in basket if the document contains the sentence.
Frequent pair  2 documents that share many sentences (plagiarism)

In general:
• there are many (!) items

• there are also many (!) baskets
• baskets correspond to sets of items

• size of a basket is small compared to the number of all items
• our relevant data: sequence / list of baskets

» problem: often too big to fit into main memory
In general:
Often only interested in frequent pairs, or triples.
𝑛𝑛 𝑛𝑛(𝑛𝑛−1) 𝑛𝑛2
Given 𝑛𝑛 items, there are 2
= ≈ pairs of items.
2 2
𝑛𝑛 𝑛𝑛! 𝑛𝑛𝑘𝑘
Given 𝑛𝑛 items, there are 𝑘𝑘
= ≈ 𝑘𝑘-element itemsets.
𝑛𝑛−𝑘𝑘 !⋅𝑘𝑘! 𝑘𝑘!
Practice:
• Too much data for considering 𝑘𝑘-element itemsets for big 𝑘𝑘.
• Sales offers: can only handle few frequent pairs.
»There will be even fewer frequent triples, etc.
Measures: (Support)
Let 𝑰𝑰 be a set of items.
# 𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝐼𝐼
𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 𝐼𝐼 =
𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡 𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛 𝑜𝑜𝑜𝑜 𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏
Meaning: Probability of finding all items of 𝐼𝐼 in one basket.
Definition: We call an itemset 𝑰𝑰 frequent, if its support is greater than

some initially fixed value 𝒔𝒔 ∈ ℝ.
𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 𝐼𝐼 ≥ 𝑠𝑠
Measures for association rules: (Confidence)
• Let 𝑰𝑰 be a set of items
• Let 𝒋𝒋 be an item that is not in 𝑰𝑰.
• Let 𝑰𝑰 → 𝒋𝒋 be an association rule.
# 𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝐼𝐼 ∪ {𝑗𝑗}

𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶 𝐼𝐼 → 𝑗𝑗 =
# 𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝐼𝐼
Meaning: Probability of having all of 𝐼𝐼 ∪ {𝑗𝑗} in one basket, given that

we already have all of 𝐼𝐼 in the basket.
Example:
Basket # Items in the baskets

1 {Butter, Bread, Milk, Cola, Cheese, Toothbrush}
2 {Butter, Bread, Milk, Cola}
3 {Bread, Milk, Toothbrush}
4 {Butter, Bread, Onion}
5 {Flour, Milk, Cola, Cheese}
6 {Flour, Butter, Milk, Cola}
7 {Flour, Milk, Toothbrush}
8 {Flour, Onion, Toothbrush}
𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵 = 0,5

𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵 = 0,5

𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀 = 0,75

𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇 = 0,5

𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶 = 0,25
𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵 = 0,5 𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵, 𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵 = 0,375


𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵 = 0,5 𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀, 𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶 = 0,25

𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵 = 0,5 𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀, 𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶 = 0,25
𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀 = 0,75 𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀, 𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇 = 0,375
𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶 𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵} → 𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵 = 0,75

𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶 𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶} → 𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀 = 1

𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶 𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶} → 𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀 = 1
𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶 𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇} → 𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀 = 0,75
Measures for association rules: (Lift)
𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶(𝐼𝐼 → 𝑗𝑗)
𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿 𝐼𝐼 → 𝑗𝑗 =
𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆({𝑗𝑗})
Meaning: Probability of having 𝑗𝑗 in a basket given that we already

have all of 𝐼𝐼 in the basket divided by the probability of having 𝑗𝑗 in a
basket.
Measures for association rules: (Lift)
𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶(𝐼𝐼 → 𝑗𝑗)
𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿 𝐼𝐼 → 𝑗𝑗 =
𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆({𝑗𝑗})
Note:
• If 𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿 𝐼𝐼 → 𝑗𝑗 is close to 1, then “not really interesting”.
• If 𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿 𝐼𝐼 → 𝑗𝑗 ≫ 1, then 𝐼𝐼 encourages to buy 𝑗𝑗.
• If 𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿 𝐼𝐼 → 𝑗𝑗 ≪ 1, then 𝐼𝐼 discourages to buy 𝑗𝑗.
Measures for association rules: (Interest)
𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼 𝐼𝐼 → 𝑗𝑗 = 𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶 𝐼𝐼 → 𝑗𝑗 − 𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆({𝑗𝑗})
Note:
• If 𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼 𝐼𝐼 → 𝑗𝑗 is close to 0, then “not really interesting”.
• If 𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼 𝐼𝐼 → 𝑗𝑗 is highly positive, then 𝐼𝐼 encourages to buy 𝑗𝑗.
• If 𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼 𝐼𝐼 → 𝑗𝑗 is highly negative, then 𝐼𝐼 discourages to buy 𝑗𝑗.

𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿 𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵} → 𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵 = 1,5
𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶 𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇} → 𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀 = 0,75

𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿 𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇} → 𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀 = 1
Finding frequent itemsets
Finding / determining interesting association rules is easy compared

to finding frequent itemsets (what we have to do first anyway).
Challenges:
• Huge amount of data.  Not enough main memory.
Focus:
• How to find frequent itemsets (algorithms).
• How handle (main) memory.
Store itemset count in main memory
For each frequent itemset algorithm we have to store the count for
the itemsets.
(!) Use integers (4 byte) instead of item names / long IDs.

 via a hash table.
Store itemset count in main memory
For each frequent itemset algorithm we have to store the count for
the itemsets.
Limit on how many items we can deal with without thrashing.
Example: Say we have 𝑛𝑛 items and need to count all pairs of them.
𝑛𝑛2
We must save about integers, each 4 bytes. So 2𝑛𝑛2 bytes.
2
If we have 2 GB main memory, say 231 bytes, then 𝑛𝑛 ≤ 215 ≈ 33000.
Triangular-Matrix Method
Let 𝑛𝑛 be the number of all items of which we want to count pairs.
• Use one-dimensional array 𝑎𝑎.
• 𝑎𝑎 𝑘𝑘 contains count of the 𝑘𝑘 -th pair of items.
• The ordering of the item pairs is as follows:

1, 2 , 1, 3 , … , 1, 𝑛𝑛 , 2, 3 , 2, 4 , … , 2, 𝑛𝑛 , … , 𝑛𝑛 − 1, 𝑛𝑛
Formally: 𝑎𝑎 𝑘𝑘 contains the count of the pair 𝑖𝑖, 𝑗𝑗 , with 1 ≤ 𝑖𝑖 < 𝑗𝑗 ≤ 𝑛𝑛,
𝑖𝑖
where 𝑘𝑘 = 𝑖𝑖 − 1 𝑛𝑛 − + 𝑗𝑗 − 𝑖𝑖.
2
Triples Method
Let 𝑛𝑛 be the number of all items of which we want to count pairs.
Let 1 ≤ 𝑖𝑖 < 𝑗𝑗 ≤ 𝑛𝑛.
• Store counts as triples 𝑖𝑖, 𝑗𝑗, 𝑐𝑐 where 𝑐𝑐 equals the count of 𝑖𝑖, 𝑗𝑗 .
Advantage: No need to store pairs with count 𝑐𝑐 equal to 0.
Disadvantage: For each pair we use 3 integers instead of just 1.

1
Hence: If fewer than of all possible pairs appear in some basket,
3
then use the Triples Method.
Naïve algorithm for counting frequent itemsets
Say our main memory is gigantic and will not limit us.
1. Read the whole data (list of baskets) into main memory.
2a. Use 2 nested loops to generate all pairs of items.

(Use 𝑘𝑘 nested loops to generate and count all 𝑘𝑘-element itemsets.)
2b. Use Triangular-Matrix Method or Triples Method to store counts.
3. Examine which pairs / 𝑘𝑘-tuples have count bigger than the support
threshold 𝑠𝑠.  yields frequent itemsets
Computational model
Remember: In general our dataset is too big to fit into main memory.
The time for reading data into main memory dominates the time for
generating / counting the relevant pairs or small 𝑘𝑘-element itemsets.
Hence, we measure the cost of computation by the number of

passes an algorithm makes over the data.
Monotonicity of frequent itemsets
Observation: If an itemset is frequent, then so is every subset of it.
The A-Priori Algorithm utilises this simple observation.
Definition: We call a frequent itemset maximal if no proper superset

of it is frequent.
Saves space later: Only store maximal frequent itemsets.
A-Priori Algorithm (for item pairs)
Uses 2 passes through the data, but requires less main memory.
Pass 1: Read data (list of baskets) and only store count of each
individual item, i.e. of singleton itemsets {𝑖𝑖}.
In main memory:
• Hash table (item names to integers in range 1 to 𝑛𝑛)

• Count for each item (stored as an array).
 Required space proportional to the number of items.
Before pass 2:
Say we had found 𝑚𝑚 frequent singleton itemsets after pass 1.

𝑛𝑛
( 𝑚𝑚 ≈ could be reasonable.)
100
Create new array (frequent singletons table), indexed 1 to 𝑛𝑛, where

• non-frequent items are mapped to 0, and
• each frequent item is mapped to a unique integer between 1 and 𝑚𝑚.
In main memory:
• Hash table (item names to integers in range 1 to 𝑛𝑛)
• Frequent singletons table
Pass 2: Read data again and only store pairs of items both of
which are frequent (utilise frequent singletons table).
By monotonicity: we do not miss any pair of frequent items!
In main memory:
• Hash table (item names to integers)

• Frequent singletons table
• Count for pairs of items (via Triangular-Matrix or Triples Method)
picture: from MMDS, p. 219
A-Priori Algorithm (for k-element itemsets)
• Uses 𝑘𝑘 passes through data.
• In pass 𝑖𝑖:
generate set of candidates 𝐶𝐶𝑖𝑖 for frequent 𝑖𝑖-element
itemsets and store counts.
To do this, use:
- frequent (𝑖𝑖 − 1)-element itemset table that lists 𝐿𝐿𝑖𝑖−1 ,
the frequent (𝑖𝑖 − 1)-element itemsets, and
- frequent singletons table 𝐿𝐿1 .
• Between pass 𝑖𝑖 and 𝑖𝑖 + 1:

create frequent 𝑖𝑖-element itemset table.
A-Priori Algorithm (for k-element itemsets)
PCY-Algorithm (named after Park, Chen and Yu)
• A-Priori works fine unless generating 𝐶𝐶2 causes thrashing.
• Rough idea:
• Utilise unused main memory in pass 1 for additional hash table.
» Further reduce number of item pairs that are created.
• Example / Note:
Say we have 1.000.000 items and several GB of main memory.

Then we need not even 10% of main memory for:
• hash table (item names to integers)
• array for counting (singleton) item frequency
PCY-Algorithm (for pairs)
• Fill left over main memory with a hash table (integer array).
• The buckets correspond to integers (counts).
• Use as many buckets as possible for the table.
• Hash pairs of items to buckets.
During pass 1 of A-Priori:

• Also initialise hash table / array with all zeros.
• Beside counting all singleton itemsets, also generate all pairs,

BUT instead of storing them: hash them and increase count of
bucket to which they hash by 1.
Idea to improve the second pass:
If a bucket is not frequent, then clearly no pair in it can be frequent.
Hence, define candidate set 𝐶𝐶2 of pairs {𝑖𝑖, 𝑗𝑗} via:
(a) {𝑖𝑖} and {𝑗𝑗} are both frequent singleton itemsets
(b) {𝒊𝒊, 𝒋𝒋} hashes to a frequent bucket, i.e. whose

count is bigger than our threshold 𝒔𝒔.
Between pass 1 and pass 2:
Summarize hash table (integer counts) via bitmap:

» 1 if bucket is frequent
» 0 if bucket is not frequent.
Hence 4 byte = 32 bits integer counts are replaced by 1 bit each.
Advantages and disadvantages:
(+) We save memory if we have (many) buckets that are infrequent.
(-) We are forced to store counts for pairs via Triples-Method.
 In A-Priori: renumbered frequent singleton itemsets

(form 1 to 𝑚𝑚 instead to 𝑛𝑛) to save space.
 Now pairs in infrequent buckets could be rather randomly

distributed within the lexicographic order.
Hence: PCY only saves memory compared to A-Priori

if 2⁄3 of all candidate pairs can be discarded.
Refinement of PCY: The Multistage Algorithm
Refinement of PCY: The Multihash Algorithm

Computational Tools DTU Presentation Week5

Uploaded by

Copyright:

Available Formats

You might also like

Computational Tools DTU Presentation Week5

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Computational Tools DTU Presentation Week5

Uploaded by

Copyright:

Available Formats

Computational Tools for Data Science

• Task: Analyse shopping baskets of customers

Via data analysis: {Diapers, Beer} is a frequent pair of items

Not expected / interesting

Explanation: new parents rarely go into bars, they drink at home.

• Baskets correspond to patients’s infos (biomarkers, diseases)

 Frequent items with a common disease suggest testing.

• Baskets are sets of sentences, items are (text) documents.

Note: Items need not be `in´ baskets!

Here: item in basket if the document contains the sentence.

Frequent pair  2 documents that share many sentences (plagiarism)

• there are many (!) items

• baskets correspond to sets of items

• our relevant data: sequence / list of baskets

Often only interested in frequent pairs, or triples.

Meaning: Probability of finding all items of 𝐼𝐼 in one basket.

Definition: We call an itemset 𝑰𝑰 frequent, if its support is greater than

# 𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝐼𝐼 ∪ {𝑗𝑗}

Meaning: Probability of having all of 𝐼𝐼 ∪ {𝑗𝑗} in one basket, given that

Basket # Items in the baskets

𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵 = 0,5

𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵 = 0,5

𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵 = 0,5

𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵 = 0,5

𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵 = 0,5

𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵 = 0,5 𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵, 𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵 = 0,375

𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵 = 0,5 𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵, 𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵 = 0,375

𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵 = 0,5 𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵, 𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵 = 0,375

𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶 𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵} → 𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵 = 0,75

𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶 𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵} → 𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵 = 0,75

𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶 𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵} → 𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵 = 0,75

Meaning: Probability of having 𝑗𝑗 in a basket given that we already

𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼 𝐼𝐼 → 𝑗𝑗 = 𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶 𝐼𝐼 → 𝑗𝑗 − 𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆({𝑗𝑗})

𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶 𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵} → 𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵 = 0,75

𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶 𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇} → 𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀 = 0,75

Finding / determining interesting association rules is easy compared

(!) Use integers (4 byte) instead of item names / long IDs.

Limit on how many items we can deal with without thrashing.

If we have 2 GB main memory, say 231 bytes, then 𝑛𝑛 ≤ 215 ≈ 33000.

• Use one-dimensional array 𝑎𝑎.

• 𝑎𝑎 𝑘𝑘 contains count of the 𝑘𝑘 -th pair of items.

• The ordering of the item pairs is as follows:

Advantage: No need to store pairs with count 𝑐𝑐 equal to 0.

Disadvantage: For each pair we use 3 integers instead of just 1.

1. Read the whole data (list of baskets) into main memory.

2a. Use 2 nested loops to generate all pairs of items.

2b. Use Triangular-Matrix Method or Triples Method to store counts.

Hence, we measure the cost of computation by the number of

Observation: If an itemset is frequent, then so is every subset of it.

The A-Priori Algorithm utilises this simple observation.

Definition: We call a frequent itemset maximal if no proper superset

Saves space later: Only store maximal frequent itemsets.

• Hash table (item names to integers in range 1 to 𝑛𝑛)

 Required space proportional to the number of items.

Say we had found 𝑚𝑚 frequent singleton itemsets after pass 1.

Create new array (frequent singletons table), indexed 1 to 𝑛𝑛, where

By monotonicity: we do not miss any pair of frequent items!