Professional Documents
Culture Documents
TM3 ch06 Frequent Itemsets
TM3 ch06 Frequent Itemsets
Frequent Itemsets
(Chapter 6 of MMDS book)
DAMA60 Group
School of Science and Technology
Hellenic Open University
Content
• Preliminaries
• Challenges
• Reminder: Apriori
2
Finding Frequent Itemsets
Itemsets: Computation Model
• Typically, data is kept in flat files Item
rather than in a database system: Item
Item
• Stored on disk Item
• Stored basket-by-basket Item
• Baskets are small but we have Item
many baskets and many items Item
Item
• Expand baskets into pairs, triples, etc.
Item
as you read baskets
Item
• Use k nested loops to generate all
Item
sets of size k
Item
Etc.
Note: We want to find frequent itemsets. To find
them, we have to count them. To count them, we
have to generate them.
Items are positive integers,
and boundaries between
baskets are –1.
4
Computation Model
5
Main-Memory Bottleneck
6
Finding Frequent Pairs
• The hardest problem often turns out to be finding the
frequent pairs of items {i1, i2}
• Why? Freq. pairs are common, freq. triples are rare
• Why? Probability of being frequent drops exponentially
with size; number of sets grows more slowly with size
• The approach:
• We always need to generate all the itemsets.
• But we would only like to count (keep track) of those
itemsets that in the end turn out to be frequent.
7
Naïve Algorithm
• Read file once, counting in main memory
the occurrences of each pair:
• From each basket of n items, generate its n(n-1)/2 pairs by two nested
loops
• Fails if (#items)2 exceeds main memory
• Remember: #items can be 100K (Wal-Mart) or 10B (Web pages)
• Suppose 105 items, counts are 4-byte integers
• Number of pairs of items: 105(105-1)/2 = 5*109
• Therefore, 2*1010 (20 gigabytes) of memory needed
8
Counting Pairs in Memory
Two approaches:
• Approach 1: Count all pairs using an item-item frequency
matrix
• Approach 2: Keep a table of triples [i, j, c] = “the count of the
pair of items {i, j} is c.”
• If integers and item ids are 4 bytes, we need approximately 12 bytes
for pairs with count > 0
• Plus some additional overhead for the hash table
Note:
• Approach 1 requires only 4 bytes per pair
• Approach 2 uses 12 bytes per pair
(but only for pairs with count > 0)
9
Comparing the 2 Approaches
10
Comparing the two approaches
• Approach 1: Triangular Matrix
• n = total number items
• Count pair of items {i, j} only if i < j
• Keep pair counts in lexicographic order:
• {1,2}, {1,3},…, {1,n}, {2,3}, {2,4},…,{2,n}, {3,4},…
𝑖
• Pair {i, j} is at position (𝑖 – 1)(𝑛– 2) + 𝑗 – 𝑖
• Total number of pairs 𝒏(𝒏−𝟏)ൗ𝟐; For large n, we can assume that this
𝟐
number equals to 𝒏 ൗ𝟐
• Triangular Matrix requires 4 bytes per pair, thus the total bytes it
consumers equals to 𝟐𝒏𝟐
11
Comparing the two approaches
12
Reminder: A-Priori Algorithm
A-Priori Algorithm – (1)
14
A-Priori Algorithm – (2)
15
Main-Memory: Picture of A-Priori
Frequent items
Item counts
Main memory
Counts of
pairs of
frequent items
(candidate
pairs)
Pass 1 Pass 2
16
Detail for A-Priori
• You can use the triangular matrix method with the number of frequent
items (m < n).
• May save space compared with storing triples
• Trick: re-number frequent items 1,2,…, m and keep a table relating new
numbers to original item numbers
• The new space required is 𝟐𝒎𝟐
Counts of pairs
Counts
of of
frequent
pairs of
items
frequent items
Pass 1 Pass 2
17
A-Priori Algorithm
• Let k=1 Fk: frequent k-itemsets
Lk: candidate k-itemsets
• Generate F1 = {frequent 1-itemsets}
18
Frequent Triples and more
19
The A-Priori Algorithm — Example
Minimum Support = 2
20
PCY (Park-Chen-Yu) Algorithm
PCY (Park-Chen-Yu) Algorithm
• Observation:
In pass 1 of A-Priori, most memory is not used
• We store only individual item counts
• Can we use the unused memory to reduce memory required in pass
2?
23
PCY Algorithm – First Pass
FOR (each basket) :
FOR (each item in the basket) :
add 1 to item’s count;
FOR (each pair of items) :
New
in hash the pair to a bucket;
PCY
add 1 to the count for that bucket;
24
Observations about Buckets
• If a bucket contains a frequent pair, then the bucket is
surely frequent
• However, even without any frequent pair,
a bucket can still be frequent
• So, we cannot use the hash to eliminate any
member (pair) of a “frequent” bucket
• But, for a bucket with total count less than s,
none of its pairs can be frequent ☺
• Pairs that hash to this bucket can be eliminated as
candidates (even if the pair consists of 2 frequent
items)
• Pass 2:
Only count pairs that hash to frequent buckets
25
PCY Algorithm – Between Passes
• Replace the buckets by a Bitmap:
• 1 means the bucket count exceeded the support s (call it a
frequent bucket);
• 0 means it did not
26
PCY Algorithm – Pass 2
• Count all pairs {i, j} that meet the
conditions for being a candidate pair:
27
Main-Memory: Picture of PCY
Bitmap
Main memory
Hash
Hash table
table Counts of
for pairs candidate
pairs
Pass 1 Pass 2
28
Main-Memory Details
•Buckets require a few bytes each:
• Note: we do not have to count past s
• #buckets is O(main-memory size)
29
Example of PCY
● Here is a collection of twelve {1, 2, 3}
baskets. {1, 3, 5}
{3, 5, 6}
● Each contains three of the six items {2, 3, 4}
{2, 4, 6}
1 through 6. {1, 2, 4}
● Suppose the support threshold is 4. {3, 4, 5}
{1, 3, 4}
● On the first pass of the PCY {2, 3, 5}
Algorithm we use a hash table with {4, 5, 6}
{2, 4, 5}
11 buckets {3, 4, 6}
○ The set {i, j} is hashed to bucket
Item 1 2 3 4 5 6
{1, 2, 3}
support 4 6 8 8 6 4
{1, 3, 5}
{3, 5, 6}
pair {1, 2} {1, 3} {1, 4} {1, 5} {2, 3} {2, 4} {2, 5}
{2, 3, 4}
support 2 3 2 1 3 4 2
{2, 4, 6}
{1, 2, 4}
pair {2, 6} {3, 4} {3, 5} {3, 6} {4, 5} {4, 6} {5, 6}
{3, 4, 5}
support 1 4 4 2 3 3 2
{1, 3, 4}
{2, 3, 5}
{4, 5, 6}
{2, 4, 5}
{3, 4, 6}
31
Support of items and pairs
Item 1 2 3 4 5 6 {1, 2, 3}
Support 4 6 8 8 6 4
{1, 3, 5}
{3, 5, 6}
{2, 3, 4}
Support of pairs j {2, 4, 6}
(i, j) 2 3 4 5 6 {1, 2, 4}
1 2 3 2 1 0 {3, 4, 5}
2 3 4 2 1
{1, 3, 4}
{2, 3, 5}
i 3 4 4 2
{4, 5, 6}
4 3 3 {2, 4, 5}
5 2 {3, 4, 6}
32
Pairs hashing to buckets
{1, 2, 3}
{1, 3, 5}
{3, 5, 6}
pair {1, 2} {1, 3} {1, 4} {1, 5} {1, 6} {2, 3} {2, 4} {2, 5}
{2, 3, 4}
bucket 2 3 4 5 6 6 8 10
{2, 4, 6}
{1, 2, 4}
pair {2, 6} {3, 4} {3, 5} {3, 6} {4, 5} {4, 6} {5, 6}
{3, 4, 5}
bucket 1 1 4 7 9 2 8 {1, 3, 4}
{2, 3, 5}
{4, 5, 6}
{2, 4, 5}
{3, 4, 6}
Support of items and pairs
Support of pairs j
(i, j) 2 3 4 5 6
1 2 3 2 1 0
{1, 2, 3}
{1, 3, 5}
2 3 4 2 1
{3, 5, 6}
i 3 4 4 2 {2, 3, 4}
4 3 3 {2, 4, 6}
5 2 {1, 2, 4}
{3, 4, 5}
j
Bucket of pairs
(i, j)
{1, 3, 4}
2 3 4 5 6 {2, 3, 5}
1
(1x2) mod
11 = 2 3 4 5 6 {4, 5, 6}
2 6 8 10 1 {2, 4, 5}
i 3 1 4 7
{3, 4, 6}
4 9 2
5 8
34
Frequent buckets
(i, j) j (i, j) j
Bucket Support 2 3 4 5 6
{1, 2, 3}
2 3 4 5 6
{1, 3, 5}
1 2 3 4 5 6 1 2 3 2 1 0
{3, 5, 6}
2 3 4 2 1
2 6 8 10 1 {2, 3, 4}
i 3 1 4 7 i 3 4 4 2 {2, 4, 6}
4 9 2 4 3 3 {1, 2, 4}
5 8 5 2 {3, 4, 5}
{1, 3, 4}
{2, 3, 5}
bucket 0 1 2 3 4 5 6 7 8 9 10 {4, 5, 6}
pairs {2, 6}
{3, 4}
{1, 2}
{4, 6}
{1, 3} {1, 4}
{3, 5}
{1, 5} {1, 6}
{2, 3}
{3, 6} {2, 4}
{5, 6}
{4, 5} {2, 5} {2, 4, 5}
support 0 (4+1) 5 3 6 1 3 2 6 3 2
{3, 4, 6}
5
Pairs counted on the second pass of PCY
Support threshold is 4
bucket 0 1 2 3 4 5 6 7 8 9 10
pairs {2, 6} {1, 2} {1, 3} {1, 4} {1, 5} {1, 6} {3, 6} {2, 4} {4, 5} {2, 5}
{3, 4} {4, 6} {3, 5} {2, 3} {5, 6}
support 0 5 5 3 6 1 3 2 6 3 2
Only pairs in frequent (green) buckets will be counted on the second pass of PCY:
{1, 2}, {1, 4}, {2, 4}, {2, 6}, {3, 4}, {3, 5}, {4, 6}, {5, 6}
Pairs counted on the second pass of PCY
Support threshold is 4
bucket 0 1 2 3 4 5 6 7 8 9 10
pairs {2, 6} {1, 2} {1, 3} {1, 4} {1, 5} {1, 6} {3, 6} {2, 4} {4, 5} {2, 5}
{3, 4} {4, 6} {3, 5} {2, 3} {5, 6}
support 0 5 5 3 6 1 3 2 6 3 2
Only pairs in frequent (green) buckets will be counted on the second pass of PCY:
{1, 2}, {1, 4}, {2, 4}, {2, 6}, {3, 4}, {3, 5}, {4, 6}, {5, 6}
support 4 6 8 8 6 4
pairs {2, 6} {1, 2} {1, 3} {1, 4} {1, 5} {1, 6} {3, 6} {2, 4} {4, 5} {2, 5}
{3, 4} {4, 6} {3, 5} {2, 3} {5, 6}
support 0 5 5 3 6 1 3 2 6 3 2
Only pairs in frequent (green) buckets will be counted on the second pass of PCY:
{1, 2}, {1, 4}, {2, 4}, {2, 6}, {3, 4}, {3, 5}, {4, 6}, {5, 6}
support 4 6 8 8 6 4
Pairs whose items are not frequent (1 and 6) will not be counted on the second
pass of PCY:
{2, 4}, {3, 4}, {3, 5}
Refinement: Multistage Algorithm
• Limit the number of candidates to be
counted
• Remember: Memory is the bottleneck
• Still need to generate all the itemsets but we only
want to count/keep track of the ones that are
frequent
• Key idea: After Pass 1 of PCY, rehash only
those pairs that qualify for Pass 2 of PCY
• i and j are frequent, and
• {i, j} hashes to a frequent bucket from Pass 1
• On middle pass, fewer pairs contribute to
buckets, so fewer false positives
• Requires 3 passes over the data
40
Main-Memory: Multistage
Item counts Freq. items Freq. items
Main memory
Bitmap 1 Bitmap 1
First Bitmap 2
hash table
First
Second Counts
hash table Counts of
of
hash table candidate
candidate
pairs
pairs
42
Important Points
1. The two hash functions have to be
independent
43
Refinement: Multihash Algorithm
44
Main-Memory: Multihash
Bitmap 1
Main memory
First
First hash
hash table
table Bitmap 2
Counts
Countsofof
Second
Second candidate
candidate
hash table
hash table pairs
pairs
Pass 1 Pass 2
45
PCY: Summary
• Either multistage or multihash can use more than
two hash functions
46
Limited Pass Algorithms
Frequent Itemsets
in < 2 Passes
Frequent Itemsets in < 2 Passes
48
Simple Randomized Algorithm (1)
•Take a random sample of the market baskets
Main memory
baskets
• Reduce support threshold
proportionally Space
to match the sample size for
counts
49
Simple Randomized Algorithm (2)
50
SON: The Algorithm of
Savasere, Omiecinski, and Navathe (1)
• Repeatedly read small subsets of the baskets into main memory
and run an in-memory algorithm to find all frequent itemsets
• Note: we are not sampling, but processing the entire file in memory-
sized chunks
51
SON Algorithm – Example
• Let the support threshold s=4 and the following set of baskets:
{A, B, C} {A, B, C}
{A, B} {A, B} Chunk #1
{C, D} {C, D} s=2
{B, C} {B, C}
{A, B, D, E} {A, B, D, E}
{A, B, E} {A, B, E} Chunk #2
{A, C, E} {A, C, E} s=2
Union of Candidates:
{A}, {B}, {C}, {E}, {A,B}, {A,E}, {B,C}, {B,E}
SON’s 2nd Pass:
{A}, {B}, {C}, {E}, {A,B}, {A,E}, {B,C}, {B,E}
53
SON: Map/Reduce
● Phase 1: Find candidate itemsets
○ Map :
■ Take the assigned subset of the baskets and find the
itemsets frequent in the subset
■ Lower the support threshold from s to (p * s) if each
Map task gets fraction p of the total input file.
■ Apply sampling algorithm
■ Output: set of key-value pairs (F, 1), F: frequent
itemset from the sample.
○ Reduce :
■ Each Reduce task is assigned a set of keys, which are
itemsets.
■ Produces those keys (itemsets) that appear one or
more times. (candidate itemsets).
54
SON: Map/Reduce
Phase 2: Find true frequent itemsets
Map
● Input:
○ All the output from the first Reduce Function (candidate
itemsets)
○ Portion of the input data file.
56