Download as pdf or txt
Download as pdf or txt
You are on page 1of 54

Tutorial Meeting #3

Data Science and Machine Learning

Frequent Itemsets
(Chapter 6 of MMDS book)

DAMA60 Group
School of Science and Technology
Hellenic Open University
Content

• Preliminaries
• Challenges
• Reminder: Apriori

• Methods and Algorithms


• PCY: The algorithm of Park, Chen and Yu
• PCY refinements
• Random Sampling
• SON: The Algorithm of Savasere, Omiecinski, and Navathe

2
Finding Frequent Itemsets
Itemsets: Computation Model
• Typically, data is kept in flat files Item
rather than in a database system: Item
Item
• Stored on disk Item
• Stored basket-by-basket Item
• Baskets are small but we have Item
many baskets and many items Item
Item
• Expand baskets into pairs, triples, etc.
Item
as you read baskets
Item
• Use k nested loops to generate all
Item
sets of size k
Item

Etc.
Note: We want to find frequent itemsets. To find
them, we have to count them. To count them, we
have to generate them.
Items are positive integers,
and boundaries between
baskets are –1.
4
Computation Model

• The true cost of mining disk-resident data is usually the number of


disk I/Os

• In practice, association-rule mining algorithms read the data in


passes – all baskets read in turn

• We measure the cost by the number of passes an algorithm makes


over the data

5
Main-Memory Bottleneck

• For many frequent-itemset algorithms, main-memory is the


critical resource

• As we read baskets, we need to count something, e.g., occurrences of


pairs of items

• The number of different things we can count is limited by main


memory

• Moving counts between memory and disc is very inefficient

6
Finding Frequent Pairs
• The hardest problem often turns out to be finding the
frequent pairs of items {i1, i2}
• Why? Freq. pairs are common, freq. triples are rare
• Why? Probability of being frequent drops exponentially
with size; number of sets grows more slowly with size

• Let’s first concentrate on pairs, then extend to larger sets

• The approach:
• We always need to generate all the itemsets.
• But we would only like to count (keep track) of those
itemsets that in the end turn out to be frequent.

7
Naïve Algorithm
• Read file once, counting in main memory
the occurrences of each pair:
• From each basket of n items, generate its n(n-1)/2 pairs by two nested
loops
• Fails if (#items)2 exceeds main memory
• Remember: #items can be 100K (Wal-Mart) or 10B (Web pages)
• Suppose 105 items, counts are 4-byte integers
• Number of pairs of items: 105(105-1)/2 = 5*109
• Therefore, 2*1010 (20 gigabytes) of memory needed

• Suppose 5 items, {I1, I2, I3, I4, I5}


• Number of pairs of items: n(n-1)/2 = 5 * 4 / 2 = 10
• {I1, I2}, {I1, I3}, {I1, I4}, {I1, I5},
• {I2, I3}, {I2, I4}, {I2, I5},
• {I3, I4}, {I3, I5},
• {I4, I5}

8
Counting Pairs in Memory

Two approaches:
• Approach 1: Count all pairs using an item-item frequency
matrix
• Approach 2: Keep a table of triples [i, j, c] = “the count of the
pair of items {i, j} is c.”
• If integers and item ids are 4 bytes, we need approximately 12 bytes
for pairs with count > 0
• Plus some additional overhead for the hash table

Note:
• Approach 1 requires only 4 bytes per pair
• Approach 2 uses 12 bytes per pair
(but only for pairs with count > 0)

9
Comparing the 2 Approaches

4 bytes per pair 12 per


occurring pair

Triangular Matrix Triples Method

10
Comparing the two approaches
• Approach 1: Triangular Matrix
• n = total number items
• Count pair of items {i, j} only if i < j
• Keep pair counts in lexicographic order:
• {1,2}, {1,3},…, {1,n}, {2,3}, {2,4},…,{2,n}, {3,4},…
𝑖
• Pair {i, j} is at position (𝑖 – 1)(𝑛– 2) + 𝑗 – 𝑖
• Total number of pairs 𝒏(𝒏−𝟏)ൗ𝟐; For large n, we can assume that this
𝟐
number equals to 𝒏 ൗ𝟐
• Triangular Matrix requires 4 bytes per pair, thus the total bytes it
consumers equals to 𝟐𝒏𝟐

• Approach 2: Triples Method


• It uses 12 bytes per occurring pair (but only for pairs with count > 0)
• Beats Approach 1 if less than 1/3 of possible pairs actually occur

11
Comparing the two approaches

Problem is if we have too


many items so the pairs
do not fit into memory.
Can we do better?

12
Reminder: A-Priori Algorithm
A-Priori Algorithm – (1)

• A two-pass approach called


A-Priori reduces the need for
large main memory requirements

• Key idea: monotonicity


• If a set of items I appears at
least s times, so does every subset J of I

• Contrapositive for pairs:


If item i does not appear in s baskets, then no pair that includes i can
appear in s baskets

• So, how does A-Priori find freq. pairs?

14
A-Priori Algorithm – (2)

• Pass 1: Read baskets and count in main memory the


occurrences of each individual item
• Requires only memory proportional to items
• Items that appear greater or equal than s times, constitute the
frequent items
• Typical value for s is 1% of the baskets

• Pass 2: Read baskets again and count in main


memory only those pairs where both elements are
frequent (from Pass 1)
• Requires memory proportional to square of frequent items only (for
counts)
• Plus a list of the frequent items (so you know what must be counted)

15
Main-Memory: Picture of A-Priori

Frequent items
Item counts
Main memory

Counts of
pairs of
frequent items
(candidate
pairs)

Pass 1 Pass 2

16
Detail for A-Priori
• You can use the triangular matrix method with the number of frequent
items (m < n).
• May save space compared with storing triples
• Trick: re-number frequent items 1,2,…, m and keep a table relating new
numbers to original item numbers
• The new space required is 𝟐𝒎𝟐

Item counts Frequent Old


items item
numbers
Main memory

Counts of pairs
Counts
of of
frequent
pairs of
items
frequent items

Pass 1 Pass 2
17
A-Priori Algorithm
• Let k=1 Fk: frequent k-itemsets
Lk: candidate k-itemsets
• Generate F1 = {frequent 1-itemsets}

• Repeat until Fk is empty


• Candidate Generation: Generate Lk+1 from Fk

• Candidate Pruning: Prune candidate itemsets in Lk+1 containing


subsets of length k that are infrequent

• Support Counting: Count the support of each candidate in Lk+1 by


scanning the Database

• Candidate Elimination: Eliminate candidates in Lk+1 that are


infrequent, leaving only those that are frequent ⇒ Fk+1

18
Frequent Triples and more

• For each k, we construct two sets of k-itemsets (sets of size k):

• Ck (candidate k-itemsets) : those that might be frequent sets (support > s)


based on information from the pass for k–1

• Lk = the set of truly frequent k-tuples

Count All pairs Count


All of items To be
the items the pairs explained
items from L1

C1 Filter L1 Construct C2 Filter L2 Construct C3

19
The A-Priori Algorithm — Example
Minimum Support = 2

20
PCY (Park-Chen-Yu) Algorithm
PCY (Park-Chen-Yu) Algorithm

• Observation:
In pass 1 of A-Priori, most memory is not used
• We store only individual item counts
• Can we use the unused memory to reduce memory required in pass
2?

• Pass 1: In addition to item counts, maintain a hash


table with as many buckets as fit in memory
• Keep a count for each bucket into which
pairs of items are hashed
• For each bucket just keep the count, not the actual
pairs that hash to the bucket!

23
PCY Algorithm – First Pass
FOR (each basket) :
FOR (each item in the basket) :
add 1 to item’s count;
FOR (each pair of items) :
New
in hash the pair to a bucket;
PCY
add 1 to the count for that bucket;

• Few things to note:


• Pairs of items need to be generated from the input file; they are not
present in the file (double loop)
• We are not just interested in the presence of a pair, but we need to see
whether it is present at least s (support) times (frequent bucket)

24
Observations about Buckets
• If a bucket contains a frequent pair, then the bucket is
surely frequent
• However, even without any frequent pair,
a bucket can still be frequent 
• So, we cannot use the hash to eliminate any
member (pair) of a “frequent” bucket
• But, for a bucket with total count less than s,
none of its pairs can be frequent ☺
• Pairs that hash to this bucket can be eliminated as
candidates (even if the pair consists of 2 frequent
items)

• Pass 2:
Only count pairs that hash to frequent buckets

25
PCY Algorithm – Between Passes
• Replace the buckets by a Bitmap:
• 1 means the bucket count exceeded the support s (call it a
frequent bucket);
• 0 means it did not

• 4-byte integer counts are replaced by bits,


so the bit-vector requires 1/32 of memory

• Also, decide which items are frequent


and list them for the second pass

26
PCY Algorithm – Pass 2
• Count all pairs {i, j} that meet the
conditions for being a candidate pair:

1. Both i and j are frequent items


2. The pair {i, j} hashes to a bucket whose bit in the bit
vector is 1 (i.e., a frequent bucket)

• Both conditions are necessary for the


pair to have a chance of being frequent!

27
Main-Memory: Picture of PCY

Item counts Frequent items

Bitmap
Main memory

Hash
Hash table
table Counts of
for pairs candidate
pairs

Pass 1 Pass 2

28
Main-Memory Details
•Buckets require a few bytes each:
• Note: we do not have to count past s
• #buckets is O(main-memory size)

•On second pass, a table of (item, item, count)


triples is essential
• Thus, hash table must eliminate approx. 2/3
of the candidate pairs for PCY to beat A-Priori

29
Example of PCY
● Here is a collection of twelve {1, 2, 3}
baskets. {1, 3, 5}
{3, 5, 6}
● Each contains three of the six items {2, 3, 4}
{2, 4, 6}
1 through 6. {1, 2, 4}
● Suppose the support threshold is 4. {3, 4, 5}
{1, 3, 4}
● On the first pass of the PCY {2, 3, 5}
Algorithm we use a hash table with {4, 5, 6}
{2, 4, 5}
11 buckets {3, 4, 6}
○ The set {i, j} is hashed to bucket

using formula (i × j) mod 11.


30
Support of items and pairs

Item 1 2 3 4 5 6
{1, 2, 3}
support 4 6 8 8 6 4
{1, 3, 5}
{3, 5, 6}
pair {1, 2} {1, 3} {1, 4} {1, 5} {2, 3} {2, 4} {2, 5}
{2, 3, 4}
support 2 3 2 1 3 4 2
{2, 4, 6}
{1, 2, 4}
pair {2, 6} {3, 4} {3, 5} {3, 6} {4, 5} {4, 6} {5, 6}
{3, 4, 5}
support 1 4 4 2 3 3 2
{1, 3, 4}
{2, 3, 5}
{4, 5, 6}
{2, 4, 5}
{3, 4, 6}

31
Support of items and pairs
Item 1 2 3 4 5 6 {1, 2, 3}
Support 4 6 8 8 6 4
{1, 3, 5}
{3, 5, 6}
{2, 3, 4}
Support of pairs j {2, 4, 6}
(i, j) 2 3 4 5 6 {1, 2, 4}
1 2 3 2 1 0 {3, 4, 5}
2 3 4 2 1
{1, 3, 4}
{2, 3, 5}
i 3 4 4 2
{4, 5, 6}
4 3 3 {2, 4, 5}
5 2 {3, 4, 6}

32
Pairs hashing to buckets

{1, 2, 3}
{1, 3, 5}
{3, 5, 6}
pair {1, 2} {1, 3} {1, 4} {1, 5} {1, 6} {2, 3} {2, 4} {2, 5}
{2, 3, 4}
bucket 2 3 4 5 6 6 8 10
{2, 4, 6}
{1, 2, 4}
pair {2, 6} {3, 4} {3, 5} {3, 6} {4, 5} {4, 6} {5, 6}
{3, 4, 5}
bucket 1 1 4 7 9 2 8 {1, 3, 4}
{2, 3, 5}
{4, 5, 6}
{2, 4, 5}
{3, 4, 6}
Support of items and pairs
Support of pairs j
(i, j) 2 3 4 5 6
1 2 3 2 1 0
{1, 2, 3}
{1, 3, 5}
2 3 4 2 1
{3, 5, 6}
i 3 4 4 2 {2, 3, 4}
4 3 3 {2, 4, 6}
5 2 {1, 2, 4}
{3, 4, 5}
j
Bucket of pairs
(i, j)
{1, 3, 4}
2 3 4 5 6 {2, 3, 5}
1
(1x2) mod
11 = 2 3 4 5 6 {4, 5, 6}
2 6 8 10 1 {2, 4, 5}
i 3 1 4 7
{3, 4, 6}
4 9 2
5 8
34
Frequent buckets

(i, j) j (i, j) j
Bucket Support 2 3 4 5 6
{1, 2, 3}
2 3 4 5 6
{1, 3, 5}
1 2 3 4 5 6 1 2 3 2 1 0
{3, 5, 6}
2 3 4 2 1
2 6 8 10 1 {2, 3, 4}
i 3 1 4 7 i 3 4 4 2 {2, 4, 6}
4 9 2 4 3 3 {1, 2, 4}
5 8 5 2 {3, 4, 5}
{1, 3, 4}
{2, 3, 5}
bucket 0 1 2 3 4 5 6 7 8 9 10 {4, 5, 6}
pairs {2, 6}
{3, 4}
{1, 2}
{4, 6}
{1, 3} {1, 4}
{3, 5}
{1, 5} {1, 6}
{2, 3}
{3, 6} {2, 4}
{5, 6}
{4, 5} {2, 5} {2, 4, 5}
support 0 (4+1) 5 3 6 1 3 2 6 3 2
{3, 4, 6}
5
Pairs counted on the second pass of PCY

Support threshold is 4

bucket 0 1 2 3 4 5 6 7 8 9 10

pairs {2, 6} {1, 2} {1, 3} {1, 4} {1, 5} {1, 6} {3, 6} {2, 4} {4, 5} {2, 5}
{3, 4} {4, 6} {3, 5} {2, 3} {5, 6}
support 0 5 5 3 6 1 3 2 6 3 2

Only pairs in frequent (green) buckets will be counted on the second pass of PCY:
{1, 2}, {1, 4}, {2, 4}, {2, 6}, {3, 4}, {3, 5}, {4, 6}, {5, 6}
Pairs counted on the second pass of PCY
Support threshold is 4
bucket 0 1 2 3 4 5 6 7 8 9 10

pairs {2, 6} {1, 2} {1, 3} {1, 4} {1, 5} {1, 6} {3, 6} {2, 4} {4, 5} {2, 5}
{3, 4} {4, 6} {3, 5} {2, 3} {5, 6}
support 0 5 5 3 6 1 3 2 6 3 2

Only pairs in frequent (green) buckets will be counted on the second pass of PCY:
{1, 2}, {1, 4}, {2, 4}, {2, 6}, {3, 4}, {3, 5}, {4, 6}, {5, 6}

Note: A pair is frequent if both its items are frequent!


Item 1 2 3 4 5 6

support 4 6 8 8 6 4

All items are frequent, so all the selected pairs pass


What will happen if support threshold is set to 5?
Pairs counted on the second pass of PCY
Support threshold is 5
bucket 0 1 2 3 4 5 6 7 8 9 10

pairs {2, 6} {1, 2} {1, 3} {1, 4} {1, 5} {1, 6} {3, 6} {2, 4} {4, 5} {2, 5}
{3, 4} {4, 6} {3, 5} {2, 3} {5, 6}
support 0 5 5 3 6 1 3 2 6 3 2

Only pairs in frequent (green) buckets will be counted on the second pass of PCY:
{1, 2}, {1, 4}, {2, 4}, {2, 6}, {3, 4}, {3, 5}, {4, 6}, {5, 6}

Note: A pair is frequent if both its items are frequent!


Item 1 2 3 4 5 6

support 4 6 8 8 6 4

Pairs whose items are not frequent (1 and 6) will not be counted on the second
pass of PCY:
{2, 4}, {3, 4}, {3, 5}
Refinement: Multistage Algorithm
• Limit the number of candidates to be
counted
• Remember: Memory is the bottleneck
• Still need to generate all the itemsets but we only
want to count/keep track of the ones that are
frequent
• Key idea: After Pass 1 of PCY, rehash only
those pairs that qualify for Pass 2 of PCY
• i and j are frequent, and
• {i, j} hashes to a frequent bucket from Pass 1
• On middle pass, fewer pairs contribute to
buckets, so fewer false positives
• Requires 3 passes over the data
40
Main-Memory: Multistage
Item counts Freq. items Freq. items
Main memory

Bitmap 1 Bitmap 1

First Bitmap 2
hash table
First
Second Counts
hash table Counts of
of
hash table candidate
candidate
pairs
pairs

Pass 1 Pass 2 Pass 3


Hash pairs {i,j}
Count items into 2nd Hash function iff:
Hash pairs {i,j} i,j are frequent
and
{i,j} hashes to
frequent bucket in Bitmap1
41
Multistage – Pass 3
Count only those pairs {i, j} that satisfy these
candidate pair conditions:

1. Both i and j are frequent items


2. Using the first hash function, the pair hashes to
a bucket whose bit in the first Bitmap is 1
3. Using the second hash function, the pair
hashes to a bucket whose bit in the second
Bitmap is 1

42
Important Points
1. The two hash functions have to be
independent

2. We need to check both hashes on the


third pass

• If not, we would end up counting pairs of


frequent items that hashed first to an
infrequent bucket (1st pass) but happened to
hash second to a frequent bucket (2nd pass).

43
Refinement: Multihash Algorithm

• Key idea: Use several independent hash


tables on the first pass
• Risk: Halving the number of buckets doubles
the average count
• We have to be sure most buckets will still not
reach count s

• If so, we can get a benefit like multistage, but


in only 2 passes

44
Main-Memory: Multihash

Item counts Freq. items

Bitmap 1
Main memory

First
First hash
hash table
table Bitmap 2

Counts
Countsofof
Second
Second candidate
candidate
hash table
hash table pairs
pairs

Pass 1 Pass 2

45
PCY: Summary
• Either multistage or multihash can use more than
two hash functions

• In multistage, there is a point of diminishing


returns, since the bit-vectors eventually consume all
of main memory

• For multihash, the bit-vectors occupy exactly what


one PCY bitmap does, but too many hash functions
makes all counts > s

46
Limited Pass Algorithms

Frequent Itemsets
in < 2 Passes
Frequent Itemsets in < 2 Passes

• A-Priori, PCY, etc., take k passes to find frequent


itemsets of size k

• Can we use fewer passes?

• Use 2 or fewer passes for all sizes,


but may miss some frequent itemsets
• Simple Randomized Algorithm
• SON (Savasere, Omiecinski, and Navathe)

48
Simple Randomized Algorithm (1)
•Take a random sample of the market baskets

•Run a-priori or one of its improvements


in main memory
• So we don’t pay for disk I/O each Copy of
sample
time we increase the size of itemsets

Main memory
baskets
• Reduce support threshold
proportionally Space
to match the sample size for
counts

49
Simple Randomized Algorithm (2)

•Optionally, verify that the candidate pairs are


truly frequent in the entire data set by a
second pass (avoid false positives)

•But you don’t catch sets frequent in the


whole, but not in the sample
• Smaller threshold, e.g., s/125, helps catch more truly
frequent itemsets
• But requires more space

50
SON: The Algorithm of
Savasere, Omiecinski, and Navathe (1)
• Repeatedly read small subsets of the baskets into main memory
and run an in-memory algorithm to find all frequent itemsets
• Note: we are not sampling, but processing the entire file in memory-
sized chunks

• An itemset becomes a candidate if it is found to be frequent in


any one or more subsets of the baskets.

• On a second pass, count all the candidate itemsets and determine


which are frequent in the entire set

• Key “monotonicity” idea: an itemset cannot be frequent in the


entire set of baskets unless it is frequent in at least one subset

51
SON Algorithm – Example
• Let the support threshold s=4 and the following set of baskets:
{A, B, C} {A, B, C}
{A, B} {A, B} Chunk #1
{C, D} {C, D} s=2
{B, C} {B, C}
{A, B, D, E} {A, B, D, E}
{A, B, E} {A, B, E} Chunk #2
{A, C, E} {A, C, E} s=2

Candidates from Chunk #1 (s≥2): Candidates from Chunk #2 (s≥2):


{A}=2, {B}=3, {C}=3, {A,B}=2, {B,C}=2 {A}=3, {B}=2, {E}=3, {A,B}=2, {A,E}=3, {B,E}=2

Union of Candidates:
{A}, {B}, {C}, {E}, {A,B}, {A,E}, {B,C}, {B,E}
SON’s 2nd Pass:
{A}, {B}, {C}, {E}, {A,B}, {A,E}, {B,C}, {B,E}

Not Frequent in Chunk #2, frequent in Chunk #1, frequent overall


52
SON – Distributed Version

•SON lends itself to distributed data mining

•Baskets distributed among many nodes


• Compute frequent itemsets at each node
• Distribute candidates to all nodes
• Accumulate the counts of all candidates

53
SON: Map/Reduce
● Phase 1: Find candidate itemsets
○ Map :
■ Take the assigned subset of the baskets and find the
itemsets frequent in the subset
■ Lower the support threshold from s to (p * s) if each
Map task gets fraction p of the total input file.
■ Apply sampling algorithm
■ Output: set of key-value pairs (F, 1), F: frequent
itemset from the sample.
○ Reduce :
■ Each Reduce task is assigned a set of keys, which are
itemsets.
■ Produces those keys (itemsets) that appear one or
more times. (candidate itemsets).

54
SON: Map/Reduce
Phase 2: Find true frequent itemsets
Map
● Input:
○ All the output from the first Reduce Function (candidate
itemsets)
○ Portion of the input data file.

● Count the number of occurrences of each of the candidate


itemsets among the baskets in the portion of the dataset that
it was assigned.
● Output: set of key-value pairs (C, v)
○ C: one of the candidate sets
○ v : the support for that itemset among the baskets that
were input to this Map task.
55
SON: Map/Reduce
Phase 2: Find true frequent itemsets
Reduce:
● Input:
○ itemsets as keys and sum the associated values.
● Output:
○ total support for each of the itemsets that the Reduce task was
assigned to handle.

● Itemsets whose sum of values is at least s are frequent in the whole


dataset
○ the Reduce task outputs these itemsets with their counts.
● Itemsets that do not have total support at least s are not transmitted
to the output of the Reduce task

56

You might also like