Professional Documents
Culture Documents
Equent Itemsets S6
Equent Itemsets S6
Equent Itemsets S6
Market-basket model:
Applications
• Items = words
Applications: plagiarism
• Items = documents
• baskets = sentences
• An item (document) is ”in” a basket (sentence) if the
sentence is in the document
• Pairs of items that appear together in several baskets
• even one or two sentences in common =⇒ plagiarism.
Finding Frequent Itemsets A-Priori Algorithm
Application: Biomarkers
Frequent Itemsets
Toy example
Singleton
• ”Dog”: support is 7
• ”Cat”: support is 6
• ”And”: support is 5
• ...
• Threshold at s = 3:
Doubletons
Association Rules
support(I ∪ j)
conf (I → j) =
support(I)
Finding Frequent Itemsets A-Priori Algorithm
support(I ∪ j)
Lif t(I → j) =
support(I) × support(j)
• Interesting rules: lif t > 1 or lif t < 1
Finding Frequent Itemsets A-Priori Algorithm
support(I ∪ j)
conf (I → j) =
support(I)
Finding Frequent Itemsets A-Priori Algorithm
Example
Outline
A-Priori Algorithm
Finding Frequent Itemsets A-Priori Algorithm
• Items = integers
• Data is kept in flat files
• Stored basket-by-basket
• Baskets are small but we have many baskets and many items
• E.g. Items are positive integers, and boundaries between
baskets are −1 or
• n ∼ n2 /2 pairs of integers
2
Naive approach
• Two approaches:
• Approach 1: Count all pairs using a matrix
• Approach 2: Keep a table of triples [i, j, c] =”the count of the
pair of items (i, j) is c
• If integers and item ids are 4 bytes, we need approximately 12
bytes for pairs with count > 0
• Approach 1 only requires 4 bytes per pair
• Approach 2 uses 12 bytes per pair (but only for pairs with
count > 0)
Finding Frequent Itemsets A-Priori Algorithm
Can we do better?
Finding Frequent Itemsets A-Priori Algorithm
Outline
A-Priori Algorithm
Finding Frequent Itemsets A-Priori Algorithm
Computation Model
• A two-pass approach
• Limits the need for main memory
• Key idea: monotonicity
• If a set of items I appears at least s times, so does every
subset J of I
• If item i does not appear in s baskets, then no pair
including i can appear in s baskets.
Finding Frequent Itemsets A-Priori Algorithm
References