Download as pdf or txt
Download as pdf or txt
You are on page 1of 30

MS4252

Big Data Analytics


2024.02.28
Prof. Louie Wong
Association Discovery

MS4252 2023/24 Sem B 2


Customers Tend to Buy Things Together

3
Market Basket Analysis (MBA)
• A data-mining technique for discovering association patterns
(sales patterns)
• Uses statistical methods to identify association patterns
• Shows which items customers tend to buy together

4
Market Basket Analysis (MBA)
• In the context of text mining
• A corpus of documents is a market basket
• Terms are the items in the market basket

Examples of association rules:


DocID Document
1 I love iPad {Love} → {iPad}
2 iPad is great for kids {Kids}→{iPad}
3 Kids love to play soccer {play} → {soccer}
4 I play soccer at OSU

5
Market Basket Analysis (MBA)
Market-Basket (transactions)
Examples of association rules:
TID Items
1 Bread, Milk {Bread} à {Milk}
尿布
2 Bread, Diaper, Beer, Eggs {Diaper} ® {?}
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer {Milk, Bread} ® {?}
5 Bread, Milk, Diaper, Coke Implication means co-
occurrence, not causality!
共同發⽣ ,X因果關係
We want to know that, in a shopping basket,
“what product likely goes with what other product”

6
Association Rules Mining
• Also called Affinity Analysis
• we want to know “what goes with what”
• Unsupervised learning methods
• Successful application to retail business problem, so commonly
called Market Basket Analysis
• Apply in many areas
• Sales transaction
• Credit card transactions
• Banking/Insurance services and products
• Medical records
• …
7
Applications of MBA
• Retail applications
• Placement of product: next to flowers are birthday cards
• Recommendations: since you are browsing HDTVs, you may also want HDMI cables
• Text mining applications
• Analyze themes/trends in customer preferences
• Identify common topics/interests in social media posts
• Discover common issues/complaints in customer service interactions
• Other applications
欺詐 • Fraud detection (multiple suspicious insurance claims)
• Medical complications (based on combinations of treatments)
8
What is Association Rules Mining?
Association rule mining (affinity analysis)
– Assume all data are categorical
– Discover significant relationships between data objects
– The uncovered relationship can be represented in the form of association rules or
sets of frequent items. For example, {Bread}à{Butter}

Association rules
Association Rule
Mining Algorithm XÞY

Given a set of transactions, find rules that predict the occurrence of an


item based on the occurrences of other items in the transaction
Caution that co-occurrence, not causality
9
Various Associations
§ Between values,
e.g. Apple Þ Coke
§ Between categories of values,
e.g. Food Þ Magazine
§ Between values of attributes,
e.g. Married: yes Þ OwnHouse: yes
§ Over time period,
e.g. year 1: Database Þ year 2: Data Mining

10
Some Definitions
• A transactional dataset: a set of transactions

• An transaction: items purchased in a basket; it may have a transaction ID (TID)


• An item: an item in a transactional dataset, e.g. Bread
• An itemset is a set of items and a k-itemset is an itemset with k items
• One-itemset: {Bread}
• Two-itemset: {Bread, Milk}
• Three-itemset: {Milk, Diapers, Beer}

11
Some Definitions
• Support count (s): frequency of occurrence of an itemset
• E.g. s({Milk, Bread, Diaper}) = 2
• Support (s): fraction of transactions that contain
an itemset
• E.g. s({Milk, Bread, Diaper}) = 2/5
• Frequent itemset: an itemset whose support is
greater than or equal to a minimum support
threshold
• Association Rule: An implication expression of the form X ® Y, where X and Y are
itemsets
• E.g. {Milk, Diaper} ® {Beer}
• Rule Evaluation Metrics
• Support (s): fraction of transactions that contain both X and Y
• Confidence (c): Measures how often items in Y appear in transactions that contain X
• Lift ratio: is the Confidence of the rule divided by the Benchmark Confidence

12
Process of Association Rules Selection

13
Candidate rules
• The idea behind association rules is to examine all possible rules
between items in an if-then format and select only those that are
most likely to be indicators of true dependence.
If (Antecedent) Then (Consequent) Association Rule
{Bread} {Milk} {Bread}à {Milk}
{Bread, Milk} {Eggs} {Bread, Milk}à{Eggs}

How to generate the


candidate rules?

14
Apriori Algorithm
• The Apriori algorithm is a classical algorithm for generating
frequent item sets, proposed by Agrawal et al. (1993)
• The key idea of this algorithm: generate frequent item sets with
just one item (one-item sets), and to recursively generate
frequent item sets with two items, then with three items, and so
on until we have generated frequent item sets of all sizes.
Apriori principle: If an itemset is frequent, then all of its subsets
must also be frequent

15
Apriori Algorithm
Method:
• Let k=1
• Generate frequent itemsets of length 1
• Repeat until no new frequent itemsets are identified
• Generate length (k+1) candidate itemsets from length k frequent itemsets
• Prune candidate itemsets containing subsets of length k that are infrequent
• Count the support of each candidate by scanning the dataset
• Eliminate candidates that are infrequent, leaving only those that are frequent

16
Generating Candidate Rules

Use Apriori algorithm to


generate candidate rules.

17
Apriori Algorithm — Example
Minimum support = 50% = 2 out of 4 transactions
Transaction Data C1 C2 C3
itemset sup
TID Items itemset sup.
100 ACD {A} 2
{A B} 1 itemset sup.
{A C} 2
200 BCE {B} 3 {B C E} 2
{A E} 1
300 ABCE {C} 3
{B C} 2
400 BE {D} 1
{B E} 3
{E} 3 {C E} 2
Ck : Candidate itemset of size k
Lk : Frequent itemset of size k L1 L2 L3
Self-joining Lk * Lk to form Ck+1 itemset sup. itemset sup itemset sup
{A} 2 {A C} 2
{B} 3
{B C E} 2
{B C} 2
{C} 3 {B E} 3
{E} 3 {C E} 2

Pruning: {ABC} is removed because {AB} is not


in L2 (i.e. not a frequent itemset)

18
Selecting Strong Association Rules
• Three criteria are used to select strong rules:
• Support: larger is better
• Confidence level: larger is better
• Lift ratio: expect to be greater than 1
• Generate rules from frequent itemsets Lk and use the above
criteria to select
{Antecedent} à {Consequent}

19
Support
• Measures the degree to which the data “support” the validity of the rule
• It is simply measured as the number of transactions that include both the
antecedent and consequent item sets.
• The support is sometimes expressed as a percentage of the total number of
transactions that include both the antecedent and consequent item sets in
the dataset.
# 𝑜𝑓 𝑡𝑟𝑎𝑛𝑠𝑎𝑐𝑡𝑖𝑜𝑛𝑠 𝑐𝑜𝑛𝑡𝑎𝑖𝑛𝑖𝑛𝑔 𝑎𝑛𝑡𝑒𝑐𝑒𝑑𝑒𝑛𝑡 𝐴𝑁𝐷 𝑐𝑜𝑛𝑠𝑒𝑞𝑢𝑒𝑛𝑡
𝑺𝒖𝒑𝒑𝒐𝒓𝒕 = 𝑷 𝒂𝒏𝒕𝒆𝒄𝒆𝒅𝒆𝒏𝒕 𝑨𝑵𝑫 𝒄𝒐𝒏𝒔𝒆𝒒𝒖𝒆𝒏𝒕 =
# 𝑜𝑓 𝑡𝑟𝑎𝑛𝑠𝑎𝑐𝑡𝑖𝑜𝑛𝑠 𝑖𝑛 𝑡ℎ𝑒 𝑑𝑎𝑡𝑎 𝑠𝑒𝑡

Customer buys both Customer buys


diaper
support, s, probability that a transaction contains
{beer diaper}
Customerbuys beer

20
Confidence
• The ratio of the number of transactions that include all antecedent
and consequent item sets (namely, the support) to the number of
transactions that include all the antecedent item sets
# 𝑜𝑓 𝑡𝑟𝑎𝑛𝑠𝑎𝑐𝑡𝑖𝑜𝑛𝑠 𝑤𝑖𝑡ℎ 𝑏𝑜𝑡ℎ 𝑎𝑛𝑡𝑒𝑐𝑒𝑑𝑒𝑛𝑡 𝑎𝑛𝑑 𝑐𝑜𝑛𝑠𝑒𝑞𝑢𝑒𝑛𝑡 𝑖𝑡𝑒𝑚 𝑠𝑒𝑡𝑠
𝑪𝒐𝒏𝒇𝒊𝒅𝒆𝒏𝒄𝒆 =
# 𝑜𝑓 𝑡𝑟𝑎𝑛𝑠𝑎𝑐𝑡𝑖𝑜𝑛𝑠 𝑤𝑖𝑡ℎ 𝑎𝑛𝑡𝑒𝑐𝑒𝑑𝑒𝑛𝑡 𝑖𝑡𝑒𝑚 𝑠𝑒𝑡
!(#$%&'&(&$% *+, '-$.&/0&$%)
= = P(consequent | antecedent)
!(#$%&'&(&$%)
• A high value of confidence suggests a strong association rule (in
which we are highly confident)

21
Lift Ratio
• A lift ratio is the Confidence of the rule divided by the Benchmark Confidence.
𝐶𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒
𝑳𝒊𝒇𝒆 𝑹𝒂𝒕𝒊𝒐 =
𝐵𝑒𝑛𝑐ℎ𝑚𝑎𝑟𝑘 𝐶𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒
• Benchmark confidence refers to the confidence value when the antecedent
and consequent item sets are independent.
𝑃(𝑎𝑛𝑡𝑒𝑐𝑒𝑑𝑒𝑛𝑡 𝐴𝑁𝐷 𝑐𝑜𝑛𝑠𝑒𝑞𝑢𝑒𝑛𝑡) 𝑃 𝑎𝑛𝑡𝑒𝑐𝑒𝑑𝑒𝑛𝑡 ∗ 𝑃(𝑐𝑜𝑛𝑠𝑒𝑞𝑢𝑒𝑛𝑡)
𝑩𝒆𝒏𝒄𝒉𝒎𝒂𝒓𝒌 𝑪𝒐𝒏𝒇𝒊𝒅𝒆𝒏𝒄𝒆 = =
𝑃(𝑎𝑛𝑡𝑒𝑐𝑒𝑑𝑒𝑛𝑡) 𝑃 𝑎𝑛𝑡𝑒𝑐𝑒𝑑𝑒𝑛𝑡
# #$ %&'()'*%+#() ,+%- *#()./0.(% +%.1 ).%
=P(consequent) = # #$ %&'()'*%+#() +( 2'%'3').

• A lift ratio>1 suggest that there is some usefulness to the rule. In other words,
the level of association between the antecedent and consequent item sets is
higher than would be expected if they were independent.
• The larger the lift ratio, the greater the strength of the association.
22
Selecting Association Rules Example
Transaction Data L1 L2 L3
TID Items itemset sup. itemset sup itemset sup
100 ACD {A} 2 {A C} 2
200 BCE {B C E} 2
{B} 3 {B C} 2
300 ABCE {C} 3 {B E} 3
400 BE {E} 3 {C E} 2

e.g. {B} à {C E}
Support = ? , Confidence = ?, Lift = ?
{B} à {E}
Support = ? , Confidence = ?, Lift = ?

23
Selecting Association Rules Example
Transaction Data L1 L2 L3
TID Items itemset sup. itemset sup itemset sup
100 ACD {A} 2 {A C} 2
200 BCE {B C E} 2
{B} 3 {B C} 2
300 ABCE {C} 3 {B E} 3
400 BE {E} 3 {C E} 2

e.g. {B} à {C E}
Support = 2/4 = 0.5, Confidence = 2/3= 67%,
Lift = 67%/(2/4)=1.33
{B} à {E}
Support = 3/4 = 0.75, Confidence = 3/3= 100%,
Lift = 100%/(3/4)=1.33

24
Interpret the Rules
• Use different measures to interpret the results
• The support for the rule indicates its impact in terms of overall size –
what proportion of transactions is affected?
• If only a small number of transactions are affected, the rule may be of little use
(unless the consequent is very valuable and/or the rule is very efficient in finding
it).
• The confidence tells us at what rate the consequents will be found and
is useful in determining the business or operational viability of a rule.
• A rule with low confidence may find consequents at too low a rate to be worth
the costs of promoting the consequent in all the transactions that involve the
antecedent.
• The lift ratio indicates how efficient the rule is in finding the
consequents, compared to random selection.

25
Actionable Rules
• Actionable Rule – contains high-quality, actionable information
• Beer and diapers
• Put the beer and diapers as close as possible
• Put the beer and diapers as far as possible
• Put high-margin diapers close to beer
• Wal-Mart customers who purchase Barbie dolls have a 60% likelihood of
also purchasing one out of three types of candy bar
• Customers must walk through candy aisles
• Advertise both products together
• Put candy bar at the eye level for a 5-year-old

26
Rules are not always useful
• Trivial Rule – information already well-known by those familiar with
the business
• Customers who purchase maintenance agreements are very likely to
purchase large appliances
• People who buy cable service from their local telephone service provider
also buy internet access as well
• Inexplicable Rule – no explanation and do not suggest action
• When a new clothing store opens, one of the most commonly purchased
items is the store's logo t-shirt.

27
In-Class Exercise

MS4252 2023/24 Sem B 28


In-Class Exercise
Use the following grocery transaction data to do the analysis and answer the
questions. Put all your answers into a WORD document.
TID Bread Jam Fruit Peanuts Soda Milk Chips
1 1 1 1 1 0 1 0
2 0 1 1 0 1 1 1
3 1 1 0 0 1 0 1
4 0 1 1 0 1 1 0
5 1 1 0 0 1 1 1
6 0 0 1 0 1 1 0
7 0 0 1 1 1 1 0
8 0 0 1 1 0 0 0

29
In-Class Exercise
1. Generate all the one-itemsets and count the support. Identify all the one-itemsets
having support of at least 4 transactions. Then use Apriori Algorithm to generate
all the candidate item sets.
2. List out the frequent Two-Itemsets and Three-Itemsets
3. Identify all the possible association rules and calculate their confidence and lift
ratio respectively
4. What are the Support, Confidence, and Lift Ratio of the association rule
{Jam}à{Soda}. Is this a good rule? Should the store put these two items together?
5. What are the Support, Confidence, and Lift Ratio of the association rule
{Fruit}à{Milk}. Is this a good rule? Should the store put these two items together?
6. What are the Support, Confidence, and Lift Ratio of the association rule {Fruit,
Milk}à{Soda}. Is this a good rule? Should the store put these three items
together?

30

You might also like