Professional Documents
Culture Documents
ANL303 - Week - 4 - Jan 2023
ANL303 - Week - 4 - Jan 2023
3
Study Unit 4
Association and Clustering
Key Learning Objectives for this unit include:
• Possible applications:
– To determine what products are frequently purchased together (market basket analysis)
– Fraud Detection
• What patterns or incidences in transactions
may flag off a potential fraud? E.g., a
purchase requisition submitted by staff A is
approved by staff B, who is a relative to staff
A
8
Example
Terminology Customer Chips Bread Milk Butter
Amy
• Itemset = a set of items Ben
Cindy
• k-itemset = an itemset that contains k items
David
9
Apriori Algorithm for Association Rule Mining
• An association rule takes the following form:
X → Y (rule support, confidence)
– X is the antecedent
– Y is the consequent
ID Item
1 Bread, Jam
2 Bread, Chips, Biscuits, Ice-cream
3 Jam, Chips, Biscuits, Muffins
4 Bread, Jam, Chips, Biscuits, Muffins
5 Bread, Jam, Chips, Muffins
11
Apriori Algorithm for Association Rule Mining
Solution
ID Item
1 Bread, Jam
2 Bread, Chips, Biscuits, Ice-cream
3 Jam, Chips, Biscuits, Muffins
4 Bread, Jam, Chips, Biscuits, Muffins
5 Bread, Jam, Chips, Muffins 13
Apriori Algorithm for Association Rule Mining
Support of {X,Y}, i.e., rule support
Solution Support of {X}
– Phase 1: Find all frequent itemsets 005 Apple, Chips, Durian, Egg
with support >= minimum support
If {Apple, Banana} is a frequent itemset, there are
threshold value
two association rules:
Rule 1: {Apple} → {Banana}
– Phase 2: Based on the frequent Rule 2: {Banana} → {Apple}
itemsets, generate association rules For Rule 1, the rule support is 60% and the
with confidence >= minimum confidence is 75%.
confidence threshold value For Rule 2, the rule support is 60% and the
confidence is 100%.
If the min. confidence threshold value is set as 80%,
then only Rule 2 will be generated.
15
Apriori Algorithm
• Phase 1 is computationally expensive. Using the downward closure principle can speed up the
process
– If an itemset is not frequent, then its supersets must not be frequent
• For example, if {A} is not a frequent itemset, then its supersets such as {A,B}, {A,C}, and {A,B,C} must not
be frequent itemsets either
– Using this principle, Apriori algorithm can infer infrequent itemsets and prune them immediately
16
Apriori Algorithm
• Apriori algorithm is designed for categorical variables (e.g., products available in a supermarket)
Question: How do you use Apriori algorithm to analyse a dataset with numeric attributes?
17
Apriori Algorithm
• Apriori algorithm is designed for categorical variables (e.g., products available in a supermarket)
Question: How do you use Apriori algorithm to analyse a dataset with numeric attributes?
Range Categories
$0 - $10 Low
$10.01 - $20 Medium Low
$20.01 - $30 Medium
$30.01 - $40 Medium High
$40.01 and above High
18
Using Apriori Node in IBM SPSS Modeler
Dataset: BASKETS1N.csv
• The dataset contains 18 fields, with each record representing a basket.
• The 18 fields are as follows:
19
Using Apriori Node in IBM SPSS Modeler
Set Measurement to “Flag”.
“T” (true value) means that the product is
purchased by the customer; “F” (false value)
otherwise
21
Using Apriori Node in IBM SPSS Modeler
22
Evaluation
• Whether or not an association rule is interesting can be assessed either subjectively or objectively
(e.g., based on the support and confidence)
• The user can judge if a given rule is interesting, and this judgement, being subjective, may differ from
one user to another
23
Evaluation
You have been asked to evaluate the following rule:
A→B
• A = a patient is tested positive for a blood test ‘A’
• B = the patient may have contracted a deadly disease ‘B’
• Support: 1%
• Rule confidence: 99.9%
24
Evaluation
You have been asked to evaluate the following rule:
A→ B
• A = a patient is tested positive for a blood test ‘A’
• B = the patient may have contracted a deadly disease ‘B’
• Support: 1%
• Rule confidence: 99.9%
25
Case Study
Association Analysis of Elderly Care Services
Case Study: Association Analysis of Elderly Care Services
Background
• MySUSS Charity is an organisation that provides a variety of services to assist elderly in managing
daily tasks such as eating, toileting, walking and bathing.
• The dataset “ElderlyService.csv” is about services used by clients of MySUSS Charity. There are 786
records, each of which contains information of an individual client as well as the services used by the
client.
27
Case Study: Association Analysis of Elderly Care Services
Business Problem
• Due to ageing population, the demand of elderly services increases. MySUSS Charity has to recruit
more frontline staff in order to cope with the increasing demand. Without understanding the patterns of
elderly services used by the clients, it is unclear what kind of skillsets should be acquired by the staff
through training.
28
Case Study: Association Analysis of Elderly Care Services
Your Task
• Load “ElderlyService.csv” into IBM SPSS Modeler
• Set the appropriate measurement and role of the fields
• Add the Apriori Node to your stream. Set the minimum antecedent support and the minimum rule
confidence to 10% and 75%, respectively
29
Clustering
Clustering
• Cluster analysis (or simply clustering) discovers
natural groupings in data
31
Clustering
• Possible applications:
– Market segmentation
• Cluster customers into segments so that strategies can be formulated to maximise company revenue
– Fraud detection
• Identify potential fraud cases by reviewing suspicious clusters
• Clustering organises objects into conceptually meaningfully groups based on distance (proximity)
measures
32
Distance Measures
• Distance measures are commonly used for computing the dissimilarity of objects described by
numeric attributes
• Suppose that there are n objects, and each object is described by p attributes. Euclidean distance
between object i and j is defined as
2 2 2
𝑑 𝑖, 𝑗 = 𝑥𝑖1 − 𝑥𝑗1 + 𝑥𝑖2 − 𝑥𝑗2 + ⋯ + 𝑥𝑖𝑝 − 𝑥𝑗𝑝
33
2 2 2
𝑑 𝑖, 𝑗 = 𝑥𝑖1 − 𝑥𝑗1 + 𝑥𝑖2 − 𝑥𝑗2 + ⋯ + 𝑥𝑖𝑝 − 𝑥𝑗𝑝
Distance Measures
What is the Euclidean distance between Customer A and Customer B?
Object Age (Years) Monthly Income ($)
Customer A 18 1000
Customer B 38 5000
34
Distance Measures
• After normalisation…. Distance = [(.11−.33)2 + (.05−.45)2]
~ 0.457
Now, income is no longer dominant
35
Clustering Algorithms
36
Partitional Clustering – K-Means Clustering
• “K” refers to the number of clusters and “means” refers to the cluster centroids (i.e., the centre or
average of all the objects within a cluster)
Steps:
1. Randomly select K observations as initial cluster centroids.
2. Compute the distance of each object to the centroid.
3. Based on the distance computed, each object is assigned to the nearest centroid. Objects assigned to
the same centroid then form a cluster.
4. For each cluster, recompute the centroid using the objects assigned to the cluster. The iteration starts
again from Step 2.
5. The iteration stops when the centroids remain unchanged or a specified number of iterations has been
performed.
37
Activity
K-Means Clustering Using Playing Cards
38
Activity
How to use K-Means to assign the cards to 3 clusters?
➢ Please download “k-means_activity.pptx” from Canvas
Blue Yellow Red
cluster cluster cluster
4 2 1
40
Example of K-Means Clustering (with One Dimension)
An example: Given {1,4,8,3,13,10}
1 2 3 4 7 8 10 13
0 13
41
Example of K-Means Clustering (with One Dimension)
An example: Given {1,4,8,3,13,10}
1 2.7 3 4 8 10 10.3 13
0 13
42
Example of K-Means Clustering (with One Dimension)
Given { 1,4,8,3,13,10 }, K=2
• Assign 1, 4, 3 to Cluster 1
– As 1, 4 and 3 are nearest to the centroid {2.7}
– New centroid is: (1+3+4) / 3 = 2.7 No change in centroid
• Assign 10, 8, and 13 to Cluster 2
– As 10, 8 and 13 are nearest to the centroid {10.3}
– New centroid is: (10+8+13) / 3 =10.3
No change in centroid
• Both rankings are measured from 1 to 10, with 10 being the most desirable score
44
Example of K-Means Clustering (with Two Dimensions)
• Set k = 3
45
Example of K-Means Clustering (with Two Dimensions)
46
Example of K-Means Clustering (with Two Dimensions)
47
Example of K-Means Clustering (with Two Dimensions)
48
Example of K-Means Clustering (with Two Dimensions)
49
Example of K-Means Clustering (with Two Dimensions)
50
Example of K-Means Clustering (with Two Dimensions)
51
Example of K-Means Clustering (with Two Dimensions)
52
Evaluation
• The goal of cluster analysis is to come up with meaningful clusters
• A good clustering solution should be able to allow us to clearly describe the profile of each cluster
High rating
Medium rating
Low rating
53
Evaluation
• Objective measures for evaluating the quality of clustering solutions:
– Cohesion (Compactness): How close are the objects in a cluster
– Separation (Differentiation): How far are the clusters from each other
– Parsimony: Minimum number of clusters to capture the variations in the data
• Usually a better clustering solution has a higher Average Silhouette coefficient, a measure of cluster
quality based on cohesion and separation
54
Issues with K-Means Clustering
• Sensitive to initial centroids
– That is, selection of different centroids may give different clustering results
• Users need to specify the value of K
• Sensitive to outliers
– A small number of such data can substantially influence the mean value. It is advisable to remove
outliers before performing k-means
• Ideally, clustering criteria should be numeric
– Distance between categorical attributes is meaningless
• Very small clusters may not be detected
• Mainly spherical clusters may result
– Clusters that are elongated may be broken down into smaller clusters
55
Tips for K-Means Clustering
• To ease visualisation and analysis of the profile of each cluster, clustering solutions should not consist
of too many clusters
• Number of clustering criteria should not be excessive, so that the clustering solution can be interpreted
with reasonable ease
56
Case Study
Cluster Analysis in the Telco Industry
Case Study: Cluster Analysis in the Telco Industry
Background
Deregulation of the telco industry has led to widespread competition. telco service carriers fight hard for
customers. The problem is that once a customer is obtained, it is attacked by competitors, and retention of
customers is very difficult. The phenomenon of a customer switching carriers is referred to as churn.
58
Case Study: Cluster Analysis in the Telco Industry
Business Problem
59
Case Study: Cluster Analysis in the Telco Industry
• Assume you have collected a dataset for cluster analysis. Load “Churn.sav” into IBM SPSS Modeler
60
Case Study: Cluster Analysis in the Telco Industry
61
Case Study: Cluster Analysis in the Telco Industry
Using all Variables Using continuous Variables
62
Average Silhouette improves by 0.1
Case Study: Cluster Analysis in the Telco Industry
How to extract the full list of
customers from cluster 5?
Do we need to perform
normalisation before clustering?
63
End of Study Unit 4
See you next week!