ANL303

ANL303
Fundamentals of Data Mining

Study Unit 4
Agenda
1. Unit 4 Overview – Association
2. Case Study: Association Analysis of Elderly Care Services
3. Unit 4 Overview – Clustering
4. Activity: K-Means Clustering Using Playing Cards
5. Case Study: Cluster Analysis in the Telco Industry
3
Study Unit 4
Association and Clustering
Key Learning Objectives for this unit include:
• Construct association rule mining models with the use of Apriori

algorithms;
• Construct clustering models with the use of K-means algorithms;
• Appraise applications of association analysis and cluster
analysis;
• Analyse the results of association analysis and cluster analysis;
• Evaluate the quality of association and clustering solutions.
Association Analysis
• Find relationships among variables based on co-occurrence
• Possible applications:
– To determine what products are frequently purchased together (market basket analysis)
Customer Chips Bread Milk Butter

Amy  
Ben  
Bread and Butter
Cindy   frequently appear together
David  in the same transaction
Evan  
Useful for decisions related to store
Flora  layout, discounts, product bundling,
Gloria    discount coupons, etc
7
– To learn what symptoms co-occur frequently
with a confirmed diagnosis
• The discovery of these associations can help
medical professionals to identify high-risk
patients (i.e., patients who possess the
symptoms)
– Fraud Detection
• What patterns or incidences in transactions
may flag off a potential fraud? E.g., a
purchase requisition submitted by staff A is
approved by staff B, who is a relative to staff
A
8
Example
Terminology Customer Chips Bread Milk Butter
Amy  
• Itemset = a set of items Ben   
Cindy  
• k-itemset = an itemset that contains k items
David 
• Support = the occurrence frequency of an Evan  

itemset
• {Milk} is an 1-itemset; {Bread, Butter} is
an 2-itemset
• Frequent itemset = an itemset with its
• The support of {Milk} is 2/5 = 40%; the
support >= minimum support threshold value
support of {Bread, Butter} is 3/5 = 60%
• If the min. support threshold value is
set as 50%, then {Bread, Butter} is a
frequent itemset but {Milk} is not.
9
Apriori Algorithm for Association Rule Mining
• An association rule takes the following form:
X → Y (rule support, confidence)
– X is the antecedent
– Y is the consequent
• Unlike usual logical rules, association rules involve some uncertainty
• Two measures of rule interestingness:

– Rule support
• The number of transactions that include both the antecedent and consequent itemsets (i.e., support of {X, Y})
– Confidence
• Measures the strength of the association
• The ratio of the number of transactions that include both antecedent and consequent itemsets (i.e., rule
support) to the number of transactions that include the antecedent itemset (i.e., support of {X})
10
Exercise
• What is the support for {Bread, Chips}?
ID Item
1 Bread, Jam
2 Bread, Chips, Biscuits, Ice-cream
3 Jam, Chips, Biscuits, Muffins
4 Bread, Jam, Chips, Biscuits, Muffins
5 Bread, Jam, Chips, Muffins
11
Solution
• What is the support for {Bread, Chips}? 3/5 = 60%
ID Bread Jam Chips Biscuits Ice-cream Muffins

1 1 1 0 0 0 0
2 1 0 1 1 1 0
3 0 1 1 1 0 1
4 1 1 1 1 0 1
5 1 1 1 0 0 1
12
Exercise
• What is the confidence for {Bread} →{Chips}?
ID Item
1 Bread, Jam
2 Bread, Chips, Biscuits, Ice-cream
3 Jam, Chips, Biscuits, Muffins
4 Bread, Jam, Chips, Biscuits, Muffins
5 Bread, Jam, Chips, Muffins 13
Support of {X,Y}, i.e., rule support
Solution Support of {X}
• What is the confidence for {Bread} →{Chips}? 3/4 = 75%

ID Bread Jam Chips Biscuits Ice-cream Muffins
1 1 1 0 0 0 0
2 1 0 1 1 1 0
3 0 1 1 1 0 1
4 1 1 1 1 0 1
5 1 1 1 0 0 1
This meant that 75% of the transactions that contain Bread will contain Chips. 14
Example
TID Item purchased
Apriori Algorithm 001 Chips, Durian
002 Apple, Banana, Chips
• Apriori Algorithm consists of two
003 Apple, Banana, Durian, Egg
phases:
004 Apple, Banana, Chips, Durian
– Phase 1: Find all frequent itemsets 005 Apple, Chips, Durian, Egg
with support >= minimum support
If {Apple, Banana} is a frequent itemset, there are
threshold value
two association rules:
Rule 1: {Apple} → {Banana}
– Phase 2: Based on the frequent Rule 2: {Banana} → {Apple}
itemsets, generate association rules For Rule 1, the rule support is 60% and the
with confidence >= minimum confidence is 75%.
confidence threshold value For Rule 2, the rule support is 60% and the
confidence is 100%.
If the min. confidence threshold value is set as 80%,
then only Rule 2 will be generated.
15
Apriori Algorithm
• Phase 1 is computationally expensive. Using the downward closure principle can speed up the
process
– If an itemset is not frequent, then its supersets must not be frequent
• For example, if {A} is not a frequent itemset, then its supersets such as {A,B}, {A,C}, and {A,B,C} must not
be frequent itemsets either
– Using this principle, Apriori algorithm can infer infrequent itemsets and prune them immediately
16
Apriori Algorithm
• Apriori algorithm is designed for categorical variables (e.g., products available in a supermarket)
Question: How do you use Apriori algorithm to analyse a dataset with numeric attributes?
17
Apriori Algorithm
• Apriori algorithm is designed for categorical variables (e.g., products available in a supermarket)
Question: How do you use Apriori algorithm to analyse a dataset with numeric attributes?
Range Categories
$0 - $10 Low
$10.01 - $20 Medium Low
$20.01 - $30 Medium
$30.01 - $40 Medium High
$40.01 and above High
18
Using Apriori Node in IBM SPSS Modeler
Dataset: BASKETS1N.csv
• The dataset contains 18 fields, with each record representing a basket.
• The 18 fields are as follows:
19
Set Measurement to “Flag”.
“T” (true value) means that the product is
purchased by the customer; “F” (false value)
otherwise
“Both” means that the field

can appear in either the
antecedent or consequent
part of the rules
20
Set the Minimum rule

confidence to 50%
We are only interested in rules

that describe what products the
customer bought (i.e., true value),
rather than what they did not buy
(i.e., false value).
21
• What can you interpret from the results?
22
Evaluation
• Whether or not an association rule is interesting can be assessed either subjectively or objectively
(e.g., based on the support and confidence)
• The user can judge if a given rule is interesting, and this judgement, being subjective, may differ from
one user to another
23
Evaluation
You have been asked to evaluate the following rule:
A→B
• A = a patient is tested positive for a blood test ‘A’
• B = the patient may have contracted a deadly disease ‘B’
• Support: 1%
• Rule confidence: 99.9%
Question: To what extend is this rule useful to a doctor?
24
Evaluation
You have been asked to evaluate the following rule:
A→ B
• A = a patient is tested positive for a blood test ‘A’
• B = the patient may have contracted a deadly disease ‘B’
• Support: 1%
• Rule confidence: 99.9%
Question: To what extend is this rule useful to a doctor?

Suggested Answer:
• This rule is potentially useful because a doctor can prescribe appropriate treatment(s) to a patient if it is
known that the patient is tested positive for blood test A.
• On the other hand, this rule will be somewhat limited if blood test A is very costly because 99% of the
patients will be tested negative. However, this limitation may be mitigated if the cost of not knowing/treating
the disease is very high.
25
Case Study
Association Analysis of Elderly Care Services
Case Study: Association Analysis of Elderly Care Services
Background
• MySUSS Charity is an organisation that provides a variety of services to assist elderly in managing
daily tasks such as eating, toileting, walking and bathing.
• The dataset “ElderlyService.csv” is about services used by clients of MySUSS Charity. There are 786
records, each of which contains information of an individual client as well as the services used by the
client.
27
Business Problem
• Due to ageing population, the demand of elderly services increases. MySUSS Charity has to recruit
more frontline staff in order to cope with the increasing demand. Without understanding the patterns of
elderly services used by the clients, it is unclear what kind of skillsets should be acquired by the staff
through training.
Can association analysis be used

to solve this problem? If so, how?
28
Your Task
• Load “ElderlyService.csv” into IBM SPSS Modeler
• Set the appropriate measurement and role of the fields
• Add the Apriori Node to your stream. Set the minimum antecedent support and the minimum rule
confidence to 10% and 75%, respectively
Based on the association analysis

results in the case study, what
suggestions do you have for
MySUSS Charity to tackle the
business problem?
29
Clustering
Clustering
• Cluster analysis (or simply clustering) discovers
natural groupings in data
• To group similar (homogenous) objects into the

same cluster and dissimilar (heterogeneous)
objects into different clusters
• A cluster is a subgroup of data objects such that

the objects within a cluster are similar to one
another and dissimilar to the objects in other
clusters
31
Clustering
– Market segmentation
• Cluster customers into segments so that strategies can be formulated to maximise company revenue
– Fraud detection
• Identify potential fraud cases by reviewing suspicious clusters
• Clustering organises objects into conceptually meaningfully groups based on distance (proximity)
measures
32
Distance Measures
• Distance measures are commonly used for computing the dissimilarity of objects described by
numeric attributes
• One of the widely used distance measures is the Euclidean distance
• Suppose that there are n objects, and each object is described by p attributes. Euclidean distance
between object i and j is defined as
2 2 2
𝑑 𝑖, 𝑗 = 𝑥𝑖1 − 𝑥𝑗1 + 𝑥𝑖2 − 𝑥𝑗2 + ⋯ + 𝑥𝑖𝑝 − 𝑥𝑗𝑝
33
2 2 2
𝑑 𝑖, 𝑗 = 𝑥𝑖1 − 𝑥𝑗1 + 𝑥𝑖2 − 𝑥𝑗2 + ⋯ + 𝑥𝑖𝑝 − 𝑥𝑗𝑝
Distance Measures
What is the Euclidean distance between Customer A and Customer B?
Object Age (Years) Monthly Income ($)
Customer A 18 1000
Customer B 38 5000
(18 − 38)2 +(1000 − 5000)2 = 4000.05

Attributes with a larger range (i.e., Monthly
Income in this case) outweighed attributes
with a smaller range (i.e., Age in this case)
34
Distance Measures
• After normalisation…. Distance = [(.11−.33)2 + (.05−.45)2]
~ 0.457
Now, income is no longer dominant
35
Clustering Algorithms
36
Partitional Clustering – K-Means Clustering
• “K” refers to the number of clusters and “means” refers to the cluster centroids (i.e., the centre or
average of all the objects within a cluster)
Steps:
1. Randomly select K observations as initial cluster centroids.
2. Compute the distance of each object to the centroid.
3. Based on the distance computed, each object is assigned to the nearest centroid. Objects assigned to
the same centroid then form a cluster.
4. For each cluster, recompute the centroid using the objects assigned to the cluster. The iteration starts
again from Step 2.
5. The iteration stops when the centroids remain unchanged or a specified number of iterations has been
performed.
37
Activity
K-Means Clustering Using Playing Cards
38
Activity
How to use K-Means to assign the cards to 3 clusters?
➢ Please download “k-means_activity.pptx” from Canvas
Blue Yellow Red
cluster cluster cluster
4 2 1
40
Example of K-Means Clustering (with One Dimension)
An example: Given {1,4,8,3,13,10}
• Use K-means to create two clusters

– That is, K=2
• Randomly select the following centroids:
– Cluster 1’s centroid is : {2}
– Cluster 2’s centroid is : {7}
1 2 3 4 7 8 10 13
0 13
41
An example: Given {1,4,8,3,13,10}
• Use K-means to create two clusters

– That is, K=2
• Randomly select the following centroids:
– Cluster 1’s centroid is : {2.7} or (1+3+4)/3
– Cluster 2’s centroid is : {10.3} or (8+10+13)/3
1 2.7 3 4 8 10 10.3 13
0 13
42
Given { 1,4,8,3,13,10 }, K=2
• Assign 1, 4, 3 to Cluster 1
– As 1, 4 and 3 are nearest to the centroid {2.7}
– New centroid is: (1+3+4) / 3 = 2.7 No change in centroid
• Assign 10, 8, and 13 to Cluster 2
– As 10, 8 and 13 are nearest to the centroid {10.3}
– New centroid is: (10+8+13) / 3 =10.3
No change in centroid
• When centroid stops changing →Stop assignment of objects to cluster
• Other stopping criteria include the number of iterations 43

Example of K-Means Clustering (with Two Dimensions)
• A small dataset
– n = 6 objects (6 premier hotels, labelled A to F)
– p = 2 clustering variables [rankings of their facilities (FACILITY) and locations (LOCATION)]
• Both rankings are measured from 1 to 10, with 10 being the most desirable score
44
• Set k = 3
• Randomly set 3 centroids
45
D and F nearest to C3’s centroid
46
B and E nearest to C1’s centroid
47
A and C nearest to C2’s centroid
48
Update C3’s centroid
49
50
51
No more change in centroids;

Three clusters formed
52
Evaluation
• The goal of cluster analysis is to come up with meaningful clusters
• A good clustering solution should be able to allow us to clearly describe the profile of each cluster
High rating
Medium rating
Low rating
53
Evaluation
• Objective measures for evaluating the quality of clustering solutions:
– Cohesion (Compactness): How close are the objects in a cluster
– Separation (Differentiation): How far are the clusters from each other
– Parsimony: Minimum number of clusters to capture the variations in the data
• Usually a better clustering solution has a higher Average Silhouette coefficient, a measure of cluster
quality based on cohesion and separation
54
Issues with K-Means Clustering
• Sensitive to initial centroids
– That is, selection of different centroids may give different clustering results
• Users need to specify the value of K
• Sensitive to outliers
– A small number of such data can substantially influence the mean value. It is advisable to remove
outliers before performing k-means
• Ideally, clustering criteria should be numeric
– Distance between categorical attributes is meaningless
• Very small clusters may not be detected
• Mainly spherical clusters may result
– Clusters that are elongated may be broken down into smaller clusters
55
Tips for K-Means Clustering
• To ease visualisation and analysis of the profile of each cluster, clustering solutions should not consist
of too many clusters
• Number of clustering criteria should not be excessive, so that the clustering solution can be interpreted
with reasonable ease
56
Case Study
Cluster Analysis in the Telco Industry
Case Study: Cluster Analysis in the Telco Industry
Background
Deregulation of the telco industry has led to widespread competition. telco service carriers fight hard for
customers. The problem is that once a customer is obtained, it is attacked by competitors, and retention of
customers is very difficult. The phenomenon of a customer switching carriers is referred to as churn.
58
Business Problem
• Customers are switching to another service provider
Can cluster analysis be used to

solve this problem? If so, how?
59
• Assume you have collected a dataset for cluster analysis. Load “Churn.sav” into IBM SPSS Modeler
60
61
Using all Variables Using continuous Variables
62
Average Silhouette improves by 0.1
How to extract the full list of
customers from cluster 5?
Do we need to perform
normalisation before clustering?
63
End of Study Unit 4
See you next week!

ANL303 - Week - 4 - Jan 2023

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

ANL303 - Week - 4 - Jan 2023

Uploaded by

Copyright:

Available Formats

Fundamentals of Data Mining

• Construct association rule mining models with the use of Apriori

Customer Chips Bread Milk Butter

• Support = the occurrence frequency of an Evan  

• Unlike usual logical rules, association rules involve some uncertainty

• Two measures of rule interestingness:

• What is the support for {Bread, Chips}?

• What is the support for {Bread, Chips}? 3/5 = 60%

ID Bread Jam Chips Biscuits Ice-cream Muffins

• What is the confidence for {Bread} →{Chips}?

• What is the confidence for {Bread} →{Chips}? 3/4 = 75%

“Both” means that the field

Set the Minimum rule

We are only interested in rules

• What can you interpret from the results?

Question: To what extend is this rule useful to a doctor?

Question: To what extend is this rule useful to a doctor?

Can association analysis be used

Based on the association analysis

• To group similar (homogenous) objects into the

• A cluster is a subgroup of data objects such that

• One of the widely used distance measures is the Euclidean distance

(18 − 38)2 +(1000 − 5000)2 = 4000.05

• Use K-means to create two clusters

• Use K-means to create two clusters

• When centroid stops changing →Stop assignment of objects to cluster

• Other stopping criteria include the number of iterations 43

• Randomly set 3 centroids

D and F nearest to C3’s centroid

B and E nearest to C1’s centroid

A and C nearest to C2’s centroid

Update C3’s centroid

Update C1’s centroid

Update C2’s centroid

No more change in centroids;

• Customers are switching to another service provider

Can cluster analysis be used to

You might also like