Download as pdf or txt
Download as pdf or txt
You are on page 1of 64

ANL303

Fundamentals of Data Mining


Study Unit 4
Agenda
1. Unit 4 Overview – Association
2. Case Study: Association Analysis of Elderly Care Services
3. Unit 4 Overview – Clustering
4. Activity: K-Means Clustering Using Playing Cards
5. Case Study: Cluster Analysis in the Telco Industry

3
Study Unit 4
Association and Clustering
Key Learning Objectives for this unit include:

• Construct association rule mining models with the use of Apriori


algorithms;
• Construct clustering models with the use of K-means algorithms;
• Appraise applications of association analysis and cluster
analysis;
• Analyse the results of association analysis and cluster analysis;
• Evaluate the quality of association and clustering solutions.
Association Analysis
Association Analysis
• Find relationships among variables based on co-occurrence

• Possible applications:
– To determine what products are frequently purchased together (market basket analysis)

Customer Chips Bread Milk Butter


Amy  
Ben  
Bread and Butter
Cindy   frequently appear together
David  in the same transaction
Evan  
Useful for decisions related to store
Flora  layout, discounts, product bundling,
Gloria    discount coupons, etc
7
Association Analysis
• Possible applications:
– To learn what symptoms co-occur frequently
with a confirmed diagnosis
• The discovery of these associations can help
medical professionals to identify high-risk
patients (i.e., patients who possess the
symptoms)

– Fraud Detection
• What patterns or incidences in transactions
may flag off a potential fraud? E.g., a
purchase requisition submitted by staff A is
approved by staff B, who is a relative to staff
A

8
Example
Terminology Customer Chips Bread Milk Butter
Amy  
• Itemset = a set of items Ben   
Cindy  
• k-itemset = an itemset that contains k items
David 

• Support = the occurrence frequency of an Evan  


itemset
• {Milk} is an 1-itemset; {Bread, Butter} is
an 2-itemset
• Frequent itemset = an itemset with its
• The support of {Milk} is 2/5 = 40%; the
support >= minimum support threshold value
support of {Bread, Butter} is 3/5 = 60%
• If the min. support threshold value is
set as 50%, then {Bread, Butter} is a
frequent itemset but {Milk} is not.

9
Apriori Algorithm for Association Rule Mining
• An association rule takes the following form:
X → Y (rule support, confidence)
– X is the antecedent
– Y is the consequent

• Unlike usual logical rules, association rules involve some uncertainty

• Two measures of rule interestingness:


– Rule support
• The number of transactions that include both the antecedent and consequent itemsets (i.e., support of {X, Y})
– Confidence
• Measures the strength of the association
• The ratio of the number of transactions that include both antecedent and consequent itemsets (i.e., rule
support) to the number of transactions that include the antecedent itemset (i.e., support of {X})
10
Apriori Algorithm for Association Rule Mining
Exercise

• What is the support for {Bread, Chips}?

ID Item
1 Bread, Jam
2 Bread, Chips, Biscuits, Ice-cream
3 Jam, Chips, Biscuits, Muffins
4 Bread, Jam, Chips, Biscuits, Muffins
5 Bread, Jam, Chips, Muffins
11
Apriori Algorithm for Association Rule Mining
Solution

• What is the support for {Bread, Chips}? 3/5 = 60%

ID Bread Jam Chips Biscuits Ice-cream Muffins


1 1 1 0 0 0 0
2 1 0 1 1 1 0
3 0 1 1 1 0 1
4 1 1 1 1 0 1
5 1 1 1 0 0 1
12
Apriori Algorithm for Association Rule Mining
Exercise

• What is the confidence for {Bread} →{Chips}?

ID Item
1 Bread, Jam
2 Bread, Chips, Biscuits, Ice-cream
3 Jam, Chips, Biscuits, Muffins
4 Bread, Jam, Chips, Biscuits, Muffins
5 Bread, Jam, Chips, Muffins 13
Apriori Algorithm for Association Rule Mining
Support of {X,Y}, i.e., rule support
Solution Support of {X}

• What is the confidence for {Bread} →{Chips}? 3/4 = 75%


ID Bread Jam Chips Biscuits Ice-cream Muffins
1 1 1 0 0 0 0
2 1 0 1 1 1 0
3 0 1 1 1 0 1
4 1 1 1 1 0 1
5 1 1 1 0 0 1
This meant that 75% of the transactions that contain Bread will contain Chips. 14
Example
TID Item purchased
Apriori Algorithm 001 Chips, Durian
002 Apple, Banana, Chips
• Apriori Algorithm consists of two
003 Apple, Banana, Durian, Egg
phases:
004 Apple, Banana, Chips, Durian

– Phase 1: Find all frequent itemsets 005 Apple, Chips, Durian, Egg
with support >= minimum support
If {Apple, Banana} is a frequent itemset, there are
threshold value
two association rules:
Rule 1: {Apple} → {Banana}
– Phase 2: Based on the frequent Rule 2: {Banana} → {Apple}
itemsets, generate association rules For Rule 1, the rule support is 60% and the
with confidence >= minimum confidence is 75%.
confidence threshold value For Rule 2, the rule support is 60% and the
confidence is 100%.
If the min. confidence threshold value is set as 80%,
then only Rule 2 will be generated.
15
Apriori Algorithm
• Phase 1 is computationally expensive. Using the downward closure principle can speed up the
process
– If an itemset is not frequent, then its supersets must not be frequent
• For example, if {A} is not a frequent itemset, then its supersets such as {A,B}, {A,C}, and {A,B,C} must not
be frequent itemsets either
– Using this principle, Apriori algorithm can infer infrequent itemsets and prune them immediately

16
Apriori Algorithm
• Apriori algorithm is designed for categorical variables (e.g., products available in a supermarket)

Question: How do you use Apriori algorithm to analyse a dataset with numeric attributes?

17
Apriori Algorithm
• Apriori algorithm is designed for categorical variables (e.g., products available in a supermarket)

Question: How do you use Apriori algorithm to analyse a dataset with numeric attributes?

Range Categories
$0 - $10 Low
$10.01 - $20 Medium Low
$20.01 - $30 Medium
$30.01 - $40 Medium High
$40.01 and above High
18
Using Apriori Node in IBM SPSS Modeler
Dataset: BASKETS1N.csv
• The dataset contains 18 fields, with each record representing a basket.
• The 18 fields are as follows:

19
Using Apriori Node in IBM SPSS Modeler
Set Measurement to “Flag”.
“T” (true value) means that the product is
purchased by the customer; “F” (false value)
otherwise

“Both” means that the field


can appear in either the
antecedent or consequent
part of the rules
20
Using Apriori Node in IBM SPSS Modeler

Set the Minimum rule


confidence to 50%

We are only interested in rules


that describe what products the
customer bought (i.e., true value),
rather than what they did not buy
(i.e., false value).

21
Using Apriori Node in IBM SPSS Modeler

• What can you interpret from the results?

22
Evaluation
• Whether or not an association rule is interesting can be assessed either subjectively or objectively
(e.g., based on the support and confidence)
• The user can judge if a given rule is interesting, and this judgement, being subjective, may differ from
one user to another

23
Evaluation
You have been asked to evaluate the following rule:
A→B
• A = a patient is tested positive for a blood test ‘A’
• B = the patient may have contracted a deadly disease ‘B’
• Support: 1%
• Rule confidence: 99.9%

Question: To what extend is this rule useful to a doctor?

24
Evaluation
You have been asked to evaluate the following rule:
A→ B
• A = a patient is tested positive for a blood test ‘A’
• B = the patient may have contracted a deadly disease ‘B’
• Support: 1%
• Rule confidence: 99.9%

Question: To what extend is this rule useful to a doctor?


Suggested Answer:
• This rule is potentially useful because a doctor can prescribe appropriate treatment(s) to a patient if it is
known that the patient is tested positive for blood test A.
• On the other hand, this rule will be somewhat limited if blood test A is very costly because 99% of the
patients will be tested negative. However, this limitation may be mitigated if the cost of not knowing/treating
the disease is very high.

25
Case Study
Association Analysis of Elderly Care Services
Case Study: Association Analysis of Elderly Care Services
Background

• MySUSS Charity is an organisation that provides a variety of services to assist elderly in managing
daily tasks such as eating, toileting, walking and bathing.

• The dataset “ElderlyService.csv” is about services used by clients of MySUSS Charity. There are 786
records, each of which contains information of an individual client as well as the services used by the
client.

27
Case Study: Association Analysis of Elderly Care Services
Business Problem

• Due to ageing population, the demand of elderly services increases. MySUSS Charity has to recruit
more frontline staff in order to cope with the increasing demand. Without understanding the patterns of
elderly services used by the clients, it is unclear what kind of skillsets should be acquired by the staff
through training.

Can association analysis be used


to solve this problem? If so, how?

28
Case Study: Association Analysis of Elderly Care Services
Your Task
• Load “ElderlyService.csv” into IBM SPSS Modeler
• Set the appropriate measurement and role of the fields
• Add the Apriori Node to your stream. Set the minimum antecedent support and the minimum rule
confidence to 10% and 75%, respectively

Based on the association analysis


results in the case study, what
suggestions do you have for
MySUSS Charity to tackle the
business problem?

29
Clustering
Clustering
• Cluster analysis (or simply clustering) discovers
natural groupings in data

• To group similar (homogenous) objects into the


same cluster and dissimilar (heterogeneous)
objects into different clusters

• A cluster is a subgroup of data objects such that


the objects within a cluster are similar to one
another and dissimilar to the objects in other
clusters

31
Clustering
• Possible applications:
– Market segmentation
• Cluster customers into segments so that strategies can be formulated to maximise company revenue
– Fraud detection
• Identify potential fraud cases by reviewing suspicious clusters

• Clustering organises objects into conceptually meaningfully groups based on distance (proximity)
measures

32
Distance Measures
• Distance measures are commonly used for computing the dissimilarity of objects described by
numeric attributes

• One of the widely used distance measures is the Euclidean distance

• Suppose that there are n objects, and each object is described by p attributes. Euclidean distance
between object i and j is defined as

2 2 2
𝑑 𝑖, 𝑗 = 𝑥𝑖1 − 𝑥𝑗1 + 𝑥𝑖2 − 𝑥𝑗2 + ⋯ + 𝑥𝑖𝑝 − 𝑥𝑗𝑝

33
2 2 2
𝑑 𝑖, 𝑗 = 𝑥𝑖1 − 𝑥𝑗1 + 𝑥𝑖2 − 𝑥𝑗2 + ⋯ + 𝑥𝑖𝑝 − 𝑥𝑗𝑝
Distance Measures
What is the Euclidean distance between Customer A and Customer B?
Object Age (Years) Monthly Income ($)

Customer A 18 1000

Customer B 38 5000

(18 − 38)2 +(1000 − 5000)2 = 4000.05


Attributes with a larger range (i.e., Monthly
Income in this case) outweighed attributes
with a smaller range (i.e., Age in this case)

34
Distance Measures
• After normalisation…. Distance = [(.11−.33)2 + (.05−.45)2]
~ 0.457
Now, income is no longer dominant

35
Clustering Algorithms

36
Partitional Clustering – K-Means Clustering
• “K” refers to the number of clusters and “means” refers to the cluster centroids (i.e., the centre or
average of all the objects within a cluster)

Steps:
1. Randomly select K observations as initial cluster centroids.
2. Compute the distance of each object to the centroid.
3. Based on the distance computed, each object is assigned to the nearest centroid. Objects assigned to
the same centroid then form a cluster.
4. For each cluster, recompute the centroid using the objects assigned to the cluster. The iteration starts
again from Step 2.
5. The iteration stops when the centroids remain unchanged or a specified number of iterations has been
performed.

37
Activity
K-Means Clustering Using Playing Cards

38
Activity
How to use K-Means to assign the cards to 3 clusters?
➢ Please download “k-means_activity.pptx” from Canvas
Blue Yellow Red
cluster cluster cluster
4 2 1

40
Example of K-Means Clustering (with One Dimension)
An example: Given {1,4,8,3,13,10}

• Use K-means to create two clusters


– That is, K=2
• Randomly select the following centroids:
– Cluster 1’s centroid is : {2}
– Cluster 2’s centroid is : {7}

1 2 3 4 7 8 10 13

0 13

41
Example of K-Means Clustering (with One Dimension)
An example: Given {1,4,8,3,13,10}

• Use K-means to create two clusters


– That is, K=2
• Randomly select the following centroids:
– Cluster 1’s centroid is : {2.7} or (1+3+4)/3
– Cluster 2’s centroid is : {10.3} or (8+10+13)/3

1 2.7 3 4 8 10 10.3 13

0 13

42
Example of K-Means Clustering (with One Dimension)
Given { 1,4,8,3,13,10 }, K=2

• Assign 1, 4, 3 to Cluster 1
– As 1, 4 and 3 are nearest to the centroid {2.7}
– New centroid is: (1+3+4) / 3 = 2.7 No change in centroid
• Assign 10, 8, and 13 to Cluster 2
– As 10, 8 and 13 are nearest to the centroid {10.3}
– New centroid is: (10+8+13) / 3 =10.3
No change in centroid

• When centroid stops changing →Stop assignment of objects to cluster

• Other stopping criteria include the number of iterations 43


Example of K-Means Clustering (with Two Dimensions)
• A small dataset
– n = 6 objects (6 premier hotels, labelled A to F)
– p = 2 clustering variables [rankings of their facilities (FACILITY) and locations (LOCATION)]

• Both rankings are measured from 1 to 10, with 10 being the most desirable score

44
Example of K-Means Clustering (with Two Dimensions)
• Set k = 3

• Randomly set 3 centroids

45
Example of K-Means Clustering (with Two Dimensions)

D and F nearest to C3’s centroid

46
Example of K-Means Clustering (with Two Dimensions)

B and E nearest to C1’s centroid

47
Example of K-Means Clustering (with Two Dimensions)

A and C nearest to C2’s centroid

48
Example of K-Means Clustering (with Two Dimensions)

Update C3’s centroid

49
Example of K-Means Clustering (with Two Dimensions)

Update C1’s centroid

50
Example of K-Means Clustering (with Two Dimensions)

Update C2’s centroid

51
Example of K-Means Clustering (with Two Dimensions)

No more change in centroids;


Three clusters formed

52
Evaluation
• The goal of cluster analysis is to come up with meaningful clusters

• A good clustering solution should be able to allow us to clearly describe the profile of each cluster

High rating

Medium rating

Low rating

53
Evaluation
• Objective measures for evaluating the quality of clustering solutions:
– Cohesion (Compactness): How close are the objects in a cluster
– Separation (Differentiation): How far are the clusters from each other
– Parsimony: Minimum number of clusters to capture the variations in the data

• Usually a better clustering solution has a higher Average Silhouette coefficient, a measure of cluster
quality based on cohesion and separation

54
Issues with K-Means Clustering
• Sensitive to initial centroids
– That is, selection of different centroids may give different clustering results
• Users need to specify the value of K
• Sensitive to outliers
– A small number of such data can substantially influence the mean value. It is advisable to remove
outliers before performing k-means
• Ideally, clustering criteria should be numeric
– Distance between categorical attributes is meaningless
• Very small clusters may not be detected
• Mainly spherical clusters may result
– Clusters that are elongated may be broken down into smaller clusters

55
Tips for K-Means Clustering
• To ease visualisation and analysis of the profile of each cluster, clustering solutions should not consist
of too many clusters

• Number of clustering criteria should not be excessive, so that the clustering solution can be interpreted
with reasonable ease

56
Case Study
Cluster Analysis in the Telco Industry
Case Study: Cluster Analysis in the Telco Industry
Background

Deregulation of the telco industry has led to widespread competition. telco service carriers fight hard for
customers. The problem is that once a customer is obtained, it is attacked by competitors, and retention of
customers is very difficult. The phenomenon of a customer switching carriers is referred to as churn.

58
Case Study: Cluster Analysis in the Telco Industry
Business Problem

• Customers are switching to another service provider

Can cluster analysis be used to


solve this problem? If so, how?

59
Case Study: Cluster Analysis in the Telco Industry
• Assume you have collected a dataset for cluster analysis. Load “Churn.sav” into IBM SPSS Modeler

60
Case Study: Cluster Analysis in the Telco Industry

61
Case Study: Cluster Analysis in the Telco Industry
Using all Variables Using continuous Variables

62
Average Silhouette improves by 0.1
Case Study: Cluster Analysis in the Telco Industry
How to extract the full list of
customers from cluster 5?

Do we need to perform
normalisation before clustering?

63
End of Study Unit 4
See you next week!

You might also like