Download as pdf or txt
Download as pdf or txt
You are on page 1of 31

DM 5: Unsupervised Learning

 Unsupervised learning =descriptive analytics


= no target (class) attribute
1. Association Rules (”Apriori” in Weka)
2. Clustering (K-means, EM)
1. Association Rules
2

 = unsupervised, also called ‘market basket analysis’


 With association rules, there is no “class” attribute
 Rules can predict any attribute, or combination of attributes
 Some association Rules for the weather.nominal data:
An example : the supermarket
3
 To illustrate the concepts, we use another small example from the supermarket domain. The set of items is I =
{milk,bread,butter,beer} and a small database containing the items (1 codes presence and 0 absence of an
item in a transaction) is shown in the table on the next slide. An example rule for the supermarket could be
{milk,bread}=>{butter} meaning that if milk and bread is bought, customers also buy butter. {milk, bread}= X;
{butter} = y; so X -> Y
 To select interesting rules from the set of all possible rules, constraints on various measures of significance
and interest can be used. The best-known constraints are minimum thresholds on support and confidence.
 Support :The support supp(X) of an itemset X is defined as the proportion of transactions in the data set which contain
the itemset.
supp(X)= no. of transactions which contain the itemset X / total no. of transactions

 Confidence =
Meaning: how many times is the rule correct? (=reliability). Weakness of confidence: if X and/or Y have a high
support, the confidence is also high per definition.

 The lift of a rule is defined as:

The Lift tells us how much our confidence has increased that Y will be purchased given that X was purchased. It
shows how effective the rule is in finding Y, as compared to finding Y randomly.
.
The supermarket example
4
Transaction ID milk Bread butter beer
1 1 1 0 0
In the example database, the itemset {milk,bread,butter} has a
2 0 1 1 0
3 0 0 0 1
support of 4 /15 = 0.26 since it occurs in 26% of all
4 1 1 1 0 transactions.
5 0 1 0 0
6 1 0 0 0
7 0 1 1 1
For the rule {milk,bread}=>{butter} we have the following
8 1 1 1 1
confidence:
9 0 1 0 1
supp({milk,bread,butter}) / supp({milk,bread}) = 0.26 /
10 1 1 0 0
11 1 0 0 0 0.4 = 0.65 . This means that for 65% of the transactions
12 0 0 0 1 containing milk and bread the rule is correct.
13 1 1 1 0
14 1 0 1 0
15 1 1 1 1
The rule {milk,bread}=>{butter} has the following lift:
supp({milk,bread,butter}) / supp({butter}) x
supp({milk,bread})= 0.26/0.46 x 0.4= 1.4
Association Rules
5

 Typically, you will have a big amount of rules


 2 ’quality’ criteria applied to the weather.nominal rules
 Support: supp(X) of an itemset X is defined as the proportion of transactions in the data set which contain the
itemset.supp(X)= no. of transactions which contain the itemset X / total no. of transactions
 Confidence: proportion of instances that satisfy the left-hand side for which the right-hand side also holds (reliability of
the rule)
Association Rules
6

 Rules are created from itemsets


 An Itemset is actually a set of attribute-value pairs. E.g.

 There are 7 potential rules from this 1 item set:


The Apriori Algorithm
7

 General Process
 Association rule generation is usually split up into two separate steps:
 First, minimum support is applied to find all frequent itemsets in a database.
 Second, these frequent itemsets and the minimum confidence constraint are used to form
rules.
 So: Generate high-support item sets, get several rules from each
 Strategy: iteratively reduce the minimum support until the required number of
rules is found with a given minimum confidence

Example in weather.nominal:
 Generate item sets with support 14 (none)
 Find rules in these item sets with minimum confidence level, 90% in Weka
 Continue with item sets with support 13 (none)
 And so on , until you have a sufficient number of rules
The Apriori Algorithm
8

 The Weather data has 336 rules with confidence 100%, but only 8 have support
≥ 3, only 58 have support ≥ 2
 In Weka: specify minimum confidence level (minMetric, default 90%), number of
rules sought (numRules, default 10)
 Apriori makes multiple passes through the data
 It generates 1-item sets, 2-item sets, ... with more than minimum support– turns each one into
(many) rules and checks their confidence
 It starts at upperBoundMinSupport (usually left at 100%) and decreases by delta at each
iteration (default 5%). It stops when numRules is reached... or at lowerBoundMinSupport
(default 10%)
The Apriori algorithm applied (weather.nominal)
9
Market Basket Analysis
10

 Look at supermarket.arff– collected from an actual New Zealand supermarket


 4627 instances, 217 attributes; appr. 1M attribute values

 Missing values used to indicate that the basket did not contain that item

 92% of values are missing


 average basket contains 217×8% = 17 items
 Most popular items: bread-and-cake (3330), vegetables (2961), frozen
foods(2717), biscuits (2605)
11
12

 [to be completed]
2. Clustering
13
 With clustering, there is, again, no “class” attribute (so unsupervised learning)
 Try to divide the instances into natural, homogenous groups, or “clusters”
 It is hard to evaluate the quality of a clustering solution
 in Weka: SimpleKMeans (+XMeans), EM, Cobweb
 Examples of clusters:
 Customer segmentation: divide customers in homogenous groups, based on numerous attributes (age, degree, social background,
average sales, number of visits, types of products bought, region, ….)
 Why?
 To have a better understanding,
 To target marketing or promotion actions
 To focus on very ‘good’ or very ‘infrequent’ customers
 ….
 Student clustering
 Clustering of prisoners
 Course clustering
 Clustering of schools, companies, cars, ….
 Based on symptoms, patients can be clustered in order determine an appropriate treatment

 Sometimes, the target population or observations can be subdivided in groups, top-down, by argument, using specific criteria.
This is not clustering.
 The goal of clustering is to start from a dataset with numerous attributes and to build up homogenous groups using similarity
measures (such as the Euclidean distance). So it is bottom-up, can be applied to a larger dataset , with many different attributes.
Clustering
14

 Automatically dividing data into homogeneous groups

Intra-cluster distances Inter-cluster distances


are minimized are maximized
Notion of a cluster can be ambiguous
15

 Which clustering is the best? Context-dependent…

How many clusters? Six Clusters

Two Clusters Four Clusters


Application: Customer segmentation
16

 Divide customer base into segments such that


• homogeneity within a segment is maximised (cohesive)

• heterogeneity between segments is maximised (separated)

 Example business analytics applications


• Understand customer population for e.g. targeted marketing and/or advertising (mass
customization)
• Efficiently allocating marketing resources

• Differentiate between brands in a portfolio

• Identify most profitable customers

• Identify shopping patterns

• …
Typical features used for customer segmentation
17

 Demographic
 Age, gender, income, education, marital status, kids, …
 Lifestyle
 Vehicle ownership, Internet use, travel, pets, hobbies, …

 Attitudinal
 Product preferences, price sensitivity, willingness to try other brands, …

 Behavioral
 Products bought, prices paid, use of cash or credit, RFM, …

 Acquisitional
 Marketing channel, promotion type, …

 Social network-based data


Clustering: K-means
18
 K-Means: Iterative distance-based clustering (disjoint sets), the algorithm:

1. Specify k, the desired number of clusters (often very difficult, how many groups do
you want, not too many, no too few)
2. Choose k points at random as cluster centers
3. Assign all instances to their closest cluster center
4. Calculate the centroid (i.e., mean) of instances in each cluster
5. These centroids are the new cluster centers
6. Re-assign instances to these cluster centers
7. Re-calculate the centroids
8. Continue until the cluster centers don’t change

Minimizes the total squared distance from instances to their cluster centers.
In Weka, the Euclidean distance or the Manhattan distance can be used as a similarity
function, to compute the distance between instances and centers.
Clustering: K-means
19

• Initial cluster centers?

1) Determine k initial centers → several possibles: farthest first, random, etc


(see the parameter ‘initializationMethod’)

Initial Initial
center center

2) Determine for each instance the distance to the initial


centers, and assign to the closest center. The Euclidean
distance is typically used
The K-means Algorithm
20
3) Adjusting clusters?

- Compute new centers = average of all instances within a cluster

- Calculate the distance for each instance to the new centers, and assign
the instance to the closest cluster (so instances could change, be re-
assigned to another cluster)

4) Iterative process : repeat the previous step (calculating the center, re-assigning
instances) until instances do not change anymore = convergence
The K-means Algorithm
21

Very often, the desired result is


achieved after a few iterations…. New
New
center
center
after 1
after 1
iteration
iteration
New
center
New center after 1
after 1 iteration
iteration
Initial
center
New
center
after 2
Initial iterations
center
New center
New after 2
center iterations
Random choice of initial centers after 2
iterations
K-means : calculations
22

A simple example: Suppose we have 6 instances (6 persons,


objects, ….) and only 2 attributes, X and Y.

X Y 9
8
A 3 7 7

B 6 1 In a chart:: 6
5
C 5 8 4

D 1 0 3
2
E 7 6 1
0
F 4 5 0 2 4 6 8
K-means : calculations
23

Step 1: Determine k initial centers using the euclidean distance :

X Y
A 3 7
n

 ( p − q )²
B 6 1
i i C 5 8
i =1
D 1 0
E 7 6
Euclidean Distance F 4 5
Case A B C D E F
9
A 8
,000 6,708 2,236 7,280 4,123 2,236 C
7
B 6,708 ,000 7,071 5,099 5,099 4,472 6 A
5 E
C 2,236 7,071 ,000 8,944 2,828 3,162 4 F
3
D 7,280 5,099 8,944 ,000 8,485 5,831 2
1 D B
E 4,123 5,099 2,828 8,485 ,000 3,162
0
0 2 4 6 8

= ((5 − 1)² + (8 − 0)²) = 8.94


F 2,236 4,472 3,162 5,831 3,162 ,000

We opt for the “Farthest First” option, choose the instances that are the farthest…
K-means : calculations
24

Step 2: For each instance, compute the euclidean distance to each cluster
center (C and D) :
X Y
A 3 7
n B 6 1
 ( p − q )²
i =1
i i (3- 5)2 + (7 - 8)2 = 5 = 2, 236 C 5 8
D 1 0
E 7 6
F 4 5

Euclidean Distance

Case A B C D E F
A
,000 6,708 2,236 7,280 4,123 2,236

B
6,708 ,000 7,071 5,099 5,099 4,472

C
2,236 7,071 ,000 8,944 2,828 3,162

D
7,280 5,099 8,944 ,000 8,485 5,831

E
4,123 5,099 2,828 8,485 ,000 3,162

F
2,236 4,472 3,162 5,831 3,162 ,000
K-means : calculations
25

Step 3: Assign an instance to the center that is the nearest.

So: A is assigned to center C, B to center D, E to C and F to C

Euclidean Distance
9
Case A B C D E F 8
C
A ,000 6,708 2,236 7,280 4,123 2,236 7 A C1
6
B 6,708 ,000 7,071 5,099 5,099 4,472 5 E
F
C 4
2,236 7,071 ,000 8,944 2,828 3,162
3
D 7,280 5,099 8,944 ,000 8,485 5,831 2
1
E 4,123 5,099 2,828 8,485 ,000 3,162
0 D
B C2
F 0 2 4 6 8
2,236 4,472 3,162 5,831 3,162 ,000
K-means : calculations
26

Step 4: Re-calculate the new centers based on the averages of the


instances in each cluster.

X Y
9
A 3 7 8
C
B 6 1 7 A
CC1
C1
6

C 5 8 5
F
E
4
D 1 0 3
2
E 7 6 1
D
CC2 B C2
F 4 5 0
0 2 4 6 8

CC1 4,75 6,5


CC2 3,5 0,5

Cluster1 Center 1: X = (3 + 5 + 7 + 4) / 4 = 4,75 Y = (7 + 8 + 6 + 5) / 4 = 6,50


Cluster2 Center 2: X = (6 + 1) / 2 = 3,50 Y = (1 + 0) / 2 = 0,50
K-means
27

 In sum, K-Means clustering intends to partition n objects into k clusters in which


each object belongs to the cluster with the nearest mean. This method produces
exactly k different clusters of greatest possible distinction. The best number of
clusters k leading to the greatest separation (distance) is not known as a priori
and must be computed from the data. The objective of K-Means clustering is to
minimize total intra-cluster variance, or, the squared error function:
Clustering: K-means
28
 K-means in Weka:

 Open weather.numeric.arff
 Cluster panel; choose
SimpleKMeans
 Note parameters:
numClusters,
distanceFunction, seed
(default 10)
 Two clusters, 9 and 5
members, total squared
error 16.2
 {1/no, 2/no, 3/yes, 4/yes,
5/yes, 8/no, 9/yes, 10/yes,
13/yes}
{6/no, 7/yes. 11/yes, 12/yes,
14/no}
 Set seed to 11: Two
clusters, 6 and 8 members,
total squared error 13.6
 Set seed to 12: total
squared error 17.3
Evaluating Clusters
29
 Now we know the size and the characteristics of the cluster
 How good is our clustering?
 Visualizing Clusters:
 Open the Iris.arff data, apply SimpleKMeans, specify 3 clusters
 3 clusters with 50 instances each
 Visualize cluster assignments (right-click menu in Result List)
 Plot Clusters (x-axis) against the instance numbers: the more density, the more cohesiveness, the
better the quality
 Which instances does a cluster contain?
 Use the AddCluster unsupervised attribute filter (in the Preprocess tab !)
 Try with SimpleKMeans (within the filter); Apply and click Edit
 What about the class variable?
 Also apply “visualize cluster assignments”, clusters on the X, class variable on the Y. There are yes
and no’s in both clusters: so no perfect match between clusters and class values
 Try the “Ignore attribute” button, ignore the class attribute; run again with 3 clusters, now: 61, 50, 39
instances
Visualizing Clusters:
30

With all attributes: very dense, balanced clusters Leaving out the class variable : more distance within
the clusters, less balanced (but still acceptable)
Classes-to-clusters evaluation
31
 In the Iris data: SimpleKMeans,
specify 3 clusters
 Classes to clusters
evaluation = using clustering in
a supervised way…
 Classes are assigned to
clusters; can clusters predict
the class values?
 Now you have a confusion
(classification) matrix and an
accuracy ! (100 – 11% = 89%)

You might also like