Professional Documents
Culture Documents
Business Intelligence DM5 Unsupervised learning_updated sl28-29
Business Intelligence DM5 Unsupervised learning_updated sl28-29
Confidence =
Meaning: how many times is the rule correct? (=reliability). Weakness of confidence: if X and/or Y have a high
support, the confidence is also high per definition.
The Lift tells us how much our confidence has increased that Y will be purchased given that X was purchased. It
shows how effective the rule is in finding Y, as compared to finding Y randomly.
.
The supermarket example
4
Transaction ID milk Bread butter beer
1 1 1 0 0
In the example database, the itemset {milk,bread,butter} has a
2 0 1 1 0
3 0 0 0 1
support of 4 /15 = 0.26 since it occurs in 26% of all
4 1 1 1 0 transactions.
5 0 1 0 0
6 1 0 0 0
7 0 1 1 1
For the rule {milk,bread}=>{butter} we have the following
8 1 1 1 1
confidence:
9 0 1 0 1
supp({milk,bread,butter}) / supp({milk,bread}) = 0.26 /
10 1 1 0 0
11 1 0 0 0 0.4 = 0.65 . This means that for 65% of the transactions
12 0 0 0 1 containing milk and bread the rule is correct.
13 1 1 1 0
14 1 0 1 0
15 1 1 1 1
The rule {milk,bread}=>{butter} has the following lift:
supp({milk,bread,butter}) / supp({butter}) x
supp({milk,bread})= 0.26/0.46 x 0.4= 1.4
Association Rules
5
General Process
Association rule generation is usually split up into two separate steps:
First, minimum support is applied to find all frequent itemsets in a database.
Second, these frequent itemsets and the minimum confidence constraint are used to form
rules.
So: Generate high-support item sets, get several rules from each
Strategy: iteratively reduce the minimum support until the required number of
rules is found with a given minimum confidence
Example in weather.nominal:
Generate item sets with support 14 (none)
Find rules in these item sets with minimum confidence level, 90% in Weka
Continue with item sets with support 13 (none)
And so on , until you have a sufficient number of rules
The Apriori Algorithm
8
The Weather data has 336 rules with confidence 100%, but only 8 have support
≥ 3, only 58 have support ≥ 2
In Weka: specify minimum confidence level (minMetric, default 90%), number of
rules sought (numRules, default 10)
Apriori makes multiple passes through the data
It generates 1-item sets, 2-item sets, ... with more than minimum support– turns each one into
(many) rules and checks their confidence
It starts at upperBoundMinSupport (usually left at 100%) and decreases by delta at each
iteration (default 5%). It stops when numRules is reached... or at lowerBoundMinSupport
(default 10%)
The Apriori algorithm applied (weather.nominal)
9
Market Basket Analysis
10
Missing values used to indicate that the basket did not contain that item
[to be completed]
2. Clustering
13
With clustering, there is, again, no “class” attribute (so unsupervised learning)
Try to divide the instances into natural, homogenous groups, or “clusters”
It is hard to evaluate the quality of a clustering solution
in Weka: SimpleKMeans (+XMeans), EM, Cobweb
Examples of clusters:
Customer segmentation: divide customers in homogenous groups, based on numerous attributes (age, degree, social background,
average sales, number of visits, types of products bought, region, ….)
Why?
To have a better understanding,
To target marketing or promotion actions
To focus on very ‘good’ or very ‘infrequent’ customers
….
Student clustering
Clustering of prisoners
Course clustering
Clustering of schools, companies, cars, ….
Based on symptoms, patients can be clustered in order determine an appropriate treatment
Sometimes, the target population or observations can be subdivided in groups, top-down, by argument, using specific criteria.
This is not clustering.
The goal of clustering is to start from a dataset with numerous attributes and to build up homogenous groups using similarity
measures (such as the Euclidean distance). So it is bottom-up, can be applied to a larger dataset , with many different attributes.
Clustering
14
• …
Typical features used for customer segmentation
17
Demographic
Age, gender, income, education, marital status, kids, …
Lifestyle
Vehicle ownership, Internet use, travel, pets, hobbies, …
Attitudinal
Product preferences, price sensitivity, willingness to try other brands, …
Behavioral
Products bought, prices paid, use of cash or credit, RFM, …
Acquisitional
Marketing channel, promotion type, …
1. Specify k, the desired number of clusters (often very difficult, how many groups do
you want, not too many, no too few)
2. Choose k points at random as cluster centers
3. Assign all instances to their closest cluster center
4. Calculate the centroid (i.e., mean) of instances in each cluster
5. These centroids are the new cluster centers
6. Re-assign instances to these cluster centers
7. Re-calculate the centroids
8. Continue until the cluster centers don’t change
Minimizes the total squared distance from instances to their cluster centers.
In Weka, the Euclidean distance or the Manhattan distance can be used as a similarity
function, to compute the distance between instances and centers.
Clustering: K-means
19
Initial Initial
center center
- Calculate the distance for each instance to the new centers, and assign
the instance to the closest cluster (so instances could change, be re-
assigned to another cluster)
4) Iterative process : repeat the previous step (calculating the center, re-assigning
instances) until instances do not change anymore = convergence
The K-means Algorithm
21
X Y 9
8
A 3 7 7
B 6 1 In a chart:: 6
5
C 5 8 4
D 1 0 3
2
E 7 6 1
0
F 4 5 0 2 4 6 8
K-means : calculations
23
X Y
A 3 7
n
( p − q )²
B 6 1
i i C 5 8
i =1
D 1 0
E 7 6
Euclidean Distance F 4 5
Case A B C D E F
9
A 8
,000 6,708 2,236 7,280 4,123 2,236 C
7
B 6,708 ,000 7,071 5,099 5,099 4,472 6 A
5 E
C 2,236 7,071 ,000 8,944 2,828 3,162 4 F
3
D 7,280 5,099 8,944 ,000 8,485 5,831 2
1 D B
E 4,123 5,099 2,828 8,485 ,000 3,162
0
0 2 4 6 8
We opt for the “Farthest First” option, choose the instances that are the farthest…
K-means : calculations
24
Step 2: For each instance, compute the euclidean distance to each cluster
center (C and D) :
X Y
A 3 7
n B 6 1
( p − q )²
i =1
i i (3- 5)2 + (7 - 8)2 = 5 = 2, 236 C 5 8
D 1 0
E 7 6
F 4 5
Euclidean Distance
Case A B C D E F
A
,000 6,708 2,236 7,280 4,123 2,236
B
6,708 ,000 7,071 5,099 5,099 4,472
C
2,236 7,071 ,000 8,944 2,828 3,162
D
7,280 5,099 8,944 ,000 8,485 5,831
E
4,123 5,099 2,828 8,485 ,000 3,162
F
2,236 4,472 3,162 5,831 3,162 ,000
K-means : calculations
25
Euclidean Distance
9
Case A B C D E F 8
C
A ,000 6,708 2,236 7,280 4,123 2,236 7 A C1
6
B 6,708 ,000 7,071 5,099 5,099 4,472 5 E
F
C 4
2,236 7,071 ,000 8,944 2,828 3,162
3
D 7,280 5,099 8,944 ,000 8,485 5,831 2
1
E 4,123 5,099 2,828 8,485 ,000 3,162
0 D
B C2
F 0 2 4 6 8
2,236 4,472 3,162 5,831 3,162 ,000
K-means : calculations
26
X Y
9
A 3 7 8
C
B 6 1 7 A
CC1
C1
6
C 5 8 5
F
E
4
D 1 0 3
2
E 7 6 1
D
CC2 B C2
F 4 5 0
0 2 4 6 8
Open weather.numeric.arff
Cluster panel; choose
SimpleKMeans
Note parameters:
numClusters,
distanceFunction, seed
(default 10)
Two clusters, 9 and 5
members, total squared
error 16.2
{1/no, 2/no, 3/yes, 4/yes,
5/yes, 8/no, 9/yes, 10/yes,
13/yes}
{6/no, 7/yes. 11/yes, 12/yes,
14/no}
Set seed to 11: Two
clusters, 6 and 8 members,
total squared error 13.6
Set seed to 12: total
squared error 17.3
Evaluating Clusters
29
Now we know the size and the characteristics of the cluster
How good is our clustering?
Visualizing Clusters:
Open the Iris.arff data, apply SimpleKMeans, specify 3 clusters
3 clusters with 50 instances each
Visualize cluster assignments (right-click menu in Result List)
Plot Clusters (x-axis) against the instance numbers: the more density, the more cohesiveness, the
better the quality
Which instances does a cluster contain?
Use the AddCluster unsupervised attribute filter (in the Preprocess tab !)
Try with SimpleKMeans (within the filter); Apply and click Edit
What about the class variable?
Also apply “visualize cluster assignments”, clusters on the X, class variable on the Y. There are yes
and no’s in both clusters: so no perfect match between clusters and class values
Try the “Ignore attribute” button, ignore the class attribute; run again with 3 clusters, now: 61, 50, 39
instances
Visualizing Clusters:
30
With all attributes: very dense, balanced clusters Leaving out the class variable : more distance within
the clusters, less balanced (but still acceptable)
Classes-to-clusters evaluation
31
In the Iris data: SimpleKMeans,
specify 3 clusters
Classes to clusters
evaluation = using clustering in
a supervised way…
Classes are assigned to
clusters; can clusters predict
the class values?
Now you have a confusion
(classification) matrix and an
accuracy ! (100 – 11% = 89%)