Professional Documents
Culture Documents
Business Intelligence 2
Business Intelligence 2
• Module 1: SQL
• Basics of SQL
• 1 table
• Where & Order by
• SQL is a non-procedural language, i.e. you specify what you want instead of
how you want to obtain it. The database management system (DBMS) will itself
interpret the SQL instructions and show the results ... SQL is suitable for any
DBMS!
• You need to install Libre Office on your laptop. (MacOS, Windows or Linux)
• You should/can also go to w3schools.com – SQL, and try basic SQL exercises
if you are an SQL novice.
[WHERE condition]
[GROUP BY column_list]
[HAVING condition]
[ORDER BY column_list]
• If we have only 1 table, we can omit the table name before the field
names:
SELECT Lastname, Emailaddress
FROM Dimcustomer
WHERE Maritalstatus = ‘M’;
Stephan Poelmans– Business Intell 10
Selection Conditions (continued)
• Comparison Operators = equal to <> different from
< lower than > greater than
<= lower or equal to >= greater or equal to
BETWEEN … AND … between two values
IN (list) equal to one of the values of the list
IS NULL equal to the NULL (blank) value
• Example: Show all products with a catalog price (list price) between $30 and
$100
SELECT *
FROM Dimproduct
WHERE Listprice BETWEEN 30 AND 100
In MS Access:
SELECT *
FROM Dimcustomer
WHERE BirthDate in (#1/15/1950#,#1/15/1970#)
In LibreOffice:
SELECT *
FROM Dimcustomer
WHERE Birthdate IN ( '1950-01-15', '1970-01-15' )
SELECT *
FROM Dimemployee
WHERE phone IS NOT NULL
Stephan Poelmans– Business Intell 12
Negations in the Where clause
• By adding the NOT operator to a selection condition we obtain the negation of
the selection condition.
E.g.: Return the name of the workers who are not “marketing manager”
SELECT *
FROM DimEmployee
WHERE phone IS NOT NULL
Stephan Poelmans– Business Intell 13
The use of Wildcards & Like
• The LIKE operator is used in a WHERE clause to search for a specified pattern in
a column.
• There are two wildcards often used in conjunction with the LIKE operator:
• % - The percent sign represents zero, one, or multiple characters
• Note: MS Access uses an asterisk (*) instead of the percent sign (%), and a question
mark (?) instead of the underscore (_).
Examples:
Select * Select *
From Dimcustomer From Dimcustomer
Where Lastname like ‘An%’ Where Lastname like ‘%an%’
• Sometimes a row must or may meet several selection conditions before it can
be selected.
• If a row must meet several selection conditions then they are linked together
with the AND operator. The composed selection condition is TRUE only if all
individual selection conditions are TRUE.
• If a row may meet several selection conditions then they are linked together
with the OR operator. The composed selection condition is TRUE if at least
one individual selected condition is TRUE
SELECT *
FROM Dimemployee
WHERE Gender = ‘F’
AND Maritalstatus = ‘M’
• All female employees or employees who are married (so no single males):
SELECT *
FROM Dimemployee
WHERE Gender = ‘F’
OR Maritalstatus = ‘M’
SELECT *
FROM DimEmployee
WHERE Gender = ‘M’
AND (Birthdate> #31-DEC-59# OR title = “chief financial officer”)
In LibreOffice:
SELECT *
FROM Dimemployee where Gender ='M' and (Birthdate > '1959-12-31'
or Title = ’Chief Financial Officer’);
So: it is anyway about male employees, they must also be either CFO or born
after 1959 ...
Stephan Poelmans– Business Intell 17
The Order By clause: Sorting the displayed rows
• The ORDER BY clause:
By adding an ORDER BY clause as last sentence of a query, we can show the
selected rows in a sorted order following the values of a particular column.
For example: show all the employees ordered by their date of birth (oldest to
youngest)
SELECT *
FROM Dimemployee
ORDER BY Birthdate
SELECT *
FROM Dimemployee
ORDER BY birthdate desc
Stephan Poelmans– Business Intell 18
Sorting the displayed rows (cont’d)
• Ordering on multiple columns
For example: Show all employees. First show male employees, then the
females. Show by gender ('m' or 'f'), the employees in order of age, from
oldest to youngest
SELECT *
FROM Dimemployee
ORDER BY gender DESC, birthdate
• This can be solved by using “joins” between tables within SQL instructions
• When multiple tables in the FROM sentence are mentioned then those tables
are linked through a JOIN.
The JOIN of tables A and B are the columns of A and the columns of B.
The Cartesian product of two tables basically includes in its result all of the
possible combinations of records between two tables. This means that there
are no conditions for the join. Each row (record) of Table A is linked to all rows
of Table B
• This is as good as taking a Cartesian product and thereby ensuring that only
the correct records remain (i.e. the right customers with the right sales).
FK PK
FactInternetSales x DimCustomer
• Result: join with a join condition ... The value of the primary key needs to
match the value of the foreign key ...
Note: both options give the same results. This format is longer.
We advise to use option 1 (previous slide) when writing SQL.
• Next to the join condition we can also add additional selection conditions in a
query.
• Example: Show the product (by name) , the name of a customer and the
dates of his internet purchases, on the condition that the customer is from
France …
To solve this query we need the following tables:
1. FactInternetSales (the fact table)
2. DimCustomer (dimension table, name of customer)
3. DimGeography (dimension table, country of the customer)
4. DimProduct (dimension table, name of the product)
5. DimDate (date of Internet sale)
So 4 join conditions and an additional condition (the customer's country)
Stephan Poelmans– Business Intell 27
The SQL statement, using
DWAdventureworks.mdb (in Base, Libre Office):
SELECT Dimcustomer.Firstname, Dimcustomer.Lastname,
Dimproduct.Productname, Dimgeography.Countryregionname,
Dimdate.Datekey
• Choose an alias for your table names and use them throughout your query. At
least one character, not a number, no spaces.
• Why ? This shortens your query considerably!
• So group by "city" (set to the same cities together in a group), and within each
group (each city), count the number of customerkeys.
Stephan Poelmans– Business Intell 30
Group By, explained with an example
• Suppose a manager, wants to know the sales for each product. Let us assume
that s/he wants a list with products and their total sales. The sales amount can
be found in the facts table.
The first records of your result then looks like this – (in total you have 1000
facts):
Presenting a table with 1000 individual sales does not make much sense for the
manager. The business analist needs to give a subtotal for each product: so
26.97 for AWC Logo Cap for instance. Hence he groups by product name and
sums up the sales amounts.
• In this example: first group by city, then "within each city” group by sex.
• Note: in the Select clause we can also add a heading using AS. (e.g. AS
COUNTofCustomerKey) The name of the header is optional, it should be one
word.
For example:
In Seattle: 1
women and 2
men
• So: each group (each city) presented in the result should include > 2
customers
• Note that the count operation does not have to be in the select clause!
• When in the same query we both have a WHERE and a HAVING clause, the
query processor processes the query as follows:
1. Union Queries
2. ‘Top[number]’
3. Subqueries
• Union queries are used to combine data from multiple queries into one result.
• For example: Provide a list of names and telephone numbers of all employees
and all customers with an annual income greater than 25 000. (This does not
necessarily imply a link between customers and employees).
• SQL
select Firstname, Lastname, Phone
from Dimemployee
union
Select Firstname, Lastname, Phone
from Dimcustomer
where Yearlyincome > 25000
• You cannot just simply combine two SQL statements (queries) with "union“
• Conditions:
• Every query must have the same number of columns (fields)
• Incorrect Example:
• Select Firstname, Lastname, Birthdate
from Dimemployee
union
select Firstname, Lastname, Phone
from Dimcustomer
where Yearlyincome > 25000
Stephan Poelmans– Business Intell 42
1. Union Queries
• Incorrect Example:
• Select Totalchildren, Firstname, Lastname
from Dimcustomer
where Yearlyincome > 25000
union
select Firstname, Lastname, Phone
from Dimemployee
• Correct Example:
• Select Phone, Firstname, Lastname
from Dimcustomer
where Yearlyincome > 25000
union
select Firstname, Lastname, Phone
from Dimemployee
• The order of the fields is different, but because the data types are the same,
most database systems recognize that query. The first query determines the
order in which the attributes are displayed
Stephan Poelmans– Business Intell 43
2. ‘Top[number]’
• Top[number]: E.g. Top 1, Top 3, ... This instruction can be added immediately after
the word "select". Example:
• Note that the “top x” instruction only ensures that the x-first records of a query are
displayed! The youngest, oldest or highest are not automatically chosen. Example:
• Then:
• How do you modify the query to select the customer with the lowest
purchase amount?
• For example:
SELECT Customerkey, Lastname FROM Dimcustomer
WHERE Customerkey IN
(SELECT Customerkey
FROM Factinternet
GROUP BY Customerkey
HAVING SUM(Salesamount) < 800)
• The data that we want to select can thus also be queried with an INNER JOIN.
Why then using a subquery? A subquery is generally a lot clearer than a JOIN
and easy to formulate. Some RDBMS perform more quickly when using a
JOIN.
• Examples:
• Which employees are older than Guy Gilbert?
So you compare employees with each other
• Which customers are located in the same town as the customer Joanna
Suarez?
So you compare with other customers
• The operator "IN" can be used if the second query yields one or more records.
• "IN" means: the defined field of a record from the top query must "appear in" one or more
records from the second query.
SELECT Customerkey FROM Dimcustomer WHERE Customerkey IN
(SELECT Customerkey
• For example:
FROM Factinternet
GROUP BY Customerkey
HAVING SUM(Salesamount) < 800)
• The second query returns all "customerkey" on the customers who bought 800 or less ....
• This query can be shorter, but then the town of customers will not be displayed. (You
only use “Geographykey”)
SELECT Firstname, Lastname
FROM Dimcustomer
WHERE Geographykey = (SELECT Geographykey
FROM Dimcustomer
WHERE Lastname=‘Suarez‘
and Firstname = ‘Jacquelyn')
Stephan Poelmans– Business Intell 51
4. Rollup and Cube
• Rollup and Cube are two typical OLAP concepts (see below)
• Rollup and Cube can also be used as SQL statements. They calculate
subtotals.
• Both instructions are not provided in MS Access, but in database systems like
MS SQL Server, Oracle and MySQL
Cat Miami 18
Cat Naples 9
SELECT Sort, Place, SUM(Amount) as amount Dog Miami 12
FROM Animal Dog Naples 5
GROUP BY Sort, Place Dog Tampa 14
Turtle Naples 1
Turtle Tampa 4
NULL NULL 63
• Note: "rollup" and "cube" calculate subtotals as they are calculated in a Pivot
Table…
Result of the query “with cube”
2
Stephan Poelmans– OLAP
3
Stephan Poelmans– OLAP
Data Warehouse vs Data Mart
4
Stephan Poelmans– OLAP
Content
Data warehousing
1) Introduction
2) OLTP vs. OLAP
3) The multidimensional model
4) Logical data warehouse design
5) Data warehouse architecture
6) Data warehouse usage
5
Stephan Poelmans– OLAP
1) Introduction
6
Stephan Poelmans– OLAP
2) OLTP vs OLAP
• OLTP databases
– Designed to support daily operations
– Ensure fast, concurrent access to data
• Transaction processing
• Concurrency control
• Recovery techniques
– Normalization crucial
• Must support heavy transaction loads
• Should prevent update anomalies
– Therefore: poor performance for executing complex queries
(joining many relational tables together)
7
Stephan Poelmans– OLAP
2) OLTP vs OLAP
8
Stephan Poelmans– OLAP
What is a Data Warehouse?
“A DW is a
– subject-oriented,
– integrated,
– time-varying,
– non-volatile,
collection of data that is used primarily in organizational decision making.”
-- W.H. Inmon, Building the Data Warehouse, 1992
9
Stephan Poelmans– OLAP
Subject-oriented
10
Stephan Poelmans– OLAP
Integration
RDBMS
Data
Legacy Warehouse
System
World
Scientific Databases Wide
Web
Digital Libraries
13
Stephan Poelmans– OLAP
Nonvolatile
• Once data is recorded it will not be updated anymore
• Historical data will not be deleted
• Only new data is added
• A data warehouse requires two operations in data accessing
– Initial loading of data (ETL)
– Access of data
load
access
14
Stephan Poelmans– OLAP
Data warehouse database vs. OLTP database
15
Stephan Poelmans– OLAP
Additional differences
16
Stephan Poelmans– OLAP
3) Multidimensional model
17
Stephan Poelmans– OLAP
Dimensions, measures, facts and hierarchies
18
Stephan Poelmans– OLAP
Aggregration
19
Stephan Poelmans– OLAP
4) Logical Data Warehouse design
20
Stephan Poelmans– OLAP
4) Logical Data Warehouse design, continued
21
Stephan Poelmans– OLAP
Relational data warehouse design
22
Stephan Poelmans– OLAP
Star Schema
Measurements
Dimensions
23
Stephan Poelmans– OLAP
24
Stephan Poelmans– OLAP
25
Stephan Poelmans– OLAP
26
Stephan Poelmans– OLAP
6) Data warehouse usage
27
Stephan Poelmans– OLAP
OLAP operations
28
Stephan Poelmans– OLAP
Roll-up
29
Stephan Poelmans– OLAP
Drill-down
30
Stephan Poelmans– OLAP
Slicing
31
Stephan Poelmans– OLAP
Dicing
32
Stephan Poelmans– OLAP
Data warehouse querying
33
Stephan Poelmans– OLAP
Some Useful References
34
Stephan Poelmans– OLAP
Business Intelligence
Part 2:
Introduction to Data Mining
Using Weka
WEKA !
What’s Weka?
– A bird found only in New Zealand?
A Data mining workbench
Waikato Environment for Knowledge Analysis
Machine learning algorithms for data mining tasks
100+ algorithms for classification
75 for data preprocessing
25 to assist with feature selection
20 for clustering, finding association rules, etc .
• There are 3 MOOCS (=massive open online course) available. This course is
considerably inspired by them.
• https://www.cs.waikato.ac.nz/ml/weka/courses.html
• Is there “noise” in the data? (errors such as “age = -20” , or “gender= m and name= Marianne”,
etc.)
• Are there “outliers”? (extreme or exceptional values) that can influence our results? If this is the
case, you can ultimately drop them…
• What do available attributes look like? What type of data do we have? Is the field textual or numeric?
If numeric, are they continuous variables or discrete (categorical) ones? (For example “gender”
is a discrete variable (m or f), “age” is a continuous variable but can become discrete (for example
subdivision in these categories : “0-17”; “18-30”, “ > 60”, …)).
• How many products can an enterprise (hope to) sell next year ?
• Is the income tax return of a company the same as the one of similar companies (tax
evasion)?
• What is the probability of success of a student given all the information we have about that
student and given a group of students from the past (social background, secondary education,
exam results,…)?
• Some questions can also, to some extent, be answered thanks to “conventional” queries, cross-
tabulations,...
Sore Swollen
patient ID Throat Fever Glands Congestion Headache Diagnosis
Swollen
11 No No Yes yes yes ?
Glands
12 Yes Yes No No yes ?
13 No no No no yes ?
no yes
Source : Roiger, R. J. & M.W. Geatz (2003)
The decision tree (with the “production rules”) can now be Fever
used to make forecasts for future patients
whose diagnosis is still unknown… no yes
Remark:
- a custodial account is an account type where an institution or tutor represents a “protected” individual
- a joint account: an account shared by people
This question becomes more global and can be answered through data mining, and more specifically data
clustering.
We can use “clustering” on the above-mentioned data. The clustering technique determines clusters
(categories/groups) in which the “distance” between the clusters is as great as possible, whereas the clients distance
within a cluster is as small as possible.
We could obtain the following “rules” :
If (margin account = yes && Age = 20-29 && Annual Income = 40-59K)
THEN cluster = 1 (accuracy = 0.80; coverage = 0.50)
If (account type = custodial && favorite recreation = skiing && annual income = 80-90K)
THEN cluster = 2 (accuracy = 0.92; coverage = 0.35)
If (account type = joint && Trades/month > 5 && transaction method = online)
THEN cluster = 3 (accuracy = 0.82; coverage = 0.65)
Accuracy & coverage do tell us something about the clustering value (validation), see further .
A clustering will never be perfect !
• Outliers
• Remove or keep?
• E,g, Age = 400 (false observation) vs. income = 10 000 Euro (correct
observation(?))
• Missing values:
• How to deal with them? E.g. replace with average value?
• Definition of the target variable = outcome variable (if required, cf. below)
• Credit scoring: What is a bad customer/actor (e.g. 90 days payment arrears
according to the international Basel II guidelines)
• Churn management: What is a churner? (e.g. a customer without any
purchase in the last 4 months)
Data Mining
Techniques
Training
Data
Testing
Data Unseen Data
(Jeff, Professor, 4)
NAME RANK YEARS TENURED
Tenured?
Tom A ssistan t P ro f 2 no
M erlisa A sso ciate P ro f 7 no
G eo rg e P ro fesso r 5 yes
Jo sep h A ssistan t P ro f 7 yes
Data Mining
Strategies
Classification Estimation
Source: adapted from Roiger, R. J. & M.W. Geatz (2003), Data Mining: A tutorial-based primer, Addison-Wesley, 350 p.
Expert System
Data
Intelligence (AI))
Human Knowledge
Expert Engineer
Expert System
Building Tool Knowledge Engineer
A person who is trained to work with an expert
and capturing his knowledge.
If Swollen Glands = Yes
Then Diagnosis = Strep
Throat
Weather.nominal.arff
• Open iris.arff
• Bring up the Visualize panel
• Bars on the right change correspond to attributes: click for x axis; right‐click for y axis
• Jitter slider (“Jitter is a random displacement applied to X and Y values to separate points
that lie on top of one another. Without jitter, 1000 instances at the same data point would
look just the same as 1 instance.)
• On classification problems, the output variable must be nominal. For regression problems, the output
variable must be real.
Using Weka
Getting to know WEKA
• Open the dataset
• Naïve Bayes
• K-nearest neighbours (Ibk in Weka)
• Neural Network (Multilayer Perceptron in Weka)
• Color is important !
6 No Married 60K No
TaxInc NO
7 Yes Divorced 220K No
< 80K > 80K
8 No Single 85K Yes
9 No Married 75K No NO YES
10 No Single 90K Yes
Model: Decision Tree
10
Training Data
Note that:
- “Cheat” is the categorical “output” variable;
2 No Married 100K No
3 No Single 70K No NO TaxInc
10
10 No Single 90K Yes
There can consequently be more than one
decision tree for the same data set!
No Married 80K ?
Refund 10
Yes No
NO MarSt
TaxInc NO
NO YES
12
Stephan Poelmans– Data Mining
Validation of a decision tree
• Focus on the predictive ability of a model
• Instead of focusing on how much time it takes to make a model, the size of the model, etc.
• Confusion Matrix: how many records were properly classified? How many errors?
PREDICTED CLASS
Class x Class y a: TP (true positive)
b: FN (false negative)
Class x a b
ACTUAL (TP) (FN) c: FP (false positive)
CLASS Class y c d d: TN (true negative)
(FP) (TN)
x y x y
ACTUAL ACTUAL
CLASS x 150 40 CLASS x 250 45
y 60 250 y 5 200
Computed Decision
C C C
1 2 3
C C C C
1 11 12 13
C C C C
2 21 22 23
C C C C
3 31 32 33
16
Stephan Poelmans– Data Mining
Configuration panel
Right-Click to
visualize
17
Stephan Poelmans– Data Mining
Weka: J48
• Evolution:
• ID3 (1979)
• C4.5 (1993)
• C4.8 (1996?) => J48 (adapted for Weka)
• C5.0 (commercial)
• Open the configuration panel in Weka (Click on the white box under “Classifier”:
• Check the More information
• Examine the options
• Look at leaf sizes, Set minNumObj to 15 to avoid small leaves
• Visualize the tree using right‐click menu
18
Stephan Poelmans– Data Mining
Min. leaves 2, Min. leaves 15,
accuracy = 66.8% accuracy = 62.15%
19
Stephan Poelmans– Data Mining
General information
Output
Tree size
20
Stephan Poelmans– Data Mining
Accuracy !
Output
Kappa !
23
Stephan Poelmans– Data Mining
Entropy is a measure of impurity (the opposite of
information gain). So the Entropy after the split is best
lower than before the split,
Stephan Poelmans– Data Mining 24
Stephan Poelmans– Data Mining 25
J48: algorithm
Note: informationStephan
gainPoelmans–
is measured
Data Mining in ‘bits’, as a unit of information 27
Weka: J48
• Fewer attributes = sometimes a better classification!
• Open glass.arff to run J48 (with default options...):
• Run J48, with default options first
• Next. Remove Fe, and run J48 again
• Next, Remove all attributes except RI and MG, run J48 again
• Compare the decision trees , and particularly their accuracy %
• (Also use the right‐click menu to visualize decision trees)
28
Stephan Poelmans– Data Mining
Classification Algorithms
• Structure of the slides:
• Naïve Bayes
• K-nearest neighbours (Ibk in Weka)
• Neural Network (Multilayer Perceptron in Weka)
29
Stephan Poelmans– Data Mining
PART
• Rules from partial decision trees: PART
• Theoretically, rules and trees have equivalent “descriptive” or expressive power ... but either
can be more perspicuous (understandable, transparent) than the other
• Create a decision tree: top-down, ”divide-and-conquer”; read rules off the tree
• One rule for each leaf
• Straightforward, but rules contain repeated tests and are overly complex
• Alternative: a covering method: bottom-up, “separate-and-conquer”:
• Take a certain class (value) in turn and seek a way of covering all instances in it. This is called a covering
approach because at each stage you identify a rule that “covers” some of the instances. This approach may
lead to a set of rules, rather than to a decision tree.
• Separate-and-conquer:
• Identify a rule
• Remove the instances it covers
• Continue, creating rules for the remaining instances
30
Stephan Poelmans– Data Mining
PART: Separate-and-Conquer: Example
• Identifying a rule for class a (so not explicitly considering class b):
• Naïve Bayes
• K-nearest neighbours (Ibk in Weka)
• Neural Network (Multilayer Perceptron in Weka)
34
Stephan Poelmans– Data Mining
Baseline Accuracy: ZeroR
• Open file diabetes.arff: 768 instances (500 negative, 268 positive)
• Always guess “negative”: 500/768 : accuracy = 65%
• rules > ZeroR: most likely class!
• Try these classifiers (cf. later for more info):
– trees > J48 74%
– bayes > NaiveBayes 74%
– lazy > IBk 70%
– rules > PART 75%
• So ZeroR is a classifier that uses no attributes. Zero R simply ”predicts” the mean (for a
numeric class (or output variable)) or the mode (for a nominal class).
35
Stephan Poelmans– Data Mining
Baseline Accuracy : ZeroR
• Always Try a simple Baseline. Sometimes, the baseline is best!
• Open supermarket.arff and blindly apply the following classifications:
• rules > ZeroR -- accuracy =63.7%
• trees > J48 -- accuracy = 63.7%
• bayes > NaiveBayes – accuracy = 63.7%
• lazy > IBk – accuracy = 37% (!)
• rules > PART – accuracy = 63.7%
36
Stephan Poelmans– Data Mining
OneR: One attribute does all the work
• Learn a 1‐level “decision tree”
• – i.e., rules that all test one particular attribute
• Basic version
• One branch for each value
• Each branch assigns most frequent class
• Error rate: proportion of instances that don’t belong to the majority class of their corresponding
branch
• Choose attribute with smallest error rate
37
Stephan Poelmans– Data Mining
• In Weka:
• Open file weather.nominal.arff
• Choose OneR rule learner (rules>OneR)
• Look at the rule,
38
Stephan Poelmans– Data Mining
Dealing with numeric attributes
• Idea: discretize numeric attributes into sub ranges (intervals or partitions)
• How to divide each attribute’s overall range into intervals?
• Sort instances according to attribute’s values
• Place breakpoints where (majority) class changes
• This minimizes the total classification error 8 intervals or
partitions…
• Example: temperature from the weather.numeric data
64 65 68 69 70 71 72 72 75 75 80 81 83 85
Yes | No | Yes Yes Yes | No No | Yes Yes Yes | No | Yes Yes | No
• However, whenever adjacent partitions have the same majority class, as the two first
partitions above (in both, ”yes” is the majority), they can be merged together, leading to 2
partitions
64 65 68 69 70 71 72 72 75 75 80 81 83 85
Yes No Yes Yes Yes No No Yes Yes Yes | No Yes Yes No
• So the rule: temperature <= 77.5 -> yes
temperature > 77.5 -> no 40
Stephan Poelmans– Data Mining
Results with overfitting avoidance
• Resulting rule sets for the four attributes in the • In Weka:
weather.numeric data, with only two rules for the temperature
• Open
attribute: weather.numeric
Attribute Rules Errors Total errors • Classify
Outlook Sunny → No 2/5 4/14 • Rules > OneR
Overcast → Yes 0/4 • Configuration panel:
Rainy → Yes 2/5 MinBucketSize = 3
Temperature 77.5 → Yes 3/10 5/14 (see previous slide)
> 77.5 → No* 2/4 • Start
Humidity 82.5 → Yes 1/7 3/14 • Look at the rule…
> 82.5 and 95.5 → No 2/6
> 95.5 → Yes 0/1
Windy False → Yes 2/8 5/14
True → No* 3/6
42
Stephan Poelmans– Data Mining
Classification Algorithms
• Structure of the slides:
• Naïve Bayes
• K-nearest neighbours (Ibk in Weka)
• Neural Network (Multilayer Perceptron in Weka)
43
Stephan Poelmans– Data Mining
The problem of Overfitting
• Any machine learning method may “overfit” the training data …
• … by producing a classifier that fits the training data too tightly
• So ML performs well on the training data but not on the independent test data (which is
then reflected in a low accuracy rate)
• Overfitting is a general phenomenon that plagues all ML methods
• This is one reason why you must always evaluate on an independent test set
• However, overfitting can occur more generally: you can have good accuracy rate but the
model performs badly when using new validation data (coming from a different context)
• E.g. You try many ML methods, and choose the best for your data – you cannot expect to get the
same performance on new validation data
44
Stephan Poelmans– Data Mining
Overfitting : an example
• Experiment with the diabetes dataset
• Open file diabetes.arff
• Choose ZeroR rule learner (rules>ZeroR)
• Use cross‐validation: 65.1%
• Choose OneR rule learner (rules>OneR)
• Use cross‐validation: 71.5%
• Look at the rule in the output (plas = plasma glucose concentration)
• In the configuration panel of OneR, change minBucketSize parameter to 1: run again
• Use cross-validation: 57.16%
• Look at the rule again
• So in the 2nd run of OneR, the rule is much more complex, tightly (over)fitted to the training set, but
performing worse on the test set(s) (even worse than ZeroR).
45
Stephan Poelmans– Data Mining
Classification Algorithms
• Structure of the slides:
• Naïve Bayes
• K-nearest neighbours (Ibk in Weka)
• Neural Network (Multilayer Perceptron in Weka)
46
Stephan Poelmans– Data Mining
Quality Evaluation: Accuracy, Precision and Recall
• So far, mainly focus on Accuracy % = general number of rightly classified / total number
• Overall a useful indicator, mainly in case of more or less balanced class values
• Easy to understand, evaluates the entire model
• Additional, and related measures (between 0 and 1) and given per class value:
• TP rate = TP/(TP+FN) = Recall = sensitivity
• Precision = TP/(TP+FP)
• F-measure is an average of recall and precision
• Area under the ROC curve (also called the C statistic)
• A receiver operating characteristic curve, i.e., ROC curve, is a graphical plot that illustrates the
discriminatory ability of a classifier.
• The ‘area under the curve’ (or C statistic) represents the classification quality of the model. 1 is a perfect
model, without any false postives; 0.5 is a worthless model, that detects as many true positives as false
positives
47
Stephan Poelmans– Data Mining
Quality Evaluation: Kappa Statistic
• Measures such as the accuracy % are easy to understand, but may give a distorted picture in the case of
imbalanced classes. E.g. we have two classes, say A and B, and A shows up on 5% of the time. Classifying
all as B gives an accuracy of 95% (so ‘excellent’), whereas the minority class is not well predicted. This
problem is even more present in the case of 3 or more classes
• Cohen’s kappa statistic is a very good measure that can handle very well imbalanced class problems.
• Kappa statistic:
• (success rate of actual predictor - success rate of random predictor) / (1 - success rate of random predictor)
• Measures relative improvement on random predictor: 1 means perfect accuracy, 0 means we are doing no better than
random
• Interpretation, rules of thumb (!):
based on Landis, J.R.; Koch, G.G. (1977). “The measurement of observer agreement for categorical
data”. Biometrics 33 (1): 159–174
• value < = 0 is indicating no improvement over random prediction
• 0–0.20 a slight improvement,
• 0.21–0.40 a fair improvement,
• 0.41–0.60 a moderate improvement,
• 0.61–0.80 a substantial improvement,
• and 0.81–1 an almost perfect, maximal improvement.
• It basically tells you how much better your classifier is performing over the performance of a classifier
that simply guesses at random according to the frequency of each class.
48
Stephan Poelmans– Data Mining
Example
• A Decision Tree on the Iris.arff dataset:
• When opting for separate training and test sets, separate files (.arff) need to be created
first. The model is than built with the training set, and tested separately on the test set.
Training and test files can be created by random selection or stratified (with each file
respecting a comparable proportion of the output (class) variable values). Typically, the
number of instances in the test set is one third of the training set. This is not a strict
requirement though.
52
Stephan Poelmans– Data Mining
Accuracy testing: 2. Holdout estimation
• What should we do if we only have a single dataset?
• The holdout method reserves a certain amount for testing and uses the remainder for
training, after shuffling
• Usually: one third for testing, the rest for training, by default 66% training– 34% testing in Weka
• Problem: the samples might not be representative
• Example: a class value might be missing in the test data
• Advanced version uses stratification
• Ensures that each class value is represented with approximately equal proportions in both subsets
• Holdout estimates can be made more reliable by repeating the process with
different subsamples
• In each iteration, a certain proportion is randomly selected for training (possibly with
stratification)
• The error rates on the different iterations are averaged to yield an overall error rate
• This is called the repeated holdout method
• Still not optimum: the different test sets overlap
• Can we prevent overlapping?
See slide 45
Stephan Poelmans– Data Mining 55
3. Repeated holdout: an example
• So quite some variations in the accuracy %.
• Calculate the mean and variance, to get a more reliable outcome (lies between 92.9% and 96.7%)
58
Stephan Poelmans– Data Mining
4.1. k-fold Cross Validation
Test Set
• 10-fold cross-validation:
Training Set
• Divide dataset into 10 parts (folds)
• Hold out each part in turn for testing
• Average the results
• So Each data point is used once for
testing, 9 times for training
59
Stephan Poelmans– Data Mining
4.1. k-fold Cross Validation:
• With 10‐fold cross‐validation, Weka invokes the learning algorithm 11 times
• Practical rule of thumb:
• Lots of data? – use percentage split
• Else stratified 10‐fold cross‐validation
ML = Machine Learning
60
Stephan Poelmans– Data Mining
More on cross-validation
• Standard method for evaluation: stratified ten-fold cross-validation
• Why ten?
• Extensive experiments have shown that this is the best choice to get an accurate estimate
• There is also some theoretical evidence for this (cf. Witten, et al. 2017)
• Stratification reduces the estimate’s variance
• Even better: repeated stratified cross-validation
• E.g., ten-fold cross-validation is repeated ten times and results are averaged (reduces the
variance)
61
Stephan Poelmans– Data Mining
Accuracy Testing 4.2. Leave-one-out cross-
validation
• Leave-one-out:
a particular form of k-fold cross-validation:
• Set number of folds to the number of training instances
• I.e., for n training instances, build classifier n times
• Makes best use of the data
• Involves no random subsampling
• Very computationally expensive (exception: using lazy classifiers such as the
nearest-neighbor classifier ( cf. later))
• Disadvantage of Leave-one-out CV: stratification is not possible
• It guarantees a non-stratified sample because there is only one instance in the test set!
62
Stephan Poelmans– Data Mining
Classification Algorithms
• Structure of the slides:
• Naïve Bayes
• K-nearest neighbours (Ibk in Weka)
• Neural Network (Multilayer Perceptron in Weka)
63
Stephan Poelmans– Data Mining
Naïve Bayes
• Frequently used in Machine Learning, Naive Bayes is a collection of classification algorithms based on the Bayes
Theorem. The family of algorithms all share a common principle, that every feature being used to classify is independent
of the value of any other feature. So for example, a fruit may be considered to be an apple if it is red, round, and about
3″ in diameter. A Naive Bayes classifier considers each of these “features” (red, round, 3” in diameter) to contribute
independently to the probability that the fruit is an apple, regardless of any correlations between features. Features or
attributes, however, aren’t always independent in reality, which is often seen as a shortcoming of the Naive Bayes
algorithm and this is why it’s called “naive”.
• Although based on a relatively simple idea, Naive Bayes can often outperform other more sophisticated algorithms and
is very useful in common applications like spam detection and document classification.
• In a nutshell, the algorithm aims at predicting a class (outcome variable), given a set of features, using probabilities. So
in a fruit example, we could predict whether a fruit is an apple, orange or banana (class) based on its color, shape, etc.
(features).
• Advantages
• It is relatively simple to understand and build
• It is easily trained, even with a small dataset
• It is not sensitive to irrelevant attributes
• Disadvantages
• It assumes every attribute is independent, which is not always the case 64
Stephan Poelmans– Data Mining
Probabilities using Bayes’s rule
• Famous rule from probability theory thanks to Thomas Bayes
• Probability of an event H given observed evidence E:
P(H | E) = P(E | H)P(H) / P(E)
• H = class variable; E=instance
• A priori probability of H : P(H )
• Probability of event before, prior to, evidence is seen
• E.g. before tossing a coin: the probability of ‘heads’ is 50%, given that the coin has two similar sites …
We know this a priori.
• A posteriori probability of H : P(H | E)
• Probability of event after evidence is seen
• E.g. tossing a coin 1000 times , with 550 times ‘heads’. So now, with this evidence, the probability of
tossing heads = 550/1000. This is the a posteriori probability.
68
Stephan Poelmans– Data Mining
Naïve Bayes
• So suppose we are presented with new data: we are only given the features of a piece of fruit
and we need to predict the class, i.e. fruit type. If we are told that the additional fruit is Long,
Sweet and Yellow, we can classify it using the following formula and the facts from the table
above.
• P(H|E) = P(E|H)*P(H) / P(E)
• So H = the values of the class = Banana, Orange or Other
• So E = the evidence presented = 3 attributes: Long, Sweet and Yellow
• P(E) is not known, but is used for each fruit, so based on the above, we can conclude that the
fruit is most likely a Banana. (0.252 > 0.01875)
69
Stephan Poelmans– Data Mining
Naïve Bayes: Probabilities for weather.nominal data
Outlook Temperature Humidity Windy Play
Yes No Yes No Yes No Yes No Yes No
Sunny 2 3 Hot 2 2 High 3 4 False 6 2 9 5
Overcast 4 0 Mild 4 2 Normal 6 1 True 3 3
Rainy 3 2 Cool 3 1
Sunny 2/9 3/5 Hot 2/9 2/5 High 3/9 4/5 False 6/9 2/5 9/ 5/
Overcast 4/9 0/5 Mild 4/9 2/5 Normal 6/9 1/5 True 3/9 3/5 14 14
Rainy 3/9 2/5 Cool 3/9 1/5
Outlook Temp Humidity Windy Play
Sunny Hot High False No
Just counting … E.g. Sunny –yes happened 2 Sunny Hot High True No
2 / 9 ´ 3 / 9 ´ 3 / 9 ´ 3 / 9 ´ 9 /14
=
P(E) 71
Stephan Poelmans– Data Mining
Naïve Bayes: Probabilities for weather.nominal data
Outlook Temperature Humidity Windy Play
Yes No Yes No Yes No Yes No Yes No
Sunny 2 3 Hot 2 2 High 3 4 False 6 2 9 5
Overcast 4 0 Mild 4 2 Normal 6 1 True 3 3
Rainy 3 2 Cool 3 1
Sunny 2/9 3/5 Hot 2/9 2/5 High 3/9 4/5 False 6/9 2/5 9/ 5/
Overcast 4/9 0/5 Mild 4/9 2/5 Normal 6/9 1/5 True 3/9 3/5 14 14
Rainy 3/9 2/5 Cool 3/9 1/5
• What if an attribute value does not occur with every class value?
(e.g., “Outlook= Overcast” for class “no”)
• Probability will be zero: E.g. P( Outlook = Overcast | No) = 0
• A posteriori probability will also be zero: P(No | E) = 0
(Regardless of how likely the other values are!)
• So a zero frequency kind of takes over the entire probability calculation.
• Remedy: add 1 to the count for every attribute value-class combination
(Laplace estimator)
• Result: probabilities will never be zero
• Additional advantage: stabilizes probability estimates computed from
small samples of data (where the likelihood of zero is bigger)
73
Stephan Poelmans– Data Mining
Numeric (Continuous) attributes
• In the previous examples, all attributes were discrete !
• What if certain attributes are numeric (e.g. Temperature in the Weather.numeric data)
• Usual assumption: attributes have a normal or Gaussian probability distribution (given the
class)
• The probability density function for the normal distribution is defined by two parameters:
• Sample mean
• Standard deviation
75
Stephan Poelmans– Data Mining
Classifying a new day
Standard output
in Weka. Just
Frequencies.. Do
you have to
calculate the
Probabilities
yourselves?
Choose “output
predictions” ,
Plaintext. (cf.
the More
options button
in the test panel
Stephan Poelmans– Data Mining 77
Classification Algorithms
• Structure of the slides:
• Decision Trees (J48)
• PART
• ZeroR
• 1R (’one R’)
• Naïve Bayes
• K-nearest neighbours (Ibk in Weka)
• Neural Network (Multilayer Perceptron in Weka)
• Note that taking the square root is not required when comparing distances
• Naïve Bayes
• K-nearest neighbours (Ibk in Weka)
• Neural Network (Multilayer Perceptron in Weka)
W1i
Node j
Wjk
W2j
0.4 Node 2 W2i Node k
Wik
Node i
W3j
• The “nodes” are neurons or perceptrons and the link between them are the “dendrites”. The neurons do an amount of operations
and pass the result to the next layer.
This calculation is a linear combination. Now what does an 8 mean? We need to define the threshold value. The neural network’s
output, 0 or 1 (stay home or go to work), is determined if the value of the linear combination is greater than the threshold value.
Suppose the threshold value is 5, which means that if the calculation gives you less than 5, you can stay at home, but if it is equal to
or more than 5, you need to go to work.
Source: Towardsdatascience.com
Stephan Poelmans– Data Mining 90
Neural Network
• So using weights, the NN algorithm knows which information will be most important in making its
decision. A higher weight means the neural network considers that input more important compared to
other inputs.
• The weights are set to random values, and then the network adjusts those weights based on the output
errors it made using the previous weights. This is called training the neural network.
• So the NN algorithm takes in inputs, applies a linear combination using a weight vector. If the result is
compared to a threshold value, leading to an output, 1 or 0. This predicted output is compared to the
observed (real) output, so the NN can learn.
Error =
Observed
(actual)-
Predicted
where Inputs(Ni) corresponds to all neurons that provide an input to Ni. ak is the input value
2. This sum, called the input function, is then passed on to Ni’s activation function (ai). So:
Typically, the activation functions of all neurons in the NN are the same, so we just write g
instead of gi. Example of a Simoid function:
Thus. every neuron in a NN takes in the activation of all its inputs and provides its own activation as an output
Stephan Poelmans– Data Mining 95
NN: Backpropagation & Gradient Descent
• It is the job of the training process of a NN to ensure that the weights, wij, given by each neuron to each
of its inputs is set right, so that the entire NN gives an optimal outcome. Backpropagation is one of the
ways to optimize those weights.
• The error function defines how much of the output of the model defers from the required (observed)
output. Typically, a mean-squared error function is used for this.
• It is important to know that each training instance will result in a different error value. It is the
backpropagation’s goal to minimize the error for all training instances on average. Backpropagation tries
to minimize or reduce the error function, by going backwards through the network, layer by layer. Then it
uses the gradient descent to optimize or fine-tune the weights.
• So the gradient descent is a technique used to fine tune the weights.
• Based on Chapter 8, 8.8 & 8.9 in Witten et al. (2017, 4th edition)
• Findings the smallest attribute (feature) set that is crucial in predicting the class is an important issue, certainly if
there are many attributes. Although all attributes are considered when building the model, not all are equally
important (i.e. have discriminatory capacity). Some might even distort the quality of the model.
• So, feature or attribute selection is used to set aside a subset of attributes that really matter and have added
value in creating the classifier.
• A typically example is an insurance company that has a huge data warehouse with historical data from their
clients, with many attributes. The company wants to predict the risk of a new client (from their perspective) and
get insight into what attributes really determine the risk. E.g. age, gender, and marital status might be the key
factors to predict the risk of future car accidents of a client. Not so much the obtained college degree, or financial
status.
• Feature selection is different from dimensionality reduction (e.g. Factor Analysis). Both methods seek to reduce
the number of attributes in the dataset, but a dimensionality reduction method does so by creating new
combinations of attributes, where as feature selection methods include and exclude attributes present in the data
without changing them.
• Easier to interpret
• Structural knowledge
• Knowing which attributes are important may be inherently important to the application and interpretation of the classifier
• Note that relevant attributes can also be harmful if they mislead the learning algorithm
• For example, adding a random (i.e., irrelevant) attribute can significantly degrade J48’ s performance
• A Feature selection approach combines an Attribute Evaluator and a Search Method. Each
section, evaluation and search, has multiple techniques from which to choose.
• The attribute evaluator is the technique by which each attribute in a dataset is evaluated in the
context of the output variable (e.g. the class). The search method is the technique by which to
navigate different combinations of attributes in the dataset in order to arrive on a short list of
chosen features.
• As an example, calculating correlation scores for each attribute (with the class variable), is only
done by the attribute evaluator. Assigning a rank to an attribute and listing the attribute in the
ranked order is done by a search method, enabling the selection of features.
• For some attribute evaluators, only certain search method are compatible
• E.g. the CorrelationAttributeEval attribute evaluator in Weka can only be used with a Ranker Search Method,
that lists the attributes in a ranked order.
• Using correlations for selecting the most relevant attributes in a dataset is quite popular. The idea is
simple: calculate the correlation between each attribute and the output variable and select only those
attributes that have a moderate-to-high positive or negative correlation (close to -1 or 1) and drop those
attributes with a low correlation (value close to zero). E.g. the CorrelationAttributeEval.
• As mentioned before: the CorrelationAttributeEval technique requires the use of a Ranker search
method.
• Another popular feature selection technique is to calculate the information gain. You can calculate the
information gain (see entropy) for each attribute for the output variable. Entry values vary from 0 (no
information) to 1 (maximum information). Those attributes that contribute more information will have a
higher information gain value and can be selected, whereas those that do not add much information will
have a lower score and can be removed.
• Try the InfoGainAttributeEval Attribute Evaluator in Weka. Like the correlation technique above, the
Ranker Search Method must be used with this evaluator.
• Open glass.arff, go to “select attributes”, use the CorrelationAttributeEval evaluator (correlation of each
attribute with the class variable). “Ranker” is now mandatory as a search method. Use the full training set
(we are not evaluating a classifier).
• Open glass.arff, use the InfoGainAttributeEval evaluator (information gain of each attribute). “Ranker” is
again mandatory as a search method.
• A different and popular feature selection technique is to use a generic but powerful learning algorithm (e.g.
classifier) and evaluate the performance of the algorithm on the dataset with different subsets of attributes. (So
no real a priori evaluation of the attributes; no filter approach. )
• The subset that results in the best performance is taken as the selected subset. The algorithm used to evaluate
the subsets does not have to be the algorithm that is intended to be used for the classifier, but it should be
generally quick to train and powerful, like a decision tree method.
• So if the target algorithm is Naïve Bayes, a different algorithm could be chosen to select a subset of
attributes..
• This is a scheme-dependent approach because the target schema, the actual classifier you want to develop, is in
the loop.
• In Weka this type of feature selection is supported by the WrapperSubsetEval technique and must use a
GreedyStepwise or BestFirst Search Method (cf. further). The latter, BestFirst, is preferred but requires more
compute time.
Select a subset of
• “Wrap around” the attributes
learning algorithm
• Must therefore always Induce learning
evaluate subsets algorithm on this subset
• Return the best subset
of attributes Evaluate the resulting
• Apply for each learning model (e.g., accuracy)
algorithm considered
No Yes
Stop?
• The available attributes in a dataset are often referred to as the ‘attribute space’ (in which to find a subset that
best predicts the class)
• When searching an attribute subset, the number of subsets is exponential in the number of attributes
• Common greedy approaches:
• forward selection
• backward elimination
• Recursive feature elimination
A greedy algorithm is any algorithm that simply picks the best choice it sees at the time and takes it.
• Forward Selection: Forward selection is an iterative method that starts with having no feature in the model. In
each iteration, the method keeps adding the feature which best improves the model till an addition of a new
attribute does not improve the performance of the model (such as a classifier) anymore. Because the accuracy of
the classifier is evaluated with all the features in a set, this method will pick out features which work well together
for classification. Features are not assumed to be independent and so advantages may be gained from looking
at their combined effect.
• Backward Elimination: In backward elimination, the model starts with all the features and removes the least
significant feature at each iteration which improves the performance of the model. This is repeated until no
improvement is observed on removal of features.
• Recursive Feature elimination: It is a greedy optimization algorithm which aims to find the best performing
feature subset. It repeatedly creates models and keeps aside the best or the worst performing feature at each
iteration. It constructs the next model with the remaining features until all the features are exhausted. It then
ranks the features based on the order of their elimination.
• The attribute space can be futher traversed by increasing the searchTermination parameter in Weka (if
> 1, the attribute space is further investigated, increasing the likelihood of finding a more global, less
local optimum)
• Example:
• Open glass.arff; choose the attribute evaluator WrapperSubsetEval in ‘Select attributes’, select J48, “use full training set”
• Get the attribute subset: RI, Mg, Al, K, Ba: “merit” 0.73 (=accuracy %)
• But the “merit” , the best subset, is the same in all 3…Searching the attribute space further did not lead to a higher optimum.
• Certainly select: Ri (appeared in 9 folds), Al (appeared in 10 folds !), Ba and MG, and maybe also Na
• Typically, you do not know a priori which view, or subset of features, in your data will produce the most
accurate models.
• Therefore, it is a good idea to try a number of different feature selection techniques on your data, to
create several views on your data (several subsets of features)
Filter method
Wrapper method
Confidence =
Meaning: how many times is the rule correct? (=reliability). Weakness of confidence: if X and/or Y have a high
support, the confidence is also high per definition.
The Lift tells us how much our confidence has increased that Y will be purchased given that X was purchased. It
shows how effective the rule is in finding Y, as compared to finding Y randomly.
.
The supermarket example
4
Transaction ID milk Bread butter beer
1 1 1 0 0
In the example database, the itemset {milk,bread,butter} has a
2 0 1 1 0
3 0 0 0 1
support of 4 /15 = 0.26 since it occurs in 26% of all
4 1 1 1 0 transactions.
5 0 1 0 0
6 1 0 0 0
7 0 1 1 1
For the rule {milk,bread}=>{butter} we have the following
8 1 1 1 1
confidence:
9 0 1 0 1
supp({milk,bread,butter}) / supp({milk,bread}) = 0.26 /
10 1 1 0 0
11 1 0 0 0 0.4 = 0.65 . This means that for 65% of the transactions
12 0 0 0 1 containing milk and bread the rule is correct.
13 1 1 1 0
14 1 0 1 0
15 1 1 1 1
The rule {milk,bread}=>{butter} has the following lift:
supp({milk,bread,butter}) / supp({butter}) x
supp({milk,bread})= 0.26/0.46 x 0.4= 1.4
Association Rules
5
General Process
Association rule generation is usually split up into two separate steps:
First, minimum support is applied to find all frequent itemsets in a database.
Second, these frequent itemsets and the minimum confidence constraint are used to form
rules.
So: Generate high-support item sets, get several rules from each
Strategy: iteratively reduce the minimum support until the required number of
rules is found with a given minimum confidence
Example in weather.nominal:
Generate item sets with support 14 (none)
Find rules in these item sets with minimum confidence level, 90% in Weka
Continue with item sets with support 13 (none)
And so on , until you have a sufficient number of rules
The Apriori Algorithm
8
The Weather data has 336 rules with confidence 100%, but only 8 have support
≥ 3, only 58 have support ≥ 2
In Weka: specify minimum confidence level (minMetric, default 90%), number of
rules sought (numRules, default 10)
Apriori makes multiple passes through the data
It generates 1-item sets, 2-item sets, ... with more than minimum support– turns each one into
(many) rules and checks their confidence
It starts at upperBoundMinSupport (usually left at 100%) and decreases by delta at each
iteration (default 5%). It stops when numRules is reached... or at lowerBoundMinSupport
(default 10%)
The Apriori algorithm applied (weather.nominal)
9
Market Basket Analysis
10
Missing values used to indicate that the basket did not contain that item
[to be completed]
2. Clustering
13
With clustering, there is, again, no “class” attribute (so unsupervised learning)
Try to divide the instances into natural, homogenous groups, or “clusters”
It is hard to evaluate the quality of a clustering solution
in Weka: SimpleKMeans (+XMeans), EM, Cobweb
Examples of clusters:
Customer segmentation: divide customers in homogenous groups, based on numerous attributes (age, degree, social background,
average sales, number of visits, types of products bought, region, ….)
Why?
To have a better understanding,
To target marketing or promotion actions
To focus on very ‘good’ or very ‘infrequent’ customers
….
Student clustering
Clustering of prisoners
Course clustering
Clustering of schools, companies, cars, ….
Based on symptoms, patients can be clustered in order determine an appropriate treatment
Sometimes, the target population or observations can be subdivided in groups, top-down, by argument, using specific criteria.
This is not clustering.
The goal of clustering is to start from a dataset with numerous attributes and to build up homogenous groups using similarity
measures (such as the Euclidean distance). So it is bottom-up, can be applied to a larger dataset , with many different attributes.
Clustering
14
• …
Typical features used for customer segmentation
17
Demographic
Age, gender, income, education, marital status, kids, …
Lifestyle
Vehicle ownership, Internet use, travel, pets, hobbies, …
Attitudinal
Product preferences, price sensitivity, willingness to try other brands, …
Behavioral
Products bought, prices paid, use of cash or credit, RFM, …
Acquisitional
Marketing channel, promotion type, …
1. Specify k, the desired number of clusters (often very difficult, how many groups do
you want, not too many, no too few)
2. Choose k points at random as cluster centers
3. Assign all instances to their closest cluster center
4. Calculate the centroid (i.e., mean) of instances in each cluster
5. These centroids are the new cluster centers
6. Re-assign instances to these cluster centers
7. Re-calculate the centroids
8. Continue until the cluster centers don’t change
Minimizes the total squared distance from instances to their cluster centers.
In Weka, the Euclidean distance or the Manhattan distance can be used as a similarity
function, to compute the distance between instances and centers.
Clustering: K-means
19
Initial Initial
center center
- Calculate the distance for each instance to the new centers, and assign
the instance to the closest cluster (so instances could change, be re-
assigned to another cluster)
4) Iterative process : repeat the previous step (calculating the center, re-assigning
instances) until instances do not change anymore = convergence
The K-means Algorithm
21
X Y 9
8
A 3 7 7
B 6 1 In a chart:: 6
5
C 5 8 4
D 1 0 3
2
E 7 6 1
0
F 4 5 0 2 4 6 8
K-means : calculations
23
X Y
A 3 7
n
( p − q )²
B 6 1
i i C 5 8
i =1
D 1 0
E 7 6
Euclidean Distance F 4 5
Case A B C D E F
9
A 8
,000 6,708 2,236 7,280 4,123 2,236 C
7
B 6,708 ,000 7,071 5,099 5,099 4,472 6 A
5 E
C 2,236 7,071 ,000 8,944 2,828 3,162 4 F
3
D 7,280 5,099 8,944 ,000 8,485 5,831 2
1 D B
E 4,123 5,099 2,828 8,485 ,000 3,162
0
0 2 4 6 8
We opt for the “Farthest First” option, choose the instances that are the farthest…
K-means : calculations
24
Step 2: For each instance, compute the euclidean distance to each cluster
center (C and D) :
X Y
A 3 7
n B 6 1
( p − q )²
i =1
i i (3- 5)2 + (7 - 8)2 = 5 = 2, 236 C 5 8
D 1 0
E 7 6
F 4 5
Euclidean Distance
Case A B C D E F
A
,000 6,708 2,236 7,280 4,123 2,236
B
6,708 ,000 7,071 5,099 5,099 4,472
C
2,236 7,071 ,000 8,944 2,828 3,162
D
7,280 5,099 8,944 ,000 8,485 5,831
E
4,123 5,099 2,828 8,485 ,000 3,162
F
2,236 4,472 3,162 5,831 3,162 ,000
K-means : calculations
25
Euclidean Distance
9
Case A B C D E F 8
C
A ,000 6,708 2,236 7,280 4,123 2,236 7 A C1
6
B 6,708 ,000 7,071 5,099 5,099 4,472 5 E
F
C 4
2,236 7,071 ,000 8,944 2,828 3,162
3
D 7,280 5,099 8,944 ,000 8,485 5,831 2
1
E 4,123 5,099 2,828 8,485 ,000 3,162
0 D
B C2
F 0 2 4 6 8
2,236 4,472 3,162 5,831 3,162 ,000
K-means : calculations
26
X Y
9
A 3 7 8
C
B 6 1 7 A
CC1
C1
6
C 5 8 5
F
E
4
D 1 0 3
2
E 7 6 1
D
CC2 B C2
F 4 5 0
0 2 4 6 8
Open weather.numeric.arff
Cluster panel; choose
SimpleKMeans
Note parameters:
numClusters,
distanceFunction, seed
(default 10)
Two clusters, 9 and 5
members, total squared
error 16.2
{1/no, 2/no, 3/yes, 4/yes,
5/yes, 8/no, 9/yes, 10/yes,
13/yes}
{6/no, 7/yes. 11/yes, 12/yes,
14/no}
Set seed to 11: Two
clusters, 6 and 8 members,
total squared error 13.6
Set seed to 12: total
squared error 17.3
Evaluating Clusters
29
Now we know the size and the characteristics of the cluster
How good is our clustering?
Visualizing Clusters:
Open the Iris.arff data, apply SimpleKMeans, specify 3 clusters
3 clusters with 50 instances each
Visualize cluster assignments (right-click menu in Result List)
Plot Clusters (x-axis) against the instance numbers: the more density, the more cohesiveness, the
better the quality
Which instances does a cluster contain?
Use the AddCluster unsupervised attribute filter (in the Preprocess tab !)
Try with SimpleKMeans (within the filter); Apply and click Edit
What about the class variable?
Also apply “visualize cluster assignments”, clusters on the X, class variable on the Y. There are yes
and no’s in both clusters: so no perfect match between clusters and class values
Try the “Ignore attribute” button, ignore the class attribute; run again with 3 clusters, now: 61, 50, 39
instances
Visualizing Clusters:
30
With all attributes: very dense, balanced clusters Leaving out the class variable : more distance within
the clusters, less balanced (but still acceptable)
Classes-to-clusters evaluation
31
In the Iris data: SimpleKMeans,
specify 3 clusters
Classes to clusters
evaluation = using clustering in
a supervised way…
Classes are assigned to
clusters; can clusters predict
the class values?
Now you have a confusion
(classification) matrix and an
accuracy ! (100 – 11% = 89%)