Business Intelligence 2

Business Intelligence
Prof. Dr. Stephan Poelmans
LIRIS : Research Centre for

Information Systems Engineering
Faculty of Business and Economics
Office: ‘t Serclaes Building, A06.08
Stephan Poelmans– Business Intell

Planning
• Introduction to Business Intelligence
• Module 1: SQL
• Basics of SQL
• 1 table
• Where & Order by
• Joins & Group By & Having
• Subqueries and Union
• Note: We will use Base in Libre Office to do exercises
• Module 2: Data Mining (with Weka)

Stephan Poelmans– Business Intell 2
Architectural Perspective

SQL (Structured Query Language): Objective
• SQL is a database-software-independent language that allows the

user/designer to perform the following three operations
• Create a relational database (both operational databases and data

warehouses)
• Load a relational database
• Question (query) a relational database

SQL (Structured Query Language): Objective
(continued)
• SQL consists of two parts:

• The instructions for creating the database structure (logical and physical
model) is called a Data Definition Language (DDL)
• The instructions to enter, retrieve and update data is called a Data

Manipulation Language (DML)
• In this chapter, we mainly use SQL as DML
• SQL is a non-procedural language, i.e. you specify what you want instead of
how you want to obtain it. The database management system (DBMS) will itself
interpret the SQL instructions and show the results ... SQL is suitable for any
DBMS!

What is required for this module and the SQL
Exercise sessions?
• You need to install Libre Office on your laptop. (MacOS, Windows or Linux)
• In Libre Office we will use BASE as a relational DBMS (equivalent to MS

Access)
• You should open the AdvendureDW_Micro1000Facts.odb, (to be downloaded

from Toledo)
(It is a mini data warehouse, just for training purposes.)
• You need to understand the structure, the snowflake schema, of

AdvendureDW_Micro1000Facts (See next slide).
• You should/can also go to w3schools.com – SQL, and try basic SQL exercises
if you are an SQL novice.

AdventureDW_Micro1000Facts.odb

General SQL Select Instruction (Statement)
SELECT [DISTINCT | ALL] {* | attribute (-expression) [AS new_name] [,...] }
FROM table [, ...]
[WHERE condition]
[GROUP BY column_list]
[HAVING condition]
[ORDER BY column_list]

The basic SELECT instruction: simple example
• We have a dimension table with customer data (DimCustomer) ...

We want: a list with the name of the customers with more than 3 children ...
SELECT DimCustomer.Firstname, DimCustomer.TotalChildren

FROM DimCustomer
WHERE DimCustomer.TotalChildren > 3;
Table DimCustomer: Field

Selection Conditions
• To select certain records from a table, we use the “WHERE” clause in the SQL
statement.
• The "Where" clause always comes after the "From" clause ...
• E.g. Display familyname and email address of all customers who are married.
• The where condition can be True or False.
SELECT Dimcustomer.Lastname, Dimcustomer.Emailaddress

FROM Dimcustomer
WHERE Dimcustomer.Maritalstatus = ‘M’;
• If we have only 1 table, we can omit the table name before the field
names:
SELECT Lastname, Emailaddress
FROM Dimcustomer
WHERE Maritalstatus = ‘M’;
Selection Conditions (continued)
• Comparison Operators = equal to <> different from
< lower than > greater than
<= lower or equal to >= greater or equal to
BETWEEN … AND … between two values
IN (list) equal to one of the values of the list
IS NULL equal to the NULL (blank) value
• Example: Show all products with a catalog price (list price) between $30 and
$100
SELECT *
FROM Dimproduct
WHERE Listprice BETWEEN 30 AND 100
* = All fields from a table

Result: a list of articles between $ 30 and $ 100, including 30 and 100!

Examples
• Show all details of customers who were born on January 15,1950 or on 15/01/1970.
In MS Access:
SELECT *
FROM Dimcustomer
WHERE BirthDate in (#1/15/1950#,#1/15/1970#)
In LibreOffice:
SELECT *
FROM Dimcustomer
WHERE Birthdate IN ( '1950-01-15', '1970-01-15' )
• Show all employees whose phone number is unknown
SELECT *
FROM Dimemployee
WHERE phone IS NOT NULL
Negations in the Where clause
• By adding the NOT operator to a selection condition we obtain the negation of
the selection condition.
E.g.: Return the name of the workers who are not “marketing manager”
SELECT Lastname, Firstname

FROM Dimemployee
WHERE NOT title = ‘Marketing Manager’
• Other negations are NOT BETWEEN … AND …

NOT IN (list)
IS NOT NULL
• Return the name of the workers from which the phone number is known
SELECT *
FROM DimEmployee
WHERE phone IS NOT NULL
The use of Wildcards & Like
• The LIKE operator is used in a WHERE clause to search for a specified pattern in
a column.
• There are two wildcards often used in conjunction with the LIKE operator:
• % - The percent sign represents zero, one, or multiple characters
• _ - The underscore represents a single character
• Note: MS Access uses an asterisk (*) instead of the percent sign (%), and a question
mark (?) instead of the underscore (_).
Examples:
Select * Select *
From Dimcustomer From Dimcustomer
Where Lastname like ‘An%’ Where Lastname like ‘%an%’

Composed selection conditions
• Sometimes a row must or may meet several selection conditions before it can
be selected.
• If a row must meet several selection conditions then they are linked together
with the AND operator. The composed selection condition is TRUE only if all
individual selection conditions are TRUE.
• If a row may meet several selection conditions then they are linked together
with the OR operator. The composed selection condition is TRUE if at least
one individual selected condition is TRUE

Examples
• All female employees who are married:
SELECT *
FROM Dimemployee
WHERE Gender = ‘F’
AND Maritalstatus = ‘M’
• All female employees or employees who are married (so no single males):
SELECT *
FROM Dimemployee
WHERE Gender = ‘F’
OR Maritalstatus = ‘M’

Examples
• Show all male employees who are “chief financial officer” (CFO) or who were
born after December 31, 1959.
SELECT *
FROM DimEmployee
WHERE Gender = ‘M’
AND (Birthdate> #31-DEC-59# OR title = “chief financial officer”)
In LibreOffice:
SELECT *
FROM Dimemployee where Gender ='M' and (Birthdate > '1959-12-31'
or Title = ’Chief Financial Officer’);
So: it is anyway about male employees, they must also be either CFO or born
after 1959 ...
The Order By clause: Sorting the displayed rows
• The ORDER BY clause:
By adding an ORDER BY clause as last sentence of a query, we can show the
selected rows in a sorted order following the values of a particular column.
For example: show all the employees ordered by their date of birth (oldest to
youngest)
SELECT *
FROM Dimemployee
ORDER BY Birthdate
• Order in a descending order

By default, data is sorted from small to large, from a to z, and from old to new. The sort
order can be reversed by adding DESC at the end of ORDER BY clause.
For example: Show all employees in order of age, from youngest to oldest
SELECT *
FROM Dimemployee
ORDER BY birthdate desc
Sorting the displayed rows (cont’d)
• Ordering on multiple columns
More than one column can be used to order the rows
For example: Show all employees. First show male employees, then the
females. Show by gender ('m' or 'f'), the employees in order of age, from
oldest to youngest
SELECT *
FROM Dimemployee
ORDER BY gender DESC, birthdate

Multiple tables in SQL (“JOINS”)
• A JOIN without selection conditions
In the previous examples we worked with a single table. The design of a

database, however, typically results in different tables linked with each other
(through keys). In a data warehouse, the link is typically the relationship
between fact tables and dimension tables
• To fulfill business requirements, users of a data warehouse usually need

information from both the facts and dimension tables
• This can be solved by using “joins” between tables within SQL instructions

Multiple tables in SQL (“JOINS”) (continued)
• When multiple tables in the FROM sentence are mentioned then those tables
are linked through a JOIN.
The JOIN of tables A and B are the columns of A and the columns of B.
• A "join" in SQL terms is actually a Cartesian product
The Cartesian product of two tables basically includes in its result all of the
possible combinations of records between two tables. This means that there
are no conditions for the join. Each row (record) of Table A is linked to all rows
of Table B

The Cartesian product (a "join") without conditions
FactInternetSales x DimCustomer
• Assume 2 tables: FactInternetSales and DimCustomer. Without further

conditions, the Cartesian product is the combination of all the records between
two tables.
• Result: without "join" conditions ...

ProductKey OrderDateKey CustomerKey CustomerKey Firstname LastName
310 20010701 21768 14747 Aidan Wood
310 20010701 21768 21768 Cole Watson
310 20010701 21768 28389 Rachael Martinez
346 20010701 28389 14747 Aidan Wood
346 20010701 28389 21768 Cole Watson

The Cartesian product (join) with conditions
• In the previous example, all combinations were given. In a relational database,

tables are however linked in a defined way (foreign keys refer to primary
keys!). The relationship between a foreign and primary key indicates a one to
many (or one on one) relationship. The value of a primary key corresponds to
many possible values of a foreign key
• In other words, instead of all possible combinations of Internet sales and

customers, the next question might be useful: “For each client, give the web
orders that he/she has sent”. This is clearly more useful than all customers
and all Internet sales combined ...

A Join between 2 tables
• This is as good as taking a Cartesian product and thereby ensuring that only
the correct records remain (i.e. the right customers with the right sales).
FK PK
FactInternetSales x DimCustomer
• Result: join with a join condition ... The value of the primary key needs to
match the value of the foreign key ...
ProductKey OrderDateKey CustomerKey CustomerKey Firstname LastName
310 20010701 21768 21768 Cole Watson

The appropriate SQL statement: option 1
SELECT Factinternetsales.ProductKey, Factinternetsales.OrderDateKey,
Factinternetsales.CustomerKey, Dimcustomer.CustomerKey,
Dimcustomer.FirstName, Dimcustomer.LastName
FROM Factinternetsales, Dimcustomer
Where Dimcustomer.CustomerKey = Factinternetsales.CustomerKey
Note : if more than 1 table is involved in a query, table names need to be

added (before each field). Eg. Customer.customerkey

The appropriate SQL statement: option 2
SELECT Factinternetsales.Productkey, Factinternetsales.Orderdatekey,

Factinternetsales.Customerkey, Dimcustomer.Customerkey,
Dimcustomer.Firstname, Dimcustomer.Lastname
FROM Factinternetsales INNER JOIN Dimcustomer
on Dimcustomer.Customerkey = Factinternetsales.Customerkey
Note: both options give the same results. This format is longer.
We advise to use option 1 (previous slide) when writing SQL.

A Join between >2 tables
• If more than two tables are required, more "join conditions" are necessary.
• The number of join conditions is: number of tables -1.
• Next to the join condition we can also add additional selection conditions in a
query.
• Example: Show the product (by name) , the name of a customer and the
dates of his internet purchases, on the condition that the customer is from
France …
To solve this query we need the following tables:
1. FactInternetSales (the fact table)
2. DimCustomer (dimension table, name of customer)
3. DimGeography (dimension table, country of the customer)
4. DimProduct (dimension table, name of the product)
5. DimDate (date of Internet sale)
So 4 join conditions and an additional condition (the customer's country)
The SQL statement, using
DWAdventureworks.mdb (in Base, Libre Office):
SELECT Dimcustomer.Firstname, Dimcustomer.Lastname,
Dimproduct.Productname, Dimgeography.Countryregionname,
Dimdate.Datekey
FROM Dimcustomer, Dimgeography, Factinternet, Dimdate, Dimproduct
WHERE Dimcustomer.Geographykey = Dimgeography.Geographykey

AND Factinternet.Customerkey = Dimcustomer.Customerkey
AND Factinternet.Orderdatekey = Dimdate.Datekey
AND Factinternet.Productkey = Dimproduct.Productkey
AND Dimgeography.Countryregionname = 'France'

Using Aliases to shorten your queries !
• Choose an alias for your table names and use them throughout your query. At
least one character, not a number, no spaces.
• Why ? This shortens your query considerably!
SELECT DC.Firstname, DC.Lastname, DP.Productname, DG.Countryregionname,

DD.Datekey
FROM Dimcustomer DC, Dimgeography DG, Factinternet F, Dimdate DD, Dimproduct DP
WHERE DC.Geographykey = DG.Geographykey
AND F.Customerkey = DC.Customerkey
AND F.Orderdatekey = DD.Datekey
AND F.Productkey = DP.Productkey
AND DG.Countryregionname = 'France'

"Subgroups" and Arithmetic Operations with SQL
• When using joins between tables, the result is often an extensive list of very
detailed rows (records). It is often necessary to split the result of a join into
"subgroups". A subgroup consists of one or more rows (records).
• Subgroups are defined in a query using a GROUP BY clause. The grouping of

the rows of a table happens on the basis of their value for one or more
columns.
• A 'group by' clause mostly assumes that on the level of a subgroup, a

calculation is performed.
For example: how many customers live in each of the different cities?
SELECT Dimgeography.City, COUNT(Dimcustomer.Customerkey)
FROM Dimgeography, Dimcustomer
WHERE Dimgeography.Geographykey = Dimcustomer.Geographykey
GROUP BY Dimgeography.City;
• So group by "city" (set to the same cities together in a group), and within each
group (each city), count the number of customerkeys.
Group By, explained with an example
• Suppose a manager, wants to know the sales for each product. Let us assume
that s/he wants a list with products and their total sales. The sales amount can
be found in the facts table.
• You as a business analist propose the following query:

SELECT P.Productname, F.Salesamount
FROM Dimproduct P, Factinternet F
WHERE P.Productkey = F.Productkey
The first records of your result then looks like this – (in total you have 1000
facts):
AWC Logo Cap has

been sold several
times, at the same
price

Group By, explained with an example
SELECT P.Productname, F.Salesamount

Presenting a table with 1000 individual sales does not make much sense for the
manager. The business analist needs to give a subtotal for each product: so
26.97 for AWC Logo Cap for instance. Hence he groups by product name and
sums up the sales amounts.
SELECT P.Productname, sum(F.Salesamount)

Group by P.Productname
Stephan Poelmans– Business Intell

"Subgroups" and Arithmetic Operations with SQL
• For example: Show the number of customers by city and by gender

SELECT Dimgeography.City, Dimcustomer.Gender, Count(Dimcustomer.Customerkey) AS CountOfCustomerKey
GROUP BY Dimgeography.City, Dimcustomer.Gender;
• The order of the fields in the "Group By" clause is important.
• In this example: first group by city, then "within each city” group by sex.
• Note: in the Select clause we can also add a heading using AS. (e.g. AS
COUNTofCustomerKey) The name of the header is optional, it should be one
word.

Result
For example:
In Seattle: 1
women and 2
men

The Having Clause
• Just as we can add selection criteria in a WHERE-clause, we can define
selection conditions for subgroups using a “HAVING” clause
• A "HAVING" clause ALWAYS implies a group by clause
• Example: In which city do more than two clients live?

SELECT Dimgeography.City, count(Dimcustomer.Customerkey)
GROUP BY Dimgeography.City
HAVING count(Dimcustomer.Customerkey) > 2
• So: each group (each city) presented in the result should include > 2
customers
• Note that the count operation does not have to be in the select clause!

A Where and Having clause in 1 SQL statement
• When in the same query we both have a WHERE and a HAVING clause, the
query processor processes the query as follows:
1) selection of the tables included in the “FROM clause”.

2) the selection condition(s) in the “WHERE clause” are adapted to select
rows from a table.
3) Subgroups are formed with the remaining rows as specified in the GROUP
BY clause.
4) The group functions, specified in the SELECT and/or HAVING clause are
computed.
5) The conditions in the HAVING clause are used to select subgroups.
6) The results for the remaining subgroups are shown.
(Possibly in the order specified in the ORDER BY sentence)

An Example
For example: An overview of the total number of products per customer, on the condition
that at least 2 products were purchased (per customer) and that the customer comes from
the USA
Select Dimcustomer.Firstname, Dimcustomer.Lastname, Dimgeography.Countryregionname

from Dimgeography, Dimcustomer, Factinternet
where Dimcustomer.Customerkey = Factinternet.Customerkey
and Dimgeography.Geographykey = Dimcustomer.Geographykey
and Dimgeography.Countryregionname = 'United States’
group by Dimcustomer.Firstname, Dimcustomer.Lastname, Dimgeography.Countryregionname

having count(Factinternet.Productkey) >= 2 ;
• Note that we don’t need the table “DimProduct"!

The facts table "FactInternet" contains one line per product (see "ProductKey").
It is therefore sufficient to count the number of products!
• Note that count(Factinternet.Productkey) >= 2 is not mentioned in the select clause, it could be added
there, but is not stricty required
• Note that firstname, lastname and countryregionname are used in the select clause, and they need to
be used all 3 in the group by clause, to make it a consistent instruction !

Some other pre-defined arithmetic functions
AVG (K): the average value of column K
COUNT (K): number of values in column K
MAX (K) the maximum value in column K
MIN (K): minimum value in column K
SUM (K): sum of the values from column K
These functions are standard SQL language.
Additional functions (a.o. in Base):
Year(): returns the year of a date field

Now(): returns the current date
…

Examples
• Show the age of customers:
SELECT Year(Now())-Year(Birthdate) AS age
FROM Dimcustomer
order by Year(Now())-Year(Birthdate)
• Show the average price across all products

SELECT AVG(Listprice)
FROM Dimproduct
(result: 1 number!)
• Show the most expensive product SELECT MAX(Listprice)

FROM Dimproduct
• Show the birth dates of the oldest and youngest customer
SELECT MIN(Birthdate), MAX(Birthdate)

FROM Dimcustomer

SQL: Additional Topics
1. Union Queries
2. ‘Top[number]’
3. Subqueries
4. ‘Rollup’ & ‘Cube’ clauses

1. Union Queries
• Union queries are used to combine data from multiple queries into one result.
• For example: Provide a list of names and telephone numbers of all employees
and all customers with an annual income greater than 25 000. (This does not
necessarily imply a link between customers and employees).
• SQL
select Firstname, Lastname, Phone
from Dimemployee
union
Select Firstname, Lastname, Phone
from Dimcustomer
where Yearlyincome > 25000

1. Union Queries
• You cannot just simply combine two SQL statements (queries) with "union“
• Conditions:
• Every query must have the same number of columns (fields)
• The data types of the corresponding fields must be the same
• Incorrect Example:
• Select Firstname, Lastname, Birthdate
from Dimemployee
union
from Dimcustomer
1. Union Queries
• Incorrect Example:
• Select Totalchildren, Firstname, Lastname
from Dimcustomer
union
from Dimemployee
• Correct Example:
• Select Phone, Firstname, Lastname
from Dimcustomer
union
from Dimemployee
• The order of the fields is different, but because the data types are the same,
most database systems recognize that query. The first query determines the
order in which the attributes are displayed
• Top[number]: E.g. Top 1, Top 3, ... This instruction can be added immediately after
the word "select". Example:
Select top 3 Firstname, Lastname, Birthdate

from Dimcustomer => Result: The 3 youngest clients
order by Birthdate desc;
• Note that the “top x” instruction only ensures that the x-first records of a query are
displayed! The youngest, oldest or highest are not automatically chosen. Example:
SELECT top 1 DC.Firstname, DC.Lastname, sum(F.Salesamount)

FROM Dimcustomer DC, Factinternet F
WHERE DC.Customerkey = F.Customerkey
GROUP BY DC.Firstname, DC.Lastname
ORDER BY DC.Firstname, DC.Lastname
=> Result: the client Aaron Collins
• This is not necessary the best customer! Top 1 takes in this case the first customer
shown in the query result.

• How can you effectively select the best customer?

Simple: make sure that the information resulting from the query is sorted.
From high to low (“desc”). Top 1 will select the best customer.
• Then:
select top 1 DC.Firstname, DC.Lastname, sum(F.Salesamount)

from Dimcustomer DC, Factinternet F
where DC.Customerkey = F.Customerkey
group by DC.Firstname, DC.Lastname
order by sum(F.Salesamount) desc
• How do you modify the query to select the customer with the lowest
purchase amount?

3. Subqueries
• A subquery is a selection of data within a different selection. In other words, we
use a query within a query
• For example:
SELECT Customerkey, Lastname FROM Dimcustomer
WHERE Customerkey IN
(SELECT Customerkey
FROM Factinternet
GROUP BY Customerkey
HAVING SUM(Salesamount) < 800)
• This query is the same as:

SELECT Dimcustomer.Customerkey, Dimcustomer.Lastname, sum(Salesamount)
FROM Dimcustomer, Factinternet
WHERE Dimcustomer.Customerkey = Factinternet.Customerkey
GROUP by Dimcustomer.Customerkey, Lastname
HAVING sum(Salesamount) < 800

3. Subqueries
• The data that we want to select can thus also be queried with an INNER JOIN.
Why then using a subquery? A subquery is generally a lot clearer than a JOIN
and easy to formulate. Some RDBMS perform more quickly when using a
JOIN.
• Moreover a subquery is needed to compare records within a same table
• Examples:
• Which employees are older than Guy Gilbert?
So you compare employees with each other
• Which customers are located in the same town as the customer Joanna
Suarez?
So you compare with other customers

3. Subqueries
• Which customers are younger than Jon Yang? (In DWAdventureworks)
Select Firstname, Lastname, Year(Now())-Year(Birthdate)

from Dimcustomer
where Year(Now())-Year(Birthdate) <
(Select Year(Now())-Year(Birthdate)
from Dimcustomer
where Firstname = 'Jon' and Lastname = 'Yang')
• Which customers are located in the same town as Jacquelyn Suarez?

Select DC.Firstname, DC.Lastname
from Dimcustomer DC, Dimgeography DG
where DC.Geographykey = DG.Geographykey
and DG.City =
(Select DG.City
from Dimcustomer DC, Dimgeography DG
where DC.Geographykey = DG.Geographykey
and DC.Firstname = 'Jacquelyn' and DC.Lastname = 'Suarez' )

3. Subqueries: remark 1
• The use of "IN" and "=" ">", ">=", "<" and "<=" in subqueries:
• The operators "=" ">", ">=", "<","<=" can only be used if the second query in the subquery
yields only one record!
• The operator "IN" can be used if the second query yields one or more records.
• For example: Create the following query: (DWAdventureworks))

SELECT Dimcustomer.Firstname, Dimcustomer.Lastname
FROM Dimcustomer
WHERE Geographykey =
(SELECT Geographykey
FROM Dimcustomer
WHERE Dimcustomer.Lastname=“Suarez”
and DimCustomer.FirstName = “Jacquelyn”)
• “=“ is possible only because the second query only yields one geographykey! (Namely the
one of Jacquelyn Suarez and on the condition that there is only one “Jacquelyn Suarez” in
the warehouse!). The word “IN” can also be used in this query.

• Why is the following query not correct?
SELECT Dimcustomer.Firstname, Dimcustomer.Lastname
FROM Dimcustomer
WHERE Geographykey =
(SELECT Geographykey
FROM Dimcustomer
WHERE Dimcustomer.Lastname like ’s*’)
• “IN” can thus be used in the above example.
• "IN" means: the defined field of a record from the top query must "appear in" one or more
records from the second query.
SELECT Customerkey FROM Dimcustomer WHERE Customerkey IN
(SELECT Customerkey
• For example:
FROM Factinternet
GROUP BY Customerkey
HAVING SUM(Salesamount) < 800)
• The second query returns all "customerkey" on the customers who bought 800 or less ....

• By only working with keys, subqueries can often be shortened.
• E.g.: Which customers are located in the same city as “Jacquelyn Suarez”?
SELECT DC.Firstname, DC.Lastname, DG.City
FROM Dimcustomer DC, Dimgeography DG
AND DG.City =
(SELECT DG.City
FROM Dimcustomer DC, Dimgeography DG
AND DC.Lastname='Suarez’ AND DC.Firstname = 'Jacquelyn')
• This query can be shorter, but then the town of customers will not be displayed. (You
only use “Geographykey”)
SELECT Firstname, Lastname
FROM Dimcustomer
WHERE Geographykey = (SELECT Geographykey
FROM Dimcustomer
WHERE Lastname=‘Suarez‘
and Firstname = ‘Jacquelyn')
4. Rollup and Cube
• Rollup and Cube are two typical OLAP concepts (see below)
• Rollup and Cube can also be used as SQL statements. They calculate
subtotals.
• Both instructions are not provided in MS Access, but in database systems like
MS SQL Server, Oracle and MySQL
• Some examples may clarify things

Rollup Sort Place Amount
Cat Miami 18
Cat Naples 9
SELECT Sort, Place, SUM(Amount) as amount Dog Miami 12
FROM Animal Dog Naples 5
GROUP BY Sort, Place Dog Tampa 14
Turtle Naples 1
Turtle Tampa 4
SELECT Sort, Place, SUM(Amount) as amount Sort Place Number
FROM Animal Cat Miami 18
GROUP BY Sort, Place

Cat Naples 9
Cat NULL 27
WITH ROLLUP; Dog Miami 12
Dog Naples 5
Dog Tampa 14
So there are subtotals calculated (total Dog NULL 31
number of animals), per species. Turtle Naples 1
In addition, the "Grand Total" (63 units) is Turtle Tampa 4
shown. Turtle NULL 5
NULL NULL 63

Cube
Sort Place Number

SELECT Sort, Place, SUM(Amount) as amount
Cat Miami 18
FROM Animal
Cat Napels 9
GROUP BY Sort, Place
Cat NULL 27
WITH CUBE;
Dog Miami 12
Dog Napels 5
Dog Tampa 14
Dog NULL 31
Similar to Rollup, but now for each
Turtle Napels 1
attribute of the group by clause
Turtle Tampa 4
(‘sort' and 'place') a subtotal of
animals is calculated. We thus have a Turtle NULL 5
total of animals per sort and per living NULL NULL 63

place NULL Miami 30
NULL Napels 15
NULL Tampa 18

Rollup, Cube and Pivot Tables
• Note: "rollup" and "cube" calculate subtotals as they are calculated in a Pivot
Table…
Result of the query “with cube”
Sort Place Number

Cat Miami 18
A pivot talbe (MS Excel): Cat Napels 9
Sum of Cat NULL
Number Column Labels Dog Miami 12
Dog Napels 5
Row Labels
Dog Tampa 14
Dog NULL
Turtle Napels 1
Turtle Tampa 4
Turtle NULL
NULL NULL
NULL Miami
NULL Napels
NULL Tampa
OLAP vs OLTP
prof. Dr. Stephan Poelmans

(based on material prof. dr. Jochen De Weerdt,
campus Leuven)
5) Data warehouse architecture
2
Stephan Poelmans– OLAP
3
Data Warehouse vs Data Mart
4
Content
Data warehousing
1) Introduction
2) OLTP vs. OLAP
3) The multidimensional model
4) Logical data warehouse design
5) Data warehouse architecture
6) Data warehouse usage
5
1) Introduction
6
2) OLTP vs OLAP
• OLTP databases
– Designed to support daily operations
– Ensure fast, concurrent access to data
• Transaction processing
• Concurrency control
• Recovery techniques
– Normalization crucial
• Must support heavy transaction loads
• Should prevent update anomalies
– Therefore: poor performance for executing complex queries
(joining many relational tables together)
7
2) OLTP vs OLAP
• OLAP database (=Data warehouse)

–Focused on analytical queries
• Heavy query loads
–Queries involve aggregation
• Traversing all records
• Normalization leads to suboptimal performance given the
high number of joins
8
What is a Data Warehouse?
“A DW is a
– subject-oriented,
– integrated,
– time-varying,
– non-volatile,
collection of data that is used primarily in organizational decision making.”
-- W.H. Inmon, Building the Data Warehouse, 1992
9
Subject-oriented
• A data warehouse is organized around subjects such as sales, products,

customers.
• It focuses on modeling and analysis of data for decision makers.
• Excludes data not useful in decision support process.
10
Integration
• Data Warehouse is constructed by integrating multiple heterogeneous

sources.
• Data Preprocessing is applied to ensure consistency.
RDBMS
Data
Legacy Warehouse
System
Flat File Data Processing

Data Transformation
11
Integrated
“Heterogeneities are everywhere”

Personal
Databases
World
Scientific Databases Wide
Web
Digital Libraries
• Different interfaces The DW integrates

• Different data representations all the data necessary for
• Duplicate and inconsistent information analysis
12
Time-varying
• Provides information from historical perspective e.g. past 5-10 years

• Data is collected at different times (e.g. every month to update the data
warehouse)
• Every key structure contains either implicitly or explicitly an element of
time
13
Nonvolatile
• Once data is recorded it will not be updated anymore
• Historical data will not be deleted
• Only new data is added
• A data warehouse requires two operations in data accessing
– Initial loading of data (ETL)
– Access of data
load
access
14
Data warehouse database vs. OLTP database
15
Additional differences
16
3) Multidimensional model
High level description of the data model of a data warehouse

N-dimensional space: data cube or hypercube
17
Dimensions, measures, facts and hierarchies
18
Aggregration
19
4) Logical Data Warehouse design
20
4) Logical Data Warehouse design, continued
21
Relational data warehouse design
22
Star Schema
Measurements
Dimensions
23
24
25
26
6) Data warehouse usage
• Two types of data warehouse applications

– Verification-based tools
• Ad hoc querying and reporting
– supports querying, basic statistical analysis, and reporting using crosstabs, tables, charts
and graphs, dashboards
– SQL
• Online analytical processing (OLAP)
– supports multidimensional analysis of data warehouse data
– OLAP operations: slice-dice, drilling down, rolling up, pivoting
– MDX: SQL-like: Multi-Dimensional Expressions
– Pivot tables in spreadsheets
– Discovery-oriented tools (data mining)
• knowledge discovery: search for hidden patterns
• supports associations, constructing analytical models, performing classification
and prediction, and presenting the mining results using visualization tools.
27
OLAP operations
28
Roll-up
29
Drill-down
30
Slicing
31
Dicing
32
Data warehouse querying
33
Some Useful References
34
Part 2:
Introduction to Data Mining
Using Weka
Computer to install Weka (Mac / Windows / Linux)

Module 2: Architectural Perspective
Stephan Poelmans– Data Mining

Background literature & Sources
WEKA !
What’s Weka?
– A bird found only in New Zealand?
A Data mining workbench
Waikato Environment for Knowledge Analysis
Machine learning algorithms for data mining tasks
100+ algorithms for classification
75 for data preprocessing
25 to assist with feature selection
20 for clustering, finding association rules, etc .
No programming experience required

High School Mathematics
Practice !

Online courses
• There are 3 MOOCS (=massive open online course) available. This course is
considerably inspired by them.
• https://www.cs.waikato.ac.nz/ml/weka/courses.html
• Course material of the Moocs (slides, videos):

• https://www.cs.waikato.ac.nz/ml/weka/mooc/dataminingwithweka/

What is data mining?
• Data mining implies different quantitative methods and techniques aimed at
finding in huge amounts of data (such as a data warehouse) relations and
patterns between data (preferably hidden relations and relations that
contribute to better management decisions)
• Definition “… the non-trivial process of identifying valid, novel, potentially

useful, and ultimately understandable patterns in data” (source: Fayyad,
1996)
• Data Mining uses statistics and artificial intelligence
• Data Mining is part of what we call Knowledge Discovery in Databases

(KDD). KDD is more a description of the process for obtaining data mining.
The terms data mining and KDD are often interchangeably.

Data Mining and Knowledge Discovery
Data mining in “The Knowledge Data Discovery (KDD) Process”:

Data Mining and Knowledge Discovery
• Data collection/Source data: for example data from operational

databases to a data warehouse.
• Selection: selecting the data you need to perform data mining (for
example tables from a data warehouse)
• Preprocessing: controlling the data and cleaning it (see next slide)
• Transformation: convert the data into a usable form (for example
in a format you can import in a data mining tool)
• Data mining: the application of data mining techniques
• Interpretation/Evaluation: interpretation for the results (which
relations/patterns/forecasts have we found?)
• We then get the “knowledge” that can be used for taking better
decisions

What is data mining? KDD Preprocessing
• Preprocessing is the comparison of the perspectives of the data you will use for data mining and the
cleaning of this data. The phase is very important and far too often achieved too fast. Here are some
important activities:
• Which attributes are relevant?

For example. “Credit scoring” (evaluating the solvency of clients) : what is a “bad” client (for
example, the Basel 2 Directive : 90 days past due).
• Are there “missing values”?

For example, we have a client database (different tables), but some critical information is missing for
some of the clients (such a the birth date)
• Is there “noise” in the data? (errors such as “age = -20” , or “gender= m and name= Marianne”,
etc.)
• Are there “outliers”? (extreme or exceptional values) that can influence our results? If this is the
case, you can ultimately drop them…
• What do available attributes look like? What type of data do we have? Is the field textual or numeric?
If numeric, are they continuous variables or discrete (categorical) ones? (For example “gender”
is a discrete variable (m or f), “age” is a continuous variable but can become discrete (for example
subdivision in these categories : “0-17”; “18-30”, “ > 60”, …)).

Applications of DM (based on ”Data Mining with
Weka”, Futurelearn.com)
• Can data mining be applied to …
• customer relationship management?
• “To establish a productive relationship with a customer, businesses collect data and analyse it. This can help in acquiring and retaining
customers, developing customer loyalty, and implementing customer focused strategies. “
• supermarket basket analysis?
• “If you buy certain items you are more likely to buy other items. Understanding purchasing behavior helps the retailer understand the buyer’s
needs and change the store layout accordingly. “
• financial data analysis?
• “Data mining can contribute to solving business problems in banking and finance by finding patterns, causalities, and correlations in business
information and market prices that are not immediately apparent to managers. “
• E.g. financial data from companies that went bankrupt (in order to predict or warn companies in difficulty)
• education?
• “Mining data generated by online courses can help predict students’ future learning success, assess the effect of educational support, and
advance scientific knowledge about learning. Perhaps individual learning patterns can be discovered and used to personalize the presentation
of material. “
• Learning Analytics = using online data from online tools like Blackboard to detect patterns and predict success
• healthcare?
• “Data mining can help identify best practices that improve care and reduce costs. It can be used to predict the volume of patients in different
categories, which helps hospitals develop processes to ensure that patients receive appropriate care at the right place and at the right time. “
• criminal investigation?
• Crime analysis involves exploring and detecting crimes and their relationships with criminals. Crime datasets are large, comprehensive and
complex. Textual reports – interrogation transcripts for instance - can be analyzed using text mining.
• making babies?
• “Yes! Selecting approapriate embryos for artificial insemination, amongst other possibilities that we’ll leave to your imagination.”
• … and many other fields..

Typical Business Requirements
• Typical “business requirements” where data mining can offer an answer :
• What is the credit risk of a client?
• Can we divide clients into different groups (clusters)?
• Which products do clients often buy together?
• How many products can an enterprise (hope to) sell next year ?
• Is the income tax return of a company the same as the one of similar companies (tax
evasion)?
• What is the probability of success of a student given all the information we have about that
student and given a group of students from the past (social background, secondary education,
exam results,…)?
• Some questions can also, to some extent, be answered thanks to “conventional” queries, cross-
tabulations,...

Example of data mining: a Decision Tree
Decision Tree:
Preprocessed Data: (data from the past)
Swollen
patient Sore Swollen Headach Glands
ID Throat Fever Glands Congestion e Diagnosis
1 yes yes yes yes yes strep throat no yes
2 no no no yes yes allergy
3 yes yes no yes no cold
4 yes no yes no no strep throat Diagnosis= Strep Throat
5 no yes no yes no cold Fever
6 no no no yes no allergy
7 no no yes no no strep throat no yes
8 yes no no yes yes allergy
9 no yes no yes yes cold
10 yes yes no yes yes cold
Diagnosis= Allergy Diagnosis= Cold
Source: Roiger, R. J. & M.W. Geatz (2003)
The data is “preprocessed” and converted in “discrete variables”.

Associated “Production rules”:
1.If a patient has swollen glands, the diagnosis is strep throat.
2.If a patient does not have swollen glands and has fever, the diagnosis is a cold.
3.If a patient does not have swollen glands and does not have fever, the diagnosis is an allergy.

Example of data mining: a Decision Tree
Sore Swollen
patient ID Throat Fever Glands Congestion Headache Diagnosis
Swollen
11 No No Yes yes yes ?
Glands
12 Yes Yes No No yes ?
13 No no No no yes ?
no yes
Source : Roiger, R. J. & M.W. Geatz (2003)
Diagnosis= Strep Throat
The decision tree (with the “production rules”) can now be Fever
used to make forecasts for future patients
whose diagnosis is still unknown… no yes
Diagnosis= Allergy Diagnosis= Cold

Example of data mining: Clustering
Credit Card Example
Customer Transaction Trades Favorite Annual
ID Account Type Margin Account method /Month Sex Age Recreation Income
1005 joint no online 12,5 f 30-39 tennis 40-59K
1013 custodial no broker 0,5 f 50-59 skiing 80-99K
1245 joint no online 3,6 m 20-29 golf 20-39K
2110 individual yes broker 22,3 m 30-39 fishing 40-59K
1001 individual yes online 5 m 40-49 golf 60-79K
Source: Roiger, R. J. & M.W. Geatz (2003)
Remark:
- a custodial account is an account type where an institution or tutor represents a “protected” individual
- a joint account: an account shared by people
We could ask the question:

“According to which attributes (fields) and attribute values can we group Clients into clusters (similar
groups)?”
This question becomes more global and can be answered through data mining, and more specifically data
clustering.

Example of data mining: Clustering
Credit Card Example
Transaction Trades Favorite Annual
Customer ID Account Type Margin Account method /Month Sex Age Recreation Income
1005 joint no online 12,5 f 30-39 tennis 40-59K
1013 custodial no broker 0,5 f 50-59 skiing 80-99K
1245 joint no online 3,6 m 20-29 golf 20-39K
2110 individual yes broker 22,3 m 30-39 fishing 40-59K
1001 individual yes online 5 m 40-49 golf 60-79K
We can use “clustering” on the above-mentioned data. The clustering technique determines clusters
(categories/groups) in which the “distance” between the clusters is as great as possible, whereas the clients distance
within a cluster is as small as possible.
We could obtain the following “rules” :
If (margin account = yes && Age = 20-29 && Annual Income = 40-59K)
THEN cluster = 1 (accuracy = 0.80; coverage = 0.50)
If (account type = custodial && favorite recreation = skiing && annual income = 80-90K)
If (account type = joint && Trades/month > 5 && transaction method = online)
Accuracy & coverage do tell us something about the clustering value (validation), see further .
A clustering will never be perfect !

Other Basic Concepts:
• Discrete vs. continuous variables
• DM is frequently applied to numerical or Boolean variables
• We can divide numerical variables into the following ones:
• Discrete variables: for example gender, nationality, marital status,…
• Continuous variables: for example age, income, temperature, satisfaction level
(on a scale),…
• Note: continuous variables can be converted in discrete variables!
For example, age: “0-18”, “18-30”, “30-50”, “>50’
For example, income: “low: < 1300”, “average: 1300-1800”, etc.
• Be careful: it is not always easy to clearly know if, strictly speaking, we are dealing with a discrete
or continuous variable!
For example, the use of a Likert-scale:
A “likert-scale” in an survey can vary from 1 “very unhappy” to 5 “very happy”. Is this a discrete
variable or a continuous one? Statisticians will rather say this is a discrete one (with 5 categories) ;
in many scientific surveys, this scale is nonetheless considered as a continuous variable (all score
between 1 and 5 are possible if you calculate an average of all respondents).

• Outliers
• Remove or keep?
• E,g, Age = 400 (false observation) vs. income = 10 000 Euro (correct
observation(?))
• Missing values:
• How to deal with them? E.g. replace with average value?
• Definition of the target variable = outcome variable (if required, cf. below)
• Credit scoring: What is a bad customer/actor (e.g. 90 days payment arrears
according to the international Basel II guidelines)
• Churn management: What is a churner? (e.g. a customer without any
purchase in the last 4 months)

• Training VS Test Set
➢ Different from “conventional statistics”:
▪ In the case of Data Mining, we split a sample/dataset in 2 parts
▪ The “training set” is used as a calculation base (to build the initial model).
Schemes/relations are searched for in the training set.
▪ The schemes/relations/models are tested with (applied to) a test set to
verify if they have a good prediction rate and to adjust them (make them
more general).
▪ For example we use 70% of the data for the training set and 30% for the
test set
▪ This division is not problematic because data mining may be applied on a
really huge datasets (most often coming from a data warehouse)

Data Mining Model Construction
Data Mining
Techniques
Training
Data
NAME RANK YEARS TENURED

Resulting
Model
M ike A ssistan t P ro f 3 no
(rules)
M ary A ssistan t P ro f 7 yes
B ill P ro fesso r 2 yes
Jim A sso ciate P ro f 7 yes
IF rank = ‘professor’
D ave A ssistan t P ro f 6 no
OR years > 6
Anne A sso ciate P ro f 3 no
THEN tenured = ‘yes’

Use the Model - built using the training set - with
the test set and later in new “unseen” data
Data Mining is never perfect.

According to the “rules” Merlisa
should be “tenured” … Model
Rules
Testing
Data Unseen Data
(Jeff, Professor, 4)
NAME RANK YEARS TENURED
Tenured?
Tom A ssistan t P ro f 2 no
M erlisa A sso ciate P ro f 7 no
G eo rg e P ro fesso r 5 yes
Jo sep h A ssistan t P ro f 7 yes

Data Mining Strategies
• We are looking for relations between data (variables such as fields

from a table)
• Three data mining “strategies”: Supervised VS Unsupervised

Learning VS Market Basket Analysis
Data Mining
Strategies
Unsupervised Supervised Market Basket

Learning Learning Analysis
Classification Estimation
Source: adapted from Roiger, R. J. & M.W. Geatz (2003), Data Mining: A tutorial-based primer, Addison-Wesley, 350 p.

• Supervised learning = predictive methods:
• We are looking for an explicative or a predictive model
• There is an “output” variable (or target variable)
• For example :
• What are the differences between clients that do not come back after one purchase and clients
who do come back for a second time at least?
Output variable: on the basis of a series of criteria, we determine if the a person will not return
(0) or the person will return (1)
• What are the differences between owners and non-owners of a house?
• How could we explain the differences in client satisfaction ? What makes clients satisfied or not?
• How can we explain the fact that some projects are successful and others not?
• In the case of supervised learning, we make another difference between
“classification” (do not mix up with clustering) and “estimation”
• Classification: the output variable is always discrete (for example Decistion Trees,
Logistic Regression….)
• Estimation: the output variable is continuous (for example a linear regression)

• Unsupervised learning = Descriptive methods:

• We are not really looking for an explicative or a predictive model
• We do not have an “output” or target variable
• For example: Divide customers into similar groups (clustering)
• Market Basked Analysis:
• The aim is to find relationships / connections between retail products.
• Typically: which products do the customers buy together?
• “Association rules” is the technique “by excellence” (see further)
• Supervised and unsupervised learning could be described respectively as “purposeful” and
“exploratory”.
• There are several data mining techniques. Some techniques are supervised or unsupervised
or belong to “market basket analysis”. Other techniques can be used as both supervised or
unsupervised.

Data Mining vs. Expert Systems
Data Mining Tool
Expert System
Data
If Swollen Glands = Yes A computer program owning the problem-solving

capabilities of one or more human experts. (= Artificial
Then Diagnosis = Strep
Throat
Intelligence (AI))
Human Knowledge
Expert Engineer
Expert System
Building Tool Knowledge Engineer
A person who is trained to work with an expert
and capturing his knowledge.
If Swollen Glands = Yes
Then Diagnosis = Strep
Throat
Data mining vs. expert systems

Installing Weka
• Go to https://www.cs.waikato.ac.nz/ml/weka/
• Click the Download button
• the latest version, currently Weka-3-8-3
• the appropriate link for your computer; Windows, Mac, or Linux
• Weka = developed in Java, you’ll need a JVM (= Java Virtual machine)
• if you don’t know what Java is, you probably want the file that includes a Java VM
• Once it’s downloaded, open it to get a standard setup “wizard”.
• Just keep clicking “Next”! Install it in the default place – and remember the name of that place!
• After installation, uncheck the box that says “Start Weka” and clicks “Finish”.
• Then go to where Weka was installed and
• make a copy of the data folder (within the Weka folder) and put it in a convenient place for future
use
• rename it “Weka datasets” !!
• Having installed Weka, note the different interfaces in Weka: Explorer; the Experimenter; the
KnowledgeFlow interface; and the Command-line interface. We will be using the explorer and the
experimenter.
• In the Explorer there are several panels: Preprocess; Classify; Clustering; Association rules;
Attribute selection; and Visualization.

Exploring Weka
Weather.nominal.arff

Weka: Visualizing your Data
• Using the Visualize panel
• Open iris.arff
• Bring up the Visualize panel
• Click one of the plots; examine some instances
• Set x axis to petalwidth and y axis to petallength
• Click on Class colour (bottom) to change the colour
• Bars on the right change correspond to attributes: click for x axis; right‐click for y axis
• Jitter slider (“Jitter is a random displacement applied to X and Y values to separate points
that lie on top of one another. Without jitter, 1000 instances at the same data point would
look just the same as 1 instance.)
• Show Select Instance: Rectangle option
• Submit, Reset, Clear and Save

Weka: Use a filter to remove an attribute
• Open weather.nominal.arff (again)
• Check the filters
• supervised vs unsupervised
• attribute vs instance
• Choose the unsupervised attribute filter Remove , click on the Remove text box
• Check the “More” button on the pop-up window; look at the options
• Set attributeIndices to 3 and click OK (So the 3rd attribute is removed for analysis)
• Apply the filter
• Recall that you can Save the result
• Press Undo

Weka: Use a filter to remove an attribute (value)
• Remove instances where humidity is high

• Supervised or unsupervised?
• Attribute or instance?
• Look at them
• Select RemoveWithValues
• Set attributeIndex (humidity = 3)
• Set nominalIndices (“high” instances need to be removed…)
• Apply & check the result: Are high humidity instances set to 0?
• Undo (!)

Understanding .ARFF Files
• ARFF is an acronym that stands for Attribute-Relation File Format.
• Weka has a specific computer science centric vocabulary when describing data:
• Instance: A row of data is called an instance, as in an instance or observation from the problem
domain.
• Attribute: A column of data is called a feature or attribute, as in feature of the observation.
• Each attribute can have a different type, for example:
• Real for numeric values like 1.2.
• Integer for numeric values without a fractional part like 5.
• Nominal for categorical data like “dog” and “cat”.
• String for lists of words, like this sentence.
• On classification problems, the output variable must be nominal. For regression problems, the output
variable must be real.

Understanding .ARFF Files
• For example, iris.arff
• Directives start with the ”at” symbol (@) and there is one for the name of the dataset (e.g.
@RELATION iris), there is a directive to define the name and datatype of each attribute (e.g.
@ATTRIBUTE sepallength REAL) and there is a directive to indicate the start of the raw data
(e.g. @DATA).
• Lines in an ARFF file that start with a percentage symbol (%) indicate a comment.
• The CSV format is also recognized by Weka and easily exported from MS Excel, so once you
can get your data into Excel, you can easily convert it to the CSV format. (Note: a CSV file
created in Excel needs commas to separate values (not semicolons !). The first line in the .csv
file contains the names of the attributes.

Supervised learning: Classification
Using Weka
Getting to know WEKA
• Open the dataset
Stephan Poelmans– Data Mining 2

Classification Algorithms
• Structure of the slides:
• Decision Trees (J48)

• PART
• ZeroR
• 1R (’one R’)
• The problem of overfitting

• Quality Evaluation
• Naïve Bayes
• K-nearest neighbours (Ibk in Weka)
• Neural Network (Multilayer Perceptron in Weka)

Decision Trees with Classification rules
• Decision trees are usually intuitively easy to interpret. A decision tree seems very similar
to a “flow chart”, by splitting nodes.
• The difficulty is not the understanding of the decision tree but rather building the tree.
Especially when you work with hundreds and thousands of records, you need a data
mining tool.
• A decision tree:
• Is an application of supervised learning (with “output” variables)
• Used to classify data (i.e. “classification”)
• The output variable is discrete!

Decision Trees with Classification rules:
Introductory example

Decision Trees with Classification rules :
• Training Instances

• Shape is important !
• Color is important !

• Decision tree:

Example of a decision tree
Splitting Attributes
Tid Refund Marital Taxable
Status Income Cheat
1 Yes Single 125K No

Refund
2 No Married 100K No Yes No
3 No Single 70K No
4 Yes Married 120K No NO MarSt
5 No Divorced 95K Yes Single, Divorced Married
6 No Married 60K No
TaxInc NO
7 Yes Divorced 220K No
< 80K > 80K
8 No Single 85K Yes
9 No Married 75K No NO YES
10 No Single 90K Yes
Model: Decision Tree
10
Training Data
Note that:
- “Cheat” is the categorical “output” variable;

Yet another example of a decision tree
MarSt
Married Single, Divorced
Tid Refund Marital Taxable

NO Refund
Status Income Cheat
No
1 Yes Single 125K No Yes
2 No Married 100K No
3 No Single 70K No NO TaxInc
4 Yes Married 120K No

< 80K > 80K
5 No Divorced 95K Yes
6 No Married 60K No
NO YES
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10
10 No Single 90K Yes
There can consequently be more than one
decision tree for the same data set!

Evaluating the model on the test data
Test Data
Refund Marital Taxable

Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES

How good is a decision tree?
• Multiple decision trees can be made on the same training set.
• The question then arises: which decision tree is the best to be used? Or, what decision
tree is the best?
• To measure the validity of a decision tree we use a “confusion matrix” .
• A confusion matrix shows how many records (observations) were correctly classified, by
the decision tree, and how many records were wrongly classified.
• The confusion matrix is obtained by the application of the decision tree to the test set !
12
Validation of a decision tree
• Focus on the predictive ability of a model
• Instead of focusing on how much time it takes to make a model, the size of the model, etc.
• Confusion Matrix: how many records were properly classified? How many errors?
PREDICTED CLASS
Class x Class y a: TP (true positive)
b: FN (false negative)
Class x a b
ACTUAL (TP) (FN) c: FP (false positive)
CLASS Class y c d d: TN (true negative)
(FP) (TN)
False negative: a record belongs to a particular class, but was excluded

False positive: a record does not belong to a class, but was however housed in it
13
Validation of a decision tree
• Accuracy: the number of correctly classified records from a dataset / the total number
of records.
Model M1 PREDICTED CLASS Model M2 PREDICTED CLASS
x y x y
ACTUAL ACTUAL
CLASS x 150 40 CLASS x 250 45
y 60 250 y 5 200
Accuracy = 80% Accuracy = 90%

(400 “true” /500 “total”)

Confusion Matrix
Table 2.5 • A Three-Class Confusion Matrix
Computed Decision
C C C
1 2 3
C C C C
1 11 12 13
C C C C
2 21 22 23
C C C C
3 31 32 33

15
Weka: Use J48 to analyze the glass dataset
• Check the available classifiers (the “Classify” tab)
• Choose the J48 decision tree learner (trees>J48)
• Run it (“start”)
• Examine the output
• Look at the correctly classified instances
• … and the confusion matrix
16
Configuration panel
Click to open the

configuration panel
Choose J48
Right-Click to
visualize
17
Weka: J48
• Evolution:
• ID3 (1979)
• C4.5 (1993)
• C4.8 (1996?) => J48 (adapted for Weka)
• C5.0 (commercial)
• Open the configuration panel in Weka (Click on the white box under “Classifier”:
• Check the More information
• Examine the options
• Look at leaf sizes, Set minNumObj to 15 to avoid small leaves
• Visualize the tree using right‐click menu
18
Min. leaves 2, Min. leaves 15,
accuracy = 66.8% accuracy = 62.15%
19
General information
Output
Test method (cf. below)
Classification rules of the tree
Tree size
20
Accuracy !
Output
Kappa !
Other accuracy indicators, such as ROC

area under the curve, ….
Confusion matrix ! Horizontal = observed.

E.g. observed a = 70, of which 46 correctly
classified. (So TP rate of a = .657)

J48: algorithm

J48: algorithm
23
Entropy is a measure of impurity (the opposite of
information gain). So the Entropy after the split is best
lower than before the split,
J48: algorithm

J48: algorithm
Note: informationStephan
gainPoelmans–
is measured
Data Mining in ‘bits’, as a unit of information 27
Weka: J48
• Fewer attributes = sometimes a better classification!
• Open glass.arff to run J48 (with default options...):
• Run J48, with default options first
• Next. Remove Fe, and run J48 again
• Next, Remove all attributes except RI and MG, run J48 again
• Compare the decision trees , and particularly their accuracy %
• (Also use the right‐click menu to visualize decision trees)
• What is your conclusion? Which model is the most accurate?
28

• PART
• ZeroR
• 1R (’one R’)

• Naïve Bayes
29
PART
• Rules from partial decision trees: PART
• Theoretically, rules and trees have equivalent “descriptive” or expressive power ... but either
can be more perspicuous (understandable, transparent) than the other
• Create a decision tree: top-down, ”divide-and-conquer”; read rules off the tree
• One rule for each leaf
• Straightforward, but rules contain repeated tests and are overly complex
• Alternative: a covering method: bottom-up, “separate-and-conquer”:
• Take a certain class (value) in turn and seek a way of covering all instances in it. This is called a covering
approach because at each stage you identify a rule that “covers” some of the instances. This approach may
lead to a set of rules, rather than to a decision tree.
• Separate-and-conquer:
• Identify a rule
• Remove the instances it covers
• Continue, creating rules for the remaining instances
30
PART: Separate-and-Conquer: Example
• Identifying a rule for class a (so not explicitly considering class b):
• Possible rule set for class b:
• You could add more rules to get a ‘perfect’ rule set…

• E.g. One ‘a’ is not covered in the 3rd graph: this could be acceptable, but if you also want to cover that instance,
you need an extra rule: If x > 1.4 and Y<2.4, then class = a
PART
• In the example above , a top-down, decision tree
approach (’divide-and-conquer’) might lead to the same
rule set, presented via a Decision Tree
• However, whereas the covering algorithm is concerned
with only covering a single class, disregarding what
happens to the other classes, the tree division would
take all classes into account, right from the start. As a
result, in many instances, there is a difference between
rules and trees in terms of the transparency of the
representations. A decision tree typically (but not
always) leads to more complex rules
• In fact: Generating a PART decision list, using separate-
and-conquer, builds a partial C4.5 (J48 in Weka)
decision tree in each iteration and makes the "best" leaf
into a rule.

PART in Weka:
• Application to the Diabetes
dataset :
• Open diabetes.arff, classify:
rules -> PART ; trees -> J48:
• ZeroR: 64.29% (Can you
calculate this number
yourself?)
• J48 : Accuracy: 73.82% : 39-
node tree (with 20 leaves)
• PART: Accuracy: 75.26% : 13
rules


• PART
• ZeroR
• 1R (’one R’)

• Naïve Bayes
34
Baseline Accuracy: ZeroR
• Open file diabetes.arff: 768 instances (500 negative, 268 positive)
• Always guess “negative”: 500/768 : accuracy = 65%
• rules > ZeroR: most likely class!
• Try these classifiers (cf. later for more info):
– trees > J48 74%
– bayes > NaiveBayes 74%
– lazy > IBk 70%
– rules > PART 75%
• So ZeroR is a classifier that uses no attributes. Zero R simply ”predicts” the mean (for a
numeric class (or output variable)) or the mode (for a nominal class).
35
Baseline Accuracy : ZeroR
• Always Try a simple Baseline. Sometimes, the baseline is best!
• Open supermarket.arff and blindly apply the following classifications:
• rules > ZeroR -- accuracy =63.7%
• trees > J48 -- accuracy = 63.7%
• bayes > NaiveBayes – accuracy = 63.7%
• lazy > IBk – accuracy = 37% (!)
• rules > PART – accuracy = 63.7%
36
OneR: One attribute does all the work
• Learn a 1‐level “decision tree”
• – i.e., rules that all test one particular attribute
• Basic version
• One branch for each value
• Each branch assigns most frequent class
• Error rate: proportion of instances that don’t belong to the majority class of their corresponding
branch
• Choose attribute with smallest error rate
37
• In Weka:
• Open file weather.nominal.arff
• Choose OneR rule learner (rules>OneR)
• Look at the rule,
38
Dealing with numeric attributes
• Idea: discretize numeric attributes into sub ranges (intervals or partitions)
• How to divide each attribute’s overall range into intervals?
• Sort instances according to attribute’s values
• Place breakpoints where (majority) class changes
• This minimizes the total classification error 8 intervals or
partitions…
• Example: temperature from the weather.numeric data
64 65 68 69 70 71 72 72 75 75 80 81 83 85
Yes | No | Yes Yes Yes | No No | Yes Yes Yes | No | Yes Yes | No
Outlook Temperature Humidity Windy Play

Sunny 85 85 False No
Sunny 80 90 True No
Overcast 83 86 False Yes
Rainy 75 80 False Yes
… … … … …
39
The problem of overfitting
• Discretization procedure is very sensitive to noise
• A single instance with an incorrect class label will probably produce a separate interval
• Simple solution:
enforce minimum number of instances in majority class per interval
• Example: temperature attribute with required minimum number of instances in majority class
set to three:
64 65 68 69 70 71 72 72 75 75 80 81 83 85
Yes | No | Yes Yes Yes | No No | Yes Yes Yes | No | Yes Yes | No
• So now we have the following partition:

64 65 68 69 70 71 72 72 75 75 80 81 83 85
Yes No Yes Yes Yes | No No Yes Yes Yes | No Yes Yes No
• However, whenever adjacent partitions have the same majority class, as the two first
partitions above (in both, ”yes” is the majority), they can be merged together, leading to 2
partitions
64 65 68 69 70 71 72 72 75 75 80 81 83 85
Yes No Yes Yes Yes No No Yes Yes Yes | No Yes Yes No
• So the rule: temperature <= 77.5 -> yes
temperature > 77.5 -> no 40
Results with overfitting avoidance
• Resulting rule sets for the four attributes in the • In Weka:
weather.numeric data, with only two rules for the temperature
• Open
attribute: weather.numeric
Attribute Rules Errors Total errors • Classify
Outlook Sunny → No 2/5 4/14 • Rules > OneR
Overcast → Yes 0/4 • Configuration panel:
Rainy → Yes 2/5 MinBucketSize = 3
Temperature  77.5 → Yes 3/10 5/14 (see previous slide)
> 77.5 → No* 2/4 • Start
Humidity  82.5 → Yes 1/7 3/14 • Look at the rule…
> 82.5 and  95.5 → No 2/6
> 95.5 → Yes 0/1
Windy False → Yes 2/8 5/14
True → No* 3/6

OneR: Conclusion
• Incredibly simple method, described in 1993 (Robert C. Holte, Computer Science
Department, University of Ottawa)
• “Very Simple Classification Rules Perform Well on Most Commonly Used Datasets”
• Experimental evaluation on 16 datasets
• Used cross‐validation
• Simple rules often outperformed far more complex methods
• How can it work so well?
• some datasets really are simple
• some are so small/noisy/complex that nothing can be learned from them!
42

• PART
• ZeroR
• 1R (’one R’)

• Naïve Bayes
43
The problem of Overfitting
• Any machine learning method may “overfit” the training data …
• … by producing a classifier that fits the training data too tightly
• So ML performs well on the training data but not on the independent test data (which is
then reflected in a low accuracy rate)
• Overfitting is a general phenomenon that plagues all ML methods
• This is one reason why you must always evaluate on an independent test set
• However, overfitting can occur more generally: you can have good accuracy rate but the
model performs badly when using new validation data (coming from a different context)
• E.g. You try many ML methods, and choose the best for your data – you cannot expect to get the
same performance on new validation data
44
Overfitting : an example
• Experiment with the diabetes dataset
• Open file diabetes.arff
• Choose ZeroR rule learner (rules>ZeroR)
• Use cross‐validation: 65.1%
• Choose OneR rule learner (rules>OneR)
• Use cross‐validation: 71.5%
• Look at the rule in the output (plas = plasma glucose concentration)
• In the configuration panel of OneR, change minBucketSize parameter to 1: run again
• Use cross-validation: 57.16%
• Look at the rule again
• So in the 2nd run of OneR, the rule is much more complex, tightly (over)fitted to the training set, but
performing worse on the test set(s) (even worse than ZeroR).
45

• PART
• ZeroR
• 1R (’one R’)

• Naïve Bayes
46
Quality Evaluation: Accuracy, Precision and Recall
• So far, mainly focus on Accuracy % = general number of rightly classified / total number
• Overall a useful indicator, mainly in case of more or less balanced class values
• Easy to understand, evaluates the entire model
• Additional, and related measures (between 0 and 1) and given per class value:
• TP rate = TP/(TP+FN) = Recall = sensitivity
• Precision = TP/(TP+FP)
• F-measure is an average of recall and precision
• Area under the ROC curve (also called the C statistic)
• A receiver operating characteristic curve, i.e., ROC curve, is a graphical plot that illustrates the
discriminatory ability of a classifier.
• The ‘area under the curve’ (or C statistic) represents the classification quality of the model. 1 is a perfect
model, without any false postives; 0.5 is a worthless model, that detects as many true positives as false
positives
47
Quality Evaluation: Kappa Statistic
• Measures such as the accuracy % are easy to understand, but may give a distorted picture in the case of
imbalanced classes. E.g. we have two classes, say A and B, and A shows up on 5% of the time. Classifying
all as B gives an accuracy of 95% (so ‘excellent’), whereas the minority class is not well predicted. This
problem is even more present in the case of 3 or more classes
• Cohen’s kappa statistic is a very good measure that can handle very well imbalanced class problems.
• Kappa statistic:
• (success rate of actual predictor - success rate of random predictor) / (1 - success rate of random predictor)
• Measures relative improvement on random predictor: 1 means perfect accuracy, 0 means we are doing no better than
random
• Interpretation, rules of thumb (!):
based on Landis, J.R.; Koch, G.G. (1977). “The measurement of observer agreement for categorical
data”. Biometrics 33 (1): 159–174
• value < = 0 is indicating no improvement over random prediction
• 0–0.20 a slight improvement,
• 0.21–0.40 a fair improvement,
• 0.41–0.60 a moderate improvement,
• 0.61–0.80 a substantial improvement,
• and 0.81–1 an almost perfect, maximal improvement.
• It basically tells you how much better your classifier is performing over the performance of a classifier
that simply guesses at random according to the frequency of each class.
48
Example
• A Decision Tree on the Iris.arff dataset:
TP rate, a = 49/50 = 0.98

Precision, a = 49 /49
In this data set, all classes (a,b,c) are very well classified,
Class a has the best performance .

Accuracy Testing options
• Test set: independent instances that have played no part in formation of classifier
• Assumption: both training data and test data are representative samples of the underlying problem
• Test and training data may differ in nature
• Example: classifiers built using customer data from two different towns A and B
• To estimate performance of classifier from town A in completely new town, test it on data from B
1. Using separate training and test datasets

2. Holdout estimation (‘Percentage Split’)
3. Repeated Holdout
4. Cross-validation
1. K-Fold
2. Leave-one-out
50
Training and Test set need to be different,
but they are created from one dataset

Accuracy testing: 1. Using Separate training & test sets
• When opting for separate training and test sets, separate files (.arff) need to be created
first. The model is than built with the training set, and tested separately on the test set.
Training and test files can be created by random selection or stratified (with each file
respecting a comparable proportion of the output (class) variable values). Typically, the
number of instances in the test set is one third of the training set. This is not a strict
requirement though.
• As an example look at the Reuterscorn-test.arff (N=604) and Reuterscorn-training.arff

(N=1554) sets. So open the training set in the Weka Explorer and when classifiying,
choose the test file in “Supply test set”.
52
Accuracy testing: 2. Holdout estimation
• What should we do if we only have a single dataset?
• The holdout method reserves a certain amount for testing and uses the remainder for
training, after shuffling
• Usually: one third for testing, the rest for training, by default 66% training– 34% testing in Weka
• Problem: the samples might not be representative
• Example: a class value might be missing in the test data
• Advanced version uses stratification
• Ensures that each class value is represented with approximately equal proportions in both subsets

53
Accuracy testing: 3. Repeated holdout method
• Holdout estimates can be made more reliable by repeating the process with
different subsamples
• In each iteration, a certain proportion is randomly selected for training (possibly with
stratification)
• The error rates on the different iterations are averaged to yield an overall error rate
• This is called the repeated holdout method
• Still not optimum: the different test sets overlap
• Can we prevent overlapping?

3. Repeated holdout: an example
See slide 45
3. Repeated holdout: an example
• So quite some variations in the accuracy %.
• Calculate the mean and variance, to get a more reliable outcome (lies between 92.9% and 96.7%)

More on the Seed option: repeat holdout
(percentage split)
• When opting for the repeated holdout estimation, you
choose the percentage split, which is 66% by default. This
means that Weka randomly selects 66% of your data set to
build the model.
• So if you simply repeat the algorithm, you could expect that
another 66% is chosen, leading to a different accuracy %.
However, try it yourself, this is not true. So when simply
repeating, you get the same accuracy.
• This is because Weka ’remembers’ the random selection
algorithm, and uses the same function, so you can re-run
and compare methods without having different training sets.
• If you want different training sets (and test sets, and thus
accuracy levels), click on ‘more options’ in the test panel,
and change the seed option in the resulting pop-up window.
57
Accuracy Testing 4.1. k-fold Cross-validation
• K-fold cross-validation avoids overlapping test sets
• First step: split data into k subsets of equal size
• Second step: use each subset in turn for testing, the remainder for training
• This means the learning algorithm is applied to k different training sets
• Often the subsets are stratified before the cross-validation is performed to yield
stratified k-fold cross-validation
• The error estimates are averaged to yield an overall error estimate; also, standard
deviation is often computed
• Alternatively, predictions and actual target values from the k folds are pooled to
compute one estimate
• Does not yield an estimate of standard deviation
58
4.1. k-fold Cross Validation
Test Set
• 10-fold cross-validation:
Training Set
• Divide dataset into 10 parts (folds)
• Hold out each part in turn for testing
• Average the results
• So Each data point is used once for
testing, 9 times for training
59
4.1. k-fold Cross Validation:
• With 10‐fold cross‐validation, Weka invokes the learning algorithm 11 times
• Practical rule of thumb:
• Lots of data? – use percentage split
• Else stratified 10‐fold cross‐validation
ML = Machine Learning
60
More on cross-validation
• Standard method for evaluation: stratified ten-fold cross-validation
• Why ten?
• Extensive experiments have shown that this is the best choice to get an accurate estimate
• There is also some theoretical evidence for this (cf. Witten, et al. 2017)
• Stratification reduces the estimate’s variance
• Even better: repeated stratified cross-validation
• E.g., ten-fold cross-validation is repeated ten times and results are averaged (reduces the
variance)
61
Accuracy Testing 4.2. Leave-one-out cross-
validation
• Leave-one-out:
a particular form of k-fold cross-validation:
• Set number of folds to the number of training instances
• I.e., for n training instances, build classifier n times
• Makes best use of the data
• Involves no random subsampling
• Very computationally expensive (exception: using lazy classifiers such as the
nearest-neighbor classifier ( cf. later))
• Disadvantage of Leave-one-out CV: stratification is not possible
• It guarantees a non-stratified sample because there is only one instance in the test set!
62

• PART
• ZeroR
• 1R (’one R’)

• Naïve Bayes
63
Naïve Bayes
• Frequently used in Machine Learning, Naive Bayes is a collection of classification algorithms based on the Bayes
Theorem. The family of algorithms all share a common principle, that every feature being used to classify is independent
of the value of any other feature. So for example, a fruit may be considered to be an apple if it is red, round, and about
3″ in diameter. A Naive Bayes classifier considers each of these “features” (red, round, 3” in diameter) to contribute
independently to the probability that the fruit is an apple, regardless of any correlations between features. Features or
attributes, however, aren’t always independent in reality, which is often seen as a shortcoming of the Naive Bayes
algorithm and this is why it’s called “naive”.
• Although based on a relatively simple idea, Naive Bayes can often outperform other more sophisticated algorithms and
is very useful in common applications like spam detection and document classification.
• In a nutshell, the algorithm aims at predicting a class (outcome variable), given a set of features, using probabilities. So
in a fruit example, we could predict whether a fruit is an apple, orange or banana (class) based on its color, shape, etc.
(features).
• Advantages
• It is relatively simple to understand and build
• It is easily trained, even with a small dataset
• It is not sensitive to irrelevant attributes
• Disadvantages
• It assumes every attribute is independent, which is not always the case 64
Probabilities using Bayes’s rule
• Famous rule from probability theory thanks to Thomas Bayes
• Probability of an event H given observed evidence E:
P(H | E) = P(E | H)P(H) / P(E)
• H = class variable; E=instance
• A priori probability of H : P(H )
• Probability of event before, prior to, evidence is seen
• E.g. before tossing a coin: the probability of ‘heads’ is 50%, given that the coin has two similar sites …
We know this a priori.
• A posteriori probability of H : P(H | E)
• Probability of event after evidence is seen
• E.g. tossing a coin 1000 times , with 550 times ‘heads’. So now, with this evidence, the probability of
tossing heads = 550/1000. This is the a posteriori probability.

Naïve Bayes
P(H | E) = P(E | H)P(H) / P(E)
• Evidence E = non-class attribute values (attributes used to predict the class or outcome
variable)
• Event H = class (outcome) value of instance
• Naïve assumption: the evidence is split into parts (i.e., attributes) that are
conditionally independent
• This means, given n attributes, we can write Bayes’ rule using a product of per-
attribute probabilities:
P(H | E) = P(E1 | H)P(E3 | H)… P(En | H)P(H) / P(E)

• But wait, don’t worry … an example will make this clear… take the time to analyse it…

Naïve Bayes: Explanation with the Fruit example
• A simple example best explains the application of Naive Bayes for classification.
• So, suppose we have data on 1000 pieces of fruit. The fruit is a Banana, Orange or some
Other fruit, and imagine we know 3 features of each fruit, whether it’s long or not,
sweet or not and yellow or not, as displayed in the table below:

• So from the table, what do we already know?
• 50% of the fruits are bananas P(Banana) = .50
• 30% are oranges P(Orange) = .30
• 20% are other fruits P(Other)= .20
• Based on this training set we can also say the following:
• From 500 bananas 400 (0.8) are Long, 350 (0.7) are Sweet and 450 (0.9) are Yellow
• So P(Long|Banana) = .8; P(Sweet|Banana)= .7; P(Yellow|Banana) = .9
• Out of 300 oranges 0 are Long, 150 (0.5) are Sweet and 300 (1) are Yellow
• From the remaining 200 fruits, 100 (0.5) are Long, 150 (0.75) are Sweet and 50 (0.25) are Yellow
68
Naïve Bayes
• So suppose we are presented with new data: we are only given the features of a piece of fruit
and we need to predict the class, i.e. fruit type. If we are told that the additional fruit is Long,
Sweet and Yellow, we can classify it using the following formula and the facts from the table
above.
• P(H|E) = P(E|H)*P(H) / P(E)
• So H = the values of the class = Banana, Orange or Other
• So E = the evidence presented = 3 attributes: Long, Sweet and Yellow
• P(Banana|Long,Sweet,Yellow)=P(Long|Banana)*P(Sweet|Banana)*P(Yellow|Banana)*P(Banana) / P(Long) * P(Sweet) * P(Yellow)

= 0.8 * 0.7 * 0.9 * 0.5 / P(E) = 0.252 / P(E)
• P(Orange|Long,Sweet,Yellow)=0 / P(E)
• P(Other|Long,Sweet,Yellow)=P(Long|Other)*P(Sweet|Other)*P(Yellow|Other)*P(Other) / P(Long)*P(Sweet)*P(Yellow)
= 0.5*0.75*0.25*0.2 / P(E) = 0.01875 / P(E)
• P(E) is not known, but is used for each fruit, so based on the above, we can conclude that the
fruit is most likely a Banana. (0.252 > 0.01875)
69
Naïve Bayes: Probabilities for weather.nominal data
Yes No Yes No Yes No Yes No Yes No
Sunny 2 3 Hot 2 2 High 3 4 False 6 2 9 5
Overcast 4 0 Mild 4 2 Normal 6 1 True 3 3
Rainy 3 2 Cool 3 1
Sunny 2/9 3/5 Hot 2/9 2/5 High 3/9 4/5 False 6/9 2/5 9/ 5/
Overcast 4/9 0/5 Mild 4/9 2/5 Normal 6/9 1/5 True 3/9 3/5 14 14
Rainy 3/9 2/5 Cool 3/9 1/5
Outlook Temp Humidity Windy Play
Sunny Hot High False No
Just counting … E.g. Sunny –yes happened 2 Sunny Hot High True No
times. Data set Overcast Hot High False Yes
Sunny – no : 3 times Rainy Mild High False Yes

Rainy Cool Normal False Yes
So 9 times yes in total: 2/9
Rainy Cool Normal True No
5 times no in total: 3/5
Overcast Cool Normal True Yes
Sunny Mild High False No
Sunny Cool Normal False Yes
Rainy Mild Normal False Yes
Sunny Mild Normal True Yes
Overcast Mild High True Yes
Overcast Hot Normal False Yes
Stephan Poelmans– Data Mining Rainy Mild High True No 70
Weather.nominal data example
Predict a new day: Outlook Temp. Humidity Windy Play

Evidence E
Yes or no? Sunny Cool High True ?
P(yes | E) = P(Outlook = Sunny | yes)

P(Temperature = Cool | yes)
Probability of P(Humidity = High | yes)
class “yes”
P(Windy = True | yes)
P(yes) / P(E)
2 / 9 ´ 3 / 9 ´ 3 / 9 ´ 3 / 9 ´ 9 /14
=
P(E) 71
Naïve Bayes: Probabilities for weather.nominal data
Sunny 2 3 Hot 2 2 High 3 4 False 6 2 9 5
Overcast 4 0 Mild 4 2 Normal 6 1 True 3 3
Rainy 3 2 Cool 3 1
Sunny 2/9 3/5 Hot 2/9 2/5 High 3/9 4/5 False 6/9 2/5 9/ 5/
Overcast 4/9 0/5 Mild 4/9 2/5 Normal 6/9 1/5 True 3/9 3/5 14 14
Rainy 3/9 2/5 Cool 3/9 1/5
Outlook Temp. Humidity Windy Play

• A new day:
Sunny Cool High True ?
Likelihood of the two classes

For “yes” = 2/9  3/9  3/9  3/9  9/14 = 0.0053
2 / 9 ´ 3 / 9 ´ 3 / 9 ´ 3 / 9 ´ 9 /14 For “no” = 3/5  1/5  4/5  3/5  5/14 = 0.0206
=
P(E) Conversion into a probability by normalization:
P(“yes”) = 0.0053 / (0.0053 + 0.0206) = 0.205
P(“no”) = 0.0206 / (0.0053 + 0.0206) = 0.795 72
The “zero-frequency problem”
• What if an attribute value does not occur with every class value?
(e.g., “Outlook= Overcast” for class “no”)
• Probability will be zero: E.g. P( Outlook = Overcast | No) = 0
• A posteriori probability will also be zero: P(No | E) = 0
(Regardless of how likely the other values are!)
• So a zero frequency kind of takes over the entire probability calculation.
• Remedy: add 1 to the count for every attribute value-class combination
(Laplace estimator)
• Result: probabilities will never be zero
• Additional advantage: stabilizes probability estimates computed from
small samples of data (where the likelihood of zero is bigger)
73
Numeric (Continuous) attributes
• In the previous examples, all attributes were discrete !
• What if certain attributes are numeric (e.g. Temperature in the Weather.numeric data)
• Usual assumption: attributes have a normal or Gaussian probability distribution (given the
class)
• The probability density function for the normal distribution is defined by two parameters:
• Sample mean
• Standard deviation
• Then the density function f(x) is

74
Statistics for weather.numeric data

Sunny 2 3 64, 68, 65,71, 65, 70, 70, 85, False 6 2 9 5
Overcast 4 0 69, 70, 72,80, 70, 75, 90, 91, True 3 3
Rainy 3 2 72, … 85, … 80, … 95, …
Sunny 2/9 3/5  =73  =75  =79  =86 False 6/9 2/5 9/ 5/
Overcast 4/9 0/5  =6.2  =7.9  =10.2  =9.7 True 3/9 3/5 14 14
Rainy 3/9 2/5
• Example density value:
75
Classifying a new day
• A new day: Outlook Temp. Humidity Windy Play

Sunny 66 90 true ?
Likelihood of “yes” = 2/9  0.0340  0.0221  3/9  9/14 = 0.000036

Likelihood of “no” = 3/5  0.0221  0.0381  3/5  5/14 = 0.000108
P(“yes”) = 0.000036 / (0.000036 + 0. 000108) = 25%
P(“no”) = 0.000108 / (0.000036 + 0. 000108) = 75%
• Missing values during training are not included

in calculation of mean and standard deviation

NaïveBayes in Weka
• Open file weather.nominal.arff
• Choose Naive Bayes method (bayes>NaiveBayes)
• Look at the output
• Avoid zero frequencies: start all counts at 1 (How can you see this in the output?)
Standard output
in Weka. Just
Frequencies.. Do
you have to
calculate the
Probabilities
yourselves?
Choose “output
predictions” ,
Plaintext. (cf.
the More
options button
in the test panel
• PART
• ZeroR
• 1R (’one R’)

• Naïve Bayes

K-nearest neighbours (KNN, Ibk in Weka)
• “Rote learning”: the simplest form of learning:
• To classify a new instance, search the training set for one that’s “most
like” it
• the instances themselves represent the “knowledge”
• lazy learning: just look for similarities, nothing more
• “Instance‐based” learning (Ibk) = “nearest‐neighbor” learning

KNN, an example
• A simple example: Below (left) is a chart with data points (observations); they are
grouped into red circles (RC) and green squares (GS). The intention is to find out the
class of the blue star (BS) (=new data). BS can either be RC or GS and nothing else. The
“K” in the KNN algorithm is the number of nearest neighbors we wish to consider.
Suppose K = 3, so the 3 closest neighbours of BS belong to RC. We classify BS in this
group. Next, we look for the values of the class (output) variable in RC, the majority
class value is then predicted for BS (=‘majority vote’).

• Search training set for one that’s “most like” it:
• We Need a similarity function
• Regular (“Euclidean”) distance? (sum of squares of differences)
• Manhattan (“city‐block”) distance? (sum of absolute differences)
• Nominal attributes? Distance = 1 if different, 0 if same
• Normalize the attributes to lie between 0 and 1?
• What about noisy instances?
• Nearest‐neighbor
• k‐nearest‐neighbors
• choose majority class among several neighbors (k of them)
• In Weka: lazy>IBk (instance‐based learning)

Similarity Functions
• Euclidean distance
• Note that taking the square root is not required when comparing distances
• Manhattan (‘city-block’) distance: Adds the differences without squaring them
• Normalize the attributes to lie between 0 and 1?

• Different attributes are measured on different scales need to be normalized, e.g., to range [0,1]:
vi : the actual value of attribute i

• So the algorithm
1. Compute a distance value between the item to be classified and every item in the
training data-set
2. Pick the k closest data points (the items with the k lowest distances)
3. Conduct a “majority vote” among those data points — the dominating classification in
that pool is decided as the final classification
• Investigate effect of changing k

• Use the Glass dataset
• lazy > IBk, k = 1, 5, 20
• 10‐fold cross‐validation:

• Often very accurate … but slow:
• scan entire training data to make each prediction?
• sophisticated data structures can make this faster
• Assumes all attributes equally important
• Remedy: attribute selection or weights
• Remedies against noisy instances:
• Majority vote over the k nearest neighbors
• Weight instances according to prediction accuracy
• Identify reliable “prototypes” for each class
• Statisticians have used k‐NN since 1950s
• If training set size the error approaches minimum


• PART
• ZeroR
• 1R (’one R’)

• Accuracy testing
• Naïve Bayes

Neural Networks
• A neural network is an algorithm that mimics the action of the brains
• A neural network for example looks like the following:

Input Layer Hidden Layer Output Layer
1.0 Node 1 W1j
W1i
Node j
Wjk
W2j
0.4 Node 2 W2i Node k
Wik
Node i
W3j
0.7 Node 3 W3i
• The topology “reflects” the human brain.
• The “nodes” are neurons or perceptrons and the link between them are the “dendrites”. The neurons do an amount of operations
and pass the result to the next layer.

Neural Networks
• 2 biological neurons: input of a neuron is provided by the dendrites, output of a neuron is provided by the
axons. The “synapse” links the axon with the dendrites (signals are thus transmitted between brain cells).

History of Neural Networks

Neural Networks
• Humans can be looked at as pattern-recognition machines. Human brains process
‘inputs’ from the world, categorize them (that’s a dog, that’s ice-cream, a spider, …), and
then generate an ‘output’ (pet the dog (or fight/ run), run away from the spider, taste
the ice-cream). This happens automatically and quickly, with little or no effort. It is the
same system that senses that someone is angry at us, or involuntarily reads the stop
sign as we speed past it. Psychologists call this mode of thinking ‘System 1’ (see K.
Stanovich and R. West), and it includes the innate skills — like perception and fear —
that we share with other animals. (There’s also a ‘System 2’, see Thinking, Fast and
Slow by Daniel Kahneman).
• Neural networks loosely mimic the way our brains solve the problem: by taking in
inputs, processing them and generating an output. Like us, they learn to recognize
patterns, but they do this by training on datasets.
Source: Towardsdatascience.com

Neural Network: input -> output
Suppose you bike to work. (output = go to work or not)
You have two factors to make your decision to go to work: the weather must not be bad, and it must be a weekday. The weather
is not that big a deal, but working on weekends is a big no-no. (So they say..) So weekday has a higher weight in your decision
than ‘bad weather’. The inputs have to be binary, so let’s propose the conditions as yes or no questions. Weather is fine? 1 for
yes, 0 for no. Is it a weekday? 1 yes, 0 no.
Let us set suitable weights of 2 for weather and 6 for weekday. Now how do we calculate the output? We simply multiply the
input with its respective weight, and sum up all the values we get for all the inputs. For example, if it is a nice, sunny (1) weekday
(1), we would do the following calculation:
This calculation is a linear combination. Now what does an 8 mean? We need to define the threshold value. The neural network’s
output, 0 or 1 (stay home or go to work), is determined if the value of the linear combination is greater than the threshold value.
Suppose the threshold value is 5, which means that if the calculation gives you less than 5, you can stay at home, but if it is equal to
or more than 5, you need to go to work.
Source: Towardsdatascience.com
Neural Network
• So using weights, the NN algorithm knows which information will be most important in making its
decision. A higher weight means the neural network considers that input more important compared to
other inputs.
• The weights are set to random values, and then the network adjusts those weights based on the output
errors it made using the previous weights. This is called training the neural network.
• So the NN algorithm takes in inputs, applies a linear combination using a weight vector. If the result is
compared to a threshold value, leading to an output, 1 or 0. This predicted output is compared to the
observed (real) output, so the NN can learn.

Neural Network
• As the neural network trains, it makes incremental changes to those weights to produce more accurate outputs.
• But additionally, a function that transforms the values for the decision of the output perceptron is known as
an activation function. A well-known activation function is the Sigmoid function. There are other activation
functions, such as linear functions, rectified linear functions, hyperbolic tangent, etc.
• Even when dealing with absolutes (1s and 0s; yeses and no’s), it is beneficial to have the output give an intermediary
value (a probability between 0 and 1). (It is comparable to answering “maybe” when asked a yes-or-no question you
have no clear answer to.)

Architecture of a Perceptron
Error =
Observed
(actual)-
Predicted

Neural Network
• So far, we have discussed the architecture of the perceptron. The simplest neural network model has one
layer of perceptrons, so 1 hidden layer.
• In the neural network diagram below, the layer on the far left is the input layer (i.e. the data you feed in),
and the layer on the far right is the output layer (the network’s prediction/answer). Any number of layers
in between these two are known as hidden layers. The more the number of layers, the more nuanced the
decision-making can get, and the more depth it has (and so they call it deep learning).
• The networks often go by different names: deep feedforward networks, feedforward neural
networks, or multi-layer perceptrons (MLP, cf. Weka).
• = feedforward networks because the information flows in one general (forward) direction, where
mathematical functions are applied at each stage.

How does it work more precisely?
Suppose Ni is the ith perceptron (neuron) in a NN (in a hidden layer).
1. Ni has stored a weight vector, assigned to each of its inputs. As the first step, Ni computes a
weighted sum of all inputs (IN). So:
where Inputs(Ni) corresponds to all neurons that provide an input to Ni. ak is the input value
2. This sum, called the input function, is then passed on to Ni’s activation function (ai). So:
Typically, the activation functions of all neurons in the NN are the same, so we just write g
instead of gi. Example of a Simoid function:
Thus. every neuron in a NN takes in the activation of all its inputs and provides its own activation as an output
NN: Backpropagation & Gradient Descent
• It is the job of the training process of a NN to ensure that the weights, wij, given by each neuron to each
of its inputs is set right, so that the entire NN gives an optimal outcome. Backpropagation is one of the
ways to optimize those weights.
• The error function defines how much of the output of the model defers from the required (observed)
output. Typically, a mean-squared error function is used for this.
• It is important to know that each training instance will result in a different error value. It is the
backpropagation’s goal to minimize the error for all training instances on average. Backpropagation tries
to minimize or reduce the error function, by going backwards through the network, layer by layer. Then it
uses the gradient descent to optimize or fine-tune the weights.
• So the gradient descent is a technique used to fine tune the weights.

NN: Backpropagation & Gradient Descent
• Consider f(x) = 2x + 5. The gradient of the function defines how the value of f will
change with a unit decrease/increase of x. In the example, this is 2. (compare x=1 to
x=2)
• Gradient descent?
The red regions correspond to places of higher
function value, while the blue regions correspond to
low values. You want to reach the lowest point
possible. You are standing on the red hill, and you
can only see a small part of the terrain around you.
So you will take small steps in the direction in which
the terrain slopes down. With each step you take,
you will scan your immediate neighbourhood
(which is visible to you), and go in the direction that
shows the steepest descent (the local optimum).
NN: Backpropagation & Gradient descent
• So you are descending from a higher value of the target function (the error function), to
a lower value, by following the direction that corresponds to the steepest decrease in
function value.
• If the gradient descent algorithm finds the minimum point, it will stop. Then, the model
has converged.
• Note that you always want to move in the direction shown by the negation of the
gradient. Consider again f(x) = 2x+5. The gradient = 2; therefore to reduce the f(x), you
need to decrease the value of x.
• It is important to understand that backpropagation tries to optimize the inner-neuron
weights and not the activation function within the neurons.

Gradient Descent
• A summary of the algorithm:
• Starting from a point on the graph of a function;
• Find a direction ▽F(a) from that point, in which the function decreases fastest;
• Go (down) along this direction a small step γ, go to a new point of a;
• By iterating the above three steps, we can find the local minima or global minima of this function.

Neural networks : Multilayer Perceptron in Weka
• Open weahter.nominal
• Function -> MultilayerPerceptron , in the configuration panel, set GUI = True, start
• 10-fold cross validation: Accuracy: 71.43% (compare with other classifiers)
• By default: 1 hidden layer !
• one epoch = one forward pass and one
backward pass of all the training examples
• Learning rate = the amount the weights have
been changed

Neural networks : Multilayer Perceptron in Weka
• You can extend the number of hidden layers in Weka (by default = 1), in the
configuration panel.
• For example, type in “hiddenlayers:”: 3, 5 ,
instead of ‘a’, run again and see what
happens
Want to know more?

https://www.coursera.org/courses?query=neural%20networks
&
http://neuralnetworksanddeeplearning.com

A few applications
• Stock market predictions
• Character recognition
• Business applications
• Face recognition, Colorization, etc.

Evaluating, Comparing and Deploying
Classifiers
Quality Evaluation
The Weka Experimenter
Predicting new data
Comparing Classifiers : The Experimenter
• The Experimenter interface
• Using the Experimenter to compare classifiers

The Experimenter
• Use the Experimenter for:
• determining mean and standard deviation performance of a classification algorithm on a dataset
... or several algorithms on several datasets
• Is one classifier better than another on a particular dataset?
... and is the difference statistically significant?
• Is one parameter setting for an algorithm better than another?
• The result of such tests can be expressed as an ARFF file
• Computation may take days or weeks
... and can be distributed over several computers

The Experimenter: 1 Classifier , 1 Dataset
• Evaluate one classifier on one dataset... using cross-validation, repeated 10 times ...
using percentage split (holdout), repeated 10 times (repeated holdout)
• Example Percentage split (see slides ’Business Intelligence DM2’)

• Repeated 10-fold Cross-validation:
• Repeat 10 times the following:




The Experimenter: >1 Classifiers , >1 Datasets
• Is J48 better than (a) ZeroR and (b) OneR on the Iris data?




• Adding more datasets…



• Statistical significance: the “null hypothesis”:
Classifier A’s performance is the same as B’s
• The null hypothesis can be rejected at the 5% level [of statistical significance]
“A performs significantly better than B at the 5% level”
• You can change the significance level (5% and 1% are common)
• You can change the comparison field (we have used % correct)
• Common to compare over a set of datasets
“On these datasets, method A has xx wins and yy losses over method B”
• Multiple comparison problem: if you make many tests, some will appear to be
“significant” just by chance.

Predicting new data
• In Data Mining, the relative weight of the predicting attributes and the exact
interpretation of the classifier output is often given secondary attention (contrary to
conventional statistics). As a matter of fact, identifying and understanding the meaning
of the classifier is often not easy or meaningful (cf. NN, KNN, etc.).
• A principal goal is to select the best classifiers (measured by accuracy rate, TP rates, the
kappa statistic, etc.) These rates are derived from applying the trained classifier to (a)
test set(s), but the test set is also historical data, with a known value for the class.
• So if a good discriminatory model is the result of the analysis, how can we use this
model? The goal is actually to save the model and predict the class of new data (as good
as possible).

Predicting new data: Simulation
• Simulation of predicting the class of new data
• First simulate new data :
• Use an existing data set (e.g. iris.arff), make a copy of the file, and call it iris new data.arff for
instance
• Open ’iris new data.arff’ (in Wordpad or Textedit, ….), and only keep the first 6 data rows (as an
example).
• Replace the class value with a ‘?’ and save the file … It is mock new data, you only know the
attributes, not the outcome variable (class)

• Continuing from the previous slide… We need to save a good
classifier to a file on your disk next.
• Open iris.arff (the original) in the Weka explorer.
• Go to the classify tab, and select ‘Use training set ‘ as a test option.
• Click start.
• The model will now be trained on the entire dataset; it will also be
tested on the entire dataset. So the results are not a good indicator
of the real quality of the model. But we are not interested in this
now. At this stage, we know the quality of the classifier from previous
estimates (using cross-validation, etc.).
• Save the model on your disk by right-clicking on the classifier in the
”Result list”. Choose the option “Save model”.
• Select a location and file name (e.g. ‘J48 iris model’) and click on
’Save’

• Next, load the model and apply it to the new data (‘iris new data.arff’ as an example).
• Open the explorer again in Weka (using any data set, we are just interested in getting to the classify
tab)
• Click the “Classify” tab to open up the classifier.
• Right click somewhere on the “Result list” and click “Load model”.
• Select the model saved in the previous section “J48 iris model”
• We can now make predictions on new data.

• Continue from the previous slide…
• On the “Classify” tab, select the “Supplied test set” option in the
“Test options” pane.
• Click the “Set” button, click the “Open file” button on the options
window and select the mock new dataset we just created with the
name “iris new data.arff ”. Click “Close” on the window.
• Click the “More options…” button to bring up options for
evaluating the classifier.
• Uncheck the information we are not interested in, specifically:
• “Output model”
• “Output per-class stats”
• “Output confusion matrix”
• “Store predictions for visualization”

• For the “Output predictions” option click the
“Choose” button and select “PlainText”. Click
‘ok’ to confirm.
• Right click on your loaded model in the
“Results list” pane.
• Select “Re-evaluate model on current test
set” . Remember, we loaded the ‘iris new
data.arff’ file as a ‘test set’ in the previous
step….
• The predictions for each test instance are
then listed in the “Classifier Output” pane.


Business Intelligence – DM4 Feature Selection
• Based on Chapter 8, 8.8 & 8.9 in Witten et al. (2017, 4th edition)
Stephan Poelmans- Introduction to BI

Feature (Attribute) Selection
• Findings the smallest attribute (feature) set that is crucial in predicting the class is an important issue, certainly if
there are many attributes. Although all attributes are considered when building the model, not all are equally
important (i.e. have discriminatory capacity). Some might even distort the quality of the model.
• So, feature or attribute selection is used to set aside a subset of attributes that really matter and have added
value in creating the classifier.
• A typically example is an insurance company that has a huge data warehouse with historical data from their
clients, with many attributes. The company wants to predict the risk of a new client (from their perspective) and
get insight into what attributes really determine the risk. E.g. age, gender, and marital status might be the key
factors to predict the risk of future car accidents of a client. Not so much the obtained college degree, or financial
status.
• Feature selection is different from dimensionality reduction (e.g. Factor Analysis). Both methods seek to reduce
the number of attributes in the dataset, but a dimensionality reduction method does so by creating new
combinations of attributes, where as feature selection methods include and exclude attributes present in the data
without changing them.

Reasons for feature selection
• Achieving Simpler model
• More transparent
• Easier to interpret
• Faster model induction (creation)
• Structural knowledge
• Knowing which attributes are important may be inherently important to the application and interpretation of the classifier
• Note that relevant attributes can also be harmful if they mislead the learning algorithm
• For example, adding a random (i.e., irrelevant) attribute can significantly degrade J48’ s performance
• Instance-based learning (Ibk) is particularly susceptible to irrelevant attributes

• Number of training instances required increases exponentially with number of irrelevant attributes
• Exception: naïve Bayes can cope well with irrelevant attributes

Feature selection methods
• There are three general classes of feature selection algorithms: filter methods, wrapper methods and embedded methods.
1. Filter Methods: ’Scheme-independent’
• Filter feature selection methods apply a statistical measure to assign a scoring to each feature. The features are ranked by the score and
can be kept or removed from the dataset. The methods are often univariate and consider the feature independently, or only with regard to
the dependent variable.
• Some examples of some filter methods include the Chi squared test, information gain and correlation coefficient scores.
2. Wrapper Methods: ‘Scheme-dependent’
• Wrapper methods consider the selection of a set of features as a search problem, where different combinations are prepared, evaluated and
compared to other combinations. A predictive model is used to evaluate a combination of features and assign a score based on model
accuracy.
• The search process may be methodical such as a best-first search, it may be stochastic such as a random hill-climbing algorithm, or it may
use heuristics, like forward and backward passes to add and remove features.
3. Embedded Methods
• Embedded methods learn which features best contribute to the accuracy of the model while the model is being created. The most common
type of embedded feature selection methods are regularization methods.
• Regularization methods are also called penalization methods that introduce additional constraints into the optimization of a predictive
algorithm (such as a regression algorithm) that bias the model toward lower complexity (fewer coefficients).
• Examples of regularization algorithms are the LASSO, Elastic Net and Ridge Regression.
• Embedded methods are however beyond the scope of this course.
Attribute Evaluator and Search Method
• A Feature selection approach combines an Attribute Evaluator and a Search Method. Each
section, evaluation and search, has multiple techniques from which to choose.
• The attribute evaluator is the technique by which each attribute in a dataset is evaluated in the
context of the output variable (e.g. the class). The search method is the technique by which to
navigate different combinations of attributes in the dataset in order to arrive on a short list of
chosen features.
• As an example, calculating correlation scores for each attribute (with the class variable), is only
done by the attribute evaluator. Assigning a rank to an attribute and listing the attribute in the
ranked order is done by a search method, enabling the selection of features.
• For some attribute evaluators, only certain search method are compatible
• E.g. the CorrelationAttributeEval attribute evaluator in Weka can only be used with a Ranker Search Method,
that lists the attributes in a ranked order.

Filter methods: Attribute selection via Correlation, Information gain
(cf. entropy)
• Using correlations for selecting the most relevant attributes in a dataset is quite popular. The idea is
simple: calculate the correlation between each attribute and the output variable and select only those
attributes that have a moderate-to-high positive or negative correlation (close to -1 or 1) and drop those
attributes with a low correlation (value close to zero). E.g. the CorrelationAttributeEval.
• As mentioned before: the CorrelationAttributeEval technique requires the use of a Ranker search
method.
• Another popular feature selection technique is to calculate the information gain. You can calculate the
information gain (see entropy) for each attribute for the output variable. Entry values vary from 0 (no
information) to 1 (maximum information). Those attributes that contribute more information will have a
higher information gain value and can be selected, whereas those that do not add much information will
have a lower score and can be removed.
• Try the InfoGainAttributeEval Attribute Evaluator in Weka. Like the correlation technique above, the
Ranker Search Method must be used with this evaluator.

Filter methods
• Filter methods result in either

• A Ranked list of attributes
• Typical when each attribute is evaluated individually
• The user has to make the attribute selection for the classifier ‘manually’
• A selected subset of attributes. The subset is the result of:

• Forward selection
• A Best first algorithm
• Random search such as genetic algorithm

Filter, an example: Correlation
• Open glass.arff, go to “select attributes”, use the CorrelationAttributeEval evaluator (correlation of each
attribute with the class variable). “Ranker” is now mandatory as a search method. Use the full training set
(we are not evaluating a classifier).
The attributes are now ranked

according to their Pearson
correlation with the class
variable. You can make a
selection and run any
classifier in the classify tab.
You might decide to stop the
selection of attributes if a
bigger delta occurs, e.g. after
Na..

Filter, an example: Information gain
• Open glass.arff, use the InfoGainAttributeEval evaluator (information gain of each attribute). “Ranker” is
again mandatory as a search method.

Wrapper methods: Learner-based Attribute Selection
• A different and popular feature selection technique is to use a generic but powerful learning algorithm (e.g.
classifier) and evaluate the performance of the algorithm on the dataset with different subsets of attributes. (So
no real a priori evaluation of the attributes; no filter approach. )
• The subset that results in the best performance is taken as the selected subset. The algorithm used to evaluate
the subsets does not have to be the algorithm that is intended to be used for the classifier, but it should be
generally quick to train and powerful, like a decision tree method.
• So if the target algorithm is Naïve Bayes, a different algorithm could be chosen to select a subset of
attributes..
• This is a scheme-dependent approach because the target schema, the actual classifier you want to develop, is in
the loop.
• In Weka this type of feature selection is supported by the WrapperSubsetEval technique and must use a
GreedyStepwise or BestFirst Search Method (cf. further). The latter, BestFirst, is preferred but requires more
compute time.

Wrappers
Select a subset of
• “Wrap around” the attributes
learning algorithm
• Must therefore always Induce learning
evaluate subsets algorithm on this subset
• Return the best subset
of attributes Evaluate the resulting
• Apply for each learning model (e.g., accuracy)
algorithm considered
No Yes
Stop?

Searching the attribute space
• The available attributes in a dataset are often referred to as the ‘attribute space’ (in which to find a subset that
best predicts the class)
• When searching an attribute subset, the number of subsets is exponential in the number of attributes
• Common greedy approaches:
• forward selection
• backward elimination
• Recursive feature elimination
A greedy algorithm is any algorithm that simply picks the best choice it sees at the time and takes it.
• More sophisticated strategies:

• Bidirectional search
• Best-first search: can find an optimum solution
• Beam search: an approximation to the best-first search
• Genetic algorithms

12
The attribute space: local vs global optimum
At point z, level a seems like an optimum.

However, this is only a local optimum
(maximum). At position z, b is not visible. You
have to traverse the entire terrain to find the
global optimum, b.
In feature selection, at point z, a combination

of features is found, leading to a higher merit
a, (e.g. the accuracy rate), which is then a
local optimum. By extending the search (e.g.
via the parameter ‘searchTerminator’, see
below), a better optimum, or even the global
optimum can be found. Greedy search
approaches are less likely to find a global
optimum..

Common Greedy Approaches
• Forward Selection: Forward selection is an iterative method that starts with having no feature in the model. In
each iteration, the method keeps adding the feature which best improves the model till an addition of a new
attribute does not improve the performance of the model (such as a classifier) anymore. Because the accuracy of
the classifier is evaluated with all the features in a set, this method will pick out features which work well together
for classification. Features are not assumed to be independent and so advantages may be gained from looking
at their combined effect.
• Backward Elimination: In backward elimination, the model starts with all the features and removes the least
significant feature at each iteration which improves the performance of the model. This is repeated until no
improvement is observed on removal of features.
• Recursive Feature elimination: It is a greedy optimization algorithm which aims to find the best performing
feature subset. It repeatedly creates models and keeps aside the best or the worst performing feature at each
iteration. It constructs the next model with the remaining features until all the features are exhausted. It then
ranks the features based on the order of their elimination.

Search Methods
As an example: 9 attributes

Wrapper approach: In sum
• Uses a classifier to find a good attribute set (“scheme-dependent”)

• E.g. J48, Ibk, Naïve Bayes,..
• Wraps a classifier in a feature selection loop

• Involves both an Attribute Evaluator and a Search Method
• Searching can be greedy forward, backward, or bidirectional

• Computationally intensive
• Greedy searching finds a local optimum in the search space
• The attribute space can be futher traversed by increasing the searchTermination parameter in Weka (if
> 1, the attribute space is further investigated, increasing the likelihood of finding a more global, less
local optimum)

Wrapper: an example in Weka
• Example:
• Open glass.arff; choose the attribute evaluator WrapperSubsetEval in ‘Select attributes’, select J48, “use full training set”
• Search as a search method ‘BestFirst’; select direction ‘Backward’
• Get the attribute subset: RI, Mg, Al, K, Ba: “merit” 0.73 (=accuracy %)
• Searching how far? … An littel experiment:

• Set searchTermination = 1: Total number of subsets evaluated 36 (see the output)
• But the “merit” , the best subset, is the same in all 3…Searching the attribute space further did not lead to a higher optimum.
• Use Cross-validation (instead of ”Use full training set”):

(searchTermination = 1): see next slide

Wrapper, an example:
• So, Use Cross-validation (instead of ”Use full training set”):

(searchTermination = 1):
• Certainly select: Ri (appeared in 9 folds), Al (appeared in 10 folds !), Ba and MG, and maybe also Na

What Feature Selection Techniques To Use
• Typically, you do not know a priori which view, or subset of features, in your data will produce the most
accurate models.
• Therefore, it is a good idea to try a number of different feature selection techniques on your data, to
create several views on your data (several subsets of features)
Filter method
Wrapper method

DM 5: Unsupervised Learning
1
 Unsupervised learning =descriptive analytics

= no target (class) attribute
1. Association Rules (”Apriori” in Weka)
2. Clustering (K-means, EM)
1. Association Rules
2
 = unsupervised, also called ‘market basket analysis’

 With association rules, there is no “class” attribute
 Rules can predict any attribute, or combination of attributes
 Some association Rules for the weather.nominal data:
An example : the supermarket
3
 To illustrate the concepts, we use another small example from the supermarket domain. The set of items is I =
{milk,bread,butter,beer} and a small database containing the items (1 codes presence and 0 absence of an
item in a transaction) is shown in the table on the next slide. An example rule for the supermarket could be
{milk,bread}=>{butter} meaning that if milk and bread is bought, customers also buy butter. {milk, bread}= X;
{butter} = y; so X -> Y
 To select interesting rules from the set of all possible rules, constraints on various measures of significance
and interest can be used. The best-known constraints are minimum thresholds on support and confidence.
 Support :The support supp(X) of an itemset X is defined as the proportion of transactions in the data set which contain
the itemset.
supp(X)= no. of transactions which contain the itemset X / total no. of transactions
 Confidence =
Meaning: how many times is the rule correct? (=reliability). Weakness of confidence: if X and/or Y have a high
support, the confidence is also high per definition.
 The lift of a rule is defined as:
The Lift tells us how much our confidence has increased that Y will be purchased given that X was purchased. It
shows how effective the rule is in finding Y, as compared to finding Y randomly.
.
The supermarket example
4
Transaction ID milk Bread butter beer
1 1 1 0 0
In the example database, the itemset {milk,bread,butter} has a
2 0 1 1 0
3 0 0 0 1
support of 4 /15 = 0.26 since it occurs in 26% of all
4 1 1 1 0 transactions.
5 0 1 0 0
6 1 0 0 0
7 0 1 1 1
For the rule {milk,bread}=>{butter} we have the following
8 1 1 1 1
confidence:
9 0 1 0 1
supp({milk,bread,butter}) / supp({milk,bread}) = 0.26 /
10 1 1 0 0
11 1 0 0 0 0.4 = 0.65 . This means that for 65% of the transactions
12 0 0 0 1 containing milk and bread the rule is correct.
13 1 1 1 0
14 1 0 1 0
15 1 1 1 1
The rule {milk,bread}=>{butter} has the following lift:
supp({milk,bread,butter}) / supp({butter}) x
supp({milk,bread})= 0.26/0.46 x 0.4= 1.4
Association Rules
5
 Typically, you will have a big amount of rules

 2 ’quality’ criteria applied to the weather.nominal rules
 Support: supp(X) of an itemset X is defined as the proportion of transactions in the data set which contain the
itemset.supp(X)= no. of transactions which contain the itemset X / total no. of transactions
 Confidence: proportion of instances that satisfy the left-hand side for which the right-hand side also holds (reliability of
the rule)
Association Rules
6
 Rules are created from itemsets

 An Itemset is actually a set of attribute-value pairs. E.g.
 There are 7 potential rules from this 1 item set:

The Apriori Algorithm
7
 General Process
 Association rule generation is usually split up into two separate steps:
 First, minimum support is applied to find all frequent itemsets in a database.
 Second, these frequent itemsets and the minimum confidence constraint are used to form
rules.
 So: Generate high-support item sets, get several rules from each
 Strategy: iteratively reduce the minimum support until the required number of
rules is found with a given minimum confidence
Example in weather.nominal:
 Generate item sets with support 14 (none)
 Find rules in these item sets with minimum confidence level, 90% in Weka
 Continue with item sets with support 13 (none)
 And so on , until you have a sufficient number of rules
The Apriori Algorithm
8
 The Weather data has 336 rules with confidence 100%, but only 8 have support
≥ 3, only 58 have support ≥ 2
 In Weka: specify minimum confidence level (minMetric, default 90%), number of
rules sought (numRules, default 10)
 Apriori makes multiple passes through the data
 It generates 1-item sets, 2-item sets, ... with more than minimum support– turns each one into
(many) rules and checks their confidence
 It starts at upperBoundMinSupport (usually left at 100%) and decreases by delta at each
iteration (default 5%). It stops when numRules is reached... or at lowerBoundMinSupport
(default 10%)
The Apriori algorithm applied (weather.nominal)
9
Market Basket Analysis
10
 Look at supermarket.arff– collected from an actual New Zealand supermarket

 4627 instances, 217 attributes; appr. 1M attribute values
 Missing values used to indicate that the basket did not contain that item
 92% of values are missing

 average basket contains 217×8% = 17 items
 Most popular items: bread-and-cake (3330), vegetables (2961), frozen
foods(2717), biscuits (2605)
11
12
 [to be completed]
2. Clustering
13
 With clustering, there is, again, no “class” attribute (so unsupervised learning)
 Try to divide the instances into natural, homogenous groups, or “clusters”
 It is hard to evaluate the quality of a clustering solution
 in Weka: SimpleKMeans (+XMeans), EM, Cobweb
 Examples of clusters:
 Customer segmentation: divide customers in homogenous groups, based on numerous attributes (age, degree, social background,
average sales, number of visits, types of products bought, region, ….)
 Why?
 To have a better understanding,
 To target marketing or promotion actions
 To focus on very ‘good’ or very ‘infrequent’ customers
 ….
 Student clustering
 Clustering of prisoners
 Course clustering
 Clustering of schools, companies, cars, ….
 Based on symptoms, patients can be clustered in order determine an appropriate treatment
 Sometimes, the target population or observations can be subdivided in groups, top-down, by argument, using specific criteria.
This is not clustering.
 The goal of clustering is to start from a dataset with numerous attributes and to build up homogenous groups using similarity
measures (such as the Euclidean distance). So it is bottom-up, can be applied to a larger dataset , with many different attributes.
Clustering
14
 Automatically dividing data into homogeneous groups
Intra-cluster distances Inter-cluster distances

are minimized are maximized
Notion of a cluster can be ambiguous
15
 Which clustering is the best? Context-dependent…
How many clusters? Six Clusters
Two Clusters Four Clusters

Application: Customer segmentation
16
 Divide customer base into segments such that

• homogeneity within a segment is maximised (cohesive)
• heterogeneity between segments is maximised (separated)
 Example business analytics applications

• Understand customer population for e.g. targeted marketing and/or advertising (mass
customization)
• Efficiently allocating marketing resources
• Differentiate between brands in a portfolio
• Identify most profitable customers
• Identify shopping patterns
• …
Typical features used for customer segmentation
17
 Demographic
 Age, gender, income, education, marital status, kids, …
 Lifestyle
 Vehicle ownership, Internet use, travel, pets, hobbies, …
 Attitudinal
 Product preferences, price sensitivity, willingness to try other brands, …
 Behavioral
 Products bought, prices paid, use of cash or credit, RFM, …
 Acquisitional
 Marketing channel, promotion type, …
 Social network-based data

Clustering: K-means
18
 K-Means: Iterative distance-based clustering (disjoint sets), the algorithm:
1. Specify k, the desired number of clusters (often very difficult, how many groups do
you want, not too many, no too few)
2. Choose k points at random as cluster centers
3. Assign all instances to their closest cluster center
4. Calculate the centroid (i.e., mean) of instances in each cluster
5. These centroids are the new cluster centers
6. Re-assign instances to these cluster centers
7. Re-calculate the centroids
8. Continue until the cluster centers don’t change
Minimizes the total squared distance from instances to their cluster centers.
In Weka, the Euclidean distance or the Manhattan distance can be used as a similarity
function, to compute the distance between instances and centers.
Clustering: K-means
19
• Initial cluster centers?
1) Determine k initial centers → several possibles: farthest first, random, etc

(see the parameter ‘initializationMethod’)
Initial Initial
center center
2) Determine for each instance the distance to the initial

centers, and assign to the closest center. The Euclidean
distance is typically used
The K-means Algorithm
20
3) Adjusting clusters?
- Compute new centers = average of all instances within a cluster
- Calculate the distance for each instance to the new centers, and assign
the instance to the closest cluster (so instances could change, be re-
assigned to another cluster)
4) Iterative process : repeat the previous step (calculating the center, re-assigning
instances) until instances do not change anymore = convergence
The K-means Algorithm
21
Very often, the desired result is

achieved after a few iterations…. New
New
center
center
after 1
after 1
iteration
iteration
New
center
New center after 1
after 1 iteration
iteration
Initial
center
New
center
after 2
Initial iterations
center
New center
New after 2
center iterations
Random choice of initial centers after 2
iterations
K-means : calculations
22
A simple example: Suppose we have 6 instances (6 persons,

objects, ….) and only 2 attributes, X and Y.
X Y 9
8
A 3 7 7
B 6 1 In a chart:: 6
5
C 5 8 4
D 1 0 3
2
E 7 6 1
0
F 4 5 0 2 4 6 8
23
Step 1: Determine k initial centers using the euclidean distance :
X Y
A 3 7
n
 ( p − q )²
B 6 1
i i C 5 8
i =1
D 1 0
E 7 6
Euclidean Distance F 4 5
Case A B C D E F
9
A 8
,000 6,708 2,236 7,280 4,123 2,236 C
7
B 6,708 ,000 7,071 5,099 5,099 4,472 6 A
5 E
C 2,236 7,071 ,000 8,944 2,828 3,162 4 F
3
D 7,280 5,099 8,944 ,000 8,485 5,831 2
1 D B
E 4,123 5,099 2,828 8,485 ,000 3,162
0
0 2 4 6 8
= ((5 − 1)² + (8 − 0)²) = 8.94

F 2,236 4,472 3,162 5,831 3,162 ,000
We opt for the “Farthest First” option, choose the instances that are the farthest…
24
Step 2: For each instance, compute the euclidean distance to each cluster
center (C and D) :
X Y
A 3 7
n B 6 1
 ( p − q )²
i =1
i i (3- 5)2 + (7 - 8)2 = 5 = 2, 236 C 5 8
D 1 0
E 7 6
F 4 5
Euclidean Distance
Case A B C D E F
A
,000 6,708 2,236 7,280 4,123 2,236
B
6,708 ,000 7,071 5,099 5,099 4,472
C
2,236 7,071 ,000 8,944 2,828 3,162
D
7,280 5,099 8,944 ,000 8,485 5,831
E
4,123 5,099 2,828 8,485 ,000 3,162
F
2,236 4,472 3,162 5,831 3,162 ,000
25
Step 3: Assign an instance to the center that is the nearest.
So: A is assigned to center C, B to center D, E to C and F to C
Euclidean Distance
9
Case A B C D E F 8
C
A ,000 6,708 2,236 7,280 4,123 2,236 7 A C1
6
B 6,708 ,000 7,071 5,099 5,099 4,472 5 E
F
C 4
2,236 7,071 ,000 8,944 2,828 3,162
3
D 7,280 5,099 8,944 ,000 8,485 5,831 2
1
E 4,123 5,099 2,828 8,485 ,000 3,162
0 D
B C2
F 0 2 4 6 8
2,236 4,472 3,162 5,831 3,162 ,000
26
Step 4: Re-calculate the new centers based on the averages of the

instances in each cluster.
X Y
9
A 3 7 8
C
B 6 1 7 A
CC1
C1
6
C 5 8 5
F
E
4
D 1 0 3
2
E 7 6 1
D
CC2 B C2
F 4 5 0
0 2 4 6 8
CC1 4,75 6,5

CC2 3,5 0,5
Cluster1 Center 1: X = (3 + 5 + 7 + 4) / 4 = 4,75 Y = (7 + 8 + 6 + 5) / 4 = 6,50

Cluster2 Center 2: X = (6 + 1) / 2 = 3,50 Y = (1 + 0) / 2 = 0,50
K-means
27
 In sum, K-Means clustering intends to partition n objects into k clusters in which

each object belongs to the cluster with the nearest mean. This method produces
exactly k different clusters of greatest possible distinction. The best number of
clusters k leading to the greatest separation (distance) is not known as a priori
and must be computed from the data. The objective of K-Means clustering is to
minimize total intra-cluster variance, or, the squared error function:
Clustering: K-means
28
 K-means in Weka:
 Open weather.numeric.arff
 Cluster panel; choose
SimpleKMeans
 Note parameters:
numClusters,
distanceFunction, seed
(default 10)
 Two clusters, 9 and 5
members, total squared
error 16.2
 {1/no, 2/no, 3/yes, 4/yes,
5/yes, 8/no, 9/yes, 10/yes,
13/yes}
{6/no, 7/yes. 11/yes, 12/yes,
14/no}
 Set seed to 11: Two
clusters, 6 and 8 members,
total squared error 13.6
 Set seed to 12: total
squared error 17.3
Evaluating Clusters
29
 Now we know the size and the characteristics of the cluster
 How good is our clustering?
 Visualizing Clusters:
 Open the Iris.arff data, apply SimpleKMeans, specify 3 clusters
 3 clusters with 50 instances each
 Visualize cluster assignments (right-click menu in Result List)
 Plot Clusters (x-axis) against the instance numbers: the more density, the more cohesiveness, the
better the quality
 Which instances does a cluster contain?
 Use the AddCluster unsupervised attribute filter (in the Preprocess tab !)
 Try with SimpleKMeans (within the filter); Apply and click Edit
 What about the class variable?
 Also apply “visualize cluster assignments”, clusters on the X, class variable on the Y. There are yes
and no’s in both clusters: so no perfect match between clusters and class values
 Try the “Ignore attribute” button, ignore the class attribute; run again with 3 clusters, now: 61, 50, 39
instances
Visualizing Clusters:
30
With all attributes: very dense, balanced clusters Leaving out the class variable : more distance within
the clusters, less balanced (but still acceptable)
Classes-to-clusters evaluation
31
 In the Iris data: SimpleKMeans,
specify 3 clusters
 Classes to clusters
evaluation = using clustering in
a supervised way…
 Classes are assigned to
clusters; can clusters predict
the class values?
 Now you have a confusion
(classification) matrix and an
accuracy ! (100 – 11% = 89%)

Business Intelligence 2

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Business Intelligence 2

Uploaded by

Copyright:

Available Formats

Business Intelligence

Prof. Dr. Stephan Poelmans

LIRIS : Research Centre for

Office: ‘t Serclaes Building, A06.08

Stephan Poelmans– Business Intell

• Introduction to Business Intelligence

• Joins & Group By & Having

• Subqueries and Union

• Note: We will use Base in Libre Office to do exercises

• Module 2: Data Mining (with Weka)

Stephan Poelmans– Business Intell 3

• SQL is a database-software-independent language that allows the

• Create a relational database (both operational databases and data

• Load a relational database

• Question (query) a relational database

Stephan Poelmans– Business Intell 4

• SQL consists of two parts:

• The instructions to enter, retrieve and update data is called a Data

• In this chapter, we mainly use SQL as DML

Stephan Poelmans– Business Intell 5

• In Libre Office we will use BASE as a relational DBMS (equivalent to MS

• You should open the AdvendureDW_Micro1000Facts.odb, (to be downloaded

• You need to understand the structure, the snowflake schema, of

Stephan Poelmans– Business Intell 6

Stephan Poelmans– Business Intell 7

SELECT [DISTINCT | ALL] {* | attribute (-expression) [AS new_name] [,...] }

FROM table [, ...]

Stephan Poelmans– Business Intell 8

• We have a dimension table with customer data (DimCustomer) ...

SELECT DimCustomer.Firstname, DimCustomer.TotalChildren

Table DimCustomer: Field

Stephan Poelmans– Business Intell 9

SELECT Dimcustomer.Lastname, Dimcustomer.Emailaddress

* = All fields from a table

Stephan Poelmans– Business Intell 11

• Show all employees whose phone number is unknown

SELECT Lastname, Firstname

• Other negations are NOT BETWEEN … AND …

• _ - The underscore represents a single character

Stephan Poelmans– Business Intell 14

Stephan Poelmans– Business Intell 15

• All female employees who are married:

Stephan Poelmans– Business Intell 16

• Order in a descending order

More than one column can be used to order the rows

Stephan Poelmans– Business Intell 19

• A JOIN without selection conditions

In the previous examples we worked with a single table. The design of a

• To fulfill business requirements, users of a data warehouse usually need

Stephan Poelmans– Business Intell 20

• A "join" in SQL terms is actually a Cartesian product

Stephan Poelmans– Business Intell 21

• Assume 2 tables: FactInternetSales and DimCustomer. Without further

• Result: without "join" conditions ...

310 20010701 21768 21768 Cole Watson

310 20010701 21768 28389 Rachael Martinez

346 20010701 28389 14747 Aidan Wood

346 20010701 28389 21768 Cole Watson

346 20010701 28389 28389 Rachael Martinez

Stephan Poelmans– Business Intell 22

• In the previous example, all combinations were given. In a relational database,

• In other words, instead of all possible combinations of Internet sales and

Stephan Poelmans– Business Intell 23

ProductKey OrderDateKey CustomerKey CustomerKey Firstname LastName