Data Mining Part 1: Ing. Ridho Rahmadi, M.SC

Data Mining Part 1
Ing. Ridho Rahmadi, M.Sc
Magister Teknik Informatika

Universitas Islam Indonesia
ridho.rahmadi@uii.ac.id
November 21, 2017
Ing. Ridho Rahmadi, M.Sc SPK-BI November 21, 2017 1 / 39

Acknowledgement
All contents are from materials and book Introduction to Data Mining by
Tan, Steinbach, Kumar.

Classification
Given a collection of records (called training set)

1 contains a set of attributes
2 an attibrute is the class
Gender Height Temperature Fever

F 170 35.4 No
M 165 36.3 No
M 175 39.0 Yes
Task: Find a model for class attribute as a function of the values of other
attributes.

Classification cont’d
Gender Height Temperature Fever

M 180 35.4 ?
F 155 35.5 ?
M 163 37.3 ?
Goal: previously unseen records should be assigned a class as accurately as

possible.


Examples of classification tasks:

1 Predicting tumor cells; benign or malignant
2 Classifying credit card transactions; legitimate or fraudulent
3 Categorizing news; finance, weather, sports, etc.

Classification techniques:
1 Decision tree based methods
2 Rule based methods
3 Support vector machine
4 and many more

Decision tree

Decision tree cont’d
There can be more than one tree that fits the same data set!
Applying model to the test data







Learn model

Decision tree algorithms
Finding the best decision tree is NP-hard

Greedy strategy: split records based on an attribute test that
optimizes a criterion
Decision tree algorithms:
1 Hunt’s algorithm
2 ID3, C45
3 CART
4 SLIQ, SPRINT

Hunt’s algorithm
Let Dt be the set or training records for node t

1 If Dt contains records belonging to the same class yt ,
then t is a leaf node labeled as yt
2 If Dt contains records belonging to more than one class,
A) use an attribute test to split the data into smaller subsets and
B) recursively apply the procedure to each subset.

An example

Issues in tree induction
1 Determine how to split the records

How to specify the attribute test condition?
How to determine the best attribute to split on?
2 Determine when to stop splitting

How to specify test condition?
1 Depends on attribute types

Nominal
Ordinal
Continuous
2 Depends on number of ways to split
2-way split
Multi-way split

Splitting based on Nominal attributes
Multi-way split
Use as many partitions as distinct values.
Binary split
Divides values into two subsets; need to find optimal partitioning.

Splitting based on Ordinal attributes
Multi-way split
Use as many partitions as distinct values.
Binary split
Divides values into two subsets, respects the order; need to find optimal
partitioning.

Splitting based on Continuous attributes
1. Discretization to form an ordinal categorical attribute
Static; discretize once at the beginning
Dynamic; ranges can be found by equal interval bucketing, equal
frequency bucketing (percentiles), or clustering.
2. Binary decision (A < v ) or (A ≥ v )
3. Consider all possible splits and finds the best cut
can be computationally intensive

How to determine the best split
Suppose before splitting: 10 records of class 0, 10 records of class 1.
Which test condition is the best?

How to determine the best split
Greedy approach
Nodes with homogeneous class distribution are preferred
Need a measure of node impurity

Use measures of node impurity
1. GINI index
2. Entropy
3. Misclassification error

Measure of impurity: GINI index
GINI index for a give node t:
X
GINI (t) = 1 − [p(c|t)]2
c
where p(c|t) is the relative frequency of class c at node t.
Maximum (1 − 1/nc ) when records are equally distributed among all

classes, implying highest impurity.
Minimum (0) when all records belong to one class, implying lowest
impurity.
Try to compute the GINI indices above...

Quality of a Split Based on GINI
Used in the decision tree algorithms CART, SLIQ, SPRINT.
When a node p is split into k partitions (children nodes), the quality of

split is computed as:
k
X ni
GINIsplit = GINI (i)
n
i=1
where ni =number of records at child node i, and n =number of records at

node p.

Computing GINI index for binary attribute
GINI (N1) GINI (N2) GINI (Children)

2 2 2 2
= 1 − (5/7) − (2/7) = 1 − (1/5) − (4/5) = 7/12 × 0.41 + 5/12 × 0.32
= 0.41 = 0.32 = 0.37

Computing GINI index for categorical attribute
Compute the GINI indices for each of above nodes!

Binary split selection: Continuous attributes
For each attribute a

Sort the values of occurring in the records at that node
Scan these values, each time updating the count matrix and
computing the GINI index of that split
Choose the split position that has the smallest GINI index

Measure of Impurity: Entropy
Entropy at a given node t:

X
Entropy (t) = − p(c | t) log p(c | t)
c
where p(c | t) is the relative frequency of class c at node t.
Measures homogeneity of a node.

Maximum (log nc ) when records are equally distributed among all classes
implying highest impurity.
impurity.

Quality of a Split Using Entropy: Information Gain
Information gain:
k
X ni
GAINsplit = Entropy (p) − ( Entropy (i))
n
i=1
where parent node p is split into k partitions; ni is number of records in

partition i.
Measures reduction in entropy achieved when using split to partition

the values of attribute a
Choose the split with maximum GAIN
Used in ID3 and C4.5
Disadvantage: tends to prefer splits that result in large number of
partitions, each being small but pure

Measure of Impurity: Classification Error
Classification error at a given node t:
Error (t) = 1 − max p(c | t)

c
where p(c | t) is the relative frequency of class c at node t.
Measures misclassification error made by a node.

Maximum (1 − 1/nc ) when records are equally distributed among all
classes implying highest impurity.
impurity.

Examples

Comparison among splitting criteria
For a 2-class problem

Any questions?

Data Mining Part 1: Ing. Ridho Rahmadi, M.SC

Uploaded by

Copyright:

Available Formats

You might also like

Data Mining Part 1: Ing. Ridho Rahmadi, M.SC

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Mining Part 1: Ing. Ridho Rahmadi, M.SC

Uploaded by

Copyright:

Available Formats

Data Mining Part 1

Ing. Ridho Rahmadi, M.Sc

Magister Teknik Informatika

November 21, 2017

Ing. Ridho Rahmadi, M.Sc SPK-BI November 21, 2017 1 / 39

Ing. Ridho Rahmadi, M.Sc SPK-BI November 21, 2017 2 / 39

Given a collection of records (called training set)

Gender Height Temperature Fever

Ing. Ridho Rahmadi, M.Sc SPK-BI November 21, 2017 3 / 39

Gender Height Temperature Fever

Goal: previously unseen records should be assigned a class as accurately as

Ing. Ridho Rahmadi, M.Sc SPK-BI November 21, 2017 4 / 39

Ing. Ridho Rahmadi, M.Sc SPK-BI November 21, 2017 5 / 39

Examples of classification tasks:

Ing. Ridho Rahmadi, M.Sc SPK-BI November 21, 2017 6 / 39

Ing. Ridho Rahmadi, M.Sc SPK-BI November 21, 2017 7 / 39

Ing. Ridho Rahmadi, M.Sc SPK-BI November 21, 2017 8 / 39

Ing. Ridho Rahmadi, M.Sc SPK-BI November 21, 2017 10 / 39

Ing. Ridho Rahmadi, M.Sc SPK-BI November 21, 2017 11 / 39

Ing. Ridho Rahmadi, M.Sc SPK-BI November 21, 2017 12 / 39

Ing. Ridho Rahmadi, M.Sc SPK-BI November 21, 2017 13 / 39

Ing. Ridho Rahmadi, M.Sc SPK-BI November 21, 2017 14 / 39

Ing. Ridho Rahmadi, M.Sc SPK-BI November 21, 2017 15 / 39

Ing. Ridho Rahmadi, M.Sc SPK-BI November 21, 2017 16 / 39

Ing. Ridho Rahmadi, M.Sc SPK-BI November 21, 2017 17 / 39

Finding the best decision tree is NP-hard

Ing. Ridho Rahmadi, M.Sc SPK-BI November 21, 2017 18 / 39

Let Dt be the set or training records for node t

Ing. Ridho Rahmadi, M.Sc SPK-BI November 21, 2017 19 / 39

Ing. Ridho Rahmadi, M.Sc SPK-BI November 21, 2017 20 / 39

1 Determine how to split the records

Ing. Ridho Rahmadi, M.Sc SPK-BI November 21, 2017 21 / 39

1 Depends on attribute types

Ing. Ridho Rahmadi, M.Sc SPK-BI November 21, 2017 22 / 39

Ing. Ridho Rahmadi, M.Sc SPK-BI November 21, 2017 23 / 39

Ing. Ridho Rahmadi, M.Sc SPK-BI November 21, 2017 24 / 39

Ing. Ridho Rahmadi, M.Sc SPK-BI November 21, 2017 25 / 39

Suppose before splitting: 10 records of class 0, 10 records of class 1.

Which test condition is the best?

Ing. Ridho Rahmadi, M.Sc SPK-BI November 21, 2017 26 / 39

Need a measure of node impurity

Ing. Ridho Rahmadi, M.Sc SPK-BI November 21, 2017 27 / 39

Ing. Ridho Rahmadi, M.Sc SPK-BI November 21, 2017 28 / 39

where p(c|t) is the relative frequency of class c at node t.

Maximum (1 − 1/nc ) when records are equally distributed among all

Try to compute the GINI indices above...

Used in the decision tree algorithms CART, SLIQ, SPRINT.

When a node p is split into k partitions (children nodes), the quality of

where ni =number of records at child node i, and n =number of records at

Ing. Ridho Rahmadi, M.Sc SPK-BI November 21, 2017 30 / 39

GINI (N1) GINI (N2) GINI (Children)

Ing. Ridho Rahmadi, M.Sc SPK-BI November 21, 2017 31 / 39

Compute the GINI indices for each of above nodes!

Ing. Ridho Rahmadi, M.Sc SPK-BI November 21, 2017 32 / 39

For each attribute a

Ing. Ridho Rahmadi, M.Sc SPK-BI November 21, 2017 33 / 39

Entropy at a given node t: