Data Mining Part 1: Ing. Ridho Rahmadi, M.SC

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 39

Data Mining Part 1

Ing. Ridho Rahmadi, M.Sc

Magister Teknik Informatika


Universitas Islam Indonesia
ridho.rahmadi@uii.ac.id

November 21, 2017

Ing. Ridho Rahmadi, M.Sc SPK-BI November 21, 2017 1 / 39


Acknowledgement

All contents are from materials and book Introduction to Data Mining by
Tan, Steinbach, Kumar.

Ing. Ridho Rahmadi, M.Sc SPK-BI November 21, 2017 2 / 39


Classification

Given a collection of records (called training set)


1 contains a set of attributes
2 an attibrute is the class

Gender Height Temperature Fever


F 170 35.4 No
M 165 36.3 No
M 175 39.0 Yes

Task: Find a model for class attribute as a function of the values of other
attributes.

Ing. Ridho Rahmadi, M.Sc SPK-BI November 21, 2017 3 / 39


Classification cont’d

Gender Height Temperature Fever


M 180 35.4 ?
F 155 35.5 ?
M 163 37.3 ?

Goal: previously unseen records should be assigned a class as accurately as


possible.

Ing. Ridho Rahmadi, M.Sc SPK-BI November 21, 2017 4 / 39


Classification cont’d

Ing. Ridho Rahmadi, M.Sc SPK-BI November 21, 2017 5 / 39


Classification cont’d

Examples of classification tasks:


1 Predicting tumor cells; benign or malignant
2 Classifying credit card transactions; legitimate or fraudulent
3 Categorizing news; finance, weather, sports, etc.

Ing. Ridho Rahmadi, M.Sc SPK-BI November 21, 2017 6 / 39


Classification cont’d

Classification techniques:
1 Decision tree based methods
2 Rule based methods
3 Support vector machine
4 and many more

Ing. Ridho Rahmadi, M.Sc SPK-BI November 21, 2017 7 / 39


Decision tree

Ing. Ridho Rahmadi, M.Sc SPK-BI November 21, 2017 8 / 39


Decision tree cont’d

There can be more than one tree that fits the same data set!
Ing. Ridho Rahmadi, M.Sc SPK-BI November 21, 2017 9 / 39
Applying model to the test data

Ing. Ridho Rahmadi, M.Sc SPK-BI November 21, 2017 10 / 39


Applying model to the test data

Ing. Ridho Rahmadi, M.Sc SPK-BI November 21, 2017 11 / 39


Applying model to the test data

Ing. Ridho Rahmadi, M.Sc SPK-BI November 21, 2017 12 / 39


Applying model to the test data

Ing. Ridho Rahmadi, M.Sc SPK-BI November 21, 2017 13 / 39


Applying model to the test data

Ing. Ridho Rahmadi, M.Sc SPK-BI November 21, 2017 14 / 39


Applying model to the test data

Ing. Ridho Rahmadi, M.Sc SPK-BI November 21, 2017 15 / 39


Applying model to the test data

Ing. Ridho Rahmadi, M.Sc SPK-BI November 21, 2017 16 / 39


Learn model

Ing. Ridho Rahmadi, M.Sc SPK-BI November 21, 2017 17 / 39


Decision tree algorithms

Finding the best decision tree is NP-hard


Greedy strategy: split records based on an attribute test that
optimizes a criterion
Decision tree algorithms:
1 Hunt’s algorithm
2 ID3, C45
3 CART
4 SLIQ, SPRINT

Ing. Ridho Rahmadi, M.Sc SPK-BI November 21, 2017 18 / 39


Hunt’s algorithm

Let Dt be the set or training records for node t


1 If Dt contains records belonging to the same class yt ,
then t is a leaf node labeled as yt
2 If Dt contains records belonging to more than one class,
A) use an attribute test to split the data into smaller subsets and
B) recursively apply the procedure to each subset.

Ing. Ridho Rahmadi, M.Sc SPK-BI November 21, 2017 19 / 39


An example

Ing. Ridho Rahmadi, M.Sc SPK-BI November 21, 2017 20 / 39


Issues in tree induction

1 Determine how to split the records


How to specify the attribute test condition?
How to determine the best attribute to split on?
2 Determine when to stop splitting

Ing. Ridho Rahmadi, M.Sc SPK-BI November 21, 2017 21 / 39


How to specify test condition?

1 Depends on attribute types


Nominal
Ordinal
Continuous
2 Depends on number of ways to split
2-way split
Multi-way split

Ing. Ridho Rahmadi, M.Sc SPK-BI November 21, 2017 22 / 39


Splitting based on Nominal attributes

Multi-way split
Use as many partitions as distinct values.

Binary split
Divides values into two subsets; need to find optimal partitioning.

Ing. Ridho Rahmadi, M.Sc SPK-BI November 21, 2017 23 / 39


Splitting based on Ordinal attributes

Multi-way split
Use as many partitions as distinct values.

Binary split
Divides values into two subsets, respects the order; need to find optimal
partitioning.

Ing. Ridho Rahmadi, M.Sc SPK-BI November 21, 2017 24 / 39


Splitting based on Continuous attributes
1. Discretization to form an ordinal categorical attribute
Static; discretize once at the beginning
Dynamic; ranges can be found by equal interval bucketing, equal
frequency bucketing (percentiles), or clustering.
2. Binary decision (A < v ) or (A ≥ v )
3. Consider all possible splits and finds the best cut
can be computationally intensive

Ing. Ridho Rahmadi, M.Sc SPK-BI November 21, 2017 25 / 39


How to determine the best split

Suppose before splitting: 10 records of class 0, 10 records of class 1.

Which test condition is the best?

Ing. Ridho Rahmadi, M.Sc SPK-BI November 21, 2017 26 / 39


How to determine the best split

Greedy approach
Nodes with homogeneous class distribution are preferred

Need a measure of node impurity

Ing. Ridho Rahmadi, M.Sc SPK-BI November 21, 2017 27 / 39


Use measures of node impurity

1. GINI index
2. Entropy
3. Misclassification error

Ing. Ridho Rahmadi, M.Sc SPK-BI November 21, 2017 28 / 39


Measure of impurity: GINI index
GINI index for a give node t:
X
GINI (t) = 1 − [p(c|t)]2
c

where p(c|t) is the relative frequency of class c at node t.

Maximum (1 − 1/nc ) when records are equally distributed among all


classes, implying highest impurity.
Minimum (0) when all records belong to one class, implying lowest
impurity.

Try to compute the GINI indices above...


Ing. Ridho Rahmadi, M.Sc SPK-BI November 21, 2017 29 / 39
Quality of a Split Based on GINI

Used in the decision tree algorithms CART, SLIQ, SPRINT.

When a node p is split into k partitions (children nodes), the quality of


split is computed as:
k
X ni
GINIsplit = GINI (i)
n
i=1

where ni =number of records at child node i, and n =number of records at


node p.

Ing. Ridho Rahmadi, M.Sc SPK-BI November 21, 2017 30 / 39


Computing GINI index for binary attribute

GINI (N1) GINI (N2) GINI (Children)


2 2 2 2
= 1 − (5/7) − (2/7) = 1 − (1/5) − (4/5) = 7/12 × 0.41 + 5/12 × 0.32
= 0.41 = 0.32 = 0.37

Ing. Ridho Rahmadi, M.Sc SPK-BI November 21, 2017 31 / 39


Computing GINI index for categorical attribute

Compute the GINI indices for each of above nodes!

Ing. Ridho Rahmadi, M.Sc SPK-BI November 21, 2017 32 / 39


Binary split selection: Continuous attributes

For each attribute a


Sort the values of occurring in the records at that node
Scan these values, each time updating the count matrix and
computing the GINI index of that split
Choose the split position that has the smallest GINI index

Ing. Ridho Rahmadi, M.Sc SPK-BI November 21, 2017 33 / 39


Measure of Impurity: Entropy

Entropy at a given node t:


X
Entropy (t) = − p(c | t) log p(c | t)
c

where p(c | t) is the relative frequency of class c at node t.

Measures homogeneity of a node.


Maximum (log nc ) when records are equally distributed among all classes
implying highest impurity.
Minimum (0) when all records belong to one class, implying lowest
impurity.

Ing. Ridho Rahmadi, M.Sc SPK-BI November 21, 2017 34 / 39


Quality of a Split Using Entropy: Information Gain

Information gain:
k
X ni
GAINsplit = Entropy (p) − ( Entropy (i))
n
i=1

where parent node p is split into k partitions; ni is number of records in


partition i.

Measures reduction in entropy achieved when using split to partition


the values of attribute a
Choose the split with maximum GAIN
Used in ID3 and C4.5
Disadvantage: tends to prefer splits that result in large number of
partitions, each being small but pure

Ing. Ridho Rahmadi, M.Sc SPK-BI November 21, 2017 35 / 39


Measure of Impurity: Classification Error

Classification error at a given node t:

Error (t) = 1 − max p(c | t)


c

where p(c | t) is the relative frequency of class c at node t.

Measures misclassification error made by a node.


Maximum (1 − 1/nc ) when records are equally distributed among all
classes implying highest impurity.
Minimum (0) when all records belong to one class, implying lowest
impurity.

Ing. Ridho Rahmadi, M.Sc SPK-BI November 21, 2017 36 / 39


Examples

Ing. Ridho Rahmadi, M.Sc SPK-BI November 21, 2017 37 / 39


Comparison among splitting criteria
For a 2-class problem

Ing. Ridho Rahmadi, M.Sc SPK-BI November 21, 2017 38 / 39


Any questions?

Ing. Ridho Rahmadi, M.Sc SPK-BI November 21, 2017 39 / 39

You might also like