Data Warehousing & Data Mining Chapter 5

Chapter 5:
Classification
TBS 2020-2021
Olfa Dridi & Afef Ben Brahim 1

Supervised learning
▪ Given a collection of records (training set )
– Each record contains a set of attributes, one of the
attributes is the class.
▪ Find a model for class attribute as a function

of the values of other attributes.
▪ Goal: previously unseen records should be
assigned a class as accurately as possible.
– A test set is used to determine the accuracy of the
model.
– Usually, the given data set is divided into training
and test sets, with training set used to build the
model and test set used to validate it. 2
Supervised learning process: two
steps
◼ Learning (training): Learn a model using the training
data
◼ Testing: Test the model using unseen test data to
assess the model accuracy
Number of correct classifica tions
Accuracy = ,
Total number of test cases
3
What is Classification?
▪ Classification is the process of assigning new
objects to predefined categories or classes
• Given a set of labeled instances S = {(x1 , y1),..., (xn , yn )}:
x is an instance and y is its class (special attribute)
• xi = (xi1, xi2, . . . , xid): the d attributes characterize

the instance xi
• The problem we want to solve is: we are given a

new sample where x = u. We want to find the class
to which this sample belongs. 4
Classification Problem
▪ Given a set of example records
– Each record consists of
• A set of attributes
• A class label
▪ Build an accurate model for each class based on
the set of attributes
▪ Use the model to classify future data for which
the class labels are unknown
▪ Objective : Good prediction for new observations
given only the features (attributes)
5
Part I
K Nearest Neighbors
6
K-Nearest Neighbors (KNN)
▪ KNN classifier is a simple algorithm that stores all
available data and classifies new data in a particular
class based on a similarity measure
▪ It is a supervised learning algorithm
▪ It is non-parametric method used for classification
▪ Prediction for test data is done on the basis of its
neighbor
▪ Used to predict the target label by finding the nearest
neighbor class. The closest class will be identified
using the distance measures like Euclidean distance.
7
K-Nearest Neighbor (KNN)
K-nearest neighbor (KNN)
▪ A new observation is classified by a majority of its
neighbors
▪ If K=1, then the class is simply assigned to the class of
its nearest neighbor
8
How does the KNN algorithm work?
Unknown record
▪ Requires three things
– The set of stored records
– Distance Metric to compute
distance between records
– The value of k, the number of
nearest neighbors to retrieve
▪ To classify an unknown record:

– Compute distance to other
training records
– Identify k nearest neighbors
– Use class labels of nearest
neighbors to determine the
class label of unknown record
(e.g., by taking majority vote)
9
X X X
(a) 1-nearest neighbor (b) 2-nearest neighbor (c) 3-nearest neighbor
K-nearest neighbors of a record x= are data points

that have the k smallest distance to x
10
▪ Compute distance between two points:
• Euclidean distance
d ( x, y ) =  i i
(
i
x − y ) 2
▪ Determine the class from nearest neighbor list

• Take the majority vote of class labels among the
k-nearest neighbors
• Weigh the vote according to distance
11
How many neighbors?
▪ Choosing the optimal value for K is best done by first
inspecting the data
▪ In general, a large K value is more precise as it

reduces the overall noise but there is no guarantee
▪ Historically, the optimal K for most datasets has been

between 3-10, that produces much better results
than 1NN
12
Distance measures for continuous
variables
13
Distance measures for categorical
variables
14
Example
▪ Consider the following data concerning credit default:
Age and Loan are two numerical variables (predictors)
and Default is the target.
15
Example
▪ We use the training set to classify an unknown case :
(Age=48, Loan=$142,000) using Euclidean distance.
▪ If K=1 then the nearest neighbor is the last case in the training set
with Default=Y.
▪ D = Sqrt[(48-33)^2 + (142000-150000)^2] = 8000.01 → Default=Y
16
Example
▪ If K=3, there are two Default=Y and one Default=N out of three closest
neighbors.
▪ Based on 3 neighbors: 2 Default=Y and 1 Default=N, majority
Default=Y. The prediction for the unknown case is again →Default=Y.
17
Standardized Distance
▪ One major drawback in calculating distance
measures directly from a training set is in case
where variables have different measurement
scales or there is a mixture and categorical
variables
▪ Example: if we have tow variables: one is based
on annual income in dollars and the other is
based on age in years then income will have a
much higher influence on the distance
calculated
▪ Solution : is to standardized the training set
18
Example: standardized distance
▪ Using the standardized distance on the same training set
▪ The unknown case returned a different neighbor →Default=N
19
KNN : advantages and disadvantages
▪ High accuracy, insensitive to outliers, no

assumptions about data
▪ Computationally expensive, high memory
requirement
▪ Works with: numeric values, nominal values
20
Exercise
Consider the following training set for a classification problem

where the attribute Risk is the target class:
21
Exercise
1. Classify the client M_0121 using Euclidean

distance:
▪ 1-KK,
▪ 2-KNN
▪ 3-KNN
2. Does varying the value of K affect the
classification?
3. Which problem occurs when k=2 and how to
improve it?
4. Do you need data standardization for your
training set? Explain?
22
Part II
Decision Trees
23
Decision Tree
▪ A Decision Tree (DT) is a decision support tool that
uses a tree-like graph or model of decisions and their
possible consequences, including chance event
outcomes, resource costs, and utility.
▪ It is one way to display an algorithm, to help identify
a strategy most likely to reach a goal, but are also a
popular tool in machine learning.
▪ DT classify instances by sorting them down the tree
from root to leaf node, which provides the
classification of the instances.
▪ Each node in the tree specifies a test of some
attributes of the instances and each branch
descending corresponds to one of the possible values
for this attribute. 24
Decision Tree
▪ Decision tree learning is one of the most widely
used techniques for classification.
– Its classification accuracy is competitive with other
methods
– It is very efficient
▪ The classification model is a tree, called decision
tree.
25
A Training set
Attributes Classes
Age Car Type Risk

23 Family High
17 Sports High
43 Sports High
68 Family Low
32 Truck Low
20 Family High
Values of the Attributes

26
Why Decision Tree Model?
▪ Relatively fast compared to other classification
models
▪ Obtain similar and sometimes better accuracy
compared to other models
▪ Simple and easy to understand
▪ Can be converted into simple and easy to
understand classification rules
27
Example 1 : Decision tree
Root: Attribute (Age)
Age < 25
Car Type in {sports}
High
High Low Leaf node :

Class
28
Example 2: Training set
# Attribute Class
Outlook Company Sailboat Sail?
1 sunny big small yes
2 sunny med small yes
3 sunny med big yes
4 sunny no small yes
5 sunny big big yes
6 rainy no small no
7 rainy med small yes
8 rainy big big yes
9 rainy no big no
10 rainy med big no
29
Decision tree induction
How are decision trees used for classification?
▪ The attributes of a tuple are tested against the

decision tree
▪ A path is traced from the root to a leaf node
which holds the prediction for that tuple
30
Example 2: Decision tree
Root: Attribute (outlook)
outlook
Decision
sunny rainy nodes
yes company
Branches
no big
med
no sailboat yes
small big
Leaf nodes :
Classes
yes no
31
Components
Root
▪Test the attributes
Decision nodes Leaf nodes

▪Test the attributes ▪classes
Branches
▪ The values of the attributes
32
Illustrating Classification Task
Tid Attrib1 Attrib2 Attrib3 Class Learning

1 Yes Large 125K No
algorithm
2 No Medium 100K No
3 No Small 70K No
4 Yes Medium 120K No

Induction
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No Learn

8 No Small 85K Yes Model
9 No Medium 75K No
10 No Small 90K Yes

Model
10
Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class Model
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ? Deduction

14 No Small 95K ?
15 No Large 67K ?
10
Test Set
33
Many Algorithms
▪ Top down Induction of decision Trees (TDIDT)
▪ ID3
▪ CART
▪ ASSISTANT
▪ C4.5
▪ J48
▪ …..
34
What is ID3 ? (Iterative Dichotomiser 3)
▪ A mathematical algorithm for building the decision

tree.
▪ Invented by J. Ross Quinlan in 1979.
▪ Uses Information Theory invented by Shannon in
1948.
▪ Builds the tree from the top down, with no
backtracking.
▪ Information Gain (IG) is used to select the most
useful attribute for classification.
35
The Process
▪ Classifies data using the attributes

▪ Tree consists of decision nodes and decision leafs.
▪ Nodes can have two or more branches which
represent the value for the attribute tested.
▪ Leaf nodes produce a homogeneous result (classes).
36
The algorithm (ID3, C4.5 and CART)
❑Tree is constructed in a top-down recursive divide-and-
conquer manner
❑ Iterations
▪ At start, all the training tuples are at the root
▪ Tuples are partitioned recursively based on selected
attributes
▪ Test attributes are selected on the basis of a heuristic
or statistical measure (e.g., information gain)
❑ Stopping conditions
▪ All samples for a given node belong to the same class
▪ There are no remaining attributes for further
partitioning – majority voting is employed for
classifying the leaf
▪ There are no samples left
37
Attribute Selection Measures
❑ An attribute selection measure is a heuristic for selecting
the splitting criterion that “best” separates a given data
partition D
Ideally
▪ Each resulting partition would be pure
▪ A pure partition is a partition containing tuples that all
belong to the same class
❑ Attribute selection measures (splitting rules)

▪Determine how the tuples at a given node are to be split
▪ Provide ranking for each attribute describing the tuples
▪The attribute with highest score is chosen
▪ Determine a split point or a splitting subset
❑ Methods : Information gain, Gain ratio, Gini Index 38

Information Gain (ID3/C4.5)
▪ Select the attribute with the highest information gain
▪ This attribute:
o minimizes the information needed to classify the tuples
in the resulting partitions
o reflects the least randomness or “impurity” in these
partitions
39
Information Gain (ID3/C4.5)
▪ Assume there are two classes, P and N:
C={N,P}
– Let the set of examples T contain p elements
of class P and n elements of class N
– The Entropy : amount of information, needed
to decide if an arbitrary example in T belongs
to P or N is defined as:
p p n n
Info(T ) = − log 2 − log 2
p+n p+n p+n p+n
or
2 freq (T , C j ) freq (T , C j )
Info(T ) = − log 2
j =1 T T
40
Information Gain in Decision Tree Induction
▪ For each attribute, compute the amount of information

needed to arrive at an exact classification after
portioning using that attribute
▪ Assume that using attribute A, a set T will be
partitioned into sets {T1, T2 , …, Tv}
– If Ti contains pi examples of P and ni examples of N, the
entropy, or the expected information needed to classify
objects in all subtrees Ti , Ti Domain (A)
Ti
InfoA (T ) = iDomain( A) Info(Ti )
T
▪ The encoding information that would be gained by
branching on A : Gain( A) = Info (T ) − Info A (T ) 41
Information Gain in Decision Tree Induction
▪ Information gain (based on the work by Shannon on

information theory )is the expected reduction in the
information requirements caused by knowing the value of A
▪ It calculates the effective change in entropy after making a

decision based on the value of an attribute.
▪ The attribute A with the highest information gain (Gain(A)),

is chosen as the splitting attribute at node N
▪ For decision trees, it’s ideal to base decisions on the

attribute that provides the largest change in entropy, the
attribute with the highest gain.
42
Split Information and Gain Ratio
▪ The split information value represents the potential
information generated by splitting the training data set T
into v partitions, corresponding to v outcomes on
attribute A.
Ti Ti
Split(T , A) = −iDomain( A) log 2
T T
▪High split Info: partitions have more or less the same size (uniform)
▪ Low split Info: few partitions hold most of the tuples (peaks)
▪The gain ratio of attribute A :

Gain( A)
Gain _ Ratio(T , A) =
Split(T , A)
▪The attribute with the maximum gain ratio is selected as 43
the splitting attribute
Decision Tree : Example
Attributes Classes
Connection Service Period Attack

High Web Working days Dos C1
High Web Holidays SQL injection
High Web Working days Dos
High Mail Holidays SQL injection C2
Medium Web Working days Dos
Medium Web Holidays SQL injection
Medium Mail Working days SQL injection
Medium Mail Holidays SQL injection
Small Mail Working days Passive C3
Small Mail Holidays Passive
Values 44
High Mail Holidays SQL injection
Medium Mail Working days SQL injection
Medium Mail Holidays SQL injection
Small Mail Working days Passive
1.485
45
Decision Tree : Using attribute Connection
Ti
Connection Service Period
High Web Working
Attack
Dos
InfoConnection(T ) = 
iDomain T
Info(Ti )
days
High Web Holidays SQL DConnection = high, medium, small
injection
High Web Working Dos Small
days High Medium
High Mail Holidays SQL
injection o
Info(THigh ) Info(TSmall )
Medium Web Working Dos Info(TMedium )
days 2 2 2 2
Medium Info
Web )=−
(THighHolidays log
SQL2 − log 2 = 1
4 injection
4 4 4
Medium Mail Working 1 SQL 1 3 3
Info(TMediumdays ) = − log
injection
2 − log 2 = 0.812
Medium Mail Holidays
4 SQL 4 4 4
2 injection
2
Small Info (T
Mail sm all ) =
Working− log
Passive
2 =0
days
2 2
Small
Info Mail
(T ) = 4 Info(TPassive) + 4 Info(T
Holidays
) + 2 Info(TSmall ) = 0.725
Connection 10 High 10 Medium 10
46
High Web Working Dos
days
High Web Holidays SQL
injection
days
injection
Medium Web Working Dos
days
Medium Web Holidays SQL
injection
MediumGainMail
(T , Connection
Working = Info(T ) − Info(Tconnectio)
) SQL
days
Gain(T , Connection = 0.761
) injection
Medium Mail Holidays SQL
injection
Small Mail Working Passive
days
47
days
injection
days
injection
days
Medium Web Holidays SQL
injection
Ti Ti
SplitInfo(T , Connection) = − iDomain
Medium Mail Working
days
SQL

injection T
log 2
T
Medium Mail Holidays SQL
Split Info(T , Connection) = −injection
4 log 4 − 4 log 4 − 2 log 2
10 2 10 10 2 10 10 2 10
days
Small Gain
Mail Ratio
Holidays(T , Connection
Passive ) = 0.761 = 0.5
1.522
48
Decision Tree : Using attribute Period
Ti
High Web Working
Attack
Dos
InfoPeriod (T ) = 
iDomain T
Info(Ti )
days
High Web Holidays SQL DPeriod = WD , Holidays 
injection
days WD Holidays
injection Info(TWD ) Info(THolidays)
days
Medium Web 3
Holidays 3 1
SQL 1 1 1
Info(TWD ) = − log 2injection
− log 2 − log 2 = 1.371
5 5 5 5 5 5
Medium Mail Working SQL
4 injection
4 1 1
Info(THolidays ) = − log
days
2 − log 2 =1
Medium Mail Holidays 5 SQL 5 5 5
injection
Info (T ) = 5
days
Info(T ) + 5 Info(T ) = 1.046
Small Period Holidays
Mail 10
Passive WD 10 holidays
49
Decision Tree : Using attribute Period
High Web Working Dos Gain Ratio(T , Period )
days
injection
days
injection
days
Medium Gain
Web(T , Holidays
Period ) = SQL
Info(T ) − Info(Tperiod )
Gain(T , Period ) = injection
0.439
Medium MailWorking SQL
Ti Ti
SplitInfo(T , Period
Medium Mail
days
Holidays
) 
=injection
− iDomain
SQL T
log 2
T
injection
Small SplitMail , period ) =Passive
Info(TWorking − 5 log 2 5 − 5 log 2 5 =1
10 10 10 10
days
Gain Ratio(T , Period ) = 0.439 = 0.439
1 50
Decision Tree : Using attribute Service
Ti
High Web Working
Attack
Dos
InfoService (T ) = 
iDomain T
Info(Ti )
days
DService = Web, Mail

injection
days Web Mail
injection Info(TWeb ) Info(TMail )
days
Medium Web Holidays
3 SQL
3 2 2
Info(TWeb ) = − log 2injection− log 2 = 0.971
Medium Mail 5
Working 5 5
SQL 5
days 3 3 2
injection 2
MediumInfo( T ) = −
MailMail Holidays log SQL
2 − log 2 = 0.971
5 5 5
injection
5
Info (T ) = 5
days
Info(T )+ 5 Info(T ) = 0.971
Small Service Holidays
Mail 10Passive Web 10 Mail
51
Decision Tree : Using attribute Service
days
Gain Ratio(T , Service)
injection
days
injection
days
Medium Gain
Web(T , Service
Holidays ) = SQL
Info(T ) − Info(Tservice )
injection
Medium
Gain
Mail
(T , Service
Working
) = SQL
0.514
T T
Medium Mail
days
SplitInfo(T , Service
Holidays
) 
=injection
− iDom ain i log 2 i
SQL T T
injection
Small SplitMail , Service ) =Passive
Info(TWorking − 5 log 2 5 − 5 log 2 5 =1
10 10 10 10
days
Gain Ratio(T , Service ) = 0.514 = 0.514
1 52
Decision Tree : Example (Level 1)
Decision Tree: level 1
Gain Ratio(T , Connection) = 0.5
Gain Ratio(T , Period ) = 0.439
Gain Ratio(T , Service) = 0.514
Root
Web Mail
Connection Service Period Attack Connection Service Period Attack

High Web Working days Dos High Mail Holidays SQL injection
High Web Holidays SQL injection Medium Mail Working days SQL injection
High Web Working days Dos Medium Mail Holidays SQL injection
Medium Web Working days Dos Small Mail Working days Passive
Medium Web Holidays SQL injection Small Mail Holidays Passive
53
Decision Tree: Service =Web

3 3 2 2
Info(TWeb ) = Info( S ) = − log 2 − log 2 = 0.971
5 5 5 5
54

2 2 1 1
InfoConnection( S High ) = − log 2 − log 2 = 0.918
3 3 3 3
1 1 1 1
InfoConnection( S Medium ) = − log 2 − log 2 = 1
2 2 2 2
InfoConnection( S Small) = 0
 3  2 
InfoConnection( S ) =    * 0.918 +   *1 + (0 * 0)  = 0.951
 5  5  55

gain( S , connection ) = 0.02

Split( S , connection ) = −3 / 5 log 2 3 / 5 − 2 / 5 log 2 2 / 5 − 0 = 0.971
GainRatio( S , Connection) = 0.02

..................
56
Final Decision Tree : Example
Service
Mail
Web
Period Connection
Working Days Holidays

High Small
(WD)
Medium
SQL SQL SQL
Dos Passive
Injection Injection Injection
57
Service
Mail
Web
Period Connection
Working Days Holidays

High Small
Medium
C2 C1 C2 C2 C3
Classify ?

Medium Mail Holidays ?
58
Classification rules
Extracting Classification Rules from Tree
▪ Represent the knowledge in the form of IF-THEN rules
▪ One rule is created for each path from the root to a leaf
▪ Each attribute-value pair along a path forms a conjunction
▪ The leaf node holds the class prediction
▪ Rules are easier for humans to understand
Example:
If (service=Web) ^(period=WD) Then C2
If (service=Web) ^(period=holidays) Then C1
Number of Classification rules? (Number of leafs nodes)

59
Summary
▪ Decision Trees have relatively faster learning speed

than other methods
▪ Conversable to simple and easy to understand

classification rules
▪ Information Gain, Ratio Gain and Gini Index are the

most common methods of attribute selection
60

Data Warehousing &amp; Data Mining Chapter 5

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Warehousing &amp; Data Mining Chapter 5

Uploaded by

Copyright:

Available Formats

Chapter 5:

Olfa Dridi & Afef Ben Brahim 1

▪ Find a model for class attribute as a function

x is an instance and y is its class (special attribute)

• xi = (xi1, xi2, . . . , xid): the d attributes characterize

• The problem we want to solve is: we are given a

▪ To classify an unknown record:

(a) 1-nearest neighbor (b) 2-nearest neighbor (c) 3-nearest neighbor

K-nearest neighbors of a record x= are data points

▪ Determine the class from nearest neighbor list

▪ In general, a large K value is more precise as it

▪ Historically, the optimal K for most datasets has been

▪ High accuracy, insensitive to outliers, no

Consider the following training set for a classification problem

1. Classify the client M_0121 using Euclidean

Age Car Type Risk

Values of the Attributes

Car Type in {sports}

High Low Leaf node :

▪ The attributes of a tuple are tested against the

Decision nodes Leaf nodes

Tid Attrib1 Attrib2 Attrib3 Class Learning

4 Yes Medium 120K No

7 Yes Large 220K No Learn

10 No Small 90K Yes

12 Yes Medium 80K ?

13 Yes Large 110K ? Deduction

▪ A mathematical algorithm for building the decision

▪ Classifies data using the attributes

❑ Attribute selection measures (splitting rules)

❑ Methods : Information gain, Gain ratio, Gini Index 38

▪ For each attribute, compute the amount of information

▪ Information gain (based on the work by Shannon on

▪ It calculates the effective change in entropy after making a

▪ The attribute A with the highest information gain (Gain(A)),

▪ For decision trees, it’s ideal to base decisions on the

▪The gain ratio of attribute A :

Connection Service Period Attack

DService = Web, Mail

Connection Service Period Attack Connection Service Period Attack

Connection Service Period Attack

High Web Working days Dos

High Web Working days Dos

High Web Working days Dos

gain( S , connection ) = 0.02

GainRatio( S , Connection) = 0.02

Working Days Holidays

Working Days Holidays

Connection Service Period Attack

Number of Classification rules? (Number of leafs nodes)

▪ Decision Trees have relatively faster learning speed

▪ Conversable to simple and easy to understand

▪ Information Gain, Ratio Gain and Gini Index are the

You might also like

Data Warehousing & Data Mining Chapter 5

Data Warehousing & Data Mining Chapter 5