Professional Documents
Culture Documents
Data Warehousing & Data Mining Chapter 5
Data Warehousing & Data Mining Chapter 5
Classification
TBS 2020-2021
3
What is Classification?
▪ Classification is the process of assigning new
objects to predefined categories or classes
• Given a set of labeled instances S = {(x1 , y1),..., (xn , yn )}:
5
Part I
K Nearest Neighbors
6
K-Nearest Neighbors (KNN)
▪ KNN classifier is a simple algorithm that stores all
available data and classifies new data in a particular
class based on a similarity measure
▪ It is a supervised learning algorithm
▪ It is non-parametric method used for classification
▪ Prediction for test data is done on the basis of its
neighbor
▪ Used to predict the target label by finding the nearest
neighbor class. The closest class will be identified
using the distance measures like Euclidean distance.
7
K-Nearest Neighbor (KNN)
K-nearest neighbor (KNN)
▪ A new observation is classified by a majority of its
neighbors
▪ If K=1, then the class is simply assigned to the class of
its nearest neighbor
8
How does the KNN algorithm work?
Unknown record
▪ Requires three things
– The set of stored records
– Distance Metric to compute
distance between records
– The value of k, the number of
nearest neighbors to retrieve
X X X
10
How does the KNN algorithm work?
▪ Compute distance between two points:
• Euclidean distance
d ( x, y ) = i i
(
i
x − y ) 2
11
How many neighbors?
▪ Choosing the optimal value for K is best done by first
inspecting the data
12
Distance measures for continuous
variables
13
Distance measures for categorical
variables
14
Example
▪ Consider the following data concerning credit default:
Age and Loan are two numerical variables (predictors)
and Default is the target.
15
Example
▪ We use the training set to classify an unknown case :
(Age=48, Loan=$142,000) using Euclidean distance.
▪ If K=1 then the nearest neighbor is the last case in the training set
with Default=Y.
▪ D = Sqrt[(48-33)^2 + (142000-150000)^2] = 8000.01 → Default=Y
16
Example
▪ If K=3, there are two Default=Y and one Default=N out of three closest
neighbors.
▪ Based on 3 neighbors: 2 Default=Y and 1 Default=N, majority
Default=Y. The prediction for the unknown case is again →Default=Y.
17
Standardized Distance
▪ One major drawback in calculating distance
measures directly from a training set is in case
where variables have different measurement
scales or there is a mixture and categorical
variables
▪ Example: if we have tow variables: one is based
on annual income in dollars and the other is
based on age in years then income will have a
much higher influence on the distance
calculated
▪ Solution : is to standardized the training set
18
Example: standardized distance
▪ Using the standardized distance on the same training set
▪ The unknown case returned a different neighbor →Default=N
19
KNN : advantages and disadvantages
20
Exercise
21
Exercise
22
Part II
Decision Trees
23
Decision Tree
▪ A Decision Tree (DT) is a decision support tool that
uses a tree-like graph or model of decisions and their
possible consequences, including chance event
outcomes, resource costs, and utility.
▪ It is one way to display an algorithm, to help identify
a strategy most likely to reach a goal, but are also a
popular tool in machine learning.
▪ DT classify instances by sorting them down the tree
from root to leaf node, which provides the
classification of the instances.
▪ Each node in the tree specifies a test of some
attributes of the instances and each branch
descending corresponds to one of the possible values
for this attribute. 24
Decision Tree
▪ Decision tree learning is one of the most widely
used techniques for classification.
– Its classification accuracy is competitive with other
methods
– It is very efficient
▪ The classification model is a tree, called decision
tree.
25
A Training set
Attributes Classes
27
Example 1 : Decision tree
Root: Attribute (Age)
Age < 25
High
28
Example 2: Training set
# Attribute Class
Outlook Company Sailboat Sail?
1 sunny big small yes
2 sunny med small yes
3 sunny med big yes
4 sunny no small yes
5 sunny big big yes
6 rainy no small no
7 rainy med small yes
8 rainy big big yes
9 rainy no big no
10 rainy med big no
29
Decision tree induction
How are decision trees used for classification?
30
Example 2: Decision tree
Root: Attribute (outlook)
outlook
Decision
sunny rainy nodes
yes company
Branches
no big
med
no sailboat yes
small big
Leaf nodes :
Classes
yes no
31
Components
Root
▪Test the attributes
Branches
▪ The values of the attributes
32
Illustrating Classification Task
3 No Small 70K No
6 No Medium 60K No
Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class Model
11 No Small 55K ?
15 No Large 67K ?
10
Test Set
33
Many Algorithms
▪ Top down Induction of decision Trees (TDIDT)
▪ ID3
▪ CART
▪ ASSISTANT
▪ C4.5
▪ J48
▪ …..
34
What is ID3 ? (Iterative Dichotomiser 3)
35
The Process
36
The algorithm (ID3, C4.5 and CART)
❑Tree is constructed in a top-down recursive divide-and-
conquer manner
❑ Iterations
▪ At start, all the training tuples are at the root
▪ Tuples are partitioned recursively based on selected
attributes
▪ Test attributes are selected on the basis of a heuristic
or statistical measure (e.g., information gain)
❑ Stopping conditions
▪ All samples for a given node belong to the same class
▪ There are no remaining attributes for further
partitioning – majority voting is employed for
classifying the leaf
▪ There are no samples left
37
Attribute Selection Measures
❑ An attribute selection measure is a heuristic for selecting
the splitting criterion that “best” separates a given data
partition D
Ideally
▪ Each resulting partition would be pure
▪ A pure partition is a partition containing tuples that all
belong to the same class
39
Information Gain (ID3/C4.5)
▪ Assume there are two classes, P and N:
C={N,P}
– Let the set of examples T contain p elements
of class P and n elements of class N
– The Entropy : amount of information, needed
to decide if an arbitrary example in T belongs
to P or N is defined as:
p p n n
Info(T ) = − log 2 − log 2
p+n p+n p+n p+n
or
2 freq (T , C j ) freq (T , C j )
Info(T ) = − log 2
j =1 T T
40
Information Gain in Decision Tree Induction
Values 44
Decision Tree : Example
Connection Service Period Attack
High Web Working days Dos
High Web Holidays SQL injection
High Web Working days Dos
High Mail Holidays SQL injection
Medium Web Working days Dos
Medium Web Holidays SQL injection
Medium Mail Working days SQL injection
Medium Mail Holidays SQL injection
Small Mail Working days Passive
Small Mail Holidays Passive
1.485
45
Decision Tree : Using attribute Connection
Ti
Connection Service Period
High Web Working
Attack
Dos
InfoConnection(T ) =
iDomain T
Info(Ti )
days
High Web Holidays SQL DConnection = high, medium, small
injection
High Web Working Dos Small
days High Medium
High Mail Holidays SQL
injection o
Info(THigh ) Info(TSmall )
Medium Web Working Dos Info(TMedium )
days 2 2 2 2
Medium Info
Web )=−
(THighHolidays log
SQL2 − log 2 = 1
4 injection
4 4 4
Medium Mail Working 1 SQL 1 3 3
Info(TMediumdays ) = − log
injection
2 − log 2 = 0.812
Medium Mail Holidays
4 SQL 4 4 4
2 injection
2
Small Info (T
Mail sm all ) =
Working− log
Passive
2 =0
days
2 2
Small
Info Mail
(T ) = 4 Info(TPassive) + 4 Info(T
Holidays
) + 2 Info(TSmall ) = 0.725
Connection 10 High 10 Medium 10
46
Decision Tree : Using attribute Connection
Connection Service Period Attack
High Web Working Dos
days
High Web Holidays SQL
injection
High Web Working Dos
days
High Mail Holidays SQL
injection
Medium Web Working Dos
days
Medium Web Holidays SQL
injection
MediumGainMail
(T , Connection
Working = Info(T ) − Info(Tconnectio)
) SQL
days
Gain(T , Connection = 0.761
) injection
Medium Mail Holidays SQL
injection
Small Mail Working Passive
days
Small Mail Holidays Passive
47
Decision Tree : Using attribute Connection
Connection Service Period Attack
High Web Working Dos
days
High Web Holidays SQL
injection
High Web Working Dos
days
High Mail Holidays SQL
injection
Medium Web Working Dos
days
Medium Web Holidays SQL
injection
Ti Ti
SplitInfo(T , Connection) = − iDomain
Medium Mail Working
days
SQL
injection T
log 2
T
Medium Mail Holidays SQL
Split Info(T , Connection) = −injection
4 log 4 − 4 log 4 − 2 log 2
10 2 10 10 2 10 10 2 10
Small Mail Working Passive
days
Small Gain
Mail Ratio
Holidays(T , Connection
Passive ) = 0.761 = 0.5
1.522
48
Decision Tree : Using attribute Period
Ti
Connection Service Period
High Web Working
Attack
Dos
InfoPeriod (T ) =
iDomain T
Info(Ti )
days
High Web Holidays SQL DPeriod = WD , Holidays
injection
High Web Working Dos
days WD Holidays
High Mail Holidays SQL
injection Info(TWD ) Info(THolidays)
Medium Web Working Dos
days
Medium Web 3
Holidays 3 1
SQL 1 1 1
Info(TWD ) = − log 2injection
− log 2 − log 2 = 1.371
5 5 5 5 5 5
Medium Mail Working SQL
4 injection
4 1 1
Info(THolidays ) = − log
days
2 − log 2 =1
Medium Mail Holidays 5 SQL 5 5 5
injection
Small Mail Working Passive
Info (T ) = 5
days
Info(T ) + 5 Info(T ) = 1.046
Small Period Holidays
Mail 10
Passive WD 10 holidays
49
Decision Tree : Using attribute Period
Connection Service Period Attack
High Web Working Dos Gain Ratio(T , Period )
days
High Web Holidays SQL
injection
High Web Working Dos
days
High Mail Holidays SQL
injection
Medium Web Working Dos
days
Medium Gain
Web(T , Holidays
Period ) = SQL
Info(T ) − Info(Tperiod )
Gain(T , Period ) = injection
0.439
Medium MailWorking SQL
Ti Ti
SplitInfo(T , Period
Medium Mail
days
Holidays
)
=injection
− iDomain
SQL T
log 2
T
injection
Small SplitMail , period ) =Passive
Info(TWorking − 5 log 2 5 − 5 log 2 5 =1
10 10 10 10
days
Small Mail Holidays Passive
Gain Ratio(T , Period ) = 0.439 = 0.439
1 50
Decision Tree : Using attribute Service
Ti
Connection Service Period
High Web Working
Attack
Dos
InfoService (T ) =
iDomain T
Info(Ti )
days
Info (T ) = 5
days
Info(T )+ 5 Info(T ) = 0.971
Small Service Holidays
Mail 10Passive Web 10 Mail
51
Decision Tree : Using attribute Service
Connection Service Period Attack
High Web Working Dos
days
Gain Ratio(T , Service)
High Web Holidays SQL
injection
High Web Working Dos
days
High Mail Holidays SQL
injection
Medium Web Working Dos
days
Medium Gain
Web(T , Service
Holidays ) = SQL
Info(T ) − Info(Tservice )
injection
Medium
Gain
Mail
(T , Service
Working
) = SQL
0.514
T T
Medium Mail
days
SplitInfo(T , Service
Holidays
)
=injection
− iDom ain i log 2 i
SQL T T
injection
Small SplitMail , Service ) =Passive
Info(TWorking − 5 log 2 5 − 5 log 2 5 =1
10 10 10 10
days
Small Mail Holidays Passive
Gain Ratio(T , Service ) = 0.514 = 0.514
1 52
Decision Tree : Example (Level 1)
Decision Tree: level 1
Gain Ratio(T , Connection) = 0.5
Gain Ratio(T , Period ) = 0.439
Gain Ratio(T , Service) = 0.514
Root
Web Mail
3 3 2 2
Info(TWeb ) = Info( S ) = − log 2 − log 2 = 0.971
5 5 5 5
54
Decision Tree : Example
Connection Service Period Attack
2 2 1 1
InfoConnection( S High ) = − log 2 − log 2 = 0.918
3 3 3 3
1 1 1 1
InfoConnection( S Medium ) = − log 2 − log 2 = 1
2 2 2 2
InfoConnection( S Small) = 0
3 2
InfoConnection( S ) = * 0.918 + *1 + (0 * 0) = 0.951
5 5 55
Decision Tree : Example
Connection Service Period Attack
56
Final Decision Tree : Example
Service
Mail
Web
Period Connection
57
Decision Tree : Example
Service
Mail
Web
Period Connection
C2 C1 C2 C2 C3
Classify ?
60