Download as pdf or txt
Download as pdf or txt
You are on page 1of 60

Chapter 5:

Classification

TBS 2020-2021

Olfa Dridi & Afef Ben Brahim 1


Supervised learning
▪ Given a collection of records (training set )
– Each record contains a set of attributes, one of the
attributes is the class.

▪ Find a model for class attribute as a function


of the values of other attributes.
▪ Goal: previously unseen records should be
assigned a class as accurately as possible.
– A test set is used to determine the accuracy of the
model.
– Usually, the given data set is divided into training
and test sets, with training set used to build the
model and test set used to validate it. 2
Supervised learning process: two
steps
◼ Learning (training): Learn a model using the training
data
◼ Testing: Test the model using unseen test data to
assess the model accuracy
Number of correct classifica tions
Accuracy = ,
Total number of test cases

3
What is Classification?
▪ Classification is the process of assigning new
objects to predefined categories or classes
• Given a set of labeled instances S = {(x1 , y1),..., (xn , yn )}:

x is an instance and y is its class (special attribute)

• xi = (xi1, xi2, . . . , xid): the d attributes characterize


the instance xi

• The problem we want to solve is: we are given a


new sample where x = u. We want to find the class
to which this sample belongs. 4
Classification Problem
▪ Given a set of example records
– Each record consists of
• A set of attributes
• A class label
▪ Build an accurate model for each class based on
the set of attributes
▪ Use the model to classify future data for which
the class labels are unknown
▪ Objective : Good prediction for new observations
given only the features (attributes)

5
Part I
K Nearest Neighbors

6
K-Nearest Neighbors (KNN)
▪ KNN classifier is a simple algorithm that stores all
available data and classifies new data in a particular
class based on a similarity measure
▪ It is a supervised learning algorithm
▪ It is non-parametric method used for classification
▪ Prediction for test data is done on the basis of its
neighbor
▪ Used to predict the target label by finding the nearest
neighbor class. The closest class will be identified
using the distance measures like Euclidean distance.

7
K-Nearest Neighbor (KNN)
K-nearest neighbor (KNN)
▪ A new observation is classified by a majority of its
neighbors
▪ If K=1, then the class is simply assigned to the class of
its nearest neighbor

8
How does the KNN algorithm work?
Unknown record
▪ Requires three things
– The set of stored records
– Distance Metric to compute
distance between records
– The value of k, the number of
nearest neighbors to retrieve

▪ To classify an unknown record:


– Compute distance to other
training records
– Identify k nearest neighbors
– Use class labels of nearest
neighbors to determine the
class label of unknown record
(e.g., by taking majority vote)
9
How does the KNN algorithm work?

X X X

(a) 1-nearest neighbor (b) 2-nearest neighbor (c) 3-nearest neighbor

K-nearest neighbors of a record x= are data points


that have the k smallest distance to x

10
How does the KNN algorithm work?
▪ Compute distance between two points:
• Euclidean distance

d ( x, y ) =  i i
(
i
x − y ) 2

▪ Determine the class from nearest neighbor list


• Take the majority vote of class labels among the
k-nearest neighbors
• Weigh the vote according to distance

11
How many neighbors?
▪ Choosing the optimal value for K is best done by first
inspecting the data

▪ In general, a large K value is more precise as it


reduces the overall noise but there is no guarantee

▪ Historically, the optimal K for most datasets has been


between 3-10, that produces much better results
than 1NN

12
Distance measures for continuous
variables

13
Distance measures for categorical
variables

14
Example
▪ Consider the following data concerning credit default:
Age and Loan are two numerical variables (predictors)
and Default is the target.

15
Example
▪ We use the training set to classify an unknown case :
(Age=48, Loan=$142,000) using Euclidean distance.
▪ If K=1 then the nearest neighbor is the last case in the training set
with Default=Y.
▪ D = Sqrt[(48-33)^2 + (142000-150000)^2] = 8000.01 → Default=Y

16
Example
▪ If K=3, there are two Default=Y and one Default=N out of three closest
neighbors.
▪ Based on 3 neighbors: 2 Default=Y and 1 Default=N, majority
Default=Y. The prediction for the unknown case is again →Default=Y.

17
Standardized Distance
▪ One major drawback in calculating distance
measures directly from a training set is in case
where variables have different measurement
scales or there is a mixture and categorical
variables
▪ Example: if we have tow variables: one is based
on annual income in dollars and the other is
based on age in years then income will have a
much higher influence on the distance
calculated
▪ Solution : is to standardized the training set
18
Example: standardized distance
▪ Using the standardized distance on the same training set
▪ The unknown case returned a different neighbor →Default=N

19
KNN : advantages and disadvantages

▪ High accuracy, insensitive to outliers, no


assumptions about data
▪ Computationally expensive, high memory
requirement
▪ Works with: numeric values, nominal values

20
Exercise

Consider the following training set for a classification problem


where the attribute Risk is the target class:

21
Exercise

1. Classify the client M_0121 using Euclidean


distance:
▪ 1-KK,
▪ 2-KNN
▪ 3-KNN
2. Does varying the value of K affect the
classification?
3. Which problem occurs when k=2 and how to
improve it?
4. Do you need data standardization for your
training set? Explain?

22
Part II
Decision Trees

23
Decision Tree
▪ A Decision Tree (DT) is a decision support tool that
uses a tree-like graph or model of decisions and their
possible consequences, including chance event
outcomes, resource costs, and utility.
▪ It is one way to display an algorithm, to help identify
a strategy most likely to reach a goal, but are also a
popular tool in machine learning.
▪ DT classify instances by sorting them down the tree
from root to leaf node, which provides the
classification of the instances.
▪ Each node in the tree specifies a test of some
attributes of the instances and each branch
descending corresponds to one of the possible values
for this attribute. 24
Decision Tree
▪ Decision tree learning is one of the most widely
used techniques for classification.
– Its classification accuracy is competitive with other
methods
– It is very efficient
▪ The classification model is a tree, called decision
tree.

25
A Training set
Attributes Classes

Age Car Type Risk


23 Family High
17 Sports High
43 Sports High
68 Family Low
32 Truck Low
20 Family High

Values of the Attributes


26
Why Decision Tree Model?
▪ Relatively fast compared to other classification
models
▪ Obtain similar and sometimes better accuracy
compared to other models
▪ Simple and easy to understand
▪ Can be converted into simple and easy to
understand classification rules

27
Example 1 : Decision tree
Root: Attribute (Age)

Age < 25

Car Type in {sports}

High

High Low Leaf node :


Class

28
Example 2: Training set
# Attribute Class
Outlook Company Sailboat Sail?
1 sunny big small yes
2 sunny med small yes
3 sunny med big yes
4 sunny no small yes
5 sunny big big yes
6 rainy no small no
7 rainy med small yes
8 rainy big big yes
9 rainy no big no
10 rainy med big no
29
Decision tree induction
How are decision trees used for classification?

▪ The attributes of a tuple are tested against the


decision tree
▪ A path is traced from the root to a leaf node
which holds the prediction for that tuple

30
Example 2: Decision tree
Root: Attribute (outlook)
outlook
Decision
sunny rainy nodes

yes company
Branches
no big
med

no sailboat yes

small big
Leaf nodes :
Classes
yes no

31
Components

Root
▪Test the attributes

Decision nodes Leaf nodes


▪Test the attributes ▪classes

Branches
▪ The values of the attributes

32
Illustrating Classification Task

Tid Attrib1 Attrib2 Attrib3 Class Learning


1 Yes Large 125K No
algorithm
2 No Medium 100K No

3 No Small 70K No

4 Yes Medium 120K No


Induction
5 No Large 95K Yes

6 No Medium 60K No

7 Yes Large 220K No Learn


8 No Small 85K Yes Model
9 No Medium 75K No

10 No Small 90K Yes


Model
10

Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class Model
11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ? Deduction


14 No Small 95K ?

15 No Large 67K ?
10

Test Set
33
Many Algorithms
▪ Top down Induction of decision Trees (TDIDT)
▪ ID3
▪ CART
▪ ASSISTANT
▪ C4.5
▪ J48
▪ …..

34
What is ID3 ? (Iterative Dichotomiser 3)

▪ A mathematical algorithm for building the decision


tree.
▪ Invented by J. Ross Quinlan in 1979.
▪ Uses Information Theory invented by Shannon in
1948.
▪ Builds the tree from the top down, with no
backtracking.
▪ Information Gain (IG) is used to select the most
useful attribute for classification.

35
The Process

▪ Classifies data using the attributes


▪ Tree consists of decision nodes and decision leafs.
▪ Nodes can have two or more branches which
represent the value for the attribute tested.
▪ Leaf nodes produce a homogeneous result (classes).

36
The algorithm (ID3, C4.5 and CART)
❑Tree is constructed in a top-down recursive divide-and-
conquer manner
❑ Iterations
▪ At start, all the training tuples are at the root
▪ Tuples are partitioned recursively based on selected
attributes
▪ Test attributes are selected on the basis of a heuristic
or statistical measure (e.g., information gain)
❑ Stopping conditions
▪ All samples for a given node belong to the same class
▪ There are no remaining attributes for further
partitioning – majority voting is employed for
classifying the leaf
▪ There are no samples left
37
Attribute Selection Measures
❑ An attribute selection measure is a heuristic for selecting
the splitting criterion that “best” separates a given data
partition D
Ideally
▪ Each resulting partition would be pure
▪ A pure partition is a partition containing tuples that all
belong to the same class

❑ Attribute selection measures (splitting rules)


▪Determine how the tuples at a given node are to be split
▪ Provide ranking for each attribute describing the tuples
▪The attribute with highest score is chosen
▪ Determine a split point or a splitting subset

❑ Methods : Information gain, Gain ratio, Gini Index 38


Information Gain (ID3/C4.5)
▪ Select the attribute with the highest information gain
▪ This attribute:
o minimizes the information needed to classify the tuples
in the resulting partitions
o reflects the least randomness or “impurity” in these
partitions

39
Information Gain (ID3/C4.5)
▪ Assume there are two classes, P and N:
C={N,P}
– Let the set of examples T contain p elements
of class P and n elements of class N
– The Entropy : amount of information, needed
to decide if an arbitrary example in T belongs
to P or N is defined as:
p p n n
Info(T ) = − log 2 − log 2
p+n p+n p+n p+n
or
2 freq (T , C j ) freq (T , C j )
Info(T ) = − log 2
j =1 T T
40
Information Gain in Decision Tree Induction

▪ For each attribute, compute the amount of information


needed to arrive at an exact classification after
portioning using that attribute
▪ Assume that using attribute A, a set T will be
partitioned into sets {T1, T2 , …, Tv}
– If Ti contains pi examples of P and ni examples of N, the
entropy, or the expected information needed to classify
objects in all subtrees Ti , Ti Domain (A)
Ti
InfoA (T ) = iDomain( A) Info(Ti )
T
▪ The encoding information that would be gained by
branching on A : Gain( A) = Info (T ) − Info A (T ) 41
Information Gain in Decision Tree Induction

▪ Information gain (based on the work by Shannon on


information theory )is the expected reduction in the
information requirements caused by knowing the value of A

▪ It calculates the effective change in entropy after making a


decision based on the value of an attribute.

▪ The attribute A with the highest information gain (Gain(A)),


is chosen as the splitting attribute at node N

▪ For decision trees, it’s ideal to base decisions on the


attribute that provides the largest change in entropy, the
attribute with the highest gain.
42
Split Information and Gain Ratio
▪ The split information value represents the potential
information generated by splitting the training data set T
into v partitions, corresponding to v outcomes on
attribute A.
Ti Ti
Split(T , A) = −iDomain( A) log 2
T T
▪High split Info: partitions have more or less the same size (uniform)
▪ Low split Info: few partitions hold most of the tuples (peaks)

▪The gain ratio of attribute A :


Gain( A)
Gain _ Ratio(T , A) =
Split(T , A)
▪The attribute with the maximum gain ratio is selected as 43
the splitting attribute
Decision Tree : Example
Attributes Classes

Connection Service Period Attack


High Web Working days Dos C1
High Web Holidays SQL injection
High Web Working days Dos
High Mail Holidays SQL injection C2
Medium Web Working days Dos
Medium Web Holidays SQL injection
Medium Mail Working days SQL injection
Medium Mail Holidays SQL injection
Small Mail Working days Passive C3
Small Mail Holidays Passive

Values 44
Decision Tree : Example
Connection Service Period Attack
High Web Working days Dos
High Web Holidays SQL injection
High Web Working days Dos
High Mail Holidays SQL injection
Medium Web Working days Dos
Medium Web Holidays SQL injection
Medium Mail Working days SQL injection
Medium Mail Holidays SQL injection
Small Mail Working days Passive
Small Mail Holidays Passive

1.485
45
Decision Tree : Using attribute Connection
Ti
Connection Service Period
High Web Working
Attack
Dos
InfoConnection(T ) = 
iDomain T
Info(Ti )
days
High Web Holidays SQL DConnection = high, medium, small
injection
High Web Working Dos Small
days High Medium
High Mail Holidays SQL
injection o
Info(THigh ) Info(TSmall )
Medium Web Working Dos Info(TMedium )
days 2 2 2 2
Medium Info
Web )=−
(THighHolidays log
SQL2 − log 2 = 1
4 injection
4 4 4
Medium Mail Working 1 SQL 1 3 3
Info(TMediumdays ) = − log
injection
2 − log 2 = 0.812
Medium Mail Holidays
4 SQL 4 4 4
2 injection
2
Small Info (T
Mail sm all ) =
Working− log
Passive
2 =0
days
2 2
Small
Info Mail
(T ) = 4 Info(TPassive) + 4 Info(T
Holidays
) + 2 Info(TSmall ) = 0.725
Connection 10 High 10 Medium 10
46
Decision Tree : Using attribute Connection
Connection Service Period Attack
High Web Working Dos
days
High Web Holidays SQL
injection
High Web Working Dos
days
High Mail Holidays SQL
injection
Medium Web Working Dos
days
Medium Web Holidays SQL
injection
MediumGainMail
(T , Connection
Working = Info(T ) − Info(Tconnectio)
) SQL
days
Gain(T , Connection = 0.761
) injection
Medium Mail Holidays SQL
injection
Small Mail Working Passive
days
Small Mail Holidays Passive
47
Decision Tree : Using attribute Connection
Connection Service Period Attack
High Web Working Dos
days
High Web Holidays SQL
injection
High Web Working Dos
days
High Mail Holidays SQL
injection
Medium Web Working Dos
days
Medium Web Holidays SQL
injection
Ti Ti
SplitInfo(T , Connection) = − iDomain
Medium Mail Working
days
SQL

injection T
log 2
T
Medium Mail Holidays SQL
Split Info(T , Connection) = −injection
4 log 4 − 4 log 4 − 2 log 2
10 2 10 10 2 10 10 2 10
Small Mail Working Passive
days
Small Gain
Mail Ratio
Holidays(T , Connection
Passive ) = 0.761 = 0.5
1.522
48
Decision Tree : Using attribute Period
Ti
Connection Service Period
High Web Working
Attack
Dos
InfoPeriod (T ) = 
iDomain T
Info(Ti )
days
High Web Holidays SQL DPeriod = WD , Holidays 
injection
High Web Working Dos
days WD Holidays
High Mail Holidays SQL
injection Info(TWD ) Info(THolidays)
Medium Web Working Dos
days
Medium Web 3
Holidays 3 1
SQL 1 1 1
Info(TWD ) = − log 2injection
− log 2 − log 2 = 1.371
5 5 5 5 5 5
Medium Mail Working SQL
4 injection
4 1 1
Info(THolidays ) = − log
days
2 − log 2 =1
Medium Mail Holidays 5 SQL 5 5 5
injection
Small Mail Working Passive

Info (T ) = 5
days
Info(T ) + 5 Info(T ) = 1.046
Small Period Holidays
Mail 10
Passive WD 10 holidays
49
Decision Tree : Using attribute Period
Connection Service Period Attack
High Web Working Dos Gain Ratio(T , Period )
days
High Web Holidays SQL
injection
High Web Working Dos
days
High Mail Holidays SQL
injection
Medium Web Working Dos
days
Medium Gain
Web(T , Holidays
Period ) = SQL
Info(T ) − Info(Tperiod )
Gain(T , Period ) = injection
0.439
Medium MailWorking SQL
Ti Ti
SplitInfo(T , Period
Medium Mail
days
Holidays
) 
=injection
− iDomain
SQL T
log 2
T
injection
Small SplitMail , period ) =Passive
Info(TWorking − 5 log 2 5 − 5 log 2 5 =1
10 10 10 10
days
Small Mail Holidays Passive
Gain Ratio(T , Period ) = 0.439 = 0.439
1 50
Decision Tree : Using attribute Service
Ti
Connection Service Period
High Web Working
Attack
Dos
InfoService (T ) = 
iDomain T
Info(Ti )
days

DService = Web, Mail


High Web Holidays SQL
injection
High Web Working Dos
days Web Mail
High Mail Holidays SQL
injection Info(TWeb ) Info(TMail )
Medium Web Working Dos
days
Medium Web Holidays
3 SQL
3 2 2
Info(TWeb ) = − log 2injection− log 2 = 0.971
Medium Mail 5
Working 5 5
SQL 5
days 3 3 2
injection 2
MediumInfo( T ) = −
MailMail Holidays log SQL
2 − log 2 = 0.971
5 5 5
injection
5
Small Mail Working Passive

Info (T ) = 5
days
Info(T )+ 5 Info(T ) = 0.971
Small Service Holidays
Mail 10Passive Web 10 Mail
51
Decision Tree : Using attribute Service
Connection Service Period Attack
High Web Working Dos
days
Gain Ratio(T , Service)
High Web Holidays SQL
injection
High Web Working Dos
days
High Mail Holidays SQL
injection
Medium Web Working Dos
days
Medium Gain
Web(T , Service
Holidays ) = SQL
Info(T ) − Info(Tservice )
injection
Medium
Gain
Mail
(T , Service
Working
) = SQL
0.514
T T
Medium Mail
days
SplitInfo(T , Service
Holidays
) 
=injection
− iDom ain i log 2 i
SQL T T
injection
Small SplitMail , Service ) =Passive
Info(TWorking − 5 log 2 5 − 5 log 2 5 =1
10 10 10 10
days
Small Mail Holidays Passive
Gain Ratio(T , Service ) = 0.514 = 0.514
1 52
Decision Tree : Example (Level 1)
Decision Tree: level 1
Gain Ratio(T , Connection) = 0.5
Gain Ratio(T , Period ) = 0.439
Gain Ratio(T , Service) = 0.514

Root

Web Mail

Connection Service Period Attack Connection Service Period Attack


High Web Working days Dos High Mail Holidays SQL injection
High Web Holidays SQL injection Medium Mail Working days SQL injection
High Web Working days Dos Medium Mail Holidays SQL injection
Medium Web Working days Dos Small Mail Working days Passive
Medium Web Holidays SQL injection Small Mail Holidays Passive
53
Decision Tree : Example
Decision Tree: Service =Web

Connection Service Period Attack

High Web Working days Dos


High Web Holidays SQL injection
High Web Working days Dos
Medium Web Working days Dos
Medium Web Holidays SQL injection

3 3 2 2
Info(TWeb ) = Info( S ) = − log 2 − log 2 = 0.971
5 5 5 5

54
Decision Tree : Example
Connection Service Period Attack

High Web Working days Dos


High Web Holidays SQL injection
High Web Working days Dos
Medium Web Working days Dos
Medium Web Holidays SQL injection

2 2 1 1
InfoConnection( S High ) = − log 2 − log 2 = 0.918
3 3 3 3
1 1 1 1
InfoConnection( S Medium ) = − log 2 − log 2 = 1
2 2 2 2
InfoConnection( S Small) = 0
 3  2 
InfoConnection( S ) =    * 0.918 +   *1 + (0 * 0)  = 0.951
 5  5  55
Decision Tree : Example
Connection Service Period Attack

High Web Working days Dos


High Web Holidays SQL injection
High Web Working days Dos
Medium Web Working days Dos
Medium Web Holidays SQL injection

gain( S , connection ) = 0.02


Split( S , connection ) = −3 / 5 log 2 3 / 5 − 2 / 5 log 2 2 / 5 − 0 = 0.971

GainRatio( S , Connection) = 0.02


..................

56
Final Decision Tree : Example

Service

Mail
Web

Period Connection

Working Days Holidays


High Small
(WD)
Medium
SQL SQL SQL
Dos Passive
Injection Injection Injection

57
Decision Tree : Example
Service

Mail
Web

Period Connection

Working Days Holidays


High Small
Medium

C2 C1 C2 C2 C3

Classify ?

Connection Service Period Attack


Medium Mail Holidays ?
58
Classification rules
Extracting Classification Rules from Tree
▪ Represent the knowledge in the form of IF-THEN rules
▪ One rule is created for each path from the root to a leaf
▪ Each attribute-value pair along a path forms a conjunction
▪ The leaf node holds the class prediction
▪ Rules are easier for humans to understand
Example:
If (service=Web) ^(period=WD) Then C2
If (service=Web) ^(period=holidays) Then C1

Number of Classification rules? (Number of leafs nodes)


59
Summary

▪ Decision Trees have relatively faster learning speed


than other methods

▪ Conversable to simple and easy to understand


classification rules

▪ Information Gain, Ratio Gain and Gini Index are the


most common methods of attribute selection

60

You might also like