Download as pdf or txt
Download as pdf or txt
You are on page 1of 85

CS 352 – Data Analysis and Visualization

Binary Classification with Visualization

1
Today’s Classification Topics

 What is Classification
 Class Imbalance Problem
 Performance Metrics:
 Confusion Matrix, Accuracy, Precision, Recalls and F-
Measures, Classification Report
 Rule Based Classifiers
 Decision Tree Classifiers
 Support Vector Machine
 Python Implementation

2
Classification: Definition

Given a collection of records (training set )


 Each record is characterized by a tuple (x,y),
where x is the attribute set and y is the class
label
• x: attribute, predictor, independent variable, input
• y: class, response, dependent variable, output

Task:
 Learn a model that maps each attribute set x
into one of the predefined class labels y
Examples of Classification Task

Task Attribute set, x Class label, y

Categorizing Features extracted from spam or non-spam


email email message header
messages and content
Identifying Features extracted from malignant or benign
tumor cells MRI scans cells

Cataloging Features extracted from Elliptical, spiral, or


galaxies telescope images irregular-shaped
galaxies
Classification Workflow
General Approach for Building Classification Model

Tid Attrib1 Attrib2 Attrib3 Class Learning


1 Yes Large 125K No
algorithm
2 No Medium 100K No

3 No Small 70K No

4 Yes Medium 120K No


Induction
5 No Large 95K Yes

6 No Medium 60K No

7 Yes Large 220K No Learn


8 No Small 85K Yes Model
9 No Medium 75K No

10 No Small 90K Yes


Model
10

Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class Model
11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ? Deduction


14 No Small 95K ?

15 No Large 67K ?
10

Test Set
Classification Techniques

Base Classifiers
 Rule-based Methods
 Decision Tree based Methods
 Support Vector Machines
 Nearest-neighbor
 Neural Networks
 Naïve Bayes and Bayesian Belief Networks

Ensemble Classifiers
 Boosting, Bagging, Random Forests
Class Imbalance Problem

Lots of classification problems where the


classes are skewed (more records from one
class than another)
 Credit card fraud
 Intrusion detection
 Defective products in manufacturing assembly
line
 Rare disease positive/negative cases
Challenges

Evaluation measures such as accuracy is


not well-suited for imbalanced class

Detecting the rare class is like finding


needle in a haystack
Confusion Matrix

 Confusion Matrix:
PREDICTED CLASS

Class=Yes Class=No

Class=Yes a b
ACTUAL
CLASS Class=No c d

a: TP (true positive) - the model correctly predicts the positive class.


b: FN (false negative) - the model incorrectly predicts the negative class.
c: FP (false positive) - the model incorrectly predicts the positive class.
d: TN (true negative) – the model correctly predicts the negative class.
Accuracy

PREDICTED CLASS

Class=Yes Class=No

Class=Yes a b
ACTUAL (TP) (FN)
CLASS
Class=No c d
(FP) (TN)

 Most widely-used metric:


ad TP  TN
Accuracy  
a  b  c  d TP  TN  FP  FN
Problem with Accuracy

Consider a 2-class problem


 Number of Class NO examples = 990
 Number of Class YES examples = 10

If a model predicts everything to be class NO,


accuracy is 990/1000 = 99 %
 This is misleading because the model does not
detect any class YES example
 Detecting the rare class is usually more interesting
(e.g., frauds, intrusions, defects, etc)
Alternative Measures

PREDICTED CLASS
Class=Yes Class=No

Class=Yes a b
ACTUAL
CLASS Class=No c d

a
Precision (p) 
ac
a
Recall (r) 
ab
2rp 2a
F - measure (F)  
r  p 2a  b  c
Alternative Measures

4
Precision (p)   0.16
PREDICTED CLASS 4  21
4
Class=Yes Class=No Recall (r)   0.4
46
Class=Yes 4 6 2*4
ACTUAL F - measure (F)   0.22
2 * 4  6  21
CLASS Class=No 21 969
973
Accuracy   0.973
1000
Alternative Measures

10
PREDICTED CLASS Precision (p)   0.5
10  10
Class=Yes Class=No 10
Recall (r)  1
10  0
Class=Yes 10 0 2 *1* 0.5
ACTUAL F - measure (F)   0.62
CLASS Class=No 10 980 1  0.5
990
Accuracy   0.99
1000
Alternative Measures

10
PREDICTED CLASS Precision (p)   0.5
10  10
Class=Yes Class=No 10
Recall (r)  1
10  0
Class=Yes 10 0 2 *1* 0.5
ACTUAL F - measure (F)   0.62
CLASS Class=No 10 980 1  0.5
990
Accuracy   0.99
1000

1
PREDICTED CLASS Precision (p)  1
1 0
Class=Yes Class=No 1
Recall (r)   0.1
1 9
Class=Yes 1 9
ACTUAL 2 * 0.1*1
F - measure (F)   0.18
CLASS Class=No 0 990 1  0.1
991
Accuracy   0.991
1000
Alternative Measures

PREDICTED CLASS Precision (p)  0.8


Class=Yes Class=No Recall (r)  0.8
Class=Yes 40 10 F - measure (F)  0.8
ACTUAL
CLASS Class=No 10 40 Accuracy  0.8
Alternative Measures

PREDICTED CLASS Precision (p)  0.8


Class=Yes Class=No Recall (r)  0.8
ACTUAL
Class=Yes 40 10 F - measure (F)  0.8
CLASS Class=No 10 40 Accuracy  0.8

PREDICTED CLASS
Precision (p) ~ 0.04
Class=Yes Class=No
Recall (r)  0.8
Class=Yes 40 10
ACTUAL F - measure (F) ~ 0.08
CLASS Class=No 1000 4000
Accuracy ~ 0.8
Measures of Classification Performance

PREDICTED CLASS
Yes No
ACTUAL
Yes TP FN
CLASS
No FP TN

 is the probability that


we reject the null
hypothesis when it is
true. This is a Type I error
or a false positive (FP).

 is the probability that


we accept the null
hypothesis when it is
false. This is a Type II
error or a false negative
(FN).
Rule-Based Classifier

 Classify records by using a collection of “if…then…” rules

 Rule: (Condition)  y
 where
• Condition is a conjunctions of attributes
• y is the class label
 LHS: rule antecedent or condition
 RHS: rule consequent
 Examples of classification rules:
• (Blood Type=Warm)  (Lay Eggs=Yes)  Birds
• (Taxable Income < 50K)  (Refund=Yes)  Evade=No
Rule-based Classifier (Example)

Name Blood Type Give Birth Can Fly Live in Water Class
human warm yes no no mammals
python cold no no no reptiles
salmon cold no no yes fishes
whale warm yes no yes mammals
frog cold no no sometimes amphibians
komodo cold no no no reptiles
bat warm yes yes no mammals
pigeon warm no yes no birds
cat warm yes no no mammals
leopard shark cold yes no yes fishes
turtle cold no no sometimes reptiles
penguin warm no no sometimes birds
porcupine warm yes no no mammals
eel cold no no yes fishes
salamander cold no no sometimes amphibians
gila monster cold no no no reptiles
platypus warm no no no mammals
owl warm no yes no birds
dolphin warm yes no yes mammals
eagle warm no yes no birds

R1: (Give Birth = no)  (Can Fly = yes)  Birds


R2: (Give Birth = no)  (Live in Water = yes)  Fishes
R3: (Give Birth = yes)  (Blood Type = warm)  Mammals
R4: (Give Birth = no)  (Can Fly = no)  Reptiles
R5: (Live in Water = sometimes)  Amphibians
Application of Rule-Based Classifier

 A rule R covers an instance x if the attributes of the


instance satisfy the condition of the rule
R1: (Give Birth = no)  (Can Fly = yes)  Birds
R2: (Give Birth = no)  (Live in Water = yes)  Fishes
R3: (Give Birth = yes)  (Blood Type = warm)  Mammals
R4: (Give Birth = no)  (Can Fly = no)  Reptiles
R5: (Live in Water = sometimes)  Amphibians

Name Blood Type Give Birth Can Fly Live in Water Class
hawk warm no yes no ?
grizzly bear warm yes no no ?

The rule R1 covers a hawk => Bird


The rule R3 covers the grizzly bear => Mammal
Rule Coverage and Accuracy
Tid Refund Marital Taxable
Status Income Class

Coverage of a rule: 1 Yes Single 125K No

 Fraction of records that 2 No Married 100K No


3 No Single 70K No
satisfy the antecedent
4 Yes Married 120K No
of a rule
5 No Divorced 95K Yes
Accuracy of a rule: 6 No Married 60K No

 Fraction of records that 7 Yes Divorced 220K No


8 No Single 85K Yes
satisfy the antecedent
9 No Married 75K No
that also satisfy the
10 No Single 90K Yes
consequent of a rule 10

(Status=Single)  No
Coverage = 40%, Accuracy = 50%
How does Rule-based Classifier Work?

R1: (Give Birth = no)  (Can Fly = yes)  Birds


R2: (Give Birth = no)  (Live in Water = yes)  Fishes
R3: (Give Birth = yes)  (Blood Type = warm)  Mammals
R4: (Give Birth = no)  (Can Fly = no)  Reptiles
R5: (Live in Water = sometimes)  Amphibians

Name Blood Type Give Birth Can Fly Live in Water Class
lemur warm yes no no ?
turtle cold no no sometimes ?
dogfish shark cold yes no yes ?

A lemur triggers rule R3, so it is classified as a mammal


A turtle triggers both R4 and R5
A dogfish shark triggers none of the rules
Characteristics of Rule Sets: Strategy 1

Mutually exclusive rules


 Every record is covered by at most one rule

Exhaustive rules
 Each record is covered by at least one rule
Characteristics of Rule Sets: Strategy 2

Rules are not mutually exclusive


 A record may trigger more than one rule
 Solution?
• Ordered rule set
• Unordered rule set – use voting schemes

Rules are not exhaustive


 A record may not trigger any rules
 Solution?
• Use a default class
Building Classification Rules

Direct Method:
• Extract rules directly from data
• Examples: RIPPER, CN2, Holte’s 1R

Indirect Method:
• Extract rules from other classification models (e.g.
decision trees, neural networks, etc).
• Examples: C4.5rules
Decision Tree Classifier Algorithm

A decision tree is a flowchart-like tree structure


where
 an internal node represents feature (or attribute),
 the branch represents a decision rule,
 and each leaf node represents the outcome.
The topmost node in a decision tree is known as
the root node.
It learns to partition on the basis of the attribute
value.
It partitions the tree in recursively manner call
recursive partitioning.
 This flowchart-like structure helps you in decision making.
Example of a Decision Tree

Splitting Attributes
Home Marital Annual Defaulted
ID
Owner Status Income Borrower
1 Yes Single 125K No Home
2 No Married 100K No Owner
Yes No
3 No Single 70K No
4 Yes Married 120K No NO MarSt
5 No Divorced 95K Yes Single, Divorced Married
6 No Married 60K No
Income NO
7 Yes Divorced 220K No
< 80K > 80K
8 No Single 85K Yes
9 No Married 75K No NO YES
10 No Single 90K Yes
10

Training Data Model: Decision Tree


Another Example of Decision Tree

MarSt Single,
Married Divorced
Home Marital Annual Defaulted
ID
Owner Status Income Borrower
NO Home
1 Yes Single 125K No
Yes Owner No
2 No Married 100K No
3 No Single 70K No NO Income
4 Yes Married 120K No < 80K > 80K
5 No Divorced 95K Yes
NO YES
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No There could be more than one tree that
fits the same data!
10 No Single 90K Yes
10
Apply Model to Test Data
Test Data
Start from the root of tree.
Home Marital Annual Defaulted
Owner Status Income Borrower
No Married 80K ?
Home 10

Yes Owner No

NO MarSt
Single, Divorced Married

Income NO
< 80K > 80K

NO YES
Apply Model to Test Data
Test Data
Home Marital Annual Defaulted
Owner Status Income Borrower
No Married 80K ?
Home 10

Yes Owner No

NO MarSt
Single, Divorced Married

Income NO
< 80K > 80K

NO YES
Apply Model to Test Data
Test Data
Home Marital Annual Defaulted
Owner Status Income Borrower
No Married 80K ?
Home 10

Yes Owner No

NO MarSt
Single, Divorced Married

Income NO
< 80K > 80K

NO YES
Apply Model to Test Data
Test Data
Home Marital Annual Defaulted
Owner Status Income Borrower
No Married 80K ?
Home 10

Yes Owner No

NO MarSt
Single, Divorced Married

Income NO
< 80K > 80K

NO YES
Apply Model to Test Data
Test Data
Home Marital Annual Defaulted
Owner Status Income Borrower
No Married 80K ?
Home 10

Yes Owner No

NO MarSt
Single, Divorced Married

Income NO
< 80K > 80K

NO YES
Apply Model to Test Data
Test Data
Home Marital Annual Defaulted
Owner Status Income Borrower
No Married 80K ?
Home 10

Yes Owner No

NO MarSt
Single, Divorced Married Assign Defaulted to
“No”
Income NO
< 80K > 80K

NO YES
Decision Tree Classification Task

Tid Attrib1 Attrib2 Attrib3 Class


Tree
1 Yes Large 125K No Induction
2 No Medium 100K No algorithm
3 No Small 70K No

4 Yes Medium 120K No


Induction
5 No Large 95K Yes

6 No Medium 60K No

7 Yes Large 220K No Learn


8 No Small 85K Yes Model
9 No Medium 75K No

10 No Small 90K Yes


Model
10

Training Set
Apply Decision
Model Tree
Tid Attrib1 Attrib2 Attrib3 Class
11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ?


Deduction
14 No Small 95K ?

15 No Large 67K ?
10

Test Set
Decision Tree Induction

Many Algorithms:
 Hunt’s Algorithm (one of the earliest)
 CART
 ID3, C4.5
 SLIQ,SPRINT
General Structure of Hunt’s Algorithm
Home Marital Annual Defaulted
ID
Owner Status Income Borrower

• Let Dt be the set of training records 1 Yes Single 125K No


2 No Married 100K No
that reach a node t 3 No Single 70K No

• General Procedure: 4 Yes Married 120K No


5 No Divorced 95K Yes
If Dt contains records that belong 6 No Married 60K No

the same class yt, then t is a leaf 7 Yes Divorced 220K No


8 No Single 85K Yes
node labeled as yt 9 No Married 75K No

If Dt contains records that belong 10


10 No Single 90K Yes

to more than one class, use an Dt


attribute test to split the data
into smaller subsets. Recursively ?
apply the procedure to each
subset.
Hunt’s Algorithm
Home Marital Annual Defaulted
Home ID
Owner Status Income Borrower
Owner
1 Yes Single 125K No
Yes No
Defaulted = No 2 No Married 100K No
Defaulted = No Defaulted = No 3 No Single 70K No
(7,3)
(3,0) (4,3) 4 Yes Married 120K No

(a) (b) 5 No Divorced 95K Yes


6 No Married 60K No
7 Yes Divorced 220K No
Home
Owner 8 No Single 85K Yes

Home Yes No 9 No Married 75K No


Owner 10 No Single 90K Yes
Defaulted = No Marital 10

Yes No
Status
(3,0) Single,
Married
Defaulted = No Marital Divorced
Status
Defaulted = No
(3,0) Single,
Married
Annual
Divorced Income
(3,0)
Defaulted = Yes Defaulted = No < 80K >= 80K

Defaulted = No Defaulted = Yes


(1,3) (3,0)
(1,0) (0,3)
(c) (d)
Hunt’s Algorithm
Home Marital Annual Defaulted
Home ID
Owner Status Income Borrower
Owner
1 Yes Single 125K No
Yes No
Defaulted = No 2 No Married 100K No
Defaulted = No Defaulted = No 3 No Single 70K No
(7,3)
(3,0) (4,3) 4 Yes Married 120K No

(a) (b) 5 No Divorced 95K Yes


6 No Married 60K No
7 Yes Divorced 220K No
Home
Owner 8 No Single 85K Yes

Home Yes No 9 No Married 75K No


Owner 10 No Single 90K Yes
Defaulted = No Marital 10

Yes No
Status
(3,0) Single,
Married
Defaulted = No Marital Divorced
Status
Defaulted = No
(3,0) Single,
Married
Annual
Divorced Income
(3,0)
Defaulted = Yes Defaulted = No < 80K >= 80K

Defaulted = No Defaulted = Yes


(1,3) (3,0)
(1,0) (0,3)
(c) (d)
Hunt’s Algorithm
Home Marital Annual Defaulted
Home ID
Owner Status Income Borrower
Owner
1 Yes Single 125K No
Yes No
Defaulted = No 2 No Married 100K No
Defaulted = No Defaulted = No 3 No Single 70K No
(7,3)
(3,0) (4,3) 4 Yes Married 120K No

(a) (b) 5 No Divorced 95K Yes


6 No Married 60K No
7 Yes Divorced 220K No
Home
Owner 8 No Single 85K Yes

Home Yes No 9 No Married 75K No


Owner 10 No Single 90K Yes
Defaulted = No Marital 10

Yes No
Status
(3,0) Single,
Married
Defaulted = No Marital Divorced
Status
Defaulted = No
(3,0) Single,
Married
Annual
Divorced Income
(3,0)
Defaulted = Yes Defaulted = No < 80K >= 80K

Defaulted = No Defaulted = Yes


(1,3) (3,0)
(1,0) (0,3)
(c) (d)
Hunt’s Algorithm
Home Marital Annual Defaulted
Home ID
Owner Status Income Borrower
Owner
1 Yes Single 125K No
Yes No
Defaulted = No 2 No Married 100K No
Defaulted = No Defaulted = No 3 No Single 70K No
(7,3)
(3,0) (4,3) 4 Yes Married 120K No

(a) (b) 5 No Divorced 95K Yes


6 No Married 60K No
7 Yes Divorced 220K No
Home
Owner 8 No Single 85K Yes

Home Yes No 9 No Married 75K No


Owner 10 No Single 90K Yes
Defaulted = No Marital 10

Yes No
Status
(3,0) Single,
Married
Defaulted = No Marital Divorced
Status
Defaulted = No
(3,0) Single,
Married
Annual
Divorced Income
(3,0)
Defaulted = Yes Defaulted = No < 80K >= 80K

Defaulted = No Defaulted = Yes


(1,3) (3,0)
(1,0) (0,3)
(c) (d)
Methods for Expressing Test Conditions

Depends on attribute types


 Binary
 Nominal
 Ordinal
 Continuous

Depends on number of ways to split


 2-way split
 Multi-way split
Test Condition for Nominal Attributes

• Multi-way split: Marital


Status

Use as many partitions as


distinct values.
Single Divorced Married
• Binary split:
Divides values into two subsets

Marital Marital Marital


Status Status Status
OR OR

{Married} {Single, {Single} {Married, {Single, {Divorced}


Divorced} Divorced} Married}
Test Condition for Ordinal Attributes

Shirt
Size
• Multi-way split:
Use as many partitions as
Small
distinct values Medium Large Extra Large

Shirt Shirt
Size Size
• Binary split:
Divides values into two
subsets {Small, {Large, {Small} {Medium, Large,
Medium} Extra Large} Extra Large}

Preserve order property Shirt


among attribute values Size
This
grouping
violates
{Small, {Medium, order
Large} Extra Large}
property
Test Condition for Continuous Attributes

Annual Annual
Income Income?
> 80K?
< 10K > 80K
Yes No

[10K,25K) [25K,50K) [50K,80K)

(i) Binary split (ii) Multi-way split


Splitting Based on Continuous Attributes

Different ways of handling


 Discretization to form an ordinal categorical
attribute
Ranges can be found by equal interval bucketing,
equal frequency bucketing (percentiles), or
clustering.
• Static – discretize once at the beginning
• Dynamic – repeat at each node

 Binary Decision: (A < v) or (A  v)


• consider all possible splits and finds the best cut
• can be more compute intensive
How to determine the Best Split

Before Splitting: 10 records of class 0,


10 records of class 1

Gender Car Customer


Type ID

Yes No Family Luxury c1 c20


c10 c11
Sports
C0: 6 C0: 4 C0: 1 C0: 8 C0: 1 C0: 1 ... C0: 1 C0: 0 ... C0: 0
C1: 4 C1: 6 C1: 3 C1: 0 C1: 7 C1: 0 C1: 0 C1: 1 C1: 1

Which test condition is the best?


How to determine the Best Split

 Greedy approach:
 Nodes with purer class distribution are preferred

 Need a measure of node impurity:

C0: 5 C0: 9
C1: 5 C1: 1

High degree of impurity Low degree of impurity


Measures of Node Impurity

 Gini Index GINI (t )  1  [ p( j | t )]2

 Entropy Entropy (t )    p ( j | t ) log p ( j | t )


j

 Misclassification error
Error(t )  1  max P(i | t )
i
Finding the Best Split

1. Compute impurity measure (P) before splitting


2. Compute impurity measure (M) after splitting
 Compute impurity measure of each child node
 M is the weighted impurity of children
3. Choose the attribute test condition that
produces the highest gain
Gain = P – M

or equivalently, lowest impurity measure after


splitting (M)
Measure of Impurity: GINI
 Gini Index for a given node t :

GINI (t )  1  [ p( j | t )]2
j

(NOTE: p( j | t) is the relative frequency of class j at node t).

 For 2-class problem (p, 1 – p):


 GINI = 1 - (x/n)2 - (y/n)2

C1 0 C1 1 C1 2 C1 3
C2 6 C2 5 C2 4 C2 3
Gini=0.000 Gini=0.278 Gini=0.444 Gini=0.500
Computing Gini Index of a Single Node
GINI (t )  1  [ p( j | t )]2
j

C1 0 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1


C2 6 Gini = 1 – P(C1)2 – P(C2)2 = 1 – 0 – 1 = 0

C1 1 P(C1) = 1/6 P(C2) = 5/6


C2 5 Gini = 1 – (1/6)2 – (5/6)2 = 0.278

C1 2 P(C1) = 2/6 P(C2) = 4/6


C2 4 Gini = 1 – (2/6)2 – (4/6)2 = 0.444
Computing Entropy of a Single Node
Entropy (t )    p ( j | t ) log p ( j | t )
j 2

C1 0 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1


C2 6 Entropy = – 0 log 0 – 1 log 1 = – 0 – 0 = 0

C1 1 P(C1) = 1/6 P(C2) = 5/6


C2 5 Entropy = – (1/6) log2 (1/6) – (5/6) log2 (1/6) = 0.65

C1 2 P(C1) = 2/6 P(C2) = 4/6


C2 4 Entropy = – (2/6) log2 (2/6) – (4/6) log2 (4/6) = 0.92
Computing Error of a Single Node
Error(t )  1  max P(i | t )
i

C1 0 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1


C2 6 Error = 1 – max (0, 1) = 1 – 1 = 0

C1 1 P(C1) = 1/6 P(C2) = 5/6


C2 5 Error = 1 – max (1/6, 5/6) = 1 – 5/6 = 1/6

C1 2 P(C1) = 2/6 P(C2) = 4/6


C2 4 Error = 1 – max (2/6, 4/6) = 1 – 4/6 = 1/3
Other Decision Tree Methods

 Pruning is a data compression technique in


machine learning and search algorithms that
reduces the size of decision trees by removing
sections of the tree that are non-critical and
redundant to classify instances.
Support Vector Machine

• A Support Vector Machine (SVM) performs classification by finding


the hyperplane that maximizes the margin between the two classes.
The vectors (cases) that define the hyperplane are the support
vectors.
• What is the purpose of Support Vectors? Support Vectors enables us
to use non-linear classifiers and compute classification hyperplanes in
large dimensional spaces.
Support Vector Machines

• Find a linear hyperplane (decision boundary) that will separate


the data
Support Vector Machines

B1

• One Possible Solution


Support Vector Machines

B2

• Another possible solution


Support Vector Machines

B2

• Other possible solutions


Support Vector Machines

B1

B2

• Which one is better? B1 or B2?


• How do you define better?
Support Vector Machines

B1

B2

b21
b22

margin
b11

b12

• Find hyperplane maximizes the margin => B1 is better than B2


Support Vector Machines
B1

 
w x  b  0
 
 
w  x  b  1 w  x  b  1

b11

 
  1 if w x  b 1 b12 2
f ( x)     Margin  
  1 if w  x  b  1 || w ||
Linear SVM

 Linear model:
 
 1 if w  x  b  1
f ( x)    
 1 if w  x  b  1

 Learning the model


 is equivalent to determining
the values of w and b
 How to find
w and b from training data?

Maximum Margin
x f yest
f(x,w,b) = sign(w. x - b)
denotes +1
denotes -1 The maximum
margin linear
classifier is the
linear classifier
with the maximum
margin.
This is the
simplest kind of
SVM (Called an
LSVM)
Linear SVM

Maximum Margin
x f yest
f(x,w,b) = sign(w. x - b)
denotes +1
denotes -1 The maximum
margin linear
classifier is the
linear classifier
Support Vectors with the maximum
are those
datapoints that margin.
the margin This is the
pushes up simplest kind of
against SVM (Called an
LSVM)
Linear SVM
SVM Parameter C

• SVM finds the best hyperplane by solving an optimization


problem that tries to increase the distance of the
hyperplane from the two classes while trying to make sure
many training examples are classified properly.
• This tradeoff is controlled by a parameter called C.
• When the value of C is small, a large margin hyperplane is
chosen at the expense of a greater number of
misclassifications.
• Conversely, when C is large, a smaller margin hyperplane is
chosen that tries to classify many more examples correctly.
SVM Parameter C
• SVM finds the best hyperplane by solving an optimization problem
that tries to increase the distance of the hyperplane from the two
classes while trying to make sure many training examples are
classified properly.
• This tradeoff is controlled by a parameter called C.
• When the value of C is small, a large margin hyperplane is chosen at
the expense of a greater number of misclassifications.
• Conversely, when C is large, a smaller margin hyperplane is chosen
that tries to classify many more examples correctly.

The line corresponding to C = 100 is not necessarily a good choice. This


is because the lone blue point may be an outlier.
SVM Parameter Gamma (ƴ )

• What if the data is not separable by a hyperplane?


• These two classes represented by the red and blue dots are
not linearly separable.
• The decision boundary shown in black is actually circular.
SVM Parameter Gamma (ƴ )

• In such a case, we use the Kernel Trick where we add a new


dimension to existing data. In the new space, the data is linearly
separable.
• In Previous Figure, we have added a third dimension (z) to the
data where,

• The above expression is called a Gaussian Radial Basis


Function or a Radial Basis Function with a Gaussian kernel.
• We can see the new 3D data is separable by the plane
containing the black circle! The parameter ƴ controls the
amount of stretching in the z direction.
SVM Parameter Gamma (ƴ )
Feature Extraction for SVM

• Selecting the most meaningful features is a crucial step in the


process of classification problems
• The selected set of features should be a small set whose values
efficiently discriminate among patterns of different classes, but
are similar for patterns within the same class.
• Features can be classified into two categories:
• Local features, which are usually geometric
• Global features, which are usually topological or statistical.
• https://medium.com/@dataturks/understanding-svms-for-
image-classification-cf4f01232700 shows several method of
feature extraction using OpenCV and Sklearn libraries
SVM for Multi-classes classifications

• Support Vector Machine is used mainly for binary classification.


• However, It can be used for multiclass classification by using
One vs One technique or One vs Rest technique.

• One vs Rest approach takes one class as positive and rest all as
negative and trains the classifier. So for the data having n-
classes it trains n classifiers.

• One vs One considers each binary pair of classes and trains


classifier on subset of data containing those classes. So it trains
total n*(n-1)/2 classes.
Decision Tree Classification in Python

Importing Required Libraries


Let's first load the required libraries.

# Load libraries
import pandas as pd

from sklearn.tree import DecisionTreeClassifier


# Import Decision Tree Classifier

from sklearn.model_selection import train_test_split


# Import train_test_split function

from sklearn import metrics


#Import scikit-learn metrics module for accuracy calculation

76
Decision Tree Classification in Python

#Load Data
col_names = ['pregnant', 'glucose', 'bp', 'skin', 'insulin', 'bmi', 'pedigree', 'age', 'label']

pima = pd.read_csv("pima-indians-diabetes.csv", header=None, names=col_names)

#Display headers of the data


pima.head()

#Feature Selection
Here, you need to divide given columns into two types of variables dependent(or target
variable) and independent variable(or feature variables).

#split dataset in features and target variable


feature_cols = ['pregnant', 'insulin', 'bmi', 'age','glucose','bp','pedigree']
X = pima[feature_cols] # Features
y = pima.label # Target variable

77
Decision Tree Classification in Python

Splitting Data
To understand model performance, dividing the dataset into a training set and a test set is a
good strategy.

Let's split the dataset by using function train_test_split(). You need to pass 3 parameters
features, target, and test_set size.

# Split dataset into training set and test set


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)
# 70% training and 30% test

78
Decision Tree Classification in Python

Building Decision Tree Model


Let's create a Decision Tree Model using Scikit-learn.

# Create Decision Tree classifer object


clf = DecisionTreeClassifier()

# Train Decision Tree Classifer


clf = clf.fit(X_train,y_train)

#Predict the response for test dataset


y_pred = clf.predict(X_test)

Evaluating Model
Let's estimate, how accurately the classifier or model can predict the type of cultivars.

Accuracy can be computed by comparing actual test set values and predicted values.

# Model Accuracy, how often is the classifier correct?


print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
79
Decision Tree Classification in Python

Visualizing Decision Trees: You can use Scikit-learn's export_graphviz function for display
the tree within a Jupyter notebook. For plotting tree, you also need to install graphviz and
pydotplus.
pip install graphviz
pip install pydotplus
pip install six
xport_graphviz function converts decision tree classifier into dot file and pydotplus convert
this dot file to png or displayable form on Jupyter.

from sklearn.tree import export_graphviz


from sklearn.externals.six import StringIO
from IPython.display import Image
import pydotplus

dot_data = StringIO()
export_graphviz(clf, out_file=dot_data,
filled=True, rounded=True,
special_characters=True,feature_names = feature_cols,class_names=['0','1'])
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
graph.write_png('diabetes.png')
Image(graph.create_png()) 80
SVM Classification in Python

# SVM Algorithm - Training and Prediction


from sklearn import svm
classifier = svm.SVC(kernel='linear', C=1.0)
classifier.fit(X_train, y_train)

#Predict the response for test dataset


y_pred = classifier.predict(X_test)

# Model Accuracy, how often is the classifier correct?


print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

from sklearn.metrics import confusion_matrix,accuracy_score


cm = confusion_matrix(y_test, y_pred)
ac = accuracy_score(y_test,y_pred)

81
Assignment - 04

 Please optimize the decision tree applied on the diabetes


dataset to achieve more accurate results
 Hint: You can change the below parameters of the
DecisionTreeClassifier https://scikit-
learn.org/stable/modules/generated/sklearn.tree.Decision
TreeClassifier.html
 criterion
 splitter
 max_depth

82
Assignment-04

• Please optimize the SVM Classifier applied on the diabetes dataset to


achieve more accurate classification results
• Hint: You can change the below parameters of the svm classifiers:
• Sigmoid
• Rbf
• You can also alter the C=1.0 parameter to get better accuracy score.

• Last date of submission: 24th April 2021


References and Further Reading

 Chapter 08, Data Mining Concepts and Techniques, 3rd Edition

 https://www.datacamp.com/community/tutorials/decision-tree-classification-
python
 https://www.kaggle.com/uciml/pima-indians-diabetes-database
 https://scikit-
learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html
 https://towardsdatascience.com/decision-tree-in-python-b433ae57fb93
 https://www.w3schools.com/python/python_ml_decision_tree.asp
References and Further Reading

 Chapter 08 and 09, Data Mining Concepts and Techniques, 3rd Edition
 https://vitalflux.com/accuracy-precision-recall-f1-score-python-example/
 https://blogs.oracle.com/ai-and-datascience/post/a-simple-guide-to-building-
a-confusion-matrix
 https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html
 https://datastoriesweb.wordpress.com/2017/06/11/classification-one-vs-rest-
and-one-vs-one/
 https://www.learnopencv.com/support-vector-machines-svm
 https://towardsdatascience.com/https-medium-com-pupalerushikesh-svm-
f4b42800e989
 https://medium.com/@dataturks/understanding-svms-for-image-
classification-cf4f01232700

You might also like