Binary Classification With Visualization: CS 352 - Data Analysis and Visualization

CS 352 – Data Analysis and Visualization
Binary Classification with Visualization
1
Today’s Classification Topics
 What is Classification
 Class Imbalance Problem
 Performance Metrics:
 Confusion Matrix, Accuracy, Precision, Recalls and F-
Measures, Classification Report
 Rule Based Classifiers
 Decision Tree Classifiers
 Support Vector Machine
 Python Implementation
2
Classification: Definition
Given a collection of records (training set )

 Each record is characterized by a tuple (x,y),
where x is the attribute set and y is the class
label
• x: attribute, predictor, independent variable, input
• y: class, response, dependent variable, output
Task:
 Learn a model that maps each attribute set x
into one of the predefined class labels y
Examples of Classification Task
Task Attribute set, x Class label, y
Categorizing Features extracted from spam or non-spam

email email message header
messages and content
Identifying Features extracted from malignant or benign
tumor cells MRI scans cells
Cataloging Features extracted from Elliptical, spiral, or

galaxies telescope images irregular-shaped
galaxies
Classification Workflow
General Approach for Building Classification Model
Tid Attrib1 Attrib2 Attrib3 Class Learning

1 Yes Large 125K No
algorithm
2 No Medium 100K No
3 No Small 70K No
4 Yes Medium 120K No

Induction
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No Learn

8 No Small 85K Yes Model
9 No Medium 75K No
10 No Small 90K Yes

Model
10
Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class Model
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ? Deduction

14 No Small 95K ?
15 No Large 67K ?
10
Test Set
Classification Techniques
Base Classifiers
 Rule-based Methods
 Decision Tree based Methods
 Support Vector Machines
 Nearest-neighbor
 Neural Networks
 Naïve Bayes and Bayesian Belief Networks
Ensemble Classifiers
 Boosting, Bagging, Random Forests
Class Imbalance Problem
Lots of classification problems where the

classes are skewed (more records from one
class than another)
 Credit card fraud
 Intrusion detection
 Defective products in manufacturing assembly
line
 Rare disease positive/negative cases
Challenges
Evaluation measures such as accuracy is

not well-suited for imbalanced class
Detecting the rare class is like finding

needle in a haystack
Confusion Matrix
 Confusion Matrix:
PREDICTED CLASS
Class=Yes Class=No
Class=Yes a b
ACTUAL
CLASS Class=No c d
a: TP (true positive) - the model correctly predicts the positive class.

b: FN (false negative) - the model incorrectly predicts the negative class.
c: FP (false positive) - the model incorrectly predicts the positive class.
d: TN (true negative) – the model correctly predicts the negative class.
Accuracy
PREDICTED CLASS
Class=Yes Class=No
Class=Yes a b
ACTUAL (TP) (FN)
CLASS
Class=No c d
(FP) (TN)
 Most widely-used metric:

ad TP  TN
Accuracy  
a  b  c  d TP  TN  FP  FN
Problem with Accuracy
Consider a 2-class problem

 Number of Class NO examples = 990
 Number of Class YES examples = 10
If a model predicts everything to be class NO,

accuracy is 990/1000 = 99 %
 This is misleading because the model does not
detect any class YES example
 Detecting the rare class is usually more interesting
(e.g., frauds, intrusions, defects, etc)
Alternative Measures
PREDICTED CLASS
Class=Yes Class=No
Class=Yes a b
ACTUAL
CLASS Class=No c d
a
Precision (p) 
ac
a
Recall (r) 
ab
2rp 2a
F - measure (F)  
r  p 2a  b  c
4
Precision (p)   0.16
PREDICTED CLASS 4  21
4
Class=Yes Class=No Recall (r)   0.4
46
Class=Yes 4 6 2*4
ACTUAL F - measure (F)   0.22
2 * 4  6  21
CLASS Class=No 21 969
973
Accuracy   0.973
1000
10
PREDICTED CLASS Precision (p)   0.5
10  10
Class=Yes Class=No 10
Recall (r)  1
10  0
Class=Yes 10 0 2 *1* 0.5
CLASS Class=No 10 980 1  0.5
990
1000
10
PREDICTED CLASS Precision (p)   0.5
10  10
Recall (r)  1
10  0
Class=Yes 10 0 2 *1* 0.5
990
1000
1
PREDICTED CLASS Precision (p)  1
1 0
Recall (r)   0.1
1 9
Class=Yes 1 9
ACTUAL 2 * 0.1*1
F - measure (F)   0.18
991
1000
PREDICTED CLASS Precision (p)  0.8

Class=Yes Class=No Recall (r)  0.8
Class=Yes 40 10 F - measure (F)  0.8
ACTUAL
CLASS Class=No 10 40 Accuracy  0.8
PREDICTED CLASS Precision (p)  0.8

Class=Yes Class=No Recall (r)  0.8
ACTUAL
Class=Yes 40 10 F - measure (F)  0.8
CLASS Class=No 10 40 Accuracy  0.8
PREDICTED CLASS
Precision (p) ~ 0.04
Class=Yes Class=No
Recall (r)  0.8
Class=Yes 40 10
ACTUAL F - measure (F) ~ 0.08
CLASS Class=No 1000 4000
Accuracy ~ 0.8
Measures of Classification Performance
PREDICTED CLASS
Yes No
ACTUAL
Yes TP FN
CLASS
No FP TN
 is the probability that

we reject the null
hypothesis when it is
true. This is a Type I error
or a false positive (FP).
 is the probability that

we accept the null
hypothesis when it is
false. This is a Type II
error or a false negative
(FN).
Rule-Based Classifier
 Classify records by using a collection of “if…then…” rules
 Rule: (Condition)  y
 where
• Condition is a conjunctions of attributes
• y is the class label
 LHS: rule antecedent or condition
 RHS: rule consequent
 Examples of classification rules:
• (Blood Type=Warm)  (Lay Eggs=Yes)  Birds
• (Taxable Income < 50K)  (Refund=Yes)  Evade=No
Rule-based Classifier (Example)
Name Blood Type Give Birth Can Fly Live in Water Class
human warm yes no no mammals
python cold no no no reptiles
salmon cold no no yes fishes
whale warm yes no yes mammals
frog cold no no sometimes amphibians
komodo cold no no no reptiles
bat warm yes yes no mammals
pigeon warm no yes no birds
cat warm yes no no mammals
leopard shark cold yes no yes fishes
turtle cold no no sometimes reptiles
penguin warm no no sometimes birds
porcupine warm yes no no mammals
eel cold no no yes fishes
salamander cold no no sometimes amphibians
gila monster cold no no no reptiles
platypus warm no no no mammals
owl warm no yes no birds
dolphin warm yes no yes mammals
eagle warm no yes no birds
R1: (Give Birth = no)  (Can Fly = yes)  Birds

R2: (Give Birth = no)  (Live in Water = yes)  Fishes
R3: (Give Birth = yes)  (Blood Type = warm)  Mammals
R4: (Give Birth = no)  (Can Fly = no)  Reptiles
R5: (Live in Water = sometimes)  Amphibians
Application of Rule-Based Classifier
 A rule R covers an instance x if the attributes of the

instance satisfy the condition of the rule
hawk warm no yes no ?
grizzly bear warm yes no no ?
The rule R1 covers a hawk => Bird

The rule R3 covers the grizzly bear => Mammal
Rule Coverage and Accuracy
Tid Refund Marital Taxable
Status Income Class
Coverage of a rule: 1 Yes Single 125K No
 Fraction of records that 2 No Married 100K No

3 No Single 70K No
satisfy the antecedent
4 Yes Married 120K No
of a rule
5 No Divorced 95K Yes
Accuracy of a rule: 6 No Married 60K No
 Fraction of records that 7 Yes Divorced 220K No

8 No Single 85K Yes
satisfy the antecedent
9 No Married 75K No
that also satisfy the
10 No Single 90K Yes
consequent of a rule 10
(Status=Single)  No
Coverage = 40%, Accuracy = 50%
How does Rule-based Classifier Work?

lemur warm yes no no ?
turtle cold no no sometimes ?
dogfish shark cold yes no yes ?
A lemur triggers rule R3, so it is classified as a mammal

A turtle triggers both R4 and R5
A dogfish shark triggers none of the rules
Characteristics of Rule Sets: Strategy 1
Mutually exclusive rules

 Every record is covered by at most one rule
Exhaustive rules
 Each record is covered by at least one rule
Characteristics of Rule Sets: Strategy 2
Rules are not mutually exclusive

 A record may trigger more than one rule
 Solution?
• Ordered rule set
• Unordered rule set – use voting schemes
Rules are not exhaustive

 A record may not trigger any rules
 Solution?
• Use a default class
Building Classification Rules
Direct Method:
• Extract rules directly from data
• Examples: RIPPER, CN2, Holte’s 1R
Indirect Method:
• Extract rules from other classification models (e.g.
decision trees, neural networks, etc).
• Examples: C4.5rules
Decision Tree Classifier Algorithm
A decision tree is a flowchart-like tree structure

where
 an internal node represents feature (or attribute),
 the branch represents a decision rule,
 and each leaf node represents the outcome.
The topmost node in a decision tree is known as
the root node.
It learns to partition on the basis of the attribute
value.
It partitions the tree in recursively manner call
recursive partitioning.
 This flowchart-like structure helps you in decision making.
Example of a Decision Tree
Splitting Attributes
Home Marital Annual Defaulted
ID
Owner Status Income Borrower
1 Yes Single 125K No Home
2 No Married 100K No Owner
Yes No
3 No Single 70K No
4 Yes Married 120K No NO MarSt
5 No Divorced 95K Yes Single, Divorced Married
6 No Married 60K No
Income NO
7 Yes Divorced 220K No
< 80K > 80K
8 No Single 85K Yes
9 No Married 75K No NO YES
10
Training Data Model: Decision Tree

Another Example of Decision Tree
MarSt Single,
Married Divorced
ID
NO Home
1 Yes Single 125K No
Yes Owner No
2 No Married 100K No
3 No Single 70K No NO Income
4 Yes Married 120K No < 80K > 80K
NO YES
6 No Married 60K No
8 No Single 85K Yes
9 No Married 75K No There could be more than one tree that
fits the same data!
10
Apply Model to Test Data
Test Data
Start from the root of tree.
No Married 80K ?
Home 10
Yes Owner No
NO MarSt
Single, Divorced Married
Income NO
< 80K > 80K
NO YES
Test Data
No Married 80K ?
Home 10
Yes Owner No
NO MarSt
Income NO
< 80K > 80K
NO YES
Test Data
No Married 80K ?
Home 10
Yes Owner No
NO MarSt
Income NO
< 80K > 80K
NO YES
Test Data
No Married 80K ?
Home 10
Yes Owner No
NO MarSt
Income NO
< 80K > 80K
NO YES
Test Data
No Married 80K ?
Home 10
Yes Owner No
NO MarSt
Income NO
< 80K > 80K
NO YES
Test Data
No Married 80K ?
Home 10
Yes Owner No
NO MarSt
Single, Divorced Married Assign Defaulted to
“No”
Income NO
< 80K > 80K
NO YES
Decision Tree Classification Task
Tid Attrib1 Attrib2 Attrib3 Class

Tree
1 Yes Large 125K No Induction
2 No Medium 100K No algorithm
3 No Small 70K No
4 Yes Medium 120K No

Induction
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No Learn

8 No Small 85K Yes Model
9 No Medium 75K No
10 No Small 90K Yes

Model
10
Training Set
Apply Decision
Model Tree
Tid Attrib1 Attrib2 Attrib3 Class
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ?

Deduction
14 No Small 95K ?
15 No Large 67K ?
10
Test Set
Decision Tree Induction
Many Algorithms:
 Hunt’s Algorithm (one of the earliest)
 CART
 ID3, C4.5
 SLIQ,SPRINT
General Structure of Hunt’s Algorithm
ID
• Let Dt be the set of training records 1 Yes Single 125K No

2 No Married 100K No
that reach a node t 3 No Single 70K No
• General Procedure: 4 Yes Married 120K No

If Dt contains records that belong 6 No Married 60K No
the same class yt, then t is a leaf 7 Yes Divorced 220K No

8 No Single 85K Yes
node labeled as yt 9 No Married 75K No
If Dt contains records that belong 10

to more than one class, use an Dt

attribute test to split the data
into smaller subsets. Recursively ?
apply the procedure to each
subset.
Hunt’s Algorithm
Home ID
Owner
Yes No
Defaulted = No 2 No Married 100K No
Defaulted = No Defaulted = No 3 No Single 70K No
(7,3)
(3,0) (4,3) 4 Yes Married 120K No
(a) (b) 5 No Divorced 95K Yes

6 No Married 60K No
Home
Owner 8 No Single 85K Yes
Home Yes No 9 No Married 75K No

Defaulted = No Marital 10
Yes No
Status
(3,0) Single,
Married
Defaulted = No Marital Divorced
Status
Defaulted = No
(3,0) Single,
Married
Annual
Divorced Income
(3,0)
Defaulted = Yes Defaulted = No < 80K >= 80K
Defaulted = No Defaulted = Yes

(1,3) (3,0)
(1,0) (0,3)
(c) (d)
Hunt’s Algorithm
Home ID
Owner
Yes No
(7,3)
(3,0) (4,3) 4 Yes Married 120K No

6 No Married 60K No
Home

Yes No
Status
(3,0) Single,
Married
Status
Defaulted = No
(3,0) Single,
Married
Annual
Divorced Income
(3,0)

(1,3) (3,0)
(1,0) (0,3)
(c) (d)
Hunt’s Algorithm
Home ID
Owner
Yes No
(7,3)
(3,0) (4,3) 4 Yes Married 120K No

6 No Married 60K No
Home

Yes No
Status
(3,0) Single,
Married
Status
Defaulted = No
(3,0) Single,
Married
Annual
Divorced Income
(3,0)

(1,3) (3,0)
(1,0) (0,3)
(c) (d)
Hunt’s Algorithm
Home ID
Owner
Yes No
(7,3)
(3,0) (4,3) 4 Yes Married 120K No

6 No Married 60K No
Home

Yes No
Status
(3,0) Single,
Married
Status
Defaulted = No
(3,0) Single,
Married
Annual
Divorced Income
(3,0)

(1,3) (3,0)
(1,0) (0,3)
(c) (d)
Methods for Expressing Test Conditions
Depends on attribute types

 Binary
 Nominal
 Ordinal
 Continuous
Depends on number of ways to split

 2-way split
 Multi-way split
Test Condition for Nominal Attributes
• Multi-way split: Marital

Status
Use as many partitions as

distinct values.
Single Divorced Married
• Binary split:
Divides values into two subsets
Marital Marital Marital

Status Status Status
OR OR
{Married} {Single, {Single} {Married, {Single, {Divorced}

Divorced} Divorced} Married}
Test Condition for Ordinal Attributes
Shirt
Size
• Multi-way split:
Use as many partitions as
Small
distinct values Medium Large Extra Large
Shirt Shirt
Size Size
• Binary split:
Divides values into two
subsets {Small, {Large, {Small} {Medium, Large,
Medium} Extra Large} Extra Large}
Preserve order property Shirt

among attribute values Size
This
grouping
violates
{Small, {Medium, order
Large} Extra Large}
property
Test Condition for Continuous Attributes
Annual Annual
Income Income?
> 80K?
< 10K > 80K
Yes No
[10K,25K) [25K,50K) [50K,80K)
(i) Binary split (ii) Multi-way split

Splitting Based on Continuous Attributes
Different ways of handling

 Discretization to form an ordinal categorical
attribute
Ranges can be found by equal interval bucketing,
equal frequency bucketing (percentiles), or
clustering.
• Static – discretize once at the beginning
• Dynamic – repeat at each node
 Binary Decision: (A < v) or (A  v)

• consider all possible splits and finds the best cut
• can be more compute intensive
How to determine the Best Split
Before Splitting: 10 records of class 0,

10 records of class 1
Gender Car Customer

Type ID
Yes No Family Luxury c1 c20

c10 c11
Sports
C0: 6 C0: 4 C0: 1 C0: 8 C0: 1 C0: 1 ... C0: 1 C0: 0 ... C0: 0
C1: 4 C1: 6 C1: 3 C1: 0 C1: 7 C1: 0 C1: 0 C1: 1 C1: 1
Which test condition is the best?

How to determine the Best Split
 Greedy approach:
 Nodes with purer class distribution are preferred
 Need a measure of node impurity:
C0: 5 C0: 9
C1: 5 C1: 1
High degree of impurity Low degree of impurity

Measures of Node Impurity
 Gini Index GINI (t )  1  [ p( j | t )]2
 Entropy Entropy (t )    p ( j | t ) log p ( j | t )

j
 Misclassification error
Error(t )  1  max P(i | t )
i
Finding the Best Split
1. Compute impurity measure (P) before splitting

2. Compute impurity measure (M) after splitting
 Compute impurity measure of each child node
 M is the weighted impurity of children
3. Choose the attribute test condition that
produces the highest gain
Gain = P – M
or equivalently, lowest impurity measure after

splitting (M)
Measure of Impurity: GINI
 Gini Index for a given node t :
GINI (t )  1  [ p( j | t )]2
j
(NOTE: p( j | t) is the relative frequency of class j at node t).
 For 2-class problem (p, 1 – p):

 GINI = 1 - (x/n)2 - (y/n)2
C1 0 C1 1 C1 2 C1 3
C2 6 C2 5 C2 4 C2 3
Gini=0.000 Gini=0.278 Gini=0.444 Gini=0.500
Computing Gini Index of a Single Node
GINI (t )  1  [ p( j | t )]2
j
C1 0 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1

C2 6 Gini = 1 – P(C1)2 – P(C2)2 = 1 – 0 – 1 = 0
C1 1 P(C1) = 1/6 P(C2) = 5/6

C2 5 Gini = 1 – (1/6)2 – (5/6)2 = 0.278
C1 2 P(C1) = 2/6 P(C2) = 4/6

C2 4 Gini = 1 – (2/6)2 – (4/6)2 = 0.444
Computing Entropy of a Single Node
Entropy (t )    p ( j | t ) log p ( j | t )
j 2
C1 0 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1

C2 6 Entropy = – 0 log 0 – 1 log 1 = – 0 – 0 = 0
C1 1 P(C1) = 1/6 P(C2) = 5/6

C2 5 Entropy = – (1/6) log2 (1/6) – (5/6) log2 (1/6) = 0.65
C1 2 P(C1) = 2/6 P(C2) = 4/6

C2 4 Entropy = – (2/6) log2 (2/6) – (4/6) log2 (4/6) = 0.92
Computing Error of a Single Node
Error(t )  1  max P(i | t )
i
C1 0 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1

C2 6 Error = 1 – max (0, 1) = 1 – 1 = 0
C1 1 P(C1) = 1/6 P(C2) = 5/6

C2 5 Error = 1 – max (1/6, 5/6) = 1 – 5/6 = 1/6
C1 2 P(C1) = 2/6 P(C2) = 4/6

C2 4 Error = 1 – max (2/6, 4/6) = 1 – 4/6 = 1/3
Other Decision Tree Methods
 Pruning is a data compression technique in

machine learning and search algorithms that
reduces the size of decision trees by removing
sections of the tree that are non-critical and
redundant to classify instances.
Support Vector Machine
• A Support Vector Machine (SVM) performs classification by finding

the hyperplane that maximizes the margin between the two classes.
The vectors (cases) that define the hyperplane are the support
vectors.
• What is the purpose of Support Vectors? Support Vectors enables us
to use non-linear classifiers and compute classification hyperplanes in
large dimensional spaces.
Support Vector Machines
• Find a linear hyperplane (decision boundary) that will separate

the data
B1
• One Possible Solution

B2
• Another possible solution

B2
• Other possible solutions

B1
B2
• Which one is better? B1 or B2?

• How do you define better?
B1
B2
b21
b22
margin
b11
b12
• Find hyperplane maximizes the margin => B1 is better than B2

B1
 
w x  b  0
 
 
w  x  b  1 w  x  b  1
b11
 
  1 if w x  b 1 b12 2
f ( x)     Margin  
  1 if w  x  b  1 || w ||
Linear SVM
 Linear model:
 
 1 if w  x  b  1
f ( x)    
 1 if w  x  b  1
 Learning the model

 is equivalent to determining
the values of w and b
 How to find
w and b from training data?

Maximum Margin
x f yest
f(x,w,b) = sign(w. x - b)
denotes +1
denotes -1 The maximum
margin linear
classifier is the
linear classifier
with the maximum
margin.
This is the
simplest kind of
SVM (Called an
LSVM)
Linear SVM

Maximum Margin
x f yest
f(x,w,b) = sign(w. x - b)
denotes +1
denotes -1 The maximum
margin linear
classifier is the
linear classifier
Support Vectors with the maximum
are those
datapoints that margin.
the margin This is the
pushes up simplest kind of
against SVM (Called an
LSVM)
Linear SVM
SVM Parameter C
• SVM finds the best hyperplane by solving an optimization

problem that tries to increase the distance of the
hyperplane from the two classes while trying to make sure
many training examples are classified properly.
• This tradeoff is controlled by a parameter called C.
• When the value of C is small, a large margin hyperplane is
chosen at the expense of a greater number of
misclassifications.
• Conversely, when C is large, a smaller margin hyperplane is
chosen that tries to classify many more examples correctly.
SVM Parameter C
• SVM finds the best hyperplane by solving an optimization problem
that tries to increase the distance of the hyperplane from the two
classes while trying to make sure many training examples are
classified properly.
• This tradeoff is controlled by a parameter called C.
• When the value of C is small, a large margin hyperplane is chosen at
the expense of a greater number of misclassifications.
• Conversely, when C is large, a smaller margin hyperplane is chosen
that tries to classify many more examples correctly.
The line corresponding to C = 100 is not necessarily a good choice. This

is because the lone blue point may be an outlier.
SVM Parameter Gamma (ƴ )
• What if the data is not separable by a hyperplane?

• These two classes represented by the red and blue dots are
not linearly separable.
• The decision boundary shown in black is actually circular.
• In such a case, we use the Kernel Trick where we add a new

dimension to existing data. In the new space, the data is linearly
separable.
• In Previous Figure, we have added a third dimension (z) to the
data where,
• The above expression is called a Gaussian Radial Basis

Function or a Radial Basis Function with a Gaussian kernel.
• We can see the new 3D data is separable by the plane
containing the black circle! The parameter ƴ controls the
amount of stretching in the z direction.
Feature Extraction for SVM
• Selecting the most meaningful features is a crucial step in the

process of classification problems
• The selected set of features should be a small set whose values
efficiently discriminate among patterns of different classes, but
are similar for patterns within the same class.
• Features can be classified into two categories:
• Local features, which are usually geometric
• Global features, which are usually topological or statistical.
• https://medium.com/@dataturks/understanding-svms-for-
image-classification-cf4f01232700 shows several method of
feature extraction using OpenCV and Sklearn libraries
SVM for Multi-classes classifications
• Support Vector Machine is used mainly for binary classification.

• However, It can be used for multiclass classification by using
One vs One technique or One vs Rest technique.
• One vs Rest approach takes one class as positive and rest all as
negative and trains the classifier. So for the data having n-
classes it trains n classifiers.
• One vs One considers each binary pair of classes and trains

classifier on subset of data containing those classes. So it trains
total n*(n-1)/2 classes.
Decision Tree Classification in Python
Importing Required Libraries

Let's first load the required libraries.
# Load libraries
import pandas as pd
from sklearn.tree import DecisionTreeClassifier

# Import Decision Tree Classifier
from sklearn.model_selection import train_test_split

# Import train_test_split function
from sklearn import metrics

#Import scikit-learn metrics module for accuracy calculation
76
#Load Data
col_names = ['pregnant', 'glucose', 'bp', 'skin', 'insulin', 'bmi', 'pedigree', 'age', 'label']
pima = pd.read_csv("pima-indians-diabetes.csv", header=None, names=col_names)
#Display headers of the data

pima.head()
#Feature Selection
Here, you need to divide given columns into two types of variables dependent(or target
variable) and independent variable(or feature variables).
#split dataset in features and target variable

feature_cols = ['pregnant', 'insulin', 'bmi', 'age','glucose','bp','pedigree']
X = pima[feature_cols] # Features
y = pima.label # Target variable
77
Splitting Data
To understand model performance, dividing the dataset into a training set and a test set is a
good strategy.
Let's split the dataset by using function train_test_split(). You need to pass 3 parameters
features, target, and test_set size.
# Split dataset into training set and test set

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)
# 70% training and 30% test
78
Building Decision Tree Model

Let's create a Decision Tree Model using Scikit-learn.
# Create Decision Tree classifer object

clf = DecisionTreeClassifier()
# Train Decision Tree Classifer

clf = clf.fit(X_train,y_train)
#Predict the response for test dataset

y_pred = clf.predict(X_test)
Evaluating Model
Let's estimate, how accurately the classifier or model can predict the type of cultivars.
Accuracy can be computed by comparing actual test set values and predicted values.
# Model Accuracy, how often is the classifier correct?

print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
79
Visualizing Decision Trees: You can use Scikit-learn's export_graphviz function for display
the tree within a Jupyter notebook. For plotting tree, you also need to install graphviz and
pydotplus.
pip install graphviz
pip install pydotplus
pip install six
xport_graphviz function converts decision tree classifier into dot file and pydotplus convert
this dot file to png or displayable form on Jupyter.
from sklearn.tree import export_graphviz

from sklearn.externals.six import StringIO
from IPython.display import Image
import pydotplus
dot_data = StringIO()
export_graphviz(clf, out_file=dot_data,
filled=True, rounded=True,
special_characters=True,feature_names = feature_cols,class_names=['0','1'])
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
graph.write_png('diabetes.png')
Image(graph.create_png()) 80
SVM Classification in Python
# SVM Algorithm - Training and Prediction

from sklearn import svm
classifier = svm.SVC(kernel='linear', C=1.0)
classifier.fit(X_train, y_train)
#Predict the response for test dataset

y_pred = classifier.predict(X_test)
# Model Accuracy, how often is the classifier correct?

print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
from sklearn.metrics import confusion_matrix,accuracy_score

cm = confusion_matrix(y_test, y_pred)
ac = accuracy_score(y_test,y_pred)
81
Assignment - 04
 Please optimize the decision tree applied on the diabetes

dataset to achieve more accurate results
 Hint: You can change the below parameters of the
DecisionTreeClassifier https://scikit-
learn.org/stable/modules/generated/sklearn.tree.Decision
TreeClassifier.html
 criterion
 splitter
 max_depth
82
Assignment-04
• Please optimize the SVM Classifier applied on the diabetes dataset to

achieve more accurate classification results
• Hint: You can change the below parameters of the svm classifiers:
• Sigmoid
• Rbf
• You can also alter the C=1.0 parameter to get better accuracy score.
• Last date of submission: 24th April 2021

References and Further Reading
 Chapter 08, Data Mining Concepts and Techniques, 3rd Edition
 https://www.datacamp.com/community/tutorials/decision-tree-classification-
python
 https://www.kaggle.com/uciml/pima-indians-diabetes-database
 https://scikit-
learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html
 https://towardsdatascience.com/decision-tree-in-python-b433ae57fb93
 https://www.w3schools.com/python/python_ml_decision_tree.asp
References and Further Reading
 Chapter 08 and 09, Data Mining Concepts and Techniques, 3rd Edition
 https://vitalflux.com/accuracy-precision-recall-f1-score-python-example/
 https://blogs.oracle.com/ai-and-datascience/post/a-simple-guide-to-building-
a-confusion-matrix
 https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html
 https://datastoriesweb.wordpress.com/2017/06/11/classification-one-vs-rest-
and-one-vs-one/
 https://www.learnopencv.com/support-vector-machines-svm
 https://towardsdatascience.com/https-medium-com-pupalerushikesh-svm-
f4b42800e989
 https://medium.com/@dataturks/understanding-svms-for-image-
classification-cf4f01232700

Binary Classification With Visualization: CS 352 - Data Analysis and Visualization

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Binary Classification With Visualization: CS 352 - Data Analysis and Visualization

Uploaded by

Copyright:

Available Formats

CS 352 – Data Analysis and Visualization

Binary Classification with Visualization

Given a collection of records (training set )

Task Attribute set, x Class label, y

Categorizing Features extracted from spam or non-spam

Cataloging Features extracted from Elliptical, spiral, or

Tid Attrib1 Attrib2 Attrib3 Class Learning

4 Yes Medium 120K No

7 Yes Large 220K No Learn

10 No Small 90K Yes

12 Yes Medium 80K ?

13 Yes Large 110K ? Deduction

Lots of classification problems where the

Evaluation measures such as accuracy is

Detecting the rare class is like finding

a: TP (true positive) - the model correctly predicts the positive class.

 Most widely-used metric:

Consider a 2-class problem

If a model predicts everything to be class NO,

PREDICTED CLASS Precision (p)  0.8

PREDICTED CLASS Precision (p)  0.8

 is the probability that

 is the probability that

 Classify records by using a collection of “if…then…” rules

R1: (Give Birth = no)  (Can Fly = yes)  Birds

 A rule R covers an instance x if the attributes of the

The rule R1 covers a hawk => Bird

Coverage of a rule: 1 Yes Single 125K No

 Fraction of records that 2 No Married 100K No

 Fraction of records that 7 Yes Divorced 220K No

R1: (Give Birth = no)  (Can Fly = yes)  Birds

A lemur triggers rule R3, so it is classified as a mammal

Mutually exclusive rules

Rules are not mutually exclusive

Rules are not exhaustive

A decision tree is a flowchart-like tree structure

Training Data Model: Decision Tree

Tid Attrib1 Attrib2 Attrib3 Class

4 Yes Medium 120K No

7 Yes Large 220K No Learn

10 No Small 90K Yes

12 Yes Medium 80K ?

13 Yes Large 110K ?

• Let Dt be the set of training records 1 Yes Single 125K No

• General Procedure: 4 Yes Married 120K No

the same class yt, then t is a leaf 7 Yes Divorced 220K No

If Dt contains records that belong 10

to more than one class, use an Dt

(a) (b) 5 No Divorced 95K Yes

Home Yes No 9 No Married 75K No

Defaulted = No Defaulted = Yes

(a) (b) 5 No Divorced 95K Yes

Home Yes No 9 No Married 75K No

Defaulted = No Defaulted = Yes

(a) (b) 5 No Divorced 95K Yes

Home Yes No 9 No Married 75K No

Defaulted = No Defaulted = Yes

(a) (b) 5 No Divorced 95K Yes

Home Yes No 9 No Married 75K No

Defaulted = No Defaulted = Yes

Depends on attribute types

Depends on number of ways to split