Professional Documents
Culture Documents
Binary Classification With Visualization: CS 352 - Data Analysis and Visualization
Binary Classification With Visualization: CS 352 - Data Analysis and Visualization
1
Today’s Classification Topics
What is Classification
Class Imbalance Problem
Performance Metrics:
Confusion Matrix, Accuracy, Precision, Recalls and F-
Measures, Classification Report
Rule Based Classifiers
Decision Tree Classifiers
Support Vector Machine
Python Implementation
2
Classification: Definition
Task:
Learn a model that maps each attribute set x
into one of the predefined class labels y
Examples of Classification Task
3 No Small 70K No
6 No Medium 60K No
Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class Model
11 No Small 55K ?
15 No Large 67K ?
10
Test Set
Classification Techniques
Base Classifiers
Rule-based Methods
Decision Tree based Methods
Support Vector Machines
Nearest-neighbor
Neural Networks
Naïve Bayes and Bayesian Belief Networks
Ensemble Classifiers
Boosting, Bagging, Random Forests
Class Imbalance Problem
Confusion Matrix:
PREDICTED CLASS
Class=Yes Class=No
Class=Yes a b
ACTUAL
CLASS Class=No c d
PREDICTED CLASS
Class=Yes Class=No
Class=Yes a b
ACTUAL (TP) (FN)
CLASS
Class=No c d
(FP) (TN)
PREDICTED CLASS
Class=Yes Class=No
Class=Yes a b
ACTUAL
CLASS Class=No c d
a
Precision (p)
ac
a
Recall (r)
ab
2rp 2a
F - measure (F)
r p 2a b c
Alternative Measures
4
Precision (p) 0.16
PREDICTED CLASS 4 21
4
Class=Yes Class=No Recall (r) 0.4
46
Class=Yes 4 6 2*4
ACTUAL F - measure (F) 0.22
2 * 4 6 21
CLASS Class=No 21 969
973
Accuracy 0.973
1000
Alternative Measures
10
PREDICTED CLASS Precision (p) 0.5
10 10
Class=Yes Class=No 10
Recall (r) 1
10 0
Class=Yes 10 0 2 *1* 0.5
ACTUAL F - measure (F) 0.62
CLASS Class=No 10 980 1 0.5
990
Accuracy 0.99
1000
Alternative Measures
10
PREDICTED CLASS Precision (p) 0.5
10 10
Class=Yes Class=No 10
Recall (r) 1
10 0
Class=Yes 10 0 2 *1* 0.5
ACTUAL F - measure (F) 0.62
CLASS Class=No 10 980 1 0.5
990
Accuracy 0.99
1000
1
PREDICTED CLASS Precision (p) 1
1 0
Class=Yes Class=No 1
Recall (r) 0.1
1 9
Class=Yes 1 9
ACTUAL 2 * 0.1*1
F - measure (F) 0.18
CLASS Class=No 0 990 1 0.1
991
Accuracy 0.991
1000
Alternative Measures
PREDICTED CLASS
Precision (p) ~ 0.04
Class=Yes Class=No
Recall (r) 0.8
Class=Yes 40 10
ACTUAL F - measure (F) ~ 0.08
CLASS Class=No 1000 4000
Accuracy ~ 0.8
Measures of Classification Performance
PREDICTED CLASS
Yes No
ACTUAL
Yes TP FN
CLASS
No FP TN
Rule: (Condition) y
where
• Condition is a conjunctions of attributes
• y is the class label
LHS: rule antecedent or condition
RHS: rule consequent
Examples of classification rules:
• (Blood Type=Warm) (Lay Eggs=Yes) Birds
• (Taxable Income < 50K) (Refund=Yes) Evade=No
Rule-based Classifier (Example)
Name Blood Type Give Birth Can Fly Live in Water Class
human warm yes no no mammals
python cold no no no reptiles
salmon cold no no yes fishes
whale warm yes no yes mammals
frog cold no no sometimes amphibians
komodo cold no no no reptiles
bat warm yes yes no mammals
pigeon warm no yes no birds
cat warm yes no no mammals
leopard shark cold yes no yes fishes
turtle cold no no sometimes reptiles
penguin warm no no sometimes birds
porcupine warm yes no no mammals
eel cold no no yes fishes
salamander cold no no sometimes amphibians
gila monster cold no no no reptiles
platypus warm no no no mammals
owl warm no yes no birds
dolphin warm yes no yes mammals
eagle warm no yes no birds
Name Blood Type Give Birth Can Fly Live in Water Class
hawk warm no yes no ?
grizzly bear warm yes no no ?
(Status=Single) No
Coverage = 40%, Accuracy = 50%
How does Rule-based Classifier Work?
Name Blood Type Give Birth Can Fly Live in Water Class
lemur warm yes no no ?
turtle cold no no sometimes ?
dogfish shark cold yes no yes ?
Exhaustive rules
Each record is covered by at least one rule
Characteristics of Rule Sets: Strategy 2
Direct Method:
• Extract rules directly from data
• Examples: RIPPER, CN2, Holte’s 1R
Indirect Method:
• Extract rules from other classification models (e.g.
decision trees, neural networks, etc).
• Examples: C4.5rules
Decision Tree Classifier Algorithm
Splitting Attributes
Home Marital Annual Defaulted
ID
Owner Status Income Borrower
1 Yes Single 125K No Home
2 No Married 100K No Owner
Yes No
3 No Single 70K No
4 Yes Married 120K No NO MarSt
5 No Divorced 95K Yes Single, Divorced Married
6 No Married 60K No
Income NO
7 Yes Divorced 220K No
< 80K > 80K
8 No Single 85K Yes
9 No Married 75K No NO YES
10 No Single 90K Yes
10
MarSt Single,
Married Divorced
Home Marital Annual Defaulted
ID
Owner Status Income Borrower
NO Home
1 Yes Single 125K No
Yes Owner No
2 No Married 100K No
3 No Single 70K No NO Income
4 Yes Married 120K No < 80K > 80K
5 No Divorced 95K Yes
NO YES
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No There could be more than one tree that
fits the same data!
10 No Single 90K Yes
10
Apply Model to Test Data
Test Data
Start from the root of tree.
Home Marital Annual Defaulted
Owner Status Income Borrower
No Married 80K ?
Home 10
Yes Owner No
NO MarSt
Single, Divorced Married
Income NO
< 80K > 80K
NO YES
Apply Model to Test Data
Test Data
Home Marital Annual Defaulted
Owner Status Income Borrower
No Married 80K ?
Home 10
Yes Owner No
NO MarSt
Single, Divorced Married
Income NO
< 80K > 80K
NO YES
Apply Model to Test Data
Test Data
Home Marital Annual Defaulted
Owner Status Income Borrower
No Married 80K ?
Home 10
Yes Owner No
NO MarSt
Single, Divorced Married
Income NO
< 80K > 80K
NO YES
Apply Model to Test Data
Test Data
Home Marital Annual Defaulted
Owner Status Income Borrower
No Married 80K ?
Home 10
Yes Owner No
NO MarSt
Single, Divorced Married
Income NO
< 80K > 80K
NO YES
Apply Model to Test Data
Test Data
Home Marital Annual Defaulted
Owner Status Income Borrower
No Married 80K ?
Home 10
Yes Owner No
NO MarSt
Single, Divorced Married
Income NO
< 80K > 80K
NO YES
Apply Model to Test Data
Test Data
Home Marital Annual Defaulted
Owner Status Income Borrower
No Married 80K ?
Home 10
Yes Owner No
NO MarSt
Single, Divorced Married Assign Defaulted to
“No”
Income NO
< 80K > 80K
NO YES
Decision Tree Classification Task
6 No Medium 60K No
Training Set
Apply Decision
Model Tree
Tid Attrib1 Attrib2 Attrib3 Class
11 No Small 55K ?
15 No Large 67K ?
10
Test Set
Decision Tree Induction
Many Algorithms:
Hunt’s Algorithm (one of the earliest)
CART
ID3, C4.5
SLIQ,SPRINT
General Structure of Hunt’s Algorithm
Home Marital Annual Defaulted
ID
Owner Status Income Borrower
Yes No
Status
(3,0) Single,
Married
Defaulted = No Marital Divorced
Status
Defaulted = No
(3,0) Single,
Married
Annual
Divorced Income
(3,0)
Defaulted = Yes Defaulted = No < 80K >= 80K
Yes No
Status
(3,0) Single,
Married
Defaulted = No Marital Divorced
Status
Defaulted = No
(3,0) Single,
Married
Annual
Divorced Income
(3,0)
Defaulted = Yes Defaulted = No < 80K >= 80K
Yes No
Status
(3,0) Single,
Married
Defaulted = No Marital Divorced
Status
Defaulted = No
(3,0) Single,
Married
Annual
Divorced Income
(3,0)
Defaulted = Yes Defaulted = No < 80K >= 80K
Yes No
Status
(3,0) Single,
Married
Defaulted = No Marital Divorced
Status
Defaulted = No
(3,0) Single,
Married
Annual
Divorced Income
(3,0)
Defaulted = Yes Defaulted = No < 80K >= 80K
Shirt
Size
• Multi-way split:
Use as many partitions as
Small
distinct values Medium Large Extra Large
Shirt Shirt
Size Size
• Binary split:
Divides values into two
subsets {Small, {Large, {Small} {Medium, Large,
Medium} Extra Large} Extra Large}
Annual Annual
Income Income?
> 80K?
< 10K > 80K
Yes No
Greedy approach:
Nodes with purer class distribution are preferred
C0: 5 C0: 9
C1: 5 C1: 1
Misclassification error
Error(t ) 1 max P(i | t )
i
Finding the Best Split
GINI (t ) 1 [ p( j | t )]2
j
C1 0 C1 1 C1 2 C1 3
C2 6 C2 5 C2 4 C2 3
Gini=0.000 Gini=0.278 Gini=0.444 Gini=0.500
Computing Gini Index of a Single Node
GINI (t ) 1 [ p( j | t )]2
j
B1
B2
B2
B1
B2
B1
B2
b21
b22
margin
b11
b12
w x b 0
w x b 1 w x b 1
b11
1 if w x b 1 b12 2
f ( x) Margin
1 if w x b 1 || w ||
Linear SVM
Linear model:
1 if w x b 1
f ( x)
1 if w x b 1
• One vs Rest approach takes one class as positive and rest all as
negative and trains the classifier. So for the data having n-
classes it trains n classifiers.
# Load libraries
import pandas as pd
76
Decision Tree Classification in Python
#Load Data
col_names = ['pregnant', 'glucose', 'bp', 'skin', 'insulin', 'bmi', 'pedigree', 'age', 'label']
#Feature Selection
Here, you need to divide given columns into two types of variables dependent(or target
variable) and independent variable(or feature variables).
77
Decision Tree Classification in Python
Splitting Data
To understand model performance, dividing the dataset into a training set and a test set is a
good strategy.
Let's split the dataset by using function train_test_split(). You need to pass 3 parameters
features, target, and test_set size.
78
Decision Tree Classification in Python
Evaluating Model
Let's estimate, how accurately the classifier or model can predict the type of cultivars.
Accuracy can be computed by comparing actual test set values and predicted values.
Visualizing Decision Trees: You can use Scikit-learn's export_graphviz function for display
the tree within a Jupyter notebook. For plotting tree, you also need to install graphviz and
pydotplus.
pip install graphviz
pip install pydotplus
pip install six
xport_graphviz function converts decision tree classifier into dot file and pydotplus convert
this dot file to png or displayable form on Jupyter.
dot_data = StringIO()
export_graphviz(clf, out_file=dot_data,
filled=True, rounded=True,
special_characters=True,feature_names = feature_cols,class_names=['0','1'])
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
graph.write_png('diabetes.png')
Image(graph.create_png()) 80
SVM Classification in Python
81
Assignment - 04
82
Assignment-04
https://www.datacamp.com/community/tutorials/decision-tree-classification-
python
https://www.kaggle.com/uciml/pima-indians-diabetes-database
https://scikit-
learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html
https://towardsdatascience.com/decision-tree-in-python-b433ae57fb93
https://www.w3schools.com/python/python_ml_decision_tree.asp
References and Further Reading
Chapter 08 and 09, Data Mining Concepts and Techniques, 3rd Edition
https://vitalflux.com/accuracy-precision-recall-f1-score-python-example/
https://blogs.oracle.com/ai-and-datascience/post/a-simple-guide-to-building-
a-confusion-matrix
https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html
https://datastoriesweb.wordpress.com/2017/06/11/classification-one-vs-rest-
and-one-vs-one/
https://www.learnopencv.com/support-vector-machines-svm
https://towardsdatascience.com/https-medium-com-pupalerushikesh-svm-
f4b42800e989
https://medium.com/@dataturks/understanding-svms-for-image-
classification-cf4f01232700