lec06

CS685: Data Mining
Basics of Classification
Arnab Bhattacharya
arnabb@cse.iitk.ac.in
Computer Science and Engineering,

Indian Institute of Technology, Kanpur
http://web.cse.iitk.ac.in/~cs685/
1st semester, 2018-19

Mon, Thu 1030-1145 at RM101
Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classification 2018-19 1 / 14

Classification
A dataset of n objects Oi , i = 1, . . . , n
A total of k classes Cj , j = 1, . . . , k
Each object belongs to a single class
If object Oi belongs to class Cj , then C (Oi ) = j
Given a new object Oq , classification is the problem of determining its
class, i.e., C (Oq ) out of possible k choices

Classification
A dataset of n objects Oi , i = 1, . . . , n
A total of k classes Cj , j = 1, . . . , k
Each object belongs to a single class
If object Oi belongs to class Cj , then C (Oi ) = j
Given a new object Oq , classification is the problem of determining its
class, i.e., C (Oq ) out of possible k choices
If, instead of k discrete classes, there is a continuum of values, the
problem of determining the value V (Oq ) of a new object Oq is called
prediction

Sets
Total available data is divided randomly into two parts: training set
and testing set
Classification algorithm or model is built using only the training set
Testing set should not be used at all
Quality of method is measured using testing set
Sometimes validation set is separated from training set to evalute
method

Processes
Stratified
If representation of each class in training set is proportional to the
overall ratios

Processes
Stratified
overall ratios
k-fold cross-validation
Training data is divided into k random parts
k − 1 groups are used as training set and the k th group as validation set
Training is repeated k times with a new validation set each time
Leave-one-out cross-validation (LOOCV): When k = n

Processes
Stratified
overall ratios
Stratified cross-validation
When representation in each of the k random groups is proportional to
the overall ratios

Processes
Stratified
overall ratios
Stratified cross-validation
When representation in each of the k random groups is proportional to
the overall ratios
Classification is called supervised learning
Algorithm or model is “supervised” by class information

Over-Fitting and Under-Fitting
Over-fitting
Algorithm or model classifies the training set too well
It is too complex or uses too many parameters
Generally performs poorly with testing set
Ends up modeling noise rather than data characteristics

Over-fitting
Under-fitting
The opposite problem
Algorithm or model does not classify the training set well at all
It is too simple or uses too less parameters
Ends up modeling overall data characteristics instead of per class

Over-fitting
Under-fitting
The opposite problem
Algorithm or model does not classify the training set well at all
It is too simple or uses too less parameters
Ends up modeling overall data characteristics instead of per class
Bias-variance tradeoff
Bias measures errors in the model learnt (under-fitting)
Variance measures errors when training set is perturbed (over-fitting)
Low bias generally implies higher variance and vice versa

Errors
Positives (P): Objects that are “true” answers (for a class)
Negatives (N): Objects that are not answers: N = D − P

Errors
For a particular classification algorithm A,
P 0 : Objects returned as positive by A
N 0 : Objects not returned as positive by A

Errors
True Positives (TP):

Errors
True Positives (TP): Answers returned by A: P ∩ P 0

Errors
True Negatives (TN):

Errors
True Negatives (TN): Non-answers not returned by A: N ∩ N 0

Errors
False Positives (FP):

Errors
False Positives (FP): Non-answers returned by A: N ∩ P 0

Errors
False Negatives (FN):

Errors
False Negatives (FN): Answers not returned by A: P ∩ N 0

Errors
P = TP ∪ FN N = TN ∪ FP P 0 = TP ∪ FP N 0 = TN ∪ FN

Errors
P = TP ∪ FN N = TN ∪ FP P 0 = TP ∪ FP N 0 = TN ∪ FN
In statistics,
Type I error: FP
Type II error: FN

Confusion Matrix
Confusion matrix visually represents the information

Rows indicate “true” answers: P and N
Columns indicate those returned by A: P 0 and N 0
Shows which error is more
Returned by A
Sets
Positives P 0 Negatives N 0
Positives P TP FN
True answers
Negatives N FP TN

Confusion Matrix for Multiple Classes
Is more useful when extended for multiple classes

Shows which classes are confused more against which other classes
Predicted by A
Sets
Class C10 Class C20 Class C30
Class C1 5 3 0
True answers Class C2 2 3 1
Class C3 0 2 9

Error Parameters or Performance Metrics
Parameter Interpretation Formula
Proportion of positives
Precision
in those returned by A

Proportion of positives |TP| |TP|
Precision |TP∪FP| = |P 0 |
Recall or
Sensitivity or Proportion of positives
True positive rate returned by A
or Hit rate

Recall or
Sensitivity or Proportion of positives |TP| |TP|
|TP∪FN| = |P|
or Hit rate
Specificity or Proportion of negatives
True negative rate not returned by A

Recall or
|TP∪FN| = |P|
or Hit rate
Specificity or Proportion of negatives |TN| |TN|
|TN∪FP| = |N|
Proportion of negatives
False positive rate
returned by A

Recall or
|TP∪FN| = |P|
or Hit rate
|TN∪FP| = |N|
Proportion of negatives |FP| |FP|
False positive rate |TN∪FP| = |N|
returned by A
Proportion of positives
False negative rate
not returned by A

Recall or
|TP∪FN| = |P|
or Hit rate
|TN∪FP| = |N|
returned by A
Proportion of positives |FN| |FN|
False negative rate |TP∪FN| = |P|
not returned by A
Proportion of positives returned
Accuracy
and negatives not returned by A

Recall or
|TP∪FN| = |P|
or Hit rate
|TN∪FP| = |N|
returned by A
not returned by A
Proportion of positives returned |TP∪TN|
Accuracy |D|
Proportion of positives not
Error rate
returned and negatives returned by A

Recall or
|TP∪FN| = |P|
or Hit rate
|TN∪FP| = |N|
returned by A
not returned by A
Proportion of positives returned |TP∪TN|
Accuracy |D|
Proportion of positives not |FP∪FN|
Error rate |D|
returned and negatives returned by A

Single Measures
Single measures capturing both precision and recall

Single Measures

F-score or F-measure or F1-score is the harmonic mean of precision
and recall
2.Precision.Recall
Fscore =
Precision + Recall

Single Measures

and recall
2.Precision.Recall
Fscore =
Precision + Recall
In terms of errors
2.TP
Fscore =
2.TP + FN + FP

Single Measures

and recall
2.Precision.Recall
Fscore =
Precision + Recall
In terms of errors
2.TP
Fscore =
2.TP + FN + FP
Similarly, G-measure is geometric mean of precision and recall

Single Measures

and recall
2.Precision.Recall
Fscore =
Precision + Recall
In terms of errors
2.TP
Fscore =
2.TP + FN + FP
Similarly, G-measure is geometric mean of precision and recall

EER or Equal Error Rate is when FP rate is equal to FN rate

Weighting Precision versus Recall
Suppose recall and precision are weighted at a ratio α : (1 − α)

F-score is the weighted harmonic mean
1 1 1
= α. + (1 − α).
F P R


1 1 1
= α. + (1 − α).
F P R
1−α
β2 = α measures the relative importance of precision over recall
α ∈ [0, 1] while β ∈ [0, ∞]
β > 1 emphasizes precision, while β < 1 emphasizes recall


1 1 1
= α. + (1 − α).
F P R
1−α
β2 = α measures the relative importance of precision over recall
α ∈ [0, 1] while β ∈ [0, ∞]
β > 1 emphasizes precision, while β < 1 emphasizes recall
Using β 2 , weighted F-score is
(β 2 + 1).P.R
F =
β 2 .P + R
When β = 1, precision and recall are equally weighted (α = 1/2)

F1-score is the harmonic mean
Example
D ={O1 , O2 , O3 , O4 , O5 , O6 , O7 , O8 }
Correct answer set ={O1 , O5 , O7 }
Algorithm returns ={O1 , O3 , O5 , O6 }

Example
D ={O1 , O2 , O3 , O4 , O5 , O6 , O7 , O8 }
∴ P=
N=
TP =
TN =
FP =
FN =

Example
D ={O1 , O2 , O3 , O4 , O5 , O6 , O7 , O8 }
∴ P = {O1 , O5 , O7 }
N=
TP =
TN =
FP =
FN =

Example
D ={O1 , O2 , O3 , O4 , O5 , O6 , O7 , O8 }
∴ P = {O1 , O5 , O7 }
N = {O2 , O3 , O4 , O6 , O8 }
TP =
TN =
FP =
FN =

Example
D ={O1 , O2 , O3 , O4 , O5 , O6 , O7 , O8 }
∴ P = {O1 , O5 , O7 }
N = {O2 , O3 , O4 , O6 , O8 }
TP = {O1 , O5 }
TN =
FP =
FN =

Example
D ={O1 , O2 , O3 , O4 , O5 , O6 , O7 , O8 }
∴ P = {O1 , O5 , O7 }
N = {O2 , O3 , O4 , O6 , O8 }
TP = {O1 , O5 }
TN = {O2 , O4 , O8 }
FP =
FN =

Example
D ={O1 , O2 , O3 , O4 , O5 , O6 , O7 , O8 }
∴ P = {O1 , O5 , O7 }
N = {O2 , O3 , O4 , O6 , O8 }
TP = {O1 , O5 }
TN = {O2 , O4 , O8 }
FP = {O3 , O6 }
FN =

Example
D ={O1 , O2 , O3 , O4 , O5 , O6 , O7 , O8 }
∴ P = {O1 , O5 , O7 }
N = {O2 , O3 , O4 , O6 , O8 }
TP = {O1 , O5 }
TN = {O2 , O4 , O8 }
FP = {O3 , O6 }
FN = {O7 }

Example
D ={O1 , O2 , O3 , O4 , O5 , O6 , O7 , O8 }
∴ P = {O1 , O5 , O7 }
N = {O2 , O3 , O4 , O6 , O8 }
TP = {O1 , O5 }
TN = {O2 , O4 , O8 }
FP = {O3 , O6 }
FN = {O7 }
∴ Recall = Sensitivity =
Precision =
Specificity =
F-score =
Accuracy =
Error rate =
Example
D ={O1 , O2 , O3 , O4 , O5 , O6 , O7 , O8 }
∴ P = {O1 , O5 , O7 }
N = {O2 , O3 , O4 , O6 , O8 }
TP = {O1 , O5 }
TN = {O2 , O4 , O8 }
FP = {O3 , O6 }
FN = {O7 }
∴ Recall = Sensitivity = 2/3 = 0.67
Precision =
Specificity =
F-score =
Accuracy =
Error rate =
Example
D ={O1 , O2 , O3 , O4 , O5 , O6 , O7 , O8 }
∴ P = {O1 , O5 , O7 }
N = {O2 , O3 , O4 , O6 , O8 }
TP = {O1 , O5 }
TN = {O2 , O4 , O8 }
FP = {O3 , O6 }
FN = {O7 }
Precision = 2/4 = 0.5
Specificity =
F-score =
Accuracy =
Error rate =
Example
D ={O1 , O2 , O3 , O4 , O5 , O6 , O7 , O8 }
∴ P = {O1 , O5 , O7 }
N = {O2 , O3 , O4 , O6 , O8 }
TP = {O1 , O5 }
TN = {O2 , O4 , O8 }
FP = {O3 , O6 }
FN = {O7 }
Specificity = 3/5 = 0.6
F-score =
Accuracy =
Error rate =
Example
D ={O1 , O2 , O3 , O4 , O5 , O6 , O7 , O8 }
∴ P = {O1 , O5 , O7 }
N = {O2 , O3 , O4 , O6 , O8 }
TP = {O1 , O5 }
TN = {O2 , O4 , O8 }
FP = {O3 , O6 }
FN = {O7 }
F-score = 4/7 = 0.571
Accuracy =
Error rate =
Example
D ={O1 , O2 , O3 , O4 , O5 , O6 , O7 , O8 }
∴ P = {O1 , O5 , O7 }
N = {O2 , O3 , O4 , O6 , O8 }
TP = {O1 , O5 }
TN = {O2 , O4 , O8 }
FP = {O3 , O6 }
FN = {O7 }
F-score = 4/7 = 0.571
Accuracy = 5/8 = 0.625
Error rate =
Example
D ={O1 , O2 , O3 , O4 , O5 , O6 , O7 , O8 }
∴ P = {O1 , O5 , O7 }
N = {O2 , O3 , O4 , O6 , O8 }
TP = {O1 , O5 }
TN = {O2 , O4 , O8 }
FP = {O3 , O6 }
FN = {O7 }
F-score = 4/7 = 0.571
Accuracy = 5/8 = 0.625
Error rate = 3/8 = 0.375
ROC Curve
Performance of an algorithm depends on parameters
To assess over a range of parameters, ROC curve is used
1 - Specificity (x-axis) versus Sensitivity (y-axis)
False positive rate (x-axis) versus True positive rate (y-axis)

ROC Curve
A random guess algorithm is a 45◦ line
1 1 Perfect algorithm
Algorithm 1
Algorithm 2
True Algorithm A True
Positive Positive Random guess
Rate Rate algorithm
Accuracy of
Algorithm A Accuracy of
Algorithm 2
0 0
0 1 0 1
False Positive Rate False Positive Rate

ROC Curve
A random guess algorithm is a 45◦ line
1 1 Perfect algorithm
Algorithm 1
Algorithm 2
True Algorithm A True
Positive Positive Random guess
Rate Rate algorithm
Accuracy of
Algorithm A Accuracy of
Algorithm 2
0 0
0 1 0 1
False Positive Rate False Positive Rate
Area under the ROC curve (AUC or AUROC) measures accuracy (or
discrimination)
What AUC is good?
0.9+: excellent; 0.8+: good; 0.7+: fair; 0.6+: poor; 0.6-: fail
EER denotes the point in ROC where FP rate is equal to FN rate
Precision-Recall Curve
Precision versus recall

1
precision
breakeven
point
0 recall 1
Breakeven point where precision is the same as recall

lec06

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

lec06

Uploaded by

Copyright:

Available Formats

CS685: Data Mining

Computer Science and Engineering,

1st semester, 2018-19

Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classification 2018-19 1 / 14

Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classification 2018-19 2 / 14

Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classification 2018-19 2 / 14

Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classification 2018-19 3 / 14

Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classification 2018-19 4 / 14

Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classification 2018-19 4 / 14

Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classification 2018-19 4 / 14

Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classification 2018-19 4 / 14

Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classification 2018-19 5 / 14

Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classification 2018-19 5 / 14

Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classification 2018-19 5 / 14

Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classification 2018-19 6 / 14

Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classification 2018-19 6 / 14

Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classification 2018-19 6 / 14

Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classification 2018-19 6 / 14

Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classification 2018-19 6 / 14

Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classification 2018-19 6 / 14

Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classification 2018-19 6 / 14

Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classification 2018-19 6 / 14

Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classification 2018-19 6 / 14

Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classification 2018-19 6 / 14

Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classification 2018-19 6 / 14

Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classification 2018-19 6 / 14

Confusion matrix visually represents the information

Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classification 2018-19 7 / 14

Is more useful when extended for multiple classes

Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classification 2018-19 8 / 14

Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classification 2018-19 9 / 14

Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classification 2018-19 9 / 14

Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classification 2018-19 9 / 14

Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classification 2018-19 9 / 14

Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classification 2018-19 9 / 14

Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classification 2018-19 9 / 14

Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classification 2018-19 9 / 14

Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classification 2018-19 9 / 14

Single measures capturing both precision and recall

Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classification 2018-19 10 / 14

Single measures capturing both precision and recall

Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classification 2018-19 10 / 14

Single measures capturing both precision and recall

Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classification 2018-19 10 / 14

Single measures capturing both precision and recall

Similarly, G-measure is geometric mean of precision and recall

Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classification 2018-19 10 / 14

Single measures capturing both precision and recall

Similarly, G-measure is geometric mean of precision and recall

Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classification 2018-19 10 / 14

Suppose recall and precision are weighted at a ratio α : (1 − α)

Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classification 2018-19 11 / 14

Suppose recall and precision are weighted at a ratio α : (1 − α)

Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classification 2018-19 11 / 14

Suppose recall and precision are weighted at a ratio α : (1 − α)

When β = 1, precision and recall are equally weighted (α = 1/2)

Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classification 2018-19 12 / 14

Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classification 2018-19 12 / 14

Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classification 2018-19 12 / 14

Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classification 2018-19 12 / 14

Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classification 2018-19 12 / 14

Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classification 2018-19 12 / 14

Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classification 2018-19 12 / 14

Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classification 2018-19 12 / 14