Professional Documents
Culture Documents
lec06
lec06
Basics of Classification
Arnab Bhattacharya
arnabb@cse.iitk.ac.in
A dataset of n objects Oi , i = 1, . . . , n
A total of k classes Cj , j = 1, . . . , k
Each object belongs to a single class
If object Oi belongs to class Cj , then C (Oi ) = j
Given a new object Oq , classification is the problem of determining its
class, i.e., C (Oq ) out of possible k choices
A dataset of n objects Oi , i = 1, . . . , n
A total of k classes Cj , j = 1, . . . , k
Each object belongs to a single class
If object Oi belongs to class Cj , then C (Oi ) = j
Given a new object Oq , classification is the problem of determining its
class, i.e., C (Oq ) out of possible k choices
If, instead of k discrete classes, there is a continuum of values, the
problem of determining the value V (Oq ) of a new object Oq is called
prediction
Total available data is divided randomly into two parts: training set
and testing set
Classification algorithm or model is built using only the training set
Testing set should not be used at all
Quality of method is measured using testing set
Sometimes validation set is separated from training set to evalute
method
In statistics,
Type I error: FP
Type II error: FN
Returned by A
Sets
Positives P 0 Negatives N 0
Positives P TP FN
True answers
Negatives N FP TN
Predicted by A
Sets
Class C10 Class C20 Class C30
Class C1 5 3 0
True answers Class C2 2 3 1
Class C3 0 2 9
In terms of errors
2.TP
Fscore =
2.TP + FN + FP
In terms of errors
2.TP
Fscore =
2.TP + FN + FP
In terms of errors
2.TP
Fscore =
2.TP + FN + FP
1−α
β2 = α measures the relative importance of precision over recall
α ∈ [0, 1] while β ∈ [0, ∞]
β > 1 emphasizes precision, while β < 1 emphasizes recall
1−α
β2 = α measures the relative importance of precision over recall
α ∈ [0, 1] while β ∈ [0, ∞]
β > 1 emphasizes precision, while β < 1 emphasizes recall
Using β 2 , weighted F-score is
(β 2 + 1).P.R
F =
β 2 .P + R
Algorithm 1
Algorithm 2
True Algorithm A True
Positive Positive Random guess
Rate Rate algorithm
Accuracy of
Algorithm A Accuracy of
Algorithm 2
0 0
0 1 0 1
False Positive Rate False Positive Rate
Algorithm 1
Algorithm 2
True Algorithm A True
Positive Positive Random guess
Rate Rate algorithm
Accuracy of
Algorithm A Accuracy of
Algorithm 2
0 0
0 1 0 1
False Positive Rate False Positive Rate
Area under the ROC curve (AUC or AUROC) measures accuracy (or
discrimination)
What AUC is good?
0.9+: excellent; 0.8+: good; 0.7+: fair; 0.6+: poor; 0.6-: fail
EER denotes the point in ROC where FP rate is equal to FN rate
Arnab Bhattacharya (arnabb@cse.iitk.ac.in) CS685: Classification 2018-19 13 / 14
Precision-Recall Curve
precision
breakeven
point
0 recall 1