Lecture 2 Classifier Performance Metrics

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 60

Classification

By: - Dr. Kavita Pabreja


Associate Professor-
Deptt. of Computer Applications-MSI
Confusion matrix
• A confusion matrix is a performance evaluation tool in machine learning,
representing the accuracy of a classification model.
• It displays the number of true positives, true negatives, false positives, and
false negatives. This matrix aids in analyzing model performance,
identifying mis-classifications, and improving predictive accuracy.
• A Confusion matrix is an N x N matrix used for evaluating the performance
of a classification model, where N is the total number of target classes.
• The matrix compares the actual target values with those predicted by the
machine learning model.
• This gives us a holistic view of how well our classification model is
performing and what kinds of errors it is making.
Important Terms in a Confusion Matrix
• True Positive (TP)
• The predicted value matches the actual value, or the predicted class matches the actual class.
• The actual value was positive, and the model predicted a positive value.
• True Negative (TN)
• The predicted value matches the actual value, or the predicted class matches the actual class.
• The actual value was negative, and the model predicted a negative value.
• False Positive (FP) – Type I Error
• The predicted value was falsely predicted.
• The actual value was negative, but the model predicted a positive value.
• Also known as the type I error.
• False Negative (FN) – Type II Error
• The predicted value was falsely predicted.
• The actual value was positive, but the model predicted a negative value.
• Also known as the type II error.
Confusion Matrix

•True Positive (TP) = 560, meaning the model


correctly classified 560 positive class data points.
•True Negative (TN) = 330, meaning the model
correctly classified 330 negative class data points.
•False Positive (FP) = 60, meaning the model
incorrectly classified 60 negative class data points as
belonging to the positive class.
•False Negative (FN) = 50, meaning the model
incorrectly classified 50 positive class data points as
belonging to the negative class.
Classifier Accuracy Measures
The accuracy of a classifier: percentage of test set tuples that are correctly classified by the
classifier.
In the pattern recognition literature, this is also referred to as the overall recognition rate
of the classifier.

confusion matrix
Predicted class

Actual 6954/7000=99.34%
class 2588/3000=86.27%

5
• Accuracy =
Classifier Accuracy Measures (TP+TN)/(P+N)
• Sensitivity/Recall/TP
rate = TP/P
• Specificity=TN/N
P=TP+FN
• FP Rate = FP/N =
N=FP+TN (1-specificity)
• FN Rate = FN/P
Precision vs. Recall (For a binary-class dataset)

Precision = TP/(TP+FP)
Recall = TP/(TP+FN)
• Precision deals with the Predicted row of the confusion matrix, telling us how accurate the model was
in predicting the positive samples out of all the samples predicted to be positive.
• Recall deals with the Expected (Actual) column of the confusion matrix, telling us how accurately the
model was able to identify the positive samples out of all positive samples that were actually present.
Accuracy, Precision, or Recall—When to Use What
• Accuracy is the most commonly used evaluation metric in most data
science projects.
• It tells us how many times our model got its prediction correct as a
ratio of the total times the model was used for predictions.
• However, it makes sense to use a metric like an accuracy only when
the dataset is balanced—all classes have the same number of
samples. In any case, such a scenario is challenging to realize in
practice.
• On the other hand, both precision and recall are useful metrics in
cases where the dataset is imbalanced (which is valid for almost all
practical scenarios).
An example where “Recall is a more important
evaluation metric than Precision”
• In the case of COVID-19 detection, we want to avoid false negatives as
much as possible. COVID-19 spreads easily, and thus we want the
patient to take appropriate measures to prevent the spread.
• A false negative case means that a COVID-positive patient is assessed
to not have the disease, which is detrimental.
• In this use case, false positives (a healthy patient diagnosed as COVID-
positive) are not as important as preventing a contagious patient from
spreading the disease.
• In most high-risk disease detection cases (like cancer), recall is a
more important evaluation metric than precision.
An example where “Precision is a more
important evaluation metric than Recall”
• Precision is more useful when we want to affirm the correctness of our model.
• For example, in the case of YouTube recommendations, Positive represents the
videos that user likes.
• Reducing the number of false positives is of utmost importance. False positives
here represent videos that the user does not like, but YouTube is still
recommending them. False negatives are of lesser importance here since the
YouTube recommendations should only contain videos that the user is more likely
to click on.
• If the user sees recommendations that are not of their liking, they will close the
application, which is not what YouTube desires.
• Most automated marketing campaigns require a high precision value to ensure
that a large number of potential customers will interact with their survey or be
interested to learn more.
Precision vs Recall
• There’s a trade-off between precision and recall, i.e., one comes at
the cost of another. Trying to increase precision lowers recall and vice-
versa.
• With precision, we try to make sure that what we are classifying as
the positive class is a positive class sample indeed, which in turn
reduces recall.
• With recall, we are trying not to miss out on any positive class
samples, which allows many false positives to creep in, thus reducing
precision.
Precision-Recall Trade-Off
• Suppose we have a binary class dataset where the test set consists of four
samples in the “positive” class and six samples in the “negative class.”
• This is represented as scenario (A) in the diagram below. The RIGHT side of
the decision boundary (green line) depicts the positive class, and the LEFT
side depicts the negative class.
• For this case, precision can be calculated by counting the number of
positive class samples on the right side divided by the total number of
positive class samples on the right side, which comes out to be 3/5 or 60%
in this case.
• Recall can be calculated by counting the number of positive class samples
on the right side divided by the total number of positive class samples,
which is 3/4 or 75% in this case.
• Now, to increase precision, we shift the
decision boundary threshold to arrive at
scenario (B).
• Precision = Positive samples on right
side/Total samples on right side = 2/2 =
100%
• Recall = Positive samples on right
side/Total positive samples = 2/4 = 50%.
• Thus, we see that compared to scenario
(A), precision increased, but that also
resulted in a decreased recall.
• Here, we tried to scrutinize the positive
samples so hard that we missed some
of the positive samples while trying to
avoid the negative samples from getting
to the right side.
• From scenario (A), now, if we want to
increase the recall score, we arrive at a
scenario (C) by changing the decision
boundary threshold again. This gives us:
• Precision = Positive samples on right
side/Total samples on right side = 4/8 =
50%
• Recall = Positive samples on right
side/Total positive samples = 4/4 = 100%
• Here, recall jumped to 100%, but at the
cost of precision, which is now 50%. In
this scenario, while we tried not to miss
any of the positive samples, we allowed
a lot of negative samples to get on the
right side (positive sample side of the
decision boundary), leading to a
decrease in recall.
Harmonic mean
• The harmonic mean is a numerical average calculated by dividing the
number of observations, or entries in the series, by the reciprocal of
each number in the series. Thus, the harmonic mean is the reciprocal
of the arithmetic mean of the reciprocals.
• For example, to calculate the harmonic mean of 1, 4, and 4, you
would divide the number of observations by the reciprocal of each
number, as follows:
F1 score
• F1 score blends precision and recall using their harmonic mean.
Maximizing for the F1 score implies simultaneously maximizing for both
precision and recall.
• Unlike the arithmetic mean, the harmonic mean tends to be closer to the
smaller number in a pair. Thus, the F1 score will only be high if both
precision and recall are high, ensuring a good balance of both.
• In a binary classification model, a large F1 score of 1 indicates excellent
precision and recall, while a low score indicates poor model performance.
• Interpreting the F1 score depends on the specific problem and context at
hand. In general, a higher F1 score suggests better model performance.
However, what constitutes a “good” or “acceptable” F1 score varies based
on factors such as the domain, application, and consequences of errors.
Low F1 score
• A low F1 score indicates poor overall performance of a binary classification model and
can be attributed to various factors, including:
• Imbalanced data: In case of an imbalanced dataset, with one class being represented
significantly more frequently than the other, the model may struggle to learn to
distinguish the minority class, resulting in poor performance and a low score.
• Insufficient data: Inadequate dataset size or insufficient representative examples of each
class can hinder the model’s ability to learn a robust representation.
• Inappropriate model selection: Score might be low if the chosen model is not suitable for
the specific task or if it is not properly tuned.
• Inadequate features: If the selected features fail to capture the relevant information for
the task, the model may struggle to learn meaningful patterns from the data.
How to improve F1 score?
• To improve the F1 score, it is necessary to determine the underlying causes of
poor performance and take appropriate steps to address them.
• For example, if the dataset is imbalanced, you can apply oversampling or
undersampling to balance the classes.
• If the model is unsuitable or poorly tuned, exploring alternative models or
performing hyperparameter tuning may be beneficial.
• Additionally, inadequate features can be addressed through feature engineering
or selection to identify more relevant features for the task at hand.
High F1 score
• A high F1 score indicates the strong overall performance of a binary classification model.
It signifies that the model can effectively identify positive cases while minimizing false
positives and false negatives.
You can achieve a high F1 score using the following techniques:
• High-quality training data: A high-quality dataset that accurately represents the problem
being solved can significantly improve the model’s performance.
• Appropriate model selection: Selecting a model architecture well-suited for the specific
problem can enhance the model’s ability to learn and identify patterns within the data.
• Effective feature engineering: Choosing or creating informative features that capture
relevant information from the data can enhance the model’s learning capabilities and
generalization.
• Hyperparameter tuning: Optimizing the model’s hyperparameters through careful tuning
can improve its performance.
Confusion Matrix for Multiclass Classification

• Which axis contains actual values and which axis contains predicted
values?
• The X-axis contains the predicted values and Y-axis includes the actual
values.
How to find the order of labels in the
confusion matrix?
• For multiclass, to know the order of classes, we can use
model.classes_ to find the order of the classes.
What metrics can be derived from the
confusion matrix?
• From the confusion matrix, we
can calculate TP, TN, FP, and FN
for each class.
• By using these values, we can
calculate precision, recall, and
f1-score.
How to visualize the confusion matrix?
• We can visualize the confusion matrix using heatmap.
• plt.figure(figsize=(10,6))
fx=sns.heatmap(confusion_matrix(y_test,pred), annot=True,
fmt=".2f",cmap="GnBu")
fx.set_title('Confusion Matrix \n');
fx.set_xlabel('\n Predicted Values\n')
fx.set_ylabel('Actual Values\n');
fx.xaxis.set_ticklabels(['Iris-setosa','Iris-versicolor','Iris-virginica'])
fx.yaxis.set_ticklabels(['Iris-setosa','Iris-versicolor','Iris-virginica'])
plt.show()
How to find the metrics of one class in a
multiclass confusion matrix?
• True Positive
• The diagonal values are
the respective classes'
TP(True positive) values.
• Calculate TP
• Class: Iris-virginica
• True Positive means both the
actual and predicted values are
positive.
• Here, in this confusion matrix,
True positive for class-Iris-
virginica
• TP=10
• True Negative
• Class: Iris-virginica
• True negative means both the
actual and predicted values are
negative.
• Here, in this confusion matrix,
the True negative for class-Iris-
virginica
• TN = 10+0+0+9= 19
• False Positive
• Class: Iris-virginica
• False positive means the
predicted value is
positive but the actual
value is negative.
• Here, in this confusion
matrix, False positive for
class-Iris-viriginica
• FP = 0+1 = 1
• False Negative
• Class: Iris-virginica
• False negative means the
predicted value is negative, but
the actual value is positive.
• Here, in this confusion matrix,
False negative for class-Iris-
viriginica
• FN = 0+0 = 0
• Precision
• Precision measures:
out of all predicted
positives how many are
actually positive.

Precision for class Iris-viriginica = 10/(10+1) = 10/11 =0.909 = 0.91


Precision for class Iris-viriginica = 0.91
• Recall
• Recall measures how many
positive records are predicted
correctly.

Recall for class Iris-viriginica = 10/(10+0) = 10/10 =1


Recall for class Iris-viriginica = 1
Multi-Class F-1 Score Calculation
• F1-score is the harmonic mean of precision and recall.
• For a multi-class classification problem, we don’t calculate an
overall F-1 score. Instead, we calculate the F-1 score per class in a
one-vs-rest manner. In this approach, we rate each class’s success
separately, as if there are distinct classifiers for each class.

F1-score = (2*0.91*1 )/(0.91+1) = 1.82/1.91 =0.95


F1-score for class Iris-virigica = 0.95
What is a Classification report?
• The classification report will display the performance metrics of the classification
model. It will display metrics like precision, recall,f1-score, and support for each
class. It also displays metrics like accuracy, macro avg, and weighted avg.
• from sklearn.metrics import classification_report
• print(classification_report(y_test, pred))

Support is the number of


actual occurrences of the class
in the specified dataset.
Precision, recall, f1-score, support values for
class Iris-Virginica
Accuracy
• Accuracy is calculated by dividing
correct predictions by total
predictions.

Here Correct predictions = 10+9+10= 29


Total predictions = 10+0+0+0+9+1+0+0+10 = 30

Accuracy = 29/30 =0.9666


Numerical
• Calculate precision,
recall and f1-score for
all three classes
• For Apple class
Precision = 7/(7+17) = 0.29
• TP = 7 Recall = 7/(7+4) = 0.64
• TN = (2+3+2+1) = 8 F1-score = 0.40
• FP = (8+9) = 17
• FN = (1+3) = 4
Micro F1 (micro-averaged F1-score)
• It is calculated by considering the
total TP, total FP and total FN of
the model.
• It does not consider each class
individually, It calculates the
metrics globally.
• Total TP = (7+2+1) = 10
• Total FP = (8+9)+(1+3)+(3+2) = 26
• Total FN = (1+3)+(8+2)+(9+3) = 26
Precision = Recall = Micro F1 = Accuracy
at global level
• Total TP = 10 Total FP = 26 Total FN = 26
• Precision = 10/(10+26) = 0.28
• Recall = 10/(10+26) = 0.28
• Now we can use the regular formula for F1-score and get the Micro
F1-score using the above precision and recall.
• Micro F1 = 0.28
• When we are calculating the metrics globally all the measures
become equal.
• Precision = Recall = Micro F1 = Accuracy
Macro F1
• This is macro-averaged F1-score.
• It calculates metrics for each class individually and then takes
unweighted mean of the measures.

• Class Apple F1-score = 0.40


• Class Orange F1-score = 0.22
• Class Mango F1-score = 0.11
• Hence,
• Macro F1 = (0.40+0.22+0.11)/3 = 0.24
Weighted F1
• The last one is weighted-averaged F1-score.
• Unlike Macro F1, it takes a weighted mean of the measures.
• The weights for each class are the total number of samples of that
class.
• Since we had 11 Apples, 12 Oranges and 13 Mangoes,
• Weighted F1 = ((0.40*11)+(0.22*12)+(0.11*13))/(11+12+13) = 0.24
Multiclass Classification vs.
Multi-label Classification
• Multiclass classification and multi-label classification are essential
techniques in the world of machine learning, catering to different
types of classification tasks.
• The choice between them depends on the nature of the data and the
problem at hand.
Multiclass Classification
• Multiclass classification, also known as single-label classification, involves
categorizing data into mutually exclusive classes.
• Each data point belongs to one and only one class, making it suitable for
scenarios where items can be distinctly assigned to one category.

• Use Cases:
• Handwritten Digit Recognition: Assigning a digit (0–9) to a given
handwritten image.
• Disease Diagnosis: Identifying the disease category based on patient
symptoms.
Multi-label Classification
• Multi-label classification deals with instances that can be associated
with multiple labels simultaneously.
• This technique is ideal for tasks where data points may belong to
more than one class at the same time.

• Use Cases:
• Scene Classification: Assigning multiple labels (e.g., beach, sunset,
people) to an image.
• Text Categorization: Labeling articles with multiple topics they cover.
ROC curve
(receiver operating characteristic curve)
• We use total area under the ROC curve (AUC) as the evaluation
criteria to find the optimal classifier.
• AUC is a single number that provides the average information about
the overall model’s performance at various threshold settings.
• An interesting property about AUC is that it is independent of class
distribution, i.e. whether class distribution is 10%/90% or 50%/50%,
ranking of models based on AUC would be same.
ROC curve
(receiver operating characteristic curve)
• ROC curves summaries the trade offs between the true positive rate and the false
positive rate for a predictive model using different probability thresholds.
1.True positive rate (TPR) is calculated as the number of true positives divided by
the sum of the number of true positives and the number of false negatives. It
describes how good the model is at predicting the positive class when the actual
outcome is positive. It is also called the hit rate (TPR). TPR can also be referred to
as Sensitivity/ Recall.
2. False positive rate (FPR) is calculated as the number of false positives divided by
the sum of the number of the number of false positives and true negatives. It is
also called the false alarm rate because it summarises how often a positive class is
predicted when the actual outcome is negative.

Specificity=TN/(FP+TN)
FP Rate = FP/(FP+TN) = (1-specificity)
ROC curve
(receiver operating characteristic curve)
• The ROC curve plots TPR vs FPR at different classification thresholds.
• Lowering the classification threshold classifies more items as positive, thereby
increasing both false positives and true positives.
• The ROC curve is a useful tool because:-
• The curves of different models can be compared directly in general or for different thresholds.
• The area under the curve (AUC) can be used as a summary of the model skill.
• The ROC curve is used in conjunction with the AUC, which is the area under the ROC
curve.
• The AUC measures the entire two dimensional area underneath the ROC curve from
0 to 1, with 0 being the lowest prediction accuracy and 1 being the highest.
• Therefore, predictions of 0% accuracy would be 0, while predictions of 100%
accuracy would be 1.
ROC curve
(receiver operating characteristic curve)
• A no-skill classifier is one that cannot discriminate between the
classes and would predict a random class or constant class in all
cases. A model with no skill is represented by (0.5, 0.5).
• A model with no skill at the threshold is represented by a diagonal
line from the bottom left to the top right of the chart and has an AUC
of 0.5.
• A model with perfect skill is represented at a point (0,1). It is
represented by a line that travels from the bottom left of the plot to
the top left and then to the top right. It will have an AUC of 1.
Threshold
• In general, a classification model can predict the probability of being a certain
class for a given record. By comparing the probability value to a threshold value
we set, we can classify the record into a class. In other words, you will need to
define a rule similar to the following:
• If the probability of being positive is greater than or equal to the threshold, then a record is
classified as a positive prediction; otherwise, a negative prediction.
• Example: we can see the probability scores for three records. Using two different
threshold values (0.5 and 0.6), we classified each record into a class. As you can
see, the predicted classes vary depending on the threshold value we choose.
ROC curve
• When constructing the curve, we
first calculate FPR and TPR across
many threshold values.
• Once we have the FPR and TPR for
the thresholds, we then plot FPR on
the x-axis and TPR on the y-axis to
get a ROC curve.

m and k are different thresholds


AUC
• Area under a ROC curve ranges
from 0 to 1.
• A completely random model has an
Area Under ROC Curve of 0.5
which is represented by the
dashed blue triangle diagonal line
below.
• The further the ROC curve is from
this line, the more predictive the
model is.
How the ROC curve
is constructed?
• Step 1: Getting classification
model predictions
• When we train a classification
model, we get the probability
of getting a result. In this
case, our example will be the
likelihood of repaying a loan.
• The probabilities range
between 0 and 1. The higher
the value, the more likely the
person is to repay a loan.
How the ROC curve is constructed?
• Find a threshold to classify the probabilities as “will repay”
or “won’t repay”.
• Here, we’ve selected a threshold at 0.35:
• All predictions at or above this threshold, are classified as “will repay”
• All predictions below this threshold, are classified as “won’t repay”
• We then look at which of these predictions were correctly
classified or miss-classified. With such information, we can
build a confusion matrix.
• All actual positives, those who did repay, are the blue dots.
• If they were classified as “will repay”, we have a True Positive (TP)
• If they were classified as “won’t repay”, we have a False Negative (FN)
• All actual negatives, those who didn’t repay, are the red dots.
• If they were classified as “won’t repay”, we have a True Negative (TN)
• If they were classified as “will repay”, we have a False Positive (FP)
Step 2: Calculate TPR & FPR

• True positive rate (TPR): from all people who “did repay” in the past, what percentage
did we classify correctly
Color of circle Count
• False positive rate (FPR): from all people who “didn’t repay”
Light blue 18
in the past, what percentage did we miss-classify
Dark blue 2
• At the threshold of 0.35., we
Dark Red 4
• classified correctly 90% of all positives, those who “paid back” (TPR)
• miss-classified 40% of all negatives, those who “didn’t pay back” (FPR) Light Red 6
• Overall, we can see this is a trade-off.
• As we increase our threshold, we’ll be better at classifying negatives, but this is at
the expense of miss-classifying more positives
Step 3: Plot the the TPR and FPR for every cut-off
• To plot the ROC curve, calculate the TPR and FPR for many different
thresholds.
• For each threshold, we plot the FPR value in the x-axis and the TPR
value in the y-axis. We then join the dots with a line.
• The area covered below the line is called “Area Under the Curve
(AUC)”. This is used to evaluate the performance of a classification
model. The higher the AUC, the better the model is at distinguishing
between classes.
Thanks and Happy Learning…….

You might also like