Confusion Matrix - Unit 2

How classification work?
Cs Result
Classification algorithm
5 Fail
6 Pass
2 Fail
7 Pass If cs <=5
9 Pass Then result=fail pass
------------------------
Classifier model
Training data
Cs=8, result=? Testing data

Confusion matrix
• Evaluated the performance of classification models by computing
the accuracy score. We defined the accuracy score as:
• accuracy = number of correct predictions / total number of
predictions
• So, if I had 100 people, 50 of whom were cheese lovers and 50 of

who were cheese haters, and I correctly identified 36 of the
cheese lovers and 24 of the cheese haters, my accuracy would be
• accuracy = (36+24) / (50+50) = 60%.
• If I told you that I had built a model that could predict, with over
99% accuracy, whether you would win the lottery or not, would
you be impressed?
• You shouldn't be!
• Whilst accuracy score is fine for balanced classes, like the cheese
lovers/haters example above, it can be very misleading for
unbalanced classes.
• What are balanced and unbalanced classes?
• Balanced classes are those where we have an equal number of individuals in each
class in our sample. Here is a balanced class:
• Cheese lovers: 50 people
• Cheese haters: 50 people
• Unbalanced classes are those where there are unequal numbers of individuals in
each class in our sample. Here is an unbalanced class:
• Lottery winners: 1 person
• Lottery losers: 45 million people
• In the lottery winner prediction example, I can get well over 99% accuracy just by
predicting that everyone will lose!
• accuracy = (45000000 + 0) / (45000000 + 1) = 99.999998%
• So, we need to come up with a better way to measure the performance of our
classification models...
Use a Confusion Matrix
• The confusion matrix breaks down the classification results, showing
a summary of the performance of the predictions against the actuals.
• The green cells on the diagonal represent the correct classifications,
the white cells represent the incorrect classifications. As you can see,
this gives a much more detailed overview of how our model is
performing.
• In the general case of a binary classification,
we use the following terms for the 4 boxes:
• True Positive (TP)
• True Negative (TN)
• False Positive (FP)
• False Negative (FN)
• If we have more than two classes, we can
extend the confusion matrix accordingly:
Precision and recall
• Imagine you are approached by the owner of an orchard, producing large
quantities of apples for different markets. The orchard has two key
customers.
• One is a supermarket, whose marketing line is “good food at low prices”.
• The other is a premium cider manufacturer, whose marketing line is
“producing the highest-quality cider from only the finest apples”.
• Sometimes bad apples enter the batch, and the orchard has machines with
various sensors to identify certain faults in the apples. Color, firmness,
blemishes, size, and many other attributes contribute to identifying an apple
as bad or good.
• You are hired to create two machine learning models. One will try to identify
apples suitable for the low-cost supermarket. Let’s call this the
SUPERMARKET model. The other will try to identify apples suitable for the
premium cider manufacturer. Let’s call this the CIDER model.
• Look at the following two confusion matrices
produced by two different models.
• Which is the best for the SUPERMARKET model
and which is the best for the CIDER model?
• The answer comes down to which model
carries the most unacceptable "cost"
(negatives) for each scenario.
• The cost of Model A is that we have some false
negatives. In other words, we allow some good
apples to be rejected. The cost of Model B is
that we have some false positives. In other
words, some bad apples slip through into the
batch.
• For our CIDER model, which is worse? That we allow 4 bad apples
to spoil the whole cider batch (Model B) or that we reject 22
good apples (Model A)? Certainly, we don’t want to spoil the
cider! So we would go for model A.
>For our SUPERMARKET model, which is worse? That we allow 4
bad apples onto the supermarket shelf (model B), or that we
reject 22 good apples (model A)? It would certainly eat into our
profits if we kept throwing away so many good apples, so model
B would be better.
Precision
• Precision — Also called Positive predictive
value.
The ratio of correct positive predictions to the
total predicted positives.
Recall
• Recall — Also called Sensitivity, Probability of
Detection, True Positive Rate
• The ratio of correct positive predictions to the
total positives examples.
• Can you calculate precision and recall for our 2 models? Once you are
done, check your answers below.
• Model A:
Precision = 78 / (78 + 0) = 1
Recall = 78 / (78 + 22) = 0.78
• Model B:
• Precision = 100 / (100 + 4) = 0.96
• Recall = 100 / (100 + 0) = 1
• Model A is a high precision model and Model B is a high recall
model.
• For our SUPERMARKET model, we want to maximize recall. We want
to ensure we throw away as few apples as possible. We would favor
model B.
• For our CIDER model, we want to maximize precision. We want to
ensure we don't spoil the batch by processing bad apples. We would
favor model A.
F1 score
• Having measures to help us pull our models towards high precision or
high recall is really useful. But what if we want a balanced measure,
one that balances both precision and recall? The F1 score does just
that, and is calculated as follows:
• F1=2∗Precision∗Recall/Precision+Recall
• Calculating F1 for our two models:
• Model A:
• F1 = 2 * (1 * 0.75) / (1 + 0.75) = 0.857
• Model B:
• F1 = 2 * (0.66 * 1) / (0.66 + 1) = 0.795
• If we want a balanced model, we can maximize F1. Be careful,
however. Don't use F1 just because it saves you worrying about the
precision/recall tradeoff. In general, you should be clear about
whether your intention is to maximize precision or recall!
Use the Receiver Operating Characteristic (ROC) Curve
• Some classification algorithms predict a hard in/out classification decision.

For example, they may classify an email as spam or not spam.
• Other classification algorithms, such as logistic regression, predict a
probability that a data point belongs to a specific class. For example, they
may say that a particular email is 78% likely to be spam. In this case, we
would want to determine a threshold above which we decide an email is
spam. We may choose 50%, or perhaps 70%. We can adjust the threshold in
order to balance the tradeoff between finding too many false positives or
too many false negatives.
• But how do we do that?
• A useful evaluation technique would be to examine how a model behaves as
this threshold moves. This will give us an idea of how well the model
separates the classes. We can then compare different models and see how
well each model separates the classes.
• For binary classification problems, the Receiver
Operating Characteristic curve (ROC curve)
provides this technique.
• Let's take our apple-selection example from
above. We can build a logistic regression model
that computes the probability of a good / bad
classification based on some feature or set of
features (such as crispness of apple). Let's
assume we have just 6 good and 6 bad apples:
• We can set a threshold of 0.5, as shown by the
blue dotted lines:
• This nicely maps our classification
decision onto a confusion matrix, by counting
the good and bad apples separated by the
blue threshold lines:
• If we move the threshold to 0.8, we get a new
confusion matrix:
• We can continue in this way, generating confusion matrices for a bunch of different
thresholds on the same classification model. We can then compute two new
measures from each confusion matrix, the True Positive Rate and False Positive
Rate:
• TruePositiveRate(TPR)=TPTP+FN
• FalsePositiveRate(FPR)=FPFP+TN
• In our example, the TPR is the percentage of good apples found and the FPR is the
percentage of bad apples misclassified as good apples. So if we plot TPR against
FPR we get a nice view on how well the model behaves as we adjust the threshold
and hence how well the model separates the two classes:
• If we had another model that produced the
following ROC curve, we would conclude that
it was less good at separating the good and
bad apples:
• Why is this less good?
• Because in order to improve the number of
good apples found (i.e. get the dots to move
up) we need to suffer more misclassifications of
bad apples (i.e. the dots will move to the right).
• If, however, we had a model that produced the
following ROC curve, we'd have a really
excellent model:
Why is this so good? Because we can increase the number of
good apples identified (i..e get the dots to move up) without
suffering many more misclassifications of bad apples (i.e.
without the dots moving too far to the right).
ROC curves that hug the top left corner of the chart represent
models that do a good job of separating the classes.
ROC Curves
• ROC is an abbreviation of Receiver Operating Characteristic come from
the signal detection theory, developed during World War 2 for analysis of
radar images.
• In the context of classifier, ROC plot is a useful tool to study the behaviour
of a classifier or comparing two or more classifiers.
• A ROC plot is a two-dimensional graph, where, X-axis represents FP rate

(FPR) and Y-axis represents TP rate (TPR).
• Since, the values of FPR and TPR varies from 0 to 1 both inclusive, the
two axes thus from 0 to 1 only.
• Each point (x, y) on the plot indicating that the FPR has value x and the
TPR value y.
CS 40003: Data Analytics 26
ROC Plot
• A typical look of ROC plot with few points in it is shown in the following
figure.
• Note the four cornered points are the four extreme cases of classifiers
Identify the four extreme classifiers.

Interpretation of Different Points in ROC Plot
• Le us interpret the different points in the ROC plot.
• The four points (A, B, C, and D)
– A: TPR = 1, FPR = 0, the ideal model, i.e., the perfect classifier, no false results
– B: TPR = 0, FPR = 1, the worst classifier, not able to predict a single instance
– C: TPR = 0, FPR = 0, the model predicts every instance to be a Negative class, i.e., it is an ultra-conservative
classifier
– D: TPR = 1, FPR = 1, the model predicts every instance to be a Positive class, i.e., it is an ultra-liberal classifier
• Le us interpret the different points in the ROC plot.
• The points on diagonals
– The diagonal line joining point C(0,0) and D(1,1) corresponds to random guessing
• Random guessing means that a record is classified as positive (0r negative) with a certain probability
• Suppose, a test set contacting N+ positive and N- negative instances. Suppose, the classifier guesses any instances with
probability p
• Thus, the random classifier is expected to correctly classify p.N+ of the positive instances and p.N- of the negative instances
• Hence, TPR = FPR = p
• Since TPR = FPR, the random classifier results reside on the main diagonals
• Let us interpret the different points in the ROC plot.
• The points on the upper diagonal region
– All points, which reside on upper-diagonal region are corresponding to classifiers “good” as
their TPR is as good as FPR (i.e., FPRs are lower than TPRs)
– Here, X is better than Z as X has higher TPR and lower FPR than Z.
– If we compare X and Y, neither classifier is superior to the other

• Let us interpret the different points in the ROC plot.
• The points on the lower diagonal region
– The Lower-diagonal triangle corresponds to the classifiers that are worst than random classifiers
– Note: A classifier that is worst than random guessing, simply by reversing its prediction, we can
get good results.
• W’(0.2, 0.4) is the better version than W(0.4, 0.2), W’ is a mirror reflection of W

Example to interpret confusion matrix:
• For the simplification of the above confusion matrix i
have added all the terms like TP,FP,etc and the row
and column totals in the following image:
Accu
=
• Classification Rate/Accuracy:
Accuracy = (TP + TN) / (TP + TN + FP + FN)= (100+50) /(100+5+10+50)= 0.90
• Recall: Recall gives us an idea about when it’s actually yes, how often does it
predict yes.
Recall=TP / (TP + FN)=100/(100+5)=0.95
• Precision: Precsion tells us about when it predicts yes, how often is it correct.
Precision = TP / (TP + FP)=100/ (100+10)=0.91
• F-measure:
Fmeasure=(2*Recall*Precision)/(Recall+Presision)=(2*0.95*0.91)/
(0.91+0.95)=0.92
How classification work?
Cs Result
Classification algorithm
5 Fail
6 Pass
2 Fail
7 Pass If cs <=5
9 Pass Then result=fail pass
------------------------
Classifier model
Training data
Cs=8, result=? Testing data

Confusion Matrix:
• A confusion matrix is a summary of prediction results on a

classification problem.
• The number of correct and incorrect predictions are
summarized with count values and broken down by each
class. This is the key to the confusion matrix.
• The confusion matrix shows the ways in which your
classification model is confused when it makes predictions.
• It gives us insight not only into the errors being made by a
classifier but more importantly the types of errors that are
being made.
• Here,
• Class 1 : Positive
• Class 2 : Negative
• Definition of the Terms:

• Positive (P) : Observation is positive (for example: is an apple).
• Negative (N) : Observation is not positive (for example: is not an apple).
• True Positive (TP) : Observation is positive, and is predicted to be
positive.
• False Negative (FN) : Observation is positive, but is predicted negative.
• True Negative (TN) : Observation is negative, and is predicted to be
negative.
• False Positive (FP) : Observation is negative, but is predicted positive.
• Classification Rate/Accuracy:
Classification Rate or Accuracy is given by the relation:
• There are problems with accuracy. It assumes equal

costs for both kinds of errors. A 99% accuracy can be
excellent, good, mediocre, poor or terrible depending
upon the problem.
• Recall:
Recall can be defined as the ratio of the total
number of correctly classified positive
examples divide to the total number of
positive examples. High Recall indicates the
class is correctly recognized (small number of
FN).
• Recall is given by the relation:
• Precision:
To get the value of precision we divide the total number of
correctly classified positive examples by the total number of
predicted positive examples. High Precision indicates an example
labeled as positive is indeed positive (small number of FP).
Precision is given by the relation:
• High recall, low precision:This means that most of the positive

examples are correctly recognized (low FN) but there are a lot of
false positives.
• Low recall, high precision:This shows that we miss a lot of positive
examples (high FN) but those we predict as positive are indeed
positive (low FP)
• F-measure:
Since we have two measures (Precision and Recall) it helps to have a measurement
that represents both of them. We calculate an F-measure which uses Harmonic
Mean in place of Arithmetic Mean as it punishes the extreme values more.
The F-Measure will always be nearer to the smaller value of Precision or Recall.
• Let’s consider an example now, in which we have infinite data elements of class B
and a single element of class A and the model is predicting class A against all the
instances in the test data.
Here,
Precision : 0.0
Recall : 1.0
• Now:
Arithmetic mean: 0.5
Harmonic mean: 0.0
When taking the arithmetic mean, it would have 50% correct. Despite being the
worst possible outcome! While taking the harmonic mean, the F-measure is 0.
• Recall versus Precision
• Precision is the ratio of True
Positives to all the positives
predicted by the model.
• Low precision: the more False
positives the model predicts, the
lower the precision.
• Recall (Sensitivity)is the ratio

of True Positives to all the positives
in your Dataset.
• Low recall: the more False
Negatives the model predicts, the
lower the recall.
• The idea of recall and precision
seems to be abstract. Let me
illustrate the difference in three real
cases.
•the result of TP will be that the COVID 19
residents diagnosed with COVID-19.
•the result of TN will be that healthy
residents are with good health.
•the result of FP will be that those actually
healthy residents are predicted as COVID 19
residents.
•the result of FN will be that those actual
COVID 19 residents are predicted as the
healthy residents
•In case 1, which scenario do you think will
have the highest cost?
•Imagine that if we predict COVID-19
residents as healthy patients and they do
not need to quarantine, there would be a
massive number of COVID-19 infections.
The cost of false negatives is much higher
than the cost of false positives.
•the result of TP will be that spam emails
are placed in the spam folder.
•the result of TN will be that important
emails are received.
•the result of FP will be that important
emails are placed in the spam folder.
•the result of FN will be that spam emails
are received.
In case 2, which scenario do you think will
have the highest cost?
Well, since missing important emails will
clearly be more of a problem than receiving
spam, we can say that in this case, FP will
have a higher cost than FN.
•the result of TP will be that bad loans
are correctly predicted as bad loans.
•the result of TN will be that good
loans are correctly predicted as good
loans.
•the result of FP will be that (actual)
good loans are incorrectly predicted as
bad loans.
•the result of FN will be that (actual)
bad loans are incorrectly predicted as
good loans.
In case 3, which scenario do you think
will have the highest cost?
The banks would lose a bunch amount
of money if the actual bad loans are
predicted as good loans due to loans
not being repaid. On the other hand,
banks won't be able to make more
revenue if the actual good loans are
predicted as bad loans. Therefore, the
cost of False Negatives is much higher
than the cost of False
Positives. Imagine that.
In practice, the cost of false negatives is not the same as the cost of false positives,
depending on the different specific cases. It is evident that not only should we
calculate accuracy, but we should also evaluate our model using other metrics, for
example, Recall and Precision.
• Combining Precision and Recall — F1 Score
•
• In the above three cases, we want to maximize
either recall or precision at the expense of the other
metric. For example, in the case of a good or bad
loan classification, we would like to decrease FN to
increase recall. However, in cases where we want to
find an optimal blend of precision and recall, we can
combine the two metrics using the F1 score.
• F-Measure provides a single score that balances both
the concerns of precision and recall in one number. A
good F1 score means that you have low false positives
and low false negatives, so you’re correctly identifying
real threats, and you are not disturbed by false alarms.
An F1 score is considered perfect when it’s 1, while the
model is a total failure when it’s 0.

Confusion Matrix - Unit 2

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Confusion Matrix - Unit 2

Uploaded by

Copyright:

Available Formats

How classification work?

Cs=8, result=? Testing data

• So, if I had 100 people, 50 of whom were cheese lovers and 50 of

• Some classification algorithms predict a hard in/out classification decision.

• A ROC plot is a two-dimensional graph, where, X-axis represents FP rate

Identify the four extreme classifiers.

CS 40003: Data Analytics 27

• The four points (A, B, C, and D)

• The points on diagonals

• The points on the upper diagonal region

– If we compare X and Y, neither classifier is superior to the other

• The points on the lower diagonal region

CS 40003: Data Analytics 31

Cs=8, result=? Testing data

• A confusion matrix is a summary of prediction results on a

• Definition of the Terms:

• There are problems with accuracy. It assumes equal

• High recall, low precision:This means that most of the positive

• Recall (Sensitivity)is the ratio

You might also like