Binary Classification Machine Learning Models

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

Performance Considerations for Binary

Classification Machine Learning Models


Joseph Dixon

Binary Classification (BC) Machine Learning Models1


BC models are algorithms in executable program form that categorize (new) observations into one of
two classes, after being trained to distinguish between the classes using known observations.
(Karabiber, 2023) Such models predict a condition’s presence or absence for an observation (generally,
positive or negative). In practice, BC models prove effective at such tasks as:

▪ Credit-worthiness assessment ▪ Spam filtering ▪ Image classification

Key Performance Metrics


A BC model’s classification performance is vital to the model’s effectiveness:

▪ for performance comparisons during the model development and training;


▪ for performance during use, where an observation’s actual classification is known.

Performance at first appears to require two considered outcomes:

correct classifications / [correct + incorrect classifications]

Full understanding, however, requires nuanced considerations, because –

A. There are actually four outcome possibilities (Irizarry, 2019):

B. The classification task’s logical context actually determines which possible outcome(s) are
relevant.

Key concepts to measuring classification performance are: (Géron, 2017) (Starmer, 2020)

Precision – the model’s ability to predict (positive) only the actual (positive) cases – refers to “how
precise are the model’s true positive (i.e., correct positive) predictions relative to all positive
predictions it made?” False-positives (FP) penalize performance.

1
For discussion purposes, ‘classification’ and ‘prediction’ are synonyms.

DTSC-670, Assignment 5 – Dixon, Joseph Page 1 of 4


Sensitivity/Recall – the model’s ability to correctly predict actual
positives in the set of observations – refers to “how many of the total
actual positives does the model actually recall?”; or, “how sensitive are
positive predictions to actual (positive) cases?”. Positive prediction of
an actual negative (false-negative, FN) penalizes performance.

Specificity – the model’s ability to correctly predict all actual negative


cases (TN) – refers to “how specific is the model in accurately
classifying actual negative conditions”? Positive predictions for actual
negatives (FP) penalize performance.

Figure 2 shows each measure’s mathematical formulation, while Figure 3 visualizes performance
concepts’ relations to outcome possibilities.

Explaining to a Lay Audience


As already described, classification performance bears nuances that could challenge lay audiences.

Therefore, explanation to a lay audience likely warrants:

1. Avoiding math-based description, as the audience’s “percentages-challenged” members may be


substantive. (Siegler, 2017)
2. Emphasizing the why and how each measure’s term “make sense”.
3. Illustrating/reinforcing concepts through (many) common-case examples.

Precision/Recall Trade-Off
Precision and recall (aka sensitivity) can contend: i.e., improving one can degrade the other.
Significantly, figure 2 shows that precision and recall formulas:

▪ differ only in denominator, by additive FP & FN factors, respectively.

Figure 4 illustrates how contention can arise between the metrics:

▪ Tactics to increase precision


performance require lowering its
denominator value, achieved by
lowering FP count;

▪ Lowering FP counts requires increasing


the decision-making threshold value;

DTSC-670, Assignment 5 – Dixon, Joseph Page 2 of 4


▪ However, threshold value increase sets a “lower bar” for predicting negatives, raising
likelihood of negative predictions of actual positives (FNs);

▪ Increased FNs increase recall’s denominator, which decreases recall performance.

A Unique Example
Consider Mars robotic mineral sampling at candidate collection locales:

▪ TPs are important because quality samples analysis is a prime exploration objective

▪ Avoiding FPs (higher precision) is important – collected, but shouldn’t have – depletes the
battery, reducing per-excursion quality yield; likewise, drill bit wear reduces sampling lifetime.

▪ Avoiding FNs (higher recall) is important – didn’t collect, but should have – directly risks failing
the exploration objective.

ROC Curve and Use


The ROC (radio operator characteristic) curve visually characterizes prediction accuracy trade-off, as
Figure 5 illustrates, by:

▪ plotting recall (sensitivity) as f(specificity) = [1 – specificity],


visually illustrating trade-off between the two

▪ plotting a random classification scheme’s (RCS) tradeoff – a


[0, 0] to [1, 1,] diagonal – as reference.

Area under the ROC curve (AUC, shaded) provides a calculable


performance measure in the range [0, 1] that includes relevant
trade-offs. (Glen, 2019) (Wikipedia, 2023)

Finally, the model developer identifies a desirable trade-off point on the ROC curve, then selects the
point’s corresponding threshold cutoff value to include in the BC decision function.

DTSC-670, Assignment 5 – Dixon, Joseph Page 3 of 4


Bibliography
Géron, A. (2017). Hands-On Machine Learning with Scikit-Learn and TensorFlow. O'Reillly Media, Inc.

Glen, S. (2019). ROC Curve Explained in One Picture. Retrieved from Data Science Central:
https://www.datasciencecentral.com/roc-curve-explained-in-one-picture/

Irizarry, R. (2019). Introduction to Data Science: Data Analysis and Prediction Algorithms with R.
Chapman & Hall.

Karabiber, F. (2023). Binary Classification. Retrieved from Learning Data Science:


https://www.learndatasci.com/

Siegler, R. (2017, November 28). Fractions: Where It All Goes Wrong. Scientific American.

Starmer, J. (Director). (2020). Machine Learning Fundamentals: Sensitivity and Specificity [Motion
Picture].

Wikipedia. (2023). Receiver operating characteristic. Retrieved from Wikipedia:


https://en.wikipedia.org/wiki/Receiver_operating_characteristic

DTSC-670, Assignment 5 – Dixon, Joseph Page 4 of 4

You might also like