Professional Documents
Culture Documents
Model Comparison
Model Comparison
Model comparison
• We all know how to compare two linear regression models with the same
number of independent variables: look at R2.
• When the number of independent variables is different between the two
regressions, look at adjusted R2.
• What should you do, though, to compare a linear regression model with a
nonlinear regression model, the latter of which really has no directly
comparable definition for R2?
• There is a definite need to compare different types of models so that the
better model may be chosen, and that’s the topic of this chapter.
• First, we will examine the case of a continuous dependent variable, which
is rather straightforward.
• Subsequently, we will discuss the binary dependent variable. It permits
many different types of comparisons.
Model comparison with
continuos dependent variable
• Three common measures used to compare predictions of a continuous
variable are the mean square error (MSE), its square root (RMSE), and
the mean absolute error (MAE).
MSE RMSE MAE
The in-sample and out-of-sample MSE and MAE are quite close, which
MSE MAE Correlation
leads one to think that the model is doing about as well at predicting out-
In-Sample 0.00338668 0.04734918 0.68339270 of-sample as it is at predicting in-sample. The correlation confirms this
Out-of-Sample 0.00284925 0.04293491 0.75127994 notion.
Model comparison with
continuos dependent variable
• To gain some insight into this phenomenon, look at a graph of Return on
McDonalds against Predicted Return on McDonalds.
• Select Graph→Scatterplot Matrix (or select Graph→Overlay Plot)
– Select Return on McDonalds and click Y, Columns.
– Then select Predicted Return on McDonalds and click X. Click OK.
• In the data table, select observations 41-48, which will make them appear
as bold dots on the graph (for helping, while still in the data table, right-
click on the selected observations, select Markers, and then choose the
plus sign (+).
• These out-of-sample observations appear to be in agreement with the in-
sample observations, which suggests that the relationship between Y and
X that existed during the in-sample period continued through the out-of-
sample period.
Model comparison with
continuos dependent variable
• This is not a particularly useful measure because it gives equal weight to all
components.
– Imagine that you are trying to predict a rare event—say, cell phone churn—when only
1% of customers churn. If you simply predict that all customers do not churn, your
accuracy rate will be 99%. (Since you are not predicting any churn, FP=0 and TN=0.)
Clearly, better statistics that make better use of the elements of the confusion matrix are
needed.
Model comparison with binary
dependent variable
• Confusion table:
Predicted 1 Predicted 0
Actual 1 True positive (TP) False negative (FN)
Actual 0 False positive (FP) True negative (TN)
!
• One such measure is the sensitivity, or true positive rate (TPR), which is
defined as:
• This is also known as recall. It answers the question, “If the model predicts
a positive event, what is the probability that it really is positive?”
Model comparison with binary
dependent variable
• Confusion table:
Predicted 1 Predicted 0
Actual 1 True positive (TP) False negative (FN)
Actual 0 False positive (FP) True negative (TN)
!
• Similarly, the true negative rate is also called the specificity and is given
by:
• It answers the question, “If the model predicts a negative event, what is
the probability that it really is negative?”
Model comparison with binary
dependent variable
• Confusion table:
Predicted 1 Predicted 0
Actual 1 True positive (TP) False negative (FN)
Actual 0 False positive (FP) True negative (TN)
!
• It answers the question, “If the model predicts a positive event, what is
the probability that it is making a mistake?”
ROC curve
• When the False Positive Rate (FPR) is plotted on the x-axis, and the True Positive
Rate (TPR) is plotted on the y-axis, the resulting graph is called an ROC curve
(“Receiver Operating Characteristic Curve”).
• In order to draw the ROC curve, the classifier has to produce a continuous-valued
output that can be used to sort the observations from most likely to least likely.
• The predicted probabilities from a logistic regression are a good example. In an
ROC graph, the vertical axis shows the proportion of ones that are correctly
identified, and the horizontal axis shows the proportion of zeros that are
misidentified as ones.
ROC curve
• To interpret the ROC curve, first note that the
point (0,0) represents a classifier that never
issues a positive classification: its FPR is
zero, which is good. But it never correctly
identifies a positive case, so its TPR is zero,
also, which is bad. The point (0,1) represents
the perfect classifier: it always correctly
identifies positives and never misclassifies a
negative as a positive.
• Remember that each point on an ROC curve corresponds to a particular confusion that, in turn, depends
on a specific threshold. This threshold is usually a percentage. E.g., classify the observation as “1” if
the probability of its being a “1” is 0.50 or greater.
• Therefore, any ROC curve represents various confusion matrices generated by a classifier as the
threshold is changed.
• Points in the lower left region of the ROC space identify “conservative” classifiers. They require strong
evidence to classify a point as positive. So they have a low false positive rate; necessarily they also
have low true positive rates.
• On the other hand, classifiers in the upper right region can be considered “liberal.” They do not require
much evidence to classify an event as positive. So they have high true positive rates; necessarily, they
also have high false positive rates.
ROC curve
• When two ROC curves cross, neither is unambiguously better
than the other, but it is possible to identify regions where one
classifier is better than the other.
• Figure shows the ROC curve as a dotted line for a classifier
produced by Model 1—say, logistic regression—and the ROC curve
as a solid line for a classifier produced by Model 2—say, a
classification tree.
• Suppose it is important to keep the FPR low at 0.2. Then, clearly,
Model 2 would be preferred because when FPR is 0.2, it has a much
higher TPR than Model 1.
• Conversely, if it was important to have a high TPR—say, 0.9—then
Model 1 would be preferable to Model 2 because when TPR =0.9,
Model 1 has an FPR of about 0.7 while Model 2 has an FPR of about
0.8.
• The ROC can be used to determine the point with optimal classification accuracy.
• Straight lines with equal classification accuracy can be drawn, and these lines will all be from the
lower left to the upper right.
• The line that is tangent to an ROC curve marks the optimal point on that ROC curve.
• In the Figure, the point marked A for Model 2, with an FPR of about 0.1 and a TPR of about 0.45, is an
optimal point. This point is optimal assuming that the costs of misclassification are equal, that a false
positive is just as harmful as a false negative.
• This assumption is not always true, as shown by misclassifying the issuance of credit cards. A good customer
might charge $5000 per year and carry a monthly balance of $200, resulting in a net profit of $100 to the credit
card company. A bad customer might run up charges of $1000 before his card is canceled. Clearly, the cost of
refusing credit to a good customer is not the same as the cost of granting credit to a bad customer.
ROC curve
• A popular method for comparing ROC curves is to calculate the “Area Under the
Curve” (AUC).
• Since both the x-axis and y-axis are from zero to one, and since the perfect classifier passes
through the point (0,1), the largest AUC is one.
• The AUC for the random classifier (the diagonal line) is 0.5. In general, then, a ROC with a
higher AUC is preferred to an ROC curve with a lower AUC.
• The AUC has a probabilistic interpretation. It can be shown that AUC = P(random positive
example > random negative example). In other words, this is the probability that the classifier
will assign a higher score to a randomly chosen positive case than to a randomly chosen
negative case.
• To illustrate the concepts discussed here, let us examine a pair of examples with real data.
• We will construct two simple models for predicting churn, and then compare them on the basis
of ROC curves.
• Open the churn data set and fit a logistic regression.
• Churn is the dependent variable (make sure that it is classified as nominal), where TRUE
or 1 indicates that the customer switched carriers.
• For simplicity, choose D_VMAIL_PLAN, VMail_Message, Day_Mins, and Day_Charge
as explanatory variables and leave them as continuous. Click Run.
• Under the red triangle for the Nominal Logistic Fit window, click ROC Curve. Since we
are interested in identifying churners, when the pop-up box instructs you to Select
which level is the positive, select 1 and click OK.
ROC curve
• Observe the ROC curve together with the
line of optimal classification; the AUC is
0.65778, the left ROC curve in the Figure.
• The line of optimal classification appears to
be tangent to the ROC at about 0.10 for 1-
Specificity and about 0.45 for Sensitivity.
• At the bottom of the window is a tab for the
ROC Table. Expand it to see various
statistics for the entire data set.
• Imagine a column between Sens-(1-Spec) and True
Pos; scroll down until Prob = 0.2284, and you will
see an asterisk in the imagined column. This asterisk
denotes the row with the highest value of Sensitivity
(1-Specificity), which is the point of optimal
classification accuracy.
• Should you happen to have 200,000 rows, right-click
in the ROC Table and select Make into Data Table,
which will be easy to manipulate to find the optimal
point
ROC curve
• We want to show how JMP compares models, so
we will go back to the churn data set and use the
same variables to build a classification tree.
• Select Analyze→Modeling→Partition. Use the
same Y and X variables as for the logistic
regression (Churn versus D_VMAIL_PLAN,
VMail_Message, Day_Mins, and Day_Charge).
• Click Split five times so that RSquare equals
0.156.
• Under the red triangle for the Partition window,
click ROC Curve.
• This time you are not asked to select which level is
positive; you are shown two ROC Curves, one for
False and one for True.
– They both have the same AUC because they represent the
same information.
– Observe that one is a reflection of the other.
• Note that the AUC is 0.6920. The partition method
does not produce a line of optimal classification
because it does not produce an ROC Table.
• On the basis of AUC, the classification tree seems
to be marginally better than the logistic regression.
Model comparison using the Lift Chart
• Suppose you intend to send out a direct mail advertisement to all 100,000 of your
customers and, on the basis of experience, you expect 1% of them to respond positively
(e.g., to buy the product).
• Suppose further that each positive response is worth $200 to your company.
• Direct mail is expensive; it will cost $1 to send out each advertisement.
• You expect $200,000 in revenue, and you have $100,000 in costs. Hence you expect to
make a profit of $100,000 for the direct mail campaign.
• Wouldn’t it be nice if you could send out 40,000 advertisements from which you could
expect 850 positive responses? You would save $60,000 in mailing costs and forego
$150x200=$30,000 in revenue for a profit of $170,000-$40,000=$130,000.
• The key is to send the advertisement only to those customers most
likely to respond positively, and not to send the advertisement to
those customers who are not likely to respond positively.
• A logistic regression, for example, can be used to calculate the probability of a positive
response for each customer. These probabilities can be used to rank the customers from
most likely to least likely to respond positively. The only remaining question is how many of
the most likely customers to target.
Model comparison using the Lift Chart
• A standard lift chart is constructed by breaking the population into deciles, and noting the
expected number of positive responses for each decile.
• Continuing with the direct mail analogy, we might see lift values as shown in Table:
Initial Lift Curves for Logistic (Left) and Classification Tree (Right)
Model comparison using the Lift Chart
• Let’s extend the y-axis for both curves to 6.0.
• Right-click Lift Curve for the classification tree and select Size/Scale→Y Axis. Near the top of the pop-up
box, change the Maximum from 3.8 to 6. Click OK.
• Do the same thing for the Logistic Lift Curve.
Lift Curves for Logistic (Left) and Classification Tree (Right)
• Both lift curves show two curves, one for False and one for True. We are obviously concerned with True,
since we are trying to identify churners.
• Suppose we wanted to launch a campaign to contact customers who are likely to churn, and we want to
offer them incentives not to churn.
• Suppose further that, due to budgetary factors, we could contact only 40% of them.
• Clearly we would want to use the classification tree, because the lift is so much greater in the range 0 to
0.40.