Download as pdf or txt
Download as pdf or txt
You are on page 1of 27

Model comparison

Model comparison
•  We all know how to compare two linear regression models with the same
number of independent variables: look at R2.
•  When the number of independent variables is different between the two
regressions, look at adjusted R2.
•  What should you do, though, to compare a linear regression model with a
nonlinear regression model, the latter of which really has no directly
comparable definition for R2?
•  There is a definite need to compare different types of models so that the
better model may be chosen, and that’s the topic of this chapter.
•  First, we will examine the case of a continuous dependent variable, which
is rather straightforward.
•  Subsequently, we will discuss the binary dependent variable. It permits
many different types of comparisons.
Model comparison with
continuos dependent variable
•  Three common measures used to compare predictions of a continuous
variable are the mean square error (MSE), its square root (RMSE), and
the mean absolute error (MAE).
MSE RMSE MAE

•  The last is less sensitive to outliers.


•  All three of these get smaller as the quality of the prediction improves.
•  The above performance measures do not consider the level of the
variable that is being predicted.
–  For example, an error of 10 units is treated the same regardless of whether the variable
has a level of 20 or 2000.
Model comparison with
continuos dependent variable
•  To account for the level of the variable, relative measures can be
employed, such as:
Relative Squared Error Relative absolute error

•  The relative measures are particularly useful when comparing variables


that have different levels.
•  Another performance measure is the correlation between the variable
and its prediction. Correlation values are constrained to be between -1
and +1 and to increase in absolute value as the quality of the prediction
improves. The correlation measure is defined as:
where and are the standard
deviations of and , respectively, and is
the average of the predicted values.
Model comparison with
continuos dependent variable
•  Which measure should be used can be determined only by a careful
study of the problem.
–  Does the data set have outliers? If so, then absolute rather than squared error might be
appropriate.
–  If an error of five units when the prediction is 100 is the same as an error of 20 units
when the prediction is 400 (i.e., 5% error), then relative measures might be appropriate.
•  Frequently these measures all give the same answer. In that case, it is
obvious that one model is not superior to another.
•  On the other hand, there are cases where the measures contradict, and
then careful thought is necessary to decide which model is superior.
•  To explore the uses of these measures, open the file McDonalds48.jmp,
which gives monthly returns on McDonalds and the S&P 500 from
January 2002 through December 2005.
•  We will run a regression on the first 40 observations and use this
regression to make out-of-sample predictions for the last 8 observations.
These 8 observations can be called a “hold-out sample.”
Model comparison with
continuos dependent variable
•  To exclude the last 8 observations from the regression, select
observations 41-48 (click in row 41, hold down the shift key, and click in
row 48).
•  Then right-click and select Exclude/Unexclude. Each of these rows
should have a red circle with a slash through it. Now run the regression:
•  Select Analyze→Fit Model, click Return on McDonalds, and then click
Y. Select Return on SP500 and click Add. Click Run.
•  To place the predicted values in the data table, click the red triangle, and
select Save Columns→Predicted Values.
•  Notice that JMP has made predictions for observations 41-48, even
though these observations were not used to calculate the regression
estimates.
Model comparison with continuos dependent
variable (using excel)
•  It is probably easiest to calculate the desired measures using Excel. So either save the datasheet as an
Excel file, and then open it in Excel, or just open Excel and copy the variables Return on McDonalds and
Predicted Return on McDonalds into columns A and B respectively.
•  In Excel, perform the following steps:
–  1. Create the residuals in column C, as Return on McDonalds – Predicted Return on McDonalds.
–  2. Create the squared residuals in column D, by squaring column C.
–  3. Create the absolute residuals in column E, by taking the absolute value of column C.
–  4. Calculate the in-sample MSE by summing the first 40 squared residuals, which will be cells 2-41 in column D, and
then dividing the sum by 40.
–  5. Calculate the in-sample MAE by summing the first 40 absolute residuals, which will be cells 2-41 in column E, and
then dividing the sum by 40.
–  6. Calculate the out-of-sample MSE by summing the last 8 squared residuals, cells 42-49 in column D, and then
dividing the sum by 8.
–  7. Calculate the out-of-sample MAE by summing the last 8 absolute residuals, cells 42-49 in column E, and then
dividing the sum by 8.
–  8. Calculate the in-sample correlation between Return on McDonalds and Predicted Return on McDonalds for the
first 40 observations using the Excel CORREL( ) function.
–  9. Calculate the out-of-sample correlation between Return on McDonalds and Predicted Return on McDonalds for
the last 8 observations using the Excel CORREL( ) function.
–  The results are summarized in next Table.

The in-sample and out-of-sample MSE and MAE are quite close, which
MSE MAE Correlation
leads one to think that the model is doing about as well at predicting out-
In-Sample 0.00338668 0.04734918 0.68339270 of-sample as it is at predicting in-sample. The correlation confirms this
Out-of-Sample 0.00284925 0.04293491 0.75127994 notion.
Model comparison with
continuos dependent variable
•  To gain some insight into this phenomenon, look at a graph of Return on
McDonalds against Predicted Return on McDonalds.
•  Select Graph→Scatterplot Matrix (or select Graph→Overlay Plot)
–  Select Return on McDonalds and click Y, Columns.
–  Then select Predicted Return on McDonalds and click X. Click OK.
•  In the data table, select observations 41-48, which will make them appear
as bold dots on the graph (for helping, while still in the data table, right-
click on the selected observations, select Markers, and then choose the
plus sign (+).
•  These out-of-sample observations appear to be in agreement with the in-
sample observations, which suggests that the relationship between Y and
X that existed during the in-sample period continued through the out-of-
sample period.
Model comparison with
continuos dependent variable

•  Hence, the in-sample and out-of-sample correlations are approximately


the same. If the relationship that existed during the in-sample period had
broken down and no longer existed during the out-of-sample period, then
the correlations might not be approximately the same, or the out-of-
sample points in the scatterplot would not be in agreement with the in-
sample points.
Model comparison with binary
dependent variable
•  When actual values are compared against predicted values for a binary
variable, a contingency table is used.
•  This table is often called an “error table” or a “confusion matrix.”
•  It displays correct versus incorrect classifications, where “1” may be
thought of as a “positive/successful” case and “0” may be thought of as a
“negative/failure” case.
•  It is important to understand that the confusion matrix is a function of
some threshold score.
–  Imagine making binary predictions from a logistic regression. The logistic regression
produces a probability.
–  If the threshold is, say, 0.50, then all observations with a score above 0.50 will be
classified as positive, and all observations with a score below 0.50 will be classified as
negative.
–  Obviously, if the threshold changes to, say, 0.55, so will the elements of the confusion
matrix.
Model comparison with binary
dependent variable
•  Confusion table:
Predicted 1 Predicted 0
Actual 1 True positive (TP) False negative (FN)
Actual 0 False positive (FP) True negative (TN)
!
•  A wide variety of statistics can be calculated from the elements of the confusion
matrix.
•  For example, the overall accuracy of the model is measured by the Accuracy:

•  This is not a particularly useful measure because it gives equal weight to all
components.
–  Imagine that you are trying to predict a rare event—say, cell phone churn—when only
1% of customers churn. If you simply predict that all customers do not churn, your
accuracy rate will be 99%. (Since you are not predicting any churn, FP=0 and TN=0.)
Clearly, better statistics that make better use of the elements of the confusion matrix are
needed.
Model comparison with binary
dependent variable
•  Confusion table:
Predicted 1 Predicted 0
Actual 1 True positive (TP) False negative (FN)
Actual 0 False positive (FP) True negative (TN)
!

•  One such measure is the sensitivity, or true positive rate (TPR), which is
defined as:

•  This is also known as recall. It answers the question, “If the model predicts
a positive event, what is the probability that it really is positive?”
Model comparison with binary
dependent variable
•  Confusion table:
Predicted 1 Predicted 0
Actual 1 True positive (TP) False negative (FN)
Actual 0 False positive (FP) True negative (TN)
!

•  Similarly, the true negative rate is also called the specificity and is given
by:

•  It answers the question, “If the model predicts a negative event, what is
the probability that it really is negative?”
Model comparison with binary
dependent variable
• Confusion table:
Predicted 1 Predicted 0
Actual 1 True positive (TP) False negative (FN)
Actual 0 False positive (FP) True negative (TN)
!

• The false positive rate equals 1-specificity and is given by:

• It answers the question, “If the model predicts a positive event, what is
the probability that it is making a mistake?”
ROC curve
•  When the False Positive Rate (FPR) is plotted on the x-axis, and the True Positive
Rate (TPR) is plotted on the y-axis, the resulting graph is called an ROC curve
(“Receiver Operating Characteristic Curve”).
•  In order to draw the ROC curve, the classifier has to produce a continuous-valued
output that can be used to sort the observations from most likely to least likely.
•  The predicted probabilities from a logistic regression are a good example. In an
ROC graph, the vertical axis shows the proportion of ones that are correctly
identified, and the horizontal axis shows the proportion of zeros that are
misidentified as ones.
ROC curve
•  To interpret the ROC curve, first note that the
point (0,0) represents a classifier that never
issues a positive classification: its FPR is
zero, which is good. But it never correctly
identifies a positive case, so its TPR is zero,
also, which is bad. The point (0,1) represents
the perfect classifier: it always correctly
identifies positives and never misclassifies a
negative as a positive.

•  In order to understand the curve, two extreme cases need to be identified.


•  First is the random classifier that simply guesses at whether a case is 0 or 1. The ROC for such a
classifier is the dotted diagonal line A, from (0, 0) to (1, 1).
•  To see this, suppose that a fair coin is flipped to determine classification. This method will correctly
identify half of the positive cases and half of the negative cases, and corresponds to the point (0.5, 0.5).
•  To understand the point (0.8, 0.8), if the coin is biased so that it comes up heads 80% of the time (let
“heads” signify “positive”), then it will correctly identify 80% of the positives and incorrectly identify
80% of the negatives. Any point beneath this 45o line is worse than random guessing.
ROC curve
•  The second extreme case is the perfect classifier, which
correctly classifies all positive cases and has no false
positives. It is represented by the dot-dash line D, from (0,
0) through (0, 1) to (1, 1).
•  The closer an ROC curve gets to the perfect classifier,
the better it is.
•  Therefore, the classifier represented by the solid line C, is
better than the classifier represented by the dashed line B.
•  Note that the line C is always above the line B; i.e., the
lines do not cross.

•  Remember that each point on an ROC curve corresponds to a particular confusion that, in turn, depends
on a specific threshold. This threshold is usually a percentage. E.g., classify the observation as “1” if
the probability of its being a “1” is 0.50 or greater.
•  Therefore, any ROC curve represents various confusion matrices generated by a classifier as the
threshold is changed.
•  Points in the lower left region of the ROC space identify “conservative” classifiers. They require strong
evidence to classify a point as positive. So they have a low false positive rate; necessarily they also
have low true positive rates.
•  On the other hand, classifiers in the upper right region can be considered “liberal.” They do not require
much evidence to classify an event as positive. So they have high true positive rates; necessarily, they
also have high false positive rates.
ROC curve
•  When two ROC curves cross, neither is unambiguously better
than the other, but it is possible to identify regions where one
classifier is better than the other.
•  Figure shows the ROC curve as a dotted line for a classifier
produced by Model 1—say, logistic regression—and the ROC curve
as a solid line for a classifier produced by Model 2—say, a
classification tree.
•  Suppose it is important to keep the FPR low at 0.2. Then, clearly,
Model 2 would be preferred because when FPR is 0.2, it has a much
higher TPR than Model 1.
•  Conversely, if it was important to have a high TPR—say, 0.9—then
Model 1 would be preferable to Model 2 because when TPR =0.9,
Model 1 has an FPR of about 0.7 while Model 2 has an FPR of about
0.8.

•  The ROC can be used to determine the point with optimal classification accuracy.
•  Straight lines with equal classification accuracy can be drawn, and these lines will all be from the
lower left to the upper right.
•  The line that is tangent to an ROC curve marks the optimal point on that ROC curve.
•  In the Figure, the point marked A for Model 2, with an FPR of about 0.1 and a TPR of about 0.45, is an
optimal point. This point is optimal assuming that the costs of misclassification are equal, that a false
positive is just as harmful as a false negative.
•  This assumption is not always true, as shown by misclassifying the issuance of credit cards. A good customer
might charge $5000 per year and carry a monthly balance of $200, resulting in a net profit of $100 to the credit
card company. A bad customer might run up charges of $1000 before his card is canceled. Clearly, the cost of
refusing credit to a good customer is not the same as the cost of granting credit to a bad customer.
ROC curve
•  A popular method for comparing ROC curves is to calculate the “Area Under the
Curve” (AUC).
•  Since both the x-axis and y-axis are from zero to one, and since the perfect classifier passes
through the point (0,1), the largest AUC is one.
•  The AUC for the random classifier (the diagonal line) is 0.5. In general, then, a ROC with a
higher AUC is preferred to an ROC curve with a lower AUC.
•  The AUC has a probabilistic interpretation. It can be shown that AUC = P(random positive
example > random negative example). In other words, this is the probability that the classifier
will assign a higher score to a randomly chosen positive case than to a randomly chosen
negative case.
•  To illustrate the concepts discussed here, let us examine a pair of examples with real data.
•  We will construct two simple models for predicting churn, and then compare them on the basis
of ROC curves.
•  Open the churn data set and fit a logistic regression.
•  Churn is the dependent variable (make sure that it is classified as nominal), where TRUE
or 1 indicates that the customer switched carriers.
•  For simplicity, choose D_VMAIL_PLAN, VMail_Message, Day_Mins, and Day_Charge
as explanatory variables and leave them as continuous. Click Run.
•  Under the red triangle for the Nominal Logistic Fit window, click ROC Curve. Since we
are interested in identifying churners, when the pop-up box instructs you to Select
which level is the positive, select 1 and click OK.
ROC curve
•  Observe the ROC curve together with the
line of optimal classification; the AUC is
0.65778, the left ROC curve in the Figure.
•  The line of optimal classification appears to
be tangent to the ROC at about 0.10 for 1-
Specificity and about 0.45 for Sensitivity.
•  At the bottom of the window is a tab for the
ROC Table. Expand it to see various
statistics for the entire data set.
•  Imagine a column between Sens-(1-Spec) and True
Pos; scroll down until Prob = 0.2284, and you will
see an asterisk in the imagined column. This asterisk
denotes the row with the highest value of Sensitivity
(1-Specificity), which is the point of optimal
classification accuracy.
•  Should you happen to have 200,000 rows, right-click
in the ROC Table and select Make into Data Table,
which will be easy to manipulate to find the optimal
point
ROC curve
•  We want to show how JMP compares models, so
we will go back to the churn data set and use the
same variables to build a classification tree.
•  Select Analyze→Modeling→Partition. Use the
same Y and X variables as for the logistic
regression (Churn versus D_VMAIL_PLAN,
VMail_Message, Day_Mins, and Day_Charge).
•  Click Split five times so that RSquare equals
0.156.
•  Under the red triangle for the Partition window,
click ROC Curve.
•  This time you are not asked to select which level is
positive; you are shown two ROC Curves, one for
False and one for True.
–  They both have the same AUC because they represent the
same information.
–  Observe that one is a reflection of the other.
•  Note that the AUC is 0.6920. The partition method
does not produce a line of optimal classification
because it does not produce an ROC Table.
•  On the basis of AUC, the classification tree seems
to be marginally better than the logistic regression.
Model comparison using the Lift Chart
•  Suppose you intend to send out a direct mail advertisement to all 100,000 of your
customers and, on the basis of experience, you expect 1% of them to respond positively
(e.g., to buy the product).
•  Suppose further that each positive response is worth $200 to your company.
•  Direct mail is expensive; it will cost $1 to send out each advertisement.
•  You expect $200,000 in revenue, and you have $100,000 in costs. Hence you expect to
make a profit of $100,000 for the direct mail campaign.
•  Wouldn’t it be nice if you could send out 40,000 advertisements from which you could
expect 850 positive responses? You would save $60,000 in mailing costs and forego
$150x200=$30,000 in revenue for a profit of $170,000-$40,000=$130,000.
•  The key is to send the advertisement only to those customers most
likely to respond positively, and not to send the advertisement to
those customers who are not likely to respond positively.
•  A logistic regression, for example, can be used to calculate the probability of a positive
response for each customer. These probabilities can be used to rank the customers from
most likely to least likely to respond positively. The only remaining question is how many of
the most likely customers to target.
Model comparison using the Lift Chart
•  A standard lift chart is constructed by breaking the population into deciles, and noting the
expected number of positive responses for each decile.
•  Continuing with the direct mail analogy, we might see lift values as shown in Table:

•  If mailing was random, we would


expect to see 100 positive responses
in each decile. (The overall
probability of “success” is
1,000/100,000 = 1%, and the
expected number of successful
mailings in a decile is 1% of 10,000
= 100.)
•  However, since the customers were
scored (had probabilities of positive
response calculated for each of
them), we can expect 280 responses
from the first 10,000 customers.
•  Compared to the 100 that would be
achieved by random mailing, scoring
gives a “lift” of 280/100 = 2.8 for the
first decile. Similarly, the second
decile has a lift of 2.35.
Lift Curve
Suppose, the top-rated 10% of
fitted probabilities have a 25%
richness of the chosen response
compared with 5% richness over
the whole population.

The lift curve would go through the X-coordinate of 0.10


at a Y-coordinate of 25% / 5%, or 5.
Lift Curve

All lift curves reach (1,1) at the right, as the population as


a whole has the general response rate.
Model comparison using the Lift Chart
•  A lift chart does the same thing, except on a more finely graduated scale. Instead of showing
the lift for each decile, it shows the lift for each percentile.
•  Necessarily, the lift for the 100th percentile equals one. Consequently, even a poor model lift is
always equal to or greater than one.
•  To create a lift chart, refer back to the previous section in this chapter where we produced a
simple logistic regression and a simple classification tree. This time, instead of selecting ROC
Curve, select Lift Curve.
•  It is difficult to compare graphs when they are not on the same scale, and, further, we cannot
see the top of the Lift Curve for the classification tree.

Initial Lift Curves for Logistic (Left) and Classification Tree (Right)
Model comparison using the Lift Chart
•  Let’s extend the y-axis for both curves to 6.0.
•  Right-click Lift Curve for the classification tree and select Size/Scale→Y Axis. Near the top of the pop-up
box, change the Maximum from 3.8 to 6. Click OK.
•  Do the same thing for the Logistic Lift Curve.
Lift Curves for Logistic (Left) and Classification Tree (Right)

•  Both lift curves show two curves, one for False and one for True. We are obviously concerned with True,
since we are trying to identify churners.
•  Suppose we wanted to launch a campaign to contact customers who are likely to churn, and we want to
offer them incentives not to churn.
•  Suppose further that, due to budgetary factors, we could contact only 40% of them.
•  Clearly we would want to use the classification tree, because the lift is so much greater in the range 0 to
0.40.

You might also like