Molin Uas

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 192

Prodi S1 Ilmu Aktuaria

Principle of Model Building

Fevi Novkaniza

LINEAR MODEL

October 28, 2021

Fevi Novkaniza Principle of Model Building 1


Why Model Building Is Important
Prodi S1 Ilmu Aktuaria

By model building, we mean writing a model that will provide a good


fit to a set of data and that will give good estimates of the mean
value of y and good predictions of future values of y for given values
of the independent variables

Fevi Novkaniza Principle of Model Building 2


Example
Prodi S1 Ilmu Aktuaria

Educational research group issued a report concerning the variables


related to academic achievement for a certain type of college student
The researchers selected a random sample of students and recorded a
measure of academic achievement, y, at the end of the senior year
together with data on an extensive list of independent variables,
x1 , x2 , ..., xk , that they thought were related to y.
Among these independent variables were the student’s IQ, scores on
mathematics and verbal achievement examinations, rank in class, and
so on.

Fevi Novkaniza Principle of Model Building 3


Why Model Building Is Important
Prodi S1 Ilmu Aktuaria

They fit the model

E (y) = β0 + β1 x1 + ... + βk xk

to the data, analyzed the results, and reached the conclusion that
none of the independent variables was ‘‘significantly related’’ to y.
The goodness of fit of the model, measured by the coefficient of
determination R 2 , was not particularly good, and t tests on individual
parameters did not lead to rejection of the null hypotheses that these
parameters equaled 0.

Fevi Novkaniza Principle of Model Building 4


Example
Prodi S1 Ilmu Aktuaria

If we hold all the other independent variables constant and vary only
x1 , E (y) will increase by the amount β1 for every unit increase in x1 .
A 1-unit change in any of the other independent variables will increase
E(y) by the value of the corresponding β parameter for that variable.

Fevi Novkaniza Principle of Model Building 5


Example
Prodi S1 Ilmu Aktuaria

Generally speaking, there will be a positive correlation between


entrance achievement test scores and college academic achievement.
So, what went wrong with the educational researchers’ study?
Most likely the difficulties in the results of the educational study were
caused by the use of an improperly constructed model.
For example, the model

E (y) = β0 + β1 x1 + ... + βk xk

assumes that the independent variables x1 , x2 , ..., xk affect mean


achievement E(y) independently of each other.

Fevi Novkaniza Principle of Model Building 6


Example
Prodi S1 Ilmu Aktuaria

Do the assumptions implied by the model agree with your knowledge


about academic achievement? First, is it reasonable to assume that
the effect of time spent on study is independent of native intellectual
ability?
No matter how much effort some students invest in a particular
subject, their rate of achievement is low
For others, it may be high. Therefore, assuming that these two
variables—effort and native intellectual ability—affect E(y)
independently of each other is likely to be an erroneous assumption.

Fevi Novkaniza Principle of Model Building 7


Example
Prodi S1 Ilmu Aktuaria

Second, suppose that x5 is the amount of time a student devotes to


study. Is it reasonable to expect that a 1-unit increase in x5 will
always produce the same change β5 in E(y)? The changes in E(y) for
a 1-unit increase in x5 might depend on the value of x5 (e.g., the law
of diminishing returns).
Consequently, it is quite likely that the assumption of a constant rate
of change in E(y) for 1-unit increases in the independent variables will
not be satisfied.

Fevi Novkaniza Principle of Model Building 8


Example
Prodi S1 Ilmu Aktuaria

Clearly, the model

E (y) = β0 + β1 x1 + β2 x2 + .. + βk xk

was a poor choice in view of the researchers’ prior knowledge of some


of the variables involved.
Terms have to be added to the model to account for interrelationships
among the independent variables and for curvature in the response
function.
Failure to include needed terms causes inflated values of SSE,
nonsignificance in statistical tests, and, often, erroneous practical
conclusions.

Fevi Novkaniza Principle of Model Building 9


Two Types of Predictor Variables:
Quantitative and Qualitative
◦ Quantitative data are observations measured on a naturally occurring numerical scale.
◦ Nonnumerical data that can only be classified into one of a group of categories are said to be
qualitative data
◦ The different values of an independent variable used in regression are called its levels.
◦ Example:
We considered the problem of predicting executive salary as a function of several predictor variables.
Consider the following four independent variables that may affect executive salaries:
(a) Years of experience
(b) Gender of the employee
(c) Firm’s net asset value
(d) Rank of the employee
For each of these independent variables, give its type and describe the nature of the levels you would
expect to observe
Model with Single Quantitative
Predictor Variable
◦ Terdapat beberapa model yang dapat dibentuk dengan satu variable predictor
quantitative.
◦ First-Order (Straight-Line) Model
◦ A Second-Order (Quadratic) Model
◦ Third-Order Model
◦ A pth-Order Polynomial
Model with Single Quantitative
Independent Variable
Model with Single Quantitative
Predictor Variable
Model with Single Quantitative
Predictor Variable
Model with Single Quantitative
Predictor Variable
Model with Single Quantitative
Predictor Variable
◦ How do you decide the order of the polynomial you should use to model a response if
you have no prior information about the relationship between E(y) and x?
◦ If you have data, construct a scatterplot of the data points, and see whether you can
deduce the nature of a good approximating function
◦ A pth-order polynomial, when graphed, will exhibit (p − 1) peaks, troughs, or reversals in
direction
First-Order Models with Two or More
Quantitative Predictor Variables
◦ Like models for a single predictor variable, models with two or more independent
variables are classified as first-order, second-order, and so forth, but it is difficult (most
often impossible) to graph the response because the plot is in a multidimensional
space.
First-Order Models with Two or More
Quantitative Predictor Variables
Second-Order Models with Two or
More Quantitative Predictor
Variables
◦ Second-order models with two or more predictor variables permit curvature in the
response surface.
◦ One important type of second-order term accounts for interaction between two
variables.
◦ Consider the two-variable model
𝐸 𝑦 = 𝛽0 + 𝛽1 𝑥1 + 𝛽2 𝑥2 + 𝛽3 𝑥1 𝑥2
◦ This interaction model traces a ruled surface (twisted plane) in a three-dimensional
space (see Figure 5.10).
◦ The second-order term 𝛽3 𝑥1 𝑥2 is called the interaction term, and it permits the contour
lines to be nonparallel.
Second-Order Models with Two or
More Quantitative Predictor
Variables
Second-Order Models with Two or
More Quantitative Predictor
Variables
Second-Order Models with Two or
More Quantitative Predictor
Variables
Second-Order Models with Two or
More Quantitative Predictor
Variables
Second-Order Models with Two or
More Quantitative Predictor
Variables

◦ To determine whether the quadratic terms are important, we would test 𝐻0 : 𝛽7 = 𝛽8 =


𝛽9 = 0 using the F-test for comparing nested models outlined in Section 4.13.
Coding Quantitative Predictor
Variables
◦ In fitting higher-order polynomial regression models (e.g., second- or third-order
models), it is often a good practice to code the quantitative independent variables.
◦ In a general sense, coding means transforming a set of independent variables
(qualitative or quantitative) into a new set of independent variables.
◦ There are two related reasons for coding quantitative variables
◦ To calculate the estimates of the model parameters using the method of least squares, the
computer must invert a matrix of numbers, called the coefficient (or information) matrix.
Considerable rounding error may occur during the inversion process if the numbers in the
coefficient matrix vary greatly in absolute value. This can produce sizable errors in the
computed values of the least squares estimates, 𝛽෠0 , 𝛽෠1 , 𝛽෠2 ,.... Coding makes it computationally
easier for the computer to invert the matrix, thus leading to more accurate estimates.
◦ the problem of predictor variables (x’s) being intercorrelated (called multicollinearity). When
polynomial regression models (e.g., second-order models) are fit, the problem of
multicollinearity is unavoidable, especially when higher-order terms are fit
Coding Quantitative Predictor
Variables
Prodi S1 Ilmu Aktuaria

Regression with Qualitative Predictors

Fevi Novkaniza

LINEAR MODEL

October 28, 2021

Fevi Novkaniza Regression with Qualitative Predictors 1


Qualitative Information
Prodi S1 Ilmu Aktuaria

Field: (Qualitative variables) business, economics, the social,


biological sciences
Example:
gender (M,F)
purchase status: purchase, no purchase
disability status: not; partial; fully
rating grade
A way to incorporate qualitative information is to use dummy
variables

Fevi Novkaniza Regression with Qualitative Predictors 2


Qualitative variables
Prodi S1 Ilmu Aktuaria

The independent variables that appear in a linear model can be one of


two types; quantitative variable is one that assumes numerical values
corresponding to the points on a line.
An independent variable that is not quantitative, that is, one that is
categorical in nature, is called qualitative
The different values of an independent variable used in regression are
called its levels.

Fevi Novkaniza Regression with Qualitative Predictors 3


Insurance Innovation Example
Prodi S1 Ilmu Aktuaria

A study of innovation in the insurance industry


The speed with which a particular insurance innovation is adopted (Y)
Size of the insurance firm (X1 )
Type of the firm (X2 and X3 )

Fevi Novkaniza Regression with Qualitative Predictors 4


Qualitative Predictors
Prodi S1 Ilmu Aktuaria

Fevi Novkaniza Regression with Qualitative Predictors 5


Qualitative Predictors
Prodi S1 Ilmu Aktuaria

Indicator variables: called dummy variables or binary variables

Fevi Novkaniza Regression with Qualitative Predictors 6


Qualitative Predictors
Prodi S1 Ilmu Aktuaria

Fevi Novkaniza Regression with Qualitative Predictors 7


Qualitative Predictors
Prodi S1 Ilmu Aktuaria

Figure: Illustration of Meaning of Regression Coefficient for Regression Model


with Indicator Variable X2

Fevi Novkaniza Regression with Qualitative Predictors 8


Insurance innovation example
Prodi S1 Ilmu Aktuaria

Fevi Novkaniza Regression with Qualitative Predictors 9


Insurance innovation example
Prodi S1 Ilmu Aktuaria

Fevi Novkaniza Regression with Qualitative Predictors 10


Insurance innovation example
Prodi S1 Ilmu Aktuaria

Figure: Regression Results for Fit of Regression Model -Insurance Innovation


Example

Fevi Novkaniza Regression with Qualitative Predictors 11


Insurance innovation example
Prodi S1 Ilmu Aktuaria

Figure: Fitted Regression Function for Regression Model Indicator Innovation


Example

Fevi Novkaniza Regression with Qualitative Predictors 12


Insurance innovation example
Prodi S1 Ilmu Aktuaria

Fevi Novkaniza Regression with Qualitative Predictors 13


More than 2 classes
Prodi S1 Ilmu Aktuaria

The regression of tool wear (Y) on tool speed X1 and tool model
(qualitative: M1, M2, M3, M4)

Fevi Novkaniza Regression with Qualitative Predictors 14


More than 2 classes
Prodi S1 Ilmu Aktuaria

Fevi Novkaniza Regression with Qualitative Predictors 15


More than 2 classes
Prodi S1 Ilmu Aktuaria

Fevi Novkaniza Regression with Qualitative Predictors 16


Tool Wear Example
Prodi S1 Ilmu Aktuaria

Fevi Novkaniza Regression with Qualitative Predictors 17


Tool Wear Example
Prodi S1 Ilmu Aktuaria

Fevi Novkaniza Regression with Qualitative Predictors 18


Some Considerations
Prodi S1 Ilmu Aktuaria

Fevi Novkaniza Regression with Qualitative Predictors 19


Some Considerations
Prodi S1 Ilmu Aktuaria

Fevi Novkaniza Regression with Qualitative Predictors 20


Some Considerations
Prodi S1 Ilmu Aktuaria

Fevi Novkaniza Regression with Qualitative Predictors 21


Some Considerations
Prodi S1 Ilmu Aktuaria

Fevi Novkaniza Regression with Qualitative Predictors 22


Some Considerations
Prodi S1 Ilmu Aktuaria

Fevi Novkaniza Regression with Qualitative Predictors 23


Interaction Quantitative & Qualitative
Prodi S1 Ilmu Aktuaria

Fevi Novkaniza Regression with Qualitative Predictors 24


Interaction Quantitative & Qualitative
Prodi S1 Ilmu Aktuaria

Fevi Novkaniza Regression with Qualitative Predictors 25


Ordinal Interaction
Prodi S1 Ilmu Aktuaria

Figure: Another Illustration of Regression Model with Indicator Variable X2 and


Interaction Term-Insurance Innovation Example

Fevi Novkaniza Regression with Qualitative Predictors 26


Example: insurance innovation
Prodi S1 Ilmu Aktuaria

Figure: Another Illustration of Regression Model with Indicator Variable X2 and


Interaction Term-Insurance Innovation Example

Fevi Novkaniza Regression with Qualitative Predictors 27


Example: insurance innovation
Prodi S1 Ilmu Aktuaria

Fevi Novkaniza Regression with Qualitative Predictors 28


Example: insurance innovation
Prodi S1 Ilmu Aktuaria

Fevi Novkaniza Regression with Qualitative Predictors 29


More complex models
Prodi S1 Ilmu Aktuaria

Fevi Novkaniza Regression with Qualitative Predictors 30


More complex models
Prodi S1 Ilmu Aktuaria

Qualitative predictor variables only

Yi = β0 + β2 Xi2 + β3 Xi3 + i

all explanatory variables are qualitative: analysis of variance


(ANOVA) models
some qualitative some qualitative explanatory variables: analysis of
covariance (ANCOVA) models

Fevi Novkaniza Regression with Qualitative Predictors 31


Prodi S1 Ilmu Aktuaria

Example of an interaction model with quantitative


predictors

Fevi Novkaniza

LINEAR MODEL

November 4, 2021

Fevi Novkaniza Example of an interaction model with quantitative predictors 1


Dataset
Prodi S1 Ilmu Aktuaria

We will use data from https://regressit.com


There are 392 complete rows of data containing information for
makes and models of cars sold in the U.S. between 1970 and 1982.
The objective is to build a model for predicting fuel economy from the
other variables, which include weight, horsepower, displacement,
acceleration, cylinders, and country of origin
The variable called Year70To81 is a dummy variable for years 1970 to
1981 which will be used to define a training set.
The origin variable is a numeric code: 1 = US, 2 = Europe, 3 =
Japan. Dummy variables have been created for the 3 values, to allow
for country effects that could have an arbitrary pattern

Fevi Novkaniza Example of an interaction model with quantitative predictors 2


Dataset
Prodi S1 Ilmu Aktuaria

A snapshot of the first 6 data points is as the following:

Using the dataset, we fit a multiple regression model with interaction using
just 2 predictors (MPG and Cylinders), and the fitted line is:

Fevi Novkaniza Example of an interaction model with quantitative predictors 3


Data Analysis
Prodi S1 Ilmu Aktuaria

ŷ = 2.451 + 0.089x1 + 1.203x2 − 0.054x1 x2


with ŷ, x1 , x2 are the predicted GallonsPer100Miles, MPG, and Cylinders,
respectively.
Fevi Novkaniza Example of an interaction model with quantitative predictors 4
Data Analysis
Prodi S1 Ilmu Aktuaria

We can see that the p-value for an interaction term is very small,
< 2 × 10−16 , leading to the rejection of the hypothesis that the
interaction term is ignorable; thus, we conclude that interaction
between MPG and Cylinders is significant and should be included in
the model.
So, how to interpret the result?

Fevi Novkaniza Example of an interaction model with quantitative predictors 5


Data Analysis
Prodi S1 Ilmu Aktuaria

Since there is an interaction term, we cannot interpret the effect of x1 nor


x2 separately. Instead, consider the following scenario. Increase x1 by 1
unit, say from x1 = a to x1 = (a + 1), while x2 is kept constant. Thus,
x1 = a, x2 = x2 :

ŷ = 2.451 + 0.089a + 1.203x2 − 0.054ax2

ŷ = (2.451 + 0.089a) + (1.203 − 0.054a)x2


x1 = a + 1, x2 = x2 :

ŷ = 2.451 + +0.089(a + 1) + 1.203x2 − 0.054(a + 1)x2

ŷ = (2.451 + 0.089(a + 1)) + (1.203 − 0.054(a + 1))x2


The change in ŷ due to 1 unit increase in x1 is:
∆ŷ = 0.089 − 0.054x2 that is, depends on the value of x2 .
Fevi Novkaniza Example of an interaction model with quantitative predictors 6
Data Analysis
Prodi S1 Ilmu Aktuaria

If there is no interaction, we could say that on average


GallonsPer100Miles is increased by 0.089 unit when MPG is increased
by 1 unit, for a car with the same Cylinder.
However, since there is an interaction between MPG and Cylinder,
this interpretation is no longer suitable.
Given Cylinder (let say 4), increasing the MPG by 1 unit results in a
reduction of GallonsPer100Miles by 0.127 unit (i.e 0.089-0.054×4).
While, if the Cylinder is 6, the reduction in GallonsPer100Miles is
even larger, at -0.235, for each 1 unit increment in MPG.

Fevi Novkaniza Example of an interaction model with quantitative predictors 7


Data Analysis
Prodi S1 Ilmu Aktuaria

By substituting x2 = 3, 4, ..., 8 (recall that the support for Cylinder ranges


from 3 to 8) into the fitted regression line

ŷ = 2.451 + 0.089x1 + 1.203x2 − 0.054x1 x2 ,

we obtain 6 lines representing the relationship between MPG (x) and


GallonsPer100Miles (y), where f1 (x) = ŷ = 6.06 − 0.073x is the fitted line
for cars with Cylinder=8, and so on, until f6 (x) = ŷ = 12.075 − 0.343x is
the fitted line when Cylinder=8.

Fevi Novkaniza Example of an interaction model with quantitative predictors 8


Data Analysis
Prodi S1 Ilmu Aktuaria

Figure: Fitted regression lines for GallonsPer100Miles based on MPG for 6


different cylinders
Fevi Novkaniza Example of an interaction model with quantitative predictors 9
Generalization
Prodi S1 Ilmu Aktuaria

In general, if there are p quantitative predictors, and all the 2-way


interaction between predictors are to be included in the model, then
the model would be
E (y) = β0 + β1 x1 + ... + βp p + αj αk xj xk ; j, k = 1, ..., p
X

j6=k

The number of interaction terms is given by C(2,p), that is a


combination of 2 predictors chosen from p predictors.
As the number of predictor variables (p) increases, the complexity of
the model grows fast, as illustrated in the following table.

Fevi Novkaniza Example of an interaction model with quantitative predictors 10


A quadratic model with quantitative predictor
Prodi S1 Ilmu Aktuaria

The general form of the quadratic equation is of the form


y = f (x) = ax 2 + bx + c, a 6= 0.

When a>0 the curve is concave up, and is the opposite for a<0. This
concept can be used for a linear model, when the effect of changes in
predictor variables is not constant.
Fevi Novkaniza Example of an interaction model with quantitative predictors 11
Prodi S1 Ilmu Aktuaria

A quadratic linear model for 1 quantitative predictor variable is

E (y) = β0 + β1 x + β2 x 2

β2 > 0 represents that when x is increased by 1 unit, if y increases,


then it will increase with a decreasing rate.
On the other hand, if y decreases, then it will decrease with an
increasing rate.
For example, if x is increased from x=c to x=c+1, let the change in y
is ∆1 > 0; then the change in y when x is increased from x=c+1 to
x=c+2, say ∆2 , will be of ∆2 < ∆1 .
While, under the same scenario, if ∆1 < 0, then ∆2 < 0 and
|∆2 > ∆1 |
Similar interpretation applies for β2 < 0

Fevi Novkaniza Example of an interaction model with quantitative predictors 12


Prodi S1 Ilmu Aktuaria

This is summarized in the following table

Fevi Novkaniza Example of an interaction model with quantitative predictors 13


What is the interpretation of β1 ?
Prodi S1 Ilmu Aktuaria

Consider the following scenario:


x = a, E (y) = β0 + β1 a + β2 a 2
x = a + 1, E (y) = β0 + β1 (a + 1) + β2 (a + 1)2
The change in E(y) when x is increased from a to (a + 1) is
(β1 + β2 ) + 2β2 a
It is seen that the change in the response variable depends on β1 , β2 , and
the initial value of x (i.e x=a); and thus there is no explicit interpretation
on β1 solely.

Fevi Novkaniza Example of an interaction model with quantitative predictors 14


Example
Prodi S1 Ilmu Aktuaria

Using the AutoMG data, if we plot GallonsPer100Miles based on


MPG, it clearly shows a quadratic trend, as in the following figure.
Moreover, if we are to calculate the ‘linear’ correlation between
GallonsPer100Miles and MPG 2 , the resulted correlation is strong,
that is-0.8676, as depicted.

It is reasonable to propose a quadratic model


Fevi Novkaniza Example of an interaction model with quantitative predictors 15
Prodi S1 Ilmu Aktuaria

Model: y = β0 + β1 x + β2 x 2 + ,  ∼ NIID(0, 2 ) where y,x are


GallonsPer100Miles and MPG, respectively
The result is as follows:

The regression fitted line is


ˆ 100Miles = 13.97 − .596MPG + .008MPG 2
GallonsPer
Fevi Novkaniza Example of an interaction model with quantitative predictors 16
Prodi S1 Ilmu Aktuaria

Model: y = β0 + β1 x + β2 x 2 + ,  ∼ NIID(0, 2 ) where y,x are


GallonsPer100Miles and MPG, respectively
The result is as follows:

The regression fitted line is ŷ = 13.97 − .596MPG + .008MPG 2

Fevi Novkaniza Example of an interaction model with quantitative predictors 17


Prodi S1 Ilmu Aktuaria

The model is useful, since the F statistic for testing the hypothesis
H0 : β1 = β2 = β3 = 0 is 1.018e+04, falls in the rejection region,
with the p-value < 2.2e − 16 ≈ 0
Moreover, the quadratic term is important and should be included in
the model, since the partial t-test for H0 : β2 = 0 against the
alternative hypothesis H1 : β2 6= 0 gave a p-value < 2e − 16 ≈ 0,
indicating the significance role of the quadratic term

Fevi Novkaniza Example of an interaction model with quantitative predictors 18


How to interpret the result?
Prodi S1 Ilmu Aktuaria

Referring to the scatterplot, GallonsPer100Miles decreases as MPG


increases, and since β̂2 = .008 > 0, the decrease in
GallonsPer100Miles is getting slower for higher MPG
For cars with 1 unit higher MPG, the change in the MPG is

(3.97−.596(a+1)+.008(a+1)2 )−(3.97−.596a+.008a 2 ) = −.596+(2a+1).008

with a is the initial value of the MPG. The fitted regression also
results in a high R 2 , that is 98.12% of the variability of
GallonsPer100Miles can be explained by MPG using the quadratic
regression model.
A visual check on how close the predicted values of
GallonsPer100Miles to the actual values, we produce is as the
following.

Fevi Novkaniza Example of an interaction model with quantitative predictors 19


Prodi S1 Ilmu Aktuaria

which give the result in the figure below. As shown in the figure, the fitted
line (blue) is close enough to most of data points, confirming the high R 2 .

Fevi Novkaniza Example of an interaction model with quantitative predictors 20


A test for comparing nested models
Prodi S1 Ilmu Aktuaria

Definition: A simpler Model A is called to be nested in a more


complex Model B if all the terms in Model A is included in Model B.
If Model A is of the form E (y) = β0 + β1 x1 + ... + βJ xJ , then model
B is of the form
E (y) = β0 + β1 x1 + βJ xJ + α1 x1∗ + ... + αK xK∗ , J ≥ 1, K ≥ 1
Model A is called a reduced model, and Model B is called a complete
(relative to model A) model
Consider a second order model with 2 predictor variables

E (y) = β0 + β1 x1 + β2 x2 + β3 x1 x2 + β4 x12 + β5 x22 (1)

Fevi Novkaniza Example of an interaction model with quantitative predictors 21


A test for comparing nested models
Prodi S1 Ilmu Aktuaria

Consider also the parsimonious principle, that is, a simpler yet


accurate model is preferred to a more complex one.
Recall that in adding the complexity of the model requires an
adjustment on the sample size (i.e. increasing the sample size) to
maintain the accuracy, represented by Ra2 ; otherwise, it will be
penalized by decrements of Ra2 .
Thus, we probably want to use a simpler model, if the higher order
model is not really contributing in explaining y. Or, not using all the
higher order terms, only those that significant should be included in
the model

Fevi Novkaniza Example of an interaction model with quantitative predictors 22


A test for comparing nested models
Prodi S1 Ilmu Aktuaria

For example, we might prefer a model with interaction only, that is


E (y) = β0 + β1 x1 + β2 x2 + β3 x1 x2 (2)
to the complete second order model in 1).
We will test hypothesis H0 : β4 = β5 = 0, against the alternative H1 :
at least one of βj 0, j = 4, 5
Similar to the rationale in global F test and adapting the formulation,
we use the F statistic, that is
Fdat = (∆SSR/df1 )/(SSRB /df2 )
where ∆SSR = (SSRA − SSRB ) is drop in SSR from Model A to
Model B, df1 is drop in number of regression coefficients from Model
B to Model A (i.e. number of βj s being tested in H0 ), and df2 is
degrees of freedom for model B (i.e. the complete model with p
predictors; n-(p+1)).
In this example df1 = 2, df2 = n − 6, where n is sample size.
Fevi Novkaniza Example of an interaction model with quantitative predictors 23
A test for comparing nested models
Prodi S1 Ilmu Aktuaria

If model’s assumptions are satistied, that is  ∼ NIID(0, σ 2 ), and H0


is true, then the calculated Fd at will follow an Fdf1 ,df2 distribution.
Thus, reject H0 at level significance α if F > F( α, df1 , df2 ); or
p − value < α where p-value= Pr (F > Fdat ).

Fevi Novkaniza Example of an interaction model with quantitative predictors 24


Example
Prodi S1 Ilmu Aktuaria

For AutoMPG data, we want to assess whether a second order terms,


i.e. the interaction and quadratic terms, are really improving the
model’s performance and should be included in the model for
predicting GallonsPer100Miles (GP100M) based on MPG and
Cyllinders (Cyl)
A complete 2nd order model is:

E (GP100M) = β0 +β1 MPG +β2 Cyl +β3 MPGCyl +β4 MPG 2 +β5 Cyl 2

and a reduced model, a straight line model is

E (GP100M) = β0 + β1 MPG + β2 Cyl

Fitting the models to data, we obtain the following results.

Fevi Novkaniza Example of an interaction model with quantitative predictors 25


Prodi S1 Ilmu Aktuaria

Fevi Novkaniza Example of an interaction model with quantitative predictors 26


Example
Prodi S1 Ilmu Aktuaria

The hypotesis to be tested is H0 : β3 = β4 = β5 = 0 implying there is


no need to include any of the 2nd order terms into the model, against
the alternative H1 : βj 0, for at least one j, j = 3, 4, 5
Implying that including either the interaction term or the quadratic
terms (minimum one of them), should improve the model’s
performance. Some statistics to calculate the test statistic are:

SSRC = sC2 × df = 0.18812 × 389 = 13.76345;

SSRR = sR2 × df = 0.50472 386 = 98.32273


So the test statistic is:
(SSRR − SSRC )/3) (98.32273 − 13.76345)/3
Fdat = = = 799.845
(SSEC /386) (13.76345/386)
Fevi Novkaniza Example of an interaction model with quantitative predictors 27
Example
Prodi S1 Ilmu Aktuaria

The critical value at α = 0.05 with degrees of freedom 3 and 386 for
the nominator and denominator, respectively, is 0.583563; as
calculated using online calculator below. Since
Fdat = 799.845 > Fα = 0.583563 , H0 is rejected.
Thus, we conclude that at least one of the second order terms should
be included in the model.

Fevi Novkaniza Example of an interaction model with quantitative predictors 28


Prodi S1 Ilmu Aktuaria

Regression Pitfalls

Fevi Novkaniza

LINEAR MODEL

November 10, 2021

Fevi Novkaniza Regression Pitfalls 1


Introduction
Prodi S1 Ilmu Aktuaria

We’ll look at some of the main things that can go wrong with a
multiple linear regression model
We’ll also consider methods for overcoming some of these pitfalls:
1 Observation vs Experimentation
2 Multicollinearity
3 Outliers & Influential observation
4 Overfitting
5 Excluding important predictor variables
6 Extrapolation
7 Missing data

Fevi Novkaniza Regression Pitfalls 2


Observation vs Experimentation
Prodi S1 Ilmu Aktuaria

Fevi Novkaniza Regression Pitfalls 3


Multicollinearity
Prodi S1 Ilmu Aktuaria

Multicollinearity exists when two or more of the predictors in a


regression model are moderately or highly correlated with one another.
Why can’t a researcher just collect his data in such a way to ensure
that the predictors aren’t highly correlated? Then, multicollinearity
wouldn’t be a problem, and we wouldn’t have to worry about it.
Unfortunately, researchers often can’t control the predictors. Obvious
examples include a person’s gender, race, grade point average, math
SAT score, IQ, and starting salary.
For each of these predictor examples, the researcher just observes the
values as they occur for the people in the random sample.

Fevi Novkaniza Regression Pitfalls 4


Multicollinearity
Prodi S1 Ilmu Aktuaria

When multicollinearity exists, any of the following outcomes can be


exacerbated:
The estimated regression coefficient of any one variable depends on
which other predictors are included in the model.
The precision of the estimated regression coefficients decreases as
more predictors are added to the model. The marginal contribution of
any one predictor variable in reducing the error sum of squares
depends on which other predictors are already in the model.
Hypothesis tests for βk = 0 may yield different conclusions depending
on which predictors are in the model

Fevi Novkaniza Regression Pitfalls 5


Types of Multicollinearity
Prodi S1 Ilmu Aktuaria

There are two types of multicollinearity:


Structural multicollinearity is a mathematical artifact caused by
creating new predictors from other predictors—such as, creating the
predictor x 2 from the predictor x
Data-based multicollinearity, is a result of a poorly designed
experiment, reliance on purely observational data, or the inability to
manipulate the system on which the data are collected.

Fevi Novkaniza Regression Pitfalls 6


Example
Prodi S1 Ilmu Aktuaria

The following data (bloodpress.txt) on 20 individuals with high blood


pressure:
blood pressure (y = BP, in mm Hg)
age (x1 = Age, in years)
weight (x2 = Weight, in kg)
body surface area (x3 = BSA, in sq m)
duration of hypertension (x4 = Dur, in years)
basal pulse (x5 = Pulse, in beats per minute)
stress index (x6 = Stress)

Fevi Novkaniza Regression Pitfalls 7


Dataset
Prodi S1 Ilmu Aktuaria

Fevi Novkaniza Regression Pitfalls 8


Example
Prodi S1 Ilmu Aktuaria

The researchers were interested in determining if a relationship exists


between blood pressure and age, weight, body surface area, duration,
pulse rate and/or stress level item
The matrix plot of BP, Age, Weight, and BSA:

Fevi Novkaniza Regression Pitfalls 9


Example
Prodi S1 Ilmu Aktuaria

Blood pressure appears to be related fairly strongly to Weight and


BSA, and hardly related at all to Stress level
Weight and BSA appear to be strongly related, while Stress and BSA
appear to be hardly related at all.
Fevi Novkaniza Regression Pitfalls 10
Data Analysis
Prodi S1 Ilmu Aktuaria

The following correlation matrix provides further evidence of the


above claims.

The high correlation among some of the predictors suggests that


data-based multicollinearity exists.

Fevi Novkaniza Regression Pitfalls 11


Multicollinearity
Prodi S1 Ilmu Aktuaria

What happens if the predictor variables are highly correlated?


Let’s see relationships among the response y = BP and the predictors
x2 = Weight and x3 = BSA:

Incidentally, it shouldn’t be too surprising that a person’s weight and body


surface area are highly correlated.
Fevi Novkaniza Regression Pitfalls 12
Data Analysis
Prodi S1 Ilmu Aktuaria

The regression of the response y = BP on the predictor x2= Weight:

The estimated coefficient b2 = 1.2009, se(b2) = 0.0930, and SSR(x2) =


505.472.
Fevi Novkaniza Regression Pitfalls 13
Seq SS
Prodi S1 Ilmu Aktuaria

Seq SS (Sequential sums of squares) are measures of variation for


different components of the model
Unlike the adjusted sums of squares, Seq SS depend on the order the
terms are entered into the model

Fevi Novkaniza Regression Pitfalls 14


Data Analysis
Prodi S1 Ilmu Aktuaria

The regression of the response y = BP on the predictor x3 = BSA:

The estimated coefficient


b3 = 34.44, se(b3 ) = 4.69, andSSR(x3 ) = 419.858.
Fevi Novkaniza Regression Pitfalls 15
Data Analysis
Prodi S1 Ilmu Aktuaria

The regression of the response y = BP on x2 = Weight and x3 = BSA (in


that order):

The estimated coefficients b2 = 1.039, b3 = 5.83, se(b2 ) = 0.193, se(b3 ) =


6.06andSSR(x3 |x2 ) = 2.814.
Fevi Novkaniza Regression Pitfalls 16
Data Analysis
Prodi S1 Ilmu Aktuaria

And finally, the regression of the response y = BP on x3 = BSA and x2 =


Weight (in that order):

the estimated coefficients b2 = 1.039, b3 = 5.83, se(b2 ) = 0.193, se(b3 ) =


6.06, andSSR(x2 |x3 ) = 88.43.
Fevi Novkaniza Regression Pitfalls 17
Data Analysis
Prodi S1 Ilmu Aktuaria

Compiling the results in a summary table, we obtain:

It appears as if, when predictors are highly correlated, the answers you get
depend on the predictors in the model.

Fevi Novkaniza Regression Pitfalls 18


Effect#1
Prodi S1 Ilmu Aktuaria

When predictor variables are correlated, the estimated regression


coefficient of any one variable depends on which other predictor
variables are included in the model
Here’s the relevant portion of the table:

Note that, depending on which predictors we include in the model, we


obtain different estimates of the slope parameter for x3 = BSA

Fevi Novkaniza Regression Pitfalls 19


Effect#1
Prodi S1 Ilmu Aktuaria

If x3 = BSA is the only predictor included in our model, we claim


that for every additional one square meter increase in body surface
area (BSA), blood pressure (BP) increases by 34.4 mm Hg
On the other hand, if x2 = Weight and x3 = BSA are both included
in our model, we claim that for every additional one square meter
increase in body surface area (BSA), holding weight constant, blood
pressure (BP) increases by only 5.83 mm Hg

Fevi Novkaniza Regression Pitfalls 20


Effect#1
Prodi S1 Ilmu Aktuaria

The high correlation among the two predictors is what causes the
large discrepancy
When interpreting b3 = 34.4 in the model that excludes x2 =
Weight, keep in mind that when we increase x3 = BSA then x2 =
Weight also increases and both factors are associated with increased
blood pressure
However, when interpreting b3 = 5.83 in the model that includes x2
= Weight, we keep x2 = Weight fixed, so the resulting increase in
blood pressure is much smaller.

Fevi Novkaniza Regression Pitfalls 21


Effect#2
Prodi S1 Ilmu Aktuaria

When predictor variables are correlated, the precision of the estimated


regression coefficients decreases as more predictor variables are added
to the model
Here’s the relevant portion of the table:

The standard error for the estimated slope b2 obtained from the
model including both x2 = Weight and x3 = BSA is about double the
standard error for the estimated slope b2 obtained from the model
including only x2 = Weight
Fevi Novkaniza Regression Pitfalls 22
Effect#2
Prodi S1 Ilmu Aktuaria

The standard error for the estimated slope b3 obtained from the
model including both x2 = Weight and x3 = BSA is about 30%
larger than the standard error for the estimated slope b3 obtained
from the model including only x3 = BSA.
What is the major implication of these increased standard errors?

Fevi Novkaniza Regression Pitfalls 23


Effect#2
Prodi S1 Ilmu Aktuaria

Recall that the standard errors are used in the calculation of the
confidence intervals for the slope parameters.
That is, increased standard errors of the estimated slopes lead to
wider confidence intervals, and hence less precise estimates of the
slope parameters.

Fevi Novkaniza Regression Pitfalls 24


Effect#3
Prodi S1 Ilmu Aktuaria

When predictor variables are correlated, the marginal contribution of


any one predictor variable in reducing the error sum of squares varies
depending on which other variables are already in the model.
For example, regressing the response y = BP on the predictor x2 =
Weight, we obtain SSR(x2) = 505.472. But, regressing the response
y = BP on the two predictors x3 = BSA and x2 = Weight (in that
order), we obtain SSR(x2|x3) = 88.43.
The first model suggests that weight reduces the error sum of squares
substantially (by 505.472), but the second model suggests that weight
doesn’t reduce the error sum of squares all that much (by 88.43) once
a person’s body surface area is taken into account.

Fevi Novkaniza Regression Pitfalls 25


Effect#3
Prodi S1 Ilmu Aktuaria

This should make intuitive sense. In essence, weight appears to


explain some of the variation in blood pressure
However, because weight and body surface area are highly correlated,
most of the variation in blood pressure explained by weight could just
have easily been explained by body surface area.
Therefore, once you take into account a person’s body surface area,
there’s not much variation left in the blood pressure for weight to
explain.

Fevi Novkaniza Regression Pitfalls 26


Effect#3
Prodi S1 Ilmu Aktuaria

We see a similar phenomenon when we enter the predictors into the


model in the reverse order. That is, regressing the response y = BP
on the predictor x3 = BSA, we obtain SSR(x3) = 419.858. But,
regressing the response y = BP on the two predictors x2 = Weight
and x3 = BSA (in that order), we obtain SSR(x3|x2) = 2.814
The first model suggests that body surface area reduces the error sum
of squares substantially (by 419.858), and the second model suggests
that body surface area doesn’t reduce the error sum of squares all that
much (by only 2.814) once a person’s weight is taken into account

Fevi Novkaniza Regression Pitfalls 27


Effect#4
Prodi S1 Ilmu Aktuaria

When predictor variables are correlated, hypothesis tests for �k = 0


may yield different conclusions depending on which predictor variables
are in the model. (This effect is a direct consequence of the three
previous effects.)
To illustrate this effect, let’s focus primarily on the outcome of the
t-tests for testing H0 : βBSA = 0 and H0 : βWeight = 0
The regression of the response y = BP on the predictor x3 = BSA:

There is sufficient evidence at the 0.05 level to conclude that blood


pressure is significantly related to body surface area
Fevi Novkaniza Regression Pitfalls 28
Effect#4
Prodi S1 Ilmu Aktuaria

The regression of the response y = BP on the predictor x2 = Weight:

There is sufficient evidence at the 0.05 level to conclude that blood


pressure is significantly related to weight.

Fevi Novkaniza Regression Pitfalls 29


Effect#4
Prodi S1 Ilmu Aktuaria

And, the regression of the response y = BP on the predictors x2 =


Weight and x3 = BSA:

There is sufficient evidence at the 0.05 level to conclude that, after


taking into account body surface area, blood pressure is significantly
related to weight.

Fevi Novkaniza Regression Pitfalls 30


Effect#4
Prodi S1 Ilmu Aktuaria

However, the regression also indicates that the P-value associated


with the t-test for testing H0 : βBSA = 0 is 0.350
There is insufficient evidence at the 0.05 level to conclude that blood
pressure is significantly related to body surface area after taking into
account weight
This might sound contradictory to what we claimed earlier, (blood
pressure is indeed significantly related to body surface area).
Once you take into account a person’s weight, body surface area
doesn’t explain much of the remaining variability in blood pressure
readings

Fevi Novkaniza Regression Pitfalls 31


Effect#5
Prodi S1 Ilmu Aktuaria

High multicollinearity among predictor variables does not prevent


good, precise predictions of the response within the scope of the
model
The following output illustrates how the predictions don’t change all
that much from model to model:

Fevi Novkaniza Regression Pitfalls 32


Effect#5
Prodi S1 Ilmu Aktuaria

The first output yields a predicted blood pressure of 112.7 mm Hg for


a person whose weight is 92 kg based on the regression of blood
pressure on weight
The second output yields a predicted blood pressure of 114.1 mm Hg
for a person whose body surface area is 2 square meters based on the
regression of blood pressure on body surface area
And the last output yields a predicted blood pressure of 112.8 mm Hg
for a person whose body surface area is 2 square meters and whose
weight is 92 kg based on the regression of blood pressure on body
surface area and weight
Reviewing the confidence intervals and prediction intervals, you can
see that they too yield similar results regardless of the model.

Fevi Novkaniza Regression Pitfalls 33


Homework
Prodi S1 Ilmu Aktuaria

What happens to the model if ”pulse” is also included as a predictor?

Fevi Novkaniza Regression Pitfalls 34


Prodi S1 Ilmu Aktuaria

Variance Inflation Factor

Fevi Novkaniza

LINEAR MODEL

November 18, 2021

Fevi Novkaniza Variance Inflation Factor 1


Detecting Multicollinearity Using VIF
Prodi S1 Ilmu Aktuaria

Some of the common methods used for detecting multicollinearity include:


The analysis exhibits the signs of multicollinearity — such as,
estimates of the coefficients vary excessively from model to model
The t-tests for each of the individual slopes are non-significant
(P>0.05), but the overall F-test for testing all of the slopes are
simultaneously 0 is significant (P<0.05)
The correlations among pairs of predictor variables are large
Looking at correlations only among pairs of predictors is limiting. It is
possible that the pairwise correlations are small, and yet a linear
dependence exists among three or even more variables.

Fevi Novkaniza Variance Inflation Factor 2


What is a VIF?
Prodi S1 Ilmu Aktuaria

Some of the common methods used for detecting multicollinearity include:


A variance inflation factor (VIF) quantifies how much the variance is
inflated. But what variance?
Recall that we learned previously that the standard errors — and
hence the variances — of the estimated coefficients are inflated when
multicollinearity exists
A VIF exists for each of the predictors in a multiple regression model.
For example, the variance inflation factor for the estimated regression
coefficient bj (VIFj ) is just the factor by which the variance of bj is
”inflated” by the existence of correlation among the predictor
variables in the model.

Fevi Novkaniza Variance Inflation Factor 3


What is a VIF?
Prodi S1 Ilmu Aktuaria

In particular, the variance inflation factor for the jth predictor is:
1
VIFj =
Rj2

where Rj2 is the R2 -value obtained by regressing the jth predictor on the
remaining predictors. How do we interpret the variance inflation factors for
a regression model?
A VIF of 1 means that there is no correlation among the jth predictor
and the remaining predictor variables, and hence the variance of bj is
not inflated at all
The general rule of thumb is that VIFs exceeding 4 warrant further
investigation, while VIFs exceeding 10 are signs of serious
multicollinearity requiring correction

Fevi Novkaniza Variance Inflation Factor 4


What is a VIF?
Prodi S1 Ilmu Aktuaria

Let’s return to the blood pressure data (bloodpress.txt) in which


researchers observed the following data on 20 individuals with high blood
pressure:
blood pressure (y = BP, in mm Hg)
age (x1 = Age, in years)
weight (x2 = Weight, in kg)
body surface area (x3 = BSA, in sq m)
duration of hypertension (x4 = Dur, in years)
basal pulse (x5 = Pulse, in beats per minute)
stress index (x6 = Stress)

Fevi Novkaniza Variance Inflation Factor 5


Correlation Matrix
Prodi S1 Ilmu Aktuaria

Recall the following correlation matrix:

Some of the predictors are at least moderately marginally correlated.


For example, body surface area (BSA) and weight are strongly
correlated (r = 0.875), and weight and pulse are fairly strongly
correlated (r = 0.659).
On the other hand, none of the pairwise correlations among age,
weight, duration and stress are particularly strong (r < 0.40 in each
case).
Fevi Novkaniza Variance Inflation Factor 6
Model
Prodi S1 Ilmu Aktuaria

Regressing y = BP on all six of the predictors, we obtain:

Fevi Novkaniza Variance Inflation Factor 7


VIF
Prodi S1 Ilmu Aktuaria

Three of the variance inflation factors —8.42, 5.33, and 4.41 —are
fairly large
The VIF for the predictor Weight, for example, tells us that the
variance of the estimated coefficient of Weight is inflated by a factor
of 8.42 because Weight is highly correlated with at least one of the
other predictors in the model
Let’s verify the calculation of the VIF for the predictor Weight.
Regressing the predictor x2 = Weight on the remaining five predictors:

Fevi Novkaniza Variance Inflation Factor 8


VIF
Prodi S1 Ilmu Aktuaria

2
RWeight is 88.12% or, in decimal form, 0.8812.
Fevi Novkaniza Variance Inflation Factor 9
VIF
Prodi S1 Ilmu Aktuaria

Therefore, the variance inflation factor for the estimated coefficient


Weight is by definition:

Again, this variance inflation factor tells us that the variance of the
weight coefficient is inflated by a factor of 8.42 because Weight is
highly correlated with at least one of the other predictors in the
model.
So, what to do? One solution to dealing with multicollinearity is to
remove some of the violating predictors from the model.

Fevi Novkaniza Variance Inflation Factor 10


VIF
Prodi S1 Ilmu Aktuaria

If we review the pairwise correlations again: we see that the predictors


Weight and BSA are highly correlated (r = 0.875)
We can choose to remove either predictor from the model. The
decision of which one to remove is often a scientific or practical one
For example, if the researchers here are interested in using their final
model to predict the blood pressure of future individuals, their choice
should be clear.
Which of the two measurements — body surface area or weight — do
you think would be easier to obtain?
If indeed weight is an easier measurement to obtain than body surface
area, then the researchers would be well-advised to remove BSA from
the model and leave Weight in the model

Fevi Novkaniza Variance Inflation Factor 11


VIF
Prodi S1 Ilmu Aktuaria

Reviewing again the above pairwise correlations, we see that the


predictor Pulse also appears to exhibit fairly strong marginal
correlations with several of the predictors, including Age (r = 0.619),
Weight (r = 0.659) and Stress (r = 0.506)
Therefore, the researchers could also consider removing the predictor
Pulse from the model
Let’s see how the researchers would do. Regressing the response y =
BP on the four remaining predictors Age, Weight, Duration, and
Stress, we obtain:

Fevi Novkaniza Variance Inflation Factor 12


VIF
Prodi S1 Ilmu Aktuaria

The remaining variance inflation factors are quite satisfactory


In terms of the adjusted R 2 -value, we did not seem to lose much by
dropping the two predictors BSA and Pulse from our model
The adjusted R 2 -value decreased to only 98.97% from the original
adjusted R 2 -value of 99.44%.
Fevi Novkaniza Variance Inflation Factor 13
Reducing Data-based Multicollinearity
Prodi S1 Ilmu Aktuaria

We should care about reducing multicollinearity because it all comes


down to drawing conclusions about the population slope parameters
If the variances of the estimated coefficients are inflated by
multicollinearity, then our confidence intervals for the slope
parameters are wider and therefore less useful.
Eliminating or even reducing the multicollinearity therefore yields
narrower, more useful confidence intervals for the slopes
One way of reducing data-based multicollinearity is to remove one or
more of the violating predictors from the regression model
Another way is to collect additional data under different experimental
or observational conditions

Fevi Novkaniza Variance Inflation Factor 14


Example
Prodi S1 Ilmu Aktuaria

Researchers running the Allen Cognitive Level (ACL) Study were


interested in the relationship of ACL test scores to the level of
psychopathology
They therefore collected the following data on a set of 69 patients in
a hospital psychiatry unit:
Response y = ACL test score
X1 = vocabulary (Vocab) score on the Shipley Institute of Living Scale
X2 = abstraction (Abstract) score on the Shipley Institute of Living
Scale
X3 = score on the Symbol-Digit Modalities Test (SDMT)

Fevi Novkaniza Variance Inflation Factor 15


Example
Prodi S1 Ilmu Aktuaria

Fevi Novkaniza Variance Inflation Factor 16


Example
Prodi S1 Ilmu Aktuaria

A very strong relationship (r = 0.99) exists among the two predictors.


Fevi Novkaniza Variance Inflation Factor 17
Example
Prodi S1 Ilmu Aktuaria

Regressing the response y = ACL on the predictors SDMT, Vocab, and


Abstract, we obtain:

The VIFs for Vocab and Abstract are very large.


Fevi Novkaniza Variance Inflation Factor 18
Example
Prodi S1 Ilmu Aktuaria

What should we do about this? We could opt to remove one of the


two predictors from the model.
Alternatively, if we have a good scientific reason for needing both of
the predictors to remain in the model, we could go out and collect
more data. Let’s try this second approach here.
Let’s imagine that we went out and collected more data, and in so
doing, obtained the actual data collected on all 69 patients enrolled in
the Allen Cognitive Level (ACL) Study. A matrix plot of the resulting
data set

Fevi Novkaniza Variance Inflation Factor 19


Example
Prodi S1 Ilmu Aktuaria

Pearson correlation of Vocab and Abstract = 0.698 (it is just a weaker


correlation now)
Fevi Novkaniza Variance Inflation Factor 20
Example
Prodi S1 Ilmu Aktuaria

The round data points in blue represent the 23 data points in the original
data set, while the square red data points represent the 46 newly collected
data points.
Fevi Novkaniza Variance Inflation Factor 21
Example
Prodi S1 Ilmu Aktuaria

As you can see from the plot, collecting the additional data has
expanded the ”base” over which the ”best fitting plane” will sit
The existence of this larger base allows less room for the plane to tilt
from sample to sample, and thereby reduces the variance of the
estimated slope coefficients
Let’s see if the addition of the new data helps to reduce the
multicollinearity here
Regressing the response y = ACL on the predictors SDMT, Vocab,
and Abstract:

Fevi Novkaniza Variance Inflation Factor 22


Example
Prodi S1 Ilmu Aktuaria

The researchers could now feel comfortable proceeding with drawing


conclusions about the effects of the vocabulary and abstraction scores on
the level of psychopathology.
Fevi Novkaniza Variance Inflation Factor 23
One thing to keep in mind
Prodi S1 Ilmu Aktuaria

One thing to keep in mind. In order to reduce the multicollinearity


that exists, it is not sufficient to go out and just collect any ol’ data
The data have to be collected in such a way to ensure that the
correlations among the violating predictors is actually reduced
That is, collecting more of the same kind of data won’t help to reduce
the multicollinearity. The data have to be collected to ensure that the
”base” is sufficiently enlarged
Doing so, of course, changes the characteristics of the studied
population, and therefore should be reported accordingly

Fevi Novkaniza Variance Inflation Factor 24


Reducing Structural Multicollinearity
Prodi S1 Ilmu Aktuaria

Recall that structural multicollinearity is multicollinearity that is a


mathematical artifact caused by creating new predictors from other
predictors, such as, creating the predictor x 2 from the predictor x.
Because of this, at the same time that we learn here about reducing
structural multicollinearity, we learn more about polynomial regression
models.

Fevi Novkaniza Variance Inflation Factor 25


Example
Prodi S1 Ilmu Aktuaria

”How is the amount of immunoglobin in blood (y) related to maximal


oxygen uptake (x)?”
Because some researchers were interested in answering the above
research question, they collected on a sample of 30 individuals:
yi = amount of immunoglobin in blood (mg) of individual i
xi =maximal oxygen uptake (ml/kg) of individual i

Fevi Novkaniza Variance Inflation Factor 26


Example
Prodi S1 Ilmu Aktuaria

The scatter plot of the resulting data suggests that there might be some
curvature to the trend in the data.
Fevi Novkaniza Variance Inflation Factor 27
Example
Prodi S1 Ilmu Aktuaria

Fevi Novkaniza Variance Inflation Factor 28


Example
Prodi S1 Ilmu Aktuaria

If 0 is a possible x value, then b0 is the predicted response when x =


0. Otherwise, the interpretation of b0 is meaningless
The estimated coefficient b1 is the estimated slope of the tangent line
at x = 0
The estimated coefficient b2 indicates the up/down direction of the
curve. That is:
if b2 < 0, then the curve is concave down
if b2 > 0, then the curve is concave up

Fevi Novkaniza Variance Inflation Factor 29


Example
Prodi S1 Ilmu Aktuaria

If we look at the output we obtain upon regressing the response y = igg


on the predictors oxygen and oxygen2:

By the nature of model, there is a ”structural multicollinearity.”


Fevi Novkaniza Variance Inflation Factor 30
Example
Prodi S1 Ilmu Aktuaria

The neat thing here is that we can reduce the multicollinearity in our
data by doing what is known as ”centering the predictors.”
Centering a predictor merely entails subtracting the mean of the
predictor values in the data set from each predictor value.
For example, the mean of the oxygen values in our data set is 50.64:

Therefore, in order to center the predictor oxygen, we merely subtract


50.64 from each oxygen value in our data set. Doing so, we obtain
the centered predictor

Fevi Novkaniza Variance Inflation Factor 31


Example
Prodi S1 Ilmu Aktuaria

Fevi Novkaniza Variance Inflation Factor 32


Example
Prodi S1 Ilmu Aktuaria

The correlation has gone from r = 0.995 to a rather low r = 0.219


Fevi Novkaniza Variance Inflation Factor 33
Example
Prodi S1 Ilmu Aktuaria

Having centered the predictor oxygen, we must reformulate our


quadratic polynomial regression model accordingly
That is, we now formulate our model as:
yi = β0∗ + β1∗ (xi x̄) + β1 1∗ (xi x̄)2 + i
or alternatively:
yi = β0∗ + β1∗ xi + β1 1∗ xi2 + i
where:
yi = amount of immunoglobin in blood (mg)
xi∗ = xi x̄ denotes the centered predictor
and the error terms i are independent, normally distributed and have
equal variance σ 2 .
Note that we add asterisks to each of the parameters in order to
make it clear that the parameters differ from the parameters in the
original model we formulated.
Fevi Novkaniza Variance Inflation Factor 34
Example
Prodi S1 Ilmu Aktuaria

Based on our original model —the variance inflation factors for


oxygen and oxygensq were 99.9.
Now, regressing y = igg on the centered predictors oxcent and
oxcentsq:

we see that the VIFs have dropped significantly—now they are 1.05 in
each case
Fevi Novkaniza Variance Inflation Factor 35
Prodi S1 Ilmu Aktuaria

Because we reformulated our model based on the centered predictors, the


meaning of the parameters must be changed accordingly. Now, the
estimated coefficients tell us:
The estimated coefficient b0 is the predicted response y when the
predictor x equals the sample mean of the predictor values
The estimated coefficient b1 is the estimated slope of the tangent line
at the predictor mean — and, often, it is similar to the estimated
slope in the simple linear regression model
The estimated coefficient b2 indicates the up/down direction of curve.
That is:
if b2 < 0, then the curve is concave down
if b2 > 0, then the curve is concave up

Fevi Novkaniza Variance Inflation Factor 36


Prodi S1 Ilmu Aktuaria

So, here, in this example, the estimated coefficient b0 = 1632.3 tells


us that a male whose maximal oxygen uptake is 50.64 ml/kg is
predicted to have 1632.3 mg of immunoglobin in his blood.
The estimated coefficient b1 = 34.00 tells us that the when an
individual’s maximal oxygen uptake is near 50.64 ml/kg, we can
expect the individual’s immunoglobin to increase by 34.00 mg for
every 1 ml/kg increase in maximal oxygen uptake.

Fevi Novkaniza Variance Inflation Factor 37


Prodi S1 Ilmu Aktuaria

As the following plot of the estimated quadratic function suggests:

The reformulated regression function appears to describe the trend in the


data well. The adjusted R 2 -value is still 93.3%.
Fevi Novkaniza Variance Inflation Factor 38
Prodi S1 Ilmu Aktuaria

We shouldn’t be surprised to see that the estimates of the coefficients in


our reformulated polynomial regression model are quite similar to the
estimates of the coefficients for the simple linear regression model:

Fevi Novkaniza Variance Inflation Factor 39


Prodi S1 Ilmu Aktuaria

The estimated coefficient b1 = 34.00 for the polynomial regression


model and b1 = 32.74 for the simple linear regression model
The estimated coefficient b0 = 1632 for the polynomial regression
model and b0 = 1558 for the simple linear regression model
The similarities in the estimates, of course, arise from the fact that
the predictors are nearly uncorrelated and therefore the estimates of
the coefficients don’t change all that much from model to model

Fevi Novkaniza Variance Inflation Factor 40


Prodi S1 Ilmu Aktuaria

Residual Analysis

Fevi Novkaniza

LINEAR MODEL

November 25, 2021

Fevi Novkaniza Residual Analysis 1


The basic idea of residual analysis
Prodi S1 Ilmu Aktuaria

Recall that not all of the data points in a sample will fall right on the
least squares regression line
The vertical distance between any one data point yi and its estimated
value ŷi is its observed ”residual”: ei = yi − ŷi
Each observed residual can be thought of as an estimate of the actual
unknown ”true error” term: i = Yi − E (Yi )
The basic idea of residual analysis, therefore, is to investigate the
observed residuals to see if they behave “properly.”
We analyze the residuals to see if they support the assumptions of
linearity, independence, normality and equal variances.

Fevi Novkaniza Residual Analysis 2


Residuals vs. Fits Plot
Prodi S1 Ilmu Aktuaria

When conducting a residual analysis, a ”residuals versus fits plot” is


the most frequently created plot
It is a scatter plot of residuals on the y axis and fitted values
(estimated responses) on the x axis
The plot is used to detect non-linearity, unequal error variances, and
outliers

Fevi Novkaniza Residual Analysis 3


Example
Prodi S1 Ilmu Aktuaria

Urbano-Marquez, et al.( 1989) were interested in determining whether or


not alcohol consumption was linearly related to muscle strength.

The predicted response of these men (whose alcohol consumption is


around 40) is about 14.
Fevi Novkaniza Residual Analysis 4
Example
Prodi S1 Ilmu Aktuaria

The plot suggests that there is a decreasing linear relationship


between alcohol and arm strength
It also suggests that there are no unusual data points in the data set.
It illustrates that the variation around the estimated regression line is
constant suggesting that the assumption of equal error variances is
reasonable
The corresponding residuals versus fits plot looks like for the data
set’s simple linear regression model with arm strength as the response
and level of alcohol consumption as the predictor:

Fevi Novkaniza Residual Analysis 5


Prodi S1 Ilmu Aktuaria

Their fitted value is about 14 and their deviation from the residual=0
line shares the same pattern as their deviation from the estimated
regression line
Any data point that falls directly on the estimated regression line has
a residual of 0. Therefore, the residual = 0 line corresponds to the
estimated regression line
Fevi Novkaniza Residual Analysis 6
Prodi S1 Ilmu Aktuaria

Here are the characteristics of a well-behaved residual vs. fits plot and
what they suggest about the appropriateness of the simple linear
regression model:
The residuals ”bounce randomly” around the 0 line. This suggests
that the assumption that the relationship is linear is reasonable
The residuals roughly form a ”horizontal band” around the 0 line.
This suggests that the variances of the error terms are equal
No one residual ”stands out” from the basic random pattern of
residuals. This suggests that there are no outliers

Fevi Novkaniza Residual Analysis 7


Residuals vs. Predictor Plot
Prodi S1 Ilmu Aktuaria

An alternative to the residuals vs. fits plot is a ”residuals vs. predictor


plot.”
The interpretation of a ”residuals vs. predictor plot” is identical to
that for a ”residuals vs. fits plot.”
That is, a well-behaved plot will bounce randomly and form a roughly
horizontal band around the residual = 0 line. And, no data points will
stand out from the basic random pattern of the other residuals.

Fevi Novkaniza Residual Analysis 8


Prodi S1 Ilmu Aktuaria

The residuals vs. predictor plot for the simple linear regression model with
arm strength as the response and level of alcohol consumption as the
predictor:

The residuals vs. predictor plot is just a mirror image of the residuals vs.
fits plot. The residuals vs. predictor plot offers no new information.
Fevi Novkaniza Residual Analysis 9
Identifying Specific Problems Using Residual Plots
Prodi S1 Ilmu Aktuaria

Specifically, we will investigate:


how a non-linear regression function shows up on a residuals vs. fits
plot
how unequal error variances show up on a residuals vs. fits plot
how an outlier show up on a residuals vs. fits plot.

Fevi Novkaniza Residual Analysis 10


Prodi S1 Ilmu Aktuaria

”How does a non-linear regression function show up on a residual vs. fits


plot?”
The residuals depart from 0 in some systematic manner, such as
being positive for small x values, negative for medium x values, and
positive again for large x values
Any systematic (non-random) pattern is sufficient to suggest that the
regression function is not linear.

Fevi Novkaniza Residual Analysis 11


Example
Prodi S1 Ilmu Aktuaria

The fitted line plot of the resulting data suggests that there is a
relationship between groove depth and mileage. The relationship is just
not linear. The corresponding residuals vs. fits plot accentuates this claim:

Fevi Novkaniza Residual Analysis 12


Example
Prodi S1 Ilmu Aktuaria

Note that the residuals depart from 0 in a systematic manner. They are
positive for small x values, negative for medium x values, and positive
again for large x values. Clearly, a non-linear model would better describe
the relationship between the two variables.
Fevi Novkaniza Residual Analysis 13
Prodi S1 Ilmu Aktuaria

We notice that the R 2 value is very high (95.26%)?


This is an excellent example of the caution ”a large r2 value should
not be interpreted as meaning that the estimated regression line fits
the data well.”
The large R 2 value tells you that if you wanted to predict groove
depth, you’d be better off taking into account mileage than not.
The residuals vs. fits plot tells you, though, that your prediction
would be better if you formulated a non-linear model rather than a
linear one

Fevi Novkaniza Residual Analysis 14


Prodi S1 Ilmu Aktuaria

How does non-constant error variance show up on a residual vs. fits plot?
The Answer: Non-constant error variance shows up on a residuals vs.
fits (or predictor) plot in any of the following ways:
1 The plot has a ”fanning” effect. That is, the residuals are close to 0 for
small x values and are more spread out for large x values
2 The plot has a ”funneling” effect. That is, the residuals are spread out
for small x values and close to 0 for large x values
3 Or, the spread of the residuals in the residuals vs. fits plot varies in
some complex fashion

Fevi Novkaniza Residual Analysis 15


Example
Prodi S1 Ilmu Aktuaria

To investigate the relationship between plutonium activity (x, in pCi/g)


and alpha count rate (y, in number per second), a study was conducted on
23 samples of plutonium. The following fitted line plot was obtained on
the resulting data

Fevi Novkaniza Residual Analysis 16


Prodi S1 Ilmu Aktuaria

The plot suggests that there is a linear relationship between alpha


count rate and plutonium activity.
It also suggests that the error terms vary around the regression line in
a non-constant manner — as the plutonium level increases, not only
does the mean alpha count rate increase, but also the variance
increases
That is, the fitted line plot suggests that the assumption of equal
variances is violated

Fevi Novkaniza Residual Analysis 17


Example
Prodi S1 Ilmu Aktuaria

As is generally the case, the corresponding residuals vs. fits plot


accentuates this claim:

Note that the residuals ”fan out” from left to right rather than exhibiting
a consistent spread around the residual = 0 line. The residual vs. fits plot
suggests that the error variances are not equal.
Fevi Novkaniza Residual Analysis 18
Prodi S1 Ilmu Aktuaria

How does an outlier show up on a residuals vs. fits plot?


The observation’s residual stands apart from the basic random
pattern of the rest of the residuals
The random pattern of the residual plot can even disappear if one
outlier really deviates from the pattern of the rest of the data
An Example: Is there a relationship between tobacco use and alcohol use?
The British government regularly conducts surveys on household spending.
One such survey (Family Expenditure Survey, Department of Employment,
1981) determined the average weekly expenditure on tobacco (x, in British
pounds) and the average weekly expenditure on alcohol (y, in British
pounds) for households in n = 11 different regions in the United Kingdom.

Fevi Novkaniza Residual Analysis 19


Example
Prodi S1 Ilmu Aktuaria

The fitted line plot of the resulting data

suggests that there is an outlier — in the lower right corner of the plot —
which corresponds to the Northern Ireland region. In fact, the outlier is so
far removed from the pattern of the rest of the data that it appears to be
”pulling the line” in its direction.
Fevi Novkaniza Residual Analysis 20
Prodi S1 Ilmu Aktuaria

As is generally the case, the corresponding residuals vs. fits plot


accentuates this claim:

Note that Northern Ireland’s residual stands apart from the basic random
pattern of the rest of the residuals. That is, the residual vs. fits plot
suggests that an outlier exists.
Fevi Novkaniza Residual Analysis 21
Prodi S1 Ilmu Aktuaria

This is an excellent example of the caution that the R 2 can be greatly


affected by just one data point
Removing one data point from the data set, and refitting the
regression line, we obtain:

The R 2 value has jumped from 5% to 61.5%. One data point greatly
affect the value of R 2
Fevi Novkaniza Residual Analysis 22
Prodi S1 Ilmu Aktuaria

How large a residual has to be before a data point should be flagged as


being an outlier?
We can make the residuals ”unitless” by dividing them by their
standard deviation
In this way we create what are called ”standardized residuals.”
They tell us how many standard deviations above — if positive — or
below — if negative — a data point is from the estimated regression
line
Any observations with a standardized residual greater than 2 or
smaller than -2 might be flagged for further investigation

Fevi Novkaniza Residual Analysis 23


Prodi S1 Ilmu Aktuaria

The corresponding standardized residuals vs. fits plot for our expenditure
survey example looks like:

The standardized residual of the suspicious data point is smaller than


-2. The data point lies more than 2 standard deviations below its
mean
Since this is such a small dataset the data point should be flagged for
further investigation!
Fevi Novkaniza Residual Analysis 24
Prodi S1 Ilmu Aktuaria

Most statistical software identifies observations with large standardized


residuals. Here is what a portion of Minitab’s output for our expenditure
survey example looks like:

Minitab labels observations with large standardized residuals with an ”R.”


For our example, Minitab reports that observation 11 — for which tobacco
= 4.56 and alcohol = 4.02 — has a large standardized residual (-2.58).
The data point has been flagged for further investigation.

Fevi Novkaniza Residual Analysis 25


Prodi S1 Ilmu Aktuaria

Recommended strategy, once you’ve identified a data point as being


unusual:
Determine whether a simple — and therefore correctable — mistake
was made in recording or entering the data point. Examples include
transcription errors (recording 62.1 instead of 26.1) or data entry
errors (entering 99.1 instead of 9.1). Correct the mistakes you found
Determine if the measurement was made in such a way that keeping
the experimental unit in the study can no longer be justified. Was
some procedure not conducted according to study guidelines? For
example, was a person’s blood pressure measured standing up rather
than sitting down? Was the measurement made on someone not in
the population of interest? For example, was the survey completed by
a man instead of a woman? If it is convincingly justifiable, remove
the data point from the data set.

Fevi Novkaniza Residual Analysis 26


Prodi S1 Ilmu Aktuaria

If the first two steps don’t resolve the problem, consider analyzing the
data twice — once with the data point included and once with the
data point excluded. Report the results of both analyses

Fevi Novkaniza Residual Analysis 27


Residuals vs. Order Plot
Prodi S1 Ilmu Aktuaria

We will learn how to use a ”residuals vs. order plot” as a way of


detecting a particular form of non-independence of the error terms,
namely serial correlation
If the data are obtained in a time (or space) sequence, a residuals vs.
order plot helps to see if there is any correlation between the error
terms that are near each other in the sequence
The plot is only appropriate if you know the order in which the data
were collected!
What is this residuals vs. order plot all about? It is a scatter plot
with residuals on the y axis and the order in which the data were
collected on the x axis.

Fevi Novkaniza Residual Analysis 28


Prodi S1 Ilmu Aktuaria

Here’s an example of a well-behaved residuals vs. order plot:

The residuals bounce randomly around the residual = 0 line as we would


hope so. In general, residuals exhibiting normal random noise around the
residual = 0 line suggest that there is no serial correlation.

Fevi Novkaniza Residual Analysis 29


Prodi S1 Ilmu Aktuaria

A residuals vs. order plot that exhibits (positive) trend as the


following plot does:

suggests that some of the variation in the response is due to time.


Therefore, it might be a good idea to add the predictor ”time” to the
model.
That is, you interpret this plot just as you would interpret any other
residual vs. predictor plot. It’s just that here your predictor is ”time.”
Fevi Novkaniza Residual Analysis 30
Positive serial correlation
Prodi S1 Ilmu Aktuaria

A residuals vs. order plot that looks like the following plot:

suggests that there is ”positive serial correlation” among the error terms.
That is, positive serial correlation exists when residuals tend to be followed,
in time, by residuals of the same sign and about the same magnitude. The
plot suggests that the assumption of independent error terms is violated.

Fevi Novkaniza Residual Analysis 31


Negative serial correlation
Prodi S1 Ilmu Aktuaria

A residuals vs. order plot that looks like the following plot:

uggests that there is ”negative serial correlation” among the error terms.
Negative serial correlation exists when residuals of one sign tend to be
followed, in time, by residuals of the opposite sign. What? Can’t you see
it? If you connect the dots in order from left to right, you should be able
to see the pattern.
Fevi Novkaniza Residual Analysis 32
Prodi S1 Ilmu Aktuaria

Negative, positive, negative, positive, negative, positve, and so on


The plot suggests that the assumption of independent error terms is
violated
If you obtain a residuals vs. order plot that looks like this, you would
again be advised to move out of the realm of regression analysis and
into that of ”time series modeling.”

Fevi Novkaniza Residual Analysis 33

You might also like