Molin Uas

Prodi S1 Ilmu Aktuaria
Principle of Model Building
Fevi Novkaniza
LINEAR MODEL
October 28, 2021
Fevi Novkaniza Principle of Model Building 1

Why Model Building Is Important
By model building, we mean writing a model that will provide a good

fit to a set of data and that will give good estimates of the mean
value of y and good predictions of future values of y for given values
of the independent variables

Example
Educational research group issued a report concerning the variables

related to academic achievement for a certain type of college student
The researchers selected a random sample of students and recorded a
measure of academic achievement, y, at the end of the senior year
together with data on an extensive list of independent variables,
x1 , x2 , ..., xk , that they thought were related to y.
Among these independent variables were the student’s IQ, scores on
mathematics and verbal achievement examinations, rank in class, and
so on.

Why Model Building Is Important
They fit the model
E (y) = β0 + β1 x1 + ... + βk xk
to the data, analyzed the results, and reached the conclusion that
none of the independent variables was ‘‘significantly related’’ to y.
The goodness of fit of the model, measured by the coefficient of
determination R 2 , was not particularly good, and t tests on individual
parameters did not lead to rejection of the null hypotheses that these
parameters equaled 0.

Example
If we hold all the other independent variables constant and vary only
x1 , E (y) will increase by the amount β1 for every unit increase in x1 .
A 1-unit change in any of the other independent variables will increase
E(y) by the value of the corresponding β parameter for that variable.

Example
Generally speaking, there will be a positive correlation between

entrance achievement test scores and college academic achievement.
So, what went wrong with the educational researchers’ study?
Most likely the difficulties in the results of the educational study were
caused by the use of an improperly constructed model.
For example, the model
E (y) = β0 + β1 x1 + ... + βk xk
assumes that the independent variables x1 , x2 , ..., xk affect mean

achievement E(y) independently of each other.

Example
Do the assumptions implied by the model agree with your knowledge

about academic achievement? First, is it reasonable to assume that
the effect of time spent on study is independent of native intellectual
ability?
No matter how much effort some students invest in a particular
subject, their rate of achievement is low
For others, it may be high. Therefore, assuming that these two
variables—effort and native intellectual ability—affect E(y)
independently of each other is likely to be an erroneous assumption.

Example
Second, suppose that x5 is the amount of time a student devotes to

study. Is it reasonable to expect that a 1-unit increase in x5 will
always produce the same change β5 in E(y)? The changes in E(y) for
a 1-unit increase in x5 might depend on the value of x5 (e.g., the law
of diminishing returns).
Consequently, it is quite likely that the assumption of a constant rate
of change in E(y) for 1-unit increases in the independent variables will
not be satisfied.

Example
Clearly, the model
E (y) = β0 + β1 x1 + β2 x2 + .. + βk xk
was a poor choice in view of the researchers’ prior knowledge of some

of the variables involved.
Terms have to be added to the model to account for interrelationships
among the independent variables and for curvature in the response
function.
Failure to include needed terms causes inflated values of SSE,
nonsignificance in statistical tests, and, often, erroneous practical
conclusions.

Two Types of Predictor Variables:
Quantitative and Qualitative
◦ Quantitative data are observations measured on a naturally occurring numerical scale.
◦ Nonnumerical data that can only be classified into one of a group of categories are said to be
qualitative data
◦ The different values of an independent variable used in regression are called its levels.
◦ Example:
We considered the problem of predicting executive salary as a function of several predictor variables.
Consider the following four independent variables that may affect executive salaries:
(a) Years of experience
(b) Gender of the employee
(c) Firm’s net asset value
(d) Rank of the employee
For each of these independent variables, give its type and describe the nature of the levels you would
expect to observe
Model with Single Quantitative
Predictor Variable
◦ Terdapat beberapa model yang dapat dibentuk dengan satu variable predictor
quantitative.
◦ First-Order (Straight-Line) Model
◦ A Second-Order (Quadratic) Model
◦ Third-Order Model
◦ A pth-Order Polynomial
Independent Variable
Predictor Variable
Predictor Variable
Predictor Variable
Predictor Variable
◦ How do you decide the order of the polynomial you should use to model a response if
you have no prior information about the relationship between E(y) and x?
◦ If you have data, construct a scatterplot of the data points, and see whether you can
deduce the nature of a good approximating function
◦ A pth-order polynomial, when graphed, will exhibit (p − 1) peaks, troughs, or reversals in
direction
First-Order Models with Two or More
Quantitative Predictor Variables
◦ Like models for a single predictor variable, models with two or more independent
variables are classified as first-order, second-order, and so forth, but it is difficult (most
often impossible) to graph the response because the plot is in a multidimensional
space.
First-Order Models with Two or More
Quantitative Predictor Variables
Second-Order Models with Two or
More Quantitative Predictor
Variables
◦ Second-order models with two or more predictor variables permit curvature in the
response surface.
◦ One important type of second-order term accounts for interaction between two
variables.
◦ Consider the two-variable model
𝐸 𝑦 = 𝛽0 + 𝛽1 𝑥1 + 𝛽2 𝑥2 + 𝛽3 𝑥1 𝑥2
◦ This interaction model traces a ruled surface (twisted plane) in a three-dimensional
space (see Figure 5.10).
◦ The second-order term 𝛽3 𝑥1 𝑥2 is called the interaction term, and it permits the contour
lines to be nonparallel.
Variables
Variables
Variables
Variables
Variables
◦ To determine whether the quadratic terms are important, we would test 𝐻0 : 𝛽7 = 𝛽8 =

𝛽9 = 0 using the F-test for comparing nested models outlined in Section 4.13.
Coding Quantitative Predictor
Variables
◦ In fitting higher-order polynomial regression models (e.g., second- or third-order
models), it is often a good practice to code the quantitative independent variables.
◦ In a general sense, coding means transforming a set of independent variables
(qualitative or quantitative) into a new set of independent variables.
◦ There are two related reasons for coding quantitative variables
◦ To calculate the estimates of the model parameters using the method of least squares, the
computer must invert a matrix of numbers, called the coefficient (or information) matrix.
Considerable rounding error may occur during the inversion process if the numbers in the
coefficient matrix vary greatly in absolute value. This can produce sizable errors in the
computed values of the least squares estimates, 𝛽෠0 , 𝛽෠1 , 𝛽෠2 ,.... Coding makes it computationally
easier for the computer to invert the matrix, thus leading to more accurate estimates.
◦ the problem of predictor variables (x’s) being intercorrelated (called multicollinearity). When
polynomial regression models (e.g., second-order models) are fit, the problem of
multicollinearity is unavoidable, especially when higher-order terms are fit
Coding Quantitative Predictor
Variables
Regression with Qualitative Predictors
Fevi Novkaniza
LINEAR MODEL
October 28, 2021
Fevi Novkaniza Regression with Qualitative Predictors 1

Qualitative Information
Field: (Qualitative variables) business, economics, the social,

biological sciences
Example:
gender (M,F)
purchase status: purchase, no purchase
disability status: not; partial; fully
rating grade
A way to incorporate qualitative information is to use dummy
variables

Qualitative variables
The independent variables that appear in a linear model can be one of

two types; quantitative variable is one that assumes numerical values
corresponding to the points on a line.
An independent variable that is not quantitative, that is, one that is
categorical in nature, is called qualitative
The different values of an independent variable used in regression are
called its levels.

Insurance Innovation Example
A study of innovation in the insurance industry

The speed with which a particular insurance innovation is adopted (Y)
Size of the insurance firm (X1 )
Type of the firm (X2 and X3 )

Qualitative Predictors

Indicator variables: called dummy variables or binary variables


Figure: Illustration of Meaning of Regression Coefficient for Regression Model

with Indicator Variable X2

Insurance innovation example


Figure: Regression Results for Fit of Regression Model -Insurance Innovation

Example

Figure: Fitted Regression Function for Regression Model Indicator Innovation

Example


More than 2 classes
The regression of tool wear (Y) on tool speed X1 and tool model
(qualitative: M1, M2, M3, M4)

More than 2 classes

More than 2 classes

Tool Wear Example

Tool Wear Example

Some Considerations

Some Considerations

Some Considerations

Some Considerations

Some Considerations

Interaction Quantitative & Qualitative

Interaction Quantitative & Qualitative

Ordinal Interaction
Figure: Another Illustration of Regression Model with Indicator Variable X2 and

Interaction Term-Insurance Innovation Example

Example: insurance innovation
Figure: Another Illustration of Regression Model with Indicator Variable X2 and

Interaction Term-Insurance Innovation Example



More complex models

More complex models
Qualitative predictor variables only
Yi = β0 + β2 Xi2 + β3 Xi3 + i
all explanatory variables are qualitative: analysis of variance

(ANOVA) models
some qualitative some qualitative explanatory variables: analysis of
covariance (ANCOVA) models

Example of an interaction model with quantitative

predictors
Fevi Novkaniza
LINEAR MODEL
November 4, 2021
Fevi Novkaniza Example of an interaction model with quantitative predictors 1

Dataset
We will use data from https://regressit.com

There are 392 complete rows of data containing information for
makes and models of cars sold in the U.S. between 1970 and 1982.
The objective is to build a model for predicting fuel economy from the
other variables, which include weight, horsepower, displacement,
acceleration, cylinders, and country of origin
The variable called Year70To81 is a dummy variable for years 1970 to
1981 which will be used to define a training set.
The origin variable is a numeric code: 1 = US, 2 = Europe, 3 =
Japan. Dummy variables have been created for the 3 values, to allow
for country effects that could have an arbitrary pattern

Dataset
A snapshot of the first 6 data points is as the following:
Using the dataset, we fit a multiple regression model with interaction using
just 2 predictors (MPG and Cylinders), and the fitted line is:

Data Analysis
ŷ = 2.451 + 0.089x1 + 1.203x2 − 0.054x1 x2

with ŷ, x1 , x2 are the predicted GallonsPer100Miles, MPG, and Cylinders,
respectively.
Data Analysis
We can see that the p-value for an interaction term is very small,
< 2 × 10−16 , leading to the rejection of the hypothesis that the
interaction term is ignorable; thus, we conclude that interaction
between MPG and Cylinders is significant and should be included in
the model.
So, how to interpret the result?

Data Analysis
Since there is an interaction term, we cannot interpret the effect of x1 nor

x2 separately. Instead, consider the following scenario. Increase x1 by 1
unit, say from x1 = a to x1 = (a + 1), while x2 is kept constant. Thus,
x1 = a, x2 = x2 :
ŷ = 2.451 + 0.089a + 1.203x2 − 0.054ax2
ŷ = (2.451 + 0.089a) + (1.203 − 0.054a)x2

x1 = a + 1, x2 = x2 :
ŷ = 2.451 + +0.089(a + 1) + 1.203x2 − 0.054(a + 1)x2
ŷ = (2.451 + 0.089(a + 1)) + (1.203 − 0.054(a + 1))x2

The change in ŷ due to 1 unit increase in x1 is:
∆ŷ = 0.089 − 0.054x2 that is, depends on the value of x2 .
Data Analysis
If there is no interaction, we could say that on average

GallonsPer100Miles is increased by 0.089 unit when MPG is increased
by 1 unit, for a car with the same Cylinder.
However, since there is an interaction between MPG and Cylinder,
this interpretation is no longer suitable.
Given Cylinder (let say 4), increasing the MPG by 1 unit results in a
reduction of GallonsPer100Miles by 0.127 unit (i.e 0.089-0.054×4).
While, if the Cylinder is 6, the reduction in GallonsPer100Miles is
even larger, at -0.235, for each 1 unit increment in MPG.

Data Analysis
By substituting x2 = 3, 4, ..., 8 (recall that the support for Cylinder ranges

from 3 to 8) into the fitted regression line
ŷ = 2.451 + 0.089x1 + 1.203x2 − 0.054x1 x2 ,
we obtain 6 lines representing the relationship between MPG (x) and

GallonsPer100Miles (y), where f1 (x) = ŷ = 6.06 − 0.073x is the fitted line
for cars with Cylinder=8, and so on, until f6 (x) = ŷ = 12.075 − 0.343x is
the fitted line when Cylinder=8.

Data Analysis
Figure: Fitted regression lines for GallonsPer100Miles based on MPG for 6

different cylinders
Generalization
In general, if there are p quantitative predictors, and all the 2-way

interaction between predictors are to be included in the model, then
the model would be
E (y) = β0 + β1 x1 + ... + βp p + αj αk xj xk ; j, k = 1, ..., p
X
j6=k
The number of interaction terms is given by C(2,p), that is a

combination of 2 predictors chosen from p predictors.
As the number of predictor variables (p) increases, the complexity of
the model grows fast, as illustrated in the following table.

A quadratic model with quantitative predictor
The general form of the quadratic equation is of the form

y = f (x) = ax 2 + bx + c, a 6= 0.
When a>0 the curve is concave up, and is the opposite for a<0. This
concept can be used for a linear model, when the effect of changes in
predictor variables is not constant.
A quadratic linear model for 1 quantitative predictor variable is
E (y) = β0 + β1 x + β2 x 2
β2 > 0 represents that when x is increased by 1 unit, if y increases,

then it will increase with a decreasing rate.
On the other hand, if y decreases, then it will decrease with an
increasing rate.
For example, if x is increased from x=c to x=c+1, let the change in y
is ∆1 > 0; then the change in y when x is increased from x=c+1 to
x=c+2, say ∆2 , will be of ∆2 < ∆1 .
While, under the same scenario, if ∆1 < 0, then ∆2 < 0 and
|∆2 > ∆1 |
Similar interpretation applies for β2 < 0

This is summarized in the following table

What is the interpretation of β1 ?
Consider the following scenario:

x = a, E (y) = β0 + β1 a + β2 a 2
x = a + 1, E (y) = β0 + β1 (a + 1) + β2 (a + 1)2
The change in E(y) when x is increased from a to (a + 1) is
(β1 + β2 ) + 2β2 a
It is seen that the change in the response variable depends on β1 , β2 , and
the initial value of x (i.e x=a); and thus there is no explicit interpretation
on β1 solely.

Example
Using the AutoMG data, if we plot GallonsPer100Miles based on

MPG, it clearly shows a quadratic trend, as in the following figure.
Moreover, if we are to calculate the ‘linear’ correlation between
GallonsPer100Miles and MPG 2 , the resulted correlation is strong,
that is-0.8676, as depicted.
It is reasonable to propose a quadratic model

Model: y = β0 + β1 x + β2 x 2 + , ∼ NIID(0, 2 ) where y,x are

GallonsPer100Miles and MPG, respectively
The result is as follows:
The regression fitted line is

ˆ 100Miles = 13.97 − .596MPG + .008MPG 2
GallonsPer
Model: y = β0 + β1 x + β2 x 2 + , ∼ NIID(0, 2 ) where y,x are

GallonsPer100Miles and MPG, respectively
The result is as follows:
The regression fitted line is ŷ = 13.97 − .596MPG + .008MPG 2

The model is useful, since the F statistic for testing the hypothesis
H0 : β1 = β2 = β3 = 0 is 1.018e+04, falls in the rejection region,
with the p-value < 2.2e − 16 ≈ 0
Moreover, the quadratic term is important and should be included in
the model, since the partial t-test for H0 : β2 = 0 against the
alternative hypothesis H1 : β2 6= 0 gave a p-value < 2e − 16 ≈ 0,
indicating the significance role of the quadratic term

How to interpret the result?
Referring to the scatterplot, GallonsPer100Miles decreases as MPG

increases, and since β̂2 = .008 > 0, the decrease in
GallonsPer100Miles is getting slower for higher MPG
For cars with 1 unit higher MPG, the change in the MPG is
(3.97−.596(a+1)+.008(a+1)2 )−(3.97−.596a+.008a 2 ) = −.596+(2a+1).008
with a is the initial value of the MPG. The fitted regression also
results in a high R 2 , that is 98.12% of the variability of
GallonsPer100Miles can be explained by MPG using the quadratic
regression model.
A visual check on how close the predicted values of
GallonsPer100Miles to the actual values, we produce is as the
following.

which give the result in the figure below. As shown in the figure, the fitted
line (blue) is close enough to most of data points, confirming the high R 2 .

A test for comparing nested models
Definition: A simpler Model A is called to be nested in a more

complex Model B if all the terms in Model A is included in Model B.
If Model A is of the form E (y) = β0 + β1 x1 + ... + βJ xJ , then model
B is of the form
E (y) = β0 + β1 x1 + βJ xJ + α1 x1∗ + ... + αK xK∗ , J ≥ 1, K ≥ 1
Model A is called a reduced model, and Model B is called a complete
(relative to model A) model
Consider a second order model with 2 predictor variables
E (y) = β0 + β1 x1 + β2 x2 + β3 x1 x2 + β4 x12 + β5 x22 (1)

Consider also the parsimonious principle, that is, a simpler yet

accurate model is preferred to a more complex one.
Recall that in adding the complexity of the model requires an
adjustment on the sample size (i.e. increasing the sample size) to
maintain the accuracy, represented by Ra2 ; otherwise, it will be
penalized by decrements of Ra2 .
Thus, we probably want to use a simpler model, if the higher order
model is not really contributing in explaining y. Or, not using all the
higher order terms, only those that significant should be included in
the model

For example, we might prefer a model with interaction only, that is

E (y) = β0 + β1 x1 + β2 x2 + β3 x1 x2 (2)
to the complete second order model in 1).
We will test hypothesis H0 : β4 = β5 = 0, against the alternative H1 :
at least one of βj 0, j = 4, 5
Similar to the rationale in global F test and adapting the formulation,
we use the F statistic, that is
Fdat = (∆SSR/df1 )/(SSRB /df2 )
where ∆SSR = (SSRA − SSRB ) is drop in SSR from Model A to
Model B, df1 is drop in number of regression coefficients from Model
B to Model A (i.e. number of βj s being tested in H0 ), and df2 is
degrees of freedom for model B (i.e. the complete model with p
predictors; n-(p+1)).
In this example df1 = 2, df2 = n − 6, where n is sample size.
If model’s assumptions are satistied, that is ∼ NIID(0, σ 2 ), and H0

is true, then the calculated Fd at will follow an Fdf1 ,df2 distribution.
Thus, reject H0 at level significance α if F > F( α, df1 , df2 ); or
p − value < α where p-value= Pr (F > Fdat ).

Example
For AutoMPG data, we want to assess whether a second order terms,

i.e. the interaction and quadratic terms, are really improving the
model’s performance and should be included in the model for
predicting GallonsPer100Miles (GP100M) based on MPG and
Cyllinders (Cyl)
A complete 2nd order model is:
E (GP100M) = β0 +β1 MPG +β2 Cyl +β3 MPGCyl +β4 MPG 2 +β5 Cyl 2
and a reduced model, a straight line model is
E (GP100M) = β0 + β1 MPG + β2 Cyl
Fitting the models to data, we obtain the following results.


Example
The hypotesis to be tested is H0 : β3 = β4 = β5 = 0 implying there is

no need to include any of the 2nd order terms into the model, against
the alternative H1 : βj 0, for at least one j, j = 3, 4, 5
Implying that including either the interaction term or the quadratic
terms (minimum one of them), should improve the model’s
performance. Some statistics to calculate the test statistic are:
SSRC = sC2 × df = 0.18812 × 389 = 13.76345;
SSRR = sR2 × df = 0.50472 386 = 98.32273

So the test statistic is:
(SSRR − SSRC )/3) (98.32273 − 13.76345)/3
Fdat = = = 799.845
(SSEC /386) (13.76345/386)
Example
The critical value at α = 0.05 with degrees of freedom 3 and 386 for
the nominator and denominator, respectively, is 0.583563; as
calculated using online calculator below. Since
Fdat = 799.845 > Fα = 0.583563 , H0 is rejected.
Thus, we conclude that at least one of the second order terms should
be included in the model.

Regression Pitfalls
Fevi Novkaniza
LINEAR MODEL
November 10, 2021
Fevi Novkaniza Regression Pitfalls 1

Introduction
We’ll look at some of the main things that can go wrong with a
multiple linear regression model
We’ll also consider methods for overcoming some of these pitfalls:
1 Observation vs Experimentation
2 Multicollinearity
3 Outliers & Influential observation
4 Overfitting
5 Excluding important predictor variables
6 Extrapolation
7 Missing data

Observation vs Experimentation

Multicollinearity
Multicollinearity exists when two or more of the predictors in a

regression model are moderately or highly correlated with one another.
Why can’t a researcher just collect his data in such a way to ensure
that the predictors aren’t highly correlated? Then, multicollinearity
wouldn’t be a problem, and we wouldn’t have to worry about it.
Unfortunately, researchers often can’t control the predictors. Obvious
examples include a person’s gender, race, grade point average, math
SAT score, IQ, and starting salary.
For each of these predictor examples, the researcher just observes the
values as they occur for the people in the random sample.

Multicollinearity
When multicollinearity exists, any of the following outcomes can be

exacerbated:
The estimated regression coefficient of any one variable depends on
which other predictors are included in the model.
The precision of the estimated regression coefficients decreases as
more predictors are added to the model. The marginal contribution of
any one predictor variable in reducing the error sum of squares
depends on which other predictors are already in the model.
Hypothesis tests for βk = 0 may yield different conclusions depending
on which predictors are in the model

Types of Multicollinearity
There are two types of multicollinearity:

Structural multicollinearity is a mathematical artifact caused by
creating new predictors from other predictors—such as, creating the
predictor x 2 from the predictor x
Data-based multicollinearity, is a result of a poorly designed
experiment, reliance on purely observational data, or the inability to
manipulate the system on which the data are collected.

Example
The following data (bloodpress.txt) on 20 individuals with high blood

pressure:
blood pressure (y = BP, in mm Hg)
age (x1 = Age, in years)
weight (x2 = Weight, in kg)
body surface area (x3 = BSA, in sq m)
duration of hypertension (x4 = Dur, in years)
basal pulse (x5 = Pulse, in beats per minute)
stress index (x6 = Stress)

Dataset

Example
The researchers were interested in determining if a relationship exists

between blood pressure and age, weight, body surface area, duration,
pulse rate and/or stress level item
The matrix plot of BP, Age, Weight, and BSA:

Example
Blood pressure appears to be related fairly strongly to Weight and

BSA, and hardly related at all to Stress level
Weight and BSA appear to be strongly related, while Stress and BSA
appear to be hardly related at all.
Data Analysis
The following correlation matrix provides further evidence of the

above claims.
The high correlation among some of the predictors suggests that

data-based multicollinearity exists.

Multicollinearity
What happens if the predictor variables are highly correlated?

Let’s see relationships among the response y = BP and the predictors
x2 = Weight and x3 = BSA:
Incidentally, it shouldn’t be too surprising that a person’s weight and body

surface area are highly correlated.
Data Analysis
The regression of the response y = BP on the predictor x2= Weight:
The estimated coefficient b2 = 1.2009, se(b2) = 0.0930, and SSR(x2) =

505.472.
Seq SS
Seq SS (Sequential sums of squares) are measures of variation for

different components of the model
Unlike the adjusted sums of squares, Seq SS depend on the order the
terms are entered into the model

Data Analysis
The regression of the response y = BP on the predictor x3 = BSA:
The estimated coefficient

b3 = 34.44, se(b3 ) = 4.69, andSSR(x3 ) = 419.858.
Data Analysis
The regression of the response y = BP on x2 = Weight and x3 = BSA (in

that order):
The estimated coefficients b2 = 1.039, b3 = 5.83, se(b2 ) = 0.193, se(b3 ) =

6.06andSSR(x3 |x2 ) = 2.814.
Data Analysis
And finally, the regression of the response y = BP on x3 = BSA and x2 =

Weight (in that order):
the estimated coefficients b2 = 1.039, b3 = 5.83, se(b2 ) = 0.193, se(b3 ) =

6.06, andSSR(x2 |x3 ) = 88.43.
Data Analysis
Compiling the results in a summary table, we obtain:
It appears as if, when predictors are highly correlated, the answers you get
depend on the predictors in the model.

Effect#1
When predictor variables are correlated, the estimated regression

coefficient of any one variable depends on which other predictor
variables are included in the model
Here’s the relevant portion of the table:
Note that, depending on which predictors we include in the model, we

obtain different estimates of the slope parameter for x3 = BSA

Effect#1
If x3 = BSA is the only predictor included in our model, we claim

that for every additional one square meter increase in body surface
area (BSA), blood pressure (BP) increases by 34.4 mm Hg
On the other hand, if x2 = Weight and x3 = BSA are both included
in our model, we claim that for every additional one square meter
increase in body surface area (BSA), holding weight constant, blood
pressure (BP) increases by only 5.83 mm Hg

Effect#1
The high correlation among the two predictors is what causes the
large discrepancy
When interpreting b3 = 34.4 in the model that excludes x2 =
Weight, keep in mind that when we increase x3 = BSA then x2 =
Weight also increases and both factors are associated with increased
blood pressure
However, when interpreting b3 = 5.83 in the model that includes x2
= Weight, we keep x2 = Weight fixed, so the resulting increase in
blood pressure is much smaller.

Effect#2
When predictor variables are correlated, the precision of the estimated

regression coefficients decreases as more predictor variables are added
to the model
Here’s the relevant portion of the table:
The standard error for the estimated slope b2 obtained from the
model including both x2 = Weight and x3 = BSA is about double the
standard error for the estimated slope b2 obtained from the model
including only x2 = Weight
Effect#2
The standard error for the estimated slope b3 obtained from the
model including both x2 = Weight and x3 = BSA is about 30%
larger than the standard error for the estimated slope b3 obtained
from the model including only x3 = BSA.
What is the major implication of these increased standard errors?

Effect#2
Recall that the standard errors are used in the calculation of the
confidence intervals for the slope parameters.
That is, increased standard errors of the estimated slopes lead to
wider confidence intervals, and hence less precise estimates of the
slope parameters.

Effect#3
When predictor variables are correlated, the marginal contribution of

any one predictor variable in reducing the error sum of squares varies
depending on which other variables are already in the model.
For example, regressing the response y = BP on the predictor x2 =
Weight, we obtain SSR(x2) = 505.472. But, regressing the response
y = BP on the two predictors x3 = BSA and x2 = Weight (in that
order), we obtain SSR(x2|x3) = 88.43.
The first model suggests that weight reduces the error sum of squares
substantially (by 505.472), but the second model suggests that weight
doesn’t reduce the error sum of squares all that much (by 88.43) once
a person’s body surface area is taken into account.

Effect#3
This should make intuitive sense. In essence, weight appears to

explain some of the variation in blood pressure
However, because weight and body surface area are highly correlated,
most of the variation in blood pressure explained by weight could just
have easily been explained by body surface area.
Therefore, once you take into account a person’s body surface area,
there’s not much variation left in the blood pressure for weight to
explain.

Effect#3
We see a similar phenomenon when we enter the predictors into the

model in the reverse order. That is, regressing the response y = BP
on the predictor x3 = BSA, we obtain SSR(x3) = 419.858. But,
regressing the response y = BP on the two predictors x2 = Weight
and x3 = BSA (in that order), we obtain SSR(x3|x2) = 2.814
The first model suggests that body surface area reduces the error sum
of squares substantially (by 419.858), and the second model suggests
that body surface area doesn’t reduce the error sum of squares all that
much (by only 2.814) once a person’s weight is taken into account

Effect#4
When predictor variables are correlated, hypothesis tests for �k = 0

may yield different conclusions depending on which predictor variables
are in the model. (This effect is a direct consequence of the three
previous effects.)
To illustrate this effect, let’s focus primarily on the outcome of the
t-tests for testing H0 : βBSA = 0 and H0 : βWeight = 0
The regression of the response y = BP on the predictor x3 = BSA:
There is sufficient evidence at the 0.05 level to conclude that blood

pressure is significantly related to body surface area
Effect#4
The regression of the response y = BP on the predictor x2 = Weight:
There is sufficient evidence at the 0.05 level to conclude that blood

pressure is significantly related to weight.

Effect#4
And, the regression of the response y = BP on the predictors x2 =

Weight and x3 = BSA:
There is sufficient evidence at the 0.05 level to conclude that, after

taking into account body surface area, blood pressure is significantly
related to weight.

Effect#4
However, the regression also indicates that the P-value associated

with the t-test for testing H0 : βBSA = 0 is 0.350
There is insufficient evidence at the 0.05 level to conclude that blood
pressure is significantly related to body surface area after taking into
account weight
This might sound contradictory to what we claimed earlier, (blood
pressure is indeed significantly related to body surface area).
Once you take into account a person’s weight, body surface area
doesn’t explain much of the remaining variability in blood pressure
readings

Effect#5
High multicollinearity among predictor variables does not prevent

good, precise predictions of the response within the scope of the
model
The following output illustrates how the predictions don’t change all
that much from model to model:

Effect#5
The first output yields a predicted blood pressure of 112.7 mm Hg for

a person whose weight is 92 kg based on the regression of blood
pressure on weight
The second output yields a predicted blood pressure of 114.1 mm Hg
for a person whose body surface area is 2 square meters based on the
regression of blood pressure on body surface area
And the last output yields a predicted blood pressure of 112.8 mm Hg
for a person whose body surface area is 2 square meters and whose
weight is 92 kg based on the regression of blood pressure on body
surface area and weight
Reviewing the confidence intervals and prediction intervals, you can
see that they too yield similar results regardless of the model.

Homework
What happens to the model if ”pulse” is also included as a predictor?

Variance Inflation Factor
Fevi Novkaniza
LINEAR MODEL
November 18, 2021
Fevi Novkaniza Variance Inflation Factor 1

Detecting Multicollinearity Using VIF
Some of the common methods used for detecting multicollinearity include:

The analysis exhibits the signs of multicollinearity — such as,
estimates of the coefficients vary excessively from model to model
The t-tests for each of the individual slopes are non-significant
(P>0.05), but the overall F-test for testing all of the slopes are
simultaneously 0 is significant (P<0.05)
The correlations among pairs of predictor variables are large
Looking at correlations only among pairs of predictors is limiting. It is
possible that the pairwise correlations are small, and yet a linear
dependence exists among three or even more variables.

What is a VIF?
Some of the common methods used for detecting multicollinearity include:

A variance inflation factor (VIF) quantifies how much the variance is
inflated. But what variance?
Recall that we learned previously that the standard errors — and
hence the variances — of the estimated coefficients are inflated when
multicollinearity exists
A VIF exists for each of the predictors in a multiple regression model.
For example, the variance inflation factor for the estimated regression
coefficient bj (VIFj ) is just the factor by which the variance of bj is
”inflated” by the existence of correlation among the predictor
variables in the model.

What is a VIF?
In particular, the variance inflation factor for the jth predictor is:
1
VIFj =
Rj2
where Rj2 is the R2 -value obtained by regressing the jth predictor on the
remaining predictors. How do we interpret the variance inflation factors for
a regression model?
A VIF of 1 means that there is no correlation among the jth predictor
and the remaining predictor variables, and hence the variance of bj is
not inflated at all
The general rule of thumb is that VIFs exceeding 4 warrant further
investigation, while VIFs exceeding 10 are signs of serious
multicollinearity requiring correction

What is a VIF?
Let’s return to the blood pressure data (bloodpress.txt) in which

researchers observed the following data on 20 individuals with high blood
pressure:
blood pressure (y = BP, in mm Hg)
age (x1 = Age, in years)
weight (x2 = Weight, in kg)
body surface area (x3 = BSA, in sq m)
duration of hypertension (x4 = Dur, in years)
basal pulse (x5 = Pulse, in beats per minute)
stress index (x6 = Stress)

Correlation Matrix
Recall the following correlation matrix:
Some of the predictors are at least moderately marginally correlated.

For example, body surface area (BSA) and weight are strongly
correlated (r = 0.875), and weight and pulse are fairly strongly
correlated (r = 0.659).
On the other hand, none of the pairwise correlations among age,
weight, duration and stress are particularly strong (r < 0.40 in each
case).
Model
Regressing y = BP on all six of the predictors, we obtain:

VIF
Three of the variance inflation factors —8.42, 5.33, and 4.41 —are
fairly large
The VIF for the predictor Weight, for example, tells us that the
variance of the estimated coefficient of Weight is inflated by a factor
of 8.42 because Weight is highly correlated with at least one of the
other predictors in the model
Let’s verify the calculation of the VIF for the predictor Weight.
Regressing the predictor x2 = Weight on the remaining five predictors:

VIF
2
RWeight is 88.12% or, in decimal form, 0.8812.
VIF
Therefore, the variance inflation factor for the estimated coefficient

Weight is by definition:
Again, this variance inflation factor tells us that the variance of the
weight coefficient is inflated by a factor of 8.42 because Weight is
highly correlated with at least one of the other predictors in the
model.
So, what to do? One solution to dealing with multicollinearity is to
remove some of the violating predictors from the model.

VIF
If we review the pairwise correlations again: we see that the predictors

Weight and BSA are highly correlated (r = 0.875)
We can choose to remove either predictor from the model. The
decision of which one to remove is often a scientific or practical one
For example, if the researchers here are interested in using their final
model to predict the blood pressure of future individuals, their choice
should be clear.
Which of the two measurements — body surface area or weight — do
you think would be easier to obtain?
If indeed weight is an easier measurement to obtain than body surface
area, then the researchers would be well-advised to remove BSA from
the model and leave Weight in the model

VIF
Reviewing again the above pairwise correlations, we see that the

predictor Pulse also appears to exhibit fairly strong marginal
correlations with several of the predictors, including Age (r = 0.619),
Weight (r = 0.659) and Stress (r = 0.506)
Therefore, the researchers could also consider removing the predictor
Pulse from the model
Let’s see how the researchers would do. Regressing the response y =
BP on the four remaining predictors Age, Weight, Duration, and
Stress, we obtain:

VIF
The remaining variance inflation factors are quite satisfactory

In terms of the adjusted R 2 -value, we did not seem to lose much by
dropping the two predictors BSA and Pulse from our model
The adjusted R 2 -value decreased to only 98.97% from the original
adjusted R 2 -value of 99.44%.
Reducing Data-based Multicollinearity
We should care about reducing multicollinearity because it all comes

down to drawing conclusions about the population slope parameters
If the variances of the estimated coefficients are inflated by
multicollinearity, then our confidence intervals for the slope
parameters are wider and therefore less useful.
Eliminating or even reducing the multicollinearity therefore yields
narrower, more useful confidence intervals for the slopes
One way of reducing data-based multicollinearity is to remove one or
more of the violating predictors from the regression model
Another way is to collect additional data under different experimental
or observational conditions

Example
Researchers running the Allen Cognitive Level (ACL) Study were

interested in the relationship of ACL test scores to the level of
psychopathology
They therefore collected the following data on a set of 69 patients in
a hospital psychiatry unit:
Response y = ACL test score
X1 = vocabulary (Vocab) score on the Shipley Institute of Living Scale
X2 = abstraction (Abstract) score on the Shipley Institute of Living
Scale
X3 = score on the Symbol-Digit Modalities Test (SDMT)

Example

Example
A very strong relationship (r = 0.99) exists among the two predictors.

Example
Regressing the response y = ACL on the predictors SDMT, Vocab, and

Abstract, we obtain:
The VIFs for Vocab and Abstract are very large.

Example
What should we do about this? We could opt to remove one of the

two predictors from the model.
Alternatively, if we have a good scientific reason for needing both of
the predictors to remain in the model, we could go out and collect
more data. Let’s try this second approach here.
Let’s imagine that we went out and collected more data, and in so
doing, obtained the actual data collected on all 69 patients enrolled in
the Allen Cognitive Level (ACL) Study. A matrix plot of the resulting
data set

Example
Pearson correlation of Vocab and Abstract = 0.698 (it is just a weaker

correlation now)
Example
The round data points in blue represent the 23 data points in the original
data set, while the square red data points represent the 46 newly collected
data points.
Example
As you can see from the plot, collecting the additional data has
expanded the ”base” over which the ”best fitting plane” will sit
The existence of this larger base allows less room for the plane to tilt
from sample to sample, and thereby reduces the variance of the
estimated slope coefficients
Let’s see if the addition of the new data helps to reduce the
multicollinearity here
Regressing the response y = ACL on the predictors SDMT, Vocab,
and Abstract:

Example
The researchers could now feel comfortable proceeding with drawing

conclusions about the effects of the vocabulary and abstraction scores on
the level of psychopathology.
One thing to keep in mind
One thing to keep in mind. In order to reduce the multicollinearity

that exists, it is not sufficient to go out and just collect any ol’ data
The data have to be collected in such a way to ensure that the
correlations among the violating predictors is actually reduced
That is, collecting more of the same kind of data won’t help to reduce
the multicollinearity. The data have to be collected to ensure that the
”base” is sufficiently enlarged
Doing so, of course, changes the characteristics of the studied
population, and therefore should be reported accordingly

Reducing Structural Multicollinearity
Recall that structural multicollinearity is multicollinearity that is a

mathematical artifact caused by creating new predictors from other
predictors, such as, creating the predictor x 2 from the predictor x.
Because of this, at the same time that we learn here about reducing
structural multicollinearity, we learn more about polynomial regression
models.

Example
”How is the amount of immunoglobin in blood (y) related to maximal

oxygen uptake (x)?”
Because some researchers were interested in answering the above
research question, they collected on a sample of 30 individuals:
yi = amount of immunoglobin in blood (mg) of individual i
xi =maximal oxygen uptake (ml/kg) of individual i

Example
The scatter plot of the resulting data suggests that there might be some
curvature to the trend in the data.
Example

Example
If 0 is a possible x value, then b0 is the predicted response when x =

0. Otherwise, the interpretation of b0 is meaningless
The estimated coefficient b1 is the estimated slope of the tangent line
at x = 0
The estimated coefficient b2 indicates the up/down direction of the
curve. That is:
if b2 < 0, then the curve is concave down
if b2 > 0, then the curve is concave up

Example
If we look at the output we obtain upon regressing the response y = igg

on the predictors oxygen and oxygen2:
By the nature of model, there is a ”structural multicollinearity.”

Example
The neat thing here is that we can reduce the multicollinearity in our
data by doing what is known as ”centering the predictors.”
Centering a predictor merely entails subtracting the mean of the
predictor values in the data set from each predictor value.
For example, the mean of the oxygen values in our data set is 50.64:
Therefore, in order to center the predictor oxygen, we merely subtract

50.64 from each oxygen value in our data set. Doing so, we obtain
the centered predictor

Example

Example
The correlation has gone from r = 0.995 to a rather low r = 0.219

Example
Having centered the predictor oxygen, we must reformulate our

quadratic polynomial regression model accordingly
That is, we now formulate our model as:
yi = β0∗ + β1∗ (xi x̄) + β1 1∗ (xi x̄)2 + i
or alternatively:
yi = β0∗ + β1∗ xi + β1 1∗ xi2 + i
where:
yi = amount of immunoglobin in blood (mg)
xi∗ = xi x̄ denotes the centered predictor
and the error terms i are independent, normally distributed and have
equal variance σ 2 .
Note that we add asterisks to each of the parameters in order to
make it clear that the parameters differ from the parameters in the
original model we formulated.
Example
Based on our original model —the variance inflation factors for

oxygen and oxygensq were 99.9.
Now, regressing y = igg on the centered predictors oxcent and
oxcentsq:
we see that the VIFs have dropped significantly—now they are 1.05 in
each case
Because we reformulated our model based on the centered predictors, the

meaning of the parameters must be changed accordingly. Now, the
estimated coefficients tell us:
The estimated coefficient b0 is the predicted response y when the
predictor x equals the sample mean of the predictor values
The estimated coefficient b1 is the estimated slope of the tangent line
at the predictor mean — and, often, it is similar to the estimated
slope in the simple linear regression model
The estimated coefficient b2 indicates the up/down direction of curve.
That is:
if b2 < 0, then the curve is concave down
if b2 > 0, then the curve is concave up

So, here, in this example, the estimated coefficient b0 = 1632.3 tells

us that a male whose maximal oxygen uptake is 50.64 ml/kg is
predicted to have 1632.3 mg of immunoglobin in his blood.
The estimated coefficient b1 = 34.00 tells us that the when an
individual’s maximal oxygen uptake is near 50.64 ml/kg, we can
expect the individual’s immunoglobin to increase by 34.00 mg for
every 1 ml/kg increase in maximal oxygen uptake.

As the following plot of the estimated quadratic function suggests:
The reformulated regression function appears to describe the trend in the

data well. The adjusted R 2 -value is still 93.3%.
We shouldn’t be surprised to see that the estimates of the coefficients in

our reformulated polynomial regression model are quite similar to the
estimates of the coefficients for the simple linear regression model:

The estimated coefficient b1 = 34.00 for the polynomial regression

model and b1 = 32.74 for the simple linear regression model
The estimated coefficient b0 = 1632 for the polynomial regression
model and b0 = 1558 for the simple linear regression model
The similarities in the estimates, of course, arise from the fact that
the predictors are nearly uncorrelated and therefore the estimates of
the coefficients don’t change all that much from model to model

Residual Analysis
Fevi Novkaniza
LINEAR MODEL
November 25, 2021
Fevi Novkaniza Residual Analysis 1

The basic idea of residual analysis
Recall that not all of the data points in a sample will fall right on the
least squares regression line
The vertical distance between any one data point yi and its estimated
value ŷi is its observed ”residual”: ei = yi − ŷi
Each observed residual can be thought of as an estimate of the actual
unknown ”true error” term: i = Yi − E (Yi )
The basic idea of residual analysis, therefore, is to investigate the
observed residuals to see if they behave “properly.”
We analyze the residuals to see if they support the assumptions of
linearity, independence, normality and equal variances.

Residuals vs. Fits Plot
When conducting a residual analysis, a ”residuals versus fits plot” is

the most frequently created plot
It is a scatter plot of residuals on the y axis and fitted values
(estimated responses) on the x axis
The plot is used to detect non-linearity, unequal error variances, and
outliers

Example
Urbano-Marquez, et al.( 1989) were interested in determining whether or

not alcohol consumption was linearly related to muscle strength.
The predicted response of these men (whose alcohol consumption is

around 40) is about 14.
Example
The plot suggests that there is a decreasing linear relationship

between alcohol and arm strength
It also suggests that there are no unusual data points in the data set.
It illustrates that the variation around the estimated regression line is
constant suggesting that the assumption of equal error variances is
reasonable
The corresponding residuals versus fits plot looks like for the data
set’s simple linear regression model with arm strength as the response
and level of alcohol consumption as the predictor:

Their fitted value is about 14 and their deviation from the residual=0
line shares the same pattern as their deviation from the estimated
regression line
Any data point that falls directly on the estimated regression line has
a residual of 0. Therefore, the residual = 0 line corresponds to the
estimated regression line
Here are the characteristics of a well-behaved residual vs. fits plot and
what they suggest about the appropriateness of the simple linear
regression model:
The residuals ”bounce randomly” around the 0 line. This suggests
that the assumption that the relationship is linear is reasonable
The residuals roughly form a ”horizontal band” around the 0 line.
This suggests that the variances of the error terms are equal
No one residual ”stands out” from the basic random pattern of
residuals. This suggests that there are no outliers

Residuals vs. Predictor Plot
An alternative to the residuals vs. fits plot is a ”residuals vs. predictor

plot.”
The interpretation of a ”residuals vs. predictor plot” is identical to
that for a ”residuals vs. fits plot.”
That is, a well-behaved plot will bounce randomly and form a roughly
horizontal band around the residual = 0 line. And, no data points will
stand out from the basic random pattern of the other residuals.

The residuals vs. predictor plot for the simple linear regression model with
arm strength as the response and level of alcohol consumption as the
predictor:
The residuals vs. predictor plot is just a mirror image of the residuals vs.
fits plot. The residuals vs. predictor plot offers no new information.
Identifying Specific Problems Using Residual Plots
Specifically, we will investigate:

how a non-linear regression function shows up on a residuals vs. fits
plot
how unequal error variances show up on a residuals vs. fits plot
how an outlier show up on a residuals vs. fits plot.

”How does a non-linear regression function show up on a residual vs. fits

plot?”
The residuals depart from 0 in some systematic manner, such as
being positive for small x values, negative for medium x values, and
positive again for large x values
Any systematic (non-random) pattern is sufficient to suggest that the
regression function is not linear.

Example
The fitted line plot of the resulting data suggests that there is a
relationship between groove depth and mileage. The relationship is just
not linear. The corresponding residuals vs. fits plot accentuates this claim:

Example
Note that the residuals depart from 0 in a systematic manner. They are
positive for small x values, negative for medium x values, and positive
again for large x values. Clearly, a non-linear model would better describe
the relationship between the two variables.
We notice that the R 2 value is very high (95.26%)?

This is an excellent example of the caution ”a large r2 value should
not be interpreted as meaning that the estimated regression line fits
the data well.”
The large R 2 value tells you that if you wanted to predict groove
depth, you’d be better off taking into account mileage than not.
The residuals vs. fits plot tells you, though, that your prediction
would be better if you formulated a non-linear model rather than a
linear one

How does non-constant error variance show up on a residual vs. fits plot?
The Answer: Non-constant error variance shows up on a residuals vs.
fits (or predictor) plot in any of the following ways:
1 The plot has a ”fanning” effect. That is, the residuals are close to 0 for
small x values and are more spread out for large x values
2 The plot has a ”funneling” effect. That is, the residuals are spread out
for small x values and close to 0 for large x values
3 Or, the spread of the residuals in the residuals vs. fits plot varies in
some complex fashion

Example
To investigate the relationship between plutonium activity (x, in pCi/g)

and alpha count rate (y, in number per second), a study was conducted on
23 samples of plutonium. The following fitted line plot was obtained on
the resulting data

The plot suggests that there is a linear relationship between alpha

count rate and plutonium activity.
It also suggests that the error terms vary around the regression line in
a non-constant manner — as the plutonium level increases, not only
does the mean alpha count rate increase, but also the variance
increases
That is, the fitted line plot suggests that the assumption of equal
variances is violated

Example
As is generally the case, the corresponding residuals vs. fits plot

accentuates this claim:
Note that the residuals ”fan out” from left to right rather than exhibiting
a consistent spread around the residual = 0 line. The residual vs. fits plot
suggests that the error variances are not equal.
How does an outlier show up on a residuals vs. fits plot?

The observation’s residual stands apart from the basic random
pattern of the rest of the residuals
The random pattern of the residual plot can even disappear if one
outlier really deviates from the pattern of the rest of the data
An Example: Is there a relationship between tobacco use and alcohol use?
The British government regularly conducts surveys on household spending.
One such survey (Family Expenditure Survey, Department of Employment,
1981) determined the average weekly expenditure on tobacco (x, in British
pounds) and the average weekly expenditure on alcohol (y, in British
pounds) for households in n = 11 different regions in the United Kingdom.

Example
The fitted line plot of the resulting data
suggests that there is an outlier — in the lower right corner of the plot —
which corresponds to the Northern Ireland region. In fact, the outlier is so
far removed from the pattern of the rest of the data that it appears to be
”pulling the line” in its direction.
As is generally the case, the corresponding residuals vs. fits plot

accentuates this claim:
Note that Northern Ireland’s residual stands apart from the basic random
pattern of the rest of the residuals. That is, the residual vs. fits plot
suggests that an outlier exists.
This is an excellent example of the caution that the R 2 can be greatly

affected by just one data point
Removing one data point from the data set, and refitting the
regression line, we obtain:
The R 2 value has jumped from 5% to 61.5%. One data point greatly
affect the value of R 2
How large a residual has to be before a data point should be flagged as

being an outlier?
We can make the residuals ”unitless” by dividing them by their
standard deviation
In this way we create what are called ”standardized residuals.”
They tell us how many standard deviations above — if positive — or
below — if negative — a data point is from the estimated regression
line
Any observations with a standardized residual greater than 2 or
smaller than -2 might be flagged for further investigation

The corresponding standardized residuals vs. fits plot for our expenditure
survey example looks like:
The standardized residual of the suspicious data point is smaller than

-2. The data point lies more than 2 standard deviations below its
mean
Since this is such a small dataset the data point should be flagged for
further investigation!
Most statistical software identifies observations with large standardized

residuals. Here is what a portion of Minitab’s output for our expenditure
survey example looks like:
Minitab labels observations with large standardized residuals with an ”R.”

For our example, Minitab reports that observation 11 — for which tobacco
= 4.56 and alcohol = 4.02 — has a large standardized residual (-2.58).
The data point has been flagged for further investigation.

Recommended strategy, once you’ve identified a data point as being

unusual:
Determine whether a simple — and therefore correctable — mistake
was made in recording or entering the data point. Examples include
transcription errors (recording 62.1 instead of 26.1) or data entry
errors (entering 99.1 instead of 9.1). Correct the mistakes you found
Determine if the measurement was made in such a way that keeping
the experimental unit in the study can no longer be justified. Was
some procedure not conducted according to study guidelines? For
example, was a person’s blood pressure measured standing up rather
than sitting down? Was the measurement made on someone not in
the population of interest? For example, was the survey completed by
a man instead of a woman? If it is convincingly justifiable, remove
the data point from the data set.

If the first two steps don’t resolve the problem, consider analyzing the
data twice — once with the data point included and once with the
data point excluded. Report the results of both analyses

Residuals vs. Order Plot
We will learn how to use a ”residuals vs. order plot” as a way of

detecting a particular form of non-independence of the error terms,
namely serial correlation
If the data are obtained in a time (or space) sequence, a residuals vs.
order plot helps to see if there is any correlation between the error
terms that are near each other in the sequence
The plot is only appropriate if you know the order in which the data
were collected!
What is this residuals vs. order plot all about? It is a scatter plot
with residuals on the y axis and the order in which the data were
collected on the x axis.

Here’s an example of a well-behaved residuals vs. order plot:
The residuals bounce randomly around the residual = 0 line as we would

hope so. In general, residuals exhibiting normal random noise around the
residual = 0 line suggest that there is no serial correlation.

A residuals vs. order plot that exhibits (positive) trend as the

following plot does:
suggests that some of the variation in the response is due to time.

Therefore, it might be a good idea to add the predictor ”time” to the
model.
That is, you interpret this plot just as you would interpret any other
residual vs. predictor plot. It’s just that here your predictor is ”time.”
Positive serial correlation
A residuals vs. order plot that looks like the following plot:
suggests that there is ”positive serial correlation” among the error terms.
That is, positive serial correlation exists when residuals tend to be followed,
in time, by residuals of the same sign and about the same magnitude. The
plot suggests that the assumption of independent error terms is violated.

Negative serial correlation
A residuals vs. order plot that looks like the following plot:
uggests that there is ”negative serial correlation” among the error terms.
Negative serial correlation exists when residuals of one sign tend to be
followed, in time, by residuals of the opposite sign. What? Can’t you see
it? If you connect the dots in order from left to right, you should be able
to see the pattern.
Negative, positive, negative, positive, negative, positve, and so on

The plot suggests that the assumption of independent error terms is
violated
If you obtain a residuals vs. order plot that looks like this, you would
again be advised to move out of the realm of regression analysis and
into that of ”time series modeling.”

Molin Uas

Uploaded by

Copyright:

Available Formats

You might also like

Molin Uas

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Molin Uas

Uploaded by

Copyright:

Available Formats

Prodi S1 Ilmu Aktuaria

Principle of Model Building

October 28, 2021

Fevi Novkaniza Principle of Model Building 1

By model building, we mean writing a model that will provide a good

Fevi Novkaniza Principle of Model Building 2

Educational research group issued a report concerning the variables

Fevi Novkaniza Principle of Model Building 3

They fit the model

Fevi Novkaniza Principle of Model Building 4

Fevi Novkaniza Principle of Model Building 5

Generally speaking, there will be a positive correlation between

assumes that the independent variables x1 , x2 , ..., xk affect mean

Fevi Novkaniza Principle of Model Building 6

Do the assumptions implied by the model agree with your knowledge

Fevi Novkaniza Principle of Model Building 7

Second, suppose that x5 is the amount of time a student devotes to

Fevi Novkaniza Principle of Model Building 8

Clearly, the model

was a poor choice in view of the researchers’ prior knowledge of some

Fevi Novkaniza Principle of Model Building 9

◦ To determine whether the quadratic terms are important, we would test 𝐻0 : 𝛽7 = 𝛽8 =

Regression with Qualitative Predictors

October 28, 2021

Fevi Novkaniza Regression with Qualitative Predictors 1

Field: (Qualitative variables) business, economics, the social,

Fevi Novkaniza Regression with Qualitative Predictors 2

The independent variables that appear in a linear model can be one of

Fevi Novkaniza Regression with Qualitative Predictors 3

A study of innovation in the insurance industry

Fevi Novkaniza Regression with Qualitative Predictors 4

Fevi Novkaniza Regression with Qualitative Predictors 5

Indicator variables: called dummy variables or binary variables

Fevi Novkaniza Regression with Qualitative Predictors 6

Fevi Novkaniza Regression with Qualitative Predictors 7

Figure: Illustration of Meaning of Regression Coefficient for Regression Model

Fevi Novkaniza Regression with Qualitative Predictors 8

Fevi Novkaniza Regression with Qualitative Predictors 9

Fevi Novkaniza Regression with Qualitative Predictors 10

Figure: Regression Results for Fit of Regression Model -Insurance Innovation

Fevi Novkaniza Regression with Qualitative Predictors 11

Figure: Fitted Regression Function for Regression Model Indicator Innovation

Fevi Novkaniza Regression with Qualitative Predictors 12

Fevi Novkaniza Regression with Qualitative Predictors 13

Fevi Novkaniza Regression with Qualitative Predictors 14

Fevi Novkaniza Regression with Qualitative Predictors 15

Fevi Novkaniza Regression with Qualitative Predictors 16

Fevi Novkaniza Regression with Qualitative Predictors 17

Fevi Novkaniza Regression with Qualitative Predictors 18

Fevi Novkaniza Regression with Qualitative Predictors 19

Fevi Novkaniza Regression with Qualitative Predictors 20

Fevi Novkaniza Regression with Qualitative Predictors 21

Fevi Novkaniza Regression with Qualitative Predictors 22

Fevi Novkaniza Regression with Qualitative Predictors 23

Fevi Novkaniza Regression with Qualitative Predictors 24

Fevi Novkaniza Regression with Qualitative Predictors 25

Figure: Another Illustration of Regression Model with Indicator Variable X2 and

Fevi Novkaniza Regression with Qualitative Predictors 26

Model: y = β0 + β1 x + β2 x 2 + , ∼ NIID(0, 2 ) where y,x are

Model: y = β0 + β1 x + β2 x 2 + , ∼ NIID(0, 2 ) where y,x are

If model’s assumptions are satistied, that is ∼ NIID(0, σ 2 ), and H0