Professional Documents
Culture Documents
Molin Uas
Molin Uas
Molin Uas
Fevi Novkaniza
LINEAR MODEL
E (y) = β0 + β1 x1 + ... + βk xk
to the data, analyzed the results, and reached the conclusion that
none of the independent variables was ‘‘significantly related’’ to y.
The goodness of fit of the model, measured by the coefficient of
determination R 2 , was not particularly good, and t tests on individual
parameters did not lead to rejection of the null hypotheses that these
parameters equaled 0.
If we hold all the other independent variables constant and vary only
x1 , E (y) will increase by the amount β1 for every unit increase in x1 .
A 1-unit change in any of the other independent variables will increase
E(y) by the value of the corresponding β parameter for that variable.
E (y) = β0 + β1 x1 + ... + βk xk
E (y) = β0 + β1 x1 + β2 x2 + .. + βk xk
Fevi Novkaniza
LINEAR MODEL
The regression of tool wear (Y) on tool speed X1 and tool model
(qualitative: M1, M2, M3, M4)
Yi = β0 + β2 Xi2 + β3 Xi3 + i
Fevi Novkaniza
LINEAR MODEL
November 4, 2021
Using the dataset, we fit a multiple regression model with interaction using
just 2 predictors (MPG and Cylinders), and the fitted line is:
We can see that the p-value for an interaction term is very small,
< 2 × 10−16 , leading to the rejection of the hypothesis that the
interaction term is ignorable; thus, we conclude that interaction
between MPG and Cylinders is significant and should be included in
the model.
So, how to interpret the result?
j6=k
When a>0 the curve is concave up, and is the opposite for a<0. This
concept can be used for a linear model, when the effect of changes in
predictor variables is not constant.
Fevi Novkaniza Example of an interaction model with quantitative predictors 11
Prodi S1 Ilmu Aktuaria
E (y) = β0 + β1 x + β2 x 2
The model is useful, since the F statistic for testing the hypothesis
H0 : β1 = β2 = β3 = 0 is 1.018e+04, falls in the rejection region,
with the p-value < 2.2e − 16 ≈ 0
Moreover, the quadratic term is important and should be included in
the model, since the partial t-test for H0 : β2 = 0 against the
alternative hypothesis H1 : β2 6= 0 gave a p-value < 2e − 16 ≈ 0,
indicating the significance role of the quadratic term
with a is the initial value of the MPG. The fitted regression also
results in a high R 2 , that is 98.12% of the variability of
GallonsPer100Miles can be explained by MPG using the quadratic
regression model.
A visual check on how close the predicted values of
GallonsPer100Miles to the actual values, we produce is as the
following.
which give the result in the figure below. As shown in the figure, the fitted
line (blue) is close enough to most of data points, confirming the high R 2 .
E (GP100M) = β0 +β1 MPG +β2 Cyl +β3 MPGCyl +β4 MPG 2 +β5 Cyl 2
The critical value at α = 0.05 with degrees of freedom 3 and 386 for
the nominator and denominator, respectively, is 0.583563; as
calculated using online calculator below. Since
Fdat = 799.845 > Fα = 0.583563 , H0 is rejected.
Thus, we conclude that at least one of the second order terms should
be included in the model.
Regression Pitfalls
Fevi Novkaniza
LINEAR MODEL
We’ll look at some of the main things that can go wrong with a
multiple linear regression model
We’ll also consider methods for overcoming some of these pitfalls:
1 Observation vs Experimentation
2 Multicollinearity
3 Outliers & Influential observation
4 Overfitting
5 Excluding important predictor variables
6 Extrapolation
7 Missing data
It appears as if, when predictors are highly correlated, the answers you get
depend on the predictors in the model.
The high correlation among the two predictors is what causes the
large discrepancy
When interpreting b3 = 34.4 in the model that excludes x2 =
Weight, keep in mind that when we increase x3 = BSA then x2 =
Weight also increases and both factors are associated with increased
blood pressure
However, when interpreting b3 = 5.83 in the model that includes x2
= Weight, we keep x2 = Weight fixed, so the resulting increase in
blood pressure is much smaller.
The standard error for the estimated slope b2 obtained from the
model including both x2 = Weight and x3 = BSA is about double the
standard error for the estimated slope b2 obtained from the model
including only x2 = Weight
Fevi Novkaniza Regression Pitfalls 22
Effect#2
Prodi S1 Ilmu Aktuaria
The standard error for the estimated slope b3 obtained from the
model including both x2 = Weight and x3 = BSA is about 30%
larger than the standard error for the estimated slope b3 obtained
from the model including only x3 = BSA.
What is the major implication of these increased standard errors?
Recall that the standard errors are used in the calculation of the
confidence intervals for the slope parameters.
That is, increased standard errors of the estimated slopes lead to
wider confidence intervals, and hence less precise estimates of the
slope parameters.
Fevi Novkaniza
LINEAR MODEL
In particular, the variance inflation factor for the jth predictor is:
1
VIFj =
Rj2
where Rj2 is the R2 -value obtained by regressing the jth predictor on the
remaining predictors. How do we interpret the variance inflation factors for
a regression model?
A VIF of 1 means that there is no correlation among the jth predictor
and the remaining predictor variables, and hence the variance of bj is
not inflated at all
The general rule of thumb is that VIFs exceeding 4 warrant further
investigation, while VIFs exceeding 10 are signs of serious
multicollinearity requiring correction
Three of the variance inflation factors —8.42, 5.33, and 4.41 —are
fairly large
The VIF for the predictor Weight, for example, tells us that the
variance of the estimated coefficient of Weight is inflated by a factor
of 8.42 because Weight is highly correlated with at least one of the
other predictors in the model
Let’s verify the calculation of the VIF for the predictor Weight.
Regressing the predictor x2 = Weight on the remaining five predictors:
2
RWeight is 88.12% or, in decimal form, 0.8812.
Fevi Novkaniza Variance Inflation Factor 9
VIF
Prodi S1 Ilmu Aktuaria
Again, this variance inflation factor tells us that the variance of the
weight coefficient is inflated by a factor of 8.42 because Weight is
highly correlated with at least one of the other predictors in the
model.
So, what to do? One solution to dealing with multicollinearity is to
remove some of the violating predictors from the model.
The round data points in blue represent the 23 data points in the original
data set, while the square red data points represent the 46 newly collected
data points.
Fevi Novkaniza Variance Inflation Factor 21
Example
Prodi S1 Ilmu Aktuaria
As you can see from the plot, collecting the additional data has
expanded the ”base” over which the ”best fitting plane” will sit
The existence of this larger base allows less room for the plane to tilt
from sample to sample, and thereby reduces the variance of the
estimated slope coefficients
Let’s see if the addition of the new data helps to reduce the
multicollinearity here
Regressing the response y = ACL on the predictors SDMT, Vocab,
and Abstract:
The scatter plot of the resulting data suggests that there might be some
curvature to the trend in the data.
Fevi Novkaniza Variance Inflation Factor 27
Example
Prodi S1 Ilmu Aktuaria
The neat thing here is that we can reduce the multicollinearity in our
data by doing what is known as ”centering the predictors.”
Centering a predictor merely entails subtracting the mean of the
predictor values in the data set from each predictor value.
For example, the mean of the oxygen values in our data set is 50.64:
we see that the VIFs have dropped significantly—now they are 1.05 in
each case
Fevi Novkaniza Variance Inflation Factor 35
Prodi S1 Ilmu Aktuaria
Residual Analysis
Fevi Novkaniza
LINEAR MODEL
Recall that not all of the data points in a sample will fall right on the
least squares regression line
The vertical distance between any one data point yi and its estimated
value ŷi is its observed ”residual”: ei = yi − ŷi
Each observed residual can be thought of as an estimate of the actual
unknown ”true error” term: i = Yi − E (Yi )
The basic idea of residual analysis, therefore, is to investigate the
observed residuals to see if they behave “properly.”
We analyze the residuals to see if they support the assumptions of
linearity, independence, normality and equal variances.
Their fitted value is about 14 and their deviation from the residual=0
line shares the same pattern as their deviation from the estimated
regression line
Any data point that falls directly on the estimated regression line has
a residual of 0. Therefore, the residual = 0 line corresponds to the
estimated regression line
Fevi Novkaniza Residual Analysis 6
Prodi S1 Ilmu Aktuaria
Here are the characteristics of a well-behaved residual vs. fits plot and
what they suggest about the appropriateness of the simple linear
regression model:
The residuals ”bounce randomly” around the 0 line. This suggests
that the assumption that the relationship is linear is reasonable
The residuals roughly form a ”horizontal band” around the 0 line.
This suggests that the variances of the error terms are equal
No one residual ”stands out” from the basic random pattern of
residuals. This suggests that there are no outliers
The residuals vs. predictor plot for the simple linear regression model with
arm strength as the response and level of alcohol consumption as the
predictor:
The residuals vs. predictor plot is just a mirror image of the residuals vs.
fits plot. The residuals vs. predictor plot offers no new information.
Fevi Novkaniza Residual Analysis 9
Identifying Specific Problems Using Residual Plots
Prodi S1 Ilmu Aktuaria
The fitted line plot of the resulting data suggests that there is a
relationship between groove depth and mileage. The relationship is just
not linear. The corresponding residuals vs. fits plot accentuates this claim:
Note that the residuals depart from 0 in a systematic manner. They are
positive for small x values, negative for medium x values, and positive
again for large x values. Clearly, a non-linear model would better describe
the relationship between the two variables.
Fevi Novkaniza Residual Analysis 13
Prodi S1 Ilmu Aktuaria
How does non-constant error variance show up on a residual vs. fits plot?
The Answer: Non-constant error variance shows up on a residuals vs.
fits (or predictor) plot in any of the following ways:
1 The plot has a ”fanning” effect. That is, the residuals are close to 0 for
small x values and are more spread out for large x values
2 The plot has a ”funneling” effect. That is, the residuals are spread out
for small x values and close to 0 for large x values
3 Or, the spread of the residuals in the residuals vs. fits plot varies in
some complex fashion
Note that the residuals ”fan out” from left to right rather than exhibiting
a consistent spread around the residual = 0 line. The residual vs. fits plot
suggests that the error variances are not equal.
Fevi Novkaniza Residual Analysis 18
Prodi S1 Ilmu Aktuaria
suggests that there is an outlier — in the lower right corner of the plot —
which corresponds to the Northern Ireland region. In fact, the outlier is so
far removed from the pattern of the rest of the data that it appears to be
”pulling the line” in its direction.
Fevi Novkaniza Residual Analysis 20
Prodi S1 Ilmu Aktuaria
Note that Northern Ireland’s residual stands apart from the basic random
pattern of the rest of the residuals. That is, the residual vs. fits plot
suggests that an outlier exists.
Fevi Novkaniza Residual Analysis 21
Prodi S1 Ilmu Aktuaria
The R 2 value has jumped from 5% to 61.5%. One data point greatly
affect the value of R 2
Fevi Novkaniza Residual Analysis 22
Prodi S1 Ilmu Aktuaria
The corresponding standardized residuals vs. fits plot for our expenditure
survey example looks like:
If the first two steps don’t resolve the problem, consider analyzing the
data twice — once with the data point included and once with the
data point excluded. Report the results of both analyses
A residuals vs. order plot that looks like the following plot:
suggests that there is ”positive serial correlation” among the error terms.
That is, positive serial correlation exists when residuals tend to be followed,
in time, by residuals of the same sign and about the same magnitude. The
plot suggests that the assumption of independent error terms is violated.
A residuals vs. order plot that looks like the following plot:
uggests that there is ”negative serial correlation” among the error terms.
Negative serial correlation exists when residuals of one sign tend to be
followed, in time, by residuals of the opposite sign. What? Can’t you see
it? If you connect the dots in order from left to right, you should be able
to see the pattern.
Fevi Novkaniza Residual Analysis 32
Prodi S1 Ilmu Aktuaria