Professional Documents
Culture Documents
Regression Analysis - STAT510
Regression Analysis - STAT510
Because the lack of linearity dominates the plot, we have to fix the
non-linearity problem before we can assess the assumption of equal
variances.
The normal Q-Q probability plot suggests that the error terms are
not normal. Most residuals are around the line but there is one
Problems of the linear fit:
I The relationship between tree diameter and volume is not
linear.
I The error terms are not normally distributed.
Transformation:
I First, let’s try only taking the natural logarithm of the tree
diameters to obtain the new predictor x = lnDiam to correct
non-linearity.
I The curvature pattern still exists, i.e., the points do not follow
a linear pattern after log-transforming x .
The residuals vs fit plot also shows that the relationship is still not
linear.
The normal Q-Q probability plot suggests that the non-normality
problem still exists.
Recall that problems of the linear fit regressing Y (Volume) on x
(Diameter) are:
I The relationship between tree diameter and volume is not
linear.
I The error terms are not normally distributed.
Now we know that it doesn’t work if we only log-transforming x .
I Let’s take the natural logarithm of the tree diameters to obtain
the new predictor x = lnDiam to correct non-linearity.
I Let’s take the natural logarithm of the tree volumes to obtain
the new response Y = lnVol to correct non-normality.
Log-transforming Y may also help the non-linearity.
The relationship between the natural log of the diameter and the
natural log of the volume looks linear and strong (R 2 = 97.4%).
The residuals vs fit plot provides yet more evidence of a linear
relationship between lnVol and lnDiam.
Now the relationship appears to be linear and the error terms appear
independent and normally distributed with equal variances.
Using the Model
Question 1: What is the nature of the association between diameter and
volume of shortleaf pines?
Similarly to what we did in the previous examples, the natural logarithm of tree
volume is positively linearly related to the natural logarithm of tree diameter.
Question 2: Is there an association between diameter and volume of shortleaf
pines?
Similarly to what we did in the previous examples, we merely test the null hypothesis
H0 : β1 = 0 using either the F -test or the equivalent t-test.
summary(fit)
##
## Call:
## lm(formula = lny ~ lnx)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.3323 -0.1131 0.0267 0.1177 0.4280
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.8718 0.1216 -23.63 <2e-16 ***
## lnx 2.5644 0.0512 50.09 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1703 on 68 degrees of freedom
## Multiple R-squared: 0.9736, Adjusted R-squared: 0.9732
## F-statistic: 2509 on 1 and 68 DF, p-value: < 2.2e-16
anova(fit)
Since the p-value is very close to 0, we conclude that there is a linear association
between the natural logarithm of tree volume and the natural logarithm of tree
diameter at any α level, say 0.05 or 0.01.
How do We Estimate E (Yh )?
Question 3: What is the "average" volume of all shortleaf pine trees that are 10"
in diameter?
For this question, we need a 95% confidence interval for the average of the natural log
of the volumes of all 10" diameter shortleaf pines.
fit$call
Therefore, we can be 95% confident that the average volume of all shortleaf pines, 10"
in diameter, is between 19.9 and 21.6 cubic feet.
To Use the 95% Confidence Interval for β1
##
## Call:
## lm(formula = lny ~ lnx)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.3323 -0.1131 0.0267 0.1177 0.4280
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.8718 0.1216 -23.63 <2e-16 ***
## lnx 2.5644 0.0512 50.09 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1703 on 68 degrees of freedom
## Multiple R-squared: 0.9736, Adjusted R-squared: 0.9732
## F-statistic: 2509 on 1 and 68 DF, p-value: < 2.2e-16
A 95% confidence interval for β1 is
we can be 95% confident that the median volume will increase by a factor
between 5.50 and 6.36 for each two-fold increase in diameter.
What to Do When It Is Difficult to Determine Which
Transformation on Y to Use?
Yiλ = β0 + β1 xi + εi
I Now let’s use a plasma data set to illustrate how to use box-cox
transformation. This data set was collected on 25 healthy children.
I The predictor is age (x ) and the response is plasma level of a
polyamine (Y ).
First, let’s try a linear fit without any transformation. Non-constant
variance and non-linearity are shown. For this fit, R 2 = 0.7532.
Now let’s check the “residuals vs fit plot”, which supports that there are
non-constant variance and non-linearity problems.
The normal Q-Q probability plot has the “classical” pattern when there is
an outlier. This is an example when a poor fit can make some data point
an outlier.
Now let’s try log-transformation on Y , which is a remedy for non-constant
variance problem, and regress lnY on x again. The fit is better now, and
R 2 = 0.8535.
The “residuals vs fit plot” also shows some progress, but there is still some
slight curvature pattern.
There is still an outlier in the normal Q-Q probability plot and the
normality assumption is not met.
Can We Find a Better Transformation on Y ?
Now let’s try box-cox transformation to find the “best” λ and hence the transformed
response Y λ . We can try searching λ on [−1, 1].
boxcox.trans = boxcox(y ~ x, data = Plasma, lambda = seq(-1, 1, length = 10))
10
95%
8
log−Likelihood
6
4
2
0