Download as pdf or txt
Download as pdf or txt
You are on page 1of 34

Regression Analysis - STAT510

Dr. Xiyue Liao

Department of Mathematics and Statistics

California State University, Long Beach


What Transformation do We Use when “Everything”
Seems Wrong?

Now we will learn how to build and use a model by transforming


both the response Y and the predictor x values.
I To correct non-normality and/or unequal variances, we
transform the Y values, which may help the non-linearity.
I To correct non-linearity, we transform the x values.
I We have a data set of 70 shortleaf pine trees. Interest is to learn about the
relationship between the volume Y (in cubic feet) and the diameter x (in
inches) of shortleaf pines.
I Although the R 2 value is quite high (89.3%), the fitted line plot suggests that
the relationship between tree volume and tree diameter is not linear.
The residuals vs fit plot also suggests that the relationship is not linear.

Because the lack of linearity dominates the plot, we have to fix the
non-linearity problem before we can assess the assumption of equal
variances.
The normal Q-Q probability plot suggests that the error terms are
not normal. Most residuals are around the line but there is one
Problems of the linear fit:
I The relationship between tree diameter and volume is not
linear.
I The error terms are not normally distributed.
Transformation:
I First, let’s try only taking the natural logarithm of the tree
diameters to obtain the new predictor x = lnDiam to correct
non-linearity.
I The curvature pattern still exists, i.e., the points do not follow
a linear pattern after log-transforming x .
The residuals vs fit plot also shows that the relationship is still not
linear.
The normal Q-Q probability plot suggests that the non-normality
problem still exists.
Recall that problems of the linear fit regressing Y (Volume) on x
(Diameter) are:
I The relationship between tree diameter and volume is not
linear.
I The error terms are not normally distributed.
Now we know that it doesn’t work if we only log-transforming x .
I Let’s take the natural logarithm of the tree diameters to obtain
the new predictor x = lnDiam to correct non-linearity.
I Let’s take the natural logarithm of the tree volumes to obtain
the new response Y = lnVol to correct non-normality.
Log-transforming Y may also help the non-linearity.
The relationship between the natural log of the diameter and the
natural log of the volume looks linear and strong (R 2 = 97.4%).
The residuals vs fit plot provides yet more evidence of a linear
relationship between lnVol and lnDiam.

There seems to be some “funneling” exists, but it does’t appear to


be too severe.
The normal Q-Q probability plot has improved substantially.

Now the relationship appears to be linear and the error terms appear
independent and normally distributed with equal variances.
Using the Model
Question 1: What is the nature of the association between diameter and
volume of shortleaf pines?

Similarly to what we did in the previous examples, the natural logarithm of tree
volume is positively linearly related to the natural logarithm of tree diameter.
Question 2: Is there an association between diameter and volume of shortleaf
pines?

Similarly to what we did in the previous examples, we merely test the null hypothesis
H0 : β1 = 0 using either the F -test or the equivalent t-test.
summary(fit)

##
## Call:
## lm(formula = lny ~ lnx)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.3323 -0.1131 0.0267 0.1177 0.4280
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.8718 0.1216 -23.63 <2e-16 ***
## lnx 2.5644 0.0512 50.09 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1703 on 68 degrees of freedom
## Multiple R-squared: 0.9736, Adjusted R-squared: 0.9732
## F-statistic: 2509 on 1 and 68 DF, p-value: < 2.2e-16
anova(fit)

## Analysis of Variance Table


##
## Response: lny
## Df Sum Sq Mean Sq F value Pr(>F)
## lnx 1 72.734 72.734 2509 < 2.2e-16 ***
## Residuals 68 1.971 0.029
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Since the p-value is very close to 0, we conclude that there is a linear association
between the natural logarithm of tree volume and the natural logarithm of tree
diameter at any α level, say 0.05 or 0.01.
How do We Estimate E (Yh )?
Question 3: What is the "average" volume of all shortleaf pine trees that are 10"
in diameter?

For this question, we need a 95% confidence interval for the average of the natural log
of the volumes of all 10" diameter shortleaf pines.
fit$call

## lm(formula = lny ~ lnx)


new = data.frame(lnx = log(10))
c.i. = predict(fit, new, interval = 'confidence', level = .95)
c.i.

## fit lwr upr


## 1 3.032996 2.992202 3.073791

Exponentiating both endpoints of the interval, we get

e 2.9922 = 19.9 and e 3.0738 = 21.6

Therefore, we can be 95% confident that the average volume of all shortleaf pines, 10"
in diameter, is between 19.9 and 21.6 cubic feet.
To Use the 95% Confidence Interval for β1

Question 4: What is expected change in volume for a


two-fold increase in diameter?
When we log-transforming both Y and x , here is a fact:
I In general, the median changes by a factor of k β1 for each
k-fold increase in the predictor x .
I Therefore, for this question, the median changes by a factor of
2β1 for each two-fold increase in the predictor x .
I As always, we won’t know the slope of the population line, β1 .
We have to use b1 to estimate it.
summary(fit)

##
## Call:
## lm(formula = lny ~ lnx)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.3323 -0.1131 0.0267 0.1177 0.4280
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.8718 0.1216 -23.63 <2e-16 ***
## lnx 2.5644 0.0512 50.09 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1703 on 68 degrees of freedom
## Multiple R-squared: 0.9736, Adjusted R-squared: 0.9732
## F-statistic: 2509 on 1 and 68 DF, p-value: < 2.2e-16
A 95% confidence interval for β1 is

2.56442 ± 1.9955(0.05120) = (2.46, 2.67)

where 1.9955 is computed by qt(.975, n − 2) in R. Because

22.46 = 5.50 and 22.67 = 6.36

we can be 95% confident that the median volume will increase by a factor
between 5.50 and 6.36 for each two-fold increase in diameter.
What to Do When It Is Difficult to Determine Which
Transformation on Y to Use?

I It is often difficult to determine which transformation on Y to use.


Box-Cox transformations are a family of power transformations on
Y such that Y 0 = Y λ , where λ is a parameter to be determined
using the data.
I The normal error regression model with a Box-Cox transformation is

Yiλ = β0 + β1 xi + εi

I The estimation method of maximum likelihood can be used to


estimate λ or a simple search over a range of candidate values may
be performed. For example, λ could be searched in a “grid” such as
λ = −1, −0.9, . . . , 0.9, 1.
I For each λ value, the Yiλ observations are standardized so that the
analysis using the SSE s does not depend on λ. The standardization is
(
K1 (Yiλ − 1), λ 6= 0;
Wi =
K2 (log Yi ), λ = 0,
Qn 1/n
where K2 = i=1 Yi and K1 = λ1 K2λ−1 . Note that
log-transformation on Y is a special case of box-cox transformation,
for which λ = 0.
I Once the Wi have been calculated for a given λ, then they are
regressed on the xi and the SSE is retained. Then the maximum
likelihood estimate λ̂ is that value of λ for which the SSE is a
minimum.

The boxcox() function in MASS package computes and optionally plots


profile log-likelihoods for the parameter of the Box-Cox power
transformation.
Plasma Data Example

I Now let’s use a plasma data set to illustrate how to use box-cox
transformation. This data set was collected on 25 healthy children.
I The predictor is age (x ) and the response is plasma level of a
polyamine (Y ).
First, let’s try a linear fit without any transformation. Non-constant
variance and non-linearity are shown. For this fit, R 2 = 0.7532.
Now let’s check the “residuals vs fit plot”, which supports that there are
non-constant variance and non-linearity problems.
The normal Q-Q probability plot has the “classical” pattern when there is
an outlier. This is an example when a poor fit can make some data point
an outlier.
Now let’s try log-transformation on Y , which is a remedy for non-constant
variance problem, and regress lnY on x again. The fit is better now, and
R 2 = 0.8535.
The “residuals vs fit plot” also shows some progress, but there is still some
slight curvature pattern.
There is still an outlier in the normal Q-Q probability plot and the
normality assumption is not met.
Can We Find a Better Transformation on Y ?
Now let’s try box-cox transformation to find the “best” λ and hence the transformed
response Y λ . We can try searching λ on [−1, 1].
boxcox.trans = boxcox(y ~ x, data = Plasma, lambda = seq(-1, 1, length = 10))
10

95%
8
log−Likelihood

6
4
2
0

−1.0 −0.5 0.0 0.5 1.0


Now let’s regress Y −0.5 on x . For this fit, R 2 = 0.8665, which is slightly
better than the fit of regressing lnY on x .
The “residuals vs fit plot” has a well-behaved pattern.
The normal Q-Q probability plot now supports the normality assumption
because points are approximately on a line.
Study Guide

To prepare for the quiz and exam, you will want to


I know and be able to perform diagnostics for linear
regression
I know and be able to find correct transformations to
guanrantee “LINE” conditions

You might also like