Chapter 2

Inferences in Regression and Correlation Analysis

Applied Linear Regression Models
(Kutner, Nachtsheim, Neter, Li)

Inferences concerning:
the regression parameters β0 and β1
interval estimation of β0 and β1
tests about them

X interval estimation of E(Y ) of the probability distribution of

Y for given X
X prediction intervals of a new observation Y
4 confidence bands for the regression line

: the analysis of variance approach to regression analysis
: the general linear test approach
: descriptive measure of association
2 the correlation coefficient

Normal error regression model

Assume that the normal error regression model is applicable:

Yi = β0 + β1 Xi + εi

β0 and β1 are parameters

Xi are known constants
εi ∼ N (0, σ 2 ): are independent

Inferences Concerning β1

Inferences about the slope β1 of the regression line

A market research analyst studying the relation between sales
(Y ) and advertising expenditures (廣告支出 X):
to obtain an interval estimate of β1
provide information as to how many additional sales dollars


H0 : β1 = 0;
Ha : β1 6= 0.

Inferences Concerning β1

Inferences about the slope β1 of the regression line (cont.)

Figure 1 : Regression Model when β1 = 0.

β1 = 0 ⇒ no linear association between Y and X

The regression line is horizontal.
The means of Y :
E{Y } = β0 .
The probability distribution of Y are identical at all levels of
Sampling Distribution of b1

Point estimator b1

(Xi − X̄ )(Yi − Ȳ )
b1 =
(Xi − X̄ )2

The sample distribution of b1 refers to the different values of

b1 that would be obtained with repeated sampling when the
levels of the predictor variable X are held constant from
sample to sample.

Sampling Distribution of b1

Point estimator b1 (cont.)

Sampling distribution of b1 (需ki 的特性)

For normal error regression model:

Mean: E{b1 } = β1 ;
variance: σ 2 {b1 } = P
(Xi − X̄ )2

X b1 is a linear combination of Yi
b1 = ki Yi
Xi − X̄
ki = P : a function of Xi
(Xi − X̄ )2

Sampling Distribution of b1

Point estimator b1 (cont.)

X Properties of ki :
ki = 0
ki X i = 1
ki2 = P
(Xi − X̄ )2

Sampling Distribution of b1


Yi : independently, normally distributed

A linear combination of independent normal random
variables is normally distributed.

Sampling Distribution of b1

Normality (cont.)

Normally properties

E{b1 } = β1

σ 2 {b1 } = σ 2 P
(Xi − X̄ )2

Sampling Distribution of b1


Normally properties (cont.)

The unbiased estimator of σ 2 :

Sampling Distribution of b1


Normally properties (cont.)

The unbiased estimator of σ 2 :

(Yi − Ŷi )2
MSE = =
n−2 n−2

Sampling Distribution of b1


Normally properties (cont.)

The unbiased estimator of σ 2 :

(Yi − Ŷi )2
MSE = =
n−2 n−2

Estimated Variance σ 2 {b1 }:

s 2 {b1 } = P
(Xi − X̄ )2

The point estimator s 2 {b1 } is an unbiased estimator of

σ 2 {b1 }.
The point estimator of σ{b1 }:
Sampling Distribution of b1


Normally properties (cont.)

The unbiased estimator of σ 2 :

(Yi − Ŷi )2
MSE = =
n−2 n−2

Estimated Variance σ 2 {b1 }:

s 2 {b1 } = P
(Xi − X̄ )2

The point estimator s 2 {b1 } is an unbiased estimator of

σ 2 {b1 }.
The point estimator of σ{b1 }: s{b1 }
Sampling Distribution of b1

b1 : unbiased linear estimator

Theorem 1
The estimator b1 has minimum variance among all unbiased
linear estimators of: X
β̂1 = ci Yi
ci : arbitrary constants; ci = 0; ci Xi = 1
X Unbiased: E{β̂1 } = β1
X σ 2 {β̂1 } = σ 2
P 2
c i
ci = ki + di
Xi −X̄
ki = P(X 2
i −X̄)
di : arbitrary constants
ki di = 0
σ 2 ki2 = σ 2 {b1 }
P 2
X β̂1 is at a minimum when d i = 0 ⇔ all di = 0 ⇔ ci = ki
Sampling distribution of (b1 − β1 )/s{b1 }

Sampling distribution of (b1 − β1 )/s{b1 }

Theorem 2

b1 − β1
∼ tn−2
s{b1 }

If the observations Yi come from the same normal

population, (Ȳ − µ)/s{Ȳ } ∼

Sampling distribution of (b1 − β1 )/s{b1 }

Sampling distribution of (b1 − β1 )/s{b1 }

Theorem 2

b1 − β1
∼ tn−2
s{b1 }

If the observations Yi come from the same normal

population, (Ȳ − µ)/s{Ȳ } ∼ tn−1

Sampling distribution of (b1 − β1 )/s{b1 }

Sampling distribution of (b1 − β1 )/s{b1 }

Theorem 2

b1 − β1
∼ tn−2
s{b1 }

If the observations Yi come from the same normal

population, (Ȳ − µ)/s{Ȳ } ∼ tn−1
Two parameters: β0 , β1

Sampling distribution of (b1 − β1 )/s{b1 }

Sampling distribution of (b1 − β1 )/s{b1 } (cont.)

b1 −β1
Proof of s{b1 }
∼ tn−2

Theorem 3
For regression model (2.1),

SSE/σ 2 ∼ χ2 (n − 2),

and is independent of b0 and b1 .

Sampling distribution of (b1 − β1 )/s{b1 }

Sampling distribution of (b1 − β1 )/s{b1 } (cont.) (cont.)

Theorem 4
t Distribution
Let Z and χ2 (ν) be independent r.v. (standard normal, χ2 ). A t
random variable as follows:
t(ν) = h χ2 (ν) i where Z and χ2 (ν) are indep.

Sampling distribution of (b1 − β1 )/s{b1 }

Sampling distribution of (b1 − β1 )/s{b1 } (cont.) (cont.)

1 (b1 − β1 )/σ{b1 } ∼ Z (Standard Normal variable)
s2 {b1 } χ2 (n−2)
σ 2 {b1 }
∼ n−2

b1 − β1 b1 − β1 s{b1 } Z
= ÷ ∼q 2
s{b1 } σ{b1 } σ{b1 } χ (n−2)

Z and χ2 are independent;

4 Z is a function of b1
5 b1 is independent of SSE/σ 2 ∼ χ2

b1 − β1
∼ tn−2
s{b1 }

Confidence interval for β1

Confidence interval for β1

b1 − β1
∼ tn−2
s{b1 }

n o
P t(α/2; n − 2) ≤ (b1 − β1 )/s{b1 } ≤ t(1 − α/2; n − 2) = 1 − α
n o
⇒P b1 − t(1 − α/2; n − 2)s{b1 } ≤ β1 ≤ b1 + t(1 − α/2; n − 2)s{b1 }

t(α/2; n − 2): (α/2)100 percentile of the t distribution with

n − 2 d.f.
Confidence interval for β1

Confidence interval for β1 (cont.)

Symmetric: t(α/2; n − 2) = −t(1 − α/2; n − 2)

The 1 − α confidence limits for β1 are:

b1 ± t(1 − α/2; n − 2)s{b1 }

Confidence interval for β1

Confidence interval for β1 (cont.)

Confidence interval for β1

Ex: Confidence interval for β1

The Toluca Company: an estimate of β1 with 95 percent

confidence coefficient.
#### Example p46
## method 1
Confidence interval for β1

Ex: Confidence interval for β1 (cont.)

sb1<-sqrt(MSE/sum((Size-mean(Size))^2)) #s{b1}
## method 2
s 2 {b1 } = P(Xi −X̄)
2 = 0.120404

s{b1 } = 0.3470
t(0.975, 23) = 2.069
The 95 percent confidence interval:

2.85 ≤ β1 ≤ 4.29

Confidence interval for β1

Ex: Confidence interval for β1 (cont.)

Confidence interval for β1

Tests concerning β1

b1 − β1 d
∼ tn−2
s{b1 }

Two-Sided Test

H0 : β1 = 0 vs. Ha : β1 6= 0

The test statistic: b1

t∗ =
s{b1 }

The decision rule (the level of significance α):

If |t ∗ | ≤ t(1 − α/2; n − 2), conclude H0

If |t ∗ | > t(1 − α/2; n − 2), conclude Ha

hsuhl (NUK) LR Chap 2 24 / 118

Confidence interval for β1

Tests concerning β1

α = 0.05; n = 25
t(0.975; 23) = 2.069
b1 = 3.5702
|t ∗ | =
|3.5702/0.3470| =
10.29 > 2.069

conclude Ha :
β1 6= 0

p-value: P t(23) >
t ∗ = 10.29} < 0.0005
Confidence interval for β1

Tests concerning β1

α = 0.05; n = 25
t(0.975; 23) = 2.069
b1 = 3.5702
|t ∗ | =
|3.5702/0.3470| =
10.29 > 2.069

conclude Ha :
β1 6= 0

p-value: P t(23) >
t ∗ = 10.29} < 0.0005
Confidence interval for β1

Tests concerning β1

One-Sided Test

H0 : β1 ≤ 0 vs. Ha : β1 > 0

The test statistic: b1

t∗ =
s{b1 }

The decision rule (the level of significance α):

If t ∗ ≤ t(1 − α; n − 2), conclude H0

If t ∗ > t(1 − α; n − 2), conclude Ha

Confidence interval for β1

Tests concerning β1

Two-Sided Test

H0 : β1 = β10 vs. Ha : β1 6= β10

The test statistic: b1 − β10

t∗ =
s{b1 }

The decision rule (the level of significance α):

If |t ∗ | ≤ t(1 − α/2; n − 2), conclude H0

If |t ∗ | > t(1 − α/2; n − 2), conclude Ha

Inferences concerning β0

Sampling distribution of b0

The point estimator b0 : b0 = Ȳ − b1 X̄

Theorem 5
For regression model (2.1), the sampling distribution of b0 is
normal, with mean and variance:

E{b0 } = β0
1 X̄ 2
2 2
σ {b0 } = σ +P
n (Xi − X̄ )2

Inferences concerning β0

Sampling distribution of b0 (cont.)

An estimator of σ 2 {b0 }:

1 X̄ 2
s {b0 } = MSE +P
n (Xi − X̄ )2

An estimator of σ{b0 }: s{b0 }

Theorem 6

b0 − β0 d
∼ tn−2
s{b0 }

Confidence interval for β0

Confidence interval for β0

The 1 − α confidence limits for β0 are:

b0 ± t(1 − α/2; n − 2)s{b0 }

## method 1
sb0<-sqrt(MSE*(1/n+mean(Size)^2/sum((Size-mean(Size))^2))) #
## method 2
confint(fitreg,level = 0.9)
Confidence interval for β0

Example C.I. for β0

Toluca Company
Range of X : [20,120]
t(0.95; 23) = 1.714
1 2
s 2 {b0 } = MSE n
+ P(XX̄i −X̄)2 = 685.34
s{b0 } = 26.18
The 90% C.I. for β0 :

17.5 ≤ β0 ≤ 107.2

Confidence interval for β0

Example C.I. for β0 (cont.)

It does not necessarily provide information about the “setup"

cost since we are not certain whether a linear regression
model is appropriate when the scope of the model is extended
to X = 0
Some considerations on making inferences concerning β0
and β1

Departures from Normality

If the probability distribution of Y are not exactly normal

but do not depart seriously, the sampling distributions of b0
and b1 will approximately normal, and the use of the t
distribution will provide approximately the specified
confidence coefficient or level of significance.

Some considerations on making inferences concerning β0
and β1

Departures from Normality (cont.)

Even if the distribution of Y are far from normal, the

estimators b0 and b1 generally have the property of
asymptotic normality- their distributions approach normality
under very general conditions as the sample size increases.

Some considerations on making inferences concerning β0
and β1

Power of Tests

H0 : β1 = β10 vs. Ha : β1 6= β10

b1 − β10
Test statistic: t ∗ =
s{b1 }

The power of this test for α level: the decision rule will lead to
conclusion Ha when Ha holds
Power = P |t ∗ | > t(1 − α/2; n − 2)|δ

Some considerations on making inferences concerning β0
and β1

Power of Tests (cont.)

the noncentrality measure i.e., a measure of how far the true

value of β1 is from β10

|β1 − β10 |
δ= (Appendix Table B.5)
σ{b1 }

Some considerations on making inferences concerning β0
and β1

Common objective
To estimate the mean for one or more probability distribu-
tions of Y .

A study of the relation between level of piecework(按件計酬的工
作) pay (X ) and worker productivity (生產力 Y ).
The mean productivity at high and medium levels of
piecework pay may be of particular interest for purposes of
analyzing the benefits obtained from an increase in the pay

Some considerations on making inferences concerning β0
and β1

Xh : the level of X for which we wish to estimate the mean

may be a value which occurred in the sample
other value of the predictor variable within the scope(範圍)
of the model
E{Yh }: the mean response when X = Xh
Ŷh : the point estimator of E{Yh }:
Ŷh = b0 + b1 Xh

What is the sampling distribution of Ŷh ?

Some considerations on making inferences concerning β0
and β1

Sampling distribution of Ŷh

Sampling distribution of Ŷh

For normal error regression model (2,1), the sampling distribution
of Ŷh is normal:
E{Ŷh } = E{Yh }

1 (Xh − X̄ )2
2 2
σ {Ŷh } = σ +P
n (Xi − X̄ )2

Ŷh is a linear combination of the observations Yi .

Ŷh is an unbiased estimator of E{Yh }.

hsuhl (NUK) LR Chap 2 39 / 118

Some considerations on making inferences concerning β0
and β1

Sampling distribution of Ŷh (cont.)

Figure 2 : Effect on Ŷh of Variation in b1 from Sample to Sample in

Two Samples with Same Means Ȳ and X̄

Some considerations on making inferences concerning β0
and β1

Properties of Ŷh

Xh = 0:
Var(Ŷh ) = Var(b0 )
s 2 {Ŷh } = s 2 {b0 }
b1 and Ȳ are uncorrelated ⇔ σ{Ȳ , b1 } = 0

Some considerations on making inferences concerning β0
and β1

Properties of Ŷh (cont.)

When MSE is substituted for σ 2 , the estimated variance of

Ŷh (s 2 (Ŷh ))

1 (Xh − X̄ )2
s {Ŷh } = MSE +P
n (Xi − X̄ )2

The estimated standard deviation of Ŷh : s{Ŷh }

Interval Estimation of E{Yh }

Sampling Distribution of (Ŷh − E{Yh })/s{Ŷh }

Theorem 7

Ŷh − E{Yh }
∼ tn−2
s{Ŷh }

The 1 − α confidence limits for E{Yh } are:

Ŷh ± t(1 − α/2; n − 2)s{Ŷh }

Interval Estimation of E{Yh }

Example 1
Toluca Company example
find a 90 percent confidence interval for E{Yh } when the lot
size is Xh = 65 units.
Ŷh = 62.37 + 3.5702(65) = 294.4i
2 1
s {Ŷh } = 2, 384 25 + 19,800 = 98.37 ⇒ s{Ŷh } = 9, 918
t(.95; 23) = 1.7141
277.4 ≤ E{Yh } ≤ 311.4
Conclude: the mean number of work hours required when
lots of 65 units are produced is somewhere between 277.4 and
311.4 hours.
the estimate of the mean number of work hours is moderately

Interval Estimation of E{Yh }

Interval Estimation of E{Yh }

Code for Example 1

#### Example p54

## method 1
sb1<-sqrt(MSE/sum((Size-mean(Size))^2)) #s{b1}
hYh<-b0+b1* Xh
Interval Estimation of E{Yh }

Code for Example 1 (cont.)


## method 2
predXCI<-predict(fitreg,data.frame(Size = c(Xh)),
interval = "confidence", = F,level = 0.9)
plot(Size, Hrs)

# now the confidence interval for $X_h=specific level$

points(Xh, predXCI[, "fit"],col="red",pch=15)
points(Xh, predXCI[, "lwr"], lty = "dotted",col="red")
points(Xh, predXCI[, "upr"], lty = "dotted",col="red")

Interval Estimation of E{Yh }

Example 2
Toluca Company example
Estimate E{Yh } for lots with Xh = 100 units with a 90
percent confidence interval
Ŷh = 62.37 + 3.5702(100) = 419.4i
2 1
s {Ŷh } = 2, 384 25 + 19,800 = 203.72
s{Ŷh } = 14.27
t(.95; 23) = 1.7141
394.9 ≤ E{Yh } ≤ 443.9
Conclude: the confidence interval is somewhat wider than
that for Example 1, since the Xh level here is substantially
farther from the mean X̄ = 70 than the Xh = 65.

Interval Estimation of E{Yh }

Interval Estimation of E{Yh }

Sampling distribution of Ŷh

The variance of Ŷh is smallest when Xh = X̄ .

In an experiment to estimate the mean response at a
particular level Xh of the predictor variable, the precision of
the estimate will be greatest if the observations on X are
spaced so that X̄ = Xh .
The usual relationship between C.I. and tests applies in
inferences concerning the mean response.
The two-sided confidence limits can be utilized for two-sided
tests concerning the mean response at Xh . Alternatively, a
regular decision rule can be set up.
The confidence limits for a mean response E{Yh } are not
sensitive to moderate departures from the assumption that
the error terms are normally distributed.

Prediction of New Observation

Prediction of New Observation

Toluca Company: the next lot to be produced consists of 100

units; wishes to predict the number of work hours for this
particular lot.

Estimated the regression relation between company sales and

number of persons 16 or more years old from data for the
past 10 years; wishes to predict next year’s company sales

Prediction of New Observation

Prediction of New Observation

College admissions example

the relevant parameters of the
regression model are known:

β0 = 0.10 β1 = 0.95
E{Y } = 0.10 + 0.95X
σ = 0.12
An applicant whose high school GPA is Xh = 3.5:

E{Yh } = 0.10 + 0.95(3.5) = 3.425

E{Yh } ± 3σ:
3.425 ± 3(0.12) ⇒ 3.065 ≤ Yh(new) ≤ 3.785

Prediction of New Observation

The basic idea of a prediction interval is to choose a range in

the distribution of Y wherein most of the observations will
fall, and then to declare that the next observation will fall in
this range.
When the regression parameters of normal error regression
model (2.1) are known, the 1 − α prediction limits for Yh(new)
E{Yh } ± z(1 − α/2)σ

hsuhl (NUK) LR Chap 2 53 / 118

Prediction of New Observation

Prediction Interval for Yh(new) when Parameters Unknown

Figure 3 : Figure 2.5: Prediction of Yh(new) when Parameters


Prediction of New Observation

Prediction Interval for Yh(new) when Parameters Unknown


Since we cannot be certain of the location of the distribution

of Y , prediction limits for Yh(new) clearly must take account
of two elements (Figure 2.5)
Variation in possible location of the distribution of Y
Variation within the probability distribution of Y
Prediction limits for a new observation Yh(new) at Xh (given)
are obtained:
Theorem 8

Yh(new) − Ŷh
∼ tn−2 for normal error regression model (2.1)

Prediction of New Observation

Prediction Interval for Yh(new) when Parameters Unknown


The 1 − α prediction limits for Yh(new) :

Ŷh ± t(1 − α/2; n − 2)s{pred}

The difference may be viewed as the prediction error, with

Ŷh(new) serving as the best point estimate of the value of the
new observation Yh(new)
σ 2 {pred}: the variance of the prediction error

σ 2 {pred} = σ 2 {Yh(new) − Ŷh } = σ 2 + σ 2 {Ŷh }

Prediction of New Observation

Prediction Interval for Yh(new) when Parameters Unknown


σ 2 {pred} has two components:

The variance of the distribution of Y at X = Xh ; σ 2
The variance of the sampling distribution of Ŷh ; σ 2 {Ŷh }
An unbiased estimator of σ 2 {pred}:

1 (Xh − X̄ )2
2 2
s {pred} = MSE + s {Ŷh } = MSE 1 + + P
n (Xi − X̄ )2

Prediction of New Observation

Toluca Company: Xh = 100
90 percent prediction interval: t(0.95; 23) = 1.714

Ŷh = 419.4 s 2 {Ŷh } = 203.72 MSE = 2, 384

⇒s 2 {pred} = 2, 384 + 203.72 = 2, 587.72
s{pred} = 50.87

The 90 percent prediction interval for Yh(new) :

332.2 ≤ Yh(new) ≤ 506.6

Prediction of New Observation

## Example (p59)
fitnew<-predict.lm(fitreg,data.frame(Size = c(Xh)), = T, level = 0.9)

Prediction of New Observation

This prediction interval is rather wide and may not be useful

for planning worker requirements for the next lot.
The interval can still be useful for control purposes.
If the actual work hours fall outside the prediction limits ⇒
some alerts- may have occurred a change in the production

Prediction of New Observation

Comparing Yh(new) and E{Yh }

Toluca Company:
The C.I. for Yh(new) is wider than the C.I. for E{Yh }:
∵ predicting the work hours required for a new lot, ⇒
encounter the variability in Ŷh from sample to sample as well
as the lot-to-lot variation within the probability distribution
of Y
The prediction interval is wider the further Xh is from X̄

Prediction of New Observation

Prediction of Mean of m New Observations for Given Xh

Predict the mean of m new observations on Y for a given Xh

Y : the mean of the new observations to be predicted as
the new Y observations are independent
The appropriate 1 − α prediction limits:

Ŷh ± t(1 − α/2; n − 2)s{predmean}

s 2 {predmean} = + s 2 {Ŷh }
1 1 (Xh − X̄ )2
⇔ s {predmean} = MSE + +P }
m n (Xi − X̄ )2

Two components for s 2 {predmean}

Prediction of New Observation

Prediction of Mean of m New Observations for Given Xh


the variance of the mean of m observations from the

probability distribution of Y at X = Xh
The variance of the sampling distribution of Ŷh .

Prediction of New Observation

Prediction of Mean of m New Observations for Given Xh


Toluca Company: Xh = 100
90 percent prediction interval for the mean number of work
hours Ȳh(new) in three new production runs
Previous results:
Ŷh = 419.4; s 2 {Yh } = 203.72
MSE = 2, 384; t(0.95; 23) = 1.714;
2, 384
⇒s 2 {predmean} = + 203.72 = 998.4
s{predmean} = 31.60

Prediction of New Observation

Prediction of Mean of m New Observations for Given Xh


The prediction interval for Ȳh(new) :

365.2 ≤ Ȳh(new) ≤ 473.6

The total number of work hours:

1, 095.6 ≤ Total work hours ≤ 1, 420.8

Analysis of Variance Approach to Regression Analysis

Partition of Total Sum of Squares

The analysis of variance approach is based on the partitioning

The variation is measured: the deviations of the Yi around

their mean Ȳ :
Yi − Ȳ

hsuhl (NUK) LR Chap 2 66 / 118

Analysis of Variance Approach to Regression Analysis

Partition of Total Sum of Squares (cont.)

Analysis of Variance Approach to Regression Analysis

Partition of Total Sum of Squares (cont.)

Total variation: (SSTO: total sum of squares)

(Yi − Ȳ )2
Yi are the same ⇒ SSTO = 0
The greater the variation among the Yi , the larger is SSTO.
SSE: error sum of squares

(Yi − Ŷi )2
Yi fall on the fitted regression line ⇒ SSE = 0
The greater the variation of the Yi around the fitted
regression line, the larger is SSE.

Analysis of Variance Approach to Regression Analysis

Partition of Total Sum of Squares (cont.)

SSR: regression sum of squares

(Ŷi − Ȳi )2
The regression line is horizontal ⇒ SSR = 0, otherwise
SSR > 0
a measure associated with the regression line
The larger SSR is in relation to SSTO, the greater is the
effect of the regression relation in accounting for the total
variation in the Yi observations.

Analysis of Variance Approach to Regression Analysis

Formal Development of Partitioning

The total deviation:

Y − Ȳ
| i {z }
= Ŷ − Ȳ
| i {z }
+ Y − Ŷ
| i {z }i
Total deviation Deviation of Deviation around
fitted regression value fitted regression line
around mean

Two components:
X The deviation of the fitted value Ŷi around the mean Ȳ .
X The deviation of the observation Yi around the fitted
regression line.

Analysis of Variance Approach to Regression Analysis

Formal Development of Partitioning (cont.)

(Yi − Ȳ )2 = (Ŷi − Ȳ )2 + (Yi − Ŷ )2



2 (Ŷi − Ȳ )(Yi − Ŷi ) = 0
SSR = b12 (Xi − X̄ )2

Degrees of freedom: (df)

SS df explanation
SSTO n − 1 (∵ (Yi − Ȳ ) = 0)

SSE n − 2 (∵ β0 , β1 in Ŷi )

Analysis of Variance Approach to Regression Analysis

Mean Squares

Mean square (MS):

A sum of squares divided by its associated df

SS df MS
SSR 1 MSR = 1
(regression mean square)
SSE n − 2 MSE = n−2
(error mean square)

Mean squares are not additive.

6= MSR + MSE

Analysis of Variance Approach to Regression Analysis

ANOVA table

Table 1 : ANOVA Table for Simple Linear Regression

Source of SS df MS E{MS }
Regression SSR = (Ŷi − Ȳ )2 σ 2 + β12 (Xi − X̄ )2
1 MSR =
Error SSE = (Yi − Ŷi )2 n−2 σ2
MSE = n−2
Total SSTO = (Yi − Ȳ )2 n−1

Analysis of Variance Approach to Regression Analysis

ANOVA table (cont.)

Modified ANOVA table:

(Yi − Ȳ )2 = Yi2 − n Ȳ 2

SSTOU: total uncorrected sum of squares:


correction for the mean sum of squares:

SS(correctionformean) = n Ȳ 2

Analysis of Variance Approach to Regression Analysis

ANOVA table (cont.)

Table 2 : Modified ANOVA Table for Simple Linear Regression

Source of Variation SS df MS
Regression SSR = (Ŷi − Ȳ )2
1 MSR = 1
Error SSE = (Yi − Ŷi )2 n−2
MSE = n−2
Total SSTO = (Yi − Ȳ )2 n−1

Correction for mean SS (correctionformean) = n Ȳ 2 1

Total, uncorrected SSTOU = Yi2

Analysis of Variance Approach to Regression Analysis

Expected Mean Squares

E{MSE} = σ 2
E{MSR} = σ 2 + β12 (Xi − X̄ )2

MSE is an unbiased estimator of σ 2 .

The mean of the sampling distribution of MSE is σ 2 ;
The mean of the sampling distribution of MSR is σ 2 when
β1 = 0;
When β1 6= 0, E{MSR} > E{MSE} = σ 2 .
(∵ β12 (Xi − X̄ )2 > 0)

Analysis of Variance Approach to Regression Analysis

F test of β1 = 0 vs. β1 6= 0

The analysis of variance approach provides us with a battery

highly useful tests for regression models.
For the simple linear regression case, the ANOVA provides us
with a test:

H0 :β1 = 0
6 0
Ha :β1 =

Test Statistics:
F∗ =
large values of F ∗ ⇒ Ha ;
values of F ∗ near 1 ⇒ H0 ;
Analysis of Variance Approach to Regression Analysis

F test of β1 = 0 vs. β1 6= 0 (cont.)

Cochran’s theorem
If all n observations Yi come from the same normal distribu-
tion with mean µ and variance σ 2 , and SSTO is decomposed
into k sums of squares SSr , each with degrees of freedom dfr ,
then the SSr /σ 2 terms are independent χ2 variables with dfr
degrees of freedom if
dfr = n − 1

Analysis of Variance Approach to Regression Analysis

F test of β1 = 0 vs. β1 6= 0 (cont.)

If β1 = 0 so that all Yi have the same mean µ = β0 and the
same variance σ 2 , SSE/σ 2 and SSR/σ 2 are independent χ2

When H0 holds:
MSR χ2 (1) χ2 (n − 2)
F∗ = σ2
÷ σ2
= ∼ ÷ ∼ F (1, n − 2)
1 n−2 MSE 1 n−2
When Ha holds, F ∗ follows the noncentral F distribution.

Analysis of Variance Approach to Regression Analysis

Construction of Decision Rule

F ∗ ∼ F (1, n − 2)
The decision rule: α=Type I error

IfF ∗ ≤ F (1 − α; 1, n − 2), concludeH0 ;

IfF ∗ > F (1 − α; 1, n − 2), concludeHa ;

where F (1 − α; 1, n − 2) is the (1 − α)100 percentile of

the approximate F distribution.

Analysis of Variance Approach to Regression Analysis


Toluca Company
Earlier test on β1 : (the t test, p46)
Two-Sided Test
H0 : β1 = β10
b1 − β10
If |t ∗ | = ≤ t(1 − α/2; n − 2), conclude H0
s{b1 }
b1 − β10
If |t ∗ | > t(1 − α/2; n − 2), conclude Ha
s{b1 }

Analysis of Variance Approach to Regression Analysis


Using the F test

H0 :β1 = β10 = 0
6 β10 = 0
Ha :β1 =

α = 0.05; n = 26; F (0.95; 1, 23) = 4.28

IfF ∗ ≤ 4.28, concludeH0

We have
MSR 252, 378
F∗ = = = 105.9
MSE 2, 384
What is the conclusion?

Analysis of Variance Approach to Regression Analysis

Equivalence of F test and t Test

For a given α level, the F test of

H0 :β1 = 0 Ha : β1 6= 0

is equivalence algebraically to the two-tailed t test.

∗ b2 b1
F = 2 1 = = (t ∗ )2
s {b1 } s{b1 }

[t(1 − α/2; n − 2)]2 = F (1 − α; 1, n − 2)

t test: two-tailed; F test: one-tailed;

General Linear Test Approach

Steps for General Linear Test Approach

1 Fit the full model and SSE(F )

2 Fit the reduced model under H0 and SSE(R)
3 Use test statistic and decision rule

General Linear Test Approach

1. Full Model

The full or unrestricted model:

Yi = β0 + β1 Xi + εi

SSE(F ):
X 2
(Yi − Ŷi )2 = SSE
SSE(F ) = Yi − (b0 + b1 Xi ) =

General Linear Test Approach

2. Reduced Model


H0 :β1 = 0 Ha : β1 6= 0

The reduced or restricted model when H0 holds:

Yi = β0 + εi

X 2
(Yi − Ȳ )2 = SSTO
SSE(R) = Yi − (b0 ) =

General Linear Test Approach

3. Test Statistic

SSE(F ) ≤ SSE(R)

The more parameters are in the model, the better one can fit
the data and the smaller are the deviations around the fitted
regression function.

General Linear Test Approach

3. Test Statistic (cont.)

When SSE(F ) is not much less than SSE(R), using the full
model does not account for much more of the variability of
the Yi than does the reduced model.

⇒ Suggest that the reduced model is adequate

i.e., H0 holds.
When SSE(F ) is close to SSE(R), the variation of the
observations around the fitted regression function for the full
model is almost as great as the variation around the fitted
regression function for the reduced model.

General Linear Test Approach

3. Test Statistic (cont.)

A small difference SSE(R) − SSE(F ) suggests that H0 holds.

⇔ A large difference suggests that Ha holds.
Test Statistic: a function of SSE(R) − SSE(F ):

SSE(R) − SSE(F ) SSE(F )

F∗ = ÷ ∼ F distribution
dfR − dfF dfF

when H0 holds.
Decision rule:

If F ∗ ≤ F (1 − α; dfR − dfF , dfF ), conclude H0

If F ∗ > F (1 − α; dfR − dfF , dfF ), conclude Ha

General Linear Test Approach

3. Test Statistic (cont.)

For testing whether or not β1 = 0, we have


dfR = n − 1 dfF = n − 2
⇒ F∗ =

Descriptive Measures of Linear Association between X
and Y


The usefulness of estimates or predictions depends upon the

width of the interval
The user’s needs for precision which vary from one
application to another.
No single descriptive measure of the “degree of linear
association” can capture the essential information as to
whether a given regression relation is useful in any particular

Descriptive Measures of Linear Association between X
and Y

Coefficient of Determination

SSTO is a measure of the uncertainty in predicting Y when

X is not considered.
SSE measures the variation in the Yi when a regression
model utilizing the predictor variable X is employed.
A natural measure of the effect of X in reducing the variation
in Y is to express the reduction in variation
(SSTO − SSE = SSR) as a proportion of the total variation:

R2 = =1−

Descriptive Measures of Linear Association between X
and Y

Coefficient of Determination (cont.)

R2 : the coefficient of determination; 判定係數;決定係數
0 ≤ SSE ≤ SSTO
⇒ 0 ≤ R2 = =1− ≤1

SSE = 0 ⇒ R2 = 1 (If all Yi = Ŷi )

SSR = 0 ⇒ R2 = 0 (If all Ŷi = Ȳ (b1 = 0 ⇒ X 與Y 無直線
Descriptive Measures of Linear Association between X
and Y

Coefficient of Determination (cont.)

R2 = 0 ⇒ X 與Y 無直線關係
There is no linear association between X and Y in the sample
data, and the predictor variable X is of no help in reducing the
variation in Yi with linear regression.
R2 → 1 ⇒ X 軸與Y 軸變項間的直線關係越強(具有線性變化)
The closer it is to 1, the greater is said to be the degree of linear
association between X and Y .

Descriptive Measures of Linear Association between X
and Y

Coefficient of Determination (cont.)

The Toluca Company:


Descriptive Measures of Linear Association between X
and Y

Coefficient of Determination (cont.)

The variation in work hours is reduced by 82.2% when lot size is


Descriptive Measures of Linear Association between X
and Y

Coefficient of Determination (cont.)

Descriptive Measures of Linear Association between X
and Y

Coefficient of Determination (cont.)

R2 is widely used for describing the usefulness of a regression

Serious misunderstanding:
A high R2 indicates that useful predictions can be made.
(Not necessarily correct. Ex: Xh = 100)
A high R2 indicated that the estimated regression line is a
good fit. (Not necessarily correct. Curvilinear)
A R2 near 0 indicated that X and Y are not related. (Not
necessarily correct. Curvilinear)

Descriptive Measures of Linear Association between X
and Y

Coefficient of Determination (cont.)

Descriptive Measures of Linear Association between X
and Y

相關 係 數 )
Coefficient of Correlation (相

A measure of linear association between Y and X when Y

and X are random is the coefficient of correlation.

r = ± R2

A plus or minus sign is attached to this measure according to

whether the slope of the fitted regression line is positive of
−1 ≤ r ≤ 1
The wider the Xi are spaced, the higher R2 will tend to be.
SSR: the “expected variation” in Y
SSE: the “unexplained variation”
R2 is interpreted in terms of the proportion of SSTO in Y
which has been “explained” by X .
Distinction between Regression and Correlation Model

Normal Correlation Models

Assume that the X values are known constants?

The confidence coefficients and risks of errors refer to
repeated sampling when X values are kept the same from
sample to sample.

Frequently, it may not be appropriate to consider the X

values as known constants.
cannot control daily temperatures
“height of person” vs. “weight of person”: using correlation
the normal correlation model

Distinction between Regression and Correlation Model

Bivariate Normal Distribution

Two variables Y1 , Y2 : bivariate normal distribution.

Density Function
The density function of the bivariate normal distribu-
tion: 
1 1 Y 1 − µ1 2
f (Y1 , Y2 ) = √ exp −
2πσ1 σ2 1 − ρ12  2(1 − ρ212 ) σ1

2 
Y 1 − µ1 Y 2 − µ2 Y2 − µ2
− 2ρ12 +
σ1 σ2 σ2 

Five parameters: µ1 , µ2 , σ1 , σ2 , ρ12

Distinction between Regression and Correlation Model

Bivariate Normal Distribution (cont.)

Distinction between Regression and Correlation Model

Bivariate Normal Distribution (cont.)

Marginal Distribution

Y1 , Y2 ∼ N2 (µ1 , µ2 , σ1 , σ2 , ρ12 ):  
1 1 Y1 − µ1

Y1 ∼ N (µ1 , σ12 ) ⇒f1 (Y1 ) = √ exp  − 
2πσ1 2 σ1
 
1 1 Y2 − µ2

Y2 ∼ N (µ2 , σ22 ) ⇒f2 (Y2 ) = √ exp  − 
2πσ2 2 σ2

When Y1 , Y2 ∼ N2 (µ1 , µ2 , σ1 , σ2 , ρ12 ) ⇒

Y1 ∼ N (µ1 , σ12 ); Y2 ∼ N (µ2 , σ22 )
The converse is not generally true.

Distinction between Regression and Correlation Model

Bivariate Normal Distribution (cont.)

ρ12 : the coefficient of correlation between Y1 , Y2

σ{Y1 , Y2 } σ12
ρ12 = =
σ{Y1 }σ{Y2 } σ1 σ2

Y1 ⊥ Y2 ⇒ σ12 = 0 ⇒ ρ12 = 0
If Y1 and Y2 are positively related⇒ σ12 and ρ12 are positive.
−1 ≤ ρ12 ≤ 1

Distinction between Regression and Correlation Model

Conditional Probability Distribution of Y1

Y1 , Y2 ∼ N2 (µ1 , µ2 , σ1 , σ2 , ρ12 ); and

Y1 ∼ N (µ1 , σ12 ); Y2 ∼ N (µ2 , σ22 );
The density function of the conditional probability of Y1 for
given value of Y2 : (Y1 |Y2 ∼ N (α1|2 + β12 Y2 , σ1|2 ))
" #
f (Y1 , Y2 ) 1 1  Y1 − α1|2 − β12 Y2 2
f (Y1 |Y2 ) = =√ exp −
f2 (Y2 ) 2πσ1|2 2 σ1|2

α1|2 = µ1 − µ2 ρ12
β12 = ρ12
σ1|2 = σ12 (1 − ρ212 )

Distinction between Regression and Correlation Model

Conditional Probability Distribution of Y1 (cont.)

α1|2 : the intercept of the line of regression of Y1 and Y2

β12 : the slope of this line
The conditional distribution of Y1 , given Y2 , is equivalent to
the normal error regression model (1.24).

Distinction between Regression and Correlation Model

Conditional Probability Distribution of Y1 (cont.)

Three important characteristics of the conditional distributions of
Y1 :
1 normal: slice a bivariate normal distribution vertically; scaled
its area;

Distinction between Regression and Correlation Model

Conditional Probability Distribution of Y1 (cont.)

2 The means of the conditional probability distributions of Y1

fall on a straight line:

E{Y1 |Y2 } = α1|2 + β12 Y2

3 All conditional probability distribution have the same

standard deviation σ1|2 .

Equivalence to Normal Error Regression Model.

Distinction between Regression and Correlation Model

Conditional Probability Distribution of Y1 (cont.)

Can we still use regression model (2.1) if Y1 and Y2 are not

bivariate normal?

1 The conditional distributions of the Yi , given Xi , are normal

and independent, with conditional means β0 + β1 Xi and
conditional variance σ 2 .
2 The X1 are independent r.v. whose probability distribution
does not involve the parameter β0 , β1 , σ 2 .

Distinction between Regression and Correlation Model

Inferences on Correlation Coefficients

To study the relationship between two variables: ρ12

MLE of ρ12 :

(Yi1 − Ȳ1 )(Yi2 − Ȳ2 )

r12 = P
[ (Yi1 − Ȳ1 )2 (Yi2 − Ȳ2 )2 ]1/2

r12 is a biased estimator of ρ12 .

−1 ≤ r12 ≤ 1

Distinction between Regression and Correlation Model

Inferences on Correlation Coefficients (cont.)

The population is bivariate normal

H0 : ρ12 = 0 vs. Ha : ρ12 6= 0

(ρ12 = 0 ⇒ Y1 ⊥ Y2 )

⇐⇒ H0 : β12 = 0 vs. Ha : β12 6= 0

⇐⇒ H0 : β21 = 0 vs. Ha : β21 6= 0

Test statistics:

∗ r12 n − 2
t = q ∼ t(n − 2)
1 − r12

Distinction between Regression and Correlation Model

Inferences on Correlation Coefficients

This test statistics is identical to the regression t ∗ test

P 1/2
(Xi −X̄)2
b1 SSR
statistics s{b1 }
. (∵ r = SSTO
= b1 SSTO
The appropriate decision rule to control the Type I error α:

If |t ∗ | ≤ t(1 − α/2; n − 2), conlude H0

If |t ∗ | > t(1 − α/2; n − 2), conlude Ha

Distinction between Regression and Correlation Model

Inferences on Correlation Coefficients (cont.)

Inferences on Correlation Coefficients

A national oil company: service station gasoline sales vs.
sales of auxiliary(附屬的) product
23 of its service stations
average monthly sales data: Y1 =gasoline sales vs.
Y2 =auxiliary products and services
r12 = 0.52; α = 0.05
To test whether or not the association was positive

H0 : ρ12 ≤ 0 vs. Ha : ρ12 > 0

t ∗ = 2.79 > t(0.95; 21) = 1.721; (P-value = 0.006)

Distinction between Regression and Correlation Model

Interval Estimation of ρ12

The Fisher z transformation:

1 1 + r12
z = ln
2 1 − r12

When n is large (n ≥ 25), the sampling distribution of z 0 is

≈ N (E{z 0 }, σ 2 {z 0 }).
1 1 + ρ12
E{z 0 } = ς = ln (2.90)
2 1 − ρ12
σ 2 {z 0 } = (只與n有關) (2.91)
(z 0 , ς): Table B.8

Distinction between Regression and Correlation Model

Interval Estimation of ρ12 (cont.)

Interval estimate: (n ≥ 25)

z0 − ς
∼ N (0, 1)
σ{z 0 }
⇒z 0 ± z(1 − α/2)σ{z 0 }

The 1 − α C.I. for ρ12 are obtained by transforming the limits

on ς by (2.90).

Distinction between Regression and Correlation Model

Interval Estimation of ρ12 (cont.)

A C.I. for ρ12 can be employed to test whether or not ρ12 has
a specified value. (ex. 0.5)
0 ≤ ρ212 ≤ 1: measures the relative reduction in the variability
of Y2 associated with the use of variable Y1 .

σ12 − σ1|2
ρ212 =
σ22 − σ2|1
ρ212 =

Distinction between Regression and Correlation Model

Spearman Rand Correlation Coefficient

When no appropriate transformations can be found, a

nonparametric rank correlation procedure may be useful for
making inferences about the association between Y1 and Y2 .
The ordinal Pearson product-moment correlation coefficient:

(Ri1 − R̄1 )(Ri2 − R̄2 )

rs = P
[ (Ri1 − R̄1 )2 (Ri2 − R̄2 )2 ]1/2

Test: (two-sided)

H0 : There is no association between Y1 and Y2

Ha : There is an association between Y1 and Y2

∗ rs n − 2
⇒t = ∼ t(n − 2)
1 − rs2
