Professional Documents
Culture Documents
Linear Models
Linear Models
Linear Models
The concept of relation between two variables, such as between family income and family
expenditures is a familiar one. We can distinguish between a functional relation and a
statistical
relation.
Functional relation between two variables.
A functional relation between two variables is expressed by a mathematical formula
Y f(X)
where X - independent variable and Y - dependent one.
Example: Sales in rands and number of unit sold. If selling price is R5 the relation is
expressed in
Y 5X
Number of units and Rand sales during three recent periods were as follows
period number of unit sold Rand sales
1 5 25
2 10 50
3 15 75
These observations are plotted in figure below
A statistical relation, unlike a functional relation, is not a perfect one. In general, the
observations
for a statistical relation do not fall directly on the curve of relationship.For a certain set of
observation the following plot is drawn
1
However this relation is not a perfect one. There is a scattering of points suggesting that some
variation of Y is not accounted by X. The plotted line indicates the general tendency by which
Y vary with changes in X. This scattering of points around the line represents variation in Y
which is not associated with X, and which is usually considered to be of a random nature.
We will start with the case where is only one independent variable and the regression
function is linear. In this case we consider the following:
(2.1) Yi o 1Xi i
where:
Y i - is the value of the response in the ith trial
o and 1 are parameters
X i is the known constant, namely,the value of the independent
variable in the ith trial
i is a random error term with mean E i 0 and variance 2 2
and i and j are uncorrelated so the covariance cov i , j 0
for all i,j such that i ≠ j.
This model is said to be simple, linear in the parameters, and linear in the independent
variable.
Let us notice that:
1. The observed value of Y in the ith trial is the sum of two components:
the term o 1 X i and the random term i . Hence Y i is a random variable
2. Since E i 0, it follows that
(2.2 ) EY i E o 1 X i i o 1 X i E i o 1 X i
Therefore the regression function for the simple linear regression is
2
(2.3) EY o 1 X i
3. The observed value of Y in the ith trial differs from the value of the regression
function by error term amount i .
4. The error terms i are assumed to have constant variance 2 . It therefore follows
that the variance of the response Y i is:
(2.4) 2 Y i 2 .
The parameters o and 1 in regression model (2.1) are called regression coefficients.
1 is the slope of the regression line. It indicates the change in the mean of the probability
distribution of Y per unit increase in X. The parameter o is the Y intercept of the regression
line. If the scope of the model includes X 0, o gives the mean of the probability
distribution of Y at X 0. When the scope of the model does not cover X 0,
o does not have any particular meaning as a separate term in the regression model.
We shall denote (X,Y) observations for the first trial as (X 1 ,Y 1 ), for the second trial
(X 2 ,Y 2 ), and in general for the ith trial (X i ,Y i ), where i 1, 2, . . . , n. To find estimators
for o and 1 we use the method of least squares. According to the method of least squares,
the estimators of o and 1 are those values b o and b 1 , respectively, that minimize the
criterion Q given by:
n
(2.8) Q ∑Y i − o − 1 X i 2
i1
To find this we differentiate Q with respect to o and 1 and solve the system of equations
∂Q ∂Q
∂ o
0 and ∂ 1 0.
From these we get the following
(2.9a) ∑ Y i nb o b 1 ∑ X i
(2.9b) ∑ X i Y i b o ∑ X i b 1 ∑ X 2i
The equations (2.9a) and (2.9b) are called normal equations; b o and b 1 are called point
estimators of o and 1 , respectively. Solving one can get the following solutions
∑ Xi∑ Yi
∑ XiYi− n ∑X i −XY i − Y
(2.10a) b1
∑ Xi2 ∑X i −X 2
∑ X 2i − n
(2.10b) b o 1n ∑ Y i − b 1 ∑ X i Y − b 1 X
3
where Y and X have the usual meaning.
Residuals.
4
n
∑ Y i nb o b 1 ∑ X i ∑b o b 1 X i ∑ Y i .
i1
4. The sum of the weighted residuals is zero when the residual in the ith trial
is weighted by the level of the independent variable in the ith trial:
n
(2.19) ∑ Xiei 0
i1
This follows from the second normal equation (2.9b)
n n
∑ X i e i ∑ X i Y i − b o − b 1 X i ∑ X i Y i − b o ∑ X i − b 1 ∑ X 2i 0
i1 i1
n
5. ∑ Y iei 0
i1
6. The regression line always goes through the point (X, Y
PROBLEMS
Let us notice that Y i come from different probability distributions with different means,
depending upon the level of X i . To estimate the error we start with
n n n
(2.21) SSE ∑Y i − Y i 2 ∑Y i − b o − b 1 X i 2 ∑ e 2i
i1 i1 i1
where SSE stands for error sum of squares or residual sum of squares.
The sum of squares SSE has n-2 degrees of freedom associated with it ( two are lost
because both o and 1 had to be estimated in obtaining Y i ).
Hence, the appropriate mean square, denoted by MSE, is
n n n
∑Y i −Y i 2 ∑Y i −b o −b 1 X i 2 ∑ e 2i
(2.22) MSE SSEn−2
i1 n−2 i1 n−2 i1
n−2
where MSE stands for error mean square or residual mean square.
∑X i − X 2 Eb 21 ∑X i − X 2 2
21 2 21 ∑X i − X 2
∑X i −X 2
2 21 ∑ X 2i − n 21 X 2 ∑X i − X 2 Eb 21
X i −X
2 ∑ X i EY i b 1 2 ∑ X i o 1 X i 1 2
∑X i −X 2
X i −XX i
2 ∑ X i o 1 X i 1 2 ∑ 2
∑X i −X 2
2 o 1 ∑ X i 2 21 ∑ X 2i 2 2
2n o 1 X 2 21 ∑ X 2i 2 2 2n o 1 X 2 21 ∑ X 2i 2 2 2 ∑ X i EY i b 1
Hence
ESSE E∑ Y 2i − n Y 2 ∑ X 2i − nX 2 b 21 2nXb 1 Y − 2 ∑ X i Y i b 1
∑ EY 2i − nE Y 2 ∑ X 2i − nX 2 Eb 21 2nXEb 1 Y − 2 ∑ X i EY i b 1
∑ EY 2i − nE Y 2 ∑X i − X 2 Eb 21 2nXEb 1 Y − 2 ∑ X i EY i b 1
n 2 n 2o 2n o 1 X 21 ∑ X 2i − 2 − n 2o − 2n o 1 X − n 21 X 2 2
21 ∑ X 2i − n 21 X 2 2n o 1 X 2n 21 X 2 − 2n o 1 X − 2 21 ∑ X 2i − 2 2
n − 2 2
We used the following
X ∑X i − X 0
j
6
∑X j −XX j −X ∑X i −X ∑X j −XX j −XX i −X
X j −XX j
∑
j j j
j ∑X i −X 2 ∑X i −X 2 ∑X i −X 2
∑X j −XX j −X
j
1
∑X i −X 2
and ∑ X 2i − nX 2 ∑X i − X 2 ∑ X 2i − nX 2
Finally
ESSE n−2 2
EMSE E SSE
n−2
n−2
n−2
2
(2.24a) SSE ∑ Y 2i − b o ∑ Y i − b 1 ∑ X i Y i
∑X i −XY i − Y 2
(2.24b) SSE ∑Y i − Y 2 −
∑X i −X 2
∑ Xi ∑ Yi
2
∑ XiYi−
∑ Yi2 n
(2.24c) SSE ∑ Y 2i − −
n
∑ Xi2
∑ X 2i − n
PROBLEMS
(2.24a) SSE ∑ Y 2i − b o ∑ Y i − b 1 ∑ X i Y i
(2.21) SSE ∑Y i − b o − b 1 X i 2
∑ Xi∑ Yi
∑ XiYi− n ∑X i −XY i − Y
(2.10a) b1
∑ X i 2 ∑X i −X 2
∑ X 2i − n
(2.10b) b o 1n ∑ Y i − b 1 ∑ X i Y − b 1 X
∑X i −XY i − Y 2
SSE ∑Y i − 1n ∑ Y i − b 1 ∑ X i − Xi
∑X i −X 2
∑X i −XY i − Y ∑X i −XY i − Y 2
∑Y i − 1n ∑ Y i − ∑ X1 − Xi
∑X i −X 2
∑X i −X 2
∑X i −XY i − Y ̄ ∑X i −XY i − Y 2
∑Y i − Y X− Xi
∑X i −X 2 ∑X i −X 2
∑X i −XY i − Y 2
∑Y i − Y − X i − X̄
∑X i −X 2
∑X i −XY i − Y ̄ 2 ∑ i
X −XY i − Y
∑Y i − Y 2 − 2 ∑Y i − Ȳ X i − X̄ ∑ X i − X 2
∑ i X −X 2
∑ iX −X 2
7
∑X i −XY i − Y 2 ∑X i −XY i − Y 2
∑Y i − Y 2 − 2
∑X i −X 2 ∑X i −X 2
∑X i −XY i − Y 2
∑Y i − Y 2 −
∑X i −X 2
∑
2
X i −XY i − Y
(2.24b) SSE ∑Y i −Ȳ −2
∑X i −X 2
∑X i −XY i − Y 2
(2.24c) SSE ∑Y i − Y 2 −
∑X i −X 2
Let us notice that
∑X i − XY i − Y 2 ∑X i Y i − X i Y − Y i X X Y 2
∑ X i Y i − ∑ X i Y − ∑ XY i nX Y 2
∑ Xi ∑ Yi 2
∑ XiYi − n
∑ XY−
∑X i −XY i − Y 2 ∑ Y i 2
i i n
SSE ∑Y i − Y − 2
∑ Y 2
− −
∑X i −X 2 i n
∑ Xi2
∑ X 2i − n
∑ Xi ∑ Yi
2
∑ XiYi−
∑ Yi 2
n
(2.24c) SSE ∑ Y 2i − −
n
∑ Xi2
∑ X 2i − n
PROBLEMS
8
2
f e i z f N0, 2 z 1
exp− 2z 2
2 2
Since Y 1 o 1 X i i therefore
f Y i y i f i y i − o − 1 X i f N0, 2 y i − o − 1 X i
y − o − 1 X i 2
1
exp− i 2 2
2 2
The likelihood function for the normal error model (2.25), given the
sample observations Y 1 , . . . , Y n , is:
n
(2.26) L o , 1 ,
2 1
exp − 21 2 Y i − o − 1 X i 2
i1 2 2
The values of o , 1 and 2 which maximize this likelihood function
are the maximum likelihood estimators. These are:
parameter maximum likelihood estimator
o bo the same as (2.10b)
1 b1 the same as (2.10a)
(2.27)
n
∑Y i −Y i 2
2
2 i1
n
2
Thus, the maximum likelihood estimator is biased, and ordinary MSE as given in
(2.22) is used.
PROBLEMS.
Question 1.
The results of a certain experiments are shown below
i 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
X i 5.5 4.8 4.7 3.9 4.5 6.2 6.0 5.2 4.7 4.3 4.9 5.4 5.0 6.3 4.6 4.3 5.0 5.9 4.1 4.7
Y i 3.1 2.3 3.0 1.9 2.5 3.7 3.4 2.6 2.8 1.6 2.0 2.9 2.3 3.2 1.8 1.4 2.0 3.8 2.2 1.5
Summary calculation results are:
∑ X i 100. 0, ∑ Y i 50. 0, ∑ X 2i 509. 12, ∑ Y 2i 134. 84, ∑ X i Y i 257. 66.
a) Obtain the least squares estimates of o and 1 , and state the estimated regression
function.
b) Obtain the point estimate for mean Y when X score is 5.0
c) What is the point estimate of change in the mean response when the X score increases
by one .
Question 2
For the following set of data:
9
X i 30 20 60 80 40 50 60 30 70 60
Y i 73 50 128 170 87 108 135 69 148 132
1) Obtain the estimated regression function. 2) Interpret b o and b 1 .
Question 3.
Prove theorem 2.11.
Question 4.
Prove the following statements:
1) The sum of the observed values Y i equals the sum of the fitted values Y i ;
n n
∑ Yi ∑ Y i
i1 i1
2) The regression line always goes through the point (X, Y
Question 5.
Prove the following statements:
1) The sum of residuals is zero:
n
∑ ei 0
i1
2) The sum of the weighted residuals is zero when the residual in the ith trial
is weighted by the level of the independent variable in the ith trial:
n
∑ Xiei 0
i1
Throughout remainder of this course unless otherwise stated, we assume that the normal
error model (2.25) is applicable.
The model is
3. 1 Yi o 1Xi i
where
o and 1 are parameters
X i are known constants
i are independent N0, 2
Inferences concerning 1 .
Sampling distribution of b 1 :
10
∑X i −XY i − Y X i −X ∑X i −X
b1 ∑ Y i − Y
∑X i −X 2 ∑X i −X 2 ∑X i −X 2
but ∑X 1 − X ∑ X i − nX ∑ X i − ∑ X i 0
hence
X i −X
3. 5 b 1 ∑ Yi
∑X i −X 2
X i −X
According to our model are constants (since X i are fixed), and we
∑X i −X 2
know that linear combination of independent normally distributed random
variables is normally distributed. The unbiasedness of the point estimator b 1
stated earlier in the Gauss-Markov theorem. We denote the quantities in (3.5) by
X i −X
3. 5a k i
∑X i −X
2
The variance of b 1 we calculate using the fact that Y i are independent random
variables , each with the variance 2 .
2
X i −X
Hence Varb 1 Var∑ k i Y i ∑ k 2i VarY i ∑
2
k 2i ∑
2
∑X i −X 2
∑X i −X 2
2 2
2 1
.
∑X i −X 2 ∑ i −X 2
X
11
probability statement:
(3.12) Pt/2; n − 2 b 1 − 1 /sb 1 t1 − /2; n − 2 1 −
where t/2; n − 2 denotes /2100 percentile of the t distribution with
n-2 degrees of freedom. Since t distribution is symmetric we know that
(3.13) t/2; n − 2 −t1 − /2; n − 2
Therefore we get the following confidence interval
(3.14) Pb 1 − t1 − /2; n − 2sb 1 1 b 1 t1 − /2; n − 2sb 1 1 − .
or we can rewrite it as
(3.15) b 1 t1 − /2; n − 2sb 1
Tests concerning 1
Sampling distribution of b o
Let X h denote the level of X for which we wish to estimate the mean response.
X h may be the value which occurred in the sample, or it may be some other value of
the independent variable within the scope of the model. The mean response when
12
X X h is denoted by E(Y h ). Formula (2.12) gives us the point estimator Y h of
E(Y h ) :
(3.27) Y h bo b1Xh
We will consider now the sampling distribution of Y h.
For model (3.1), the sampling distribution of Y h is normal, with
(3.28a) E Y h EY h
X h −X 2
(3.28b) Var Y h 2 1n
∑X i −X 2
When MSE is substituted for 2 in (3.28b), we obtain s 2 Y h , the estimated
variance of Y h :
X h −X 2
(3.30) s 2 Y h MSE 1n
∑X i −X
2
Statement
Y h −EY h
(3.31) is distributed as tn − 2 for model (3.1)
sY h
A 1 − confidence
interval for E(Y
h ) is given by:
(3.32) Y h t1 − /2; n − 2s Y h
The basic idea of a prediction interval is thus to choose a range in the distribution of Y
where in
most of the observations will fall, and to declare that the next observation will fall in the
range.
In general, when the regression parameters are known, the 1- prediction limits for Y hnew
are:
(3.33) EY h z1 − /2
In case when
the parameters are unknown 1- prediction limits are
(3.35) Y h t1 − /2; n − 2sY hnew
where
(3.37) s 2 Y hnew s 2 Y h MSE
Using (3.30) one can get
X h −X 2
(3.37a) s 2 Y hnew MSE1 1n
∑X i −X 2
Basic notations. The analysis of variance is based on the partitioning of sums of squares
and degrees of freedom associated with the response variable Y. To illustrate this approach
we will consider the following data
13
Run i X 1 Y i
1 30 73
2 20 50
3 60 128
4 80 170
5 40 87
6 50 108
7 60 135
8 30 69
9 70 148
10 60 132
The first picture shows Y with the observed Y i
14
In Our case
SSTO 13 660 and SSE 60.
What accounts for this difference? The difference, is another sum of squares
(3.45) SSR ∑Y i − Y 2
where SSR stands for regression sum of squares.
This deviations are shown in the next figure
Each deviation is simply the difference between the fitted value on the regression line
and the mean of the fitted values Y (Recall from (2.18) that the mean of fitted
values Y i is Y ).
15
It is very interesting property that the sum of these squared deviations have the same
relationship:
(3.48) ∑Y i − Y 2 ∑Y i − Y 2 ∑Y i − Y i 2
or using notation in (3.42), (3.44) and (3.45):
(3.48a) SSTO SSR SSE
Proof:
∑Y i − Y 2 ∑Y i −Y Y i− Y i 2
∑Y i − Y 2 Y i − Y i 2 2Y i − Y Y i − Y i
∑Y i − Y 2 ∑Y i − Y i 2 2 ∑Y i − Y Y i − Y i
The last
term on the right is zero
since:
2 ∑Y i − Y Y i − Y i 2 ∑ Y i Y i − Y i − 2 Y ∑Y i − Y i
and
2 Y ∑Y i − Y i 2 Y ∑ e i 0 ((2.17))
and
2 ∑ Y i Y i − Y i 2 ∑ Y i e i 2 ∑b o b 1 X i e 2
2b o ∑ e i 2b 1 ∑ X i e i 0 (since (2.19))
Hence (3.48) follows.
Computational formulas.
The definitional formulas for SSTO, SSR and SSE presented above are often not
convenient for hand computation. Useful formulas for SSTO and SSR are:
∑ Yi2
(3.49) SSTO ∑ Y 2i − n ∑ Y 2i − n Y 2
∑ Xi ∑ Yi
(3.50a) SSR b 1 ∑ X i Y i − n b 1 ∑X i − XY i − Y
or
(3.50b) SSR b 21 ∑X i − X 2
PROBLEMS.
16
2
SSR ∑Y i − 2Y i Y Y
2
2 2
∑ Y i − 2n Y 2 n Y 2 ∑ Y i − n Y 2
∑b o b 1 X i 2 − n Y 2
∑b 2o 2b o b 1 X i b 21 X 2i − n Y 2
n Y − b 1 X 2 2 Y − b 1 Xb 1 nX b 21 ∑ X 2i − n Y 2
n Y 2 − 2nb 1 X Y nb 21 X 2 2nb 1 X Y − 2nb 21 X 2 b 21 ∑ X 2i − n Y 2
∑ Xi2
b 21 ∑ X 2i − nb 21 X 2 b 21 ∑ X 2i − n b 21 ∑X i − X 2
or
SSR ∑Y i − Y 2 ∑b o b 1 X i − Y 2 ∑ Y − b 1 X b 1 X i − Y 2
∑b 1 X i − X 2 b 21 ∑X i − X 2
so we get 3. 50b
SSR b 21 ∑X i − X 2 b 1 b 1 ∑X i − X 2
∑ Xi ∑ Yi
∑ XiYy− ∑ Xi ∑ Yi
∑ ∑
n
b1 X i − X 2
b 1 X i Y y −
∑X i −X 2 n
Degrees of freedom.
Corresponding to the partitioning of the total sum of squares SSTO there is partitioning
of associated degrees of freedom.
SSE has n - 2 degrees of freedom. SSR has one degree of freedom. Therefore SSTO
has n - 1 degrees of freedom.
Mean squares.
Basic table. The breakdowns of the total sum of squares and associated degrees
of freedom are displayed in the form of an analysis of variance table
(ANOVA table):
source of variation SS df MS E(MS)
regression SSR ∑Y i − Y 2 1 MSR SSR 1
2 21 ∑X i − X 2
error SSE ∑Y i − Y i 2 n − 2 MSE SSE n−2
2
total SSTO ∑Y i − Y 2 n − 1
17
source of variation SS df MS
regression 13600 1 13600
error 60 8 7. 5
total 13660 9
(3.53) SSTOU ∑ Y 2i
and the correction for the mean sum of squares, denoted by SS (correction for mean)
is defined as:
source of variation SS df MS
regression SSR ∑Y i − Y 2 1 MSR SSR
1
error SSE ∑Y i − Y i 2 n − 2 MSE SSE
n−2
F test of 1 0 versus 1 ≠ 0.
ANOVA provides us with battery of highly useful tests for regression model.
For a simple regression case considered here, the analysis of variance provides us with
test for:
(3.58) Ho : 1 0
Ha : 1 ≠ 0
Test statistics:
(3.59) F ∗ MSR
MSE
The decision rule:
(3.61) If F ∗ F1 − ; 1, n − 2, conclude H o
If F ∗ F1 − ; 1, n − 2, conclude H a
where F1 − ; 1, n − 2 is 1 − 100 percentile of the appropriate F distribution.
Equivalence of F test and t test.
For a given level, the F test of 1 0 versus 1 ≠ 0 is equivalent algebraically to
the two-tailed t test. To see this, recall from (3.50b) that
SSR b 21 ∑X i − X 2
Thus, we can write:
b 21 ∑X i −X 2
F ∗ MSR
MSE
SSR1
SSEn−2
MSE
Since s b 1
2 MSE
we have
∑X i −X 2
18
b 21 ∑X i −X 2 b 21 b 21 2
(3.62) F∗ s b
b1
sb 1
t ∗ 2
1
MSE MSE 2
∑Xi−X2
Corresponding to the relation between t ∗ and F ∗ , we have the following
relation between the required percentiles of the t and F distributions in the tests:
t1 − /2; n − 2 2 F1 − ; 1; n − 2.
Thus, at given level, we can use either t test or the F test for testing
of 1 0 versus 1 ≠ 0. Whenever one test leads to H o, so will the other, and
correspondingly to H a . The t test is more flexible since it can be used for one-sided
alternatives involving 1 ( 1 0 or 1 0 while the test F cannot.
Descriptive measures of association between X and Y in regression model.
Coefficient of determination.
We saw earlier that SSTO measures the variation in the observations Y i , or the
uncertainty in predicting Y, when no account of independent variable X is not
considered. Also SSE measures the variation in the Y i when a regression model
utilizing the independent variable X is employed. A natural measure of the effect
of X in reducing variation in Y i , (the uncertainty in predicting Y) is
(3.69) r 2 SSTO−SSE
SSTO
SSTO
SSR
1 − SSTO
SSE
Coefficient of correlation
EXAMPLES
For our data we get the following
19
Xi Yi XiYi X 2i Y i b o b 1 X i e i Y i − Y i e 2i
30 73 2190 900 70 3 9
20 50 1000 400 50 0 0
60 128 7680 3600 130 -2 4
80 170 13600 6400 170 0 0
40 87 3480 1600 90 -3 9
50 108 5400 2500 110 -2 4
60 135 8100 3600 130 5 25
30 69 2070 900 70 -1 1
70 148 10360 4900 150 -2 4
60 132 7920 3600 130 2 4
500 1100 61800 28400 1100 0 60 ←TOTALS
Therefore
X 1n ∑ X i 500 10
50
Y n ∑ Y i 10 110
1 1100
∑ X 2i − n
10
b o Y − b 1 X 110 − 2 50 10
So
Y i 10 2 ∗ X i
In our case
SSE ∑Y i − Y i 2 ∑ e 2i 60
MSE SSE n−2
608
7. 5
s b 1
2 MSE
MSE
7.5
7.5
0. 002206
∑X i −X
2 2 ∑ Xi 28400− 500 2 3400
∑ X 2i − n
10
20
Pt8 t ∗ 42. 58 0. 0005
21
source of variation SS df MS E(MS)
regression SSR ∑Y i − Y 2 1 MSR SSR
1
2 21 ∑X i − X 2
error SSE ∑Y i − Y i 2 n − 2 MSE SSE
n−2
2
total SSTO ∑Y i − Y 2 n − 1
source of variation SS df MS
regression 13600 1 13600
error 60 8 7. 5
total 13660 9
F test of 1 0 versus 1 ≠ 0.
The hypothesis.
Ho : 1 0
Ha : 1 ≠ 0
Test statistics:
F ∗ MSR
MSE
13600
7.5
1813. 3
The decision rule:
If F ∗ F1 − ; 1, n − 2, conclude H o
If F ∗ F1 − ; 1, n − 2, conclude H a
If 0. 05 then F1 − ; 1; n − 2 F0. 95; 1; 8 5. 32
Since F ∗ 1813. 3 5. 32 we conclude H a , that 1 ≠ 0.
Coefficient of determination.
r 2 SSTO−SSE
SSTO
SSTO
SSR
1 − SSTO
SSE
1 − 13660
60
0. 995 61
Coefficient of correlation
r r 2 0. 99561 0. 997 8
(a plus sign since the slope of the fitted regression line is positive)
PROBLEMS.
Question 1.
The results of a certain experiments are shown below
i 1 2 3 4 5 6 7 8 9
Xi 7 6 5 1 5 4 7 3 4
Y i 97 86 78 10 75 62 101 39 53
i 10 11 12 13 14 15 16 17 18
Xi 2 8 5 2 5 7 1 4 5
Y i 33 118 65 25 71 105 17 49 68
22
6) Find 90% confidence intervals for mean of response variable corresponding
to the level of the explanatory equal to 5.
7)Find 95% prediction limits for new observation of the response variable
corresponding to the level of the explanatory equal to 5.
8) Obtain the residuals e i . 9) Estimate 2 and .
Question 2.
The results of a certain experiments are shown below
i 1 2 3 4 5 6 7 8 9 10
Xi 1 0 2 0 3 1 0 1 2 0
Y i 16 9 17 12 22 13 8 15 19 11
1) Obtain the estimated regression function. 2) Plot the estimated regression function
and the data. 3) Interpret b o and b 1 . 4) Find the 95% confidence interval for: o , 1 ,
and interpret them. 5)Test the H o : 1 0 versus H a : 1 ≠ 0 using t ∗ and ANOVA.
using 0. 05 6) Find 95% confidence intervals for mean of response variable
corresponding to the level of the explanatory equal to 3.
7)Find 90% prediction limits for new observation of the response variable
corresponding to the level of the explanatory equal to 3.
8) Obtain the residuals e i . 9) Estimate 2 and . 10)Compute ∑ e 2i
Question 3.
In a test of the alternatives H o : 1 0 versus H a : 1 0, a student concluded
H o . Does this conclusion imply that there is no linear association between X and Y?
Explain.
Question 4.
Show that b o as defined in (3.19) is an unbiased estimator of o .
Question 5.
Obtain the likelihood function for the sample observations Y 1 , . . . , Y n
given X 1 , . . . , X n if the normal model is assumed to be applicable.
Question 6.
The following data were obtained in the study of solution concentration.
i 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Xi 9 9 9 7 7 7 5 5 5 3 3 3 1 1 1
Y i 0.07 0.09 0.08 0.16 0.17 0.21 0.49 0.58 0.53 1.22 1.15 1.07 2.84 2.57 3.1
Summary calculational results are: ∑ X i 75, ∑ Y i 14. 33, ∑ X 2i 495
∑ Y 2i 29. 2117, ∑ X i Y i 32. 77.
1) Fit a linear regression function 2) Perform an F test to determine whether or not there
is lack of fit of a linear regression function. Use 0. 05.
Question 7.
The following data were obtained in a certain study.
i 1 2 3 4 5 6 7 8 9 10 11 12
Xi 1 1 1 2 2 2 2 4 4 4 5 5
Y i 6.2 5.8 6 9.7 9.8 10.3 10.2 17.8 17.9 18.3 21.9 22.1
Summary calculational results are: ∑ X i 33, ∑ Y i 156, ∑ X 2i 117
∑ Y 2i 2448. 5, ∑ X i Y i 534.
1) Fit a linear regression function 2) Perform an F test to determine whether or not there
is lack of fit of a linear regression function. Use 0. 05.
RESIDUALS
A residuale i , as defined in (2.16) is
ei Yi − Y i
23
As such, it may be regarded as the observed error, in distinction to the unknown true
error i in the regression model
(4.2) i Y i − EY i
Properties of residuals
The mean of the n residuals e i is by (2.17)
(4.3) e 1n ∑ e i 0
where e denotes the mean of the residuals.
The variance of n residuals is defined as follows
∑e i − e 2 ∑ e 2i
(4.4) n−2
n−2 SSEn−2
MSE
If the model is appropriate, MSE is an unbiased estimator for the variance
of the error terms 2 .
Standardized residuals.
Since the standard deviation of the error terms i is , which is estimated by MSE
we shall define the standardized residual as follow:
ei− e
(4.5) ei
MSE MSE
The following six important departures from model (3.1), the simple linear
regression model with the normal errors, are as follows:
1) The regression function is not linear
2) The error terms do not have the constant variance
3) The error terms are not independent.
4) The model fits all but one or few outlier observations
5) The error terms are not normally distributed
6) One or several important independent variables have been omitted from
the model.
We take up now some informal ways in which graphs of residuals can be analyzed
to provide information on whether any of the six types of departures from the simple
linear regression model (3.1) are present.
To study this we plot residuals against explanatory variable X. The following prototype
residual plots are considered:
24
Figure (a) shows a prototype situation of the residual plot against X if the linear
model is appropriate. The residuals should tend to fall within a horizontal band
centered around 0, displaying no systematic tendencies to be positive and negative.
Figure (b) shows a prototype situation of a departure from the linear regression
model indicating the need for curvilinear regression function. Here the residuals
tend to vary in a systematic fashion between being positive and negative.
Similar interpretation will be if the picture is convex (like x 2 ).
Figure (c) shows a prototype situation of a fact that the error terms do not
have the constant variance. (error variance increases with X) Similar
picture (but oriented differently) will be in case of the decrease of the error
variance with X.
Whenever data are obtained in a time sequence it is a good idea to plot the
residuals against the time, even though time has not been explicitly incorporated
as a variable in the model. The purpose is to see if there is any correlation between
the error terms over time. Example of time related effect is shown in figure (d).
If errors are time independent we would expect the residuals to follow the figure (a).
Presence of outliers
Outliers are extreme observations. In residual plot, they are points that lie far beyond
the scatter of the remaining residuals, perhaps four or more standard deviations from
zero. The following figure present standardized residuals and contains one outlier,
which is circled
Outliers can create great difficulty. When we encounter one, our first suspicion is that
25
the observation resulted from a mistake or other extraneous effect, and hence should
be discarded. On the other hand, outliers may convey significant information. A safe
rule is to discard an outlier if there is direct evidence that it represents an error in
recording, a miscalculation, a malfunction of equipment, or a similar type of
circumstances.
The normality of the error terms can be studied informally by examining the residuals in
a variety of graphic ways. One can construct a histogram of the residuals and see if gross
departures from normality are shown by it. Another possibility is to determine whether,
say, about 68 percent of the standardized residuals e i / MSE fall between -1 and 1,
or about 90 fall between -1.64 and 1.64. (If the sample size is small, the corresponding
t values would be used).
Another possibility is to prepare a normal probability plot of the residuals. This plot
of the ordered residuals against their expected values. To find the expected values
of the i − th smallest observation under normality we use the following expression
(4.6) MSE z i−0.375
n0.25
where zA as usual denotes the A100 percentile of the standard normal distribution
( F N0,1 zA 1 − A2 )
One method of assessing the linearity of the normal probability plot is to calculate the
coefficient of correlation (3.73) relating residuals e i to their expected values under
normality. A high value of the coefficient of correlation, say 0.9 or more is indicative
of normality.
The following test is used for determining whether or not a specified regression
function adequately fits the data. The test assumes that the observations Y for
given X are 1) independent 2) normally distributed and 3) the distributions of Y
have the same variance 2 . The lack of fit test requires repeat observations at one or
more X levels. Repeated trials for the same level of independent variable, of type
described are called replications. The resulting observations are called replicates
Decomposition of SSE
Pure error component. The basic idea for the first component of SSE rest on the fact
that there are replications at some levels of X. Let us denote the i − th observation
for the j − th level of X by Y i,j where i 1, 2, . . . , n j (n j - number of observations
at j − th level of X) j 1, 2, . . . , c (c - number of observed levels of X).
nj
Yj 1
nj ∑ Y i,j - the sample mean of Y at j − th level of X.
i1
The square deviations of Y at j − th level of X is equal to
nj
(4.8) ∑Y i,j − Y j 2
i1
Then we add these sums of squares over all levels of X and denote this sum by SSPE
c nj
(4.9) SSPE ∑ ∑Y i,j − Y j 2
j1 i1
SSPE stands for pure error sum of squares. The degrees of freedom associated with
c
SSPE are n − c (where n ∑ n j - number of all observations)
j1
The pure error mean square MSPE is given by:
26
(4.11) MSPE SSPE n−c
Lack of fit component. The second component of SSE is:
(4.12) SSLF SSE − SSPE
where SSLF denotes lack of fit sum of squares. It can be shown that
c
(4.13) SSLF ∑ Y j − Y j 2
j1
where Y j denotes the fitted value when X X j . The are c − 2 degrees of freedom
associated with SSLF. Thus, the lack of fit mean square is
(4.14) MSLF SSLF c−2
F test.
Test statistics
(4.15) F ∗ MSPE
MSLF
The hypothesis
H o: EY o 1 X
(4.17)
H a: EY ≠ o 1 X
The decision rule
If F ∗ F1 − ; c − 2, n − c conclude H o
(4.18)
If F ∗ F1 − ; c − 2, n − c conclude H a
and
b o Y − b 1 X 1n ∑ Y i − b 1 1n ∑ X i 151 14. 33 0. 324 151 75 2. 575 3
Therefore
Y i 2. 575 3 − 0. 324 X i
2) F test for lack of fit.
We have c 5 levels for X and 3 replicates for each level (hence each n j 3
and n 15.
Hence
nj
Yj 1
nj ∑ Y i,j - the sample mean at j − th level of X.
i1
Y1 1
3
3. 1 2. 57 2. 84 2. 836 7 at level X 1
Y2 1
3Y
1. 07 1. 15 1. 22 1. 146 7 at level X 3
Y3 1
3
0. 53 0. 58 0. 49 0. 533 33 at level X 5
27
Y4 1
3
0. 21 0. 17 0. 16 0. 18 at level X 7
Y4 1
3
0. 08 0. 09 0. 07 0. 0 8 at level X 9
c nj
SSPE ∑ ∑Y i,j − Y j 2 2. 84 − 2. 8367 2 2. 57 − 2. 8367 2
j1 i1
3. 1 − 2. 8367 2 1. 22 − 1. 1467 2 1. 15 − 1. 1467 2
1. 07 − 1. 1467 2 0. 49 − 0. 53333 2 0. 58 − 0. 5333 2
0. 53 − 0. 5333 2 0. 16 − 0. 18 2 0. 17 − 0. 18 2
0. 21 − 0. 18 2 0. 07 − 0. 08 2 0. 09 − 0. 08 2 0. 08 − 0. 08 2 0. 157 4
MSPE SSPEn−c 15−5 0. 0 157 4
0.157 4
n
SSE ∑Y i − Y i 2 ∑ Y 2i − b o ∑ Y i − b 1 ∑ X i Y i
i1
29. 2117 − 2. 575 30 14. 33 0. 324 32. 77 2. 925 1
SSLF SSE − SSPE 2. 925 1 − 0. 1574 2. 767 7
MSLF SSLF c−2
2.7677
5−2
0. 922 57
The hypothesis
H o: EY o 1 X
H a: EY ≠ o 1 X
Test statistics
F ∗ MSPE
MSLF
0. 922 57
0.0 157 4
58. 613
The decision rule
If F ∗ F1 − ; c − 2, n − c conclude H o
If F ∗ F1 − ; c − 2, n − c conclude H a
F1 − ; c − 2; n − c F0. 95; 3; 10 3. 71
Since F ∗ 58. 613 F1 − ; c − 2; n − c 3. 71 we conclude H a .
PROBLEMS
Question 1.
Distinguish between:
1) residual and standardized residual
2) E i 0 and e 0
3) error term and residual.
Question 2.
The fitted values and residuals are
i 1 2 3 4 5 6 7 8 9 10
Y i 2.92 2.33 2.25 1.58 2.08 3.51 3.34 2.67 2.25 1.91
e i 0.18 -0.03 0.75 0.32 0.42 0.19 0.06 -0.07 0.55 -0.31
i 11 12 13 14 15 16 17 18 19 20
Y i 2.42 2.84 2.50 3.59 2.16 1.91 2.50 3.26 1.74 2.25
e i -0.42 0.06 -0.20 -0.39 -0.36 -0.51 -0.50 0.54 0.46 -0.75
a) Plot the residuals e i against the fitted values Y i . What departures from
regression model (3.1) can be studied from this plot? What are your findings?
b) Prepare a normal probability plot. Also calculate the coefficient of correlation
between the ordered residuals and their expected values. What is your conclusion?
Question 3.
The following data were obtained in the study of solution concentration.
28
i 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Xi 8 8 8 6 6 6 4 4 4 2 2 2 0 0 0
Y i 0.65 0.87 0.79 0.15 0.18 0.22 0.51 0.57 0.55 1.23 1.17 1.09 2.91 2.61 3.11
a) Fit a linear regression function
b) Perform an F test to determine whether or not there is lack of fit
of a linear regression function. Use 0. 025
c) Does the test in part b) indicate what regression function is appropriate when it
leads to conclusion that lack of fit of a linear regression exists? Explain.
Question 4.
The following data were obtained in a study of the relation between diastolic
blood pressure (Y) and age (X) for boys 5 to 13 years old
i 1 2 3 4 5 6 7 8
Xi 5 8 11 7 13 12 12 6
Y i 63 67 74 64 75 69 90 60
a) Assuming regression model (3.1) is appropriate, obtain the estimated
regression function and plot the residuals e i against X i . What does your
residual plot show?
b) Omit observation 7 from the data and obtain the estimated regression
line based on the remaining seven observations. Compare this estimated
regression function to that obtained in part a). What can you conclude
about the effect of the observation 7?
c) Using your fitted regression function in part b) obtain a 99 percent
confidence interval for a new observation Y at X 12. Does observation
Y 7 fall outside this prediction interval? What is the significance of this?
Question 5.
The following data were obtained in a certain study.
i 1 2 3 4 5 6 7 8 9 10 11 12
Xi 1 1 1 2 2 3 3 3 3 5 5 5
Y i 4.8 4.9 5.1 7.9 8.3 10.9 10.8 11.3 11.1 16.5 17.3 17.1
Summary calculational results are: ∑ X i 34, ∑ Y i 126, ∑ X 2i 122
∑ Y 2i 1554. 66, ∑ X i Y i 434.
1) Fit a linear regression function
2) Perform an F test to determine whether or not there is lack of fit
of a linear regression function. Use 0. 05.
General approach
The least squares criterion (2.8):
n
Q ∑Y i − o − 1 X i 2
k1
weights each observation equally. There are times, when some observations should
receive greater weight and others smaller weight. The weighted last square criterion for
simple linear regression is
n
(5.31) Q w ∑ w i Y i − o − 1 X i 2
k1
where w i is the given weight of the ith observation. Minimizing Q w with respect to o
29
and 1 leads to the normal equations
∑ wiYi bo ∑ wi b1 ∑ wiXi
(5.32)
∑ w i X i Y i b o ∑ w i X i b 1 ∑ w i X 2i
Solving we get:
∑ wiXi ∑ wiYi
∑ wiXiYi−
∑ wi
(5.33a) b1
∑ w i X i 2
∑ w i X 2i −
∑ wi
∑ w i Y i −b 1 ∑ w i X i
(5.33b) bo
∑ wi
Note that if all weights are equal so w i is the same for all i, the normal equations (5.32)
for weighted least squares reduce to the ones for unweighted least squares (2.10).
Very popular choice for weights is
(5.35) w i X12 .
i
This particular weight relate the error term variance ( case then 2 kX 2i )
is frequently encountered in business, economic.
PROBLEMS
Question 1.
Data from a study of the relation between the size of a bid in million rands (X) and
the cost to the firm of preparing the bid in thousands rands (Y) for 12 recent bids
are presented in table below:
i 1 2 3 4 5 6 7 8 9 10 11 12
X i 2.13 1.21 11.0 6.0 5.6 6.91 2.97 3.35 10.39 1.1 4.36 8.0
Y i 15.5 11.1 62.6 35.4 24.9 28.1 15.0 23.2 42.0 10 20 47.5
The scatter plot strongly suggest that the error variance increase with X.
Fit the weighted lest squares regression line using weights w i X12 .
i
Question 2.
Data from a study of computer-assisted learning by 12 students, showing the total
number of responses in completing a lesson (X) and the cost of computer time
(Y, in cents), follow
i 1 2 3 4 5 6 7 8 9 10 11 12
X i 16 14 22 10 14 17 10 13 19 12 18 11
Y i 77 70 85 50 62 70 52 63 88 57 81 54
Fit the weighted lest squares regression line using weights w i 1
X 2i
.
MATRICES.
The matrix A has elements denoted by a i,k where k refers to the
column and i to the row.
A a i,k
Example:
30
Column 1 Column 2 Column 3
Row 1 6 5 1
Row 2 13 7 2
Row 3 -1 5 2
A matrix with r rows and k columns will be referred as rk matrix.
Transpose of a matrix A denoted by A ′ is the matrix that is obtained by
interchanging corresponding columns and rows of the matrix A.
For example let A be 32 matrix given by
2 5
A 7 10 that is a 1,1 2, a 1,2 5, a 2,1 7, a 2,2 10, a 3,1 3, a 3,2 4
3 4
then
2 7 4
A′
5 10 4
that is
b 1,1 a 1,1 , b 1,2 a 2,1 , b 1,3 a 3,1 , b 2,1 a 1,2 , b 2,2 a 2,2 , b 2,3 a 3,2
Two matrices A and B are said to be equal if they have the same dimension and all
corresponding elements are equal.
Let us notice that if Y is n1 matrix then
Y1
Y2
Y and Y ′ Y1 Y2 . . . Yn
:
Yn
Adding or subtracting two matrices requires that they have the same dimension.
The sum of two matrices is another matrix whose elements each consist of the
sum of the corresponding elements of the two matrices e.g.:
1 4 2 3 12 43 3 7
2 5 1 4 21 54 3 9
3 6 1 3 31 63 4 9
The difference of two matrices is another matrix whose elements each consist of the
difference of the corresponding elements of the two matrices e.g.:
1 4 2 3 1−2 4−3 −1 1
2 5 − 1 4 2−1 5−4 1 1
3 6 1 3 3−1 6−3 2 3
Multiplication of a matrix A by a matrix B. The product AB is only defined
when the number of columns in A equals the number of rows in B.
In general if A has dimension rc and B has dimension cs then
a 1,1 a 1,2 . . . a 1,c b 1,1 b 1,2 . . . b 1,s
a 2,1 a 2,2 . . . a 2,c b 2,1 b 2,2 . . . b 2,s
A B
: : : : : : : :
a r,1 a r,2 . . . a r,c b c,1 b c,2 . . . bc r,s
31
and
c
AB ∑ a i,k b k,j
k1
(we multiply the corresponding elements of i − th row of A by corresponding elements
of j − th column of B and put the sum of the products as i, j − th element
in the resulting matrix)
The determinant of A will be denoted by ether |A|or det(A).
a b
det ad − cb
c d
a b c
det d e f iae − afh − idb dch gbf − gce
g h i
Formulas for det A in case of higher dimensions is given in theorem 18
Special types of matrices
Symmetric matrix ; If AA ′ then A is said to be symmetric matrix.
Diagonal matrix ; A diagonal matrix is a square matrix whose off-diagonal
elements are all zeros e.g:
4 0 0 0
0 1 0 0
C
0 0 −1 0
0 0 0 7
so if Cc i,j then c i,j 0 for i ≠ j.
Identity matrix; The identity matrix denoted by I is a matrix where diagonal
elements are equal to 1 and all the others equal to zero e.g.
1 0 0 0
0 1 0 0
I 44
0 0 1 0
0 0 0 1
Inverse matrix; Let A be a square matrix. If there exists a matrix B such that AB I,
then B is called the inverse of A, denoted by A −1 . Also if AB I, then it can be shown
that BA I. When there exists a matrix B such that AB BA I, the matrix A is
said to be nonsingular; in the contrary case, A is said to be singular.
Theorem 1
If a matrix has an inverse, the inverse is unique.
Theorem 2
If A has an inverse, then A −1 has an inverse and (A −1 −1 A.
Theorem 3
If A and B are nonsingular matrices, then AB has an inverse and (AB) −1 B −1 A −1 .
This can be extended to any finite number of matrices.
Theorem 4
If A is a nonsingular matrix and k is a nonzero constant, then
(kA) −1 1k A −1
Theorem 5
If A and B are mn matrices, and a, b are scalars, then
aA ′ Aa ′ A ′ a aA ′
32
and
aA bB ′ aA ′ bB ′
Theorem 6
If A is any matrix, then
A ′ ′ A
Theorem 7
Let A and B be any matrices such that AB is defined, then
AB ′ B ′ A ′
This can be extended to any finite number of matrices.
Theorem 8
If D is a diagonal matrix, then D D ′
Theorem 9
If A is any matrix, then AA ′ and A ′ A are symmetric.
Theorem 10
If A is a nonsingular matrix, then A ′ and A −1 are nonsingular and
A −1 ′ A ′ −1
The matrices A,B,C discussed in this section are assumed to have size nn.
Theorem 11
detA detA ′
Corollary 12
Any theorem about detA that is true for rows (columns) of a matrix A
is true for columns (rows).
Theorem 13
If two rows (columns) of a matrix are interchanged, the determinant for the matrix
changes sign .
Theorem 14
If each element of the i − th row of an nm matrix A contains a given factor k, then
we may write detA |k| detB, where the rows of B are the same as the rows of A
except that the number k has been factored from each element of the i − th row of A.
Theorem 15
If each element of a row of the matrix A is zero, then detA 0
Theorem 16
If two rows of a matrix A are identical, then detA 0
Theorem 17
The determinant of a matrix is not changed if the elements of the i − th row are
multiplied by a scalar k and the results are added to the corresponding elements of the
h − th row, h ≠ i.
Theorem 18
If A and B are nn matrices, then
detAB detAdetB
this result can be extended to any finite number of matrices.
Let A be any mn matrix. From this matrix, if one deletes any set r m rows and
any set s n columns, the matrix of the remaining elements is a submatrix of A.
If A is an nn matrix and if the i − th row and j − th column are deleted, the determinant
of the remaining matrix, denoted by m ij , is called the minor of a ij . We call A ij the
cofactor of the element a ij , where
A ij −1 ij m i,j
Theorem 19
Let A ij be the cofactor of a ij , then
33
n
detA ∑ a ij A ij
j1
for any i.
Theorem 20
Let A be the nonsingular. (this is equivalent to the detA ≠0)
Then
A −1 detA
1
A ij ′
where A ij is the matrix of cofactors.
Example:
1 0 1
Let A 2 1 2
0 4 6
1 0 1
then detA det 2 1 2 6
0 4 6
and
′
−2 −12 8 −2 4 −1
′
A i,j 4 6 −4 −12 6 0
−1 0 1 8 −4 1
Hence
34
−2 4 −1 − 13 2
3
− 16
A −1 1
6
A i,j ′ 1
6 −12 6 0 −2 1 0
8 −4 1 4
3
− 2
3
1
6
RANK OF MATRICES
An nm matrix A is said to be of the rank r if the size of the largest nonsingular
square submatrix of A is r. This definition is rather difficult to apply directly to find out
the rank of any given matrix.
Each of the following operations is called an elementary transformation of a matrix A;
1) The interchange of two rows (or two columns) of A.
2) The multiplication of the elements of a row (or a column) of A by the same
nonzero scalar k.
3) The addition of the elements of a row (or a column) of A, after they have been
multiplied by the scalar k, to the corresponding elements of another row (or another
column) of A.
We define the inverse of an elementary transformation as the transformation that restores
the resulting matrix to the original form of A.
Theorem 21
The inverse of an elementary transformation of a matrix is an elementary transformation
of the same type.
Theorem 22
The size and rank of a matrix are not altered by an elementary transformation of a matrix.
Two matrices that have the same size and rank are said to be equivalent
Theorem 23
Every elementary matrix has an inverse of the same type.
Theorem 24
Any nonsingular matrix can be written as the product of elementary matrices.
Theorem 25
The size and rank of the matrix A are not altered by multiplying and postmultiplying
A by elementary matrices.
Theorem 26
If matrices A and B are nonsingular, then for any matrix C the following
matrices AC, CB, and ACB all have the same rank (provided that all multiplication
are defined)
Theorem 27
If A is an mn matrix of rank r, then there exist nonsingular matrices
P and Q such that PAQ is equal to
I I 0
I, I 0 , ,
0 0 0
depending on n,m,r.
Theorem 28
Two matrices A and B of the same size are equivalent if and only if B can be
obtained from A by multiplying and postmultiplying by finite number of elementary
matrices.
Theorem 29
The rank of the product of matrices A and B cannot exceed the rank of either A or B.
35
PROBLEMS
Question 1.
1 −1 0
a) Find the determinant of A 2 1 1
1 0 3
b) Find A −1
Question 2
Suppose that you want to perform the following operations on 33 matrix A
1) Interchange the first and third row
2) Interchange the first and third columns
3) Multiply the first row by -2 and add the result to the third row
4) Multiply the first column by -2 and add the result to the third column
5) Multiply the second row by -2 and add the result to the third row
6) Multiply the second column by -2 and add the result to the third column
7) Mmultiply the second row by 1/2
8) Multiply the third column by 7
0 4 2
If A 4 2 0 find the resulting matrix after performing the eight transformations.
2 0 1
Question 3
For the matrices A and X
1 2 −1 1 3
2 1 0 3 4
A X
0 −3 2 1 1
−3 0 −1 −5 1
find AX, X ′ A.
REGRESSION EXAMPLE.
In regression analysis, one basic matrix is the vector Y (n1), consisting of the n
observations on the dependent variable
Y1
Y2
(6.4) Y
:
Yn
Note that the transpose Y ′ is the row vector;
(6.5) Y′ Y1 Y2 . . . Yn
Another basic matrix in regression analysis is the X (n2) matrix, which is defined
as follows for simple regression analysis:
36
1 X1
1 X2
(6.6) X
: :
1 Xn
The matrix X consists of column of 1’s and a column containing the n values of
the independent variable X. Note that the transpose of X is
1 1 ... 1
(6.7) X′
X1 X2 . . . Xn
The regression model;
Y i EY i i i 1, 2, . . . , n
can be written compactly in matrix notation.
First, let us define the vector (n1) of mean responses:
EY 1
EY 2
(6.9) EY
:
EY n
and the vector of error terms
1
2
(6.10)
:
n
Using Y (6.4) we can write
Y E(Y)
because
Y1 EY 1 1 EY 1 1
Y2 EY 2 2 EY 2 2
: : : :
Yn EY n n EY n n
Thus, the observations vector Y equals to the sum of two vectors, a vector
containing the expected values and another containing the error terms.
Let us define vector of the regression coefficients as follows:
o
(6.13)
1
Then the product X, where X is defined in (6.60 is
1 X1 o 1X1
1 X2 o o 1X2
(6.14) X
: : 1 :
1 Xn o 1Xn
Since o 1 X i EY i we see that X is the vector of expected values EY i
for simple linear regression model i.e. EY X where EY is defined in (6.9).
Another product frequently needed is Y ′ Y,
37
Y1
Y2
(6.15) Y′Y Y1 Y2 . . . Yn ∑ Y 2i
:
Yn
′
Note that Y Y is a 11 matrix,or a scalar. In a compact way we can write the sum of
squares Y ′ Y ∑ Y 2i .
We also will need X ′ X
1 X1
1 1 ... 1 1 X2 n ∑ Xi
(6.16) X′X
X1 X2 . . . Xn : : ∑ X i ∑ X 2i
1 Xn
and X ′ Y
Y1
1 1 ... 1 Y2 ∑ Yi
(6.17) X′Y
X1 X2 . . . Xn : ∑ XiYi
Yn
The principal inverse matrix in regression analysis is the inverse of
the matrix X ′ X
n ∑ Xi ∑ X i 2
det n ∑ X 2i − ∑ X i 2 n∑ X 2i − n n ∑X i − X 2
∑ Xi ∑ Xi 2
Hence
−1
∑ X 2i ∑ Xi
−
A random vector or a random matrix contains elements which are random variables.
Expectation of a random vector or matrix.
Suppose we have the following observation vector
Y1
Y Y2
Y3
The expected value of Y is a vector, denoted by E(Y), which is defined as follows:
38
EY 1
E(Y) EY 2
EY 3
In general we can use the following notation
(6.41) E(Y) EY i i 1, 2, . . , n
and for random matrix Y with dimension np, the expectation is
(6.42) E(Y) EY i,k i 1, 2, . . . , n k 1, 2, . . . , p
Basic theorems
PROBLEMS
Question 1.
For the matrices below obtain AB,A-B,AC,AB ′ ,B ′ A
1 4 1 3
3 8 1
A 2 6 B 1 4 C
5 4 0
3 8 2 5
Question 2.
For the matrices below obtain AC,A-C,B ′ A,AC ′
2 1 6 3 8
3 5 9 8 6
A B C
5 7 3 5 1
4 8 1 2 4
State the dimension of each resulting matrix.
Question 3.
Find the inverse of each of the following matrices
4 3 2
2 4
A B 6 5 10
3 1
10 1 6
Question 4.
Consider the following functions of the random variables
Y 1 , Y 2 and Y 3 :
W1 Y1 Y2 Y3
W2 Y1 − Y2
W3 Y1 − Y2 − Y3
a) State above in matrix notation
W1
b) Find the expectation of the random vector W W2
W3
c) Find the variance-covariance matrix of W.
the error terms have constant variance 2 2 , and are uncorrelated so
i , j 0 for i ≠ j. We can then write the variance-covariance matrix for the
random vector as follows (using the fact that E i 0 i 1, 2, 3)
1 11 12 13
2 E 2 1 2 3 E 21 22 23
3 31 32 33
40
E 1 1 E 1 2 E 1 3 2 0 0
E 2 1 E 2 2 E 2 3 0 2 0 2I
E 3 1 E 3 2 E 3 3 0 0 2
b) Let us consider W AY where
W1 1 −1 Y1
W 2 1 A 2 2 Y
W2 1 1 Y2
then
W1 1 −1 Y1 Y1 − Y2
W2 1 1 Y2 Y1 Y2
1 −1 EY 1 EY 1 − EY 2
E(W)
1 1 EY 2 EY 1 EY 2
and
1 −1 2 Y 1 Y 1 , Y 2 1 1
2 W)
1 1 Y 2 , Y 1 Y 2
2
−1 1
2 Y 1 2 Y 2 − 2 2 Y 1 , Y 2 2 Y 1 − 2 Y 2
2 Y 1 − 2 Y 2 2 Y 1 2 Y 2 2 2 Y 1 , Y 2
41
variables. The condition E i 0 in matrix terms is:
1 E 1 0
2 E 2 0
(6.54) E E 0
: : :
n E n 0
The condition that the error terms have constant variance and that
all i , j 0 for i ≠ j (since independence) is expressed in matrix
terms through the variance-covariance matrix:
2 0 . . . 0 1 0 ... 0
0 2 . . . 0 0 1 ... 0
(6.55) 2 2 2I
: : ... : : : ... :
0 0 . . . 2 0 0 ... 1
Thus the normal error model (3.1) in matrix terms is:
Let
Y1 1 X1 1
Y2 1 X2 o 2
Y X
: : : 1 :
Yn 1 Xn n
(6.56 ′ ) Y X
n1 n2 n1 n1
where: is the vector of parameters
X - matrix of known constants,namely,the values of the independent
variables
is a vector of independent normal random variables with E 0
and 2 2 I.
42
Y1
1 1 ... 1 Y2 ∑ Yi
X′Y
X1 X2 . . . Xn : ∑ XiYi
Yn
Using this result we have
n ∑ Xi bo ∑ Yi
∑ X i ∑ X 2i b1 ∑ XiYi
or
nb o b 1 ∑ X i ∑ Yi
b o ∑ X i b 1 ∑ X 2i ∑ XiYi
43
to evaluate X ′ X −1 . We have
∑ Xi2 500 2
n ∑X i − X 2 n ∑ X 2i − n 10 28400 − 10
34 000
Therefore
∑ X 2i ∑ Xi
−
44
e1
e2
(6.65) e n 1
:
en
In matrix notation we have
(6.66) Y Xb
because
Y1 1 X1 bo b1X1
Y2 1 X2 bo bo b1X2
: : : b1 :
Yn 1 Xn bo b1Xn
Similarly:
(6.67) e Y −Y Y − Xb
Sums of squares.
1n Y 1 Y 2 . . . Y n Y 1 Y 2 . . . Y n n
Hence
(6.70a) SSTO Y ′ Y − 1n Y ′ 11 ′ Y
The same way as for ∑ Y 2i we obtain that SSE ∑ e 2i in matrix terms is:
(6.70b) SSE e ′ e Y − Xb ′ Y − Xb
which can be shown to equal:
(6.70c) SSE Y ′ Y − b ′ X ′ Y
For SSR SSTO − SSE in matrix terms we have:
(6.70c) SSR b ′ X ′ Y − 1n Y ′ 11 ′ Y.
Example.
46
73
50
128
170
87
1′Y 1 1 1 1 1 1 1 1 1 1 1100
108
135
69
148
132
′ ′
Y 11 Y
1
n 1100 1100 121 000
1
10
and finally:
SSR b ′ X ′ Y − 1n Y ′ 11 ′ Y 134 600 − 121 000 13 600
47
(6.76c) I − XX ′ X −1 X ′
are symmetric. Hence, SSTO, SSR, and SSE are quadratic forms, with the matrices
of quadratic forms given in (6.76)
Regression coefficients
The variance-covariance matrix of b:
bo 2 b o b o , b 1
(6.77) 2 b 2
b1 b 1 , b o 2 b 1
is
(6.78) 2 b 2 X ′ X −1
22
or, using (6.27)
∑ X 2i
2
X 2
−
n ∑X i −X 2 ∑X i −X 2
(6.78a) 2 b
X 2 2
−
∑X i −X 2 ∑X i −X 2
When MSE is substituted for 2 in (6.78a) we have
MSE ∑ X 2i
− XMSE
coefficients in (6.79).
48
The estimated variance s 2 Y hnew given earlier in (3.37), is in matrix notation:
(6.85) s 2 Y hnew MSE s 2 Y h MSE X ′h s 2 bX h
MSE1 X ′h X ′ X −1 X h
Example:
1) For data (examples previously considered)
−1
1 30
1 20
1 60
1 80
1 1 1 1 1 1 1 1 1 1 1 40
X ′ X −1
30 20 60 80 40 50 60 30 70 60 1 50
1 60
1 30
1 70
1 60
−1
10 500
71
85
− 681 . 835 29 −0. 1470 588
500 28 400 − 68
1 1
3400
−0. 1470 588 0. 002 941 2
We found earlier that MSE 7. 5
Hence
s 2 b MSEX ′ X −1
. 835 29 −0. 1470 588 6. 264 706 −0. 1102941
7. 5
−0. 1470 588 0. 002 941 2 −0. 1102941 0. 0022059
Thus
s 2 b o 6. 264706 and s 2 b
1 0. 002206
For the same data we will find s Y h when X h 55.
2
The regression results for the weighted least squares can be stated in matrix algebra
using the following notation:
w1 0 . . . 0
0 wi . . . 0
(6.88) W
nn : : : :
0 0 . . . wn
is the diagonal matrix containing the weights w i .
The weighted normal equations (5.32) can be rewritten as
(6.89) X ′ WXb X ′ WY
and the weighted least squares estimators are:
49
(6.90) b X ′ WX −1 X ′ WY
21
The variance-covariance matrix of the weighted least squares estimators is:
(6.91) 2 b 2 X ′ WX −1
22
and the estimated variance-covariance matrix is:
(6.92) s 2 b MSE w X ′ WX −1
22
where MSE w is based on the weighted
squared deviations:
∑ w i Y i −Y i 2
(6.92a) MSE w n−2
Residuals
For the analysis of residuals, it will be useful to recognize that each residual e i
can be expressed as a linear combination of the observations Y i . It can be shown
that e defined in (6.65), equals:
(6.93) e I − H Y
n1 nn nn n1
where
(6.93a) H XX ′ X −1 X ′
nn
Note from (6.76c) that the matrix I − H is the matrix of the quadratic form
(6.75c) for SSE ∑ e 2i .
The square nn matrix H is called the hat matrix and plays an important
role in regression analysis.
The variance covariance matrix of e can be derived by means of (6.49)
2 W 2 AY A 2 YA ′
Since e I − HY then
2 e I − H 2 YI − H ′
Now 2 Y 2 2 I for the normal error model.
The matrix I − H has the special property. First it is symmetric and
I − HI − H I − H
Hence
2 e 2 I − HII − H 2 I − H.
PROBLEMS.
Question 1.
Assume that the normal regression model is applicable.
For the following data given by:
i 1 2 3 4 5
Xi 8 4 0 -4 -8
Y i 7.8 9 10.2 11 11.7
using matrix method find:
1) Y ′ Y 2) X ′ X 3) X ′ Y
4) b 5) ANOVA table
6) covariance-variance matrix s 2 b
Question 2.
Find the matrix A of the quadratic form:
3Y 21 10Y 1 Y 2 6Y 22
Question 3.
Assume that the normal regression model is applicable.
For the following data given by:
50
i 1 2 3 4 5 6
Xi 4 1 2 3 3 4
Y i 16 5 10 15 13 22
using matrix method find:
1) Y ′ Y 2) X ′ X 3) X ′ Y 4) b 5) ANOVA table
6) covariance-variance matrix s 2 b
7) Y 8) s 2 Y hnew when X h 3. 5
Question 4.
For the matrix
1 0 4
A 0 3 0
4 0 9
find the quadratic form of the observations Y 1 , Y 2 and Y 3 .
Question 5.
The results of a certain experiments are shown below
i 1 2 3 4 5 6 7 8 9
Xi 7 6 5 1 5 4 7 3 4
Y i 97 86 78 10 75 62 101 39 53
i 10 11 12 13 14 15 16 17 18
Xi 2 8 5 2 5 7 1 4 5
Y i 33 118 65 25 71 105 17 49 68
Multiple regression analysis is one of the most widely used of all statistical
tools. In many practical situations a number of key independent variables
affects the response variable in important and distinctive way. Furthermore,
in such case one will find that predictions of the response variable based on
the model containing only a single independent variable are to imprecise to be
useful. A more complex model, containing additional independent variables,
is more helpful in providing sufficiently precise predictions of the response
variable.
Note that a point on the response plane corresponds to the mean response EY
at the given combination of levels of X 1 and X 2 . Figure above also shows
a series of observations Y i corresponding to given levels of the two independent
variables X i,1 , X i,2 . Note that each vertical rule in picture above represents the
difference between Y i and the mean EY i . Hence, the vertical distance from Y i to
the response plane represents the error term i Y i − EY i . Frequently the regression
function in multiple regression is called a regression surface or a response surface.
52
Meaning of regression coefficients.
We consider now the case where there are p − 1 independent variables X 1 , . . . X p−1 .
The model
(7.5) Y i o 1 X i,1 2 X i,2 . . . p−1 X i,p−1 i
is called a first-order model with p − 1 independent variables. It can also be written:
p−1
(7.5a) Y i o ∑ k X i,k i
k1
Assuming that E i 0, the response function model for model (7.5) is:
(7.6) EY i o 1 X i,1 2 X i,2 . . . p−1 X i,p−1
This response function is a hyperplane, which is a plane in more than two
dimensions.
It is a remarkable property of matrix algebra that the results for the general
linear regression model (7.7) appear exactly the same in matrix notation
as those for the simple linear regression model (6.56).
To express the general linear regression model in matrix terms, we need to
53
define the following matrices:
Y1
Y2
(7.17a) Y
n1 :
Yn
1 X 1,1 X 1,2 . . . X 1,p−1
1 X 2,1 X 2,2 . . . X 2,p−1
(7.17b) X
np : : : ... :
1 X n,1 X n,2 . . . X n,p−1
o
1
(7.17c)
p1 :
p−1
1
2
(7.17d)
n1 :
n
Note that Y and vectors are the same as for simple regression. The vector
contains additional regression parameters, and X matrix contains a column of 1’s
as well as a column of the n values for the each of the p − 1 X variables in the
regression model. The row subscript for each element X i,k in the X matrix identifies
the trial, and the column subscript identifies the X variable.
In matrix terms, the general linear regression model (7.7) is:
(7.18) YX
n1 npp1 n1
where:
Y is a vector of observations
is a vector of parameters
X is a matrix of constants
is a vector of independent normal random variables with expectation
E 0 and variance-covariance matrix 2 2 I
Consequently, the random vector Y has expectation
(7.18a) EY X
and the variance-covariance matrix of Y is:
(7.18b) 2 Y 2 I
54
The least squares normal equations for the general linear regression model (7.18)
are:
(7.20) X ′ X b X ′ Y
pp p1 pn n1
and the least squares estimators are:
(7.21) b X ′ X −1 X ′ Y
p1 pp p1
For the model (7.18), these least squares estimators are also maximum likelihood
estimators and have all the properties stated before: they are unbiased, minimum
variance unbiased, and sufficient.
55
Source of variation SS df MS
Regression SSR b ′ X ′ Y − 1n Y ′ 11 ′ Y p − 1 MSR SSR
p−1
is given by:
(7.37) 2 b 2 X ′ X −1
The estimated variance-covariance matrix s 2 b :
56
s 2 b 0 sb 0 , b 1 . . . sb 0 , b p−1
sb 1 , b 0 s 2 b 1 . . . sb 1 , b p−1
(7.38) s b
2
: : ... :
sb p−1 , b 0 sb p−1 , b 1 . . . s 2 b p−1
is given by
(7.39) s 2 b MSEX ′ X −1
From s 2 b, one can obtain s 2 b 0 , s 2 b 1 or whatever other variance is needed, or any
needed covariance.
Interval estimation of k
Tests for k
Joint inferences
The boundary of the joint confidence region for all p of the k regression parameters
(k 0, 1, . . . , p − 1) with confidence coefficient 1 − is:
b− ′ X ′ Xb−
(7.43) pMSE
F1 − , p, n − p
The region defined by this boundary is generally difficult to obtain and interpret.
The Bonferroni joint confidence intervals, are easy to obtain and interpret.
If g parameters are to be estimated jointly (where g p), the confidence
limits with family confidence coefficient 1 − are:
(7.44) b k Bsb k
where
(7.44a) B t1 − /2g, n − p
57
For given values of X 1 , X 2 , . . . , X p denoted by X h,1 , X h,2 , . . . , X h,p , the mean
response is denoted by EY h . We define the vector X h :
1
X h,1
(7.45) Xh X h,2
:
X h,p−1
so the mean response to be estimated is:
(7.46) EY h X ′h
The estimated
mean response corresponding to X h denoted by Y h , is:
(7.47) Y h X ′h b
This estimator is unbiased:
(7.48) EY h EX ′h b X ′h Eb X ′h EY h
and its variance is: 2 ′ ′ −1
(7.49) Y h X h X X X h X ′h 2 bX h
2
Note that the variance 2 Y h is a function of the variances 2 b k of the regression
coefficients and of the covariances b k , b j between pairs of the regression
coefficients, just
as in simple regression. The estimated variance s 2
Y h is given by
(7.50) s 2 Y h MSEX ′h X ′ X −1 X h X ′h s 2 bX h
The 1 − confidence
h are:
limits for EY
(7.51) Y h t1 − /2; n − psY h
The hypothesis
H o: EY o 1 X 1 . . . p−1 X p−1
H a: EY ≠ o 1 X 1 . . . p−1 X p−1
The decision rule
If F ∗ F1 − ; c − p, n − c conclude H o
If F ∗ F1 − ; c − p, n − c conclude H a
MSE
m X ′h s 2 bX h MSE m1 X ′h X ′ X −1 X h
59
So
∂Q
∂b o
0 2 ∑ i −Y i b o b 1 X i,1 b 2 X i,2 0
Therefore first normal equation is
nb o b 1 ∑ X i,1 b 2 ∑ X i,2 ∑ Y i
∂Q
∂b 1
∂b∂ 1 ∑ i Y i − b o − b 1 X i,1 − b 2 X i,2 2 2 ∑ i −X i,1 Y i X i,1 b o b 1 X 2i,1 X i,1 b 2 X i,2
Next
∂Q
∂b 1
0 ∑ i −X i,1 Y i X i,1 b o b 1 X 2i,1 X i,1 b 2 X i,2 0
Hence the second normal equation is
b o ∑ X i,1 b 1 ∑ X 2i,1 b 2 ∑ X i,1 X i,2 ∑ X i,1 Y i
Finally
∂Q
∂b 2
∂b∂ 2 ∑ i Y i − b o − b 1 X i,1 − b 2 X i,2 2 2 ∑ i −X i,2 Y i X i,2 b o X i,2 b 1 X i,1 b 2 X 2i,2
So
∂Q
∂b 2
0 ∑ i −X i,2 Y i X i,2 b o X i,2 b 1 X i,1 b 2 X 2i,2 0
and the third normal equation is
b o ∑ X i,2 b 1 ∑ X i,1 X i,2 b 2 ∑ X 2i,2 ∑ X i,1 Y i
EXAMPLE 1:
Basic calculations
60
162 1 274 2450
120 1 180 3254
223 1 375 3802
131 1 205 2838
67 1 86 2347
169 1 265 3782
81 1 98 3008
Y 192 X 1 330 2450
116 1 195 2137
55 1 53 2560
252 1 430 4020
232 1 372 4427
144 1 236 2660
103 1 157 2088
212 1 370 2605
′
1 274 2450 1 274 2450
1 180 3254 1 180 3254
1 375 3802 1 375 3802
1 205 2838 1 205 2838
1 86 2347 1 86 2347
1 265 3782 1 265 3782
1 98 3008 1 98 3008
′
1) X X 1 330 2450 1 330 2450
1 195 2137 1 195 2137
1 53 2560 1 53 2560
1 430 4020 1 430 4020
1 372 4427 1 372 4427
1 236 2660 1 236 2660
1 157 2088 1 157 2088
1 370 2605 1 370 2605
15 3626 44 428
3626 1067 614 11 419 181
44 428 11 419 181 139 063 428
Let us consider the following data:
61
′
1 274 2450 162
1 180 3254 120
1 375 3802 223
1 205 2838 131
1 86 2347 67
1 265 3782 169
1 98 3008 81 2259
′
2) X Y 1 330 2450 192 647 107
1 195 2137 116 7096 619
1 53 2560 55
1 430 4020 252
1 372 4427 232
1 236 2660 144
1 157 2088 103
1 370 2605 212
−1
15 3626 44 428
3) X ′ X −1 3626 1067 614 11 419 181
44 428 11 419 181 139 063 428
1. 246 348 416 2. 129 664 176 10 −4 −4. 156 712 541 10 −4
2. 129 664 176 10 −4 7. 732 903 033 10 −6 −7. 030 251 792 10 −7
−4. 156 712 541 10 −4 −7. 030 251 792 10 −7 1. 977 185 133 10 −7
Algebraic equivalents
Note that X ′ X is
(7.63)
1 X 1,1 X 1,2
1 1 ... 1 n ∑ X i,1 ∑ X i,2
1 X 2,1 X 2,2
X′X X 1,1 X 2,1 . . . X 2,2 ∑ X i,1 ∑ X 2i,1 ∑ X i,1 X i,2
: : :
X 1,2 X 2,2 . . . X n,2 ∑ X i,2 ∑ X i,1 X i,2 ∑ X 2i,2
1 X n,1 X n,2
So in our case
15 3626 44 428 n ∑ X i,1 ∑ X i,2
3626 1067 614 11 419 181 ∑ X i,1 ∑ X 2i,1 ∑ X i,1 X i,2
44 428 11 419 181 139 063 428 ∑ X i,2 ∑ X i,1 X i,2 ∑ X 2i,2
Also
62
Y1
1 1 ... 1 ∑ Yi
Y2
(7.64) ′
XY X 1,1 X 2,1 . . . X 2,2 ∑ Y i X i,1
:
X 1,2 X 2,2 . . . X n,2 ∑ Y i X i,2
Yn
hence in our case
2259 ∑ Yi
647 107 ∑ Y i X i,1
7096 619 ∑ Y i X i,2
Fitted values
64
′ ′
162 1 1 162
120 1 1 120
223 1 1 223
131 1 1 131
67 1 1 67
169 1 1 169
81 1 1 81
′ ′
Y 11 Y
1
n
1
15 192 1 1 192 340 205. 4
116 1 1 116
55 1 1 55
252 1 1 252
232 1 1 232
144 1 1 144
103 1 1 103
212 1 1 212
Thus
SSTO Y ′ Y − 1n Y ′ 11 ′ Y 394 107 − 340 205. 4 53901. 6
and
′
3. 452 611 738 2259
′ ′ ′
SSE Y Y − b X Y 394 107 − . 496 004 976 1 647 107
9. 199 080 488 10 −3 7096 619
56. 888 641 05
Finally we have
SSR SSTO − SSE 53901. 6 − 56. 884 53844. 716
The ANOVA table
Source of variation SS df MS
Regression SSR 53844. 716 p − 1 2 MSR SSR
p−1
26922. 358
Error SSE 56. 844 n − p 12 MSE SSE
n−p 4. 740
Total SSTO 53901. 6 n − 1 14
The hypothesis
Ho : 1 2 0
H a : not both 1 and 2 are equal to zero
We use the test statistics
F ∗ MSR
MSE
26922.358
4.740
5678
The decision rule is
If F ∗ F1 − , p − 1, n − p, conclude H o
If F ∗ F1 − , p − 1, n − p, conclude H a
Assuming that 0. 05 from table we get
65
F1 − , p − 1, n − p F0. 95; 2; 12 3. 89
Since F ∗ 56878 F0. 95; 2; 12 3. 89
we conclude H a , that our Y is related to X 1 and X 2 .
67
Hence by (7.55a)
s 2 Y Anew MSE s 2 Y A 4. 7403 0. 46638 5. 206 7
or
sY Anew 2. 28182
1
XB 375
3500
The point estimate of mean for Y is by (7.47)
3. 452 611 738
Y B X ′B b 1 375 3500 . 496 004 976 1 221. 65
9. 199 080 488 10 −3
and
s 2 Y B X ′B s 2 bX B
1 375 3500
5. 907 462 . 00 100 947 78 −. 00 197 027 58 1
. 00 100 947 78 3. 665 394 6 10 −5 −3. 332 362 2 10 −6 375
−. 00 197 027 58 −3. 332 362 2 10 −6 9. 371 928 10 −7 3500
0. 760 26
Hence
s 2 Y Bnew MSE s 2 Y B 4. 7403 0. 760 26 5. 500 6
and
sY Bnew 5. 500 6 2. 345 3
We found before that B 2. 179. The simultaneous Bonferroni prediction
intervals
with confidence coefficient 0.90 are:
Y h BsY hnew so
135. 57 − 2. 179 2. 28182 Y Anew 135. 57 2. 179 2. 28182
221. 65 − 2. 179 2. 34536 Y Bnew 221. 65 − 2. 179 2. 34536
or
130. 6 Y Anew 140. 5
216. 5 Y Bnew 226. 8
68
Standardized regression coefficients, also called beta coefficients,
are defined as follows:
1
∑Xi,k−X k2 2
∑X i,k −X k 2
1
2
sk
(7.69) Bk bk bk n−1
bk
sy
∑Yi− Y 2 ∑Y i − Y 2
n−1
where s k and s y are the standard deviations of the X k and Y observations,
respectively.
For our example
1
B 1 0. 496 191089
53902
2
0. 933 89
1
B 2 0. 00920 7473616
53902
2
0. 108 33
PROBLEMS
Question 1.
The following data were obtained in certain experiment:
i Y i X i,1 X i,2
1 64 4 2
2 73 4 4
3 61 4 2
4 76 4 4
5 72 6 2
6 80 6 4
7 71 6 2
8 83 6 4
9 83 8 2
10 89 8 4
11 86 8 2
12 93 8 4
13 88 10 2
14 95 10 4
15 94 10 2
16 100 10 4
Assume that regression model (7.1) with independent normal errors is appropriate.
1) Find the estimated regression coefficients.
2)Test whether there is a regression relation using 0. 01.
3) Estimate 1 and 2 jointly by the Bonferroni procedure using
99 percent family confidence coefficient.
4) Obtain an interval estimate of EY h when X h,1 5 and X h,2 4.
Use a 99 percent confidence coefficient.
5) Obtain an ANOVA table and use it to test whether there is a regression
relation using 0. 01.
6) Obtain the residuals.
Question 2.
The following data were obtained in certain experiment:
69
i Yi X i,1 X i,2 i Yi X i,1 X i,2
1 58 7 5.11 11 121 17 11.02
2 152 18 16.72 12 112 12 9.51
3 41 5 3.20 13 50 6 3.79
4 93 14 7.03 14 82 12 6.45
5 101 11 10.98 15 48 8 4.60
6 38 5 4.04 16 127 15 13.86
7 203 23 22.07 17 140 17 13.03
8 78 9 7.03 18 155 21 15.21
9 117 16 10.62 19 39 6 3.64
10 44 5 4.76 20 90 11 9.57
1) Find the estimated regression coefficients.
2)Test whether there is a regression relation using 0. 01.
3) Estimate 1 and 2 jointly by the Bonferroni procedure using
99 percent family confidence coefficient.
4) Obtain an interval estimate of EY h when X h,1 5 and X h,2 3. 2.
Use a 99 percent confidence coefficient.
5) Obtain an ANOVA table and use it to test whether there is a regression
relation using 0. 01.
6) Obtain the residuals.
7) Calculate the coefficient of multiple determination R 2 .
8) Obtain the simultaneous interval estimates for five levels of X :
1 2 3 4 5
X1 5 6 10 14 20
X 2 3.2 4.8 7.0 10.0 18.0
using 95 percent confidence coefficient.
Question 3.
Consider the multiple regression model:
Y i 1 X i,1 2 X i,2 i
where i are independent normally distributed random errors
with N0, 2 .
1) Derive the least squares estimators for 1 and 2
2) Obtain the maximum likelihood estimators for 1 and 2 .
Question 4.
A pharmaceutical company testing a new pain-killing drug tests the drug on 20 people
suffering from arthritis. The time elapsed, in minutes, from taking the drug until noticeable
relief in pain is detected, is to be predicted from dosage (in grams) and the age
of patient (in years). The results are given below
70
i Time (Y i ) Dosage (X i,1 ) Age (X i,2 )
1 11 2 59
2 3 2 57
3 20 2 22
4 25 2 12
5 27 2 18
6 15 5 40
7 10 5 64
8 34 5 27
9 14 5 54
10 34 5 22
11 35 7 33
12 28 7 49
13 23 7 29
14 21 7 32
15 33 7 20
16 27 10 43
17 8 10 61
18 3 10 69
19 12 10 62
20 14 10 61
1) Find the estimated regression coefficients.
2)Test whether there is a regression relation using 0. 01.
3) Estimate 1 and 2 jointly by the Bonferroni procedure using
99 percent family confidence coefficient.
4) Obtain an interval estimate of EY h when X h,1 6 and X h,2 45.
Use a 99 percent confidence coefficient.
5) Obtain an ANOVA table and use it to test whether there is a regression
relation using 0. 01.
6) Obtain the residuals.
7) Calculate the coefficient of multiple determination R 2 .
8) Obtain the simultaneous interval estimates for five levels of X :
1 2 3 4 5
X1 5 6 2 4 5
X 2 41 39 52 61 75
using 95 percent confidence coefficient.
Question 5.
A large discount department store chain advertises on television (X 1 ),
on the radio (X 2 ), and in newspapers (X 3 ). A sample of 12 of its stores in
a certain area showed the following advertising expenditures and revenues
during a given month. ( All figures are in thousands of rands)
71
i Revenues (Y i ) X i,1 X i,2 X i,3
1 84 13 5 2
2 84 13 7 1
3 80 8 6 3
4 50 9 5 3
5 20 9 3 1
6 68 13 5 1
7 34 12 7 2
8 30 10 3 2
9 54 8 5 2
10 40 10 5 3
11 57 5 6 2
12 46 5 7 2
1) Find the estimated regression coefficients.
2)Test whether there is a regression relation using 0. 01.
3) Estimate 1 , 2 and 3 jointly by the Bonferroni procedure using
99 percent family confidence coefficient.
4) Obtain an interval estimate of EY h when X h,1 11, X h,2 6 and X h,3 2.
Use a 99 percent confidence coefficient.
5) Obtain an ANOVA table and use it to test whether there is a regression
relation using 0. 01.
6) Obtain the residuals.
7) Calculate the coefficient of multiple determination R 2 .
8) Obtain the simultaneous interval estimates for two levels of X :
i 1 2
X 1 11 15
X2 7 9
X3 2 3
using 95 percent confidence coefficient.
Question 6.
Assume that the normal regression model is applicable.
For the following data given by:
i 1 2 3 4 5
Xi 8 4 0 -4 -8
Y i 7.8 9 10.2 11 11.7
using matrix method find:
1) Y ′ Y
2) X ′ X
3) X ′ Y
4) b
5)Test the H o : 1 0 versus H a : 1 ≠ 0 using ANOVA.
with 0. 05
6) covariance-variance matrix s 2 b
72
MULTICOLLINEARITY AND ITS EFFECTS
In multiple regression analysis the relation between the independent variables and
the dependent one is of prime interest. Questions that are frequently asked include:
1. What is the relative importance of the effects of the different independent variables ?
2. What is the magnitude of the effect of a given independent variable on the dependent
one?
3. Can any independent variable be dropped from the model because it has little or no
effect on the dependent one ?
4. Should any independent variables not yet included in the model be considered
for possible inclusion ?
If the independent variables included in the model are uncorrelated among themselves
and uncorrelated with any other independent variables that are related to the dependent
variable but omitted from the model, relatively simple answers can be given.
Unfortunately, in many situations the independent variables tend to be correlated
among themselves and with other variables that are related to the dependent variable but
are not included in the model. When the independent variables are correlated among
themselves, intercorrelation or multicollinearity among them is said to exist.
Table below contains data for a small-scale experiment on the effect of crew size X 1
and level of bonus pay X 2 on crew productivity score Y. It is easy to show that X 1
and X 2 are uncorrelated here, that is r 212 0, where r 212 denotes the coefficient of
simple determination between X 1 and X 2 . We will use the notation SSRX 1 , X 2
and SSEX 1 , X 2 to indicate explicitly the two independent variables in the model.
SSRX 1 and SSEX 1 to show that only one independent variable X 1 is in model
(case of simple linear regression) and SSRX 2 and SSEX 2 in case of X 2 only.
Trial i X i,1 X i,2 Y i
1 4 2 42
2 4 2 39
3 4 3 48
4 4 3 51
5 6 2 49
6 6 2 53
7 6 3 61
8 6 3 60
ANOVA tables I
a) Regression of Y on X 1 and X 2
Y 0. 375 5. 375X 1 9. 250X 2
73
Source of variation SS df MS
Regression SSRX 1 , X 2 402. 250 2 MSRX 1 , X 2 201. 125
Error SSEX 1 , X 2 17. 625 5 MSEX 1 , X 2 3. 525
Total SSTO 419. 875 7
b) Regression
of Y on X 1
Y 23. 500 5. 375X 1
Source of variation SS df MS
Regression SSRX 1 231. 125 1 MSRX 1 231. 125
Error SSEX 1 188. 750 6 MSEX 1 31. 458
Total SSTO 419. 875 7
c) Regression of Y on X 2
Y 27. 250 9. 250X 2
Source of variation SS df MS
Regression SSRX 2 171. 125 1 MSRX 2 171. 125
Error SSEX 2 248. 750 6 MSEX 2 41. 458
Total SSTO 7
The table III.1 below contains data for study of the relation of body fat Y to triceps
skinfold thickness X 1 and thigh circumferences X 2 based on sample of 20 healthy
females.
Table III.1
X i,1 X i,2 Yi
19.50 43.10 11.90
74
24.70 48.80 22.80
30.70 51.90 18.70
29.80 54.30 20.10
19.10 42.20 12.90
25.60 53.90 21.70
31.40 58.50 27.10
27.90 52.10 25.40
22.10 49.90 21.30
25.50 53.50 19.30
31.10 56.60 25.40
30.40 56.70 27.20
18.70 46.50 11.70
19.70 44.20 17.80
14.60 42.70 12.80
29.50 54.40 23.90
27.70 53.30 22.60
30.20 58.60 25.40
22.70 48.20 14.80
25.20 51.00 21.10
The triceps skinfold thickness (X 1 ) and thigh circumferences (X 2 ) are highly
correlated, as the scatter plot below suggests. The coefficient of simple correlation
between these two variables is equal to 0.92.
ANOVA tables II
a) Regression of Y on X 1 and X 2
Y −19. 174 0. 224X 1 0. 6594X 2
Source of variation SS df MS
Regression SSRX 1 , X 2 385. 44 2 MSRX 1 , X 2 192. 72
Error SSEX 1 , X 2 109. 95 17 MSEX 1 , X 2 6. 47
Total SSTO 495. 39 19
b) Regression of Y on X 1
Y −1. 496 0. 8572X 1
75
Source of variation SS df MS
Regression SSRX 1 352. 27 1 MSRX 1 352. 27
Error SSEX 1 143. 12 18 MSEX 1 7. 95
Total SSTO 495. 39 19
c) Regression of Y on X 2
Y −23. 634 0. 8566X 2
Source of variation SS df MS
Regression SSRX 2 381. 97 1 MSRX 2 381. 97
Error SSEX 2 113. 42 18 MSEX 2 6. 30
Total SSTO 495. 39 19
Note first that the regression coefficient for X 1 is not the same in a) and b).
Thus, the effect ascribed to X 1 by the fitted response varies here, depending
upon whether one X 1 or both X 1 and X 2 are being considered in the model.
The reason for that is correlation between X 1 and X 2 .
Note from table II a, that the error sum of squares when both X 1 and X 2
are included in the model is SSEX 1 , X 2 109. 95. When only X 2 is included
in the model, the error sum of squares is SSEX 2 113. 42 as seen from table II c.
Using (8.2) we obtain
SSRX 1 ∣ X 2 SSEX 2 − SSEX 1 , X 2 3. 47
When we fit a regression function containing only X 1 , we also obtain a measure of
reduction in variation of Y associated with X 1 namely SSRX 1 . For our example,
table II b indicates that SSRX 1 352. 27 which is not the same as
SSRX 1 ∣ X 2 3. 47. The reason for the large difference is the high positive
correlation between X 1 and X 2 .
The story is the same for the other independent variable.
The important conclusion is: When independent (explanatory ones) variables are
correlated, there is no unique sum of squares which can be ascribed to an independent
variable as reflecting its effect in reducing the total variation in Y. The reduction
in the total variation ascribed to an independent variable must be viewed in the
context of the other independent variables included in the model, whenever the
independent variables are correlated.
The terms SSRX 1 ∣ X 2 and SSRX 2 ∣ X 1 are called extra sums of squares,
since they indicate the additional or extra reduction in the error sum of squares
achieved by introducing an additional independent variable.
Let us consider the first-order regression model with two independent variables:
(8.4) Y i o 1 X i,1 2 X i,2 i full model
If the test on 1 indicates it is zero, the regression model (8.4) would be:
Y i o 2 X i,2 i
If the test on 2 indicates it is zero, the regression model (8.4) would be
Y i o 1 X i,1 i
76
However, if the separate tests indicate that 1 0 and 2 0, that does not
necessarily imply that:
Yi o i
since neither of the test consider this alternative: not both 1 and 2 equal to zero
The proper test for the existence of regression relation :
Ho : 1 2 0
H a : not both 1 and 2 equal to zero
is the F test of (7.30)
F ∗ MSR
MSE
Decomposition of SSR
77
SSEX 1 , X 2 SSRX 3 ∣ X 1 , X 2 SSEX 1 , X 2 , X 3
and hence
(8.10b) SSTO SSRX 1 SSRX 2 ∣ X 1 SSRX 3 ∣ X 1 , X 2 SSEX 1 , X 2 , X 3
For multiple regression with three independent variables, using our notation, we
can write equivalent of (8.10)
(8.11) SSTO SSRX 3 , X 1 , X 2 SSEX 1 , X 2 , X 3
Hence
(8.12) SSRX 3 , X 1 , X 2 SSRX 1 SSRX 2 ∣ X 1 SSRX 3 ∣ X 1 , X 2
Thus, the regression sum of squares has bee decomposed into marginal components,
each associated with one degree of freedom. Of course order of the independent
variables is arbitrary. For instance:
(8.13) SSRX 3 , X 1 , X 2 SSRX 3 SSRX 1 ∣ X 3 SSRX 2 ∣ X 1 , X 3
We can define extra sum of squares for two or more independent variables at a time
and obtain still other decompositions. We can define
(8.14) SSRX 2 , X 3 ∣ X 1 SSEX 1 − SSEX 1 , X 2 , X 3
Thus, SSRX 2 , X 3 ∣ X 1 represents the reduction the reduction in the error sum of
squares which is achieved by introducing X 2 , X 3 into regression model already
containing X 1 . There are two degrees of freedom associated with SSRX 2 , X 3 ∣ X 1 ,
and also
(8.14a) SSRX 2 , X 3 ∣ X 1 SSRX 2 ∣ X 1 SSRX 3 ∣ X 1 , X 2
With SSRX 2 , X 3 ∣ X 1 we can make use of the decomposition
(8.15) SSRX 3 , X 1 , X 2 SSRX 1 SSRX 2 , X 3 ∣ X 1 .
78
r 2 SSTO−SSE
SSTO
SSTO
SSR
1 − SSTO
SSE
Let us consider a first-order multiple regression model with two independent variables
Y i o 1 X i,1 2 X i,2 i .
SSEX 2 measures the variation in Y when X 2 is included in the model. SSEX 1 , X 2
measures the variation in Y when both X 1 and X 2 are included in the model. Hence the
relative marginal reduction in the variation in Y associated with X 1 when X 2 is already
in the model is:
SSEX 2 −SSEX 1 ,X 2
SSEX 2
This measure is the coefficient of partial determination between Y and X 1 , given
that X 2 is in the model. We denote it by r 2Y 1.2 :
SSEX −SSEX ,X SSEX ,X
(8.16) r 2Y 1.2 2
SSEX 2
1 2
1 − SSEX1 2
2
Using (8.7a) we get
SSRX 1 ∣X 2
(8.16a) r 2Y 1.2 SSEX
2
The coefficient of partial determination between Y and X 2 , given that X 1 is in the
model is defined by
SSEX 1 −SSEX 1 ,X 2 SSEX ,X SSRX 2 ∣X 1
(8.17) r 2Y 2.1 SSEX 1
1 − SSEX1 2 SSEX
1 1
For our example given in table III.1 we have:
r 2Y 1.2 113.42−109.95
113.42
0. 03059 4
and
r 2Y 2.1 143.12−109.95
143.12
0. 231 76
General case
We have already discussed how to conduct two types of tests concerning the
79
regression coefficients in multiple regression model. We will summarize these
tests and then take some additional types of tests.
80
SSEX ,...,X −SSEX ,...,X SSEX ,...,X
(8.32) F∗ 1 q−1
n−q−n−p
1 p−1
1
n−p
p−1
or equivalently
SSRX q ,X q1 ,...,X p−1 ∣X 1 ,...,X q−1
(8.32a) F∗ p−q MSEX 1 , . . . , X p−1
Decision rule
If F ∗ F1 − , p − q, n − p, conclude H o
If F ∗ F1 − , p − q, n − p, conclude H a
Correlation transformation.
81
s
(11.8a) 1 s 1y ′1
s
(11.8b) 2 s 2y ′2
(11.8c) o Y − 1 X 1 − 2 X 2
Thus, the new regression coefficients ′1 and ′2 and the original regression
coefficients 1 and 2 are related by simple scaling factors involving ratios
of standard deviations.
′
X ′1,1 X ′2,1 . . . X ′n,1 X ′2,1 X ′2,2
(11.9) X X
X ′1,2 X ′2,2 . . . X ′n,2 : :
X ′n,1 X ′n,2
82
and between Y and X 2 respectively. Hence , the estimated regression coefficients
for the reparameterized model (11.7) are
b ′1 1 −r 1,2 r Y,1
(11.13) b 1
b ′2
2
1−r 1,2
−r 1,2 1 r Y,2
r Y,1 − r 1,2 r Y,2
1
1−r 21,2
r Y,2 − r 1,2 r Y,1
The return to the estimated regression coefficients for the original model
is accomplished by employing the relations in (11.8):
s
(11.14a) b 1 s 1y b ′1
s
(11.14b) b 2 s 2y b ′2
(11.14c) bo Y − b1 X 1 − b2 X 2
Informal methods.
83
The variance inflation factor VIF k is equal to 1 when R 2k 0
i.e. when X k is not linearly related to the other X variables. When R 2k ≠ 0,
then VIF k is greater than 1, indicating an inflated variance for b ′k .
We know (6.93) that the least square residuals can be expressed as a linear
combination of the observations Y i by means of hat matrix:
(11.44) e I − HY
The hat matrix H is given by (6.93a):
(11.45) H XX ′ X −1 X
′
84
Let Y i denote the fitted value using fitted regression based on the observations
excluding the ith one(this observation is deleted). The residual
(1.56) d i Y i − Y i
is called a deleted residual and is denoted by d i .
The studentized deleted residual, denoted by d ∗i is :
(1.57) d ∗i sdd ii
Fortunately, the studentized deleted residuals d ∗i can be calculated without having
to fit regression function with the ith observation omitted. It can be shown that :
1/2
d ∗i e i
n−p−1
(11.58) SSE1−h i,i −e 2i
and they follow the t distribution with n − p − 1 degrees of freedom.
To identify outlying Yobservations, we examine the studentized deleted
residuals for large absolute values and use appropriate t distribution to
ascertain how far in the tails such outlying values fall.
PROBLEMS
Question 1
The following data were obtained in certain experiment:
i Y i X i,1 X i,2
1 64 4 2
2 73 4 4
3 61 4 2
4 76 4 4
5 72 6 2
6 80 6 4
7 71 6 2
8 83 6 4
9 83 8 2
10 89 8 4
11 86 8 2
12 93 8 4
13 88 10 2
14 95 10 4
15 94 10 2
16 100 10 4
a) Fit the first-order simple regression model for relating Y to X 1 .
State the fitted regression function.
b) Find the estimated regression coefficients for full model (Y on X 1 and X 2 ).
c) Does SSRX 1 equal SSRX 1 ∣ X 2 here? If not, is the difference substantial?
d) Calculate the coefficient of simple correlation between X 1 and X 2 .
The diagonal elements of the hat matrix are:
i 1 2 3 4 5 6 7 8
h i,i 0.237 0.237 0.237 0.237 0.137 0.137 0.137 0.137
85
i 9 10 11 12 13 14 15 16
h i,i 0.137 0.137 0.137 0.137 0.237 0.237 0.237 0.237
e) Identify any outlying X observations using the hat matrix method.
f) Obtain the studentized deleted residuals and identify any outlying Y observations.
Question 2.
The following data were obtained in certain experiment:
i Y i X i,1 X i,2 i Y i X i,1 X i,2
1 58 7 5.11 11 121 17 11.02
2 152 18 16.72 12 112 12 9.51
3 41 5 3.20 13 50 6 3.79
4 93 14 7.03 14 82 12 6.45
5 101 11 10.98 15 48 8 4.60
6 38 5 4.04 16 127 15 13.86
7 203 23 22.07 17 140 17 13.03
8 78 9 7.03 18 155 21 15.21
9 117 16 10.62 19 39 6 3.64
10 44 5 4.76 20 90 11 9.57
1) Find the estimated regression coefficients for full model (Y on X 1 and X 2 ).
2) Fit the first-order simple linear regression model for relating Y and X 2
3) Does SSRX 2 equal to SSRX 2 ∣ X 1 here? If not is the difference substantial?
4) Calculate the coefficient of simple correlation between X 1 and X 2 .
The diagonal elements of the hat matrix are:
i 1 2 3 4 5 6 7 8 9 10
h i,i 0.91 0.194 0.131 0.268 0.149 0.141 0.429 0.067 0.135 0.165
i 11 12 13 14 15 16 17 18 19 20
h i,i 0.179 0.059 0.110 0.156 0.095 0.128 0.97 0.230 0.112 0.073
5) Identify any outlying X observations using the hat matrix method.
6) Obtain the studentized deleted residuals and identify any outlying Y observations.
Question 3.
For a certain experiment the first-order regression model with two independent
variables was used. The calculated diagonal elements of the hat matrix are:
i 1 2 3 4 5 6 7 8
h i,i 0.237 0.237 0.237 0.237 0.137 0.137 0.137 0.137
i 9 10 11 12 13 14 15 16
h i,i 0.137 0.137 0.137 0.137 0.237 0.237 0.237 0.237
1) Describe use of hat matrix for identifying outlying X observations.
2) Identify any outlying X observations using the hat matrix method.
86
to some criterion. The following can be used:
R 2p criterion
The R 2p criterion calls for an examination of the coefficient of multiple determination R 2
in order to select one or several subsets of X variables. We show the number of parameters
in the regression model as a subscript of R 2 . Thus, R 2p indicates that there are p parameters,
or p − 1 predictor variables, in the regression equation on which R 2p is based.
Since R 2p is a ratio of sums of squares:
SSR SSE
(12.1) R 2p SSTOp 1 − SSTOp
and the denominator is constant for all possible regressions, R 2p varies inversely with the
error sums of squares . But we know that can never increase as additional independent
variables are included in the model. Thus, R 2p will be a maximum when all p − 1 potential X
variables are included in the regression model. The reason for using the R 2p criterion with the
all-possible-regressions approach therefore cannot be to maximize R 2p . Rather, the intent
is to find the point where adding more X variables is not worthwhile because it leads to
a very small increase in R 2p . Often, the point is reached when only a limited number of X
variables is included in the regression model. Clearly, the determination of where
diminishing returns set in is a judgmental one.
Example 12.1. The following table contains the R 2p values for all possible regression models
for a certain data (54 observations) with 4 possible variables:
X variables p df SSE p R 2p MSE p C p
none 1 53 3.9728 0 0.0750 1721.6
X1 2 52 3.4960 0.120 0.0672 1510.7
X2 2 52 2.5762 0.352 0.0495 1100.1
X3 2 52 2.2154 0.442 0.0426 939.0
X4 2 52 1.8777 0.527 0.0361 788.3
X1, X2 3 51 2.2324 0.438 0.0438 948.6
X1, X3 3 51 1.4073 0.646 0.0276 580.3
X1, X4 3 51 1.8759 0.528 0.0368 789.5
X2, X3 3 51 0.7431 0.813 0.0146 283.7
X2, X4 3 51 1.3922 0.650 0.0273 573.5
X3, X4 3 51 1.2455 0.687 0.0244 508.0
X1, X2, X3 4 50 0.1099 0.972 0.0022 3.1
X1, X2, X4 4 50 1.3905 0.650 0.0278 574.8
X1, X3, X4 4 50 1.1157 0.719 0.0223 452.1
X2, X3, X4 4 50 0.4653 0.883 0.0093 161.7
X 1 , X 2 , X 3 , X 4 5 49 0.1098 0.972 0.0022 5.0
Using the R 2p one can notice that the use of the subset X 1 , X 2 , X 3 appears
to be reasonable since there is a little increase after these three X variables are included
in the model.
MSE p or R 2a criterion.
Since R 2p does not take account of the number of parameter in the model, and since
maxR 2p can never decrease as p increases, the use of the adjusted coefficient of multiple
determination R 2a
SSE p MSE p
(12.2) R 2a 1 − n−1
n−p SSTO 1 − SSTO/n−1
87
has been suggested as a criterion which takes the number of parameters in the model into
account through the degrees of freedom. It can be seen from (12.2) that R 2a increases if and
only if MSE p decreases since SSTP/n − 1 is fixed for the given Y observations. Hence, R 2a
and MSE p are equivalent criteria. We shall consider here the criterion MSE p . minMSE p
can, indeed, increase as p increases when the reduction in SSE p becomes so small that it is
not sufficient to offset the loss of an additional degree of freedom. Users of the MSE p
criterion either seek to find the subset of X variables that minimizes MSE p , or one or several
subsets for which MSE p is so close to the minimum that adding more variables is not
worthwhile. Using the table from example 12.1 one can see that the subset X 1 , X 2 , X 3
appears to be best using MSE p (R 2a ) criterion too.
C p criterion.
This criterion is concerned with the total mean squared error of the n fitted values for each
of the various subset regression models. The mean squared error concept involves a bias
component and a random error component. Here, the mean squared error pertains to the fitted
values for theregression model employed. The bias component for the ith fitted value is:
(12.3) EY i − EY i
where EY i is the expectation of the ith fitted value for the given
regressionmodel and EY i
is the true mean response. The random error component for Y i is simply 2 Y i , its variance.
The mean squared error for Y i is then the sum of the squared bias and the variance:
2
(12.4) EY i − EY i 2 Y i
The total mean squared error for all n fitted values is the sum of the n individual mean
squared
errors:
n 2
n
(12.5) ∑ EY i − EY i ∑ 2 Y i
i1 i1
The criterion measure, denoted by Γ p , is simply the total mean squared error divided by 2 ,
the true error variance:
n 2
n
(12.6) Γ p 12 ∑ EY i − EY i ∑ 2 Y i
i1 i1
The model which includes all p − 1 potential X variables is assumed to have been carefully
chosen so that MSEX 1 , . . . , X p−1 is an unbiased estimator of 2 . It can than be shown
that an estimator of Γ p is C p :
SSE p
(12.7) C p MSEX ,...,X − n − 2p
1 p−1
where SSE p is the error sum of squares for the fitted subset regression model with p
parameters
( i.e., with p − 1 predictor variables).
When there is no bias in the regression model with p − 1 predictor variables so that
EY i ≡ EY i , the expected value of C p is approximately p:
(12.8) E C p ∣ EY i ≡ EY i ≃ p
Thus, when values for all possible regression models are plotted against p, those models with
little bias will tend to fall near the line C p p. Models with substantial bias will tend to fall
considerably above this line.
In using the C p criterion, one seeks to identify subsets of X variables for which
(1) the C p value is small and
(2) the value C p is near p.
Sets of X variables with small C p values have a small total mean squared error, and when
C p value is also near p, the bias of the regression model is small. The regression model
based on the subset of X variables with the smallest C p value may involves substantial bias.
In that case, one may at times prefer a regression model based on somewhat larger subset
of X variables for which the C p value is slightly larger but which does not involve a
88
substantial bias component.
Identification of "best" subsets by use of algorithm.
In those occasional cases when the pool of potential X variables contains many variables.
An automatic search procedure that develops sequentially the sbset of X variables
to be included in the regression model may be helpful in those cases. It was developed
to economize on computational efforts, as compared with the all possible regression approach,
while arriving at a reasonable good subset of independent variables.
Stepwise Regression.
Stepwise regression uses t statistics (and related prob-values) to determine the importance
(or significance) of the independent (explanatory) variables in various regression models.
In this context the t statistics indicates that the independent variable is significant at the
level if and only if the related p − value is less than . This implies that we
can reject H o : j 0 in favor of H a : j ≠ 0 with the probability of Type I error equal
to . Before beginning the stepwise procedure we choose a value of entry , which we call
“the probability of a Type I error related to entering an independent variable into regression
model”. We also choose stay , which we call “the probability of a Type I error related to
retaining an independent variable that was previously entered into model”
The SAS default values are entry 0. 5 and stay 0. 15.
Then the stepwise regression is performed as follows:
Step 1. The stepwise procedure considers all possible one-independent variable regression
models of the form
y o 1xj
Each different model includes a different potential independent variable. For each model
the t statistics (and p − value) related to testing H o : 1 0 versus H a : 1 ≠ 0
is calculated. If the largest absolute value of t statistics is significant (that is the
corresponding smallest p − value entry ) the corresponding variable is included in
the model say X 1 , we consider the model
y o 1 x 1
and relevant in step one we have: variable entered.
If the t statistics does not indicate that X 1 is significant at the entry level, then
the stepwise procedure terminates by choosing the model
y o
Step 2.The stepwise procedure considers all possible two-independent-variable models
(with one variable already selected in the first step). For each model the t statistics related
to testing H o : 2 0 versus H a : 2 ≠ 0 is calculated. The variable with the biggest
absolute value is included if is significant (corresponding p − value entry ).
Further steps. The stepwise procedure continues by adding independent variables one at
the time if and only if it has the largest (in absolute value) and if its t statistics is significant
(the corresponding p − value entry ). After adding an independent variable the stepwise
procedure checks all the independent variables already included in the model and removes
any if it is not significant at the level stay . The stepwise procedure terminates when all
independent variables not in a model are insignificant at entry level or when the variable to
be added to the model is the one just removed from it.
Notice: In some packages instead of t statistics F statistics (t 2 is used in reports.
MAXR SAS PROCEDURE.
This method does not settle on a single model. Instead, it looks for the “best”
one-variable model, the “best” two-variable model, and so forth.
The MAXR methods begins, by finding one-variable model that would producing
the highest R 2 . Then another variable, that one that would yield the greatest increase in
R 2 is added. Once two-variable model is obtained, each variables is compared to each
variable not in a model. For each comparison, MAXR determines if removing one variable
and replacing it with the other variable would increase R 2 . After comparing all possible
89
switches, the one that produces the largest increase in R 2 is made.
Comparison begin again, and the procedure continues until no switch could increase R 2 .
The two-variable model thus achieved is considered the “best” two-variable model the
technique can find. Another variable is then added to the model, and the comparing and
switching process is repeated to find the “best” three-variable model, and so forth.
The difference between the stepwise technique and maximum improvement method is that
all switches are evaluated before any switch is made in the MAXR method. In stepwise
method, the “worst” variable may be removed without considering what adding the “best”
remaining variable may accomplish.
The basic regression models considered so far have assumed that the random
error terms i are either uncorrelated random variables or independent normal
random variables. Many regression applications involve time series data. For such
data, the assumption of uncorellated or independent error terms is often not
appropriate. Error term correlated over time are said to be autocorrelated or
serially correlated.
Multiple regression
The multiple regression model with the random error following a first-order
autoregressive process is:
Y t o 1 X t,1 . . . p−1 X t,p−1 t
(13.2)
t t−1 u t
where:
is parameter such that || 1
u t are independent N0, 2
The Durbin-Watson test assumes that first-order autoregressive error models (13.1)
or (13.2). Because correlated errors in applications tend to show positive serial
correlation, the usual test alternatives are:
Ho : 0
(13.3)
Ha : 0
The test statistics
90
n
∑e t −e t−1 2
(13.4) D t2
n
∑ e 2t
t1
where n is the number of observations.
The critical values d L and d U are given in tables with decision rule
If D d U conclude H o
(13.5) If D d L conclude H a
If d L D d U the test is inconclusive
PROBLEMS
Question 1
For the following data:
t 1 2 3 4 5 6 7 8
X t 2.052 2.026 2.002 1.949 1.942 1.887 1.986 2.053
Y t 102.9 101.5 100.8 98.0 97.3 93.5 97.5 102.2
t 9 10 11 12 13 14 15 16
X t 2.102 2.113 2.058 2.060 2.035 2.080 2.102 2.150
Y t 105.0 107.2 105.1 103.9 103.0 104.8 105.0 107.2
1) Fit a simple linear regression line.
2) Conduct a formal test for positive autocorrelation using 0. 05.
Question 2.
The fitted values and residuals of a regression analysis are given below
i 1 2 3 4 5 6 7
Y i 2.92 2.33 2.25 1.58 2.08 3.51 3.34
e i 0.18 -0.03 0.75 0.32 0.42 0.19 0.06
i 8 9 10 11 12 13 14 15
Y i 2.42 2.84 2.50 3.59 2.16 1.91 2.50 3.26
e i -0.42 0.06 -0.20 -0.39 -0.36 -0.51 -0.50 0.54
Assume that the simple linear regression model with the random
terms following a first-order autoregressive process is appropriate.
Conduct a formal test for positive autocorrelation using 0. 05.
Question 3
The following data were obtained in a certain experiment:
91
i Yi X i,1 X i,2
1 64 4 2
2 73 4 4
3 61 4 2
4 76 4 4
5 72 6 2
6 80 6 4
7 71 6 2
8 83 6 4
9 83 8 2
10 89 8 4
11 86 8 2
12 93 8 4
13 88 10 2
14 95 10 4
15 94 10 2
16 100 10 4
The data summary is given below in matrix form
16 112 48
99
80
− 807 − 163
X′X 112 864 336 X ′ X −1 − 807 1
80
0
48 336 160 − 163 0 1
16
1308
′
XY 9510 Y ′ Y 108896
3994
Assume that first-order regression model with independent normal errors is appropriate.
1) Find the estimated regression coefficients.
2) Obtain an ANOVA table and use it to test whether there is a regression
relation using 0. 05.
3) Estimate 1 and 2 jointly by the Bonferroni procedure using
80 percent family confidence coefficient.
1
1
2
exp− x22 dx 0. 682 69
−1
2
1
2
exp− x22 dx 0. 954 5
−2
3
1
2
exp− x22 dx 0. 997 3
−3
4
1
2
exp− x22 dx 0. 999 94
−4
92