BS Classes V2

Linear Model for Biostatistics
Ernesto Ponsot Balaguer

PhD in Statistics, MSc. In Applied Statistics, Systems Engineering
http://webdelprofesor.ula.ve/economia/ernesto
E-mail: eponsot@yachaytech.edu.ec
University of Experimental Technologies Research Yachay (Yachay Tech)

School of Mathematical Sciences and Information Technology
School of Biological Sciences and Engineering
Imbabura, Ecuador - March 2018

Content
1 Simple Regression Model
2 Multiple Linear Regression Model

Simple Regression Model Multiple Linear Regression Model
The Model
If we have a quantitative variable (x ) that we believe is associated

with the response of interest (also quantitative, y ), then the
simplest statistical model that we can postulate is:
yi = β0 + β1 xi + i , i = 1, 2, · · · , n (1)
We call this the Simple Linear Regression Model. “Simple”

because we have only one independent variable x also call
“regressor” and “linear” because the relationship is linear in the
β’s.
In (1), we have n observations of the pair (y , x ) and we propose
two unknown parameters β0 and β1 in R, that we must estimate
from the data; i is an unobservable error that we assume is
present in each observation.
3 / 69
The Model
Remark
The word regression is curious, it means going backwards. In this
context we can interpret it as “returning to origins”, in the sense
that x is called a regressor because it somehow gives origin to y .
4 / 69
Preliminary assumptions of the model
1 i , i = 1, 2, · · · , n are random variables with E[i ] = 0 and

V[i ] = σ 2 ∀i. Also Cov[i , j ] = 0 ∀i 6= j.
2 xi , i = 1, 2, · · · , n are known constants.
3 Consequently yi , i = 1, 2, · · · , n, which results from adding a
random variable to a constant, is thus a random variable and
E[yi ] = E[β0 + β1 xi + i ] = β0 + β1 xi and
V[yi ] = V[β0 + β1 xi + i ] = V[i ] = σ 2 ∀i. Like before,
Cov[yi , yj ] = 0 ∀i 6= j.
5 / 69
Assumptions of the model
The assumption of constant variance is known as the

homoscedasticity assumption.
The assumption that the expected value of the errors is null
implies that the expected value of the response depends only
on the explanatory variable.
The assumption that errors are uncorrelated is known as the
assumption of non-collinearity.
6 / 69
Parameters Estimation
The parameters of the model are β0 , β1 and σ 2 . We call βb0 ,

b 2 the estimators of the respective parameters from
βb1 and σ
the observed sample.
We call ybi = βb0 + βb1 xi , the i-th predicted value and
bi = yi − ybi , the i-th residue, this is the difference between the
i-th observed value and the i-th value predicted by the model.
Note that the predicted value ybi estimates E[yi ] = β0 + β1 xi ,
not yi . In addition, ybi is an unbiased estimator if βb0 and βb1
are also unbiased.
Now, how to find these estimators in a reasonable way?
7 / 69
The least squares method
A “reasonable” way (of course, not the only way) to estimate

the parameters of the model from the observed sample is to
find those that minimize the sum of squares of the residuals.
This is call the least squares method. This is:
( n ) ( n )
X X
min b2i = min (yi − βb0 − βb1 xi ) 2
(β
b0 ,βb1 ) i=1 (β
b0 ,βb1 ) i=1
Then the least squares method is to find βb0 and βb1 which
minimize the overall sum of squared differences between
observed and predicted values. So, taking partial derivatives,
equaling zero and solving:
nx y − ni=1 xi yi
P
βb1 = and βb0 = y − x βb1
nx 2 − ni=1 xi2
P
8 / 69
The expected values and variances are:
E[βb1 ] = β1 and E[βb0 ] = β0

" #
σ2 2 1 x2
V[βb1 ] = Pn 2
and V[β0 ] = σ
b + Pn 2
i=1 (xi − x ) n i=1 (xi − x )
Thus, βb0 and βb1 obtained by the least squares method are
unbiased estimators of β0 and β1 respectively.
Exercise 1
Calculate E[βb0 ], E[βb1 ], and if you’re brave, the variances.
9 / 69

What’s about σ 2 ?
The least squares method don’t provide an estimator for σ 2 .

However we can find one intuitively based on the definition of
variance. Let’s see: V[yi ] = σ 2 = E[(yi − E[yi ])2 ]. Now, the
estimator of E[yi ] is ybi = βb0 + βb1 xi . Now, we can think that a
good estimator of expected value is always the average, then
we propose:
n n
X (yi − ybi )2 X (yi − βb0 − βb1 xi )2
s2 = σ
b2 = =
i=1
n−2 i=1
n−2
Normally on an average we would divide by n. When we

estimate s 2 from a random sample we divide by n − 1 because
y must also be estimated which reduces one degree of
freedom, and in this case, we must estimate β0 and β1 , which
reduces two degrees of freedom.
10 / 69

What’s about σ 2 ?
Note that E[s 2 ] = σ 2 , so the proposed s 2 is an unbiased estimator

of σ 2 . The quantity ni=1 (yi − ybi )2 is known as the residual sum of
P
squares or better, the sum of squares of error and will be

abbreviated SSE.
Exercise 2
Calculate E[s 2 ].
11 / 69
Hypothesis test and confidence intervals for β1
Our main concern is to determine whether or not there is a

relation between x and y , therefore, the principal hypothesis
to prove is H0 : β1 = 0.
If we reject H0 we conclude that our model is valid and that
such a relation does exist, on the contrary, if we can’t reject
H0 , we will have to accept that our model is not valid and that
relation between x and y can’t be established in those terms.
To prove such hypothesis, we need some distributional theory:
!
σ 2
βb1 ∼ N β1 , Pn 2
i=1 (xi − x )
(n − 2)s 2
∼ χ2n−2
σ2
βb1 and s 2 are independent.
12 / 69
With these assumptions,
βb1
t = qP ∼ tn−2,δ
n
s/ i=1 (xi − x )2
Where δ is known as the non-centrality parameter of the

t-Student distribution and is such that
β1
δ = qP
n
σ/ i=1 (xi − x )2
Note that if β1 = 0 then δ = 0, son under H0 t ∼ tn−2 , this is t is

distributed as a central t-Student. This is the basis of the
hypothesis test.
13 / 69
Then the procedure is:

Assuming the alternative hypothesis as Ha : β1 6= 0, calculate
|t|.
Find (or calculate) a theoretical tα/2,n−2 , the upper point of
the central t distribution that leaves to the right a probability
equal to α/2. Here α is the significance level of the prove and
n − 2 is the degrees of freedom.
If |t| > tα/2,n−2 then we reject H0 .
Also we can reject H0 if p ≤ α (p is called the p-value). For a
twosided test, the p-value is twice the probability that tn−2
exceeds |t|.
A (1 − α)% confidence interval for β1 is obtained as:
s
βb1 ± tα/2,n−2 qP
n 2
i=1 (xi − x )
14 / 69
Coefficient of Determination r 2
We can separate SST = SSR + SSE. Where:

n
X
SST = Total sum of squares = (yi − y )2
i=1
n
X
SSR = Sum of squares of regression = (ybi − y )2
i=1
n
X
SSE = Sum of squares of error = (yi − ybi )2
i=1
Then
n
X n
X n
X
(yi − y )2 = (ybi − y )2 + (yi − ybi )2 (2)
i=1 i=1 i=1
15 / 69
SST represents the total variation of y (corrected by the mean of

the observations), SSR represents the variation of y explained
(accounted) by regression and SSE represents the variation of y
due to error. Then
n
X
(ybi − y )2
SSR
r2 = = i=1
n
SST X
(yi − y )2
i=1
So r 2 gives the proportion of variation in y that is explained by the

model or, equivalently, accounted for by regression on x. From (2)
we can conclude that 0 ≤ r 2 ≤ 1 and that r 2 close to 1 is “good”
for our model but close to 0 is “bad”, because it means that the
variation of y is largely explained by error.
16 / 69
The coefficient of determination is a measure of goodness of fit

of the model. It’s known as r 2 precisely because it’s the square of
the sample correlation coefficient r . Note, however, that for this
claim to make statistical sense, both Y and X must be considered
random variables, which is not the case here.
Exercise 3
Do the exercises 17.1 to 17.3 from Zar (p. 329).
17 / 69
MLRM: Formulation
The Multiple Linear Regression Model is:
y = Xβ + (3)
Where
       
y1 1 x11 x12 · · · x1k β0 1
y 1 x21 x22 · · · x2k  β 2 
      
 2   1
y =
..  , X = .
. .. .. .. ..  , β =  . , =  . 
 .  .
. . . . . . 

 .  .
yn 1 xn1 xn2 · · · xnk βk n
h i
Of course, X could also be written as X = j x 1 x2 · · · xk ,
h i0 h i0
where j = 1 1 · · · 1 and xi = x1i x2i ··· xni for
i = 1, 2, · · · , k.
18 / 69
MLRM: Formulation
In (3) we postulate that the response variable y depends on

an intercept and k explanatory variables or regressors
x1 , x2 , · · · , xk .
We must also emphasize that we have n observations of the
set {y, x1 , x2 , · · · , xk }.
In (3) we propose k + 1 unknown parameters β0 , β1 , · · · , βk .
The linearity of the model is required in the relationship
between these parameters but not necessarily between the
variables. For example the conceptual model (that is, without
reference to observations):
y = β0 + β1 x1 + β2 x2 + β3 x12 + β4 x22 + β5 x1 x2 + ,
is linear, but
y = β0 + β1 x1 + β12 x2 + e β2 x3 + log(β3 )x4 + ,
is not linear for us.
19 / 69
MLRM: Formulation
In (3) X is called the design matrix and is a matrix of known

constants, β is a vector of unknown constants whose elements
must be found from the data, is an unobservable random
vector and then y is a random vector since it is the sum of a
vector of constants and a random vector.
Although we can relax the assumptions later, we start by assuming
the following:
E[] = 0 ⇒ E[y] = µ = Xβ
V[] = V[y] = Σ = σ 2 I.
n > k + 1 and r(X) = k + 1.
20 / 69
MLRM: Formulation
In (3) the elements of β are called the regression

coefficients, sometimes referred to as partial regression
coefficients. The word partial carries both a mathematical and
a statistical meaning.
Mathematically, the partial derivative of E[yi ] with respect to
xi1 , for example, is β1 . Thus β1 indicates the change in E[yi ]
with a unit increase in xi1 when xi2 , xi3 , · · · , xik are held
constant.
Statistically, β1 shows the effect of xi1 on E[yi ] in the presence
of the other xi ’s. This effect would typically be different from
the effect if the other xi ’s were not present in the model
(except if xi and xj are orthogonal for i 6= j).
21 / 69
Estimation of the parameters
We must remember that

yb = E[y] [ = X βb (The predicted values vector).
d = (Xβ)
b = y − y (The residuals vector).
b
Also, we can show that ni=1 (yi − ybi )2 = (y − X β)
b 0 (y − X β)
P b
Using the MLS Method, after an optimization process, the
normal equations are X 0 X βb = X 0 y.
And considering that (X 0 X)−1 exists, then the least square
estimator of β is
βb = (X 0 X)−1 X 0 y and
E[β]
b = β
b = σ 2 (X 0 X)−1
V[β]
Since the predicted value is X βb = X(X 0 X)−1 X 0 y we will call

H = X(X 0 X)−1 X 0 the hat matrix.
22 / 69

What’s about σ 2 ? (again)
We postulate that s 2 is an unbiased estimator for σ 2 where.

X n
1 1 b 0 (y − X β)
s2 = (yi − ybi )2 = (y − X β) b
n − k − 1 i=1 n−k −1
Let’s try it on the board, but for this we need to know that if A is a
matrix of constants and y is a random vector with E[y] = µ and
V[y] = Σ, then
E[Ay] = AE[y] = Aµ
(Expected value of a linear form)
V[Ay] = AV[y]A0 = AΣA0
(Variance of a linear form)
0
E[y Ay] = tr(AΣ) + µ0 Aµ
(Expected value of a quadratic form)
V[y 0 Ay] = 2tr[(AΣ)2 ] + 4µ0 AΣAµ
(Variance of a quadratic form)
23 / 69
According to The Gauss-Markov Theorem, if E[y] = Xβ and

V[y] = σ 2 I, the least-squares estimators βb have minimum
variance among all linear unbiased estimators.
Sometimes the phrase “...have minimum variance among all
linear unbiased estimators” is understood as “...are the best
linear unbiased estimators (BLUE)”.
As a corollary we have that the BLUE of aβ (a linear
function of the parameters) is a β,
b where β
b is the
least-squares estimator.
24 / 69
Normal model
Suppose now that in (3) we add the assumption ∼ N(0, σ 2 I) or

equivalently y ∼ N(Xβ, σ 2 I). In general, if y ∼ N(µ, Σ) and Σ is
positive definite then
( )
1 (y − µ)0 Σ−1 (y − µ)
f (y; µ, Σ) = exp −
(2π)n/2 |Σ|1/2 2
Then with our assumption:

0 2 −1 (y−Xβ)
n o
exp − (y−Xβ) (σ 2I)
f (y; Xβ, σ 2 I) =
(2π)n/2 |σ 2 I|1/2
1 (y − Xβ)0 (y − Xβ)

= exp −
(2πσ 2 )n/2 2σ 2
25 / 69
Normal model
Y
X
Figure 1: N2 (µ = 0, σ 2 = 1) 26 / 69
Normal model
The likelihood L(β, σ 2 ) = f (y; Xβ, σ 2 I) view as a function of

parameters and
(y − Xβ)0 (y − Xβ)
l = log[L(β, σ 2 )] = − − log[(2πσ 2 )n/2 ]
2σ 2
Now we can find the maximum-likelihood estimators of the
parameters, deriving l with respect to β and σ 2 , equaling 0 and
clearing. So
βb = (X 0 X)−1 X 0 y
b 0 (y − X β)
(y − X β) b
b2 =
σ
n
27 / 69
Normal model
b 2 is biased since the denominator is n

Note that the estimator σ
rather than n − k − 1. We will use the unbiased estimator s 2 by
b 0 (y − X β)
(y − X β) b
s2 =
n−k −1
Now suppose that y ∼ N(Xβ, σ 2 I) with the assumptions made

b 2 have the following distributional properties:
before. Then βb and σ
1 βb ∼ N(β, σ 2 (X 0 X)−1 ).
2 b 2 /σ 2 ∼ χ2(n−k−1) , or equivalently
nσ
(n − k − 1)s 2 /σ 2 ∼ χ2(n−k−1) .
3 b 2 (or s 2 ) are independent.
βb and σ
28 / 69
Sums of Squares
For a random variable y with n observations, the Total Sum of

Squares is ni=1 yi2 . Now suppose we want to correct this by the
P
mean of observations (this is centering observations), then

n
X n
X n
X n
X n
X
(yi − y )2 = [yi2 − 2yi y + y 2 ] = yi2 − 2y yi + y2
i=1 i=1 i=1 i=1 i=1
n
X Xn
= yi2 − 2y ny + ny 2 = yi2 − 2ny 2 + ny 2
i=1 i=1
n
X
= yi2 − ny 2
i=1
29 / 69
Sums of Squares
The following are useful identities:

n
yi2 = y 0 y
X
i=1
n
yi = (1/n)j 0 y
X
y = (1/n)
i=1
ny 2 = n[(1/n)j 0 y]2 = n[(1/n)j 0 y][(1/n)j 0 y
but j 0 y = y 0 j so
ny 2 = (1/n)y 0 jj 0 y = (1/n)y 0 Jy
Then
n
(yi − y )2 = y 0 y − (1/n)y 0 Jy
X
i=1
= y 0 [I − (1/n)J]y
30 / 69
Sums of Squares
Now suppose we want to partition the sum total of squares into

“something” that is due to regression and “something” that is due
to error. We need to include ybi , but yb = X βb = X(X 0 X)−1 X 0 y
= Hy. So what if we add and subtract H? Let’s see:
y 0 [I − (1/n)J]y = y 0 [I − (1/n)J + H − H]y

= y 0 [I − H]y + y 0 [H − (1/n)J]y
Then:
SST = y 0 [I − (1/n)J]y
SSE = y 0 [I − H]y
SSR = y 0 [H − (1/n)J]y
Where H = X(X 0 X)−1 X 0 .

31 / 69
Coefficient of Determination R 2
We have just shown that it is possible to partition the sum of

total squares (corrected) into two components, the sum of
squares of the error and the sum of squares due to regression.
That is SST = SSR + SSE.
The definition of the Coefficient of Determination R 2 is the
same as before, then
SSR y 0 [H − (1/n)J]y
R2 = = 0
SST y [I − (1/n)J]y
32 / 69
Coefficient of Determination R 2
0 ≤ R 2 ≤ 1. Thinking about the sums of squares as estimates

of the variability of the data, we can affirm that R 2 will be
close to zero when the variability explained by the regression
is very small in relation to the total (which is to say that it
depends a lot on the error) and, on the contrary, it will be
close to one, when the variability explained by the regression
is large with respect to the total.
So R 2 near one, speaks in favor of the model and near zero,
speaks against.
However, R 2 can only grow if you add explanatory variables
(why?), so an artificial way to improve the value of R 2 is to
add explanatory variables, regardless of whether they are
relevant or not. So this indicator should be viewed carefully.
33 / 69
Generalized Least Squares
Now consider the model:
y = Xβ + , E[] = 0, V[] = σ 2 V (4)
Where all previous assumptions hold, except that

V[] = V[y] = Σ = σ 2 V , with V a positive definite matrix of
constants. Note that
The model formulated in (4) generalizes that formulated in
(3) to the situation in which the variances of the errors can be
different from each other and there can be covariances
between them.
34 / 69
But, there exists a non-singular matrix Q such that V = QQ 0 .

Then, multiplying by Q −1
Q −1 y = Q −1 Xβ + Q −1
z = Wβ + δ (5)
Calling z = Q −1 y, W = Q −1 X and δ = Q −1 . But
E[δ] = E[Q −1 ] = Q −1 E[] = 0

V[δ] = V[Q −1 ] = Q −1 V[](Q −1 )0
= Q −1 σ 2 V (Q 0 )−1 = σ 2 Q −1 QQ 0 (Q 0 )−1 = σ 2 I
So the model formulated in (5) is exactly the original model.
35 / 69
And we know that
βb = (W 0 W )−1 W 0 z
b = σ 2 (W 0 W )−1
V[β]
z 0 [I − HW ]z
s2 = , with HW = W (W 0 W )−1 W 0
n−k −1
Exercise 4
Prove that
βb = (X 0 V −1 X)−1 X 0 V −1 y
b = σ 2 (X 0 V −1 X)−1
V[β]
y 0 [V −1 − V −1 X(X 0 V −1 X)−1 X 0 V −1 ]y
s2 =
n−k −1
36 / 69
Some results about quadratic forms
Let z ∼ Nn (0, I) then z 0 z ∼ χ2(n) .

Let y ∼ Nn (µ, I) then y 0 y ∼ χ2(n,λ) , with λ = (1/2)µ0 µ.
If vi are independent RV’s distributed as χ2(ni ,λi ) for all
Pk
i = 1, 2, · · · , k then i=1 vi ∼ χ2Pk Pk .
i=1
ni , i=1
λi
If u ∼ χ2(p) and v ∼ χ2(q) are independent, then
u/p
w = ∼ F(p,q)
v /q
If u ∼ χ2(p,λ) and v ∼ χ2(q) are independent, then
u/p
w = ∼ F(p,q,λ)
v /q
37 / 69
Some results about quadratic forms
Let z ∼ N(0, 1), u ∼ χ2(p) ; z, u independent, then
z
t = p ∼ t(p)
u/p
Let y ∼ N(µ, 1), u ∼ χ2(p) ; y , u independent, then
y
t = p ∼ t(p,µ)
u/p
Let y ∼ N(µ, σ 2 ), u ∼ χ2(p) ; y , u independent, then
y /σ
t = p ∼ t(p,µ/σ)
u/p
38 / 69
The General Linear Hypothesis Test H0 : Cβ = 0 vs. Ha : Cβ 6= 0
We recall that the general linear model is
y = Xβ + , E[] = 0 ⇒ E[y] = Xβ, V[] = V[y] = σ 2 I
Where y ∼ N(Xβ, σ 2 I) and r(X) = k + 1 (full column rank).

Note that βb = (X 0 X)−1 X 0 y is the least squares (also
maximum likelihood) estimator of β. This βb is the value of β
that minimizes the sum of squares of the error 0 =
(y − Xβ)0 (y − Xβ).
d = Xβ
Then E[y] b = X(X 0 X)−1 X 0 y = Hy, naming
H = X(X X) X 0 , the hat matrix.
0 −1
39 / 69
Now, assuming that the null hypothesis is true, the model is

subject to a linear constraint that conditions it. So we can
write:
y = Xβ + subject to Cβ = 0
Where C[q×(k+1)] is a matrix of known constants such that

r(C) = q (full row rank).
Clearly finding the value of β that minimizes the error sum of
squares is now a new optimization problem, which can be
solved using Lagrange multipliers. That is, finding the
minimum on β and λ of the function:
u(β, λ) = (y − Xβ)0 (y − Xβ) + λ0 Cβ
40 / 69
Using partial derivatives, equaling to zero and solving for:

The β least squares estimator is now:
βe = (X 0 X)−1 X 0 y
−(X 0 X)−1 C 0 [C(X 0 X)−1 C 0 ]−1 C(X 0 X)−1 X 0 y
And now
E[y] e = X(X 0 X)−1 X 0 y

d = Xβ
−X(X 0 X)−1 C 0 [C(X 0 X)−1 C 0 ]−1 C(X 0 X)−1 X 0 y

= (H − H ∗ )y
Naming H ∗ = X(X 0 X)−1 C 0 [C(X 0 X)−1 C 0 ]−1 C(X 0 X)−1 X 0

It can be shown that H ∗ is symmetric, idempotent and its
trace is q.
41 / 69
Now, do not forget that we are looking for a way to test the
null hypothesis H0 : Cβ = 0 and we must find a test statistic
that serves these purposes. With the theory we have
developed so far, it is logical to think about two independent
χ2 random variables. But note that:
y 0H ∗y = y 0 X(X 0 X)−1 C 0 [C(X 0 X)−1 C 0 ]−1 C(X 0 X)−1 X 0 y
b 0 [C(X 0 X)−1 C 0 ]−1 C β
= (C β) b
Is a quadratic form clearly associated with Cβ. Moreover,

C βb ∼ Nq (Cβ, σ 2 C(X 0 X)−1 C 0 ). Let’s see: βb is normal
distributed and C βb is a linear form of β,b so is normal
distributed too. The dimension of C β is q × 1 and
b
E[C β]
b = CE[β]
b = Cβ
V[C β] b 0 = σ 2 C(X 0 X)−1 C 0
b = CV[β]C
42 / 69
Then after some verifications, (1/σ 2 )y 0 H ∗ y ∼ χ2q,λ with
λ = [1/(2σ 2 )](Cβ)0 [C(X 0 X)−1 C 0 ]−1 Cβ
Which it is zero if and only if the null hypothesis is true.

We also can prove that y 0 H ∗ y and SSE = y 0 (I − H)y are
independent.
43 / 69
Finally
(y 0 H ∗ y/σ 2 )/q
F =
(y 0 (I − H)y/σ 2 )/(n − k − 1)
y 0 H ∗ y/q
= ∼ Fq,(n−k−1),λ
y 0 (I − H)y/(n − k − 1)
With λ = [1/(2σ 2 )](Cβ)0 [C(X 0 X)−1 C 0 ]−1 Cβ
If the null hypothesis is true, F is central-F distributed, so the

decision rule is Reject H0 if F ≥ Fα,q,(n−k−1) where α is
the significance level.
44 / 69
The General Linear Hypothesis Test H0 : Cβ = t vs. Ha : Cβ 6= t
Also, for this generalization, it can be proved that:
(C βb − t)0 [C(X 0 X)−1 C 0 ]−1 (C βb − t)/q

F =
SSE/(n − k − 1)
∼ Fq,(n−k−1),λ
With λ = [1/(2σ 2 )](Cβ − t)0 [C(X 0 X)−1 C 0 ]−1 (Cβ − t)
Like before if the null hypothesis is true, F is central-F

distributed, so the decision rule is Reject H0 if
F ≥ Fα,q,(n−k−1) where α is the significance level.
45 / 69
The General Linear Hypothesis Test: Particularizations

When H0 : a 0 β = 0.
If a is a vector of known constants, it is clear that in this case

C = a 0 and t = 0 so:
(a 0 β)
b 0 [a 0 (X 0 X)−1 a]−1 (a 0 β)
b (a 0 β)
b 2
F = =
s2 s 2 [a 0 (X 0 X)−1 a]
∼ F1,(n−k−1),λ
(a 0 β)2
With λ =
2σ 2 a 0 (X 0 X)−1 a
Because q = 1, a 0 β,
b a 0 β and a 0 (X 0 X)−1 a are scalars. Then
we reject H0 if F ≥ Fα,1,(n−k−1) .
46 / 69
The General Linear Hypothesis Test: Particularizations

Testing one βj . H0 : βj = 0.
h i
Now a 0 = 0 0 · · · 0 1 0 ··· 0 0 with 1 in the
position j. Then:
βbj2 βj2
F = ∼ F 1,(n−k−1),λ with λ =
s 2 gjj 2σ 2 gjj
Where gjj is the jj element of the matrix (X 0 X)−1 , of course,

this is the j-th element of the diagonal.
Since F has 1 and n − k − 1 degrees of freedom, we can
equivalently use the t-Student distribution:
βbj
t = √ ∼ t(n−k−1),λ
s gjj
Then we reject H0 if |t| ≥ tα/2,(n−k−1) .

47 / 69
Confidence Interval for βj
If we assume that βj is non-zero, βbj − βj has expected value

iqual to 0 and using the hypothesis test based on t-Sudent we
have:
βbj − βj
t = √ ∼ t(n−k−1)
s gjj
And we can write:
P[|t| ≥ tα/2,(n−k−1) ] = α
P[|t| < tα/2,(n−k−1) ] = 1 − α
P[−tα/2,(n−k−1) < t < tα/2,(n−k−1) ] = 1 − α
48 / 69
Confidence Interval (CI) for βj
βbj − βj
P[−tα/2,(n−k−1) < √ < tα/2,(n−k−1) ]
s gjj
√ √
= P[−tα/2,(n−k−1) s gjj < βbj − βj < tα/2,(n−k−1) s gjj ]
√ √
= P[−βbj − tα/2,(n−k−1) s gjj < −βj < −βbj + tα/2,(n−k−1) s gjj ]
√ √
= P[βbj + tα/2,(n−k−1) s gjj > βj > βbj − tα/2,(n−k−1) s gjj ]
√ √
= P[βbj − tα/2,(n−k−1) s gjj < βj < βbj + tα/2,(n−k−1) s gjj ]
= 1−α
√
Then a 100(1 − α)% CI for βj is βbj ± tα/2,(n−k−1) s gjj .
49 / 69
Confidence Interval (CI) for σ 2
We know that (n − k − 1)s 2 /σ 2 ∼ χ2(n−k−1) . Then:

" #
(n − k − 1)s 2
P χ2(1−α/2),(n−k−1) ≤ ≤ χ2(α/2),(n−k−1) = 1−α
σ2
But, at the right side:
(n − k − 1)s 2
≤ χ2(α/2),(n−k−1)
σ2
(n − k − 1)s 2
≤ σ2
χ2(α/2),(n−k−1)
And, at the left side:
(n − k − 1)s 2
χ2(1−α/2),(n−k−1) ≤
σ2
(n − k − 1)s 2
σ2 ≤ 2
χ(1−α/2),(n−k−1)
50 / 69
Confidence Interval (CI) for σ 2
So:
" #
(n − k − 1)s 2 2 (n − k − 1)s 2
P ≤ σ ≤ = 1−α
χ2(α/2),(n−k−1) χ2(1−α/2),(n−k−1)
Then a 100(1 − α)% CI for σ 2 is
(n − k − 1)s 2 (n − k − 1)s 2
2 ≤ σ2 ≤ 2
χ(α/2),(n−k−1) χ(1−α/2),(n−k−1)
51 / 69
Model Validation and Diagnostics
We consider various approaches to checking the model and

the attendant assumptions for adequacy and validity. We
discuss some properties of the residuals, the hat matrix,
outliers, the influence of individual observations and leverage.
The residual vector is b = y − yb , that is, the difference
between the response vector and the vector of predictions.
b = y − yb = y − X βb = y − X(X 0 X)−1 X 0 y
= (I − H)y
Also
b = (I − H)y = (I − H)(Xβ + ) = Xβ + − HXβ − H
= Xβ + − Xβ − H = − H = (I − H)
Because HX = X(X 0 X)−1 X 0 X = X.
52 / 69
Note that yb = E[y]

d = Xβ b = X(X 0 X)−1 X 0 y = Hy. From that
comes the name “hat matrix”. H transforms y into y-hat.
Also note that
h
HXi = X
h i
H j x1 · · · xk = j x1 · · · xk
⇒ Hj = j
And from b = (I − H) note that the elements of H must be

close to 0 if b will be used as a reasonable approximation of .
53 / 69
Some useful results:
E[b] = 0 (6)
2
V[b] = σ (I − H) (7)
2
Cov[b, y] = σ (I − H) (8)
Cov[b, yb ] = 0 (9)
b0
b = j /n = 0 (10)
b0
y 0
= y (I − H)y (11)
b0 b
y = 0 (12)
b0
X = 0 0
(13)
54 / 69
If x and y are two random variables, the sample correlation

coefficient is defined as
sxy
rxy = q
sx2 sy2
x 0 [I − (1/n)J]y/(n − 1)
= p
{x 0 [I − (1/n)J]x/(n − 1)}{y 0 [I − (1/n)J]y/(n − 1)}
x 0 [I − (1/n)J]y
= p
x 0 [I − (1/n)J]xy 0 [I − (1/n)J]y
55 / 69
And examining the numerator, for b, y:
b0 [I − (1/n)J]y = b0 y − (1/n)b0 Jy = b0 y
due to (10). Also for b, yb :
b0 [I − (1/n)J]yb = b0 yb − (1/n)b0 J yb = 0
due to (10) and (12). And due to (10) and (13) this
numerator is 0 for b and every column of the matrix X. Then
rb,by = 0
rb,xi = 0, i = 1, · · · , k
56 / 69
If the model and attendant assumptions are correct, then a

plot of the residuals versus predicted values, should show no
systematic pattern. The k plots of the residuals versus each of
X’s collumns should show only random variation.
Exercise 5
Using the Hematology Data in Table 10.1 of Rencher’s (p. 253),
described in the example 10.3, postule a “purely additive” linear
model for all explanatory variables and study the residuals.
57 / 69

The role of the matrix H.
For H = {hij }, i, j = 1, 2, · · · , n we have:
(1/n) ≤ hii ≤ 1
−0.5 ≤ hij ≤ 0.5, ∀i 6= j
X
tr(H) = hii = k + 1
What do you think about the lower bound of hii ?
58 / 69

Outliers.
An outlier is an atypical value of the data, which appears very

far from the rest and which is suspected to be a measurement
error, a transcription error, or something like that.
As the adjustment of the linear model occurs through means
and these are very sensitive to the extreme values, an outlier
can represent a significant change in the estimations that can
even affect the hypothesis tests.
So the first impulse is to detect it and get it out of the data
set.
But what if you have very little data? How can we be sure
that this is an error and not a correct, plausible value?
Maybe we should get it out of the set, but never before
thoroughly investigating its true meaning.
59 / 69

Outliers.
√
We need to scale the variance doing bi /[σ 1 − hii ] (the
standardized residuals) or the studentized residual
bi
ri = √
s 1 − hii
Our approach to checking for outliers is to plot studentized
residual versus ybi or versus i, the observation number.
There are different approaches to this analysis, however the
others are beyond the scope of this course.
60 / 69

Influential Observations and Leverage.
An influential observation is an observation (yi , xi0 ) that makes

a major difference on βb and X β.
b
A point (yi , xi0 ) is potentially influential if it is an outlier in the
y direction or if it is unusually far away from the center of the
x ’s.
Let’s look the blood data for x4 : neutrophil count.
61 / 69

60
40
y
20
20 30 40
x4
Figure 2: Scatter plot (y vs. x )

62 / 69
R code 1
# Scatter plot x4 vs y
ggplot(dat, aes(x = x4, y = y)) +
geom_point(colour=’darkblue’,shape=21,size=2)+
geom_smooth(method=’lm’,colour=’red’,se=TRUE)+
geom_point(aes(x=42,y=34),data=dat,size=10,
shape=1,color=’green’)+
geom_segment(aes(x=37,y=34,xend=41,yend=34),
lwd=0.2,col=’darkgreen’,arrow=arrow(length
=unit(0.03,"npc")))+
geom_point(aes(x=17, y=61),data=dat,size=10,
shape=1,color=’green’)+
geom_segment(aes(x=17,y=53,xend=17,yend=59),
lwd=0.2,col=’darkgreen’,arrow=arrow(length
=unit(0.03,"npc")))+
ylab(’y’)+
xlab(’x4’)
63 / 69

Again, yb = Hy then for all i = 1, 2, · · · , n:

 
y1
h y2  X
i  n
ybi = hi1 hi2 · · · hin  . 
 = hij yj
 .. 

j=1
yn
X
= hii yi + hij yj
j6=i
Pn
But it is also true that Hj = j then j=1 hij = 1 ∀i.
Then, when hii is large (close to 1) the rest of the hij are
necessarily small and yi has a great influence on ybi . Hence, hii
is called the leverage of yi (in Spanish “apalancamiento”).
64 / 69

When an observation (yi , xi0 ) has a value of hii near 1, then

the estimated regression equation will be close to yi ; that is,
ybi − yi will be small and have high potential for influencing
regression results.
We can examine any observation whose leverage is unusually
large relative to the other values. We know that
Pn
i=1 hii = k + 1 then some authors suggest that if we find a
hii whose value equals or exceeds 2(k + 1)/n (twice the
average) we must think that it is an influential point.
High leverage points can be either good or bad. In the
direction of the regression line can contribute to decrease the
variance of the estimators, but in the opposite direction, it
affects the adjustment too much.
65 / 69

The strategy is to compare two models: one that includes the

suspicious point and one that excludes it (symbolized with (i)
subscript). To do this we use Cook’s distance, defined as:
b 0 X 0 X(β
(βb(i) − β) b − β)b
(i)
Di =
(k + 1)s 2
(X βb(i) − X β) b 0 (X β
b − X β)
b
(i)
=
(k + 1)s 2
(yb(i) − yb )0 (yb(i) − yb )
=
(k + 1)s 2
!
ri2 hii

=
k +1 1 − hii
66 / 69

Di is proportional to the ordinary Euclidean distance between

ybi and yi . Thus if Di is large, the observation (yi , xi0 ) has
substantial influence on both βb and yb = X β. b
Revisiting the blood data let’s see an interesting R code.
67 / 69
R code 2
hv <- diag(H); hv
hatvalues(mod1)
lmax <- 2*(ncol(mod1$x)+1)/nrow(mod1$x); lmax
cd <- round(cooks.distance(mod1),6); cd
sx4 <- as.data.frame(sort(dat$x4, decreasing=T,
index.return=T))
sx4_y <- cbind(sx4, y=dat$y[sx4$ix], h=hv[sx4$ix],
c=cd[sx4$ix]); sx4_y
sy <- as.data.frame(sort(dat$y, decreasing=T,
index.return=T))
sy_x4 <- cbind(sy, x4=dat$x4[sy$ix], h=hv[sy$ix],
c=cd[sy$ix]); sy_x4
68 / 69

So, what to do?. Easy, detect high influence points and

inquire about their origin, if it is measurement or transcription
errors, discard them, otherwise, do not discard them (but keep
them under observation...)
69 / 69

So, what to do?. Easy, detect high influence points and

inquire about their origin, if it is measurement or transcription
errors, discard them, otherwise, do not discard them (but keep
them under observation...)
69 / 69

BS Classes V2

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

BS Classes V2

Uploaded by

Copyright:

Available Formats

Linear Model for Biostatistics

Ernesto Ponsot Balaguer

University of Experimental Technologies Research Yachay (Yachay Tech)

Imbabura, Ecuador - March 2018

1 Simple Regression Model

2 Multiple Linear Regression Model

If we have a quantitative variable (x ) that we believe is associated

We call this the Simple Linear Regression Model. “Simple”

Preliminary assumptions of the model

1 i , i = 1, 2, · · · , n are random variables with E[i ] = 0 and

Assumptions of the model

The assumption of constant variance is known as the

The parameters of the model are β0 , β1 and σ 2 . We call βb0 ,

The least squares method

A “reasonable” way (of course, not the only way) to estimate

The least squares method

The expected values and variances are:

E[βb1 ] = β1 and E[βb0 ] = β0

The least squares method

The least squares method don’t provide an estimator for σ 2 .

Normally on an average we would divide by n. When we

The least squares method

Note that E[s 2 ] = σ 2 , so the proposed s 2 is an unbiased estimator

squares or better, the sum of squares of error and will be

Hypothesis test and confidence intervals for β1

Our main concern is to determine whether or not there is a

Hypothesis test and confidence intervals for β1

With these assumptions,

Where δ is known as the non-centrality parameter of the

Note that if β1 = 0 then δ = 0, son under H0 t ∼ tn−2 , this is t is

Hypothesis test and confidence intervals for β1

Then the procedure is:

We can separate SST = SSR + SSE. Where:

SST represents the total variation of y (corrected by the mean of

So r 2 gives the proportion of variation in y that is explained by the

The coefficient of determination is a measure of goodness of fit

The Multiple Linear Regression Model is:

In (3) we postulate that the response variable y depends on

In (3) X is called the design matrix and is a matrix of known

In (3) the elements of β are called the regression

Estimation of the parameters

We must remember that

Since the predicted value is X βb = X(X 0 X)−1 X 0 y we will call

The least squares method

We postulate that s 2 is an unbiased estimator for σ 2 where.

The least squares method

According to The Gauss-Markov Theorem, if E[y] = Xβ and

Suppose now that in (3) we add the assumption  ∼ N(0, σ 2 I) or

Then with our assumption:

The likelihood L(β, σ 2 ) = f (y; Xβ, σ 2 I) view as a function of

b 2 is biased since the denominator is n

Now suppose that y ∼ N(Xβ, σ 2 I) with the assumptions made

For a random variable y with n observations, the Total Sum of

mean of observations (this is centering observations), then

The following are useful identities:

Now suppose we want to partition the sum total of squares into

y 0 [I − (1/n)J]y = y 0 [I − (1/n)J + H − H]y

Where H = X(X 0 X)−1 X 0 .

We have just shown that it is possible to partition the sum of

0 ≤ R 2 ≤ 1. Thinking about the sums of squares as estimates

Generalized Least Squares

Now consider the model:

y = Xβ + , E[] = 0, V[] = σ 2 V (4)

1 i , i = 1, 2, · · · , n are random variables with E[i ] = 0 and

Suppose now that in (3) we add the assumption ∼ N(0, σ 2 I) or

y = Xβ + , E[] = 0, V[] = σ 2 V (4)

Calling z = Q −1 y, W = Q −1 X and δ = Q −1 . But

E[δ] = E[Q −1 ] = Q −1 E[] = 0

y = Xβ + , E[] = 0 ⇒ E[y] = Xβ, V[] = V[y] = σ 2 I

And from b = (I − H) note that the elements of H must be

And examining the numerator, for b, y:

b0 [I − (1/n)J]y = b0 y − (1/n)b0 Jy = b0 y

due to (10). Also for b, yb :

b0 [I − (1/n)J]yb = b0 yb − (1/n)b0 J yb = 0