Download as pdf or txt
Download as pdf or txt
You are on page 1of 23

MIXED

Overview
This document summarizes the computational algorithms discussed in Wolfinger,
Tobias and Sall (1994).

Notation
The following notation is used throughout this chapter unless otherwise stated:

θ Overall covariance parameter vector.

θG A vector of covariance parameters associated with random effects.

θk A vector of covariance parameters associated with the k th random effect.

θR A vector of covariance parameters associated with the residual term.

K Number of random effects.

SR Number of repeated subjects.

Sk Number of subjects in k th random effect.

V(θ ) The n ™ n covariance matrix for y .

& (θ )
V First derivative of V(θ ) with respect to the s th parameter in θ .
s

&& (θ )
V Second derivative of V(θ ) with respect to the s th and t th parameters in θ .
st
R(θ R ) The n ™ n covariance matrix for ε .
& (θ )
R First derivative of R(θ R ) with respect to the s th parameter in θ R
s R
&& (θ )
R Second derivative of R(θ R ) with respect to the s th and t th parameters in θ R .
st R
G(θ G ) The covariance matrix of random effects.
& (θ )
G First derivative of G(θ G ) with respect to the s th parameter in θ G .
s G

1
2 MIXED

&& (θ )
G Second derivative of G(θ G ) with respect to the s th and t th parameters in θ G .
st G
Vk (θ k ) The covariance matrix of the k th random effect for one random subject.
& (θ )
V First derivative of Vk (θ k ) with respect to the s th parameter in θ k .
k ,s k
&& (θ )
V Second derivative of Vk (θ k ) with respect to the s th and t th parameters in θ k .
k ,st k
y n ™1 vector of observed values of the dependent variable.
X n ™ p design matrix of fixed effects.
Z n ™ q design matrix of random effects.
r n ™1 vector of residuals.
β p ™1 vector of fixed effects parameters.
γ q ™1 vector of random effects parameters.
ε n ™1 vector of residual error terms.
Wc n ™ n diagonal matrix of case weights.
Wrw n ™ n diagonal matrix of regression weights.

Model

In this document, we assume a mixed effect model of the form

y = Xβ + Z γ + ε (1)

In this model, we assume that ε is distributed as N[0, R(θ R )] and γ is


independently distributed as N[0, G(θ G )]. Therefore y is distributed as
N Xβ, V(θ ) , where V(θ ) = Z G(θ G )Z T + R(θ R ) . The unknown parameters
include the regression parameters in β and covariance parameters in θ . Estimation
of these model parameters relies on the use of a Newton-Ralphson or scoring
algorithm. When we use either algorithm for finding MLE or REML solutions, we
need to compute V -1 (θ ) and its derivatives with respect to θ , which are
computationally infeasible for large n . Wolfinger et.al.(1994) discussed methods
that can avoid direct computation of V -1 (θ ) . They tackled the problem by using
the SWEEP algorithm and exploiting the block diagonal structure of G(θ G ) and
R(θ R ) . In the first half of this document, we will detail the algorithm for mixed
model without subject blocking. In second half of the document we will refine the
algorithm to exploit the structure of G(θ G ) and this is the actual implementation of
the algorithm.
MIXED 3

If there are regression weights, the covariance matrix R(θ R ) will be replaced by
R * (θ R ) = W -1/ 2 R(θ R )W -1/ 2 . For simpler notations, we will assume that the
rw rw
weights are already included in the matrix R(θ R ) and they will not be displayed in
the remainder of this document. When case weights are specified, they will be
rounded to nearest integer and each case will be entered into the analysis multiple
times depending on the rounded case weight. Since replicating a case will lead to
duplicate repeated measures (Note: repeated measures are unique within a repeated
subject), non-unity case weights will only be allowed for R(θ R ) with scaled
identity structure. In MIXED, only cases with positive case weight and regression
weight will be included analysis.

Fixed Effects Parameterization


The parameterization of fixed effects is the same as in the GLM procedure.

Random Effects Parameterization

If there are K random effects and Sk random subjects in the k th random effect, the
design matrix Z will be partitioned as

Z = Z1 Z2 L ZK ,

where Z k is the design matrix of the k th random effect. Each Z k can be


partitioned further by random subjects as shown below:

Z k = Z k1 Zk 2 L Z kSk , k = 1,K, K .

The number of columns in the design matrix Z kj (the j th random subject of the
k th random effect) is equal to the number of levels of the k th random effect
variable.
Under this partition, the G(θ G ) will be a block diagonal matrix which can be
expressed as

G(θ G ) = ª kK=1 I Sk © Vk (θ k ) .
4 MIXED

It should also be noted that each random effect has its own parameter vector θ k ,
k = 1,K, K , and there are no functional constraints between elements in these
1
parameter vectors. Thus θ G = θ1 ,K, θ K . 6
Repeated Subjects:
When the REPEATED subcommand is used, R(θ R ) will be a block diagonal
matrix where the i th block is R i (θ R ) , i = 1,K, SR . That is,

S
R(θ R ) = ª i =R1 R i (θ R )

The dimension of R i (θ R ) will be equal to the number of cases in one repeated


subject but all R i (θ R ) share the same parameter vector θ R .

Likelihood Functions

Recall that the –2 log-likelihood using maximum likelihood estimation (ML) is

-2l MLE (β, θ ) = log| V(θ )|+ r(θ ) T V(θ ) -1 r(θ ) + n log 2Q (2)

and the –2 log-likelihood using restricted maximum likelihood estimation (REML)


is

-2l REML (θ ) = log|V(θ )|+ r(θ ) T V(θ ) -1 r(θ ) + log| X’V -1 X|+(n - p) log 2Q (3)

where n is the number of observations and p is the rank of fixed effect design
matrix. From (2) and (3), we can see that the key components of the likelihood
functions are

l1 (θ ) = log| V(θ )|
l 2 (θ ) = r(θ ) T V -1 (θ ) r(θ ) (4)
l 3 (θ ) = log| X T V -1 (θ )X|.
MIXED 5

Therefore, in each estimation iteration, we need to compute l1 (R ) , l 2 (R ) and


l 3 (R ) as well as their 1 and 2 derivatives with respect to θ .
st nd

Newton & Scoring Algorithms


Covariance parameters in θ can be found by maximizing (2) or (3); however, there
are no closed form solutions in general. Therefore Newton and scoring algorithms
are used to find the solution numerically, as outlined below:

1. Compute the starting parameter values and initial log-likelihood (REML or


ML).
2. Compute the gradient vector g and Hessian matrix H of the log-likelihood
function using the previous iteration’s estimate θ i -1 . (See later section for
computation of g and H )
3. Compute the new step d = - H -1 g .
4. Let S = 1.
5. Compute estimates of of the i th iteration θ i = θ i -1 + Sd .
6. Check to see if θ i generates valid covariance matrices and improve the
likelihood. If not, reduce S by half and repeat step (5). If this process is
repeated for a pre-specified number of times and the stated conditions are still
not satisfied, stop.
7. Check for convergence of the parameter estimates. If convergence criteria are
met, then stop. Otherwise, go back to step (2).

Newton’s algorithm performs well if the starting value is close to the solution. In
order to improve the algorithm’s robustness to bad starting values, the scoring
algorithm is used in the first few iterations. This can be done easily by applying
different formulae for the Hessian matrix at each iteration. Apart from improved
robustness, the scoring algorithm is faster due to the simpler form of the Hessian
matrix.

Convergence Criteria
There are three types of convergence criteria: parameter convergence, log-
likelihood convergence and Hessian convergence. Parameter and log-likelihood
convergence are further subdivided into absolute and relative. If we let ε be some
given tolerance level and
6 MIXED

θ s,i be the s th parameter in the i th iteration,


l i be the log-likelihood in log-likelihood in the i th iteration,
g i be the gradient vector in the i th iteration,
and H i be the hessian matrix in the i th iteration,
then the criteria can be written as follows,

Absolute parameter convergence: max s | θ s ,i − θ s ,i −1 |< ε

Relative parameter convergence: max s θ s ,i − θs ,i−1 < ε | θs ,i−1 |

Absolute log-likelihood convergence: | l i − l i−1 |< ε

Relative log-likelihood convergence: | l i − l i−1 |< ε | l i−1 |

Absolute Hessian convergence: g Ti H i−1g i < ε

Relative Hessian convergence: g Ti H i−1g i < ε | l i |

Starting value of Newton’s Algorithm


If no prior information is available, we can choose the initial values of G and R
to be the identity matrix. However, it is highly desirable to estimate the scale of the
variance parameter. By ignoring the random effects, and assuming the residual
errors are i.i.d. with variance T 2 , we can fit a GLM model and estimate T 2 by the
residual sum of squares T$ 2 . Then we choose the starting value of Newton’s
algorithm to be

σˆ 2 σˆ 2
Gk = and R = .
K +1 K +1
MIXED 7

Confidence Intervals of Covariance Parameters

The estimate θ$ (ML or REML) is asymptotically normally distributed. Its variance


−1
covariance matrix can be approximated by − 2H , where H is the Hessian
matrix of the log-likelihood function evaluated at θ$ . A simple Wald’s-type
confidence interval for any covariance parameter can be obtained by using the
asymptotic normality of the parameter estimates, however it is not very appropriate
for variance parameters and correlation parameters that have a range of [0, ∞ ) and
[−1,1] respectively. Therefore these parameters are transformed to parameters that
have range (−∞, ∞) . Using the uniform delta method (van der Vaart, 1998), these
transformed estimates still have asymptotic normal distributions.

Suppose we are estimating a variance parameter σ 2 by σ̂ 2n that is distributed as


N[σ 2 , Var (σˆ 2n )] asymptotically. The transformation we used is log(σ 2 ) which
can correct the skewness of σ̂ 2n . Moreover log(σˆ 2n ) has the range (−∞, ∞)
which matches that of normal distribution. Using the delta method, one can show
log(σˆ 2n ) is N[log(σ 2 ), σ −4 Var (σˆ 2n )] . Thus,
that the asymptotic distribution of
a (1 − α )100% confidence interval of log(σ ) is given by
2

[log(σˆ 2n ) − z1− α / 2 σ −n2 Var (σˆ 2n ) , log(σˆ 2n ) + z1− α / 2σ −n2 Var (σˆ 2n ) ]

where z1−α / 2 is the upper (1 − α / 2) percentage point of standard normal


distribution. By inverting this confidence interval, a (1 − α )100% confidence
interval for σ is given by
2

(
[exp log(σˆ 2n ) − z1− α / 2 σ −n 2 Var (σˆ 2n ) ) ( )
, exp log(σˆ 2n ) + z1−α / 2 σ −n 2 Var (σˆ 2n ) ]

When we need a confidence interval for a correlation parameter ρ , a possible


transformation will be its generalized logit
arctan h (ρ) = 0.5 log[(1 + ρ) /(1 − ρ)] . The resulting confidence interval for ρ
will be
8 MIXED

( ) ( )
[tanh arctanh(ρˆ ) − z1−α / 2 (1 − ρˆ 2 )−1 Var(ρˆ ) , tanh arctanh(ρˆ ) − z1−α / 2 (1 − ρˆ 2 )−1 Var(ρˆ ) ]

Fixed and Random Effect Parameters


Estimation and prediction
After we obtain an estimate of θ , best linear unbiased estimator (BLUE) of β and
best linear unbiased predictor (BLUP) of γ can be found by solving the mixed
model equations, Henderson (1984).

XT R
ˆ −1 X XT R ˆ −1 Z  β   X T R
ˆ −1 y 
 T ˆ −1   =
ˆ −1    T ˆ −1 
(5)
 Z R X Z R X + G  γ   Z R y 
T ˆ −1

The solution of (5) can be expressed as

ˆ −1 X ) − X T V
ˆ = (X T V ˆ −1y

ˆ =G ˆ −1 (y − X ˆ )
ˆ ZTV (6)

=G ˆ −1y − Z T V
ˆ [Z T V ˆ −1 X ˆ ]

The covariance matrix C of β̂ and γ̂ is given by

C = Cov(βˆ , γˆ )

XT Rˆ −1 X XT R ˆ −1Z 
=  T −1
ˆ −1 
(7)
ˆ ˆ −1 Z + G
ZTR
 Z R X 
C
ˆ ˆ T 21 
C
=  11
ˆ ˆ 
C 21 C 22 

where
MIXED 9

ˆ −1X) −
ˆ = (XT V
C11

ˆ = −G
C ˆ −1XC
ˆ ZTV ˆ
21 11

ˆ −1Z + G
ˆ = (Z T R
C ˆ −1 ) −1 − C ˆ −1ZG
ˆ XT V ˆ
22 21

Custom Hypotheses

In general, one can construct estimators or predictors for

 
Lb = [L 0 L1 ]   (8)
 

for some hypothesis matrix L . Estimators or predictors of Lb can easily be


constructed by substituting β̂ and γ̂ into (8) and its variance covariance matrix can
T
be approximated by LCL . If L is zero and L ˆ is called
is estimable, Lb
1 0

the best linear unbiased estimator of L 0 . If L1 is nonzero and L 0 is

estimable, Lbˆ is called the best linear unbiased predictor of Lb .


To test the hypothesis H 0 : Lb = a for a given vector a , we can use the statistic

F=
(Lbˆ − a) (LCˆ L ) (Lbˆ − a )
T T −1
(9)
q

where q is the rank of the matrix L . The statistic in (9) has an approximate F
distribution. The numerator degrees of freedom is q and the denominator degree
of freedom can be obtained by Satterthwaite (1946) approximation. The method
outlined below is similar to Giesbrecht and Burns (1985), McLean and Sanders
(1988), and Fai and Cornelius (1996).
10 MIXED

Satterthwaite’s Approximation
To find the denominator degrees of freedom of (9), we first perform the spectral
decomposition LC ˆ LT = Γ T DΓ where Γ is an orthogonal matrix of
eigenvectors and D is a diagonal matrix of eigenvalues. If we let l m be the the
m th row of Γ L , d m be the m th eigenvalues and

2
2d
= T m −1
g m ∑(θˆ ) g m
m

∂l m Cl Tm
where gm = and Σ(θˆ ) −1 is the covariance matrix of the estimated
∂θ θ = θˆ
covariance parameters. If we let

q
E= ∑ m
I( > 2)
m− 2
m
m =1

then the denominator degrees of freedom is given by

2E
= .
E− q

Note that the degrees of freedom can only be computed when E>q.

Type I &III Statistics

Type I or III test statistics are special cases of custom hypothesis tests.

Saved Values

If predicted values are requested, they are computed by


MIXED 11

yˆ = X ˆ + Z ˆ (10)

using the estimates given in (6).

If fixed predicted values are requested, they are computed by

y$ F = Xβ$ (11)

If residuals are requested, they are computed by

r = yˆ − y (12)

Information Criteria

Information criteria are for model comparison, and the following criteria are given
in “smaller is better” form. If we let l be the log-likelihood of (REML or ML), n
be total number of cases (or total of case weights if used) and d be number of
model parameters, the formulae for various criteria are given below:
• Akaike information criteria (AIC), Akaike (1974):
-2 l + 2d
• Finite sample corrected (AICC), Hurvich and Tsai (1989):
2d ™ n
-2l +
( n - d - 1)

• Bayesian information criteria (BIC), Schwarz (1978):


-2l + d ™ log( n)

• Consistent AIC (CAIC), Bozdogan (1987):


-2l + d ™ (log( n) + 1)

For REML, the value of n is chosen to be the total number of cases minus number
fixed effect parameters and d is the number of covariance parameters. For ML, the
value of n is the total number of cases and d is the number of fixed effect
parameters plus number of covariance parameters.
12 MIXED

1st and 2nd Derivatives of l k (θ)


st nd
In each Newton or scoring iteration we need to compute the 1 and 2 derivatives
of the components of the log-likelihood l k (θ) , k=1,2,3. Here we let
∂ ∂2
gk = l k (θ) and H k = l k (θ) , k = 1,2,3 , then the 1st derivatives
∂θ ∂θ ∂θ
with respect to the s th parameter in θ are given by

& ),
[g1 ]s = tr (V −1Vs

[g 2 ]s = −rV −1V & V −1r, (13)


s
~ T −1 & −1 ~
[g 3 ]s = − tr ( X V Vs V X)

and the 2 derivatives with respect to the s th and t th parameters are given by
nd

& V −1V
[H1 ]st = − tr (V −1V & ) + tr (V −1V
&& ),
s t st

& V −1V
[H 2 ]st = 2r T V −1V & V −1r
s t
~ ~ T & −1
& V −1X
− 2r T V −1V s X Vt V r (14)

&& V −1r
− r T V −1V st

~ & V −1V ~
& V −1X
[H 3 ]st = 2tr ( XT V −1V s t )
~ ~ ~ T −1 & −1 ~
& V −1X
− tr ( XT V −1Vs X V Vt V X)
~ & V −1X~
− tr ( XT V −1Vst )

~
where X = XC for a matrix C satisfying CCT = ( X’V −1X) − = P and
r = [I − X(X’V −1 X)−1 X’V −1 ]y = y − X ˆ .
MIXED 13

Derivatives w.r.t. Parameters in G

Derivatives with respect to parameters in G can be constructed by from the entries


of

 W1 ( X, X) W1 ( X, Z) W1 ( X, r )
W1 ( X; r; Z) =  W1 (Z, X) W1 (Z, Z) W1 (Z, r ) 
 W1 (r, X) W1 (r, Z) W1 (r, r ) 
(15)
 X T V −1X X T V −1Z XT V −1r 
 
=  Z T V −1X Z T V −1Z Z T V −1r 
 r T V −1X r T V −1Z r T V −1r 
 

The matrix W1 ( X; r; Z) can be computed from W1 ( X; y; Z) given in (27), by


using the following relationship,

r = y − Xb 0

where b 0 is the current estimate of β .

Using the above formula, we can obtain the following expressions,

r T V −1r = y T V −1y − y T V −1Xb0


(16)
= W1 (y , y ) − W1 (y , X)b 0

XT V −1r = X T V −1y − X T V −1Xb0


(17)
= W1 ( X, y ) − W1 ( X, X)b 0
14 MIXED

Z T V −1r = Z T V −1y − Z T V −1Xb0


(18)
= W1 (Z, y ) − W1 (Z, X)b 0

In terms of the elements in W1 ( X; r; Z) matrix, we can write down the 1st


derivatives of l 1 , l 2 and l 3 with respect to a parameter θs of the G matrix,

& ),
[g1 ]G , s = tr ( W1 (Z, Z)G s

& W (Z, r ),
[g 2 ]G , s = − W1 (Z, r )T G (19)
s 1

& W (Z, X)P),


[g 3 ]G , s = − tr ( W1 ( X, Z)G s 1

For the second derivatives, we first define the following simplification factors

H stG1 = − W1 (Z, Z)G& W (Z, Z)G


& + W (Z, Z)G
&&
s 1 t 1 st
&
H = W ( X, Z)G W (Z, r )
s
G2 1 s 1

H st
= 2 W1 (r, Z)G& W (Z, Z)G & W (Z, r ) − W (r, Z)G
&& W (Z, r )
G2 s 1 t 1 1 st 1

H sG 3 = W1 ( X, Z)G& W ( Z, X )
s 1

H stG 3 &
= 2 W ( X, Z)G W (Z, Z)G & W (Z, X) − W ( X, Z)G && W (Z, X)
1 s 1 t 1 1 st 1

then second derivatives of l1 , l 2 and l 3 w.r.t. θs and θt (in G) are given by

[H1 ]G,st = tr(H stG1 )


[H 2 ]G,st = H stG2 − 2(H sG2 )T P H G2
t
(20)

[H 3 ]G,st = tr[H stG3 P] − tr[H sG3 P H G3


t
P]

Derivatives w.r.t. Parameters in R

To compute R derivatives, we need to introduce the matrices


MIXED 15

∂W0
W0(1)s = −
∂ s

and

∂ 2 W0
W (2)st
=−
∂ s∂ t
0

where θs and θt are the sth and tth parameters of R . Therefore,

W0 ( A, B) = A T R −1B
& R −1B
W0(1)s ( A, B) = A T R −1R s

= − A T [ ∂∂θ s R −1 (θ)]B
&& R −1 − R −1R
W0( 2)st ( A, B) = A T [R −1R & R −1R
& R −1 − R −1R
& R −1R
& R −1 ]B
st s t t s

= − A T [ ∂θ∂s θ t R −1 (θ)]B

~
The matrices A and B can be X , Z , Z or r , where

~
Z = ZM = Z(G −1 + Z T R −1Z )−1 , and

r = [I − X( X’ V −1X) −1 X’ V −1 ]y = y − Xb 0

~
Remark: The matrix (G −1 + Z T R −1Z) −1 involved in Z can be obtained by

pre/post multiply (I + LT Z T R −1ZL) − by L and LT ).

st
Using these notations, the 1 derivatives of l k (θ) with respect to a parameter in
R are as follows,
16 MIXED

[g1 ]R , s = tr (R −1R & ) − tr( W (1)s (Z, Z)M )


s 0

~
[g 2 ]R , s = − W0(1) s (r, r ) + 2W0 (r, Z) W0(1) s (Z, r )
(21)
~ ~
− W0 (r, Z) W0(1)s (Z, Z) W0 (Z, r )
[g 3 ]R , s = − tr(H sR 3 )

nd
To compute 2 derivatives w.r.t. θs and θt (of R), we need to consider the
following simplification factors.

& R −1R
H stR1 = − R −1R & + R −1R
&&
s t st

− W0( 2 )st (Z, Z)M − W0(1)s (Z, Z)MW0(1)t (Z, Z)M

~ ~
H sR 2 = W0(1) s ( X, r ) + W0 ( X, Z) W0(1)s (Z, Z) W(Z, r )
~ ~
− W0 ( X, Z) W0(1)s (Z, r ) − W0(1) s ( X, Z) W0 (Z, r )

H stR 2 = − W0( 2 )st (r, r )


~ ~
− W0 (r, Z) W0( 2) st (Z, Z) W0 (Z, r )
~
+ 2 W0 (r, Z) W0( 2) st (Z, r )
~
− 2[ W0 (r, Z) W0(1) s (Z, Z) − W0(1) s (r, Z)]M
~
× [ W0(1) t (Z, Z) W0 (Z, r ) − W0(1) t (Z, r )]

~
H sR 3 = W0(1)s ( X, X) − W0 ( X, Z) W0(1)s (Z, X)
~
− W0(1)s ( X, Z) W0 (Z, X)
~ ~
+ W0 ( X, Z) W0(1) s (Z, Z) W0 (Z, X)
MIXED 17

H stR 3 = − W0( 2)st ( X, X)


~ ~
− W0 ( X, Z) W0( 2)st (Z, Z) W0 (Z, X)
~
+ 2W0 ( X, Z) W0( 2) st (Z, X)
~
− 2[ W0 ( X, Z) W0(1) s (Z, Z) − W0(1) s ( X, Z)]M
~
× [ W0(1) t (Z, Z) W0 (Z, X) − W0(1) t (Z, X)]

Based on these simplification terms, the entries of the Hessian matrices are given
by

[H1 ]R,st = tr (H stR1 )


[H 2 ]R,st = H stR2 − 2(H sR2 )T P H tR2 (22)

[H 3 ]R,st = tr(H stR3 P − H sR3 P H tR3 P)

G&R cross derivatives


nd
These section gives expressions for the 2 derivatives of l1 , l 2 and l 3 with
respect to a parameter θs in G and a parameter θt in R. First, we introduce the
following simplification terms,

&
H stGR1 = − W0(1)s ( Z, Z )G t
& ~
+ 2 W0 ( Z, Z )G t W0 ( Z, Z )
1( s )

~ ~
& W ( Z, Z
− W0(1)s ( Z, Z )W0 ( Z, Z )G t 0 )

H stGR 2 = 2[ W0(1)s (r, Z) − W0 (r, Z)MW0(1) s (Z, Z)]


& [ W ( Z, Z ) M − I ]
× [MW0 (Z, Z) − I ]G t 0

× W0 (Z, r )
18 MIXED

H stGR 3 = 2[ W0(1)s ( X, Z) − W0 ( X, Z)MW0(1)s (Z, Z)]


& [ W ( Z , Z )M − I ]
× [MW0 (Z, Z) − I ]G t 0

× W0 (Z, X)

Based on these simplification terms, the second derivatives are given by

[H1 ]GR , st = tr (H stGR1 )

[H 2 ]GR , st = H stGR 2 − 2(H sG 2 ) T H tR 2 23)

[H 3 ]GR , st = tr (H stGR 3 P − H sG 3 P H tR 3 P)

Gradient & Hessian of REML

The restricted log likelihood is given by

− 2l REML (θ | y ) = log | V (θ) | +r (θ)T V −1 (θ)r (θ)


+ log | XT V −1 (θ) X | + (n − p) log 2

where p is equal to the rank of X.

Therefore the s th element of the gradient vector is given by

[ g ]s = [ g 1 ]s + [ g 2 ]s + [ g 3 ]s

and the (s,t)-th element of the Hessian matrix is given by

[H ]st = [H1 ]st + [H 2 ]st + [H 3 ]st


MIXED 19

If scoring algorithm is used, the Hessian can be simplified to

[H]st = −[H 1 ]st + [H 3 ]st .

Gradient & Hessian of MLE

The log likelihood is given by

− 2l MLE (θ | y ) = log | V (θ) | + r (θ)T V −1 (θ)r (θ)


+ n log 2
th
Therefore the s element of the gradient vector is given by

[g]s = [g1 ]s + [g 2 ]s
th
and the (s,t) element of the Hessian matrix is given by

[H ]st = [H1 ]st + [H 2 ]st .

If scoring algorithm is used the Hessian can be simplified to

[H]st = −[H1 ]st .

It should be noted that the Hessian matrices for the scoring algorithm in both ML
and REML are not ‘exact’. In order to speed up calculation, some second derivative
terms are dropped. Therefore, they are only used in intermediate step of
optimization but not for standard error calculations.
20 MIXED

Cross Product matrices

During the estimation we need to construct several cross product matrices in each
iteration, namely: W0 (X; y; Z) , W1 (X; y; Z) , W0A (X; y; Z) ,
W1A (X; y; Z) , Wb0 (X; y) , and Wb1 (X; y) . The SWEEP operator (see for
example Goodnight (1979)) is used in constructing these matrices. Basically, the
SWEEP operator performs the following transformation

 A B  A− A−B 
B’ C ⇒  − .
   − B’ A C − B’ A − B 

The steps needed to construct these matrices are outlined below,

STEP 1:
Construct

 X T R −1 X X T R −1 yX T R −1 Z 
 
W0 (X; y; Z) =  y T R −1 X y T R −1y y T R −1Z  (24)
 Z T R −1 X Z T R −1Z y T R −1y 

STEP 2:

Construct W0A (X; y; Z) which is an augmented version of W0 (X; y; Z) . It is


given by the following expression.

I + LT Z T R −1 ZL LT W0 (Z,⋅)
W (X; y; Z) = 
A
 (25)
 W0 (⋅, Z)L
0
W0 
T
where L is the lower-triangular Cholesky root of G, i.e. G=LL and W0(Z, . ) is the
rows of W0 corresponding to Z.
MIXED 21

STEP 3:

Sweeping W0A ( X; y; Z) by pivoting on diagonal elements in the upper-left

partition will give us the matrix W1A ( X; y; Z) , which is shown below.

 W1A (1,1) W1A (1,1)LT W(Z,⋅)


W (X; y; Z) = 
A
 (26)
− W0 (⋅, Z)LW1 (1,1)
1 A
W1 (X; y; Z) 

where

 X T V −1X X T V −1Z X T V −1y 


 
W1 ( X; y; Z) =  Z T V −1X Z T V −1Z Z T V −1y  (27)
 y T V −1X y T V −1Z y T V −1y 

and

W1A (1,1) = (I + LT Z T R −1ZL) − .

During the sweeping, if we accumulate the log of the i th diagonal element just
before the i th sweep, we will obtain

log | I + LT Z T R −1ZL |= log | V | − log | R | as a by-product. Thus, adding to


this quantity by log | R | will give us l 1 (θ) .

STEP 4:

Consider the following submatrix Wb 0 ( X; y ) of W1 ( X; y; Z) ,

 X T V −1 X X T V −1y 
Wb 0 ( X; y ) =  T −1 . (28)
 y V X y T V −1y 
22 MIXED

Sweeping Wb 0 ( X; y ) by pivoting on diagonal elements of X T V −1X will give


us

( X T V −1 X ) − b0 
Wb1 (X; y) =   (29)
 b T0 l 2 (θ) 

where b0 is an estimate of β 0 in current iteration. After this step, we will obtain


l 2 (θ) and l 3 (θ) =| XT V −1X | .

Reference
Akaike, H. (1974), A New Look at the Statistical Model Identification, IEEE
Transaction on Automatic Control, AC -19, 716 -723.

Bozdogan, H. (1987), Model Selection and Akaike’s Information Criterion (AIC):


The General Theory and Its Analytical Extensions, Psychometrika, 52, 345 -
370.

Fai, A.H.T. and Cornelius, P.L. (1996), Approximate F-tests of Multiple Degree of
Freedom Hypotheses in Generalized Least Squares Analyses of Unbalanced
Split-plot Experiments, Journal of Statistical Computation and Simulation, 54,
363 -378.

Giesbrecht, F.G. and Burns, J.C. (1985), Two-Stage Analysis Based on a Mixed
Model: Large-sample Asymptotic Theory and Small-Sample Simulation
Results, Biometrics, 41, 477 -486.

Goodnight, J. H. (1979). A Tutorial on the SWEEP Operator. The American


Statistician, 33(3), 149-158.

Henderson, C.R. (1984), Applications of Linear Models in Animal Breeding,


University of Guelph.

Hurvich, C. M., and Tsai, C-L. (1989). Regression and Time Series Model
Selection in Small Samples, Biometrika 76, 297-307.

McLean, R.A. and Sanders, W.L. (1988), Approximating Degrees of Freedom for
Standard Errors in Mixed Linear Models, Proceedings of the Statistical
Computing Section, American Statistical Association, New Orleans, 50 -59.
MIXED 23

Schwarz, G. (1978), Estimating the Dimension of a Model, Annals of Statistics, 6,


461 -464.

Wolfinger, R., Tobias, R., Sall J., (1994), Computing Gaussian likelihoods and
their derivatives for general linear mixed models. SIAM J. Sci. Comput., Vol.15,
No. 6, pp. 1294-1310.

You might also like