Professional Documents
Culture Documents
Linear Models and The Relevant Distributions and Matrix Algebra
Linear Models and The Relevant Distributions and Matrix Algebra
Relevant Distributions
and Matrix Algebra
CHAPMAN & HALL/CRC
Texts in Statistical Science Series
Series Editors
Joseph K. Blitzstein, Harvard University, USA
Julian J. Faraway, University of Bath, UK
Martin Tanner, Northwestern University, USA
Jim Zidek, University of British Columbia, Canada
Mathematical Statistics: Basic Ideas and Modelling Survival Data in Medical Research,
Selected Topics, Volume I, Third Edition
Second Edition D. Collett
P. J. Bickel and K. A. Doksum Introduction to Statistical Methods for
Mathematical Statistics: Basic Ideas and Clinical Trials
Selected Topics, Volume II T.D. Cook and D.L. DeMets
P. J. Bickel and K. A. Doksum Applied Statistics: Principles and Examples
Analysis of Categorical Data with R D.R. Cox and E.J. Snell
C. R. Bilder and T. M. Loughin Multivariate Survival Analysis and Competing
Statistical Methods for SPC and TQM Risks
D. Bissell M. Crowder
Introduction to Probability Statistical Analysis of Reliability Data
J. K. Blitzstein and J. Hwang M.J. Crowder, A.C. Kimber,
T.J. Sweeting, and R.L. Smith
Bayesian Methods for Data Analysis,
Third Edition An Introduction to Generalized
B.P. Carlin and T.A. Louis Linear Models, Third Edition
A.J. Dobson and A.G. Barnett
Second Edition
R. Caulcutt Nonlinear Time Series: Theory, Methods, and
Applications with R Examples
The Analysis of Time Series: An Introduction, R. Douc, E. Moulines, and D.S. Stoffer
Sixth Edition
C. Chatfield Introduction to Optimization Methods and
Their Applications in Statistics
Introduction to Multivariate Analysis B.S. Everitt
C. Chatfield and A.J. Collins
Extending the Linear Model with R: Mathematical Statistics
Generalized Linear, Mixed Effects and K. Knight
Nonparametric Regression Models, Introduction to Functional Data Analysis
Second Edition P. Kokoszka and M. Reimherr
J.J. Faraway
Introduction to Multivariate Analysis:
Linear Models with R, Second Edition Linear and Nonlinear Modeling
J.J. Faraway S. Konishi
A Course in Large Sample Theory Nonparametric Methods in Statistics with SAS
T.S. Ferguson Applications
Multivariate Statistics: A Practical O. Korosteleva
Approach Modeling and Analysis of Stochastic Systems,
B. Flury and H. Riedwyl Third Edition
Readings in Decision Analysis V.G. Kulkarni
S. French Exercises and Solutions in Biostatistical Theory
Discrete Data Analysis with R: Visualization L.L. Kupper, B.H. Neelon, and S.M. O’Brien
and Modeling Techniques for Categorical and Exercises and Solutions in Statistical Theory
Count Data L.L. Kupper, B.H. Neelon, and S.M. O’Brien
M. Friendly and D. Meyer
Design and Analysis of Experiments with R
Markov Chain Monte Carlo: J. Lawson
Stochastic Simulation for Bayesian Inference,
Second Edition Design and Analysis of Experiments with SAS
D. Gamerman and H.F. Lopes J. Lawson
David A. Harville
CRC Press
Taylor & Francis Group
6000 Broken Sound Parkway NW, Suite 300
Boca Raton, FL 33487-2742
This book contains information obtained from authentic and highly regarded sources. Reasonable efforts have been
made to publish reliable data and information, but the author and publisher cannot assume responsibility for the validity
of all materials or the consequences of their use. The authors and publishers have attempted to trace the copyright
holders of all material reproduced in this publication and apologize to copyright holders if permission to publish in this
form has not been obtained. If any copyright material has not been acknowledged please write and let us know so we may
rectify in any future reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or utilized
in any form by any electronic, mechanical, or other means, now known or hereafter invented, including photocopying,
microfilming, and recording, or in any information storage or retrieval system, without written permission from the
publishers.
For permission to photocopy or use material electronically from this work, please access www.copyright.com (http://
www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923,
978-750-8400. CCC is a not-for-profit organization that provides licenses and registration for a variety of users. For
organizations that have been granted a photocopy license by the CCC, a separate system of payment has been arranged.
Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used only for
identification and explanation without intent to infringe.
Preface ix
1 Introduction 1
1.1 Linear Statistical Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Regression Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Classificatory Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Hierarchical Models and Random-Effects Models . . . . . . . . . . . . . . . . . 7
1.5 Statistical Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.6 An Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2 Matrix Algebra: A Primer 23
2.1 The Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.2 Partitioned Matrices and Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.3 Trace of a (Square) Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.4 Linear Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.5 Inverse Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.6 Ranks and Inverses of Partitioned Matrices . . . . . . . . . . . . . . . . . . . . . 44
2.7 Orthogonal Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
2.8 Idempotent Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
2.9 Linear Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
2.10 Generalized Inverses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
2.11 Linear Systems Revisited . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
2.12 Projection Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
2.13 Quadratic Forms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
2.14 Determinants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
Bibliographic and Supplementary Notes . . . . . . . . . . . . . . . . . . . . . . 85
3 Random Vectors and Matrices 87
3.1 Expected Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
3.2 Variances, Covariances, and Correlations . . . . . . . . . . . . . . . . . . . . . . 89
3.3 Standardized Version of a Random Variable . . . . . . . . . . . . . . . . . . . . 97
3.4 Conditional Expected Values and Conditional Variances and Covariances . . . . 100
3.5 Multivariate Normal Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 101
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
Bibliographic and Supplementary Notes . . . . . . . . . . . . . . . . . . . . . . 122
4 The General Linear Model 123
4.1 Some Basic Types of Linear Models . . . . . . . . . . . . . . . . . . . . . . . . 124
4.2 Some Specific Types of Gauss–Markov Models (with Examples) . . . . . . . . . 129
4.3 Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
4.4 Heteroscedastic and Correlated Residual Effects . . . . . . . . . . . . . . . . . . 136
4.5 Multivariate Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
viii Contents
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
Bibliographic and Supplementary Notes . . . . . . . . . . . . . . . . . . . . . . 162
5 Estimation and Prediction: Classical Approach 165
5.1 Linearity and Unbiasedness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
5.2 Translation Equivariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
5.3 Estimability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
5.4 The Method of Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
5.5 Best Linear Unbiased or Translation-Equivariant Estimation of Estimable Functions
(under the G–M Model) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
5.6 Simultaneous Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
5.7 Estimation of Variability and Covariability . . . . . . . . . . . . . . . . . . . . . 198
5.8 Best (Minimum-Variance) Unbiased Estimation . . . . . . . . . . . . . . . . . . 211
5.9 Likelihood-Based Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212
5.10 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246
Bibliographic and Supplementary Notes . . . . . . . . . . . . . . . . . . . . . . 252
6 Some Relevant Distributions and Their Properties 253
6.1 Chi-Square, Gamma, Beta, and Dirichlet Distributions . . . . . . . . . . . . . . 253
6.2 Noncentral Chi-Square Distribution . . . . . . . . . . . . . . . . . . . . . . . . 267
6.3 Central and Noncentral F Distributions . . . . . . . . . . . . . . . . . . . . . . 281
6.4 Central, Noncentral, and Multivariate t Distributions . . . . . . . . . . . . . . . 290
6.5 Moment Generating Function of the Distribution of One or More Quadratic Forms
or Second-Degree Polynomials (in a Normally Distributed Random Vector) . . . 303
6.6 Distribution of Quadratic Forms or Second-Degree Polynomials (in a Normally
Distributed Random Vector): Chi-Squareness . . . . . . . . . . . . . . . . . . . 308
6.7 The Spectral Decomposition, with Application to the Distribution of Quadratic
Forms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315
6.8 More on the Distribution of Quadratic Forms or Second-Degree Polynomials (in a
Normally Distributed Random Vector) . . . . . . . . . . . . . . . . . . . . . . . 326
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345
Bibliographic and Supplementary Notes . . . . . . . . . . . . . . . . . . . . . . 349
7 Confidence Intervals (or Sets) and Tests of Hypotheses 351
7.1 “Setting the Stage”: Response Surfaces in the Context of a Specific Application and
in General . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 351
7.2 Augmented G–M Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 357
7.3 The F Test (and Corresponding Confidence Set) and a Generalized S Method . . 364
7.4 Some Optimality Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 405
7.5 One-Sided t Tests and the Corresponding Confidence Bounds . . . . . . . . . . . 421
7.6 The Residual Variance 2 : Confidence Intervals and Tests of Hypotheses . . . . 430
7.7 Multiple Comparisons and Simultaneous Confidence Intervals: Some Enhance-
ments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 441
7.8 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 480
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 492
Bibliographic and Supplementary Notes . . . . . . . . . . . . . . . . . . . . . . 502
References 505
Index 513
Preface
Linear statistical models provide the theoretical underpinnings for many of the statistical procedures
in common use. In deciding on the suitability of one of those procedures for use in a potential
application, it would seem to be important to know the assumptions embodied in the underlying
model and the theoretical properties of the procedure as determined on the basis of that model. In
fact, the value of such knowledge is not limited to its value in deciding whether or not to use the
procedure. When (as is frequently the case) one or more of the assumptions appear to be unrealistic,
such knowledge can be very helpful in devising a suitably modified procedure—a situation of this
kind is illustrated in Section 7.7f.
Knowledge of matrix algebra has in effect become a prerequisite for reading much of the literature
pertaining to linear statistical models. The use of matrix algebra in this literature started to become
commonplace in the mid 1900s. Among the early adopters were Scheffé (1959), Graybill (1961), Rao
(1965), and Searle (1971). When it comes to clarity and succinctness of exposition, the introduction
of matrix algebra represented a great advance. However, those without an adequate knowledge of
matrix algebra were left at a considerable disadvantage.
Among the procedures for making statistical inferences are ones that are based on an assumption
that the data vector is the realization of a random vector, say y, that follows a linear statistical model.
The present volume discusses procedures of that kind and the properties of those procedures. Included
in the coverage are various results from matrix algebra needed to effect an efficient presentation of the
procedures and their properties. Also included in the coverage are the relevant statistical distributions.
Some of the supporting material on matrix algebra and statistical distributions is interspersed with
the discussion of the inferential procedures and their properties.
Two classical procedures are the least squares estimator (of an estimable function) and the F test.
The least squares estimator is optimal in the sense described in a result known as the Gauss–Markov
theorem. The Gauss–Markov theorem has a relatively simple proof. Results on the optimality of the
F test are stated and proved herein (in Chapter 7); the proofs of these results are relatively difficult
and less “accessible”—reference is sometimes made to Wolfowitz’s (1949) proofs of results on the
optimality of the F test, which are (at best) extremely terse.
The F test is valid under an assumption that the distribution of the observable random vector y is
multivariate normal. However, that assumption is stronger than necessary. As can be discerned from
results like those discussed by Fang, Kotz, and Ng (1990), as has been pointed out by Ravishanker
and Dey (2002, sec. 5.5), and is shown herein, the F test and various related procedures depend
on y only through a (possibly vector-valued) function of y whose distribution is the same for every
distribution of y that is “elliptically symmetric,” so that those procedures are valid not only when the
distribution of y is multivariate normal but more generally when the distribution of y is elliptically
symmetric.
The present volume includes considerable discussion of multiple comparisons and simultaneous
confidence intervals. At one time, the use of these kinds of procedures was confined to situations where
the requisite percentage points were those of a distribution (like the distribution of the Studentized
range) that was sufficiently tractable that the percentage points could be computed by numerical
means. The percentage points could then be tabulated or could be recomputed on an “as needed”
basis. An alternative whose use is not limited by considerations of “numerical tractability” is to
x Preface
determine the percentage points by Monte Carlo methods in the manner described by Edwards and
Berry (1987).
The discussion herein of multiple comparisons is not confined to the traditional methods, which
serve to control the FWER (familywise error rate). It includes discussion of less conservative methods
of the kinds proposed by Benjamini and Hochberg (1995) and by Lehmann and Romano (2005a).
Prerequisites. The reader is assumed to have had at least some exposure to the basic concepts of
probability “theory” and to the basic principles of statistical inference. This exposure is assumed to
have been of the kind that could have been gained through an introductory course at a level equal to
(or exceeding) that of Casella and Berger (2002) or Bickel and Doksum (2001).
The coverage of matrix algebra provided herein is more-or-less self-contained. Nevertheless,
some previous exposure of the kind that might have been gained through an introductory course
on linear algebra is likely to be helpful. That would be so even if the introductory course were
such that the level of abstractness or generality were quite high or (at the other extreme) were
such that computations were emphasized at the expense of fundamental concepts, in which case the
connections to what is covered herein would be less direct and less obvious.
Potential uses. The book could be used as a reference. Such use has been facilitated by the inclusion
of a very extensive and detailed index and by arranging the covered material in a way that allows (to
the greatest extent feasible) the various parts of the book to be read more-or-less independently.
Or the book could serve as the text for a graduate-level course on linear statistical models with
a secondary purpose of providing instruction in matrix algebra. Knowledge of matrix algebra is
critical not only in the study of linear statistical models but also in the study of various other areas of
statistics including multivariate analysis. The integration of the instruction in matrix algebra with the
coverage of linear statistical models could have a symbiotic effect on the study of both subjects. If
desired, topics not covered in the book (either additional topics pertaining to linear statistical models
or topics pertaining to some other area such as multivariate analysis) could be included in the course
by introducing material from a secondary source.
Alternatively, the book could be used selectively in a graduate-level course on linear statistical
models to provide coverage of certain topics that may be covered in less depth (or not covered at all) in
another source. It could also be used selectively in a graduate-level course in mathematical statistics
to provide in-depth illustrations of various concepts and principles in the context of a relatively
important and complex setting.
To facilitate the use of the book as a text, a large number of exercises have been included. A
solutions manual is accessible to instructors who have adopted the book at https://www.crcpress.com/
9781138578333.
An underlying perspective. A basic problem in statistics (perhaps, the basic problem) is that of making
inferences about the realizations of some number (assumed for the sake of simplicity to be finite) of
unobservable random variables, say w1 ; w2 ; : : : ; wM , based on the value of an observable random
vector y. Let w D .w1 ; w2 ; : : : ; wM /0. A statistical model might be taken to mean a specification
of the joint distribution of w and y up to the value of a vector, say , of unknown parameters. This
definition is sufficiently broad to include the case where w D —when w D , the joint distribution
of w and y is “degenerate.”
In this setting, statistical inference might take the form of a “point” estimate or prediction for the
realization of w or of a set of M -dimensional vectors and might be based on the statistical model (in
what might be deemed model-based inference). Depending on the nature of w1 ; w2 ; : : : ; wM , this
activity might be referred to as parametric inference or alternatively as predictive inference.
Let w.y/
Q represent a point estimator or predictor, and let A.y/ represent a set of M -dimensional
vectors that varies with the value of y. And consider the use of w.y/ Q and A.y/ in model-based
(parametric or predictive) inference. If EŒw.y/
Q D E.w/, w.y/
Q is said to be an unbiased estimator
or predictor. And if PrŒw 2 A.y/ D 1 P for some prespecified constant P (and for “every” value of
Preface xi
), A.y/ is what might be deemed a 100.1 P /% “confidence” set—depending on the model, such
a set might or might not exist.
In the special case where is “degenerate” (i.e., where the joint distribution of w and y is
known), w.y/
Q could be taken to be E.wj y/ (the so-called posterior mean), in which case w.y/ Q
would be unbiased. And among the choices for the set A.y/ in that special case are choices for which
PrŒw 2 A.y/ j y D 1 P [so-called 100.1 P /% credible sets].
Other models can be generated from the original model by regarding as a random vector whose
distribution is specified up to the value of some parameter vector (of smaller dimension than )
and by regarding the joint distribution of w and y specified by the original model as the conditional
distribution of w and y given . The resultant (hierarchical) models are more parsimonious than the
original model, but this (reduction in the number of parameters) comes at the expense of additional
assumptions. In the special case where is “degenerate” (i.e., where is regarded as a random
vector whose distribution is completely specified and represents what in a Bayesian framework is
referred to as the prior distribution), the resultant models are ones in which the joint distribution of
w and y is completely specified.
As discussed in a 2014 paper (Harville 2014), I regard the division of statistical inference along
Bayesian-frequentist lines as unnecessary and undesirable. What in a Bayesian approach is referred
to as the prior distribution can simply be regarded as part of a hierarchical model. In combination with
the original model, it leads to a new model (in which the joint distribution of w and y is completely
specified).
In that 2014 paper, it is also maintained that there are many instances (especially in the case of
predictive inference) where any particular application of the inferential procedures is one in a se-
quence of “repeated” applications. In such instances, the “performance” of the procedures in repeated
application would seem to be an important consideration. Performance in repeated application can be
assessed on the basis of empirical evidence or on the basis of a “model”—for some discussion of per-
formance in repeated application within a rather specific Bayesian framework, refer to Dawid (1982).
As famously stated by George Box, “all models are wrong, but some are useful” (e.g., Box and
Draper 1987, p. 424). In fact, a model may be useful for some purposes but not for others. How
useful any particular model might be in providing a basis for statistical inference would seem to
depend at least in part on the extent to which the relationship between w and y implicit in the model
is consistent with the “actual relationship”—the more “elaborate” the model, the more opportunities
there are for discrepancies. In principle, it would seem that the inferences should be based on a
model that reflects all relevant prior information [i.e., the joint distribution of w and y should be the
conditional (on the prior information) joint distribution]. In practice, it may be difficult to formally
account for certain kinds of prior information in a way that seems altogether satisfactory; it may be
preferable to account for those kinds of prior information through informal “posterior adjustments.”
In devising a model, there is a potential pitfall. It is implicitly assumed that the specification of the
joint distribution of w and y is not influenced by the observed value of y. Yet, in practice, the model
may not be decided upon until after the data become available and/or may undergo modification
subsequent to that time. Allowing the observed value of y to influence the choice of model could
introduce subtle biases and distortions into the inferences.
Format. The book is divided into (7) numbered chapters, the chapters into numbered sections, and (in
some cases) the sections into lettered subsections. Sections are identified by two numbers (chapter
and section within chapter) separated by a decimal point—thus, the fifth section of Chapter 3 is
referred to as Section 3.5. Within a section, a subsection is referred to by letter alone. A subsection
in a different chapter or in a different section of the same chapter is referred to by referring to the
section and by appending a letter to the section number—for example, in Section 6.2, Subsection
c of Section 6.1 is referred to as Section 6.1c. An exercise in a different chapter is referred to by
the number obtained by inserting the chapter number (and a decimal point) in front of the exercise
number.
xii Preface
Some of the subsections are divided into parts. Each such subsection includes two or more parts
that begin with a heading and may or may not include an introductory part (with no heading). On the
relatively small number of occasions on which reference is made to one or more of the individual
parts, the parts that begin with headings are identified as though they had been numbered 1; 2; : : : in
order of appearance.
Some of the displayed “equations” are numbered. An equation number consists of two parts
(corresponding to section within chapter and equation within section) separated by a decimal point
(and is enclosed in parentheses). An equation in a different chapter is referred to by the “num-
ber” obtained by starting with the chapter number and appending a decimal point and the equation
number—for example, in Chapter 6, result (5.11) of Chapter 3 is referred to as result (3.5.11). For
purposes of numbering (and referring to) equations in the exercises, the exercises in each chapter are
to be regarded as forming Section E of that chapter.
Notational conventions and issues. The broad coverage of the manuscript (which includes coverage
of the statistical distributions and matrix algebra applicable to discussions of linear models) has led to
challenges and issues in devising suitable notation. It has sometimes proved necessary to use similar
(or even identical) symbols for more than one purpose. In some cases, notational conventions that
are typically followed in the treatment of one of the covered topics may conflict with those typically
followed in another of the covered topics; such conflicts have added to the difficulties in devising
suitable notation.
For example, in discussions of matrix algebra, it is customary (at least among statisticians) to use
boldface capital letters to represent matrices, to use boldface lowercase letters to represent vectors, and
to use ordinary lowercase letters to represent scalars. And in discussions of statistical distributions and
their characteristics, it is customary to distinguish the realization of a random variable or vector from
the random variable or vector itself by using a capital letter, say X , to represent the random variable or
a boldface capital letter, say X, to represent the random vector and to use the corresponding lowercase
letter x or boldface lowercase letter x to represent its realization. In such a case, the approach taken
herein is to use some other device such as an underline to differentiate between the random variable
or vector and its realization. Accordingly, x, x, or X might be used to represent a random variable,
vector, or matrix and x, x, or X to represent the realization of x, x, or X. Alternatively, in cases
where the intended usage is clear from the context, the same symbol may be used for both.
Credentials. I have brought to the writing of this book an extensive background in the subject matter.
On numerous occasions, I have taught graduate-level courses on linear statistical models. Moreover,
linear statistical models and their use as a basis for statistical inference has been my primary research
interest. My research in that area includes both work that is relatively theoretical in nature and work
in which the focus is on applications (including applications in sports and in animal breeding). I
am the author of two previous books, both of which pertain to matrix algebra: Matrix Algebra from
a Statistician’s Perspective, which provides coverage of matrix algebra of a kind that would seem
to be well-suited for those with interests in statistics and related disciplines, and Matrix Algebra:
Exercises and Solutions, which provides the solutions to the exercises in Matrix Algebra from a
Statistician’s Perspective.
In the writing of Matrix Algebra from a Statistician’s Perspective, I adopted the philosophy that (to
the greatest extent feasible) the discourse should include the theoretical underpinnings of essentially
every result. In the writing of the present volume, I have adopted much the same philosophy. Of
course, doing so has a limiting effect on the number of topics and the number of results that can be
covered.
Acknowledgments. In the writing of this volume, I have been influenced greatly (either consciously
or subconsciously) by insights acquired from others through direct contact or indirectly through
exposure to presentations they have given or to documents they have written. Among those from
whom I have acquired insights are: Frank Graybill—his 1961 book was an early influence; Justus
Preface xiii
Seely (through access to some unpublished class notes from a course he had taught at Oregon State
University, as well as through the reading of a number of his published papers); C. R. Henderson (who
was my major professor and a source of inspiration and ideas); Oscar Kempthorne (through access
to his class notes and through thought-provoking conversations during the time he was a colleague);
and Shayle Searle (who was very supportive of my efforts and who was a major contributor to the
literature on linear statistical models and the associated matrix algebra). And I am indebted to John
Kimmel, who (in his capacity as an executive editor at Chapman and Hall/CRC) has been a source
of encouragement, support, and guidance.
David A. Harville
harville@iastate.edu
1
Introduction
This book is about linear statistical models and about the statistical procedures derived on the basis
of those models. These statistical procedures include the various procedures that make up a linear
regression analysis or an analysis of variance, as well as many other well-known procedures. They
have been applied on many occasions and with great success to a wide variety of experimental and
observational data.
In agriculture, data on the milk production of dairy cattle are used to make inferences about the
“breeding values” of various cows and bulls and ultimately to select breeding stock (e.g., Henderson
1984). These inferences (and the resultant selections) are made on the basis of a linear statistical
model. The adoption of this approach to the selection of breeding stock has significantly increased
the rate of genetic progress in the affected populations.
In education, student test scores are used in assessing the effectiveness of teachers, schools, and
school districts. In the Tennessee value-added assessment system (TVAAS), the assessments are in
terms of statistical inferences made on the basis of a linear statistical model (e.g., Sanders and Horn
1994). This approach compares favorably with the more traditional ways of using student test scores
to assess effectiveness. Accordingly, its use has been mandated in a number of regions.
In sports such as football and basketball, the outcomes of past and present games can be used to
predict the outcomes of future games and to rank or rate the various teams. Very accurate results can
be obtained by basing the predictions and the rankings or ratings on a linear statistical model (e.g.,
Harville 1980, 2003b, 2014). The predictions obtained in this way are nearly as accurate as those
implicit in the betting line. And (in the case of college basketball) they are considerably more accurate
than predictions based on the RPI (Ratings Percentage Index), which is a statistical instrument used
by the NCAA (National Collegiate Athletic Association) to rank teams.
The scope of statistical procedures developed on the basis of linear statistical models can be (and
has been) extended. Extensions to various kinds of nonlinear statistical models have been considered
by Bates and Watts (1988), Gallant (1987), and Pinheiro and Bates (2000). Extensions to the kinds
of statistical models that have come to be known as generalized linear models have been considered
by McCullagh and Nelder (1989), Agresti (2013), and McCulloch, Searle, and Neuhaus (2008).
may constitute some or all of the concomitant information. The various assumptions made about the
distribution of y are referred to collectively as a statistical model or simply as a model.
We shall be concerned herein with what are known as linear (statistical) models. These models
are relatively tractable and provide the theoretical underpinnings for a broad class of statistical
procedures.
What constitutes a linear model? In a linear model, the expected values of y1; y2 ; : : : ; yN are taken
to be linear combinations of some number, say P , of generally unknown parameters ˇ1 ; ˇ2 ; : : : ; ˇP .
That is, there exist numbers xi1 ; xi 2 ; : : : ; xiP (assumed known) such that
P
X
E.yi / D xij ˇj .i D 1; 2; : : : ; N /: (1.1)
j D1
where e1 ; e2 ; : : : ; eN are random variables, each of which has an expected value of 0. Under condition
(1.2), we have that
P
X
ei D yi xij ˇj D yi E.yi / .i D 1; 2; : : : ; N /: (1.3)
j D1
Aside from the two trivial cases var.ei / D 0 and xi1 D xi 2 D D xiP D 0 and a case where
ˇ1 ; ˇ2 ; : : : ; ˇP are subject to restrictions under which jPD1 xij ˇj is known, ei is unobservable.
P
The random variables e1 ; e2 ; : : : ; eN are sometimes referred to as residual effects or as errors.
In working with linear models, the use of matrix notation is extremely convenient. Note that
condition (1.1) can be reexpressed in the form
y D Xˇ C e; (1.5)
where e is a random column vector (the elements of which are e1 ; e2 ; : : : ; eN ) with E.e/ D 0.
Further, in matrix notation, result (1.3) becomes
For a model to qualify as a linear model, we require something more than condition (1.1) or (1.2).
Namely, we require that the variance-covariance matrix of y, or equivalently of e, not depend on the
elements ˇ1 ; ˇ2 ; : : : ; ˇP of ˇ—the diagonal elements of the variance-covariance matrix of y are the
variances of the elements y1 ; y2 ; : : : ; yN of y, and the off-diagonal elements are the covariances.
Regression Models 3
This matrix may depend (and typically does depend) on various unknown parameters other than
ˇ1 ; ˇ2 ; : : : ; ˇP .
For a model to be useful in making inferences about the unobservable quantities of interest, it
must be possible to express those quantities in a relevant way. Consider a linear model, in which
E.y1 /; E.y2 /; : : : ; E.yN / are expressible in the form (1.1) or, equivalently, in which y is expressible in
the form (1.5). This model could be useful in making inferences about a quantity that is expressible as a
linear combination, say jPD1 j ˇj , of the elements ˇ1 ; ˇ2 ; : : : ; ˇP of ˇ—how useful would depend
P
on X, on the coefficients 1 ; 2 ; : : : ; P , and perhaps on various characteristics of the distribution of
e. More generally, this model could be useful in making inferences about an unobservable random
variable w for which E.w/ D jPD1 j ˇj and for which var.w/ and cov.w; y/ do not depend on ˇ
P
where ˛0 ; ˛1 ; : : : ; ˛C are unrestricted parameters (of unknown value) and where e1 ; e2 ; : : : ; eN are
uncorrelated, unobservable random variables, each with mean 0 and (for a strictly positive parameter
of unknown value) variance 2 . Models of the form (2.1) are referred to as simple or multiple
(depending on whether C D 1 or C 2) linear regression models.
As suggested by the name, a linear regression model qualifies as a linear model. Under the linear
regression model (2.1),
C
X
E.yi / D ˛0 C uij ˛j .i D 1; 2; : : : ; N /: (2.2)
j D1
The expected values (2.2) are of the form (1.1), and the expressions (2.1) are of the form (1.2); set
P D C C 1, ˇ1 D ˛0 , and (for j D 1; 2; : : : ; C ) ˇj C1 D ˛j , and take x11 D x21 D D xN1 D 1
and (for i D 1; 2; : : : ; N and j D 1; 2; : : : ; C ) xi;j C1 D uij . Moreover, the linear regression model
is such that the variance-covariance matrix of e1 ; e2 ; : : : ; eN does not depend on the ˇj ’s; it depends
only on the parameter .
4 Introduction
In an application of the multiple linear regression model, we might wish to make inferences about
some or all of the individual parameters ˛0 ; ˛1 ; : : : ; ˛C , and . Or we might wish to make inferences
about the quantity ˛0 C jCD1 uj ˛j for various values of the explanatory variables u1 ; u2 ; : : : ; uC .
P
This quantity could be thought of as representing the “average” value of an infinitely large number
of future data points, all of which correspond to the same u1 ; u2 ; : : : ; uC values. Also of potential
interest are quantities of the form ˛0 C jCD1 uj ˛j C d , where d is an unobservable random variable
P
that is uncorrelated with e1; e2 ; : : : ; eN and that has mean 0 and variance 2. A quantity of this form is
a random variable, the value of which can be thought of as representing an individual future data point.
There are potential pitfalls in making predictive inferences on the basis of a statistical model,
both in general and in the case of a multiple linear regression model. It is essential that the relevant
characteristics of the setting in which the predictive inferences are to be applied be consistent with
those of the setting that gives rise to the data. For example, in making predictive inferences about the
relationship between a cow’s milk production and her age and her body weight, it would be essential
that there be consistency with regard to breed and perhaps with regard to various management
practices. The use of data collected on a random sample of the population that is the “target” of the
predictive inferences can be regarded as an attempt to achieve the desired consistency. In making
predictive inferences on the basis of a multiple linear regression model, it is also essential that
the model “accurately reflect” the underlying relationships and that it do so over all values of the
explanatory variables for which predictive inferences are sought (as well as over all values for which
there are data).
For some applications of the multiple linear regression model, the assumption that e1 ; e2 ; : : : ; eN
(and d ) are uncorrelated with each other may be overly simplistic. Consider, for example, an appli-
cation in which each of the data points represents the amount of milk produced by a cow. If some
of the cows are genetically related to others, then we may wish to modify the model accordingly.
Any two of the residual effects e1 ; e2 ; : : : ; eN (and d ) that correspond to cows that are genetically
related may be positively correlated (to an extent that depends on the closeness of the relationship).
Moreover, the data are likely to come from more than one herd of cows. Cows that belong to the
same herd share a common environment, and tend to be more alike than cows that belong to different
herds. One way to account for their alikeness is through the introduction of a positive covariance (of
unknown value).
In making inferences on the basis of a multiple linear regression model, a possible objective
is that of obtaining relevant input to some sort of decision-making process. In particular, when
inferences are made about future data points, it may be done with the intent of judging the effects of
changes in the values of any of the explanatory variables u1 ; u2 ; : : : ; uC that are subject to control.
Considerable caution needs to be exercised in making such judgments. There may be variables that are
not accounted for in the model but whose values may have “influenced” the values of y1 ; y2 ; : : : ; yN
and may influence future data points. If the values of any of the excluded variables are related
(either positively or negatively) to any of the variables for which changes are contemplated, then the
model-based inferences may create a misleading impression of the effects of the changes.
which is based on a different criterion or “factor.” The subsets or groups formed on the basis of any
particular factor are sometimes referred to as the “levels” of the factor.
A factor can be converted into an explanatory variable by assigning each of its levels a distinct
number. In some cases, the assignment can be done in such a way that the explanatory variable might
be suitable for inclusion in a multiple linear regression model. Consider, for example, the case of
data on individual animals that have been partitioned into groups on the basis of age or body weight.
In a case of this kind, the factor might be referred to as a “quantitative” factor.
There is another kind of situation; one where the data points are partitioned into groups on the
basis of a “qualitative” factor and where (regardless of the method of assignment) the numbers
assigned to the groups or levels are meaningful only for purposes of identification. For example, in
an application where each data point consists of the amount of milk produced by a different one of
N dairy cows, the data points might be partitioned into groups, each of which consists of the data
points from those cows that are the daughters of a different one of K bulls. The K bulls constitute the
levels of a qualitative factor. For purposes of identification, the bulls could be numbered 1; 2; : : : ; K
in whatever order might be convenient.
In a situation where the N data points have been partitioned into groups on the basis of each of
one or more qualitative factors, the data are sometimes referred to as “classificatory data.” Among the
models that could be applied to classificatory data are what might be called “classificatory models.”
Suppose (for the sake of simplicity) that there is a single qualitative factor, and that it has K levels
numbered 1; 2; : : : ; K. And (for k D 1; 2; : : : ; K) denote by Nk the number of data points associated
with level k—clearly, K kD1 Nk D N .
P
In this setting, it is convenient to use two subscripts, rather than one, in distinguishing among
the random variables y1 ; y2 ; : : : ; yN (and among related quantities). The first subscript identifies the
level, and the second allows us to distinguish among entities associated with the same level. Accord-
ingly, we write yk1 ; yk2 ; : : : ; ykNk for those of the random variables y1 ; y2 ; : : : ; yN associated with
the kth level (k D 1; 2; : : : ; K).
As a possible model, we have the classificatory model obtained by taking
yks D C ˛k C eks .k D 1; 2; : : : ; KI s D 1; 2; : : : ; Nk /; (3.1)
where ; ˛1 ; ˛2 ; : : : ; ˛K are unknown parameters and where the eks ’s are uncorrelated, unobserv-
able random variables, each with mean 0 and (for a strictly positive parameter of unknown value)
variance 2 . The parameters ˛1 ; ˛2 ; : : : ; ˛K are sometimes referred to as effects. And the model
itself is sometimes referred to as the one-way-classification model or (to distinguish it from a vari-
ation to be discussed subsequently) the one-way-classification fixed-effects model. The parameters
; ˛1 ; ˛2 ; : : : ; ˛K are generally taken to be unrestricted, though sometimes they are required to
satisfy the restriction K
X
˛k D 0 (3.2)
kD1
or some other restriction (such as K kD1 k ˛k D 0, D 0, or ˛K D 0).
P
N
Is the one-way-classification model a linear model? The answer is yes, though this may be less
obvious than in the case (considered in Section 1.2) of a multiple linear regression model. That the
one-way-classification model is a linear model becomes more transparent upon observing that the
defining relation (3.1) can be reexpressed in the form
KC1
X
yks D xksj ˇj C eks .k D 1; 2; : : : ; KI s D 1; 2; : : : ; Nk /; (3.3)
j D1
where ˇ1 D and ˇj D ˛j 1 (j D 2; 3; : : : ; K C 1) and where (for k D 1; 2; : : : ; K; s D 1; 2;
: : : ; Nk ; j D 1; 2; : : : ; K C 1)
(
1; if j D 1 or j D k C 1,
xksj D
0; otherwise.
6 Introduction
In that regard, it may be helpful (i.e., provide even more in the way of transparency) to observe that
result (3.3) is equivalent to the result
KC1
X
yi D xij ˇj C ei .i D 1; 2; : : : ; N /;
j D1
where xi1 D 1 and (for j D 2; 3; : : : ; K C 1) xij D 1 or xij D 0 depending on whether or not the
i th data point is a member of the (j 1)th group (i D 1; 2; : : : ; N ) and where (as in Section 1.2)
e1 ; e2 ; : : : ; eN are uncorrelated, unobservable random variables, each with mean 0 and variance 2.
In an application of the one-way-classification model, we might wish to make inferences about
C ˛1 ; C ˛2 ; : : : ; C ˛K . For k D 1; 2; : : : ; K (and “all” s)
E.yks / D C ˛k :
When the coefficients c1 ; c2 ; : : : ; cK in the linear combination (3.4) are such that K
P
PK kD1 ck D 0,
the linear combination is reexpressible as kD1 ck ˛k and is referred to as a contrast. Perhaps the
simplest kind of contrast is a difference: ˛k 0 ˛k D C˛k 0 .C˛k / (where k 0 ¤ k). Still another
possibility is that we may wish to make inferences about the quantity C ˛k C d , where 1 k K
and where d is an unobservable random variable that (for k 0 D 1; 2; : : : ; K and s D 1; 2; : : : ; Nk 0 )
is uncorrelated with ek 0 s and that has mean 0 and variance 2. This quantity can be thought of as
representing an individual future data point belonging to the kth group.
As a variation on model (3.1), we have the model
where 1 ; 2 ; : : : ; K are unknown parameters and where the eks ’s are as defined earlier [i.e.,
in connection with model (3.1)]. Model (3.5), like model (3.1), is a linear model. It is a simple
example of what is called a means model or a cell-means model; let us refer to it as the one-way-
classification cell-means model. Clearly, k D E.yks / (k D 1; 2; : : : ; KI s D 1; 2; : : : ; Nk ), so that
(for k D 1; 2; : : : ; K) k is interpretable as the expected value or the “mean” of an arbitrary one of
the random variables yk1 ; yk2 ; : : : ; ykNk , whose observed values comprise the kth group or “cell.”
In making statistical inferences, it matters not whether the inferences are based on model (3.5) or
on model (3.1). Nor does it matter whether the restriction (3.2), or a “similar” restriction, is imposed
on the parameters of model (3.1). For purposes of making inferences about the relevant quantities,
model (3.5) and the restricted and unrestricted versions of model (3.1) are “interchangeable.”
The number of applications for which the one-way-classification model provides a completely
satisfactory basis for the statistical inferences is relatively small. Even in those applications where
interest centers on a particular factor, the relevant concomitant information is typically not limited
to the information associated with that factor. To insure that the inferences obtained in such a cir-
cumstance are meaningful, they may need to be based on a model that accounts for the additional
information.
Suppose, for example, that each data point consists of the amount of milk produced during the
first lactation of a different one of N dairy cows. And suppose that Nk of the cows are the daughters
of the kth of K bulls (where N1 ; N2 ; : : : ; NK are positive integers that sum to N ). Interest might
Hierarchical Models and Random-Effects Models 7
center on differences among the respective “breeding values” of the K bulls, that is, on differences
in the “average” amounts of milk produced by infinitely large numbers of future daughters under
circumstances that are similar from bull to bull. Any inferences about these differences that are based
on a one-way-classification model (in which the factor is that whose levels correspond to the bulls)
are likely to be at least somewhat misleading. There are factors of known importance that are not
accounted for by this model. These include a factor for the time period during which the lactation
was initiated and a factor for the herd to which the cow belongs. The importance of these factors
is due to the presence of seasonal differences, environmental and genetic trends, and environmental
and genetic differences among herds.
The one-way-classification model may be unsuitable as a basis for making inferences from the
milk-production data not only because of the omission of important factors, but also because the
assumption that the eks ’s are uncorrelated may not be altogether realistic. Typically, some of the
cows will have ancestors in common on the female side of the pedigree, in which case the eks ’s for
those cows may be positively correlated.
The negative consequences of not having accounted for a factor that has been omitted from a
classificatory model may be exacerbated in a situation in which there is a tendency for the levels
of the omitted factor to be “confounded” with those of an included factor. In the case of the milk-
production data, there may (in the presence of a positive genetic trend) be a tendency for the better
bulls to be associated with the more recent time periods. Moreover, there may be a tendency for
an exceptionally large proportion of the daughters of some bulls to be located in “above-average”
herds, and for an exceptionally large proportion of the daughters of some other bulls to be located in
“below-average” herds.
A failure to account for an important factor may occur for any of several reasons. The factor may
have been mistakenly judged to be irrelevant or at least unimportant. Or the requisite information
about the factor (i.e., knowledge of which data points correspond to which levels) may be unavailable
and may not even have been ascertained (possibly for reasons of cost). Or the factor may be a “hidden”
factor.
In some cases, the data from which the inferences are to be made are those from a designed
experiment. The incorporation of randomization into the design of the experiment serves to limit the
extent of the kind of problematic confounding that may be occasioned by a failure to account for an
important factor. This kind of problematic confounding can still occur, but only to the extent that it
is introduced by chance during the randomization.
of the observable and unobservable random variables (the unobservable random variables about
which the inferences are to be made). It might be assumed that the joint distribution of these random
variables is known up to the value of , or it might be only the expected values and the variances
and covariances of these random variables that are assumed to be known up to the value of .
The model consisting of assumptions that define the distribution of y (or various characteristics
of the distribution of y) up to the value of the parameter vector can be subjected to a “hierarchical”
approach. This approach gives rise to various alternative models. In the hierarchical approach,
is regarded as random, and the assumptions that comprise the original model are reinterpreted as
assumptions about the conditional distribution of y given . Further, the distribution of or various
characteristics of the distribution of are assumed to be known, or at least to be known up to the value
of a column vector, say , of unknown parameters. These additional assumptions can be thought of
as comprising a model for .
As an alternative model for y, we have the model obtained by combining the assumptions
comprising the original model for y with the assumptions about the distribution of (or about
its characteristics). This model is referred to as a hierarchical model. In some cases, it can be readily
reexpressed in nonhierarchical terms; that is, in terms that do not involve . This process is facilitated
by the application of some basic results on conditional expectations and on conditional variances
and covariances.
For “any” random variable x,
E.x/ D EŒE.x j /; (4.1)
and
var.x/ D EŒvar.x j / C varŒE.x j /: (4.2)
And, for “any” two random variables x and w,
The unconditional expected values in expressions (4.1), (4.2), and (4.3) and the unconditional vari-
ance and covariance in expressions (4.2) and (4.3) are those defined with respect to the (marginal)
distribution of . In general, the expressions for E.x/, var.x/, and cov.x; w/ given by formulas (4.1),
(4.2), and (4.3) depend on . Formulas (4.1), (4.2), and (4.3) are obtainable from results presented
in Chapter 3.
Formulas (4.1) and (4.2) can be used in particular to obtain expressions for the unconditional
expected values and variances of y1 ; y2 ; : : : ; yN in terms of their conditional expected values and
variances—take x D yi (1 i N ). Similarly, formula (4.3) can be used to obtain an expression for
the unconditional covariance of any two of the random variables y1 ; y2 ; : : : ; yN —take x D yi and
w D yj (1 i < j N ). Moreover, if the conditional distribution of y given has a probability
density function, say f .y j /, then, upon applying formula (4.1) with x D f .y j /, we obtain an
expression for the probability density function of the unconditional distribution of y.
In a typical implementation of the hierarchical approach, the dimension of the vector is signifi-
cantly smaller than that of the vector . The most extreme case is that where the various assumptions
about the distribution of do not involve unknown parameters; in that case, can be regarded as
“degenerate” (i.e., of dimension 0). The effects of basing the inferences on the hierarchical model,
rather than on the original model, can be either positive or negative. If the additional assumptions
(i.e., the assumptions about the distribution of or about its characteristics) are at least somewhat
reflective of an “underlying reality,” the effects are likely to be “beneficial.” If the additional as-
sumptions are not sufficiently in conformance with “reality,” their inclusion in the model may be
“counterproductive.”
The hierarchical model itself can be subjected to a hierarchical approach. In this continuation of
the hierarchical approach, is regarded as random, and the assumptions that comprise the hierarchical
model are reinterpreted as assumptions about the conditional distributions of y given and and
of given or simply about the conditional distribution of y given . And the distribution of or
various characteristics of the distribution of are assumed to be known or at least to be known up
Hierarchical Models and Random-Effects Models 9
to the value of a vector of unknown parameters. In general, further continuations of the hierarchical
approach are possible. Assuming that each continuation results in a reduction in the number of
unknown parameters (as would be the case in a typical implementation), the hierarchical approach
eventually (after some number of continuations) results in a model that does not involve any unknown
parameters.
In general, a model obtained via the hierarchical approach (like any other model) may not in and
of itself provide an adequate basis for the statistical inferences. Instead of applying the approach
just to the assumptions (about the observable random variables y1 ; y2 ; : : : ; yN ) that comprise the
model, the application of the approach may need to be extended to cover any further assumptions
included among those made about the joint distribution of the observable random variables and the
unobservable random variables (the unobservable random variables about which the inferences are
to be made).
Let us now consider the hierarchical approach in the special case where y follows a linear model.
In this special case, there exist (known) numbers xi1 ; xi 2 ; : : : ; xiP such that
P
X
E.yi / D xij ˇj .i D 1; 2; : : : ; N / (4.4)
j D1
(i D 1; 2; : : : ; N ) and
10 Introduction
P
X P
X
cov.yi ; yi 0 / D E.i i 0 / C cov xij ˇj ; xi 0 j 0 ˇj 0
j D1 j 0 D1
P X
X P
D i i 0 C xij xi 0 j 0 jj 0 (4.7)
j D1 j 0 D1
(i; i 0 D 1; 2; : : : ; N ). It follows from results (4.6) and (4.7) that if † does not depend on 1 ; 2 ; : : : ;
P 0 , then the hierarchical model, like the original model, is a linear model.
For any integer j between 1 and P , inclusive, such that var.ˇj / D 0, we have that ˇj D E.ˇj /
PP 0
(with probability 1). Thus, for any such integer j , the assumption that E.ˇj / D kD1 zjk k
PP 0
simplifies in effect to an assumption that ˇj D kD1 zjk k . In the special case where E.ˇj / D k 0
for some integer k 0 (1 k 0 P 0 ), there is a further simplification to ˇj D k 0 . Thus, the hierarchical
approach is sufficiently flexible that some of the parameters ˇ1 ; ˇ2 ; : : : ; ˇP can in effect be retained
and included among the parameters 1 ; 2 ; : : : ; P 0 .
As indicated earlier (in Section 1.1), it is extremely convenient (in working with linear mod-
els) to adopt matrix notation. Let ˇ represent the P -dimensional column vector with elements
ˇ1 ; ˇ2 ; : : : ; ˇP , respectively, and the P 0 -dimensional column vector with elements 1 ; 2 ; : : : ; P 0 ,
respectively. Then, in matrix notation, equality (4.4) becomes (in the context of the hierarchical ap-
proach)
E.y j ˇ/ D Xˇ; (4.8)
where X is the N P matrix with ij th element xij (i D 1; 2; : : : ; N ; j D 1; 2; : : : ; P ). And
equality (4.5) becomes
E.ˇ/ D Z; (4.9)
where Z is the P P 0 matrix with j kth element zjk (j D 1; 2; : : : ; P ; k D 1; 2; : : : ; P 0 ). Further,
results (4.6) and (4.7) can be recast in matrix notation as
and (for i D 1; 2; : : : ; N )
P
X P
X
cov.yi ; w/ D E.i / C cov xij ˇj ; j 0 ˇj 0
j D1 j 0 D1
P X
X P
D i C xij j 0 jj 0 : (4.14)
j D1 j 0 D1
Clearly, if and do not depend on 1 ; 2 ; : : : ; P 0 , then neither do expressions (4.13) and (4.14).
As in the case of results (4.6) and (4.7), results (4.12), (4.13), and (4.14) can be recast in matrix no-
tation. Denote by the P -dimensional column vector with elements 1 ; 2 ; : : : ; P , respectively—
the linear combination jPD1 j ˇj is reexpressible as jPD1 j ˇj D 0 ˇ. Under the hierarchical
P P
model,
The hierarchical approach is not the only way of arriving at a model characterized by expected
values and variances and covariances of the form (4.6) and (4.7) or, equivalently, of the form (4.10)
and (4.11). Under the original model, the distribution of y1 ; y2 ; : : : ; yN , and w is such that
P
X
yi D xij ˇj C ei .i D 1; 2; : : : ; N / (4.18)
j D1
and P
X
wD j ˇj C d; (4.19)
j D1
where e1 ; e2 ; : : : ; eN , and d are random variables, each with expected value 0. Or, equivalently, the
distribution of y and w is such that
y D Xˇ C e (4.20)
and 0
w D ˇ C d; (4.21)
where e is the N -dimensional random (column) vector with elements e1 ; e2 ; : : : ; eN and hence with
E.e/ D 0. Moreover, under the original model, var.e/ D var.y/ D †, var.d / D var.w/ D , and
cov.e; d / D cov.y; w/ D .
Now, suppose that instead of taking ˇ1 ; ˇ2 ; : : : ; ˇP to be parameters, they are (as in the hierar-
chical approach) taken to be random variables with expected values of the form (4.5)—in which case,
ˇ is a random vector with an expected value of the form (4.9)—and with variance-covariance matrix
. Suppose further that each of the random variables ˇ1 ; ˇ2 ; : : : ; ˇP is uncorrelated with each of
the random variables e1 ; e2 ; : : : ; eN , and d [or, equivalently, that cov.ˇ; e/ D 0 and cov.ˇ; d / D 0]
or, perhaps more generally, suppose that each of the quantities jPD1 x1j ˇj ; jPD1 x2j ˇj ; : : : ;
P P
PP
j D1 xNj ˇj is uncorrelated with each of the random variables e1 ; e2 ; : : : ; eN , and d [or, equiva-
lently, that cov.Xˇ; e/ D 0 and cov.Xˇ; d / D 0]. And consider the effect of these suppositions
about ˇ1 ; ˇ2 ; : : : ; ˇP on the distribution of the N C1 random variables (4.18) and (4.19) (specifically
the effect on their expected values and their variances and covariances).
12 Introduction
where ı1 ; ı2 ; : : : ; ıP are random variables with expected values of 0 and variance-covariance ma-
trix and with the property that each of the linear combinations jPD1 x1j ıj ; jPD1 x2j ıj ; : : : ;
P P
PP
j D1 xNj ıj is uncorrelated with each of the random variables e1 ; e2 ; : : : ; eN , and d . Or, in matrix
notation,
ˇ D Z C ı; (4.23)
where ı is the P -dimensional random (column) vector with elements ı1 ; ı2 ; : : : ; ıP and hence with
E.ı/ D 0, var.ı/ D , cov.Xı; e/ D 0, and cov.Xı; d / D 0.
Upon replacing ˇ1 ; ˇ2 ; : : : ; ˇP in expressions (4.18) and (4.19) with the expressions for
ˇ1 ; ˇ2 ; : : : ; ˇP given by result (4.22), we obtain the expressions
0
P X
P
X
yi D xij zjk k C fi .i D 1; 2; : : : ; N /; (4.24)
kD1 j D1
PP
where (for i D 1; 2; : : : ; N ) fi D ei C j D1 xij ıj , and the expression
0
P X
P
X
wD j zjk k C g; (4.25)
kD1 j D1
PP
where g D d C j D1 j ıj . Results (4.24) and (4.25) can be restated in matrix notation as
y D XZ C f ; (4.26)
where f is the N -dimensional random (column) vector with elements f1 ; f2 ; : : : ; fN and hence
where f D e C Xı, and
w D 0 Z C g; (4.27)
where g D d C 0 ı. Alternatively, expressions (4.26) and (4.27) are obtainable by replacing ˇ in
expressions (4.20) and (4.21) with expression (4.23).
Clearly, E.fi / D 0 (i D 1; 2; : : : ; N ), or equivalently E.f / D 0, and E.g/ D 0. Further, by
making use of some basic results on the variances and covariances of linear combinations of random
variables [in essentially the same way as in the derivation of results (4.7), (4.13), and (4.14)], we
find that
P X
X P
cov.fi ; fi 0 / D i i 0 C xij xi 0 j 0 jj 0 .i; i 0 D 1; 2; : : : ; N /;
j D1 j 0 D1
P X
X P
var.g/ D C j j 0 jj 0 ;
j D1 j 0 D1
and P X
P
X
cov.fi ; g/ D i C xij j 0 jj 0 .i D 1; 2; : : : ; N /;
j D1 j 0 D1
or, equivalently, that
var.f / D † C XX0 ;
var.g/ D C 0 ;
and
cov.f ; g/ D C X:
Hierarchical Models and Random-Effects Models 13
These results imply that, as in the case of the hierarchical approach, the expected values and the
variances and covariances of the random variables y1 ; y2 ; : : : ; yN , and w are of the form (4.6),
(4.12), (4.7), (4.13), and (4.14), or, equivalently, that E.y/, E.w/, var.y/, var.w/, and cov.y; w/ are
of the form (4.10), (4.15), (4.11), (4.16), and (4.17). Let us refer to this alternative way of arriving at
a model characterized by expected values and variances and covariances of the form (4.6) and (4.7),
or equivalently (4.10) and (4.11), as the random-effects approach.
The assumptions comprising the original model are such that (for i D 1; 2; : : : ; N ) E.yi / D
PP
j D1 xij ˇj and are such that the variance-covariance matrix of y1 ; y2 ; : : : ; yN equals † (where
† does not vary with ˇ1 ; ˇ2 ; : : : ; ˇP ). In the hierarchical approach, these assumptions are regarded
as applying to the conditional distribution of y1 ; y2 ; : : : ; yN given ˇ1 ; ˇ2 ; : : : ; ˇP . Thus, in the
hierarchical approach, the random vector e in decomposition (4.20) is such that (with probability
1) E.ei j ˇ1 ; ˇ2 ; : : : ; ˇP / D 0 (i D 1; 2; : : : ; N ) and cov.ei ; ei 0 j ˇ1 ; ˇ2 ; : : : ; ˇP / D i i 0 (i; i 0 D
1; 2; : : : ; N ) or, equivalently, E.e j ˇ/ D 0 and var.e j ˇ/ D †.
The random-effects approach results in the same alternative model as the hierarchical approach
and does so under less stringent assumptions. In the random-effects approach, it is assumed that
the (unconditional) distribution of e is such that (for i D 1; 2; : : : ; N ) E.ei / D 0 or, equiv-
alently, E.e/ D 0. It is also assumed that the joint distribution of e and of the random vec-
tor ı in decomposition (4.23) is such that (for i; i 0 D 1; 2; : : : ; N ) cov. jPD1 xij ıj ; ei 0 / D 0
P
or, equivalently, cov.Xı; e/ D 0. By making use of formulas (4.1) and (4.3), these assump-
tions can be restated as follows: (for i D 1; 2; : : : ; N ) EŒE.ei j ˇ1 ; ˇ2 ; : : : ; ˇP / D 0 and (for
i; i 0 D 1; 2; : : : ; N ) covŒ jPD1 xij ıj ; E.ei 0 j ˇ1 ; ˇ2 ; : : : ; ˇP / D 0 or, equivalently, EŒE.e j ˇ/ D 0
P
i i 0 D cov.ei ; ei 0 / D EŒcov.ei ; ei 0 j ˇ1 ; ˇ2 ; : : : ; ˇP /
C covŒE.ei j ˇ1 ; ˇ2 ; : : : ; ˇP /; E.ei 0 j ˇ1 ; ˇ2 ; : : : ; ˇP /
(i; i 0 D 1; 2; : : : ; N ) or, equivalently,
† D var.e/ D EŒvar.e j ˇ/ C varŒE.e j ˇ/:
Let us now specialize even further by considering an application of the hierarchical approach
or random-effects approach in a setting where the N data points have been partitioned into K
groups, corresponding to the first through Kth levels of a single qualitative factor. As in our previous
discussion of this setting (in Section 1.3), let us write yk1 ; yk2 ; : : : ; ykNk for those of the random
variables y1 ; y2 ; : : : ; yN associated with the kth level (k D 1; 2; : : : ; K).
A possible model is the one-way-classification cell-means model, in which
where 1 ; 2 ; : : : ; K are unknown parameters and where the eks ’s are uncorrelated, unobservable
random variables, each with mean 0 and (for a strictly positive parameter of unknown value)
variance 2 . As previously indicated (in Section 1.3), this model qualifies as a linear model.
Let us apply the hierarchical approach or random-effects approach to the one-way-classification
cell-means model. Suppose that 1 ; 2 ; : : : ; K (but not ) are regarded as random. Suppose further
that 1 ; 2 ; : : : ; K are uncorrelated and that they have a common, unknown mean, say , and (for
a nonnegative parameter ˛ of unknown value) a common variance ˛2 . Or, equivalently, suppose
that
k D C ˛k .k D 1; 2; : : : ; K/; (4.29)
where ˛1 ; ˛2 ; : : : ; ˛K are uncorrelated random variables having mean 0 and a common variance ˛2 .
(And assume that ˛ and are not functionally dependent on .)
14 Introduction
Under the original model (i.e., the 1-way-classification cell-means model), we have that (for
k D 1; 2; : : : ; K and s D 1; 2; : : : ; Nk )
E.yks / D k (4.30)
In the hierarchical approach, the expected value (4.30) and the covariance (4.31) are regarded as a
conditional (on 1 ; 2 ; : : : ; K ) expected value and a conditional covariance.
The same model that would be obtained by applying the hierarchical approach (the so-called
hierarchical model) is obtainable via the random-effects approach. Accordingly, assume that each of
the random variables ˛1 ; ˛2 ; : : : ; ˛K [in representation (4.29)] is uncorrelated with each of the N ran-
dom variables eks (k D 1; 2; : : : ; K; s D 1; 2; : : : ; Nk ). In this setting, the random-effects approach
can be implemented by replacing 1 ; 2 ; : : : ; K in representation (4.28) with the expressions for
1 ; 2 ; : : : ; K comprising representation (4.29). This operation gives
Thus, under the alternative model, all of the yks ’s have the same expected value, and those of the
yks ’s that are associated with the same level may be positively correlated. Under the original model,
the expected values of the yks ’s may vary with the level, and none of the yks ’s are correlated.
Representation (4.32) is of the same form as representation (3.1), which is identified with the
one-way-classification fixed-effects model. However, the quantities ˛1 ; ˛2 ; : : : ; ˛K that appear in
representation (4.32) are random variables and are referred to as random effects, whereas the quanti-
ties ˛1 ; ˛2 ; : : : ; ˛K that appear in representation (3.1) are (unknown) parameters that are referred to
as fixed effects. Accordingly, the model obtained from the cell-means model via the random-effects
approach (or the corresponding hierarchical approach) is referred to as the one-way-classification
random-effects model.
The one-way-classification random-effects model can be obtained not only via an application
of the random-effects approach or hierarchical approach to the one-way-classification cell-means
model, but also via an application of the random-effects approach or hierarchical approach to the
one-way-classification fixed-effects model. In the case of the random-effects approach, simply regard
the effects ˛1 ; ˛2 ; : : : ; ˛K of the fixed-effects model as random variables rather than (unknown)
parameters, assume that ˛1 ; ˛2 ; : : : ; ˛K are uncorrelated, each with mean 0 and variance ˛2 , and
assume that each of the ˛k ’s is uncorrelated with each of the eks ’s. Then, by proceeding in much
Hierarchical Models and Random-Effects Models 15
the same way as in the application of the random-effects approach to the one-way-classification
cell-means model, we once again arrive at the one-way-classification random-effects model.
We have established that the one-way-classification random-effects model can be obtained by
adding (in the context of a random-effects approach or hierarchical approach) to the assumptions
that comprise the one-way-classification fixed-effects model or the one-way-classification cell-means
model. Under what circumstances are the additional assumptions likely to reflect an underlying reality
and hence to be beneficial? The additional assumptions would seem to be warranted in a circumstance
where the K levels of the factor can reasonably be envisioned as a random sample from an infinitely
large “population” of levels. Or, relatedly, they might be warranted in a circumstance where it is
possible to conceive of K infinitely large sets of data points, each of which corresponds to a different
one of the K levels, and where the average values of the data points in the K sets can reasonably be
regarded as a random sample from an infinitely large population of averages.
Suppose, for example, that each data point consists of the amount of milk produced by a different
one of N dairy cows and that Nk of the cows are the daughters of the kth of K bulls (k D 1; 2; : : : ; K).
Then, as discussed in Section 1.3, interest might center on the differences among the breeding
values of the bulls. Among the models that could conceivably serve as a basis for inferences about
those differences is the one-way-classification fixed-effects model (in which the factor is that whose
levels correspond to the bulls) or the one-way-classification cell-means model. However, under
some circumstances, better results are likely to be obtained by basing the inferences on the one-way-
classification random-effects model. Those circumstances include ones where the underlying reality
is at least reasonably consistent with what might be expected if the K bulls were a random sample
from an infinitely large population of bulls.
As a practical matter, the circumstances are likely to be such that the one-way-classification
random-effects model is too simplistic to provide a satisfactory basis for the inferences. The circum-
stances in which the one-way-classification random-effects model is likely to be inadequate include
those discussed earlier (in Section 1.3) in which inferences based on a one-way-classification fixed-
effects (or cell-means) model are likely to be misleading. They also include other circumstances.
Some of the K bulls may have one or more ancestors in common with some of the other bulls.
Depending on the extent and closeness of the resultant genetic relationships, it may be important
to take those relationships into account. This can be done within the context of the hierarchical or
random-effects approach. Instead of taking the random effects to be uncorrelated, they can be taken
to be correlated in a way and to an extent that reflects the underlying relationships.
There may exist other information about the K bulls that (like the ancestral information) is
“external” to the information provided by the N data points and that is at odds with various of
the assumptions of the one-way-classification random-effects model. For example, the information
might take the form of results from a statistical analysis of some earlier data. We may wish to
base our inferences on a model that accounts for this information. As in the case of the ancestral
information, such a model can (at least in principle) be devised within the context of the hierar-
chical approach or random-effects approach. In our application of the random-effects approach to
the one-way-classification cell-means model, we may wish to modify our assumption that the cell
means 1 ; 2 ; : : : ; K have a common mean as well as our assumption that the random effects
˛1 ; ˛2 ; : : : ; ˛K are uncorrelated with a common variance ˛2 .
In making inferences on the basis of the one-way-classification cell-means model (4.28), the
quantities of interest are typically ones that are expressible as a linear combination of the cell
means 1 ; 2 ; : : : ; K , say a linear combination KkD1 ck k , or, more generally, quantities that are
P
expressible as a random variable w of the form
K
X
wD ck k C d; (4.34)
kD1
where d is a random variable for which E.d / D 0 [and for which var.d / and the covariance of d
with each of the eks ’s do not depend on 1 ; 2 ; : : : ; K ]. Let us consider how w is affected by the
16 Introduction
application of the random-effects approach (to the one-way-classification cell-means model). In that
approach, 1 ; 2 ; : : : ; K are regarded as random variables that are expressible in the form (4.29).
And it is supposed that each of the random effects ˛1 ; ˛2 ; : : : ; ˛K is uncorrelated with d (as well as
with each of the eks ’s).
The random variable w can be reexpressed in terms of the parameter and the random ef-
fects ˛1 ; ˛2 ; : : : ; ˛K . Upon replacing 1 ; 2 ; : : : ; K in expression (4.34) with the expressions for
1 ; 2 ; : : : ; K given by representation (4.29), we find that
K
X K
X
wD ck C ck ˛k C d (4.35)
kD1 kD1
K
X
D ck C g; (4.36)
kD1
where g D d C K kD1 ck ˛k .
P
Expression (4.36) gives w in terms of the parameter and a random variable g. Clearly, E.g/ D 0.
And it follows from a basic formula on the variance of a linear combination of uncorrelated random
variables that K
X
var.g/ D var.d / C ˛2 ck2 : (4.37)
kD1
Recall that (for k D 1; 2; : : : ; K and s D 1; 2; : : : ; Nk ) fks D ˛k C eks . Under the original
(1-way-classification cell-means) model, cov.yks ; w/ D cov.eks ; d /, while under the alternative
(1-way-classification random-effects) model, cov.yks ; w/ D cov.fks ; g/. Even if eks and d are
uncorrelated, fks and g may be (and in many cases are) correlated. In fact,
XK XK
cov.fks ; g/ D cov.eks ; d / C cov eks ; ck 0 ˛k 0 C cov.˛k ; d / C cov ˛k ; ck 0 ˛k 0
k 0 D1
K k 0 D1
X
D cov.eks ; d / C 0 C 0 C cov ˛k ; ck 0 ˛k 0
k D1 0
K
X
D cov.eks ; d / C cov ˛k ; ck 0 ˛k 0 ; (4.38)
k 0 D1
as can be readily verified by applying a basic formula for a covariance between sums of random
variables. And, in light of the assumption that ˛1 ; ˛2 ; : : : ; ˛K are uncorrelated, each with variance
˛2 , result (4.38) simplifies to
As noted earlier in this section, the one-way-classification random-effects model can be obtained
by applying a hierarchical approach or random-effects approach to the one-way-classification fixed-
effects model (as well as by application to the 1-way-classification cell-means model). A related
observation (pertaining to inference about a random variable w) is that the application of the hier-
archical or random-effects approach to the one-way-classification fixed-effects model has the same
effect on the random variable w defined by w D K as the application to the
P
kD1 ck . C ˛k / C d P
one-way-classification cell-means model has on the random variable w D K kD1 ck k C d .
It is worth noting that results pertaining to inference about a random variable w based on the
one-way-classification random-effects model can be readily extended from a random variable w that
is expressible in the form (4.35) to one that is expressible in the more general form
K0
X
w D c0 C ck ˛k C d;
kD1
Statistical Inference 17
estimator or predictor, then EfŒt.y/ w2 g D varŒt.y/ w, that is, the MSE of t.y/ equals the
variance of its estimation or prediction error. In the special case where w is a parametric function,
the variance varŒt.y/ w of the estimation or prediction error equals the variance varŒt.y/ of the
estimator or predictor; however, in general, varŒt.y/ w is not necessarily equal to varŒt.y/. In the
special case where w is a parametric
p function and where t.y/ is an unbiased estimator or predictor,
the root MSE of t.y/ equals varŒt.y/ and may be referred to as the standard error (SE) of t.y/. In
the point estimation or prediction of w, it is desirable to include an estimate of the SE or root MSE
of the estimator or predictor.
In interval (or set) estimation or prediction, inferences about w are in the form of one or more
intervals or, more generally, one or more (1-dimensional) sets. The end points of each interval are
functions of y; more generally, the membership of each set varies with y. Associated with each
interval or set is a numerical measure that may be helpful in assessing the “chances” of the interval
or set including or “covering” w.
Let S.y/ represent an arbitrary (1-dimensional) set (the membership of which varies with y).
The probability PrŒw 2 S.y/ is referred to as the probability of coverage. It is interpretable as the
(theoretical) frequency of the event w 2 S.y/ in an infinitely long sequence of repeated applications
(involving repeated draws of both w- and y-values). Clearly, PrŒw 2 S.y/ D 1 PrŒw … S.y/.
In general, PrŒw … S.y/ may depend on and/or on characteristics of the joint distribution
of w and y not covered by the assumptions that comprise the statistical model. If S.y/ is such that
PrŒw … S.y/ D ˛ (uniformly for all distributions that conform to the underlying assumptions), then
S.y/ is said to be a 100.1 ˛/% confidence set for w [or, if appropriate, a 100.1 ˛/% confidence
interval for w]. More generally, S.y/ is said to be a 100.1 ˛/% confidence set or interval for w if
the supremum of PrŒw … S.y/ (over all distributions that conform to the underlying assumptions)
equals ˛. The infimum of the probability of coverage of a 100.1 ˛/% confidence interval or set
equals 1 ˛; this number is referred to as the confidence coefficient or confidence level. When w
represents a future quantity, the confidence interval or set might be referred to as a prediction interval
or set.
In making statistical inferences about w, it may be instructive to form a 100.1 ˛/% confidence
interval or set for each of several different values of ˛. Models that provide an adequate basis for
point estimation may not provide an adequate basis for obtaining confidence intervals or sets. Rather,
we may require a more elaborate model that is rather specific about the form of the distribution of y
(or, in general, the joint distribution of w and y). In the case of a linear model, it is common practice
to take the form of this distribution to be multivariate normal.
In making inferences about a parametric function, there are circumstances that would seem
to call for the inclusion of hypothesis testing. Suppose that w represents the parametric function
./. Suppose further that, for some value w .0/ of this function, interest centers on the question of
whether the hypothesis H0 W w D w .0/ (called the null hypothesis) is “consistent with the data.”
Corresponding to a test of the null hypothesis H0 is a set C of N -dimensional (column) vectors that
is referred to as the critical region (or the rejection region) and the complementary set A (consisting
of those N -dimensional vectors not contained in C ), which is referred to as the acceptance region.
The test consists of accepting H0 , if y 2 A, and of rejecting H0 , if y 2 C .
If (for some number ˛ between 0 and 1) the probability Pr.y 2 C / of rejecting H0 equals ˛
whenever ./ D w .0/ (i.e., whenever H0 is true), then the test is said to be a size-˛ test. More
generally, the probability Pr.y 2 C / of rejecting H0 when H0 is true may depend on and/or on
characteristics of the distribution of y not covered by the assumptions that comprise the model, in
which case the test is said to be a size-˛ test if (under H0 ) the supremum of Pr.y 2 C / equals
˛. Under either the null hypothesis H0 or the hypothesis H1 W w ¤ w .0/ (called the alternative
hypothesis), Pr.y 2 C / is interpretable as the (theoretical) frequency with which H0 is rejected in
an infinitely long sequence of repeated applications (involving repeated draws from the distribution
of y).
Statistical Inference 19
Corresponding to any 100.1 ˛/% confidence interval or set that may exist for the parametric
function w D ./ is a size-˛ test of H0 . If S.y/ is a 100.1 ˛/% confidence interval or set for
w D ./, then a size-˛ test of H0 is obtained by taking the acceptance region A to be the set of
y-values for which w .0/ 2 S.y/ (e.g., Casella and Berger 2002, sec. 9.2). If the model is to provide an
adequate basis for hypothesis testing, it may (as in the case of interval or set estimation) be necessary
to include rather specific assumptions about the form of the distribution of y.
There may be a size-˛ test of H0 for every ˛ between 0 and 1. Let (for 0 < ˛ < 1) C.˛/ represent
the critical region of the size-˛ test. And suppose (as would often be the case) that the critical regions
are nested in the sense that, for 0 < ˛1 < ˛2 < 1, C.˛1 / is a proper subset of C.˛2 /. Then, the
infimum of the set f˛ W y 2 C.˛/g (i.e., the infimum of those ˛-values for which H0 is rejected by
the size-˛ test) is referred to as the p-value. Instead of reporting the results of the size-˛ test for one
or more values of ˛, it may be preferable to report the p-value—clearly, the p-value conveys more
information.
In practice, there will typically be more than one unobservable quantity of interest. Suppose that
inferences are to be made about M unobservable quantities, and that these quantities are to be regarded
as the respective values of M random variables w1 ; w2 ; : : : ; wM (whose joint distribution may
depend on )—this formulation includes the case where some or all of the unobservable quantities
are functions of . One approach is to deal with the M unobservable quantities separately, that is, to
make inferences about each of these quantities as though that quantity were the only quantity about
which inferences were being made. That approach lends itself to misinterpretation of results (more
so in the case of interval or set estimation or prediction and hypothesis testing than in the case of
point estimation or prediction) and to unwarranted conclusions.
The potential pitfalls can be circumvented by adopting an alternative approach in which the M
unobservable quantities of interest are dealt with simultaneously. Let w represent the M -dimensional
unobservable random (column) vector whose elements are w1 ; w2 ; : : : ; wM , respectively. Inference
about the value of the vector w can be regarded as synonymous with simultaneous inference about
w1 ; w2 ; : : : ; wM . Many of the concepts introduced earlier (in connection with inference about a
single unobservable random variable w) can be readily extended to inference about w.
Let t.y/ represent a vector-valued function (in the form of an M -dimensional column vector)
of y, the realized value of which is to be regarded as a (point) estimate of the value of the vector w.
This function is referred to as a point estimator (or predictor), and the vector t.y/ w is referred
to as the error of estimation or prediction. The estimator or predictor t.y/ is said to be unbiased if
EŒt.y/ w D 0 or, equivalently, if EŒt.y/ D E.w/.
The elements, say t1 .y/; t2 .y/; : : : ; tM .y/, of t.y/ can be regarded as estimators or predictors
of w1 ; w2 ; : : : ; wM , respectively. Clearly, t.y/ is an unbiased estimator or predictor of the vector w
if and only if, for i D 1; 2; : : : ; M , ti .y/ is an unbiased estimator or predictor of wi . The M M
matrix EfŒt.y/ wŒt.y/ w0 g is referred to as the mean-squared-error (MSE) matrix of t.y/. Its
ij th element equals EfŒti .y/ wi Œtj .y/ wj g (i; j D 1; 2; : : : ; M ), implying in particular that
the diagonal elements of the MSE matrix are the MSEs of t1 .y/; t2 .y/; : : : ; tM .y/, respectively. If
t.y/ is an unbiased estimator or predictor of w, then its MSE matrix equals the variance-covariance
matrix of the vector t.y/ w.
Turning now to the set estimation or prediction of the vector w, let S.y/ represent an arbitrary set
of M -dimensional (column) vectors (the membership of which varies with y). The terminology intro-
duced earlier (in connection with the interval or set estimation or prediction of an unobservable ran-
dom variable w) extends in a straightforward way to the set estimation or prediction of w. In particular,
PrŒw 2 S.y/ is referred to as the probability of coverage. As in the one-dimensional case, PrŒw 2
S.y/ D 1 PrŒw … S.y/. Further, the set S.y/ is said to be a 100.1 ˛/% confidence set if the supre-
mum of PrŒw … S.y/ (over all distributions of w and y that conform to the underlying assumptions)
equals ˛. And, accordingly, the infimum of the probability of coverage of a 100.1 ˛/% confidence
set equals 1 ˛, a number that is referred to as the confidence coefficient or confidence level.
20 Introduction
Inference about the vector w can also take the form of M one-dimensional sets, one set for each
of its elements. For j D 1; 2; : : : ; M , let Sj .y/ represent an interval (whose end points depend on y)
or, more generally, a one-dimensional set (the membership of which varies with y) of wj -values. A
possible criterion for evaluating these sets (as confidence sets for w1 ; w2 ; : : : ; wM , respectively) is
the probability (for some relatively small integer k 1) that wj 2 Sj .y/ for at least M kC1 values
of j. In the special case where k D 1, this probability is referred to as the probability of simultaneous
coverage. It is worth noting that the probability of simultaneous coverage can be drastically smaller
than any of the M “individual” probabilities of coverage PrŒw1 2 S1 .y/; PrŒw2 2 S2 .y/; : : : ;
PrŒwM 2 SM .y/.
Closely related to the sets S1 .y/; S2 .y/; : : : ; SM .y/ is the set S.y/ of M -dimensional (column)
vectors defined as follows:
S.y/ D fw W wj 2 Sj .y/ .j D 1; 2; : : : ; M /g: (5.1)
When each of the sets S1 .y/; S2 .y/; : : : ; SM .y/ is an interval, the geometrical form of the set
S.y/ is that of an “M -dimensional rectangle.” Clearly, PrŒw 2 S.y/ equals the probability of
simultaneous coverage of the individual sets S1 .y/; S2 .y/, : : : ; SM .y/. Accordingly, the set S.y/
is a 100.1 ˛/% confidence set for w if (and only if) the probability of simultaneous coverage of
S1 .y/; S2 .y/; : : : ; SM .y/ equals 1 ˛.
Let us now revisit hypothesis testing. Suppose that every one of the M elements w1 ; w2 ; : : : ; wM
of w is a function of . Suppose further that, for some values w1.0/ ; w2.0/ ; : : : ; wM .0/
, interest centers
.0/
on the question of whether the hypothesis H0 W wi D wi .i D 1; 2; : : : ; M / (the so-called null
hypothesis) is consistent with the data. The null hypothesis can be restated (in matrix notation) as H0 W
.0/ .0/ .0/
w D w .0/ , where w .0/ is the M -dimensional (column) vector with elements w1 ; w2 ; : : : ; wM .
Various concepts introduced earlier in connection with the testing of the null hypothesis in the
case of a single parametric function extend in an altogether straightforward way to the case of
an M -dimensional vector of parametric functions. These concepts include those of an acceptance
region, of a critical or rejection region, of a size-˛ test, and of a p-value. Moreover, the relationship
between confidence sets and hypothesis tests discussed earlier (in connection with inference about a
single parametric function) extends to inference about the M -dimensional vector w (of parametric
functions): if S.y/ is a 100.1 ˛/% confidence set for w, then a size-˛ test of H0 W w D w .0/ is
obtained by taking the acceptance region to be the set of y-values for which w .0/ 2 S.y/.
The null hypothesis H0 W w D w .0/ can be regarded as a “composite” of the M “individual” hy-
potheses H0.1/ W w1 D w1.0/ ; H0.2/ W w2 D w2.0/ ; : : : ; H0.M / W wM D wM .0/
. Clearly, the composite null
hypothesis H0 is true if and only if all M of the individual null hypotheses H0.1/ ; H0.2/ ; : : : ; H0.M /
are true and is false if and only if one or more of the individual null hypotheses are false. As a
variation on the problem of testing H0 , there is the problem of testing H0.1/ ; H0.2/ ; : : : ; H0.M / indi-
vidually. The latter problem is known as the problem of multiple comparisons—in some applications,
w1 ; w2 ; : : : ; wM have interpretations relating them to comparisons among various entities. The clas-
.1/ .2/ .M /
sical approach to this problem is to restrict attention to tests of H0 ; H0 ; : : : ; H0 such that the
problem of one or more false rejections does not exceed some relatively low level ˛, for example,
˛ D 0:05. For even moderately large values of M , this approach can be quite “conservative.” A less
conservative approach is obtainable by considering all test procedures for which (for some positive
integer k > 1) the probability of k or more false rejections does not exceed ˛ (e.g., Lehmann and
Romano 2005a). Another such approach is that of Benjamini and Hochberg (1995); it takes the form
of controlling the false discovery rate, which by definition is the expected value of the ratio of the
number of false rejections to the total number of rejections.
In the testing of a hypothesis about a parametric function or functions and in the point or set
estimation or prediction of an unobservable random variable or vector w, the statistical properties of
the test or of the estimator or predictor depend on various characteristics of the distribution of y or,
more generally (in the case of the estimator or predictor), on the joint distribution of w and y. These
An Overview 21
properties include the probability of acceptance or rejection of a hypothesis (by a hypothesis test),
the probability of coverage (by a confidence set), and the unbiasedness and MSE or MSE matrix of
a point estimator or predictor. Some or all of the relevant characteristics of the distribution of y or of
the joint distribution of w and y are determined by the assumptions (about the distribution of y) that
comprise the statistical model and by any further assumptions pertaining to the joint distribution of
w and y. By definition, these assumptions pertain to the unconditional distribution of y and to the
unconditional joint distribution of w and y.
It can be informative to determine the properties of a hypothesis test or of a point or set estimator
or predictor under more than one model (or under more than 1 set of assumptions about the joint
distribution of w and y) and/or to determine the properties of the test or the estimator or predictor
conditionally on the values of various random variables (e.g., conditionally on the values of various
functions of y or even on the value of y itself). The appeal of a test or estimation or prediction pro-
cedure whose properties have been evaluated unconditionally under a particular model (or particular
set of assumptions about the joint distribution of w and y) can be either enhanced or diminished
by evaluating its properties conditionally and/or under an alternative model (or alternative set of
assumptions). The relative appeal of alternative procedures may be a matter of emphasis; which
procedure has the more favorable properties may depend on whether the properties are evaluated
conditionally or unconditionally and under which model. In such a case, it may be instructive to
analyze the data in accordance with each of multiple procedures.
1.6 An Overview
This volume provides coverage of linear statistical models and of various statistical procedures that
are based on those models. The emphasis is on the underlying theory; however, some discussion of
applications and some attempts at illustration are included among the content.
In-depth coverage is provided for a broad class of linear statistical models consisting of what are
referred to herein as Gauss–Markov models. Results obtained on the basis of Gauss–Markov models
can be extended in a relatively straightforward way to a somewhat broader class of linear statistical
models consisting of what are referred to herein as Aitken models. Results on a few selected topics
are obtained for what are referred to herein as general linear models, which form a very broad class
of linear statistical models and include the Gauss–Markov and Aitken models as special cases.
The models underlying (simple and multiple) linear regression procedures are Gauss–Markov
models, and results obtained on the basis of Gauss–Markov models apply more-or-less directly to
those procedures. Moreover, many of the procedures that are commonly used to analyze experimen-
tal data (and in some cases observational data) are based on classificatory (fixed-effects) models.
Like regression models, these models are Gauss–Markov models. However, some of the procedures
that are commonly used to analyze classificatory data (such as the analysis of variance) are rather
“specialized.” For the most part, those kinds of specialized procedures are outside the scope of what
is covered herein. They constitute (along with various results that would expand on the coverage of
general linear models) potential subjects for a possible future volume.
The organization of the present volume is such that the results that are directly applicable to
linear models are presented in Chapters 5 and 7. Chapter 5 provides coverage of (point) estimation
and prediction, and Chapter 7 provides coverage of topics related to the construction of confidence
intervals and sets and to the testing of hypotheses. Chapters 2, 3, 4, and 6 present results on matrix
algebra and the relevant underlying statistical distributions (as well as other supportive material);
many of these results are of importance in their own right. Some additional results on matrix algebra
and statistical distributions are introduced in Chapters 5 and 7 (as the need for them arises).
2
Matrix Algebra: A Primer
Knowledge of matrix algebra is essential in working with linear models. Chapter 2 provides a limited
coverage of matrix algebra, with an emphasis on concepts and results that are highly relevant and
that are more-or-less elementary in nature. It forms a core body of knowledge, and, as such, provides
a solid foundation for the developments that follow. Derivations or proofs are included for most
results. In subsequent chapters, the coverage of matrix algebra is extended (as the need arises) to
additional concepts and results.
a. Matrix operations
A matrix can be transformed or can be combined with various other matrices in accordance with
operations called scalar multiplication, matrix addition and subtraction, matrix multiplication, and
transposition.
Scalar multiplication. The term scalar is to be used synonymously with real number. Scalar multi-
plication is defined for an arbitrary scalar k and an arbitrary M N matrix A D faij g. The product
of k and A is written as kA (or, much less commonly, as Ak), and is defined to be the M N matrix
whose ij th element is kaij . The matrix kA is said to be a scalar multiple of the matrix A. Clearly,
for any scalars c and k and any matrix A,
It is customary to refer to the product . 1/A of 1 and A as the negative of A and to abbreviate
. 1/A to A.
Matrix addition and subtraction. Matrix addition and subtraction are defined for any two matrices
A D faij g and B D fbij g that have the same number of rows, say M, and the same number of
columns, say N. The sum of the two M N matrices A and B is denoted by the symbol A C B and
is defined to be the M N matrix whose ij th element is aij C bij .
Matrix addition is commutative, that is,
A C B D B C A: (1.2)
A C .B C C/ D .A C B/ C C: (1.3)
The symbol A C B C C is used to represent the common value of the left and right sides of equality
(1.3), and that value is referred to as the sum of A, B, and C. This notation and terminology extend
in an obvious way to any finite number of M N matrices.
Clearly, for any scalar k,
k.A C B/ D kA C kB; (1.4)
and, for any scalars c and k,
.c C k/A D cA C kA: (1.5)
Let us write A B for the sum A C . B/ or, equivalently, for the M N matrix whose ij th
element is aij bij , and refer to this matrix as the difference between A and B.
Matrices having the same number of rows and the same number of columns are said to be
conformal for addition (and subtraction).
Matrix multiplication. Turning now to matrix multiplication (i.e., the multiplication of one matrix
by another), let A D faij g represent an M N matrix and B D fbij g a P Q matrix. When N D P
(i.e., when A has the same number of columns as B has rows), the matrix product is defined to be
the M Q matrix whose ij th element is
N
X
ai k bkj D ai1 b1j C ai 2 b2j C C aiN bNj :
kD1
provided that N D P and that C has Q rows (so that all relevant matrix products are defined). The
symbol ABC is used to represent the common value of the left and right sides of equality (1.6), and
that value is referred to as the product of A, B, and C. This notation and terminology extend in an
obvious way to any finite number of matrices.
Matrix multiplication is distributive with respect to addition, that is,
where, in each equality, it is assumed that the dimensions of A, B, and C are such that all multipli-
cations and additions are defined. Results (1.7) and (1.8) extend in an obvious way to the postmulti-
plication or premultiplication of a matrix A or C by the sum of any finite number of matrices.
The Basics 25
In general, matrix multiplication is not commutative. That is, AB is not necessarily identical to
BA. In fact, when N D P but M ¤ Q or when M D Q but N ¤ P, one of the matrix products
AB and BA is defined, while the other is undefined. When N D P and M D Q, AB and BA are
both defined, but the dimensions .M M / of AB are the same as those of BA only if M D N.
Even if N D P D M D Q, in which case A and B are both N N matrices and the two matrix
products AB and BA are both defined and of the same dimensions, it is not necessarily the case that
AB D BA.
Two N N matrices A and B are said to commute if AB D BA. More generally, a collection
of N N matrices A1 ; A2 ; : : : ; AK is said to commute in pairs if Ai Aj D Aj Ai for j > i D
1; 2; : : : ; K.
For any scalar c, M N matrix A, and N P matrix B, it is customary to write cAB for the
scalar product c.AB/ of c and the matrix product AB. Note that
(as is evident from the very definitions of scalar and matrix multiplication). This notation (for a scalar
multiple of a product of 2 matrices) and result (1.9) extend in an obvious way to a scalar multiple of
a product of any finite number of matrices.
Transposition. Corresponding to any M N matrix A D faij g is the N M matrix obtained by
rewriting the columns of A as rows or the rows of A as columns. This matrix, the ij th element of
which is aj i , is called the transpose of A and is to be denoted herein by the symbol A0 .
For any matrix A,
.A0 /0 D AI (1.10)
for any scalar k and any matrix A,
.kA/0 D kA0 I (1.11)
and for any two matrices A and B (that are conformal for addition),
.A C B/0 D A0 C B0 (1.12)
—these 3 results are easily verified. Further, for any two matrices A and B (for which the product
AB is defined)
.AB/0 D B0 A0 ; (1.13)
as can be verified by comparing the ij th element of B0 A0 with that of .AB/0 . More generally,
.A1 A2 AK /0 D AK
0
A02 A01 (1.14)
b. Types of matrices
There are several types of matrices that are worthy of mention.
Square matrices. A matrix having the same number of rows as columns, say N rows and N columns,
is referred to as a square matrix and is said to be of order N. The N elements of a square matrix of
order N that lie on an imaginary line (called the diagonal) extending from the upper left corner of
the matrix to the lower right corner are called the diagonal elements; the other N.N 1/ elements
of the matrix (those elements that lie above and to the right or below and to the left of the diagonal)
are called the off-diagonal elements. Thus, the diagonal elements of a square matrix A D faij g of
order N are ai i (i D 1; 2; : : : ; N ), and the off-diagonal elements are aij (j ¤ i D 1; 2; : : : ; N ).
Symmetric matrices. A matrix A is said to be symmetric if A0 D A. Thus, a matrix is symmetric if
it is square and if (for all i and j ¤ i ) its j i th element equals its ij th element.
26 Matrix Algebra: A Primer
Diagonal matrices. A diagonal matrix is a square matrix whose off-diagonal elements are all equal
to 0. Thus, a square matrix A D faij g of order N is a diagonal matrix if aij D 0 for j ¤ i D
1; 2; : : : ; N . The notation D D fdi g is sometimes used to introduce a diagonal matrix, the i th
diagonal element of which is di . Also, we may write diag.d1 ; d2 ; : : : ; dN / for such a matrix (where
N is the order of the matrix).
Identity matrices. A diagonal matrix diag.1; 1; : : : ; 1/ whose diagonal elements are all equal to 1 is
called an identity matrix. The symbol IN is used to represent an identity matrix of order N. In cases
where the order is clear from the context, IN may be abbreviated to I.
Triangular matrices. If all of the elements of a square matrix that are located below and to the left of
the diagonal are 0, the matrix is said to be upper triangular. Similarly, if all of the elements that are
located above and to the right of the diagonal are 0, the matrix is said to be lower triangular. More
formally, a square matrix A D faij g of order N is upper triangular if aij D 0 for j < i D 1; : : : ; N
and is lower triangular if aij D 0 for j > i D 1; : : : ; N . By a triangular matrix, we mean a (square)
matrix that is upper triangular or lower triangular. An (upper or lower) triangular matrix is called a
unit (upper or lower) triangular matrix if all of its diagonal elements equal 1.
Row and column vectors. A matrix that has only one row, that is, a matrix of the form .a1; a2 ; : : : ; aN /
is called a row vector. Similarly, a matrix that has only one column is called a column vector. A row
or column vector having N elements may be referred to as an N -dimensional row or column vector.
Clearly, the transpose of an N -dimensional column vector is an N -dimensional row vector, and vice
versa.
Lowercase boldface letters (e.g., a) are used herein to represent column vectors. This notation
is helpful in distinguishing column vectors from matrices that may have more than one column. No
further notation is introduced for row vectors. Instead, row vectors are represented as the transposes
of column vectors. For example, a0 represents the row vector whose transpose is the column vector
a. The notation a D fai g or a0 D fai g is used in introducing a column or row vector whose i th
element is ai .
Note that each column of an M N matrix A D faij g is an M -dimensional column vector,
and that each row of A is an N -dimensional row vector. Specifically, the j th column of A is the
M -dimensional column vector .a1j ; a2j ; : : : ; aMj /0 (j D 1; : : : ; N ), and the i th row of A is the
N -dimensional row vector .ai1 ; ai 2 ; : : : ; aiN / (i D 1; : : : ; M ).
Null matrices. A matrix all of whose elements are 0 is called a null matrix—a matrix having one or
more nonzero elements is said to be nonnull. A null matrix is denoted by the symbol 0—this notation
is reserved for use in situations where the dimensions of the null matrix can be ascertained from the
context. A null matrix that has one row or one column may be referred to as a null vector.
Matrices of 1’ s. The symbol 1N is used to represent an N -dimensional column vector all of whose
elements equal 1. In a situation where the dimensions of a column vector of 1’s is clear from the
context or is to be left unspecified, we may simply write 1 for such a vector. Note that 10N is an
N -dimensional row vector, all of whose elements equal 1, and that 1M 10N is an M N matrix, all
of whose elements equal 1.
out, and vice versa). The R R (principal) submatrix of an N N matrix obtained by striking out
the last N R rows and columns is referred to as a leading principal submatrix (R D 1; : : : ; N ). A
principal submatrix of a symmetric matrix is symmetric, a principal submatrix of a diagonal matrix
is diagonal, and a principal submatrix of an upper or lower triangular matrix is respectively upper or
lower triangular, as is easily verified.
and the partitioning of A and B is said to be conformal (for the premultiplication of B by A). In the
special case where R D C D U D V D 2, that is, where
A11 A12 B11 B12
AD and BD ;
A21 A22 B21 B22
then A is said to be block-diagonal, and diag.A11 ; A22 ; : : : ; ARR / is sometimes written for A. If
Aij D 0 for j < i D 1; : : : ; R, that is, if
0 1
A11 A12 : : : A1R
B 0 A22 : : : A2R C
ADB : :: C ;
B C
@ :: ::
: : A
0 0 ARR
then A is called an upper block-triangular matrix. Similarly, if Aij D 0 for j > i D 1; : : : ; R, that
is, if 0 1
A11 0 ::: 0
B A21 A22 0 C
ADB : : C;
B C
@ :: :: : :: A
AR1 AR2 : : : ARR
then A is called a lower block-triangular matrix. To indicate that A is upper or lower block-triangular
(without being more specific), A is referred to simply as block-triangular.
Ax D x1 a1 C x2 a2 C C xN aN : (2.8)
0 0 1
b1
B b0 C
B 2C
Similarly, for an M N matrix A D B : C, with rows b01 ; b02 ; : : : ; bM
0
, and an M -dimensional
@ :: A
0
bM
0
row vector x D .x1 ; x2 ; : : : ; xM /, with elements x1 ; x2 ; : : : ; xM ,
x0 A D x1 b01 C x2 b02 C C xM bM
0
: (2.9)
Note also that an unpartitioned matrix can be regarded as a “partitioned” matrix comprising a
single row and a single column of blocks. Thus, letting A represent an M N matrix0and1taking
y10
By0 C
B 2C
X D .x1 ; x2 ; : : : ; xQ / to be an N Q matrix with columns x1 ; x2 ; : : : ; xQ and Y D B : C to be
@ :: A
yP0
0 0 0
a P M matrix with rows y1 ; y2 ; : : : ; yP , result (2.5) implies that
and
y10 Ax1 y10 Ax2 : : : y10 AxQ
0 1
B y20 Ax1 y20 Ax2 : : : y20 AxQ C
Y AX D B : :: C : (2.12)
B C
::
@ :: : : A
0 0 0
yP Ax1 yP Ax2 : : : yP AxQ
That is, AX is an M Q matrix whose j th column is Axj (j D 1; 2; : : : ; Q); Y A is a P N
matrix whose i th row is yi0 A (i D 1; 2; : : : ; P ); and Y AX is a P Q matrix whose ij th element is
yi0 Axj (i D 1; 2; : : : ; P ; j D 1; 2; : : : ; Q).
Representation (2.9) is helpful in establishing the elementary results expressed in the following
two lemmas—refer, for instance, to Harville (1997, sec. 2.3) for detailed derivations.
Lemma 2.2.1. For any column vector y and nonnull column vector x, there exists a matrix A
such that y D Ax.
Lemma 2.2.2. For any two M N matrices A and B, A D B if and only if Ax D Bx for every
N -dimensional column vector x.
Note that Lemma 2.2.2 implies in particular that A D 0 if and only if Ax D 0 for every x.
Trace of a (Square) Matrix 31
a. Basic properties
Clearly, for any scalar k and any N N matrices A and B,
as can be readily verified by, for example, the repeated application of results (3.1) and (3.2). And for
a square matrix A that has been partitioned as
0 1
A11 A12 : : : A1R
B A21 A22 : : : A2R C
ADB : :: :: C
B C
@ :: ::
: : : A
AR1 AR2 : : : ARR
in such a way that the diagonal blocks A11 ; A22 ; : : : ; ARR are square,
b. Trace of a product
Let A D faij g represent an M N matrix and B D fbj i g an N M matrix. Then,
M X
X N
tr.AB/ D aij bj i ; (3.6)
i D1 j D1
as is evident upon observing that the i th diagonal element of AB is jND1 aij bj i . Thus, since the
P
j i th element of B is the ij th element of B0, the trace of the matrix product AB can be formed by
multiplying the ij th element of A by the corresponding (ij th) element of B0 and by then summing
(over i and j ).
A simple (but very important) result on the trace of a product of two matrices is expressed in the
following lemma.
Lemma 2.3.1. For any M N matrix A and N M matrix B,
Proof. Let aij represent the ij th element of A and bj i the j i th element of B, and observe that
the j th diagonal element of BA is M i D1 bj i aij . Thus, making use of result (3.6), we find that
P
M X
X N N X
X M N X
X M
tr.AB/ D aij bj i D aij bj i D bj i aij D tr.BA/: Q.E.D.
i D1 j D1 j D1 i D1 j D1 i D1
Note [in light of results (3.7) and (3.6)] that for any M N matrix A D faij g,
M X
X N
tr.A0 A/ D tr.AA0 / D 2
aij (3.8)
i D1 j D1
0: (3.9)
That is, both tr.A0 A/ and tr.AA0 / equal the sum of squares of the MN elements of A, and both are
inherently nonnegative.
and it follows from Corollary 2.3.3 that AB AC D 0 or, equivalently, that AB D AC.
(2) To establish Part (2), simply take the transpose of each side of the two equivalent equalities
AB0 D AC 0 and A0 AB0 D A0 AC 0 . [The equivalence of these two equalities follows from Part
(1).] Q.E.D.
Note that as a special case of Part (1) of Corollary 2.3.4 (the special case where C D 0), we have
that AB D 0 if and only if A0 AB D 0, and as a special case of Part (2), we have that BA0 D 0 if and
only if BA0 A D 0.
contains the null matrix 0 (of appropriate dimensions), and that the set f0g, whose only member is
a null matrix, is a linear space. Note also that if a linear space contains a nonnull matrix, then it
contains an infinite number of nonnull matrices.
A linear combination of matrices A1 ; A2 ; : : : ; AK (of the same dimensions) is an expression of
the general form
x1 A1 C x2 A2 C C xK AK ;
where x1 ; x2 ; : : : ; xK are scalars (which are referred to as the coefficients of the linear combination).
If A1 ; A2 ; : : : ; AK are matrices in a linear space V, then every linear combination of A1 ; A2 ; : : : ; AK
is also in V.
Corresponding to any finite set of M N matrices is the span of the set. By definition, the span of
a nonempty finite set fA1 ; A2 ; : : : ; AK g of M N matrices is the set consisting of all matrices that
are expressible as linear combinations of A1 ; A2 ; : : : ; AK . (By convention, the span of the empty
set of M N matrices is the set f0g, whose only member is the M N null matrix.) The span of a
finite set S is denoted herein by the symbol sp.S /; sp.fA1 ; A2 ; : : : ; AK g/, which represents the span
of the set comprising the matrices A1 ; A2 ; : : : ; AK , is typically abbreviated to sp.A1 ; A2 ; : : : ; AK /.
Clearly, the span of any finite set of M N matrices is a linear space.
A finite set S of matrices in a linear space V is said to span V if sp.S / D V. Or equivalently
[since sp.S / V], S spans V if V sp.S /.
b. Subspaces
A subset U of a linear space V (of M N matrices) is said to be a subspace of V if U is itself
a linear space. Trivial examples of a subspace of a linear space V are: (1) the set f0g, whose only
member is the null matrix, and (2) the entire set V. The column space C.A/ of an M N matrix A
is a subspace of RM (when RM is interpreted as the set of all M -dimensional column vectors), and
R.A/ is a subspace of RN (when RN is interpreted as the set of all N -dimensional row vectors).
34 Matrix Algebra: A Primer
We require some additional terminology and notation. Suppose that S and T are subspaces of
the linear space of all M N matrices or, more generally, that S and T are subsets of a given set.
If every member of S is a member of T , then S is said to be contained in T (or T is said to contain
S ), and we write S T (or T S ). Note that if S T and T S , then S D T , that is, the two
subsets S and T are identical.
Some basic results on row and column spaces are expressed in the following lemmas and corol-
laries, proofs of which are given by Harville (1997, sec. 4.2).
Lemma 2.4.2. Let A represent an M N matrix. Then, for any subspace U of RM, C.A/ U
if and only if every column of A belongs to U. Similarly, for any subspace V of RN , R.A/ V if
and only if every row of A belongs to V.
Lemma 2.4.3. For any M N matrix A and M P matrix B, C.B/ C.A/ if and only if
there exists an N P matrix F such that B D AF. Similarly, for any M N matrix A and Q N
matrix C, R.C/ R.A/ if and only if there exists a Q M matrix L such that C D LA.
Corollary 2.4.4. For any M N matrix A and N P matrix F, C.AF/ C.A/. Similarly,
for any M N matrix A and Q M matrix L, R.LA/ R.A/.
Corollary 2.4.5. Let A represent an M N matrix, E an N K matrix, F an N P matrix,
L a Q M matrix, and T an S M matrix.
(1) If C.E/ C.F/, then C.AE/ C.AF/; and if C.E/ D C.F/, then C.AE/ D C.AF/.
(2) If R.L/ R.T /, then R.LA/ R.T A/; and if R.L/ D R.T /, then R.LA/ D R.T A/.
Lemma 2.4.6. Let A represent an M N matrix and B an M P matrix. Then, (1) C.A/ C.B/
if and only if R.A0 / R.B0 /, and (2) C.A/ D C.B/ if and only if R.A0 / D R.B0 /.
x1 A1 C x2 A2 C C xK AK D 0:
If no such scalars exist, the set is said to be linearly independent. The empty set is considered
to be linearly independent. Note that if any subset of a finite set of M N matrices is linearly
dependent, then the set itself is linearly dependent. Note also that if the set fA1 ; A2 ; : : : ; AK g is
linearly dependent, then some member of the set, say the sth member As , can be expressedPas a linear
combination of the other K 1 members A1; A2 ; : : : ; As 1 ; AsC1 ; : : : ; AK ; that is, As D i ¤s yi Ai
for some scalars y1 ; y2 ; : : : ; ys 1 ; ysC1 ; : : : ; yK .
While technically linear dependence and independence are properties of sets of matrices, it is
customary to speak of “a set of linearly dependent (or independent) matrices” or simply of “lin-
early dependent (or independent) matrices” instead of “a linearly dependent (or independent) set of
matrices.” In particular, in the case of row or column vectors, it is customary to speak of “linearly
dependent (or independent) vectors.”
d. Bases
A basis for a linear space V of M N matrices is a linearly independent set of matrices in V
that spans V. The empty set is the (unique) basis for the linear space f0g (whose only member
is the null matrix). The set whose members are the M columns .1; 0; : : : ; 0/0 ; : : : ; .0; : : : ; 0; 1/0
of the M M identity matrix IM is a basis for the linear space RM of all M -dimensional col-
umn vectors. Similarly, the set whose members are the N rows .1; 0; : : : ; 0/; : : : ; .0; : : : ; 0; 1/ of
Linear Spaces 35
the N N identity matrix IN is a basis for the linear space RN of all N -dimensional row vec-
tors. More generally, letting Uij represent the M N matrix whose ij th element equals 1 and
whose remaining (MN 1) elements equal 0, the set whose members are the MN matrices
U11 ; U21 ; : : : ; UM1 ; U12 ; U22 ; : : : ; UM 2 ; : : : ; U1N ; U2N ; : : : ; UMN is a basis (the so-called natural
basis) for the linear space of all M N matrices (as can be readily verified).
Now, consider the column space C.A/ and row space R.A/ of an M N matrix A. By definition,
C.A/ is spanned by the set whose members are the columns of A. If this set is linearly independent,
it is a basis for C.A/; otherwise, it is not. Similarly, if the set whose members are the rows of A is
linearly independent, it is a basis for R.A/.
Two fundamentally important properties of linear spaces in general and row and column spaces
in particular are described in the following two theorems.
Theorem 2.4.7. Every linear space (of M N matrices) has a basis.
Theorem 2.4.8. Any two bases for a linear space (of M N natrices) contain the same number
of matrices
The number of matrices in a basis for a linear space V (of M N matrices) is referred to as
the dimension of V and is denoted by the symbol dim V or dim.V/. Note that the term dimension is
used not only in reference to the number of matrices in a basis, but also in reference to the number
of rows or columns in a matrix—which usage is intended is determinable from the context.
Some basic results related to the dimension of a linear space or subspace are presented in the
following two theorems.
Theorem 2.4.9. If a linear space V (of M N matrices) is spanned by a set of R matrices, then
dim V R, and if there is a set of K linearly independent matrices in V, then dim V K.
Theorem 2.4.10. Let U and V represent linear spaces of M N matrices. If U V (i.e., if U
is a subspace of V), then dim U dim V. Moreover, if U V and if in addition dim U D dim V,
then U D V.
Two key results pertaining to bases are as follows.
Theorem 2.4.11. Any set of R linearly independent matrices in an R-dimensional linear space
V (of M N matrices) is a basis for V.
Theorem 2.4.12. A matrix A in a linear space V (of M N matrices) has a unique representation
in terms of any particular basis fA1 ; A2 ; : : : ; AR g; that is, the coefficients x1 ; x2 ; : : : ; xR in the linear
combination
A D x1 A1 C x2 A2 C C xR AR
are uniquely determined.
For proofs of the results set forth in Theorems 2.4.7 through 2.4.12, refer to Harville (1997, sec.
4.3).
Proof (of Theorem 2.4.14). Take B to be an M C matrix whose columns form a basis for
C.A/. Then, C.A/ D C.B/, and consequently it follows from Lemma 2.4.3 that there exists a C N
matrix L such that A D BL. The existence of an M R matrix K and an R N matrix T such that
A D KT can be established via a similar argument. Q.E.D.
In light of Theorem 2.4.13, it is not necessary to distinguish between the row and column ranks
of a matrix A. Their common value is called the rank of A and is denoted by the symbol rank A or
rank.A/.
Various of the results pertaining to the dimensions of linear spaces and subspaces can be special-
ized to row and column spaces and restated in terms of ranks. Since the column space of an M N
matrix A is a subspace of RM and the row space of A is a subspace of RN, the following lemma is
an immediate consequence of Theorem 2.4.10.
Lemma 2.4.15. For any M N matrix A, rank.A/ M and rank.A/ N .
A further implication of Theorem 2.4.10 is as follows.
Theorem 2.4.16. Let A represent an M N matrix, B an M P matrix, and C a Q N matrix.
If C.B/ C.A/, then rank.B/ rank.A/; if C.B/ C.A/ and if in addition rank.B/ D rank.A/,
then C.B/ D C.A/. Similarly, if R.C/ R.A/, then rank.C/ rank.A/; if R.C/ R.A/ and
if in addition rank.C/ D rank.A/, then R.C/ D R.A/.
In light of Corollary 2.4.4, we have the following corollary of Theorem 2.4.16.
Corollary 2.4.17. Let A represent an M N matrix and F an N P matrix. Then, rank.AF/
rank.A/ and rank.AF/ rank.F/. Moreover, if rank.AF/ D rank.A/, then C.AF/ D C.A/;
similarly, if rank.AF/ D rank.F/, then R.AF/ D R.F/.
The rank of an M N matrix cannot exceed min.M; N /, as is evident from Lemma 2.4.15. An
M N matrix A is said to have full row rank if rank.A/ D M , that is, if its rank equals the number
of rows, and to have full column rank if rank.A/ D N . Clearly, an M N matrix can have full row
rank only if M N , that is, only if the number of rows does not exceed the number of columns,
and can have full column rank only if N M .
A matrix is said to be nonsingular if it has both full row rank and full column rank. Clearly,
any nonsingular matrix is square. By definition, an N N matrix A is nonsingular if and only if
rank.A/ D N . An N N matrix of rank less than N is said to be singular.
Any M N matrix can be expressed as the product of a matrix having full column rank and a
matrix having full row rank as indicated by the following theorem.
Theorem 2.4.18. Let A represent an M N nonnull matrix of rank R. Then, there exist an
M R matrix B and R N matrix T such that A D BT. Moreover, for any M R matrix B and
R N matrix T such that A D BT , rank.B/ D rank.T / D R, that is, B has full column rank and
T has full row rank.
Proof. The existence of an M R matrix B and an R N matrix T such that A D BT follows
from Theorem 2.4.14. And, letting B represent any M R matrix and T any R N matrix such
that A D BT , we find that rank.B/ R and rank.T / R (as is evident from Corollary 2.4.17)
and that rank.B/ R and rank.T / R (as is evident from Lemma 2.4.15), and as a consequence,
we have that rank.B/ D R and rank.T / D R. Q.E.D.
The following theorem, a proof of which is given by Harville (1997, sec. 4.4), characterizes the
rank of a matrix in terms of the ranks of its submatrices.
Theorem 2.4.19. Let A represent an M N matrix of rank R. Then, A contains R linearly
independent rows and R linearly independent columns. And for any R linearly independent rows
and R linearly independent columns of A, the R R submatrix, obtained by striking out the other
M R rows and N R columns, is nonsingular. Moreover, any set of more than R rows or more
than R columns (of A) is linearly dependent, and there exists no submatrix of A whose rank exceeds
R.
Linear Spaces 37
It is a simple exercise to verify that the function defined by expression (4.4) has the four properties
required of an inner product. In the special case of a linear space V of M -dimensional column
vectors, the value assigned by the usual inner product to each pair of vectors x D fxi g and y D fyi g
in V is expressible as
38 Matrix Algebra: A Primer
M
X
x y D tr.xy 0 / D tr.y 0 x/ D y 0 x D x0 y D xi yi : (4.5)
i D1
And in the special case of a linear space V of N -dimensional row vectors, the value assigned by the
usual inner product to each pair of vectors x0 D fxi g and y 0 D fyi g in V is expressible as
N
X
x0 y 0 D trŒx0 .y 0 /0 D tr.x0 y/ D x0 y D xj yj : (4.6)
j D1
The four basic properties of an inner product for a linear space V (of M N matrices) imply
various additional properties. We find, in particular, that (for any matrix A in V)
0 A D 0; (4.7)
as is evident from Property (3) upon observing that
0 A D .0A/ A D 0.A A/ D 0:
And by making repeated use of Properties (3) and (4), we find that (for any matrices A1 ; A2 ; : : : ; AK ,
and B in V and any scalars x1 ; x2 ; : : : ; xK /,
.x1 A1 C x2 A2 C C xK AK / B D x1 .A1 B/ C x2 .A2 B/ C C xK .AK B/: (4.8)
Corresponding to an arbitrary matrix, say A, in the linear space V of M N matrices is the
scalar .A A/1=2 . This scalar is called the norm of A and is denoted by the symbol kAk. The norm
depends on the choice of inner product; when the inner product is taken to be the usual inner product,
the norm is referred to as the usual norm.
An important and famous inequality, known as the Schwarz inequality or Cauchy–Schwarz
inequality, is set forth in the following theorem, a proof of which is given by Harville (1997, sec 6.3).
Theorem 2.4.21 (Cauchy–Schwarz inequality). For any two matrices A and B in a linear space
V,
jA Bj kAkkBk; (4.9)
with equality holding if and only if B D 0 or A D kB for some scalar k.
As a special case of Theorem 2.4.21, we have that for any two M -dimensional column vectors
x and y,
jx0 yj .x0 x/1=2 .y 0 y/1=2 ; (4.10)
with equality holding if and only if y D 0 or x D ky for some scalar k.
Two vectors x and y in a linear space V of M -dimensional column vectors are said to be
orthogonal to each other if x y D 0. More generally, two matrices A and B in a linear space V
are said to be orthogonal to each other if A B D 0. The statement that two matrices A and B are
orthogonal to each other is sometimes abbreviated to A ? B. Whether two matrices are orthogonal
to each other depends on the choice of inner product; two matrices that are orthogonal (to each other)
with respect to one inner product may not be orthogonal (to each other) with respect to another inner
product.
A finite set of matrices in a linear space V of M N matrices is said to be orthogonal if every
matrix in the set is orthogonal to every other matrix in the set. Thus, the empty set and any set
containing only one matrix are orthogonal sets. And a finite set fA1 ; A2 ; : : : ; AK g of two or more
matrices in V is an orthogonal set if Ai Aj D 0 for j ¤ i D 1; 2; : : : ; K. A finite set of matrices
in V is said to be orthonormal if the set is orthogonal and if the norm of every matrix in the set
equals 1. In the special case of a set of (row or column) vectors, the expression “set of orthogonal
(or orthonormal) vectors,” or simply “orthogonal (or orthonormal) vectors,” is often used in lieu of
the technically more correct expression “orthogonal (or orthonormal) set of vectors.”
The following lemma establishes a connection between orthogonality and linear independence.
Linear Spaces 39
0 D 0 Ai D .x1 A1 C x2 A2 C C xK AK / Ai
D x1 .A1 Ai / C x2 .A2 Ai / C C xK .AK Ai /
D xi .Ai Ai /;
implying (since Ai is nonnull) that xi D 0. We conclude that the set fA1 ; A2 ; : : : ; AK g is linearly
independent. Q.E.D.
Note that Lemma 2.4.22 implies in particular that any orthonormal set of matrices is linearly
independent. Note also that the converse of Lemma 2.4.22 is not necessarily true; that is, a linearly
independent set is not necessarily orthogonal. For example, the set consisting of the two 2-dimensional
row vectors .1; 0/ and .1; 1/ is linearly independent but is not orthogonal (with respect to the usual
inner product).
Suppose now that A1; A2 ; : : : ; AK are linearly independent matrices in a linear space V of M N
matrices. There exists a recursive procedure, known as Gram–Schmidt orthogonalization, that when
applied to A1 ; A2 ; : : : ; AK , generates an orthonormal set of M N matrices B1 ; B2 ; : : : ; BK (the
j th of which is a linear combination of A1 ; A2 ; : : : ; Aj )—refer, for example, to Harville (1997, sec.
6.4) for a discussion of Gram–Schmidt orthogonalization. In combination with Theorems 2.4.7 and
2.4.11 and Lemma 2.4.22, the existence of such a procedure leads to the following conclusion.
Theorem 2.4.23. Every linear space (of M N matrices) has an orthonormal basis.
g. Some results on the rank of a matrix partitioned into blocks of rows or columns
and on the rank and row or column space of a sum of matrices
A basic result on the rank of a matrix that has been partitioned into two blocks of rows or columns
is as follows.
Lemma 2.4.24. For any M N matrix A, M P matrix B, and Q N matrix C,
which establishes inequality (4.11). Inequality (4.12) can be established via an analogous argu-
ment. Q.E.D.
40 Matrix Algebra: A Primer
Upon the repeated application of result (4.11), we obtain the more general result that for any
matrices A1 ; A2 ; : : : ; AK having M rows,
And, similarly, upon the repeated application of result (4.12), we find that for any matrices
A1 ; A2 ; : : : ; AK having N columns,
A1
0 1
B A2 C
rank B
@ ::: A rank.A1 / C rank.A2 / C C rank.AK /:
C (4.14)
AK
Every linear space of M N matrices contains the M N null matrix 0. When the intersection
U \ V of two linear spaces U and V of M N matrices contains no matrices other than the M N
null matrix, U and V are said to be essentially disjoint. The following theorem gives a necessary
and sufficient condition for equality to hold in inequality (4.11) or (4.12) of Lemma 2.4.24.
Theorem 2.4.25. Let A represent an M N matrix, B an M P matrix, and C a Q N
matrix. Then,
rank .A; B/ D rank.A/ C rank.B/ (4.15)
if and only if C.A/ and C.B/ are essentially disjoint, and, similarly,
A
rank D rank.A/ C rank.C/ (4.16)
C
if and only if R.A/ and R.C/ are essentially disjoint.
Proof. Let R D rank A and S D rank B. And take a1 ; a2 ; : : : ; aR to be any R linearly indepen-
dent columns of A and b1 ; b2 ; : : : ; bS any S linearly independent columns of B—their existence
follows from Theorem 2.4.19. Clearly,
C.A; B/ D sp.a1 ; a2 ; : : : ; aR ; b1 ; b2 ; : : : ; bS /:
Thus, to establish the first part of Theorem 2.4.25, it suffices to show that the set fa1 , a2 , : : : ;
aR ; b1 ; b2 ; : : : ; bS g is linearly independent if and only if C.A/ and C.B/ are essentially disjoint.
Accordingly, suppose that C.A/ and C.B/ are essentially disjoint. Then, for any scalars
c1 ; c2 ; : : : ; cR and k1 ; k2 ; : : : ; kS such that R
PS PR
i D1 ci ai C j D1 kj bj D 0, we have that
P
ci ai D
PS PR PSi D1
j D1 j jk b , implying (in light of the essential disjointness) that c
i D1 i ia D 0 and j D1 j bj D
k
0 and hence that c1 D c2 D D cR D 0 and k1 D k2 D D kS D 0. And we conclude that the
set fa1 ; a2 ; : : : ; aR , b1 , b2 , : : : ; bS g is linearly independent.
Conversely, if C.A/ and C.B/ were essentially disjoint, there would exist scalars c1 ; c2 ; : : : ; cR
and k1 ; k2 ; : : : ; kS , not all of which are 0, such that R
PS
j D1 kj bj or, equivalently, such
P
i D1 ci ai D
PR PS
that i D1 ci ai C j D1 . kj /bj D 0, in which case the set fa1 ; a2 ; : : : ; aR ; b1 ; b2 ; : : : ; bS g would
be linearly dependent.
The second part of Theorem 2.4.25 can be proved in similar fashion. Q.E.D.
0
Suppose that the M N matrix A and the M P matrix B are such that A B D 0. Then, for
any N 1 vector k and any P 1 vector ` such that Ak D B`,
A0Ak D A0 B` D 0;
implying (in light of Corollary 2.3.4) that Ak D 0 and leading to the conclusion that C.A/ and C.B/
are essentially disjoint. Similarly, if AC 0 D 0, then R.A/ and R.C/ are essentially disjoint. Thus,
we have the following corollary of Theorem 2.4.25.
Inverse Matrices 41
Corollary 2.4.26. Let A represent an M N matrix. Then, for any M P matrix B such that
A0 B D 0, rank .A; B/ D rank.A/ C rank.B/:
And for any Q N matrix C such that AC 0 D 0,
A
rank D rank.A/ C rank.C/:
C
Upon the repeated application of results (4.17), (4.18), and (4.19), we obtain the more general
results that for any K M N matrices A1 ; A2 ; : : : ; AK ,
AK
A1
0 1
B A2 C
rank.A1 C A2 C C AK / rank B
@ ::: A;
C (4.23)
AK
and
rank.A1 C A2 C C AK / rank.A1 / C rank.A2 / C rank.AK /: (4.24)
Lemma 2.5.1. An M N matrix A has a right inverse if and only if rank.A/ D M (i.e., if and
only if A has full row rank) and has a left inverse if and only if rank.A/ D N (i.e., if and only if A
has full column rank).
Proof. If rank.A/ D M , then C.A/ D C.IM / [as is evident from Theorem 2.4.16 upon observing
that C.A/ RM D C.IM /], implying (in light of Lemma 2.4.3) that there exists a matrix R such
that AR D IM (i.e., that A has a right inverse). Conversely, if there exists a matrix R such that
AR D IM , then
rank.A/ rank.AR/ D rank.IM / D M;
implying [since, according to Lemma 2.4.15, rank.A/ M ] that rank.A/ D M . That A has a left
inverse if and only if rank.A/ D N is evident upon observing that A has a left inverse if and only if
A0 has a right inverse [and recalling that rank.A0 / D rank.A/]. Q.E.D.
As an almost immediate consequence of Lemma 2.5.1, we have the following corollary.
Corollary 2.5.2. A matrix A has both a right inverse and a left inverse if and only if A is a
(square) nonsingular matrix.
If there exists a matrix B that is both a right and left inverse of a matrix A (so that AB D I and
BA D I), then A is said to be invertible and B is referred to as an inverse of A. Only a (square)
nonsingular matrix can be invertible, as is evident from Corollary 2.5.2.
The following lemma and theorem include some basic results on the existence and uniqueness
of inverse matrices.
Lemma 2.5.3. If a square matrix A has a right or left inverse B, then A is nonsingular and B is
an inverse of A.
Proof. Suppose that A has a right inverse R. Then, it follows from Lemma 2.5.1 that A is
nonsingular and further that A has a left inverse L. Observing that
L D LI D LAR D IR D R
C D CI D CAB D IB D B;
implying that A has a unique inverse and further—in light of Lemma 2.5.3—that A has no right or
left inverse other than B.
That any invertible matrix is nonsingular is (as noted earlier) evident from Corollary 2.5.2. Q.E.D.
The symbol A 1 is used to denote the inverse of a nonsingular matrix A. By definition,
1
AA D A 1 A D I:
A 1 1 matrix A D .a11 / is invertible if and only if its element a11 is nonzero, in which case
1
A D .1=a11 /:
a11 a12
For a 2 2 matrix A D , we find that
a21 a22
Inverse Matrices 43
AB D kI;
a22 a12
where B D and k D a11 a22 a12 a21 . If k D 0, then AB D 0, implying that the
a21 a11
columns of A are linearly dependent, in which case A is singular and hence not invertible. If k ¤ 0,
then AŒ.1=k/B D I, in which case A is invertible and
1
A D .1=k/B: (5.1)
.A0 / 1
D .A 1 /0 : (5.4)
A 1
D .A 1 /0 : (5.5)
rank.AB/ D N; (5.8)
that is, AB is nonsingular, and 1 1
.AB/ DB A 1: (5.9)
Results (5.8) and (5.9) can be easily verified by observing that ABB A D I (so that B A 1
1 1 1
is a right inverse of AB) and applying Lemma 2.5.3. (If either or both of two N N matrices A
and B are singular, then their product AB is singular, as is evident from Corollary 2.4.17.) Repeated
application of results (5.8) and (5.9) leads to the conclusion that, for any K nonsingular matrices
A1 ; A2 ; : : : ; AK of order N ,
rank.A1 A2 AK / D N (5.10)
and
.A1 A2 AK / 1 D AK1 A2 1 A1 1: (5.11)
44 Matrix Algebra: A Primer
b. Some results on the ranks and row and column spaces of matrix products
The following lemma gives some basic results on the effects of premultiplication or postmultiplication
by a matrix of full row or column rank.
Lemma 2.5.5. Let A represent an M N matrix and B an N P matrix. If A has full column
rank, then
R.AB/ D R.B/ and rank.AB/ D rank.B/:
Similarly, if B has full row rank, then
Proof. It is clear from Corollary 2.4.4 that R.AB/ R.B/ and C.AB/ C.A/. If A has full
column rank, then (according to Lemma 2.5.1) it has a left inverse L, implying that
and hence that R.AB/ D R.B/ [which implies, in turn, that rank.AB/ D rank.B/]. Similarly, if B
has full row rank, then it has a right inverse R, implying that C.A/ D C.ABR/ C.AB/ and hence
that C.AB/ D C.A/ [and rank.AB/ D rank.A/]. Q.E.D.
As an immediate consequence of Lemma 2.5.5, we have the following corollary.
Corollary 2.5.6. If A is an N N nonsingular matrix, then for any N P matrix B,
Proof. Let R D rank A and S D rank B. And suppose that R > 0 and S > 0—if R or S equals
0 (in which case A D 0 or B D 0), equality (6.1) is clearly valid. Then, according to Theorem 2.4.18,
there exist an M R matrix A and an RN matrix E such that A D A E, and, similarly, there exist
a P S matrix B and an S Q matrix F such that B D B F. Moreover, rank A D rank E D R
and rank B D rank F D S .
Ranks and Inverses of Partitioned Matrices 45
We have that
A 0 A 0 E 0
D :
0 B 0 B 0 F
Further, the columns of diag.A ; B / are linearly independent, as is evident upon observing that,
for any R-dimensional column vector c and S -dimensional column vector d such that
A 0 c
D 0;
0 B d
A c D 0 and B d D 0, implying (since the columns of A are linearly independent) that c D 0 and
likewise that d D 0. Similarly, the rows of diag.E; F/ are linearly independent. Thus, diag.A ; B /
has full column rank and diag.E; F/ has full row rank. And, recalling Lemma 2.5.5, we conclude
that
A 0 A 0
rank D D R C S: Q.E.D.
0 B 0 B
Repeated application of result (6.1) gives the following formula for the rank of a block-diagonal
matrix with diagonal blocks A1 ; A2 ; : : : ; AK :
Note that result (6.2) implies in particular that the rank of a diagonal matrix D equals the number of
nonzero diagonal elements in D.
Let T represent an M M matrix and W an N N matrix. Then, the .M C N / .M C N /
block-diagonal matrix diag.T ; W / is nonsingular if and only if both T and W are nonsingular, as
is evident from Lemma 2.6.1. Moreover, if both T and W are nonsingular, then
1 1
T 0 T 0
D ; (6.3)
0 W 0 W 1
as is easily verified.
More generally, for any square matrices A1 ; A2 ; : : : ; AK , the block-diagonal matrix
diag.A1 ; A2 ; : : : ; AK / is nonsingular if and only if A1 ; A2 ; : : : ; AK are all nonsingular [as is evident
from result (6.2)], in which case
As what is essentially a special case of this result, we have that an N N diagonal matrix
diag.d1 ; d2 ; : : : ; dN / is nonsingular if and only if its diagonal elements d1 ; d2 ; : : : ; dN are all
nonzero, in which case
1
Œdiag.d1 ; d2 ; : : : ; dN / D diag.1=d1 ; 1=d2 ; : : : ; 1=dN /: (6.5)
Formula (6.1) for the rank of a block-diagonal matrix can be extended to certain block-triangular
matrices, as indicated by the following lemma.
Lemma 2.6.3. Let T represent an M P matrix, V an N P matrix, and W an N Q matrix.
If T has full column rank or W has full row rank, that is, if rank.T / D P or rank.W / D N, then
T 0 W V
rank D rank D rank.T / C rank.W /: (6.6)
V W 0 T
Proof. Suppose that rank.T / D P . Then, according to Lemma 2.5.1, there exists a matrix L
that is a left inverse of T , in which case
I VL W V W 0 I 0 T 0 T 0
D and D :
0 I 0 T 0 T VL I V W 0 W
I VL I 0
Since (according to Lemma 2.6.2) and are nonsingular, we conclude (on
0 I VL I
the basis of Corollary 2.5.6) that
W V W 0 T 0 T 0
rank D rank and rank D rank
0 T 0 T V W 0 W
That result (6.6) holds if rank.W / D N can be established via an analogous argument. Q.E.D.
The results of Lemma 2.6.2 can be extended to additional block-triangular matrices, as detailed
in the following lemma.
Lemma 2.6.4. Let T represent an M M matrix, V an N M matrix, and W an N N
matrix.
T 0
(1) The .M C N / .M C N / partitioned matrix is nonsingular if and only if both T
V W
and W are nonsingular, in which case
1
T 1
T 0 0
D 1 : (6.7)
V W W 1V T 1
W
W V
(2) The .M C N / .M C N / partitioned matrix is nonsingular if and only if both T
0 T
and W are nonsingular, in which case
1 1 1 1
W V W W VT
D : (6.8)
0 T 0 T 1
Ranks and Inverses of Partitioned Matrices 47
T 0
Proof. Suppose that is nonsingular (in which case its rows are linearly independent).
V W
Then, for any M -dimensional column vector c such that c0 T D 0, we find that
0
c T 0
D .c0 T ; 0/ D 0
0 V W
T 0
and hence that c D 0—if c were nonnull, the rows of would be linearly dependent. Thus,
V W
the rows of T are linearly independent, which implies that T is nonsingular. That W is nonsingular
can be established in similar fashion.
Conversely, suppose that both T and W are nonsingular. Then,
T 0 I 0 IM 0 T 0
D M ;
V W 0 W W 1 V T 1 IN 0 IN
I 0
as is easily verified, and (in light of Lemmas 2.6.1 and 2.6.2 or Lemma 2.6.3) ,
0 W
I 0 T 0
, and are nonsingular. Further, it follows from result (5.10) (and also
W 1V T 1 I 0 I
T 0
from Lemma 2.6.3) that is nonsingular and from results (5.11) and (6.3) and Lemma 2.6.2
V W
that 1 1 1 1
T 0 T 0 I 0 I 0
D
V W 0 I W 1V T 1 I 0 W
1
T 0 I 0 I 0
D
0 I W 1V T 1 I 0 W 1
T 1
0
D :
W 1V T 1 W 1
The proof of Part (1) is now complete. Part (2) can be proved via an analogous argument. Q.E.D.
The results of Lemma 2.6.4 can be extended (by repeated application) to block-triangular matrices
having more than two rows and columns of blocks. Let A1 ; A2 ; : : : ; AR represent square matrices,
and take A to be an (upper or lower) block-triangular matrix whose diagonal blocks are respectively
A1 ; A2 ; : : : ; AR . Then, A is nonsingular if and only if A1 ; A2 ; : : : ; AR are all nonsingular (and,
as what is essentially a special case, a triangular matrix is nonsingular if and only if its diagonal
elements are all nonzero). Further, A 1 is block-triangular (lower block-triangular if A is lower
block-triangular and upper block-triangular if A is upper block-triangular). The diagonal blocks of
A 1 are A1 1; A2 1; : : : ; AR1, respectively, and the off-diagonal blocks of A 1 are expressible in terms
of recursive formulas given by, for example, Harville (1997, sec. 8.5).
c. General case
The following theorem can (when applicable) be used to express the rank of a partitioned matrix in
terms of the rank of a matrix of smaller dimensions.
Theorem 2.6.5. Let T represent an M M matrix, U an M Q matrix, V an N M matrix,
and W an N Q matrix. If rank.T / D M , that is, if T is nonsingular, then
T U W V
rank D rank D M C rank.W V T 1 U/: (6.9)
V W U T
48 Matrix Algebra: A Primer
Results (6.7) and (6.8) on the inverse of a block-triangular matrix can be extended to a more
general class of partitioned matrices, as described in the following theorem.
Theorem 2.6.6. Let T represent an M M matrix, U an M N matrix, V an N M matrix,
and W an N N matrix. Suppose that T is nonsingular, and define
Q D W V T 1 U:
T U
Then, the partitioned matrix is nonsingular if and only if Q is nonsingular, in which case
V W
1 1
C T 1 UQ 1 1 1 1
T U T VT T UQ
D : (6.10)
V W Q 1V T 1
Q 1
W V
Similarly, the partitioned matrix is nonsingular if and only if Q is nonsingular, in which
case U T
1
Q 1 Q 1V T 1
W V
D : (6.11)
U T T 1 UQ 1 T 1 C T 1 UQ 1 V T 1
T U W V
Proof. That is nonsingular if and only if Q is nonsingular and that is
V W U T
nonsingular if and only if Q is nonsingular are immediate consequences of Theorem 2.6.5.
Suppose now that Q is nonsingular, and observe that
T 1U
T 0 T U I
D : (6.12)
V Q V W 0 I
T 0 T U
Then, in light of Corollary 2.5.6 and Lemma 2.6.2, , like , is nonsingular. Premul-
V Q V W
1 1
T U T 0
tiplying both sides of equality (6.12) by and postmultiplying both sides by
V W V Q
Orthogonal Matrices 49
AA0 D A0 A D I;
(as is evident upon observing that a0i aj equals the ij th element of A0 A). Thus, a square matrix is
orthogonal if and only if its columns form an orthonormal (with respect to the usual inner product)
set of vectors. Similarly, a square matrix is orthogonal if and only if its rows form an orthonormal
set of vectors.
Note that if A is an orthogonal matrix, then its transpose A0 is also orthogonal. Note also that
in using the term orthogonal in connection with one or more matrices, say A1 ; A2 ; : : : ; AK , care
must be exercised to avoid confusion. Under a strict interpretation, saying that A1 ; A2 ; : : : ; AK are
orthogonal matrices has an entirely different meaning than saying that the set fA1 ; A2 ; : : : ; AK g,
whose members are A1 ; A2 ; : : : ; AK , is an orthogonal set.
50 Matrix Algebra: A Primer
If P and Q are both N N orthogonal matrices, then [in light of result (1.13)]
.P Q/0 P Q D Q0 P 0 P Q D Q0 IQ D Q0 Q D I:
Thus, the product of two N N orthogonal matrices is another (N N ) orthogonal matrix. The
repeated application of this result leads to the following lemma.
Lemma 2.7.1. If each of the matrices Q1 ; Q2 ; : : : ; QK is an N N orthogonal matrix, then
the product Q1 Q2 QK is an (N N ) orthogonal matrix.
Proof. Let us restrict attention to the case where A is nonnull. [The case where A D 0 is
trivial—if A D 0, then tr.A/ D 0 D k rank.A/.]
Let N denote the order of A, and let R D rank.A/. Then, according to Theorem 2.4.18, there exist
an N R matrix B and an R N matrix T such that A D BT. Moreover, rank.B/ D rank.T / D R,
implying (in light of Lemma 2.5.1) the existence of a matrix L such that LB D IR and a matrix H
such that T H D IR —L is a left inverse of B and H a right inverse of T.
We have that
BT BT D A2 D kA D kBT D B.kIR /T:
Thus,
T B D IR T BIR D LBT BT H D L.BT BT /H D LB.kIR /T H D IR .kIR /IR D kIR :
Linear Systems 51
By making use of Lemma 2.8.1 and Corollary 2.8.3, we find that for any N N idempotent
matrix A,
rank.I A/ D tr.I A/ D tr.IN / tr.A/ D N rank.A/;
thereby establishing the following, additional result.
Lemma 2.8.4. For any N N idempotent matrix A,
where a11 ; a12 ; : : : ; a1N ; a21 ; a22 ; : : : ; a2N ; : : : ; aM1 ; aM 2 ; : : : ; aMN and b1 ; b2 , : : : ; bM represent
“fixed” scalars and x1 ; x2 ; : : : ; xN are scalar-valued unknowns or variables. The left side of each of
these equations is a linear combination of the unknowns x1 ; x2 ; : : : ; xN . Collectively, these equations
are called a system of linear equations (in unknowns x1 ; x2 ; : : : ; xN ) or simply a linear system (in
x1 ; x2 ; : : : ; xN ).
The linear system can be rewritten in matrix form as
Ax D b; (9.1)
AX D B: (9.3)
52 Matrix Algebra: A Primer
As in the special case (9.1) where P D 1, AX D B is called a linear system (in X), A and B are
called the coefficient matrix and the right side, respectively, and any value of X that satisfies AX D B
is called a solution.
In the special case AX D 0, where the right side B of linear system (9.3) is a null matrix, the linear
system is said to be homogeneous. If B is nonnull, linear system (9.3) is said to be nonhomogeneous.
a. Consistency
A linear system is said to be consistent if it has one or more solutions. If no solution exists, the linear
system is said to be inconsistent.
Every homogeneous linear system is consistent—one solution to a homogeneous linear system
is the null matrix (of appropriate dimensions). A nonhomogeneous linear system may be either
consistent or inconsistent.
Let us determine the characteristics that distinguish the coefficient matrix and right side of a
consistent linear system from those of an inconsistent linear system. In doing so, let us adopt an
abbreviated notation for the column space of a partitioned matrix of the form .A; B/ by writing
C.A; B/ for CŒ.A; B/. And as a preliminary step, let us establish the following lemma.
Lemma 2.9.1. Let A represent an M N matrix and B an M P matrix. Then,
Proof. We have that C.A/ C.A; B/ and C.B/ C.A; B/, as is evident from Corollary 2.4.4
I 0
upon observing that A D .A; B/ and B D .A; B/ .
0 I
Now, suppose that C.B/ C.A/. Then, according to Lemma 2.4.3, there exists a matrix F such
that B D AF and hence such that .A; B/ D A.I; F/. Thus, C.A; B/ C.A/, implying [since
C.A/ C.A; B/] that C.A; B/ D C.A/.
Conversely, suppose that C.A; B/ D C.A/. Then, since C.B/ C.A; B/, C.B/ C.A/, and
the proof of result (9.4) is complete.
To prove result (9.5), it suffices [having established result (9.4)] to show that
That C.A; B/ D C.A/ ) rank.A; B/ D rank.A/ is clear. And since C.A/ C.A; B/, it follows
from Theorem 2.4.16 that rank.A; B/ D rank.A/ ) C.A; B/ D C.A/. Q.E.D.
We are now in a position to establish the following result on the consistency of a linear system.
Theorem 2.9.2. Each of the following conditions is necessary and sufficient for a linear system
AX D B (in X) to be consistent:
(1) C.B/ C.A/;
(2) every column of B belongs to C.A/;
(3) C.A; B/ D C.A/;
(4) rank.A; B/ D rank.A/.
Proof. That Condition (1) is necessary and sufficient for the consistency of AX D B is an
immediate consequence of Lemma 2.4.3. Further, it follows from Lemma 2.4.2 that Condition (2) is
equivalent to Condition (1), and from Lemma 2.9.1 that each of Conditions (3) and (4) is equivalent to
Condition (1). Thus, like Condition (1), each of Conditions (2) through (4) is necessary and sufficient
for the consistency of AX D B. Q.E.D.
A sufficient (but in general not a necessary) condition for the consistency of a linear system is
given by the following theorem.
Generalized Inverses 53
Theorem 2.9.3. If the coefficient matrix A of a linear system AX D B (in X) has full row rank,
then AX D B is consistent.
Proof. Suppose that A has full row rank. Then, it follows from Lemma 2.5.1 that there exists
a matrix R (a right inverse of A) such that AR D I and hence such that ARB D B. Thus, setting
X D RB gives a solution to the linear system AX D B, and we conclude that AX D B is
consistent. Q.E.D.
b. Solution set
The collection of all solutions to a linear system AX D B (in X) is called the solution set of the
linear system. Clearly, a linear system is consistent if and only if its solution set is nonempty.
Is the solution set of a linear system AX D B (in X) a linear space? The answer depends on
whether the linear system is homogeneous or nonhomogeneous, that is, on whether the right side B
is null or nonnull.
Consider first the solution set of a homogeneous linear system AX D 0. A homogeneous linear
system is consistent, and hence its solution set is nonempty—its solution set includes the null matrix
0. Furthermore, if X1 and X2 are solutions to AX D 0 and k is a scalar, then A.X1 C X2 / D
AX1 C AX2 D 0 and A.kX1 / D k.AX1 / D 0, so that X1 C X2 and kX1 are solutions to AX D 0.
Thus, the solution set of a homogeneous linear system is a linear space. Accordingly, the solution
set of a homogeneous linear system AX D 0 may be called the solution space of AX D 0.
The solution space of a homogeneous linear system Ax D 0 (in a column vector x) is called the
null space of the matrix A and is denoted by the symbol N.A/. Thus, for any M N matrix A,
N.A/ D fx 2 RN W Ax D 0g:
The solution set of a nonhomogeneous linear system is not a linear space (as can be easily seen
by, e.g., observing that the solution set does not contain the null matrix).
Theorem 2.10.5. Let B represent an M K matrix of full column rank and T a K N matrix
of full row rank. Then, B has a left inverse, say L, and T has a right inverse, say R; and RL is a
generalized inverse of BT.
Proof. That B has a left inverse L and T a right inverse R is an immediate consequence of
Lemma 2.5.1. Moreover,
BT .RL/BT D B.T R/.LB/T D BIIT D BT:
Thus, RL is a generalized inverse of BT. Q.E.D.
Now, consider an arbitrary M N matrix A. If A D 0, then clearly any N M matrix is a
generalized inverse of A. If A ¤ 0, then (according to Theorem 2.4.18) there exist a matrix B of full
column rank and a matrix T of full row rank such that A D BT , and hence (in light of Theorem
2.10.5) A has a generalized inverse. Thus, we arrive at the following conclusion.
Corollary 2.10.6. Every matrix has at least one generalized inverse.
The symbol A is used to denote an arbitrary generalized inverse of an M N matrix A. By
definition,
AA A D A:
A0 .A /0 A0 D .AA A/0 D A0 :
Proof. It follows from Corollary 2.4.4 that C.AA / C.A/ and also, since A D AA A D
.AA /A, that C.A/ C.AA /. Thus, C.AA / D C.A/. That R.A A/ D R.A/ follows from an
analogous argument. And since AA has the same column space as A and A A the same row space
as A, AA and A A have the same rank as A. Q.E.D.
It follows from Corollary 2.4.17 and Lemma 2.10.13 that, for any matrix A,
Thus, we have the following lemma, which can be regarded as an extension of result (5.6).
Lemma 2.10.14. For any matrix A, rank.A / rank.A/.
The following lemma extends to generalized inverses some results on the inverses of products of
matrices—refer to results (5.9) and (5.11).
Lemma 2.10.15. Let B represent an M N matrix and G an N M matrix. Then, for any
M M nonsingular matrix A and N N nonsingular matrix C, (1) G is a generalized inverse of
AB if and only if G D HA 1 for some generalized inverse H of B, (2) G is a generalized inverse
of BC if and only if G D C 1 H for some generalized inverse H of B, and (3) G is a generalized
inverse of ABC if and only if G D C 1 HA 1 for some generalized inverse H of B.
Linear Systems Revisited 57
Proof. Parts (1) and (2) are special cases of Part (3) (those where C D I and A D I, respectively).
Thus, it suffices to prove Part (3).
By definition, G is a generalized inverse of ABC if and only if
Upon premultiplying both sides of equality (10.4) by A 1 and postmultiplying both sides by C 1,
we obtain the equivalent equality
BCGAB D B:
Thus, G is a generalized inverse of ABC if and only if CGA D H for some generalized inverse H
of B or, equivalently, if and only if G D C 1 HA 1 for some generalized inverse H of B. Q.E.D.
X D X A 0 D X A AX D .I A A/X;
x D .I A A/y
for some column vector y. Thus, we have the following corollary of Theorem 2.11.3.
Corollary 2.11.4. For any matrix A,
How “large” is the solution space of a homogeneous linear system Ax D 0 (in an N -dimensional
column vector x)? The answer is given by the following lemma.
Linear Systems Revisited 59
the consistency of which was established earlier (in Subsection a). Taking A to be the coefficient
matrix and b the right side of linear system (11.3), choosing
0 1
0 0 0
B 1 0 3C
A DB @ 0 0 0A;
C
1 0 2
Thus, the members of the solution set of linear system (11.3) consist of all vectors of the general
form (11.4).
A possibly nonhomogeneous linear system AX D B (in an N P matrix X) has 0 solutions, 1
solution, or an infinite number of solutions. If AX D B is inconsistent, then, by definition, it has 0
solutions. If AX D B is consistent and A is of full column rank (i.e., of rank N ), then I A A D 0
(as is evident from Lemma 2.10.3), and it follows from Theorem 2.11.7 that AX D B has 1 solution.
If AX D B is consistent and rank.A/ < Ns, then I A A ¤ 0 (since otherwise we would arrive at
a contradiction of Lemma 2.5.1), and it follows from Theorem 2.11.7 that AX D B has an infinite
number of solutions.
In the special case of a linear system with a nonsingular coefficient matrix, we obtain the follow-
ing, additional result (that can be regarded as a consequence of Theorems 2.9.3 and 2.11.7 or that
can be verified directly).
Theorem 2.11.8. If the coefficient matrix A of a linear system AX D B (in X) is nonsingular,
then AX D B has a unique solution and that solution equals A 1 B.
X0 X.X0 X/ X0 X D X0 X:
And, upon applying Part (1) of Corollary 2.3.4 [with A D X, B D .X0 X/ X0 X, and C D I], we
find that
X.X0 X/ X0 X D X: (12.1)
Similarly, applying Part (2) of Corollary 2.3.4 [with A D X, B D X0 X.X0 X/ , and C D I], we
find that
X0 X.X0 X/ X0 D X0 : (12.2)
Projection Matrices 61
In light of results (12.1) and (12.2), Corollary 2.4.4 implies that R.X/ R.X0 X/ and that
C.X0 / C.X0 X/. Since Corollary 2.4.4 also implies that R.X0 X/ R.X/ and C.X0 X/ C.X0 /,
we conclude that
R.X0 X/ D R.X/ and C.X0 X/ D C.X0 /:
Moreover, matrices having the same row space have the same rank, so that
rank.X0 X/ D rank.X/:
Now, let PX represent the (square) matrix X.X0 X/ X0 . Then, results (12.1) and (12.2) can be
restated succinctly as
PX X D X (12.3)
and
X0 PX D X0 : (12.4)
For any generalized inverses G1 and G2 of X0 X, we have [in light of result (12.1)] that
XG1 X0 X D X D XG2 X0 X:
And, upon applying Part (2) of Corollary 2.3.4 (with A D X, B D XG1 , and C D XG2 ), we find
that
XG1 X0 D XG2 X0 :
Thus, PX is invariant to the choice of the generalized inverse .X0 X/ .
There is a stronger version of this invariance property. Consider the linear system X0 XB D X0
(in B). A solution to this linear system can be obtained by taking B D .X0 X/ X0 [as is evident from
result (12.2)], and, for B D .X0 X/ X0 , PX D XB. Moreover, for any two solutions B1 and B2 to
X0 XB D X0 , we have that X0 XB1 D X0 D X0 XB2 , and, upon applying Part (1) of Corollary 2.3.4
(with A D X, B D B1 , and C D B2 ), we find that XB1 D XB2 . Thus, PX D XB for every solution
to X0 XB D X0 .
Note that PX0 D XŒ.X0 X/ 0 X0 . According to Corollary 2.10.11, Œ.X0 X/ 0 , like .X0 X/ itself,
is a generalized inverse of X0 X, and, since PX is invariant to the choice of the generalized inverse
.X0 X/ , it follows that PX D XŒ.X0 X/ 0 X0 . Thus, PX0 D PX ; that is, PX is symmetric. And PX is
idempotent, as is evident upon observing [in light of result (12.3)] that
Subsequently, we continue to use (for any matrix X) the symbol PX to represent the matrix
X.X0 X/ X0 . For reasons that will eventually become apparent, a matrix of the general form PX is
referred to as a projection matrix.
Upon applying Corollary 2.4.5, we obtain the following generalization of Lemma 2.12.1.
Lemma 2.12.3. For any M N matrix X and for any P N matrix S and N Q matrix T,
C.SX0 X/ D C.SX0 / and rank.SX0 X/ D rank.SX0 /
and
R.X0 XT / D R.XT / and rank.X0 XT / D rank.XT /:
A function of x that is expressible in the form x0 Ax is called a quadratic form (in x). It is customary
to refer to A as the matrix of the quadratic form x0 Ax.
Let B D fbij g represent a second N N matrix. Under what circumstances are the two quadratic
forms x0 Ax and x0 Bx identically equal (i.e., equal for every value of x)? Clearly, a sufficient condition
for them to be identically equal is that A D B. However, except in the special case where N D 1,
A D B is not a necessary condition.
For purposes of establishing a necessary condition, suppose that x0 Ax and x0 Bx are identically
equal. Then, setting x equal to the i th column of IN , we find that
ai i D x0 Ax D x0 Bx D bi i .i D 1; 2; : : : ; N /: (13.1)
That is, the diagonal elements of A are the same as those of B. Consider now the off-diagonal elements
of A and B. Setting x equal to the N -dimensional column vector whose i th and j th elements equal
1 and whose remaining elements equal 0, we find that
ai i C aij C aj i C ajj D x0 Ax D x0 Bx D bi i C bij C bj i C bjj .j ¤ i D 1; 2; : : : ; N /: (13.2)
Together, results (13.1) and (13.2) imply that
ai i D bi i and aij C aj i D bij C bj i .j ¤ i D 1; 2; : : : ; N /
or, equivalently, that
A C A0 D B C B0 : (13.3)
Thus, condition (13.3) is a necessary condition for x0 Ax and x0 Bx to be identically equal. It is
also a sufficient condition. To see this, observe that (since a 1 1 matrix is symmetric) condition
(13.3) implies that
Lemma 2.13.1. Let A D faij g and B D fbij g represent arbitrary N N matrices. The two
quadratic forms x0 Ax and x0 Bx (in x) are identically equal if and only if, for j ¤ i D 1; 2; : : : ; N ,
ai i D bi i and aij C aj i D bij C bj i or, equivalently, if and only if A C A0 D B C B0 .
Note that Lemma 2.13.1 implies in particular that the quadratic form x0 A0 x (in x) is identically
equal to the quadratic form x0 Ax (in x). When B is symmetric, the condition A C A0 D B C B0 is
equivalent to the condition B D .1=2/.ACA0 /, and when both A and B are symmetric, the condition
A C A0 D B C B0 is equivalent to the condition A D B. Thus, we have the following two corollaries
of Lemma 2.13.1.
Corollary 2.13.2. Corresponding to any quadratic form x0 Ax (in x), there is a unique symmetric
matrix B such that x0 Bx D x0 Ax for all x, namely, the matrix B D .1=2/.A C A0 /.
Corollary 2.13.3. For any pair of N N symmetric matrices A and B, the two quadratic forms
x0 Ax and x0 Bx (in x) are identically equal (i.e., x0 Ax D x0 Bx for all x) if and only if A D B.
As a special case of Corollary 2.13.3 (that where B D 0), we have the following additional
corollary.
Corollary 2.13.4. Let A represent an N N symmetric matrix. If x0 Ax D 0 for every (N 1)
vector x, then A D 0.
only if the other elements in the i th row and the i th column of that matrix satisfy the conditions set
forth in the following lemma and corollary.
Lemma 2.13.17. Let A D faij g represent an N N nonnegative definite matrix. If ai i D 0,
then, for j D 1; 2; : : : ; N , aij D aj i ; that is, if the i th diagonal element of A equals 0, then
.ai1 ; ai 2 ; : : : ; aiN /, which is the i th row of A, equals .a1i ; a2i ; : : : ; aN i /, which is 1 times the
transpose of the i th column of A (i D 1; 2; : : : ; N ).
Proof. Suppose that ai i D 0, and take x D fxk g to be an N -dimensional column vector such
that xi < ajj , xj D aij C aj i , and xk D 0 for k other than k D i and k D j (where j ¤ i ).
Then,
x0 Ax D ai i xi2 C .aij C aj i /xi xj C ajj xj2
D .aij C aj i /2 .xi C ajj /
0; (13.5)
with equality holding only if aij C aj i D 0 or, equivalently, only if aij D aj i . Moreover, since A is
nonnegative definite, x0 Ax 0, which—together with inequality (13.5)—implies that x0 Ax D 0.
We conclude that aij D aj i . Q.E.D.
If the nonnegative definite matrix A in Lemma 2.13.17 is symmetric, then aij D aj i , 2aij D
0 , aij D 0. Thus, we have the following corollary.
Corollary 2.13.18. Let A D faij g represent an N N symmetric nonnegative definite matrix.
If ai i D 0, then, for j D 1; 2; : : : ; N , aj i D aij D 0; that is, if the i th diagonal element of A equals
zero, then the i th column .a1i ; a2i ; : : : ; aN i /0 of A and the i th row .ai1 ; ai 2 : : : ; aiN / of A are null.
Moreover, if all N diagonal elements a11 ; a22 ; : : : ; aNN of A equal 0, then A D 0.
to consider the case where a11 > 0 separately from that where a11 D 0—since A is nonnegative
definite, a11 0.
Case (1): a11 > 0. According to Lemma 2.13.19, there exists a unit upper triangular matrix U
such that U 0 AU D diag.b11 ; B22 / for some scalar b11 and some .N 1/ .N 1/ matrix B22 .
Moreover, B22 is symmetric and nonnegative definite (as is evident upon observing that U 0 AU is
symmetric and nonnegative definite). Thus, by supposition, there exists a unit upper triangular matrix
Q such that Q0 B22 Q is a diagonal matrix. Take Q D U diag.1; Q /. Then, Q is a unit upper
triangular matrix, as is evident upon observing that it is the product of two unit upper triangular
matrices—a product of 2 unit upper triangular matrices is itself a unit upper triangular matrix, as can
be readily verified by, e.g., making use of result (2.5). And
Q0 AQ D diag.1; Q0 / diag.b11 ; B22 / diag.1; Q / D diag.b11 ; Q0 B22 Q /;
which (like Q0 B22 Q ) is a diagonal matrix.
Case (2): a11 D 0. The submatrix A22 is an .N 1/ .N 1/ symmetric nonnegative definite
matrix. Thus, by supposition, there exists a unit upper triangular matrix Q such that Q0 A22 Q
is a diagonal matrix. Take Q D diag.1; Q /. Then, Q is a unit upper triangular matrix. And,
upon observing (in light of Corollary 2.13.18) that a11 D 0 ) a D 0, we find that Q0 AQ D
diag.a11 ; Q0 A22 Q /, which (like Q0 A22 Q ) is a diagonal matrix. Q.E.D.
Observe (in light of the results of Section 2.6 b) that a unit upper triangular matrix is nonsingular
and that its inverse is a unit upper triangular matrix. And note (in connection with Theorem 2.13.20)
that if Q is a unit upper triangular matrix such that Q0 AQ D D for some diagonal matrix D, then
1 0
A D .Q / Q0 AQQ 1
D .Q 1 0
/ DQ 1:
Thus, we have the following corollary.
Corollary 2.13.21. Corresponding to any N N symmetric nonnegative definite matrix A,
there exist a unit upper triangular matrix U and a diagonal matrix D such that A D U 0 DU.
Suppose that A is an N N symmetric nonnegative definite matrix, and take U to be a unit upper
triangular matrix and D a diagonal matrix such that A D U 0 DU (the existence of such matrices
is established in Corollary 2.13.21). Further, let R D rank A, and denote by d1 ; d2 ; : : : ; dN the
diagonal elements of D. Then, di 0 (i D 1; 2; p : : : ; Np), as is evident
p from Corollary 2.13.16. And
A is expressible as A D T 0 T , where T D diag. d1 ; d2 ; : : : ; dN / U. Moreover, rank D D R,
so that R of the diagonal elements of D are strictly positive and the others are equal to 0. Thus, as
an additional corollary of Theorem 2.13.20, we have the following result.
Corollary 2.13.22. Let A represent an N N symmetric nonnegative definite matrix. And let
R D rank A. Then, there exists an upper triangular matrix T with R (strictly) positive diagonal
elements and with N R null rows such that
A D T 0 T: (13.6)
Let (for i; j D 1; 2; : : : ; N ) tij represent the ij th element of the upper triangular matrix T in
decomposition (13.6) (and observe that tij D 0 for i > j ). Equality (13.6) implies that
p
t11 D a11 ; (13.7)
(
a1j =t11 ; if t11 > 0,
t1j D (13.8)
0; if t11 D 0,
( Pi 1
kD1 tki tkj =ti i ; if ti i > 0,
aij
tij D (13.9)
0; if ti i D 0
(i D 2; 3; : : : ; j 1), Pj 1 2 1=2
tjj D ajj kD1 kj
t (13.10)
(j D 2; 3; : : : ; N ).
68 Matrix Algebra: A Primer
It follows from equalities (13.7), (13.8), (13.9), and (13.10) that decomposition (13.6) is unique.
This decomposition is known as the Cholesky decomposition. It can be computed by a method that is
sometimes referred to as the square root method. In the square root method, formulas (13.7), (13.8),
(13.9), and (13.10) are used to construct the matrix T in N steps, one row or one column per step.
Refer, for instance, to Harville (1997, sec. 14.5) for more details and for an illustrative example.
As a variation on decomposition (13.6), we have the decomposition
A D T0 T ; (13.11)
where T is the R N matrix whose rows are the nonnull rows of the upper triangular matrix
T. Among the implications of result (13.11) is the following result, which can be regarded as an
additional corollary of Theorem 2.13.20.
Corollary 2.13.23. Corresponding to any N N (nonnull) symmetric nonnegative definite
matrix A of rank R, there exists an R N matrix P such that A D P 0 P (and any such R N matrix
P is of full row rank R).
The following corollary can be regarded as a generalization of Corollary 2.13.23.
Corollary 2.13.24. Let A represent an N N matrix of rank R, and take M to be any positive
integer greater than or equal to R. If A is symmetric and nonnegative definite, then there exists an
M N matrix P such that A D P 0 P (and any such M N matrix P is of rank R).
Proof. Suppose that A is symmetric and nonnegative definite. And assume that R > 0—when
R D 0, A D 0 0 0. According to Corollary 2.13.23, there existsan R N matrix P1 such that
0 P1
A D P1 P1 . Take P to be the M N matrix of the form P D . Then, clearly, A D P 0 P .
0
Moreover, it is clear from Lemma 2.12.1 that rank.P / D R for any M N matrix P such that
A D P 0P . Q.E.D.
In light of Corollary 2.13.15, Corollary 2.13.24 has the following implication.
Corollary 2.13.25. An N N matrix A is a symmetric nonnegative definite matrix if and only
if there exists a matrix P (having N columns) such that A D P 0 P .
Further results on symmetric nonnegative definite matrices are given by the following three
corollaries.
Corollary 2.13.26. Let A represent an N N nonnegative definite matrix, and let R D rank.AC
A0 /. Then, assuming that R > 0, there exists an R N matrix P (of full row rank R) such that
the quadratic form x0 Ax (in an N -dimensional vector x) is expressible as the sum R 2
i D1 yi of the
P
squares of the elements y1 ; y2 ; : : : ; yR of the R-dimensional vector y D P x.
Proof. According to Corollary 2.13.2, there is a unique symmetric matrix B such that x0 Ax D
x Bx for all x, namely, the matrix B D .1=2/.A C A0 /. Moreover, B is nonnegative definite, and
0
(assuming that R > 0) it follows from Corollary 2.13.23 that there exists an R N matrix P (of
full row rank R) such that B D P 0 P . Thus, letting y1 ; y2 ; : : : ; yR represent the elements of the
R-dimensional vector y D P x, we find that
R
X
x0 Ax D x0 P 0 P x D .P x/0 P x D yi2 : Q.E.D.
i D1
Corollary 2.13.27. For any N M matrix X and any N N symmetric nonnegative definite
matrix A, AX D 0 if and only if X0 AX D 0.
Proof. According to Corollary 2.13.25, there exists a matrix P such that A D P 0 P and hence
such that X0 AX D .P X/0 P X. Thus, if X0 AX D 0, then (in light of Corollary 2.3.3) P X D 0,
implying that AX D P 0 P X D P 0 0 D 0. That X0 AX D 0 if AX D 0 is obvious. Q.E.D.
Quadratic Forms 69
Corollary 2.13.28. A symmetric nonnegative definite matrix is positive definite if and only if it
is nonsingular (or, equivalently, is positive semidefinite if and only if it is singular).
Proof. Let A represent an N N symmetric nonnegative definite matrix. If A is positive definite,
then we have, as an immediate consequence of Lemma 2.13.9, that A is nonsingular.
Suppose now that the symmetric nonnegative definite matrix A is nonsingular, and consider the
quadratic form x0 Ax (in x). If x0 Ax D 0, then, according to Corollary 2.13.27, Ax D 0, and
consequently x D A 1 Ax D A 1 0 D 0. Thus, the quadratic form x0 Ax is positive definite and
hence the matrix A is positive definite. Q.E.D.
When specialized to positive definite matrices, Corollary 2.13.23, in combination with Lemma
2.13.9 and Corollary 2.13.15, yields the following result.
Corollary 2.13.29. An N N matrix A is a symmetric positive definite matrix if and only if
there exists a nonsingular matrix P such that A D P 0 P.
In the special case of a positive definite matrix, Corollary 2.13.26 can (in light of Corollary 2.13.2
and Lemma 2.13.9) be restated as follows—Corollary 2.13.2 implies that if A is positive definite,
then so is .1=2/.A C A0 /.
Corollary 2.13.30. Let A represent an N N positive definite matrix. Then, there exists a
nonsingular matrix P such that the quadratic form x0Ax (in an N -dimensional vector x) is expressible
as the sum N 2
i D1 yi of the squares of the elements y1 ; y2 ; : : : ; yN of the transformed vector y D P x.
P
x0 Ax D x01 A1 x1 C x02 A2 x2 C C xK
0
AK xK :
matrices A1 ; A2 ; : : : ; AK are positive definite and hence (since they are nonnegative definite) that at
least one of them is positive semidefinite. Q.E.D.
The following theorem relates the positive definiteness or semidefiniteness of a symmetric matrix
to the positive definiteness or semidefiniteness of the Schur complement of a positive definite principal
submatrix.
Theorem 2.13.32. Let T represent an M M symmetric matrix, W an N N symmetric
matrix, and U an M N matrix. Suppose that T is positive definite, and define
Q D W U 0 T 1 U:
T U
Then, the partitioned (symmetric) matrix is positive definite if and only if Q is positive
U0 W
definite and is positive
semidefinite if and only if Q is positive semidefinite. Similarly, the partitioned
W U0
(symmetric) matrix is positive definite if and only if Q is positive definite and is positive
U T
semidefinite if and only if Q is positive semidefinite.
T 1U
T U IM
Proof. Let A D , and define P D . According to Lemma 2.6.2, P
U0 W 0 IN
is nonsingular. Moreover,
P 0 AP D diag.T; Q/;
as is easily verified. And, upon observing that
1 0 1
.P / diag.T; Q/P D A;
it follows from Corollary 2.13.11 that A is positive definite if and only if diag.T; Q/ is positive
definite and is positive semidefinite if and only if diag.T; Q/ is positive semidefinite. In light of
Lemma 2.13.31, we conclude that A is positive definite if and only if Q is positive definite and
is positive semidefinite if and only if Q is positive semidefinite, thereby completing the proof of
the first part of Theorem 2.13.32. The second part of Theorem 2.13.32 can be proved in similar
fashion. Q.E.D.
As a corollary of Theorem 2.13.32, we have the following result on the positive definiteness of
a symmetric matrix.
Corollary 2.13.33. Suppose that a symmetric matrix A is partitioned as
T U
AD
U0 W
(where T and W are square). Then, A is positive definite if and only if T and the Schur complement
W U 0 T 1 U of T are both positive definite. Similarly, A is positive definite if and only if W and
the Schur complement T UW 1 U 0 of W are both positive definite.
Proof. If T is positive definite (in which case T is nonsingular) and W U 0 T 1 U is positive
definite, then it follows from the first part of Theorem 2.13.32 that A is positive definite. Similarly,
if W is positive definite (in which case W is nonsingular) and T UW 1 U 0 is positive definite,
then it follows from the second part of Theorem 2.13.32 that A is positive definite.
Conversely, suppose that A is positive definite. Then, it follows from Corollary 2.13.13 that T
is positive definite and also that W is positive definite. And, based on the first part of Theorem
2.13.32, we conclude that W U 0 T 1 U is positive definite; similarly, based on the second part of
that theorem, we conclude that T UW 1 U 0 is positive definite. Q.E.D.
Determinants 71
2.14 Determinants
Associated with any square matrix is a scalar that is known as the determinant of the matrix. As a
preliminary to defining the determinant, it is convenient to introduce a convention for classifying
various pairs of matrix elements as either positive or negative.
Let A D faij g represent an arbitrary N N matrix. Consider any pair of elements of A that do
not lie either in the same row or the same column, say aij and ai 0 j 0 (where i 0 ¤ i and j 0 ¤ j ). The
pair is said to be a negative pair if one of the elements is located above and to the right of the other,
or, equivalently, if either i 0 > i and j 0 < j or i 0 < i and j 0 > j . Otherwise (if one of the elements
is located above and to the left of the other, or, equivalently, if either i 0 > i and j 0 > j or i 0 < i and
j 0 < j ), the pair is said to be a positive pair. Thus, the pair aij and ai 0 j 0 is classified as positive or
negative in accordance with the following two-way table:
i0 > i i0 < i
0
j >j C
j0 < j C
For example (supposing that N 4), the pair a34 and a22 is positive, while the pair a34 and a41 is
negative. Note that whether the pair aij and ai 0 j 0 is positive or negative is completely determined by
the relative locations of aij and ai 0 j 0 and has nothing to do with whether aij and ai 0 j 0 are positive
or negative numbers.
Now, consider N elements of A, no two of which lie either in the same row or the same column, say
the i1 j1 ; i2 j2 ; : : : ; iN jN th elements (where both i1 ; i2 ; : : : ; iN and j1 ; j2 ; : : : ; jN are permutations
of the first N positive integers). A total of N2 pairs can be formed from these N elements. The
or, equivalently, by X
jAj D . 1/N .j1 ;j2 ;:::;jN / a1j1 a2j2 aNjN ; (14.20)
72 Matrix Algebra: A Primer
where j1 ; j2 ; : : : ; jN is a permutation of the first N positive integers and the summation is over all
such permutations.
Thus, the determinant of an N N matrix A can (at least in principle) be obtained via the
following process:
(1) Form all possible products, each of N factors, that can be obtained by picking one and only one
element from each row and column of A.
(2) In each product, count the number of negative pairs among the N2 pairs of elements that can
be generated from the N elements that contribute to this particular product. If the number of
negative pairs is an even number, attach a plus sign to the product; if it is an odd number, attach
a minus sign.
(3) Sum the signed products.
In particular, the determinant of a 1 1 matrix A D .a11 / is
jAj D a11 I (14.3)
the determinant of a 2 2 matrix A D faij g is
jAj D . 1/0 a11 a22 C . 1/1 a12 a21 D a11 a22 a12 a21 I (14.4)
and the determinant of a 3 3 matrix A D faij g is
jAj D . 1/0 a11 a22 a33 C . 1/1 a11 a23 a32 C . 1/1 a12 a21 a33
C . 1/2 a12 a23 a31 C . 1/2 a13 a21 a32 C . 1/3 a13 a22 a31
D a11 a22 a33 C a12 a23 a31 C a13 a21 a32
a11 a23 a32 a12 a21 a33 a13 a22 a31 : (14.5)
An alternative definition of the determinant of an N N matrix A is
X
jAj D . 1/N .i1 ;1I i2 ;2I:::I iN ;N / ai1 1 ai2 2 aiN N (14.6)
X
D . 1/N .i1 ;i2 ;:::;iN / ai1 1 ai2 2 aiN N ; (14.60)
where i1 ; i2 ; : : : ; iN is a permutation of the first N positive integers and the summation is over all
such permutations.
Definition (14.6) is equivalent to definition (14.2). To see this, observe that the product
a1j1 a2j2 aNjN , which appears in definition (14.2), can be reexpressed by permuting the N factors
a1j1 ; a2j2 ; : : : ; aNjN so that they are ordered by column number, giving
a1j1 a2j2 aNjN D ai1 1 ai2 2 aiN N ;
where i1 ; i2 ; : : : ; iN is a permutation of the first N positive integers that is defined uniquely by
ji1 D 1; ji2 D 2; : : : ; jiN D N:
Further,
N .1; j1 I 2; j2 I : : : I N; jN / D N .i1 ; ji1 I i2 ; ji2 ; : : : I iN ; jiN /
D N .i1 ; 1I i2 ; 2I : : : I iN ; N /;
so that
. 1/N .1;j1 I 2;j2 I:::I N;jN / a1j1 a2j2 aNjN D . 1/N .i1 ;1I i2 ;2I:::I iN ;N / ai1 1 ai2 2 aiN N :
Thus, we can establish a one-to-one correspondence between the terms of the sum (14.6) and the
terms of the sum (14.2) such that the corresponding terms are equal. We conclude that the two sums
are themselves equal.
A11 A12
In considering the determinant of a partitioned matrix, say , it is customary to
ˇ ˇ ˇ ˇ A21 A22
ˇ A11 A12 ˇ ˇA11 A12 ˇ
abbreviate ˇˇ ˇ to ˇ ˇ.
A21 A22 ˇ ˇA21 A22 ˇ
Determinants 73
that is, the determinant of a triangular matrix equals the product of its diagonal elements.
Proof. Consider the case where A is lower triangular, that is, of the form
0 1
a11 0 ::: 0
B a21 a22 0 C
ADB : : C:
B C
@ :: :: : :: A
aN1 aN 2 : : : aNN
That jAj D a11 a22 aNN is clear upon observing that the only term in the sum (14.2) or (14.20)
that can be nonzero is that corresponding to the permutation j1 D 1; j2 D 2; : : : ; jN D N [and
upon observing that N .1; 2; : : : ; N / D 0]. (To verify formally that this is the only term that can be
nonzero, let j1 ; j2 ; : : : ; jN represent an arbitrary permutation of the first N positive integers, and
suppose that a1j1 a2j2 aNjN ¤ 0 or, equivalently, that aiji ¤ 0 for i D 1; 2; : : : ; N . Then, it is
clear that j1 D 1 and that if j1 D 1; j2 D 2; : : : ; ji 1 D i 1, then ji D i . We conclude, on the
basis of mathematical induction, that j1 D 1; j2 D 2; : : : ; jN D N .)
The validity of formula (14.7) as applied to an upper triangular matrix follows from a similar
argument. Q.E.D.
Note that Lemma 2.14.1 implies in particular that the determinant of a unit (upper or lower)
triangular matrix equals 1. And, as a further implication of Lemma 2.14.1, we have the following
corollary.
Corollary 2.14.2. The determinant of a diagonal matrix equals the product of its diagonal
elements.
As obvious special cases of Corollary 2.14.2, we have that
j0j D 0; (14.8)
jIj D 1: (14.9)
Proof. Let aij and bij represent the ij th elements of A and A0, respectively (i; j D 1; 2; : : : ; N ).
Then, in light of the equivalence of definitions (14.20) and (14.60 ),
X
jA0 j D . 1/N .j1 ;j2 ;:::;jN / b1j1 b2j2 bNjN
X
D . 1/N .j1 ;j2 ;:::;jN / aj1 1 aj2 2 ajN N
D jAj;
74 Matrix Algebra: A Primer
where j1 ; j2 ; : : : ; jN is a permutation of the first N positive integers and the summations are over
all such permutations. Q.E.D.
As an immediate consequence of the definition of a determinant, we have the following lemma.
Lemma 2.14.4. If an N N matrix B is formed from an N N matrix A by multiplying all of
the elements of one row or one column of A by the same scalar k (and leaving the elements of the
other N 1 rows or columns unchanged), then
jBj D kjAj:
As a corollary of Lemma 2.14.4, we obtain the following result on the determinant of a matrix
having a null row or a null column.
Corollary 2.14.5. If one or more rows (or columns) of an N N matrix A are null, then jAj D 0.
Proof. Suppose that the i th row of A is null, and let B represent an N N matrix formed from
A by multiplying every element of the i th row of A by 0. Then, A D B, and we find that
jAj D jBj D 0jAj D 0: Q.E.D.
The following corollary (of Lemma 2.14.4) relates the determinant of a scalar multiple of a matrix
A to that of A itself.
Corollary 2.14.6. For any N N matrix A and any scalar k,
jkAj D k N jAj: (14.11)
Proof. This result follows from Lemma 2.14.4 upon observing that kA can be formed from A
by successively multiplying the N rows of A by k. Q.E.D.
As a special case of Corollary 2.14.6, we have the following, additional corollary.
Corollary 2.14.7. For any N N matrix A,
j Aj D . 1/N jAj: (14.12)
The following two theorems describe how the determinant of a matrix is affected by permuting
its rows or columns in certain ways.
Theorem 2.14.8. If an N N matrix B D fbij g is formed from an N N matrix A D faij g
by interchanging two rows or two columns of A, then
jBj D jAj:
Proof. Consider first the case where B is formed from A by interchanging two adjacent rows,
say the i th and .i C 1/th rows. Then,
X
jBj D . 1/N .j1 ;j2 ;:::;jN / b1j1 b2j2 bi 1;ji 1 biji bi C1;ji C1 bi C2;ji C2 bNjN
X
D . 1/N .j1 ;j2 ;:::;jN / a1j1 a2j2 ai 1;ji 1 ai C1;ji ai;ji C1 ai C2;ji C2 aNjN
X
D . 1/N .j1 ;j2 ;:::;ji 1 ;ji C1 ;ji ;ji C2 ;:::;jN /
a1j1 a2j2 ai 1;ji 1 ai;ji C1 ai C1;ji ai C2;ji C2 aNjN
"
since N .j1 ; j2 ; : : : ; ji 1 ; ji C1 ; ji ; ji C2 ; : : : ; jN /
( #
N .j1 ; j2 ; : : : ; jN / C 1; if ji C1 > ji
D
N .j1 ; j2 ; : : : ; jN / 1; if ji C1 < ji
D jAj;
Determinants 75
represent a matrix formed from A by multiplying the i th row of A by the scalar k. Then, according
to Lemmas 2.14.4 and 2.14.10,
kjAj D jBj D 0: (14.15)
If k ¤ 0, then it follows from equality (14.15) that jAj D 0. If k D 0, then a0s D 0, and it follows
from Corollary 2.14.5 that jAj D 0. Thus, in either case, jAj D 0.
An analogous argument shows that if one column of a (square) matrix is a scalar multiple of
another, then again the determinant of the matrix equals zero. Q.E.D.
The transposition of a (square) matrix does not (according to Lemma 2.14.3) affect its determi-
nant. Other operations that do not affect the determinant of a matrix are described in the following
two theorems.
Theorem 2.14.12. Let B represent a matrix formed from an N N matrix A by adding, to any
one row or column of A, scalar multiples of one or more other rows or columns. Then, jBj D jAj.
Proof. Let a0i D .ai1 ; ai 2 ; : : : ; aiN / and b0i D .bi1 ; bi 2 ; : : : ; biN / represent the i th rows of A
and B, respectively (i D 1; 2; : : : ; N ). And suppose that for some integer s (1 s N ) and some
scalars k1 ; k2 ; : : : ; ks 1 ; ksC1 ; : : : ; kN ,
X
b0s D a0s C ki a0i and b0i D a0i .i ¤ s/:
Then, i ¤s
X
jBj D . 1/N .j1 ;j2 ;:::;jN / b1j1 b2j2 bNjN
X X
D . 1/N .j1 ;j2 ;:::;jN / a1j1 a2j2 as 1;js 1
asjs C ki aijs asC1;jsC1 aNjN
XX i ¤s
D jAj C . 1/N .j1 ;j2 ;:::;jN / a1j1 a2j2 as 1;js 1
.ki aijs / asC1;jsC1 aNjN
i ¤s
X
D jAj C jBi j;
i ¤s
where Bi is a matrix formed from A by replacing the sth row of A with ki a0i and where j1 ; j2 ; : : : ; jN
is a permutation of the first N positive integers and the (unlabeled) summations are over all such
permutations. Since (according to Lemma 2.14.11) jBi j D 0 (i ¤ s), we conclude that jBj D jAj.
An analogous argument shows that jBj D jAj when B is formed from A by adding, to a column
of A, scalar multiples of other columns. Q.E.D.
Theorem 2.14.13. For any N N matrix A and any N N unit (upper or lower) triangular
matrix T ,
jAT j D jT Aj D jAj: (14.16)
Proof. Consider the case where A is postmultiplied by T and T is unit lower triangular. Define
Ti to be a matrix formed from IN by replacing the i th column of IN with the i th column of T
(i D 1; 2; : : : ; N ). Then, T D T1 T2 TN (as is easily verified), and consequently
AT D AT1 T2 TN :
Proof. Let
T 0
AD ;
V W
and let aij represent the ij th element of A (i; j D 1; 2; : : : ; M C N ). Further, denote by tij the ij th
element of T (i; j D 1; 2; : : : ; M ) and by wij the ij th element of W (i; j D 1; 2; : : : ; N ).
By definition, X
jAj D . 1/M CN .j1 ;:::;jM ;jM C1 ;:::;jM CN /
a1j1 aMjM aM C1;jM C1 aM CN;jM CN ; (14.18)
where j1 ; : : : ; jM ; jM C1 ; : : : ; jM CN is a permutation of the first M C N positive integers and the
summation is over all such permutations. Clearly, the only terms of the sum (14.18) that can be
nonzero are those for which j1 ; : : : ; jM constitutes a permutation of the first M positive integers
and thus for which jM C1 ; : : : ; jM CN constitutes a permutation of the integers M C 1; : : : ; M C N.
For any such permutation, we have that
a1j1 aMjM aM C1;jM C1 aM CN;jM CN D t1j1 tMjM w1;jM C1 M wN;jM CN M
D t1j1 tMjM w1k1 wN kN ;
where k1 D jM C1 M; : : : ; kN D jM CN M , and we also have that
M CN .j1 ; : : : ; jM ; jM C1 ; : : : ; jM CN / D M .j1 ; : : : ; jM / C N .jM C1 ; : : : ; jM CN /
D M .j1 ; : : : ; jM / C N .jM C1 M; : : : ; jM CN M/
D M .j1 ; : : : ; jM / C N .k1 ; : : : ; kN /:
Thus, XX
jAj D . 1/M .j1 ;:::;jM /CN .k1 ;:::;kN / t1j1 tMjM w1k1 wN kN
X X
D . 1/M .j1 ;:::;jM / t1j1 tMjM . 1/N .k1 ;:::;kN / w1k1 wN kN
D jT jjW j;
where j1 ; : : : ; jM is a permutation of the first M positive integers and k1 ; : : : ; kN a permutation of
the first Nˇ positiveˇ integers and where the respective summations are over all such permutations.
ˇW V ˇ
That ˇˇ ˇ D jT jjW j can be established via a similar argument. Q.E.D.
0 Tˇ
The repeated application of Theorem 2.14.14 leads to the following formulas for the determinant
of an arbitrary (square) upper or lower block-triangular matrix (with square diagonal blocks):
ˇ ˇ
ˇA11 A12 : : : A1R ˇ
ˇ ˇ
ˇ 0 A22 : : : A2R ˇ
:: ˇ D jA11 jjA22 j jARR jI (14.19)
ˇ ˇ
ˇ :: ::
ˇ :
ˇ : : ˇˇ
ˇ 0 0 ARR ˇ
ˇ ˇ
ˇ B11
ˇ 0 ::: 0 ˇˇ
ˇ B21 B22 0 ˇˇ
ˇ :: :: ˇ D jB11 jjB22 j jBRR j: (14.20)
ˇ
::
ˇ :
ˇ : : ˇ
ˇ
ˇBR1 BR2 : : : BRR ˇ
78 Matrix Algebra: A Primer
Formulas (14.19), (14.20), and (14.21) generalize the results of Lemma 2.14.1 and Corollary 2.14.2
on the determinants of triangular and diagonal matrices.
As an immediate consequence of Theorem 2.14.9, we have the following corollary of Theorem
2.14.14.
Corollary 2.14.15. Let T represent an M M matrix, V an N M matrix, and W an N N
matrix. Then, ˇ ˇ ˇ ˇ
ˇ 0 T ˇ ˇV W ˇ MN
ˇW V ˇ ˇT 0 ˇ D . 1/
ˇ ˇDˇ ˇ jT jjW j: (14.22)
The following corollary gives a simplified version of formula (14.22) for the special case where
M D N and T D IN .
Corollary 2.14.16. For N N matrices W and V ,
ˇ ˇ ˇ ˇ
ˇ0 IN ˇˇ ˇˇ V W ˇˇ
ˇ
ˇW D D jW j: (14.23)
V ˇ ˇ IN 0 ˇ
Proof (of Corollary 2.14.16). Corollary 2.14.16 can be derived from the special case of Corollary
2.14.15 where M D N and T D IN by observing that
and that either N or N C 1 is an even number and consequently N.N C 1/ is an even number. Q.E.D.
The repeated application of Theorem 2.14.17 leads to the following formula for the determinant
of the product of an arbitrary number of N N matrices A1 ; A2 ; : : : ; AK :
As a special case of this formula, we obtain the following formula for the determinant of the kth
power of an N N matrix A:
jAk j D jAj k (14.26)
(k D 2; 3; : : : ).
In light of Lemma 2.14.3, we have the following corollary of Theorem 2.14.17.
Corollary 2.14.18. For any N N matrix A,
jA0 Aj D jAj2: (14.27)
Corollary 2.14.18 gives rise to the following result on the determinant of an orthogonal matrix.
Determinants 79
Proof (of Corollary 2.14.19). Using Corollary 2.14.18 [and result (14.9)], we find that
jP j2 D jP 0 P j D jIj D 1: Q.E.D.
Having established Theorem 2.14.17, we are now in a position to prove the following result on
the nonsingularity of a matrix and on the determinant of an inverse matrix.
Theorem 2.14.20. Let A represent an N N matrix. Then, A is nonsingular (or, equivalently,
A is invertible) if and only if jAj ¤ 0, in which case
jA 1j D 1=jAj: (14.28)
Proof. It suffices to show that if A is nonsingular, then jAj ¤ 0 and jA 1j D 1=jAj and that if
A is singular, then jAj D 0.
Suppose that A is nonsingular. Then, according to Theorem 2.14.17 [and result (14.9)],
jA 1jjAj D jA 1Aj D jIj D 1;
implying that jAj ¤ 0 and further that jA 1j D 1=jAj.
Alternatively, suppose that A is singular. Then, some column of A, say the sth column as , can be
expressedPas a linear combination of the other N 1 columns a1 ; a2 ; : : : ; as 1 ; asC1 ; : : : ; aN ; that
is, as D i ¤s ki ai for some scalars k1P ; k2 ; : : : ; ks 1 ; ksC1 ; : : : ; kN . Now, let B represent a matrix
formed from A by adding the vector i ¤s ki ai to the sth column of A. Clearly, the sth column of
B is null, and it follows from Corollary 2.14.5 that jBj D 0. And it follows from Theorem 2.14.12
that jAj D jBj. Thus, jAj D 0. Q.E.D.
Let A represent an N N symmetric positive definite matrix. Then, according to Corollary
2.13.29, there exists a nonsingular matrix P such that A D P 0 P . Thus, making use of Corollary
2.14.18 and observing (in light of Theorem 2.14.20) that jP j ¤ 0, we find that
jAj D jP 0 P j D jP j2 > 0:
Moreover, the determinant of a symmetric positive semidefinite matrix equals 0, as is evident from
Theorem 2.14.20 upon recalling (from Corollary 2.13.28) that a symmetric positive semidefinite
matrix is singular. Accordingly, we have the following lemma.
Lemma 2.14.21. The determinant of a symmetric positive definite matrix is positive; the deter-
minant of a symmetric positive semidefinite matrix equals 0.
Applying Theorems 2.14.17 and 2.14.14 and result (14.9), we find that
ˇ ˇ
ˇT U ˇ 1
ˇV W ˇ D jT jjW V T Uj:
ˇ ˇ
ˇ ˇ
ˇW V ˇ
That ˇ
ˇ ˇ D jT jjW V T 1 Uj can be proved in similar fashion. Q.E.D.
U Tˇ
Exercises
Exercise 1. Let A represent an M N matrix and B an N M matrix. Can the value of A C B0
be determined from the value of A0 C B (in the absence of any other information about A and B)?
Describe your reasoning.
Exercise 2. Show that for any M N matrix A D faij g and N P matrix B D fbij g, .AB/0 D B0 A0
[thereby verifying result (1.13)].
Exercise 3. Let A D faij g and B D fbij g represent N N symmetric matrices.
(a) Show that in the special case where N D 2, AB is symmetric if and only if b12 .a11 a22 / D
a12 .b11 b22 /.
(b) Give a numerical example where AB is nonsymmetric.
(c) Show that A and B commute if and only if AB is symmetric.
Exercise 8. Which of the following sets are linear spaces: (1) the set of all N N diagonal matrices,
(2) the set of all N N upper triangular matrices, and (3) the set of all N N nonsymmetric
matrices?
Exercise 9. Define 0 1
1 2 1 0
A D @2 1 1 1A;
1 1 2 1
and (for i D 1; 2; 3) let a0i represent the i th row of A.
(a) Show that the set fa01 ; a02 g is a basis for R.A/.
82 Matrix Algebra: A Primer
Exercise 10. Let A1 ; A2 ; : : : ; AK represent matrices in a linear space V, and let U represent a
subspace of V. Show that sp.A1 ; A2 ; : : : ; AK / U if and only if A1 ; A2 ; : : : ; AK are contained in
U (thereby establishing what is essentially a generalization of Lemma 2.4.2).
Exercise 11. Let V represent a K-dimensional linear space of M N matrices (where K 1).
Further, let fA1 ; A2 ; : : : ; AK g represent a basis for V, and, for arbitrary scalars x1 ; x2 ; : : : ; xK and
y1 ; y2 ; : : : ; yK , define A D K
PK
i D1 xi Ai and B D j D1 yj Aj . Show that
P
K
X
AB D xi yi
i D1
for all choices of x1 ; x2 ; : : : ; xK and y1 ; y2 ; : : : ; yK if and only if the basis fA1 ; A2 ; : : : ; AK g is
orthonormal.
Exercise 12. An N N matrix A is said to be involutory if A2 D I, that is, if A is invertible and is
its own inverse.
(a) Show that an N N matrix A is involutory if and only if .I A/.I C A/ D 0.
a b
(b) Show that a 2 2 matrix A D is involutory if and only if (1) a2 C bc D 1 and d D a
c d
or (2) b D c D 0 and d D a D ˙1.
Exercise 19. Let A represent a 4 N matrix of rank 2, and take b D fbi g to be a 4-dimensional
column vector. Suppose that b1 D 1 and b2 D 0 and that two of the N columns of A are the vectors
a1 D .5; 4; 3; 1/0 and a2 D .1; 2; 0; 1/0 . Determine for which values of b3 and b4 the linear
system Ax D b (in x) is consistent.
Exercise 20. Let A represent an M N matrix. Show that for any generalized inverses G1 and G2
of A and for any scalars w1 and w2 such that w1 C w2 D 1, the linear combination w1 G1 C w2 G2
is a generalized inverse of A.
Exercise 21. Let A represent an N N matrix.
(a) Using the result of Exercise 20 in combination with Corollary 2.10.11 (or otherwise), show that
if A is symmetric, then A has a symmetric generalized inverse.
(b) Show that if A is singular (i.e., of rank less than N ) and if N > 1, then (even if A is symmetric) A
has a nonsymmetric generalized inverse. (Hint. Make use of the second part of Theorem 2.10.7.)
Exercise 22. Let A represent an M N matrix of rank N 1. And let x represent any nonnull vector
in N.A/, that is, any N -dimensional nonnull column vector such that Ax D 0. Show that a matrix
Z is a solution to the homogeneous linear system AZ D 0 (in an N P matrix Z) if and only if
Z D xk0 for some P -dimensional row vector k0.
Exercise 23. Suppose that AX D B is a consistent linear system (in an N P matrix X).
(a) Show that if rank.A/ D N or rank.B/ D P , then, corresponding to any solution X to AX D B,
there is a generalized inverse G of A such that X D GB.
(b) Show that if rank.A/ < N and rank.B/ < P , then there exists a solution X to AX D B such
that there is no generalized inverse G of A for which X D GB.
Exercise 24. Show that a matrix A is symmetric and idempotent if and only if there exists a matrix
X such that A D PX .
Exercise 25. Show that corresponding to any quadratic form x0 Ax (in an N -dimensional vector x),
there exists a unique lower triangular matrix B such that x0 Ax and x0 Bx are identically equal, and
express the elements of B in terms of the elements of A.
Exercise 26. Show, via an example, that the sum of two positive semidefinite matrices can be positive
definite.
Exercise 27. Let A represent an N N symmetric nonnegative definite matrix (where N 2).
Define A0 D A, and, for k D 1; 2; : : : ; N 1, take Qk to be an .N k C 1/ .N k C 1/ unit
upper triangular matrix, Ak an .N k/ .N k/ matrix, and dk a scalar that satisfy the recursive
relationship
Q0k Ak 1 Qk D diag.dk ; Ak / (E.1)
—Qk , Ak , and dk can be constructed by making use of Lemma 2.13.19 and by proceeding as in the
proof of Theorem 2.13.20.
(a) Indicate how Q1 ; Q2 ; : : : ; QN 1 ; A1 ; A2 ; : : : ; AN 1 , and d1 ; d2 ; : : : ; dN 1 could be used to
form an N N unit upper triangular matrix Q and a diagonal matrix D such that Q0 AQ D D.
0 1
2 0 0 0
B0 4 2 4C
(b) Taking A D B @0
C (which is a symmetric nonnegative definite matrix), determine
2 1 2A
0 4 2 7
unit upper triangular matrices Q1 , Q2 , and Q3 , matrices A1 , A2 , and A3 , and scalars d1 , d2 , and
d3 that satisfy the recursive relationship (E.1), and illustrate the procedure devised in response
84 Matrix Algebra: A Primer
to Part (a) by using it to find a 4 4 unit upper triangular matrix Q and a diagonal matrix D such
that Q0 AQ D D.
Exercise 28. Let A D faij g represent an N N symmetric positive definite matrix, and let
B D fbij g D A 1. Show that, for i D 1; 2; : : : ; N ,
bi i 1=ai i ;
with equality holding if and only if aij D 0 for all j ¤ i .
Exercise 29. Let 0 1
a11 a12 a13 a14
B a21 a22 a23 a24 C
ADB C:
B C
@ a31 a32 a33 a34 A
a41 a42 a43 a44
(a) Write out all of the pairs that can be formed from the four “boxed” elements of A.
(b) Indicate which of the pairs from Part (a) are positive and which are negative.
(c) Use formula (14.1) to compute the number of pairs from Part (a) that are negative, and check
that the result of this computation is consistent with your answer to Part (b).
Exercise 30. Obtain (in as simple form as possible) an expression for the determinant of each of the
following two matrices: (1) an N N matrix A D faij g of the general form
0 1
0 ::: 0 0 a1N
B 0 ::: 0 a2;N 1 a2N C
B C
B 0 a a3;N 1 a3N C
ADB 3;N 2
:: :: :: C
C
B
@ : : : A
aN1 : : : aN;N 2 aN;N 1 aNN
(b) Extend the result of Part (a) by showing that in the general case where A is not necessarily
symmetric (i.e., where
p possibly c ¤ b), A is nonnegative definite if and only if a 0, dp 0,
and jb C cj=2 ad and is positive definite if and only if a > 0, d > 0, and jb C cj=2 < ad .
[Hint. Take advantage of the result of Part (a).]
Exercise 34. Let A D faij g represent an N N symmetric matrix. And suppose that A is nonnegative
definite (in which case its diagonal elements are nonnegative). By, for example, making use of the
result of Part (a) of Exercise 33, show that, for j ¤ i D 1; 2; : : : ; N ,
p
jaij j ai i ajj max.ai i ; ajj /;
p
with jaij j < ai i ajj if A is positive definite.
Exercise 35. Let A D faij g represent an N N symmetric positive definite matrix. Show that
N
Y
det A ai i ;
i D1
In working with linear models, knowledge of basic results on the distribution of random variables is
essential. Of particular relevance are various results on expected values and on variances and covari-
ances. Also of relevance are results that pertain to conditional distributions and to the multivariate
normal distribution.
In working with a large number (or even a modest number) of random variables, the use of matrix
notation can be extremely helpful. In particular, formulas for the expected values and the variances
and covariances of linear combinations of random variables can be expressed very concisely in
matrix notation. The use of matrix notation is facilitated by the arrangement of random variables in
the form of a vector or a matrix. A random (row or column) vector is a (row or column) vector whose
elements are (jointly distributed) random variables. More generally, a random matrix is a matrix
whose elements are (jointly distributed) random variables.
A random variable x (or its distribution) is said to be absolutely continuous if there exists a
nonnegative function f .x/ of x, called
R a probability density function, such that, for an “arbitrary”
set A of real numbers, Pr.x 2 A/ D A f .s/ ds, in which case
Z 1
EŒg.x/ D g.s/f .s/ ds
1
for “any” function g.x/ of x. More generally, an N -dimensional random vector x (or its distribution)
is said to be absolutely continuous if there exists a nonnegative function f .x/ of x, called
R a probability
density function (pdf), such that, for an “arbitrary” subset A of RN, Pr.x 2 A/ D A f .s/ d s, in
which case Z
EŒg.x/ D g.s/f .s/ d s (1.2)
RN
for “any” function g.x/ of x.
If x is a random vector and g.x/ “any” function of x that is nonnegative [in the sense that
g.x/ 0 for every value of x] or is nonnegative with probability 1 [in the sense that for some set A
of x-values for which Pr.x 2 A/ D 1, g.x/ 0 for every value of x in A], then
By definition, two random vectors, say x and y, are statistically independent if for “every” set
A (of x-values) and “every” set B (of y-values),
Pr.x 2 A; y 2 B/ D Pr.x 2 A/ Pr.y 2 B/:
If x and y are statistically independent, then for “any” function f .x/ of x and “any” function g.y/
of y (for which EŒf .x/ and EŒg.y/ exist),
EŒf .x/g.y/ D EŒf .x/ EŒg.y/ (1.4)
(e.g., Casella and Berger 2002, sec. 4.2; Parzen 1960, p. 361).
The expected value of an N -dimensional random row or column vector is the N -dimensional
(respectively row or column) vector whose i th element is the expected value of the i th element of the
random vector (i D 1; 2; : : : ; N ). More generally, the expected value of an M N random matrix
is the M N matrix whose ij th element is the expected value of the ij th element of the random
matrix (i D 1; 2; : : : ; M ; j D 1; 2; : : : ; N ). The expected value of a random matrix X is denoted
by the symbol E.X/ (and is said to exist if the expected value of every element of X exists). Thus,
for an M N random matrix X with ij th element xij (i D 1; 2; : : : ; M ; j D 1; 2; : : : ; N ),
E.x11 / E.x12 / : : : E.x1N /
0 1
B E.x21 / E.x22 / : : : E.x2N / C
E.X/ D B : :: :: C:
B C
@ : : : : A
E.xM1 / E.xM 2 / : : : E.xMN /
The expected value of a random variable x is referred to as the mean of x (or of the distribution
of x). And, similarly, the expected value of a random vector or matrix X is referred to as the mean
(or, if applicable, mean vector) of X (or of the distribution of X).
It follows from elementary properties of the expected values of random variables that for a finite
number of random variables x1 ; x2 ; : : : ; xN and for nonrandom scalars c, a1 ; a2 ; : : : ; aN ,
X N XN
E cC aj xj D c C aj E.xj /: (1.5)
j D1 j D1
Letting x D .x1 ; x2 ; : : : ; xN / and a D .a1 ; a2 ; : : : ; aN /0, this equality can be reexpressed in matrix
0
notation as
E.c C a0 x/ D c C a0 E.x/: (1.6)
Variances, Covariances, and Correlations 89
where A D .a1 I; a2 I; : : : ; aN I/, X0 D .X01 ; X02 ; : : : ; X0N /, and K D I, equality (1.10) can be
derived from equality (1.9).
Variance (and standard deviation) of a random variable. The variance of a random variable x
(whose expected value exists) is (by definition) the expected value EfŒx E.x/2 g of the square of
the difference between x and its p expected value. The variance of x is denoted by the symbol var x or
var.x/. The positive square root var.x/ of the variance of x is referred to as the standard deviation
of x.
If a random variable x is such that E.x 2 / exists [i.e., such that E.x 2 / < 1], then E.x/ exists
and var.x/ also exists (i.e., is finite). That the existence of E.x 2 / implies the existence of E.x/ can
be readily verified by making use of the inequality jxj < 1 C x 2. That it also implies the existence
(finiteness) of var.x/ becomes clear upon observing that
Œx E.x/2 D x 2 2x E.x/ C ŒE.x/2: (2.1)
90 Random Vectors and Matrices
The existence of E.x 2 / is a necessary as well as a sufficient condition for the existence of E.x/ and
var.x/, as is evident upon reexpressing equality (2.1) as
Further,
var.x/ D E.x 2 / ŒE.x/2; (2.3)
as can be readily verified by using formula (1.5) to evaluate expression (2.1). Also, it is worth noting
that
var.x/ D 0 , x D E.x/ with probability 1: (2.4)
Covariance of two random variables. The covariance of two random variables x and y (whose
expected values exist) is (by definition) EfŒx E.x/Œy E.y/g. The covariance of x and y is
denoted by the symbol cov.x; y/. We have that
Note that in the special case where y D x, formula (2.7) reduces to formula (2.3).
Some fundamental results bearing on the covariance of two random variables x and y (whose
expected values exist) and on the relationship of the covariance to the variances of x and y are as
follows. The covariance of x and y exists if the variances of x and y both exist, in which case
p p
jcov.x; y/j var.x/ var.y/ (2.9)
or, equivalently,
Œcov.x; y/2 var.x/ var.y/ (2.10)
or (also equivalently)
p p p p
var.x/ var.y/ cov.x; y/ var.x/ var.y/: (2.11)
Further,
var.x/ D 0 or var.y/ D 0 ) cov.x; y/ D 0; (2.12)
so that when var.x/ D 0 or var.y/ D 0 or, equivalently, when x D E.x/ with probability 1 or y D
E.y/ with probability 1, inequality (2.9) holds as an equality, both sides of which equal 0. And when
Variances, Covariances, and Correlations 91
p p
var.x/ > 0 and var.y/ > 0, inequality (2.9) holds as the equality cov.x; y/ D var.x/ var.y/ if
and only if
y E.y/ x E.x/
p D p with probability 1;
var.y/ var.x/
p p
and holds as the equality cov.x; y/ D var.x/ var.y/ if and only if
y E.y/ x E.x/
p D p with probability 1:
var.y/ var.x/
These results (on the covariance of the random variables x and y) can be inferred from the
following results on the expected value of the product of two random variables, say w and z—they
are obtained from the results on E.wz/ by setting w D x E.x/ and z D y E.y/. The expected
value E.wz/ of wz exists if the expected values E.w 2 / and E.z 2 / of w 2 and z 2 both exist, in which
case p p
jE.wz/j E.w 2 / E.z 2 / (2.13)
or, equivalently,
ŒE.wz/2 E.w 2 / E.z 2 / (2.14)
or (also equivalently)
p p p p
E.w 2 / E.z 2 / E.wz/ E.w 2 / E.z 2 /: (2.15)
Further,
E.w 2 / D 0 or E.z 2 / D 0 ) E.wz/ D 0; (2.16)
so that when E.w 2 / D 0 or E.z 2 / D 0 or, equivalently, when w D 0 with probability 1 or z D 0 with
2
probability 1, inequality (2.13) holds as an equality, both sides of which
pequal 0.pAnd when E.w / > 0
2
and E.z / > 0, p inequality (2.13) holds as the equality E.wz/ D E.w / E.zp/ if andponly if
2 2
ıp
z E.z 2 / D w E.w 2 / with probability 1, and holds as the equality E.wz/ D E.w 2 / E.z 2 /
ı
ıp ıp
if and only if z E.z / D w E.w / with probability 1. A verification of these results on E.wz/
2 2
Correlation of two random variables. The correlation of two random variables x and y (whose
expected values exist and whose variances also exist and are strictly positive) is (by definition)
cov.x; y/
p p ;
var.x/ var.y/
and is denoted by the symbol corr.x; y/. From result (2.5), it is clear that
Verification of results on the expected value of the product of two random variables. Let us verify
the results (given earlier in the present subsection) on the expected value E.wz/ of the product of
two random variables w and z. Suppose that E.w 2 / and E.z 2 / both exist. Then, what we wish to
establish are the existence of E.wz/ and the validity of inequality (2.13) and of the conditions under
which equality is attained in inequality (2.13).
Let us begin by observing that, for arbitrary scalars a and b,
1 2
2 .a C b 2/ ab D 21 .a b/2 0
and 1 2
2
.a C b2/ ab D 1
2
.a C b/2 0;
implying in particular that 1 2
2 .a C b 2 / ab 12 .a2 C b 2 / (2.22)
or, equivalently, that
jabj 12 .a2 C b 2 /: (2.23)
Upon setting a D w and b D z in inequality (2.23), we obtain the inequality
jwzj 12 .w 2 C z 2 /: (2.24)
The expected value of the right side of inequality (2.24) exists, implying the existence of the expected
value of the left side of inequality (2.24) and hence the existence of E.wz/.
Now, consider inequality (2.13). When E.w 2 / D 0 or E.z 2 / D 0 or, equivalently, when w D 0
with probability 1 or z D 0 with probability 1, wz D 0 with probability 1 and hence inequality
(2.13) holds as an equality, both sides of which equal 0.
ıp
Alternatively, suppose that E.w 2 / > 0 and E.z 2 / > 0. And take a D w E.w 2 / and b D
ıp
z E.z 2 /. In light of result (2.22), we have that
Moreover,
E.wz/
.a2 C b 2 / D 1
1
E and E.ab/ D p (2.26)
2
p :
E.w 2 / E.z 2 /
Together, results (2.25) and (2.26) imply that
E.wz/
1 p p 1;
E.w 2 / E.z 2 /
which is equivalent to result (2.15) and hence to inequality (2.13). Further, inequality (2.13) holds as
the equality E.wz/ D E.w 2 / E.z 2 / if and only if E 21 .a2 C b 2 / ab D 0 [as is evident from
p p
result (2.26)], or equivalently if and only if E 21 .a b/2 D 0, and hence if and only if b D a with
p p
probability 1. And, similarly, inequality (2.13) holds as the equality E.wz/ D E.w 2 / E.z 2 / if
and only if EŒ 12 .a2 C b 2 / ab D 0, or equivalently if and only if EŒ 12 .a C b/2 D 0, and hence
if and only if b D a with probability 1.
Variances, Covariances, and Correlations 93
For an N -dimensional random column vector x and a T -dimensional random column vector y,
cov.x; y/ D EfŒx E.x/Œy E.y/0 g; (2.27)
cov.x; y/ D E.xy 0 / E.x/ŒE.y/0; (2.28)
and
cov.y; x/ D Œcov.x; y/0: (2.29)
Equality (2.27) can be regarded as a multivariate extension of the formula cov.x; y/ D EfŒx
E.x/Œy E.y/g for the covariance of two random variables x and y. And equality (2.28) can be
regarded as a multivariate extension of equality (2.7), and equality (2.29) as a multivariate extension
of equality (2.5). Each of equalities (2.27), (2.28), and (2.29) can be readily verified by comparing
each element of the left side with the corresponding element of the right side.
Clearly, for an N -dimensional random column vector x,
var.x/ D cov.x; x/: (2.30)
Thus, as special cases of equalities (2.27), (2.28), and (2.29), we have that
var.x/ D EfŒx E.x/Œx E.x/0 g; (2.31)
var.x/ D E.xx0 / E.x/ŒE.x/0; (2.32)
94 Random Vectors and Matrices
and
var.x/ D Œvar.x/0: (2.33)
Equality (2.31) can be regarded as a multivariate extension of the formula var.x/ D EfŒx E.x/2 g
for the variance of a random variable x, and equality (2.32) can be regarded as a multivariate extension
of equality (2.3). Equality (2.33) indicates that a variance-covariance matrix is symmetric.
For an N -dimensional random column vector x D .x1 ; x2 ; : : : ; xN /0,
N
X
PrŒx ¤ E.x/ PrŒxi ¤ E.xi /; (2.34)
i D1
as is evident upon observing that fx W x ¤ E.x/g D [N i D1 fx W xi ¤ E.xi /g. Moreover, according to
result (2.4), var.xi / D 0 ) PrŒxi ¤ E.xi / D 0 (i D 1; 2; : : : ; N ), implying [in combination with
inequality (2.34)] that if var.xi / D 0 for i D 1; 2; : : : ; N , then PrŒx ¤ E.x/ D 0 or equivalently
PrŒx D E.x/ D 1. Thus, as a generalization of result (2.4), we have [since (for i D 1; 2; : : : ; N )
PrŒx D E.x/ D 1 ) PrŒxi D E.xi / D 1 ) var.xi / D 0] that [for an N -dimensional random
column vector x D .x1 ; x2 ; : : : ; xN /0 ]
var.xi / D 0 for i D 1; 2; : : : ; N , x D E.x/ with probability 1. (2.35)
Alternatively, result (2.35) can be established by observing that
PN
var.xi / D 0 for i D 1; 2; : : : ; N ) i D1 var.xi / D 0
) EfŒx E.x/0 Œx E.x/g D 0
) PrfŒx E.x/0 Œx E.x/g D 0g D 1
and that Œx E.x/0 Œx E.x/ D 0 , x E.x/ D 0.
In connection with result (2.35) (and otherwise), it is worth noting that, for an N -dimensional
random column vector x D .x1 ; x2 ; : : : ; xN /0,
var.xi / D 0 for i D 1; 2; : : : ; N , var.x/ D 0: (2.36)
—result (2.36) is a consequence of result (2.12).
For random column vectors x and y,
var.x/ cov.x; y/
x
var D ; (2.37)
y cov.y; x/ var.y/
as is evident from the very definition of a variance-covariance matrix (and from the definition of the
covariance of two random vectors). More generally, for a random column vector x that has been
partitioned into subvectors x1 ; x2 , : : : ; xR [so that x0 D .x01 ; x02 ; : : : ; x0R /],
var.x1 / cov.x1 ; x2 / : : : cov.x1 ; xR /
0 1
B cov.x2 ; x1 / var.x2 / cov.x2 ; xR /C
var.x/ D B C: (2.38)
B C
:: ::
@ : : A
cov.xR ; x1 / cov.xR ; x2 / var.xR /
as can be readily verified by making use of result (1.5). As a special case of equality (2.39), we have
that
N
X X N X N
var c C ai xi D ai aj cov.xi ; xj / (2.40)
i D1 i D1 j D1
N
X N
X1 N
X
D ai2 var.xi / C 2 ai aj cov.xi ; xj /: (2.41)
i D1 i D1 j Di C1
As in the case of equality (1.5), equalities (2.39) and (2.40) are reexpressible in matrix no-
tation. Upon letting x D .x1 ; x2 ; : : : ; xN /0, a D .a1 ; a2 ; : : : ; aN /0, y D .y1 ; y2 ; : : : ; yT /0, and
96 Random Vectors and Matrices
Results (2.42), (2.43), and (2.44) can be generalized. Let c represent an M -dimensional non-
random column vector, A an M N nonrandom matrix, and x an N -dimensional random column
vector (whose expected value exists). Similarly, let k represent an S -dimensional nonrandom col-
umn vector, B an S T nonrandom matrix, and y a T -dimensional random column vector (whose
expected value exists). Then,
cov.c C Ax; k C By/ D A cov.x; y/ B0; (2.45)
which is a generalization of result (2.42) and which in the special case where y D x (and T D N )
yields the following generalization of result (2.44):
cov.c C Ax; k C Bx/ D A var.x/ B0: (2.46)
When k D c and B D A, result (2.46) simplifies to the following generalization of result (2.43):
var.c C Ax/ D A var.x/ A0: (2.47)
Equality (2.45) can be readily verified by comparing each element of the left side with the corre-
sponding element of the right side and by applying result (2.42).
Another sort of generalization is possible. Let x1 ; x2 ; : : : ; xN represent M -dimensional random
column vectors (whose expected values exist), c an M -dimensional nonrandom column vector, and
a1 ; a2 , : : : ; aN nonrandom scalars. Similarly, let y1 ; y2 , : : : ; yT represent S -dimensional random
column vectors (whose expected values exist), k an S -dimensional nonrandom column vector, and
b1 ; b2 , : : : ; bT nonrandom scalars. Then,
XN T
X X N X T
cov c C ai xi ; k C bj yj D ai bj cov.xi ; yj /; (2.48)
i D1 j D1 i D1 j D1
which is a generalization of result (2.39). As a special case of equality (2.48) [that obtained by setting
T D N and (for j D 1; 2; : : : ; T ) yj D xj ], we have that
XN N
X X N X N
cov c C ai xi ; k C bj xj D ai bj cov.xi ; xj /: (2.49)
i D1 j D1 i D1 j D1
And as a further special case [that obtained by setting k D c and (for j D 1; 2, : : : ; N ) bj D aj ],
we have that
XN X N N X
X N
var c C ai xi D ai2 var.xi / C ai aj cov.xi ; xj /: (2.50)
i D1 i D1 i D1 j D1
.j ¤i /
Equality (2.48) can be verified by comparing each element of the left side with the corresponding
element of the right side and by applying result (2.39). Alternatively, equality (2.48) can be derived
by observing that N 0 0 0 0
i D1 ai xi D Ax, where A D .a1 I; a2 I; : : : ; aN I/ and x D .x1 ; x2 ; : : : ; xN /,
P
PT 0 0 0 0
and that j D1 bj yj D By, where B D .b1 I; b2 I; : : : ; bT I/ and y D .y1 ; y2 ; : : : ; yT /, and by
applying equality (2.45).
Standardized Version of a Random Variable 97
The null space N.V / of V is (as discussed in Section 2.9b) a linear space and (according to
Lemma 2.11.5) is of dimension N rank.V /. When V is positive definite, V is nonsingular, so
that dimŒN.V / D 0. When V is positive semidefinite, V is singular, so that dimŒN.V / 1.
In light of Lemma 2.14.21, we have that jV j 0 and, more specifically, that
An inequality revisited. Consider the implications of results (2.52) and (2.53) in the special case
where N D 2. Accordingly, suppose that V is the variance-covariance matrix of a vector of two
random variables, say x and y. Then, in light of result (2.14.4),
with equality holding if and only if jV j D 0 and hence if and only if V is positive semidefinite (or,
equivalently, if and only if V is singular or, also equivalently, if and only if dimŒN.V / 1).
Note that inequality (2.55) is identical to inequality (2.9), the validity of which was established
(in the final part of Subsection a) via a different approach.
Then, x is expressible as
x D c C az;
and the distribution of x is determinable from c and a and from the distribution of the transformed
random variable z
Clearly,
E.x/ c
E.z/ D :
a
And
var.x/
var.z/ D ;
a2
or more generally, taking y to be a random variable and w to be the transformed random variable
defined by w D .y k/=b (where k is a nonrandom scalar and b a nonzero nonrandom scalar),
cov.x; y/
cov.z; w/ D :
ab
p
In the special case where c D E.x/ and a D var x, the random variable z is referred to as the
standardized version of the random variable x (with the use of this term being restricted to situations
where the expected p value of x exists and where the variance of x exists and is strictly positive). When
c D E.x/ and a D var x,
E.z/ D 0 and var.z/ D 1:
Further, if z is the standardized version of x and w the standardized version of a random variable y,
then
cov.z; w/ D corr.x; y/:
z D S 1 Œx E.x/;
p p p
where S D diag. var x1 ; var x2 ; : : : ; var xN /. Then, x is expressible as
1
x D E.x/ C S z;
and
p the distribution
p of
p x is determinable from its mean vector E.x/ and the vector
. var x1 ; var x2 ; : : : ; var xN / of standard deviations and from the distribution of the trans-
formed random vector z.
Clearly,
E.z/ D 0:
Further, var.z/ equals the correlation matrix of x, or more generally, taking y to be a T -dimensional
random column vector with first through T th elements y1 ; y2 ; : : : ; yT (whose expected values exist
and whose variances exist and are strictly positive) and taking w to be the T -dimensional random
column vector whose j th element is the standardized version of yj , cov.z; w/ equals the N T
matrix whose ij th element is corr.xi ; yj /.
Standardized Version of a Random Variable 99
.I T 0 R0 /d D .I T 0 R0 /T 0 T h D T 0 ŒI .T R/0 T h D T 0 .I I 0 /T h D 0:
d D T 0 R0 d 2 C.T 0 / D C.V /:
Thus, the third term of the sum (4.4) equals 0. That the fourth term equals 0 can be demonstrated in
similar fashion.
The definition of the conditional expected value of a random variable can be readily extended
to a random row or column vector or more generally to a random matrix. The expected value of an
M N random matrix Y D fyij g conditional on a random matrix X is defined to be the M N
matrix whose ij th element is the conditional expected value E.yij j X/ of the ij th element of Y . It
is to be denoted by the symbol E.Y j X/. As a straightforward extension of the property (4.1), we
have that
E.Y / D EŒE.Y j X/: (4.5)
The definition of the conditional variance of a random variable and the definition of a conditional
covariance of two random variables can also be readily extended. The variance-covariance matrix
of an M -dimensional random column vector y D fyi g (or its transpose y 0 ) conditional on a random
matrix X is defined to be the M M matrix whose ij th element is the conditional covariance
cov.yi ; yj j X/ of the i th and j th elements of y or y 0 . It is to be denoted by the symbol var.y j X/
or var.y 0 j X/. Note that the diagonal elements of this matrix are the N conditional variances
var.y1 j X/; var.y2 j X/; : : : ; var.yN j X/. Further, the covariance of an M -dimensional random
column vector y D fyi g (or its transpose y 0 ) and an N -dimensional random column vector w D fwj g
(or its transpose w 0 ) conditional on a random matrix X is defined to be the M N matrix whose
ij th element is the conditional covariance cov.yi ; wj j X/ of the i th element of y or y 0 and the j th
element of w or w 0. It is to be denoted by the symbol cov.y; w j X/, cov.y 0; w j X/, cov.y; w 0 j X/,
or cov.y 0; w 0 j X/.
As generalizations of equalities (4.2) and (4.3), we have that
var.y/ D EŒvar.y j X/ C varŒE.y j X/ (4.6)
and
cov.y; w/ D EŒcov.y; w j X/ C covŒE.y j X/; E.w j X/: (4.7)
The validity of equalities (4.6) and (4.7) is evident upon observing that equality (4.6) can be regarded
as a special case of equality (4.7) and that equality (4.3) implies that each element of the left side of
equality (4.7) equals the corresponding element of the right side.
Moreover, 1
r
Z
z 2 =2
e dz D ; (5.3)
0 2
as is well-known and as can be verified by observing that
Z 1 2 Z 1 Z 1
z 2 =2 z 2 =2 2
e dz D e dz e y =2 dy
0 0 0
Z 1Z 1
2 2
D e .z Cy /=2 dz dy (5.4)
0 0
and by evaluating the double integral (5.4) by converting to polar coordinates—refer, e.g., to Casella
and Berger (2002, sec. 3.3) for the details.
Together, results (5.2) and (5.3) imply that
Z 1
f .z/ dz D 1:
1
And, upon observing that f .z/ 0 for all z, we conclude that the function f ./ can serve as
a probability density function. The probability distribution determined by this probability density
function is referred to as the standard normal (or standard Gaussian) distribution.
b. Gamma function
To obtain convenient expressions for the moments of the standard normal distribution, it is helpful
to recall (e.g., from Parzen 1960, pp. 161–163, or Casella and Berger 2002, sec. 3.3) the definition
and some basic properties of the gamma function. The gamma function is the function ./ defined
by Z 1
.x/ D wx 1
e w
dw .x > 0/: (5.5)
0
By integrating by parts, it can be shown that
.x C 1/ D x .x/: (5.6)
It is a simple exercise to show that
.1/ D 1: (5.7)
And, by repeated application of the recursive formula (5.6), result (5.7) can be generalized to
.n C 1/ D nŠ D n.n 1/.n 2/ 1 .n D 0; 1; 2; : : :/: (5.8)
(By definition, 0Š D 1.) p
By making the change of variable z D 2w in integral (5.5), we find that, for r > 1,
.r 1/=2 Z 1
r C1
1 2
D z r e z =2 dz; (5.9)
2 2 0
thereby obtaining an alternative representation for the gamma function. And, upon applying result
(5.9) in the special case where r D 0 and upon recalling result (5.3), we obtain the formula
p
12 D : (5.10)
This result is extended to .n C 21 / by the formula
p
1 .2n/Š 1 3 5 7 .2n 1/ p
(5.11)
nC 2 D n
D .n D 0; 1; 2; : : :/;
4 nŠ 2n
the validity of which can be established by making use of result (5.6) and employing mathematical
induction.
Multivariate Normal Distribution 103
[In result (5.12), the symbol z is used to represent a variable of integration as well as a random
variable. In circumstances where this kind of dual usage might result in confusion or ambiguity,
either altogether different symbols are to be used for a random variable (or random vector or random
matrix) and for a related quantity (such as a variable of integration or a value of the random variable),
or the related quantity is to be distinguished from the random quantity simply by underlining whatever
symbol is used for the random quantity.]
We have that, for r D 1; 2; 3; : : : ,
Z 0 Z 1
z r f .z/ dz D . 1/r z r f .z/ dz (5.13)
1 0
[as can be readily verified by making the change of variable y D z and recalling result (5.1)] and
that, for r > 1,
Z 1 Z 1
2.r=2/ 1
r 1 r z 2 =2 r C1
z f .z/ dz D p z e dz D p (5.14)
0 2 0 2
[as is evident from result (5.9)].
Now, starting with expression (5.12) and making use of results (5.13), (5.14), (5.8), and (5.11),
we find that, for r D 1; 2; 3; : : : ,
Z 1
r
E.jzj / D 2 z r f .z/ dz
0
r
2r
r C1
D (5.15)
2
8̂ r
r
< 2 Œ.r 1/=2Š if r is odd,
D (5.16)
:̂ .r 1/.r 3/.r 5/ 7 5 3 1 if r is even.
Accordingly, the rth moment of the standard normal distribution exists for r D 1; 2; 3; : : : . For
r D 1; 3; 5; : : : , we find [in light of result (5.13)] that
Z 1 Z 1
r r r
E.z / D z f .z/ dz C . 1/ z r f .z/ dz D 0: (5.17)
0 0
Thus, the odd moments of the standard normal distribution equal 0, while the even moments are
given by formula (5.18).
In particular, we have that
That is, the standard normal distribution has a mean of 0 and a variance of 1. Further, the third and
fourth moments of the standard normal distribution are
Pr.x D / D 1; (5.24)
so that the distribution of x is completely concentrated at a single value, namely . Note that the
distribution of x depends on only through the value of 2.
Let us refer to an absolutely continuous distribution having a probability density function of
the form (5.23) and also to a “degenerate” distribution of the form (5.24) as a normal (or Gaussian)
distribution. Accordingly, there is a family of normal distributions, the members of which are indexed
by the mean and the variance of the distribution. The symbol N.; 2 / is used to denote a normal
distribution with mean and variance 2 . Note that the N.0; 1/ distribution is identical to the
standard normal distribution.
The rth central moment of the random variable x defined by equality (5.21) is expressible as
We find in particular [upon applying result (5.25) in the special case where r D 3 and result (5.26) in
the special case where r D 4] that the third and fourth central moments of the N.; 2 / distribution
are
EŒ.x /3 D 0 and EŒ.x /4 D 3 4 (5.27)
—in the special case where r D 2, result (5.26) simplifies to var.x/ D 2 .
The form of the probability density function of a (nondegenerate) normal distribution is illustrated
in Figure 3.1.
Multivariate Normal Distribution 105
f(x)
σ = 0.625
σ=1
0.6
σ = 1.6
0.4
0.2
0 x−µ
−4 −2 0 2 4
FIGURE 3.1. The probability density function f ./ of a (nondegenerate) N.; 2 / distribution: plot of f .x/
against x for each of 3 values of .
e. Multivariate extension
Let us now extend the approach taken in Subsection d [in defining the (univariate) normal distribution]
to the multivariate case.
Let us begin by considering the distribution of an M -dimensional random column vector, say
z, whose elements are statistically independent and individually have standard normal distributions.
This distribution is referred to as the M -variate (or multivariate) standard normal (or standard
Gaussian) distribution. It has the probability density function f ./ defined by
1
exp 1
z0 z .z 2 RM /: (5.28)
f .z/ D 2
.2/M=2
Its mean vector and variance-covariance matrix are:
Probability density function: existence, derivation, and geometrical form. Let us consider the
distribution of the random vector x [defined by equality (5.30)]. If rank./ < M or, equivalently, if
106 Random Vectors and Matrices
rank.†/ < M (in which case † is positive semidefinite), then the distribution of x has no probability
density function. Suppose now that rank./ D M or, equivalently, that rank.†/ D M (in which
case † is positive definite and N M ). Then, the distribution of x has a probability density function,
which we now proceed to derive.
Take ƒ to be an N .N M / matrix whose columns form an orthonormal (with respect to the
usual inner product) basis for N. 0 /—according to Lemma 2.11.5, dimŒN. 0 / D N M . Then,
observing that ƒ0 ƒ D I and 0ƒ D 0 and making use of Lemmas 2.12.1 and 2.6.1, we find that
we find that
1 x 0 † 1
1 0 x
h.x; w/ D exp
.2/N=2 j†j1=2 2 w 0 I w
1
exp 1
/0 † 1
D 2
.x .x /
.2/M=2 j†j1=2
1 1 0
expŒ 2 w w:
.2/.N M /=2
Thus, the distribution of x has the probability density function f ./ given by
Z 1 Z 1
f .x/ D h.x; w/ dw
1 1
1
exp 1
/0 † 1
(5.32)
D 2
.x .x / :
.2/M=2 j†j1=2
Each of the contour lines or surfaces of f ./ consists of the points in a set of the form
fx W .x /0 † 1
.x / D cg;
where c is a nonnegative scalar—f ./ has the same value for every point in the set. When M D 2,
each of these lines or surfaces is an ellipse. More generally (when M 2), each is an M -dimensional
ellipsoid. In the special case where † (and hence † 1 ) is a scalar multiple of IM , each of the contour
lines or surfaces is (when M D 2) a circle or (when M 2) an M -dimensional sphere.
Multivariate Normal Distribution 107
Uniqueness property. The matrix † D 0 has the same value for various choices of the N M
matrix that differ with regard to their respective entries and/or with regard to the value of N .
However, the distribution of the random vector x D C 0 z is the same for all such choices—it
depends on only through the value of † D 0 . That this is the case when † is positive definite
is evident from result (5.32). That it is the case in general (i.e., even if † is positive semidefinite) is
established in Subsection f.
Definition and notation. Let us refer to the distribution of the random vector x D C 0 z as
an M -variate (or multivariate) normal (or Gaussian) distribution. Accordingly, there is a family
of M -variate normal distributions, the members of which are indexed by the mean vector and the
variance-covariance matrix of the distribution. For every M -dimensional column vector and every
M M symmetric nonnegative definite matrix †, there is an M -variate normal distribution having
as its mean vector and † as its variance-covariance matrix (as is evident upon recalling, from
Corollary 2.13.25, that every symmetric nonnegative definite matrix is expressible in the form 0 ).
The symbol N.; †/ is used to denote an MVN (multivariate normal) distribution with mean vector
and variance-covariance matrix †. Note that the N.0; IM / distribution is identical to the M -variate
standard normal distribution.
We conclude that
I I I
w D C ƒ0 y D C ƒ 0
y C 0
z D C 0 z D C 0 z D x:
B0 1 B0 1 A0 1
That is, w has the same probability distribution as x.
x 1 2
1 1
f .x; y/ D exp
2.1 2 /
p
21 2 1 2 1
y 2 2
x 1 y 2
2 C (5.36)
1 2 2
( 1 < x < 1; 1 < y < 1).
The form of the probability density function (5.36) and the effect on the probability density
function of changes in and 2 =1 are illustrated in Figure 3.2.
Multivariate Normal Distribution 109
2 2
.05
1 1 .3
.05 .55
.8
(y − µ2)/σ1
1.05
.55 .3
1.05 .8
1.6 2.67
0 0
−1 −1
−2 −2
−2 −1 0 1 2 −2 −1 0 1 2
σ2/σ1 = 1, ρ = 0 σ2/σ1 = 1, ρ = 0.8
.05
2 2
.3
.05 .55
1 1 .8
.3 1.05
(y − µ2)/σ1
.55
.8
1 1.67
0 0
−1 −1
−2 −2
−2 −1 0 1 2 −2 −1 0 1 2
σ2/σ1 = 1.6, ρ = 0 σ2/σ1 = 1.6, ρ = 0.8
.05
2 2 .3
.05
.55
1 .3 1 .8
(y − µ2)/σ1
.55
.625 1.04
0 0
−1 −1
−2 −2
−2 −1 0 1 2 −2 −1 0 1 2
(x − µ1)/σ1 (x − µ1)/σ1
FIGURE 3.2. Contour maps of the probability density function f . ; / of the distribution of random variables x
and y that are jointly normal with E.x/ D 1 , E.y/ D 2 , var x D 12 (1 > 0), var y D 22
(2 > 0), and corr.x; y/ D . The 6 maps are arranged in 3 rows, corresponding to values of
2 =1 of 0:625, 1, and 1:6, respectively, and in 2 columns, corresponding to values of of 0
and 0:8. The coordinates of the points of each contour line are the values of .x 1 /=1 and
.y 2 /=1 at which f .x; y/ D k=.212 /, where k D 0:05; 0:3; 0:55; 0:8, or 1:05. Contour
maps corresponding to D 0:8 could be obtained by forming the mirror images of those
corresponding to D 0:8.
110 Random Vectors and Matrices
j. Marginal distributions
Let x represent an M -dimensional random column vector whose distribution is N.; †/, and con-
sider the distribution of a subvector of x, say the M -dimensional subvector x obtained by striking
out all of the elements of x except the j1 ; j2 ; : : : ; jMth elements. Clearly,
x D Ax;
where A is the M M submatrix of IM obtained by striking out all of the rows of IM except the
j1 ; j2 ; : : : ; jMth rows—if the elements of x are the first M elements of x, then A D .I; 0/.
Accordingly, it follows from Theorem 3.5.1 that
x N.A; A†A0 /:
k. Statistical independence
Let x1 ; x2 ; : : : ; xP represent random column vectors having expected values i D E.xi / (i D
1; 2; : : : ; P ) and covariances †ij D cov.xi ; xj / (i; j D 1; 2; : : : ; P ). And for i D 1; 2; : : : ; P,
denote by Mi the number of elements in xi .
Let xi s represent the sth element of xi (i D 1; 2; : : : ; P ; s D 1; 2; : : : ; Mi ). If xi and xj are
statistically independent, in which case xi s and xjt are statistically independent for every s and every
t, then (according to Lemma 3.2.2) †ij D 0 (j ¤ i D 1; 2; : : : ; P ). In general, the converse is not
true (as is well-known and as could be surmised from the discussion of Section 3.2c). That is, xi and
xj being uncorrelated does not necessarily imply their statistical independence. However, when their
joint distribution is MVN, xi and xj being uncorrelated does imply their statistical independence.
More generally, when the joint distribution of x1 ; x2 ; : : : ; xP is MVN, †ij D 0 (i.e., xi and xj
being uncorrelated) for j ¤ i D 1; 2; : : : ; P implies that x1 ; x2 ; : : : ; xP are mutually (jointly)
independent.
To see this, let 0 1
x1
B x2 C
x D B : C;
B C
@ :: A
xP
and observe that E.x/ D and var.x/ D †, where
0 1 0 1
1 †11 †12 : : : †1P
B 2 C B †21 †22 : : : †2P C
DB : C and †DB : :: C:
B C B C
:: ::
@ :: A @ :: : : : A
P †P 1 †P 2 : : : †PP
And suppose that the distribution of x is MVN and that
†ij D 0 .j ¤ i D 1; 2; : : : ; P /:
Further, define
D diag.1 ; 2 ; : : : ; P /;
where (for i D 1; 2; : : : ; P ) i is any matrix such that †i i D i0 i . Then,
so are the vector-valued functions 1 C 10 z1 , 2 C 20 z2 ; : : : ; P C P0 zP and hence so are
x1 ; x2 ; : : : ; xP —vector-valued functions of statistically independent random vectors are statistically
independent, as is evident, for example, from the discussion of Casella and Berger (2002, sec. 4.6)
or Bickel and Doksum (2001, app. A).
In summary, we have the following theorem.
Theorem 3.5.4. Let x1 ; x2 ; : : : ; xP represent random column vectors whose joint distribution
is MVN. Then, x1 ; x2 ; : : : ; xP are distributed independently if (and only if)
cov.xi ; xj / D 0 .j > i D 1; 2; : : : ; P /:
Note that the coverage of Theorem 3.5.4 includes the case where each of the random vectors
x1 ; x2 ; : : : ; xP is of dimension 1 and hence is in effect a random variable. In the special case where
P D 2, Theorem 3.5.4 can be restated in the form of the following corollary.
Corollary 3.5.5. Let x and y represent random column vectors whose joint distribution is MVN.
Then, x and y are statistically independent if (and only if) cov.x; y/ D 0.
As an additional corollary of Theorem 3.5.4, we have the following result.
Corollary 3.5.6. Let x represent an N -dimensional random column vector whose distribution
is MVN; and, for i D 1; 2; : : : ; P , let yi D ci C Ai x, where ci is an Mi -dimensional nonran-
dom column vector and Ai is an Mi N nonrandom matrix. Then, y1 ; y2 ; : : : ; yP are distributed
independently if (and only if)
cov.yi ; yj / D 0 .j > i D 1; 2; : : : ; P /:
Proof. Let 0 1 0 1 0 1
y1 c1 A1
B y2 C B c2 C B A2 C
y D B : C; c D B : C; and A D B : C:
B C B C B C
@ :: A @ :: A @ :: A
yP cP AP
Then,
y D c C Ax;
implying (in light of Theorem 3.5.1) that the joint distribution of y1; y2 ; : : : ; yP is MVN. Accordingly,
it follows from Theorem 3.5.4 that y1 ; y2 ; : : : ; yP are distributed independently if (and only if)
cov.yi ; yj / D 0 .j > i D 1; 2; : : : ; P /: Q.E.D.
If each of two or more independently distributed random vectors has an MVN distribution, then,
as indicated by the following theorem, their joint distribution is MVN.
Theorem 3.5.7. For i D 1; 2; : : : ; P , let xi represent an Mi -dimensional random column vector
whose distribution is N.i ; †i /. If x1 ; x2 ; : : : ; xP are mutually independent, then the distribution
of the random vector x defined by x0 D .x01 ; x02 ; : : : ; xP 0
/ is N.; †/, where 0 D .01 ; 02 ; : : : ;
0
P / and † D diag.†1 ; †2 ; : : : ; †P /.
Proof. For i D 1; 2; : : : ; P , take i to be a matrix (having Mi columns) such that †i D i0 i —
the existence of such a matrix follows from Corollary 2.13.25—and denote by Ni the number
of rows in i . And let N D P i D1 Ni . Further, let D diag.1 ; 2 ; : : : ; P /, and define z by
P
1 C 10 z1
0 1
B C 0z C
B 2 2 2
C 0z D B C;
C
::
@ : A
P C P0 zP
it follows that C 0 z has the same distribution as x. Clearly, z N.0; IN / and 0 D †. Thus,
C 0 z N.; †/ and hence x N.; †/. Q.E.D.
and making use of Theorems 2.14.22 and 2.6.6, we find that the conditional distribution of y given
x is the distribution with probability density function f . j / given by
h.x; y/ 1 1
f .yjx/ D D expŒ 2
q.x; y/;
h1 .x/ .2/ 2 =2 c 1=2
M
where
c D j†j=j†11 j D j†11 jjV j=j†11j D jV j
and
0 0
1
x 1 †11 †21 x 1
q.x; y/ D .x 1 /0 †111 .x 1 /
y 2 †21 †22 y 2
0 0 0
1 †111 C †111 †21 V 1 †21 †111 †111 †21 V 1 x 1
x
D
y 2 V 1 †21 †111 V 1 y 2
.x 1 /0 †111 .x 1 /
0 1
D Œy .x/ V Œy .x/ :
The probability density function of the conditional distribution of y given x is seen to be that of
the MVN distribution with mean vector .x/ and variance-covariance matrix V. Thus, we have the
following theorem.
Theorem 3.5.8. Let x and y represent random column vectors whose joint distribution is MVN,
and let 1 D E.x/, 2 D E.y/, †11 D var.x/,
†22 D var.y/, and †21 D cov.y; x/. Then, under
0
x †11 †21
the supposition that var D is positive definite, the conditional distribution of y
y †21 †22
given x is N Œ.x/; V , where
0
.x/ D 2 C †21 †111 .x 1 / and V D †22 †21 †111 †21 :
114 Random Vectors and Matrices
varŒ.x/ D †21 †111 †11 .†21 †111 /0 D †21 .†21 †111 /0 D .†21 †111 †21
0 0
/
and
covŒ.x/; x D covŒ.x/; Ix D †21 †111 †11 I 0 D †21 :
And, upon recalling results (2.48) and (2.50), it follows that
(3) Suppose that the joint distribution of x and y is MVN. Then, upon observing that
it follows from Theorem 3.5.1 that the joint distribution of e and x is MVN. Since [according to
Part (2)] cov.e; x/ D 0, we conclude (on the basis of Corollary 3.5.5) that e and x are statistically
independent. To establish that the distribution of e is N.0; V /, it suffices [since it has already been
established in Part (2) that E.e/ D 0 and var.e/ D V ] to observe (e.g., on the basis of Theorem
3.5.3) that the distribution of e is MVN. Q.E.D.
Multivariate Normal Distribution 115
and define
eDy .x/:
0
Observe (in light of Theorem 2.13.25) that † D for some matrix . Accordingly,
for a suitable partitioning D .1 ; 2 /. And making use of Theorem 2.12.2 [and equating .10 1 /
and †11 ], we find that
By taking advantage of equality (5.38) and by proceeding in the same fashion as in proving Parts (1)
and (2) of Theorem 3.5.9, we obtain the following results:
0
EŒ.x/ D 2 and varŒ.x/ D covŒ.x/; y D †21 †11 †21 I
E.e/ D 0; var.e/ D V ; and cov.e; x/ D 0: (5.39)
0
Before proceeding, let us consider the extent to which .x/ and †21 †11 †21 (and hence V ) are
invariant to the choice of the generalized inverse †11 . Recalling result (5.37) [and equating .10 1 /
and †11 ], we find that
0
†21 †11 †21 D 20 1 .10 1 / .20 1 /0 D 20 Œ1 .10 1 / 10 2 :
0
And, based on Theorem 2.12.2, we conclude that †21 †11 †21 and V are invariant to the choice of
†11 .
With regard to .x/, the situation is a bit more complicated. If x is such that x 1 2 C.†11 /,
then there exists a column vector t such that
x 1 D †11 t;
and hence that †21 †11 .x 1 / is invariant to the choice of †11 . Thus, .x/ is invariant to the
choice of †11 for every x such that x 1 2 C.†11 /. Moreover, x 1 2 C.†11 / is an event of
probability one. To see this, observe (in light of Lemmas 2.4.2 and 2.11.2) that
Theorem 3.5.11. Let x and y represent random column vectors, and let 1 D E.x/, 2 D E.y/,
†11 D var.x/, †22 D var.y/, and †21 D cov.y; x/. Further, let
0
.x/ D 2 C †21 †11 .x 1 / and V D †22 †21 †11 †21 ;
and define
eDy .x/:
Then,
0
(0) †21 †11 †21 and V are invariant to the choice of †11 , and for x such that x 1 2 C.†11 /,
.x/ is invariant to the choice of †11 ; moreover, x 1 2 C.†11 / is an event of probability
one;
0
(1) EŒ.x/ D 2 , and varŒ.x/ D covŒ.x/; y D †21 †11 †21 ;
(2) E.e/ D 0, var.e/ D V , and cov.e; x/ D 0; and
(3) under the assumption that the joint distribution of x and y is MVN, the distribution of e is
N.0; V / and e and x are statistically independent.
E z1k1 z2k2 zN
kN
D E z1k1 E z2k2 E zN
kN
(5.40)
:
E z1k1 z2k2 zN
kN
D 0 for k1 ; k2 ; : : : ; kN such that i ki D 1; 3; 5; 7; : : : . (5.42)
P
Now, consider the third-order central moments of the MVN distribution. Making use of result
(5.42), we find that, for i; j; s D 1; 2; : : : ; M ,
hX X X i
EŒ.xi i /.xj j /.xs s / D E 0 z
i i i 0 0 z
j j j 0 0 z
ss s 0
i0 j0 s0
X
D i 0 i j 0 j s0s E.zi 0 zj 0 zs 0 /
i 0; j 0; s0
D 0: (5.43)
That is, all of the third-order central moments of the MVN distribution equal zero. In fact, by
proceeding in similar fashion, we can establish the more general result that, for every odd positive
integer r (i.e., for r D 1; 3; 5; 7; : : :), all of the rth-order central moments of the MVN distribution
equal zero.
Turning to the fourth-order central moments of the MVN distribution, we find [in light of results
(5.40), (5.19), and (5.20)] that, for i; j; s; t D 1; 2; : : : ; M ,
EŒ.xi i /4 D 3 i i ;
in agreement with the expression given earlier [in result (5.27)] for the fourth central moment of the
univariate normal distribution.
118 Random Vectors and Matrices
t0 0 z 1 0
2z z D 1
2 .z t/0 .z t/ C 21 t 0 †t; (5.46)
Moreover, Z
.2/ N=2
expŒ 1
2
.z t/0 .z t/ d z D 1;
RN
as is evident upon making a change of variables from z to y D z t or upon observing that the
integrand is a probability density function [that of the N.t; I/ distribution].
Thus, the moment generating function m./ of the M -variate normal distribution with mean
vector and variance-covariance matrix † is given by
Corresponding to the moment generating function m./ is the cumulant generating function, say c./,
of this distribution, which is given by
D E.x/ and † D var.x/, and observe that (for every a) E.a0 x/ D a0 and var.a0 x/ D a0 †a.
Then, recalling the results of Subsection o and denoting by m . I a/ the moment generating function
of the N.a0 ; a0 †a/ distribution, we find that (for every a)
And we conclude that the distribution of x has a moment generating function, say m./, and that
A comparison of expression (5.50) with expression (5.47) reveals that the moment generating
function of the distribution of x is the same as that of the N.; †/ distribution. If two distributions
have the same moment generating function, they are identical (e.g., Casella and Berger 2002, p. 65;
Bickel and Doksum 2001, pp. 460 and 505). Consequently, the distribution of x is MVN.
In summary, we have the following characterization of multivariate normality.
Theorem 3.5.12. The distribution of the M -dimensional random column vector x D fxi g is
MVN if and only if, for every M -dimensional nonrandom column vector a D fai g, the distribution
of the linear combination a0 x D M i D1 ai xi is (univariate) normal.
P
Exercises
Exercise 1. Provide detailed verifications for (1) equality (1.7), (2) equality (1.8), (3) equality (1.10),
and (4) equality (1.9).
Exercise 2.
(a) Let w and z represent random variables [such that E.w 2 / < 1 and E.z 2 / < 1]. Show that
p p
jE.wz/j E.jwzj/ E.w 2 / E.z 2 /I (E.1)
and determine the conditions under which the first inequality holds as an equality, the conditions
under which the second inequality holds as an equality, and the conditions under which both
inequalities hold as equalities.
(b) Let x and y represent random variables [such that E.x 2 / < 1 and E.y 2 / < 1]. Using Part (a)
(or otherwise), show that
p p
jcov.x; y/j EŒjx E.x/jjy E.y/j var.x/ var.y/I (E.2)
and determine the conditions under which the first inequality holds as an equality, the conditions
under which the second inequality holds as an equality, and the conditions under which both
inequalities hold as equalities.
Exercise 3. Let x represent an N -dimensional random column vector and y a T -dimensional random
column vector. And define x to be an R-dimensional subvector of x and y an S -dimensional
subvector of y (where 1 R N and 1 S T ). Relate E.x / to E.x/, var.x / to var.x/, and
cov.x ; y / and cov.y ; x / to cov.x; y/.
Exercise 4. Let x represent a random variable that is distributed symmetrically about 0 (so that
x x); and suppose that the distribution of x is “nondegenerate” in the sense that there exists a
nonnegative constant c such that 0 < Pr.x > c/ < 21 [and assume that E.x 2 / < 1]. Further, define
y D jxj.
120 Random Vectors and Matrices
Exercise 5. Provide detailed verifications for (1) equality (2.39), (2) equality (2.45), and (3) equality
(2.48).
Exercise 6.
(a) Let x and y represent random variables. Show that cov.x; y/ can be determined from knowledge
of var.x/, var.y/, and var.x C y/, and give a formula for doing so.
(b) Let x D .x1 ; x2 /0 and y D .y1 ; y2 /0 represent 2-dimensional random column vectors. Can
cov.x; y/ be determined from knowledge of var.x/, var.y/, and var.x C y/? Why or why not?
Exercise 7. Let x represent an M -dimensional random column vector with mean vector and
variance-covariance matrix †. Show that there exist M.M C 1/=2 linear combinations of the M
elements of x such that can be determined from knowledge of the expected values of M of these
linear combinations and † can be determined from knowledge of the [M.M C1/=2] variances of
these linear combinations.
Exercise 8. Let x and y represent random variables (whose expected values and variances exist), let
V represent the variance-covariance matrix of the random vector .x; y/, and suppose that var.x/ > 0
and var.y/ > 0.
(a) Show that if jV j D 0, then, for scalars a and b,
p p
.a; b/0 2 N.V / , b var y D c a var x;
(
C1; when cov.x; y/ < 0,
where c D
1; when cov.x; y/ > 0.
(b) Use the result of Part (a) and the results of Section
p 3.2epto devise an alternative proof ofpthe result
(establishedp
in Section 3.2a) that cov.x; y/ D var x var y if andponly ifpŒy E.y/= var y D
Œx E.x/= var x with probability 1 and that cov.x; y/ D var x var y if and only if
p p
Œy E.y/= var y D Œx E.x/= var x with probability 1.
Exercise 9. Let x represent an N -dimensional random column vector (with elements whose expected
values and variances exist). Show that (regardless of the rank of var x) there exist a nonrandom column
vector c and an N N nonsingular nonrandom matrix A such that the random vector w, defined
implicitly by x D c C A0 w, has mean 0 and a variance-covariance matrix of the form diag.I; 0/
[where diag.I; 0/ is to be regarded as including 0 and I as special cases].
Exercise 10. Establish the validity of result (5.11).
Exercise 11. Let w D jzj, where z is a standard normal random variable.
(a) Find a probability density function for the distribution of w.
(b) Use the expression obtained in Part (a) (for the probability density function of the distribution
of w) to derive formula (5.15) for E.w r /—in Section 3.5c, this formula is derived from the
probability density function of the distribution of z.
(c) Find E.w/ and var.w/.
Exercise 12. Let x represent a random variable having mean and variance 2. Then, E.x 2 / D
2 C 2 [as is evident from result (2.3)]. Thus, the second moment of x depends on the distribution
of x only through and 2. If the distribution of x is normal, then the third and higher moments of x
Exercises 121
also depend only on and 2. Taking the distribution of x to be normal, obtain explicit expressions
for E.x 3 /, E.x 4 /, and, more generally, E.x r / (where r is an arbitrary positive integer).
Exercise 13. Let x represent an N -dimensional random column vector whose distribution is
N.; †/. Further, let R D rank.†/, and assume that † is nonnull (so that R 1). Show that
there exist an R-dimensional nonrandom column vector c and an R N nonrandom matrix A such
that c C Ax N.0; I/ (i.e., such that the distribution of c C Ax is R-variate standard normal).
Exercise 14. Let x and y represent random variables, and suppose that x C y and x y are
independently and normally distributed and have the same mean, say , and the same variance, say
2. Show that x and y are statistically independent, and determine the distribution of x and the
distribution of y.
Exercise 15. Suppose that two or more random column vectors x1 ; x2 ; : : : ; xP are pairwise inde-
pendent (i.e., xi and xj are statistically independent for j > i D 1; 2; : : : ; P ) and that the joint
distribution of x1 ; x2 ; : : : ; xP is MVN. Is it necessarily the case that x1 ; x2 ; : : : ; xP are mutually
independent? Why or why not?
Exercise 16. Let x represent a random variable whose distribution is N.0; 1/, and define y D ux,
where u is a discrete random variable that is distributed independently of x with Pr.u D 1/ D
Pr.u D 1/ D 21 .
(a) Show that y N.0; 1/.
(b) Show that cov.x; y/ D 0.
(c) Show that x and y are statistically dependent.
(d) Is the joint distribution of x and y bivariate normal? Why or why not?
Exercise 17. Let x1 ; x2 ; : : : ; xK represent N -dimensional random column vectors, and suppose that
x1 ; x2 ; : : : ; xK are mutually independent and that (for i D 1; 2; : : : ; K) xi N.i ; †i /. Derive
(for arbitrary scalars a1 ; a2 ; : : : ; aK ) the distribution of the linear combination Ki D1 ai xi .
P
Exercise 18. Let x and y represent random variables whose joint distribution is bivariate normal.
Further, let 1 D E.x/, 2 D E.y/, 12 D var x, 22 D var y, and D corr.x; y/ (where 1 0
and 2 0). Assuming that 1 > 0 , 2 > 0, and 1 < < 1, show that the conditional distribution
of y given x is N Œ2 C 2 .x 1 /=1 ; 22 .1 2 /.
Exercise 19. Let x and y represent random variables whose joint distribution is bivariate normal.
Further, let 12 D var x, 22 D var y, and 12 D cov.x; y/ (where 1 0 and 2 0). Describe (in
as simple terms as possible) the marginal distributions of x and y and the conditional distributions
of y given x and of x given y. Do so for each of the following two “degenerate” cases: (1) 12 D 0;
and (2) 12 > 0, 22 > 0, and j12 j D 1 2 .
Exercise 20. Let x represent an N -dimensional random column vector, and take y to be the M -
dimensional random column vector defined by y D cCAx, where c is an M -dimensional nonrandom
column vector and A an M N nonrandom matrix.
(a) Express the moment generating function of the distribution of y in terms of the moment generating
function of the distribution of x.
(b) Use the result of Part (a) to show that if the distribution of x is N.; †/, then the moment gener-
ating function of the distribution of y is the same as that of the N.c C A; A†A0 / distribution,
thereby (since distributions having the same moment generating function are identical) providing
an alternative way of arriving at Theorem 3.5.1.
122 Random Vectors and Matrices
The first two sections of Chapter 1 provide an introduction to linear statistical models in general
and to linear regression models in particular. Let us now expand on that introduction, doing so in a
way that facilitates the presentation (in subsequent chapters) of the results on statistical theory and
methodology that constitute the primary subject matter of the book.
The setting is one in which some number, say N, of data points are (for purposes of making
statistical inferences about various quantities of interest) to be regarded as the respective values of
observable random variables y1 ; y2 ; : : : ; yN . Define y D .y1 ; y2 ; : : : ; yN /0. It is supposed that (for
i D 1; 2; : : : ; N ) the i th datum (the observed value of yi ) is accompanied by the corresponding
value ui of a column vector u D .u1 ; u2 ; : : : ; uC /0 of C “explanatory” variables u1 ; u2 ; : : : ; uC .
The observable random vector y is to be modeled by specifying a “family,” say , of functions of u,
and by assuming that for some member of (of unknown identity), say ı./, the random deviations
yi ı.ui / (i D 1; 2; : : : ; N ) have (“conditionally” on u1 ; u2 ; : : : ; uN ) a joint distribution with
certain specified characteristics. In particular, it might be assumed that these random deviations have
a common mean of 0 and a common variance 2 (of unknown value), that they are uncorrelated, and
possibly that they are jointly normal.
The emphasis herein is on models in which consists of some or all of those functions (of u)
that are expressible as linear combinations of P specified functions ı1 ./; ı2 ./; : : : ; ıP ./. When
is of that form, the assumption that, for some function ı./, the joint distribution of yi ı.ui /
(i D 1; 2; : : : ; N ) has certain specified characteristics can be replaced by the assumption that, for
some linear combination in (of unknown identity), say one with coefficients ˇ1 ; ˇ2 ; : : : ; ˇP , the
PC
joint distribution of yi j D1 ˇj ıj .ui / (i D 1; 2; : : : ; N ) has the specified characteristics. Corre-
sponding to this linear combination is the P -dimensional parameter vector ˇ D .ˇ1 ; ˇ2 ; : : : ; ˇP /0
of coefficients—in general, this vector is of unknown value.
As what can be regarded as a very special case, we have the kind of situation where (with
probability 1) yi D ı.ui / (i D 1; 2; : : : ; N ) for some function ı./ whose identity is known. In
that special case, has only one member, and (with probability 1) all N of the “random” deviations
yi ı.ui / (i D 1; 2; : : : ; N ) equal 0. An example of that kind of situation is provided by the
ideal-gas law of physics:
p D rnt=v:
Here, p is the pressure within a container of gas, v is the volume of the container, t is the absolute
temperature of the gas, n is the number of moles of gas present in the container, and r is the universal
gas constant. It has been found that, under laboratory conditions and for any of a wide variety of
gases, the pressure readings obtained for any of a broad range of values of n, t, and v conform almost
perfectly to the ideal-gas law. Note that (by taking logarithms) the ideal-gas law can be reexpressed
in the form of the linear equation
In general, consists of a possibly infinite number of functions. And the relationship between
y1 ; y2 ; : : : ; yN and the corresponding values u1 ; u2 ; : : : ; uN of the vector u of explanatory variables
is typically imperfect.
124 The General Linear Model
At times (when the intended usage would seem to be clear from the context), resort is made herein
to the convenient practice of using the same symbol for a realization of a random variable (or of a
random vector or random matrix) as for the random quantity itself. At other times, the realization
might be distinguished from the random quantity by means of an underline or by the use of an
altogether different symbol. Thus, depending on the context, the N data points might be represented
by either y1 ; y2 ; : : : ; yN or y 1 ; y 2 ; : : : ; y N , and the N -dimensional column vector comprising these
points might be represented by either y or y. Similarly, depending on the context, an arbitrary member
of might be denoted either by ı./ (the same symbol used to denote the member having the specified
characteristics) or by ı./. And either ˇ1 ; ˇ2 ; : : : ; ˇP or b1 ; b2 ; : : : ; bP might be used to represent
the coefficients of an arbitrary linear combination of the functions ı1 ./; ı2 ./; : : : ; ıP ./, and either
ˇ or b might be used to represent the P -dimensional column vector comprising these coefficients.
The family of functions of u, in combination with whatever assumptions are made about the
joint distribution of the N random deviations yi ı.ui / (i D 1; 2; : : : ; N ), determines the statistical
model. Accordingly, it can play a critical role in establishing a basis for the use of the data in making
statistical inferences. Moreover, corresponding to an arbitrary member ı./ of is the approximation
to the data vector y D .y 1 ; y 2 ; : : : ; y N /0 provided by the vector Œı.u1 /; ı.u2 /; : : : ; ı.uN /0. It may
be of direct or indirect interest to determine the member of for which this approximation is the
best [best in the sense that the norm of the N -dimensional vector with elements y 1 ı.u1 /; y 2
ı.u2 /; : : : ; y N ı.uN / is minimized for ı./ 2 ]. Note that the solution to this optimization
problem may be well-defined even in the absence of any assumptions of a statistical nature.
The data could be either univariate or multivariate. If some of the N data points were “altogether
different” in character than some of the others, the data would be regarded as multivariate. Such
would be the case if, for example, part of the data consisted of height measurements and part of
weight measurements. In some cases, the distinction (between univariate data and multivariate data)
might be less than clear-cut. For example, if the data consisted entirely of measurements of the
level of a pollutant but some measurements were obtained by different means, at a different time, or
under different conditions than others, the data might be regarded as univariate or, alternatively, as
multivariate.
E.y/ D ı; (1.3)
and condition (1.2) as
E.y ı/ D 0: (1.4)
Some Basic Types of Linear Models 125
Here, the distributions of y and y ı [and hence the expected values in conditions (1.1), (1.2), (1.3),
and (1.4)] are regarded as “conditional” on u1 ; u2 ; : : : ; uN . Further, it is assumed that the distribution
of y ı does not depend on ı./ or, less stringently, that var.y ı/ (D var y) does not depend on
ı./.
As is evident from the simple identity
y D ı C .y ı/;
y D ı C e; (1.5)
where e is an N -dimensional random column vector with E.e/ D 0. Under condition (1.5),
and—aside from a trivial case where ı has the same value for all ı./ 2 (and that value is known)
and/or where e D 0 (with probability 1)—e is unobservable. Moreover, under condition (1.5), the
assumption that the distribution of y ı does not depend on ı./ is equivalent to an assumption
that the distribution of e does not depend on ı./, and, similarly, the less stringent assumption that
var.y ı/ or var.y/ does not depend on ı./ is equivalent to an assumption that var.e/ does not
depend on ı./.
Suppose now that consists of some or all of those functions (of u) that are expressible as linear
combinations of the P specified functions ı1 ./; ı2 ./; : : : ; ıP ./. Then, by definition, a function ı./
(of u) is a member of only if (for “all” u) ı.u/ is expressible in the form
y D Xˇ C e; (1.14)
The elements e1 ; e2 ; : : : ; eN of the vector e constitute what are sometimes (including herein) referred
to as residual effects and sometimes referred to as errors.
When assumption (1.14) or (1.15) is coupled with an assumption that
var.ei / D 2 and cov.ei ; es / D 0 .s > i D 1; 2; : : : ; N /; (1.16)
or equivalently that
var.e/ D 2 I; (1.17)
we arrive at a model that is to be called the Gauss–Markov model or (for short) the G–M model.
Here, is a (strictly) positive parameter that is functionally unrelated to ˇ; accordingly, the parameter
Some Basic Types of Linear Models 127
where V./ is an N N symmetric nonnegative definite matrix with ij th element vij ./ that
(for i; j D 1; 2; : : : ; N ) is functionally dependent on a T -dimensional (column) vector D
.1 ; 2 ; : : : ; T /0 of unknown parameters. Here, belongs to a specified subset, say ‚, of RT .
In the special case where T D 1, ‚ D f W 1 > 0g, and V./ D 12 H, assumption (1.20)
reduces to what is essentially assumption (1.19) (with 1 in place of )—when H D I, there is a
further reduction to what is essentially assumption (1.17). Accordingly, when the assumption that
the observable random column vector y is such that y D Xˇ C e is coupled with assumption (1.20),
we obtain what is essentially a further generalization of the Aitken generalization of the G–M model.
This generalization, whose parameter space is implicitly taken to be
and still more generally (in the case of the general linear model),
Unlike the G–M model, the Aitken and general linear models are able to allow for the possibility that
y1 ; y2 ; : : : ; yN are correlated and/or have nonhomogeneous variances, however (as in the special
case of the G–M model) the variances and covariances of y1 ; y2 ; : : : ; yN are functionally unrelated
to the expected values of y1 ; y2 ; : : : ; yN .
Instead of parameterizing the G–M, Aitken, and general linear models in terms of the
P -dimensional vector ˇ D .ˇ1 ; ˇ2 ; : : : ; ˇP /0 of coefficients of the P specified functions
ı1 ./; ı2 ./; : : : ; ıP ./, they can be parameterized in terms of an N -dimensional (column) vector
128 The General Linear Model
D .1 ; 2 ; : : : ; N /0 that “corresponds” to the vector Xˇ. In the alternative parameterization, the
model equation (1.14) becomes
y D C e: (1.25)
And the parameter space for the G–M model and its Aitken generalization becomes
Further, specifying (in the case of the G–M model or its Aitken generalization) that E.y/ and var.y/
are expressible in the form (1.22) or (1.23) [subject to the restriction that ˇ and are confined to the
space (1.18)] is equivalent to specifying that they are expressible in the form
E.y/ D and var.y/ D 2 I (1.28)
or
E.y/ D and var.y/ D 2 H; (1.29)
respectively [subject to the restriction that and are confined to the space (1.26)]. Similarly,
specifying (in the case of the general linear model) that E.y/ and var.y/ are expressible in the
form (1.24) [subject to the restriction that ˇ and are confined to the space (1.21)] is equivalent to
specifying that they are expressible in the form
[subject to the restriction that and are confined to the space (1.27)].
The equation (1.14) is sometimes referred to as the model equation. And the (N P ) matrix
X, which appears in that equation, is referred to as the model matrix. What distinguishes one G–M,
Aitken, or general linear model from another is the choice of the model matrix X and (in the case of
the Aitken or general linear model) the choice of the matrix H or the choices made in the specification
(up to the value of a vector of parameters) of the variance-covariance matrix V./. Different choices
for X are associated with different choices for the functions ı1 ./; ı2 ./, : : : ; ıP ./. The number (N )
of rows of X is fixed; it must equal the number of observations. However, the number (P ) of columns
of X (and accordingly the dimension of the parameter vector ˇ) may vary from one choice of X to
another.
The only assumptions inherent in the G–M, Aitken, or general linear model about the distribution
of the vector y are those reflected in result (1.22), (1.23), or (1.24), which pertain to the first- and
second-order moments. Stronger versions of these models can be obtained by making additional
assumptions. One highly convenient and frequently adopted additional assumption is that of taking
the distribution of the vector e to be MVN. In combination with assumption (1.17), (1.19), or (1.20),
this assumption implies that e N.0; 2 I/, e N.0; 2 H/, or e N Œ0; V./. Further, taking
the distribution of e to be MVN implies (in light of Theorem 3.5.1) that the distribution of y is also
MVN; specifically, it implies that y N.Xˇ; 2 I/, y N.Xˇ; 2 H/, or y N ŒXˇ; V./.
The credibility of the normality assumption varies with the nature of the application. Its credibility
would seem to be greatest under circumstances in which each of the quantities e1 ; e2 ; : : : ; eN can
reasonably be regarded as the sum of a large number of deviations of a similar magnitude. Under such
circumstances, the central limit theorem may be “operative.” The credibility of the basic assumptions
inherent in the G–M, Aitken, and general linear models is affected by the choice of the functions
ı1 ./; ı2 ./; : : : ; ıP ./. The credibility of these assumptions as well as that of the normality assumption
can sometimes be enhanced via a transformation of the data.
The coverage in subsequent chapters (and in subsequent sections of the present chapter) includes
numerous results on the G–M, Aitken, and general linear models and much discussion pertaining to
Some Specific Types of Gauss–Markov Models (with Examples) 129
those results and to various aspects of the models themselves. Except where otherwise indicated, the
notation employed in that coverage is implicitly taken to be that employed in the present section (in
the introduction of the models). And in connection with the G–M, Aitken, and general linear models,
ı./ is subsequently taken to be the function (of u) defined by ı.u/ D jPD1 ˇj ıj .u/.
P
In making statistical inferences on the basis of a statistical model, it is necessary to express the
quantities of interest in terms related to the model. A G–M, Aitken, or general linear model is often
used as a basis for making inferences about quantities that are expressible as linear combinations
of ˇ1 ; ˇ2 ; : : : ; ˇP , that is, ones that are expressible in the form 0 ˇ, where D .1 ; 2 ; : : : ; P /0
is a P -dimensional column vector of scalars, or equivalently ones that are expressible in the form
PP
j D1 j ˇj . More generally, a G–M, Aitken, or general linear model is often used as a basis for
making inferences about quantities that are expressible as random variables, each of which has an
expected value that is a linear combination of ˇ1 ; ˇ2 ; : : : ; ˇP ; these quantities may or may not be
correlated with e1 ; e2 ; : : : ; eN and/or with each other. And a general linear model is sometimes used
as a basis for making inferences about quantities that are expressible as functions of (or in the
special case of a G–M or Aitken model, for making inferences about 2 or a function of 2 ).
ı .k 1/
.a/ D .k 1/Šˇk (2.2)
[assuming that a is an interior point of the domain of ı./] and, more generally,
P
X
ı .k 1/
.u/ D .k 1/Šˇk C .j 1/.j 2/ .j k C 1/.u a/j k
ˇj ; (2.3)
j DkC1
as is easily verified. Thus, the derivatives of ı./ at the point a are scalar multiples of the parameters
ˇ2 ; ˇ3 , : : : ; ˇP —the parameter ˇ1 is the value ı.a/ of ı.u/ at u D a. And the .k 1/th derivative
of ı./ at an arbitrary interior point u is a linear combination of ˇk ; ˇkC1 , : : : ; ˇP .
130 The General Linear Model
TABLE 4.1. Lethal doses of ouabain in cats for each of four rates of injection.
ı.u/ D ˇ1 C ˇ2 u: (2.6)
There are cases where ı.u/ is not of the form (2.1) but where ı.u/ can be reexpressed in that
form by introducing a transformation of u and by redefining u accordingly. Suppose, for example,
that u is strictly positive and that
ı.u/ D ˇ1 C ˇ2 log u (2.7)
(where ˇ1 and ˇ2 are parameters). Clearly, expression (2.7) is not a polynomial in u. However,
upon introducing a transformation from u to the variable u defined by u D log u, ı.u/ can be
reexpressed as a function of u ; specifically, it can be reexpressed in terms of the function ı ./
defined by
ı .u / D ˇ1 C ˇ2 u:
Thus, when u is redefined to be u , we can take ı.u/ to be of the form
ı.u/ D ˇ1 C ˇ2 u:
to different versions of the model, more than one of which may be worthy of consideration—the
design of the study that gave rise to these data suggests that a choice for P of no more than 4 would
have been regarded as adequate. The choice of the value a in expression (2.1) is more or less a matter
of convenience.
Now, let us consider (for purposes of illustration) the matrix representation y D Xˇ C e of the
observable random column vector y in an application of the G–M model in which the observed value
of y comprises the lethal doses, in which C D 1, in which u is the rate of injection, and in which
ı.u/ is the special case of the (P 1)-degree polynomial (2.1) (in u a) obtained by setting P D 5
and a D 0. There are N D 41 data points. Suppose that the data points are numbered 1; 2; : : : ; 41
by proceeding row by row in Table 4.1 from the top to the bottom and by proceeding from left to
right within each row. Then, y 0 D .y10 ; y20 ; y30 ; y40 /, where
Further, ˇ 0 D .ˇ1 ; ˇ2 ; ˇ3 ; ˇ4 ; ˇ5 /. Because the data are arranged in groups, with each group having a
common value of u (corresponding to a common rate of injection), the model matrix X has a succinct
representation. Specifically, this matrix, whose ij th element xij corresponds to the i th datum and to
the j th element ˇj of ˇ, is given by
0 1
112 112 112 112 112
B 111 2111 4111 8111 16111C
XDB @ 19 419 1619
C:
6419 25619 A
19 819 6419 51219 4;09619
Another special case worthy of mention is that where P D C C 1 C ŒC.C C 1/=2, where
(as in the previous special case) k11 D k12 D D k1C D 0 and, for j D 2; 3; : : : ; C C 1,
kj1 ; kj 2 ; : : : ; kjC are given by expression (2.9), and where, for j D CC2; CC3; : : : ; CC1 C ŒC.CC
1/=2, kj1 ; kj 2 ; : : : ; kjC are such that C t D1 kjt D 2. In that special case, ı.u/ is a polynomial of
P
degree 2. That polynomial is obtainable from the degree-1 polynomial (2.10) by adding C.C C1/=2
terms of degree 2. Each of these degree-2 terms is expressible in the form
ˇj .u t a t /.u t 0 a t 0 /; (2.12)
where j is an integer between C C 2 and C C 1 C ŒC.C C 1/=2, inclusive, and where t is an
integer between 1 and C , inclusive, and t 0 an integer between t and C , inclusive—j; t, and t 0
are such that either kjt D 2 and t 0 D t or kjt D kjt 0 D 1 and t 0 > t. It is convenient to
express the degree-2 terms in the form (2.12) and to adopt a modified notation for the coefficients
ˇC C2 ; ˇC C3 ; : : : ; ˇC C1CŒC.C C1/=2 in which the coefficient in term (2.12) is identified by the values
of t and t 0 , that is, in which ˇ t t 0 is written in place of ˇj . Accordingly,
C
X C X
X C
ı.u/ D ˇ1 C ˇj C1 .uj aj / C ˇ t t 0 .u t a t /.u t 0 a t 0 /: (2.13)
j D1 t D1 t 0 Dt
When ı.u/ is of the form (2.13), the model is sometimes referred to as second-order. When a1 D
a2 D D aC D 0, expression (2.13) reduces to
C
X C X
X C
ı.u/ D ˇ1 C ˇj C1 uj C ˇt t 0 ut ut 0 : (2.14)
j D1 t D1 t 0 Dt
(Mo), and iron (Fe). The plants were grown in a medium in containers of 3 plants each. Each container
was assigned one of 5 levels of each of the 3 trace minerals; for Fe, the lowest and highest levels
were 0:0025 ppm and 25 ppm, and for both Cu and Mo, the lowest and highest levels were 0:0002
ppm and 2 ppm. The 5 levels of each mineral were reexpressed on a transformed scale: the ppm were
replaced by a linear function of the logarithm
p of the ppm chosen so that the transformed values of
4
the highest and lowest levels were ˙ 8. Yield was recorded as grams of dry weight. The results of
the experimental study are reproduced in Table 4.3.
A G–M model could be applied to these data. We could take N D 20, take the values of
y1 ; y2 ; : : : ; yN to be the yields, take C D 3, and take u1 ; u2 , and u3 to be the transformed amounts
of Cu, Mo, and Fe, respectively. And we could consider taking ı.u/ to be a polynomial (in u1 ; u2 ,
and u3 ). The levels of Cu, Mo, and Fe represented in the study covered what was considered to be
a rather wide range. Accordingly, the first-order model, in which ı.u/ is taken to be the degree-1
polynomial (2.10) or (2.11), is not suitable. The second-order model, in which ı.u/ is taken to be
the degree-2 polynomial (2.13) or (2.14), would seem to be a much better choice.
4.3 Regression
Suppose there are N data points y1 ; y2 ; : : : ; yN and that (for i D 1; 2; : : : ; N ) yi is accompanied by
the corresponding value ui of a vector u D .u1 ; u2 ; : : : ; uC /0 of C explanatory variables. Further,
regard y1 ; y2 ; : : : ; yN as N values of a variable y. In some applications, bothu and y can reasonably
y
be regarded as random, that is, the .C C 1/-dimensional column vector can reasonably be
u
y1 y y
regarded as a random vector; and the .C C 1/-dimensional vectors ; 2 ; : : : ; N can
u1 u2 uN
y
reasonably be regarded as a random sample of size N from the distribution of . Assume that
u
y y1 y y y1 y y
and ; 2 ; : : : ; N can be so regarded. Then, in effect, ; 2 ;:::; N
u u1 u2 uN u1 u2 uN
134 The General Linear Model
TABLE 4.3. Yields of lettuce from pots containing various amounts of Cu, Mo, and Fe (Hader, Harward, Mason,
and Moore 1957, p. 63; Moore, Harward, Mason, Hader, Lott, and Jackson 1957, p. 67).
Transformed amount
Yield Cu Mo Fe
21:42 1 1 0:4965
15:92 1 1 0:4965
22:81 1 1 0:4965
14:90 1 1 1
14:95 1 1 0:4965
7:83 1 1 1
19:90 1 1 1
4:68 p 1 1 1
4
0:20 p 8 0 0
4
17:65 8 p 0 0
4
18:16 0 p 8 0
4
25:39 0 8 p 0
4
11:99 0 0 p 8
4
7:37 0 0 8
22:22 0 0 0
19:49 0 0 0
22:76 0 0 0
24:27 0 0 0
27:88 0 0 0
27:53 0 0 0
v.u/ D 2; (3.5)
where
2 D var.y/ cov.y; u/.var u/ 1
cov.u; y/:
y
We conclude that when the distribution of the vector is MVN, the joint distribution of
u
y1 y y
y1 ; y2 ; : : : ; yN obtained by regarding ; 2 ; : : : ; N as a random sample from the dis-
u1 u2 uN
y
tribution of and by conditioning on u1 ; u2 ; : : : ; uN is identical to that obtained by adopting
u
a first-order G–M model and by taking the joint distribution of the residual effects e1 ; e2 ; : : : ; eN
y
of the G–M model to be MVN. Moreover, when the distribution of is MVN, the parameters
u
ˇ1 ; ˇ2 ; : : : ; ˇC C1 and of that first-order
G–M model are expressible in terms of the mean vector
y
and the variance-covariance matrix of .
u
y
There are other distributions of (besides the MVN) for which the joint distribution of
u
y1 y2 y
y1 ; y2 ; : : : ; yN obtained by regarding ; ; : : : ; N as a random sample from the distri-
u1 u2 uN
y
bution of and by conditioning on u1 ; u2 ; : : : ; uN is consistent with the adoption of a first-order
u
G–M model. Whether or not the joint distribution obtained in that way is consistent
with the adoption
y
of a first-order G–M model depends on the nature of the distribution of solely through the nature
u
of the conditional distribution of y given u; in fact, it depends solely on the nature of the mean and
136 The General Linear Model
variance of the conditional distribution of y given u. The nature of the marginal distribution of u is
without relevance (to that particular issue).
It is instructive to consider the implications of the expression for E.y j u/ given by equation
(3.2). Suppose that C D 1. Then, writing u for u1 , equation (3.2) can be reexpressed in the form
E.y j u/ E.y/ u E.u/
p D corr.y; u/ p : (3.6)
var y var u
Excluding the limiting case where jcorr.y; u/j D 1, there is an implication that, for any particular
value of u, E.y j u/ is less extreme than u in the sense that (in units of standard deviations) it is
closer to E.y/ than u is to E.u/.
Roughly speaking, observations on y corresponding to any particular value of u are on average
less extreme than the value of u. This phenomenon was recognized early on by Sir Francis Galton,
who determined (from data on a human population) that the heights of the offspring of very tall
(or very short) parents, while typically above (or below) average, tend to be less extreme than the
heights of the parents. It is a phenomenon that has come to be known as “regression to the mean” or
simply as regression. This term evolved from the term “regression (or reversion) towards mediocrity”
introduced by Galton.
Some authors reserve the use of the term regression for situations (like that under consideration
in the present section) where the explanatory variables can reasonably be regarded as realizations of
random variables (e.g., Graybill 1976; Rao 1973). This would seem to be more or less in keeping
with the original meaning of the term. However, over time, the term regression has come to be used
much more broadly. In particular, it has become common to use the term linear regression almost
synonymously with what is being referred to herein as the G–M model, with the possible proviso
that the explanatory variables be continuous. This broader usage is inclusive enough to cover a study
(like that which produced the lettuce data) where the values of the explanatory variables have been
determined systematically as part of a designed experiment.
exercise to show that, for k D 1; 2; : : : ; K, E.yNk / D jPD1 xik j ˇj and var.yNk / D 2 =Nk and that,
P
for k 0 ¤ k D 1; 2; : : : ; K, cov.yNk ; yNk 0 / D 0. Thus, yN1 ; yN2 ; : : : ; yNK follow an Aitken generalization
of a G–M model in which, taking y to be the K-dimensional vector .yN1 ; yN2 ; : : : ; yNK /0, the model
matrix is the K P matrix whose kth row is the ik th row of the original model matrix, and the matrix
H is the diagonal matrix diag.1=N1 ; 1=N2; : : : ; 1=NK /—the parameters ˇ1 ; ˇ2 ; : : : ; ˇP and of
this model are “identical” to those of the original (G–M) model (i.e., the model for the individual
observations y1 ; y2 ; : : : ; yN ).
As a simple example, consider the ouabain data of Section 4.2b. Suppose that the lethal doses
for the 41 cats follow a G–M model in which the rate of injection is the sole explanatory variable
and in which ı./ is a polynomial. Then, the model matrix has 4 distinct rows, corresponding to the
4 rates of injection: 1, 2, 4, and 8. The numbers of cats injected at those 4 rates were 12, 11, 9, and
9, respectively. The average lethal doses for the 4 rates would follow an Aitken model, with a model
matrix that has 4 rows (which are the distinct rows of the original model matrix and whose lengths
and entries depend on the choice of polynomial) and with H D diag.1=12; 1=11; 1=9; 1=9/.
Within-group homoscedasticity. There are situations where the residual effects e1; e2 ; : : : ; eN cannot
reasonably be regarded as homoscedastic, but where they can be partitioned into some (hopefully
modest) number of mutually exclusive and exhaustive subsets or “groups,” each of which consists
of residual effects that are thought to be homoscedastic. Suppose there are K such groups and that
(for purposes of identification) they are numbered 1; 2; : : : ; K. And (for k D 1; 2; : : : ; K) denote by
Ik the subset of the integers 1; 2; : : : ; N defined by i 2 Ik if the i th residual effect ei is a member
of the kth group.
Let us assume the existence of a function, say .u/, of u (the vector of explanatory variables)
whose value .ui / for u D ui is as follows: .ui / D k if i 2 Ik . This assumption entails essentially
no loss of generality. If necessary, it can be satisfied by introducing an additional explanatory variable.
In particular, it can be satisfied by including an explanatory variable whose i th value (the value of the
explanatory variable corresponding to yi ) equals k for every i 2 Ik (k D 1; 2; : : : ; K). It is worth
noting that there is nothing in the formulation of the G–M model (or its Aitken generalization) or in
the general linear model requiring that ı.u/ (whose values for u D u1 ; u2 ; : : : ; uN are the expected
values of y1 ; y2 ; : : : ; yN ) depend nontrivially on every component of u.
Consider, for example, the case of the ouabain data. For those data, we could conceivably define
4 groups of residual effects, corresponding to the 4 rates of injection, and assume that the residual
effects are homoscedastic within a group but (contrary to what is inherent in the G–M model) not
necessarily homoscedastic across groups. Then, assuming (as before) that the rate of injection is
the sole explanatory variable u (and writing u for u), we could choose the function .u/ so that
.1/ D 1, .2/ D 2, .4/ D 3, and .8/ D 4.
The situation is one in which the residual effects in the kth group have a common variance, say
138 The General Linear Model
In general, the function v./ is known only up to the (unknown) values of one or more parameters—
the dependence on the parameters is suppressed in the notation. It is implicitly assumed that these
parameters are unrelated to the parameters ˇ1 ; ˇ2 , : : : ; ˇP , whose values determine the expected
values of y1 ; y2 , : : : ; yN . Thus, when they are regarded as the elements of the vector , var.e/ is of
the form V./ of var.e/ in the general linear model.
In what is a relatively simple special case, v.u/ is of the form
v.u/ D 2 h.u/; (4.2)
where is a strictly positive parameter (of unknown value) and h.u/ is a known (nonnegatively
valued) function of u. In that special case, formula (4.1) is expressible as
var.e/ D 2 diagŒh.u1 /; h.u2 /; : : : ; h.uN /: (4.3)
This expression is of the form 2 H of var.e/ in the Aitken generalization of the G–M model.
Let us consider some of the more common choices for the function v.u/. For the sake of simplicity,
let us do so for the special case where v.u/ depends on u only through a single one of its C
components. Further, for convenience, let us write u for this component (dropping the subscript) and
write v.u/ for v.u/ [thereby regarding v.u/ as a function solely of u]. In the case of the ouabain data,
we could take u to be the rate of injection, or, alternatively, we could regard some strictly monotonic
function of the rate of injection as an explanatory variable and take it to be u.
One very simple choice for v.u/ is
v.u/ D 2 juj (4.4)
(where is a strictly positive parameter of unknown value). More generally, we could take v.u/ to
be of the form
v.u/ D 2 juj2˛; (4.5)
where ˛ is a strictly positive scalar or possibly (if the domain of u does not include the value 0) an
unrestricted scalar. And, still more generally, we could take v.u/ to be of the form
where is a nonnegative scalar. While expressions (4.4), (4.5), and (4.6) depend on u only through its
absolute value and consequently are well-defined for both positive and negative values of u, choices
for v.u/ of the form (4.4), (4.5), or (4.6) would seem to be best-suited for use in situations where u
is either strictly positive or strictly negative. p
Note that taking v.u/ to be of the form (4.4), (4.5), or (4.6) is equivalent to taking v.u/ to
p
be juj, juj˛, or . C juj˛/, respectively. Note also that expression (4.4) is of the form (4.2);
Heteroscedastic and Correlated Residual Effects 139
and recall that when v.u/ is of the form (4.2), var.e/ is of the form 2 H associated with the Aitken
generalization of the G–M model. More generally, if ˛ is known, then expression (4.5) is of the form
(4.2); and if both and ˛ are known, expression (4.6) is of the form (4.2). However, if ˛ or if
and/or ˛ are (like ) regarded as unknown parameters, then expression (4.5) or (4.6), respectively,
is not of the form (4.2), and, while var.e/ is of the form associated with the general linear model, it
is not of the form associated with the Aitken generalization of the G–M model.
As an alternative to taking v.u/ to be of the form (4.4), (4.5), or (4.6), we could take it to be of
the form
v.u/ D 2 e 2˛u; (4.7)
where ˛ is an unrestricted scalar (and is a strictly positivepparameter of unknown value).
p Note that
taking v.u/ to be of the form (4.7) is equivalent to taking v.u/ to be of the form v.u/ D e ˛u,
1
p
and is also equivalent to taking log v.u/ [which equals 2 log v.u/] to be of the form
p
log v.u/ D log C ˛u:
Note also that if the scalar ˛ in expression (4.7) is known, then that expression is of the form (4.2),
in which case var.e/ is of the form associated with the Aitken generalization of the G–M model.
Alternatively, if ˛ is an unknown parameter, then expression (4.7) is not of the form (4.2) and var.e/
is not of the form associated with the Aitken generalization of the G–M model [though var.e/ is of
the form associated with the general linear model].
In some applications, there may not be any choice for v.u/ of a relatively simple form [like (4.6)
or (4.7)] for which it is realistic to assume that var.ei / D v.ui / for i D 1; 2; : : : ; N . However, in
some such applications, it may be possible to partition e1 ; e2 ; : : : ; eN into mutually exclusive and
exhaustive subsets (perhaps on the basis of the values u1 ; u2 ; : : : ; uN of the vector u of explanatory
variables) in such a way that, specific to each subset, there is a choice for v.u/ (of a relatively simple
form) for which it may be realistic to assume that var.ei / D v.ui / for those i corresponding to
the members of that subset. While one subset may require a different choice for v.u/ than another,
the various choices (corresponding to the various subsets) may all be of the same general form; for
example, they could all be of the form (4.7) (but with possibly different values of and/or ˛).
Form of the variance-covariance matrix. Let us now return to the main development. Upon applying
Lemma 4.4.1 with a D 1 k and b D k , we find that (when Nk 2) the correlation matrix
Rk of the vector ek is nonnegative definite if and only if 1 k 0 and 1 C .Nk 1/k 0, or
equivalently if and only if
1
k 1; (4.15)
Nk 1
and similarly that Rk is positive definite if and only if
1
< k < 1: (4.16)
Nk 1
The permissible values of k are those in the interval (4.15).
The variance-covariance matrix of the vector ek is such that all of its diagonal elements (which
are the variances) equal each other and all of its off-diagonal elements (which are the covariances)
also equal each other. Such a variance-covariance matrix is said to be compound symmetric (e.g.,
Milliken and Johnson 2009, p. 536).
The variance-covariance matrix of the vector e of residual effects is positive definite if and only
if all K of the matrices R1 ; R2 ; : : : ; RK are positive definite—refer to result (4.13) and “recall”
Lemma 2.13.31. Thus, var.e/ is positive definite if, for every k for which Nk 2, k is in the
interior (4.16) of interval (4.15). [If, for every k for which Nk 2, k is in interval (4.15) but, for
at least one such k, k equals an end point of interval (4.15), var.e/ is positive semidefinite.]
There are applications in which it may reasonably be assumed that the intraclass correlations are
nonnegative (i.e., that k 0 for every k for which Nk 2). In some such applications, it is further
assumed that all of the intraclass correlations are equal. Together, these assumptions are equivalent
to the assumption that, for some scalar in the interval 0 1,
k D (for every k for which Nk 2). (4.17)
Now, suppose that assumption (4.17) is adopted. If were taken to be 0, var.e/ would be of
the form considered in Part 2 of Subsection a. When as well as 1 ; 2 ; : : : ; K are regarded as
unknown parameters, var.e/ is of the form V./ of var.e/ in the general linear model—take to be
a .K C1/-dimensional (column) vector whose elements are 1 ; 2 ; : : : ; K , and .
In some applications, there may be a willingness to augment assumption (4.17) with the addi-
tional assumption that the residual effects are completely homoscedastic, that is, with the additional
assumption that, for some strictly positive scalar ,
1 D 2 D D K D : (4.18)
Then,
var.e/ D 2 diag.R1 ; R2 ; : : : ; RK /; (4.19)
and when is regarded as an unknown parameter and is taken to be known, var.e/ is of the form
142 The General Linear Model
2 H of var.e/ in the Aitken generalization of the G–M model. When both and are regarded
as unknown parameters, var.e/ is of the form V ./ of var.e/ in the general linear model—take
to be the 2-dimensional (column) vector whose elements are and . Assumption (4.18) leads to a
model that is less flexible but more parsimonious than that obtained by allowing the variances of the
residual effects to differ from class to class.
Decomposition of the residual effects. A supposition that an intraclass correlation is nonnegative,
but strictly less than 1, is the equivalent of a supposition that each residual effect can be regarded as
the sum of two uncorrelated components, one of which is specific to that residual effect and the other
of which is shared by all of the residual effects in the same class. Let us consider this equivalence in
some detail. Accordingly, take ak (k D 1; 2; : : : ; K) and rks (k D 1; 2; : : : ; K; s D 1; 2; : : : ; Nk ) to
be uncorrelated random variables, each with mean 0, such that var.ak / D k2 for some nonnegative
scalar k and var.rks / D k2 (s D 1; 2; : : : ; Nk ) for some strictly positive scalar k . And suppose
that, for k D 1; 2; : : : ; K and s D 1; 2; : : : ; Nk ,
eks D ak C rks : (4.20)
Here, ak is the component of the sth residual effect in the kth class that is shared by all of the residual
effects in the kth class and rks is the component that is specific to eks .
We find that the variance k2 of the residual effects in the kth class is expressible as
k2 D k2 C k2 : (4.21)
Further, upon observing that (for t ¤ s D 1; 2; : : : ; Nk ) cov.eks ; ekt / D k2 , we find that the
correlation of any two residual effects in the kth class (the intraclass correlation) is expressible as
k2 k2 .k =k /2
k D D D (4.22)
k2 k2 C k2 1 C .k =k /2
Note that result (4.22) implies that the intraclass correlation is nonnegative, but strictly less than
1. Note also that an assumption that k does not vary with k can be restated as an assumption that
k2 =k2 does not vary with k and also as an assumption that k =k (or k2 =k2 ) does not vary with k.
The effects of intraclass competition. In some applications, the residual effects in each class may be
those for data on entities among which there is competition. For example, the entities might consist
of individual animals that are kept in the same pen and that may compete for space and for feed. Or
they might consist of individual plants that are in close proximity and that may compete for water,
nutrients, and light. In the presence of such competition, the residual effects in each class may tend
to be less alike than would otherwise be the case.
The decomposition of the residual effects considered in the previous part can be modified so as
to reflect the effects of competition. Let us suppose that the Nk residual effects in the kth class are
those for data on Nk of a possibly larger number Nk of entities among which there is competition.
Define ak (k D 1; 2; : : : ; K) and rks (k D 1; 2; : : : ; K; s D 1; 2; : : : ; Nk ) as in the previous part.
And take dks (k D 1; 2; : : : ; K; s D 1; 2; : : : ; Nk ) to be random variables, each with mean 0, that are
uncorrelated with the ak ’s and the rks ’s, and suppose that cov.dks ; djt / D 0 for j ¤ k D 1; 2; : : : ; K
Heteroscedastic and Correlated Residual Effects 143
PNk
(and for all s and t). Suppose further that (for k D 1; 2; : : : ; K) sD1 dks D 0, that var.dks / D !k2
(s D 1; 2; : : : ; Nk ) for some nonnegative scalar !k , and that cov.dks ; dkt / has the same value for
all t ¤ s D 1; 2; : : : ; Nk. Finally, suppose that, for k D 1; 2; : : : ; K and s D 1; 2; : : : ; Nk ,
eks D ak C dks C rks : (4.26)
Decomposition (4.26) is such that the variance k2 of the residual effects in the kth class is
expressible as
k2 D k2 C k2 C !k2 : (4.30)
Further, for t ¤ s D 1; 2; : : : ; Nk ,
1
cov.eks ; ekt / D k2 C cov.dks ; dkt / D k2 !k2 : (4.31)
Nk 1
Thus, the correlation of any two residual effects in the kth class (the intraclass correlation) is ex-
pressible as
1 1 1
k2 !2
Nk 1 k
k2 !2
Nk 1 k
.k =k /2 Nk 1
.!k =k /2
k D D D : (4.32)
k2 k2 C k2 C !k2 1 C .k =k /2 C .!k =k /2
In light of expression (4.32), the permissible values of k are those in the interval
1
< k < 1: (4.33)
Nk 1
The intraclass correlation k approaches the upper end point of this interval as k =k ! 1 (for
fixed !k =k ) and approaches the lower end point as !k =k ! 1 (for fixed k =k ).
Expression (4.32) “simplifies” to a considerable extent if neither k =k nor !k =k varies with
k (or, equivalently, if neither k =k nor !k =k varies with k). However, even then, k may depend
on Nk and hence may vary with k.
30-kilogram “batches” of corn. Each batch was tempered so that its moisture content conformed to
a specified setting (one of three equally spaced settings selected for inclusion in the study).
Following its preparation, each batch was split into three equal (10 kg) parts. And for each part
of each batch, settings were specified for the roll gap, screen size, and roller speed, the grinding mill
was configured (to conform to the specified settings), the processing of the corn was undertaken, and
the amount of grits obtained from a one-minute run was determined. The moisture content, roll gap,
screen size, and roller speed were recorded on a transformed scale chosen so that the values of the
high, medium, and low settings were C1, 0, and 1, respectively. The results of the 30 experimental
runs are reproduced in Table 4.4.
The corn-milling experiment was conducted in accordance with what is known as a split-plot
design (e.g., Hinkelmann and Kempthorne 2008, chap. 13). The ten batches are the so-called whole
plots, and the three moisture-content settings constitute the so-called whole-plot treatments. Further,
the 30 parts (obtained by splitting the 10 batches into 3 parts each) are the so-called split plots, and
Heteroscedastic and Correlated Residual Effects 145
the various combinations of settings for roll gap, screen size, and roller speed constitute the so-called
split-plot treatments—21 of a possible 27 combinations were included in the experiment.
The data from the corn-milling experiment might be suitable for the application of a general linear
model. We could take N D 30, take the observed value of yi to be the amount of grits obtained
on the i th experimental run (i D 1; 2; : : : ; 30), take C D 4, and take u1 , u2 , u3 , and u4 to be the
moisture content, roll gap, screen size, and roller speed, respectively (with each being expressed on
the transformed scale). And we could consider taking ı.u/ to be a polynomial of degree 2 in u1 , u2 ,
u3 , and u4 . The nature of the application is such that ı.u/ defines what is commonly referred to as
a “response surface.”
The batches in the corn-milling experiment define classes within which the residual effects are
likely to be correlated. Moreover, this correlation is likely to be compound symmetric, that is, to
be the same for every two residual effects in the same class. In the simplest case, the intraclass
correlation would be regarded as having the same value, say , for every class. Then, assuming that
the data points (and the corresponding residual effects) are numbered 1; 2; : : : ; 30 in the order in
which the data points are listed in Table 4.4 and that the residual effects have a common variance
2, the variance-covariance matrix of the vector e of residual effects would be
Date (month/day)
Temp. Time
(°F) (sec.) 07/11 07/16 07/20 08/07 08/08 08/14 08/20 08/22 09/11 09/24 10/03 10/10
375 30 1,226 1,075 1,172 1,213 1,282 1,142 1,281 1,305 1,091 1,281 1,305 1,207
400 30 1,898 1,790 1,804 1,961 1,940 1,699 1,833 1,774 1,588 1,992 2,011 1,742
450 30 2,142 1,843 2,061 2,184 2,095 1,935 2,116 2,133 1,913 2,213 2,192 1,995
375 35 1,472 1,121 1,506 1,606 1,572 1,608 1,502 1,580 1,343 1,691 1,584 1,486
400 35 2,010 2,175 2,279 2,450 2,291 2,374 2,417 2,393 2,205 2,142 2,052 2,339
400 35 1,882 2,355 2,268 2,032
400 35 1,915 2,420 2,103 2,190
400 35 2,106 2,240
450 35 2,352 2,274 2,168 2,298 2,147 2,413 2,430 2,440 2,093 2,208 2,201 2,216
375 40 1,491 1,691 1,707 1,882 1,741 1,846 1,645 1,688 1,582 1,692 1,744 1,751
400 40 2,078 2,513 2,392 2,531 2,366 2,392 2,392 2,413 2,392 2,488 2,392 2,390
450 40 2,531 2,588 2,617 2,609 2,431 2,408 2,517 2,604 2,477 2,601 2,588 2,572
u2 to be the temperature and curing time (employed in the manufacture of the adhesive), and take
ı.u/ to be a polynomial of degree 2 in u1 and u2 . As in the previous example (that of Subsection c),
the nature of the application is such that ı.u/ defines a “response surface.”
Steel aliquots chosen at random from those on hand on any particular date may tend to resemble
each other more closely than ones chosen at random from those on hand on different dates. Accord-
ingly, it may be prudent to regard the 12 blocks as “classes” and to allow for the possibility of an
intraclass correlation. If (in doing so) it is assumed that the intraclass correlation and the variance
of the random effects have values, say and 2, respectively, that do not vary from block to block,
then the variance-covariance matrix of the vector e of residual effects is
var.e/ D 2 diag.R1 ; R2 ; : : : ; R12 /; (4.34)
where (for k D 1; 2; : : : ; 12)
Rk D .1 /INk C 1Nk10Nk
with 8̂
< 9 if k D 2; 3; 5; 6; 7; 8; 10, or 12,
Nk D 11 if k D 9 or 11,
12 if k D 1 or 4.
:̂
In arriving at expression (4.34), it was implicitly assumed that the residual effects associated with
the data points in any one block are uncorrelated with those associated with the data points in any
other block. It is conceivable that steel aliquots chosen at random from those on hand on different
dates may tend to be more alike when the intervening time (between dates) is short than when it is
long. If we wished to account for any such tendency, we would need to allow for the possibility that
the residual effects associated with the data points in different blocks may be correlated to an extent
that diminishes with the separation (in time) between the blocks. That would seem to call for taking
var.e/ to be of a different and more complex form than the block-diagonal form (4.34).
e. Longitudinal data
There are situations where the data are obtained by recording the value of what is essentially the same
variate for each of a number of “observational units” on each of a number of occasions (corresponding
Heteroscedastic and Correlated Residual Effects 147
to different points in time). The observational units might be people, animals, plants, laboratory
specimens, households, experimental plots (of land), or other such entities. For example, in a clinical
trial of drugs for lowering blood pressure, each drug might be administered to a different group of
people, with a placebo being administered to an additional group, and each person’s blood pressure
might be recorded periodically, including at least once prior to the administration of the drug or
placebo—in this example, each person constitutes an observational unit. Data of this kind are referred
to as longitudinal data.
Suppose that the observed values of the random variables y1 ; y2 ; : : : ; yN in the general linear
model are longitudinal data. Further, let t represent time, and denote by t1 ; t2 ; : : : ; tN the values of t
corresponding to y1 ; y2 ; : : : ; yN , respectively. And assume (as can be done essentially without loss
of generality) that the vector u of explanatory variables u1 ; u2 ; : : : ; uC is such that t is one of the
explanatory variables or, more generally, is expressible as a function of u, so that (for i D 1; 2; : : : ; N )
the i th value ti of t is determinable from the corresponding (i th) value ui of u.
Denote by K the number of observational units represented in the data, suppose that the observa-
tional units are numbered 1; 2; : : : ; K, and define Nk to be the number of data points pertaining to the
kth observational unit (so that K kD1 Nk D N ). Assume that the numbering (from 1 through N ) of
P
the N data points is such that they are ordered by observational unit and by time within observational
unit (so that if the i th data point pertains to the kth observational unit and the i 0 th data point to the
k 0 th observational unit where i 0 > i , then either k 0 > k or k 0 D k and ti 0 ti )—it is always
possible to number the data points in such a way. The setting is one in which it is customary and
convenient to use two subscripts, rather than one, to identify the random variables y1 ; y2 ; : : : ; yN ,
residual effects e1 ; e2 ; : : : ; eN , and times t1 ; t2 ; : : : ; tN . Accordingly, let us write eks for the residual
effect corresponding to the sth of those data points that pertain to the kth observational unit, and tks
for the time corresponding to that data point.
It is often possible to account for the more “systematic” effects of time through the choice of the
form of the function ı.u/. However, even then, it is seldom appropriate to assume (as in the G–M
model) that all N of the residual effects are uncorrelated with each other.
The vector e of residual effects is such that
e0 D .e01 ; e02 ; : : : ; eK
0
/;
where (for k D 1; 2; : : : ; K) ek D .ek1 ; ek2 ; : : : ; ekNk /0. It is assumed that cov.ek ; ej / D 0 for
j ¤ k D 1; 2; : : : ; K, so that
var.e/ D diagŒvar.e1 /; var.e2 /; : : : ; var.eK /: (4.35)
For k D 1; 2; : : : ; K, var.ek / is the Nk Nk matrix with rth diagonal element var.ekr / and rsth
(where s ¤ r) off-diagonal element cov.ekr ; eks /. It is to be expected that ekr and eks will be
positively correlated, with the extent of their correlation depending on j tks tkr j; typically, the
correlation of ekr and eks is a strictly decreasing function of j tkr tks j. There are various kinds
of stochastic processes that exhibit that kind of correlation structure. Among the simplest and most
prominent of them is the following.
Stationary first-order autoregressive processes. Let x1 represent a random variable having mean 0
(and a strictly positive variance), and let x2 ; x3 ; : : : represent a possibly infinite sequence of random
variables generated successively (starting with x1 ) in accordance with the following relationship:
xi C1 D xi C di C1 : (4.36)
Here, is a (nonrandom) scalar in the interval 0 < < 1, and d2 ; d3 ; : : : are random variables, each
with mean 0 (and a finite variance), that are uncorrelated with x1 and with each other. The sequence
of random variables x1 ; x2 ; x3 ; : : : represents a stochastic process that is characterized as first-order
autoregressive.
148 The General Linear Model
Note that
E.xi / D 0 (for all i ). (4.37)
Note also that for r D 1; 2; : : : ; i, cov.xr ; di C1 / D 0, as can be readily verified by mathematical
induction—because xr D xr 1 C dr , cov.xr 1 ; di C1 / D 0 implies that cov.xr ; di C1 / D 0. In
particular, cov.xi ; di C1 / D 0. Thus,
var.xi C1 / D 2 var.xi / C var.di C1 /: (4.38)
Let us determine the conditions under which the sequence of random variables x1 ; x2 ; x3 ; : : : is
stationary in the sense that
var.x1 / D var.x2 / D var.x3 / D : (4.39)
In light of equality (4.38),
By making repeated use of the defining relationship (4.36), we find that, for an arbitrary positive
integer s, s 1
X
xi Cs D s xi C j di Cs j : (4.41)
j D0
And it follows that
cov.xi ; xi Cs / D cov.xi ; s xi / D s var.xi /: (4.42)
Thus, in the special case where the sequence x1 ; x2 ; x3 ; : : : satisfies condition (4.39), we have that
corr.xi ; xi Cs / D s (4.43)
or, equivalently, that (for r ¤ i D 1; 2; 3; : : :)
corr.xi ; xr / D jr ij
: (4.44)
In summary, we have that when d2 ; d3 ; d4 ; : : : satisfy condition (4.40), the sequence of random
variables x1 ; x2 ; x3 ; : : : satisfies condition (4.39) and condition (4.43) or (4.44). Accordingly, when
d2 ; d3 ; d4 ; : : : satisfy condition (4.40), the sequence x1 ; x2 ; x3 ; : : : is of a kind that is sometimes
referred to as stationary in the wide sense (e.g., Parzen 1960, chap. 10).
The entries in the sequence x1 ; x2 ; x3 ; : : : may represent the state of some phenomenon at a
succession of times, say times t1 ; t2 ; t3 ; : : : . The coefficient of xi in expression (4.36) is , which
does not vary with i and hence does not vary with the “elapsed times” t2 t1 ; t3 t2 ; t4 t3 ; : : : .
So, for the sequence x1 ; x2 ; x3 ; : : : to be a suitable reflection of the evolution of the phenomenon
over time, it would seem to be necessary that t1 ; t2 ; t3 ; : : : be equally spaced.
A sequence that may be suitable even if t1 ; t2 ; t3 ; : : : are not equally spaced can be achieved by
introducing a modified version of the defining relationship (4.36). The requisite modification can be
discerned from result (4.41) by thinking of the implications of that result as applied to a situation
where the successive differences in time (t2 t1 ; t3 t2 ; t4 t3 ; : : :) are equal and arbitrarily small.
Specifically, what is needed is to replace relationship (4.36) with the relationship
xi C1 D ti C1 ti
xi C di C1 ; (4.45)
where is a (nonrandom) scalar in the interval 0 < < 1.
Suppose that this replacement is made. Then, in lieu of result (4.38), we have that
var.xi C1 / D 2.ti C1 ti /
var.xi / C var.di C1 /:
Heteroscedastic and Correlated Residual Effects 149
And by employing essentially the same reasoning as before, we find that the sequence x1 ; x2 ; x3 ; : : :
satisfies condition (4.39) [i.e., the condition that the sequence is stationary in the sense that var.x1 / D
var.x2 / D var.x3 / D ] if (and only if)
var.di C1 / D Œ1 2.ti C1 ti /
var.x1 / (for all i ). (4.46)
Further, in lieu of result (4.42), we have that
cov.xi ; xi Cs / D ti Cs ti
var.xi /:
Thus, in the special case where the sequence x1 ; x2 ; x3 ; : : : satisfies condition (4.39), we have that
corr.xi ; xi Cs / D ti Cs ti
—these two results take the place of results (4.43) and (4.44), respectively.
In connection with result (4.47), it is worth noting that [in the special case where the sequence
x1 ; x2 ; x3 ; : : : satisfies condition (4.39)] corr.xi ; xr / ! 1 as jtr ti j ! 0 and corr.xi ; xr / ! 0 as
jtr ti j ! 1. And (in that special case) the quantity represents the correlation of any two of the
xi ’s that are separated from each other by a single unit of time.
In what follows, it is the stochastic process defined by relationship (4.45) that is referred to
as a first-order autoregressive process. Moreover, in the special case where the stochastic process
defined by relationship (4.45) satisfies condition (4.39), it is referred to as a stationary first-order
autoregressive process.
Variance-covariance matrix of a subvector of residual effects. Let us now return to the main de-
velopment, and consider further the form of the matrices var.e1 /; var.e2 /; : : : ; var.eK /. We could
consider taking (for an “arbitrary” k) var.ek / to be of the form that would result from regarding
ek1 ; ek2 ; : : : ; ekNk as Nk successive members of a stationary first-order autoregressive process, in
which case we would have (based on the results of Part 1) that (for some strictly positive scalar
and some scalar in the interval 0 < < 1) var.ek1 / D var.ek2 / D D var.ekNk / D 2
and corr.eks ; ekj / D jtkj tks j (j ¤ s D 1; 2; : : : ; Nk ). However, for most applications, taking
var.ek / to be of that form would be inappropriate. It would imply that residual effects corresponding
to “replicate” data points, that is, data points pertaining to the same observational unit and to the
same time, are perfectly correlated. It would also imply that residual effects corresponding to data
points that pertain to the same observational unit, but that are widely separated in time, are essentially
uncorrelated. Neither characteristic conforms to what is found in many applications. There may be
“measurement error,” in which case the residual effects corresponding to replicate data points are
expected to differ from one another. And all of the data points that pertain to the same observational
unit (including ones that are widely separated in time) may be subject to some common influences,
in which case every two of the residual effects corresponding to those data points may be correlated
to at least some minimal extent.
Results that are better suited for most applications can be obtained by adopting a somewhat
more elaborate approach. This approach builds on the approach introduced in Part 3 of Subsection
b in connection with compound symmetry. Take ak (k D 1; 2; : : : ; K) and rks (k D 1; 2; : : : ; K;
s D 1; 2; : : : ; Nk ) to be uncorrelated random variables, each with mean 0, such that var.ak / D k2
for some nonnegative scalar k and var.rks / D k2 (s D 1; 2; : : : ; Nk ) for some strictly positive
scalar k . Further, take fks (k D 1; 2; : : : ; K; s D 1; 2; : : : ; Nk ) to be random variables, each with
mean 0, that are uncorrelated with the ak ’s and the rks ’s; take the elements of each of the K sets
ffk1 ; fk2 ; : : : ; fkNk g (k D 1; 2; : : : ; K) to be uncorrelated with those of each of the others; and
take the variances and covariances of fk1 ; fk2 ; : : : ; fkNk to be those obtained by regarding these
150 The General Linear Model
The correlation corr.eks ; ekj / can be regarded as a function of the elapsed time jtkj tks j. Clearly,
if the k ’s, k ’s, k ’s, and k ’s are such that k =k , k =k , and k do not vary with k, then this
function does not vary with k.
In connection with supposition (4.48), it may be advisable to extend the parameter space of k by
appending the value 0, which corresponds to allowing for the possibility that var.fk1 / D var.fk2 / D
D var.fkNk / D 0 or, equivalently, that fk1 D fk2 D D fkNk D 0 with probability 1. When
k D 0, var.ek / is of the compound-symmetric form considered in Subsection b.
In practice, it is common to make the simplifying assumption that the k ’s, k ’s, k ’s, and
k ’s do not vary with k, so that 1 D 2 D D K D , 1 D 2 D D K D ,
1 D 2 D D K D , and 1 D 2 D D K D for some strictly positive scalar , for
some nonnegative scalar , for some strictly positive (or alternatively nonnegative) scalar , and for
some scalar in the interval 0 < < 1. Under that assumption, neither var.eks / nor corr.eks ; ekj /
varies with k. Moreover, if and the ratios = and = are known, then var.e/ is of the form
2 H of var.e/ in the Aitken generalization of the G–M model. When , , , and are regarded as
unknown parameters and are taken to be the elements of the vector , var.e/ is of the form V./ of
var.e/ in the general linear model.
Girls Boys
Age (in years) Age (in years)
Youngster 8 10 12 14 Youngster 8 10 12 14
1 21 20 21:5 23 12 26 25 29 31
2 21 21:5 24 25:5 13 21:5 22:5 23 26:5
3 20:5 24 24:5 26 14 23 22:5 24 27:5
4 23:5 24:5 25 26:5 15 25:5 27:5 26:5 27
5 21:5 23 22:5 23:5 16 20 23:5 22:5 26
6 20 21 21 22:5 17 24:5 25:5 27 28:5
7 21:5 22:5 23 25 18 22 22 24:5 26:5
8 23 23 23:5 24 19 24 21:5 24:5 25:5
9 20 21 22 21:5 20 23 20:5 31 26
10 16:5 19 19 19:5 21 27:5 28 31 31:5
11 24:5 25 28 28 22 23 23 23:5 25
23 21:5 23:5 24 28
24 17 24:5 26 29:5
25 22:5 25:5 25:5 26
26 23 24:5 26 30
27 22 21:5 23:5 25
Taking ı.u/ to be of the form (4.51) allows for the possibility that the relationship between distance
and age may be markedly different for boys than for girls.
152 The General Linear Model
The distance measurements can be regarded as longitudinal data. The 27 youngsters (11 girls
and 16 boys) form K D 27 observational units, each of which contributes 4 data points. And age
plays the role of time.
Partition the (column) vector e of residual effects (in the general linear model) into 4-dimensional
subvectors e1 ; e2 ; : : : ; e27 [in such a way that e0 D .e01 ; e02 ; : : : ; e027 /]. (This partitioning corresponds
to the partitioning of y into the subvectors y1 ; y2 ; : : : ; y27 .) And (for k D 1; 2; : : : ; 27) denote by
ek1 , ek2 , ek3 , and ek4 the elements of the subvector ek —these are the residual effects that correspond
to the distance measurements on the kth youngster at ages 8, 10, 12, and 14. Then, proceeding as in
Subsection e, we could take var.e/ to be of the form
Further, for k D 1; 2; : : : ; 27, we could take var.ek / to be of the form associated with the
supposition that (for s D 1; 2; 3; 4) eks is expressible in the form of the decomposition (4.48).
That is, we could take the diagonal elements of var.ek / to be of the form (4.49), and the off-
diagonal elements to be of the form (4.50). Expressions (4.49) and (4.50) determine var.ek / up
to the values of the four parameters k , k , k , and k . It is assumed that, for k D 1; 2; : : : ; 11
(corresponding to the 11 girls), the values of k , k , k , and k do not vary with k, and similarly that,
for k D 12; 13; : : : ; 27 (corresponding to the 16 boys), the values of k , k , k , and k do not vary
with k. Under that assumption, var.e/ would depend on 8 parameters. Presumably, those parameters
would be unknown, in which case the vector of unknown parameters in the variance-covariance
matrix V ./ (of the vector e of residual effects in the general linear model) would be of dimension 8.
g. Spatial data
There are situations where each of the N data points is associated with a specific location (in 1-,
2-, or 3-dimensional space). For example, the data points might be measurements of the hardness
of samples of water obtained from different wells. Data of this kind are among those referred to as
spatial data.
Suppose that the observed value of each of the random variables y1 ; y2 ; : : : ; yN in the general
linear model is associated with a specific location in D-dimensional space. Further, let us represent
an arbitrary location in D-dimensional space by a D-dimensional column vector s of “coordinates,”
denote by s1 ; s2 ; : : : ; sN the values of s corresponding to y1 ; y2 ; : : : ; yN , respectively, and take S to
be a finite or infinite set of values of s that includes s1 ; s2 ; : : : ; sN (and perhaps other values of s that
may be of interest). Assume (as can be done essentially without loss of generality) that the vector u
of explanatory variables includes the vector s as a subvector or, more generally, that the elements of
s are expressible as functions of u, so that (for i D 1; 2; : : : ; N ) the i th value si of s is determinable
from the i th value ui of u.
Typically, data points associated with locations that are in close proximity tend to be more alike
than those associated with locations that are farther apart. This phenomenon may be due in part to
“systematic forces” that manifest themselves as “trends” or “gradients” in the surface defined by
the function ı.u/—whatever part may be due to these systematic forces is sometimes referred to
as large-scale variation. However, typically not all of this phenomenon is attributable to systematic
forces or is reflected in the surface defined by ı.u/. There is generally a nonsystematic component.
The nonsystematic component takes the form of what is sometimes called small-scale variation and
is reflected in the correlation matrix of the residual effects e1 ; e2 ; : : : ; eN . The residual effects may
be positively correlated, with the correlation being greatest among residual effects corresponding to
locations that are in close proximity.
It is supposed that the residual effects e1 ; e2 ; : : : ; eN are expressible as follows:
ei D ai C ri .i D 1; 2; : : : ; N /: (4.54)
Heteroscedastic and Correlated Residual Effects 153
Further, the correlation between the i th and j th residual effects is of the form
2 2 .=/2
corr.ei ; ej / D K.s i sj / D K.s i sj / D K.si sj /:
2 2 C 2 1 C .=/2
Accordingly, the distribution of the residual effects is weakly stationary in the sense that all of the
residual effects have the same variance and in the sense that the covariance and correlation between
any two residual effects depend on the corresponding locations only through the difference between
the locations.
The variance 2 of the residual effects is the sum of two components: (1) the variance 2 of
the ai ’s and (2) the variance 2 of the ri ’s. The first component 2 accounts for a part of whatever
variability is spatial in origin, that is, a part of whatever variability is related to differences in
location—it accounts for the so-called small-scale variation. The second component 2 accounts for
the remaining variability, including any variability that may be attributable to measurement error.
Depending on the application, some of the N locations s1 ; s2 ; : : : ; sN may be identical. For
example, the N data points may represent the scores achieved on a standardized test by N different
students and (for i D 1; 2; : : : ; N ) si may represent the location of the school system in which the
the i th student is enrolled. In such a circumstance, the variation accounted for by 2 would include
the variation among those residual effects for which the corresponding locations are the same. If
there were L distinct values of s represented among the N locations s1 ; s2 ; : : : ; sN and the residual
154 The General Linear Model
effects e1 ; e2 ; : : : ; eN were divided into L classes in such a way that the locations corresponding
to the residual effects in any particular class were identical, then the ratio 2 =. 2 C 2 / could be
regarded as the intraclass correlation—refer to Subsection b.
There may be spatial variation that is so localized in nature that it does not contribute to the
covariances among the residual effects. This kind of spatial variation is sometimes referred to as
microscale variation (e.g., Cressie 1993). The range of its influence is less than the distance between
any two of the locations s1 ; s2 ; : : : ; sN . Its contribution, which has come to be known as the “nugget
effect,” is reflected in the component 2.
To complete the specification of the form (4.58) of the covariances of the residual effects
e1 ; e2 ; : : : ; eN , it remains to specify the form of the function K./. In that regard, it suffices to
take [for an arbitrary D-dimensional column vector h in the domain H of K./]
where w is a D-dimensional random column vector (the distribution of which may depend on
unknown parameters).
Let us confirm this claim; that is, let us confirm that when K./ is taken to be of the form (4.59),
it has (even if S and hence H comprise all of RD ) the three properties required of an autocorrelation
function. Recall that cos 0 D 1; that for any real number x, cos. x/ D cos x; and that for any real
numbers x and z, cos.x z/ D .cos x/.cos z/ C .sin x/.sin z/. When K./ is taken to be of the form
(4.59), we have (in light of the properties of the cosine operator) that (1)
K.0/ D E.cos 00 w/ D E.cos 0/ D E.1/ D 1I
that (2) for h 2 H ,
K. h/ D EŒcos . h/0 w D EŒcos. h0 w/ D E.cos h0 w/ D K.h/I
and that (3) for every positive integer M , any M vectors t1 ; t2 ; : : : ; tM in S , and any M scalars
x1 ; x2 ; : : : ; xM ,
M X
X M M X
X M
xi xj K.ti tj / D xi xj EŒcos.ti0 w tj0 w/
i D1 j D1 i D1 j D1
M X
hX M i
DE xi xj cos.ti0 w tj0 w/
i D1 j D1
M X
hX M M X
X M i
DE xi xj .cos ti0 w/.cos tj0 w/ C xi xj .sin ti0 w/.sin tj0 w/
i D1 j D1 i D1 j D1
M
hX M
2 X 2 i
DE xi cos ti0 w C xi sin ti0 w
i D1 i D1
0:
Thus, when K./ is taken to be of the form (4.59), it has (even if S and H comprise all of RD ) the
requisite three properties.
Expression (4.59) depends on the distribution of w. The evaluation of this expression for a
distribution of any particular form is closely related to the evaluation of the characteristic function
of a distribution of that form. The characteristic function of the distribution of w is the function
p c./
0
defined (for an arbitrary D-dimensional column vector h) by c.h/ D E.e i h w / (where i D 1).
The characteristic function of the distribution of w can be expressed in the form
c.h/ D E.cos h0 w/ C i E.sin h0 w/; (4.60)
Heteroscedastic and Correlated Residual Effects 155
the real component of which is identical to expression (4.59). It is worth mentioning that if the
distribution of w has a moment generating function, say m./, then
c.h/ D m.i h/ (4.61)
(e.g., Grimmett and Welsh 1986, p. 117).
In the special case where w is distributed symmetrically about 0 (i.e., where w w), expression
(4.60) simplifies to
c.h/ D E.cos h0 w/;
as is evident upon recalling that (for any real number x) sin. x/ D sin x and then observing that
E.sin h0 w/ D EŒsin h0 . w/ D EŒsin. h0 w/ D E.sin h0 w/ [which implies that E.sin h0 w/ D
0]. Thus, in the special case where w is distributed symmetrically about 0, taking K./ to be of the
form (4.59) is equivalent to taking K./ to be of the form
K.h/ D c.h/:
Suppose, for example, that w N.0; /, where D f ij g is a symmetric nonnegative defi-
nite matrix. Then, it follows from result (3.5.47) that the moment generating function m./ of the
distribution of w is
m.h/ D exp. 21 h0 h/;
implying [in light of result (4.61)] that the characteristic function c./ of the distribution of w is
1 0
c.h/ D exp. 2 h h/:
Thus, the choices for the form of the function K./ include
1 0
K.h/ D exp. 2
h h/: (4.62)
TABLE 4.7. The diameters and heights and the coordinates of the locations of 101 trees in a circular region of
radius 40 m—the locations are relative to the center of the region (Zhang, Bi, Cheng, and Davis
2004).
Coordinates Coordinates
Diam. Hgt. Diam. Hgt.
Tree (cm) (m) 1st (m) 2nd (m) Tree (cm) (m) 1st (m) 2nd (m)
1 17:5 14:0 0:73 1:31 52 43:7 31:5 32:13 18:55
2 18:3 15:1 1:99 0:24 53 23:5 22:0 31:55 18:95
3 10:1 7:3 2:96 3:65 54 53:1 25:1 19:79 4:93
4 49:0 26:5 32:90 15:34 55 18:4 12:1 21:30 6:92
5 32:9 22:9 18:28 11:87 56 38:8 24:7 24:06 7:82
6 50:6 16:3 22:31 1:95 57 66:7 35:3 32:98 12:00
7 66:2 26:7 27:69 3:40 58 61:7 31:5 36:32 9:73
8 73:5 31:0 29:30 0:51 59 15:9 11:2 35:84 9:60
9 39:7 26:5 37:65 1:97 60 14:5 11:5 22:11 4:70
10 69:8 32:5 39:75 3:47 61 18:8 14:2 1:98 0:31
11 67:4 28:3 15:78 0:83 62 19:4 14:8 2:26 0:44
12 68:6 31:1 33:45 9:60 63 16:2 12:7 8:25 0:87
13 27:0 15:2 34:43 10:53 64 55:9 26:0 13:10 2:31
14 44:8 25:2 36:40 12:54 65 55:9 25:6 13:10 2:31
15 44:4 29:2 32:00 15:61 66 39:6 24:3 28:44 2:99
16 44:5 32:4 29:53 17:05 67 12:7 11:4 29:43 2:06
17 72:0 28:8 9:92 6:20 68 10:2 10:6 31:08 2:72
18 94:2 33:0 12:70 9:57 69 10:2 9:2 31:18 2:73
19 63:8 29:0 29:39 22:97 70 10:3 7:2 31:88 2:79
20 12:7 12:9 20:61 22:89 71 15:3 13:2 33:65 1:77
21 40:6 23:9 14:65 22:56 72 13:5 8:4 36:08 1:26
22 31:5 13:2 13:90 23:14 73 10:8 8:3 34:33 11:16
23 40:2 26:7 16:37 26:21 74 55:8 27:9 28:09 11:35
24 38:9 22:6 16:70 34:25 75 55:8 27:5 28:09 11:35
25 20:0 15:0 16:14 34:62 76 41:4 25:5 25:72 14:85
26 17:7 9:1 10:60 29:13 77 70:4 25:0 18:45 11:99
27 24:3 18:0 11:18 29:13 78 79:0 27:5 25:29 19:77
28 10:4 10:1 0:34 3:89 79 12:0 10:6 8:19 13:63
29 52:2 28:9 0:71 20:39 80 14:7 14:5 28:34 27:37
30 17:7 13:1 0:00 21:90 81 12:4 10:2 16:44 24:38
31 19:9 13:6 0:75 21:49 82 19:4 16:8 18:33 29:34
32 56:4 29:0 1:68 24:04 83 120:0 34:0 2:41 17:13
33 27:0 16:6 2:55 36:51 84 10:2 10:0 13:75 30:88
34 37:0 28:5 2:73 19:41 85 28:7 21:3 14:37 33:88
35 55:2 24:1 3:43 32:62 86 36:5 23:9 14:49 34:15
36 81:2 30:0 4:59 32:68 87 30:4 21:0 9:79 32:04
37 41:3 27:4 4:37 8:21 88 18:8 12:1 9:76 31:94
38 16:9 13:8 19:75 34:21 89 59:0 26:6 6:62 24:73
39 15:2 12:3 19:61 33:94 90 38:7 19:7 5:88 23:58
40 11:6 12:4 17:09 17:69 91 15:0 14:1 8:03 34:79
41 12:0 10:5 19:64 21:06 92 58:0 30:8 3:24 13:00
42 11:6 13:6 22:89 32:68 93 34:7 26:6 3:54 33:71
43 10:6 10:9 23:09 18:04 94 38:3 29:2 3:65 34:71
44 36:7 20:0 26:04 22:63 95 44:4 29:2 5:51 25:92
45 31:5 15:9 25:49 22:95 96 48:3 25:7 6:36 25:52
46 65:8 33:0 10:57 6:86 97 32:8 15:7 7:73 23:78
47 51:1 27:4 14:41 6:72 98 83:7 26:7 14:39 15:43
48 19:4 10:5 14:74 7:19 99 39:0 25:1 19:46 16:32
49 30:0 21:9 18:61 9:07 100 49:0 25:4 28:67 21:60
50 40:2 23:3 18:71 9:53 101 14:4 11:4 3:61 29:38
51 136:0 32:5 27:23 13:28
Multivariate Data 157
suckers rather than seed), and double leaders occurred with some frequency. The diameters and
heights of the trees are listed in Table 4.7, along with the location of each tree. Only those trees with
a diameter greater than 10 cm were considered; there were 101 such trees.
Interest centered on ascertaining how tree height relates to tree diameter. Refer to Zhang et al.
(2004) for some informative graphical displays that bear on this relationship and that indicate how
the trees are distributed within the plot.
The setting is one that might be suitable for the application of a general linear model. We could
take N D 101, take the observed value of yi to be the logarithm of the height of the i th tree
(i D 1; 2; : : : ; 101), take C D 3, take u1 to be the diameter of the tree, and take u2 and u3 to be the
first and second coordinates of the location of the tree. And we could consider taking ı.u/ to be of
the simple form
ı.u/ D ˇ1 C ˇ2 log u1 ; (4.63)
where ˇ1 and ˇ2 are unknown parameters; expression (4.63) is a first-degree polynomial in log u1
with coefficients ˇ1 and ˇ2 . Or, following Zhang et al. (2004), we could consider taking ı.u/ to
be a variation on the first-degree polynomial (4.63) in which the coefficients of the polynomial are
allowed to differ from location to location (i.e., allowed to vary with u2 and u3 ).
The logarithms of the heights of the 101 trees constitute spatial data. The logarithm of the
height of each tree is associated with a specific location in 2-dimensional space; this location is that
represented by the value of the 2-dimensional vector s whose elements (in this particular setting) are
u2 and u3 . Accordingly, the residual effects e1 ; e2 ; : : : ; eN are likely to be correlated to an extent
that depends on the relative locations of the trees with which they are associated. Trees that are in
close proximity may tend to be subject to similar conditions, and consequently the residual effects
identified with those trees may tend to be similar. It is worth noting, however, that the influence on the
residual effects of any such similarity in conditions could be at least partially offset by the influence
of competition for resources among neighboring trees.
In light of the spatial nature of the data, we might wish to take the variance-covariance matrix
V./ of the vector e of residual effects in the general linear model to be that whose diagonal and
off-diagonal elements are of the form (4.56) and (4.58). Among the choices for the form of the
autocorrelation function K./ is that specified by expression (4.62). It might or might not be realistic
to restrict the matrix in expression (4.62) to be of the form D I (where is a nonnegative
scalar). When is restricted in that way, the autocorrelation function K./ is isotropic, and the vector
could be taken to be the 3-dimensional (column) vector with elements , , and (assuming that
each of these 3 scalars is to be regarded as an unknown parameter).
of the different points in time as defining different response variables. Nevertheless, there is a very
meaningful distinction between longitudinal data and the sort of “unstructured” multivariate data
that is the subject of the present section. Longitudinal data exhibit a structure that can be exploited
for modeling purposes. Models (like those considered in Section 4.4e) that exploit that structure are
not suitable for use with unstructured multivariate data.
where (for s D 1; 2; : : : ; S) ys is the Rs -dimensional column vector, the kth element of which is
the random variable whose observed value is the value obtained for the sth response variable on
the rsk th observational unit (k D 1; 2; : : : ; Rs ). The setting is such that it is convenient to use two
subscripts instead of one to distinguish among y1 ; y2 ; : : : ; yN and also to distinguish among the
residual effects e1 ; e2 ; : : : ; eN . Let us use the first subscript to identify the response variable and the
second to identify the observational unit. Accordingly, let us write ys rsk for the kth element of ys ,
so that, by definition,
ys D .ys rs1 ; ys rs2 ; : : : ; ys rsRs /0 :
And let us write es rsk for the residual effect corresponding to ys rsk .
In a typical application, the model is taken to be such that the model matrix is of the block-diagonal
form 0 1
X1 0 : : : 0
B 0 X2 0 C
XDB: C; (5.1)
B C
@ :: ::
: A
0 0 XS
where (for s D 1; 2; : : : ; S) Xs is of dimensions Rs Ps (and where P1 ; P2 ; : : : ; PS are positive
integers that sum to P ). Now, suppose that X is of the form (5.1), and partition the parameter vector
ˇ into S subvectors ˇ1 ; ˇ2 ; : : : ; ˇS of dimensions P1 ; P2 ; : : : ; PS , respectively, so that
ˇ 0 D .ˇ10 ; ˇ20 ; : : : ; ˇS0 /:
Further, partition the vector e of residual effects into S subvectors e1 ; e2 ; : : : ; eS of dimensions
R1 ; R2 ; : : : ; RS , respectively, so that
e0 D .e01 ; e02 ; : : : ; e0S /;
where (for s D 1; 2; : : : ; S) e0s D .es rs1 ; es rs2 ; : : : ; es rsRs /0 —the partitioning of e is conformal to
the partitioning of y. Then, the model equation y D Xˇ C e is reexpressible as
ys D Xs ˇs C es .s D 1; 2; : : : ; S /: (5.2)
Note that the model equation for the vector ys of observations on the sth response variable depends
Multivariate Data 159
on only Ps of the elements of ˇ, namely, those Ps elements that are members of the subvector ˇs .
In practice, it is often the case that P1 D P2 D D PS D P (where P D P =S ) and that there
is a matrix X of dimensions R P such that (for s D 1; 2; : : : ; S) the (first through Rs th) rows of
Xs are respectively the rs1 ; rs2 ; : : : ; rsRs th rows of X .
An important special case is that where there is complete information on every observational unit;
that is, where every one of the S response variables is observed on every one of the R observational
units. Then, R1 D R2 D D RS D R. And, commonly, P1 D P2 D D PS D P .D P =S /
and X1 D X2 D D XS D X (for some R P matrix X ). Under those conditions, it is
possible to reexpress the model equation y D Xˇ C e as
Y D X B C E; (5.3)
where Y D .y1 ; y2 ; : : : ; yS /, B D .ˇ1 ; ˇ2 ; : : : ; ˇS /, and E D .e1 ; e2 ; : : : ; eS /.
The N residual effects es rsk (s D 1; 2; : : : ; S; k D 1; 2; : : : ; Rs ) can be regarded as a subset of
a set of RS random variables es r (s D 1; 2; : : : ; S; r D 1; 2; : : : ; R) having expected values of 0—
think of these RS random variables as the residual effects for the special case where there is complete
information on every observational unit. It is assumed that the distribution of the random variables es r
(s D 1; 2; : : : ; S; r D 1; 2; : : : ; R) is such that the R vectors .e1r ; e2r ; : : : ; eSr /0 (r D 1; 2; : : : ; R)
are uncorrelated and each has the same variance-covariance matrix † D fij g. Then, in the special
case where there is complete information on every observational unit, the variance-covariance matrix
of the vector e of residual effects is
0 1
11 IR 12 IR : : : 1S IR
B 12 IR 22 IR : : : 2S IR C
B :: :: :: C: (5.4)
B C
::
@ : : : : A
1S IR 2S IR : : : SS IR
Moreover, in the general case (where the information on some or all of the observational units is
incomplete), the variance-covariance matrix of e is a submatrix of the matrix (5.4). Specifically, it
is the submatrix obtained by replacing the stth block of matrix (5.4) with the Rs R t submatrix
formed from that block by striking out all of the rows and columns of the block save the rs1 ; rs2 ; : : : ;
rsRs th rows and the r t1 ; r t 2 ; : : : ; r tR t th columns (s; t D 1; 2; : : : ; S).
Typically, the matrix † (which is inherently symmetric and nonnegative definite) is assumed
to be positive definite, and its S.S C 1/=2 distinct elements, say ij (j i D 1; 2; : : : ; S), are
regarded as unknown parameters. The situation is such that (even in the absence of the assumption
that † is positive definite) var.e/ is of the form V./ of var.e/ in the general linear model—the
parameter vector can be taken to be the [S.S C1/=2]-dimensional (column) vector with elements
ij (j i D 1; 2; : : : ; S).
averaged. Accordingly, each trial resulted in four data points, one for each response variable. The
data from the 13 trials are reproduced in Table 4.8.
These data are multivariate in nature. The trials constitute the observational units; there are
R D 13 of them. And there are S D 4 response variables: hardness, cohesiveness, springiness, and
compressible H2 O. Moreover, there is complete information on every observational unit; every one
of the four response variables was observed on every one of the 13 trials.
The setting is one that might be suitable for the application of a general linear model. More
specifically, it might be suitable for the application of a general linear model of the form considered
in Subsection a. In such an application, we might take C D 3, take u1 to be the level of cysteine,
take u2 to be the level of CaCl2 , and take the value of u3 to be 1, 2, 3, or 4 depending on whether
the response variable is hardness, cohesiveness, springiness, or compressible H2 O.
Further, following Schmidt et al. (1979) and adopting the notation of Subsection a, we might
take (for s D 1; 2; 3; 4 and r D 1; 2; : : : ; 13)
ys r D ˇs1 C ˇs2 u1r C ˇs3 u2r C ˇs4 u21r C ˇs5 u22r C ˇs6 u1r u2r C es r ; (5.5)
where u1r and u2r are the values of u1 and u2 for the rth observational unit and where
ˇs1 ; ˇs2 ; : : : ; ˇs6 are unknown parameters. Taking (for s D 1; 2; 3; 4 and r D 1; 2; : : : ; 13) ys r
to be of the form (5.5) is equivalent to taking (for s D 1; 2; 3; 4) the vector ys (with elements
ys1 ; ys2 ; : : : ; ysR ) to be of the form (5.2) and to taking X1 D X2 D X3 D X4 D X , where X is
the 13 6 matrix whose rth row is .1; u1r ; u2r ; u21r ; u22r ; u1r u2r /.
Exercises
Exercise 1. Verify formula (2.3).
Exercise 2. Write out the elements of the vector ˇ, of the observed value of the vector y, and of the
matrix X (in the model equation y D Xˇ C e) in an application of the G–M model to the cement
Exercises 161
data of Section 4.2 d. In doing so, regard the measurements of the heat that evolves during hardening
as the data points, take C D 4, take u1 ; u2 , u3 , and u4 to be the respective amounts of tricalcium
aluminate, tricalcium silicate, tetracalcium aluminoferrite, and ˇ-dicalcium silicate, and take ı.u/
to be of the form (2.11).
Exercise 3. Write out the elements of the vector ˇ, of the observed value of the vector y, and of the
matrix X (in the model equation y D Xˇ C e) in an application of the G–M model to the lettuce
data of Section 4.2 e. In doing so, regard the yields of lettuce as the data points, take C D 3, take
u1 ; u2 , and u3 to be the transformed amounts of Cu, Mo, and Fe, respectively, and take ı.u/ to be
of the form (2.14).
Exercise 4. Let y represent a random variable and u a C -dimensional random column vector such
that the joint distribution of y and u is MVN (with a nonsingular variance-covariance matrix). And
take z D fzj g to be a transformation (of u) of the form
z D R0 Œu E.u/;
where R is a nonsingular (nonrandom) matrix such that var.z/ D I—the existence of such a matrix
follows from the results of Section 3.3b. Show that
C
E.y j u/ E.y/ X
p D corr.y; zj / zj :
var y j D1
in agreement with the results obtained in Section 4.3 (under the assumption that the joint distri-
bution of y and u is MVN).
(b) Show that 1
EŒvar.y j u/ D var.y/ cov.y; u/.var u/ cov.u; y/:
Exercise 6. Suppose that (in conformance with the development in Section 4.4b) the residual effects
in the general linear model have been partitioned into K mutually exclusive and exhaustive subsets
or classes numbered 1; 2; : : : ; K. And for k D 1; 2; : : : ; K, write ek1 ; ek2 ; : : : ; ekNk for the residual
effects in the kth class. Take ak (k D 1; 2; : : : ; K) and rks
(k D 1; 2; : : : ; K; s D 1; 2; : : : ; Nk ) to
be uncorrelated random variables, each with mean 0, such that var.ak / D k2 for some nonnegative
scalar k and var.rks
/ D k2 (s D 1; 2; : : : ; Nk ) for some strictly positive scalar k . Consider the
effect of taking the residual effects to be of the form
eks D ak C rks
; (E.1)
rather than of the form (4.26). Are there values of k2 and k2 for which the value of var.e/ is the
same when the residual effects are taken to be of the form (E.1) as when they are taken to be of the
form (4.26)? If so, what are those values; if not, why not?
162 The General Linear Model
Exercise 7. Develop a correlation structure for the residual effects in the general linear model that,
in the application of the model to the shear-strength data (of Section 4.4 d), would allow for the
possibility that steel aliquots chosen at random from those on hand on different dates may tend to
be more alike when the intervening time is short than when it is long. Do so by making use of the
results (in Section 4.4 e) on stationary first-order autoregressive processes.
Exercise 8. Suppose (as in Section 4.4g) that the residual effects e1 ; e2 ; : : : ; eN in the general linear
model correspond to locations in D-dimensional space, that these locations are represented by D-
dimensional column vectors s1 ; s2 ; : : : ; sN of coordinates, and that S is a finite or infinite set of
D-dimensional column vectors that includes s1 ; s2 ; : : : ; sN . Suppose further that e1 ; e2 ; : : : ; eN are
expressible in the form (4.54) and that conditions (4.55) and (4.57) are applicable. And take ./ to
be the function defined on the set H D fh 2 RD W h D s t for s; t 2 S g by
.h/ D 2 C 2 Œ1 K.h/:
Exercise 9. Suppose that the general linear model is applied to the example of Section 4.5b (in the
way described in Section 4.5b). What is the form of the function ı.u/?
§4b. In some presentations, the intraclass correlation is taken to be the same for every class, and the permis-
sible values of the intraclass correlation are taken to be all of the values for which the variance-covariance matrix
of the residual effects is nonnegative definite. Then, assuming that there are K classes of sizes N1 ; N2 ; : : : ; NK
and denoting the intraclass correlation by , the permissible values would be those in the interval
1
min 1:
k Nk 1
One could question whether the intraclass correlation’s being the same for every class is compatible with its being
negative. Assuming that a negative intraclass correlation is indicative of competition among some number ( the
class size) of entities, it would seem that the correlation would depend on the number of entities—presumably,
the pairwise competition would be less intense and the correlation less affected if the number of entities were
relatively large.
§4e. The development in Part 2 is based on taking the correlation structure of the sequence
fk1 ; fk2 ; : : : ; fkNk to be that of a stationary first-order autoregressive process. There are other possible choices
for this correlation structure; see, for example, Diggle, Heagerty, Liang, and Zeger (2002, secs. 4.2.2 and 5.2)
and Laird (2004, sec. 1.3).
§4g. For extensive (book-length) treatises on spatial statistics, refer, e.g., to Cressie (1993) and Schaben-
berger and Gotway (2005). Gaussian autocorrelation functions may be regarded as “artificial” and their use
discouraged; they have certain characteristics that are considered by Schabenberger and Gotway—refer to their
Section 4.3—and by many others to be inconsistent with the characteristics of real physical and biological
processes.
§5a. By transposing both of its sides, model equation (5.3) can be reexpressed in the form of the equation
Y 0 D B0 X0 C E0;
each side of which is an S R matrix whose rows correspond to the response variables and whose columns
correspond to the observational units. In many publications, the model equation is presented in this alternative
form rather than in the form (5.3). As pointed out, for example, by Arnold (1981, p. 348), the form (5.3) has the
appealing property that, in the special case of univariate data (i.e., the special case where S D 1), each side of
the equation reduces to a column vector (rather than a row vector), in conformance with the usual representation
for that case.
5
Estimation and Prediction: Classical Approach
Models of the form of the general linear model, and in particular those of the form of the Gauss–
Markov or Aitken model, are often used to obtain point estimates of the unobservable quantities
represented by various parametric functions. In many cases, the parametric functions are ones that
are expressible in the form 0ˇ, where D .1 ; 2 ; : : : ; P /0 is a P -dimensional column vector
of constants, or equivalently ones that are expressible in the form jPD1 j ˇj . Models of the form
P
of the G–M, Aitken, or general linear model may also be used to obtain predictions for future
quantities; these would be future quantities that are represented by unobservable random variables
with expected values of the form 0ˇ. The emphasis in this chapter is on the G–M model (in which
the only parameter other than ˇ1 ; ˇ2 ; : : : ; ˇP is the standard deviation ) and on what might be
regarded as a classical approach to estimation and prediction.
Attention is sometimes restricted to estimators that are unbiased. By definition, an estimator t.y/
of 0ˇ is unbiased if EŒt.y/ D 0ˇ. If t.y/ is an unbiased estimator of 0ˇ, then
EfŒt.y/ 0ˇ2 g D varŒt.y/; (1.1)
that is, its MSE equals its variance.
In the case of a linear estimator c C a0 y, the expected value of the estimator is
E.c C a0 y/ D c C a0 E.y/ D c C a0 Xˇ: (1.2)
0 0
Accordingly, c C a y is an unbiased estimator of ˇ if and only if, for every P -dimensional column
vector ˇ,
c C a0 Xˇ D 0ˇ: (1.3)
Clearly, a sufficient condition for the unbiasedness of the linear estimator c C a0 y is
cD0 and a0 X D 0 (1.4)
or, equivalently,
cD0 and X0 a D : (1.5)
0
This condition is also a necessary condition for the unbiasedness of c C a y as is evident upon
observing that if equality (1.3) holds for every column vector ˇ in RP , then it holds in particular
when ˇ is taken to be the P 1 null vector 0 (so that c D 0) and when (for each integer j between
1 and P, inclusive) ˇ is taken to be the j th column of IP (so that the j th element of a0 X equals the
j th element of 0 ).
In the special case of a linear unbiased estimator a0 y, expression (1.1) for the MSE of an unbiased
estimator of 0ˇ simplifies to
EŒ.a0 y 0ˇ/2 D a0 var.y/a: (1.6)
c C a0 y C 0 k D c C a0 .y C Xk/;
Estimability 167
a0 Xk D 0 k: (2.3)
The estimator t.y/ is said to be translation equivariant if it is such that condition (2.2) is satisfied
for every k 2 RP (and for every value of y). Accordingly, the linear estimator c C a0 y is translation
equivariant if and only if condition (2.3) is satisfied for every k 2 RP or, equivalently, if and only if
a0 X D 0: (2.4)
Observe (in light of the results of Section 5.1) that condition (2.4) is identical to one of the
conditions needed for unbiasedness—for unbiasedness, we also need the condition c D 0. Thus,
the motivation for requiring that the coefficient vector a0 in the linear estimator c C a0 y satisfy the
condition a0 X D 0 can come from a desire to achieve unbiasedness or translation equivariance or
both.
5.3 Estimability
Suppose (as in Sections 5.1 and 5.2) that y is an N 1 observable random vector that follows the
G–M, Aitken, or general linear model, and consider the estimation of a parametric function that
is expressible in the form 0ˇ or jPD1 j ˇj , where D .1 ; 2 ; : : : ; P /0 is a P 1 vector of
P
coefficients. If there exists a linear unbiased estimator of 0ˇ [i.e., if there exists a constant c and an
N 1 vector of constants a such that E.c C a0 y/ D 0ˇ], then 0ˇ is said to be estimable. Otherwise
(if no such estimator exists), 0ˇ is said to be nonestimable.
If 0ˇ is estimable, then the data provide at least some information about 0ˇ. Estimability can
be of critical importance in the design of an experiment. If the data from the experiment are to be
regarded as having originated from a G–M, Aitken, or general linear model and if the quantities of
interest are to be formulated as parametric functions of the form 0ˇ (as is common practice), then
it is imperative that every one of the relevant functions be estimable.
It follows immediately from the results of Section 5.1 that 0ˇ is estimable if and only if there
exists an N 1 vector a such that
0 D a0 X (3.1)
or, equivalently, such that
D X0 a: (3.2)
Thus, for 0ˇ to be estimable (under the G–M, Aitken, or general linear model), it is necessary and
sufficient that
0 2 R.X/ (3.3)
or, equivalently, that
2 C.X0 /: (3.4)
Note that it follows from the very definition of estimability [as well as from condition (3.1)] that
if 0ˇ is estimable, then there exists an N 1 vector a such that
choices of a and, consequently, it may have multiple interpretations in terms of the expected values
of y1 ; y2 ; : : : ; yN .
Two basic and readily verifiable observations about linear combinations of parametric functions
of the form 0ˇ are as follows:
(1) linear combinations of estimable functions are estimable; and
(2) linear combinations of nonestimable functions are not necessarily nonestimable.
How many “essentially different” estimable functions are there? Let 10 ˇ; 20 ˇ; : : : ; K 0
ˇ rep-
resent K (where K is an arbitrary positive integer) linear combinations of the elements of ˇ. These
linear combinations are said to be linearly independent if their coefficient vectors 10 ; 20 ; : : : ; K
0
are
0 0 0
linearly independent vectors. A question as to whether 1 ˇ; 2 ˇ; : : : ; K ˇ are essentially different
can be made precise by taking essentially different to mean linearly independent.
Letting R D rank.X/, some basic and readily verifiable observations about linearly independent
parametric functions of the form 0ˇ and about their estimability or nonestimability are as follows:
(1) there exists a set of R linearly independent estimable functions;
(2) no set of estimable functions contains more than R linearly independent estimable functions;
and
(3) if the model is not of full rank (i.e., if R < P ), then at least one and, in fact, at least P R of
the individual parameters ˇ1 ; ˇ2 ; : : : ; ˇP are nonestimable.
When the model matrix X has full column rank P, the model is said to be of full rank. In the
special case of a full-rank model, R.X/ D RP, and every parametric function of the form 0ˇ is
estimable.
Note that the existence of an N 1 vector a that satisfies equality (3.2) is equivalent to the
consistency of a linear system (in an N 1 vector a of unknowns), namely, the linear system with
coefficient matrix X0 (which is of dimensions P N ) and with right side . The significance of this
equivalence is that any result on the consistency of a linear system can be readily translated into a
result on the estimability of the parametric function 0ˇ. Consider, in particular, Theorem 2.11.1.
Upon applying this theorem [and observing that .X /0 is a generalized inverse of X0 ], we find that
for 0ˇ to be estimable, it is necessary and sufficient that
0 X X D 0 (3.6)
or, equivalently, that
0 .I X X/ D 0: (3.7)
If rank.X/ D P, then (in light of Lemma 2.10.3) X X D I. Thus, in the special case of a
full-rank model, conditions (3.6) and (3.7) are vacuous.
k0A D 0 , A0 k D 0
, k 2 N.A0 /
, k 2 CŒI .A /0 A0
, k D ŒI .A /0 A0 r for some M 1 vector r
, k0 D r 0 .I AA / for some M 1 vector r:
Estimability 169
And based on Theorem 2.11.1, we conclude that the linear system AX D B is consistent if and only
if k0 B D 0 for every k such that k0A D 0. Q.E.D.
Theorem 5.3.1 establishes that the consistency of the linear system AX D B is equivalent to a
condition that is sometimes referred to as compatibility; the linear system AX D B is said to be
compatible if every linear relationship that exists among the rows of the coefficient matrix A also
exists among the rows of the right side B (in the sense that k0AD0 ) k0 BD0). The proof presented
herein differs from that presented in Matrix Algebra from a Statistician’s Perspective (Harville 1997,
sec. 7.3); it makes use of results on generalized inverses.
ı.u/ D ˇ1 C ˇ2 u C ˇ3 u2 C C ˇP uP 1: (3.11)
Under what circumstances are all P of the coefficients ˇ1 ; ˇ2 ; : : : ; ˇP estimable? Or, equiv-
alently, under what circumstances is the model of full rank? The answer to this question can be
established with the help of a result on a kind of matrix known as a Vandermonde matrix.
Vandermonde matrices. A Vandermonde matrix is a square matrix A of the general form
1 t1 t12 : : : t1K 1
0 1
B1 t t 2 : : : t K 1 C
B 2 2 2 C
B 2 K 1C
ADB B 1 t3 t 3 : : : t3
C;
B :: :: :: : : :: C
C
@: : : : : A
1 tK tK2 : : : tKK 1
Rank of the model matrix. We are now in a position to determine the rank of the model matrix X
(of a G–M, Aitken, or general linear model) in the special case where C D 1 and where ı.u/ is
the polynomial (3.11) (which is of degree P 1 in the lone explanatory variable u). Denote by D
the number of distinct values of u represented among the N values of u corresponding to the N
observable random variables y1 ; y2 ; : : : ; yN . And take i1 ; i2 ; : : : ; iD (i1 < i2 < < iD ) to be
integers between 1 and N , inclusive, such that the values of u corresponding to yi1 ; yi2 ; : : : ; yiD are
distinct. Each of the N rows of X is either among its i1 ; i2 ; : : : ; iD th rows or is a duplicate of one of
those rows. Thus, R.X/ is spanned by the i1 ; i2 ; : : : ; iD th rows of X, and it follows that rank.X/ D
and hence [since rank.X/ P ] that rank.X/ M, where M D min.D; P /. Moreover, it follows
from result (3.13) that the M M submatrix of X formed from its i1 ; i2 ; : : : ; iM th rows and its
first M columns is nonsingular, implying (in light of Theorem 2.4.19) that rank.X/ M . And we
conclude that
rank.X/ D min.D; P /: (3.14)
In light of result (3.14), it is evident that [in the special case where ı.u/ is the (P 1)-degree
polynomial (3.11)] the model is of full rank if and only if D P, that is, if and only if at least P
of the N values of u (the N values of u corresponding to y1 ; y2 ; : : : ; yN ) are distinct. When the
model is of full rank, all P of the coefficients ˇ1 ; ˇ2 ; : : : ; ˇP [in the (P 1)-degree polynomial] are
estimable.
In the application to the ouabain data (of Section 4.2b), there are 4 distinct values of the ex-
planatory variable, representing the 4 different rates of injection (or perhaps the logarithms of the
4 different rates). Accordingly, if ı.u/ were taken to be a polynomial of the form (3.11), the model
would be of full rank if and only if the degree of the polynomial were taken to be 3 or less.
u1 u2 u3 u4
0:8 0:2 0 0
0:2 0 0:8 0
0:4 0 0 0:6
0:5 0:1 0:4 0
0:6 0:1 0 0:3
0:3 0 0:4 0:3
What parametric functions of the form 0ˇ are estimable? In light of equality (3.18), rank.X/
C , and it follows from the results of Subsection b that for 0ˇ to be estimable, it is necessary that
0
1
D0 (3.19)
1C
or, equivalently, that
C
X C1
j D 1 : (3.20)
j D2
Is this condition sufficient as well as necessary? The answer to this question depends on rank.X/. It
follows from the results of Subsection b that if rank.X/ D C , then condition (3.20) is sufficient (as
well as necessary) for the estimability of 0ˇ; however, if rank.X/ < C , then condition (3.20) is not
(in and of itself) sufficient.
In the case of the 6 blends of the 4 juices,
rank.X/ D 3 D C 1 < C:
To see this, observe that the last 3 columns of X (which contain the values of u2 , u3 , and u4 ,
respectively) are linearly independent, so that rank.X/ 3. Observe also that each of the 6 blends
of the 4 juices is such that
u4 D 1:5u1 6u2 0:375u3 ;
so that (in the case of the 6 blends of the 4 juices)
which [together with result (3.17)] implies that the first and last columns of X are expressible as
linear combinations of the other 3 columns and hence that rank.X/ 3. Thus, rank.X/ D 3.
To obtain conditions that are both necessary and sufficient for the estimability of 0ˇ (from the
information provided by the data on the 6 blends of the 4 juices), observe that equality (3.21) can be
reexpressed in the form 0 1
0
B 1:5 C
B C
C D 0:
XBB 6 C
@ 0:375A
1
And it follows from the results of Subsection b that for 0ˇ to be estimable, it is necessary that
0 10
0
B 1:5 C
B C
B 6
B
C D 0
C (3.22)
@ 0:375A
1
or, equivalently, that
5 D 1:52 63 0:3754: (3.23)
Estimability 173
Moreover, together, the two conditions (3.19) and (3.22) or, equivalently, the two conditions (3.20)
and (3.23) are sufficient (as well as necessary) for the estimability of 0ˇ.
The example provided by the 6 blends of the 4 juices is one in which rank.X/ D C 1 (D P 2).
It is easy to construct examples in which rank.X/ D C (D P 1) and to do so for any N ( C )—in
light of result (3.17) or (3.18), rank.X/ cannot be larger than C . In what is perhaps the simplest way
to construct such an example, we can take any C of the blends corresponding to the N data points
to be the C pure blends, each of which consists entirely of one ingredient. This approach results in
a model matrix X whose N rows include the vectors .1; 1; 0; 0; : : : ; 0; 0/; .1; 0; 1; 0; : : : ; 0; 0/; : : : ;
.1; 0; 0; 0; : : : ; 0; 1/. Clearly, these C vectors are linearly independent, implying that rank.X/ C
and hence [since rank.X/ C ] that rank.X/ D C .
0
1 P C1
When rank.X/ D C , the condition D 0, or equivalently the condition jCD2 j D 1 ,
1C
is sufficient (as well as necessary) for the estimability of 0ˇ. It is worth noting that this condition
is not satisfied by any of the individual parameters ˇ1 ; ˇ2 ; : : : ; ˇC C1 and, consequently, none of
these parameters is estimable. Thus, we have established (by means of an example) that not only is
it possible for all P of the individual parameters ˇ1 ; ˇ2 ; : : : ; ˇP of a G–M, Aitken, or general linear
model to be nonestimable, but it is possible even if rank.X/ D P 1.
(or, perhaps, a nondegenerate subset of that set). Geometrically, the form of the set (3.27) is that
of a (C 1)-dimensional simplex. Regardless of the number of data points and regardless ofwhich
1
values of u correspond to the data points, the model matrix X would be such that X D0
1C
and, consequently, condition (3.25), or equivalently condition (3.26), would be a necessary condition
for the estimability of 0ˇ. Accordingly, condition (3.26) constitutes an inherent restriction on the
estimability of 0ˇ.
As discussed in Subsection e, the parametric functions for which condition (3.26) is not satis-
fied include all C C 1 of the individual parameters ˇ1 ; ˇ2 ; : : : ; ˇC C1 . If the explanatory variables
u1 ; u2 ; : : : ; uC were unrestricted, the individual parameters would have meaningful interpretations,
emanating from the observation that ˇ1 D ı.0/ and that (for j D 1; 2; : : : ; C ) ˇj C1 equals the
change in ı.u/ effected by a unit change in the j th explanatory variable uj when the other C 1
explanatory variables are held constant. However, those interpretations are rendered meaningless
by the restriction of the vector u of explanatory variables to the set (3.27). The interpretation of
ˇ1 emanating from the observation that ˇ1 D ı.0/ is meaningless because there is no mixture for
which u D 0. The interpretations of ˇ2 ; ˇ3 ; : : : ; ˇC C1 in terms of the change in ı.u/ effected by
changing one of the explanatory variables while holding the others constant are also meaningless.
By their very nature, the explanatory variables u1 ; u2 ; : : : ; uC are such that jCD1 uj D 1, making
P
it impossible to change one of the explanatory variables without changing any of the others.
Subsection e includes an example in 0which N1 D 6 and C D 4 and in which the mixtures consist
0
B 1:5 C
B C
of blends of juices. In that example, X B B 6
C D 0, so that for a parametric function of the form
C
@ 0:375A
1
0ˇ to be estimable, it is necessary that
If it had been the case that the only restriction on estimability were that determined by the inherent
restriction (3.26), then the value of ı.u/ would have been estimable for every mixture [i.e., for every
u in the set (3.27)].
The Method of Least Squares 175
Noninherent restrictions on the estimability of parametric functions of the form 0ˇ may be
encountered in cases where the data are “observational” in nature. They may also be encountered
in cases where the data come from a designed experiment or a sample survey. The extent to which
their presence is of concern would seem to depend on which parametric functions are rendered
nonestimable and on the extent to which those functions are of interest.
In the case of data from a designed experiment or a sample survey, the presence of noninherent
restrictions may be either inadvertent or by intent. Their presence may be attributable to problems in
execution or design. Or when the affected parametric functions are ones that are considered to be of
little importance and/or of negligible size, the presence of noninherent restrictions may be viewed
as an acceptable consequence of an attempt to make the best possible use of limited resources.
where b1 ; b2 ; : : : ; bP are arbitrary scalars. When ı./ is such that ı.u/ is expressible in the form
(4.1), N N P
X X X 2
Œy i ı.ui /2 D yi xij bj ;
i D1 i D1 j D1
where (for i D 1; 2; : : : ; N and j D 1; 2; : : : ; P ) xij D ıj .ui /. Accordingly, in the special case
under consideration, the minimization of N ı.ui /2 with respect to ı./ [for ı./ 2 ] is
P
i D1 Œy i
PN PP 2
equivalent to the minimization of i D1 y i j D1 xij bj with respect to b1 ; b2 ; : : : ; bP . Moreover,
upon letting y D .y 1 ; y 2 ; : : : ; y N /0 and b D .b1 ; b2 ; : : : ; bP /0 and taking X to be the N P matrix
with ij th element xij ,
N P
X X 2
yi xij bj D .y Xb/0 .y Xb/; (4.2)
i D1 j D1
176 Estimation and Prediction: Classical Approach
2
so that the minimization of N
PP
with respect to b1 ; b2 ; : : : ; bP is equivalent
P
i D1 y i j D1 xij bj
0
to the minimization of .y Xb/ .y Xb/ with respect to the P -dimensional vector b (where b is
an arbitrary member of RP ).
2
We wish to obtain a solution to the problem of minimizing the quantity N
P PP
i D1 y i j D1 xij bj
with respect to b1 ; b2 ; : : : ; bP . Conditions that are necessary for this quantity to attain a minimum
value can be obtained by differentiating with respect to b1 ; b2 ; : : : ; bP and by equating the resultant
partial derivatives to 0. Or in what can be regarded as an appealing variation on this approach,
we can reformulate the minimization problem in matrix notation [as the problem of minimizing
.y Xb/0 .y Xb/ with respect to b] and take advantage of some basic results on vector differentiation.
is the j th element of A0 x). Further, in light of result (4.6), the Hessian matrix of x0 Ax is
@2 x0 Ax
D A C A0 : (4.9)
@x@x0
In the special case where A is symmetric, results (4.8) and (4.9) simplify to
@ x0 Ax
D 2Ax (4.10)
@x
and
@2 x0 Ax
D 2A: (4.11)
@x@x0
The value of the vector XbQ is the same for any solution bQ to the normal equations or, equivalently,
for any vector bQ at which q.b/ attains its minimum value. That is, for any two solutions bQ 1 and bQ 2
to the normal equations [or any two vectors bQ 1 and bQ 2 at which q.b/ attains its minimum value],
XbQ 1 D XbQ 2 : (4.18)
Equality (4.18) can be established by applying result (4.17) or, alternatively, by observing that
X0 XbQ 1 D X0 y D X0 XbQ 2 and then observing (in light of Corollary 2.3.4) that X0 XbQ 1 D X0 XbQ 2 )
The Method of Least Squares 179
XbQ 1 D XbQ 2 . The vector .X0 X/ X0 y is a solution to the normal equations. Thus, as a variation on
result (4.18), we have that for any solution bQ to the normal equations or, equivalently, for any vector
bQ at which q.b/ attains its minimum value,
XbQ D X.X0 X/ X0 y D P y X (4.19)
0 0
[where PX D X.X X/ X ].
Let bQ represent any solution to the normal equations. Then, the minimum value of q.b/ is
Q D .y
q.b/ Q 0 .y
Xb/ Q
Xb/:
Note that expression (4.22) is a quadratic form (in y), the matrix of which is I PX .
In summary, we have established the following:
(1) The function q.b/ D .y Xb/0 .y Xb/ attains a minimum value at a point bQ if and only if bQ
is a solution to the linear system X0 Xb D X0 y (comprising the normal equations).
(2) The linear system X0 Xb D X0 y is consistent.
(3) XbQ 1 D XbQ 2 for any two solutions bQ 1 and bQ 2 to the linear system X0 Xb D X0 y or, equivalently,
for any two vectors bQ 1 amd bQ 2 at which q.b/ attains its minimum value.
(30 ) XbQ D PX y for any solution bQ to the linear system X0 Xb D X0 y or, equivalently, for any vector
bQ at which q.b/ attains its minimum value.
(4) For any solution bQ to the linear system X0 Xb D X0 y,
Q D y 0 .y
min q.b/ D q.b/ Q D y 0y
Xb/ bQ 0 X0 y D y 0 .I PX /y:
b
Further, let y D .y1 ; y2 ; : : : ; yN /0, and continue to take X to be the N P matrix with ij th element
xij D ıj .ui /.
By definition, the least squares estimator of an estimable function 0ˇ is the function, say `.y/,
of y whose value at y D y (an arbitrary N 1 vector) is taken to be 0 b, Q where bQ is any solution to
the linear system
180 Estimation and Prediction: Classical Approach
X0 Xb D X0 y (4.23)
(in the P 1 vector b), comprising the normal equations. Unless rank.X/ D P , there are an
infinite number of solutions to the normal equations and hence an infinite number of choices for
Q Nevertheless, 0 bQ is uniquely defined; that is, 0 bQ is invariant to the choice of b.
b. Q To see this, let
bQ 1 and bQ 2 represent any two solutions to linear system (4.23), and observe (in light of the results
of Subsection b and Section 5.3) that XbQ 1 D XbQ 2 and that (because of the estimability of 0ˇ)
0 D a0 X for some N 1 vector a, so that
0 bQ 1 D a0 XbQ 1 D a0 XbQ 2 D 0 bQ 2 :
The solutions to linear system (4.23) include the vector .X0 X/ X0 y. Thus, among the represen-
tations for the least squares estimator `.y/ of an estimable function 0ˇ is the representation
so the least squares estimator is a linear estimator. In the special case where X is of full column rank
P, linear system (4.23) has the unique solution .X0 X/ 1 X0 y. And in that special case, expression
(4.24) becomes
`.y/ D 0 .X0 X/ 1 X0 y:
Some further results on estimability. Let us continue to suppose that y is a column vector of
observable random variables y1 ; y2 ; : : : ; yN that follow a G–M, Aitken, or general linear model.
And let us consider further the subject of Section 5.3, namely, the estimability of a parametric
function of the form 0ˇ (D jPD1 j ˇj ).
P
In Section 5.3, the estimability of 0ˇ was related to various characteristics of the model matrix
X. A number of conditions were set forth, each of which is necessary and sufficient for estimability.
Those conditions can be restated in terms of various characteristics of the P P matrix X0 X, which
is the coefficient matrix of the normal equations. Their restatement is based on the following results:
k0 X0 X D 0 , k0 X0 D 0 (4.28)
X0 Xk D 0 , Xk D 0
or, also equivalently,
N.X0 X/ D N.X/
—results (4.25), (4.26), and (4.27) were established in Section 2.12, and result (4.28) is a consequence
of Corollary 2.3.4.
In light of results (4.25), (4.26), (4.27), and (4.28), it follows from the results of Section 5.3 that
each of the following conditions is necessary and sufficient for the estimability of 0ˇ:
(1) 0 2 R.X0 X/;
0 0 0
(2) D r X X for some P 1 vector r;
(3) 0 .X0 X/ X0 X D 0 or, equivalently,
0
(3 ) ŒI .X0 X/ X0 X D 0;
0
E.r 0 X0 y/ D r 0 X0 Xˇ:
And upon observing that the least squares estimator of r 0 X0 Xˇ is r 0 X0 y, it follows that any linear
combination of the elements of the vector X0 y (defined by the right side of the normal equations) is
the least squares estimator of its expected value.
Conjugate normal equations. Let us resume our discussion of least squares estimation, taking the
setting to that in which N data points are regarded as the respective values of the elements of an N 1
observable random vector y that follows a G–M, Aitken, or general linear model. Corresponding to
any linear combination 0ˇ (of the elements of the vector ˇ) is the linear system
X0 Xr D ; (4.31)
X0 Xb D X0 y; (4.32)
comprising the normal equations; however, the right side of linear system (4.31) is the coefficient
vector , while that of linear system (4.32) is X0 y. The P equations that form linear system (4.31)
are sometimes referred to (collectively) as the conjugate normal equations.
It follows from the results of Part 1 of the present subsection that 0ˇ is estimable if and only if
the conjugate normal equations are consistent. Now, suppose that 0ˇ is estimable, and consider the
least squares estimator `.y/ of 0ˇ. The value `.y/ of `.y/ at y D y is expressible in terms of any
182 Estimation and Prediction: Classical Approach
solution to the normal equations; for any solution bQ to linear system (4.32),
Q
`.y/ D 0 b: (4.33)
The value of `.y/ is also expressible in terms of any solution to the conjugate normal equations. For
any solution rQ to linear system (4.31) [and any solution bQ to linear system (4.32)], we find that
The upshot of result (4.34) is that the roles of the normal equations and the conjugate normal
equations are (in a certain sense) interchangeable. The least squares estimate `.y/ can be obtained
by forming the (usual) inner product 0 bQ of a solution bQ to the normal equations and of the right
side of the conjugate normal equations. Or, alternatively, it can be obtained by forming the inner
product rQ 0 X0 y of a solution rQ to the conjugate normal equations and of the right side X0 y of the
normal equations.
The general form and expected values, variances, and covariances of least squares estimators. Let
us continue to take y to be an N -dimensional observable random (column) vector that follows a G–M,
Aitken, or general linear model. And let us consider the general form and expected values, variances,
and covariances of least squares estimators of estimable linear combinations of the elements of ˇ.
Suppose that 0ˇ is an estimable linear combination. Then, in light of the results of Part 2 of the
present subsection, the least squares estimator `.y/ of 0ˇ is expressible in the form
`.y/ D rQ 0 X0 y; (4.35)
where rQ is any solution to the conjugate normal equations X0 Xr D . It follows immediately that
the least squares estimator is a linear estimator, in confirmation of what was established earlier (in
the introductory part of the present subsection) via a different approach. Moreover,
Under the general linear model, the variance of the least squares estimator of 0ˇ is [in light of
result (4.37) and the equality rQ 0 X0 D .XQr /0 ] expressible as
Result (4.38) can be extended. Suppose that 10 ˇ and 20 ˇ are two estimable linear combinations of
the elements of ˇ. Then, the least squares estimator of 10 ˇ equals rQ10 X0 y and that of 02 ˇ equals
rQ20 X0 y. Here, rQ1 is any solution to the linear system X0 Xr1 D 1 (in r1 ) and rQ2 any solution to the
linear system X0 Xr2 D 2 (in r2 ). And under the general linear model,
Under the G–M model, considerable further simplification is possible, and various additional repre-
sentations are obtainable. Specifically, we find that (under the G–M model)
var.Qr 0 X0 y/ D 2 rQ 0 X0 XQr D 2 rQ 0 D 2 0 rQ (4.42)
D 2 rQ 0 X0 X.X0 X/ X0 XQr D 2 0 .X0 X/ ; (4.43)
and, similarly,
cov.Qr10 X0 y; rQ20 X0 y/ D 2 rQ10 X0 XQr2 D 2 rQ10 2 D 2 10 rQ 2 (4.44)
D 2 rQ10 X0 X.X0 X/ 0
X XQr2 D 2 10 .X0 X/ 2 : (4.45)
kxk D .x x/1=2:
The angle between two nonnull M -dimensional column vectors x D fxi g and y D fyi g is
defined indirectly in terms of its cosine. Specifically, the angle between x and y is the angle
(0 ) defined by
xy
cos D (4.49)
kxk kyk
xy
—it follows from Theorem 2.4.21 (the Cauchy–Schwarz inequality) that 1 1. In the
kxkkyk
184 Estimation and Prediction: Classical Approach
case of the usual inner product (and usual norm), equality (4.49) can be reexpressed in the form
PM
x0 y i D1 xi yi
cos D 0 1=2 0 1=2 D P 1=2 PM 2 1=2 : (4.50)
.x x/ .y y/ M 2
i D1 xi i D1 yi
By definition, two M -dimensional column vectors x D fxi g and y D fyi g are orthogonal (or
perpendicular) to each other if x y D 0. Thus, when the inner product is taken to be the usual inner
product, x and y are orthogonal to each other if x0 y D 0 or, equivalently, if M i D1 xi yi D 0. The
P
statement that x and y are orthogonal to each other is sometimes abbreviated to the statement that x
and y are orthogonal. Clearly, two nonnull vectors are orthogonal if and only if the angle between
them is =2 (90ı) or, equivalently, the cosine of that angle is 0.
If an M -dimensional column vector x is orthogonal to every vector in a subspace U of M -
dimensional column vectors, x is said to be orthogonal to U. The set consisting of all M -dimensional
column vectors that are orthogonal to the subspace U is called the orthogonal complement of U
and is denoted by the symbol U?. The set U? is a linear space (as can be readily verified). When
U D C.X/ (where X is a matrix), we may write C?.X/ for U?.
Least squares revisited: the projection and decomposition of the data vector. Denote by y D
.y1 ; y2 ; : : : ; yN /0 an N -dimensional column vector of data points—this notation differs somewhat
from that employed earlier in the section (which included an underline). Further, suppose that
y1 ; y2 ; : : : ; yN are accompanied by the corresponding values u1 ; u2 ; : : : ; uN of a C -dimensional
column vector u of explanatory variables. Let us consider the approximation of y1 ; y2 ; : : : ; yN by
ı.u1 /; ı.u2 /; : : : ; ı.uN /, where ı.u/ is a function of u. Which of the possible choices for the function
ı./ results in the “best” approximation (and in what sense)? In particular, which results in the best
approximation when the choice for ı./ is restricted to functions (of u) that are expressible as linear
combinations of P specified functions ı1 ./; ı2 ./; : : : ; ıP ./; that is, when the choice is restricted to
those functions that are expressible in the form
where b D .b1 ; b2 ; : : : ; bP /0. Note [in connection with result (4.52)] that
P P
X 0 X
Xb/0 .y
.y Xb/ D y bj xj y bj xj :
j D1 j D1
In light of result (4.52), the minimization problem that gives rise to the method of least squares
can be regarded as that of minimizing .y Xb/0 .y Xb/ [or, equivalently, that of minimizing the
(usual) norm of y Xb] with respect to the P 1 vector b. As previously indicated (in Subsection
The Method of Least Squares 185
b), .y Xb/0 .y Xb/ attains a minimum value at a P 1 vector bQ if and only if bQ is a solution to
the normal equations X0 Xb D X0 y.
Some further insights into the method of least squares can be obtained by transforming the
underlying minimization problem into a more geometrically meaningful form. Let U D C.X/, and
observe that an N -dimensional column vector w is a member of U if and only if w D Xb for
some b, in which case the elements of b are the “coordinates” of w with respect to the spanning set
fx1 ; x2 ; : : : ; xP g. Accordingly, the problem of minimizing .y Xb/0 .y Xb/ with respect to b can
be reformulated as the “coordinate-free” problem of minimizing .y w/0 .y w/ with respect to
w, where w is an arbitrary member of the linear space U. The latter problem depends on the matrix
X only through its column space. From a geometrical perspective, the problem is that of finding the
vector in the subspace U (of RN ) that is “closest” to the data vector y.
It follows from the results of Subsection b that .y w/0 .y w/ attains a minimum value over
the subspace U, doing so at a unique point z that is expressible as
Q
z D Xb; (4.53)
z D PX y: (4.54)
Taking (here and in the remainder of the present subsection) the inner product to be the usual inner
product, the vector z is such that y z 2 U? ; that is, the difference between y and z is orthogonal
to every vector in U. To see this, let a represent an arbitrary member of U [D C.X/], and observe
that a D Xr for some P 1 vector r and hence that
a0 .y z/ D r 0 X0 .y Q D r 0 .X0 y
Xb/ Q D r 0 .X0 y
X0 Xb/ X0 y/ D r 0 0 D 0:
Moreover, there is no member w of U other than z for which y w 2 U?, as is evident upon
observing that if w 2 U and y w 2 U?, then w z 2 U and
w z D .y z/ .y w/ 2 U?
.w z/0 .w z/ D 0
w D z:
In summary, there is a unique vector w in U such that y w 2 U?, namely, the vector z. This
vector is referred to as the orthogonal projection of y on U or simply as the projection of y on U. As
previously indicated, the matrix PX is referred to as a projection matrix; the reason why is apparent
from expression (4.54).
Conceptually, the point z in RN at which .y w/0 .y w/ attains its minimum value for w 2 U is
obtainable by “projecting” the point y onto the surface U. The point in RN located by this operation
is such that the “line” formed by joining that point with the point y is orthogonal (perpendicular) to
the surface U.
Corresponding to the projection z of y on U is the decomposition
y D z C d; (4.55)
where d D y z. The first component of this decomposition is a member of the linear space U
[D C.X/], and the second component is a member of the orthogonal complement U? [D C?.X/] of
186 Estimation and Prediction: Classical Approach
FIGURE 5.1. The projection z of the 2-dimensional data vector y D .4; 8/0 on the 1-dimensional linear space
U D C.X/, where X D x D .3; 1/0.
U. In this context, the linear space U is sometimes referred to as the estimation space—logically, it
could also be referred to as the approximation space—and the linear space U? is sometimes referred
to as the error space. Decomposition (4.55) is unique; if y is expressed as the sum of two components,
the first of which is in the estimation space U and the second of which is in the error space U?, then
necessarily the first component equals z and the second equals d (D y z).
Example: N D 2. Suppose that N D 2, that y D .4; 8/0, and that X D x, where x is the 2-
dimensional column vector x D .3; 1/0 (in which case P D 1). Then, the linear system X0 Xb D X0 y
becomes .10/b D .20/, which has the unique solution b D .2/. Thus, the projection of y on the
linear space U [D C.X/] is the vector
3 6
zD .2/ D ;
1 2
FIGURE 5.2. The projection z of the 3-dimensional data vector y D .3; 38=5; 74=5/0 on the 2-dimensional
linear space U D C.X/, where X D .x1 ; x2 ; x3 /, with x1 D .0; 3; 6/0, x2 D . 2; 2; 4/0, and
x3 D . 2; 1; 2/0.
One solution to these equations is the vector .32=15; 1=2; 1/0. Thus, the projection of y on the
linear space U [D C.X/] is
0 10 1 0 1
0 2 2 32=15 3
z D @3 2 1A@ 1=2 A D @22=5A;
6 4 2 1 44=5
as depicted in Figure 5.2.
A D QR; (4.57)
R1
where Q D .Q1 ; Q2 / is an M M orthogonal matrix and where R D . The columns of the
0
M .M N / submatrix Q2 are any M N M -dimensional column vectors that together with the
N columns of Q1 form an orthonormal basis for RM.
Either of the two decompositions (4.56) and (4.57) might be referred to as a QR decomposition.
QR decomposition as a basis for least squares computations. Let us now resume our discussion of
the computational aspects of the minimization of .y Xb/0 .y Xb/. Assume that the N P matrix
X is of full column rank P —discussion of the general case where rank.X/ may be less than P is
deferred until the final part of the present subsection. Consider the QR decomposition of X. That is,
consider a decomposition of X of the form
X D Q1 R1 ; (4.58)
where Q1 is an N P matrix with orthonormal columns and R1 is an upper triangular matrix with
(strictly) positive diagonal elements, or of the related form
X D QR; (4.59)
R1
where Q D .Q1 ; Q2 / is an N N orthogonal matrix and where R D .
0
z
Let z D Q0 y, and partition z as z D 1 , where z1 D Q01 y and z2 D Q02 y. Then,
z2
y Xb D Q.z Rb/
D Q1 .z1 R1 b/ C Q2 z2 : (4.60)
And
.y Xb/0 .y Xb/ D .z Rb/0 Q0 Q.z Rb/
0
D .z Rb/ .z Rb/
D .z1 R1 b/0 .z1 R1 b/ C z02 z2 : (4.61)
The Method of Least Squares 189
y XbQ D Q2 z2 : (4.63)
These results serve as the basis for an alternative approach to the least squares computations (not
requiring the formation of the normal equations). In the alternative approach, the formation of the
matrix R1 and the vector z1 are at the heart of the computations. Their formation can be accomplished
through the use of Householder transformations (reflections) or Givens transformations (rotations)
or through the use of a modified Gram–Schmidt procedure—refer, for example, to Golub and Van
Loan (2013, chap. 5) for a detailed discussion. The value bQ at which .y Xb/0 .y Xb/ attains its
minimum value is determined from R1 and z1 by solving linear system (4.62), doing so in a way that
exploits the triangularity of R1 —refer, for example, to Harville (1997, sec. 11.8) for a discussion of
the solution of a linear system with a triangular coefficient matrix.
Our results on the alternative approach to the least squares computations can be extended to the
general case where the matrix X is not necessarily of full column rank. The extension requires some
familiarity with a type of matrix called a permutation matrix.
Permutation matrices. A permutation matrix is a square matrix whose columns can be obtained by
permuting (rearranging) the columns of an identity matrix. Thus, letting u1 ; u2 ; : : : ; uN represent
the first, second, : : : ; N th columns, respectively, of IN , an N N permutation matrix is a matrix of
the general form .uk1 ; uk2 ; : : : ; ukN /, where k1 ; k2 ; : : : ; kN is an arbitrary permutation of the first
N positive integers 1; 2; : : : ; N. For example, one permutation matrix of order N D 3 is the 3 3
matrix 0 1
0 1 0
.u3 ; u1 ; u2 / D @0 0 1A;
1 0 0
whose columns are the third, first, and second columns, respectively, of I3 . Clearly, the columns of
any permutation matrix form an orthonormal (with respect to the usual inner product) set, and hence
any permutation matrix is an orthogonal matrix.
The j th element of the kj th row of the N N permutation matrix .uk1 ; uk2 ; : : : ; ukN / is 1, and
its other N 1 elements are 0. That is, the j th row uj0 of IN is the kj th row of .uk1 ; uk2 ; : : : ; ukN / or,
equivalently, the j th column uj of IN is the kj th column of .uk1 ; uk2 ; : : : ; ukN /0. Thus, the transpose
of any permutation matrix is itself a permutation matrix. Further, the rows of any permutation matrix
are a permutation of the rows of an identity matrix and, conversely, any matrix whose rows can be
obtained by permuting the rows of an identity matrix is a permutation matrix.
The effect of postmultiplying an M N matrix A by an N N permutation matrix P is to
permute the columns of A in the same way that the columns of IN were permuted in forming P . Thus,
if a1 ; a2 ; : : : ; aN are the first, second, : : : ; N th columns of A, the first, second, : : : ; N th columns
of the product A.uk1 ; uk2 ; : : : ; ukN / of A and the N N permutation matrix .uk1 ; uk2 ; : : : ; ukN /
are ak1 ; ak2 ; : : : ; akN , respectively. Further, the first, second : : : ; N th columns a1 ; a2 ; : : : ; aN of A
are the k1 ; k2 ; : : : ; kN th columns, respectively, of the product A.uk1 ; uk2 ; : : : ; ukN /0 of A and the
permutation matrix .uk1 ; uk2 ; : : : ; ukN /0. When N D 3, we have, for example, that
190 Estimation and Prediction: Classical Approach
0 1
0 1 0
A.u3 ; u1 ; u2 / D .a1 ; a2 ; a3 / @0 0 1A D .a3 ; a1 ; a2 /
1 0 0
and 0
0 0 1
1
A.u3 ; u1 ; u2 /0 D .a1 ; a2 ; a3 / @1 0 0A D .a2 ; a3 ; a1 /:
0 1 0
Similarly, the effect of premultiplying an N M matrix A by an N N permutation matrix
is to permute the rows of A. If the first, second, : : : ; N th rows of A are a01 ; a02 ; : : : ; a0N , respec-
tively, then the first, second, : : : ; N th rows of the product .uk1 ; uk2 ; : : : ; ukN /0 A of the permutation
matrix .uk1 ; uk2 ; : : : ; ukN /0 and A are a0k1 ; a0k2 ; : : : ; a0kN , respectively, and a01 ; a02 ; : : : ; a0N are the
k1 ; k2 ; : : : ; kN th rows, respectively, of .uk1 ; uk2 ; : : : ; ukN /A. When N D 3, we have, for example,
that
0 0 1 a01
0 10 1 0 0 1
a3
.u3 ; u1 ; u2 /0 A D @1 0 0A@a02 A D @a01 A
0 1 0 a03 a02
and 0
0 1 0 a01
1 0 1 0
a02
1
.u3 ; u1 ; u2 /A D @0 0 1A@a02 A D @a03 A:
1 0 0 a03 a01
Alternative approach to least squares computations: general case. Let us now extend our initial
results on the alternative approach to the least squares computations. Accordingly, suppose that we
wish to minimize the quantity .y Xb/0 .y Xb/ and that rank.X/ D K, where K is possibly less
than P —our initial results (the results of Part 2) were obtained under the simplifying assumption
that the N P matrix X is of full column rank P .
Let L represent any P P permutation matrix such that the first K columns of the N P matrix
XL are linearly independent, and partition L as L D .L1 ; L2 /, where L1 is of dimensions P K.
Then,
XL D .XL1 ; XL2 /;
and XL1 is of full column rank K. Decompose XL1 as
XL1 D Q1 R1 ; (4.64)
where Q1 is an N K matrix with orthonormal columns and R1 is an upper triangular matrix with
(strictly) positive diagonal elements. And observe that the columns of Q1 form a basis for C.XL/
[D C.XL1 /], so that
XL2 D Q1 R2 ; (4.65)
for some matrix R2 . Together, results (4.64) and (4.65) imply that
XL D Q1 .R1 ; R2 /
and also that
XL D QR;
R1 R2
where Q D .Q1 ; Q2 / is an N N orthogonal matrix and where R D . Or, equivalently,
0 0
X D Q1 .R1 ; R2 /L0 D Q1 R1 L01 C Q1 R2 L02
and
X D QRL0: (4.66)
h1
Let h D L0 b, and partition h as h D , where h1 D L01 b and h2 D L02 b. Further, let
h2
Best Linear Unbiased or Translation-Equivariant Estimation 191
z
z D Q0 y, and partition z as z D 1 , where z1 D Q01 y and z2 D Q02 y. Then,
z2
y Xb D Q.z Rh/
D Q1 .z1 R1 h1 R2 h2 / C Q2 z2 ; (4.67)
R1 h1 D z1 R2 hQ 2 (4.69)
(in the vector h1 ). Thus, an arbitrary one of the values of h at which .y Xb/0 .y Xb/ attains a
minimum value is obtained by assigning h2 an arbitrary value h Q 2 and by then taking the value h Q 1 of
h1 to be the solution to linear system (4.69)—the matrix R1 is nonsingular, so that h Q 1 is uniquely
determined by hQ 2 . In particular, we could take the value of h2 to be 0, and take the value of h1 to be
the (unique) solution to the linear system R1 h1 D z1 .
We conclude that .y Xb/0 .y Xb/ attains a minimum value of z02 z2 and that it does so at a
value bQ of b if and only if for some .P K/ 1 vector hQ 2 , bQ D L1 hQ 1 C L2 hQ 2 , where h Q 1 is the
solution to linear system (4.69). Note [in light of result (4.67)] that for any such (minimizing) value
bQ of b,
y XbQ D Q2 z2 : (4.70)
These results generalize the results obtained earlier (in Part 2) for the special case where the rank
K of the N P matrix X equals P . They provide a basis for extending the alternative approach to the
least squares computations to the general case (where K may be less than P ). As in the special case,
the formation of the matrix R1 and the vector z1 are at the heart of the computations. (And, as in the
special case, the formation of R1 and z1 can be accomplished via any of several procedures devised
for that purpose.) In the general case, there is also a need to determine the permutation matrix L (i.e.,
to identify K linearly independent columns of X) and possibly the matrix R2 —if the value of h2 is
taken to be 0, then R2 is not needed. A value bQ at which .y Xb/0 .y Xb/ attains its minimum
value is determined from R1 , z1 , L, and possibly R2 by taking hQ 2 to be any .P K/ 1 vector, by
computing the solution hQ 1 to linear system (4.69), and by setting bQ D L1 h Q 1 C L2 hQ 2 .
the elements of the parametric vector ˇ. The least squares estimator is a linear estimator, as was
demonstrated in Section 5.4c—refer to representation (4.24) or (4.35). Moreover, the least squares
estimator is an unbiased estimator. Its unbiasedness can be established directly by verifying that
EŒ`.y/ D 0ˇ, as was done in Section 5.4c—refer to result (4.36). Alternatively, its unbiasedness
can be established by applying the following result (from Section 5.1) on the unbiasedness of linear
estimators: for an estimator of the form c C a0 y to be an unbiased estimator of 0ˇ, it is necessary
and sufficient that
cD0 and X0 a D : (5.1)
Upon observing [in light of result (4.35)] that `.y/ D .XQr /0 y, where rQ is any solution to the conjugate
normal equations X0 Xr D , it follows immediately from the sufficiency of condition (5.1) that `.y/
is an unbiased estimator of 0ˇ.
The least squares estimator of 0ˇ is translation equivariant as well as unbiased. To see this, recall
(from Section 5.2) that for an estimator of the form c C a0 y to be a translation-equivariant estimator
of 0ˇ, it is necessary and sufficient that a0 X D 0 or, equivalently, that X0 a D . The translation
equivariance of the least squares estimator `.y/ [which is expressible in the form `.y/ D .XQr /0 y]
follows from the sufficiency of the condition X0 a D in much the same way that its unbiasedness
follows from the sufficiency of condition (5.1).
When y follows a G–M model, the least squares estimator of 0ˇ is superior to other linear
unbiased or translation-equivariant estimators in a sense that is to be discussed in Subsections a and
b. More generally (when y follows an Aitken or general linear model), this superiority is confined
to special cases. These special cases include, of course, G–M models, but also a limited number of
other models.
a. Gauss–Markov theorem
In the special case where y is an N 1 observable random vector that follows a G–M model, the least
squares estimator of an estimable linear combination 0ˇ of the elements of the parametric vector ˇ
is the best linear unbiased estimator in the sense described in the following theorem.
Theorem 5.5.1 (Gauss–Markov theorem). Suppose that y is an N 1 observable random vector
that follows a G–M, Aitken, or general linear model, and suppose that 0ˇ is an estimable linear
combination of the elements of the parametric vector ˇ. Then, the least squares estimator of 0ˇ is a
linear unbiased estimator. Moreover, in the special case where y follows a G–M model, the variance
(and hence the mean squared error) of the least squares estimator is uniformly smaller than that of
any other linear unbiased estimator.
Proof. That the least squares estimator is a linear unbiased estimator was established earlier (in
the introductory part of the present section). Now, take c C a0 y to be an arbitrary linear unbiased
estimator of the estimable linear combination 0ˇ, in which case
cD0 and X0 a D
(as noted earlier). And recall that the least squares estimator of 0ˇ is expressible in the form .XQr /0 y,
where rQ is any solution to the conjugate normal equations X0 Xr D .
In the special case where y follows a G–M model, we find that
C 2 covŒ.XQr /0 y; c C a0 y .XQr /0 y
D varŒ.XQr /0 y C varŒc C a0 y .XQr /0 y
varŒ.XQr /0 y; (5.2)
with equality holding if and only if varŒc C a0 y .XQr /0 y D 0. Moreover, in the special case of the
G–M model,
varŒc C a0 y .XQr /0 y D .a XQr /0 . 2 I/.a XQr /
2 0
D .a XQr / .a XQr /;
so that equality holds in inequality (5.2) if and only if a XQr D 0 or, equivalently, if and only if
a D XQr . We conclude that in the special case of the G–M model, the variance of the least squares
estimator is uniformly smaller than that of any other linear unbiased estimator. Q.E.D.
Theorem 5.5.1 (in one form or another) has come to be known as the Gauss–Markov theorem
(in honor of the contributions of Carl Friedrich Gauss and Andrei Andreevich Markov). It is one of
the most famous theoretical results in all of statistics. Seal (1967, sec. 3) considered this result from
a historical perspective. That Gauss’s name has come to be attached to the result of Theorem 5.5.1
seems altogether appropriate. The case for the attachment of Markov’s name appears to be much
weaker.
It is customary (both in the present setting and in general) to refer to a linear unbiased estimator
that has minimum variance among all linear unbiased estimators as a BLUE (an acronym for best
linear unbiased estimator or estimation). If y is an N 1 observable random vector that follows
a G–M model, then (according to the Gauss–Markov theorem) the least squares estimator of an
estimable linear combination 0ˇ of the elements of the parametric vector ˇ is the unique BLUE
of 0ˇ. Albert (1972, sec. 6.1), in a comment he characterized as jocular, suggested that the least
squares estimator of an estimable linear combination could be referred to as a TRUE (an acronym
for tiniest residual unbiased estimator). Accordingly, when the least squares estimator is a BLUE, it
could be referred to as a TRUE-BLUE—someone who is unswervingly loyal or faithful is said to be
true-blue.
b. A corollary
Suppose that y is an N 1 observable random vector that follows a G–M, Aitken, or general linear
model, and take 0ˇ to be an estimable linear combination of the elements of the parametric vector
ˇ. Further, let `.y/ D .XQr /0 y, where rQ is any solution to the conjugate normal equations X0 Xr D
[so that `.y/ is the least squares estimator of 0ˇ]. And let c C a0 y represent an arbitrary linear
translation-equivariant estimator of 0ˇ or, equivalently, any estimator of the form c C a0 y that
satisfies the condition a0 X D 0 ; and recall (from the introductory part of the present section) that
the least squares estimator is a linear translation-equivariant estimator.
Clearly, E.a0 y/ D 0ˇ, that is, a0 y is an unbiased estimator of 0ˇ. And, as a consequence, the
MSE (mean squared error) of c C a0 y is
EŒ.c C a0 y 0ˇ/2 D c 2 C EŒ.a0 y 0ˇ/2 C 2c E.a0 y 0ˇ/
D c 2 C var.a0 y/
var.a0 y/;
with equality holding if and only if c D 0 and hence if and only if c C a0 y D a0 y. Moreover, in the
special case where y follows a G–M model, it follows from the Gauss–Markov theorem that
194 Estimation and Prediction: Classical Approach
var.a0 y/ varŒ`.y/;
with equality holding if and only if a0 y D `.y/, that is, if and only if a0 y is the least squares estimator.
Accordingly, in that special case, the MSE of the least squares estimator is uniformly smaller than
the MSE of any other linear translation-equivariant estimator.
In summary, we have the following result, the main part of which can be regarded as a corollary
of the Gauss–Markov theorem.
Corollary 5.5.2. Suppose that y is an N 1 observable random vector that follows a G–M,
Aitken, or general linear model, and suppose that 0ˇ is an estimable linear combination of the
elements of the parametric vector ˇ. Then, the least squares estimator of 0ˇ is a linear translation-
equivariant estimator. Moreover, in the special case where y follows a G–M model, the mean squared
error of the least squares estimator is uniformly smaller than that of any other linear translation-
equivariant estimator.
are estimable. Upon letting k D .k1 ; k2 ; : : : ; kM /0, this property can be restated (in matrix notation)
Simultaneous Estimation 195
as follows: the least squares estimator of k0ƒ0ˇ [D .ƒk/0 ˇ] is k0 `.y/. In light of results (4.24) and
(6.1), this property can be readily verified by observing that the least squares estimator of .ƒk/0 ˇ is
Alternatively, in light of results (4.35) and (6.2), it can be verified by observing that (for any solution
Q to X0 XR D ƒ) Rk
R Q is a solution to the linear system X0 Xr D ƒk and hence that the least squares
0
estimator of .ƒk/ ˇ is
Q 0 X0 y D k0 R
.Rk/ Q 0 X0 y D k0 `.y/:
Under the general linear model, the variance-covariance matrix of the least squares estimator of
ƒ0ˇ is expressible as
Q 0 X0 V./XR
varŒ`.y/ D R Q D ƒ0 .X0 X/ X0 V./X.X0 X/ ƒ (6.4)
(where R Q is any solution to X0 XR D ƒ). Result (6.4) can be deduced from result (4.39): start with
the expressions for varŒ`i .y/; `j .y/ obtained by applying result (4.39), and then observe that these
expressions are essentially the same as the ij th elements of the expressions for varŒ`.y/ given by
result (6.4) (i; j D 1; 2; : : : ; M ). In the special case of the Aitken model, result (6.4) “simplifies” to
Q 0 X0 HXR
varŒ`.y/ D 2 R Q D 2 ƒ0 .X0 X/ X0 HX.X0 X/ ƒ: (6.5)
And in the further special case of the G–M model, we find that
Q 0 X0 XR
varŒ`.y/ D 2 R Q D 2R
Q 0 ƒ D 2 ƒ0 R
Q (6.6)
D 2RQ 0 X0 X.X0 X/ X0 XR Q D 2 ƒ0 .X0 X/ ƒ: (6.7)
c C A0 y such that c D 0 and A0 X D ƒ0 ). Then, (the least squares estimator) `.y/ is a linear unbiased
estimator. Moreover, in the special case where y follows a G–M model, var.c C A0 y/ varŒ`.y/ is
a nonnegative definite matrix, and var.c C A0 y/ varŒ`.y/ D 0 or, equivalently, var.c C A0 y/ D
varŒ`.y/ if and only if c C A0 y D `.y/.
Proof. That `.y/ is a linear unbiased estimator of ƒ0ˇ follows from what was established in
Section 5.5 (as was noted previously). Now, suppose that y follows a G–M model, and consider the
quadratic form
k0 fvar.c C A0 y/ varŒ`.y/gk (6.9)
(in an M -dimensional column vector k). Clearly, the quadratic form (6.9) is reexpressible as
k0 fvar.c C A0 y/ varŒ`.y/gk D varŒk0 c C .Ak/0 y varŒk0 `.y/: (6.10)
0 0 0
Moreover, k c C .Ak/ y is a linear unbiased estimator of .ƒk/ ˇ; the unbiasedness of which can
be verified simply by observing that EŒk0 c C .Ak/0 y D k0 E.c C A0 y/ D k0 ƒ0ˇ D .ƒk/0 ˇ or,
alternatively, by observing [in light of the sufficiency of condition (1.4)] that k0 c D k0 0 D 0 and
that .Ak/0 X D k0 A0 X D k0 ƒ0 D .ƒk/0. And as discussed in the introductory part of the present
section, k0 `.y/ is the least squares estimator of .ƒk/0 ˇ. Thus, it follows from the Gauss–Markov
theorem that varŒk0 `.y/ varŒk0 c C .Ak/0 y or, equivalently, that
varŒk0 c C .Ak/0 y varŒk0 `.y/ 0: (6.11)
Together, results (6.10) and (6.11) imply that the quadratic form k0 fvar.c C A0 y/ varŒ`.y/gk is
nonnegative definite and hence that the matrix var.c C A0 y/ varŒ`.y/ is nonnegative definite.
As a further implication of the Gauss–Markov theorem, we have that varŒk0 c C .Ak/0 y D
varŒk0 `.y/ or, equivalently, that equality holds in inequality (6.11) if and only if k0 c C .Ak/0 y D
k0 `.y/, leading [in light of equality (6.10) and Corollary 2.13.4] to the conclusion that var.c C
A0 y/ varŒ`.y/ D 0 if and only if k0 c C .Ak/0 y D k0 `.y/ for every k and hence if and only if
c C A0 y D `.y/. Q.E.D.
Suppose (in connection with Theorem 5.6.1) that y follows a G–M model, in which case var.c C
A0 y/ varŒ`.y/ is nonnegative definite. Then, var.c C A0 y/ varŒ`.y/ D R0 R for some matrix
R (as is evident from Corollary 2.13.25). And upon recalling Lemma 2.3.2 and observing that R0 R
equals 0 if and only if all M of its diagonal elements equal 0 and upon letting (for j D 1; 2; : : : ; M )
cj represent the j th element of c, aj the j th column of A, and `j .y/ the j th element of `.y/, it
follows that
var.c C A0 y/ varŒ`.y/ D 0
, var.cj C aj0 y/ varŒ`j .y/ D 0 (j D 1; 2; : : : ; M )
0
, trfvar.c C A y/ varŒ`.y/g D 0
or, equivalently, that
var.c C A0 y/ D varŒ`.y/ , var.cj C aj0 y/ D varŒ`j .y/ (j D 1; 2; : : : ; M )
, trŒvar.c C A0 y/ D trfvarŒ`.y/g:
Because the diagonal elements of a nonnegative definite matrix are inherently nonnegative (as
evidenced by Corollary 2.13.14), the following result can be regarded as a corollary of Theorem
5.6.1.
Corollary 5.6.2. Suppose that y is an N 1 observable random vector that follows a G–M
model, take ƒ0ˇ to be any M 1 vector of estimable linear combinations of the elements of the
parametric vector ˇ, denote by `.y/ the least squares estimator of ƒ0ˇ, and let c C A0 y represent an
arbitrary linear unbiased estimator of ƒ0ˇ. Then,
trfvarŒ`.y/g trŒvar.c C A0 y/;
Simultaneous Estimation 197
between the mean-squared-error matrices of c C A0 y and `.y/ is a nonnegative definite matrix, and
this difference equals 0 if and only if c C A0 y D `.y/.
The nonnegative definiteness of the matrix (6.14) and the condition [c C A0 y D `.y/] under
which it equals 0 follow from Theorem 5.6.1 in much the same way that the main part of Corollary
5.5.2 follows from the Gauss–Markov theorem.
diagonal]. Accordingly, in rearranging the elements of A in the form of a vector (as in forming the
vec), we may wish to exclude the N.N 1/=2 “duplicate” elements. Thus, as an alternative to the
vec of A, we may wish to consider the N.N C1/=2-dimensional column vector
0 1
a1
B a C
B 2C
B : C; (7.2)
@ :: A
aN
where (for i D 1; 2; : : : ; N ) ai D .ai i ; ai C1;i ; : : : ; aN i /0 is the subvector of the i th column of A
obtained by striking out its first i 1 elements. The vector (7.2) is referred to as the vech of A and
is denoted by the symbol vech.A/ or vech A. For N D 1, N D 2, 0 and1N D 3,
a11
0 1 Ba21 C
a11 B C
Ba31 C
vech A D .a11 /; vech A D @a21 A; and vech A D B Ba22 C; respectively:
C
a22 B C
@a32 A
a33
Every element of A, and hence every element of vec A, is either an element of vech A or a
“duplicate” of an element of vech A. Thus, there exists a unique [N 2 N.N C1/=2]-dimensional
matrix, to be denoted by the symbol GN , such that (for every N N symmetric matrix A)
vec A D GN vech A:
Among the various properties of the Kronecker product operation is the following: for any
matrices A and B,
.A ˝ B/0 D A0 ˝ B0 (7.4)
—for a verification of equality (7.4), refer, e.g., to Harville (1997, sec. 16.1).
Two formulas. There are two formulas that will be convenient to have at our disposal; one of these is
for the vec of a product of three matrices and the other is for the trace of the product of four matrices.
The two formulas are as follows. For any M N matrix A, N P matrix B, and P Q matrix C,
For a derivation of formulas (7.5) and (7.6), refer, for example, to Harville (1997, sec. 16.2).
b. Expected values and variances of quadratic forms (and their covariances with
each other and with linear forms)
Suppose that x is an N -dimensional random column vector. Then, it is customary to refer to a linear
combination, say a0 x, of the elements of x (where a is an N 1 vector of constants) as a linear form
(in x).
Formulas for the expected values and the variances and covariances of linear forms are available
from the results of Sections 3.1 and 3.2. If the random vector x has a mean vector , then the expected
value of a linear form a0 x (in x) is expressible as
And if, in addition, x has a variance-covariance matrix †, then the variance of a0 x is expressible as
and, more generally, the covariance of a0 x and b0 x (where b0 x is a second linear form in x) is
expressible as
cov.a0 x; b0 x/ D a0 †b: (7.9)
In what follows, these results are extended by obtaining formulas for the expected values and
variances of quadratic forms (in a random column vector) and formulas for the covariances of the
quadratic forms with each other and with linear forms.
Main results. The main results are presented in a series of three theorems.
Theorem 5.7.1. Let x represent an N -dimensional random column vector having mean vector
D fi g and variance-covariance matrix † D fij g, and take A D faij g to be an N N matrix
of constants. Then,
X
E.x0Ax/ D aij .ij C i j / (7.10)
i;j
where ƒ is an N N 2 matrix whose entry for the i th row and j kth column [i.e., column .j 1/N Ck]
is ijk .
Proof. Letting z D fzi g D x (in which case x D z C ) and using Theorem 5.7.1 [and
observing that z0A D .z0A/0 D 0Az and that b0 z D z0 b], we find that
cov.b0 x; x0Ax/ D cov.b0 z; x0Ax/ D EŒ.b0 z/x0Ax
D EŒ.b0 z/.z0Az C 20Az C 0A/
hX X i
DE bi zi ajk zj zk C 2 EŒz0 b0Az C 0
i j; k
X X X
DE bi ajk zi zj zk C 2 bi j ajk i k
i;j; k i; k j
X
D bi ajk .ijk C 2j i k /
i;j; k
XX XX X
D bi ijk akj C 2 bi i k akj j
j; k i k i j
0 0
D b ƒ vec A C 2 b †A:
Q.E.D.
If the distribution of x is MVN, then ƒ D 0 (as is evident from the results of Section 3.5n). More
generally, if the distribution of x is symmetric [in the sense that .x / x ], then ƒ D 0
[as is evident upon observing that if the distribution of x is symmetric, then (for all i , j , and k)
ijk D ijk ]. When ƒ D 0, formula (7.13) simplifies to
Thus, if the distribution of x is symmetric and its mean vector is null, then any linear form in x is
uncorrelated with any quadratic form.
Theorem 5.7.3. Let x represent an N -dimensional random column vector having mean vector
D fi g, variance-covariance matrix † D fij g, third central moments ijk D EŒ.xi i /
202 Estimation and Prediction: Classical Approach
.xj j /.xk k / (i; j; k D 1; 2; : : : ; N ), and fourth central moments ijkm D EŒ.xi i /.xj
j /.xk k /.xm m / (i; j; k; m D 1; 2; : : : ; N ); and take A D faij g and H D fhij g to be
N N symmetric matrices of constants. Then,
cov.x0Ax; x0 Hx/
X
D aij hkm Œ. ijkm ij km i k j m i m jk /
i;j; k; m
C 2k ij m C 2i jkm C 2i k j m C 4j k i m (7.15)
D .vec A/0 vec H C 20 Hƒ vec A C 20Aƒ vec H
C 2 tr.A†H†/ C 40A†H; (7.16)
where is an N 2 N 2 matrix whose entry for the ij th row [row .i 1/N C j ] and kmth column
[column .k 1/N C m] is ijkm ij km i k j m i m jk and where ƒ is an N N 2 matrix
whose entry for the j th row and kmth column [column .k 1/N C m] is jkm .
Proof. Letting z D fzi g D x (in which case x D z C ) and using Theorems 5.7.1 and
5.7.2 [and observing that z0A D .z0A/0 D 0Az and similarly that z0 H D 0 Hz), we find that
cov.x0Ax; x0 Hx/
D EŒ.x0Ax/.x0 Hx/ E.x0Ax/ E.x0 Hx/
D EŒ.z C /0A.z C /.z C /0 H.z C /
ŒE.z0Az/ C 0AŒE.z0 Hz/ C 0 H
D EŒ.z0Az/.z0 Hz/ C 2 EŒ.z0Az/.0 Hz/ C 2 EŒ.0Az/.z0 Hz/
C 4 EŒ.z0A/.0 Hz/ C 2 EŒ.z0A/.0 H/ C 2 EŒ.0A/.0 Hz/
E.z0Az/ E.z0 Hz/
D EŒ.z0Az/.z0 Hz/ C 2 cov.0 Hz; z0Az/ C 2 cov.0Az; z0 Hz/
C 4 E.z0A0 Hz/ C 0 C 0 E.z0Az/ E.z0 Hz/
X
DE aij hkm zi zj zk zm
i;j; k; m X X
X X
C2 aij k hkm ij m C 2 hkm i aij jkm
i;j; m k j; k; m i
XX X X X
C4 aij j hkm k i m aij ij hkm km
i; m j k i;j k; m
X
D aij hkm Œ ijkm C 2k ij m C 2i jkm C 4j k i m ij km
i;j; k; m
X
D aij hkm Œ. ijkm ij km i k j m i m jk /
i;j; k; m
C 2k ij m C 2i jkm C 2i k j m C 4j k i m
X
D aj i hmk . ijkm ij km i k j m i m jk /
i;j; k; m
XhXX i XhXX i
C2 k hkm mij aj i C 2 i aij jkm hmk
i;j m k k; m j i
X XX X
C2 aij j m hmk ki
i m j k
XX X
C4 j aj i hmk k i m
i; m j k
Estimation of Variability and Covariability 203
The formulas of Theorems 5.7.2 and 5.7.3 were derived under the assumption that the matrix
A of the quadratic form x0Ax is symmetric and (in the case of Theorem 5.7.3) the assumption that
the matrix H of the quadratic form x0 Hx is symmetric. Note that whether or not A and/or H are
symmetric, it would be the case that x0Ax D 21 x0 .A C A0 /x and that x0 Hx D 12 x0 .H C H0 /x. Thus,
the formulas of Theorems 5.7.2 and 5.7.3 could be extended to the case where the matrices of the
quadratic forms are possibly nonsymmetric simply by substituting 12 .A C A0 / for A and (in the case
of Theorem 5.7.3) 12 .H C H0 / for H.
Some alternative representations. By making use of the vec and vech operations, the expressions
provided by the formulas of Theorems 5.7.1, 5.7.2, and 5.7.3 can be recast in ways that are informative
about the nature of the dependence of the expressions on the elements of the matrices of the quadratic
forms.
An alternative to the matrix expression (7.11) provided by Theorem 5.7.1 for the expected value
of the quadratic form x0Ax is as follows:
[as can be readily verified from expression (7.10)]. Expression (7.21) is a linear form in vec A; that is,
it is a linear combination of the elements of vec A (which are the elements of A). If A is symmetric,
then expression (7.21) can be restated as follows:
(where GN is the duplication matrix). Expression (7.22) is a linear form in vech A, the elements of
which are N.N C1/=2 nonredundant elements of A—if A is symmetric, N.N 1/=2 of its elements
are redundant.
Now, consider the matrix expression (7.13) provided by Theorem 5.7.2 for the covariance of the
linear form b0 x and the quadratic form x0Ax. Making use of result (7.6), we find that the second
term of expression (7.13) can be reexpressed as follows:
2b0 †A D 2 tr.b0 †A/ D 2 trŒb0 †A.0 /0 D 2.vec b/0 .0 ˝ †/ vec A D 2b0 .0 ˝ †/ vec A:
Expression (7.23) is a bilinear form in the N -dimensional column vector b and the N.N C1/=2-
dimensional column vector vech A, that is, for any particular value of b, it is a linear form in vech A,
and for any particular value of vech A, it is a linear form in b.
Further, consider the matrix expression (7.16) provided by Theorem 5.7.3 for the covariance of
the two quadratic forms x0Ax and x0 Hx. Making use of results (7.5) and (7.4), we find that the
second and third terms of expression (7.16) can be reexpressed as follows:
Moreover, making use of result (7.6) (and Lemma 2.3.1), the fourth and fifth terms of expression
(7.16) are reexpressible as
And based on results (7.24), (7.25), (7.26), and (7.27), formula (7.16) can be restated as follows:
cov.x0Ax; x0 Hx/ D .vec A/0 f C 2. ˝ ƒ/0 C 2. ˝ ƒ/
C 2.† ˝ †/ C 4Œ.0 / ˝ †g vec H (7.28)
D .vech A/0 GN0 f C 2. ˝ ƒ/0 C 2. ˝ ƒ/
C 2.† ˝ †/ C 4Œ.0 / ˝ †gGN vech H: (7.29)
in which var.x0Ax/ is expressed as a quadratic form in the N.N C1/=2-dimensional column vector
vech A.
eQ D y PX y ŒD .I PX /y
Corresponding to the vector eQ is the quantity eQ 0 eQ , which is customarily referred to as the residual
sum of squares. It follows from the results of Section 5.4b (on least squares minimization) that, for
every value of y,
eQ 0 eQ D min .y Xb/0 .y Xb/: (7.33)
b
Moreover,
eQ eQ D y 0 .I
0
PX /y; (7.34)
as is evident from the results of Section 5.4b or upon observing (in light of the symmetry and
idempotency of PX ) that .I PX /0 .I PX / D I PX .
In what follows (i.e., in the remainder of Subsection c), it is supposed that y follows a G–M
model, and the emphasis is on the estimation of the parameter 2.
An unbiased estimator. The expected value of the residual sum of squares can be derived by applying
formula (7.11) (for the expected value of a quadratic form) to expression (7.34) (which is a quadratic
form in the random vector y). Recalling that PX X D X, we find that
E.Qe0 eQ / D EŒy 0 .I PX /y
D trŒ.I PX /. 2 I/ C .Xˇ/0 .I PX /Xˇ
2
D tr.I PX / C 0
2
D ŒN tr.PX /:
Assume that the rank of the model matrix X is (strictly) less than N . Then, upon dividing the
residual sum of squares by N rank X, we obtain the quantity
eQ 0 eQ
O 2 D :
N rank X
206 Estimation and Prediction: Classical Approach
Clearly,
E.O 2 / D 2; (7.37)
that is, the quantity O 2 obtained by dividing the residual sum of squares by N rank X is an unbiased
estimator of the parameter 2.
Let us find the variance of the estimator O 2 . Suppose that the fourth-order moments of the
distribution of the vector e D .e1 ; e2 ; : : : ; eN /0 of residual effects are such that (for i; j; k; m D
1; 2; : : : ; N )
8̂
ˆ 3 4 if m D k D j D i ,
ˆ
< 4
if j D i and m D k ¤ i , if k D i and m D j ¤ i ,
E.ei ej ek em / D (7.38)
ˆ
ˆ or if m D i and k D j ¤ i ,
0 otherwise
:̂
(as would be the case if the distribution of e were MVN). Further, let ƒ represent the N N 2 matrix
whose entry for the j th row and kmth column [column .k 1/N C m] is E.ej ek em /. Then, upon
applying formula (7.17) and once again making use of the properties of the PX matrix (set forth in
Theorem 2.12.2), we find that
var.Qe0 eQ / 2 4
var.O 2 / D D : (7.40)
.N rank X/2 N rank X
eQ 0 eQ
; (7.41)
k
where k is a (strictly) positive constant. It is the estimator of the form (7.41) obtained by taking
k D N rank X. Taking k D N rank X achieves unbiasedness. Nevertheless, it can be of interest
to consider other choices for k.
Let us derive the MSE (mean squared error) of the estimator (7.41). And, in doing so, let us con-
tinue to suppose that the fourth-order moments of the distribution of the vector e D .e1 ; e2 ; : : : ; eN /0
of residual effects are such that (for i; j; k; m D 1; 2; : : : ; N ) E.ei ej ek em / satisfies condition (7.38)
(as would be the case if the distribution of e were MVN).
The MSE of the estimator (7.41) can be regarded as a function, say m.k/, of the scalar k. Making
Estimation of Variability and Covariability 207
For what choice of k does m.k/ attain its minimum value? Upon differentiating m.k/ and
engaging in some algebraic simplification, we find that
eQ 0 eQ
(7.44)
N rank.X/ C 2
has minimum MSE. The estimator (7.44) is sometimes referred to as the Hodges–Lehmann estimator.
In light of results (7.36) and (7.42), it has a bias of
eQ 0 eQ N rank X
E 2 D 2 1
N rank.X/ C 2 N rank.X/ C 2
2 2
D (7.45)
N rank.X/ C 2
cov.R0 X0 y; eQ / D 0: (7.48)
Thus, the least squares estimator r 0 X0 y and the residual vector eQ are uncorrelated. And, more gen-
erally, the vector R0 X0 y of least squares estimators and the residual vector eQ are uncorrelated.
Is the least squares estimator r 0 X0 y uncorrelated with the residual sum of squares eQ 0 eQ ? Or,
equivalently, is r 0 X0 y uncorrelated with an estimator of 2 of the form (7.41), including the unbiased
estimator O 2 (and the Hodges–Lehmann estimator)? Assuming the model is such that the distribution
of the vector e D .e1 ; e2 ; : : : ; eN /0 of residual effects has third-order moments ijk D E.ei ej ek /
(i; j; k D 1; 2; : : : ; N ) and making use of formula (7.13), we find that
where ƒ is an N N 2 matrix whose entry for the i th row and j kth column [column .j 1/N C k]
is ijk . The second term of expression (7.49) equals 0, as is evident upon recalling that PX X D X,
and the first term equals 0 if ƒ D 0, as would be the case if the distribution of e were MVN or, more
generally, if the distribution of e were symmetric. Thus, if the distribution of e is symmetric, then
cov.r 0 X0 y; eQ 0 eQ / D 0 (7.50)
and, more generally,
cov.R0 X0 y; eQ 0 eQ / D 0: (7.51)
Accordingly, if the distribution of e is symmetric, r 0 X0 y and R0 X0 y are uncorrelated with any
estimator of 2 of the form (7.41), including the unbiased estimator O 2 (and the Hodges–Lehmann
estimator).
Are the vector R0X0 y of least squares estimators and the residual vector eQ statistically independent
(as well as uncorrelated)? If the model is such that the distribution of e is MVN (in which case the
distribution of y is also MVN), then it follows from Corollary 3.5.6 that the answer is yes. That is, if
the model is such that the distribution of e is MVN, then eQ is distributed independently of R0 X0 y (and,
in particular, eQ is distributed independently of r 0 X0 y). Moreover, eQ being distributed independently
of R0 X0 y implies that “any” function of eQ is distributed independently of R0 X0 y—refer, e.g., to
Casella and Berger (2002, theorem 4.6.12). Accordingly, if the distribution of e is MVN, then the
residual sum of squares eQ 0 eQ is distributed independently of R0 X0 y and any estimator of 2 of the
form (7.41) (including the unbiased estimator O 2 and the Hodges–Lehmann estimator) is distributed
independently of R0 X0 y.
d. Translation invariance
Suppose that y is an N 1 observable random vector that follows a G–M, Aitken, or general linear
model. And suppose that we wish to make inferences about 2 (in the case of a G–M or Aitken
model) or (in the case of a general linear model) or about various functions of 2 or . In making
such inferences, it is common practice to restrict attention to procedures that depend on the value of y
only through the value of a (possibly vector-valued) statistic having a property known as translation
invariance (or location invariance).
Proceeding as in Section 5.2 (in discussing the translation-equivariant estimation of a parametric
function of the form 0ˇ), let k represent a P -dimensional column vector of known constants, and
define z D y C Xk. Then, z D X C e, where D ˇ C k. And z can be regarded as an N 1
observable random vector that follows a G–M, Aitken, or general linear model that is identical in
all respects to the model followed by y, except that the role of the parametric vector ˇ is played by
a vector (represented by ) having a different interpretation. It can be argued that inferences about
2 or , or about functions of 2 or , should be made on the basis of a statistical procedure that
Estimation of Variability and Covariability 209
depends on the value of y only through the value of a (possibly vector-valued) statistic h.y/ that, for
every k 2 RP (and for every value of y), satisfies the condition
h.y/ D h.z/
or, equivalently, the condition
h.y/ D h.y C Xk/: (7.52)
Any statistic h.y/ that satisfies condition (7.52) and that does so for every k 2 RP (and for every
value of y) is said to be translation invariant.
If the statistic h.y/ is translation invariant, then
h.y/ D hŒy C X. ˇ/ D h.y Xˇ/ D h.e/: (7.53)
Thus, the statistical properties of a statistical procedure that depends on the value of y only through
the value of a translation-invariant statistic h.y/ are completely determined by the distribution of the
vector e of residual effects. They do not depend on the vector ˇ.
Let us now consider condition (7.52) in the special case where h.y/ is a scalar-valued statistic
h.y/ of the form h.y/ D y 0Ay;
where A is a symmetric matrix of constants. In this special case,
h.y/ D h.y C Xk/ , y 0AXk C k0 X0Ay C k0 X0AXk D 0
, 2y 0AXk D k0 X0AXk: (7.54)
For condition (7.54) to be satisfied for every k 2 RP (and for every value of y), it is sufficient that
AX D 0. It is also necessary. To see this, suppose that condition (7.54) is satisfied for every k 2 RP
(and for every value of y). Then, upon setting y D 0 in condition (7.54), we find that k0 X0AXk D 0
for every k 2 RP, implying (in light of Corollary 2.13.4) that X0AX D 0. Thus, y 0AXk D 0 for
every k 2 RP (and every value of y), implying that every element of AX equals 0 and hence that
AX D 0.
In summary, we have established that the quadratic form y 0Ay (where A is a symmetric matrix of
constants) is a translation-invariant statistic if and only if the matrix A of the quadratic form satisfies
the condition
AX D 0: (7.55)
Adopting the same notation and terminology as in Subsection c, consider the concept of translation
invariance as applied to the residual vector eQ and to the residual sum of squares eQ 0 eQ . Recall that eQ is
expressible as eQ D .I PX /y and eQ 0 eQ as eQ 0 eQ D y 0 .I PX /y. Recall also that PX X D X and hence
that .I PX /X D 0. Thus, for any P 1 vector k (and for any value of y),
.I PX /.y C Xk/ D .I PX /y:
And it follows that eQ is translation invariant. Moreover, eQ 0 eQ is also translation invariant, as is evident
upon observing that it depends on y only through the value of eQ or, alternatively, upon applying
condition (7.55) (with A D I PX )—that condition (7.55) is applicable is evident upon recalling
that PX is symmetric and hence that the matrix I PX of the quadratic form y 0 .I PX /y is symmetric.
Let us now specialize by supposing that y follows a G–M model, and let us add to the results
obtained in Subsection c (on the estimation of 2 ) by obtaining some results on translation-invariant
estimation. Since the residual sum of squares eQ 0 eQ is translation invariant, any estimator of 2 of the
form (7.41) is translation invariant. In particular, the unbiased estimator O 2 is translation invariant
(and the Hodges–Lehmann estimator is translation invariant).
A quadratic form y 0Ay in the observable random vector y (where A is a symmetric matrix of
constants) is an unbiased estimator of 2 and is translation invariant if and only if
E.y 0Ay/ D 2 and AX D 0 (7.56)
210 Estimation and Prediction: Classical Approach
(in which case the quadratic form is referred to as a quadratic unbiased translation-invariant estima-
tor). As an application of formula (7.11), we have that
E.y 0Ay/ D trŒA. 2 I/ C ˇ 0 X0AXˇ D 2 tr.A/ C ˇ 0 X0AXˇ: (7.57)
In light of result (7.57), condition (7.56) is equivalent to the condition
tr.A/ D 1 and AX D 0: (7.58)
Thus, the quadratic form y 0Ay is a quadratic unbiased translation-invariant estimator of 2 if and
only if the matrix A of the quadratic form satisfies condition (7.58).
Clearly, the estimator O 2 [which is expressible in the form O 2 D y 0 .I PX /y] is a quadratic
unbiased translation-invariant estimator of 2. In fact, if the fourth-order moments of the distribution
of the vector e D .e1 ; e2 ; : : : ; eN /0 of residual effects are such that (for i; j; k; m D 1; 2; : : : ; N )
E.ei ej ek em / satisfies condition (7.38) (as would be the case if the distribution of e were MVN), then
the estimator O 2 has minimum variance (and hence minimum MSE) among all quadratic unbiased
translation-invariant estimators of 2, as we now proceed to show.
Suppose that the fourth-order moments of the distribution of the vector e D .e1 ; e2 ; : : : ; eN /0
are such that (for i; j; k; m D 1; 2; : : : ; N ) E.ei ej ek em / satisfies condition (7.38). And denote by
ƒ the N N 2 matrix whose entry for the j th row and kmth column [column .k 1/N C m] is
E.ej ek em /. Then, for any quadratic unbiased translation-invariant estimator y 0Ay of 2 (where A is
symmetric), we find [upon applying formula (7.17) and observing that AX D 0] that
var.y 0Ay/ D 4ˇ 0 .AX/0 ƒ vec A C 2 4 tr.A2 / C 4 2 ˇ 0 X0 AAXˇ
D 0 C 2 4 tr.A2 / C 0 D 2 4 tr.A2 /: (7.59)
1
Let R D A .I PX /, so that
N rank X
1
AD .I PX / C R: (7.60)
N rank X
Further, observe that (since PX is symmetric) R0 D R, that (since AX D 0 and PX X D X)
1
RX D AX .I PX /X D 0 0 D 0;
N rank X
and that
X0 R D X0 R0 D .RX/0 D 00 D 0:
Accordingly, upon substituting expression (7.60) for A (and recalling that PX is idempotent), we find
that 1 2
A2 D .I PX / C R C R0 R: (7.61)
.N rank X/ 2 N rank X
Moreover, because tr.A/ D 1, we have [in light of result (7.35)] that
1
tr.R/ D tr.A/ tr.I PX / D 1 1 D 0: (7.62)
N rank X
And upon substituting expression (7.61) for A2 in expression (7.59) and making use of results (7.35)
and (7.62), we find that
0 4 1 2 0
var.y Ay/ D 2 tr.I PX / C tr.R/ C tr.R R/
.N rank X/2 N rank X
1
D 2 4 C tr.R0 R/ : (7.63)
N rank X
Finally, upon observing that tr.R0 R/ D 2
i;j rij, where (for i; j D 1; 2; : : : ; N ) rij is the ij th
P
element of R, we conclude that var.y 0Ay/ attains a minimum value of 2 4 =.N rank X/ and does
1
so uniquely when R D 0 or, equivalently, when A D .I PX / (i.e., when y 0Ay D O 2 ).
N rank X
Best (Minimum-Variance) Unbiased Estimation 211
Now, suppose that W 0 y and y 0 y form a complete sufficient statistic. Then, it follows from
result (8.2) that X0 y and y 0 .I PX /y form a sufficient statistic. Moreover, if EfgŒX0 y; y 0 .I
PX /yg D 0, then EŒg .W 0 y; y 0 y/ D 0, implying that PrŒg .W 0 y; y 0 y/ D 0 D 1 and hence that
PrfgŒX0 y; y 0 .I PX /y D 0g D 1. Thus, X0 y and y 0 .I PX /y form a complete statistic.
Conversely, suppose that X0y and y 0 .I PX /y form a complete sufficient statistic. Then, it follows
from result (8.1) that W 0 y and y 0 y form a sufficient statistic. Moreover, if EŒh.W 0 y; y 0 y/ D 0,
then Efh ŒX0 y; y 0 .I PX /yg D 0, implying that Prfh ŒX0 y; y 0 .I PX /y D 0g D 1 and hence
212 Estimation and Prediction: Classical Approach
And the log-likelihood function, say `.ˇ; I y / [which, by definition, is the function obtained by
equating `.ˇ; I y/ to the logarithm of the likelihood function, i.e., to log L.ˇ; I y /], is expressible
as N N 1
`.ˇ; I y/ D log.2/ log 2 .y Xˇ/0 .y Xˇ/: (9.2)
2 2 2 2
Now, consider the maximization of the likelihood function L.ˇ; I y / or, equivalently, of the
log-likelihood function `.ˇ; I y/. Irrespective of the value of , `.ˇ; I y/ attains its maximum
value with respect to ˇ at any value of ˇ that minimizes .y Xˇ/0 .y Xˇ/. Thus, in light of the
results of Section 5.4b (on least squares minimization), `.ˇ; I y / attains its maximum value with
respect to ˇ at a point ˇQ if and only if
X0 XˇQ D X0 y; (9.3)
that is, if and only if ˇQ is a solution to the normal equations.
Letting ˇQ represent any P 1 vector that satisfies condition (9.3), it remains to consider the
maximization of `.ˇ; Q I y / with respect to . In that regard, take g./ to be a function of of the
form K c
g./ D a log 2 ; (9.4)
2 2 2
where a is a constant, c is a (strictly) positive constant, and K is a (strictly) positive integer. And
observe that, unless y XˇQ D 0 (which is an event of probability 0), `.ˇ; Q I y / is of the form
(9.4); in the special case where a D .N=2/ log.2/, c D .y Xˇ/ Q .y Xˇ/,
0 Q and K D N ,
Q
g./ D `.ˇ; I y /. Clearly,
dg./ K c K 2 c
D C 3 D :
d 3 K
Thus, dg./=d > 0 if 2 < c=K, dg./=d D 0 if 2 D c=K, and dg./=d < 0 if 2 > c=K,
p p
so that g./ is an increasing function ofp for < c=K, is a decreasing function for > c=K,
and attains its maximum value at D c=K.
Unless the model is of full rank (i.e., unless rank X D P ), there are an infinite number of solutions
to the normal equations and hence an infinite number of values of ˇ that maximize `.ˇ; I y/.
However, the value of an estimable linear combination 0ˇ of the elements of ˇ is the same for every
value of ˇ that maximizes `.ˇ; I y/—recall (from the results of Section 5.4 on the method of least
squares) that 0bQ has the same value for every solution bQ to the normal equations.
In effect, we have established that the least squares estimator of any estimable linear combination
of the elements of ˇ is also the ML estimator. Moreover, since condition (9.3) can be satisfied by
taking ˇQ D .X0 X/ X0 y, the ML estimator of 2 (the square root of which is the ML estimator of )
is the estimator eQ 0 eQ
; (9.5)
N
0
where eQ D y PX y. Like the unbiased estimator eQ eQ =.N rank X/ and the Hodges–Lehmann
estimator eQ 0 eQ =ŒN rank.X/ C 2, the ML estimator of 2 is of the form (7.41).
A result on minimization and some results on matrices. As a preliminary to considering ML
estimation as applied to a general linear model (or an Aitken model), it is convenient to establish the
following result on minimization.
Theorem 5.9.1. Let b represent a P 1 vector of (unconstrained) variables, and define f .b/ D
.y Xb/0 W .y Xb/, where W is an N N symmetric nonnegative definite matrix, X is an N P
matrix, and y is an N 1 vector. Then, the linear system X0 W Xb D X0 W y (in b) is consistent.
Q if and only if bQ is a solution to X0 W Xb D X0 W y,
Further, f .b/ attains its minimum value at a point b
Q 0 Q
in which case f .b/ D y W y b X W y.0 0
214 Estimation and Prediction: Classical Approach
Proof. Let R represent a matrix such that W D R0 R—the existence of such a matrix is guaranteed
by Corollary 2.13.25. Then, upon letting t D Ry and U D RX, f .b/ is expressible as f .b/ D .t
Ub/0 .t Ub/. Moreover, it follows from the results of Section 5.4b (on least squares minimization)
that the linear system U 0 Ub D U 0 t (in b) is consistent and that .t Ub/0 .t Ub/ attains its
minimum value at a point bQ if and only if bQ is a solution to U 0 Ub D U 0 t, in which case
.t U b/ Q 0 .t U b/ Q D t 0 t bQ 0 U 0 t:
It remains only to observe that U 0 U D X0 W X, that U 0 t D X0 W y, and that t 0 t D y 0 W y. Q.E.D.
In addition to Theorem 5.9.1, it is convenient to have at our disposal the following lemma, which
can be regarded as a generalization of Lemma 2.12.1.
Lemma 5.9.2. For any N P matrix X and any N N symmetric nonnegative definite matrix
W,
R.X0 W X/ D R.W X/; C.X0 W X/ D C.X0 W /; and rank.X0 W X/ D rank.W X/:
Proof. In light of Corollary 2.13.25, W D R0 R for some matrix R. And upon observing that
X0 W X D .RX/0 RX and making use of Corollary 2.4.4 and Lemma 2.12.1, we find that
R.W X/ D R.R0 RX/ R.RX/ D RŒ.RX/0 RX D R.X0 W X/ R.W X/
and hence that R.X0 W X/ D R.W X/. Moreover, that R.X0 W X/ D R.W X/ implies that
rank.X0 W X/ D rank.W X/ and, in light of Lemma 2.4.6, that C.X0 W X/ D C.X0 W /. Q.E.D.
In the special case of Lemma 5.9.2 where W is a (symmetric) positive definite matrix (and hence
is nonsingular), it follows from Corollary 2.5.6 that R.W X/ D R.X/, C.X0 W / D C.X0 /, and
rank.W X/ D rank.X/. Thus, we have the following corollary, which (like Lemma 5.9.2 itself) can
be regarded as a generalization of Lemma 2.12.1.
Corollary 5.9.3. For any N P matrix X and any N N symmetric positive definite matrix
W,
R.X0 W X/ D R.X/; C.X0 W X/ D C.X0 /; and rank.X0 W X/ D rank.X/:
As an additional corollary of Lemma 5.9.2, we have the following result.
Corollary 5.9.4. For any N P matrix X and any N N symmetric nonnegative definite
matrix W,
W X.X0 W X/ X0 W X D W X and X0 W X.X0 W X/ X0 W D X0 W :
Proof. In light of Lemmas 5.9.2 and 2.4.3, W X D L0 X0 W X for some P N matrix L. Thus,
W X.X0 W X/ X0 W X D L0 X0 W X.X0 W X/ X0 W X D L0 X0 W X D W X
and [since X0 W D .W X/0 D .L0 X0 W X/0 D X0 W XL]
X0 W X.X0 W X/ X0 W D X0 W X.X0 W X/ X0 W XL D X0 W XL D X0 W :
Q.E.D.
General linear model. Suppose that y is an N 1 observable random vector that follows a general
linear model. Suppose further that the distribution of the vector e of residual effects is MVN, so that
y N ŒXˇ; V ./. And suppose that V ./ is of rank N (for every 2 ‚).
Let us consider the ML estimation of functions of the model’s parameters (which consist of
the elements ˇ1 ; ˇ2 ; : : : ; ˇP of the vector ˇ and the elements 1 ; 2 ; : : : ; T of the vector ). Let
f . I ˇ; / represent the pdf of the distribution of y, and denote by y the observed value of y. Then,
the likelihood function is the function, say L.ˇ; I y/, of ˇ and defined (for ˇ 2 RP and 2 ‚)
by L.ˇ; I y/ D f .yI ˇ; /. Accordingly,
1 1 0 1
L.ˇ; I y/ D exp .y Xˇ/ ŒV ./ .y Xˇ/ : (9.6)
.2/N=2 jV ./j1=2 2
And the log-likelihood function, say `.ˇ; I y/, is expressible as
Likelihood-Based Methods 215
N 1 1
`.ˇ; I y/ D log.2/ log jV ./j .y Xˇ/0 ŒV ./ 1
.y Xˇ/: (9.7)
2 2 2
Maximum likelihood estimates are obtained by maximizing L.ˇ; I y / or, equivalently,
`.ˇ; I y/ with respect to ˇ and : if L.ˇ; I y/ or `.ˇ; I y/ attains its maximum value at val-
ues ˇQ and Q (of ˇ and , respectively), then an ML estimate of a function, say h.ˇ; /, of ˇ and/or
is provided by the quantity h.ˇ; Q /
Q obtained by substituting ˇQ and Q for ˇ and . In considering
the maximization of the likelihood or log-likelihood function, it is helpful to begin by regarding the
value of as “fixed” and considering the maximization of the likelihood or log-likelihood function
with respect to ˇ alone.
Observe [in light of result (4.5.5) and Corollary 2.13.12] that (regardless of the value of )
ŒV ./ 1 is a symmetric positive definite matrix. Accordingly, it follows from Theorem 5.9.1 that
for any particular value of , the linear system
X0 ŒV ./ 1
Xb D X0 ŒV ./ 1
y (9.8)
0 1
(in the P 1 vector b) is consistent. Further, .y Xˇ/ ŒV ./ Xˇ/ attains its minimum value,
.y
0
or equivalently .1=2/.y Xˇ/ ŒV ./ .y 1 Q
Xˇ/ attains its maximum value, at a value ˇ./ of
Q
ˇ if and only if ˇ./ is a solution to linear system (9.8), that is, if and only if
Q
X0 ŒV ./ 1 Xˇ./ D X0 ŒV ./ 1 y; (9.9)
in which case
1
max .y Xˇ/0 ŒV ./ 1 .y Xˇ/
ˇ2R P 2
1 Q 0 Q
D Œy Xˇ./ ŒV ./ 1 Œy Xˇ./
2
1˚ 0 Q 0 0
D y ŒV ./ 1 y Œˇ./ X ŒV ./ 1 y : (9.10)
2
Now, suppose that (for 2 ‚) ˇ./ Q satisfies condition (9.9). Then, for any matrix A such that
Q
R.A/ R.X/, the value of Aˇ./ (at any particular value of ) does not depend on the choice
Q
of ˇ./, as is evident upon observing (in light of Corollary 5.9.3) that A D T./X0 ŒV ./ 1 X for
some matrix-valued function T./ of and hence that
Q
Aˇ./ D T./X0 ŒV ./ 1 Xˇ./ Q D T./X0 ŒV ./ 1 y:
Q
Thus, Xˇ./ does not depend on the choice of ˇ./,Q and for any estimable linear combination 0ˇ
0Q Q
of the elements of ˇ, ˇ./ does not depend on the choice of ˇ./. Among the possible choices for
Q 0
ˇ./ are the vector fX ŒV ./ Xg X ŒV ./ y and the vector .fX0 ŒV ./ 1 Xg /0 X0 ŒV ./ 1 y.
1 0 1
Define
Q
L .I y/ D LŒˇ./; I y and Q
` .I y/ D `Œˇ./; I y ŒD log L .I y/: (9.11)
Then,
L .I y/ D max L.ˇ; I y / and ` .I y/ D max `.ˇ; I y/; (9.12)
ˇ2RP ˇ2RP
so that L .I y/ is a profile likelihood function and ` .I y/ is a profile log-likelihood function—
refer, e.g., to Severini (2000, sec 4.6) for the definition of a profile likelihood or profile log-likelihood
function. Moreover,
N 1 1 Q 0 Q
` .I y/ D log.2/ log jV ./j Œy Xˇ./ ŒV ./ 1 Œy Xˇ./ (9.13)
2 2 2
N 1 1˚ 0 Q 0 0
D log.2/ log jV ./j y ŒV ./ 1 y Œˇ./ X ŒV ./ 1 y (9.14)
2 2 2
N 1
D log.2/ log jV ./j
2 2
1 0
y ŒV ./ 1 ŒV ./ 1 XfX0 ŒV ./ 1 Xg X0 ŒV ./ 1 y: (9.15)
2
216 Estimation and Prediction: Classical Approach
Result (9.12) is significant from a computational standpoint. It “reduces” the problem of maxi-
mizing L.ˇ; I y/ or `.ˇ; I y/ with respect to ˇ and to that of maximizing L .I y/ or ` .I y/
with respect to alone. Values of ˇ and at which L.ˇ; I y/ or `.ˇ; I y/ attains its maximum
value can be obtained by taking the value of to be a value, say , Q at which L .I y/ or ` .I y/
attains its maximum value and by then taking the value of ˇ to be a solution ˇ.Q /
Q to the linear system
Q 1 Xb D X0 ŒV ./
X0 ŒV ./ Q 1y
(in the P 1 vector b).
In general, a solution to the problem of maximizing ` .I y/ is not obtainable in “closed form”;
rather, the maximization must be accomplished numerically via an iterative procedure—the discus-
sion of such procedures is deferred until later in the book. Nevertheless, there are special cases where
the maximization of ` .I y/, and hence that of `.ˇ; I y/, can be accomplished without resort to
indirect (iterative) numerical methods. Indirect numerical methods are not needed in the special case
where y follows a G–M model; that special case was discussed in Part 1 of the present subsection.
More generally, indirect numerical methods are not needed in the special case where y follows an
Aitken model, as is to be demonstrated in what follows.
Aitken model. Suppose that y follows an Aitken model (and that H is nonsingular and that the
distribution of e is MVN). And regard the Aitken model as the special case of the general linear
model where T D 1 (i.e., where has only 1 element), where 1 D , and where V ./ D 2 H. In
that special case, linear system (9.8) is equivalent to the linear system
X0 H 1 Xb D X0 H 1 y (9.16)
—the equivalence is in the sense that both linear systems have the same set of solutions. The equations
comprising linear system (9.16) are known as the Aitken equations. When H D I (i.e., when the
model is a G–M model), the linear system (9.16) of Aitken equations simplifies to the linear system
X0 Xb D X0 y of normal equations.
Q
In this setting, we are free to choose the vector ˇ./ in such a way that it has the same value for
Q
every value of . Accordingly, for every value of , take ˇ./ Q where ˇQ is any solution to the
to be ˇ,
Aitken equations. Then, writing for , the profile log-likelihood function ` .I y/ is expressible
as
` .I y/ D `.ˇ;Q I y/ D N log.2/ N log 2 1 log jHj 1
.y Xˇ/ Q 0 H 1 .y Xˇ/:Q
2 2 2 2 2
Unless y XˇQ D 0 (which is an event of probability 0), ` .I y/ is of the form of the function g./
defined (in Part 1 of the present subsection) by equality (9.4); upon setting a D .N=2/ log.2/
.1=2/ logjHj, c D .y Xˇ/ Q 0 H 1 .y Xˇ/, Q and K D N , g./ D ` .I y/. Thus, it follows from
the results of Part 1 that ` .I y / attains its maximum value when 2 equals
.y Xˇ/ Q 0 H 1 .y Xˇ/ Q
: (9.17)
N
And we conclude that `.ˇ; I y/ attains its maximum value when ˇ equals ˇQ and when 2 equals
the quantity (9.17). This conclusion serves to generalize the conclusion reached in Part 1, where it
was determined that in the special case of the G–M model, the log-likelihood function attains its
maximum value when ˇ equals a solution, say ˇ, Q to the normal equations (i.e., to the linear system
0 0 2
X Xb D X y) and when equals .y Xˇ/ .y Xˇ/=N. Q 0 Q
And let `.ˇ; I y/ represent the log-likelihood function [where y is the observed value of y and
where V ./ is assumed to be of rank N (for every 2 ‚)]. This function has the representation
(9.7).
Suppose that ˇQ and Q are values of ˇ and at which `.ˇ; I y/ attains its maximum value. And
observe that Q is a value of at which `.ˇ;
Q I y/ attains its maximum value. There is an implication
Q
that is identical to the value of that would be obtained from maximizing the likelihood function
under a supposition that ˇ is a known (P 1) vector (rather than a vector of unknown parameters)
and under the further supposition that ˇ equals ˇQ (or, perhaps more precisely, Xˇ equals Xˇ).Q Thus,
in a certain sense, maximum likelihood estimators of functions of fail to account for the estimation
of ˇ. This failure can be disconcerting and can have undesirable consequences.
It is informative to consider the manifestation of this phenomenon in the relatively simple special
case of a G–M model. In that special case, the use of maximum likelihood estimation results in 2
being estimated by the quantity (9.5), in which the residual sum of squares is divided by N rather
than by N rank X as in the case of the unbiased estimator [or by N rank.X/ C 2 as in the case
of the Hodges–Lehmann estimator].
The failure of ML estimators of functions of to account for the estimation of ˇ has led to
the widespread use of a variant of maximum likelihood that has come to be known by the acronym
REML (which is regarded by some as standing for restricted maximum likelihood and by others as
standing for residual maximum likelihood). In REML, inferences about functions of are based on
the likelihood function associated with a vector of what are sometimes called error contrasts.
An error contrast is a linear unbiased estimator of 0, that is, a linear combination, say r 0 y, of
the elements of y such that E.r 0 y/ D 0 or, equivalently, such that X0 r D 0. Thus, r 0 y is an error
contrast if and only if r 2 N.X0 /. Moreover, in light of Lemma 2.11.5,
dimŒN.X0 D N rank.X0 / D N rank X:
And it follows that there exists a set of N rank X linearly independent error contrasts and that no
set of error contrasts contains more than N rank X linearly independent error contrasts.
Accordingly, let R represent an N .N rank X/ matrix (of constants) of full column rank
N rank X such that X0 R D 0 [or, equivalently, an N .N rank X/ matrix whose columns are
linearly independent members of the null space N.X0 / of X0 ]. And take z to be the .N rank X/ 1
vector defined by z D R0 y (so that the elements of z are N rank X linearly independent error
contrasts). Then, z N Œ0; R0 V ./R, and [in light of the assumption that V ./ is nonsingular and
in light of Theorem 2.13.10] R0 V ./R is nonsingular. Further, let f . I / represent the pdf of the
distribution of z, and take L.I R0 y/ to be the function of defined (for 2 ‚) by L.I R0 y/ D
f .R0 yI /. The function L.I R0 y/ is a likelihood function; it is the likelihood function obtained
by regarding the observed value of z as the data vector. In REML, the inferences about functions of
are based on the likelihood function L.I R0 y/ [or on a likelihood function that is equivalent to
L.I R0 y/ in the sense that it differs from L.I R0 y/ by no more than a multiplicative constant].
It is worth noting that the use of REML results in the same inferences regardless of the choice
of the matrix R. To see that REML has this property, let R1 and R2 represent any two choices for
R, that is, take R1 and R2 to be any two N .N rank X/ matrices of full column rank such
that X0 R1 D X0 R2 D 0. Further, define z1 D R01 y and z2 D R02 y. And let f1 . I / represent the
pdf of the distribution of z1 and f2 . I / the pdf of the distribution of z2 ; and take L1 .I R01 y/
and L2 .I R02 y/ to be the functions of defined by L1 .I R01 y/ D f1 .R01 yI / and L2 .I R02 y/ D
f2 .R02 yI /.
There exists an .N rank X/ .N rank X/ matrix A such that R2 D R1 A, as is evident
upon observing that the columns of each of the two matrices R1 and R2 form a basis for the
.N rank X/-dimensional linear space N.X0 /; necessarily, A is nonsingular. Moreover, the pdf’s of
the distributions of z1 and z2 are such that (for every value of z1 )
f1 .z1 / D jdet Ajf2 .A0 z1 /
218 Estimation and Prediction: Classical Approach
—this relationship can be verified directly from formula (3.5.32) for the pdf of an MVN distribution
or simply by observing that z2 D A0 z1 and making use of standard results (e.g., Bickel and Doksum
2001, sec. B.2) on a change of variables. Thus,
We conclude that the two likelihood functions L1 .I R01 y/ and L2 .I R02 y/ differ from each other
by no more than a multiplicative constant and hence that they are equivalent.
The .N rank X/-dimensional vector z D R0 y of error contrasts is translation invariant, as is
evident upon observing that for every P 1 vector k (and every value of y),
In fact, z is a maximal invariant: in the present context, a (possibly vector-valued) statistic h.y/ is
said to be a maximal invariant if it is invariant and if corresponding to each pair of values y1 and y2
of y such that h.y2 / D h.y1 /, there exists a P 1 vector k such that y2 D y1 C Xk—refer, e.g., to
Lehmann and Romano (2005b, sec 6.2) for a general definition (of a maximal invariant).
To confirm that z is a maximal invariant, take y1 and y2 to be any pair of values of y such that
R0 y2 D R0 y1 . And observe that y2 D y1 C .y2 y1 / and that y2 y1 2 N.R0 /. Observe also (in light
of Lemma 2.11.5) that dimŒN.R0 / D rank X. Moreover, R0 X D .X0 R/0 D 0, implying (in light
of Lemma 2.4.2) that C.X/ N.R0 / and hence (in light of Theorem 2.4.10) that C.X/ D N.R0 /.
Thus, the linear space N.R0 / is spanned by the columns of X, leading to the conclusion that there
exists a P 1 vector k such that y2 y1 D Xk and hence such that y2 D y1 C Xk.
That z D R0 y is a maximal invariant is of interest because any maximal invariant, say h.y/,
has (in the present context) the following property: a (possibly vector-valued) statistic, say g.y/, is
translation invariant if and only if g.y/ depends on the value of y only through h.y/, that is, if and
only if there exists a function s./ such that g.y/ D sŒh.y/ (for every value of y). To see that h.y/
has this property, observe that if [for some function s./] g.y/ D sŒh.y/ (for every value of y), then
(for every P 1 vector k)
g.y C Xk/ D sŒh.y C Xk/ D sŒh.y/ D g.y/;
so that g.y/ is translation invariant. Conversely, if g.y/ is translation invariant and if y1 and y2
are any pair of values of y such that h.y2 / D h.y1 /, then y2 D y1 C Xk for some vector k and,
consequently, g.y2 / D g.y1 C Xk/ D g.y1 /.
The vector z consists of N rank X linearly independent linear combinations of the elements
of the N 1 vector y. Suppose that we introduce an additional rank X linear combinations in the
form of the .rank X/ 1 vector u defined by u D X0 y, where X is any N .rank X/ matrix (of
constants) whose columns are linearly independent columns of X or, more generally, whose columns
form a basis for C.X/. Then,
u
D .X ; R/0 y:
z
And (since X D XA for some matrix A)
Accordingly,
the likelihood function that would result from regarding the observed value .X ; R/0 y
u
of as the data vector differs by no more than a multiplicative constant from that obtained by
z
Likelihood-Based Methods 219
regarding the observed value y of y as the data vector (as can be readily verified). When viewed in
this context, the likelihood function that is employed in REML can be regarded as what is known
as a marginal likelihood—refer, e.g., to Pawitan (2001, sec. 10.3) for the definition of a marginal
likelihood.
The vector eQ D .I PX /y [where PX D X.X0 X/ X0 ] is the vector of (least squares) residuals.
Observe [in light of Theorem 2.12.2 and Lemma 2.8.4] that X0 .I PX / D 0 and that
rank.I PX / D N rank PX D N rank X: (9.19)
Thus, among the choices for the N .N rank X/ matrix R (of full column rank N rank X such
that X0 R D 0) is any N .N rank X/ matrix whose columns are a linearly independent subset
of the columns of the (symmetric) matrix I PX . For any such choice of R, the elements of the
.N rank X/ 1 vector z D R0 y consist of linearly independent (least squares) residuals.
The letters R and E in the acronym REML can be regarded as representing either restricted or
residual. REML is restricted ML in the sense that in the formation of the likelihood function, the data
are restricted to those inherent in the values of the N rank X linearly independent error contrasts.
REML is residual ML in the sense that the N rank X linearly independent error contrasts can be
taken to be (least squares) residuals.
It might seem as though the use of REML would result in the loss of some information about
functions of . However, in at least one regard, there is no loss of information. Consider the profile
likelihood function L . I y/ or profile log-likelihood function ` . I y/ of definition (9.11)—the
(ordinary) ML estimate of a function of is obtained from a value of at which L .I y/ or
` .I y/ attains its maximum value. The identity of the function L . I y/ or, equivalently, that of
the function ` . I y/ can be determined solely from knowledge of the observed value R0 y of the
vector z of error contrasts; complete knowledge of the observed value y of y is not required. Thus,
the (ordinary) ML estimator of a function of (like the REML estimator) depends on the value of
y only through the value of the vector of error contrasts.
Let us verify that the identity of the function ` . I y/ is determinable solely from knowledge of
R0 y. Let eQ D .I PX /y, and observe (in light of Theorem 2.12.2) that X0 .I PX /0 D 0, implying
[since the columns of R form a basis for N.X0 /] that .I PX /0 D RK for some matrix K and hence
that
eQ D .RK/0 y D K0 R0 y (9.20)
—Qe is the observed value of the vector eQ D .I PX /y. Moreover, upon observing [in light of result
(2.5.5) and Corollary 2.13.12] that ŒV ./ 1 is a symmetric positive definite matrix, it follows from
Corollary 5.9.4 that
ŒV ./ 1 ŒV ./ 1 XfX0 ŒV ./ 1 Xg X0 ŒV ./ 1 X D 0
and that
X0 ŒV ./ 1 ŒV ./ 1 XfX0 ŒV ./ 1 Xg X0 ŒV ./ 1 D 0:
Conversely, suppose that Q is any N .N R/ matrix such that I PX D QQ0. Then, according
to Theorem 5.9.5, Q0 Q D I. Moreover, making use of Theorem 2.12.2, we find that
X0 QQ0 D X0.I PX / D 0;
0
implying (in light of Corollary 2.3.4) that X Q D 0. Q.E.D.
Lemma 5.9.7. Let X represent an N P matrix of rank R .< N /. Then, for any N .N R/
matrix Q, X0 Q D 0 and Q0 Q D I if and only if the columns of Q form an orthonormal basis for
N.X0 /.
Proof. If the columns of Q form an orthonormal basis for N.X0 /, then clearly X0 Q D 0 and
Q Q D I. Conversely, suppose that X0 Q D 0 and Q0 Q D I. Then, clearly, the N R columns
0
of Q are orthonormal, and each of them is contained in N.X0 /. And since orthonormal vectors are
linearly independent (as is evident from Lemma 2.4.22) and since (according to Lemma 2.11.5)
Likelihood-Based Methods 221
dimŒN.X0 / D N R, it follows from Theorem 2.4.11 that the columns of Q form a basis for
N.X0 /. Q.E.D.
REML in the special case of a G–M model. Let us consider REML in the special case where the
N 1 observable random vector y follows a G–M model. And in doing so, let us continue to suppose
that the distribution of the vector e of residual effects is MVN. Then, y N.Xˇ; 2 I/.
What is the REML estimator of 2, and how does it compare with other estimators of 2, including
the (ordinary) ML estimator (which was derived in Subsection a)? These questions can be readily
answered by making a judicious choice for the N .N rank X/ matrix R (of full column rank
N rank X) such that X0 R D 0.
Let Q represent an N .N rank X/ matrix whose columns form an orthonormal basis for
N.X0 /. Or, equivalently (in light of Lemma 5.9.7), take Q to be an N .N rank X/ matrix such
that X0 Q D 0 and Q0 Q D I. And observe (in light of Theorem 5.9.6) that
I PX D QQ0
(and that Q is of full column rank).
Suppose that in implementing REML, we set R D Q—clearly, that is a legitimate choice for R.
Then, z D Q0 y N.0; 2 I/. And, letting y represent the observed value of y, the log-likelihood
function that results from regarding the observed value Q0 y of z as the data vector is the function
`.; Q0 y/ of given by
N rank X 1 1 0
`.; Q0 y/ D log ˇ 2 IN rank X ˇ y Q. 2 I/ 1 Q0 y
ˇ ˇ
log.2/
2 2 2
N rank X N rank X 1 0
D log.2/ log 2 y .I PX /y
2 2 2 2
N rank X N rank X 1
D log.2/ log 2 Œ.I PX /y0 .I PX /y: (9.23)
2 2 2 2
Unless .I PX /y D 0 (which is an event of probability 0), `.; Q0 y/ is of the form of the function
g./ defined by equality (9.4); upon setting a D Œ.N rank X/=2 log.2/, c D Œ.I PX /y0 .I
PX /y, and K D N rank X, g./ D `.; Q0 y/. Accordingly, it follows from the results of Part 1
of Subsection a that `.; Q0 y/ attains its maximum value when 2 equals
Œ.I PX /y0 .I PX /y
:
N rank X
Thus, the REML estimator of 2 is the estimator
eQ 0 eQ
; (9.24)
N rank X
where eQ D y PX y.
The REML estimator (9.24) is of the form (7.41) considered in Section 5.7c; it is the estimator
of the form (7.41) that is unbiased. Unlike the (ordinary) ML estimator eQ 0 eQ =N [which was derived
in Part 1 of Subsection a and is also of the form (7.41)], it “accounts for the estimation of ˇ”; in the
REML estimation of 2, the residual sum of squares eQ 0 eQ is divided by N rank X rather than by N.
A matrix lemma. Preliminary to the further discussion of REML, it is convenient to establish the
following lemma.
Lemma 5.9.8. Let A represent a Q S matrix. Then, for any K Q matrix C of full column
rank Q and any S T matrix B of full row rank S , B.CAB/ C is a generalized inverse of A.
Proof. In light of Lemma 2.5.1, C has a left inverse, say L, and B has a right inverse, say R.
And it follows that
AB.CAB/ CA D IAB.CAB/ CAI D LCAB.CAB/ CABR D LCABR D IAI D A:
222 Estimation and Prediction: Classical Approach
An informative and computationally useful expression for the REML log-likelihood function.
Suppose that y is an N 1 observable random vector that follows a general linear model. Suppose
further that the distribution of the vector e of residual effects is MVN and that the variance-covariance
matrix V ./ of e is nonsingular (for every 2 ‚). And let z D R0 y, where R is an N .N rank X/
matrix of full column rank N rank X such that X0 R D 0, and denote by y the observed value of y.
In REML, inferences about functions of are based on the likelihood function L.I R0 y/
obtained by regarding the observed value R0 y of z as the data vector. Corresponding to L.I R0 y/
is the log-likelihood function `.I R0 y/ D log L.I R0 y/. We have that z N Œ0; R0 V ./R, and
it follows that
N rank X 1 1 0
`.I R0 y/ D log.2/ logjR0 V ./Rj y RŒR0 V ./R 1 R0 y (9.26)
2 2 2
—recall that R0 V ./R is nonsingular.
REML estimates of functions of are obtained from a value, say , O of at which L.I R0 y/ or,
0
equivalently, `.I R y/ attains its maximum value. By way of comparison, (ordinary) ML estimates
of such functions are obtained from a value, say , Q at which the profile likelihood function L .I y/
or profile log-likelihood function ` .I y/ attains its maximum value; the (ordinary) ML estimate of
Q whereas the REML estimate is h./.
a function h./ of is h./, O It is of potential interest to compare
0
`.I R y/ with ` .I y/. Expressions for ` .I y/ are given by results (9.13), (9.14), and (9.15).
However, expression (9.26) [for `.I R0 y/] is not of a form that facilitates meaningful comparisons
with any of those expressions. Moreover, depending on the nature of the variance-covariance matrix
V ./ (and on the choice of the matrix R), expression (9.26) may not be well-suited for computational
purposes [such as in computing the values of `.I R0 y/ corresponding to various values of ].
For purposes of obtaining a more useful expression for `.I R0 y/, take S to be any matrix (with
N rows) whose columns span C.X/, that is, any matrix such that C.S/ D C.X/ (in which case,
S D XA for some matrix A). And, temporarily (for the sake of simplicity) writing V for V ./,
observe that
.V 1 S; R/0 V .V 1 S; R/ D diag.S0 V 1 S; R0 V R/ (9.27)
and [in light of result (2.5.5), Corollary 2.13.12, and Corollary 5.9.3] that
rankŒ.V 1
S; R/0 V .V 1
S; R/ D rankŒdiag.S0 V 1
S; R0 V R/
D rank.S0 V 1
S/ C rank.R0 V R/
D rank.S/ C rank.R/
D rank.X/ C N rank.X/ D N: (9.28)
V 1
D .V 1
S; R/ diagŒ.S0 V 1
S/ ; .R0 V R/ 1
.V 1
S; R/0
DV 1
S.S0 V 1
S/ S0 V 1
C R.R0 V R/ 1
R0 (9.30)
Likelihood-Based Methods 223
Now, consider the quantity jR0 V Rj [which appears in the 2nd term of expression (9.26)]. Take
X to be any N .rank X/ matrix whose columns are linearly independent columns of X or, more
generally, whose columns form a basis for C.X/ (in which case, X D XA for some matrix A).
Observing that
.X ; R/0 .X ; R/ D diag.X0 X ; R0 R/
and making use of basic properties of determinants, we find that
And making use of formula (2.14.29) for the determinant of a partitioned matrix, we find that
ˇ 0
ˇX V X X0 V Rˇ
ˇ
j.X ; R/0 V .X ; R/j D ˇˇ 0
R V X R0 V R ˇ
ˇ
Moreover, as a special case of equality (9.30) (that where S D X ), we have (since, in light of
Corollary 5.9.3, X0 V 1 X is nonsingular) that
V 1
DV 1
X .X0 V 1
X / 1
X0 V 1
C R.R0 V R/ 1
R0
V V R.R0 V R/ 1
R0 V D X .X0 V 1
X / 1
X0 : (9.35)
It remains to equate expressions (9.33) and (9.36); doing so leads to the following expression for
jR0 V Rj:
jR0 V Rj D jR0 Rj jV j jX0 V 1 X j=jX0 X j: (9.37)
Upon substituting expressions (9.32) and (9.37) [for R.R0 V R/ 1 R0 and jR0 V Rj] into expression
(9.26), we find that the REML log-likelihood function `.I R0 y/ is reexpressible as follows:
N rank X 1 1
`.I R0 y/ D log.2/ logjR0 Rj C logjX0 X j
2 2 2
1 1 0 1
logjV ./j logjX ŒV ./ X j
2 2
1 0
y ŒV ./ 1 ŒV ./ 1 XfX0 ŒV ./ 1 Xg X0 ŒV ./ 1 y: (9.38)
2
224 Estimation and Prediction: Classical Approach
If R is taken to be a matrix whose columns form an orthonormal basis for N.X0 /, then the second
term of expression (9.38) equals 0; similarly, if X is taken to be a matrix whose columns form an
orthonormal basis for C.X/, then the third term of expression (9.38) equals 0. However, what is more
important is that the choice of R affects expression (9.38) only through its second term, which is a
constant (i.e., does not involve ). And for any two choices of X , say X1 and X2 , X2 D X1 B for
some matrix B (which is necessarily nonsingular), implying that
1 1 1
log jX02 ŒV ./ 1
X2 j D log jB0 X01 ŒV ./ 1
X1 Bj D log jdet Bj log jX01 ŒV ./ 1
X1 j
2 2 2
and, similarly, that
1 1
log jX02 X2 j D log jdet Bj C log jX01 X1 j;
2 2
so that the only effect on expression (9.38) of a change in the choice of X from X1 to X2 is to add
a constant to the third term and to subtract the same constant from the fifth term. Thus, the choice of
R and the choice of X are immaterial.
Q
The last term of expression (9.38) can be reexpressed in terms of an arbitrary solution, say ˇ./,
to the linear system
X0 ŒV ./ 1 Xb D X0 ŒV ./ 1 y (9.39)
(in the P 1 vector b)—recall (from Subsection a) that this linear system is consistent, that
Q
Xˇ./ Q
does not depend on the choice of ˇ./, Q
and that the choices for ˇ./ include the vector
0 1 0 0 1
.fX ŒV ./ Xg / X ŒV ./ y. We find that
y 0 ŒV ./ 1 ŒV ./ 1 XfX0 ŒV ./ 1 Xg X0 ŒV ./ 1 y
Q
D y 0 ŒV ./ 1 y Œˇ./ 0 0
X ŒV ./ 1 y (9.40)
D Œy Q
Xˇ./0
ŒV ./ 1
Œy Q
Xˇ./: (9.41)
It is informative to compare expression (9.38) for `.I R0 y/ with expression (9.15) for the
profile log-likelihood function ` .I y/. Aside from the terms that do not depend on [the 1st term
of expression (9.15) and the first 3 terms of expression (9.38)], the only difference between the
two expressions is the inclusion in expression (9.38) of the term 12 log jX0 ŒV ./ 1 X j. This term
depends on , but not on y. Its inclusion serves to adjust the profile log-likelihood function ` .I y/
so as to compensate for the failure of ordinary ML (in estimating functions of ) to account for the
estimation of ˇ. Unlike the profile log-likelihood function, `.I R0 y/ is the logarithm of an actual
likelihood function and, consequently, has the properties thereof—it is the logarithm of the likelihood
function L.I R0 y/ obtained by regarding the observed value R0 y of z as the data vector.
If the form of the N N matrix V ./ is such that V ./ is relatively easy to invert (as is often
the case in practice), then expression (9.38) for `.I R0 y/ is likely to be much more useful for
computational purposes than expression (9.26). Expression (9.38) [along with expression (9.40) or
(9.41)] serves to relate the numerical evaluation of `.I R0 y/ for any particular value of to the
solution of the linear system (9.39), comprising P equations in P “unknowns.”
Special case: Aitken model. Let us now specialize to the case where y follows an Aitken model (and
where H is nonsingular). As in Subsection a, this case is to be regarded as the special case of the
general linear model where T D 1, where D ./, and where V ./ D 2 H. In this special case,
linear system (9.39) is equivalent to (i.e., has the same solutions as) the linear system
X0 H 1
Xb D X0 H 1
y; (9.42)
comprising the Aitken equations. And taking ˇQ to be any solution to linear system (9.42), we find
Likelihood-Based Methods 225
[in light of results (9.38) and (9.41)] that the log-likelihood function `.I R0 y/ is expressible as
N rank X 1 1
`.I R0 y/ D log.2/ logjR0 Rj C logjX0 X j
2 2 2
1 1 0 1 N rank X
logjHj logjX H X j log 2
2 2 2
1 Q 0 H 1.y Xˇ/: Q
.y Xˇ/ (9.43)
2 2
Unless y XˇQ D 0 (which is an event of probability 0), `.I R0 y/ is of the form of the
function g./ defined (in Part 1 of Subsection a) by equality (9.4); upon setting a D Œ.N
rank X/=2 log.2/ .1=2/ logjR0 Rj C .1=2/ logjX0 X j .1=2/ logjHj .1=2/ logjX0 H 1 X j,
c D .y Xˇ/ Q 0 H 1.y Xˇ/, Q and K D N rank X, g./ D `.I R0 y/. Accordingly, it follows from
the results of Part 1 of Subsection a that `.I R0 y/ attains its maximum value when 2 equals
Q 0 H 1.y Xˇ/
.y Xˇ/ Q
: (9.44)
N rank X
The quantity (9.44) is the REML estimate of 2 ; it is the estimate obtained by dividing .y
Xˇ/Q 0 H 1.y Xˇ/ Q by N rank X. It differs from the (ordinary) ML estimate of 2 ; which (as
is evident from the results of Subsection a) is obtained by dividing .y Xˇ/ Q 0 H 1.y Xˇ/Q by N .
Note that in the further special case of the G–M model (i.e., the further special case where H D I), the
Aitken equations simplify to the normal equations X0Xb D X0 y and expression (9.44) (for the REML
estimate) is (upon setting ˇQ D .X0 X/ X0 y) reexpressible as Œ.I PX /y0 .I PX /y=.N rank X/,
in agreement with the expression for the REML estimator [expression (9.24)] derived in a previous
part of the present subsection.
c. Elliptical distributions
The results of Subsections a and b (on the ML and REML estimation of functions of the parameters
of a G–M, Aitken, or general linear model) were obtained under the assumption that the distribution
of the vector e of residual effects is MVN. Some of the properties of the MVN distribution extend (in
a relatively straightforward way) to a broader class of distributions called elliptical distributions (or
elliptically contoured or elliptically symmetric distributions). Elliptical distributions are introduced
(and some of their basic properties described) in the present subsection—this follows the presentation
(in Part 1 of the present subsection) of a useful result on orthogonal matrices. Then, in Subsection
d, the results of Subsections a and b are revisited with the intent of obtaining extensions suitable for
G–M, Aitken, or general linear models when the form of the distribution of the vector e of residual
effects is taken to be that of an elliptical distribution other than a multivariate normal distribution.
A matrix lemma.
Lemma 5.9.9. For any two M -dimensional column vectors x1 and x2 , x02 x2 D x01 x1 if and
only if there exists an M M orthogonal matrix O such that x2 D Ox1 .
Proof. If there exists an orthogonal matrix O such that x2 D Ox1 , then, clearly,
For purposes of establishing the converse, take u D .1; 0; 0; : : : ; 0/0 to be the first column of IM ,
and assume that both x1 and x2 are nonnull—if x02 x2 D x01 x1 and either x1 or x2 is null, then both
x1 and x2 are null, in which case x2 D Ox1 for any M M orthogonal matrix O. And for i D 1; 2,
define
Pi D I 2.vi0 vi / 1vi vi0 ;
226 Estimation and Prediction: Classical Approach
where vi D xi .x0i xi /1=2 u—if vi D 0, take Pi D I. The two matrices P1 and P2 are Householder
matrices; they are orthogonal and are such that, for i D 1; 2, Pi xi D .x0i xi /1=2 u—refer, e.g., to
Golub and Van Loan (2013, sec. 5.1.2). Thus, if x02 x2 D x01 x1 , then
And upon observing [in light of the fundamental theorem of (integral) calculus (e.g., Billingsley
1995)] that u z if and only if h.Oz/ D f .Oz/ (with probability 1), it follows that u z if and
only if
f .Oz/ D f .z/ (with probability 1): (9.50)
In effect, we have established that z has a spherical distribution if and only if, for every orthogonal
matrix O, the pdf f ./ satisfies condition (9.50). Now, suppose that f .z/ depends on the value
of z only through z0 z or, equivalently, that there exists a (nonnegative) function g./ (of a single
nonnegative variable) such that
Clearly, if f ./ is of the form (9.51), then, for every orthogonal matrix O, f ./ satisfies condition
(9.50) and, in fact, satisfies the more stringent condition
where c D RM g.z0 z/ d z (and observe that RM f .z/ d z D 1). Then, there is an absolutely
R R
continuous distribution (of an M 1 random vector) having f ./ as a pdf, and [since f ./ is of the
form (9.51)] that distribution is spherical.
228 Estimation and Prediction: Classical Approach
in which case the constant c in expression (9.54) is expressible in the form (9.57).
Moment generating function of a spherical distribution. Spherical distributions can be characterized
in terms of their moment generating functions (or, more generally, their characteristic functions) as
Likelihood-Based Methods 229
To see this, let O represent an arbitrary M M matrix, and observe that (for any M 1 vector t)
0 0 0
.Ot/ D E e .Ot/ z D E e t .O z/
and hence that .Ot/ D .t/ (for every M 1 vector t in a neighborhood of 0) if and only if ./
is the moment generating function of the distribution of O 0 z (as well as that of the distribution of z),
or equivalently—refer, e.g., to Casella and Berger (2002, p. 65)—if and only if O 0 z and z have the
same distribution.
For the distribution of z to be spherical, it is necessary and sufficient that .t/ depend on the
M 1 vector t only through the value of t 0 t or, equivalently, that there exists a function ./ (of a
single nonnegative variable) such that
.t/ D .t 0 t/ (for every M 1 vector t in a neighborhood of 0): (9.59)
Let us verify the necessity and sufficiency of the existence of a function ./ that satisfies condition
(9.59). If there exists a function ./ that satisfies condition (9.59), then for every M M orthogonal
matrix O (and for every M 1 vector t in a neighborhood of 0),
If E.z/ exists, then E.x/ exists and [in light of result (9.46)]
E.x/ D : (9.61)
And if the second-order moments of the distribution of z exist, then so do those of the distribution
of x and [in light of result (9.47)]
var.x/ D c†; (9.62)
where c is the variance of any element of z—every element of z has the same variance.
If the distribution of z has a moment generating function, say !./, then there exists a (nonnegative)
function ./ (of a single nonnegative variable) such that (for every N 1 vector s in a neighborhood
of 0) !.s/ D .s0 s/, and the distribution of x has the moment generating function ./, where (for
every M 1 vector t in a neighborhood of 0)
0 0 0 0 0 0 0
.t/ D E e t x D E e t .C z/ D e t E e .t/ z D e t !.t/ D e t .t 0 †t/: (9.63)
Note that the moment generating function of the distribution of x and hence the distribution itself
depend on the value of the N M matrix only through the value of the M M matrix †.
Marginal distributions (of spherically distributed random vectors). Let z represent an N -
dimensional spherically distributed random column vector. And take z to be an M -dimensional
subvector of z (where M < N ), say the subvector obtained by striking out all of the elements of z
save the i1 ; i2 ; : : : ; iM th elements.
Suppose that the distribution of z has a moment generating function, say ./. Then, necessarily,
there exists a (nonnegative) function ./ (of a single nonnegative variable) such that .s/ D .s0 s/
(for every N 1 vector s in a neighborhood of 0). Clearly, the subvector z can be regarded as a special
case of the random column vector x defined by expression (9.60); it is the special case obtained by
setting D 0 and taking to be the N M matrix whose first, second, …, M th columns are,
respectively, the i1 ; i2 ; : : : ; iM th columns of IN . And (in light of the results of the preceding part of
the present subsection) it follows that the distribution of z has a moment generating function, say
./, and that (for every M 1 vector t in some neighborhood of 0)
Thus, the moment generating function of the distribution of the subvector z is characterized by the
same function ./ as that of the distribution of z itself.
Suppose now that u is an M -dimensional random column vector whose distribution has a moment
generating function, say !./, and that (for every M 1 vector t in a neighborhood of 0) !.t/ D
.t 0 t/. Then, the distribution of u is spherical. Moreover, it has the same moment generating function
as the distribution of z (and, consequently, u z ). There is an implication that the elements of u,
like those of z , have the same variance as the elements of z.
The moment generating function of a marginal distribution of z (i.e., of the distribution of a
subvector of z) is characterized by the same function ./ as that of the distribution of z itself. In the
case of pdfs, the relationship is more complex.
Suppose that the distribution of the N -dimensional spherically distributed random column vector
z is an absolutely continuous spherical distribution. Then, the distribution of z has a pdf f ./, where
f .z/ D g.z0 z/ for some (nonnegative) function g./ of a single nonnegative variable (and for
every value of z). Accordingly, the distribution of the M -dimensional subvector z is the absolutely
continuous distribution with pdf f ./ defined (for every value of z ) by
Z
f .z / D g.z0 z C zN 0 zN / d zN ;
RN M
elements. And upon regarding g.z0 z C w/ as a function of a nonnegative variable w and applying
result (9.57), we find that (for every value of z )
Z 1
2 .N M /=2
f .z / D s N M 1 g.z0 z C s 2 / ds: (9.65)
Œ.N M /=2 0
Clearly, f .z / depends on the value of z only through z0 z , so that (as could have been anticipated
from our results on the moment generating function of the distribution of a subvector of a spherically
distributed random vector) the distribution of z is spherical. Further, upon introducing the changes
of variable w D s 2 and u D z0 z C w, we obtain the following variations on expression (9.65):
Z 1
.N M /=2
f .z / D w Œ.N M /=2 1 g.z0 z C w/ dw (9.66)
Œ.N M /=2 0
Z 1
.N M /=2
D .u z0 z /Œ.N M /=2 1 g.u/ du: (9.67)
Œ.N M /=2 z0 z
Elliptical distributions: definition. The distribution of a random column vector of the form of the
vector x of equality (9.60) is said to be elliptical. And a random column vector whose distribution is
that of the vector x of equality (9.60) may be referred to as being distributed elliptically about or,
in the special case where D I (or where is orthogonal), as being distributed spherically about .
Clearly, a random column vector x is distributed elliptically about if and only if x is distributed
elliptically about 0 and is distributed spherically about if and only if x is distributed spherically
about 0. Let us consider the definition of an elliptical distribution as applied to distributions whose
second-order moments exist.
For any M 1 vector and any M M nonnegative definite matrix †, an M 1 random vector
x has an elliptical distribution with mean vector and variance-covariance matrix † if and only if
x C 0z (9.68)
for some matrix such that † D 0 and some random (column) vector z (of compatible dimension)
having a spherical distribution with variance-covariance matrix I—recall that if a random column
vector z has a spherical distribution with a variance-covariance matrix that is a nonzero scalar multiple
cI of I, then the rescaled vector c 1=2 z has a spherical distribution with variance-covariance matrix
I. In connection with condition (9.68), define K D rank †, and denote by N the number of rows in
the matrix (or, equivalently, the number of elements in z)—necessarily, N K. For any particular
N , the distribution of C 0 z does not depend on the choice of [as is evident (for the case where
the distribution of z has a moment generating function) from result (9.63)]; rather, it depends only
on , †, and the distribution of z.
Now, consider the distribution of C 0 z for different choices of N . Assume that K 1—if
K D 0, then (for any choice of N ) D 0 and hence C 0 z D . And take to be a K M
matrix such that † D 0 , and take z to be a K 1 random vector having a spherical distribution
with variance-covariance matrix IK .
Suppose that the distribution of z has a moment generating function, say ! ./. Then, because
the distribution of z is spherical, there exists a (nonnegative) function ./ (of a single nonnegative
variable) such that (for every K 1 vector t in a neighborhood of 0) ! .t / D .t0 t /.
Take !.t/ to be the function of an N 1 vector t defined (for every value of t in some neighborhood
of 0) by !.t/ D .t 0 t/. There may or may not exist an (N -dimensional) distribution having !./ as
a moment generating function. If such a distribution exists, then that distribution is spherical, and for
any random vector, say w, having that distribution, the distribution of z is a marginal distribution of
w and var.w/ D IN . Accordingly, if there exists a distribution having !./ as a moment generating
function, then the distribution of the random vector z [in expression (9.68)] could be taken to be
232 Estimation and Prediction: Classical Approach
that distribution, in which case the distribution of C 0 z would have the same moment generating
function as the distribution of C 0 z [as is evident from result (9.63)] and it would follow that
C 0 z C 0 z .
Thus, as long as there exists an (N -dimensional) distribution having !./ as a moment generating
function [where !.t/ D .t 0 t/] and as long as the distribution of z is taken to be that distribution,
the distribution of C 0 z is invariant to the choice of N . This invariance extends to every N for
which there exists an (N -dimensional) distribution having !./ as a moment generating function.
Let us refer to the function ./ as the mgf generator of the distribution of the M -dimensional
random vector C 0 z (with mgf being regarded as an acronym for moment generating function).
The moment generating function of the distribution of C 0 z is the function ./ defined (for
every M 1 vector t in a neighborhood of 0) by
0
.t/ D e t .t 0 †t/ (9.69)
[as is evident from result (9.63)]. The distribution of C 0 z is completely determined by the
mean vector , the variance-covariance matrix †, and the mgf generator ./. Accordingly, we
may refer to this distribution as an (M -dimensional) elliptical distribution with mean , variance-
covariance matrix †, and mgf generator ./. The mgf generator ./ serves to identify the applicable
distribution of z ; alternatively, some other characteristic of the distribution of z could be used for
that purpose (e.g., the pdf). Note that the N.; †/ distribution is an elliptical distribution with mean
, variance-covariance matrix †, and mgf generator ./, where (for every nonnegative scalar u)
.u/ D exp.u=2/.
Pdf of an elliptical distribution. Let x D .x1 ; x2 ; : : : ; xM /0 represent an M 1 random vector, and
suppose that for some M 1 (nonrandom) vector and some M M (nonrandom) positive definite
matrix †,
x D C 0 z;
where is an M M (nonsingular) matrix such that † D 0 and where z D .z1 ; z2 ; : : : ; zM /0 is
an M 1 spherically distributed random vector with variance-covariance matrix I. Then, x has an
elliptical distribution with mean vector and variance-covariance matrix †.
Now, suppose that the distribution of z is an absolutely continuous spherical distribution. Then,
the distribution of z is absolutely continuous with a pdf h./ defined asR follows in terms of some
1
(nonnegative) function g./ (of a single nonnegative variable) for which 0 s M 1 g.s 2 / ds < 1:
h.z/ D c 1
g.z0 z/;
R1
where c D Œ2 M=2= .M=2/ 0 s M 1 g.s 2 / ds. And the distribution of x is absolutely continuous
with a pdf, say f ./, that is derivable from the pdf of the distribution of z.
Let us derive an expression for f .x/. Clearly, z D . 0 / 1 .x /, and the M M matrix with
ij th element @zi =@xj equals . 0 / 1. Moreover,
jdet . 0 / 1
j D jdet 0 j 1
D Œ.det 0 /2 1=2
D Œ.det 0 / det 1=2
y D c C Ax; (9.71)
where c is an M 1 (nonrandom) vector and A an M N (nonrandom) matrix. Then, y has an
(M -dimensional) elliptical distribution with mean c C A, variance-covariance matrix A†A0 , and
(if A†A0 ¤ 0) mgf generator ./ (identical to the mgf generator of the distribution of x).
Let us verify that y has this distribution. Define K D rank †. And suppose that K > 0 (or,
equivalently, that † ¤ 0), in which case
x C 0 z;
where is any K N (nonrandom) matrix such that † D 0 and where z is a K 1 random
vector having a spherical distribution with a moment generating function !./ defined (for every
K 1 vector s in a neighborhood of 0) by !.s/ D .s0 s/. Then,
y c C A. C 0 z/ D c C A C .A0 /0 z: (9.72)
Now, let K D rank.A†A0 /, and observe that K K and that A†A0 D .A0 /0 A0 . Further,
suppose that K > 0 (or, equivalently, that A†A0 ¤ 0), take to be any K M (nonrandom)
matrix such that A†A0 D 0 , and take z to be a K 1 random vector having a distribution
that is a marginal distribution of z and that, consequently, has a moment generating function ! ./
defined (for every K 1 vector s in a neighborhood of 0) by ! .s / D .s0 s /. Then, it follows
from what was established earlier (in defining elliptical distributions) that
c C A C .A0 /0 z c C A C 0 z ;
which [in combination with result (9.72)] implies that y has an elliptical distribution with mean
c C A, variance-covariance matrix A†A0, and mgf generator ./. It remains only to observe that
even in the “degenerate” case where † D 0 or, more generally, where A†A0 D 0, E.y/ D cCA and
var.y/ D A†A0 (and to observe that the distribution of a random vector whose variance-covariance
matrix equals a null matrix qualifies as an elliptical distribution).
Marginal distributions (of elliptically distributed random vectors). Let x represent an N 1 random
vector that has an (N -dimensional) elliptical distribution with mean , nonnull variance-covariance
matrix †, and mgf generator ./. And take x to be an M -dimensional subvector of x (where
M < N ), say the subvector obtained by striking out all of the elements of x save the i1 ; i2 ; : : : ; iM th
elements. Further, take to be the M -dimensional subvector of obtained by striking out all of the
elements of save the i1 ; i2 ; : : : ; iM th elements and † to be the M M submatrix of † obtained
by striking out all of the rows and columns of † save the i1 ; i2 ; : : : ; iM th rows and columns.
Consider the distribution of x . Clearly, x D Ax, where A is the M N matrix whose first,
second, : : : ; M th rows are, respectively, the i1 ; i2 ; : : : ; iM th rows of IN . Thus, upon observing
that A D and that A†A0 D † , it follows from the result of the preceding subsection (the
subsection pertaining to linear transformation of elliptically distributed random vectors) that x
has an elliptical distribution with mean , variance-covariance matrix † , and (if † ¤ 0) mgf
generator ./ (identical to the mgf generator of x).
continuous spherical distribution with variance-covariance matrix I. The distribution of u has a pdf
h./, where (for every value of u)
h.u/ D c 1 g.u0 u/:
R1
Here, g./ is a nonnegative function (of a single nonnegative variable) such that 0 s N 1 g.s 2 / ds <
1
1, and c D Œ2 N=2= .N=2/ 0 s N 1 g.s 2 / ds. As a consequence of supposition (9.73), y has
R
an elliptical distribution.
Let us consider the ML estimation of functions of the parameters of the general linear model (i.e.,
functions of the elements ˇ1 ; ˇ2 ; : : : ; ˇP of the vector ˇ and the elements 1 ; 2 ; : : : ; T of the vector
). That topic was considered earlier (in Subsection a) in the special case where e N Œ0; V ./—
when g.s 2 / D exp. s 2 =2/, h.u/ D .2/ N=2 exp. 21 u0 u/, which is the pdf of the N.0; IN /
distribution.
Let f . I ˇ; / represent the pdf of the distribution of y, and denote by y the observed value of
y. Then, the likelihood function is the function, say L.ˇ; I y/ of ˇ and defined by L.ˇ; I y/ D
f .yI ˇ; /. Accordingly, it follows from result (9.70) that
L.ˇ; I y / D c 1
jV ./j 1=2
gf.y Xˇ/0 ŒV ./ 1
.y Xˇ/g: (9.74)
Maximum likelihood estimates of functions of ˇ and/or are obtained from values, say ˇQ and , Q
of ˇ and at which L.ˇ; I y/ attains its maximum value: a maximum likelihood estimate of a
Q /
function, say r.ˇ; /, of ˇ and/or is provided by the quantity r.ˇ; Q obtained by substituting ˇQ
and Q for ˇ and .
Profile likelihood function. Now, suppose that the function g./ is a strictly decreasing function (as in
the special case where the distribution of e is MVN). Then, for any particular value of , the maximiza-
tion of L.ˇ; I y/ with respect to ˇ is equivalent to the minimization of .y Xˇ/0 ŒV ./ 1 .y Xˇ/
with respect to ˇ. Thus, upon regarding the value of as “fixed,” upon recalling (from Part 3 of
Subsection a) that the linear system
X0 ŒV ./ 1
Xb D X0 ŒV ./ 1
y (9.75)
(in the P 1 vector b) is consistent, and upon employing the same line of reasoning as in Part 3 of
Q
Subsection a, we find that L.ˇ; I y/ attains its maximum value at a value ˇ./ of ˇ if and only if
Q
ˇ./ is a solution to linear system (9.75) or, equivalently, if and only if
X0 ŒV ./ 1 Q
Xˇ./ D X0 ŒV ./ 1
y;
in which case
Q 1 1=2 Q 0 Q
ŒV ./ 1 Œy Xˇ./ (9.76)
˚
LŒˇ./; I y D c jV ./jg Œy Xˇ./
D c 1 jV ./j 1=2 g y 0 ŒV ./ 1 y Œˇ./
Q 0 0
X ŒV ./ 1 y (9.77)
˚
1
XfX0 ŒV ./ 1
Xg X0 ŒV ./ 1
(9.78)
ŒV ./ y :
Q
Accordingly, the function L .I y/ of defined by L .I y/ D LŒˇ./; I y is a profile likelihood
function.
Values, say ˇQ and Q (of ˇ and , respectively), at which L.ˇ; I y/ attains its maximum value
can be obtained by taking Q to be a value at which the profile likelihood function L .I y/ attains
its maximum value and by then taking ˇQ to be a solution to the linear system
Q
X0 ŒV ./ 1 Q
Xb D X0 ŒV ./ 1
y:
Except for relatively simple special cases, the maximization of L .I y/ must be accomplished
numerically via an iterative procedure.
Likelihood-Based Methods 235
REML variant. REML is a variant of ML in which inferences about functions of are based on
the likelihood function associated with a vector of so-called error contrasts. REML was introduced
and discussed in an earlier subsection (Subsection b) under the assumption that the distribution of e
is MVN. Let us consider REML in the present, more general context (where the distribution of e is
taken to be elliptical).
Let R represent an N .N rank X/ matrix (of constants) of full column rank N rank X
such that X0 R D 0, and take z to be the .N rank X/ 1 vector defined by z D R0 y. Note that
z D R0 e and hence that the distribution of z does not depend on ˇ. Further, let k. I / represent
the pdf of the distribution of z, and take L.I R0 y/ to be the function of defined (for 2 ‚)
by L.I R0 y/ D k.R0 yI /. The function L.I R0 y/ is a likelihood function; it is the likelihood
function obtained by regarding the value of z as the data vector.
Now, suppose that the (N -dimensional spherical) distribution of the random vector u [in ex-
pression (9.73)] has a moment generating function, say ./. Then, necessarily, there exists a (non-
negative) function ./ (of a single nonnegative variable) such that (for every N 1 vector t in a
neighborhood of 0) .t/ D .t 0 t/. And in light of the results of Subsection c, it follows that z
has an [.N rank X/-dimensional] elliptical distribution with mean 0, variance-covariance matrix
R0 V ./R, and mgf generator ./. Further,
z Œ ./0 u ;
where ./ is any .N rank X/ .N rank X/ matrix such that R0 V ./R D Œ ./0 ./
and where u is an .N rank X/ 1 random vector whose distribution is spherical with variance-
covariance matrix I and with moment generating function ./ defined [for every .N rank X/ 1
vector t in a neighborhood of 0] by .t / D .t0 t /—the distribution of u is a marginal
distribution of u.
The distribution of u is absolutely continuous with a pdf h ./ that (at least in principle) is
determinable from the pdf of the distribution of u and that is expressible in the form
h .u / D c 1 g .u0 u /;
R1 rank.X/ 1
where g ./ is a nonnegative function (of a single nonnegative variable) such that 0 sN
g .s 2 / ds < 1 and where c is a strictly positive constant. Necessarily,
2 .N rank X/=2
Z 1
c D s N rank.X/ 1 g .s 2 / ds:
Œ.N rank X/=2 0
Thus, in light of result (9.70), the pdf of the distribution of z is absolutely continuous with a pdf
k. I / that is expressible as
k.zI / D c 1 jR0 V ./Rj 1=2
g fz0 ŒR0 V ./R 1
zg:
And it follows that the REML likelihood function is expressible as
L.I R0 y/ D c 1 jR0 V ./Rj 1=2
g fy 0 RŒR0 V ./R 1
R0 yg: (9.79)
As in the special case where the distribution of e is MVN, an alternative expression for L.I R0 y/
can be obtained by taking advantage of identities (9.32) and (9.37). Taking X to be any N .rank X/
matrix whose columns form a basis for C.X/, we find that
L.I R0 y/ D c 1 jR0 Rj jX0 X j1=2 jV ./j 1=2 jX0 ŒV ./ 1 X j 1=2
1=2
Alternative versions of this expression can be obtained by replacing the argument of the function
g ./ with expression (9.40) or expression (9.41).
As in the special case where the distribution of e is MVN, L.I R0 y/ depends on the choice of
the matrix R only through the multiplicative constant jR0 Rj 1=2. In some special cases including that
where the distribution of e is MVN, the function g ./ differs from the function g./ by no more than a
multiplicative constant. However, in general, the relationship between g./ and g./ is more complex.
236 Estimation and Prediction: Classical Approach
5.10 Prediction
a. Some general results
Let y represent an N 1 observable random vector. And consider the use of y in predicting an
unobservable random variable or, more generally, an unobservable random vector, say an M 1
unobservable random vector w D .w1 ; w2 ; : : : ; wM /0. That is, consider the use of the observed
value of y (the so-called data vector) in making inferences about an unobservable quantity that can
be regarded as a realization (i.e., sample value) of w. Here, an unobservable quantity is a quantity
that is unobservable at the time the inferences are to be made; it may become observable at some
future time (as suggested by the use of the word prediction). In the present section, the focus is on
obtaining a point estimate of the unobservable quantity; that is, on what might be deemed a point
prediction.
Suppose that the second-order moments of the joint distribution of w and y exist. And adopt the
following notation: y D E.y/, w D E.w/, Vy D var.y/, Vyw D cov.y; w/, and Vw D var.w/.
Further, in considering the special case M D 1, let us write w, w , vyw , and vw for w, w , Vyw ,
and Vw , respectively.
It is informative to consider the prediction of w under each of the following states of knowledge:
(1) the joint distribution of y and w is known; (2) only y , w , Vy , Vyw , and Vw are known; and
(3) only Vy , Vyw , and Vw are known.
Let w.y/
Q represent an (M 1)-dimensional vector-valued function of y that qualifies as a (point)
predictor of w—in considering the special case where M D 1, let us write w.y/ Q for w.y/.
Q That w.y/
Q
qualifies as a predictor implies that the vector-valued function w./
Q depends on the joint distribution
of y and w (if at all) only through characteristics of the joint distribution that are known.
The difference w.y/Q w is referred to as the prediction error. The predictor w.y/Q is said to
be unbiased if EŒwQ .y/ w D 0, that is, if the expected value of the prediction error equals 0, or,
equivalently, if EŒw.y/
Q D w , that is, if the expected value of the predictor is the same as that of
the random vector w whose realization is being predicted.
Attention is sometimes restricted to linear predictors. An (M 1)-dimensional vector-valued
function t.y/ of y is said to be linear if it is expressible in the form
t.y/ D c C A0 y; (10.1)
where c is an M 1 vector of constants and A is an M N matrix of constants. A vector-valued
function t.y/ that is expressible in the form (10.1) is regarded as linear even if the vector c and
the matrix A depend on the joint distribution of y and w—the linearity reflects the nature of the
dependence on the value of y, not the nature of any dependence on the joint distribution. And
it qualifies as a predictor if any dependence on the joint distribution of y and w is confined to
characteristics of the joint distribution that are known.
The M M matrix EfŒw.y/ Q wŒwQ .y/ w0 g is referred to as the mean-squared-error (MSE)
matrix of the predictor w.y/.
Q If w.y/
Q is an unbiased predictor (of w), then
EfŒw.y/
Q wŒwQ .y/ w0 g D varŒw.y/
Q w:
That is, the MSE matrix of an unbiased predictor equals the variance-covariance matrix of its pre-
diction error (not the variance-covariance matrix of the predictor itself). Note that in the special case
where M D 1, the MSE matrix has only one element, which is expressible as EfŒw.y/ Q w2 g and
which is referred to simply as the mean squared error (MSE) of the (scalar-valued) predictor w.y/. Q
State (1): joint distribution known. Suppose that the joint distribution of y and w is known or that,
at the very least, enough is known about the joint distribution to determine the conditional expected
Prediction 237
Now, let t.y/ represent “any” (M 1)-dimensional vector-valued function of y—in the special
case where M D 1, let us write t.y/ for t.y/. Then, upon observing that
Œt.y/ wŒt.y/ w0 D ft.y/ E.w j y/ C ŒE.w j y/ wgft.y/ E.w j y/ C ŒE.w j y/ wg0
D Œt.y/ E.w j y/Œt.y/ E.w j y/0 C ŒE.w j y/ wŒE.w j y/ w0
C Œt.y/ E.w j y/ŒE.w j y/ w0 C fŒt.y/ E.w j y/ŒE.w j y/ w0 g0
Result (10.5) implies that E.w j y/ is an optimal predictor of w. It is optimal in the sense that the
difference EfŒt.y/ wŒt.y/ w0 j yg EfŒE.w j y/ wŒE.w j y/ w0 j yg between the conditional
(given y) MSE matrix of an arbitrary predictor t.y/ and that of E.w j y/ equals (with probability 1)
the matrix Œt.y/ E.w j y/Œt.y/ E.w j y/0, which is nonnegative definite and which equals 0 if
and only if t.y/ E.w j y/ D 0 or, equivalently, if and only if t.y/ D E.w j y/. In the special case
where M D 1, we have that [for an arbitrary predictor t.y/]
Clearly, E.w j y/ is an unbiased predictor; in fact, the expected value of its prediction error equals
0 conditionally on y (albeit with probability 1) as well as unconditionally [as is evident from result
(10.2)]. Whether or not E.w j y/ is a linear predictor (or, more generally, equal to a linear predictor
with probability 1) depends on the form of the joint distribution of y and w; a sufficient (but not a
necessary) condition for E.w j y/ to be linear (or, at least, “linear with probability 1”) is that the joint
distribution of y and w be MVN.
238 Estimation and Prediction: Classical Approach
State (2): only the means and the variances and covariances are known. Suppose that y , w ,
Vy , Vyw , and Vw are known, but that nothing else is known about the joint distribution of y and w.
Then, E.w j y/ is not determinable from what is known, forcing us to look elsewhere for a predictor
of w.
Assume (for the sake of simplicity) that Vy is nonsingular. And consider the predictor
0 0
.y/ D w C Vyw Vy 1 .y y / D C Vyw Vy 1 y;
0
where D w Vyw Vy 1 y —in the special case where M D 1, let us write .y/ for .y/.
Clearly, .y/ is linear; it is also unbiased. Now, consider its MSE matrix EfŒ.y/ wŒ.y/ w0 g
or, equivalently, the variance-covariance matrix varŒ.y/ w of its prediction error. Let us compare
the MSE matrix of .y/ with the MSE matrices of other linear predictors.
Let t.y/ represent an (M 1)-dimensional vector-valued function of y of the form t.y/ D cCA0 y,
where c is an M 1 vector of constants and A an N M matrix of constants—in the special case
where M D 1, let us write t.y/ for t.y/. Further, decompose the difference between t.y/ and w
into two components as follows:
t.y/ w D Œt.y/ .y/ C Œ.y/ w: (10.6)
And observe that
0 0
covŒy; .y/ w D cov.y; Vyw Vy 1 y w/ D Vy ŒVyw Vy 1 0 Vyw D 0: (10.7)
component .y/ E.w j y/ also has an expected value of 0. Moreover, it follows from result (10.3)
that
EfŒ.y/ E.w j y/ŒE.w j y/ w0 j yg D 0 (with probability 1); (10.12)
implying that
EfŒ.y/ E.w j y/ŒE.w j y/ w0 g D 0 (10.13)
and hence that the two components .y/ E.w j y/ and E.w j y/ w of decomposition (10.11) are
uncorrelated. And upon applying result (10.5) [with t.y/ D .y/], we find that
Equality (10.14) serves to decompose the conditional (on y) MSE matrix of the best linear
predictor .y/ into two components, corresponding to the two components of the decomposition
(10.11) of the prediction error of .y/. The (unconditional) MSE matrix of .y/ or, equivalently, the
(unconditional) variance-covariance matrix of .y/ w lends itself to a similar decomposition. We
find that
varŒ.y/ w D varŒ.y/ E.w j y/ C varŒE.w j y/ w
D varŒ.y/ E.w j y/ C EŒvar.w j y/: (10.15)
Of the two components of the prediction error of .y/, the second component E.w j y/ w can be
regarded as an “inherent” component. It is inherent in the sense that it is an error that would be incurred
even if enough were known about the joint distribution of y and w that E.w j y/ were determinable
and were employed as the predictor. The first component .y/ E.w j y/ of the prediction error can
be regarded as a “nonlinearity” component; it equals 0 if and only if E.w j y/ D c C A0 y for some
vector c of constants and some matrix A of constants.
The variance-covariance matrix of the prediction error of .y/ is expressible as
0
varŒ.y/ w D var.Vyw Vy 1 y w/
0 0
D Vw C Vyw Vy 1 Vy .Vyw Vy 1 /0 0
Vyw Vy 1 Vyw 0
ŒVyw Vy 1 Vyw 0
0
D Vw Vyw Vy 1 Vyw : (10.16)
It differs from the variance-covariance matrix of .y/; the latter variance-covariance matrix is ex-
pressible as
0 0 0
varŒ.y/ D var.Vyw Vy 1 y/ D Vyw Vy 1 Vy ŒVyw Vy 1 0 D Vyw
0
Vy 1 Vyw :
In fact, varŒ.y/ w and varŒ.y/ are the first and second components in the following decompo-
sition of var.w/:
var.w/ D varŒ.y/ w C varŒ.y/:
The best linear predictor .y/ can be regarded as an approximation to E.w j y/. The expected
value EŒ.y/ E.w j y/ of the error of this approximation equals 0. Note that varŒ.y/ E.w j y/ D
EfŒ.y/ E.w j y/Œ.y/ E.w j y/0 g. Further, .y/ is the best linear approximation to E.w j y/
in the sense that, for any (M 1)-dimensional vector-valued function t.y/ of the form t.y/ D
c C A0 y, the difference between the matrix EfŒt.y/ E.w j y/Œt.y/ E.w j y/0 g and the matrix
varŒ.y/ E.w j y/ equals the matrix EfŒt.y/ .y/Œt.y/ .y/0 g, which is nonnegative definite
240 Estimation and Prediction: Classical Approach
and which equals 0 if and only if t.y/ D .y/. This result follows from what has already been
established (in regard to the best linear prediction of w) upon observing [in light of result (10.5)]
that
EfŒt.y/ wŒt.y/ w0 g D EfŒt.y/ E.w j y/Œt.y/ E.w j y/0 g C EŒvar.w j y/;
which in combination with result (10.15) implies that the difference between the two matrices
EfŒt.y/ E.w j y/Œt.y/ E.w j y/0 g and varŒ.y/ E.w j y/ is the same as that between the
two matrices EfŒt.y/ wŒt.y/ w0 g and varŒ.y/ w. In the special case where M D 1, we
have [for any function t.y/ of y of the form t.y/ D c C a0 y (where c is a constant and a an N 1
vector of constants)] that
EfŒt.y/ E.w j y/2 g varŒ.y/ E.w j y/ D EfŒ.y/ E.w j y/2 g ; (10.17)
Let us now specialize to linear predictors. Let us write Q L .y/ for a linear predictor of w and
QL .y/ for the corresponding estimator of [which, like Q L .y/, is linear]. Then, in light of results
(10.22) and (10.8),
EfŒQL .y/ Œ.y/ w0 g D EfŒQ L .y/ .y/Œ.y/ w0 g D 0: (10.24)
And making use of results (10.23) and (10.16), it follows that
Let Lp represent a collection of linear predictors of w. And let Le represent the collection of
(linear) estimators of that correspond to the predictors in Lp . Then, for a predictor, say O L .y/, in
the collection Lp to be best in the sense that, for every predictor Q L .y/ in Lp , the matrix EfŒQ L .y/
wŒQ L .y/ w0 g EfŒO L .y/ wŒO L .y/ w0 g is nonnegative definite, it is necessary and sufficient
that 0
O L .y/ D O L .y/ C Vyw Vy 1 y
for some estimator O L .y/ in Le that is best in the sense that, for every estimator Q L .y/ in Le , the
matrix EfŒQL .y/ ŒQL .y/ 0 g EfŒOL .y/ ŒOL .y/ 0 g is nonnegative definite. In general,
there may or may not be an estimator that is best in such a sense; the existence of such an estimator
depends on the nature of the collection Le and on any assumptions that may be made about y and
w .
If Lp is the collection of all linear unbiased predictors of w, then Le is the collection of all
linear unbiased estimators of . As previously indicated (in Section 5.5a), it is customary to refer to
an estimator that is best among linear unbiased estimators as a BLUE (an acronym for best linear
unbiased estimator or estimation). Similarly, a predictor that is best among linear unbiased predictors
is customarily referred to as a BLUP (an acronym for best linear unbiased predictor or prediction).
The prediction error of the predictor .y/
Q can be decomposed into three components by start-
ing with decomposition (10.23) and by expanding the component .y/ w into two components
on the basis of decomposition (10.11). As specialized to the linear predictor Q L .y/, the resultant
decomposition is
Q L .y/ w D ŒQL .y/ C Œ.y/ E.w j y/ C ŒE.w j y/ w: (10.26)
Recall (from the preceding part of the present subsection) that .y/ E.w j y/ and E.w j y/ w
[which are the 2nd and 3rd components of decomposition (10.26)] are uncorrelated and that each
has an expected value of 0. Moreover, it follows from result (10.3) that QL .y/ is uncorre-
lated with E.w j y/ w and from result (10.7) that it is uncorrelated with .y/ w and hence
uncorrelated with .y/ E.w j y/ [which is expressible as the difference between .y/ w and
E.w j y/ w]. Thus, all three components of decomposition (10.26) are uncorrelated. Expanding
on the terminology introduced in the preceding part of the present subsection, the first, second, and
third components of decomposition (10.26) can be regarded, respectively, as an “unknown-means”
component, a “nonlinearity” component, and an “inherent” component.
Corresponding to decomposition (10.26) of the prediction error of Q L .y/ is the following de-
composition of the MSE matrix of Q L .y/:
EfŒQ L .y/ wŒQ L .y/ w0 g
D EfŒQL .y/ ŒQL .y/ 0 g C varŒ.y/ E.w j y/ C varŒE.w j y/ w: (10.27)
In the special case where M D 1, this decomposition can [upon writing QL .y/ for Q L .y/, QL .y/ for
QL .y/, and for , as well as .y/ for .y/ and w for w] be reexpressed as follows:
242 Estimation and Prediction: Classical Approach
EfŒQL .y/ w2 g D EfŒQL .y/ 2 g C varŒ.y/ E.w j y/ C varŒE.w j y/ w:
In taking .y/
Q to be an estimator of and regarding .y/
Q as a predictor of w, it is implicitly
assumed that the functions ./
Q and ./
Q depend on the joint distribution of y and w only through Vy ,
0
Vyw , and Vw . In practice, the dependence may only be through the elements of the matrix Vyw Vy 1
1
and through various functions of the elements of Vy , in which case .y/
Q may qualify as an estimator
and .y/
Q as a predictor even in the absence of complete knowledge of Vy , Vyw , and Vw .
As defined and discussed in Sections 5.2 and 5.6b, translation equivariance is a criterion that is
applicable to estimators of a linear combination of the elements of ˇ or, more generally, to estimators
of a vector of such linear combinations. This criterion can also be applied to predictors. A predictor
Q
w.y/ of the random vector w (the expected value of which is ƒ0ˇ) is said to be translation equivariant
if w.y
Q C Xk/ D w.y/ Q C ƒ0 k for every P 1 vector k (and for every value of y). Clearly, w.y/
Q is
a translation-equivariant predictor of w if and only if it is a translation-equivariant estimator of the
expected value ƒ0ˇ of w.
Special case: Aitken and G–M models. Let us now specialize to the case where y follows an Aitken
model. Under the Aitken model, var.y/ is an unknown scalar multiple of a known (nonnegative
definite) matrix H. It isconvenient
(and potentially useful) to consider the prediction of w under
y
the assumption that var is also an unknown scalar multiple of a known (nonnegative definite)
w
matrix. Accordingly, it is supposed that cov.y; w/ and var.w/ are of the form
cov.y; w/ D 2 Hyw and var.w/ D 2 Hw ;
where Hyw and Hw are known matrices. Thus, writing Hy for H, the setup is such that
y 2 Hy Hyw
var D 0 :
w Hyw Hw
As in the general case, it is supposed that
E.w/ D ƒ0ˇ
(where ƒ is a known matrix).
The setup can be regarded as a special case of the more general setup where y follows a general
linear model and where E.w/ is of the form (10.28) and cov.y; w/ and var.w/ of the form (10.29).
Specifically, it can be regarded as the special case where is the one-dimensional vector whose only
2 2
element is 1 D , where ‚ D f j 1 > 0g, and where Vy ./ D 1 Hy , Vyw ./ D 1 Hyw , and
y
Vw ./ D 12 Hw . Clearly, in this special case, var is known up to the value of the unknown
w
2 2
scalar multiple 1 D .
In the further special case where y follows a G–M model [i.e., where var.y/ D 2 I], Hy D I.
When Hy D I, the case where Hyw D 0 and Hw D I is often singled out for special attention. The
case where Hy D I, Hyw D 0, and Hw D I is encountered in applications where the realization of
y
w corresponds to a vector of future data points and where
the augmented vector w is assumed to
X
follow a G–M model, the model matrix of which is .
ƒ0
Best linear unbiased prediction (under a G–M model). Suppose that the N 1 observable random
vector y follows a G–M model, in which case E.y/ D Xˇ and var.y/ D 2 I. And consider the
prediction of the M 1 unobservable random vector w whose expected value is of the form
(where ƒ is a matrix of known constants). Assume that cov.y; w/ and var.w/ are of the form
(where Hyw and Hw are known matrices). Assume also that w is predictable or, equivalently, that
ƒ0ˇ is estimable.
For purposes of applying the results of the final part of the preceding subsection (Subsection a),
take to be the M 1 vector (of linear combinations of the elements of ˇ) defined as follows:
Clearly, is estimable, and its least squares estimator is the vector O L .y/ defined as follows:
OL .y/ D .ƒ0 0
Hyw X/.X0 X/ X0 y D ƒ0 .X0 X/ X0 y 0
Hyw PX y (10.35)
0 0
[where PX D X.X X/ X ]. Moreover, according to Theorem 5.6.1, O L .y/ is a linear unbiased
estimator of and, in fact, is the BLUE (best linear unbiased estimator) of . It is the BLUE in the
sense that the difference between the MSE matrix EfŒQL .y/ ŒQL .y/ 0 g D varŒQL .y/ of an
arbitrary linear unbiased estimator QL .y/ of and the MSE matrix EfŒOL .y/ ŒOL .y/ 0 g D
varŒOL .y/ of the least squares estimator O L .y/ is nonnegative definite [and is equal to 0 if and only
if QL .y/ D OL .y/].
Now, let
wO L .y/ D OL .y/ C Œcov.y; w/0 Œvar.y/ 1
y D ƒ0 .X0 X/ X0 y C Hyw
0
.I PX /y: (10.36)
Then, it follows from the results of the final part of Subsection a that wO L .y/ is a linear unbiased
predictor of w and, in fact, is the BLUP (best linear unbiased predictor) of w. It is the BLUP in the
sense that the difference between the MSE matrix EfŒwQ L .y/ wŒwQ L .y/ w0 g D varŒwQ L .y/ w of
an arbitrary linear unbiased predictor wQ L.y/ of w and the MSE matrix EfŒwO L .y/ wŒwO L .y/ w0 g D
varŒwL .y/ w of wL .y/ is nonnegative definite [and is equal to 0 if and only if wQ L .y/ D wO L .y/].
O O
In the special case where M D 1, the sense in which wO L .y/ is the BLUP can [upon writing wO L .y/
for wO L .y/ and w for w] be restated as follows: the MSE of wO L .y/ [or, equivalently, the variance of
the prediction error of wO L .y/] is smaller than that of any other linear unbiased predictor of w.
In light of result (6.7), the variance-covariance matrix of the least squares estimator of is
varŒOL .y/ D 2 .ƒ0 0
Hyw X/.X0 X/ .ƒ X0 Hyw /: (10.37)
Accordingly, it follows from result (10.25) that the MSE matrix of the BLUP of w or, equivalently,
the variance-covariance matrix of the prediction error of the BLUP is
varŒwO L .y/ w D 2 .ƒ0 0
Hyw X/.X0 X/ .ƒ X0 Hyw / C 2 .Hw 0
Hyw Hyw /: (10.38)
In the special case where Hyw D 0, we find that D ƒ0ˇ, wO L .y/ D OL .y/, varŒOL .y/ D
ƒ0 .X0 X/ ƒ, and
2
varŒwO L .y/ w D 2 ƒ0 .X0 X/ ƒ C 2 Hw :
Note that even in this special case [where the BLUP of w equals the BLUE of and where D E.w/],
the MSE matrix of the BLUP typically differs from that of the BLUE. The difference between the two
0
MSE matrices [ 2 Hw in the special case and 2 .Hw Hyw Hyw / in the general case] is nonnegative
definite. This difference is attributable to the variability of w, which contributes to the variability of
the prediction error w.y/
O w but not to the variability of .y/
O .
Best linear translation-equivariant prediction (under a G–M model). Let us continue to consider the
prediction of the M 1 unobservable random vector w on the basis of the N 1 observable random
vector y, doing so under the same conditions as in the preceding part of the present subsection. Thus,
it is supposed that y follows a G–M model, that E.w/ is of the form (10.32), that cov.y; w/ and
var.w/ are of the form (10.33), and that w is predictable. Further, define , O L .y/, and wO L .y/ as in
equations (10.34), (10.35), and (10.36) [so that OL .y/ is the BLUE of and wO L .y/ the BLUP of w].
Let us consider the translation-equivariant prediction of w. Denote by w.y/
Q an arbitrary predictor
of w, and take
Q
.y/ Q
D w.y/ Œcov.y; w/0 Œvar.y/ 1 y D w.y/
Q 0
Hyw y
to be the corresponding estimator of —refer to the final part of Subsection a. Then, w.y/ Q is a
translation-equivariant predictor (of w) if and only if .y/
Q is a translation-equivariant estimator (of
), as can be readily verified. Further, w.y/
Q is a linear translation-equivariant predictor (of w) if and
only if .y/
Q is a linear translation-equivariant estimator (of ).
Prediction 245
In light of Corollary 5.6.4, the estimator OL .y/ is a linear translation-equivariant estimator of
and, in fact, is the best linear translation-equivariant estimator of . It is the best linear translation-
equivariant estimator in the sense that the difference between the MSE matrix EfŒQL .y/ ŒQL .y/
0 g of an arbitrary linear translation-equivariant estimator Q L.y/ of and the MSE matrix EfŒOL .y/
ŒOL .y/ 0 g of OL .y/ is nonnegative definite [and is equal to 0 if and only if QL .y/ D OL .y/].
And upon recalling the results of the final part of Subsection a, it follows that the predictor wO L .y/ is
a linear translation-equivariant predictor of w and, in fact, is the best linear translation-equivariant
predictor of w. It is the best linear translation-equivariant predictor in the sense that the difference
between the MSE matrix EfŒwQ L .y/ wŒwQ L .y/ w0 g of an arbitrary linear translation-equivariant
predictor wQ L .y/ of w and the MSE matrix EfŒwO L .y/ wŒwO L .y/ w0 g D varŒwO L .y/ w of
wO L .y/ is nonnegative definite [and is equal to 0 if and only if wQ L .y/ D wO L .y/]. In the special case
where M D 1, the sense in which wO L .y/ is the best linear translation-equivariant predictor can [upon
writing wO L .y/ for wO L .y/ and w for w] be restated as follows: the MSE of wO L .y/ is smaller than that
of any other linear translation-equivariant predictor of w.
Exercises
Exercise 1. Take the context to be that of estimating parametric functions of the form 0ˇ from
an N 1 observable random vector y that follows a G–M, Aitken, or general linear model. Verify
(1) that linear combinations of estimable functions are estimable and (2) that linear combinations of
nonestimable functions are not necessarily nonestimable.
Exercise 2. Take the context to be that of estimating parametric functions of the form 0ˇ from an
N 1 observable random vector y that follows a G–M, Aitken, or general linear model. And let
R D rank.X/.
(a) Verify (1) that there exists a set of R linearly independent estimable functions; (2) that no set of
estimable functions contains more than R linearly independent estimable functions; and (3) that
if the model is not of full rank (i.e., if R < P ), then at least one and, in fact, at least P R of
the individual parameters ˇ1 ; ˇ2 ; : : : ; ˇP are nonestimable.
(b) Show that the j th of the individual parameters ˇ1 ; ˇ2 ; : : : ; ˇP is estimable if and only if the j th
element of every vector in N.X/ equals 0 (j D 1; 2; : : : ; P ).
Exercise 3. Show that for a parametric function of the form 0ˇ to be estimable from an N 1
observable random vector y that follows a G–M, Aitken, or general linear model, it is necessary and
sufficient that
rank.X0; / D rank.X/:
Exercise 4. Suppose that y is an N 1 observable random vector that follows a G–M, Aitken, or
general linear model. Further, take y to be any value of y, and consider the quantity 0 b, Q where is
an arbitrary P 1 vector of constants and bQ is any solution to the linear system X0 Xb D X0 y (in the
P 1 vector b). Show that if 0 bQ is invariant to the choice of the solution b,
Q then 0ˇ is an estimable
function. And discuss the implications of this result.
Exercise 5. Suppose that y is an N 1 observable random vector that follows a G–M, Aitken, or
general linear model. And let a represent an arbitrary N 1 vector of constants. Show that a0 y is
the least squares estimator of its expected value E.a0 y/ (i.e., of the parametric function a0 Xˇ) if and
only if a 2 C.X/.
Exercise 6. Let U represent a subspace of the linear space RM of all M -dimensional column
vectors. Verify that the set U? (comprising all M -dimensional column vectors that are orthogonal
to U) is a linear space.
Exercise 7. Let X represent an N P matrix. A P N matrix G is said to be a least squares
generalized inverse of X if it is a generalized inverse of X (i.e., if XGX D X) and if, in addition,
.XG/0 D XG (i.e., XG is symmetric).
(a) Show that G is a least squares generalized inverse of X if and only if X0 XG D X0.
(b) Using Part (a) (or otherwise), establish the existence of a least squares generalized inverse of X.
(c) Show that if G is a least squares generalized inverse of X, then, for any N Q matrix Y , the
matrix GY is a solution to the linear system X0 XB D X0 Y (in the P Q matrix B).
(a) Show that H is a minimum norm generalized inverse of A if and only if H0 is a least squares
generalized inverse of A0 (where least squares generalized inverse is as defined in Exercise 7).
(b) Using the results of Exercise 7 (or otherwise), establish the existence of a minimum norm
generalized inverse of A.
(c) Show that if H is a minimum norm generalized inverse of A, then, for any vector b 2 C.A/,
kxk attains its minimum value over the set fx W Ax D bg [comprising all solutions to the linear
system Ax D b (in x)] uniquely at x D Hb (where kk denotes the usual norm).
Exercise 9. Let X represent an N P matrix, and let G represent a P N matrix that is subject
to the following four conditions: (1) XGX D X; (2) GXG D G; (3) .XG/0 D XG; and (4)
.GX/0 D GX.
(a) Show that if a P P matrix H is a minimum norm generalized inverse of X0 X, then conditions
(1)–(4) can be satisfied by taking G D HX0.
(b) Use Part (a) and the result of Part (b) of Exercise 8 (or other means) to establish the existence of
a P N matrix G that satisfies conditions (1)–(4) and show that there is only one such matrix.
(c) Let XC represent the unique P N matrix G that satisfies conditions (1)–(4)—this matrix is
customarily referred to as the Moore–Penrose inverse, and conditions (1)–(4) are customarily
referred to as the Moore–Penrose conditions. Using Parts (a) and (b) and the results of Part (c)
of Exercise 7 and Part (c) of Exercise 8 (or otherwise), show that XC y is a solution to the linear
system X0 Xb D X0 y (in b) and that kbk attains its minimum value over the set fb W X0 Xb D X0 yg
(comprising all solutions to the linear system) uniquely at b D XC y (where kk denotes the usual
norm).
Exercise 10. Consider further the alternative approach to the least squares computations, taking the
formulation and the notation to be those of the final part of Section 5.4e.
(a) Let bQ D L1 hQ 1 C L2 hQ 2 , where hQ 2 is an arbitrary (P K)-dimensional column vector and hQ 1 is
the solution to the linear system R1 h1 D z1 R2 hQ 2 . Show that kbk Q is minimized by taking
hQ 2 D ŒI C .R1 1 R2 /0 R1 1 R2 1
.R1 1 R2 /0 R1 1 z1 :
Xb/0 .y Xb/ attains a minimum value of z02 z2 and that it does so at a value bQ of b if and
dQ
only if bQ D LO Q 1 for some .P K/ 1 vector dQ 2 .
d2
(5) Letting bQ represent an arbitrary one of the values of b at which .y Xb/0 .y Xb/ attains a
minimum value [and, as in Part (4), taking dQ 1 to be the solution to T1 d1 D z1 ], show that
Q 2 (where kk denotes the usual norm) attains a minimum value of dQ 0 dQ and that it does
kbk 1 1
Q1
d
so uniquely at bQ D LO .
0
Exercise 11. Verify that the difference (6.14) is a nonnegative definite matrix and that it equals 0 if
and only if c C A0 y D `.y/.
Exercise 12. Suppose that y is an N 1 observable random vector that follows a G–M, Aitken, or
general linear model. And let s.y/ represent any particular translation-equivariant estimator of an
estimable linear combination 0ˇ of the elements of the parametric vector ˇ—e.g., s.y/ could be
the least squares estimator of 0ˇ. Show that an estimator t.y/ of 0ˇ is translation equivariant if
and only if
t.y/ D s.y/ C d.y/
for some translation-invariant statistic d.y/.
Exercise 13. Suppose that y is an N 1 observable random vector that follows a G–M model. And
let y 0Ay represent a quadratic unbiased nonnegative-definite estimator of 2, that is, a quadratic form
in y whose matrix A is a symmetric nonnegative definite matrix of constants and whose expected
value is 2.
(a) Show that y 0Ay is translation invariant.
(b) Suppose that the fourth-order moments of the distribution of the vector e D .e1 ; e2 ; : : : ; eN /0 are
such that (for i; j; k; m D 1; 2; : : : ; N ) E.ei ej ek em / satisfies condition (7.38). For what choice
of A is the variance of the quadratic unbiased nonnegative-definite estimator y 0Ay a minimum?
Describe your reasoning.
Exercise 14. Suppose that y is an N 1 observable random vector that follows a G–M model.
Suppose further that the distribution of the vector e D .e1 ; e2 ; : : : ; eN /0 has third-order moments
jkm D E.ej ek em / (j; k; m D 1; 2; : : : ; N ) and fourth-order moments ijkm D E.ei ej ek em /
(i; j; k; m D 1; 2; : : : ; N ). And let A D faij g represent an N N symmetric matrix of constants.
(a) Show that in the special case where the elements e1 ; e2 ; : : : ; eN of e are statistically independent,
var.y 0Ay/ D a0 a C 4ˇ 0 X0Aƒ a C 2 4 tr.A2 / C 4 2 ˇ 0 X0A2 Xˇ; (E.1)
Exercise 15. Suppose that y is an N 1 observable random vector that follows a G–M model, and
assume that the distribution of the vector e of residual effects is MVN.
(a) Letting 0ˇ represent an estimable linear combination of the elements of the parametric vector
ˇ, find a minimum-variance unbiased estimator of .0ˇ/2.
Exercises 249
Exercise 16. Suppose that y is an N 1 observable random vector that follows a G–M model, and
assume that the distribution of the vector e of residual effects is MVN. Show that if 2 were known,
X0 y would be a complete sufficient statistic.
Exercise 17. Suppose that y is an N 1 observable random vector that follows a general linear
model. Suppose further that the distribution of the vector e of residual effects is MVN or, more
generally, that the distribution of e is known up to the value of the vector . And take h.y/ to be any
(possibly vector-valued) translation-invariant statistic.
(a) Show that if were known, h.y/ would be an ancillary statistic—for a definition of ancillarity,
refer, e.g., to Casella and Berger (2002, def. 6.2.16) or to Lehmann and Casella (1998, p. 41).
(b) Suppose that X0 y would be a complete sufficient statistic if were known. Show (1) that the
least squares estimator of any estimable linear combination 0ˇ of the elements of the parametric
vector ˇ has minimum variance among all unbiased estimators, (2) that any vector of least squares
estimators of estimable linear combinations (of the elements of ˇ) is distributed independently
of h.y/, and (3) (using the result of Exercise 12 or otherwise) that the least squares estimator of
any estimable linear combination 0ˇ has minimum mean squared error among all translation-
equivariant estimators. {Hint [for Part (2)]. Make use of Basu’s theorem—refer, e.g., to Lehmann
and Casella (1998, p. 42) for a statement of Basu’s theorem.}
Exercise 18. Suppose that y is an N 1 observable random vector that follows a G–M model. Suppose
further that the distribution of the vector e of residual effects is MVN. And, letting eQ D y PX y,
take Q 2 D eQ 0 eQ =N to be the ML estimator of 2 and O 2 D eQ 0 eQ =.N rank X/ to be the unbiased
estimator.
(a) Find the bias and the MSE of the ML estimator Q 2.
(b) Compare the MSE of the ML estimator Q 2 with that of the unbiased estimator O 2 : for which
values of N and of rank X is the MSE of the ML estimator smaller than that of the unbiased
estimator and for which values is it larger?
Exercise 19. Suppose that y is an N 1 observable random vector that follows a general linear
model, that the distribution of the vector e of residual effects is MVN, and that the variance-covariance
matrix V ./ of e is nonsingular (for all 2 ‚). And, letting K D N rank X, take R to be any
N K matrix (of constants) of full column rank K such that X0 R D 0, and (as in Section 5.9b)
define z D R0 y. Further, let w D s.z/, where s./ is a K 1 vector of real-valued functions that
defines a one-to-one mapping of RK onto some set W.
(a) Show that w is a maximal invariant.
(b) Let f1 . I / represent the pdf of the distribution of z, and assume that s./ is such that the
distribution of w has a pdf, say f2 . I /, that is obtainable from f1 . I / via an application of
the basic formula (e.g., Bickel and Doksum 2001, sec. B.2) for a change of variables. And, taking
L1 .I R0 y/ and L2 ŒI s.R0 y/ (where y denotes the observed value of y) to be the likelihood
functions defined by L1 .I R0 y/ D f1 .R0 yI / and L2 ŒI s.R0 y/ D f2 Œs.R0 y/I , show that
L1 .I R0 y/ and L2 ŒI s.R0 y/ differ from each other by no more than a multiplicative constant.
Exercise 20. Suppose that y is an N 1 observable random vector that follows a general linear model,
that the distribution of the vector e of residual effects is MVN, and that the variance-covariance matrix
V ./ of e is nonsingular (for all 2 ‚). Further, let z D R0 y, where R is any N .N rank X/
matrix (of constants) of full column rank N rank X such that X0 R D 0; and let u D X0 y, where
250 Estimation and Prediction: Classical Approach
X is any N .rank X/ matrix (of constants) whose columns form a basis for C.X/. And denote by
y the observed value of y.
0
(a) Verify
that the likelihood function that would result from regarding the observed value .X ; R/ y
u
of as the data vector differs by no more than a multiplicative constant from that obtained
z
by regarding the observed value y of y as the data vector.
(b) Let f0 . j I ˇ; / represent the pdf of the conditional distribution of u given z. And take
L0 Œˇ; I .X ; R/0 y to be the function of ˇ and defined by L0 Œˇ; I .X ; R/0 y D
f0 .X0 y j R0 yI ˇ; /. Show that
L0 Œˇ; I .X ; R/0 y D .2/ .rank X/=2
jX0 X jjX0 ŒV ./ 1 X j1=2
1
Q
expf 21 Œˇ./ ˇ0 X0 ŒV ./ 1 Q
XŒˇ./ ˇg;
Q
where ˇ./ is any solution to the linear system X0 ŒV ./ 1
Xb D X0 ŒV ./ 1
y (in the P 1
vector b).
(c) In connection with Part (b), show (1) that
Q
Œˇ./ ˇ0 X0 ŒV ./ 1 Q
XŒˇ./ ˇ
D .y Xˇ/0 ŒV ./ 1
XfX0 ŒV ./ 1
Xg X0 ŒV ./ 1.y Xˇ/
and (2) that the distribution of the random variable s defined by
s D .y Xˇ/0 ŒV ./ 1
XfX0 ŒV ./ 1
Xg X0 ŒV ./ 1.y Xˇ/
Exercise 21. Suppose that z is an S 1 observable random vector and that z N.0; 2 I/, where
is a (strictly) positive unknown parameter.
(a) Show that z0 z is a complete sufficient statistic.
(b) Take w.z/ to be the S -dimensional vector-valued statistic defined by w.z/ D .z0 z/ 1=2 z—w.z/
is defined for z ¤ 0 and hence with probability 1. Show that z0 z and w.z/ are statistically
independent. (Hint. Make use of Basu’s theorem.)
(c) Show that any estimator of 2 of the form z0 z=k (where k is a nonzero constant) is scale
equivariant—an estimator, say t.z/, of 2 is to be regarded as scale equivariant if for every
(strictly) positive scalar c (and for every nonnull value of z) t.cz/ D c 2 t.z/.
(d) Let t0 .z/ represent any particular scale-equivariant estimator of 2 such that t0 .z/ ¤ 0 for z ¤ 0.
Show that an estimator t.z/ of 2 is scale equivariant if and only if, for some function u./ such
that u.cz/ D u.z/ (for every strictly positive constant c and every nonnull value of z),
t.z/ D u.z/t0 .z/ for z ¤ 0: (E.2)
(e) Show that a function u.z/ of z is such that u.cz/ D u.z/ (for every strictly positive constant c
and every nonnull value of z) if and only if u.z/ depends on the value of z only through w.z/
[where w.z/ is as defined in Part (b)].
(f) Show that the estimator z0 z=.S C2/ has minimum MSE among all scale-equivariant estimators
of 2.
Exercise 22. Suppose that y is an N 1 observable random vector that follows a G–M model
and that the distribution of the vector e of residual effects is MVN. Using the result of Part (f) of
Exercise 21 (or otherwise), show that the Hodges–Lehmann estimator y 0 .I PX /y=ŒN rank.X/C2
has minimum MSE among all translation-invariant estimators of 2 that are scale equivariant—a
Exercises 251
Exercise 25. Let y represent an N 1 random vector and w an M 1 random vector. Suppose
that the second-order moments of the joint distribution of y and w exist, and adopt the following
notation: y D E.y/, w D E.w/, Vy D var.y/, Vyw D cov.y; w/, and Vw D var.w/. Further,
assume that Vy is nonsingular.
0
(a) Show that the matrix Vw Vyw Vy 1 Vyw EŒvar.w j y/ is nonnegative definite and that it equals 0
if and only if (for some nonrandom vector c and some nonrandom matrix A) E.w j y/ D cCA0 y
(with probability 1).
0
(b) Show that the matrix varŒE.w j y/ Vyw Vy 1 Vyw is nonnegative definite and that it equals 0 if
and only if (for some nonrandom vector c and some nonrandom matrix A) E.w j y/ D c C A0 y
(with probability 1).
Exercise 26. Let y represent an N 1 observable random vector and w an M 1 unobservable random
vector. Suppose that the second-order moments of the joint distribution of y and w exist, and adopt the
following notation: y D E.y/, w D E.w/, Vy D var.y/, Vyw D cov.y; w/, and Vw D var.w/.
0
Assume that y , w , Vy , Vyw , and Vw are known. Further, define .y/ D w C Vyw Vy .y y /,
and take t.y/ to be an (M 1)-dimensional vector-valued function of the form t.y/ D c C A0 y,
where c is a vector of constants and A is an N M matrix of constants. Extend various of the
results of Section 5.10a (to the case where Vy may be singular) by using Theorem 3.5.11 to show
(1) that .y/ is the best linear predictor of w in the sense that the difference between the matrix
EfŒt.y/ wŒt.y/ w0 g and the matrix varŒ.y/ w [which is the MSE matrix of .y/] equals
the matrix EfŒt.y/ .y/Œt.y/ .y/0 g, which is nonnegative definite and which equals 0 if and
only if t.y/ D .y/ for every value of y such that y y 2 C.Vy /, (2) that PrŒy y 2 C.Vy / D 1,
0
and (3) that varŒ.y/ w D Vw Vyw Vy Vyw .
252 Estimation and Prediction: Classical Approach
Exercise 27. Suppose that y is an N 1 observable random vector that follows a G–M model, and
take w to be an M 1 unobservable random vector whose value is to be predicted. Suppose further
that E.w/ is of the form E.w/ D ƒ0ˇ (where ƒ is a matrix of known constants) and that cov.y; w/
is of the form cov.y; w/ D 2 Hyw (where Hyw is a known matrix). Let D .ƒ0 Hyw 0
X/ˇ,
0
denote by w.y/
Q an arbitrary predictor (of w), and define .y/
Q Q
D w.y/ Hyw y. Verify that w.y/
Q
is a translation-equivariant predictor (of w) if and only if .y/
Q is a translation-equivariant estimator
of .
The multivariate normal distribution was introduced and was discussed extensively in Section 3.5. A
broader class of multivariate distributions, comprising so-called elliptical distributions, was consid-
ered in Section 5.9c. Numerous results on the first- and second-order moments of linear and quadratic
forms (in random vectors) were presented in Chapter 3 and in Section 5.7b.
Knowledge of the multivariate normal distribution and of other elliptical distributions and knowl-
edge of results on the first- and second-order moments of linear and quadratic forms provide a more-
or-less adequate background for the discussion of the classical approach to point estimation and
prediction, which was the subject of Chapter 5. However, when it comes to extending the results
on point estimation and prediction to the construction and evaluation of confidence regions and of
test procedures, this knowledge, while still relevant, is far from adequate. It needs to be augmented
with a knowledge of the distributions of certain functions of normally distributed random vectors
and a knowledge of various related distributions and with a knowledge of the properties of such
distributions. It is these distributions and their properties that form the subject matter of the present
chapter.
a. Gamma distribution
For strictly positive scalars ˛ and ˇ, let f ./ represent the function defined (on the real line) by
< 1
8
x ˛ 1 e x=ˇ; for 0 < x < 1,
f .x/ D .˛/ˇ ˛ (1.1)
0; elsewhere.
:
R1
Clearly, f .x/ 0 for 1 < x < 1. And 1 f .x/ dx D 1, as is evident upon introducing the
change of variable w D x=ˇ and recalling the definition—refer to expression (3.5.5)—of the gamma
function. Thus, the function f ./ qualifies as a pdf (probability density function). The distribution
254 Some Relevant Distributions and Their Properties
determined by this pdf is known as the gamma distribution (with parameters ˛ and ˇ). Let us use
the symbol Ga.˛; ˇ/ to denote this distribution.
If a random variable x has a Ga.˛; ˇ/ distribution, then for any (strictly) positive constant c,
cx Ga.˛; cˇ/; (1.2)
as can be readily verified.
Suppose that two random variables w1 and w2 are distributed independently as Ga.˛1 ; ˇ/ and
Ga.˛2 ; ˇ/, respectively. Let
w1
w D w1 C w2 and sD : (1.3)
w1 C w2
And note that equalities (1.3) define a one-to-one transformation from the rectangular region
fw1 ; w2 W w1 > 0; w2 > 0g onto the rectangular region
w1 D sw and w2 D .1 s/w:
We find that
@w1 =@w @w1 =@s s w
det D det D w:
@w2 =@w @w2 =@s 1 s w
Let f . ; / represent the pdf of the joint distribution of w and s, and f1 ./ and f2 ./ the pdfs of the
distributions of w1 and w2 , respectively. Then, for values of w and s in the region (1.4),
and
< .˛1 C ˛2 / s ˛1
8
1
.1 s/ ˛2 1; for 0 < s < 1,
h.s/ D .˛1 /.˛2 / (1.7)
0; elsewhere.
:
The function g./ is seen to be the pdf of the Ga.˛1 C ˛2 ; ˇ/ distribution. And because h.s/ 0
(for every value of s) and because
Z 1 Z 1 Z 1 Z 1Z 1
h.s/ ds D h.s/ ds g.w/ dw D f .w; s/ dw ds D 1; (1.8)
1 1 1 1 1
h./, like g./, is a pdf; it is the pdf of the distribution of the random variable s D w1 =.w1 C w2 /.
Moreover, the random variables w and s are distributed independently.
Based on what has been established, we have the following result.
Chi-Square, Gamma, Beta, and Dirichlet Distributions 255
Theorem 6.1.1. If two random variables w1 and w2 are distributed independently as Ga.˛1 ; ˇ/
and Ga.˛2 ; ˇ/, respectively, then (1) w1 C w2 is distributed as Ga.˛1 C ˛2 ; ˇ/ and (2) w1 C w2
is distributed independently of w1 =.w1 C w2 /.
By employing a simple induction argument, we can establish the following generalization of the
first part of Theorem 6.1.1.
Theorem 6.1.2. If N random variables w1 ; w2 ; : : : ; wN are distributed independently as
Ga.˛1 ; ˇ/; Ga.˛2 ; ˇ/; : : : ; Ga.˛N ; ˇ/, respectively, then w1 C w2 C C wN is distributed as
Ga.˛1 C ˛2 C C ˛N ; ˇ/.
This function is known as the beta function. The beta function is expressible in terms of the gamma
function:
.y/.z/
B.y; z/ D ; (1.10)
.y C z/
as is evident from result (1.8).
Note that the pdf h./ of the Be.˛1 ; ˛2 / distribution can be reexpressed in terms of the beta
function. We have that
8̂
1
< s ˛1 1 .1 s/ ˛2 1; for 0 < s < 1,
h.s/ D B.˛1 ; ˛2 / (1.11)
:̂ 0; elsewhere.
Thus,
Ix .y; z/ D 1 I1 x .z; y/: (1.14)
256 Some Relevant Distributions and Their Properties
Let us write 2 .N / for a chi-square distribution with N degrees of freedom. Here, N is assumed
to be an integer. Reference is sometimes made to a chi-square distribution with noninteger (but strictly
positive) degrees of freedom N . Unless otherwise indicated, such a reference is to be interpreted as
a reference to the Ga N2 ; 2 distribution; this interpretation is that obtained when the relationship
between the 2 .N / and Ga N2 ; 2 distributions is regarded as extending to noninteger values of N .
In light of the relationship of the chi-square distribution to the gamma distribution, various results
on the gamma distribution can be translated into results on the chi-square distribution. In particular,
if a random variable x has a 2 .N / distribution and if c is a (strictly) positive constant, then it follows
from result (1.2) that
cx Ga N2 ; 2c : (1.17)
d. Moment generating function, moments, and cumulants of the gamma and chi-
square distributions
Let w represent a random variable whose distribution is Ga.˛; ˇ/. And denote by f ./ the pdf of
the Ga.˛; ˇ/ distribution. Further, let u represent a random variable whose distribution is 2 .N /.
For t < 1=ˇ, Z 1
E.e t w / D e txf .x/ dx
Z0 1
1
D x ˛ 1 e x.1 ˇ t /=ˇ dx
0 .˛/ˇ ˛
Z 1
1 1
D x ˛ 1 e x= dx; (1.18)
.1 ˇt/ 0 .˛/ ˛
˛
where D ˇ=.1 ˇt/. The integrand of the integral in expression (1.18) equals g.x/, where g./ is
the pdf of the Ga.˛; / distribution, so that the integral equals 1. Thus, the mgf (moment generating
function), say m./, of the Ga.˛; ˇ/ distribution is
˛
m.t/ D .1 ˇt/ .t < 1=ˇ/: (1.19)
As a special case of result (1.19), we have that the mgf, say m./, of the 2 .N / distribution is
N=2
m.t/ D .1 2t/ .t < 1=2/: (1.20)
For r > ˛,
Z 1
r
E.w / D x rf .x/ dx
0
1
1
Z
D ˛
x ˛Cr 1 e x=ˇ dx
0 .˛/ˇ
ˇ r .˛ C r/ 1 1
Z
D ˛Cr
x ˛Cr 1
e x=ˇ
dx: (1.21)
.˛/ 0 .˛ C r/ˇ
The integrand of the integral in expression (1.21) equals g.x/, where g./ is the pdf of the Ga.˛Cr; ˇ/
distribution, so that the integral equals 1. Thus, for r > ˛,
.˛ C r/
E.w r / D ˇ r : (1.22)
.˛/
The gamma function ./ is such that, for x > 0 and for any positive integer r,
.x C r/ D .x C r 1/ .x C 2/.x C 1/x .x/; (1.23)
as is evident upon the repeated application of result (3.5.6). In light of result (1.23), it follows from
formula (1.22) that the rth (positive, integer-valued) moment of the Ga.˛; ˇ/ distribution is
E.w r / D ˇ r ˛.˛ C 1/.˛ C 2/ .˛ C r 1/: (1.24)
Thus, the mean and variance of the Ga.˛; ˇ/ distribution are
E.w/ D ˛ˇ (1.25)
258 Some Relevant Distributions and Their Properties
and
var.w/ D ˇ 2 ˛.˛ C 1/ .˛ˇ/2 D ˛ˇ 2: (1.26)
Upon setting ˛ D N=2 and ˇ D 2 in expression (1.22), we find that (for r > N=2)
Œ.N=2/ C r
E.ur / D 2r : (1.27)
.N=2/
E.u/ D N (1.29)
and
var.u/ D 2N: (1.30)
Upon applying formula (1.27) [and making use of result (1.23)], we find that the rth moment of the
reciprocal of a chi-square random variable (with N degrees of freedom) is (for r D 1; 2; : : : < N=2)
r Œ.N=2/ r
r
E.u /D2
.N=2/
r 1
D f2 Œ.N=2/ 1Œ.N=2/ 2 Œ.N=2/ rg
D Œ.N 2/.N 4/ .N 2r/ 1: (1.31)
Upon taking the logarithm of the mgf m./ of the Ga.˛; ˇ/ distribution, we obtain the cumulant
generating function, say c./, of this distribution—refer, e.g., to Bickel and Doksum (2001, sec. A.12)
for an introduction to cumulants and cumulant generating functions. In light of result (1.19),
N 2r 1
.r 1/Š : (1.35)
Chi-Square, Gamma, Beta, and Dirichlet Distributions 259
e. Dirichlet distribution
Let w1 ; w2 ; : : : ; wK , and wKC1 represent K C 1 random variables that are distributed independently
as Ga.˛1 ; ˇ/; Ga.˛2 ; ˇ/; : : : ; Ga.˛K ; ˇ/, and Ga.˛KC1 ; ˇ/, respectively. And consider the joint
distribution of the K C 1 random variables w and s1 ; s2 ; : : : ; sK defined as follows:
KC1
X wk
wD wk and sk D PKC1 .k D 1; 2; : : : ; K/: (1.36)
kD1 k 0 D1 wk 0
A derivation of the pdf of the joint distribution of w and s1 ; s2 ; : : : ; sK was presented for the
special case where K D 1 in Subsection a. Let us extend that derivation to the general case (where
K is an arbitrary positive integer).
The K C 1 equalities (1.36) define a one-to-one transformation from the rectangular region
fw1 ; w2 ; : : : ; wK ; wKC1 W wk > 0 .k D 1; 2; : : : ; K; K C1/g onto the region
w; s1 ; s2 ; : : : ; sK W w > 0; sk > 0 .k D 1; 2; : : : ; K/; K (1.37)
˚ P
kD1 sk < 1 :
The inverse transformation is defined by
PK
and
wk D sk w .k D 1; 2; : : : ; K/ wKC1 D 1 kD1 sk w:
For j; k D 1; 2; : : : ; K,
(
@wj w; if k D j ,
D
@sk 0; if k ¤ j .
@wj @wKC1 @wKC1
Further, D sj (j D 1; 2; : : : ; K), D w (k D 1; 2; : : : ; K), and D 1
PK @w @s k @w
0
kD1 sk . Thus, letting s D .s1 ; s2 ; : : : ; sK / and making use of Theorem 2.14.9, formula (2.14.29),
and Corollary 2.14.2, we find that
ˇ ˇ
ˇ @w1 @w1 @w1 @w1 ˇˇ
ˇ
ˇ @w : : :
ˇ @s1 @s2 @sK ˇˇ
ˇ @w2 @w2 @w2 @w2 ˇˇ
ˇ
ˇ @w :::
@s1 @s2 @sK ˇˇ ˇˇ s wI ˇˇ
ˇ
ˇ :: :: :: :
ˇ
:: :: ˇ D ˇ PK
ˇ ˇ
ˇ : : : :
w10 ˇ
ˇ
ˇ ˇ1
kD1 sk
ˇ
ˇ @wK @w K @w K @w K ˇ
ˇ
ˇ @w ::: ˇ
ˇ @s1 @s2 @sK ˇˇ
ˇ @wKC1 @wKC1 @wKC1 @wKC1 ˇˇ
ˇ
ˇ @w :::
@s1 @s2 @sK ˇ
ˇ ˇ
ˇ wI s ˇ
Kˇ
D . 1/ ˇ
ˇ
ˇ w10 1 K
P ˇ
kD1 sk
ˇ
K K
D . 1/K w K .1
P P
kD1 sk C kD1 sk/
D . w/K:
Now, let f . ; ; ; : : : ; / represent the pdf of the joint distribution of w and s1 ; s2 ; : : : ; sK , define
PKC1
˛ D kD1 ˛k , and, for k D 1; 2; : : : ; K, denote by fk ./ the pdf of the distribution of wk . Then,
for w; s1 ; s2 ; : : : ; sK in the region (1.37),
f .w; s1 ; s2 ; : : : ; sK /
PK
w j. w/K j
D f1 .s1 w/f2 .s2 w/ fK .sK w/fKC1 1 kD1 sk
1
D
.˛1 /.˛2 / .˛K /.˛KC1 /ˇ ˛
1 ˛1 1 ˛2 1 ˛K 1 PK ˛KC1 1
w˛ s1 s2 sK 1 kD1 sk e w=ˇ
260 Some Relevant Distributions and Their Properties
The function g./ is seen to be the pdf of the Ga. KC1 kD1 ˛k ; ˇ/ distribution. And because
P
h.s1 ; s2 ; : : : ; sK / 0 for all s1 ; s2 ; : : : ; sK and because
Z 1Z 1 Z 1
h.s1 ; s2 ; : : : ; sK / ds1 ds2 : : : dsK
1 1
Z 1 Z11 Z 1 Z 1
D h.s1 ; s2 ; : : : ; sK / ds1 ds2 : : : dsK g.w/ dw
1 1
Z 1 Z 1 Z 1 1Z 1 1
h. ; ; : : : ; /, like g./, is a pdf; it is the pdf of the joint distribution of the K random variables
sk D wk = KC1 k 0 D1 wk 0 (k D 1; 2; : : : ; K). Moreover, w is distributed independently of s1; s2 ; : : : ; sK .
P
Based on what has been established, we have the following generalization of Theorem 6.1.1.
Theorem 6.1.4. If K C 1 random variables w1 ; w2 ; : : : ; wK ; wKC1 are distributed indepen-
dently as Ga.˛1 ; ˇ/; Ga.˛2 ; ˇ/; : : : ; Ga.˛K ; ˇ/; Ga.˛KC1 ; ˇ/, respectively, then (1) KC1
P
kD1 wk
PKC1 PKC1
is distributed as Ga. kD1 ˛k ; ˇ/ and (2) kD1 wk is distributed independently of
w1 = KC1
PKC1 PKC1
k.
P
kD1 wk ; w2 = kD1 wk ; : : : ; wK = kD1 w
Note that Part (1) of Theorem 6.1.4 is essentially a restatement of a result established earlier (in
the form of Theorem 6.1.2) via a mathematical induction argument.
The joint distribution of the K random variables s1 ; s2 ; : : : ; sK , the pdf of which is the
function h. ; ; : : : ; / defined by expression (1.38), is known as the Dirichlet distribution
(with parameters ˛1 ; ˛2 ; : : : ; ˛K , and ˛KC1 ). Let us denote this distribution by the symbol
Di.˛1 ; ˛2 ; : : : ; ˛K ; ˛KC1 I K/. The beta distribution is a special case of the Dirichlet distribution;
specifically, the Be.˛1 ; ˛2 / distribution is identical to the Di.˛1 ; ˛2 I 2/ distribution.
Some results on the Dirichlet distribution. Some results on the Dirichlet distribution are stated in
the form of the following theorem.
Theorem 6.1.5. Take s1 ; s2 ; : : : ; sK to be K random variables whose joint distribution is
PK
Di.˛1 ; ˛2 ; : : : ; ˛K ; ˛KC1 I K/, and define sKC1 D 1 kD1 sk . Further, partition the inte-
gers 1; : : : ; K; K C 1 into I C 1 (nonempty) mutually exclusive and exhaustive subsets, say
B1 ; : : : ; BI ; BI C1 , of sizes K1 C1; : : : ; KI C1; KI C1 C1, respectively, and denote by fi1; i2 ; : : : ; iP g
the subset of f1; : : : ; I; I C 1g consisting of every integer i between 1 and I C 1, inclusive, for
Chi-Square, Gamma, Beta, and Dirichlet Distributions 261
which Ki 1. And for i D 1; : : : ; I; I C 1, let si D k2Bi sk and ˛i D k2Bi ˛k ; and for
P P
P , let up represent the Kip 1 vector whose elements are the first Kip of the Kip C 1
p D 1; 2; : : : ; P
quantities sk = k 0 2Bi sk 0 (k 2 Bip ). Then,
p
(1) the P C 1 random vectors u1 ; u2 ; : : : ; uP , and .s1 ; : : : ; sI ; sIC1 /0 are statistically independent;
(2) the joint distribution of s1 ; s2 ; : : : ; sI is Di.˛1 ; ˛2 ; : : : ; ˛I ; ˛IC1 I I /; and
(3) for p D 1; 2; : : : ; P , the joint distribution of the Kip elements of up is Dirichlet with parameters
˛k (k 2 Bip ).
Proof. Let wij (i D 1; : : : ; I; I C 1; j D 1; : : : ; Ki ; Ki C 1) represent statistically independent
random variables, and suppose that (for all i and j ) wij Ga.˛ij ; ˇ/, where ˛ij is the j th of the
Ki C 1 parameters ˛k (k 2 Bi ). Further, let wi D .wi1 ; : : : ; wiKi ; wi;Ki C1 /0. And observe that in
light of the very definition of the Dirichlet distribution, it suffices (for purposes of the proof) to set the
PKi 0 C1
Ki C 1 random variables sk (k 2 Bi ) equal to wij = Ii 0C1 j 0 D1 wi j (j D 1; : : : ; Ki ; Ki C 1),
P
0 0
D1
respectively (i D 1; : : : ; I; I C1). Upon doing so, we find that
PKi C1 ıPI C1 PKi 0 C1
si D j D1 wij i 0 D1 j 0 D1 wi 0j 0 .i D 1; : : : ; I; I C1/ (1.39)
and that (for p D 1; 2; : : : ; P ) the Kip elements of the vector up are up1 ; up2 , : : : ; upKip , where
(for j D 1; 2; : : : ; Kip )
ıPI C1 PKi 0 C1
wip j i 0 D1 j 0 D1 wi j
0 0 wip j
upj D PK C1 D PK C1 : (1.40)
ip
ıP I C1 PKi 0 C1 ip
j 0 D1 wip j j 0 D1 wi j j 0 D1 wip j
0 0 0 0
i 0 D1
Part (3) of Theorem 6.1.5 follows immediately from result (1.40). And upon observing that the
PKi C1
IC1 sums j D1 wij (i D 1; : : : ; I; IC1) are statistically independent and observing also [in light of
PKi C1
Theorem 6.1.2 or Part (1) of Theorem 6.1.4] that (for i D 1; : : : ; I; IC1) j D1 wij Ga.˛i; ˇ/,
Part (2) follows from result (1.39).
It remains to verify Part (1). Let y D .y1 ; : : : ; yI ; yI C1 /0, where (for i D 1, : : : ; I; I C 1)
PKi C1
yi D j D1 wij , and consider a change of variables from the elements of the I C 1 (statisti-
cally independent) random vectors w1 ; : : : ; wI ; wI C1 to the elements of the P C1 random vectors
u1 ; u2 ; : : : ; uP , and y. We find that the pdf, say f . ; ; : : : ; ; / of u1; u2 ; : : : ; uP , and y is expressible
as C1
IY P
Y
f .u1 ; u2 ; : : : ; uP ; y/ D gi .yi / hp .up /; (1.41)
i D1 pD1
where gi ./ is the pdf of the Ga.˛i; ˇ/ distribution and hp ./ is the pdf of the
Di.˛ip 1 ; : : : ; ˛ip Kip ; ˛ip ;Kip C1 I Kip / distribution. And upon observing that s1 , : : : ; sI , sIC1 are ex-
pressible as functions of y, it follows from result (1.41) that u1 ; u2 , : : : ; uP , and .s1 ; : : : ; sI ; sIC1 /0
are statistically independent. Q.E.D.
Marginal distributions. Define s1 ; s2 ; : : : ; sK , and sKC1 as in Theorem 6.1.5; that is, take
s1 ; s2 ; : : : ; sK to be K random variables whose joint distribution is Di.˛1 ; ˛2 , : : : ; ˛K ; ˛KC1 I K/,
PK
and let sKC1 D 1 kD1 sk . And, taking I to be an arbitrary integer between 1 and K, inclusive,
consider the joint distribution of any I elements of the set fs1; : : : ; sK ; sKC1 g, say the k1 ; k2 ; : : : ; kI th
elements sk1 ; sk2 ; : : : ; skI .
The joint distribution of sk1 ; sk2 ; : : : ; skI can be readily determined by applying Part (2) of
Theorem 6.1.5 (in the special case where B1 D fk1 g, B2 D fk2 g, : : : ; BI D fkI g and where BI C1
is the (K C1 I )-dimensional subset of f1; : : : ; K; K C1g obtained by striking out k1 ; k2 ; : : : ; kI ).
The joint distribution is Di.˛k1 ; ˛k2 ; : : : ; ˛kI ; ˛IC1 I I /, where ˛IC1 D k2BI C1 ˛k . In particular,
P
for an arbitrary one of the integers 1; : : : ; K; K C1, say the integer k, the (marginal) distribution of
sk is Di.˛k ; KC1
PKC1
k 0 D1 .k 0 ¤k/ ˛k 0 I 1/ or, equivalently, Be.˛k ;
P
k 0 D1 .k 0 ¤k/ ˛k 0 /.
262 Some Relevant Distributions and Their Properties
distribution.
Consider the distribution of the K C1 random variables z12 = KC1 2 2
PKC1 2
kD1 zk , : : : ; zK = z , and
P
PKC1 2 PKC1 2 kD1 k
kD1 zk . As a special case of Part (2) of Theorem 6.1.4, we have that kD1 zk is distributed
PKC1 2
independently of z12 = kD1 2
zk , : : : ; zK = KC1 2
. And as a special case of Theorem 6.1.3 [or
P
z
kD1 k
of Part (1) of Theorem 6.1.4], we have that kD1 zk .K C 1/. Moreover, z12 = KC1
PKC1 2 2 2
kD1 zk ,
P
2
= KC1 2 1 1 1
kD1 zk have a Di 2 ; : : : ; 2 ; 2 I K distribution, and, more generally, any K (where
0
P
: : : ; zK
0 2
PKC1 2 2
PKC1 2 2 PKC1 2
1 K K) of the random variables z1 = kD1 zk , : : : ; zK = kD1 zk , zKC1 = kD1 zk have a
0
Di 21 ; : : : ; 12 ; KC12 K I K 0 distribution.
Now, let
KC1
X zk
uD zk2 and (for k D 1; : : : ; K; K C 1) yk D P ;
KC1 2 1=2
kD1 j D1 zj
and consider the joint distribution of u and any K 0 (where 1 K 0 K) of the random variables
y1 ; : : : ; yK ; yKC1 (which for notational convenience and without any essential loss of generality are
taken to be the first K 0 of these random variables). Let us reexpress u and y1 ; y2 ; : : : ; yK 0 as
K 0
X zk
uDvC zk2 and (for k D 1; 2; : : : ; K 0 ) yk D PK 0 ; (1.42)
2 1=2
kD1 vC j D1 zj
where v D KC1 2 2 0
kDK 0 C1 zk . Clearly, v is distributed independently of z1 ; z2 ; : : : ; zK 0 as .K K C1/.
P
0 0
Define y D .y1 ; y2 ; : : : ; yK 0 / . And observe that the K C1 equalities (1.42) define a one-to-one
transformation from the (K 0 C1)-dimensional region defined by the K 0 C1 inequalities 0 < v < 1
and 1 < zk < 1 (k D 1; 2; : : : ; K 0 ) onto the region
fu; y W 0 < u < 1; y 2 D g;
PK 0
where D D fy W 2
kD1 yk < 1g. Observe also that the inverse of this transformation is the
transformation defined by the K 0 C1 equalities
PK 0
yk2 and (for k D 1; 2; : : : ; K 0 ) zk D u1=2 yk :
vDu 1 kD1
Further, letting A represent the .K 0C1/ .K 0C1/ matrix whose ij th element is the partial derivative
of the i th element of the vector .v; z1 ; z2 ; : : : ; zK 0 / with respect to the j th element of the vector
.u; y1 ; y2 ; : : : ; yK 0 / and recalling Theorem 2.14.22, we find that
ˇ 1 PK 0 y 2
ˇ ˇ
0ˇ
j D1 j 2uy ˇ 0
jAj D ˇ ˇ D uK =2:
ˇ
ˇ.1=2/u 1=2 y 1=2 ˇ
u I
Thus, denoting by d./ the pdf of the 2 .K K 0C1/ distribution and by b./ the pdf of the N.0; IK 0 /
distribution and making use of standard results on a change of variables, the joint distribution of
u; y1 ; y2 ; : : : ; yK 0 has as a pdf the function q. ; ; ; : : : ; / (of K 0 variables) obtained by taking (for
Chi-Square, Gamma, Beta, and Dirichlet Distributions 263
q.u; y1 ; y2 ; : : : ; yK 0 /
PK 0 2 1=2
0
y uK =2
Dd u 1 kD1 yk b u
1
D uŒ.KC1/=2 1
2.KC1/=2 Œ.K K 0 C1/=2 K 0 =2
PK 0 2 Œ.K K 0C1/=2 1 u=2
1 kD1 yk e
1
D uŒ.KC1/=2 1
e u=2
2.KC1/=2 Œ.K C1/=2
Œ.K C1/=2 PK 0 2 Œ.K K 0C1/=2 1
0 K 0 =2 1 kD1 yk (1.43)
Œ.K K C1/=2
—for u and y such that 1 < u 0 or y … D , q.u; y1 ; y2 ; : : : ; yK 0 / D 0. Accordingly, we
conclude that (for all u and y1 ; y2 ; : : : ; yK 0 )
where r./ is the pdf of the 2 .KC1/ distribution and h . ; ; : : : ; / is the function (of K 0 variables)
defined (for all y1 ; y2 ; : : : ; yK 0 ) as follows:
h .y1 ; y2 ; : : : ; yK 0 /
8̂
Œ.KC1/=2 PK 0 2 Œ.K K 0C1/=2 1 PK 0
<
0 K 0 =2 1 kD1 yk ; if kD1 yk2 < 1,
D Œ.K K C1/=2 (1.45)
:̂ 0; otherwise.
In effect, we have established that y1 ; y2 ; : : : ; yK 0 are statistically independent of KC1 kD1 zk and
2
P
that the distribution of y1 ; y2 ; : : : ; yK 0 has as a pdf the function h . ; ; : : : ; / defined by expression
(1.45). In the special case where K 0 D K, the function h. ; ; : : : ; / is the pdf of the joint distribution
of y1 ; y2 ; : : : ; yK . When K 0 D K, h . ; ; : : : ; / is reexpressible as follows: for all y1 ; y2 ; : : : ; yK ,
8̂
< Œ.KC1/=2 1
PK 1=2 PK
.KC1/=2 kD1 yk2 ; if kD1 yk2 < 1,
h .y1 ;y2 ; : : : ;yK / D (1.46)
:̂ 0; otherwise.
Clearly, PK 1=2
yKC1 D iKC1 1 kD1 yk2 ;
where iKC1 D i.zKC1 /. Moreover, Pr.iKC1 D 0/ D 0. And the joint distribution of
z1 ; z2 ; : : : ; zK ; zKC1 is the same as that of z1 , z2 , : : : ; zK , zKC1 , implying that the joint dis-
tribution of u; y1 ; y2 ; : : : ; yK ; iKC1 is the same as that of u; y1 ; y2 ; : : : ; yK ; iKC1 and hence that
Pr.iKC1 D 1/ D Pr.iKC1 D 1/ D 12 , both unconditionally and conditionally on u; y1 ; y2 ; : : : ; yK
or y1 ; y2 ; : : : ; yK . Thus, conditionally on u; y1 ; y2 ; : : : ; yK or y1 ; y2 ; : : : ; yK ,
8
2 1=2
PK
; with probability 12 ,
< 1
kD1 yk
yKC1 D
: 1 PK y 2 1=2; with probability 1 .
kD1 k 2
264 Some Relevant Distributions and Their Properties
Random variables, say x1 ; : : : ; xK ; xKC1 , whose joint distribution is that of the random variables
y1 ; : : : ; yK ; yKC1 are said to be distributed uniformly on the surface of a .K C1/-dimensional unit
ball—refer, e.g., to definition 1.1 of Gupta and Song (1997). More generally, random variables
x1 ; : : : ; xK ; xKC1 whose joint distribution is that of the random variables ry1 ; : : : ; ryK ; ryKC1 are
said to be distributed uniformly on the surface of a .KC1/-dimensional ball of radius r. Note that if
x1 ; : : : ; xK ; xKC1 are distributed uniformly on the surface of a (K C1)-dimensional unit ball, then
x12 ; : : : ; xK
2
have a Di 12 ; : : : ; 12 ; 12 I K distribution.
The (K C1)-dimensional random vector z [which has a (K C1)-dimensional standard normal
distribution] is expressible in terms of the random variable u [which has a 2 .K C1/ distribution]
and the K C1 random variables y1 ; : : : ; yK ; yKC1 [which are distributed uniformly on the surface
of a (K C1)-dimensional unit ball independently of u]. Clearly,
p
z D u y; (1.47)
where y D .y1 ; : : : ; yK ; yKC1 /0.
The distribution of the (positive) square root of a chi-square random variable, say a chi-square
random variable with N degrees of freedom, is sometimes referred to as a chi distribution (with N
degrees of freedom). This distribution has a pdf b./ that is expressible as
8̂
1 2
<
.N=2/ 1
x N 1 e x =2; for 0 < x < 1,
b.x/ D .N=2/ 2 (1.48)
:̂ 0; elsewhere,
p
as can be readily verified. Accordingly, the random variable u, which appears in expression (1.47),
has a chi distribution with KC1 degrees of freedom, the pdf of which is obtainable from expression
(1.48) (upon setting N D K C1).
f .z1 ; z2 ; : : : ; zM / D g M 2
(for all values of z1 ; z2 ; : : : ; zM ). (1.49)
P
i D1 zi
where g./ is a (nonnegative) function of a single nonnegative variable (in which case the joint
distribution of z1 ; : : : ; zK ; zKC1 is spherical). Further, let
KC1
X zk
uD zk2 and (for k D 1; : : : ; K; K C 1) yk D P :
KC1 2 1=2
kD1 j D1 zj
266 Some Relevant Distributions and Their Properties
f .v; z1 ; z2 ; : : : ; zK / D 2 g vC K 2 1=2
D g vC K 2
P P 1=2
kD1 zk .1=2/v kD1 zk v
—for u and y such that 1 < u 0 or y … D , take q.u; y1 ; y2 ; : : : ; yK / D 0. Thus, for all u
and y1 ; y2 ; : : : ; yK ,
q.u; y1 ; y2 ; : : : ; yK / D r.u/ h.y1 ; y2 ; : : : ; yK /; (1.55)
where r./ is the function (of a single variable) defined (for all u) and h . ; ; : : : ; / the function (of
K variables) defined (for all y1 ; y2 ; : : : ; yK ) as follows:
.KC1/=2
8̂
< uŒ.KC1/=2 1 g.u/; for 0 < u < 1,
r.u/ D Œ.K C 1/=2 (1.56)
0; for 1 < u 0,
:̂
and
8̂
< Œ.KC1/=2 1
PK 1=2 PK
.KC1/=2 kD1 yk2 ; if kD1 yk2 < 1,
h .y1 ;y2 ; : : : ;yK / D (1.57)
:̂ 0; otherwise.
As is evident from the results of Part 1 of the present subsection, the function r./ is a pdf; it
PKC1 2
is the pdf of the distribution of kD1 zk . Further, y1 ; y2 ; : : : ; yK are statistically independent of
PKC1 2
kD1 zk , and the distribution of y1 ; y2 ; : : : ; yK has as a pdf the function h . ; ; : : : ; / defined by
expression (1.57). And, conditionally on u; y1 ; y2 ; : : : ; yK or y1 ; y2 , : : : ; yK ,
8
2 1=2
PK
; with probability 12 ,
< 1
kD1 yk
yKC1 D
: 1 PK y 2 1=2; with probability 1 ,
kD1 k 2
as can be established in the same way as in Subsection f [where it was assumed that the joint dis-
tribution of z1 ; : : : ; zK ; zKC1 is N.0; I/]. Accordingly, y1 ; : : : ; yK ; yKC1 are distributed uniformly
on the surface of a (K C1)-dimensional unit ball.
Noncentral Chi-Square Distribution 267
Let z D .z1 ; : : : ; zK ; zKC1 /0, and consider the decomposition of the vector z defined by the
identity p
z D u y; (1.58)
where y D .y1 ; : : : ; yK ; yKC1 /0. This decomposition was considered previously (in Subsection f) in
the special case where z N.0; I/. As in the special case, y is distributed
puniformly on the surface of
a (KC1)-dimensional unit ball (and is distributed independently of u or u). In the present (general)
case of an arbitrary absolutely continuous spherical distribution [i.e., where the distribution of z is
any absolutely continuous distribution with a pdf of the form (1.54)], the distribution of u is the
distribution with the pdf r./ given bypexpression (1.56) and (recalling the results of Part 1 of the
present subsection) the distribution of u is the distribution with the pdf b./ given by the expression
8̂
.KC1/=2
< 2 x K g.x 2 /; for 0 < x < 1,
b.x/ D Œ.K C 1/=2 (1.59)
:̂ 0; elsewhere.
p
In the special case where z N.0; I/, u 2 .K C1/, and u has a chi distribution (with K C1
degrees of freedom).
a. Helmert matrix
Let a D .a1 ; a2 ; : : : ; aN /0 represent an N -dimensional nonnull (column) vector. Does there exist an
N N orthogonal matrix, one of whose rows, say the first row, is proportional to a0 ? Or, equivalently,
does there exist an orthonormal basis for RN that includes the vector a? In what follows, the answer
is shown to be yes. The approach taken is to describe a particular N N orthogonal matrix whose
first row is proportional to a0. Other approaches are possible—refer, e.g., to Harville (1997, sec 6.4).
Let us begin by considering a special case. Suppose that ai ¤ 0 for i D 1, 2, : : : ; N . And
consider the N N matrix P , whose first through N th rows, say p01 ; p02 ; : : : ; p0N , are each of norm
1 and are further defined as follows: take p01 proportional to a0, take p02 proportional to
.a1 ; a12 =a2 ; 0; 0; : : : ; 0/I
take p03 proportional to
Œa1 ; a2 ; .a12 Ca22 /=a3 ; 0; 0; : : : ; 0I
and, more generally, take the second through N th rows proportional to
Pk 1 2
.a1 ; a2 ; : : : ; ak 1 ; i D1 ai =ak ; 0; 0; : : : ; 0/ .k D 2; 3; : : : ; N /; (2.1)
respectively.
Clearly, the N 1 vectors (2.1) are orthogonal to each other and to the vector a0. Thus, P is an
orthogonal matrix. Moreover, upon “normalizing” a0 and the vectors (2.1), we find that
2 1=2
PN
p01 D (2.2)
i D1 ai .a1 ; a2 ; : : : ; aN /
and that, for k D 2; 3; : : : ; N ,
268 Some Relevant Distributions and Their Properties
" #1=2
ak2 Pk 1
p0k D Pk 1 Pk .a1 ; a2 ; : : : ; ak 1; i D1 ai2 =ak ; 0; 0; : : : ; 0/: (2.3)
2 2
i D1 ai i D1 ai
Thus,
p
8̂
<p 1 e .uC/=2
cosh u ; for u > 0,
h.u/ D 2u (2.6)
:̂ 0; for u 0.
The pdf h./ can be further reexpressed by making use of the power-series representation
1
X x 2r
cosh.x/ D . 1 < x < 1/
rD0
.2r/Š
for the hyperbolic cosine function and by making use of result (3.5.11). We find that, for u > 0,
1
1 .uC/=2
X .u/r
h.u/ D p e 1
2u .2/ 1=2 22rC.1=2/ rŠ r C
rD0 2
1
X .=2/r e =2
1
D uŒ.2rC1/=2 1
e u=2
: (2.7)
rD0
rŠ Œ.2r C 1/=2 2.2rC1/=2
And letting (for j D 1; 2; 3; : : : ) gj ./ represent the pdf of a central chi-square distribution with j
degrees of freedom, it follows that (for all u)
270 Some Relevant Distributions and Their Properties
1
X .=2/r e =2
h.u/ D g2rC1 .u/: (2.8)
rD0
rŠ
The distribution of w, like that of w1 , is a mixture distribution. As in the case of the pdf of
the distribution of w1 , the pdf of the distribution of w is a weighted average of the pdfs of central
chi-square distributions. Moreover, the sequence p0 ; p1 ; p2 ; : : : of weights is the same in the case
of the pdf of the distribution of w as in the case of the pdf of the distribution of w1 . And the central
chi-square distributions represented in one of these weighted averages are related in a simple way to
those represented in the other; each of the central chi-square distributions represented in the weighted
average (2.12) have an additional K degrees of freedom.
In light of results (2.5) and (2.8), the pdf of the 2 .N; / distribution is obtainable as a special case
of the pdf (2.12). Specifically, upon setting K D N 1 and setting (for r D 0; 1; 2; : : :) Nr D 2r C1
and pr D p.r/ [where, as in Part 1 of the present subsection, p.r/ D .=2/r e =2=rŠ], we obtain
the pdf of the 2 .N; / distribution as a special case of the pdf (2.12). Accordingly, the pdf of the
2 .N; / distribution is the function q./ that is expressible as follows:
1
X
q.w/ D p.r/ g2rCN .w/ (2.13)
rD0
1
X .=2/r e =2
D g2rCN .w/ (2.14)
rD0
rŠ
8̂ 1
X .=2/r e =2
1
< w Œ.2rCN /=2 1
e w=2
; for w > 0,
D rD0
rŠ Œ.2r C N /=2 2.2rCN /=2 (2.15)
0; for w 0.
:̂
1
where h t ./ is the pdf of the N Œ=.1 2t/; .1 2t/ distribution. Thus, for t < 1=2,
t2
2
E et z 2t/ 1=2 exp
D .1 :
1 2t
And it follows that the moment generating function, say m./, of the 2 .1; / distribution is
t
m.t/ D .1 2t/ 1=2 exp .t < 1=2/: (2.17)
1 2t
To obtain an expression for the moment generating function of the 2 .N; / distribution (where
N is any strictly positive integer), it suffices (in light of Theorem 6.2.1) to find the moment generating
function of the distribution of the sum w D w1 C w2 of two random variables w1 and w2 that are
distributed independently as 2.N 1/ and 2.1; /, respectively. Letting m1./ represent the moment
generating function of the 2 .N 1/ distribution and m2 ./ the moment generating function of the
2 .1; / distribution and making use of results (1.20) and (2.17), we find that, for t < 1=2,
E e t w D E e t w1 E e t w2 D m1 .t/ m2 .t/
t
D .1 2t/ .N 1/=2 .1 2t/ 1=2 exp
1 2t
N=2 t
D .1 2t/ exp :
1 2t
Thus, the moment generating function, say m./, of the 2 .N; / distribution is
N=2 t
m.t/ D .1 2t/ exp .t < 1=2/: (2.18)
1 2t
Or, upon reexpressing expŒt=.1 2t/ as
t =2
exp D exp. =2/ exp
1 2t 1 2t
Noncentral Chi-Square Distribution 273
Alternatively, expression (2.20) for the moment generating function of the 2 .N; / distribution
can be derived from the pdf (2.14). Letting (as in Subsection c) gj ./ represent (for an arbitrary
strictly positive integer j ) the pdf of a 2 .j / distribution, the alternative approach gives
1 1
.=2/r e =2
Z X
m.t/ D et w g2rCN .w/ dw
0 rD0
rŠ
1
.=2/r e =2 1 t w
X Z
D e g2rCN .w/ dw: (2.21)
rD0
rŠ 0
R1
If we use formula (1.20) to evaluate (for each r) the integral 0 e t w g2rCN .w/ dw, we arrive
immediately at expression (2.20).
The cumulant generating function, say c./, of the 2 .N; / distribution is [in light of result
(2.18)]
c.t/ D log m.t/ D .N=2/ log.1 2t/ C t.1 2t/ 1 .t < 1=2/: (2.22)
Upon expanding c.t/ in a power series (about 0), we find that (for 1=2 < t < 1=2)
1
X 1
X
c.t/ D .N=2/ .2t/r=r C t .2t/r 1
rD1 rD1
1
X
D 2r 1 r
t .N C r/=r
rD1
X1
D .N C r/ 2r 1
.r 1/Š t r=rŠ: (2.23)
rD1
.N C r/ 2r 1
.r 1/Š: (2.24)
f. Moments
2
Mean and variance. Let w represent a random variable whose
p distribution is .N; /. And [for
2
purposes of determining E.w/, var.w/, and E.w /] let xpD C z, where z is a random variable
that has a standard normal distribution, so that x N ; 1 and hence x 2 2 .1; /. Then, in
light of Theorem 6.2.1,
w x 2 C u;
where u is a random variable that is distributed independently of z (and hence distributed indepen-
dently of x and x 2 ) as 2 .N 1/.
274 Some Relevant Distributions and Their Properties
Clearly,
E.x 2 / D var.x/ C ŒE.x/2 D 1 C : (2.25)
And, making use of results (3.5.19) and (3.5.20), we find that
p 4
E.x 4 / D E Cz
p
D E z 4 C 4z 3 C 6z 2 C 4z3=2 C 2
D 3 C 0 C 6 C 0 C 2 D 2 C 6 C 3; (2.26)
Higher-order moments. Let (as in Part 1) w represent a random variable whose distribution is
2 .N; /. Further, take (for an arbitrary strictly positive integer k) gk ./ to be the pdf of a 2 .k/
distribution, and recall [from result (2.14)] that the pdf, say q./, of the 2 .N; / distribution is
expressible (for all w) as 1
X .=2/j e =2
q.w/ D g2j CN .w/:
jŠ
j D0
Then, using result (1.27), we find that (for r > N=2)
Z 1
E.w r / D w r q.w/ dw
0
1 1
.=2/j e =2
X Z
D w r g2j CN .w/ dw
jŠ 0
j D0
1
X .=2/j e =2
Œ.N=2/ C j C r
D 2r
jŠ Œ.N=2/ C j
j D0
1
X .=2/j Œ.N=2/ C j C r
D 2r e =2
: (2.31)
j Š Œ.N=2/ C j
j D0
Now, define mr D E.w r /, and regard mr as a function of the noncentrality parameter . And
observe that (for r > N=2)
1
d mr X j .=2/j 1 Œ.N=2/ C j C r
D .1=2/ mr C 2r 1
e =2
d j Š Œ.N=2/ C j
j D0
Verification of formula (2.35). The verification of formula (2.35) is by mathematical induction. The
formula is valid for r D 0—according to the formula, E.w 0 / D 1. Now, suppose that the formula is
valid for r D k (where k is an arbitrary nonnegative integer), that is, suppose that
k
!
X k .=2/j
mk D 2k Œ.N=2/ C k (2.37)
j Œ.N=2/ C j
j D0
j
—as before, mj D E.w / (for j > N=2). We wish to show that formula (2.35) is valid for
r D k C 1, that is, to show that
kC1
!
kC1
X k C1 .=2/j
mkC1 D 2 Œ.N=2/ C k C 1 : (2.38)
j Œ.N=2/ C j
j D0
276 Some Relevant Distributions and Their Properties
and that
k
!
d mk X k j .=2/j
2 D 2kC1 Œ.N=2/ C k : (2.41)
d j Œ.N=2/ C j
j D1
Upon starting with expression (2.39) and substituting expressions (2.37), (2.40), and (2.41) for mk ,
mk , and 2.d mk =d/, respectively, we find that
k
!
.=2/j
kC1
X k
mkC1 D 2 Œ.N=2/ C k Œ.N=2/ C k
j Œ.N=2/ C j
j D0
kC1
!
X k Œ.N=2/ C j 1.=2/j
C
j 1 Œ.N=2/ C j
j D1
k
!
j .=2/j
X k
C : (2.42)
j Œ.N=2/ C j
j D1
Expressions (2.38) and (2.42) are both polynomials of degree k C1 in =2. Thus, to establish
equality (2.38), it suffices to establish that the coefficient of .=2/j is the same for each of these two
polynomials (j D 0; 1; : : : ; k C1). In the case of the polynomial (2.42), the coefficient of .=2/0 is
Clearly, the coefficients (2.43), (2.44), and (2.45) are identical to the coefficients of the polyno-
mial (2.38). Accordingly, equality (2.38 is established, and the mathematical-induction argument is
complete.
The cumulants of the Ga.˛; ˇ; ı/ distribution can be determined from expression (2.48) in essentially
the same way that the cumulants of the 2.N; / distribution were determined from expression (2.18):
for r D 1; 2; 3; : : : , the rth cumulant is
Theorem 6.2.1 (on the distribution of a sum of noncentral chi-square random variables) can
be generalized. Suppose that K random variables w1 ; w2 ; : : : ; wK are distributed independently
as Ga.˛1 ; ˇ; ı1 /, Ga.˛2 ; ˇ; ı2 /, : : : ; Ga.˛K ; ˇ; ıK /, respectively. Then, the moment generating
function, say m./, of the sum K kD1 wk of w1 ; w2 ; : : : ; wK is given by the formula
P
.˛1 C˛2 CC˛K / .ı1 C ı2 C C ıK /ˇt
m.t/ D .1 ˇt/ exp .t < 1=ˇ/;
1 ˇt
as is evident from result (2.48) upon observing that
PK K
Y K
Y
E et wk
DE e t wk D E e t wk :
kD1
kD1 kD1
Formula (2.55) can be verified via a mathematical-induction argument akin to that employed in Part
3 of Subsection f, and, in what is essentially a generalization of result (2.36), is [in light of result
(1.23)] reexpressible in the “simplified” form
r 1
!
r r r
X r j
E.w / D ˇ Œı C .˛Cj /.˛Cj C1/.˛Cj C2/ .˛Cr 1/ ı : (2.56)
j
j D0
established earlier (in Subsection b) that in the special case where z N.0; I/ and hence where
x N.; I/, the distribution of x0 x depends on the value of only through or, equivalently, only
through —in that special case, the distribution of x0 x is, by definition, the noncentral chi-square
distribution with parameters N and . In fact, the distribution of x0 x has this property (i.e., the
property of depending on the value of only through or, equivalently, only through ) not only in
the special case where z N.0; I/, but also in the general case (where the distribution of z is any
particular spherical distribution).
To see this, take (as in Subsection b) A to be any N N orthogonal matrix whose first column is
.1=/—if D 0 (or, equivalently, if D 0), take A to be an arbitrary N N orthogonal matrix.
And observe that
x0 x D x0 Ix D x0 AA0 x D .A0 x/0 A0 x (2.57)
and that
A0 x D A0 C w; (2.58)
where w D A0 z. Observe also that
A0 D ; (2.59)
0
that
w z; (2.60)
w1
and, upon partitioning w as w D (where w1 is a scalar), that
w2
x0 x D .A0 C w/0 .A0 C w/ D . C w1 /2 C w20 w2 D C 2w1 C w 0 w: (2.61)
It is now clear [in light of results (2.60) and (2.61)] that the distribution of x0 x depends on the value
of only through or, equivalently, only through .
Consider now the special case of an absolutely continuous spherical distribution where the dis-
tribution of z is absolutely continuous with a pdf f ./ such that
f .z/ D g.z0 z/ .z 2 RN / (2.62)
0
for some (nonnegative) function g./ (of a single nonnegative variable). Letting u D w w and
v D w1 =.w 0 w/1=2, we find [in light of result (2.61)] that (for w ¤ 0) x0 x is expressible in the form
x0 x D C 2vu1=2 C u: (2.63)
0
Moreover, in light of result (2.60), the joint distribution of u and v is the same as that of z z and
z1 =.z0 z/1=2, implying (in light of the results of Section 6.1g) that u and v are distributed independently,
that the distribution of u is the distribution with pdf r./, where
280 Some Relevant Distributions and Their Properties
8̂ N=2
< u.N=2/ 1
g.u/; for 0 < u < 1,
r.u/ D .N=2/ (2.64)
0; for 1 < u 0,
:̂
and that the distribution of v is the distribution with pdf h ./, where
8̂
.N=2/
< .1 v 2 /.N 3/=2; if 1 < v < 1,
h .v/ D Œ.N 1/=2 1=2
(2.65)
:̂ 0; otherwise.
Define y D x0 x. In the special case where z N.0; I/ [and hence where x N.; I/], y
has (by definition) a 2 .N; / distribution, the pdf of which is a function q./ that is expressible in
the form (2.15). Let us obtain an expression for the pdf of the distribution of y in the general case
[where the distribution of z is any (spherical) distribution that is absolutely continuous with a pdf
f ./ of the form (2.62)].
Denote by d. ; / the pdf of the joint distribution of u and v, so that (for all u and v)
d.u; v/ D r.u/ h .v/:
Now, introduce a change of variables from u and v to the random variables y and s, where
C w1 C vu1=2
sD D
y 1=2 . C 2vu1=2 C u/1=2
as can be readily verified. Observe also that vu1=2 D sy 1=2 and hence that
[and by taking, for 1 < y 0, q.y/ D 0]. In the special case where D 0 (and hence where
D 0), expression (2.67) simplifies to
N=2 .N=2/ 1
q.y/ D y g.y/;
.N=2/
in agreement with the pdf (1.56) derived earlier (in Section 6.1g) for that special case.
a. (Central) F distribution
Relationship to the beta distribution. Let u and v represent random variables that are distributed
independently as 2 .M / and 2 .N /, respectively. Further, define
u=M u
wD and xD :
v=N uCv
Then,
Nx Mw .M=N /w
wD and xD D ; (3.3)
M.1 x/ N C Mw 1 C .M=N /w
as can be readily verified.
282 Some Relevant Distributions and Their Properties
By definition, w SF .M; N /. And in light of the discussion of Sections 6.1a, 6.1b, and 6.1c,
x Be.M=2; N=2/. In effect, we have established the following result: if x is a random variable
that is distributed as Be.M=2; N=2/, then
Nx
SF .M; N /I (3.4)
M.1 x/
and if w is a random variable that is distributed as SF .M; N /, then
Mw
Be.M=2; N=2/: (3.5)
N C Mw
The cdf (cumulative distribution function) of an F distribution can be reexpressed in terms
of an incomplete beta function ratio (which coincides with the cdf of a beta distribution)—the
incomplete beta function ratio was introduced and discussed in Section 6.1b. Denote by F ./ the
cdf of the SF .M; N / distribution. Then, letting x represent a random variable that is distributed as
Be.M=2; N=2/, we find [in light of result (3.4)] that (for any nonnegative scalar c)
Nx Mc
F .c/ D Pr c D Pr x D IM c=.NCM c/.M=2; N=2/ (3.6)
M.1 x/ N C Mc
—for c < 0, F .c/ D 0. Moreover, in light of result (1.14) on the incomplete beta function ratio, the
cdf of the SF .M; N / distribution can also be expressed (for c 0) as
F .c/ D 1 IN=.NCM c/ .N=2; M=2/: (3.7)
Distribution of the reciprocal. Let w represent a random variable that has an SF .M; N / distribution.
Then, clearly,
1
SF .N; M /: (3.8)
w
Now, let F ./ represent the cdf (cumulative distribution function) of the SF .M; N / distribution
and G./ the cdf of the SF .N; M / distribution. Then, for any strictly positive scalar c,
Joint distribution. As in Part 1 of the present subsection, take u and v to be statistically in-
dependent random variables that are distributed as 2 .M / and 2 .N /, respectively, and define
w D .u=M /=.v=N / and x D u=.u C v/. Let us consider the joint distribution of w and the random
variable y defined as follows:
y D u C v:
Central and Noncentral F Distributions 283
Œ.M CN /=2
D .M=N /M=2 w .M=2/ 1 Œ1 C .M=N /w .M CN /=2 (3.12)
.M=2/.N=2/
Œ.M=2/Cr Œ.N=2/ r
E.w r / D .N=M /r E.ur / E.v r
/ D .N=M /r : (3.13)
.M=2/ .N=2/
Further, it follows from results (1.28) and (1.31) that the rth (integer) moment of the SF .M; N /
distribution is expressible as
M.M C 2/.M C 4/ ŒM C 2.r 1/
E.w r / D .N=M /r (3.14)
.N 2/.N 4/ .N 2r/
(r D 1; 2; : : : < N=2). For r N=2, the rth moment of the SF .M; N / distribution does not exist.
(And, as a consequence, the F distribution does not have a moment generating function.)
The mean of the SF .M; N / distribution is (if N > 2)
E.w/ D N=.N 2/: (3.15)
284 Some Relevant Distributions and Their Properties
And the second moment and the variance are (if N > 4)
M.M C 2/ M C2 N2
E.w 2 / D .N=M /2 D (3.16)
.N 2/.N 4/ M .N 2/.N 4/
and 2N 2 .M C N 2/
var.w/ D E.w 2 / ŒE.w/2 D : (3.17)
M.N 2/2 .N 4/
Noninteger degrees of freedom. The definition of the F distribution can be extended to noninteger
degrees of freedom in much the same way as the definition of the chi-square distribution. For
arbitrary strictly positive numbers M and N, take u and v to be random variables that are distributed
independently as Ga M and N
; 2 , respectively, and define
2
; 2 Ga 2
u=M
wD : (3.18)
v=N
Let us regard the distribution of the random variable w as an F distribution with M (possibly
noninteger) numerator degrees of freedom and N (possibly noninteger) denominator degrees of
freedom. In the special case where M and N are (strictly positive) integers, the Ga M 2 ; 2 distribution
is identical to the 2 .M / distribution and the Ga N2 ; 2 distribution is identical to the 2 .N /
distribution (as is evident from the results of Section 6.1c), so that this usage of the term F distribution
is consistent with our previous usage of this term.
Note that in the definition (3.18) of the random variable w, we could have taken u and v to
be random variables that are distributed independently as Ga M 2 ; ˇ and Ga N
2 ; ˇ , respectively
(where ˇ is an arbitrary strictly positive scalar). The distribution of w is unaffected by the choice
of ˇ. To see this, observe (as in Part 1 of the present subsection) that w D N x=ŒM.1 x/, where
x D u=.u C v/, and that (irrespective of the value of ˇ) u=.u C v/ Be M N
2 ; 2 .
A function of random variables that are distributed independently and identically as N.0; 1/.
Let z D .z1 ; z2 ; : : : ; zN /0 (where N 2) represent an (N -dimensional) random (column) vector
that has an N -variate standard normal distribution or, equivalently, whose elements are distributed
independently and identically as N.0; 1/. And let K represent any integer between 1 and N 1,
inclusive. Then, PK 2
i D1 zi =K
PN SF .K; N K/; (3.19)
2
i DKC1 zi =.N K/
For j D 1; 2; : : : ; J , let uj
sj D PJ :
vC j 0 D1 uj 0
PJ PJ
Then, 1 and, consequently,
ı
j 0 D1 sj 0 Dv vC j 0 D1 uj 0
N sj
wj D PJ :
Mj 1 j 0 D1 sj 0
distribution.
More specifically, suppose that z1 ; z2 ; : : : ; zJ ; zJ C1 are random column vectors of dimensions
PJ C1
N1 ; N2 ; : : : ; NJ ; NJ C1 , respectively, the joint distribution of which is j D1 Nj -variate standard
normal N.0; I/. And, for j D 1; 2; : : : ; J; J C 1, denote by zjk the kth element of zj . Further, define
PNj 2
kD1 zjk =Nj
wj D PN .j D 1; 2; : : : ; J /;
J C1 2
kD1 zJC1; k =N J C1
PNJC1 2
and observe that the JC1 sums of squares kD1 z1k ; kD1 z2k ; : : : ; N
PN1 2 PN2 2 P J 2
z , and kD1
kD1 J k
zJC1; k
2 2 2 2
are distributed independently as .N1 /; .N2 /; : : : ; .NJ /, and .NJC1 /, respectively. Then,
for j D 1; 2; : : : ; J , the (marginal) distribution of wj is SF .Nj ; NJC1 / and is related to a beta
distribution; for j D 1; 2; : : : ; J ,
NJC1 xj
wj D ;
Nj .1 xj /
PNj 2 ı PNj 2 PNJC1 2 Nj NJC1
where xj D kD1 . The joint distribution of
zjk kD1 zjk C kD1 zJC1; k Be 2 ; 2
w1 ; w2 ; : : : ; wJ is related to a Dirichlet distribution; clearly,
PNj 2
NJC1 sj kD1 zjk
wj D PJ ; where sj D PJC1 PN 0 .j D 1; 2; : : : ; J /;
Nj 1 2
j 0 D1 sj 0
j
0
j D1 z
kD1 j k 0
N1 N2 NJ NJC1
and s1 ; s2 ; : : : ; sJ are jointly distributed as Di 2 ; 2 ; : : : ; 2 ; 2 I J .
1) zk zk2
2
yk D P and w k D y k D KC1 2
;
KC1 2 1=2
P
j D1 zj j D1 zj
PK
in which case wKC1 D 1 kD1 wk . As is evident from the results of Part 1 of Section 6.1g,
w1 ; w2 ; : : : ; wK are statistically independent of w0 and have a Di 12 ; : : : ; 12 ; 12 I K distribution; and
as is evident from the results of Part 2, y1 , : : : ; yK , yKC1 are statistically independent of w0 and
are distributed uniformly on the surface of a (K C1)-dimensional unit ball. There is an implication
that the distribution of w1 ; : : : ; wK ; wKC1 and the distribution of y1 ; : : : ; yK ; yKC1 are the same in
the general case (of an arbitrary absolutely continuous spherical distribution) as in the special case
where z N.0; IKC1 /. Thus, we have the following theorem.
Theorem 6.3.1. Let z D .z1 ; : : : ; zK ; zKC1 /0 represent any (K C1)-dimensional random (col-
umn) vector having an absolutely continuous spherical distribution. And define w0 D KC1 2
kD1 zk and
P
(for k D 1; : : : ; K; K C1)
286 Some Relevant Distributions and Their Properties
zk zk2
yk D P and wk D yk2 D PKC1 :
KC1 2 1=2 zj2
j D1 zj j D1
(1) For “any” function g./ (defined on RKC1 ) such that g.z/ depends on the value of z only through
w1 ; : : : ; wK ; wKC1 or, more generally, only through y1 , : : : ; yK , yKC1 , the random variable g.z/
is statistically independent of w0 and has the same distribution in the general case (of an arbitrary
absolutely continuous spherical distribution) as in the special case where z N.0; IKC1 /. (2)
For “any” P functions g1 ./; g2 ./; : : : ; gP ./ (defined on RKC1 ) such that (for j D 1; 2; : : : ; P )
gj .z/ depends on the value of z only through w1 ; : : : ; wK ; wKC1 or, more generally, only through
y1 ; : : : ; yK ; yKC1 , the random variables g1 .z/; g2 .z/; : : : ; gP .z/ are statistically independent of w0
and have the same (joint) distribution in the general case (of an arbitrary absolutely continuous
spherical distribution) as in the special case where z N.0; IKC1 /.
Application to the F distribution. Let z D .z1 ; z2 ; : : : ; zN /0 (where N 2) represent an N -di-
mensional random (column) vector. In the next-to-last part of Subsection a, it was established that if
z N.0; IN /, then (for any integer K between 1 and N 1, inclusive)
PK 2
i D1 zi =K
PN SF .K; N K/: (3.20)
2
i DKC1 zi =.N K/
Clearly, PK 2
PK
i D1 zi =K i D1 wi =K
PN D PN ; (3.21)
2
i DKC1 zi =.N K/ i DKC1 wi =.N K/
2 N 2
where (for i D 1; 2; : : : ; N ) wi D zi i 0 D1 zi 0 . And we conclude [on the basis of Part (1)
ı P
P C1
Clearly, s1 ; s2 ; : : : ; sJ and hence w1 ; w2 ; : : : ; wJ depend on the values of the jJD1 Nj quanti-
2
ıPJC1 PNj 0 2
ties zjk j 0 D1 k 0 D1 zj 0 k 0 (j D 1; 2; : : : ; JC1; k D 1; 2, : : : ; Nj ). Thus, it follows from Theorem
6.3.1 that if the joint distribution of z1 , z2 , : : : ; zJ , zJ C1 is an absolutely continuous spherical dis-
PNj 2
tribution, then s1 , s2 , : : : ; sJ and w1 ; w2 ; : : : ; wJ are distributed independently of jJC1 z ,
P
D1 kD1 jk
and the joint distribution of s1 ; s2 ; : : : ; sJ and the joint distribution of w1 ; w2 ; : : : ; wJ are the same
as in the special case where the joint distribution of z1 ; z2 ; : : : ; zJ ; zJ C1 is N.0; I/. And in light of
the results obtained earlier (in the last part of Subsection a) for that special case, we are able to infer
that if the joint distribution of z1 , z2 , : : : ; zJ , zJ C1 is an absolutely continuous spherical distribution,
N
then the joint distribution of s1 ; s2 ; : : : ; sJ is Di N21 ; N22 ; : : : ; N2J ; JC1 2 I J . [The same inference
could be made “directly” by using the results of Section 6.1g to establish that if the joint distribution
Central and Noncentral F Distributions 287
c. Noncentral F distribution
A related distribution: the noncentral beta distribution. Let u and v represent random variables
that are distributed independently as 2 .M; / and 2 .N /, respectively. Further, define
u=M u
wD and xD :
v=N u C v
Then, as in the case of result (3.3),
Nx Mw .M=N /w
wD and xD D : (3.22)
M.1 x/ N C Mw 1 C .M=N /w
Let y D u C v. Then, in light of results (2.11) and (2.13), the joint distribution of y and x has
as a pdf the function f . ; / obtained by taking (for all y and x)
1
X
f .y; x/ D p.r/ gM CN C2r .y/ d.M C2r/=2; N=2 .x/; (3.23)
rD0
where (for r D 0; 1; 2; : : : ) p.r/ D .=2/r e =2=rŠ , where (for any strictly positive integer j )
gj ./ denotes the pdf of a 2 .j / distribution, and where (for any values of the parameters ˛1 and
˛2 ) d˛1;˛2 ./ denotes the pdf of a Be.˛1 ; ˛2 / distribution. Thus, a pdf, say h./, of the (marginal)
distribution of x is obtained by taking (for all x)
Z 1 X1 Z 1
h.x/ D f .y; x/ dy D p.r/ d.M C2r/=2; N=2 .x/ gM CN C2r .y/ dy
0 rD0 0
X1
D p.r/ d.M C2r/=2; N=2 .x/: (3.24)
rD0
An extension. The definition of the noncentral beta distribution can be extended. Take x D u=.u C
v/, where u and v are random variables that are distributed independently as Ga.˛1 ; ˇ; ı/ and
Ga.˛2 ; ˇ/, respectively—here, ˛1, ˛2 , and ˇ are arbitrary strictly positive scalars and ı is an arbitrary
nonnegative scalar. Letting y D u Cv and proceeding in essentially the same way as in the derivation
of result (2.11), we find that the joint distribution of y and x has as a pdf the function f . ; / obtained
by taking (for all y and x)
1
X ı re ı
f .y; x/ D gr .y/ d˛1 Cr; ˛2 .x/; (3.25)
rD0
rŠ
where (for r D 0; 1; 2; : : : ) gr ./ represents the pdf of a Ga.˛1 C ˛2 C r; ˇ/ distribution and
d˛1 Cr; ˛2 ./ represents the pdf of a Be.˛1Cr; ˛2 / distribution. Accordingly, the pdf of the (marginal)
distribution of x is the function h./ obtained by taking (for all x)
Z 1 1
X ı re ı
h.x/ D f .y; x/ dy D d˛1 Cr; ˛2 .x/: (3.26)
0 rD0
rŠ
288 Some Relevant Distributions and Their Properties
Formulas (3.25) and (3.26) can be regarded as extensions of formulas (3.23) and (3.24); formulas
(3.23) and (3.24) are for the special case where ˛1 D M=2, ˇ D 2, and ˛2 D N=2 (and where ı is
expressed in the form =2).
Take the noncentral beta distribution with parameters ˛1 , ˛2 , and ı (where ˛1 > 0, ˛2 > 0, and
ı 0) to be the distribution of the random variable x or, equivalently, the distribution with pdf h./
given by expression (3.26)—the distribution of x does not depend on the parameter ˇ. And denote
this distribution by the symbol Be.˛1 ; ˛2 ; ı/.
Probability density function (of the noncentral F distribution). Earlier (in Subsection a), the pdf
(3.12) of the (central) F distribution was derived from the pdf of a beta distribution. By taking a
similar approach, the pdf of the noncentral F distribution can be derived from the pdf of a noncentral
beta distribution.
Let x represent a random variable that has a noncentral beta distribution with parameters M=2,
N=2, and =2 (where M and N are arbitrary strictly positive integers and is an arbitrary nonnegative
scalar). And define
Nx
wD : (3.27)
M.1 x/
Then, in light of what was established earlier (in Part 1 of the present subsection),
w SF .M; N; /:
Moreover, equality (3.27) defines a one-to-one transformation from the interval 0 < x < 1 onto the
interval 0 < w < 1; the inverse transformation is that defined by the equality
x D .M=N /wŒ1 C .M=N /w 1:
Thus, a pdf, say f ./, of the SF .M; N; / distribution is obtainable from a pdf of the distribution of
x.
For r D 0; 1; 2; : : : , let p.r/ D .=2/r e =2=rŠ , and, for arbitrary strictly positive scalars ˛1
and ˛2 , denote by d˛1; ˛2 ./ the pdf of a Be.˛1 ; ˛2 / distribution. Then, according to result (3.24), a
pdf, say h./, of the distribution of x is obtained by taking (for all x)
X1
h.x/ D p.r/ d.M C2r/=2; N=2 .x/:
rD0
And upon observing that
dx 2 1
D .M=N /Œ1 C .M=N /w and 1 x D Œ1 C .M=N /w
dw
and making use of standard results on a change of variable, we find that, for 0 < w < 1,
1
X Œ.M CN C2r/=2
D p.r/ .M=N /.M C2r/=2
rD0
Œ.M C2r/=2.N=2/
w Œ.M C2r/=2 1 Œ1 C .M=N /w .M CN C2r/=2
1
X
D p.r/ŒM=.M C2r/ gr fŒM=.M C2r/wg; (3.28)
rD0
where (for r D 0; 1; 2; : : : ) gr ./ denotes the pdf of the SF .M C2r; N / distribution—for 1 <
w 0, f .w/ D 0.
Moments. Let w represent a random variable that has an SF .M; N; / distribution. Then, by def-
inition, w .N=M /.u=v/, where u and v are random variables that are distributed independently
Central and Noncentral F Distributions 289
as 2 .M; / and 2 .N /, respectively. And making use of results (1.27) and (2.31), we find that, for
M=2 < r < N=2,
For r N=2, the rth moment of the SF .M; N; / distribution, like that of the central F
distribution, does not exist. And (as in the case of the central F distribution) the noncentral F
distribution does not have a moment generating function.
Upon recalling results (2.28) and (2.30) [or (2.33) and (2.34)] and result (1.23) and applying
formula (3.29), we find that the mean of the SF .M; N; / distribution is (if N > 2)
N
E.w/ D 1C (3.31)
N 2 M
and the second moment and variance are (if N > 4)
.M C2/.M C2/ C 2
E.w 2 / D .N=M /2
.N 2/.N 4/
2
2
N 2 2
D 1C C 1C (3.32)
.N 2/.N 4/ M M M
and
2N 2 M Œ1 C .=M /2 C .N 2/Œ1 C 2.=M /
˚
2 2
var.w/ D E.w / ŒE.w/ D : (3.33)
M .N 2/2 .N 4/
Moreover, results (3.31) and (3.32) can be regarded as special cases of a more general result: in
light of results (1.23) and (2.36), it follows from result (3.29) that (for r D 1; 2; : : : < N=2) the rth
moment of the SF .M; N; / distribution is
r
Nr
E.w r / D
.N 2/.N 4/ .N 2r/ M
r 1
!
X .MC 2j /ŒMC 2.j C1/ŒMC 2.j C2/ ŒMC 2.r 1/ r j
C : (3.34)
Mr j j M
j D0
Noninteger degrees of freedom. The definition of the noncentral F distribution can be extended to
noninteger degrees of freedom in much the same way as the definitions of the (central) F distribution
and the central and noncentral chi-square distributions. For arbitrary strictly positive numbers M and
N (and an arbitrary nonnegative number ), take u and v to be random variables that are distributed
independently as Ga.M=2; N=2; =2/ and Ga.N=2; 2/, respectively. Further, define
u=M
wD :
v=N
Let us regard the distribution of the random variable w as a noncentral F distribution with M
(possibly noninteger) numerator degrees of freedom and N (possibly noninteger) denominator de-
grees of freedom (and with noncentrality parameter ). When M is an integer, the Ga.M=2; 2; =2/
290 Some Relevant Distributions and Their Properties
distribution is identical to the 2 .M; / distribution, and when N is an integer, the Ga.N=2; 2/ dis-
tribution is identical to the 2 .N / distribution, so that this usage of the term noncentral F distribution
is consistent with our previous usage.
Let x D u=.u C v/. As in the special case where M and N are integers, w and x are related
to each other as follows:
Nx .M=N /w
wD and xD :
M.1 x/ 1 C .M=N /w
By definition, x has a Be.M=2; N=2; =2/ distribution—refer to Part 2 of the present subsection.
Accordingly, the distribution of w is related to the Be.M=2, N=2, =2/ distribution in the same
way as in the special case where M and N are integers. Further, the distribution of x and hence that
of w would be unaffected if the distributions of the (statistically independent) random variables u
and v were taken to be Ga.M=2; ˇ; =2/ and Ga.N=2; ˇ/, respectively, where ˇ is an arbitrary
strictly positive number (not necessarily equal to 2).
is referred to as the M -variate t distribution (or when the dimension of t is unspecified or is clear
from the context) as the multivariate t distribution. The parameters of this distribution consist of
the degrees of freedom N and the M.M 1/=2 correlations rij (i > j D 1; 2; : : : ; M )—the
diagonal elements of R equal 1 and (because R is symmetric) only M.M 1/=2 of its off-diagonal
elements are distinct. Let us denote the multivariate t distribution with N degrees of freedom and
with correlation matrix R by the symbol MV t.N; R/—the number M of variables is discernible
from the dimensions of R. The ordinary (univariate) t distribution S t.N / can be regarded as a special
case of the multivariate t distribution; it is the special case MV t.N; 1/ where the correlation matrix
is the 1 1 matrix whose only element equals 1.
a. (Central) t distribution
Related distributions. The t distribution is closely related to the F distribution (as is apparent from
the very definitions of the t and F distributions). More specifically, the S t.N / distribution is related
to the SF .1; N / distribution.
Let t represent a random variable that is distributed as S t.N /, and F a random variable that is
distributed as SF .1; N /. Then,
t 2 F; (4.2)
or, equivalently, p
jtj F : (4.3)
The S t.N / distribution (i.e., the t distribution with N degrees of freedom) is also closely related
to the distribution of the random variable y defined as follows:
z
yD ;
.v C z 2 /1=2
where z and v are as defined in the introduction to the present section [i.e., where z and v are
2
random
ıpvariables that are distributed independently as N.0; 1/ and .N /, respectively]. Now, let
t D z v=N , in which case t S t.N / (as is evident from the very definition of the t distribution).
Then, t and y are related as follows:
p
Ny t
tD 2 1=2
and yD : (4.4)
.1 y / .N C t 2 /1=2
Probability density function (pdf). Let us continue to take z and v to be random variables that are
distributed independently as N.0; 1/ and 2 .N /, respectively, and to take y D z=.v C z 2 /1=2 and
ıp
t D z v=N . Further, define u D v C z 2.
Let us determine the pdf of the joint distribution of u and y and the pdf of the (marginal)
distribution of y—in light of the relationships (4.4), the pdf of the distribution of t is determinable
from the pdf of the distribution of y. The equalities u D v C z 2 and y D z=.v C z 2 /1=2 define a one-
to-one transformation from the region defined by the inequalities 0 < v < 1 and 1 < z < 1
onto the region defined by the inequalities 0 < u < 1 and 1 < y < 1. The inverse of this
transformation is the transformation defined by the equalities
v D u.1 y2/ and z D u1=2y:
Further,
ˇ@v=@u @v=@y ˇ ˇ 1 y 2
ˇ ˇ ˇ ˇ
2uy ˇˇ
ˇ@z=@u @z=@y ˇ D ˇ.1=2/u 1=2y
ˇ ˇ ˇ D u1=2:
u1=2 ˇ
Thus, denoting by d./ the pdf of the 2 .N / distribution and by b./ the pdf of the N.0; 1/ distribution
292 Some Relevant Distributions and Their Properties
and making use of standard results on a change of variables, the joint distribution of u and y has as
a pdf the function q. ; / (of 2 variables) obtained by taking (for 0 < u < 1 and 1 < y < 1)
1 Œ.N C1/=2
D uŒ.N C1/=2 1 e u=2
.1 y 2 /.N=2/ 1
(4.5)
Œ.N C1/=2 2.N C1/=2 .N=2/ 1=2
—for u and y such that 1 < u 0 or 1 jyj < 1, q.u; y/ D 0. The derivation of expression
(4.5) is more or less the same as the derivation (in Section 6.1f) of expression (1.43).
The quantity q.u; y/ is reexpressible (for all u and y) in the form
q.u; y/ D g.u/ h .y/; (4.6)
where g./ is the pdf of the 2 .NC1/ distribution and where h ./ is the function (of a single variable)
defined as follows:
8̂
< Œ.N C1/=2 .1 y 2 /.N=2/ 1; for 1 < y < 1,
h .y/ D .N=2/ 1=2
(4.7)
:̂ 0; elsewhere.
Accordingly, we conclude that h ./ is a pdf; it is the pdf of the distribution of y. Moreover, y is
distributed independently of u.
Now, upon making a change of variable from y to t [based on the relationships (4.4)] and
observing that
d t.N C t 2 / 1=2
D N.N C t 2 / 3=2;
dt
we find that the distribution of t is the distribution with pdf f ./ defined (for all t) by
Œ.N C1/=2 t 2 .N C1/=2
f .t/ D h Œ t.N C t 2 / 1=2 N.N C t 2 / 3=2 D N 1=2
1 C : (4.8)
.N=2/ 1=2 N
And t (like y) is distributed independently of u (i.e., independently of v C z 2 ).
In the special case where N D 1 (i.e., in the special case of a t distribution with 1 degree of
freedom), expression (4.8) simplifies to
1 1
f .t/ D : (4.9)
1 C t2
The distribution with pdf (4.9) is known as the (standard) Cauchy distribution. Thus, the t distribution
with 1 degree of freedom is identical to the Cauchy distribution.
As N ! 1, the pdf of the S t.N / distribution converges (on all of R1 ) to the pdf of the N.0; 1/
distribution—refer, e.g., to Casella and Berger (2002, exercise 5.18). And, accordingly, t converges
in distribution to z.
The pdf of the S t.5/ distribution is displayed in Figure 6.1 along with the pdf of the S t.1/
(Cauchy) distribution and the pdf of the N.0; 1/ (standard normal) distribution.
Symmetry and (absolute, odd, and even) moments. The absolute moments of the t distribution are
determinable from the results of Section 6.3a on the F distribution. Let t represent a random variable
that is distributed as S t.N / and w a random variable that is distributed as SF .1; N /. Then, as an
implication of the relationship (4.3), we have (for an arbitrary scalar r) that E.jtjr / exists if and only
if E.w r=2 / exists, in which case
E.jtjr / D E.w r=2 /: (4.10)
Central, Noncentral, and Multivariate t Distributions 293
N(0, 1) pdf
0.4
St(5) pdf
St(1) pdf
0.2
0.0
−4 −2 0 2 4
FIGURE 6.1. The probability density functions of the N.0; 1/ (standard normal), St.5/, and St.1/ (Cauchy)
distributions.
And upon applying result (3.13), we find that, for 1 < r < N ,
Œ.r C1/=2 Œ.N r/=2
E.jtjr / D N r=2 : (4.11)
.1=2/ .N=2/
For any even positive integer r, jtjr D t r. Thus, upon applying result (3.14), we find [in light of
result (4.10)] that, for r D 2; 4; 6; : : : < N , the rth moment of the S t.N / distribution exists and is
expressible as
.r 1/.r 3/ .3/.1/
E.t r / D N r=2 : (4.12)
.N 2/.N 4/ .N r/
For r N , the rth moment of the S t.N / distribution does not exist (and, as a consequence, the t
distribution, like the F distribution, does not have a moment generating function).
The S t.N / distribution is symmetric (about 0), that is,
t t (4.13)
(as is evident from the very definition of the t distribution). And upon observing that, for any odd
positive integer r, t r D . t/r, we find that, for r D 1; 3; 5; : : : < N,
E.t r / D E. t r / D EŒ. t/r D E.t r /
and hence that (for r D 1; 3; 5; : : : < N )
E.t r / D 0: (4.14)
Thus, those odd moments of the S t.N / distribution that exist (which are those of order less than N )
are all equal to 0. In particular, for N > 1,
E.t/ D 0: (4.15)
Note that none of the moments of the S t.1/ (Cauchy) distribution exist, not even the mean. And
the S t.2/ distribution has a mean (which equals 0), but does not have a second moment (or any other
moments of order greater than 1) and hence does not have a variance.
For N > 2, we have [upon applying result (4.12)] that
N
var.t/ D E.t 2 / D : (4.16)
N 2
294 Some Relevant Distributions and Their Properties
0.6
N(0, 1) pdf
std. St(5) pdf
std. St(3) pdf
0.4
0.2
0.0
−4 −2 0 2 4
FIGURE 6.2. The probability density functions of the N.0; 1/ (standard normal) distribution and of the dis-
tributions of the standardized versions of a random variable having an St.5/ distribution and a
random variable having an St.3/ distribution.
PrŒt tN1 ˛ .N / D PrŒt > tN1 ˛ .N / D1 ˛D1 PrŒt > tN˛ .N / D PrŒt tN˛ .N /;
implying that
tN˛ .N / D tN1 ˛ .N /: (4.24)
And in light of result (4.23), we find that
PrŒjtj > tN˛=2 .N / D 2 PrŒt > tN˛=2 .N / D 2.˛=2/ D ˛; (4.25)
so that the upper 100˛% point of the distribution of jtj equals the upper 100.˛=2/% point tN˛=2 .N /
of the distribution of t [i.e., of the S t.N / distribution]. Moreover, in light of relationship (4.3),
q
tN˛=2 .N / D FN˛ .1; N /; (4.26)
where FN˛ .1; N / is the upper 100˛% point of the SF .1; N / distribution.
The t distribution as the distribution of a function of a random vector having a multivariate
standard normal distribution or a spherical distribution. Let z D .z1 ; z2 , : : : ; zN C1 /0 represent an
(N C1)-dimensional random (column) vector. And let
zN C1
tDq PN 2 :
.1=N / i D1 zi
t S t.N /: (4.27)
Moreover, it follows from the results of Part 2 (of the present subsection) that t is distributed inde-
P C1 2
pendently of N i D1 zi .
296 Some Relevant Distributions and Their Properties
More generally, suppose that z has an absolutely continuous spherical distribution. And [recalling
result (4.4)] observe that
p
Ny z
tD 2 1=2
; where y D P N C1 1=2 :
.1 y / N C1 2
i D1 zi
Then, it follows from Theorem 6.3.1 that [as in the special case where z N.0; I/]
t S t.N /; (4.28)
PN C1
and t is distributed independently of i D1 zi2 .
b. Noncentral t distribution
Related distributions. Let x and v represent random variables that areıp distributed independently
as N.; 1/ and 2 .N /, respectively. And observe that (by definition) x v=N S t.N; /. Ob-
serve also that x 2 is distributed independently of v as 2 .1; 2 / and hence that x 2 =.v=N /
SF .1; N; 2 /. Thus, if t is a random variable that is distributed as S t.N; /, then
t 2 SF .1; N; 2 / (4.29)
or, equivalently, p
jtj F ; (4.30)
2
where F is a random variable that is distributed as SF .1; N; /.
Now, let t D x v=N , in which case t S t.N; /, and define
ıp
x
yDp :
v C x2
Then, as in the case of result (4.4), t and y are related as follows:
p
Ny t
tD 2 1=2
and yD : (4.31)
.1 y / .N C t 2 /1=2
Probability density function (pdf). Let us derive an expression for the pdf of the noncentral t
distribution. Let us do so by following an approach analogous to the one taken in Subsection a in
deriving expression (4.8) for the pdf of the central t distribution.
Take x and v to be random variables that are distributed independently as N.; 1/ and 2 .N /,
ıp
respectively. Further, take t D x v=N , in which case t S t.N; /, and define u D v C x 2 and
y D x=.v Cx 2 /1=2, in which case t and y are related by equalities (4.31). Then, denoting by d./
the pdf of the 2 .N / distribution and by b./ the pdf of the N.; 1/ distribution and proceeding in
the same way as in arriving at expression (4.5), we find that the joint distribution of u and y has as
a pdf the function q. ; / obtained by taking (for 0 < u < 1 and 1 < y < 1)
Œ.N C1/=2
8̂
2
ˆ 1=2
.1 y 2 /.N=2/ 1 e =2
ˆ
ˆ .N=2/ 1
Œ.N Cr C1/=2 p
<
h .y/ D
X r
2 y ; for 1 < y < 1,
ˆ
ˆ
ˆ rD0
Œ.N C1/=2 rŠ
0; elsewhere.
:̂
Finally, making a change of variable from y to t [based on the relationship (4.31)] and proceeding
in the same way as in arriving at expression (4.8), we find that the distribution of t has as a pdf the
function f ./ obtained by taking (for all t)
1 p
Œ.N C1/=2 1=2 2 =2
X Œ.N Cr C1/=2 2 t r t2 .N CrC1/=2
D N e p 1C : (4.34)
.N=2/ 1=2 rD0
Œ.N C1/=2 rŠ N N
In the special case where D 0, expression (4.34) simplifies to expression (4.8) for the pdf of the
central t distribution S t.N /.
Moments: relationship to the moments of the N.; 1/ distribution. ıp Let t represent a random variable
that has an S t.N; / distribution. Then, by definition, t x v=N , where x and v are random
variables that are distributed independently as N.; 1/ and 2 .N /, respectively. And, for r D
1; 2; : : : < N , the rth moment of the S t.N; / distribution exists and is expressible as
Like the S t.N / distribution, the S t.N; / distribution does not have moments of order N or greater
and, accordingly, does not have a moment generating function.
In light of results (4.36) and (4.37), we find that the mean of the S t.N; / distribution is (if
N > 1)
p Œ.N 1/=2
E.t/ D N=2 (4.38)
.N=2/
and the second moment is (if N > 2)
298 Some Relevant Distributions and Their Properties
N
E.t 2 / D .1 C 2 /: (4.39)
N 2
Noninteger degrees of freedom. The definition of the noncentral t distribution can be extended
to noninteger degrees of freedom in essentially the same way as the definition of the central t
distribution. For any (strictly) positive scalar N (and any scalar ), take the noncentral t distribution
with N degreespof freedom and noncentrality parameter to be the distribution of the random
variable t D x v=N , where x and v are random variables that are statistically independent with
ı
x N.; 1/ and v Ga N2 ; 2 . In the special case where N is a (strictly positive) integer, the
Ga N2 ; 2 distribution is identical to the 2 .N / distribution, so that this usage of the term noncental
t distribution is consistent with our previous usage of this term.
Some relationships. Let t represent a random variable that has an S T .N; / distribution and t a
random variable that has an S t.N; / distribution. Then, clearly,
t t: (4.40)
And for any (nonrandom) scalar c, we have [as a generalization of result (4.22)] that
Proof. Suppose that R and T are nonsingular. Then, making use of Theorem 2.14.22, we find
that ˇ ˇ
ˇR S ˇˇ
D jT 1 jjR . S/.T 1 / 1 Uj D jT 1 jjR C ST Uj
T 1ˇ
ˇ
ˇU
and also that ˇ ˇ
ˇR S ˇ 1 1 1 1
1 ˇ D jRjjT UR . S/j D jRjjT C UR Sj:
ˇ ˇ
ˇU T
Thus,
1 1 1
jT jjR C ST Uj D jRjjT C UR Sj
Central, Noncentral, and Multivariate t Distributions 299
1
or, equivalently (since jT j D 1=jT j),
1 1
jR C ST Uj D jRjjT jjT C UR Sj:
Q.E.D.
In the special case where R D IN and T D IM , Theorem 6.4.1 simplifies to the following result.
Corollary 6.4.2. For any N M matrix S and any M N matrix U,
In the special case where M D 1, Corollary 6.4.2 can be restated as the following corollary.
Corollary 6.4.3. For any N -dimensional column vectors s D fsi g and u D fui g,
N
X
0 0 0
jIN C su j D 1 C u s D 1 C s u D 1 C si ui : (4.47)
i D1
d. Multivariate t distribution
Related distributions. Let t D .t1 ; t2 ; : : : ; tM /0 represent a random (column) vector that has an
MV t.N; R/ distribution, that is, an M -variate t distribution with N degrees of freedom and corre-
lation matrix R. If t is any subvector of t, say the M -dimensional subvector .ti1 ; ti2 ; : : : ; tiM /0
consisting of the i1 ; i2 ; : : : ; iM th elements, then, clearly,
t MV t.N; R /; (4.48)
where R is the M M submatrix of R formed by striking out all of the rows and columns of R
save its i1 ; i2 ; : : : ; iM th rows and columns. And, for i D 1, 2, : : : ; M,
ti S t.N /; (4.49)
that is, the (marginal) distribution of each of the elements of t is a (univariate) t distribution with the
same number of degrees of freedom (N ) as the distribution of t.
In the special case where R D I,
1 0
M t t SF .M; N /: (4.50)
More generally, if t is partitioned into some number of subvectors, say K subvectors t1 ; t2 ; : : : ; tK
of dimensions M1 ; M2 ; : : : ; MK , respectively, then, letting u1 ; u2 ; : : : ; uK , and v represent ran-
dom variables that are distributed independently as 2 .M1 /; 2 .M2 /; : : : ; 2 .MK /, and 2 .N /,
respectively, we find that, in the special case where R D I, the joint distribution of the K quan-
tities M1 1 t10 t1 , M2 1 t20 t2 , : : : ; MK 1 tK
0
tK is identical to the joint distribution of the K ratios
.u1 =M1 /=.v=N /, .u2 =M2 /=.v=N /, : : : ; .uK =MK /=.v=N / [the marginal distributions of which
are SF .M1 ; N /, SF .M2 ; N /, : : : ; SF .MK ; N /, respectively, and which have a common denomi-
nator v=N ].
The multivariate t distribution is related asymptotically to the multivariate normal distribution.
It can be shown that as N ! 1, the MV t.N; R/ distribution converges to the N.0; R/ distribution.
Now, take z to be an M -dimensional random column vector and v a random variable that are
statistically independent with z N.0; R/ and v 2 .N /. And take t D .v=N / 1=2 z, in which
case t MV t.N; R/, and define y D .v C z0 R 1 z/ 1=2 z. Then, t and y are related as follows:
t D ŒN=.1 y 0R 1
y/1=2 y and y D .N C t 0 R 1
t/ 1=2
t: (4.51)
300 Some Relevant Distributions and Their Properties
Probability density function (pdf). Let us continue to take z to be an M -dimensional random column
vector and v a random variable that are statistically independent with z N.0; R/ and v 2 .N /.
And let us continue to take t D .v=N / 1=2 z and to define y D .v C z0 R 1 z/ 1=2 z. Further, define
u D v C z0 R 1 z.
Consider the joint distribution of the random variable u and the random vector y. The equalities
u D v C z0 R 1 z and y D .v C z0 R 1 z/ 1=2 z define a one-to-one transformation from the region
fv; z W 0 < v < 1; z 2 RM g onto the region fu; y W 0 < u < 1; y 0 R 1 y < 1g. The inverse of
this transformation is the transformation defined by the equalities
v D u.1 y 0R 1
y/ and z D u1=2 y:
Thus, denoting by d./ the pdf of the 2.N / distribution and by b./ the pdf of the N.0; R/ distribution
and making use of standard results on a change of variables, the joint distribution of u and y has as
a pdf the function q. ; / obtained by taking (for u and y such that 0 < u < 1 and y 0 R 1 y < 1)
q.u; y/ D d Œu.1 y 0 R 1
y/ b.u1=2 y/ uM=2
1
D uŒ.N CM /=2 1 e u=2 .1 y 0 R 1 y/.N=2/ 1
.N=2/ 2.N CM /=2 M=2 jRj1=2
1
D uŒ.N CM /=2 1 e u=2
Œ.N CM /=2 2.N CM /=2
Œ.N CM /=2
.1 y 0 R 1 y/.N=2/ 1
(4.52)
.N=2/ M=2 jRj1=2
Accordingly, we conclude that h ./ is a pdf; it is the pdf of the distribution of y. Moreover, y is
distributed independently of u.
Now, for j D 1; 2; : : : ; M , let yj represent the j th element of y, tj the j th element of t, and ej
Central, Noncentral, and Multivariate t Distributions 301
the j th column of IM , and observe [in light of relationship (4.51) and result (5.4.10)] that
implying [in light of Lemma 2.14.3 and Corollaries 2.14.6 and 6.4.3] that
ˇ ˇ ˇ 0ˇ
ˇ @y ˇ ˇ @y ˇ 0 1
ˇ @t 0 ˇ ˇ @t ˇ D .N C t R t/
ˇ ˇDˇ ˇ M=2
Œ1 .N C t 0 R 1 t/ 1 t 0 R 1 t D N .N C t 0 R 1
t/ .M=2/ 1
:
Thus, upon making a change of variables from the elements of y to the elements of t, we find that
the distribution of t (which is the M -variate t distribution with N degrees of freedom) has as a pdf
the function f ./ obtained by taking (for all t)
Alternatively, if k is an even number (smaller than N ), then [in light of result (1.31)]
N k=2
E t1k1 t2k2 tM
kM
D E z1k1 z2k2 zM
kM
: (4.58)
.N 2/.N 4/ .N k/
We conclude that (for k < N ) each kth-order moment of the MV t.N; R/ distribution is either 0
or is obtainable from the corresponding kth-order moment of the N.0; R/ distribution (depending
on whether k is odd or even). In particular, for i D 1; 2; : : : ; M , we find that (if N > 1)
E.ti / D 0 (4.59)
and that (if N > 2) N N
var.ti / D E ti2 D (4.60)
ri i D ;
N 2 N 2
in agreement with results (4.15) and (4.16). And, for j ¤ i D 1; 2; : : : ; M , we find that (if N > 2)
N
cov.ti ; tj / D E.ti tj / D rij (4.61)
N 2
and that (if N > 2)
corr.ti ; tj / D rij ŒD corr.zi ; zj /: (4.62)
In matrix notation, we have that (if N > 1)
E.t/ D 0 (4.63)
and that (if N > 2) N
var.t/ D R: (4.64)
N 2
Noninteger degrees of freedom. The definition of the multivariate t distribution can be extended
to noninteger degrees of freedom in essentially the same way as the definition of the (univariate) t
distribution. For any (strictly) positive number N , take the M -variate t distribution with degrees of
freedom N and correlation matrix R to be the distribution of the M -variate random (column) vector
t D .v=N / 1=2 z, where z is an M -dimensional random column vector and v a random variable that
N
are statistically independent with z N.0; R/ and v Ga 2
; 2 . In the special case where N is
N
a (strictly positive) integer, the Ga 2 ; 2 distribution is identical to the 2 .N / distribution, so that
this usage of the term M -variate t distribution is consistent with our previous usage of this term.
Sphericity and ellipticity. The MV t.N; IM / distribution is spherical. To see this, take z to be an
M -dimensional random column vector and v a random variable that are statistically independent
with z N.0; IM / and v 2 .N /, and take O to be any M M orthogonal matrix of constants.
Further, let t D .v=N / 1=2 z, in which case t MV t.N; IM /, and observe that the M 1 vector
Oz, like z itself, is distributed independently of v as N.0; IM /. Thus,
1=2 1=2
Ot D .v=N / .Oz/ .v=N / z D t:
S0 t D .v=N / 1=2
.S0 z/;
and S0 z is distributed independently of v as N.0; R/, so that S0 t MV t.N; R/. Thus, the
MV t.N; R/ distribution is elliptical.
The MV t.N; IM / distribution as the distribution of a vector-valued function of a random vec-
tor having a standard normal distribution or a spherical distribution. Let z D .z1 ; z2 ; : : : ; zN ,
zN C1 ; zN C2 ; : : : ; zN CM /0 represent an (N CM )-dimensional random (column) vector. Further, let
t D Œ.1=N / N 2 1=2
P
i D1 zi z ;
where z D .zN C1 ; zN C2 ; : : : ; zN CM /0.
Suppose that z N.0; IN CM /. Then, z and N 2
i D1 zi are distributed independently as
P
2
N.0; IM / and .N /, respectively. Thus,
t MV t.N; IM /: (4.66)
Moreover, it follows from the results of Part 2 (of the present subsection) that t is distributed
P CM 2
independently of N i D1 zi .
More generally, suppose that z has an absolutely continuous spherical distribution. And [recalling
result (4.51)] observe that
PN CM 2 1=2
t D ŒN=.1 y 0 y/1=2 y; where y D i D1 zi z :
Then, it follows from Theorem 6.3.1 that [as in the special case where z N.0; I/]
t MV t.N; IM /; (4.67)
PN CM 2
and t is statistically independent of i D1 zi .
[The scalars d0 and d1 are eigenvalues of the matrix A; in fact, they are respectively the smallest and
largest eigenvalues of A, as can be ascertained from results to be presented subsequently (in Section
6.7a).] And observe that
Thus,
if d0 < 0 and d1 > 0,
8̂
ˆ .1=d0 ; 1=d1 /;
if d0 < 0 and d1 0,
ˆ
<.1=d ; 1/;
0
SD (5.1)
ˆ
ˆ . 1; 1=d1 /; if d0 0 and d1 > 0,
. 1; 1/; if d0 D d1 D 0.
:̂
under which a matrix of the form IM tA is positive definite to matrices of the more general form
V tA.
According to Corollary 2.13.29, there exists an M M nonsingular matrix Q such that V D Q0 Q.
And upon observing that
V tA D Q0 ŒI t.Q 1/0AQ 1
Q and I t.Q 1/0AQ 1
D .Q 1/0 .V tA/Q 1;
it becomes clear (in light of Corollary 2.13.11) that V tA is positive definite if and only if the
matrix IM t.Q 1/0AQ 1 is positive definite. Thus, the applicability of the conditions under which
a matrix of the form IM tA is positive definite can be readily extended to a matrix of the more
general form V tA; it is a simple matter of applying those conditions with .Q 1/0AQ 1 in place
PK
of A. More generally, conditions under which a matrix of the form IM i D1 ti Ai (where A1 , A2 ,
: : : ; AK are M M symmetric matrices and t1 ; t2 ; : : : ; tK arbitrary scalars) is positive definite can
PK
be translated into conditions under which a matrix of the form V i D1 ti Ai is positive definite
by replacing A1 ; A2 ; : : : ; AK with .Q / A1 Q , .Q / A2 Q , : : : ; .Q 1/0AK Q 1, respectively.
1 0 1 1 0 1
PK PK
Note that conditions under which I tA or I i D1 ti Ai (or V tA or V ti Ai ) is
PKi D1
positive definite can be easily translated into conditions under which I C tA or I C i D1 ti Ai (or
V C tA or V C K i D1 ti Ai ) is positive definite.
P
b. Main results
Let us derive the moment generating function of the distribution of a quadratic form x0 Ax, or
more generally of the distribution of a second-degree polynomial c C b0 x C x0 Ax, in a random
column vector x, where x N.; †/. And let us derive the moment generating function of the
joint distribution of two or more quadratic forms or second-degree polynomials. Let us do so by
establishing and exploiting the following theorem.
Theorem 6.5.3. Let z represent an M -dimensional random column vector that has an M -variate
standard normal distribution N.0; IM /. Then, for any constant c and any M -dimensional column
vector b (of constants) and for any M M symmetric matrix A (of constants) such that I 2A is
positive definite,
0 0 0 1
E e cCb zCz Az D jI 2Aj 1=2 e cC.1=2/b .I 2A/ b: (5.2)
Proof. Let f ./ represent the pdf of the N.0; IM / distribution and g./ the pdf of the N Œ.I
2A/ 1 b; .I 2A/ 1 distribution. Then, for all z,
0 0 1=2 cC.1=2/b0 .I 2A/ 1b
e cCb zCz Azf .z/ D jI 2Aj e g.z/;
as can be readily verified. And it follows that
Z
cCb0 zCz0 Az 0 0
E e e cCb zCz Azf .z/ d z
D
RM
Z
1=2 cC.1=2/b0 .I 2A/ 1b
D jI 2Aj e g.z/ d z
RM
1=2 cC.1=2/b0 .I 2A/ 1b
D jI 2Aj e :
Q.E.D.
Moment generating function of the distribution of a single quadratic form or second-degree poly-
nomial. Let x represent an M -dimensional random column vector that has an N.; †/ distribution
(where the rank of † is possibly less than M ). Further, take to be any matrix such that † D 0 —
the existence of such a matrix follows from Corollary 2.13.25—and denote by R the number of rows
in . And observe that
x C 0 z; (5.3)
306 Some Relevant Distributions and Their Properties
if d0 < 0 and d1 0,
ˆ
<.1=d ; 1/;
0
SD
ˆ. 1; 1=d1 /; if d0 0 and d1 > 0,
ˆ
. 1; 1/; if d0 D d1 D 0.
:̂
D jI 2tA 0 j 1=2
expŒ t.c Cb0 C0 A/
expŒ.1=2/ t 2 .bC2 A/0 0 .I 2tA 0 / 1
.bC2 A/: (5.5)
The dependence of expression (5.5) on the variance-covariance matrix † is through the “inter-
mediary” . The moment generating function can be reexpressed in terms of † itself. In light of
Corollary 6.4.2,
jI 2tA 0 j D jI .2tA/ 0 j D jI 2tA 0 j D jI 2tA†j; (5.6)
implying that
jI 2tA†j > 0 for t 2 S (5.7)
and hence that I 2tA† is nonsingular for t 2 S . Moreover,
.I 2tA 0 / 1
D .I 2tA†/ 1
for t 2 S , (5.8)
as is evident upon observing that
.I 2tA†/ D .I 2tA 0 /
and upon premultiplying both sides of this equality by .I 2tA 0 / 1 and postmultiplying both
sides by .I 2tA†/ 1.
Results (5.6) and (5.8) can be used to reexpress expression (5.5) (for the moment generating
function) as follows: for t 2 S ,
In the special case where c D 0 and b D 0 [i.e., where m./ is the moment generating function of
the quadratic form x0 Ax], expression (5.9) simplifies as follows: for t 2 S ,
Moment Generating Function of the Distribution of Quadratic Forms 307
And in the further special case where (in addition to c D 0 and b D 0) † is nonsingular, the moment
generating function (of the distribution of x0 Ax) is also expressible as follows: for t 2 S ,
m.t/ D jI 2tA†j 1=2
expf .1=2/0ŒI .I 2tA†/ 1
† 1
g; (5.11)
as is evident upon observing that
t .I 2tA†/ 1A D .1=2/.I 2tA†/ 1. 2tA†/† 1
D .1=2/ŒI .I 2tA†/ 1
† 1:
Moment generating function of the joint distribution of multiple quadratic forms or second-degree
polynomials. Let us continue to take x to be an M -dimensional random column vector that has an
N.; †/ distribution, to take to be any matrix such that † D 0 , to denote by R the number
of rows in , and to take z to be an R-dimensional random column vector that has an N.0; IR /
distribution.
For i D 1; 2; : : : ; K (where K is a strictly positive integer), let ci represent a constant, bi an
M -dimensional column vector of constants, and Ai an M M symmetric matrix of constants. And
denote by m./ the moment generating function of the distribution of the K-dimensional random
column vector whose i th element is the second-degree polynomial ci C b0i x C x0 Ai x (in the random
vector x), and let t D .t1 ; t2 ; : : : ; tK /0 represent an arbitrary K-dimensional column vector. Further,
take S to be the subset of RK defined as follows:
S D f.t1 ; t2 ; : : : ; tK /0 W I 2 K 0
i D1 ti Ai is positive definiteg:
P
(i D 1; 2; : : : ; K). And upon applying result (5.2), we obtain the following generalization of result
(5.5): for t 2 S ,
P 0 0
m.t/ D E e i ti .ci Cbi xCx Ai x/
0 0 0 0 0
D E e i ti .ci Cbi C Ai /CŒ i ti .biC2 Ai / zCz . i ti Ai /z
˚ P P P
ˇ 1=2 0
D ˇI 2 i ti Ai 0 ˇ 0
ˇ
exp
P P
i ti .ci Cbi C Ai /
0 1 P
exp .1=2/ i ti .bi C2 Ai / 0 I 2 i ti Ai 0 i ti .bi C2 Ai / : (5.12)
˚ P P
ˇI 2 K ti Ai 0 ˇ D ˇI 2 K ti Ai † ˇ;
ˇ ˇ ˇ ˇ
(5.13)
P P
i D1 i D1
implying that ˇ PK ˇ
2 ˇ > 0 for t 2 S (5.14)
i D1 ti Ai †
ˇI
PK
and hence that I 2 i D1 ti Ai † is nonsingular for t 2 S. And as a straightforward generalization
of result (5.8), we have that
0 1
1
I 2 K D I 2 K for t 2 S . (5.15)
P P
i D1 ti Ai i D1 ti Ai †
Based on results (5.13) and (5.15), we obtain, as a variation on expression (5.12) for the moment
generating function, the following generalization of expression (5.9): for t 2 S ,
308 Some Relevant Distributions and Their Properties
ˇ ˇ 1=2 0 0
exp
P P
m.t/ D ˇI 2 i ti Ai † ˇ i ti .ci Cbi C Ai /
0 1P
exp .1=2/ i ti .bi C2 Ai / † I 2 i ti Ai † i ti .bi C2 Ai / : (5.16)
˚ P P
In connection with Theorem 6.6.1, it is worth noting that if A, b, and c satisfy conditions (6.2)
and (6.3), then the second-degree polynomial q is reexpressible as a quadratic form
0
q D z C 21 b A z C 12 b (6.4)
[in the vector z C 12 b, the distribution of which is N 21 b; IM ]. Moreover, if A, b, and c satisfy all
three of conditions (6.1), (6.2), and (6.3), then q is reexpressible as a sum of squares
0
q D Az C 12 b Az C 12 b (6.5)
Theorem 6.6.1 asserts that conditions (6.1), (6.2), and (6.3) are necessary and sufficient for the
second-degree polynomial q to have a noncentral chi-square distribution. In proving Theorem 6.6.1,
it is convenient to devote our initial efforts to establishing the sufficiency of these conditions. The
proof of sufficiency is considerably simpler than that of necessity. And, perhaps fortuitously, it is
the sufficiency that is of the most importance; it is typically the sufficiency of the conditions that is
invoked in an application of the theorem rather than their necessity.
Proof (of Theorem 6.6.1): sufficiency. Suppose that the symmetric matrix A, the column vector
b, and the scalar c satisfy conditions (6.1), (6.2), and (6.3). Then, in conformance with the earlier
observation (6.4), 0
q D z C 12 b A z C 12 b :
Moreover, it follows from Theorem 5.9.5 that there exists a matrix O of dimensions M R, where
R D rank A, such that A D OO 0 and that necessarily this matrix is such that O 0 O D IR . Thus,
0
q D z C 12 b OO 0 z C 12 b D x0 x;
we conclude (on the basis of the very definition of the noncentral chi-square distribution) that
0
q 2 R; 12 O 0 b 12 O 0 b :
and
c C b0 C 0 A D 14 .b C 2A/0 †.b C 2A/; (6.8)
2 0 0 2
then q .R; cCb C A/, where R D rank.†A†/ D tr.A†/. Conversely, if q .R; /
(for some strictly positive integer R), then A, b, and c (and † and ) satisfy conditions (6.6), (6.7),
and (6.8), R D rank.†A†/ D tr.A†/, and D c C b0 C 0 A.
Proof. Let d D b C 2A, take to be any matrix (with M columns) such that † D 0
(the existence of which follows from Corollary 2.13.25), and denote by P the number of rows in .
Further, take z to be a P -dimensional random column vector that is distributed as N.0; IP /. And
observe that x C 0 z and hence that
q c C b0. C 0 z/ C . C 0 z/0 A. C 0 z/ D c C b0 C 0 A C .d/0 z C z0 A 0 z:
Observe also (in light of Corollary 2.3.4) that
†A† ¤ 0 , †A 0 ¤ 0 , A 0 ¤ 0:
Accordingly, it follows from Theorem 6.6.1 that if
A 0 A 0 D A 0; (6.9)
d D A 0 d; (6.10)
and
c C b0 C 0 A D 41 d0 0 d; (6.11)
2 0 0 0 0
then q .R; c C b C A/, where R D rank.A / D tr.A /; and, conversely, if
q 2 .R; / (for some strictly positive integer R), then A, b, and c (and and ) satisfy conditions
(6.9), (6.10), and (6.11), R D rank.A 0 / D tr.A 0 /, and D c C b0 C 0 A. Moreover, in
light of Lemma 2.12.3,
rank.A 0 / D rank.†A 0 / D rank.†A†/;
and in light of Lemma 2.3.1, tr.A 0 / D tr.A†/. Since d0 0 d D .b C 2A/0 †.b C 2A/, it
remains only to observe (in light of Corollary 2.3.4) that
A 0 A 0 D A 0 , 0 A 0 A 0 D 0 A 0 , 0 A 0 A 0 D 0 A 0
[so that conditions (6.6) and (6.9) are equivalent] and that
d D A 0 d , 0 d D 0 A 0 d
[so that conditions (6.7) and (6.10) are equivalent]. Q.E.D.
Note that condition (6.6) is satisfied if A is such that
.A†/2 D A† [or, equivalently, .†A/2 D †A];
that is, if A† is idempotent (or, equivalently, †A is idempotent), in which case
tr.A†/ D rank.A†/ ŒD rank.†A†/:
And note that conditions (6.6) and (6.7) are both satisfied if A and b are such that
.A†/2 D A† and b 2 C.A/:
Note also that all three of conditions (6.6), (6.7), and (6.8) are satisfied if A, b, and c are such that
A†A D A; b 2 C.A/; and c D 41 b0 †b: (6.12)
Finally, note that (by definition) A†A D A if and only if † is a generalized inverse of A.
In the special case where † is nonsingular, condition (6.12) is a necessary condition for A, b,
and c to satisfy all three of conditions (6.6), (6.7), and (6.8) (as well as a sufficient condition), as can
be readily verified. Moreover, if † is nonsingular, then rank.†A†/ D rank.A/. Thus, as a corollary
of Theorem 6.6.2, we have the following result.
Distribution of Quadratic Forms: Chi-Squareness 311
Corollary 6.6.3. Let x represent an M -dimensional random column vector that has an N.; †/
distribution, where † is nonsingular. And take q D c C b0 x C x0 Ax, where c is a constant, b an M -
dimensional column vector of constants, and A an M M (nonnull) symmetric matrix of constants.
If
A†A D A; b 2 C.A/; and c D 41 b0 †b; (6.13)
then q 2 .rank A; cCb0 C0 A/. Conversely, if q 2 .R; / (for some strictly positive integer
R), then A, b, and c (and †) satisfy condition (6.13), R D rank A, and D c C b0 C 0 A.
In the special case where q is a quadratic form (i.e., where c D 0 and b D 0), Corollary 6.6.3
simplifies to the following result.
Corollary 6.6.4. Let x represent an M -dimensional random column vector that has an N.; †/
distribution, where † is nonsingular. And take A to be an M M (nonnull) symmetric matrix of
constants. If A†A D A, then x0 Ax 2 .rank A; 0 A/. Conversely, if x0 Ax 2 .R; / (for
some strictly positive integer R), then A†A D A, R D rank A, and D 0 A.
In connection with Corollaries 6.6.3 and 6.6.4, note that if † is nonsingular, then
Moreover, upon taking k to be an M -dimensional column vector of constants and upon applying
Corollary 6.6.3 (with A D † 1, b D 2† 1 k, and c D k0 † 1 k), we find that [for an M -
dimensional random column vector x that has an N.; †/ distribution, where † is nonsingular]
.x k/0 † 1
.x k/ 2 ŒM; . k/0 † 1
. k/: (6.14)
In the special case where k D 0, result (6.14) simplifies to the following result:
x0 † 1
x 2 .M; 0 † 1/: (6.15)
1
Alternatively, result (6.15) is obtainable as an application of Corollary 6.6.4 (that where A D † ).
c. Some results on linear spaces (of M -dimensional row or column vectors or, more
generally, of M N matrices)
At this point in the discussion of the distribution of quadratic forms, it is helpful to introduce some
additional results on linear spaces. According to Theorem 2.4.7, every linear space (of M N
matrices) has a basis. And according to Theorem 2.4.11, any set of R linearly independent matrices
in an R-dimensional linear space V (of M N matrices) is a basis for V. A useful generalization
of these results is provided by the following theorem.
Theorem 6.6.5. For any set S of R linearly independent matrices in a K-dimensional linear
space V (of M N matrices), there exists a basis for V that includes all R of the matrices in S (and
K R additional matrices).
For a proof of the result set forth in Theorem 6.6.5, refer, for example, to Harville (1997, sec.
4.3g).
Not only does every linear space (of M N matrices) have a basis (as asserted by Theorem
2.4.7), but (according to Theorem 2.4.23) every linear space (of M N matrices) has an orthonormal
basis. A useful generalization of this result is provided by the following variation on Theorem 6.6.5.
Theorem 6.6.6. For any orthonormal set S of R matrices in a K-dimensional linear space V
(of M N matrices), there exists an orthonormal basis for V that includes all R of the matrices in
S (and K R additional matrices).
Theorem 6.6.6 can be derived from Theorem 6.6.5 in much the same way that Theorem 2.4.23
can be derived from Theorem 2.4.7—refer, e.g., to Harville (1997, sec. 6.4c) for some specifics.
312 Some Relevant Distributions and Their Properties
z0 Az z0 Q1Q10 z y10 y1
D D ;
z0 z z0 QQ 0 z y 0y
and that the elements of y1 are the first R elements of y, we conclude that
PR 2
z0 Az i D1 yi
:
z0 z M 2
P
i D1 yi
In summary, we have the following theorem, which generalizes Theorem 6.6.7 and which relates
to Theorem 6.6.2 in the same way that Theorem 6.6.7 relates to Theorem 6.6.1.
Theorem 6.6.8. Let x represent an M -dimensional random column vector that has an N.0; †/
distribution, let P D rank †, suppose that P > 0, take y1 ; y2 ; : : : ; yP to be statistically independent
random variables that are distributed identically as N.0; 1/, and denote by A an M M symmetric
matrix of constants (such that †A† ¤ 0). If †A†A† D †A†, then
PR 2
x0 Ax i D1 yi
;
x0 † x P 2
P
i D1 yi PR 2
x0 Ax i D1 yi
where R D rank.†A†/ D tr.A†/; and, conversely, if 0 PP for some integer R
x† x 2
i D1 yi
between 1 and P , inclusive, then †A†A† D †A† and R D rank.†A†/ D tr.A†/.
x0 Ax
In connection with Theorem 6.6.8, note that (for 1 R P 1) the condition
PR x0 † x
2
i D1 yi
PP is equivalent to the condition
2
i D1 yi
314 Some Relevant Distributions and Their Properties
x0 Ax
Be R2 ; P 2 R : (6.18)
0
x† x
Note also that the condition †A†A† D †A† is satisfied if, in particular, .A†/2 D A†, in which
case
tr.A†/ D rank.A†/ ŒD rank.†A†/:
Finally, note that if † is nonsingular, then the condition †A†A† D †A† is equivalent to the
condition A†A D A, and rank.†A†/ D rank A.
x0 Ax z0 A 0 z
;
x0 † x z0 z
that z0 A 0 z
D Œ.z0 z/ 1=2
z0 A 0 Œ.z0 z/ 1=2
z;
z0 z
and that the normalized vector .z0 z/ 1=2 z has the same distribution as in the special case where
z N.0; IP /. Accordingly, the distribution of x0 Ax=x0 † x is the same in the general case where
(for a P M matrix such that † D 0 and a P -dimensional random column vector z that has
an absolutely continuous spherical distribution) x 0 z as in the special case where x N.0; †/.
The Spectral Decomposition, with Application to the Distribution of Quadratic Forms 315
x00 Ax0 x0 Ax
Moreover, if A is symmetric, then 0 and 10 1 are eigenvalues of A—they are respectively
x0 x0 x1 x1
x0 Ax
the smallest and largest eigenvalues of A—and x0 and x1 are eigenvectors corresponding to 00 0
x0 Ax x0 x0
and 10 1 , respectively.
x1 x1
Proof. Let x represent an N -dimensional column vector of (unconstrained) variables, and take
f ./ to be the function defined (on RN ) as follows: f .x/ D x0 Ax. Further, define S D fx W
x0 x D 1g. And observe that the function f ./ is continuous and that the set S is closed and bounded.
Then, upon recalling (as in Section 6.5a) that a continuous function attains a minimum value and a
maximum value over any closed and bounded set, it follows that S contains vectors x0 and x1 such
that, for x 2 S ,
x00 Ax0 x0 Ax x01 Ax1 :
Thus, for x ¤ 0,
x00 Ax0 x0 Ax x0 Ax
0 D x00 Ax0 0 x01 Ax1 D 10 1 :
x0 x0 xx x1 x1
Now, suppose that A is symmetric, and take x0 and x1 to be any N -dimensional nonnull column
vectors such that x00 Ax0 x0 Ax x0 Ax
0 0
10 1
x0 x0 xx x1 x1
for every nonnull column vector x in RN . And observe that, for x ¤ 0,
1 0
x ŒA .x00 Ax0 =x00 x0 / IN x 0
x0 x
or, equivalently,
x0 ŒA .x00 Ax0 =x00 x0 / IN x 0:
Thus, A .x00 Ax0 =x00 x0 / IN is a symmetric nonnegative definite matrix, and upon observing that
A D QDQ0: (7.2)
Since QDQ0 is symmetric, it is clear from equality (7.2) that a necessary condition for an N N
matrix A to be orthogonally diagonalizable is that A be symmetric—certain nonsymmetric matrices
are diagonalizable, however they are not orthogonally diagonalizable. This condition is also sufficient,
as indicated by the following theorem.
Theorem 6.7.4. Every symmetric matrix is orthogonally diagonalizable.
Proof. The proof is by mathematical induction. Clearly, every 1 1 matrix is orthogonally
diagonalizable. Now, suppose that every .N 1/ .N 1/ symmetric matrix is orthogonally diago-
nalizable (where N 2). Then, it suffices to show that every N N symmetric matrix is orthogonally
diagonalizable.
Let A represent an N N symmetric matrix. And let represent an eigenvalue of A (the existence
of which is guaranteed by Corollary 6.7.2), and take u to be an eigenvector (of A) with (usual) norm 1
that corresponds to . Further, take V to be any N .N 1/ matrix such that the N vectors consisting
of u and the N 1 columns of V form an orthonormal basis for RN —the existence of such a matrix
318 Some Relevant Distributions and Their Properties
follows from Theorem 6.6.6—or, equivalently, such that .u; V / is an N N orthogonal matrix.
Then, Au D u, u0 u D 1, and V 0 u D 0, and, consequently,
0
u Au .V 0Au/0 00
.u; V /0 A.u; V / D D :
V 0Au V 0AV 0 V 0AV
Clearly, V 0AV is a symmetric matrix of order N 1, so that (by supposition) there exists an
.N 1/ .N 1/ orthogonal matrix R such that R0 V 0AV R D F for some diagonal matrix F (of
order N 1). Define S D diag.1; R/, and let P D .u; V /S. Then,
S0 S D diag.1; R0 R/ D diag.1; IN 1/ D IN ;
so that S is orthogonal and hence (according to Lemma 2.7.1) P is orthogonal. Further,
and hence
QN
p./ D . 1/N i D1 . di / (7.6)
N QK Nj
D. 1/ j D1 . j / ; (7.7)
where f1 ; 2 ; : : : ; K g is a set whose elements consist of the distinct values represented among the
N scalars d1 ; d2 ; : : : ; dN and where (for j D 1; 2; : : : ; K) Nj represents the number of values of
the integer i (between 1 and N , inclusive) for which di D j .
In light of expression (7.6) or (7.7), it is clear that a scalar is an eigenvalue of A if and only
if D di for some integer i (between 1 and N , inclusive) or, equivalently, if and only if is
contained in the set f1 ; 2 ; : : : ; K g. Accordingly, 1 ; 2 ; : : : ; K may be referred to as the distinct
eigenvalues of A. And, collectively, d1 ; d2 ; : : : ; dN may be referred to as the not-necessarily-distinct
eigenvalues of A. Further, the set f1 ; 2 ; : : : ; K g is sometimes referred to as the spectrum of A,
and (for j D 1; 2; : : : ; K) Nj is sometimes referred to as the multiplicity of j .
The Spectral Decomposition, with Application to the Distribution of Quadratic Forms 319
Clearly,
AQ D QD; (7.8)
or, equivalently,
Aqi D di qi .i D 1; 2; : : : ; N /: (7.9)
Thus, the N orthonormal vectors q1 ; q2 ; : : : ; qN are eigenvectors of A, the i th of which corresponds
to the eigenvalue di . Note that result (7.8) or (7.9) is also expressible in the form
AQj D j Qj .j D 1; 2; : : : ; K/; (7.10)
where Qj is the N Nj matrix whose columns consist of those of the eigenvectors q1 ; q2 ; : : : ; qN
for which the corresponding eigenvalue equals j . Note also that the K equalities in the collection
(7.10) are reexpressible as
.A j IN /Qj D 0 .j D 1; 2; : : : ; K/: (7.11)
It is clear from result (7.11) that the columns of Qj are members of N.A j IN /. In fact, they
form a basis (an orthonormal basis) for N.A j IN /, as is evident upon observing (in light of
Lemma 2.11.5) that
dimŒN.A j IN / D N rank.A j IN /
DN rankŒQ0 .A j IN /Q
DN rank.D j IN / D N .N Nj / D Nj : (7.12)
In general, a distinction needs to be made between the algebraic multiplicity and the geometric
multiplicity of an eigenvalue j ; algebraic multiplicity refers to the multiplicity of j as a factor
of the characteristic polynomial p./, whereas geometric multiplicity refers to the dimension of
the linear space N.A j IN /. However, in the present context (where the eigenvalue is that of a
symmetic matrix A), the algebraic and geometric multiplicities are equal, so that no distinction is
necessary.
To what extent is the spectral decomposition of A unique? The distinct eigenvalues 1; 2 ; : : : ; K
are (aside from order) unique and their multiplicities N1 ; N2 ; : : : ; NK are unique, as is evident from
result (7.7). Moreover, for j (an integer between 1 and K, inclusive) such that Nj D 1, Qj is
unique, as is evident from result (7.11) upon observing [in light of result (7.12)] that (if Nj D 1)
dimŒN.A j IN / D 1. For j such that Nj > 1, Qj is not uniquely determined; Qj can be taken
to be any N Nj matrix whose columns are orthonormal eigenvectors (of A) corresponding to the
eigenvalue j or, equivalently, whose columns form an orthonormal basis for N.A j IN /—refer
to Lemma 6.7.3. However, even for j such that Nj > 1, Qj Qj0 is uniquely determined—refer, e.g.,
to Harville (1997, sec. 21.5) for a proof. Accordingly, a decomposition of A that is unique (aside
from the order of the terms) is obtained upon reexpressing decomposition (7.5) in the form
A D jKD1 j Ej ; (7.13)
P
where Ej D Qj Qj0 . Decomposition (7.13), like decompositions (7.4) and 7.5), is sometimes referred
to as the spectral decomposition.
Rank, trace, and determinant of a symmetric matrix. The following theorem provides expressions
for the rank, trace, and determinant of a symmetric matrix (in terms of its eigenvalues).
Theorem 6.7.5. Let A represent an N N symmetric matrix with not-necessarily-distinct eigen-
values d1 ; d2 ; : : : ; dN and with distinct eigenvalues 1; 2 ; : : : ; K of multiplicities N1; N2 ; : : : ; NK ,
respectively. Then,
(1) rank A D N N0 , where N0 D Nj if j D 0 (1 j K) and where N0 D 0 if
0 … f1 ; 2 ; : : : ; K g (i.e., where N0 equals the multiplicity of the eigenvalue 0 if 0 is an
eigenvalue of A and equals 0 otherwise);
320 Some Relevant Distributions and Their Properties
PN PK
(2) tr.A/ D i D1 di D j D1 Nj j ; and
QN QK N
(3) det.A/ D i D1 di D j D1 j j.
Proof. Let Q represent an N N orthogonal matrix such that A D QDQ0, where D D
diag.d1 ; d2 ; : : : ; dN /—the existence of such a matrix follows from the results of the preceding part
of the present subsection (i.e., the part pertaining to the spectral decomposition).
(1) Clearly, rank A equals rank D, and rank D equals the number of diagonal elements of D that
are nonzero. Thus, rank A D N N0 .
(2) Making use of Lemma 2.3.1, we find that
tr.A/ D tr.QDQ0 / D tr.DQ0 Q/ D tr.DI/ D tr.D/ D N
P PK
i D1 di D j D1 Nj j :
(3) Making use of result (2.14.25), Lemma 2.14.3, and Corollary 2.14.19, we find that
QK Nj
jAj D jQDQ0 j D jQjjDjjQ0 j D jQj2 jDj D jDj D N
Q
i D1 di D j D1 j :
Q.E.D.
When is a symmetric matrix nonnegative definite, positive definite, or positive semidefinite? Let
A represent an N N symmetric matrix. And take Q to be an N N orthogonal matrix and D
an N N diagonal matrix such that A D QDQ0 —the existence of such an orthogonal matrix and
such a diagonal matrix follows from Theorem 6.7.4. Then, upon recalling (from the discussion of
the spectral decomposition) that the diagonal elements of D constitute the (not-necessarily-distinct)
eigenvalues of A and upon applying Corollary 2.13.16, we arrive at the following result.
Theorem 6.7.6. Let A represent an N N symmetric matrix with not-necessarily-distinct
eigenvalues d1 ; d2 ; : : : ; dN . Then, (1) A is nonnegative definite if and only if d1 ; d2 ; : : : ; dN are
nonnegative; (2) A is positive definite if and only if d1 , d2 , : : : ; dN are (strictly) positive; and (3)
A is positive semidefinite if and only if di 0 for i D 1; 2; : : : ; N with equality holding for one or
more values of i .
When is a symmetric matrix idempotent? The following theorem characterizes the idempotency of
a symmetric matrix in terms of its eigenvalues.
Theorem 6.7.7. An N N symmetric matrix is idempotent if (and only if) it has no eigenvalues
other than 0 or 1.
Proof. Let A represent an N N symmetric matrix, and denote by d1 ; d2 ; : : : ; dN its
not-necessarily-distinct eigenvalues. And observe (in light of the discussion of the spectral de-
composition) that there exists an N N orthogonal matrix Q such that A D QDQ0, where
D D diag.d1 ; d2 ; : : : ; dN /. Observe also that
A2 D QDQ0 QDQ0 D QD 2 Q0:
Thus,
A2 D A , D2 D D , di2 D di .i D 1; 2; : : : ; N /:
Moreover, di2 D di if and only if either di D 0 or di D 1. It is now clear that A is idempotent if
(and only if) it has no eigenvalues other than 0 or 1. Q.E.D.
—the existence of such an orthogonal matrix and such a diagonal matrix follows from Theorem
6.7.4. As previously indicated (in Subsection a), the representation (7.14) is sometimes referred to
as the spectral decomposition or representation (of the matrix A).
The second-degree polynomial q can be reexpressed in terms related to the representation (7.14).
Let r D O 0 b and u D O 0 z. Then, clearly,
N
X
q D c C r 0 u C u0 Du D .ci C ri ui C di u2i /; (7.15)
i D1
Then, upon regarding q as a second-degree polynomial in the random vector u (rather than in the
random vector z) and applying formula (5.9), we find that, for t 2 S ,
In the special case where c D 0 and b D 0, that is, the special case where q D z0 Az, we have that,
for t 2 S,
m.t/ D jI 2tDj 1=2 D N 2tdi / 1=2: (7.21)
Q
i D1 .1
z0 Az=z0 z D 1 .
Result (7.24) can be generalized. Suppose that x is an N -dimensional random column vector that
is distributed as N.0; †/ (where † ¤ 0), let P D rank †, and take A to be an N N symmetric
matrix of constants. Further, take to be a matrix of dimensions P N such that † D 0
(the existence of which follows from Corollary 2.13.23), and take z to be a P -dimensional random
column vector that has an N.0; IP / distribution. Then, as in the case of result (6.17), we have that
x0 Ax z0 A 0 z
0
:
x† x z0 z
And upon applying result (7.24) (with the P P matrix A 0 in place of the N N matrix A),
we find that K
x0 Ax X
j wj ; (7.25)
x0 † x
j D1
where 1 ; 2 ; : : : ; K are the distinct eigenvalues of A 0 with multiplicities N1 , N2 , : : : ; NK ,
respectively, where w1 ; w2 ; : : : ; wK 1 are random variables that are jointly distributed as Di N21 ,
N2 NK 1 NK PK 1
2 , :::; 2 , 2 IK 1 , and where wK D 1 j D1 wj —in the “degenerate” case where
Let x represent a real variable. Then, a function of x, say p.x/, that is expressible in the form
p.x/ D a0 C a1 x C a2 x 2 C C aN x N;
where N is a nonnegative integer and where the coefficients a0 ; a1 ; a2 ; : : : ; aN are real numbers, is
referred to as a polynomial (in x). The polynomial p.x/ is said to be nonzero if one or more of the
coefficients a0 ; a1 ; a2 ; : : : ; aN are nonzero, in which case the largest nonnegative integer k such that
ak ¤ 0 is referred to as the degree of p.x/ and is denoted by the symbol degŒp.x/. When it causes
no confusion, p.x/ may be abbreviated to p fand degŒp.x/ to deg.p/g.
A polynomial q is said to be a factor of a polynomial p if there exists a polynomial r such that
p qr. And a real number c is said to be a root (or a zero) of a polynomial p if p.c/ D 0.
A basic property of polynomials is as follows.
Theorem 6.7.8. Let p.x/ and q.x/ represent polynomials (in a variable x). And suppose that
p.x/ D q.x/ for all x in some nondegenerate interval. Or, more generally, taking N to be a nonneg-
ative integer such that N > deg.p/ (if p is nonzero) and N > deg.q/ (if q is nonzero), suppose that
p.x/ D q.x/ for N distinct values of x. Then, p.x/ D q.x/ for all x.
Proof. Suppose that N > 0, and take x1 ; x2 ; : : : ; xN to be N distinct values of x such that
p.xi / D q.xi / for i D 1; 2; : : : ; N —if N D 0, then neither p nor q is nonzero, in which case
p.x/ D 0 D q.x/ for all x. And observe that there exist real numbers a0 ; a1 ; a2 ; : : : ; aN 1 and
b0 ; b1 ; b2 ; : : : ; bN 1 such that
p.x/ D a0 C a1 x C a2 x 2 C C aN 1x
N 1
and
q.x/ D b0 C b1 x C b2 x 2 C C bN 1x
N 1
:
Further, let p D Œp.x1 /; p.x2 /; : : : ; p.xN / and q D Œq.x1 /; q.x2 /; : : : ; q.xN /0, define a D
0
Refer, for example, to Beaumont and Pierce (1963) for proofs of Theorems 6.7.9, 6.7.10, and
6.7.11, which are equivalent to their theorems 9-3.3, 9-3.5, and 9-7.5, and to Harville (1997, appendix
to chap. 21) for a proof of Theorem 6.7.12.
The various basic properties of polynomials can be used to establish the following result.
Theorem 6.7.13. Let r1 .x/; s1 .x/ and s2 .x/ represent polynomials in a real variable x. And
take r2 .x/ to be a function of x defined as follows:
Thus, R D K [since, otherwise, the polynomials forming the left and right sides of equality (7.29)
would be of different degrees]. And, for i D 1; 2; : : : ; K, di D 1 [since the left side of equality
(7.29) has a root at t D 1=.2di /, while, if di ¤ 1, the right side does not]. We conclude (on the basis
of Theorem 6.7.7) that A2 D A and (in light of Corollary 2.8.3) that K D tr.A/.
It remains to show that b D Ab and that D c D 14 b0 b. Since R D K and d1 D d2 D D
dK D 1, it follows from result (7.28) that (for t 2 I )
K M
!
2
X u2i X
2 2t
2tc C t C ui D 0: (7.30)
1 2t 1 2t
i D1 i DKC1
And upon multiplying both sides of equality (7.30) by 1 2t, we obtain the equality
h i
2.c /t 4 c 14 M 2 2
2t 3 M 2
P P
i D1 ui t i DKC1 ui D 0
[which, since both sides are polynomials in t, holds for all t]. Thus,
c D c 14 M
P 2 PM 2
i D1 ui D i DKC1 ui D 0:
Moreover, PM 2
i D1 ui D u0u D .O 0 b/0 O 0 b D b0 b;
and PM
i DKC1 u2i D .uKC1 ; uKC2 ; : : : ; uM /.uKC1 ; uKC2 ; : : : ; uM /0 D .O20 b/0 O20 b:
We conclude that
D c D 41 b0 b
and that O20 b D 0 and hence that
Theorem 6.8.1. Let x represent an M -dimensional random column vector that has an N.; †/
distribution. And, for i D 1; 2; : : : ; K, take qi D ci C b0i x C x0 Ai x, where ci is a constant, bi an
M -dimensional column vector of constants, and Ai an M M symmetric matrix of constants. Then,
q1 ; q2 ; : : : ; qK are distributed independently if and only if, for j ¤ i D 1; 2; : : : ; K,
Ai †Aj D 0; (8.4)
Ai †bj D 0; (8.5)
and
b0i †bj D 0: (8.6)
Note that in the special case of Corollary 6.8.2 where q1 ; q2 ; : : : ; qK are quadratic forms (i.e,
the special case where c1 D c2 D D cK D 0 and b1 D b2 D D bK D 0), conditions (8.5)
and (8.6) are vacuous; in this special case, only condition (8.4) is “operative.” Note also that in the
special case of Corollary 6.8.2 where † D I, Corollary 6.8.2 reduces to the following result.
Theorem 6.8.3. Let x represent an M -dimensional random column vector that has an N.; IM /
distribution. And, for i D 1; 2; : : : ; K, take qi D ci C b0i x C x0 Ai x, where ci is a constant, bi an
M -dimensional column vector of constants, and Ai an M M symmetric matrix of constants. Then,
q1 ; q2 ; : : : ; qK are distributed independently if and only if, for j ¤ i D 1; 2; : : : ; K,
Verification of theorems. To prove Theorem 6.8.1, it suffices to prove Theorem 6.8.3. In fact, it suf-
fices to prove the special case of Theorem 6.8.3 where D 0. To see this, define x and q1 ; q2 ; : : : ; qK
as in Theorem 6.8.1. That is, take x to be an M -dimensional random column vector that has an
N.; †/ distribution. And, for i D 1; 2; : : : ; K, take qi D ci C b0i x C x0 Ai x, where ci is a constant,
bi an M -dimensional column vector of constants, and Ai an M M symmetric matrix of constants.
Further, take to be any matrix (with M columns) such that † D 0 , denote by P the number of
rows in , and define z to be a P -dimensional random column vector that has an N.0; IP / distri-
bution. Then, x C 0 z, and hence the joint distribution of q1 ; q2 ; : : : ; qK is identical to that of
q1 ; q2 ; : : : ; qK
, where (for i D 1; 2; : : : ; K)
qi D ci C b0i .C 0 z/ C .C 0 z/0Ai .C 0 z/:
which is a second-degree polynomial in z. Thus, it follows from Theorem 6.8.3 (upon applying the
theorem to q1 ; q2 ; : : : ; qK
) that q1 ; q2 ; : : : ; qK
(and hence q1 ; q2 ; : : : ; qK ) are statistically indepen-
dent if and only if, for i D 1; 2; : : : ; K,
Ai 0 Aj 0 D 0; (8.8)
Ai 0 .bj C 2Aj / D 0; (8.9)
and
.bi C 2Ai /0 0 .bj C 2Aj / D 0: (8.10)
Moreover, in light of Corollary 2.3.4,
Ai 0 Aj 0 D 0 , 0 Ai 0 Aj 0 D 0 , 0 Ai 0 Aj 0 D 0
and
Ai 0 .bj C 2Aj / D 0 , 0 Ai 0 .bj C 2Aj / D 0:
And we conclude that the conditions [conditions (8.8), (8.9), and (8.10)] derived from the application
of Theorem 6.8.3 to q1 ; q2 ; : : : ; qK
are equivalent to conditions (8.1), (8.2), and (8.3) (of Theorem
6.8.1).
Theorem 6.8.3, like Theorem 6.6.1, pertains to second-degree polynomials in a normally dis-
tributed random column vector. Theorem 6.8.3 gives conditions that are necessary and sufficient
for such second-degree polynomials to be statistically independent, whereas Theorem 6.6.1 gives
conditions that are necessary and sufficient for such a second-degree polynomial to have a noncentral
chi-square distribution. In the case of Theorem 6.8.3, as in the case of Theorem 6.6.1, it is much
easier to prove sufficiency than necessity, and the sufficiency is more important than the necessity
(in the sense that it is typically the sufficiency that is invoked in an application). Accordingly, the
following proof of Theorem 6.8.3 is a proof of sufficiency; the proof of necessity is deferred until a
subsequent subsection (Subsection e).
Proof (of Theorem 6.8.3): sufficiency. For i D 1; 2; : : : ; K, qi is reexpressible in the form
qi D ci C b0i x C .Ai x/0Ai .Ai x/;
and hence qi depends on the value of x only through the (M C 1)-dimensional column vector
.bi ; Ai /0 x. Moreover, for j ¤ i D 1; 2; : : : ; K,
0
bi bj .Aj bi /0
0 0 0
covŒ.bi ; Ai / x; .bj ; Aj / x D .bi ; Ai / .bj ; Aj / D : (8.11)
Ai bj Ai Aj
And the joint distribution of the K vectors .b1 ; A1 /0 x, .b2 ; A2 /0 x, : : : ; .bK ; AK /0 x is multivariate
normal.
Now, suppose that, for j ¤ i D 1; 2; : : : ; K, condition (8.7) is satisfied. Then, in light of
result (8.11), .b1 ; A1 /0 x, .b2 ; A2 /0 x, : : : ; .bK ; AK /0 x are uncorrelated and hence (since their
joint distribution is multivariate normal) statistically independent, leading to the conclusion that the
second-degree polynomials q1 ; q2 ; : : : ; qK [each of which depends on a different one of the vectors
.b1 ; A1 /0 x, .b2 ; A2 /0 x, : : : ; .bK ; AK /0 x] are statistically independent. Q.E.D.
An extension. The coverage of Theorem 6.8.1 includes the special case where one or more of the
quantities q1 ; q2 ; : : : ; qK (whose statistical independence is in question) are linear forms. In the
following generalization, the coverage is extended to include vectors of linear forms.
Theorem 6.8.4. Let x represent an M -dimensional random column vector that has an N.; †/
distribution. And, for i D 1; 2; : : : ; K, take qi D ci C b0i x C x0 Ai x, where ci is a constant, bi
an M -dimensional column vector of constants, and Ai an M M symmetric matrix of constants.
Further, for s D 1; 2; : : : ; R, denote by ds an Ns -dimensional column vector of constants and by
Ls an M Ns matrix of constants. Then, q1 ; q2 ; : : : ; qK ; d1 C L01 x; d2 C L02 x; : : : ; dR C L0R x are
distributed independently if and only if, for j ¤ i D 1; 2; : : : ; K,
†Ai †Aj † D 0; (8.12)
†Ai †.bj C 2Aj / D 0; (8.13)
More on the Distribution of Quadratic Forms or Second-Degree Polynomials 329
and
.bi C 2Ai /0 †.bj C 2Aj / D 0; (8.14)
for i D 1; 2; : : : ; K and s D 1; 2; : : : ; R,
†Ai †Ls D 0 (8.15)
and 0
.bi C 2Ai / †Ls D 0; (8.16)
and, for t ¤ s D 1; 2; : : : ; R,
L0t †Ls D 0: (8.17)
Note that in the special case where † is nonsingular, conditions (8.15) and (8.16) are (collectively)
equivalent to the condition
Ai †Ls D 0 and b0i †Ls D 0: (8.18)
Accordingly, in the further special case where † D I, the result of Theorem 6.8.4 can be restated in
the form of the following generalization of Theorem 6.8.3.
Theorem 6.8.5. Let x represent an M -dimensional random column vector that has an N.; IM /
distribution. And, for i D 1; 2; : : : ; K, take qi D ci C b0i x C x0 Ai x, where ci is a constant, bi an
M -dimensional column vector of constants, and Ai an M M symmetric matrix of constants.
Further, for s D 1; 2; : : : ; R, denote by ds an Ns -dimensional column vector of constants and by
Ls an M Ns matrix of constants. Then, q1 ; q2 ; : : : ; qK ; d1 C L01 x; d2 C L02 x; : : : ; dR C L0R x are
distributed independently if and only if, for j ¤ i D 1; 2; : : : ; K,
Ai Aj D 0; Ai bj D 0; and b0i bj D 0; (8.19)
for i D 1; 2; : : : ; K and s D 1; 2; : : : ; R,
Ai Ls D 0 and b0i Ls D 0; (8.20)
and, for t ¤ s D 1; 2; : : : ; R,
L0t Ls D 0: (8.21)
To prove Theorem 6.8.4, it suffices to prove Theorem 6.8.5, as can be established via a straight-
forward extension of the argument used in establishing that to prove Theorem 6.8.1, it suffices to
prove Theorem 6.8.3. Moreover, the “sufficiency part” of Theorem 6.8.5 can be established via a
straightforward extension of the argument used to establish the sufficiency part of Theorem 6.8.3.
Now, consider the “necessity part” of Theorem 6.8.5. Suppose that q1, q2 , : : : ; qK , d1CL01 x, d2C
L2 x, : : : ; dRCL0R x are statistically independent. Then, for arbitrary column vectors h1 ; h2 ; : : : ; hR
0
Statistical independence versus zero correlation. Let x represent an M -dimensional random column
vector that has an N.; †/ distribution. For an M N1 matrix of constants L1 and an M N2
matrix of constants L2 ,
cov.L01 x; L02 x/ D L01 †L2 :
Thus, two vectors of linear forms in a normally distributed random column vector are statistically
independent if and only if they are uncorrelated.
For an M N matrix of constants L and an M M symmetric matrix of constants A,
cov.L0 x; x0 Ax/ D 2L0 †A
—refer to result (5.7.14)—so that the vector L0 x of linear forms and the quadratic form x0 Ax (in
the normally distributed random vector x) are uncorrelated if and only if L0 †A D 0, but are
statistically independent if and only if †A†L D 0 and 0 A†L D 0 (or, equivalently, if and only
if L0 †A† D 0 and L0 †A D 0). And, for two M M symmetric matrices of constants A1 and
A2 ,
cov.x0A1 x; x0A2 x/ D 2 tr.A1†A2 †/ C 40A1†A2
—refer to result (5.7.19)—so that the two quadratic forms x0A1 x and x0A2 x (in the normally
distributed random vector x) are uncorrelated if and only if
tr.A1†A2 †/ C 20A1†A2 D 0; (8.24)
but are statistically independent if and only if
†A1†A2 † D 0; †A1†A2 D 0; †A2 †A1 D 0; and 0A1†A2 D 0:
b. Cochran’s theorem
Theorem 6.8.1 can be used to determine whether two or more second-degree polynomials (in a
normally distributed random vector) are statistically independent. The following theorem can be
used to determine whether the second-degree polynomials are not only statistically independent but
whether, in addition, they have noncentral chi-square distributions.
Theorem 6.8.6. Let x represent an M -dimensional random column vector that has an N.; †/
distribution. And, for i D 1; 2; : : : ; K, take qi D ci C b0i x C x0 Ai x, where ci is a constant, bi an
M -dimensional column vector of constants, and Ai an M M symmetric matrix of constants (such
that †Ai † ¤ 0). Further, define A D A1 C A2 C C AK . If
†A†A† D †A†; (8.25)
rank.†A1 †/ C rank.†A2 †/ C C rank.†AK †/ D rank.†A†/; (8.26)
†.bi C 2Ai / D †A†.bi C 2Ai / .i D 1; 2; : : : ; K/ (8.27)
and
ci C b0i C 0 Ai D 41 .bi C 2Ai /0 †.bi C 2Ai /; .i D 1; 2; : : : ; K/; (8.28)
then q1 ; q2 ; : : : ; qK are statistically independent and (for i D 1; 2; : : : ; K) qi 2 .Ri ; ci C
b0i C0 Ai /, where Ri D rank.†Ai †/ D tr.Ai †/. Conversely, if q1 ; q2 ; : : : ; qK are statistically
independent and (for i D 1; 2; : : : ; K) qi 2 .Ri ; i / (where Ri is a strictly positive integer),
then conditions (8.25), (8.26), (8.27), and (8.28) are satisfied and (for i D 1; 2; : : : ; K) Ri D
rank.†Ai †/ D tr.Ai †/ and i D ci C b0i C0 Ai .
Some results on matrices. The proof of Theorem 6.8.6 makes use of certain properties of matrices.
These properties are presented in the form of a generalization of the following theorem.
Theorem 6.8.7. Let A1 ; A2 ; : : : ; AK represent N N matrices, and define A D A1 C A2 C
C AK . Suppose that A is idempotent. Then, each of the following conditions implies the other
two:
More on the Distribution of Quadratic Forms or Second-Degree Polynomials 331
(2) ) (3). Suppose that Condition (2) is satisfied. Then, making use of Corollary 2.8.3, we find
that K K XK
X X
rank.Ai / D tr.Ai / D tr Ai D tr.A/ D rank.A/:
i D1 i D1 i D1
PK
(3) )(1). Suppose that Condition (3) is satisfied. And define A0 D IN A. Then, i D0 Ai D I.
332 Some Relevant Distributions and Their Properties
Moreover, it follows from Lemma 2.8.4 (and also from Lemma 6.8.8) that rank.A0 / D N rank.A/
and hence that K i D0 rank.Ai / D N .
P
Now, making use of inequality (2.4.24), we find (for an arbitrary integer i between 1 and K,
inclusive) that
K
X K
X
rank.I Ai / D rank As rank.As / D N rank.Ai /;
sD0 .s¤i / sD0 .s¤i /
N rank.Ai C Aj /;
Proof of Theorem 6.8.6. Let us prove Theorem 6.8.6, doing so by taking advantage of Theorem
6.8.10. Suppose that conditions (8.25), (8.26), (8.27), and (8.28) are satisfied. In light of Theorem
6.8.10, it follows from conditions (8.25) and (8.26) that
And upon applying Theorems 6.8.1 and 6.6.2, we conclude that q1 ; q2 ; : : : ; qK are statistically inde-
pendent and that (for i D 1; 2; : : : ; K) qi 2 .Ri ; ciCb0i C0 Ai /, where Ri D rank.†Ai †/ D
tr.Ai †/.
Conversely, suppose that q1 ; q2 ; : : : ; qK are statistically independent and that (for i D
1; 2; : : : ; K) qi 2 .Ri ; i / (where Ri is a strictly positive integer). Then, from Theorem 6.6.2, we
have that conditions (8.27) and (8.28) are satisfied, that (for i D 1; 2; : : : ; K) Ri D rank.†Ai †/ D
tr.Ai †/ and i D ci C b0i C0 Ai , and that
†Ai †Ai † D †Ai † .i D 1; 2; : : : ; K/: (8.31)
And from Theorem 6.8.1, we have that
†Ai †Aj † D 0 .j ¤ i D 1; 2; : : : ; K/: (8.32)
Together, results (8.31) and (8.32) imply that condition (8.25) is satisfied. That condition (8.26) is
also satisfied can be inferred from Theorem 6.8.10. Q.E.D.
Corollaries of Theorem 6.8.6. In light of Theorem 6.8.10, alternative versions of Theorem 6.8.6 can
be obtained by replacing condition (8.26) with the condition
†Ai †Aj † D 0 .j ¤ i D 1; 2; : : : ; K/
or with the condition
†Ai †Ai † D †Ai † .i D 1; 2; : : : ; K/:
In either case, the replacement results in what can be regarded as a corollary of Theorem 6.8.6. The
following result can also be regarded as a corollary of Theorem 6.8.6.
334 Some Relevant Distributions and Their Properties
Corollary 6.8.11. Let x represent an M -dimensional random column vector that has an
N.; IM / distribution. And take A1 ; A2 ; : : : ; AK to be M M (nonnull) symmetric matrices
of constants, and define A D A1 C A2 C C AK . If A is idempotent and
rank.A1 / C rank.A2 / C C rank.AK / D rank.A/; (8.33)
0 0 0
then the quadratic forms x A1 x; x A2 x; : : : ; x AK x are statistically independent and (for i D
1; 2; : : : ; K) x0 Ai x 2 .Ri ; 0 Ai /, where Ri D rank.Ai / D tr.Ai /. Conversely, if
x0 A1 x; x0 A2 x; : : : ; x0 AK x are statistically independent and if (for i D 1; 2; : : : ; K) x0 Ai x
2 .Ri ; i / (where Ri is a strictly positive integer), then A is idempotent, condition (8.33) is satis-
fied, and (for i D 1; 2; : : : ; K) Ri D rank.Ai / D tr.Ai / and i D 0 Ai .
Corollary 6.8.11 can be deduced from Theorem 6.8.6 by making use of Theorem 6.8.7. The
special case of Corollary 6.8.11 where D 0 and where A1 ; A2 ; : : : ; AK are such that A D I
was formulated and proved by Cochran (1934) and is known as Cochran’s theorem. Cochran’s
theorem is one of the most famous theoretical results in all of statistics. Note that in light of Theorem
6.8.7, alternative versions of Corollary 6.8.11 can be obtained by replacing condition (8.33) with the
condition
Ai Aj D 0 .j ¤ i D 1; 2; : : : ; K/
or with the condition
A2i D Ai .i D 1; 2; : : : ; K/:
Another result that can be regarded as a corollary of Theorem 6.8.6 is as follows.
Corollary 6.8.12. Let x represent an M -dimensional random column vector that has an N.; †/
distribution. And, for i D 1; 2; : : : ; K, take Ai to be an M M symmetric matrix of constants (such
that Ai † ¤ 0). Further, define A D A1 C A2 C C AK . If
A†A D A (8.34)
and
Ai †Ai D Ai .i D 1; 2; : : : ; K/; (8.35)
then the quadratic forms x0 A1 x; x0 A2 x; : : : ; x0 AK x are statistically independent and (for i D
1; 2; : : : ; K) x0 Ai x 2 .Ri ; 0 Ai /, where Ri D rank.Ai †/ D tr.Ai †/. Conversely, if
x0 A1 x; x0 A2 x; : : : ; x0 AK x are statistically independent and if (for i D 1; 2; : : : ; K) x0 Ai x
2 .Ri ; i / (where Ri is a strictly positive integer), then in the special case where † is nonsingular,
conditions (8.34) and (8.35) are satisfied and (for i D 1; 2; : : : ; K) Ri D rank.Ai †/ D tr.Ai †/
and i D 0 Ai .
Proof. Condition (8.34) implies condition (8.25), and in the special case where † is nonsingular,
conditions (8.34) and (8.25) are equivalent. And as noted earlier, condition (8.26) of Theorem 6.8.6
can be replaced by the condition
†Ai †Ai † D †Ai † .i D 1; 2; : : : ; K/: (8.36)
Moreover, condition (8.35) implies condition (8.36) and that (for i D 1; 2; : : : ; K) †Ai D
†Ai †Ai and 0 Ai D 0 Ai †Ai ; in the special case where † is nonsingular, conditions
(8.35) and (8.36) are equivalent. Condition (8.35) also implies that (for i D 1; 2; : : : ; K) Ai † is
idempotent and hence that rank.Ai †/ D tr.Ai †/. Accordingly, Corollary 6.8.12 is obtainable as
an “application” of Theorem 6.8.6. Q.E.D.
If in Corollary 6.8.12, we substitute .1= i /Ai for Ai , where i is a nonzero scalar, we obtain an
additional corollary as follows.
Corollary 6.8.13. Let x represent an M -dimensional random column vector that has an N.; †/
distribution. And, for i D 1; 2; : : : ; K, take Ai to be an M M symmetric matrix of constants (such
that Ai † ¤ 0), and take i to be a nonzero constant. Further, define B D .1= 1 /A1 C .1= 2 /A2 C
C .1= K /AK . If
B†B D B (8.37)
and
More on the Distribution of Quadratic Forms or Second-Degree Polynomials 335
Conversely, suppose that z0A1 z=z0 z; z0A2 z=z0 z; : : : ; z0AK z=z0 z have a Di R1 =2, R2 =2, : : : ;
PK
i D1 Ri =2; K distribution (where R1 ; R2 ; : : : ; RK are strictly positive integers
RK =2, M
such that K M ). And partition z into subvectors z1 ; z2 ; : : : ; zK ; zKC1 of dimensions
P
i D1 Ri <P
K 0 0 0 0 0
R1 ; R2 ; : : : ; RK ; M i D1 Ri , respectively, so that z D .z1 ; z2 ; : : : ; zK ; zKC1 /. Then, the (joint)
distribution of z A1 z=z z, z A2 z=z z, : : : ; z AK z=z z is identical to that of the quantities z01 z1 =z0 z,
0 0 0 0 0 0
z02 z2 =z0 z, : : : ; zK 0
zK =z0 z. Moreover, the quantities z0A1 z=z0 z, z0A2 z=z0 z, : : : ; z0AK z=z0 z and the
quantities z01 z1 =z0 z, z02 z2 =z0 z, : : : ; zK 0
zK =z0 z are both distributed independently of z0 z, as is evident
from the results of Section 6.1f upon observing that (for i D 1; 2; : : : ; K) z0Ai z=z0 z and z0i zi =z0 z
depend on the value of z only through .z0 z/ 1=2 z. Thus,
0 0 0 0
z A1 z=z0 z
0 0
z1 z1 =z0 z
1 1 1 0 0 1
z A1 z z1 z1
B z0A z C B z0A z=z0 z C B z0 z =z0 z C B z0 z C
B 2 C 2 B 2 2 C B 2 2C
B :: C D z0 zB C z0 zB C D B :: C;
B C
:: ::
@ : A @ : A @ : A @ : A
0 0
z0AK z z0AK z=z0 z zK zK =z0 z zK zK
More on the Distribution of Quadratic Forms or Second-Degree Polynomials 337
which implies that the quadratic forms z0A1 z; z0A2 z; : : : ; z0AK z are statistically independent and
that (for i D 1; 2; : : : ; K) z0Ai z 2 .Ri /. Upon applying Corollary 6.8.11, it follows that A is
idempotent, that K rank.Ai / D rank.A/, and that (for i D 1; 2; : : : ; K) Ri D rank.Ai / D
P
i D1
tr.Ai / (in which case K
PK
i D1 rank.Ai / D i D1 Ri < M ). Q.E.D.
P
The result of Theorem 6.8.15 can be generalized in much the same fashion as Theorem 6.6.7. Take
x to be an M -dimensional random column vector that is distributed as N.0; †/, let P D rank †, sup-
pose that P > 1, and take A1 ; A2 ; : : : ; AK to be M M symmetric matrices of constants (such that
†A1 †; †A2 †; : : : ; †AK † are nonnull). Further, take to be a matrix of dimensions P M such
that † D 0 (the existence of which follows from Corollary 2.13.23), take z to be a P -dimensional
random column vector that has an N.0; IP / distribution, and define A D A1 C A2 C AK . And
observe that x 0 z, that (for i D 1; 2; : : : ; K) . 0 z/0 Ai 0 z D z0 Ai 0 z, and (in light of the
discourse in Section 6.6d) that . 0 z/0 † 0 z D z0 z. Observe also that A1 0; A2 0; : : : ; AK 0
are nonnull (as can be readily verified by making use of Corollary 2.3.4).
Now, suppose that A 0 is idempotent and that
rank.A1 0 / C rank.A2 0 / C C rank.AK 0 / D rank.A 0 / < P: (8.44)
0 0 0
Then, upon applying Theorem 6.8.15 (with A1 ; A2 ; : : : ; AK in place of A1 , A2 , : : : ;
AK ), we find that z0 A1 0 z=z0 z; z0 A2 0 z=z0 z; : : : ; z0 AK 0 z=z0 z have a Di R1 =2; R2 =2, : : : ;
PK 0
i D1 Ri =2; K distribution, where (for i D 1; 2; : : : ; K) Ri D rank.Ai / D
RK =2, P
0 0 0 0 0 0 0 0 0 0
tr.A
i /. And, conversely, if PzKA1 z=z z; z A2 z=z z; : : : ; z AK z=z z have a
Di R1 =2, R2 =2, : : : ; RK =2, P R
i D1 i =2; K distribution (where R1 ; R2 ; : : : ; RK are strictly
PK
positive integers such that i D1 Ri < P ), then A 0 is idempotent, condition (8.44) is satisfied,
and (for i D 1; 2; : : : ; K) Ri D rank.Ai 0 / D tr.Ai 0 /.
Clearly, x0A1 x=x0 † x; x0A2 x=x0 † x; : : : ; x0AK x=x0 † x have the same distribution as
z A1 0 z=z0 z; z0 A2 0 z=z0 z; : : : ; z0 AK 0 z=z0 z. And by employing the same line of reason-
0
column vector x is N.0; †/. The validity of these results can be extended to a broader class of
distributions by employing an approach analogous to that described in Section 6.6e for extending
the results of Theorems 6.6.7 and 6.6.8.
Specifically, the validity of the results of Theorem 6.8.15 extends to the case where the distribu-
tion of the M -dimensional random column vector z is an arbitrary absolutely continuous spherical
distribution. Similarly, the validity of the results of Theorem 6.8.16 extends to the case where the
distribution of the M -dimensional random column vector x is that of the vector 0 z, where (with
P D rank †) is a P M matrix such that † D 0 and where z is a P -dimensional random
column vector that has an absolutely continuous spherical distribution (i.e., the case where x is
distributed elliptically about 0).
Clearly,
I tA D O.I tD/O 0 D O diag.1 td1 ; 1 td2 ; : : : ; 1 tdN / O 0: (8.51)
And in light of result (2.14.25), Lemma 2.14.3, and Corollaries 2.14.19 and 2.14.2, it follows that
N
Y
jI tAj D .1 tdi /; (8.52)
i D1
Now, suppose that the N N symmetric nonnegative definite matrix V is positive definite, and
take R to be a nonsingular matrix (of order N ) such that V D R0 R. Then, upon observing that
V tA D R0 ŒI t.R 1 0
/ AR 1
R
340 Some Relevant Distributions and Their Properties
1 1 0 1 0
fso that .V tA/ D R 1ŒI t.R / AR 1
1.R / g and applying results (8.55) and (8.56), we
find that
1
@.V tA/ 1 0 1 0 1 0 1 0
D R 1ŒI t.R / AR 1
1.R / AR 1ŒI t.R / AR 1
1.R /
@t
D .V tA/ 1A.V tA/ 1
(8.58)
and that
@ 2 .V tA/ 1
D 2 .V tA/ 1A.V tA/ 1A.V tA/ 1: (8.59)
@t 2
Formulas (8.58) and (8.59) are valid for any t for which V tA is nonsingular.
Some results on determinants. We are now in a position to establish the following result.
Theorem 6.8.17. Let A and B represent N N symmetric matrices, and let c and d represent
(strictly) positive scalars. Then, a necessary and sufficient condition for
for all (scalars) t and u satisfying jtj < c and juj < d (or, equivalently, for all t and u) is that
AB D 0.
Proof. The sufficiency of the condition AB D 0 is evident upon observing that (for all t and u)
jI tAj jI uBj D j.I tA/.I uB/j D jI tA uB C tuABj:
Now, for purposes of establishing the necessity of this condition, suppose that equality (8.50)
holds for all t and u satisfying jtj < c and juj < d. And let c ( c) represent a (strictly) positive
scalar such that I tA is positive definite whenever jtj < c —the existence of such a scalar is
guaranteed by Lemma 6.5.1—and (for t satisfying jtj < c ) let
H.t/ D .I tA/ 1:
If jtj < c , then jI uBj D jI uBH.t/j and jI . u/Bj D jI . u/BH.t/j, implying that
jI u2 B2 j D jI u2 ŒBH.t/2 j: (8.61)
Since each side of equality (8.61) is (for fixed t) a polynomial in u2, we have that
jI r B2 j D jI rŒBH.t/2 j
for every scalar r (and for t such that jtj < c )—refer to Theorem 6.7.8.
Upon observing that ŒBH.t/2 D ŒBH.t/BH.t/, that B2 and BH.t/B are symmetric, and that
H.t/ is symmetric and nonnegative definite, and upon applying results (8.54) and (8.57), we find
that, for every scalar t such that jtj < c ,
@ jI rB2 j ˇˇ @ jI rŒBH.t/2 j ˇˇ
ˇ ˇ
2
tr.B / D D D trfŒBH.t/2 g:
@r ˇ
rD0 @r ˇ
rD0
Thus,
@2 trfŒBH.t/2 g @2 tr.B2 /
D D0 (8.62)
@t 2 @t 2
(for t such that jtj < c ). Moreover,
@ trfŒBH.t/2 g @ ŒBH.t/2
@ H.t/ @ H.t/
D tr D tr B BH.t/ C BH.t/B
@t @t @t @t
@ H.t/
D 2 tr BH.t/B ;
@t
More on the Distribution of Quadratic Forms or Second-Degree Polynomials 341
@2 trfŒBH.t/2 g @2 H.t/
@ H.t/ @ H.t/
D 2 tr B B C BH.t/B
@t 2 @t @t @t 2
D 2 trŒBH.t/AH.t/BH.t/AH.t/ C 2BH.t/BH.t/AH.t/AH.t/: (8.63)
0 D trŒ.BA/2 C 2 tr.B2A2 /
D trŒ.BA/2 C B2A2 C tr.B2A2 /
D trŒB.AB C BA/A C tr.B2A2 /
1 1 0 2
D 2 trŒB.AB C BA/A C 2 trfŒB.AB C BA/A g C tr.BA B/
D 1
2
trŒ.AB C BA/AB C 21 trŒA.AB C BA/0 B C tr.BA2 B/
D 1
2 trŒ.AB C BA/0 AB C 21 trŒ.AB C BA/0 BA C tr.BA2 B/
1 0 0
D 2 trŒ.AB C BA/ .AB C BA/ C trŒ.AB/ AB: (8.64)
Both terms of expression (8.64) are nonnegative and hence equal to 0. Moreover, trŒ.AB/0 AB D 0
implies that AB D 0—refer to Lemma 2.3.2. Q.E.D.
In light of Lemma 6.5.2, we have the following variation on Theorem 6.8.17.
Corollary 6.8.18. Let A and B represent N N symmetric matrices. Then, there exist (strictly)
positive scalars c and d such that I tA, I uB, and I tA uB are positive definite for all
(scalars) t and u satisfying jtj < c and juj < d . And a necessary and sufficient condition for
jI tA uBj
log D0
jI tAjjI uBj
for all t and u satisfying jtj < c and juj < d is that AB D 0.
The following theorem can be regarded as a generalization of Corollary 6.8.18.
Theorem 6.8.19. Let A and B represent N N symmetric matrices. Then, there exist (strictly)
positive scalars c and d such that I tA, I uB, and I tA uB are positive definite for all
(scalars) t and u satisfying jtj < c and juj < d . And letting h.t; u/ represent a polynomial (in t and
u), necessary and sufficient conditions for
jI tA uBj h.t; u/
log D (8.65)
jI tAjjI uBj jI tAjjI uBjjI tA uBj
for all t and u satisfying jtj < c and juj < d are that AB D 0 and that, for all t and u satisfying
jtj < c and juj < d (or, equivalently, for all t and u), h.t; u/ D 0.
Proof (of Theorem 6.8.19). The sufficiency of these conditions is an immediate consequence
of Corollary 6.8.18 [as is the existence of positive scalars c and d such that I tA, I uB, and
I tA uB are positive definite for all t and u satisfying jtj < c and juj < d ].
For purposes of establishing their necessity, take u to be an arbitrary scalar satisfying juj < d ,
and observe (in light of Corollaries 2.13.12 and 2.13.29) that there exists an N N nonsingular
matrix S such that .I uB/ 1 D S0 S. Observe also (in light of Theorem 6.7.4) that there exist N N
matrices P and Q such that A D P diag.d1 ; d2 ; : : : ; dN /P 0 and SAS0 D Q diag.f1 ; f2 ; : : : ; fN / Q0
for some scalars d1 ; d2 : : : ; dN and f1 ; f2 ; : : : ; fN —d1 ; d2 : : : ; dN are the not-necessarily-distinct
eigenvalues of A and f1 ; f2 ; : : : ; fN the not-necessarily-distinct eigenvalues of SAS0. Moreover,
letting R D rank A, it follows from Theorem 6.7.5 that exactly R of the scalars d1 ; d2 : : : ; dN and
exactly R of the scalars f1 ; f2 ; : : : ; fN are nonzero; assume (without any essential loss of generality)
342 Some Relevant Distributions and Their Properties
that it is the first R of the scalars d1 ; d2 : : : ; dN and the first R of the scalars f1 ; f2 ; : : : ; fN that are
nonzero. Then, in light of results (2.14.25) and (2. 14.10) and Corollary 2.14.19, we find that
jI tAj D jP ŒI diag.td1 ; : : : ; tdR 1 ; tdR ; 0; 0; : : : ; 0/P 0 j
D jdiag.1 td1 ; : : : ; 1 tdR 1 ; 1 tdR ; 1; 1; : : : ; 1/j
YR Y R
D .1 tdi / D . di /.t di 1 /
i D1 i D1
and that
jI tA uBj D jS 1
.I tSAS0 /.S 0 / 1
j
2 0
D jSj jI tSAS j
2 0
D jSj jQŒI diag.tf1 ; : : : ; tfR 1 ; tfR ; 0; 0; : : : ; 0/Q j
2
D jSj jdiag.1 tf1 ; : : : ; 1 tfR 1 ; 1 tfR ; 1; 1; : : : ; 1/j
YR R
Y
D jSj 2 .1 tfi / D jSj 2 . fi /.t fi 1 /;
i D1 i D1
so that (for fixed u) jI tA uBj and jI tAjjI uBj are polynomials in t. And
R
Y
2
jI tAjjI uBjjI tA uBj D jI uBjjSj di fi .t di 1 /.t fi 1
/; (8.66)
i D1
which (for fixed u) is a polynomial in t of degree 2R with roots d1 1, d2 1, : : : ; dR 1, f1 1, f2 1, : : : ;
fR 1.
Now, regarding u as fixed, suppose that equality (8.65) holds for all t satisfying jtj < c. Then,
in light of equality (8.66), it follows from Theorem 6.7.13 that there exists a real number ˛.u/ such
that, for all t,
h.t; u/ D ˛.u/jI tAjjI uBjjI tA uBj
and
jI tA uBj D e ˛.u/ jI tAjjI uBj: (8.67)
[In applying Theorem 6.7.13, take x D t, s1 .t/ D jI tA uBj, s2 .t/ D jI tAjjI uBj,
r1 .t/ D h.t; u/, and r2 .t/ D jI tAjjI uBjjI tA uBj.] Moreover, upon setting t D 0 in equality
(8.67), we find that
jI uBj D e ˛.u/ jI uBj;
implying that e ˛.u/ D 1 or, equivalently, that ˛.u/ D 0. Thus, for all t,
h.t; u/ D 0 and jI tA uBj D jI tAjjI uBj:
We conclude that if equality (8.65) holds for all t and u satisfying jtj < c and juj < d , then
h.t; u/ D 0 and jI tA uBj D jI tAjjI uBj for all t and u satisfying juj < d , implying
(in light of Theorem 6.7.8) that h.t; u/ D 0 for all t and u and (in light of Theorem 6.8.17) that
AB D 0. Q.E.D.
The cofactors of a (square) matrix. Let A D faij g represent an N N matrix. And (for i; j D
1; 2; : : : ; N ) let Aij represent the .N 1/ .N 1/ submatrix of A obtained by striking out the row
and column that contain the element aij , that is, by striking out the i th row and the j th column.
The determinant jAij j of this submatrix is called the minor of the element aij ; the “signed” minor
. 1/i Cj jAij j is called the cofactor of aij .
The determinant of an N N matrix A can be expanded in terms of the cofactors of the N
elements of any particular row or column of A, as described in the following theorem.
More on the Distribution of Quadratic Forms or Second-Degree Polynomials 343
For a proof of Theorem 6.8.20, refer, for example, to Harville (1997, sec 13.5). The following
theorem adds to the results of Theorem 6.8.20.
Theorem 6.8.21. Let A represent an N N matrix. And (for i; j D 1; 2; : : : ; N ) let aij represent
the ij th element of A and let ˛ij represent the cofactor of aij . Then, for i 0 ¤ i D 1; : : : ; N ,
XN
aij ˛i 0j D ai1 ˛i 0 1 C ai 2 ˛i 0 2 C C aiN ˛i 0 N D 0; (8.70)
j D1
and N
X
aj i ˛j i 0 D a1i ˛1i 0 C a2i ˛2i 0 C C aN i ˛N i 0 D 0: (8.71)
j D1
Proof (of Theorem 6.8.21). Consider result (8.70). Let B represent a matrix whose i 0 th row
equals the i th row of A and whose first, second, : : : ; .i 0 1/th, .i 0 C1/th, : : : ; .N 1/th, N th rows
are identical to those of A (where i 0 ¤ i ). Observe that the i 0 th row of B is a duplicate of its i th row
and hence (in light of Lemma 2.14.10) that jBj D 0.
Let bkj represent the kj th element of B (k; j D 1; 2; : : : ; N ). Clearly, the cofactor of bi 0j is the
same as that of ai 0j (j D 1; 2; : : : ; N ). Thus, making use of Theorem 6.8.20, we find that
N
X N
X
aij ˛i 0j D bi 0j ˛i 0j D jBj D 0;
j D1 j D1
which establishes result (8.70). Result (8.71) can be proved via an analogous argument. Q.E.D.
For any N N matrix A D faij g, the N N matrix whose ij th element is the cofactor ˛ij of
aij is called the matrix of cofactors (or cofactor matrix) of A. The transpose of this matrix is called
the adjoint or adjoint matrix of A and is denoted by the symbol adj A or adj.A/.
There is a close relationship between the adjoint of a nonsingular matrix A and the inverse of A,
as is evident from the following theorem and as is made explicit in the corollary of this theorem.
Theorem 6.8.22. For any N N matrix A,
A adj.A/ D .adj A/A D jAjIN :
Proof. Let aij represent the ij th element of A and let ˛ij represent the cofactor of aij (i; j D
1; 2; : : : ; N ). Then, the i i 0 th element of the matrix product A adj.A/ is jND1 aij ˛i 0j (i; i 0 D
P
Thus, A adj.A/ D jAjI. That .adj A/A D jAjI can be established via a similar argument. Q.E.D.
Corollary 6.8.23. If A is an N N nonsingular matrix, then
1
adj A D jAj A (8.72)
or, equivalently, 1
A D .1=jAj/ adj.A/: (8.73)
344 Some Relevant Distributions and Their Properties
Observe also (in light of the statistical independence of q1 ; q2 ; : : : ; qK ) that, for scalars t and u such
that .t; u/0 2 Nij ,
mij .t; u/ D mij .t; 0/ mij .0; u/: (8.75)
Upon squaring both sides of equality (8.75) and making use of formula (8.74), we find that, for t
and u such that .t; u/0 2 Nij ,
jI 2tAi 2uAj j
log D rij .t; u/; (8.76)
jI 2tAi j jI 2uAj j
where
rij .t; u/ D .tdi Cudj /0 .I 2tAi 2uAj / 1
.tdi Cudj /
t 2 d0i .I 2tAi / 1
di u2 dj0 .I 2uAj / 1
dj : (8.77)
And upon observing that the partial derivatives of rij .t; u/ evaluated at t D u D 0, equal 0, it follows
from result (8.79) that d0i dj D 0 and from result (8.80) that Aj di D 0 and Ai dj D 0. Thus,
Ai bj D Ai .dj 2Aj / D Ai dj 2Ai Aj D 0
and
b0i bj D .di 2Ai /0 .dj 2Aj / D d0i dj 20Ai dj 2.Aj di /0 C 40Ai Aj D 0:
Exercises
Exercise 1. Let x represent a random variable whose distribution is Ga.˛; ˇ/, and let c represent a
(strictly) positive constant. Show that cx Ga.˛; cˇ/ [thereby verifying result (1.2)].
Exercise 2. Let w represent a random variable whose distribution is Ga.˛; ˇ/, where ˛ is a (strictly
positive) integer. Show that (for any strictly positive scalar t)
Pr.w t/ D Pr.u ˛/;
where u is a random variable whose distribution is Poisson with parameter t=ˇ [so that Pr.u D s/ D
e t =ˇ .t=ˇ/ s=sŠ for s D 0; 1; 2; : : :].
Exercise 3. Let u and w represent random variables that are distributed independently as Be.˛; ı/
and Be.˛Cı; /, respectively. Show that uw Be.˛; Cı/.
Exercise 4. Let x represent a random variable whose distribution is Be.˛; /.
(a) Show that, for r > ˛,
.˛ C r/.˛ C /
E.x r / D :
.˛/.˛ C C r/
(b) Show that
˛ ˛
E.x/ D and var.x/ D :
˛C .˛ C /2 .˛
C C 1/
(a) Generalize the results of Part (a) of Exercise 4 by showing that, for r1 > ˛1 , : : : ; rK > ˛K ,
and rKC1 > ˛KC1 ,
KC1
Y .˛k C rk /
rK rKC1 .˛/
E x1r1 xK xKC1 D PKC1 :
˛ C kD1 rk .˛k /
kD1
(b) Generalize the results of Part (b) of Exercise 4 by showing that (for an arbitrary integer k between
1 and K C 1, inclusive)
˛k ˛k .˛ ˛k /
E.xk / D and var.xk / D
˛ ˛ 2 .˛ C 1/
and that (for any 2 distinct integers j and k between 1 and K C1, inclusive)
˛j ˛k
cov.xj ; xk / D 2 :
˛ .˛ C 1/
346 Some Relevant Distributions and Their Properties
Exercise 6. Verify that the function b./ defined by expression (1.48) is a pdf of the chi distribution
with N degrees of freedom.
Exercise 7. For strictly positive integers J and K, let s1 ; : : : ; sJ ; sJ C1 ; : : : ; sJ CK represent J CK
random variables whose joint distribution is Di.˛1 , : : : ; ˛J , ˛J C1 , : : : ; ˛J CK ; ˛J CKC1 I J CK/.
PJ
Further, for k D 1; 2; : : : ; K, let xk D sJ Ck 1 j D1 sj . Show that the conditional distribution
ı
a nonnegative function of a single nonnegative variable]. Verify that the function b./ defined by
2 1=2
PM
expression (1.53) is a pdf of the distribution of the random variable . (Note. This
i D1 zi
exercise can be regarded as a more general version of Exercise 6.)
Exercise 9. Use the procedure described in Section 6.2a to construct a 6 6 orthogonal matrix
whose first row is proportional to the vector .0; 3; 4; 2; 0; 1/.
Exercise 10. Let x1 and x2 represent M -dimensional column vectors.
(a) Use the results of Section 6.2a (pertaining to Helmert matrices) to show that if x02 x2 D x01 x1 ,
then there exist orthogonal matrices O1 and O2 such that O2 x2 D O1 x1 .
(b) Use the result of Part (a) to devise an alternative proof of the “only if” part of Lemma 5.9.9.
Exercise 11. Let w represent a random variable whose distribution is 2 .N; /. Verify that the
expressions for E.w/ and E.w 2 / provided by formula (2.36) are in agreement with those provided
by results (2.33) and (2.34) [or, equivalently, by results (2.28) and (2.30)].
Exercise 12. Let w1 and w2 represent random variables that are distributed independently as
Ga.˛1 ; ˇ; ı1 / and Ga.˛2 ; ˇ; ı2 /, respectively, and define w D w1 C w2 . Derive the pdf of
the distribution of w by starting with the pdf of the joint distribution of w1 and w2 and intro-
ducing a suitable change of variables. [Note. This derivation serves the purpose of verifying that
w Ga.˛1 C˛2 ; ˇ; ı1 Cı2 / and (when coupled with a mathematical-induction argument) repre-
sents an alternative way of establishing Theorem 6.2.2 (and Theorem 6.2.1).]
Exercise 13. Let x D C z, where z is an N -dimensional random column vector that has an
absolutely continuous spherical distribution and where is an N -dimensional nonrandom column
vector. Verify that in the special case where z N.0; I/, the pdf q./ derived in Section 6.2h for the
distribution of x0 x “simplifies to” (i.e., is reexpressible in the form of) the expression (2.15) given
in Section 6.2c for the pdf of the noncentral chi-square distribution [with N degrees of freedom and
with noncentrality parameter (D 0 )].
Exercise 14. Let u and v represent random variables that are distributed independently as 2 .M /
and 2 .N /, respectively. And define w D .u=M /=.v=N /. Devise an alternative derivation of the
pdf of the SF .M; N / distribution by (1) deriving the pdf of the joint distribution of w and v and by
(2) determining the pdf of the marginal distribution of w from the pdf of the joint distribution of w
and v.
Exercise 15. Let t D z v=N , where z and v are random variables that are statistically independent
ıp
Exercise 16. Let t D .x1 C x2 /=jx1 x2 j, where x1 and x2 are random variables that are distributed
independently and identically as N.; 2 / (with > 0). Show that t has a noncentral t distribution,
and determine the values of the parameters (the degrees of freedom and the noncentrality parameter)
of this distribution.
Exercise 17. Let t represent a random variable that has an S t.N; / distribution. And take r to be
an arbitrary one of the integers 1; 2; : : : < N . Generalize expressions (4.38) and (4.39) [for E.t 1 /
and E.t 2 /, respectively] by obtaining an expression for E.t r / (in terms of ). (Note. This exercise is
closely related to Exercise 3.12.)
Exercise 18. Let t represent an M -dimensional random column vector that has an MV t.N; IM /
distribution. And let w D t 0 t. Derive the pdf of the distribution of w in each of the following two
ways: (1) as a special case of the pdf (1.51) and (2) by making use of the relationship (4.50).
Exercise 19. Let x represent an M -dimensional random column vector whose distribution has as a
pdf a function f ./ that is expressible in the following form: for all x,
Z 1
f .x/ D h.x j u/g.u/ du;
0
where g./ is the pdf of the distribution of a strictly positive random variable u and where (for every
u) h. j u/ is the pdf of the N.0; u 1 IM / distribution.
(a) Show that the distribution of x is spherical.
(b) Show that the distribution of u can be chosen in such a way that f ./ is the pdf of the MVt.N; IM /
distribution.
Exercise 20. Show that if condition (6.7) of Theorem 6.6.2 is replaced by the condition
†.b C 2A/ 2 C.†A†/;
the theorem is still valid.
Exercise 21. Let x represent an M -dimensional random column vector that has an N.; †/ distri-
bution (where † ¤ 0), and take G to be a symmetric generalized inverse of †. Show that
x0 Gx 2 .rank †; 0 G/
if 2 C.†/ or G†G D G. [Note. A symmetric generalized inverse G is obtainable from a possibly
nonsymmetric generalized inverse, say H, by taking G D 21 H C 12 H0 ; the condition G†G D G is
the second of the so-called Moore–Penrose conditions—refer, e.g., to Harville (1997, chap. 20) for
a discussion of the Moore–Penrose conditions.]
Exercise 22. Let z represent an N -dimensional random column vector. And suppose that the distri-
bution of z is an absolutely continuous spherical distribution, so that the distribution of z has as a pdf
a function f ./ such that (for all z) f .z/ D g.z0 z/, where g./ is a (nonnegative) function of a single
nonnegative variable. Further, take z to be an M -dimensional subvector of z (where M < N ), and
let v D z0 z .
(a) Show that the distribution of v has as a pdf the function h./ defined as follows: for v > 0,
Z 1
N=2
h.v/ D v .M=2/ 1 w Œ.N M /=2 1 g.vCw/ dwI
.M=2/ Œ.N M /=2 0
for v 0, h.v/ D 0.
(b) Verify that in the special case where z N.0; IN /, h./ simplifies to the pdf of the 2 .M /
distribution.
348 Some Relevant Distributions and Their Properties
Exercise 23. Let z D .z1 ; z2 ; : : : ; zM /0 represent an M -dimensional random (column) vector that
has a spherical distribution. And take A to be an M M symmetric idempotent matrix of rank R
(where R 1).
(a) Starting from first principles (i.e., from the definition of a spherical distribution), use the results of
Theorems 5.9.5 and 6.6.6 to show that (1) z0 Az R 2
i D1 zi and [assuming that Pr.z ¤ 0/ D 1]
P
R M
that (2) z0 Az=z0 z i D1 zi2 = i D1 zi2 .
P P
(b) Provide an alternative “derivation” of results (1) and (2) of Part (a); do so by showing that
(when z has an absolutely continuous spherical distribution) these two results can be obtained
by applying Theorem 6.6.7 (and by making use of the results of Sections 6.1f and 6.1g).
Exercise 31. Let x represent an M -dimensional random column vector that has an N.0; †/ distri-
bution. And take A1 and A2 to be M M symmetric nonnegative definite matrices of constants.
Show that the two quadratic forms x0 A1 x and x0 A2 x are statistically independent if and only if they
are uncorrelated.
Exercise 32. Let x represent an M -dimensional random column vector that has an N.; IM /
distribution. Show (by producing an example) that there exist quadratic forms x0A1 x and x0A2 x
(where A1 and A2 are M M symmetric matrices of constants) that are uncorrelated for every
2 RM but that are not statistically independent for any 2 RM.
§7a. The polynomial p./ and equation p./ D 0 referred to herein as the characteristic polynomial
and characteristic equation differ by a factor of . 1/N from what some authors refer to as the characteristic
polynomial and equation; those authors refer to the polynomial q./ obtained by taking (for all ) q./ D
jIN Aj as the characteristic polynomial and/or to the equation q./ D 0 as the characteristic equation.
Theorem 6.7.1 is essentially the same as Theorem 21.5.6 of Harville (1997), and the approach taken in devising
a proof is a variation on the approach taken by Harville.
§7c. Theorem 6.7.13 can be regarded as a variant of a “lemma” on polynomials in 2 variables that was stated
(without proof) by Laha (1956) and that has come to be identified (at least among statisticians) with Laha’s
name—essentially the same lemma appears (along with a proof) in Ogawa’s (1950) paper. Various approaches
to the proof of Laha’s lemma are discussed by Driscoll and Gundberg (1986, sec. 3)—refer also to Driscoll and
Krasnicka (1995, sec. 4).
§7d. The proof (of the “necessity part” of Theorem 6.6.1) presented in Section 6.7d makes use of Theorem
6.7.13 (on polynomials). Driscoll (1999) and Khatri (1999) introduced (in the context of Corollary 6.6.4) an
alternative proof of necessity—refer also to Ravishanker and Dey (2002, sec. 5.4) and to Khuri (2010, sec. 5.2).
§8a (and §8e). The result presented herein as Corollary 6.8.2 includes as a special case a result that has
come to be widely known as Craig’s theorem (in recognition of the contributions of A. T. Craig) and that (to
acknowledge the relevance of the work of H. Sakamoto and/or K. Matusita) is also sometimes referred to as
the Craig–Sakamoto or Craig–Sakamoto–Matusita theorem. Craig’s theorem has a long and tortuous history
that includes numerous attempts at proofs of necessity, many of which have been judged to be incomplete or
otherwise flawed or deficient. Accounts of this history are provided by, for example, Driscoll and Gundberg
(1986), Reid and Driscoll (1988), and Driscoll and Krasnicka (1995) and, more recently, Ogawa and Olkin
(2008).
§8c. For more on the kind of variations on Cochran’s theorem that are the subject of this subsection, refer,
e.g., to Anderson and Fang (1987).
§8d. The proof of Theorem 6.8.17 is based on a proof presented by Rao and Mitra (1971, pp. 170–171).
For some discussion (of a historical nature and also of a more general nature) pertaining to alternative proofs
of the result of Theorem 6.8.17, refer to Ogawa and Olkin (2008).
§8e. The proof (of the “necessity part” of Theorem 6.8.3) presented in Section 6.8e is based on Theorem
6.8.19, which was proved (in Section 6.8d) by making use of Theorem 6.7.13 (on polynomials). Reid and
Driscoll (1988) and Driscoll and Krasnicka (1995) introduced an alternative proof of the necessity of the
conditions under which two or more quadratic forms (in a normally distributed random vector) are distributed
independently—refer also to Khuri (2010, sec. 5.3).
Exercise 7. The result of Exercise 7 is essentially the same as Theorem 1.6 of Fang, Kotz, and Ng (1990).
Exercise 26. Conditions (1) and (3) of Exercise 26 correspond to conditions given by Shanbhag (1968).
7
Confidence Intervals (or Sets) and Tests of Hypotheses
Suppose that y is an N 1 observable random vector that follows a G–M model. And suppose that
we wish to make inferences about a parametric function of the form 0ˇ (where is a P 1 vector
of constants) or, more generally, about a vector of such parametric functions. Or suppose that we
wish to make inferences about the realization of an unobservable random variable whose expected
value is of the form 0ˇ or about the realization of a vector of such random variables. Inferences that
take the form of point estimation or prediction were considered in Chapter 5. The present chapter is
devoted to inferences that take the form of an interval or set of values. More specifically, it is devoted
to confidence intervals and sets (and to the corresponding tests of hypotheses).
ı.u/ D ˇ1 C ˇ2 u1 C ˇ3 u2 C ˇ4 u3 ; (1.1)
ı.u/ D ˇ1 C ˇ2 u1 C ˇ3 u2 C ˇ4 u3
C ˇ11 u21 C ˇ12 u1 u2 C ˇ13 u1 u3 C ˇ22 u22 C ˇ23 u2 u3 C ˇ33 u23 ; (1.2)
or
ı.u/ D ˇ1 C ˇ2 u1 C ˇ3 u2 C ˇ4 u3
C ˇ11 u21 C ˇ12 u1 u2 C ˇ13 u1 u3 C ˇ22 u22 C ˇ23 u2 u3 C ˇ33 u23
C ˇ111 u31 C ˇ112 u21 u2 C ˇ113 u21 u3 C ˇ122 u1 u22 C ˇ123 u1 u2 u3
C ˇ133 u1 u23 C ˇ222 u32 C ˇ223 u22 u3 C ˇ233 u2 u23 C ˇ333 u33 ; (1.3)
in which case
ı.u0 / D ˇ1 C a0 u0 C .Au0 /0 u0 D ˇ1 C .1=2/a0u0 : (1.7)
The linear system (1.6) is consistent (i.e., has a solution) if the matrix A is nonsingular, in which
case the linear system has a unique solution u0 that is expressible as
u0 D .1=2/A 1 a: (1.8)
More generally, linear system (1.6) is consistent if (and only if) a 2 C.A/.
In light of result (5.4.11), the Hessian matrix of ı.u/ is expressible as follows:
@ 2 ı.u/
D 2A: (1.9)
@u@u0
If the matrix A is nonnegative definite, then the stationary points of ı.u/ are points at which ı.u/
attains a maximum value—refer, e.g., to Harville (1997, sec. 19.1).
Let us consider the use of the data on lettuce yields in making inferences (on the basis of the
second-order model) about the response surface and various of its characteristics. Specifically, let us
consider the use of these data in making inferences about the value of the second-order polynomial
ı.u/ for each value of u in the relevant region. And let us consider the use of these data in making
inferences about the parameters of the second-order model [and hence about the values of the first-
order derivatives (at u D 0) and second-order derivatives of the second-order polynomial ı.u/] and
in making inferences about the location of the stationary points of ı.u/.
The (20 10) model matrix X D .X1 ; X2 / is of full column rank 10—the model matrix is
sufficiently simple that its rank can be determined without resort to numerical means. Thus, all ten
of the parameters that form the elements of ˇ are estimable (and every linear combination of these
parameters is estimable).
Let ˇO D .ˇO1 ; ˇO2 ; ˇO3 ; ˇO4 ; ˇO11 ; ˇO12 ; ˇO13 ; ˇO22 ; ˇO23 ; ˇO33 /0 represent the least squares estimator of
ˇ, that is, the 10 1 vector whose elements are the least squares estimators of the corresponding
elements of ˇ; let O 2 represent the usual unbiased estimator of 2 ; and denote by y the data vector
(the elements of which are listed in column 1 of Table 4.3). The least squares estimate of ˇ (which
is the value of ˇO when y D y) is obtainable as the (unique) solution, say b, Q to the linear system
0 0
X Xb D X y (in the vector b) comprising the so-called normal equations. The residual sum of
squares equals 108:9407; dividing this quantity by N P D 10 (to obtain the value of O 2 ) gives
10:89 as an estimate of 2. Upon taking the square root of this estimate of 2, we obtain 3:30 as an
estimate of . The variance-covariance matrix of ˇO is expressible as var.ˇ/ O D 2 .X0 X/ 1 and is
estimated unbiasedly by the matrix O .X X/ . The standard errors of the elements of ˇO are given
2 0 1
by the square roots of the diagonal elements of 2 .X0 X/ 1, and the estimated standard errors (those
corresponding to the estimator O 2 ) are given by the square roots of the values of the diagonal elements
of O 2 .X0 X/ 1.
The least squares estimates of the elements of ˇ (i.e., of the parameters ˇ1 , ˇ2 , ˇ3 , ˇ4 , ˇ11 , ˇ12 ,
ˇ13 , ˇ22 , ˇ23 , and ˇ33 ) are presented in Table 7.1 along with their standard errors and estimated
standard errors. And letting S represent the diagonal matrix of order P D 10 whose first through
P th diagonal elements are respectively the square roots of the first through P th diagonal elements
of the P P matrix .X0 X/ 1, the correlation matrix of the vector ˇO of least squares estimators of
the elements of ˇ is
0 1
1:00 0:00 0:00 0:09 0:57 0:00 0:00 0:57 0:00 0:52
B 0:00 1:00 0:00 0:00 0:00 0:00 0:24 0:00 0:00 0:00 C
0:00 0:00 1:00 0:00 0:00 0:00 0:00 0:00 0:24 0:00 C
B C
B
B
B 0:09 0:00 0:00 1:00 0:10 0:00 0:00 0:10 0:00 0:22 C
C
0:57 0:00 0:00 0:10 1:00 0:00 0:00 0:13 0:00 0:19 C
1
.X0 X/ 1 1
B
S S DB C:
B
B 0:00 0:00 0:00 0:00 0:00 1:00 0:00 0:00 0:00 0:00 C
C
B 0:00 0:24 0:00 0:00 0:00 0:00 1:00 0:00 0:00 0:00 C
0:57 0:00 0:00 0:10 0:13 0:00 0:00 1:00 0:00 0:19 C
B C
B
@ 0:00 0:00 0:24 0:00 0:00 0:00 0:00 0:00 1:00 0:00 A
0:52 0:00 0:00 0:22 0:19 0:00 0:00 0:19 0:00 1:00
354 Confidence Intervals (or Sets) and Tests of Hypotheses
TABLE 7.1. Least squares estimates (with standard errors and estimated standard errors) obtained from the
lettuce-yield data for the regression coefficients in a second-order G–M model.
The model matrix X has a relatively simple structure, as might be expected in the case of a
designed experiment. That structure is reflected in the correlation matrix of ˇO and in the various
other quantities that depend on the model matrix through the matrix X0 X. Of course, that structure
would have been even simpler had not the level of Fe in the four containers where it was supposed
to have been 1 (on the transformed scale) been taken instead to be 0:4965.
Assuming that the distribution of the vector e of residual effects in the G–M model is N.0; 2 I/
or, more generally, that the fourth-order moments of its distribution are identical to those of the
N.0; 2 I/ distribution,
24 4
var.O 2 / D D (1.10)
N rank X 5
—refer to result (5.7.40). And (under the same assumption)
N rank.X/ C 2 4
E.O 4 / D var.O 2 / C ŒE.O 2 /2 D
N rank.X/
2 O 4
: (1.11)
N rank.X/ C 2
for the standard error of O 2, and corresponding to the estimator (1.11) of var.O 2 / is the estimator
s
2
O 2: (1.13)
N rank.X/ C 2
of the standard error of O 2. The estimated standard error of O 2 [i.e., the value of the estimator (1.13)]
is r
2
10:89407 D 4:45:
12
“Setting the Stage”: Response Surfaces 355
Mo = − 1 Mo = − 1/3
1.5 1.5
3 3
1 8 1 8
13 13
18 18
0.5 0.5
23 23
Cu
0 0
24.91
25.32
− 0.5 − 0.5
−1 −1
− 1.5 − 1.5
Mo = 1/3 Mo = 1
1.5 1.5
3 3
1 8 1 8
13 13
0.5 18 0.5 18
23
Cu
0 0 23
−1 −1
− 1.5 − 1.5
FIGURE 7.1. Contour plots of the estimated response surface obtained from the lettuce-yield data (on the basis
of a second-order model). Each plot serves to relate the yield of lettuce plants to the levels of 2
trace minerals (Cu and Fe) at one of 4 levels of a third trace mineral (Mo).
Let
ˇO2 ˇO11 1 O 1 O
0 1 0 1
ˇ
2 12
ˇ
2 13
BO C
aO D @ˇ3 A and O D @ 1 ˇO12
A
B
ˇO22 1 O C:
ˇ
2 2 23 A
ˇO4 1 O
2 ˇ13
1 O
2ˇ23 ˇO33
For any particular value of the vector u D .u1 ; u2 ; u3 /0 (the elements of which represent the trans-
formed levels of Cu, Mo, and Fe), the value of ı.u/ is a linear combination of the ten regression
coefficients ˇ1 , ˇ2 , ˇ3 , ˇ4 , ˇ11 , ˇ12 , ˇ13 , ˇ22 , ˇ23 , and ˇ33 . And upon taking the same linear
combination of the least squares estimators of the regression coefficients, we obtain the least squares
estimator of the value of ı.u/. Accordingly, the least squares estimator of the value of ı.u/ is the
O
value of the function ı.u/ (of u) defined as follows:
O
ı.u/ D ˇO1 C aO 0 u C u0Au:
O (1.14)
The estimated response surface defined by this function is depicted in Figure 7.1 in the form of four
contour plots; each of these plots corresponds to a different one of four levels of Mo.
For any linear combination 0ˇ of the regression coefficients ˇ1 , ˇ2 , ˇ3 , ˇ4 , ˇ11 , ˇ12 , ˇ13 , ˇ22 ,
ˇ23 , and ˇ33 ,
356 Confidence Intervals (or Sets) and Tests of Hypotheses
O D 2 0 .X0 X/
var.0ˇ/ 1
: (1.15)
More generally, for any two linear combinations 0ˇ and ` 0ˇ,
And upon replacing 2 in expression (1.15) or (1.16) with the unbiased estimator O 2, we obtain an
unbiased estimator of var.0ˇ/ O or cov.0ˇ; O ` 0ˇ/.O Further, upon taking the square root of var.0ˇ/ O
p
and of its p unbiased estimator, we obtain .X X/ as an expression for the standard error of
0 0 1
Suppose that the second-order regression coefficients ˇ11 , ˇ12 , ˇ13 , ˇ22 , ˇ23 , and ˇ33 are such
that the matrix A is nonsingular, in which case the function ı.u/ has a unique stationary point
u0 D .1=2/A 1 a. Then, it seems natural to regard uO 0 as an estimator of u0 (as suggested by the
notation).
A more formal justification for regarding uO 0 as an estimator of u0 is possible. Suppose that the
distribution of the vector e of residual effects in the second-order (G–M) model is MVN. Then, it
O and the vector aO are the ML
follows from the results of Section 5.9a that the elements of the matrix A
estimators of the corresponding elements of the matrix A and the vector a (and are the ML estimators
even when the values of the second-order regression coefficients ˇ11 , ˇ12 , ˇ13 , ˇ22 , ˇ23 , and ˇ33
are restricted to those values for which A is nonsingular). And upon applying a well-known general
Augmented G–M Model 357
result on the ML estimation of parametric functions [which is Theorem 5.1.1 of Zacks (1971)], we
conclude that uO 0 is the ML estimator of u0 .
In making inferences from the results of the experimental study (of the yield of lettuce plants),
there could be interest in predictive inferences as well as in inferences about the function ı.u/
(and about various characteristics of this function). Specifically, there could be interest in making
inferences about the yield to be obtained in the future from a container of lettuce plants, based on
regarding the future yield as a realization of the random variable ı.u/ C d , where d is a random
variable that has an expected value of 0 and a variance of 2 (and that is uncorrelated with the vector
e). Then, with regard to point prediction, the BLUP of this quantity equals the least squares estimator
O
ı.u/ of ı.u/, and the variance of the prediction error (which equals the mean squared error of the
O
BLUP) is 2 C varŒı.u/. And the covariance of the prediction errors of the BLUPs of two future
yields, one of which is modeled as the realization of ı.u/ C d and the other as the realization of
ı.v/ C h (where h is a random variable that has an expected value of 0 and a variance of 2 and that
O
is uncorrelated with e and with d ), equals covŒı.u/; O
ı.v/.
Point estimation (or prediction) can be quite informative, especially when accompanied by stan-
dard errors or other quantities that reflect the magnitude of the underlying variability. However, it is
generally desirable to augment any such inferences with inferences that take the form of intervals or
sets. Following the presentation (in Section 7.2) of some results on “multi-part” G–M models, the
emphasis (beginning in Section 7.3) in the present chapter is on confidence intervals and sets (and
on the closely related topic of tests of hypotheses).
a. General results
Let Z D fzij g represent a matrix with N rows, and denote by Q the number of columns in Z. And as
an alternative to the original G–M model with model matrix X, consider the following G–M model
with model matrix .X; Z/:
ˇ
y D .X; Z/ C e; (2.1)
0
where D .1 ; 2 ; : : : ; Q / is a Q-dimensional (column) vector
of additional parameters [and
ˇ
where the parameter space for the augmented parameter vector is RP CQ ]. To distinguish this
model from the original G–M model, let us refer to it as the augmented G–M model. Note that the
model equation (2.1) for the augmented G–M model is reexpressible in the form
y D Xˇ C Z C e: (2.2)
0
Let ƒ represent a matrix (with P rows) such that R.ƒ / D R.X/, and denote by M the number
of columns in ƒ (in which case M rank ƒ D rank X). Then, under the original G–M model
y D Xˇ C e, the M elements of the vector ƒ0ˇ are estimable linear combinations of the elements of
ˇ. Moreover, these estimable linear combinations include rank.X/ linearly independent estimable
358 Confidence Intervals (or Sets) and Tests of Hypotheses
linear combinations [and no set of linearly independent estimable linear combinations can include
more than rank.X/ linear combinations].
Under the original G–M model, the least squares estimators of the elements of the vector ƒ0ˇ
are the elements of the vector
RQ 0 X0 y; (2.3)
Q is any solution to the linear system
where R
X0 XR D ƒ (2.4)
(in the P M matrix R)—refer to Section 5.4 or 5.6. Note that the matrix XR and hence its transpose
R0 X0 do not vary with the choice of solution to linear system (2.4) (as is evident from, e.g., Corollary
2.3.4).
Under the augmented G–M model,
Q 0 X0 y/ D R
E.R Q 0 X0 .Xˇ C Z/ D ƒ0ˇ C R
Q 0 X0 Z: (2.5)
Thus, while RQ 0 X0 y is an unbiased estimator of ƒ0ˇ under the original G–M model (in fact, it is the
best linear unbiased estimator of ƒ0ˇ in the sense described in Section 5.6), it is not (in general)
an unbiased estimator of ƒ0ˇ under the augmented G–M model. In fact, under the augmented G–M
model, the elements of R Q 0 X0 y are the least squares estimators of the elements of ƒ0ˇ C R
Q 0 X0 Z, as
is evident upon observing that
0
X X X0 Z
0
.X; Z/ .X; Z/ D ;
Z0 X Z0 Z
0
Q 0 X0 Z D .ƒ0; RQ 0 X0 Z/ ˇ D ƒ ˇ
ƒ0ˇ C R Q ; (2.6)
Z0 XR
and 0
Q
X X X0 Z R
ƒ
D Q
Z0 X Z0 Z 0 Z0 XR
and as is also evident upon observing (in light of the results of Section 5.4) that any linear combination
of the elements of the vector .X; Z/0 y is the least squares estimator of its expected value.
Note that while in general E.R Q 0 X0 y/ is affected by the augmentation of the G–M model,
Q 0 0
var.R X y/ is unaffected (in the sense that the same expressions continue to apply).
0
Q 0 X0 Z/ D ƒ
In regard to the coefficient matrix .ƒ0; R Q of expression (2.6) for the vector
Z0 XR
ƒ0ˇ C R Q 0 X0 Z of linear combinations of the elements of the parametric vector ˇ , note that
0 Q0 0
rank.ƒ ; R X Z/ D rank ƒ .D rank X/; (2.7)
Q 0 X0 X and
as is evident upon observing that ƒ0 D R
.ƒ0; R
Q 0 X0 Z/ D R
Q 0 X0 .X; Z/
and hence (in light of Lemma 2.12.3 and Corollary 2.4.17) that
Q 0 X0 / rank.ƒ0; R
rank.ƒ0 / D rank.R Q 0 X0 Z/ rank.ƒ0 /:
Q 0 X0 X
Further, for any M -dimensional column vector `, we find (upon observing that ` 0ƒ0 D .R`/
and making use of Corollary 2.3.4) that
` 0ƒ0 D 0 , .R`/
Q 0 X0 D 0 ) .R`/
Q 0 X0 Z D 0 , ` 0 R
Q 0 X0 Z D 0 (2.8)
and that
Augmented G–M Model 359
Q 0 X0 Z/ D 0:
` 0ƒ0 D 0 , ` 0.ƒ0; R (2.9)
Thus, if any row or linear combination of rows of the matrix R Q 0 X0 Z is nonnull, then the corresponding
0
row or linear combination of rows of the matrix ƒ is also nonnull. And a subset of the rows of the
matrix .ƒ0; RQ 0 X0 Z/ is linearly independent if and only if the corresponding subset of the rows of the
0
matrix ƒ is linearly independent.
Let S represent a matrix (with N rows) such that C.S/ D N.X0 / (i.e., such that the columns
of S span the null space of X0 ) or equivalently fsince (according to Lemma 2.11.5) dimŒN.X0 / D
N rank.X/g such that X0 S D 0 and rank.S/ D N rank.X/. Further, denote by N the number of
columns in S—necessarily, N N rank.X/. And consider the N -dimensional column vector
S0 Z, the elements of which are linear combinations of the elements of the vector .
Under the augmented G–M model, the elements of the vector S0 Z (like those of the vector
0 Q 0 0 ˇ
ƒ ˇ C R X Z) are estimable linear combinations of the elements of the vector , as is evident
upon observing that
S0 X D .X0 S/0 D 0
and hence that
0 0 ˇ 0 ˇ
S Z D .0; S Z/ D S .X; Z/ :
Clearly,
Q 0 X0 Z Q 0 X0 Z ˇ
0 0
ƒˇ C R ƒ R
D :
S0 Z 0 S0Z
And upon observing that
Q 0 X0 Z Q X
0 0 0
ƒ R R
D .X; Z/
0 S0Z S0
and (in light of Lemmas 2.12.1, 2.6.1, and 2.12.3) that
0 0 0 0 0
Q X Q X Q X Q 0 XRQ
0 0
R R R .XR/ 0
rank D rank D rank
S0 S0 S0 0 S0 S
Q 0 XR
D rankŒ.XR/ Q C rank.S0 S/
Q C rank.S/
D rank.XR/
D rank.ƒ/ C rank.S/
D rank.X/ C N rank.X/ D N;
it follows from Lemma 2.5.5 that
Q 0 X0 Z Q 0 X0 Z
0 0
ƒ R ƒ R
R D RŒ.X; Z/ and rank D rank.X; Z/: (2.10)
0 S0 Z 0 S0Z
Moreover,
Q 0 X0 Z
0
ƒ ƒ0 L
0
ƒ R
D (2.11)
0 S0 Z 0 S0 Z
for some matrix L [as is evident from Lemma 2.4.3 upon observing (in light of Corollary 2.4.4 and
Lemma 2.12.3) that C.R Q 0 X0 Z/ C.R Q 0 X0 / D C.RQ 0 X0 X/ D C.ƒ0 /], implying that
Q 0 X0 Z ƒ0 0
0
ƒ R IP L
D
0 S0Z 0 S0 Z 0 IQ
I L
and hence [since (according to Lemma 2.6.2) is nonsingular] that
0 I
Q 0 X0 Z
0
ƒ R
rank D rank.ƒ/ C rank.S0 Z/ D rank.X/ C rank.S0 Z/ (2.12)
0 S0 Z
360 Confidence Intervals (or Sets) and Tests of Hypotheses
(as is evident from Corollary 2.5.6 and Lemma 2.6.1). Together with result (2.10), result (2.12)
implies that
rank.S0 Z/ D rank.X; Z/ rank.X/: (2.13)
For any M -dimensional column vector `1 and any N -dimensional column vector `2 ,
0 0
Q 0 X0 Z
`1 ƒ R Q 0 X0 Z/ D 0 and ` 0 S0 Z D 0;
D 0 , `10 .ƒ0; R (2.14)
`2 0 S0 Z 2
as is evident from
result (2.9). Thus, a subset of size rank.X; Z/ of the M CN rows of the matrix
ƒ0 R Q 0 X0 Z
is linearly independent if and only if the subset consists of rank.X/ linearly inde-
0 S0Z
pendent rows of .ƒ0; R Q 0 X0 Z/ and rank.X; Z/ rank.X/ linearly independent rows of .0; S0 Z/. [In
light of result (2.10), there exists a linearly independent subset of size rank.X; Z/ of the rows of
Q 0 X0 Z
0
ƒ R
and no linearly independent subset of a size larger than rank.X; Z/.]
0 S0Z
Under the augmented G–M model, a linear combination, say 01 ˇ C 02 , of the elements of ˇ
and is [in light of result (2.10)] estimable if and only if
Q 0 X0 Z C `20 S0 Z
01 D `10 ƒ0 and 02 D `10 R (2.15)
for some M -dimensional column vector `1 and some N -dimensional column vector `2 . Moreover,
in the special case where 1 D 0 (i.e., in the special case of a linear combination of the elements
of ), this result can [in light of result (2.8)] be simplified as follows: under the augmented G–M
model, 02 is estimable if and only if
02 D `20 S0 Z (2.16)
for some N -dimensional column vector `2 .
Among the choices for the N N matrix S is the N N matrix I PX . Since (according to
Theorem 2.12.2) X0 PX D X0, PX2 D PX , and rank.PX / D rank.X/,
X0 .I PX / D 0
and (in light of Lemma 2.8.4)
rank.I PX / D N rank.X/:
Clearly, when S D I PX , N D N.
Let TQ represent any solution to the following linear system [in the .P CQ/ N matrix T ]:
.X; Z/0 .X; Z/T D .0; S0 Z/0 (2.17)
—the existence of a solution follows (in light of the results of Section 5.4c)
from the estimability
ˇ
(under the augmented G–M model) of the elements of the vector .0; S Z/ 0
. Further, partition TQ
TQ
as TQ D Q 1 (where TQ 1 has P rows), and observe that
T2
X0 XTQ 1 D X0 ZT Q2
and hence (in light of Theorem 2.12.2) that
Q 1 D X.X0 X/ X0 XTQ 1 D X.X0 X/ X0 ZTQ 2 D
XTQ 1 D PX XT PX ZTQ 2 : (2.18)
Then, under the augmented G–M model, the least squares estimators of the elements of the vector
S0 Z are [in light of result (5.4.35)] the (corresponding) elements of a vector that is expressible as
follows:
TQ 0 .X; Z/0 y D TQ 10 X0 y C T
Q 0 Z0 y D . P ZT
2 X
Q /0 y C T
2
Q 0 Z0 y D T
2
Q 0 Z0 .I P /y:
2 X (2.19)
And (under the augmented G–M model)
covŒTQ 20 Z0 .I PX /y; R
Q 0 X0 y D 2 T
Q 0 Z0 .I P /XR
2 X
Q D0 (2.20)
Augmented G–M Model 361
(i.e., the least squares estimators of the elements of S0 Z are uncorrelated with those of the elements
of ƒ0ˇ C R Q 0 X0 Z) and [in light of result (5.6.6)]
X
Q 20 Z0 S:
varŒTQ 20 Z0 .I P /y D 2 TQ 0 .0; S0 Z/0 D 2 T (2.21)
In connection with linear system (2.17), note [in light of result (2.18)] that
Q 2 D Z0 S ŒD Z0 .I P /S
Z0 .I PX /ZT X
Z0 .I PX /ZT2 D Z0 S (2.22)
Q
Q 1 a solution to the linear system X0 XT1 D Q 2 (in T1 ), then T 1
(in T2 ) and T X0 ZT is a solution
TQ 2
to linear system (2.17).
where O is an N rank.X; Z/ matrix with orthonormal columns and where, for some rank.X/ P
matrix U11 (of full row rank), some rank.X/ Q matrix U12 , and some Œrank.X; Z/ rank.X/ Q
matrix U22 (of full row rank),
U11 U12
UD :
0 U22
A decomposition of the form (2.23) can be constructed by, for example, applying Gram–Schmidt
orthogonalization, or (in what would be preferable for numerical purposes) modified Gram–Schmidt
orthogonalization, to the columns of the matrix .X; Z/. In fact, when this method of construction
is employed, U is a submatrix of a .P CQ/ .P CQ/ upper triangular matrix having rank.X; Z/
positive diagonal elements and PCQ rank.X; Z/ null rows; it is the submatrix obtained by striking
out the null rows—refer, e.g., to Harville (1997, chap. 6). When U is of this form, the decomposition
(2.23) is what is known (at least in the special case of the decomposition of a matrix having full
column rank) as the QR deomposition (or the “skinny” QR decomposition). The QR decomposition
was encountered earlier; Section 5.4e included a discussion of the use of the QR decomposition in
the computation of least squares estimates.
Partition the matrix O (conformally to the partitioning of U) as O D .O1 ; O2 /, where O1 has
rank.X/ columns. And observe that
X D O1 U11 and Z D O2 U22 C O1 U12 : (2.24)
0
Then, among the choices for ƒ is that obtained by taking ƒ D U11 (as is evident from Corollary
2.4.17). Further, the choices for S include the matrix .O2 ; O3 /, where O3 is an N ŒN rank.X; Z/
matrix whose columns form an orthonormal basis for NŒ.X; Z/ or, equivalently (in light of Corollary
2.4.17), NŒ.O1 ; O2 /—the N rank.X/ columns of .O2 ; O3 / form an orthonormal basis for N.X/.
In light of result (2.24), we have that
X0 X D U11
0
U11 and X0 Z D U11
0
U12 :
Moreover, when ƒ0 D U11 , we find that
X0 XR D ƒ , U11
0 0
U11 R D U11 , U11 R D I
0 0 0
—that U11U11 R D U11 ) U11 R D I is clear upon, e.g., observing that U11U11 is nonsingular
362 Confidence Intervals (or Sets) and Tests of Hypotheses
0 0 0
and premultiplying both sides of the equation U11 U11 R D U11 by .U11U11 / 1 U11 . Thus, when
ƒ0 D U11 , the expected value ƒ0ˇ C R Q 0 X0 Z (under the augmented G–M model) of the estimator
Q Q
R X y (where as in Subsection a, R represents an arbitrary solution to X0 XR D ƒ) is reexpressible
0 0
as
ƒ0ˇ C R Q 0 X0 Z D U ˇ C R Q 0 U 0 U D U ˇ C U : (2.25)
11 11 12 11 12
var.R Q 0 X0 XR
Q 0 X0 y/ D 2 R Q D 2 U11 R
Q D 2 ƒ0 R Q D 2 I: (2.26)
Z0 S D .Z0 O2 ; Z0 O3 / D .U22
0
; 0/;
so that
U22
U22
S0 Z D D :
0 0
And observe (in light of Lemma 2.6.3) that
rank U D rank.X; Z/
and hence that
rank.UU 0 / D rank.X; Z/: (2.27)
Observe also that .O; O3 / is an (N N ) orthogonal matrix and hence that
c. An illustration
Let us illustrate the results of Subsections a and b by using them to add to the results obtained
earlier (in Section 7.1) for the lettuce-yield data. Accordingly, let us take y to be the 20 1 random
vector whose observed value is the vector of lettuce yields. Further, let us adopt the terminology and
notation introduced in Section 7.1 (along with those introduced in Subsections a and b of the present
section).
Suppose that the original G–M model is the second-order G–M model, which is the model
that was adopted in the analyses carried out in Section 7.1. And suppose that the augmented G–
M model is the third-order G–M model. Then, X D .X1 ; X2 / and Z D X3 (where X1 , X2 , and
X3 are as defined in Section 7.1). Further, ˇ D .ˇ1 ; ˇ2 ; ˇ3 ; ˇ4 ; ˇ11 ; ˇ12 ; ˇ13 ; ˇ22 ; ˇ23 ; ˇ33 /0 and
D .ˇ111 ; ˇ112 ; ˇ113 ; ˇ122 ; ˇ123 ; ˇ133 ; ˇ222 ; ˇ223 ; ˇ233 ; ˇ333 /0.
Upon applying result (2.5) with R Q D .X0 X/ 1 (corresponding to ƒ D I), we find that under the
augmented (third-order) model
All ten of the least squares estimators ˇO1 , ˇO2 , ˇO3 , ˇO4 , ˇO11 , ˇO12 , ˇO13 , ˇO22 , ˇO23 , and ˇO33 of the
elements of ˇ are at least somewhat susceptible to biases occasioned by the exclusion from the model
of third-order terms. The exposure to such biases appears to be greatest in the case of the estimators
ˇO2 , ˇO3 , and ˇO4 of the first-order regression coefficients ˇ2 , ˇ3 , and ˇ4 . In fact, if the level of Fe
in the first, second, third, and fifth containers had been 1 (on the transformed scale), which was
the intended level, instead of 0:4965, the expected values of the other seven estimators (ˇO1 , ˇO11 ,
ˇO12 , ˇO13 , ˇO22 , ˇO23 , and ˇO33 ) would have been the same under the third-order model as under the
second-order model (i.e., would have equalled ˇ1 , ˇ11 , ˇ12 , ˇ13 , ˇ22 , ˇ23 , and ˇ33 , respectively,
under both models)—if the level of Fe in those containers had been 1, the expected values of ˇO2 ,
ˇO3 , and ˇO4 under the third-order model would have been
E.ˇO2 / D ˇ2 C 1:757ˇ111 C 0:586ˇ122 C 0:586ˇ133 ;
E.ˇO3 / D ˇ3 C 0:586ˇ112 C 1:757ˇ222 C 0:586ˇ233 ; and
E.ˇO4 / D ˇ4 C 0:586ˇ113 C 0:586ˇ223 C 1:757ˇ333 :
The estimators ˇO1 , ˇO2 , ˇO3 , ˇO4 , ˇO11 , ˇO12 , ˇO13 , ˇO22 , ˇO23 , and ˇO33 (which are the least squares
estimators of the regression coefficients in the second-order model) have the “same” standard errors
under the augmented (third-order) model as under the original (second-order) model; they are the
same in the sense that the expressions given in Table 7.1 for the standard errors are still applicable.
However, the “interpretation” of the parameter (which appears as a multiplicative factor in those
364 Confidence Intervals (or Sets) and Tests of Hypotheses
expressions) differs. This difference manifests itself in the estimation of and 2. In the case of the
augmented G–M model, the usual estimator O 2 of 2 is that represented by the quadratic form
y 0 ŒI P.X; Z/ y ŒN rank.X; Z/; (2.34)
ı
rather than ı(as in the case of the original G–M model) that represented by the quadratic form
y 0 .I PX /y ŒN rank.X/.
For the lettuce-yield data, y 0 ŒI P.X; Z/ y D 52:6306 and rank.X; Z/ D 15, so that when the
estimator (2.34) is applied to the lettuce-yield data, we obtain O 2 D 10:53. Upon taking the square
root of this value, we obtain O D 3:24, which is 1:70% smaller than the value (3:30) obtained for
O when the model was taken to be the original (second-order) model. Accordingly, when the model
for the lettuce-yield data is taken to be the augmented (third-order) model, the estimated standard
errors obtained for ˇO1 , ˇO2 , ˇO3 , ˇO4 , ˇO11 , ˇO12 , ˇO13 , ˇO22 , ˇO23 , and ˇO33 are 1:70% smaller than those
given in Table 7.1 (which were obtained on the basis of the second-order model).
Under the augmented G–M model, the rank.X; Z/ rank.X/ elements of the vector U22
[where U22 is defined in terms of the decomposition (2.23)] are linearly independent estimable
linear combinations of the elements of [and every estimable linear combination of the elements
of is expressible in terms of these rank.X; Z/ rank.X/ linear combinations]. In the lettuce-yield
application [where the augmented G–M model is the third-order model and where rank.X; Z/
rank.X/ D 5], the elements of U22 [those obtained when the decomposition (2.23) is taken to be the
QR decomposition] are the following linear combinations of the third-order regression coefficients:
3:253ˇ111 1:779ˇ122 0:883ˇ133 ;
1:779ˇ112 3:253ˇ222 C 0:883ˇ233 ;
1:554ˇ113 C 1:554ˇ223 3:168ˇ333;
2:116ˇ123 ;
and
0:471ˇ333 :
The least squares estimates of these five linear combinations are 1:97, 4:22, 5:11, 2:05, and
2:07, respectively—the least squares estimators are uncorrelated, and each of them has a standard
error of and an estimated standard error of 3:24.
1 ; 2 ; : : : ; M and possibly about some or all linear combinations of 1; 2 ; : : : ; M . These inferences
may take the form of a confidence set for the vector . Or they may take the form of individual con-
fidence intervals (or sets) for 1 ; 2 ; : : : ; M (and possibly for linear combinations of 1 ; 2 ; : : : ; M )
that have a specified probability of simultaneous coverage. Alternatively, these inferences may take
the form of a test of hypothesis. More specifically, they may take the form of a test (of a specified size)
of the null hypothesis H0 W D 0 versus the alternative hypothesis H1 W ¤ 0 or, more generally,
.0/ .0/ .0/
of H0 W D .0/ [where .0/ D .1 ; 2 ; : : : ; M /0 is any vector of hypothesized values] versus
H1 W ¤ .0/. They may also take a form that consists of testing whether or not each of the M
quantities 1 ; 2 ; : : : ; M (and possibly each of various linear combinations of these M quantities)
equals a hypothesized value (subject to some restriction on the probability of an “excessive overall
number” of false rejections).
In testing H0 W D .0/ (versus H1 W ¤ .0/ ) attention is restricted to what are called testable
hypotheses. The null hypothesis H0 W D .0/ is said to be testable if (in addition to 1 ; 2 ; : : : ; M
being estimable and ƒ being nonnull) .0/ 2 C.ƒ0 /, that is, if .0/ D ƒ0ˇ .0/ for some P 1
vector ˇ .0/. Note that if .0/ … C.ƒ0 /, there would not exist any values of ˇ for which ƒ0ˇ equals the
hypothesized value .0/ and, consequently, H0 would be inherently false. It is worth noting that while
the definition of testability adopted herein rules out the existence of any contradictions among the
.0/
M equalities i D i .i D 1; 2; : : : ; M / that define H0 , it is sufficiently flexible to accommodate
redundancies among these equalities—a more restrictive definition (one adopted by many authors)
would be to require that rank.ƒ/ D M.
Most of the results on confidence sets for or its individual elements or for testing whether or
not or its individual elements equal hypothesized values are obtained under an assumption that the
vector e of the residual effects in the G–M model has an N.0; 2 I/ distribution. However, for some
of these results, this assumption is stronger than necessary; it suffices to assume that the distribution
of e is spherically symmetric.
Note that the matrix C is symmetric and nonnegative definite and (in light of Lemmas 2.12.1
and 2.12.3) that
rank C D rank.XR/Q D rank.X0 XR/Q D rank ƒ D M ; (3.1)
implying in particular (in light of Corollary 2.13.23) that
C D T 0T
for some M M matrix T of full row rank M . Now, take S to be any right inverse of T —that
T has a right inverse is evident from Lemma 2.5.1—or, more generally, take S to be any M M
matrix such that T S is orthogonal, in which case
S0 CS D .T S/0 T S D I (3.2)
366 Confidence Intervals (or Sets) and Tests of Hypotheses
Further, let
˛ D S0 D S0ƒ0ˇ D .ƒS/0 ˇ;
so that ˛ is an M 1 vector whose elements are expressible as linearly independent linear combi-
nations of the elements of either or ˇ. And let ˛O represent the least squares estimator of ˛. Then,
clearly,
Q 0 X0 y D S0 O
˛O D .RS/
and
O D 2 I:
var.˛/
Inverse relationship. The transformation from the M -dimensional vector to the M -dimensional
vector ˛ is invertible. In light of result (3.3), the M columns of the matrix ƒS form a basis for
C.ƒ/, and, consequently, there exists a unique M M matrix W (of full row rank M ) such that
ƒSW D ƒ: (3.4)
And upon premultiplying both sides of equality (3.4) by S0ƒ0.X0 X/ [and making use of result (3.2)],
we find that
W D .T S/0 T (3.5)
—in the special case where S is a right inverse of T , W D T . Note that
C.W 0 / D C.ƒ0 / and W S D I; (3.6)
0 0 0
as is evident upon, for example, observing that C.ƒ / D CŒ.ƒSW / C.W / (and invoking
Theorem 2.4.16) and upon observing [in light of result (3.5)] that W S D .T S/0 T S [and applying
result (3.2)]. Note also that
C D ƒ0.X0 X/ ƒ D .ƒSW /0 .X0 X/ ƒSW D W 0 S0 CSW D W 0 W : (3.7)
Clearly,
D ƒ0ˇ D .ƒSW /0ˇ D W 0 S0 D W 0 ˛: (3.8)
Similarly,
O D ƒ0.X0 X/ X0 y D .ƒSW /0 .X0 X/ X0 y D W 0 S0 O D W 0 ˛:
O (3.9)
upon observing that C is an M M symmetric nonnegative matrix of rank M ). Accordingly, let
T represent an M M nonsingular matrix such that C D T0 T —the existence of such a matrix
is evident from Corollary 2.13.29. Further, let TQ D T K, and define SQ to be the M M matrix
whose j1 ; j2 ; : : : ; jM th rows are respectively the first, second, …, M th rows of T 1 and whose
remaining (jM C1 , jM C2 , : : : ; jM th) rows are null vectors. Then,
C D TQ 0 TQ
And
Q SQ D T T 1 D I;
T
so that SQ is a right inverse of TQ . Thus, among the choices for a matrix T such that C D T 0 T and
for a matrix S such that S0 CS D I are T D TQ and S D S; Q and when T D TQ and S D S, Q
Q D T K;
W DT
in which case D K0 T0 ˛ and O D K0 T0 ˛O or, equivalently,
D T0 ˛ and ji D kj0 i .i D M C1; M C2; : : : ; M /;
and
O D T0 ˛O and Oji D kj0 i O .i D M C1; M C2; : : : ; M /:
An equivalent null hypothesis. Now, consider the problem of testing the null hypothesis H0 W D
.0/ versus the alternative hypothesis H1 W ¤ .0/. This problem can be reformulated in terms of
the vector ˛. Assume that H0 is testable, in which case .0/ D ƒ0ˇ .0/ for some P 1 vector ˇ .0/,
let ˛.0/ D S0 .0/, and consider the problem of testing the null hypothesis HQ 0 W ˛ D ˛.0/ versus the
alternative hypothesis HQ 1 W ˛ ¤ ˛.0/.
The problem of testing HQ 0 versus HQ 1 is equivalent to that of testing H0 versus H 1 ; they are
equivalent in the sense that any value of ˇ that satisfies HQ 0 satisfies H0 and vice versa. To see this,
observe that a value of ˇ satisfies HQ 0 if and only if it satisfies the equality S0ƒ0 .ˇ ˇ .0/ / D 0,
and it satisfies H0 if and only if it satisfies the equality ƒ0 .ˇ ˇ .0/ / D 0. Observe also that
N.S0ƒ0 / N.ƒ0 / and [in light of Lemma 2.11.5 and result (3.3)] that
dimŒN.S0ƒ0 / D P rank.S0ƒ0 / D P rank.ƒS/ D P M D dimŒN.ƒ0 /
and hence (recalling Theorem 2.4.10) that N.S0ƒ0 / D N.ƒ0 /.
A vector of error contrasts. Let L represent an N .N P / matrix whose columns form an
orthonormal basis for N.X0 /—the existence of an orthonormal basis follows from Theorem 2.4.23
and as previously indicated (in Section 5.9b) and as is evident from Lemma 2.11.5, dimŒN.X0 / D
N P . Further, let d D L0 y. And observe that
rank L D N P ; X0 L D 0; L0 X D 0; and L0 L D I; (3.10)
that
E.d/ D 0 and var.d/ D 2 I;
and that cov.d; X0 y/ D 0 and hence that
cov.d; ˛/
O D 0:
The N P elements of the vector d are error contrasts—error contrasts were discussed earlier (in
Section 5.9b).
Complementary parametric functions and their least squares estimators. The least squares estima-
tor ˛O of the vector ˛ and the vector d of error contrasts can be combined into a single vector and
expressed as follows:
Q Xy
0 0 0
˛O SR Q L/0 y:
D D .XRS;
d L0 y
The columns of the N .N P C M / matrix .XRS; Q L/ are orthonormal. And (in light of Lemma
2.11.5)
368 Confidence Intervals (or Sets) and Tests of Hypotheses
Q L/0 g D N
dimfNŒ.XRS; .N P C M / D P M : (3.11)
Further,
Q L/0 N.L0 / D C.X/
NŒ.XRS; (3.12)
0
—that N.L / D C.X/ follows from Theorem 2.4.10 upon observing [in light of result (3.10) and
Lemma 2.4.2] that C.X/ N.L0 / and (in light of Lemma 2.11.5) that dimŒN.L0 / D N .N P/ D
P D dimŒC.X/.
Let U represent a matrix whose columns form an orthonormal basis for
NŒ.XRS;Q L/0 —the existence of such a matrix follows from Theorem 2.4.23. And observe [in
light of result (3.11)] that U is of dimensions N .P M / and [in light of result (3.12)] that
U D XK for some matrix K [of dimensions P .P M /]. Observe also that
X0 XK D X0 U D .U 0 X/0 (3.13)
and (in light of Lemma 2.12.3) that
rank.U 0 X/ D rank.X0 XK/ D rank.XK/ D rank.U/ D P M : (3.14)
Now, let D U 0 Xˇ and O D U 0 y .D K0 X0 y/. Then, in light of results (3.14) and (3.13), is a
vector of P M linearly independent estimable functions, and O is the least squares estimator of
. Further,
rank U D P M ; U 0 L D 0; Q D 0;
U 0 XRS and U 0 U D I;
implying in particular that
cov.;
O d/ D 0; cov.;
O ˛/
O D 0; and O D 2 I:
var./
z D O 0 Xˇ C O 0 e: (3.18)
Thus, 0 1
IM 0
˛
z D @0 IP M A C e;
0 0
The F Test and a Generalized S Method 369
where e D O 0 e. Clearly,
E.e / D 0 and var.e / D 2 I:
Like y, z is an observable random vector. Also like y, it follows a G–M model. Specifically, it
follows
0 a G–M model
1 in which the role of the N P model matrix X is played by the N P matrix
IM 0
˛
@0 IP M and the role of the P 1 parameter vector ˇ is played by the P 1 vector
A
0 0
and in which the M -dimensional vector ˛ and the .P M /-dimensional vector are regarded as
vectors of unknown (unconstrained) parameters rather than as vectors of parametric functions. This
model is referred to as the canonical form of the (G–M) model.
Suppose that the vector z is assumed to follow the canonical form of the G–M model (with
parameterization ˛, , and 2 ). Suppose further that the distribution of the vector of residual effects
in the canonical form is taken to be the same as that of the vector e of residual effects in the original
form y D Xˇ C e. And suppose that O 0 e e, as would be the case if e N.0; 2 I/ or, more
generally, if the distribution of e is spherically symmetric. Then, the distribution of the vector Oz
obtained (on the basis of the canonical form) upon setting the parameter vector ˛ equal to S0ƒ0ˇ and
the parameter vector equal to U 0 Xˇ is identical to the distribution of y (i.e., to the distribution of
y obtained from a direct application of the model y D Xˇ C e). Note that (when it comes to making
inferences about ƒ0ˇ) the elements of the parameter vector of the canonical form can be viewed
as “nuisance parameters.”
Sufficiency and completeness. Suppose that y N.Xˇ; 2 I/, as is the case when the distribution
of the vector e of residual effects in the G–M model is taken to be N.0; 2 I/ (which is a case where
O 0 e e). Then, as was established earlier (in Section 5.8) by working directly with the pdf of the
distribution of y, X0 y and y 0 .I PX /y form a complete sufficient statistic. In establishing results of
this kind, it can be advantageous to adopt an alternative approach that makes use of the canonical
form of the model.
Suppose that the transformed vector z .D O 0 y/ follows the canonical form of the G–M model.
Suppose also that the2 distribution
0 1 of the vector of residual effects in the canonical form is N.0; 2 I/,
3
˛
in which case z N 4@ A; 2 I5. Further, denote by f ./ the pdf of the distribution of z. Then,
0
2 0 10 0 13
1 1 z1 ˛ z1 ˛
f .z/ D exp 4 @ z2 A @z2 A5
.2 2 /N=2 2 2 z 0 z3 0
3
1 1 0 0 0
exp
D . O
˛ ˛/ . ˛O ˛/ C . O
/ . O
/ C d d
.2 2 /N=2 2 2
1 1 0 0
D exp .˛ ˛ C /
.2 2 /N=2 2 2
1 0 0 0 1 0 1 0
exp .d d C O
˛ O
˛ C O
O
/ C ˛ ˛O C O
: (3.19)
2 2 2 2
And it follows from a standard result (e.g., Schervish 1995, theorem 2.74) on exponential families
of distributions that .d0 d C ˛O 0 ˛O C O 0 /,
O ˛, O and O form a complete sufficient statistic.
The complete sufficient statistic can be reexpressed in various alternative forms. Upon observing
that
d0 d C ˛O 0 ˛O C O 0 O D z0 z D z0 Iz D z0 O 0 Oz D y 0 y; (3.20)
[in light of Theorem 5.9.6 and result (3.10)] that
d0 d D y 0 LL0 y D y 0 .I PX /y; (3.21)
370 Confidence Intervals (or Sets) and Tests of Hypotheses
And upon proceeding in essentially the same way as in arriving at expression (3.19) and upon
applying the factorization theorem (e.g., Casella and Berger 2002, theorem 6.2.6), we arrive at the
conclusion that the same quantities that form a complete sufficient statistic in the special case where
e has an N.0; 2 I/ distribution form a sufficient (though not necessarily complete) statistic.
b. The test and confidence set and their basic properties
Let us continue to take y to be an N 1 observable random vector that follows the G–M model.
And taking the results of Subsection a to be our starting point (and adopting the notation employed
therein), let us consider further inferences about the vector .D ƒ0ˇ/.
2
Suppose that the distribution
20 1 of
3 the vector e of residual effects is MVN. Then, y N.Xˇ; I/.
˛
And z .D O 0 y/ N 4@ A; 2 I5.
0
F statistic and pivotal quantity. Letting ˛P represent an arbitrary value of ˛ (i.e., an arbitrary M 1
vector), consider the random variable F.Q ˛/P defined as follows:
P 0 .˛O ˛/=M
.˛O ˛/ P D . P 0 C .O /=M
O / P : (3.28)
To see this, observe that
.˛O S0 /
P 0 .˛O S0 / P 0 SS0 .O /
P D .O / P
and that
CSS0C D T 0 T S.T S/0 T D T 0 IT D T 0 T D C (3.29)
(so that SS0 is a generalized inverse of C). Recalling that O D ƒ0.X0 X/ X0 y, observe also that
O P 2 C.ƒ0 / (3.30)
and [in light of Corollary 2.4.17 and result (3.1)] that
C.ƒ0 / D C.C/; (3.31)
implying that O P D Cr for some vector r and hence that
P 0 C .O /
.O / P D r 0 CC Cr D r 0 Cr
and leading to the conclusion that .O / P 0 C .O /P is invariant to the choice of the generalized
inverse C .
Q ˛/
In light of result (3.28), F. P is reexpressible in terms of the quantity F./
P defined as follows:
P 0 C .O /=M
.O / P
P D
F./ 0
P 0 C .O /=.M
D .O / P O 2 /:
y .I PX /y=.N P /
An expression for the quantity .1= 2 /.˛P ˛/0 .˛P ˛/—this quantity appears in result (3.24)—can
P 0 .˛O ˛/=M
be obtained that is analogous to expression (3.28) for .˛O ˛/ P and that can be verified in
0 0
essentially the same way. For ˛P D S P [where P 2 C.ƒ /], we find that
.1= 2 /.˛P ˛/0 .˛P ˛/ D .1= 2 /.P /0 C .P /: (3.34)
A characterization of the members of C.ƒ0 /. Let P D .P1 ; P2 ; : : : ; PM /0 represent an arbitrary M 1
vector. And (defining j1 ; j2 ; : : : ; jM as in Subsection a) let P D .Pj1 ; Pj2 ; : : : ; PjM /0 . Then,
(where kjMC1 ; kjM C2 ; : : : ; kjM are as defined in Subsection a), as becomes evident upon observing
that P 2 C.ƒ0 / if and only if P D ƒ0 ˇP for some P 1 vector ˇP and that ƒ0 D K0 ƒ0 (where K and
ƒ are as defined in Subsection a).
A particular choice for the generalized inverse of C. Among the choices for the generalized inverse
C is
C D SQ SQ 0 (3.36)
(where SQ is as defined in Subsection a). For i; i 0 D 1; 2; : : : ; M , the ji ji 0 th element of this particular
generalized inverse equals the i i 0 th element of the ordinary inverse C 1 of the M M nonsingular
submatrix C of C (the submatrix obtained upon striking out the jM C1 ; jM C2 ; : : : ; jM th rows and
columns of C); the remaining elements of this particular generalized inverse equal 0. Thus, upon
setting C equal to this particular generalized inverse, we find that .O / P 0 C .O /P is expressible
0
[for P 2 C.ƒ /] as follows:
P 0 C .O /
.O / P D .O P /0 C 1.O P / (3.37)
(where, as in Subsection a, O and P represent the subvectors of O and ,
P respectively, obtained by
striking out their jM C1 ; jM C2 ; : : : ; jM th elements).
Confidence set. Denote by AQF a set of ˛-values defined as follows:
AQF D f˛P W F.
Q ˛/
P FN P .M ; N P /g;
where FN P .M ; N P / is the upper 100 P % point of the SF .M ; N P / distribution and P is a
scalar between 0 and 1. Since F. Q ˛/
P varies with z, the set AQF also varies with z. For purposes of
making explicit the dependence of AQF on z, let us write AQF .z/, or alternatively AQF .˛; O d/, for AQF .
O ;
On the basis of result (3.25), we have that
Thus, the set AQF constitutes a 100.1 P /% confidence set for ˛. In light of result (3.27),
O 0 .˛P ˛/
AQF D f˛P W .˛P ˛/ O M O 2 FN P .M ; N P /g: (3.38)
The geometrical form of the set AQF is that of an M -dimensional closed ball centered at the point ˛O
and with radius ŒM O 2 FN P .M ; N P /1=2.
By exploiting the relationship D W 0˛, a confidence set for can be obtained from that for ˛.
Define a set AF (of -values) as follows:
The F Test and a Generalized S Method 373
AF D fP W P D W 0˛;
P ˛P 2 AQF g: (3.39)
Since AQF depends on z and hence (since z D O 0 y) on y, the set AF depends on y. For purposes of
making this dependence explicit, let us write AF .y/ for AF . Clearly,
O 0 C 1.P /
AF D fP W .P / O M O 2 FN P .M; N P /g: (3.41)
More generally,
AF D fP W .P O /0 C 1.P O / M O 2 FN P .M ; N P /;
Pji D kj0 i P .i D M C1; M C2; : : : ; M /g (3.42)
[where P D .Pj1 ; Pj2 ; : : : ; PjM /0, where P1 ; P2 ; : : : ; PM represent the first, second, …, M th ele-
ments of ,P and where kjM C1 ; kjM C2 ; : : : ; kjM and j1 ; j2 ; : : : ; jM are as defined in Subsection
a]. Geometrically, the set fP W .P O /0 C 1.P O / M O 2 FN P .M ; N P /g is represented by
the points in M -dimensional space enclosed by a surface that is “elliptical in nature.”
F test. Define a set CQ F of z-values as follows:
CQ F D fz W F.˛
Q .0/ / > FN P .M ; N P /g
.˛O ˛.0/ /0 .˛O ˛.0/ /=M
D fz W > FN P .M ; N P /g:
d0d=.N P /
Thus, as a size- P test of HQ 0 or H0 , we have the test with critical (rejection) region CQ F and critical
(test) function QF .z/, that is, the test that rejects HQ 0 or H0 if z 2 CQ F or Q F .z/ D 1 and accepts HQ 0
or H0 otherwise. This test is referred to as the size- P F test.
The critical region and critical function of the size- P F test can be reexpressed in terms of y. In
light of result (3.33),
z 2 CQ F , y 2 CF and Q F .z/ D 1 , F .y/ D 1; (3.43)
where
CF D fy W F. .0/ / > FN P .M ; N P /g
.O .0/ /0 C .O .0/ /=M
D fy W > FN P .M ; N P /g
y 0 .I PX /y=.N P /
374 Confidence Intervals (or Sets) and Tests of Hypotheses
and (
1; if y 2 CF ,
F .y/ D
0; if y … CF .
In connection with result (3.43), it is worth noting that [in light of result (3.37)]
Thus, the confidence set AQF for ˛ consists of those values of ˛.0/ for which the null hypothesis
HQ 0 W ˛ D ˛.0/ is accepted, and the confidence set AF for consists of those values of .0/ [2 C.ƒ0 /]
for which the equivalent null hypothesis H0 W D .0/ is accepted.
Power function and probability of false coverage. The probability of HQ 0 or H0 being rejected by
the F test (or any other test) depends on the model’s parameters; when regarded as a function of
the model’s parameters, this probability is referred to as the power function of the test. The power
function of the size- P F test of HQ 0 or H0 is expressible in terms of ˛, , and ; specifically, it is
expressible as the function QF .˛; ; / defined as follows:
Q .0/ / > FN P .M ; N P /:
QF .˛; ; / D PrŒF.˛
Thus, QF .˛; ; / does not depend on , and it depends on ˛ and only through the quantity
.1= 2 /.˛ ˛.0/ /0 .˛ ˛.0/ / D Œ.1=/.˛ ˛.0/ /0 Œ.1=/.˛ ˛.0/ /: (3.47)
This quantity is interpretable as the squared distance (in units of ) between the true and hypothesized
values of ˛. When ˛ D ˛.0/, QF .˛; ; / D P .
The power function can be reexpressed as a function, say F .; ; /, of , , and . Clearly,
For ˛ ¤ ˛.0/ or equivalently ¤ .0/, QF .˛; ; / or F .; ; / represents the power of the
size- P F test, that is, the probability of rejecting HQ 0 or H0 when HQ 0 or H0 is false. The power of a
size- P test is a widely adopted criterion for assessing the test’s effectiveness.
In the case of a 100.1 P /% confidence region for ˛ or , the assessment of its effectiveness
might be based on the probability of false coverage, which (by definition) is the probability that
the region will cover (i.e., include) a vector ˛.0/ when ˛.0/ ¤ ˛ or a vector .0/ when .0/ ¤ .
In light of the relationships (3.43) and (3.45), the probability Pr. .0/ 2 AF / of AF covering .0/
[where .0/ 2 C.ƒ0 /] equals the probability Pr.˛.0/ 2 AQF / of AQF covering ˛.0/ (D S0 .0/ ), and
their probability of coverage equals 1 F .; ; / or (in terms of ˛ and ) 1 QF .˛; ; /.
The F Test and a Generalized S Method 375
A property of the noncentral F distribution. The power function QF .˛; ; / or F .; ; / of the
F test of HQ 0 or H0 depends on the model’s parameters only through the noncentrality parameter
of a noncentral F distribution with numerator degrees of freedom M and denominator degrees of
freedom N P . An important characteristic of this dependence is discernible from the following
lemma.
Lemma 7.3.1. Let w represent a random variable that has an SF .r; s; / distribution (where
0 < r < 1, 0 < s < 1, and 0 < 1). Then, for any (strictly) positive constant c, Pr.w > c/
is a strictly increasing function of .
Preliminary to proving Lemma 7.3.1, it is convenient to establish (in the form of the following 2
lemmas) some results on central or noncentral chi-square distributions.
Lemma 7.3.2. Let u represent a random variable that has a 2 .r/ distribution (where 0 < r <
1). Then, for any (strictly) positive constant c, Pr.u > c/ is a strictly increasing function of r.
Proof (of Lemma 7.3.2). Let v represent a random variable that is distributed independently of
u as 2 .s/ (where 0 < s < 1). Then, uCv 2 .r Cs/, and it suffices to observe that
Pr.uCv > c/ D Pr.u > c/ C Pr.u c; v > c u/ > Pr.u > c/: Q.E.D.
Lemma 7.3.3. Let u represent a random variable that has a 2 .r; / distribution (where 0 < r <
1 and 0 < 1). Then, for any (strictly) positive constant c, Pr.u > c/ is a strictly increasing
function of .
Proof (of Lemma 7.3.3). Let h./ represent the pdf of the 2 .r; / distribution. Further, for
j D 1; 2; 3; : : : , let gj ./ represent the pdf of the 2 .j / distribution, and let vj represent a random
variable that has a 2 .j / distribution. Then, making use of expression (6.2.14), we find that (for
> 0)
Z 1
d Pr.u > c/ dh.u/
D du
d c d
Z 1X 1
d Œ.=2/k e =2=kŠ
D g2kCr .u/ du
c d
kD0
Z 1n 1
=2
X 1
D 1
2
e gr .u/ C 1
2
k.=2/k 1e =2
c kŠ o
.=2/k e =2 g2kCr .u/ du
kD1
Z 1 nX 1
D 12 .=2/j 1e =2=.j 1/Š g2j Cr .u/
c j D1 1 o
X
.=2/ke =2=kŠ g2kCr .u/ du
kD0
Z 1
1nX
1
.=2/ke =2=kŠ g2kCrC2 .u/
D 2
c
kD0 1 o
X
.=2/ke =2
=kŠ g2kCr .u/ du
kD0
1
X
1
.=2/ke =2
=kŠ Pr v2kCrC2 > c Pr v2kCr > c :
D 2
kD0
d Pr.u > c/
And in light of Lemma 7.3.2, it follows that > 0 and hence that Pr.u > c/ is a strictly
d
increasing function of . Q.E.D.
Proof (of Lemma 7.3.1). Let u and v represent random variables that are distributed indepen-
u=r
dently as 2 .r; / and 2 .s/, respectively. Then, observing that w and denoting by g./ the
v=s
376 Confidence Intervals (or Sets) and Tests of Hypotheses
in which case .y/ is the critical function for a size- P test of the null hypothesis H0 W D .0/ (versus
the alternative hypothesis H1 W ¤ .0/ ) or, equivalently, of HQ 0 W ˛ D ˛.0/ (versus HQ 1 W ˛ ¤ ˛.0/ )
Q
and .z/ D .Oz/. Or, more generally, define .y/ to be the critical function for an arbitrary size- P
test of H0 or HQ 0 (versus H1 or HQ 1 ), and define .z/Q D .Oz/.
Taking u to be an N 1 unobservable random vector that has an N.0; I/ distribution or some
other absolutely continuous spherical distribution0with 1 mean 0 and variance-covariance matrix I and
˛
assuming that y Xˇ C u and hence that z @ A C u, let TQ ./ represent a one-to-one (linear)
0
transformation from RN onto RN for which there exist corresponding one-to-one transformations
FQ1 ./ from RM onto RM, FQ2 ./ from RP M onto RP M, and FQ3 ./ from the interval .0; 1/
onto .0; 1/ [where TQ ./, FQ1 ./, FQ2 ./, and FQ3 ./ do not depend on ˇ or ] such that
2 3
FQ1 .˛/
TQ .z/ 4FQ2 ./5 C FQ3 ./u; (3.57)
6 7
0
so that the problem of making inferences about the vector FQ1 .˛/ on the basis of the transformed
vector TQ .z/ is of the same general form as that of making inferences about ˛ on the basis of z. Further,
let GQ represent a group of such transformations—refer, e.g., to Casella and Berger (2002, sec. 6.4)
for the definition of a group. And let us write TQ .z1 ; z2 ; z3 / for TQ .z/ whenever it is convenient to do
so.
In making a choice for the 100.1 P /% confidence set A (for ), there would seem to be some
appeal in restricting attention to choices for which the corresponding 100.1 P /% confidence set AQ
(for ˛) is such that, for every value of z and for every transformation TQ ./ in G, Q
Q TQ .z/ D f˛R W ˛R D FQ1 .˛/;
AŒ Q
P ˛P 2 A.z/g: (3.58)
A choice for AQ having this property is said to be invariant or equivariant with respect to G,
Q with the
term invariant being reserved for the special case where
FQ1 .˛/
P D ˛P for every ˛P 2 RM [and every TQ ./ 2 G]
Q (3.59)
Q TQ .z/ D A.z/.
—in that special case, condition (3.58) simplifies to AŒ Q
Clearly,
˛ D ˛.0/ , FQ1 .˛/ D FQ1 .˛.0/ /: (3.60)
Suppose that
FQ1 .˛.0/ / D ˛.0/ [for every TQ ./ 2 G].
Q (3.61)
Then, in making a choice for a size- P test of the null hypothesis HQ 0 W ˛ D ˛ (versus the alternative
.0/
hypothesis HQ 1 W ˛ ¤ ˛.0/ ), there would seem to be some appeal in restricting attention to choices
Q is such that, for every value of z and for every transformation
for which the critical function ./
TQ ./ in G,
Q
Q TQ .z/ D .z/:
Œ Q (3.62)
—the appeal may be enhanced if condition (3.59) [which is more restrictive than condition (3.61)]
is satisfied. A choice for a size- P test (of HQ 0 versus HQ 1 ) having this property is said to be invariant
with respect to GQ (as is the critical function itself). In the special case where condition (3.59) is
satisfied, a size- P test with critical function of the form (3.56) is invariant with respect to GQ for every
hypothesized value ˛.0/ 2 RM if and only if the corresponding 100.1 P /% confidence set AQ is
invariant with respect to G. Q
The F Test and a Generalized S Method 379
The characterization of equivariance and invariance can be recast in terms of and/or in terms of
transformations of y. Corresponding to the one-to-one transformation FQ1 ./ (from RM onto RM )
is the one-to-one transformation F1 ./ from C.ƒ0 / onto C.ƒ0 / defined as follows:
P D W 0FQ1 .S0 /
F1 ./ P [for every P 2 C.ƒ0 /] (3.63)
or, equivalently,
P D S0F1 .W 0˛/
FQ1 .˛/ P (for every ˛P 2 RM ). (3.64)
And corresponding to the one-to-one transformation TQ ./ (from RN onto RN ) is the one-to-one
transformation T ./ (also from RN onto RN ) defined as follows:
T .y/ D O TQ .O 0 y/ or, equivalently, TQ .z/ D O 0 T .Oz/: (3.65)
Clearly, 2
FQ1 .˛/
3
T .y/ D O TQ .z/ O4FQ2 ./ 5 C FQ3 ./u: (3.66)
0
Moreover,
FQ1 .˛/
2 3
Q
O4FQ2 ./ 5 D XRSS 0
F1 ./ C UFQ2 ./ D XRC Q F1 ./ C UFQ2 ./; (3.67)
0
as can be readily verified.
Now, let G represent the group of transformations of y obtained upon reexpressing each of the
transformations (of z) in the group GQ in terms of y, so that T ./ 2 G if and only if TQ ./ 2 G,
Q where
TQ ./ is the unique transformation (of z) that corresponds to the transformation T ./ (of y) in the
sense determined by relationship (3.65). Then, condition (3.59) is satisfied if and only if
P D P for every P 2 C.ƒ0 / [and every T ./ 2 G];
F1 ./ (3.68)
and condition (3.61) is satisfied if and only if
F1 . .0/ / D .0/ [for every T ./ 2 G]. (3.69)
—the equivalence of conditions (3.59) and (3.68) and the equivalence of conditions (3.61) and (3.69)
can be readily verified.
Q for ˛ is equivariant or invariant with respect to GQ
Further, the 100.1 P /% confidence set A.z/
if and only if the corresponding 100.1 P /% confidence set A.y/ for is respectively equivariant
or, in the special case where condition (3.68) is satisfied, invariant in the sense that, for every value
of y and for every transformation T ./ in G,
AŒT .y/ D fR W R D F1 ./;
P P 2 A.y/g (3.70)
—in the special case where condition (3.68) is satisfied, condition (3.70) simplifies to AŒT .y/ D
A.y/. To see this, observe that [for P and R in C.ƒ0 /]
P 2 A.y/ , P 2 A.Oz/ , S0 P 2 A.z/;
Q
that
P , R D W 0FQ1 .S0 /
R D F1 ./ P , S0 R D FQ1 .S0 /;
P
and that
R 2 AŒT .y/ , S0 R 2 AŒ
Q TQ .z/:
Now, consider the size- P test of H0 or HQ 0 (versus H1 or HQ 1 ) with critical function .z/.
Q This
Q 0y/. And assuming that condition (3.61)
test is identical to that with critical function .y/ ŒD .O
or equivalently condition (3.69) is satisfied, the test and the critical function .z/ are invariant with
respect to GQ [in the sense (3.62)] if and only if the test and the critical function .y/ are invariant
with respect to G in the sense that [for every value of y and for every transformation T ./ in G]
ŒT .y/ D .y/; (3.71)
as is evident upon observing that
380 Confidence Intervals (or Sets) and Tests of Hypotheses
where k is a strictly positive scalar. Note that in the special case where ˛.0/ D 0, equality (3.74)
simplifies to
TQ2 .zI 0/ D kz: (3.75)
The transformation TQ2 .zI ˛.0/ / is a special case of the transformation TQ .z/. In this special case,
FQ1 .˛/ D ˛.0/ C k.˛ ˛.0/ /; FQ2 ./ D k; and FQ3 ./ D k:
Denote by GQ 2.˛.0/ / the group of transformations (of z) of the form (3.74), so that a transformation
(of z) of the general form TQ .z/ is contained in GQ 2 .˛.0/ / if and only if, for some k .> 0/, TQ ./ D
TQ2 . I ˛.0/ /. Further, for .0/ 2 C.ƒ0 /, take T2 .y I .0/ / to be the transformation of y determined
from the transformation TQ2 .zI S0 .0/ / in accordance with relationship (3.65); and take G2 . .0/ / to
be the group of all such transformations, so that T2 . I .0/ / 2 G2 . .0/ / if and only if, for some
transformation TQ2 . I S0 .0/ / 2 GQ 2 .S0 .0/ /, T2 .y I .0/ / D O TQ2 .O 0 y I S0 .0/ /. The group G2 . .0/ /
does not vary with the choice of the matrices S, U, and L, as can be demonstrated via a relatively
straightforward exercise. In the particularly simple special case where .0/ D 0, we have that
T2 .y I 0/ D ky: (3.76)
Clearly, the F test (of HQ 0 or H0 ) is invariant with respect to the group GQ 2 .˛.0/ / or G2 . .0/ / of
transformations of the form TQ2 . I ˛.0/ / or T2 . I .0/ /. And the corresponding confidence sets AQF .z/
and AF .y/ are equivariant with respect to GQ 2 .˛.0/ / or G2 . .0/ /.
Invariance/equivariance with respect to groups of orthogonal transformations. The F test and the
corresponding confidence sets are invariant with respect to various groups of orthogonal transforma-
tions (of the vector z). For ˛.0/ 2 RM, let TQ3 .zI ˛.0/ / represent a one-to-one transformation from
RN onto RN of the following form:
˛ C P 0 .z1 ˛.0/ /
0 .0/ 1
Denote by GQ 3 .˛.0/ / the group of transformations (of z) of the form (3.77) and by GQ 4 the group of
the form (3.79), so that a transformation (of z) of the general form TQ .z/ is contained in GQ 3.˛.0/ / if and
only if, for some (M M orthogonal matrix) P , TQ ./ D TQ3 . I ˛.0/ / and is contained in GQ 4 if and
only if, for some [.N P/.N P / orthogonal matrix] B, TQ ./ D TQ4 ./. Further, take T4.y/ to be the
transformation of y determined from the transformation TQ4 .z/ in accordance with relationship (3.65),
382 Confidence Intervals (or Sets) and Tests of Hypotheses
and [for .0/ 2 C.ƒ0 /] take T3 .y I .0/ / to be that determined from the transformation TQ3 .zI S0 .0/ /.
And take G4 and G3 . .0/ / to be the respective groups of all such transformations, so that T4./ 2 G4 if
and only if, for some transformation TQ4 ./ 2 GQ 4 , T4 .y/ D O TQ4 .O 0 y/ and T3 . I .0/ / 2 G3 . .0/ / if
and only if, for some transformation TQ3 . I S0 .0/ / 2 GQ 3 .S0 .0/ /, T3 .y I .0/ / D O TQ3 .O 0 y I S0 .0/ /.
The groups G3 . .0/ / and G4 , like the groups G0 , G1 , G01 , and G2 . .0/ /, do not vary with the choice
of the matrices S, U, and L.
Clearly, the F test (of HQ 0 or H0 ) is invariant with respect to both the group GQ 3 .˛.0/ / or G3 . .0/ /
of transformations of the form TQ3 . I ˛.0/ / or T3 . I .0/ / and the group GQ 4 or G4 of transformations
of the form TQ4 ./ or T4 ./. And the corresponding confidence sets AQF .z/ and AF .y/ are equivariant
with respect to GQ 3 .˛.0/ / or G3 . .0/ / and are invariant with respect to GQ 4 or G4 .
In regard to requirement (3.81) and in what follows, it is assumed that there exists an M 1 vector
.0/ 2 C.ƒ0 / such that ı.0/ D ı 0 .0/, so that the collection H0.ı/ (ı 2 ) of null hypotheses is
“internally consistent.” Corresponding to a collection Aı (ı 2 ) of confidence intervals or sets that
satisfies requirement (3.80) (i.e., for which the probability of simultaneous coverage equals 1 P )
is a collection of tests of H0.ı/ versus H1.ı/ (ı 2 ) with critical regions C.ı/ (ı 2 ) defined
(implicitly) as follows:
y 2 C.ı/ , ı.0/ … Aı .y/: (3.82)
As can be readily verified (via an argument that will subsequently be demonstrated), this collection
of tests satisfies condition (3.81).
The null hypothesis H0.ı/ can be thought of as representing a comparison between the “actual” or
“true” value of some entity (e.g., a difference in effect between two “treatments”) and a hypothesized
value (e.g., 0). Accordingly, the null hypotheses forming the collection H0.ı/ (ı 2 ) may be referred
to as multiple comparisons, and procedures for testing these null hypotheses in a way that accounts
for the multiplicity may be referred to as multiple-comparison procedures.
A reformulation. The problem of making inferences about the linear combinations (of the elements
1 ; 2 ; : : : ; M of ) forming the collection ı 0 (ı 2 ) is reexpressible in terms associated with the
canonical form of the G–M model. Adopting the notation and recalling the results of Subsection a,
the linear combination D ı 0 (of 1 ; 2 ; : : : ; M ) is reexpressible as a linear combination of the
M elements of the vector ˛ (D S0 D S0 ƒ0 ˇ). Making use of result (3.8), we find that
D ı 0 W 0 ˛ D .W ı/0 ˛: (3.83)
Moreover, expression (3.83) is unique; that is, if ıQ is an M 1 vector of constants such that (for
every value of ˇ) D ıQ 0 ˛, then
ıQ D W ı: (3.84)
To see this, observe that if ıQ ˛ D ı (for every value of ˇ), then
0 0
Simultaneous confidence intervals: a general approach. Suppose that the distribution of the vector
˛O ˛
e of residual effects in the G–M model or, more generally, the distribution of the vector
d
is MVN or is some other absolutely continuous spherical distribution (with mean 0 and variance-
covariance matrix 2 I). Then, making use of result (6.4.67) and letting d1 ; d2 ; : : : ; dN P represent
the elements of d, we find that
And letting t represent an M 1 random vector that has an MV t.N P ; IM / distribution, it
follows that
jıQ 0 .˛O ˛/j jıQ 0 tj
max max : (3.87)
Q
fı2 Q
Q W ı¤0g .ıQ 0 ı/
Q 1=2 O Q
fı2 Q
Q W ı¤0g .ıQ 0 ı/
Q 1=2
Thus, letting (for any scalar P such that 0 < P < 1) c P represent the upper 100 P % point of the
distribution of the random variable maxfı2
Q Q
Q W ı¤0g jıQ 0 tj=.ıQ 0 ı/
Q 1=2, we find that
For ıQ 2 ,
Q denote by AQ Q .z/ or simply by AQ Q a set of -values (i.e., a set of scalars), the contents
ı ı
of which may depend on the value of the vector z (defined in Subsection a). Further, suppose that
AQ Q D fP 2 R1 W P D ıQ 0˛;
ı P jıQ 0 .˛O ˛/j Q 1=2 O c P g:
P .ıQ 0 ı/ (3.89)
Clearly, the set (3.89) is reexpressible in the form
Q 1=2 O c P ;
Q 1=2 O c P P ıQ 0 ˛O C .ıQ 0 ı/
AQıQ D P 2 R1 W ıQ 0 ˛O .ıQ 0 ı/ (3.90)
˚
Clearly,
Pr z 2 [ Q D1
CQ .ı/ Pr z … [ Q
CQ .ı/
Q
fı2 Q W ıQ 0 ˛DıQ 0 ˛.0/ g Q
fı2 Q W ıQ 0˛DıQ 0 ˛.0/ g
D1 Q0
PrŒı ˛ 2 AQıQ .z/ for every ıQ 2 Q such that ıQ 0˛ D ıQ 0˛.0/
1 PrŒıQ 0˛ 2 AQ Q .z/ for every ıQ 2 ;
ı
Q
with equality holding when ıQ 0˛ D ıQ 0˛.0/ for every ıQ 2 . Q Thus, if the probability PrŒıQ 0˛ 2
AıQ .z/ for every ı 2 of simultaneous coverage of the sets AQıQ .z/ (ıQ 2 )
Q Q Q Q is greater than or equal
to 1 P , then
Pr z 2 [ Q P:
CQ .ı/ (3.98)
Q
fı2 Q W ıQ 0 ˛DıQ 0 ˛.0/ g
Moreover, if the probability of simultaneous coverage of the sets AQıQ .z/ (ıQ 2 ) Q equals 1 P , then
equality is attained in inequality (3.98) when ıQ ˛ D ıQ ˛ for every ıQ 2 .
0 0 .0/ Q
.0/
Now, suppose that ˛ is the unique value of ˛ that satisfies condition (3.85) (i.e., the condition
.0/ D W 0 ˛.0/ ). And observe that (by definition) ıQ 2 Q if and only if ıQ D W ı for some ı 2 .
Observe also [in light of result (3.8)] that for ı 2 and ıQ D W ı,
ı 0 D ı 0 W 0 ˛ D ıQ 0 ˛ and ı 0 .0/ D ı 0 W 0 ˛.0/ D ıQ 0 ˛.0/;
Q Q
in which case H0.ı/ is equivalent to HQ 0.ı/ and H1.ı/ to HQ 1.ı/.
The test of H0.ı/ (versus H1.ı/ ) with critical region C.ı/ [defined (implicitly) by relationship
(3.82)] is related to the test of HQ 0.W ı/ (versus HQ 1.W ı/ ) with critical region CQ .W ı/ [defined by
expression (3.97)]. For ı 2 ,
C.ı/ D fy W ı 0 .0/ … Aı .y/g D fy W ı 0 W 0 ˛.0/ … AQW ı .O 0 y/g D fy W O 0 y 2 CQ .W ı/g: (3.99)
And the probability of one or more false rejections is expressible as
Pr y 2 [ C.ı/ D Pr z 2 [ Q ;
CQ .ı/ (3.100)
fı2 W ı0 Dı0 .0/ g Q
fı2 Q W ıQ 0˛DıQ 0 ˛.0/ g
TABLE 7.2. Value of ŒM FN P .M ; N P /1=2 for selected values of M , N P , and P .
N P D 10 N P D 25 N P D 1
M P D:01 P D:10 P D:50 P D:01 P D:10 P D:50 P D:01 P D:10 P D:50
1 3:17 1:81 0:70 2:79 1:71 0:68 2:58 1:64 0:67
2 3:89 2:42 1:22 3:34 2:25 1:19 3:03 2:15 1:18
3 4:43 2:86 1:59 3:75 2:64 1:56 3:37 2:50 1:54
4 4:90 3:23 1:90 4:09 2:96 1:86 3:64 2:79 1:83
5 5:31 3:55 2:16 4:39 3:23 2:11 3:88 3:04 2:09
10 6:96 4:82 3:16 5:59 4:32 3:10 4:82 4:00 3:06
20 9:39 6:63 4:55 7:35 5:86 4:46 6:13 5:33 4:40
40 12:91 9:23 6:49 9:91 8:07 6:36 7:98 7:20 6:27
Pr y 2 [ C.ı/ P ; (3.111)
fı2RM W ı0 Dı0 .0/ g
where (for ı 2 RM )
C.ı/ D fy W jı 0 .O .0/ /j > .ı 0 Cı/1=2 O ŒM FN P .M ; N P /1=2 g; (3.112)
and that equality holds in inequality (3.111) when ı 0 D ı 0 .0/ for every ı 2 RM.
The use of the interval (3.110) as a means for obtaining a confidence set for ı 0 (for every
ı 2 RM ) and the use of the critical region (3.112) as a means for obtaining a test of H0.ı/ (versus
H1.ı/ ) (for every ı 2 RM ) are known as Scheffé’s method or simply as the S method. The S method
was proposed by Scheffé (1953, 1959).
The interval (3.110) (with end points of ı 0 O ˙ .ı 0 Cı/1=2 O ŒM FN P .M ; N P /1=2 ) is of length
2.ı Cı/1=2 O ŒM FN P .M ; N P /1=2, which is proportional to ŒMFN P .M ; N P /1=2 and depends
0
on M , N P , and P —more generally, the interval (3.94) (for a linear combination ı 0 such that
ı 2 and with end points of ı 0 O ˙ .ı 0 Cı/1=2 c O P ) is of length 2.ı 0 Cı/1=2 O c P . Note that when
M D 1, the interval (3.110) is identical to the 100.1 P /% confidence interval for ı 0 [D .ƒı/0ˇ]
obtained via a one-at-a-time approach by applying formula (3.55). Table 7.2 gives the value of the
factor ŒM FN P .M ; N P /1=2 for selected values of M , N P , and P . As is evident from the
tabulated values, this factor increases rapidly as M increases.
A connection. For ı 2 RM , let Aı or Aı .y/ represent the interval (3.110) (of ı 0 -values) associated
with the S method for obtaining confidence intervals [for ı 0 (ı 2 RM )] having a probability of
simultaneous coverage equal to 1 P . And denote by A or A.y/ the set (3.38) (of -values), which
as discussed earlier (in Part b) is a 100.1 P /% confidence set for the vector .
The sets Aı (ı 2 RM ) are related to the set A; their relationship is as follows: for every value
of y,
A D fP 2 C.ƒ0 / W ı 0P 2 Aı for every ı 2 RM g; (3.113)
so that [for P 2 C.ƒ0 /]
PrŒı 0P 2 Aı .y/ for every ı 2 RM D PrŒP 2 A.y/: (3.114)
Moreover, relationship (3.113) implies that the sets C.ı/ (ı 2 RM ), where C.ı/ is the critical
.ı/
region (3.112) (for testing the null hypothesis H0 W ı 0 D ı 0 .0/ versus the alternative hypothesis
.ı/
H1 W ı 0 ¤ ı 0 .0/ ) associated with the S method of multiple comparisons, are related to the critical
region CF associated with the F test of the null hypothesis H0 W D .0/ versus the alternative
hypothesis H1 W ¤ .0/ ; their relationship is as follows:
CF D fyP 2 RN W yP 2 C.ı/ for some ı 2 RM g; (3.115)
388 Confidence Intervals (or Sets) and Tests of Hypotheses
so that
PrŒy 2 C.ı/ for some ı 2 RM D Pr.y 2 CF /: (3.116)
Let us verify relationship (3.113). Taking (for ıQ 2 RM )
and denoting by AQ the set (3.38), the relationship (3.113) is [in light of results (3.39) and (3.92)]
equivalent to the following relationship: for every value of z,
AQ D f˛P 2 RM W ıQ 0˛P 2 AQıQ for every ıQ 2 RM g: (3.118)
Thus, it suffices to verify relationship (3.118).
Let r D ŒMO FN P .M ; N P /1=2. And (letting ˛P represent an arbitrary value of ˛) observe
that
˛P 2 AQ , Œ.˛P ˛/ O 0 .˛P ˛/
O 1=2 r (3.119)
and that (for ıQ 2 RM )
ıQ 0 ˛P 2 AQıQ , jıQ 0 .˛P ˛/jO .ıQ 0 ı/
Q 1=2 r: (3.120)
Q Then, upon applying result (2.4.10) (which is a special case of the
Now, suppose that ˛P 2 A.
Q that
Cauchy–Schwarz inequality) and result (3.119), we find (for every M 1 vector ı)
jıQ 0 .˛P ˛/j
O .ıQ 0 ı/
Q 1=2 Œ.˛P ˛/ O 1=2 .ıQ 0 ı/
O 0 .˛P ˛/ Q 1=2 r;
thereby establishing [in light of result (3.120)] the existence of an M 1 vector ıQ such that ıQ 0 ˛P … AQıQ
and completing the verification of relationship (3.118).
In connection with relationship (3.113), it is worth noting that if the condition ı 0P 2 Aı is
satisfied by any particular nonnull vector ı in RM, then it is also satisfied by any vector ıP in RM
such that ƒıP D ƒı or such that ıP / ı.
An extension. Result (3.118) relates AQıQ (ıQ 2 RM ) to A, Q where AQ Q is the set (3.117) (of ıQ 0 ˛-values)
ı
and AQ is the set (3.38) (of ˛-values). This relationship can be extended to a broader class of sets.
Suppose that the set of M 1 vectors is such that the set Q D fıQ W ıDW
Q ı; ı2g contains
M linearly independent (M 1) vectors, say ıQ1 ; ıQ2 ; : : : ; ıQM . And for ıQ 2 , Q take AQ Q to be interval
ı
(3.90) [which, when Q D RM, is identical to interval (3.117)], and take AQ to be the set of ˛-values
defined as follows:
jıQ 0 .˛O ˛/j
P
Q
A D ˛P W max cP (3.121)
Q
fı2 Q
Q W ı¤0g .ıQ 0 ı/
Q 1=2 O
or, equivalently,
AQ D ˛P W ıQ 0 ˛O .ıQ 0 ı/
Q 1=2 O c P ıQ 0 ˛P ıQ 0 ˛C
O .ıQ 0 ı/Q 1=2 O c P for every ıQ 2
Q (3.122)
˚
—in the special case where Q D RM, this set is identical to the set (3.38), as is evident from result
Q Q take
(3.118). Further, for ı … ,
AQ Q D fP 2 R1 W P D ıQ 0 ˛;
ı
Q
P ˛P 2 Ag; (3.123)
thereby extending the definition of AQıQ to every ıQ 2 RM. Then, clearly,
Let us now specialize to the case where (for M linearly independent M 1 vectors
Qı1 ; ıQ2 ; : : : ; ıQM ) Q D fıQ1 ; ıQ2 ; : : : ; ıQM g. For i D 1; 2; : : : ; M , let D ıQ 0 ˛ and O D ıQ 0 ˛.
i i i i O
And observe that corresponding to any M 1 vector ı, Q there exist (unique) scalars k1 ; k2 ; : : : ; kM
such that ıQ D M Q
i D1 ki ıi and hence such that
P
ıQ 0 ˛ D M Q 0 O D M ki Oi :
i D1 ki i and ı ˛
P P
i D1
Observe also that the set AQ [defined by expression (3.121) or (3.122)] is reexpressible as
AQ D f˛P W Oi .ıQi0 ıQi /1=2 O c P ıQi0 ˛P Oi C.ıQi0 ıQi /1=2 c
O P .i D 1; 2; : : : ; M /g: (3.128)
Moreover, for ki ¤ 0,
Oi .ıQi0 ıQi /1=2 c
O P ıQi0 ˛P Oi C.ıQi0 ıQi /1=2 O c P
O P ki ıQi0 ˛P ki Oi Cjki j.ıQi0 ıQi /1=2 c
, ki Oi jki j.ıQi0 ıQi /1=2 c O P
(i D 1; 2; : : : ; M ), so that the set AQıQ [defined for ıQ 2 Q by expression (3.90) and for ıQ … Q by
Q
expression (3.123)] is expressible (for every ı 2 R ) as M
AQıQ D fP 2 R1 W M
P M
Q 0 Q 1=2 O c
P
i D1 ki Oi i D1 jki j.ıi ıi / P
P M
P M
Q 0 Q 1=2 c
O P g: (3.129)
P
i D1 ki Oi C i D1 jki j.ıi ıi /
Pr M
PM
O P M
P M
Q 0 Q 1=2 c
P P
i D1 ki Oi i D1 jki j.ıi ıi / i D1 ki i i D1 ki Oi
P M
C i D1 jki j.ıQi0 ıQi /1=2 O c P for all scalars k1 ; k2 ; : : : ; kM D 1 P : (3.130)
Suppose (for purposes of illustration) that M D 2 and that ıQ1 D .1; 0/0 and ıQ2 D .0; 1/0. Then,
O expression (3.128) for the set AQ [defined by expression
letting ˛O 1 and ˛O 2 represent the elements of ˛,
(3.121) or (3.122)] is reexpressible as
AQ D f˛P D .˛P 1 ; ˛P 2 /0 W ˛O i O P ˛P i ˛O i C O c P .i D 1; 2/g:
c (3.131)
390 Confidence Intervals (or Sets) and Tests of Hypotheses
.˛P 2 ˛O 2 /=O
−2
.˛P 1 ˛O 1 /=O
−2 0 2
FIGURE 7.2. Display of the sets (3.131) and (3.133) [in terms of the transformed coordinates .˛P 1 ˛O 1 /=O and
O for the case where P D 0:10 and N P D 10; the set (3.131) is represented by
.˛P 2 ˛O 2 /=]
the rectangular region and the set (3.133) by the circular region.
And expression (3.129) for the set AQıQ [defined for ıQ 2 Q by expression (3.90) and for ıQ … Q by
expression (3.123)] is reexpressible as
AQıQ D fP 2 R1 W 2iD1 ki ˛O i O c P 2iD1 jki j P 2iD1 ki ˛O i C O c P 2iD1 jki jg (3.132)
P P P P
[where ıQ D .k1 ; k2 /0 ]. By way of comparison, consider the set AQF (of ˛-values) given by expression
(3.38), which (in the case under consideration) is reexpressible as
f˛P D .˛P 1 ; ˛P 2 /0 W 2 .˛P i ˛O i /2 2 O 2FN P .2; N P /g; (3.133)
P
i D1
and the set (of ıQ 0 ˛-values) given by expression (3.117), which (in the present context) is reexpressible
as
P2
AQıQ D fP 2 R1 W O FN P .2; N P /1=2 .k12 Ck22 /1=2
O i Œ2
i D1 ki ˛
P 2iD1 ki ˛O i C O Œ2FN P .2; N P /1=2 .k12 Ck22 /1=2 g: (3.134)
P
The two sets (3.131) and (3.133) of ˛-values are displayed in Figure 7.2 [in terms of the trans-
formed coordinates .˛P 1 ˛O 1 /=O and .˛P 2 ˛O 2 /=]
O for the case where P D 0:10 and N P D 10—in
N
this case c P D 2:193 and F P .2; N P / D 2:924. The set (3.131) is represented by the rectangular
region, and the set (3.133) by the circular region. For each of these two sets, the probability of
coverage is 1 P D 0:90.
Interval (3.132) is of length 2 c O FN P .2; N
O P .jk1 j C jk2 j/, and interval (3.134) of length 2 Œ2
1=2 2 2 1=2
P / .k1 Ck2 / . Suppose that k1 or k2 is nonzero, in which case the length of both intervals is
strictly positive—if both k1 and k2 were 0, both intervals would be of length 0. Further, let v represent
the ratio of the length of interval (3.132) to the length of interval (3.134), and let u D k12 =.k12 Ck22 /.
And observe that
cP jk1 j C jk2 j cP 1=2
u C .1 u/1=2 ;
vD D
Œ2FN .2; N P / .k1 Ck2 /
P
1=2 2 2 1=2 Œ2FN .2; N P /P
1=2
so that v can be regarded as a function of u. Observe also that 0 pu 1, and that as u increases
from 0 to 21 , u1=2 C .1 u/1=2 increases monotonically from 1 to 2 and that as u increases from
1 1=2
p
2 to 1, u C .1 u/1=2 decreases monotonically from 2 to 1.
In Figure 7.3, v is plotted as a function of u for the case where P D 0:10 and N P D 10.
When P D 0:10 and N P D 10, v D 0:907Œu1=2 C .1 u/1=2 , and we find that v > 1 if
The F Test and a Generalized S Method 391
v
1.5
1.25
0.75
0.5
u
0 0.25 0.5 0.75 1
FIGURE 7.3. Plot (represented by the solid line) of v D c P Œ2FN P .2; N P / 1=2 Œu1=2 C .1 u/1=2 as a
function of u D k12 =.k12 Ck22 / for the case where P D 0:10 and N P D 10.
0:012 < u < 0:988, that v D 1 if u D 0:012 or u D 0:988, and that v < 1 if u < 0:012
or u > 0:988—u D 0:012 when jk1 =k2 j D 0:109 and u D 0:988 when jk2 =k1 j D 0:109 (or,
equivalently, when jk1 =k2 j D 9:145).
Computational issues. Consider the set AQıQ given for ıQ 2 Q by expression (3.90) and extended to
Qı … Q by expression (3.123). These expressions involve (either explicitly or implicitly) the upper
100 P % point c P of the distribution of the random variable maxfı2 Q Q
Q W ı¤0g
Q 1=2. Let us
jıQ 0 tj=.ıQ 0 ı/
consider the computation of c P .
In certain special cases, the distribution of maxfı2Q Q
Q W ı¤0g
Q 1=2 is sufficiently simple
jıQ 0 tj=.ıQ 0 ı/
that the computation of c P is relatively tractable. One such special case has already been considered.
Q D RM or, more generally, if
If Q is such that condition (3.107) is satisfied, then
c P D ŒM FN P .M ; N P /1=2:
A second special case where the computation of c P is relatively tractable is that where for some
nonnull M 1 orthogonal vectors ıQ1 ; ıQ2 ; : : : ; ıQM ,
Q D fıQ1 ; ıQ2 ; : : : ; ıQM g: (3.135)
In that special case,
jıQ 0 tj
max max.jt1 j; jt2 j; : : : ; jtM j/;
Q
fı2 Q
Q W ı¤0g .ıQ 0 ı/
Q 1=2
where t1 ; t2 ; : : : ; tM are the elements of the M -dimensional random vector t [whose distribution
is MV t.N P ; IM /]. The distribution of max.jt1 j; jt2 j; : : : ; jtM j/ is referred to as a Studentized
maximum modulus distribution.
In the special case where Q is of the form (3.135) (and where M 2)
c P > ŒFN P .M ; N P /1=2; (3.136)
as is evident upon observing that
max.jt1 j; jt2 j; : : : ; jtM j/ D Œmax.t12 ; t22 ; : : : ; tM
2
/1=2
P M 2 1=2
> i D1 ti =M (with probability 1)
and that M
P 2
i D1 ti =M SF .M ; N P /.
A third special case where the computation of c P is relatively tractable is that where for some
nonzero scalar a and some M 1 orthonormal vectors ıP1 ; ıP2 ; : : : ; ıPM ,
Q D fa.ıPi ıPj / .j 6D i D 1; 2; : : : ; M /g: (3.137)
392 Confidence Intervals (or Sets) and Tests of Hypotheses
O c P .ıQM
0
ıQ / 1=2 M Q 0 Q 1=2 x c (3.141)
P
Cj MCj i D1 aj i .ıi ıi / i O P .j D 1; 2; : : : ; L M /:
We conclude that the problem of determining Pmax and Pmin is essentially that of determining the
maximum and minimum values of the quantity (3.138) with respect to x1 ; x2 ; : : : ; xM subject to the
constraints (3.140) and (3.141). The problem of maximizing or minimizing this quantity subject to
these constraints can be formulated as a linear programming problem, and its solution can be effected
by employing an algorithm for solving linear programming problems—refer, e.g., to Nocedal and
Wright (2006, chaps. 13 & 14).
d. An illustration
Let us illustrate various of the results of Subsections a, b, and c by using them to add to the results
obtained earlier (in Sections 7.1 and 7.2c) for the lettuce-yield data. Accordingly, let us take y to be
the 201 random vector whose observed value is the vector of lettuce yields. Further, let us adopt
the terminology and notation introduced in Section 7.1 along with those introduced in the present
section. And let us restrict attention to the case where y is assumed to follow either the second-order
or third-order model, that is, the G–M model obtained upon taking the function ı.u/ (that defines the
response surface) to be either the second-order polynomial (1.2) or the third-order polynomial (1.3)
(and taking u to be the 3-dimensional column vector whose elements u1 , u2 , and u3 represent the
transformed amounts of Cu, Mo, and Fe). In what follows, the distribution of the vector e of residual
effects (in the second- or third-order model) is taken to be N.0; 2 I/.
Second-order model versus the third-order model. The second-order model has considerable appeal;
it is relatively simple and relatively tractable. However, there may be a question as to whether the
second-order polynomial (1.2) provides an “adequate” approximation to the response surface over
the region of interest. A common way of addressing this question is to take the model to be the
third-order model and to attempt to determine whether the data are consistent with the hypothesis
that the coefficients of the third-order terms [i.e., the terms that appear in the third-order polynomial
(1.3) but not the second-order polynomial (1.2)] equal 0.
Accordingly, suppose that y follows the third-order model (in which case P D 20,
P D 15, and N P D 5). There are 10 third-order terms, the coefficients of which are
ˇ111 ; ˇ112 ; ˇ113 ; ˇ122 ; ˇ123 ; ˇ133 ; ˇ222 ; ˇ223 ; ˇ233 , and ˇ333 . Not all of these coefficients are es-
timable from the lettuce-yield data; only certain linear combinations are estimable. In fact, as
discussed in Section 7.2c, a linear combination of the coefficients of the third-order terms is es-
timable (from these data) if and only if it is expressible as a linear combination of 5 linearly in-
dependent estimable linear combinations, and among the choices for the 5 linearly independent
estimable linear combinations are the linear combinations 3:253ˇ111 1:779ˇ122 0:883ˇ133 ,
1:779ˇ112 3:253ˇ222C0:883ˇ233 , 1:554ˇ113C1:554ˇ223 3:168ˇ333 , 2:116ˇ123 , and 0:471ˇ333 .
Let represent the 5-dimensional column vector with elements 1 D 3:253ˇ111 1:779ˇ122
0:883ˇ133 , 2 D 1:779ˇ112 3:253ˇ222 C 0:883ˇ233 , 3 D 1:554ˇ113 C 1:554ˇ223 3:168ˇ333 ,
4 D 2:116ˇ123 , and 5 D 0:471ˇ333 . And consider the null hypothesis H0 W D .0/ , where
.0/ D 0 (and where the alternative hypothesis is H1 W ¤ .0/ ). Clearly, H0 is testable.
In light of the results of Section 7.2c, the least squares estimator O of equals
. 1:97; 4:22; 5:11; 2:05; 2:07/0, and var./ O D 2 I—by construction, the linear combinations
that form the elements of are such that their least squares estimators are uncorrelated and have
394 Confidence Intervals (or Sets) and Tests of Hypotheses
standard errors equal to . Further, O 2 D 10:53 (so that each element of O has an estimated standard
error of O D 3:24). And
.O 0/0 C .O 0/ O 0 O
F .0/ D 2
D D 1:07:
M O 5 O 2
The size- P F test of H0 W D 0 (versus H1 W ¤ 0) consists of rejecting H0 if F .0/ > FN P .5; 5/,
and accepting H0 otherwise. The p-value of the F test of H0 (versus H1 ), which (by definition)
is the value of P such that F .0/ D FN P .5; 5/, equals 0:471. Thus, the size- P F test rejects H0 for
values of P larger than 0:471 and accepts H0 for values less than or equal to 0:471. This result is
more-or-less consistent with a hypothesis that the coefficients of the 10 third-order terms (of the
third-order model) equal 0. However, there is a caveat: the power of the test depends on the values of
the coefficients of the 10 third-order terms only through the values of the 5 linear combinations 1 ,
2 , 3 , 4 , and 5 . The distribution of F .0/ (under bothP
H0 and H1 ), from which the power function
of the size- P F test of H0 is determined, is SF 5; 5; 5iD1 i2= 2 .
Presence or absence of interactions. Among the stated objectives of the experimental study of
lettuce yield was that of “determining the importance of interactions among Cu, Mo, and Fe.” That
is, to what extent (if any) does the change in yield effected by a change in the level of one of these
three variables vary with the levels of the other two?
Suppose that y follows the second-order model, in which ˇ D .ˇ1 ; ˇ2 ; ˇ3 ; ˇ4 , ˇ11 ,
ˇ12 ; ˇ13 ; ˇ22 ; ˇ23 ; ˇ33 /0, P D P D 10, and N P D 10. And take to be the 3-dimensional
column vector with elements 1 D ˇ12 , 2 D ˇ13 , and 3 D ˇ23 . Then, M D M D 3, and
D ƒ0ˇ, where ƒ is the 103 matrix whose columns are the 6th, 7th, and 9 th columns of the 1010
identity matrix.
Consider the problem of obtaining a 100.1 P /% confidence set for the vector and that of
obtaining confidence intervals for ˇ12 , ˇ13 , and ˇ23 (and possibly for linear combinations of ˇ12 ,
ˇ13 , and ˇ23 ) for which the probability of simultaneous coverage is 1 P . Consider also the problem
of obtaining a size- P test of the null hypothesis H0 W D 0 (versus the alternative hypothesis
H1 W ¤ 0) and that of testing whether each of the quantities ˇ12 , ˇ13 , and ˇ23 (and possibly each
of their linear combinations) equals 0 (and of doing so in such a way that the probability of one or
more false rejections equals P ).
Letting ˇO12 , ˇO13 , and ˇO23 represent the least squares estimators of ˇ12 , ˇ13 , and ˇ23 , respectively,
we find that ˇO12 D 1:31, ˇO13 D 1:29, and ˇO23 D 0:66, so that the least squares estimator O of
equals . 1:31; 1:29; 0:66/0 —refer to Table 7.1. Further, var./ O D 2 C, and
C D ƒ0.X0 X/ ƒ D diag.0:125; 0:213; 0:213/:
And for the M M matrix T of full row rank such that C D T 0 T and for the M M matrix S
such that T S is orthogonal, take
1
T D diag.0:354; 0:462; 0:462/ and S D T D diag.2:83; 2:17; 2:17/;
in which case
˛ D S0 D .2:83ˇ12 ; 2:17ˇ13 ; 2:17ˇ23 /0 and ˛O D S0 O D . 3:72; 2:80; 1:43/
(and the M M matrix W such that ƒSW D ƒ equals T ).
The “F statistic” F .0/ for testing the null hypothesis H0 W D 0 (versus the alternative hypothesis
H1 W ¤ 0) is expressible as
.ˇO12 ; ˇO13 ; ˇO23 /0 C 1.ˇO12 ; ˇO13 ; ˇO23 /
F .0/ D I
3O 2
its value is 0:726, which is “quite small.” Thus, if there is any “nonadditivity” in the effects of Cu,
Mo, and Fe on lettuce yield, it is not detectable from the results obtained by carrying out an F test
(on these data). Corresponding to the size- P F test is the 100.1 P /% ellipsoidal confidence set AF
The F Test and a Generalized S Method 395
(for the vector ) given by expression (3.41); it consists of the values of D .ˇ12 ; ˇ13 ; ˇ23 /0 such
that
0:244.ˇ12 C 1:31/2 C 0:143.ˇ13 C 1:29/2 C 0:143.ˇ23 0:66/2 FN P .3; 10/: (3.142)
For P equal to :01, :10, and :50, the values of FN P .3; 10/ are 6:552, 2:728, and 0:845, respectively.
Confidence intervals for which the probability of simultaneous coverage is 1 P can be obtained
for ˇ12 , ˇ13 , and ˇ23 and all linear combinations of ˇ12 , ˇ13 , and ˇ23 by applying the S method. In
the special case where P D :10, the intervals obtained for ˇ12 , ˇ13 , and ˇ23 via the S method [using
formula (3.110)] are:
4:65 ˇ12 2:02; 5:66 ˇ13 3:07; and 3:70 ˇ23 5:02:
By way of comparison, the 90% one-at-a-time confidence intervals obtained for ˇ12 , ˇ13 , and ˇ23
[upon applying formula (3.55)] are:
3:43 ˇ12 0:80; 4:06 ˇ13 1:47; and 2:10 ˇ23 3:42:
Corresponding to the S method for obtaining (for all linear combinations of ˇ12 , ˇ13 , and ˇ23
including ˇ12 , ˇ13 , and ˇ23 themselves) confidence intervals for which the probability of simultane-
ous coverage is 1 P is the S method for obtaining for every linear combination of ˇ12 , ˇ13 , and ˇ23
a test of the null hypothesis that the linear combination equals 0 versus the alternative hypothesis
that it does not equal 0. The null hypothesis is either accepted or rejected depending on whether or
not 0 is a member of the confidence interval for that linear combination. The tests are such that the
probability of one or more false rejections is less than or equal to P .
The confidence intervals obtained for ˇ12 , ˇ13 , and ˇ23 via the S method [using formula (3.110)]
are “conservative.” The probability of simultaneous coverage for these three intervals is greater than
1 P , not equal to 1 P —there are values of y for which coverage is achieved by these three intervals
but for which coverage is not achieved by the intervals obtained [using formula (3.110)] for some
linear combinations of ˇ12 , ˇ13 , and ˇ23 .
Confidence intervals can be obtained for ˇ12 , ˇ13 , and ˇ23 for which the probability of simulta-
neous coverage equals 1 P . Letting ı1 , ı2 , and ı3 represent the columns of the 33 identity matrix,
take D fı1 ; ı2 ; ı3 g, in which case Q D fıQ1 ; ıQ2 ; ıQ3 g, where ıQ1 , ıQ2 , and ıQ3 are the columns of the
matrix W [which equals diag.0:354; 0:462; 0:462/]. Then, intervals can be obtained for ˇ12 , ˇ13 ,
and ˇ23 for which the probability of simultaneous coverage equals 1 P by using formula (3.90) or
(3.94). When Q D fıQ1 ; ıQ2 ; ıQ3 g,
Q is of the form (3.135) and, consequently, c P is the upper 100 P %
point of a Studentized maximum modulus distribution; specifically, it is the upper 100 P % point
of the distribution of max.jt1 j; jt2 j; jt3 j/, where t1 , t2 , and t3 are the elements of a 3-dimensional
random column vector whose distribution is MV t.10; I3 /. The value of c:10 is 2:410—refer, e.g.,
to Graybill (1976, p. 656). And [as obtained from formula (3.94)] confidence intervals for ˇ12 , ˇ13 ,
and ˇ23 with a probability of simultaneous coverage equal to :90 are:
4:13 ˇ12 1:50; 4:97 ˇ13 2:38; and 3:01 ˇ23 4:33: (3.143)
The values of whose elements (ˇ12 , ˇ13 , and ˇ23 ) satisfy the three inequalities (3.143) form
a 3-dimensional rectangular set A. The set A is a 90% confidence set for . It can be regarded as a
“competitor” to the 90% ellipsoidal confidence set for defined (upon setting P D :10) by inequality
(3.142).
Starting with the confidence intervals (3.143) for ˇ12 , ˇ13 , and ˇ23 , confidence intervals (with
the same probability of simultaneous coverage) can be obtained for every linear combination of ˇ12 ,
ˇ13 , and ˇ23 . Let ı represent an arbitrary 31 vector and denote by k1 , k2 , and k3 the elements of
ı, so that ı 0 D k1 ˇ12 C k2 ˇ13 C k3 ˇ23 . Further, take Aı D AQW ı [where AQıQ is the set defined for
ıQ 2 Q by expression (3.90) and for ıQ … Q by expression (3.123)]. Then,
PrŒı 0 2 Aı .y/ for every ı 2 R3 D 1 P:
396 Confidence Intervals (or Sets) and Tests of Hypotheses
And upon observing that W ı D k1 ıQ1 C k2 ıQ2 C k3 ıQ3 and making use of formula (3.129), we find
that Aı is the interval with end points
k1 ˇO12 Ck2 ˇO13 Ck3 ˇO23 ˙ 3iD1 jki j.ıi0 Cıi /1=2 O c P : (3.144)
P
When P D :10,
P3 0 1=2
i D1 jki j.ıi Cıi / O c P D 7:954.0:354jk1j C 0:462jk2 j C 0:462jk3j/:
The intervals obtained for all linear combinations of ˇ12 , ˇ13 , and ˇ23 by taking the end points
of each interval to be those given by expression (3.144) can be regarded as competitors to those
obtained by applying the S method. When only one of the 3 coefficients k1 , k2 , and k3 of the linear
combination k1 ˇ12 C k2 ˇ13 C k3 ˇ23 is nonzero, the interval with end points (3.144) is shorter than
the interval obtained by applying the S method.
Suppose however that k1 , k2 , and k3 are such that for some nonzero scalar k, ki D k.ıQi0 ıQi / 1=2
or, equivalently, ki D k.ıi0 Cıi / 1=2 (i D 1; 2; 3). Then, the length of the interval with end points
(3.144) is
2 3iD1 jki j.ıi0 Cıi /1=2 O c P D 6 jkjO c P D 19:80 jkjc P
P
.tP Q 0 .tP
ı/ Q .D tP 0 tP
ı/ (3.146) Q
2ıQ 0 tP C 2 ıQ 0 ı/
subject to the constraints ıQ 2
Q and 0. Suppose that tP is such that for some vector ıR 2 ,
Q ıR 0 tP >
0—when no such vector exists, maxfı2 Q Q
Q W ı¤0g
Q 0P Q 0 Q 1=2
ı t =.ı ı/ 0 maxfı2Q Q
Q W ı¤0g ı . t/=.ıQ 0 ı/
Q 0 P Q 1=2,
and (subject to the constraints ıQ 2 Q and 0) the minimum value of the sum of squares (3.146)
is the value t t attained at D 0. And let ıP and P represent any values of ıQ and at which the sum
P 0P
of squares (3.146) attains its minimum value (for ıQ 2 Q and 0). Then,
ıP 0 tP .ıP 0 tP /2
P D >0 and .tP P ı/
P 0 .tP P ı/
P D tP 0 tP < tP 0 tP : (3.147)
ıP 0 ıP ıP 0 ıP
Further, ıQ 0 tP ıP 0 tP
max D > 0: (3.148)
Q
fı2 Q
Q W ı¤0g .ıQ 0 ı/
Q 1=2 .ıP 0 ı/
P 1=2
Let us verify results (3.147) and (3.148). For any vector ıR in Q such that ıR 0 tP ¤ 0 and for
R D ıR 0 tP =ıR 0 ı,
R
R0 P 2
.tP R ı/ R 0 .tP R ı/ R D tP 0 tP .ı t / < tP 0 tP : (3.149)
ıR 0 ıR
Thus, minfı; Q W ı2 Q 0g .t
Q ; P ı/ Q 0 .tP ı/ Q is less than tP 0 tP , which is the value of .tP ı/ Q 0 .tP ı/
Q when
D 0 or ıQ D 0. And it follows that P > 0 and ıP ¤ 0.
To establish that maxfı2 Q Q
Q W ı¤0g ıQ 0 tP =.ıQ 0 ı/Q 1=2 D ıP 0 tP =.ıP 0 ı/ P 1=2, assume (for purposes of estab-
lishing a contradiction) the contrary, that is, assume that there exists a nonnull vector ıR 2 Q such
that
ıR 0 tP ıP 0 tP
> :
.ıR 0 ı/
R 1=2 .ıP 0 ı/
P 1=2
Then, letting R D .ıP 0 ı= P we find that
R 1=2 ,
P ıR 0 ı/
R ıR 0 tP > P ıP 0 tP and R 2 ıR 0 ıR D P 2 ıP 0 ı;
P
which implies that
.tP R ı/
R 0 .tP R ı/
R < .tP P ı/
P 0 .tP P ı/;
P
thereby establishing the desired contradiction.
The verification of result (3.148) is complete upon observing that (since, by supposition, there
exists a vector ıR 2 Q such that ıR 0 tP > 0) max Q Q Q Q 0 P Q 0 Q 1=2 > 0. Further, turning to result
fı2 W ı¤0g ı t =.ı ı/
P as is evident upon letting R D ıP 0 tP =.ıP 0 ı/
(3.147), P D ıP 0 tP =.ıP 0 ı/, P and upon observing that
.tP P ı/
P 0 .tP P ı/
P D .tP R ı/
P 0 .tP R ı/
P C .P R 2 ıP 0 ıP
/
and [in light of result (3.148)] that R > 0. To complete the verification of result (3.147), it remains
only to observe that ıP 0 tP ¤ 0 and to apply the special case of result (3.149) obtained upon setting
ıR D ıP (and implicitly R D ).
P
The constrained nonlinear least squares problem can be reformulated. Consider the minimization
(with respect to ı and ) of the sum of squares
W .ı/0 Œ tP W .ı/
Œ tP (3.150)
subject to the constaints ı 2 and 0. And let ıR and R represent any solution to this constrained
nonlinear least squares problem, that is, any values of ı and at which the sum of squares (3.150)
attains its minimum value (for ı 2 and 0). Then, a solution ıP and P to the original constrained
nonlinear least squares problem, that is, values of ıQ and that minimize the sum of squares (3.146)
Q and 0, can be obtained by taking ıP D W ıR and P D .
subject to the constraints ıQ 2 R
Thus, maxfı2Q Q
Q W ı¤0g
Q 1=2 can be computed as the square root of the difference between
jıQ 0 tj=.ıQ 0 ı/
the total sum of squares t t and the residual sum of squares .tP P ı/
P 0P P 0 .tP P ı/.
P Further,
so that the residual sum of squares obtained by minimizing .tP ı/ Q 0 .tP ı/
Q with respect to ıQ and
Q Q
(subject to the constraint ı 2 ) is identical to that obtained by minimizing ŒtP W .ı/0 ŒtP W .ı/
with respect to ı and (subject to the constraint ı 2 ).
Constant-width simultaneous confidence intervals. For ı 2 , Aı is (as defined in Part c) the
confidence interval (for ı 0 ) with end points ı 0 O ˙ .ı 0 Cı/1=2 O c P . The intervals Aı (ı 2 ) were
constructed in such a way that their probability of simultaneous coverage equals 1 P . Even when
the intervals corresponding to values of ı for which ƒı D 0 (i.e., values for which ı 0 D 0) are
excluded, these intervals are (aside from various special cases) not all of the same width. Clearly,
the width of the interval Aı is proportional to the standard error .ı 0 Cı/1=2 or estimated standard
error .ı 0 Cı/1=2 O of the least squares estimator ı 0 O of ı 0 .
Suppose that is such that ƒı ¤ 0 for every ı 2 , and suppose that Q (which is the set
fıQ W ıQ D W ı; ı 2 g) is such that for every M 1 vector tP , maxfı2 Q0 P
Q jı t j exists.. Then,
Q g
confidence intervals can be obtained for the linear combinations ı 0 (ı 2 ) that have a probability
of simultaneous coverage equal to 1 P and that are all of the same width.
Analogous to result (3.87), which underlies the generalized S method, we find that
1 Q0 Q0
maxfı2 Q j
Q g O ı .˛O ˛/j maxfı2 Q j ı tj;
Q g (3.156)
where t MV t.N P ; IM /. This result can be used to devise (for each ıQ 2 ) Q an interval, say
A Q , of ı ˛-values such that the probability of simultaneous coverage of the intervals AQQ (ıQ 2 ),
Q Q 0 Q
ı ı
like that of the intervals AQıQ (ıQ 2 ),
Q equals 1 P . Letting c represent the upper 100 P % point of the
P
distribution of max Q Q j ıQ 0 tj, this interval is as follows:
fı2g
The F Test and a Generalized S Method 399
AQıQ D P 2 R1 W ıQ 0 ˛O O c P P ıQ 0 ˛O C c
O P ; (3.157)
˚
Now, for ı 2 , let Aı D AQW ı , so that Aı is an interval of ı 0-values expressible as follows:
Aı D P 2 R1 W ı 0 O O c P P ı 0 O C cO P : (3.158)
˚
Like the intervals Aı (ı 2 ) associated with the generalized S method, the probability of
simultaneous coverage of the intervals Aı (ı 2 ) equals 1 P . Unlike the intervals Aı (ı 2 ),
the intervals Aı (ı 2 ) are all of the same width—each of them is of width 2 c
O P .
In practice, c P may have to be approximated by numerical means—refer, e.g., to Liu (2011,
chaps. 3 & 7). If necessary, c P can be approximated by adopting a Monte Carlo approach in which
repeated draws are made from the distribution of the random variable maxfı2 Q0
Q jı tj. Note that
Q g
maxfı2 Q0
Q jı tj D max maxfı2
Q0
Q ı t; maxfı2
Q0 (3.159)
Q g Q g Q ı . t/ ;
Q g
Proof. Denote by R0 the unique (with probability 1) integer such that X ranks R0 th (in magnitude)
among the K C1 random variables X1 ; X2 ; : : : ; XK ; X (so that X < XŒ1 , XŒR0 1 < X < XŒR0 , or
X > XŒK , depending on whether R0 D 1, 2 R0 K, or R0 D K C 1). Then, upon observing
that Pr.R0 D k/ D 1=.K C1/ for k D 1; 2; : : : ; K C1, we find that
K RC1
Pr X > XŒR D Pr.R0 > R/ D KC1 0
kDRC1 Pr.R D k/ D
P
D ˛:
K C1
Q.E.D.
Lemma 7.3.5. Let X1 ; X2 ; : : : ; XK represent K statistically independent random variables.
Further, suppose that Xk X (k D 1; 2; : : : ; K) for some random variable X whose distribution is
absolutely continuous with a cdf G./ that is strictly increasing over some finite or infinite interval I
for which Pr.X 2 I / D 1. And denote by XŒ1 ; XŒ2 ; : : : ; XŒK the first through Kth order statistics
of X1 ; X2 ; : : : ; XK . Then, for any integer R betweeen 1 and K, inclusive,
G XŒR Be.R; K RC1/:
400 Confidence Intervals (or Sets) and Tests of Hypotheses
The number K of draws from the distribution of X can be chosen so that for some specified “tolerance”
> 0 and some specified probability !,
ˇ ˇ
Pr ˇG XŒR (3.165)
.1 P /ˇ !:
as can be readily verified—refer to Exercise 6.4. Edwards and Berry (1987, p. 915) proposed (for pur-
poses of deciding on a value for K
and implicitly for R) an implementation of the criterion (3.165) in
which the distribution of G XŒR .1 P / is approximated by an N Œ0; P .1 P /=.KC2/
p distribution—
as K ! 1, the pdf of the standardized random variable ŒG.XŒR / .1 P / = P .1 P /=.K C2/
tends to the pdf of the N.0; 1/ distribution (Johnson, Kotz, and Balakrishnan 1995, p. 240). When
this implementation is adopted, the criterion (3.165) is replaced by the much simpler criterion
2
K P .1 P /.z.1 !/=2 =/ 2; (3.166)
where (for any scalar ˛ between 0 and 1) z ˛ is the upper 100 ˛% point of the N.0; 1/ distribution.
And the number K of draws is chosen to be such that inequality (3.166) is satisfied and such that
.K C1/.1 P / is an integer.
Table 1 of Edwards and Berry gives choices for K that would be suitable if P were taken to be
:10, :05, or :01, were taken to be :01, :005, :002, or :001, and ! were taken to be :99—the entries
in their Table 1 are for K C1. For example, if P were taken to be :10, to be :001, and ! to be :99
[in which case z.1 !/=2 D 2:5758 and the right side of inequality (3.166) equals 597125], we could
take K D 599999 [in which case R D 600000 :90 D 540000].
The width of the confidence band defined by the surfaces formed by the end points of interval
Iu.3/.y/[or by the end points of interval Iu.1/.y/ or Iu.2/.y/] varies from one point in the space U
to another. Suppose that U is such that for every P 1 vector tP , maxfu2Ug jŒx.u/0 W 0 tP j exists [as
would be the case if the functions x1 .u/; x2 .u/; : : : ; xP .u/ are continuous and the set U is closed
.3/
and bounded]. Then, as an alternative to interval Iu .y/ with end points (3.171), we have the interval,
.4/
say interval Iu .y/, with end points
O
ı.u/ ˙ O c P ; (3.172)
where c P is the upper 100 P % point of the distribution of the random variable maxfu2Ug jŒx.u/0 W 0 tj.
.3/ .4/
Like the end points of interval Iu .y/, the end points of interval Iu .y/ form surfaces that define a
confidence band having a probability of coverage equal to 1 P . Unlike the confidence band formed
by the end points of interval Iu.3/.y/, this confidence band is of uniform width (equal to 2 c
O P ).
The end points (3.171) of interval Iu.3/.y/ depend on c P , and the end points (3.172) of interval
Iu.4/.y/ depend on c P . Except for relatively simple special cases, c P and c P have to be replaced
by approximations obtained via Monte Carlo methods. The computation of these approximations
requires (in the case of c P ) the computation of maxfu2Ug jŒx.u/0 W 0 tj=fŒx.u/0 .X0 X/ 1x.u/g1=2
and (in the case of c P ) the computation of maxfu2Ug jŒx.u/0 W 0 tj for each of a large number of
values of t.
In light of results (3.154) and (3.155), we find that for any value, say tP , of t,
maxfu2Ug jŒx.u/0 W 0 tP j=fŒx.u/0 .X0 X/ 1x.u/g1=2 can be determined by finding values, say P and
P of the scalar and the vector u that minimize the sum of squares
u,
ŒtP W x.u/0 ŒtP W x.u/;
subject to the constraint u 2 U, and by then observing that
jŒx.u/0 W 0 tP j P x.u/ P x.u/g
max D ftP 0 tP ŒtP W P 0 ŒtP W P 1=2: (3.173)
fu2Ug fŒx.u/0 .X0 X/ 1x.u/g1=2
And maxfu2Ug jŒx.u/0 W 0 tP j can be determined by finding the maximum values of Œx.u/0 W 0 tP and
Œx.u/0 W 0 . tP / with respect to u (subject to the constraint u 2 U) and by then observing that
maxfu2Ug jŒx.u/0 W 0 tP j equals the larger of these two values.
An illustration. Let us illustrate the four alternative procedures for constructing confidence bands by
applying them to the lettuce-yield data. In this application (which adds to the results obtained for these
data in Sections 7.1and 7.2c and in Subsection d of the present section), N D 20, C D 3, and u1 , u2 ,
and u3 represent transformed amounts of Cu, Mo, and Fe, respectively. Further, let us take ı.u/ to be
the second-order polynomial (1.2), in which case rank X D P D 10 (and N rank X D N P D 10).
And let us take P D :10 and take U to be the rectangular region defined by imposing on u1 , u2 ,
and u3 upper and lower bounds as follows: 1 ui 1 (i D 1; 2; 3)—the determination of the
constants c P and c P needed to construct confidence bands Iu.3/.y/ and Iu.4/.y/ is considerably more
straightforward when U is rectangular than when, e.g., it is spherical.
The values of the constants tN:05.10/ and Œ10 FN:10.10; 10/1=2 needed to construct confidence bands
Iu .y/ and Iu.2/.y/ can be determined via well-known numerical methods and are readily avail-
.1/
able from multiple sources. They are as follows: tN:05 .10/ D 1:812461 and Œ10 FN:10 .10; 10/1=2 D
4:819340.
.3/ .4/
In constructing confidence bands Iu .y/ and Iu .y/, resort was made to the approximate versions
of those bands obtained upon replacing c P and c P with Monte Carlo approximations. The Monte
:
Carlo approximations were determined from K D 599999 draws with the following results: c:10 D
:
3:448802 and c:10 D 2:776452—refer to the final 3 paragraphs of Subsection e for some discussion
relevant to the “accuracy” of these approximations. Note that the approximation to c:10 is considerably
greater than tN:05 .10/ and significantly smaller than Œ10 FN:10 .10; 10/1=2. If U had been taken to
be U D R3 rather than the rectangular region U D fu W jui j 1 .i D 1; 2; 3/g, the Monte
404 Confidence Intervals (or Sets) and Tests of Hypotheses
δ u2=u3=1
^
δ
I (1)
30 I (2)
I (3)
I (4)
20
10
0 u1
−1 − 0.5 0 0.5 1
u2=u3=0
30
20
10
0 u1
−1 − 0.5 0 0.5 1
: :
Carlo approximation to c:10 would have been c:10 D 3:520382 (rather than c:10 D 3:448802). The
difference between the two Monte Carlo approximations to c:10 is relatively small, suggesting that
most of the difference between c:10 and Œ10 FN:10 .10; 10/1=2 is accounted for by the restriction of
ı.u/ to the form of the second-order polynomial (1.2) rather than the restriction of u to the region
fu W jui j 1 .i D 1; 2; 3/g.
Segments of the various confidence bands [and of the estimated response surface ı.u/] O are
depicted in Figure 7.4; the segments depicted in the first plot are those for u-values such that
u2 D u3 D 1, and the segments depicted in the second plot are those for u-values such that
u2 D u3 D 0.
Some Optimality Properties 405
(3) The function .z/Q is invariant with respect to the group GQ 2 .˛.0/ / of transformations of the form
TQ2 .zI ˛.0/ / as well as the groups GQ 1 and GQ 3 .˛.0/ / of transformations of the form TQ1 .z/ and
TQ3 .zI ˛.0/ / only if there exists a function Q132 ./ [of an (N P )-dimensional vector] such that
Q
.z/ D Q132 fŒ.z1 ˛.0/ /0 .z1 ˛.0/ /Cz03 z3 1=2 z3 g
for those values of z for which z3 ¤ 0. Moreover, Pr.z3 ¤ 0/ D 1, and the distribution
of Œ.z1 ˛.0/ /0 .z1 ˛.0/ / C z03 z3 1=2 z3 depends on ˛, , and only through the value of
.˛ ˛.0/ / 0 .˛ ˛.0/ /= 2.
(4) The function .z/ Q is invariant with respect to the group GQ 4 of transformations of the form
TQ4 .z/ as well as the groups GQ 1 , GQ 3 .˛.0/ /, and GQ 2 .˛.0/ / of transformations of the form TQ1 .z/,
TQ3 .zI ˛.0/ /, and TQ2 .zI ˛.0/ / only if there exists a function Q1324 ./ (of a single variable) such
that
Q
.z/ D Q1324fz03 z3 =Œ.z1 ˛.0/ /0 .z1 ˛.0/ /Cz03 z3 g
for those values of z for which z3 ¤ 0. Moreover, Pr.z3 ¤ 0/ D 1, and the distribution
of z03 z3 =Œ.z1 ˛.0/ /0 .z1 ˛.0/ / C z03 z3 depends on ˛, , and only through the value of
.˛ ˛.0/ / 0 .˛ ˛.0/ /= 2.
Q
Verification of Proposition (1). If .z/ D Q1 .z1 ; z3 / for some function Q1 . ; /, then it is clear that
[for every transformation TQ1 ./ in the group GQ 1 ] Œ
Q TQ1 .z/ D Q 1 .z1 ; z3 / D .z/
Q and hence that the
Q Q
function .z/ is invariant with respect to the group G1 . Conversely, suppose that .z/ Q is invariant
Q
with respect to the group G1 . Then, for every choice of the vector c,
Q 1 ; z2 Cc; z3 / D Œ
.z Q TQ1 .z/ D .z
Q 1 ; z2 ; z3 /:
And upon setting c D z2 , we find that .z/ Q D Q 1 .z1 ; z3 /, where Q 1 .z1 ; z3 / D .z
Q 1 ; 0; z3 /.
Moreover, the joint distribution of z1 and z3 does not depend on , as is evident upon observing that
z1 ˛ C u1
:
z3 u3
Verification of Proposition (2). Suppose that .z/ Q D Q13 Œ.z1 ˛.0/ /0 .z1 ˛.0/ / C z03 z3 ; z3 for
some function Q 13 . ; /. Then, .z/
Q is invariant with respect to the group GQ 1 of transformations, as is
evident from Proposition (1). And it is also invariant with respect to the group GQ 3 .˛.0/ /, as is evident
upon observing that
Q TQ3 .zI ˛.0/ / D Q Œ.z1 ˛.0/ /0 P P 0 .z1 ˛.0/ /Cz03 z3 ; z3
Œ 13
D Q13 Œ.z1 ˛.0/ /0 .z1 ˛.0/ /Cz03 z3 ; z3 D .z/:
Q
Conversely, suppose that .z/ Q is invariant with respect to the group GQ 3 .˛.0/ / as well as the group
Q
G1 . Then, in light of Proposition (1), .z/ Q D Q1 .z1 ; z3 / for some function Q1 . ; /. And to establish
Q
that .z/ D Q 13 Œ.z1 ˛ / .z1 ˛ /Cz3 z3 ; z3 for some function Q13 . ; /, it suffices to observe
.0/ 0 .0/ 0
that corresponding to any two values zP 1 and zR 1 of z1 that satisfy the equality .Rz1 ˛.0/ /0 .Rz1 ˛.0/ / D
.Pz1 ˛.0/ /0 .Pz1 ˛.0/ /, there is an orthogonal matrix P such that zR 1 D ˛.0/ CP 0 .Pz1 ˛.0/ / (the existence
of which follows from Lemma 5.9.9) and hence such that Q1 .Rz1 ; z3 / D Q 1 Œ˛.0/ CP 0 .Pz1 ˛.0/ /; z3 .
It remains to verify that the joint distribution of .z1 ˛.0/ /0 .z1 ˛.0/ /Cz03 z3 and z3 depends on ˛,
, and only through the values of .˛ ˛.0/ / 0 .˛ ˛.0/ / and . Denote by O an M M orthogonal
matrix defined as follows: if ˛ D ˛.0/ , take O to be IM or any other M M orthogonal matrix; if
˛ ¤ ˛.0/ , take O to be the Helmert matrix whose first row is proportional to the vector .˛ ˛.0/ /0.
Further, let uQ 1 D Ou1 , and take uQ to be the N -dimensional column vector whose transpose is
uQ 0 D .uQ 01 ; u02 ; u03 /. Then, upon observing that the joint distribution of .z1 ˛.0/ /0 .z1 ˛.0/ / C z03 z3
and z3 is identical to that of the random variable .˛ ˛.0/ Cu1 /0 .˛ ˛.0/ Cu1 / C 2 u03 u3 and
the random vector u3 , observing that
Some Optimality Properties 407
Verification of Proposition (4). Suppose that .z/Q is invariant with respect to the group GQ 4 as well
Q Q .0/ Q .0/
as the groups G1 , G3 .˛ /, and G2 .˛ /. Then, in light of Proposition (3), there exists a function
Q132 ./ such that .z/
Q D Q132 fŒ.z1 ˛.0/ /0 .z1 ˛.0/ /Cz03 z3 1=2 z3 g for those values of z for which
z3 ¤ 0. Moreover, there exists a function Q1324 ./ such that for every value of z1 and every nonnull
value of z3 ,
Q 132fŒ.z1 ˛.0/ /0 .z1 ˛.0/ /Cz03 z3 1=2
z3 g D Q1324fz03 z3 =Œ.z1 ˛.0/ /0 .z1 ˛.0/ /Cz03 z3 g:
To confirm this, it suffices to take zP 1 and zR 1 to be values of z1 and zP 3 and zR 3 nonnull values of z3
such that zR 03 zR 3 zP 03 zP 3
0
D (4.1)
.Rz1 ˛.0/ /0 .Rz1 ˛.0/ /C zR 3 zR 3 .Pz1 ˛.0/ /0 .Pz1 ˛.0/ /C zP 03 zP 3
and to observe that equality (4.1) is reexpressible as the equality
Verification of the proposition. Suppose that .z/ Q D Q 4 .z1 ; z2 ; z03 z3 / for some function Q 4 . ; ; /.
Then,
Q TQ4 .z/ D Q 4 Œz1 ; z2 ; .B0 z3 /0 B0 z3 D Q 4 .z1 ; z2 ; z03 z3 / D .z/:
Œ Q
Conversely, suppose that .z/ Q is invariant with respect to the group GQ 4 . Then, to establish that
.z/ D 4 .z1 ; z2 ; z3 z3 / for some function Q 4 . ; ; /, it suffices to observe that corresponding to
Q Q 0
any two values zP 3 and zR 3 of z3 that satisfy the equality zR 03 zR 3 D zP 03 zP 3 , there is an orthogonal matrix
B such that zR 3 D B0 zP 3 (the existence of which follows from Lemma 5.9.9).
The result of Theorem 7.4.1 constitutes part of a version of what is known as the Neyman–Pearson
fundamental lemma or simply as the Neyman–Pearson lemma. For a proof of this result, refer, for
example, to Casella and Berger (2002, sec. 8.3).
Some Optimality Properties 409
In regard to Theorem 7.4.1, the test with critical region (set) C can be identified by the indicator
function of C rather than by C itself; this function is the so-called critical (test) function. The
definition of a test can be extended to include “randomized” tests; this can be done by extending the
definition of a critical function to include any function ./ such that 0 .x/ 1 for every scalar
x—when ./ is an indicator function of a set C , .x/ equals either 0 or 1. Under the extended
definition, the test with critical function ./ consists of rejecting the null hypothesis D .0/ with
probability .x/, where x is the observed value of X . This test has a power function, say .I /,
that is expressible as .I / D EŒ.X /. The coverage of Theorem 7.4.1 can be extended: it can be
shown that among tests (randomized as well as nonrandomized) whose size does not exceed P , the
power function attains its maximum value for any specified value of (other than .0/ ) when the
test is taken to be a nonrandomized test with critical region C that satisfies conditions (4.2) and (4.3).
In the context of Theorem 7.4.1, the (nonrandomized) test with critical region C that satisfies
conditions (4.2) and (4.3) is optimal (in the sense that the value attained by its power function at the
specified value of is a maximum among all tests whose size does not exceed P ). In general, this
test varies with . Suppose, however, that the set X D fx W f .xI / > 0g does not vary with the
value of and that for every value of in ‚ and for x 2 X, the “likelihood ratio” f .xI /=f .xI .0/ /
is a nondecreasing function of x or, alternatively, is a nonincreasing function of x. Then, there is a
critical region C that satisfies conditions (4.2) and (4.3) and that does not vary with : depending
on whether the ratio f .xI /=f .xI .0/ / is a nondecreasing function of x or a nonincreasing function
of x, we can take C D fx W x > k 0 g, where k 0 is the upper 100 P % point of the distribution with
pdf f . I .0/ /, or take C D fx W x < k 0 g, where k 0 is the lower 100 P % point of the distribution
with pdf f . I .0/ /. In either case, the test with critical region C constitutes what is referred to as
a UMP (uniformly most powerful) test.
And let D .˛ ˛.0/ /0 .˛ ˛.0/ /= 2. When regarded as a function of z, X is invariant with respect
to the groups GQ 1 , GQ 3 .˛.0/ /, GQ 2 .˛.0/ /, and GQ 4 (as can be readily verified). Thus, if .z/
Q depends on
z only through the value of X , then it is invariant with respect to the groups G1 , G3 .˛.0/ /, GQ 2 .˛.0/ /,
Q Q
and GQ 4 . Conversely, if .z/
Q is invariant with respect to the groups GQ 1 , GQ 3 .˛.0/ /, GQ 2 .˛.0/ /, and GQ 4 ,
then [in light of equality (4.4)] there exists a function Q 1324
./ such that
410 Confidence Intervals (or Sets) and Tests of Hypotheses
Q
.z/ D Q 1324 .X / (4.6)
Q
for those values of z for which z3 ¤ 0, in which case [since Pr.z3 ¤ 0/ D 1] .z/ Q 1324
.X /.
Moreover, the distribution of X or any function of X depends on ˛, , and only through the value
of the nonnegative parametric function . In fact,
p
C 2. ; 0; 0; : : : ; 0/u1 C u01 u1
X p ; (4.7)
C 2. ; 0; 0; : : : ; 0/u1 C u01 u1 C u03 u3
as can be readily verified via a development similar to that employed in Subsection a in the verification
of Proposition (2).
Applicability of the Neyman–Pearson lemma. As is evident from the preceding part of the present
subsection, a test of HQ 0 versus HQ 1 with critical function .z/
Q is invariant with respect to the groups
G1 , G3 .˛ /, G2 .˛ /, and G4 if and only if there exists a test with critical function Q1324
Q Q .0/ Q .0/ Q
.X / such
Q Q
that .z/ D 1324 .X / with probability 1, in which case
Q
EŒ.z/ D EŒQ1324 .X /; (4.8)
that is, the two tests have the same power function. The upshot of this remark is that the Neyman–
Pearson lemma can be used to address the problem of finding a test of HQ 0 versus HQ 1 that is “optimal”
among tests that are invariant with respect to the groups GQ 1 , GQ 3 .˛.0/ /, GQ 2 .˛.0/ /, and GQ 4 and whose
size does not exceed P . Among such tests, the power function of the test attains its maximum value
for values of ˛, , and such that equals some specified value when the critical region of
the test is a critical region obtained upon applying the Neyman–Pearson lemma, taking X to be the
observable random variable defined by expression (4.5) and taking D , ‚ D Œ0; 1/, .0/ D 0,
and D .
1 1
Special case: e N.0; I/. Suppose that the distribution of the vector u D e is N.0; I/, in
" ! #
˛
2
which case z N ; I . Then,
0
.z1 ˛.0/ /0 .z1 ˛.0/ /
M N P
X Be ; ; ; (4.9)
.z1 ˛.0/ /0 .z1 ˛.0/ /Cz03 z3 2 2 2
as is evident upon observing that z03 z3 = 2 2 .N P / and that .z1 ˛.0/ /0 .z1 ˛.0/ /= 2 is
statistically independent of z03 z3 = 2 and has a 2 .M ; / distribution. Further, let (for 0)
f . I / represent the pdf of the BeŒM =2; .N P /=2; =2 distribution, and observe [in light of
result (6.3.24)] that (for > 0) the ratio f .xI / =f .xI 0/ is a strictly increasing function of x (over
the interval 0 < x < 1). And consider the test (of the null hypothesis HQ 0 W ˛ D ˛.0/ versus the
alternative hypothesis HQ 1 W ˛ ¤ ˛.0/ ) that rejects HQ 0 if and only if
X > BN P ŒM =2; .N P /=2; (4.10)
where BN P ŒM =2; .N P /=2 is the upper 100 P % point of the BeŒM =2; .N P /=2 distribution.
This test is of size P , is invariant with respect to the groups GQ 1 , GQ 3 .˛.0/ /, GQ 2 .˛.0/ /, and GQ 4 , and
among tests whose size does not exceed P and that are invariant with respect to the groups GQ 1 ,
GQ 3 .˛.0/ /, GQ 2 .˛.0/ /, and GQ 4 , is UMP (as is evident from the discussion in Subsection c and from the
discussion in the preceding part of the present subsection).
The F test. Note that (for z3 ¤ 0)
.z1 ˛.0/ /0 .z1 ˛.0/ / Q .0/ /
ŒM =.N P /F.˛
0
D ; (4.11)
.z1 ˛.0/ /0 .z1 ˛.0/ /Cz3 z3 Q .0/ /
1 C ŒM =.N P /F.˛
where (for an arbitrary M 1 vector ˛)
P
Some Optimality Properties 411
Let us continue to take X to be the observable random variable defined by equality (4.5), and
let us take f . I / to be the pdf of the distribution of X [as determined from the distribution of u
on the basis of expression (4.7)]. When D 0, X BeŒM =2; .N P /=2, so that f . I 0/ is
the same in the general case as in the special case where u N.0; I/—refer, e.g., to Part (1) of
Theorem 6.3.1. However, when > 0, the distribution of X varies with the distribution of u—in
the special case where u N.0; I/, this distribution is BeŒM =2; .N P /=2; =2 (a noncentral
beta distribution). Further, let C represent an arbitrary critical region for testing (on the basis of X )
the null hypothesis that D 0 versus the alternative hypothesis that > 0. And denote R by . I C /
the power function of the test with critical region C, so that (for 0) .I C / D C f .xI / dx.
Now, consider a critical region C such that subject to the constraint .0I C / P, .I C /
attains its maximum value for any particular (strictly positive) value of when C D C . At least
in principle, such a critical region can be determined by applying the Neyman–Pearson lemma. The
critical region C defines a set of z-values that form the critical region of a size- P test of HQ 0 versus
HQ 1 that is invariant with respect to the groups GQ 1 , GQ 3 .˛.0/ /, GQ 2 .˛.0/ /, and GQ 4 of transformations.
Moreover, upon recalling (from the final part of Section 7.3a) that [as in the special case where
u N.0; I/] z1 , z2 , and z03 z3 form a (vector-valued) sufficient statistic and upon employing
essentially the same reasoning as in the preceding part of the present subsection, we find that the
value attained for D by the power function of that test is greater than or equal to that attained
for D by the power function of any test of HQ 0 versus HQ 1 that is invariant with respect to the
groups GQ 1 , GQ 3 .˛.0/ /, and GQ 2 .˛.0/ / and whose size does not exceed P.
If [as in the special case where u N.0; I/] the ratio f .xI /=f .xI 0/ is a nondecreasing
function of x, then C can be taken to be the critical region defined by inequality (4.10) or (aside
from a set of z-values of probability 0) by the inequality F.˛ Q .0/ / > FN P .M ; N P /. And if [as
in the special case where u N.0; I/] the ratio f .xI /=f .xI 0/ is a nondecreasing function
of x for every choice of , then the size- P test (of HQ 0 versus HQ 1 ) that is invariant with respect
to the groups GQ 1 , GQ 3 .˛.0/ /, GQ 2 .˛.0/ /, and GQ 4 and whose critical region is defined by inequality
(4.10) or (for values of z such that z3 ¤ 0) by the inequality F.˛ Q .0/ / > FN P .M ; N P / is UMP
among all tests that are invariant with respect to the groups GQ 1 , GQ 3 .˛.0/ /, and GQ 2 .˛.0/ / and whose
size does not exceed P. However, in general, C may differ nontrivially (i.e., for a set of z-values
having nonzero probability) from the critical region defined by inequality (4.10) or by the inequality
Q .0/ / > FN P .M ; N P / and from one choice of to another, and there may not be any test that
F.˛
is UMP among all tests that are invariant with respect to the groups GQ 1 , GQ 3 .˛.0/ /, and GQ 2 .˛.0/ / and
whose size does not exceed P.
Restatement of results. The results of the preceding parts of the present subsection are stated in terms
of the transformed observable random vector z (D O 0 y) and in terms of the transformed parametric
vector ˛ (D S0 ) associated with the canonical form of the G–M model. These results can be restated
in terms of y and .
The null hypothesis HQ 0 W ˛ D ˛.0/ is equivalent to the null hypothesis H0 W D .0/ —
the parameter vector ˇ satisfies the condition D .0/ if and only if it satisfies the condition
˛ D ˛.0/ —and the alternative hypothesis HQ 1 W ˛ ¤ ˛.0/ is equivalent to the alternative hypothesis
H1 W ¤ .0/. Moreover, D .˛ ˛.0/ /0 .˛ ˛.0/ /= 2 is reexpressible as
D . .0/ /0 C . .0/ /= 2 (4.12)
—refer to result (3.34)—and z03 z3 and .z1 ˛ .0/ 0
/ .z1 ˛ .0/
/ are reexpressible as
z03 z3 0 0
D d d D y .I PX /y (4.13)
and
.z1 ˛.0/ /0 .z1 ˛.0/ / D .˛O ˛.0/ /0 .˛O ˛.0/ / D .O .0/ /0 C .O .0/ / (4.14)
—refer to result (3.21) or (3.26) and to result (3.28). Results (4.13) and (4.14) can be used to reexpress
(in terms of y) the observable random variable X defined by expression (4.5) and as noted earlier
Q .0/ /.
(in Section 7.3b) to reexpress F.˛
Some Optimality Properties 413
Q
A function .z/ of z that is the critical function of a test of HQ 0 or H0 versus HQ 1 or H1 is reex-
pressible as a function .y/ [D .O Q 0 y/] of y. And corresponding to any one-to-one transformation
TQ .z/ of z (from RN onto RN ) is a one-to-one transformation T .y/ [D O TQ .O 0 y/] of y (from RN
onto RN ), and corresponding to any group GQ of such transformations of z is a group G consisting
of the corresponding transformations of y. Further, a test (of HQ 0 or H0 versus HQ 1 or H1 ) is invariant
with respect to the group GQ if and only if it is invariant with respect to the group G—if .z/ Q is the
critical function expressed as a function of z and .y/ the critical function expressed as a function
of y, then ŒQ TQ .z/ D .z/
Q for every transformation TQ .z/ in GQ if and only if ŒT .y/ D .y/ for
every transformation T .y/ in G. Refer to the final two parts of Section 7.3b for some specifics as
they pertain to the groups GQ 1 , GQ 2 .˛.0/ /, GQ 3 .˛.0/ /, and GQ 4 .
Confidence sets. Take (for each value ˛P of ˛) X. Q ˛/
P to be the random variable defined as follows:
0
8̂
.z1 ˛/ P .z1 ˛/ P
<
0 ; if z1 ¤ ˛P or z3 ¤ 0,
Q P D
X .˛/ .z1 P
˛/ 0 .z
1 P
˛/Cz 3 z3
:̂ 0; if z1 D ˛P and z3 D 0:
of the groups GQ 3 .˛.0/ / (˛.0/ 2 RM). Then, the 100.1 P /% confidence set AQF .z/ is invariant with
respect to the groups GQ 1 and GQ 4 and is equivariant with respect to the groups GQ 2 and GQ 3 —refer to
Q represent
the discussion in the final 3 parts of Section 7.3b. And for purposes of comparison, let A.z/
any confidence set for ˛ whose probability of coverage equals or exceeds 1 P and that is invariant
with respect to the group GQ 1 and equivariant with respect to the groups GQ 2 and GQ 3 . Further, for
˛.0/ 2 RM, take ı.˛.0/ I ˛/ to be the probability PrŒ˛.0/ 2 A.z/ Q of ˛.0/ being “covered” by the
Q .0/ .0/
confidence set A.z/—ı.˛I ˛/ 1 P , and for ˛ ¤ ˛, ı.˛ I ˛/ is referred to as the probability
of false coverage.
Corresponding to A.z/ Q is the test of the null hypothesis HQ 0 W ˛ D ˛.0/ versus the alternative
Q .0/
hypothesis H1 W ˛ ¤ ˛ with critical function . Q I ˛0 / defined as follows:
(
Q
1; if ˛.0/ … A.z/,
Q
.zI ˛0 / D .0/
0; if ˛ 2 A.z/.Q
And in light of the invariance of A.z/Q with respect to the group GQ 1 and the equivariance of A.z/
Q
with respect to the groups GQ 2 and GQ 3 ,
Q TQ1 .z/I ˛.0/ D Œ
Œ Q TQ2 .z/I ˛.0/ D Œ
Q TQ3 .z/I ˛.0/ D .zI
Q ˛0 /; (4.15)
Q
as can be readily verified. Further, let .˛I ˛.0/ / D EŒ.zI ˛.0/ /, and observe that
.˛I ˛.0/ / D 1 ı.˛.0/ I ˛/; (4.16)
implying in particular that
.˛I ˛/ D 1 ı.˛I ˛/ P : (4.17)
1
Now, suppose that the distribution of the random vector u D e is N.0; I/ or, more generally,
that the pdf f . I / of the distribution of the random variable (4.7) is such that for every > 0 the
ratio f .xI /=f .xI 0/ is (for 0 < x < 1) a nondecreasing function of x. Then, in light of equalities
(4.15) and (4.17), it follows from our previous results (on the optimality of the F test) that (for every
414 Confidence Intervals (or Sets) and Tests of Hypotheses
value of ˛ and regardless of the choice for ˛.0/ ) .˛I ˛.0/ / attains its maximum value when A.z/ Q is
taken to be the set AQF .z/ [in which case .zI
Q ˛.0/ / is the critical function of the test that rejects HQ 0
if XQ .˛.0/ / > BN P ŒM =2; .N P /=2 or (for values of z such that z3 ¤ 0) the critical function of
the F test (of HQ 0 versus HQ 1 )]. And in light of equality (4.16), we conclude that among confidence
sets for ˛ whose probability of coverage equals or exceeds 1 P and that are invariant with respect
to the group GQ 1 and equivariant with respect to the groups GQ 2 and GQ 3 , the confidence set AQF .z/ is
UMA (uniformly most accurate) in the sense that (for every false value of ˛) its probability of false
coverage is a minimum.
This result on the optimality of the 100.1 P /% confidence set AQF .z/ (for the vector ˛) can be
translated into a result on the optimality of the following 100.1 P /% confidence set for the vector
:
AF .y/ D fP W P D W 0˛; P ˛P 2 AQF .O 0 y/g D fP 2 ƒ0 W X./ P BN P ŒM =2; .N P /=2g
where (for P 2 ƒ0 )
P 0 C .O /
.O / P
8
< ; if O ¤ P or y 0 .I PX /y > 0,
P D
X./ .O P 0C
/ .O /Cy
P 0 .I P /y
X
0; if O D P and y 0 .I PX /y D 0.
:
The set AF .y/ is invariant with respect to the group G1 and equivariant with respect to the groups
G2 and G3 consisting respectively of the totality of the groups G2 . .0/ / [ .0/ 2 C.ƒ0 /] and the
totality of the groups G3 . .0/ / [ .0/ 2 C.ƒ0 /]. And the probability PrŒ .0/ 2 AF .y/ of the set AF .y/
covering any particular vector .0/ in C.ƒ0 / equals the probability PrŒS0 .0/ 2 AQF .z/ of the set
AQF .z/ covering the vector S0 .0/.
Corresponding to any confidence set A.y/ (for ) whose probability of coverage equals or exceeds
1 P and that is invariant with respect to the group G1 and equivariant with respect to the groups G2
and G3 is a confidence set A.z/Q (for ˛) defined as follows:
Q D f˛P W ˛P D S0;
A.z/ P P 2 A.Oz/g:
This set is such that
A.y/ D fP W P D W 0˛; Q 0 y/g;
P ˛P 2 A.O
and it is invariant with respect to the group GQ 1 and equivariant with respect to the groups GQ 2 and
GQ 3 . Moreover, for any vector .0/ 2 C.ƒ0 /,
PrŒ .0/ 2 A.y/ D PrŒS0 .0/ 2 A.z/
Q
—in particular, PrŒ 2 A.y/ D PrŒ˛ 2 A.z/. Q Since the confidence set AQF .z/ is UMA among
confidence sets for ˛ whose probability of coverage equals or exceeds 1 P and that are invariant
with respect to the group GQ 1 and equivariant with respect to the groups GQ 2 and GQ 3 , we conclude that
the confidence set AF .y/ is UMA among confidence sets for whose probability of coverage equals
or exceeds 1 P and that are invariant with respect to the group G1 and equivariant with respect to
the groups G2 and G3 .
equal to 2. And consider the integration of a function g.s/ of an N 1 vector s over a set SN .s.0/; /
defined for s.0/ 2 RN and > 0 as follows:
SN .s.0/; / D fs 2 RN W .s s.0//0 .s s.0/ / D 2 g:
When N D 2, SN .s.0/; / is a circle, and when N D 3, it is a sphere. More generally, SN .s.0/; / is
referred to as a hypersphere (of dimension N 1). This circle, sphere, or hypersphere is centered at
the point s.0/ and is of radius .
Integration over the set SN .s.0/; / is related to integration over a set BN defined as follows:
BN D fx 2 RN W x0 x 1g:
This set is a closed
R ball; it is centered at the origin 0 and is of radius 1.
Let us write SN .s.0/; / g.s/ ds for the integral of the function g./ over the hypersphere SN.s.0/; /
centered at s.0/ and of radius . In the special case of a hypersphere centered at the origin and of
radius 1, 0 1=2
(4.18)
R R
SN .0; 1/ g.s/ ds D N BN gŒ.x x/ x dx
—for x D 0, define .x0 x/ 1=2 x D 0. More generally (in the special case of a hypersphere centered
at the origin and of radius ),
N 1 0 1=2
(4.19)
R R
SN .0; / g.s/ ds D N BN gŒ .x x/ x dx:
And still more generally,
g.s.0/ C sQ / d sQ (4.20)
R R
SN .s.0/; / g.s/ ds D SN .0; /
N 1 .0/ 0 1=2
(4.21)
R
D N BN gŒs C .x x/ x dx:
Following Baker (1997), let us regard equality (4.18) or, more generally, equality (4.19) or (4.21)
as a definition—Baker indicated that he suspects equality (4.18) “is folkloric, and likely has appeared
as a theorem rather than a definition.” Various basic results on integration over a hypersphere follow
readily from equality (4.18), (4.19), or (4.21). In particular, for any two constants a and b and for
“any” two functions g1.s/ and g2 .s/ of an N 1 vector s,
(4.22)
R R R
SN .s.0/; / Œag1.s/ C bg2 .s/ ds D a SN .s.0/; / g1.s/ ds C b SN .s.0/; / g2 .s/ ds;
and for any N N orthogonal matrix O [and any function g.s/ of an N 1 vector s],
(4.23)
R R
SN .0; / g.Os/ ds D SN .0; / g.s/ ds:
In the special case where g.s/ D 1 (for every N 1 vector s), SN .s.0/; / g.s/ ds represents
R
the “surface area” of the (N 1)-dimensional hypersphere SN .s.0/; /. As demonstrated by Baker
(1997),
2 N=2 N 1
(4.24)
R
.0/
SN .s ; / ds D :
.N=2/
The integration of a function over a hypersphere is defined for hyperspheres of dimension 1 or
more by equality (4.21) or, in special cases, by equality (4.18) or (4.19). It is convenient to also
define the integration of a function over a hypersphere for a “hypersphere” S1.s .0/; / of dimension
0 (centered at s .0/ and of radius ), that is, for the “hypersphere”
S1.s .0/; / D fs 2 R1 W .s s .0/ /2 D 2 g D fs 2 R1 W s D s .0/ ˙g: (4.25)
.0/
Let us write g.s/ ds for the integral of a function g./ over the set S1.s ; /, and define
R
S1.s .0/; /
.0/
C/ C g.s .0/ /: (4.26)
R
S1.s .0/; / g.s/ ds D g.s
It is also convenient to extend the definition of the integral SN .s.0/; / g.s/ ds to “hyperspheres” of
R
Note that when the definition of the integral of a function g./ over the set SN .s.0/; / is extended
to N D 1 and/or D 0 via equalitiies (4.26) and/or (4.27), properties (4.22) and (4.23) continue to
apply. Note also that
(4.28)
R
SN .s.0/; 0/ ds D 1
and that (for > 0)
(4.29)
R
S1.s .0/; / ds D 2:
An optimality criterion: average power. Let us now resume discussion of the problem of testing the
null hypothesis H0 W D .0/ or HQ 0 W ˛ D ˛.0/ versus the alternative hypothesis H1 W ¤ .0/
or HQ 1 W ˛ ¤ ˛.0/ . In doing so, let us take the context to be that of Section 7.3, and let us adopt
the notation and terminology introduced therein. Thus, y is taken to be an N 1 observable random
vector that follows the G–M model, and z D O 0 y D .z01 ; z02 ; z03 / 0 D .˛O 0; O 0; d0 / 0. Moreover, in what
follows, it is assumed that the distribution of the vector e of residual effects in the G–M model is
N.0; 2 I/, in which case y N.Xˇ; 2 I/ and z N Œ.˛0; 0; 00 / 0; 2 I.
Q
Let .z/ represent (in terms of the transformed vector z) the critical function of an arbitrary
(possibly randomized) test of the null hypothesis HQ 0 W ˛ D ˛.0/ [so that 0 .z/ Q 1 for every
value of z]. And let Q .˛; ; / represent the power function of this test. By definition,
Q
Q .˛; ; / D EŒ.z/:
Further, let ˛Q D ˛ ˛.0/, and take N .; ; / to be the function of , , and defined (for 0) as
R
S.˛.0/; / Q .˛; ; / d˛
N .; ; / D R ;
S.˛.0/; / d˛
and where . I ; / is a function whose value is defined for every M 1 vector x as follows:
x0 ˛=
Q 2
R
S.0; / e d ˛Q
.x I ; / D R :
S.0; / d ˛Q
Note that
O ˛.0/ /0 .˛
O ˛.0/ /=.2 2 /
f1 .˛O I 0; / D .2 2 / M =2
e .˛
D f1 .˛O I ˛ .0/; /: (4.32)
The function . I ; / is such that for every M M orthogonal matrix O and every M 1
vector x,
.Ox I ; / D .x I ; /;
as is evident from result (4.23). Moreover, corresponding to any M 1 vectors x1 and x2 such that
x02 x2 D x01 x1 , there exists an orthogonal matrix O such that x2 D Ox1 , as is evident from Lemma
5.9.9. Thus, .x I ; / depends on x only through the value of x0 x. That is,
.x I ; / D Q .x0 x I ; /
for some function Q . I ; / of a single (nonnegative) variable. And f1 .˛O I ; / is reexpressible as
O ˛.0/ /0 .˛
O ˛.0/ /=.2 2 / 2=.2 2 /
f1 .˛O I ; / D .2 2 / M =2
e .˛
e Q Œ.˛O ˛.0/ /0 .˛O ˛.0/ / I ; : (4.33)
Letting x represent any M 1 vector such that x0 x D 1, we find that for any nonnegative
scalar t, p p
Q .t I ; / D Q .t x0 x I ; / D 1
2 tx I ; C t x I ; :
Thus,
Q .0 I ; / D 1: (4.34)
Moreover, for t > 0,
1
ptx0 ˛= 2
p 0
Q 2
Q C e t x ˛=
R ı
d Q .t I ; / 2 S.0; / d e dt d ˛Q
D R I
dt Q
S.0; / d ˛
and upon observing that (for t > 0)
p 0Q 2 p 0
Q 2 1=2 0
p
tx0 ˛=
Q 2
p 0
Q 2
d e tx ˛= C e tx ˛= 1
Q 2 t x ˛=
ı
dt D 2 t x ˛= e Ce
>0 (unless x0 ˛Q D 0);
it follows that (for t > 0 and > 0)
d Q .t I ; /
>0 (4.35)
dt
and hence that (unless D 0) Q . I ; / is a strictly increasing function.
An equivalence. Let T represent a collection of (possibly randomized) tests of H0 or HQ 0 (versus
H1 or HQ 1 ). And consider the problem of identifying a test in T for which the value of N .; ; / (at
any particular values of , , and ) is greater than or equal to its value for every other test in T . This
problem can be transformed into a problem that is equivalent to the original but that is more directly
amenable to solution by conventional means.
Together, results (4.30) and (4.31) imply that
Q
(4.36)
R
N .; ; / D RN .z/f .z I ; ; / dz;
where f .z I ;R; / D f1 .˛O I ; /f2 .O I ; /f3 .d I /. Moreover, f1 .˛O I ; / 0 (for every
O and RM f1 .˛O I ; / d ˛O D 1, so that f1 . I ; / can serve as the pdf of an absolutely
value of ˛)
continuous distribution. And it follows that
N .; ; / D E Œ.z/;
Q (4.37)
where the symbol E denotes an expected value obtained when the underlying distribution of z is
taken to be that with pdf f . I ; ; / rather than that with pdf f . I ˛; ; /.
In the transformed version of the problem of identifying a test of H0 or HQ 0 that maximizes the
418 Confidence Intervals (or Sets) and Tests of Hypotheses
value of N .; ; /, the distribution of ˛O is taken to be that with pdf f1 . I ; / and the distribution
of the observable random vector z D .˛O 0; O 0; d0 / 0 is taken to be that with pdf f . I ; ; /. Further,
the nonnegative scalar (as well as and each element of ) is regarded as an unknown parameter.
Q is regarded as a test of the null hypothesis H0 W D 0 versus
And the test with critical function ./
the alternative hypothesis H1 W > 0 (and T is regarded as a collection of tests of H0). Note [in
light of result (4.37)] that in this context, N .; ; / is interpretable as the power function of the test
Q
with critical function ./.
The transformed version of the problem consists of identifying a test in T for which the value
of the power function (at the particular values of , , and ) is greater than or equal to the value
of the power function for every other test in T . The transformed version is equivalent to the original
version in that a test of H0 or HQ 0 is a solution to the original version if and only if its critical function
is the critical function of a test of H0 that is a solution to the transformed version.
In what follows, T is taken to be the collection of all size- P similar tests of H0 or HQ 0 . And a
solution to the problem of identifying a test in T for which the value of N .; ; / is greater than
or equal to its value for every other test in T is effected by solving the transformed version of this
problem. Note that in the context of the transformed version, T represents the collection of all size- P
similar tests of H0, as is evident upon observing [in light of result (4.32)] that a test of H0 and a
test of H0 or HQ 0 that have the same critical function have the same Type-I error.
A sufficient statistic. Suppose that the distribution of ˛O is taken to be that with pdf f1 . I ; / and
the distribution of z D .˛O 0; O 0; d0 / 0 to be that with pdf f . I ; ; /. Further, let
u D .˛O ˛.0/ /0 .˛O ˛.0/ / and v D d0 d:
And define u
sD and w D u C v:
uCv
Then, clearly, u, v, and O form a sufficient statistic—recall result (4.33)—and s, w, and O also form
a sufficient statistic.
The random vectors ˛, O and d are distributed independently and, consequently, u, v, and O are
O ,
distributed independently. And the random variable u has an absolutely continuous distribution with
a pdf g1 . I ; / that is derivable from the pdf f1 . I ; / of ˛O by introducing successive changes
in variables as follows: from ˛O to the vector x D .x1 ; x2 ; : : : ; xM /0 D ˛O ˛.0/, from x to the M1
vector t whose i th element is ti D xi2 , from t to the M 1 vector y whose first M 1 elements are
yi D ti (i D 1; 2; : : : ; M 1) and whose M th element is yM D M i D1 ti , and finally from y to
P
the vector whose first M 1 elements are yi =yM (i D 1; 2; : : : ; M 1) and whose M th element
is yM —refer to Section 6.1g for the details. Upon introducing these changes of variables and upon
observing that u D yM , we find that (for u > 0)
u=.2 2 /
g1 .u I ; / / u.M =2/ 1
e Q .u I ; /: (4.38)
Moreover, the random variable v has an absolutely continuous distribution with a pdf g2 . I / such
that (for v > 0)
1 v=.2 2 /
g2 .v I / D v Œ.N P /=2 1
e ; (4.39)
Œ.N P /=2.2 2 /.N P /=2
as is evident upon observing that v= 2 has a chi-square distribution with N P degrees of freedom.
The random variables s and w have a joint distribution that is absolutely continuous with a pdf
h . ; I ; / that is determinable (via a change of variables) from the pdfs of the distributions of u
and v. For 0 < s < 1 and w > 0, we find that
h.s ; w I ; / D g1 .sw I ; / g2 Œ.1 s/w I w
P CM /=2 1 w=.2 2 /
/ s .M =2/ 1
.1 s/Œ.N P /=2 1
w Œ.N e Q .sw I ; /: (4.40)
Some Optimality Properties 419
Corresponding to a test of H0 versus H1 with critical function .z/ Q is the test with the critical
Q j s; w;
function EŒ .z/ O obtained upon taking the expected value of .z/ Q conditional on s, w, and
O Moreover,
.
EfEŒ .z/ O D EŒ .z/;
Q j s; w; g Q (4.41)
Q
so that the test with critical function E Œ .z/ j s; w; O has the same power function as the test with
Q
critical function .z/. Thus, for purposes of identifying a size- P similar test of H0 for which the
value of the power function (at particular values of , , and ) is greater than or equal to the value
of the power function for every other size- P similar test of H0, it suffices to restrict attention to tests
that depend on z D .˛O 0; O 0; d0 / 0 only through the values of s, w, and .
O
Conditional Type-I error and conditional power. Let .s; Q w; /
O represent the critical function of
a (possibly randomized) test (of H0 versus H1 ) that depends on z only through the value of the
O And let .; ; / represent the power function of the
sufficient statistic formed by s, w, and .
test. Then,
.; ; / D E Œ.s;
Q w; /:
O (4.42)
And the test is a size- P similar test if and only if
.0; ; / D P (4.43)
for all values of and .
The conditional expected value E Œ.s; Q w; / O represents the conditional (on w and )
O j w; O
Q w; /
probability of the test with critical function .s; O rejecting the null hypothesis H0. The power
function .; ; / of this test can be expressed in terms of E Œ.s; Q w; / O Upon reexpressing
O j w; .
the right side of equality (4.42) in terms of this conditional expected value, we find that
.; ; / D EfE Œ.s;
Q w; / O j w; g:
O (4.44)
Q Q
Let us write E 0 Œ.s; w; / O for E Œ.s; w; /
O j w; O in the special case where D 0, so
O j w;
Q w; /
that E0 Œ.s; O represents the conditional (on w and )
O j w; O probability of a Type-I error. When
D 0, s is distributed independently of w and O as a BeŒM =2; .N P /=2 random variable, so that
(under H0 ) the conditional distribution of s given w and O is an absolutely continuous distribution
with a pdf h0 .s/ that is expressible (for 0 < s < 1) as
1
h0 .s/ D s .M =2/ 1 .1 s/ Œ.N P /=2 1
BŒM =2; .N P /=2
—refer to result (4.40) and note that Q .sw I 0; / D 1. And (for every value of w and every value
of )
O Z 1
E0 Œ.s;
Q w; / O j w;
O D Q w; /h
.s; O 0 .s/ ds:
0
Under H0 (i.e., when D 0), w and O form a complete sufficient statistic [for distributions of z
with a pdf of the form f .z I ; ; /]—refer to Section 5.8. And in light of result (4.44), it follows
Q w; /
that the test with critical function .s; O is a size- P similar test if and only if
Q
E 0 Œ.s; w; /O j w;
O D P (wp1). (4.45)
Thus, a size- P similar test of H0 for which the value of .; ; / (at any particular values of , ,
and ) is greater than or equal to its value for any other size- P similar test can be obtained by taking
Q w; /
the critical function .s; O of the test to be that derived by regarding (for each value of w and each
value of ) Q w; /
O .s; O as a function of s alone and by maximizing the value of E Œ.s; Q w; / O j w;
O
(with respect to the choice of that function) subject to the constraint E0 Œ.s; Q w; / O D P.
O j w;
When > 0 (as when D 0), O is distributed independently of s and w. However, when
> 0 (unlike when D 0), s and w are statistically dependent. And in general (i.e., for 0),
the conditional distribution of s given w and O is an absolutely continuous distribution with a pdf
hC .s j w/ such that
hC .s j w/ / h.s ; w I ; / (4.46)
—this distribution varies with the values of and as well as the value of w.
420 Confidence Intervals (or Sets) and Tests of Hypotheses
Application of the Neyman–Pearson lemma. Observe [in light of results (4.46) and (4.40)] that (for
0 < s < 1)
hC .s j w/
/ Q .sw I ; /:
h0 .s/
Observe also that (when w > 0 and > 0) Q .sw I ; / is strictly increasing in s [as is evident upon
recalling that (when > 0) Q . I ; / is a strictly increasing function]. Further, let Q .s/ represent
the critical function of a (size- P ) test (of H0 ) that depends on z only through the value of s and that
is defined as follows: (
Q 1; if s > k,
.s/ D
0; if s k,
R1
where k h0 .s/ ds D P . Then, it follows from an extended (to cover randomized tests) version of
Theorem 7.4.1 (the Neyman–Pearson lemma) that (for every strictly positive value of and for all
values of and ) the “conditional power” E Œ.s; Q w; / O of the test with critical function
O j w;
Q w; /
.s; O attains its maximum value, subject to the constraint E0 Œ.s; Q w; / O D P , when
O j w;
Q w; /
.s; O is taken to be the function Q .s/. And (in light of the results of the preceding two parts of
the present subsection) we conclude that the test of H0 with critical function Q .s/ is UMP among
all size- P similar tests.
Main result. Based on relationship (4.37) and equality (4.32), the result on the optimality of the test
of H0 with critical function Q .s/ can be reexpressed as a result on the optimality of the test of H0
or HQ 0 with the same critical function. Moreover, the test of H0 or HQ 0 with critical function Q .s/
is equivalent to the size- P F test of H0 or HQ 0 . Thus, among size- P similar tests of H0 W D .0/
or HQ 0 W ˛ D ˛.0/, the average value N .; ; / of the power function Q .˛; ; / over those ˛-values
located on the sphere S.˛.0/; /, centered at ˛.0/ and of radius , is maximized (for every > 0 and
for all and ) by taking the test of H0 or HQ 0 to be the size- P F test.
A corollary. Suppose that a test of H0 W D .0/ or HQ 0 W ˛ D ˛.0/ versus H1 W ¤ .0/ or
HQ 1 W ˛ ¤ ˛.0/ is such that the power function Q .˛; ; / depends on ˛, , and only through the
value of .˛ ˛0 /0 .˛ ˛0 /= 2, so that
Q .˛; ; / D R Œ.˛ ˛0 /0 .˛ ˛0 /= 2 (4.47)
for some function R ./. Then, for all and ,
Q .˛.0/; ; / D R .0/: (4.48)
And for 0 and for all and ,
N .; ; / D R .2= 2 /: (4.49)
As noted in Section 7.3b, the size- P F test is among those tests of H0 or HQ 0 for which the power
function Q .˛; ; / is of the form (4.47). Moreover, any size- P test of H0 or HQ 0 for which Q .˛; ; /
is of the form (4.47) is a size- P similar test, as is evident from result (4.48). And, together, results
(4.47) and (4.49) imply that (for all ˛, , and )
Q .˛; ; / D N fŒ.˛ ˛0 /0 .˛ ˛0 /1=2; ; g:
Thus, it follows from what has already been proven that among size- P tests of H0 or HQ 0 (versus H1
or HQ 1 ) with a power function of the form (4.47), the size- P F test is a UMP test.
The power function of a test of H0 or HQ 0 versus H1 or HQ 1 can be expressed as either a function
Q .˛; ; / of ˛, , and or as a function .; ; / D Q .S 0 ; ; / of , , and . Note [in light
of result (3.48)] that for Q .˛; ; / to depend on ˛, , and only through the value of .˛ ˛0 /0 .˛
˛0 /= 2, it is necessary and sufficient that .; ; / depend on , , and only through the value
of . .0/ /0 C . .0/ /= 2.
One-Sided t Tests and the Corresponding Confidence Bounds 421
Special case: M D M D 1. Suppose that M D M D 1. And recall (from Section 7.3b) that in
this special case, the size- P F test of H0 W D .0/ or HQ 0 W ˛ D ˛.0/ versus H1 W ¤ .0/ or
HQ 1 W ˛ ¤ ˛.0/ simplifies to the (size- P ) two-sided t test. Further, let us write ˛ and for ˛ and
and ˛ .0/ and .0/ for ˛.0/ and .0/.
For > 0, S.˛ .0/; / D f˛ W j˛ ˛ .0/j D g, so that S.˛ .0/; / consists of the two points ˛ .0/ ˙ .
And we find that among size- P similar tests of H0 or HQ 0 versus H1 or HQ 1 , the average power at any
two points that are equidistant from ˛ .0/ or .0/ is maximized by taking the test to be the (size- P )
two-sided t test. Moreover, among all size- P tests of H0 or HQ 0 whose power functions depend on ˛,
, and or , , and only through the value of j˛ ˛ .0/j= or j .0/j=, the (size- P ) two-sided
t test is a UMP test.
Since the size- P two-sided t test is equivalent to the size- P F test and since (as discussed in
Section 7.3b) the size- P F test is strictly unbiased, the size- P two-sided t test is a strictly unbiased
test. Moreover, it can be shown that among all level- P unbiased tests of H0 or HQ 0 versus H1 or HQ 1 ,
the size- P two-sided t test is a UMP test.
Q ˛;
Confidence sets. Let A. O d/ or simply AQ represent an arbitrary confidence set for ˛ with con-
O ;
fidence coefficient 1 P . And let AQF .˛; O d/ or simply AQF represent the 100.1 P /% confidence
O ;
set
O 0 .˛P ˛/
f˛P W .˛P ˛/ O M O 2 FN P .M ; N P /g:
Our results on the optimality of the size- P F test of HQ 0 W ˛ D ˛.0/ can be reexpressed as results
on the optimality of AQF . When AQ is required to be such that Pr.˛ 2 A/ Q D 1 P for all ˛, , and
Q Q
, the optimal choice for A is AF ; this choice is optimal in the sense that (for > 0) it minimizes
the average value (with respect to ˛.0/ ) of Pr.˛.0/ 2 A/ Q (the probability of false coverage) over the
Q
sphere S.˛; / centered at ˛ with radius . And AF is UMA (uniformly most accurate) when the
choice of the set AQ is restricted to a set for which the probability Pr.˛.0/ 2 A/Q of AQ covering a vector
.0/ .0/ 0 .0/ 2
˛ depends on ˛, , and only through the value .˛ ˛/ .˛ ˛/= ; it is UMA in the sense
that the probability of false coverage is minimized for every ˛.0/ and for all ˛, , and . Moreover,
among those 100.1 P / confidence sets for which the probability of covering a vector .0/ [in C.ƒ0 /]
depends on , , and only through the value of . .0/ /0 C . .0/ /= 2, the UMA set is the set
AF D fP 2 C.ƒ0 / W .P /
O 0 C .P /
O M O 2 FN P .M ; N P /g:
the null hypothesis HQ 0 W ˛ ˛ .0/ versus the alternative hypothesis HQ 1 W ˛ < ˛ .0/, and its critical
region C can be reexpressed in the form of the set CQ D f z W .˛O ˛ .0/ /=O < tNP .N P /g.
Invariance. Both of the two size- P one-sided t tests are invariant with respect to groups of transfor-
mations (of z) of the form TQ1 .z/, of the form TQ2 .zI ˛.0/ /, and of the form TQ4 .z/—transformations
of these forms are among those discussed in Section 7.3b. And the 100.1 P /% confidence intervals
(5.2) and (5.3) are invariant with respect to groups of transformations of the form TQ1 .z/ and of the
form TQ4 .z/ and are equivariant with respect to groups of transformations of the form TQ2 .zI ˛.0/ /.
c. Invariant tests
Let us add to the brief “discussion” (in the final part of Subsection a) of the invariance of one-sided
t tests by establishing some properties possessed by the critical functions of tests that are invariant
with respect to various groups of transformations but not by the critical functions of other tests.
Two propositions. The four propositions introduced previously (in Section 7.4a) serve to characterize
functions of z that are invariant with respect to the four groups GQ 1 , GQ 3 .˛ .0/ /, GQ 2 .˛ .0/ /, and GQ 4 .
In the present context, the group GQ 3 .˛ .0/ / is irrelevant. The following two propositions [in which
Q
.z/ represents an arbitrary function of z and in which z1 represents the lone element of the vector
z1 ] take the place of Propositions (2), (3), and (4) of Section 7.4a and provide the characterization
needed for present purposes:
(2 0 ) The function .z/Q is invariant with respect to the group GQ 2 .˛.0/ / of transformations of the form
T2 .zI ˛ / as well as the group GQ 1 of transformations of the form TQ1 .z/ only if there exists a
Q .0/
Verification of Proposition (2 0 ). Suppose that .z/ Q is invariant with respect to the group GQ 2 .˛.0/ /
as well as the group GQ 1 . Then, in light of Proposition (1) (of Section 7.4a), .z/ Q D Q 1 .z1 ; z3 /
Q
for some function 1 . ; /. Thus, Q Q
to establish
.0/
the existence of a function 12 ./ such that .z/ D
z ˛
Q12 Œ.z1 ˛ .0/ /2 C z03 z3 1=2 1 for those values of z for which z3 ¤ 0, it suffices to take
z3
zP1 and zR1 to be values of z1 and zP 3 and zR 3 nonnull values of z3 such that
R1 ˛ .0/ P1 ˛ .0/
.0/ 2 0 1=2 z .0/ 2 0 1=2 z
Œ.Rz1 ˛ / C zR 3 zR 3 D Œ.Pz1 ˛ / C zP 3 zP 3
zR 3 zP 3
and to observe that
zR 3 D k zP 3 and zR1 D ˛ .0/ C k.Pz1 ˛ .0/ /;
where k D Œ.Rz1 ˛ .0/ /2 C zR 03 zR 3 1=2=Œ.Pz1 ˛ .0/ /2 C zP 03 zP 3 1=2.
And upon letting u1 represent a random
u
variable and u3 an .N P / 1 random vector such that 1 N.0; I/, it remains only to observe
u3
that
z ˛ .0/
Œ.z1 ˛ .0/ /2 C z03 z3 1=2 1
z3
˛ ˛ .0/ Cu1
Œ.˛ ˛ .0/ Cu1 /2 C 2 u03 u3 1=2
u3
1
.˛ ˛ .0/ /Cu1
D fŒ 1.˛ ˛ .0/ /Cu1 2 C u03 u3 g 1=2 :
u3
Q
Verification of Proposition (3 0 ). Suppose that .z/ is invariant with respect to the group GQ 4 as well
as the groups G1 and G2 .˛ /. Then, in light of Proposition (2 0 ), there exists a function Q12 ./ such
Q Q .0/
z ˛ .0/
Q
that .z/ D Q12 Œ.z1 ˛ .0/ /2 C z03 z3 1=2 1 for those values of z for which z3 ¤ 0.
z3
Moreover, there also exists a function Q124 ./ such that
z ˛ .0/
Q12 Œ.z1 ˛ .0/ /2 C z03 z3 1=2 1 D Q124fŒ.z1 ˛ .0/ /2 C z03 z3 1=2 .z1 ˛ .0/ /g
z3
(for those values of z for which z3 ¤ 0). To confirm this, it suffices to take zP1 and zR1 to be values of
z1 and zP 3 and zR 3 nonnull values of z3 such that
Œ.Rz1 ˛ .0/ /2 C zR 03 zR 3 1=2
.Rz1 ˛ .0/ / D Œ.Pz1 ˛ .0/ /2 C zP 03 zP 3 1=2
.Pz1 ˛ .0/ / (5.5)
and to observe that equality (5.5) implies that
and [in light of results (6.4.32) and (6.4.7)] that (for some strictly positive scalar c that does not
depend on u, y, or and for 0 < u < 1 and 1 < y < 1)
(versus the alternative hypothesis ¤ 0) that are of level P and that are invariant with respect to the
groups GQ 1 , GQ 2 .˛ .0/ /, and GQ 4 , the power of the test at the point D attains its maximum value
(for every choice of the strictly positive scalar ) when the critical region of the test is taken to be
the region CQ C. Moreover, tests of the null hypothesis D 0 that are of level P and that are invariant
with respect to the groups GQ 1 , GQ 2 .˛ .0/ /, and GQ 4 include as a subset tests of the null hypothesis 0
that are of level P and that are invariant with respect to those groups. And we conclude that among
all tests of the null hypothesis H0C W .0/ or HQ 0C W ˛ ˛.0/ (versus the alternative hypothesis
H1C W > .0/ or HQ 1C W ˛ > ˛.0/ ) that are of level P and that are invariant with respect to the groups
GQ 1 , GQ 2 .˛ .0/ /, and GQ 4 , the size- P one-sided t test of H0C is a UMP test.
A stronger result. As is evident from the next-to-last part of Section 7.3a, z1 , z2 , and z03 z3 form a
(vector-valued) sufficient statistic. And in light of the proposition of Section 7.4b, the critical function
of a test or H0C or HQ 0C is expressible as a function of this statistic if and only if the test is invariant
with respect to the group GQ 4 of transformations (of z) of the form TQ4 .z/. Thus, a test of H0C or
HQ 0C is invariant with respect to the groups GQ 1 , GQ 2 .˛ .0/ /, and GQ 4 of transformations if and only if
it is invariant with respect to the groups GQ 1 and GQ 2 .˛ .0/ / and, in addition, its critical function is
expressible as a function of z1 , z2 , and z03 z3 .
Q
Let .z/ represent a function of z that is the critical function of a (possibly randomized) test
of H0C or HQ 0C. Then, corresponding to .z/ Q is a critical function .z/ N Q j z ; z ; z0 z
D EŒ.z/ 1 2 3 3
that depends on z only through the value of the sufficient statistic. Moreover, the test with critical
N
function .z/ has the same power function as that with critical function .z/; Q and if the test with
Q
critical function .z/ is invariant with respect to the groups GQ 1 and GQ 2 .˛ .0/ / of transformations,
N
then so is the test with critical function .z/—refer to the fifth part of Section 7.4d.
Thus, the results (obtained in the preceding part of the present subsection) on the optimality of
the size- P one-sided t test can be strengthened. We find that among all tests of the null hypothesis
H0C W .0/ or HQ 0C W ˛ ˛.0/ (versus the alternative hypothesis H1C W > .0/ or HQ 1C W ˛ > ˛.0/ )
that are of level P and that are invariant with respect to the groups GQ 1 and GQ 2 .˛ .0/ /, the size- P one-
sided t test of H0C is a UMP test. The restriction to tests that are invariant with respect to the group
GQ 4 is unnecessary.
Best invariant test of H0 or HQ 0 (versus H1 or HQ 1 ): an analogous result. By proceeding in much
the same fashion as in arriving at the result (on the optimality of the size- P one-sided t test of the
null hypothesis H0C ) presented in the preceding part of the present subsection, one can establish the
following result on the optimality of the size- P one-sided t test of the null hypothesis H0 : Among
all tests of the null hypothesis H0 W .0/ or HQ 0 W ˛ ˛.0/ (versus the alternative hypothesis
One-Sided t Tests and the Corresponding Confidence Bounds 427
H1 W < .0/ or HQ 1 W ˛ < ˛.0/ ) that are of level P and that are invariant with respect to the groups
GQ 1 and GQ 2 .˛ .0/ /, the size- P one-sided t test of H0 is a UMP test.
Confidence intervals. By proceeding in much the same fashion as in the final part of Section 7.4d (in
translating results on the optimality of the F test into results on the optimality of the corresponding
confidence set), the results of the preceding parts of the present subsection (of Section 7.5) on the
optimality of the one-sided t tests can be translated into results on the optimality of the corresponding
confidence intervals.
Clearly, the 100.1 P /% confidence intervals (5.2) and (5.3) (for ) and the corresponding intervals
ŒO O tNP .N P /; 1/ and . 1; O C O tNP .N P / for ˛ are equivariant with respect to the group
GQ 2 of transformations [the group formed by the totality of the groups GQ 2 .˛ .0/ / (˛ .0/ 2 R1 )], and
they are invariant with respect to the group GQ 1 (and also with respect to the group GQ 4 ). Now, let
Q represent any confidence set for ˛ whose probability of coverage equals or exceeds 1 P and
A.z/
that is invariant with respect to the group GQ 1 and equivariant with respect to the group GQ 2 . Further,
for ˛ .0/ 2 R1, denote by ı.˛ .0/ I ˛/ the probability PrŒ˛ .0/ 2 A.z/Q of ˛ .0/ being covered by the
confidence set A.z/Q [and note that ı.˛I ˛/ 1 P ]. Then, the interval ŒO O tNP .N P /; 1/ is
the optimal choice for A.z/Q in the sense that for every scalar ˛ .0/ such that ˛ .0/ < ˛, it is the choice
that minimizes ı.˛ .0/ I ˛/, that is, the choice that minimizes the probability of the confidence set
covering any value of ˛ smaller than the true value.
This result on the optimality of the 100.1 P /% confidence interval ŒO O tNP .N P /; 1/ (for
˛) can be reexpressed as a result on the optimality of the 100.1 P /% confidence interval (5.2) (for
), and/or (as discussed in Section 7.3b) conditions (pertaining to the invariance or equivariance of
confidence sets) that are expressed in terms of groups of transformations of z can be reexpressed in
terms of the corresponding groups of transformations of y. Corresponding to the groups GQ 1 and GQ 4
and the group GQ 2 of transformations of z are the groups G1 and G4 and the group G2 [consisting of
the totality of the groups G2 . .0/ / ( .0/ 2 R1 )] of transformations of y.
Let A.y/ represent any confidence set for for which PrŒ 2 A.y/ 1 P and that is invariant
with respect to the group G1 and equivariant with respect to the group G2 . Then, the interval (5.2) is
the optimal choice for A.y/ in the sense that for every scalar .0/ < , it is the choice that minimizes
PrŒ .0/ 2 A.y/; that is, the choice that minimizes the probability of the confidence set covering
any value of smaller than the true value. Moreover, the probability of the interval (5.2) covering
a scalar .0/ is less than, equal to, or greater than 1 P depending on whether .0/ < , .0/ D ,
or .0/ > . While the interval (5.2) is among those choices for the confidence set A.y/ that are
invariant with respect to the group G4 of transformations (of y) and is also among those choices for
which PrŒ .0/ 2 A.y/ > 1 P for every scalar .0/ > , the optimality of the interval (5.2) is not
limited to choices for A.y/ that have either or both of those properties.
If the objective is to minimize the probability of A.y/ covering values of that are larger than
the true value rather than smaller, the optimal choice for A.y/ is interval (5.3) rather than interval
(5.2). For every scalar .0/ > , interval (5.3) is the choice that minimizes PrŒ .0/ 2 A.y/. And the
probability of the interval (5.3) covering a scalar .0/ is greater than, equal to, or less than 1 P
depending on whether .0/ < , .0/ D , or .0/ > .
Best unbiased tests. It can be shown that among all tests of the null hypothesis H0C W .0/ or
HQ 0C W ˛ ˛.0/ (versus the alternative hypothesis H1C W > .0/ or HQ 1C W ˛ > ˛.0/ ) that are of
level P and that are unbiased, the size- P one-sided t test of H0C is a UMP test. Similarly, among
all tests of the null hypothesis H0 W .0/ or HQ 0 W ˛ ˛.0/ (versus the alternative hypothesis
H1 W < .0/ or HQ 1 W ˛ < ˛.0/ ) that are of level P and that are unbiased, the size- P one-sided t
test of H0 is a UMP test.
428 Confidence Intervals (or Sets) and Tests of Hypotheses
e. Simultaneous inference
Suppose that we wish to obtain either a lower confidence bound or an upper confidence bound for
each of a number of estimable linear combinations of the elements of ˇ. And suppose that we wish
for these confidence bounds to be such that the probability of simultaneous coverage equals 1 P —
simultaneous coverage occurs when for every one of the linear combinations, the true value of the
linear combination is covered by the interval formed (in the case of a lower confidence bound) by
the scalars greater than or equal to the confidence bound or alternatively (in the case of an upper
confidence bound) by the scalars less than or equal to the confidence bound. Or suppose that for each
of a number of estimable linear combinations, we wish to test the null hypothesis that the true value
of the linear combination is less than or equal to some hypothesized value (versus the alternative that
it exceeds the hypothesized value) or the null hypothesis that the true value is greater than or equal
to some hypothesized value (versus the alternative that it is less than the hypothesized value), and
suppose that we wish to do so in such a way that the probability of falsely rejecting one or more of
the null hypotheses is less than or equal to P .
Let 1 ; 2 ; : : : ; M represent estimable linear combinations of the elements of ˇ, and let D
.1 ; 2 ; : : : ; M /0. And suppose that the linear combinations (of the elements of ˇ) for which we wish
to obtain confidence bounds or to subject to hypothesis tests are expressible in the form D ı 0 ,
where ı is an arbitrary member of a specified collection of M 1 vectors—it is assumed that
ƒı ¤ 0 for some ı 2 . Further, in connection with the hypothesis tests, denote by ı.0/ the
.ı/ .ı/
hypothesized value of D ı 0 and by H0 and H1 the null and alternative hypotheses, so that
either H0.ı/ W ı.0/ and H1.ı/ W > ı.0/ or H0.ı/ W ı.0/ and H1.ı/ W < ı.0/. It is assumed that
.0/
the various hypothesized values ı (ı 2 ) are simultaneously achievable in the sense that there
.0/
exists an M 1 vector .0/ 2 C.ƒ0 / such that (for every ı 2 ) ı D ı 0 .0/.
Note that for any scalar P ,
ı 0 P , . ı/0 P and ı 0 > P , . ı/0 < P
:
0
Thus, for purposes of obtaining for every one of the linear combinations ı (ı 2 ) either a lower
or an upper confidence bound (and for doing so in such a way that the probability of simultaneous
coverage equals 1 P ), there is no real loss of generality in restricting attention to the case where all
of the confidence bounds are lower bounds. Similarly, for purposes of obtaining for every one of the
linear combinations ı 0 (ı 2 ) a test of the null hypothesis H0.ı/ versus the alternative hypothesis
H1.ı/ (and for doing so in such a way that the probability of falsely rejecting one or more of the null
hypotheses is less than or equal to P ), there is no real loss of generality in restricting attention to the
case where for every ı 2 , H0.ı/ and H1.ı/ are H0.ı/ W ı.0/ and H1.ı/ W > ı.0/.
Simultaneous confidence bounds. Confidence bounds with a specified probability of simultaneous
coverage can be obtained by adopting an approach similar to that employed in Sections 7.3c and 7.3e
in obtaining confidence intervals (each of which has end points that are equidistant from the least
Q represent
squares estimator) with a specified probability of simultaneous coverage. As before, let
the set of M 1 vectors defined as follows:
Q D fıQ W ıDW
Q ı; ı2g:
Further, letting t represent an M 1 random vector that has an MV t.N P ; IM / distribution,
denote by a P the upper 100 P % point of the distribution of the random variable
ıQ 0 t
max : (5.7)
Q
fı2 Q
Q W ı¤0g .ıQ 0 ı/
Q 1=2
And (for ı 2 ) take Aı .y/ to be the set
fP 2 R1 W ı 0O .ı 0 Cı/1=2 a
O P P < 1g
One-Sided t Tests and the Corresponding Confidence Bounds 429
The tests of the null hypotheses H0.ı/ (ı 2 ) with critical regions C.ı/ (ı 2 ) are such that
the probability of one or more false rejections is less than or equal to P , with equality holding when
.0/
ı 0 D ı for every ı 2 . To see this, let (for ı 2 )
C 0.ı/ D fy W ı 0 … Aı .y/g:
.0/
Then, for ı 2 such that ı 0 ı , C.ı/ C 0.ı/. Thus,
430 Confidence Intervals (or Sets) and Tests of Hypotheses
Pr y 2 [ C.ı/ Pr y 2 [ C 0.ı/ Pr y 2 [ C 0.ı/ D P;
fı2 W ı0 ı.0/ g fı2 W ı0 ı.0/ g fı2g
so that
Pr y 2 [ C.ı/ P;
.0/
fı2 W ı0 ı g
f. Nonnormality
In Subsections a and b, it was established that each of the confidence sets (5.2), (5.3), and more
generally (5.4) has a probability of coverage equal to 1 P . It was also established that the test of
the null hypothesis H0C or alternatively H0 with critical region C C or C , respectively, is such that
the probability of falsely rejecting H0C or H0 is equal to P . And in Subsection e, it was established
that the probability of simultaneous coverage of the confidence intervals Aı .y/ (ı 2 ) and the
probability of simultaneous coverage of the confidence intervals Aı .y/ (ı 2 ) are both equal to
1 P . In addition, it was established that the tests of the null hypotheses H0.ı/ (ı 2 ) with critical
regions C.ı/ (ı 2 ) and the tests of H0.ı/ (ı 2 ) with critical regions C .ı/ (ı 2 ) are both such
that the probability of falsely rejecting one or more of the null hypotheses is less than or equal to P .
A supposition of normality (made at the beginning of Section 7.5) underlies those results. How-
ever, this supposition is stronger than necessary. It suffices to take the distribution of the vector e of
residual effects in the G–M model to be an absolutely continuous spherical distribution. In fact, as
is evident upon observing that
O 1
.˛O ˛/ D Œd0 d=.N P / .˛O ˛/ 1=2
˛O ˛
and recalling result (6.4.67), it suffices to take the distribution of the vector to be an absolutely
d
continuous spherical distribution. A supposition of normality is also stronger than what is needed
(in Subsection c) in establishing the results of Propositions (2 0 ) and (3 0 ).
a. The basics
Let S represent the sum of squares eQ 0 eQ D y 0 .I PX /y of the elements of the vector eQ D .I PX /y
of least squares residuals. And letting (as in Section 7.3a) d D L0 y [where L is an N .N P /
matrix of rank N P such that X0 L D 0 and L0 L D I] and recalling result (3.21), observe that
S D d0 d:
Moreover,
d N.0; 2 I/
and hence .1=/d N.0; I/, so that
S=. 2 / D Œ.1=/d0 .1=/d 2 .N P /: (6.1)
Confidence intervals/bounds. For 0 < ˛ < 1, denote by N 2˛ .N P / or simply by N˛2 the upper
100 ˛% point of the 2 .N P / distribution. Then, upon applying result (6.1), we obtain as a
100.1 P /% “one-sided” confidence interval for , the interval
s
S
< 1; (6.2)
N 2P
q
so that S=N 2P is a 100.1 P /% lower “confidence bound” for . Similarly, the one-sided interval
s
S
0< 2
(6.3)
N 1 P
q
constitutes a 100.1 P /% confidence interval for , and S=N 21 P is interpretable as a 100.1 P /%
upper confidence bound. And upon letting P1 and P2 represent any two strictly positive scalars such
that P1 C P2 D P , we obtain as a 100.1 P /% “two-sided” confidence interval for , the interval
s s
S S
: (6.4)
N 2P1 N12 P2
Tests of hypotheses. Let 0 represent a hypothesized value of (where 0 < 0 < 1). Further, let
T D S=.02 /. Then, corresponding to the confidence interval (6.4) is a size- P (nonrandomized) test
of the null hypothesis H0 W D 0 versus the alternative hypothesis H1 W ¤ 0 with critical region
C D y W T < N12 P2 or T > N 2P1 (6.5)
˚
p p
consisting of all values of y for which 0 is larger than S=N12 P2 or smaller than S=N 2P1 . And
corresponding to the confidence interval (6.2) is a (nonrandomized) test of the null hypothesis
H0C W 0 versus the alternative hypothesis H1C W > 0 with critical region
C C D y W T > N 2P (6.6)
˚
p
consisting of all values of y for which the lower confidence bound S=N 2P exceeds 0 . Similarly,
corresponding to the confidence interval (6.3) is a (nonrandomized) test of the null hypothesis
H0 W 0 versus the alternative hypothesis H1 W < 0 with critical region
C D y W T < N12 P (6.7)
˚
p
consisting of all values of y for which 0 exceeds the upper confidence bound S=N12 P .
The tests of H0C and H0 with critical regions C C and C , respectively, like the test of H0 with
critical region C, are of size P . To see this, it suffices to observe that
Pr T > N 2P D Pr S=. 2 / > .0 =/2 N 2P (6.8)
432 Confidence Intervals (or Sets) and Tests of Hypotheses
and
Pr T < N12 P D Pr S=. 2 / < .0 =/2 N12 P ; (6.9)
which implies that Pr T > N 2P is greater than, equal to, or less than P and Pr T < N12 less than,
P
equal to, or greater than P depending on whether > 0 , D 0 , or < 0 .
Translation invariance. The confidence intervals (6.2), (6.3), and (6.4) and the tests of H0 , H0C, and
H0 with critical regions C, C C, and C are translation invariant. That is, the results produced by
these procedures are unaffected when for any P 1 vector k, the value of the vector y is replaced
by the value of the vector y CXk. To see this, it suffices to observe that the procedures depend on y
only through the value of the vector eQ D .I PX /y of least squares residuals and that
.I PX /.y C Xk/ D y:
Unbiasedness. The test of H0C versus H1C with critical region C C and the test of H0 versus H1
with critical region C are both unbiased, as is evident from results (6.8) and (6.9). In fact, they are
both strictly unbiased. In contrast, the test of H0 versus H1 with critical region C is unbiased only
if P1 (and P2 D P P1 ) are chosen judiciously. Let us consider how to choose P1 so as to achieve
unbiasedness—it is not a simple matter of setting P1 D P=2.
Let ./ represent the power function of a (possibly randomized) size- P test of H0 versus H1 .
And suppose that (as in the case of the test with critical region C ), the test is among those size- P tests
with a critical function that is expressible as a function, say .T /, of T. Then, ./ D EŒ.T /, and
Z 1
1 ./ D EŒ1 .T / D Œ1 .t/ h.t/ dt; (6.10)
0
where h./ is the pdf of the distribution of T.
The pdf of the distribution of T is derivable from the pdf of the 2 .N P / distribution. Let
U D S=. 2 /, so that U 2 .N P /. And observe that U D .0 =/2 T. Further, let g./ represent
the pdf of the 2 .N P / distribution. Then, upon recalling result (6.1.16), we find that for t > 0,
h.t/ D .0 =/2 gŒ.0 =/2 t
1 2
D .N P /=2
.0 =/N P t Œ.N P /=2 1 e .0 =/ t =2 (6.11)
Œ.N P /=2 2
—for t 0, h.t/ D 0.
A necessary and sufficient condition for the test to be unbiased is that ./ attain its minimum
value (with respect to ) at D 0 or, equivalently, that 1 ./ attain its maximum value at D 0 —
for the test to be strictly unbiased, it is necessary and sufficient that 1 ./ attain its maximum
value at D 0 and at no other value of . And if 1 ./ attains its maximum value at D 0 , then
ˇ
d Œ1 ./ ˇˇ
D 0: (6.12)
d ˇ
D0
Now, suppose that the test is such that lim!0 1 ./ and lim!1 1 ./ are both equal to
0 (as in the case of the test with critical region C ) or, more generally, that both of these limits are
smaller than 1 P . Suppose also that the test is such that condition (6.12) is satisfied and such that
d Œ1 ./
¤ 0 for ¤ 0 . Then, 1 ./ attains its maximum value at D 0 and at no other
d
value of , and hence the test is unbiased and, in fact, is strictly unbiased.
The derivative of 1 ./ is expressible as follows:
Z 1
d Œ1 ./ dh.t/
D Œ1 .t/ dt
d 0 Z 1d 2
0 t
D 1 .N P / Œ1 .t/ 1 h.t/ dt: (6.13)
0 N P
The Residual Variance 2 : Confidence Intervals and Tests 433
Further, upon recalling the relationship U D .0 =/2 T between the random variable U [D S=. 2 /]
that has a 2 .N P / distribution with pdf g./ and the random variable T [D S=.02 /] and upon
introducing a change of variable, we find that 1 ./ and its derivative are reexpressible in the form
Z 1
1 ./ D f1 Œ.=0 /2 ug g.u/ du; (6.14)
0
and 1
d Œ1 ./ u
Z
1 2
D .N P / f1 Œ.=0 / ug 1 g.u/ du: (6.15)
d 0 N P
And in the special case of the test with critical region C,
Z N 2 .0 =/2
P1
1 ./ D g.u/ du; (6.16)
N 2
1
.0 =/2
P C P1
and N 2P .0 =/2
d Œ1 ./ u
Z
1 1
D .N P / 1 g.u/ du: (6.17)
d N 2
1 P C P1
.0 =/2 N P
In the further special case where D 0 , result (6.17) is reexpressible as
ˇ Z N 2
d Œ1 ./ ˇˇ 1
P1 u
D 0 .N P / 1 g.u/ du: (6.18)
d ˇ
D0 N 2
1 PC P
N P
1
For what value or values of P1 (in the interval 0 < P1 < P ) is expression (6.18) equal to 0?
Note that the integrand of the integral in expression (6.18) is greater than 0 for u > N P and
less than 0 for u < N P [and that N P D E.U /]. And assume that P is sufficiently small that
N 2P > N P —the median of a chi-square distribution is smaller than the mean (e.g., Sen 1989), so
that N 2P > N P ) N 21 P < N P . Then, the value or values of P1 for which expression (6.18)
equals 0 are those for which
Z N 2 Z N P
P1 u u
1 g.u/ du D 1 g.u/ du: (6.19)
N P N P N 2
1 PC P
N P
1
Both the left and right sides of equation (6.19) are strictly positive.
Z 1 As P1 increases
from 0 to
u
P (its upper limit), the left side of equation (6.19) decreases from 1 g.u/ du to
N P N P
Z N 2 Z N P
P u u
1 g.u/ du and the right side increases from 1 g.u/ du to
N P N P N 2
1 P
N P
Z N P
u
1 g.u/ du. Assume that P is such that
0 N P
Z 1 Z N P
u u
1 g.u/ du 1 g.u/ du
N P N P N 2
1 P
N P
and is also such that
Z N P Z N 2
u P u
1 g.u/ du 1 g.u/ du
0 N P N P N P
—otherwise, there would not exist any solution (for P1 ) to equation (6.19). Then, there exists a unique
value, say P1, of P1 that is a solution to equation (6.19).
Suppose the test (of H0 versus H1 ) is that with critical region C and that P1 D P1. That is,
suppose the test is that with critical region
C D y W T < N12 P or T > N 2P ; (6.20)
˚
2 1
where P2 D P P1. Then,
434 Confidence Intervals (or Sets) and Tests of Hypotheses
N 2 .0 =/2
d Œ1 ./ u
Z
P
1 1
D .N P / 1 g.u/ du:
d N 2 .0 =/2 N P
1 PC P
1
and ˇ
d Œ1 ./ ˇˇ
D 0:
d ˇ
D0
To conclude that the test is unbiased (and, in fact, strictly unbiased), it remains only to show that
d Œ1 ./
¤ 0 for ¤ 0 or, equivalently, that
d
Z N 2 .0 =/2
P1 u
1 g.u/ du D 0 (6.21)
N 2 .0 =/
2 N P
1 P C P1
implies that D 0 .
Suppose that satisfies condition (6.21). Then,
N 2P .0 =/2 > N P > N 21 ; P C P1 .0 =/
2
1
u
since otherwise 1 would either be less than 0 for all values of u between N 21 P C P1
.0 =/2
N P
and N 2P .0 =/2 or greater than 0 for all such values. Thus, is such that
1
N 2 .0 =/2 N P
u u
Z Z
P1
1 g.u/ du D 1 g.u/ du:
N P N P N 2 .0 =/2 N P
1 PC P
1
Moreover, if < 0 (in which case 0 = > 1), then
Z N 2 .0 =/2 Z N 2
P
1
u P
1
u
1 g.u/ du > 1 g.u/ du
N P N P N P N P
Z N P
u
D 1 g.u/ du
N 2 N P
1 PC P
1
N P
u
Z
> 1 g.u/ du:
N 2 .0 =/
2 N P
1 P C P1
where G./ is the cdf of the 2 .N P / distribution and G ./ is the cdf of the 2 .N P C 2/
distribution. It is also worth noting that equation (6.19) does not involve 0 and hence that P1 does
not vary with the choice of 0 .
Corresponding to the size- P strictly unbiased test of H0 versus H1 with critical region C is the
following 100.1 P /% confidence interval for :
s s
S S
: (6.23)
N 2P N12 P
1 2
This interval is the special case of the 100.1 P /% confidence interval (6.4) obtained upon setting
P1 D P1 and P2 D P2 (D P P1 ). As is evident from the (strict) unbiasedness of the corresponding
test, the 100.1 P /% confidence interval (6.23) is strictly unbiased in the sense that the probability
1 P of its covering the true value of is greater than the probability of its covering any value other
than the true value. q
Like the 100.1 P /% confidence interval (6.23), the 100.1 P /% lower confidence bound S=N 2P
q
and the 100.1 P /% upper confidence bound S=N 21 P (for ) are strictly unbiased. The strict
unbiasedness of the lower confidence bound follows from the strict unbiasedness
q of the
size- P test
C C C 2
of H0 versus H1 with critical region C and is in the sense that Pr S=N P 0 < 1 P for
any positive scalar 0 such that 0 < . Similarly, the strict unbiasedness of the upper confidence
bound follows from the strict q
unbiasedness of the
size- P test of H0 versus H1 with critical region
C and is in the sense that Pr S=N 21 P 0 < 1 P for any scalar 0 such that 0 > .
b. An illustration
Let us illustrate various of the results of Subsection a by using them to add to the results obtained
earlier (in Sections 7.1, 7.2c, and 7.3d and in the final part of Section 7.3f) for the lettuce-yield
data. Accordingly, let us take y to be the 20 1 random vector whose observed value is the vector
of lettuce yields. And suppose that y follows the G–M model obtained upon taking the function
ı.u/ (that defines the response surface) to be the second-order polynomial (1.2) (where u is the
3-dimensional column vector whose elements represent transformed amounts of Cu, Mo, and Fe).
Suppose further that the distribution of the vector e of residual effects is N.0; 2 I/.
Then, as is evident from the results of Section 7.1, S (the residual sum of squares) equals
108:9407. And N P D N P D 10. Further, the usual (unbiased) point estimator O 2 D S=.N P /
of 2 equals 10:89, and upon taking the square root of this value, we obtain 3:30 as an estimate of .
When P D 0:10, the value P1 of P1 that is a solution to equation (6.19) is found to be 0:03495,
and the corresponding value P2 of P2 is P2 D P P1 D 0:06505. And N 2:03495 D 19:446, and
N12 :06505 D N 2:93495 D 4:258. Thus, upon setting S D 108:9407, P1 D 0:03495, and P 2 D 0:06505
in the interval (6.23), we obtain as a 90% strictly unbiased confidence interval for the interval
2:37 5:06:
By way of comparison, the 90% confidence interval for obtained upon setting S D 108:9407 and
P1 D P2 D 0:5 in the interval (6.4) is
2:44 5:26
—this interval is not unbiased.
If (instead of obtaining a two-sided confidence interval for ) we had chosen to obtain [as an
application of interval (6.2)] a 90% “lower confidence bound,” we would have obtained
2:61 < 1:
Similarly, if we had chosen to obtain [as an application of interval (6.3)] a 90% “upper confidence
bound,” we would have obtained
0 < 4:73:
436 Confidence Intervals (or Sets) and Tests of Hypotheses
c. Optimality
Are the tests of H0C, H0 , and H0 (versus H1C, H1 , and H1 ) with critical regions C C, C , and C
and the corresponding confidence intervals (6.2), (6.3), and (6.23) optimal and if so, in what sense?
These questions are addressed in what follows. In the initial treatment, attention is restricted to
translation-invariant procedures. Then, the results obtained in that context are extended to a broader
class of procedures.
Translation-invariant procedures. As noted earlier (in Section 7.3a) the vector d D L0 y [where L
is an N .N P / matrix of full column rank such that X0 L D 0 and L0 L D I] is an (N P )-
dimensional vector of linearly independent error contrasts—an error contrast is a linear combination
(of the elements of y) with an expected value equal to 0. And as is evident from the discussion of
error contrasts in Section 5.9b, a (possibly randomized) test of H0C, H0 , or H0 (versus H1C, H1 , or
H1 ) is translation invariant if and only if its critical function is expressible as a function, say .d/,
of d. Moreover, when the observed value of d (rather than that of y) is regarded as the data vector,
S D d0 d or, alternatively, T D S=02 is a sufficient statistic—refer to the next-to-last part of Section
7.3a for some relevant discussion. Thus, corresponding to the test with critical function .d/ is a
(possibly randomized) test with critical function EŒ.d/ j S or EŒ.d/ j T that depends on d only
through the value of S or T and that has the same power function.
Now, consider the size- P translation-invariant test of the null hypothesis H0C W 0 (versus
C C
the alternative hypothesis H1 W > 0 ) with critical region C D y W T > N 2P . Further, let
˚
represent any particular value of greater than 0 , let h0 ./ represent the pdf of the 2 .N P /
distribution (which is the distribution of T when D 0 ), let h ./ represent the pdf of the distribution
of T when D , and observe [in light of result (6.11)] that (for t > 0)
N P
h .t/ 0 2
D e Œ1 .0 = / t =2: (6.24)
h0 .t/
Then, upon applying Theorem 7.4.1 (the Neyman–Pearson lemma) with X D T , D , ‚ D
Œ0 ; 1/, .0/ D 0 , and D , we find that conditions (4.2) and (4.3) (of Theorem 7.4.1) are
satisfied when the critical region is taken to be the set consisting of all values of T for which T > N 2P .
And upon observing that the test of H0C with critical region C C is such that Pr.y 2 C C / P for
< 0 as well as for D 0 and upon recalling the discussion following Theorem 7.4.1, we
find that the test of H0C with critical region C C is UMP among all (possibly randomized) level- P
translation-invariant tests—note that the set consisting of all such tests is a subset of the set consisting
of all (possibly randomized) translation-invariant tests for which the probability of rejecting H0C is
less than or equal to P when D 0 .
By employing a similar argument, it can be shown that the size- P translation-invariant test of the
null hypothesis H0 W 0 (versus the alternative hypothesis H1 W < 0 ) with critical region
C D y W T < N12 P is UMP among all (possibly randomized) level- P translation-invariant tests
˚
(of H0 versus H1 ).
It remains to consider the size- P translation-invariant test of the null hypothesis H0 W D 0
(versus the alternative hypothesis H1 W ¤ 0 ) with critical region C. In the special case where
C D C (i.e., the special case where P1 D P1 and P2 D P2 D P P1 ), this test is (strictly)
unbiased. In fact, the test of H0 with critical region C is optimal in the sense that it is UMP among
all (possibly randomized) level- P translation-invariant unbiased tests of H0 (versus H1 ).
Let us verify the optimality of this test. Accordingly, take .T / to be a function of T that
represents the critical function of a (translation-invariant possibly randomized) test of H0 (versus
H1 ). Further, denote by ./ the power function of the test with critical function .T /. And observe
that (by definition) this test is of level P if .0 / P or, equivalently, if
Z 1
.t/ h0 .t/ dt P (6.25)
0
The Residual Variance 2 : Confidence Intervals and Tests 437
where k1 D .0 = / .N P / .k1 k2 /, k2 D .0 = / .N P / k2 =.N P /, and b D .1=2/Œ1
.0 = /2 . Moreover, among the choices for the critical function ./ are choices that satisfy condi-
tions (6.26) and (6.27) and for which . / P , as is evident upon observing that one such choice is
that obtained upon setting .t/ D P (for all t). Thus, if the critical function ./ satisfies conditions
(6.26) and (6.27) and is such that corresponding to every choice of there exist constants k1 and
k2 that satisfy condition (6.29) or, equivalently, condition (6.30), then the test with critical function
./ is a size- P translation-invariant unbiased test and [since the tests for which the critical function
./ is such that the test is of level P and is unbiased constitute a subset of those tests for which ./
satisfies conditions (6.26) and (6.27)] is UMP among all level- P translation-invariant unbiased tests.
Suppose that the choice ./ for ./ is as follows:
8
< 1; when t < N 2 or t > N 2 ,
1 P2 P1
.t/ D (6.31)
: 0; when N 2 t N 2 .
1 P2 P
1
And observe that for this choice of ./, the test with critical function ./ is identical to the (non-
randomized) test of H0 with critical region C . By construction, this test is such that ./ satisfies
conditions (6.26) and (6.27). Thus, to verify that the test of H0 with critical region C is UMP
among all (possibly randomized) level- P translation-invariant unbiased tests, it suffices to show that
(corresponding to every choice of ) there exist constants k1 and k2 such that ./ is expressible
in the form (6.30).
438 Confidence Intervals (or Sets) and Tests of Hypotheses
Accordingly, suppose that k1 and k2 are the constants defined (implicitly) by taking k1 and k2
to be such that
k1 C k2 c0 D e b c0 and k1 C k2 c1 D e b c1;
2 2
where c0 D N1 P and c1 D N P . Further, let u.t/ D e bt
.k1 C k2 t/ (a function of t with domain
2 1
0 < t < 1), and observe that u.c1 / D u.c0 / D 0. Observe also that
du.t/ d 2 u.t/
D be bt k2 and D b 2 e bt > 0;
dt dt 2
so that u./ is a strictly convex function and its derivative is a strictly increasing function. Then,
clearly,
u.t/ < 0 for c0 < t < c1 . (6.32)
du.t/ du.t/
And < 0 for t c0 and > 0 for t c1 , which implies that
dt dt
u.t/ > 0 for t < c0 and t > c1
and hence in combination with result (6.32) implies that .t/ is expressible in the form (6.30) and
which in doing so completes the verification that the test of H0 (versus H1 ) with critical region C
is UMP among all level- P translation-invariant unbiased tests.
Corresponding to the test of H0 with critical region C is the 100.1 P /% translation-invariant
strictly unbiased confidence interval (6.23). Among translation-invariant confidence sets (for ) that
have a probability of coverage greater than or equal to 1 P and that are unbiased (in the sense that
the probability of covering the true value of is greater than or equal to the probability of covering
any value other than the true value), the confidence interval (6.23) is optimal; it is optimal in the
sense that the probability of covering any value of other than the true value is minimized.
The 100.1 P /% translation-invariant confidence interval (6.2) is optimal in a different sense;
among all translation-invariant confidence sets (for ) that have a probability of coverage greater than
or equal to 1 P , it is optimal in the sense that it is the confidence set that minimizes the probability
of covering values of smaller than the true value. Analogously, among all translation-invariant
confidence sets that have a probability of coverage greater than or equal to 1 P , the 100.1 P /%
translation-invariant confidence interval (6.3) is optimal in the sense that it is the confidence set that
minimizes the probability of covering values of larger than the true value.
Optimality in the absence of a restriction to translation-invariant procedures. Let z D O 0y rep-
resent an observable N -dimensional random column vector that follows the canonical form of the
G–M model (as defined in Section 7.3a) in the special case where M D P . Then, ˛ and its least
squares estimator ˛O are P -dimensional, and z D .˛O 0; d0 /0, where (as before) d D L0 y, so that the
critical function of any (possibly randomized) test of H0C, H0 , or H0 is expressible as a function of
˛O and d. Moreover, ˛O and T D S=02 D d0d=02 D y 0.I PX /y=02 form a sufficient statistic, as is
evident upon recalling the results of the next-to-last part of Section 7.3a. And corresponding to any
(possibly randomized) test of H0C, H0 , or H0 , say one with critical function . Q ˛;
O d/, there is a test
with a critical function, say .T; ˛/,O that depends on d only through the value of T and that has the
same power function—take .T; ˛/ O D EŒ .Q ˛; O Thus, for present purposes, it suffices to
O d/ j T; ˛.
restrict attention to tests with critical functions that are expressible in the form .T; ˛/.O
Suppose that .T; ˛/ O is the critical function of a level- P test of the null hypothesis H0C W 0
versus the alternative hypothesis H1C W > 0 , and consider the choice of the function .T; ˛/. O Fur-
ther, let .; ˛/ represent the power function of the test. Then, by definition, .; ˛/ D EŒ.T; ˛/, O
and, in particular, .0 ; ˛/ D E 0Œ .T; ˛/, O where E 0 represents the expectation operator in the
special case where D 0 . Since the test is of level P, .0 ; ˛/ P (for all ˛).
Now, suppose that the level- P test with critical function .T; ˛/ O and power function .; ˛/
is unbiased. Then, upon observing that .; / is a continuous function, it follows—refer, e.g., to
Lehmann and Romano (2005b, sec. 4.1)—that
.0 ; ˛/ D P (for all ˛). (6.33)
The Residual Variance 2 : Confidence Intervals and Tests 439
Clearly,
.; ˛/ D EfEŒ.T; ˛/
O j ˛g:
O (6.34)
In particular, .0 ; ˛/ D E 0fE 0Œ.T; ˛/ O so that result (6.33) can be restated as
O j ˛g,
E 0fE 0Œ.T; ˛/ O D P (for all ˛).
O j ˛g (6.35)
Moreover, with fixed (at 0 ), ˛O is a complete sufficient statistic—refer to the next-to-last part of
Section 7.3a. Thus, E 0Œ.T; ˛/
O j ˛O does not depend on ˛, and condition (6.35) is equivalent to the
condition
E 0Œ.T; ˛/ O D P (wp1).
O j ˛ (6.36)
Let represent any particular value of greater than 0 , let ˛ represent any particular value of
˛, and let E represent the expectation operator in the special case where D and ˛ D ˛ . Then, in
light of result (6.34), the choice of the critical function .T; ˛/
O that maximizes . ; ˛ / subject to
the constraint (6.36), and hence subject to the constraint (6.33), is that obtained by choosing (for each
value of ˛)
O .; ˛/O so as to maximize E Œ.T; ˛/ O subject to the constraint E 0Œ.T; ˛/
O j ˛ O D P.
O j ˛
Moreover, upon observing that T is distributed independently of ˛O and hence that the distribution of
T conditional on ˛O is the same as the unconditional distribution of T and upon proceeding as in the
preceding part of the present subsection (in determining the optimal translation-invariant test), we
find that (for every choice of and ˛ and for every value of ˛) O E Œ.T; ˛/ O can be maximized
O j ˛
subject to the constraint E 0Œ.T; ˛/ O D P by taking
O j ˛
(
1; when t > N 2P ,
O D
.t; ˛/ (6.37)
0; when t N 2P .
And it follows that the test with critical function (6.37) is UMP among all tests of H0C (versus H1C )
with a critical function that satisfies condition (6.33).
Clearly, the test with critical function (6.37) is identical to the test with critical region C C, which
is the UMP level- P translation-invariant test. And upon recalling (from Subsection a) that the test
with critical region C C is unbiased and upon observing that those tests with a critical function for
which the test is of level- P and is unbiased is a subset of those tests with a critical function that
satisfies condition (6.33), we conclude that the test with critical region C C is UMP among all level- P
unbiased tests of H0C (versus H1C ).
It can be shown in similar fashion that the size- P translation-invariant unbiased test of H0 versus
H1 with critical region C is UMP among all level- P unbiased tests of H0 versus H1 . However,
as pointed out by Lehmann and Romano (2005b, sec. 3.9.1), the result on the optimality of the test
of H0C with critical region C C can be strengthened in a way that does not extend to the test of H0
with critical region C . In the case of the test of H0C, the restriction to unbiased tests is unnecessary.
It can be shown that the test of H0C versus H1C with critical region C C is UMP among all level- P
tests, not just among those level- P tests that are unbiased.
It remains to consider the optimality of the test of the null hypothesis H0 W D 0 (versus the
alternative hypothesis H1 W ¤ 0 ) with critical region C . Accordingly, suppose that .T; ˛/ O is the
critical function of a (possibly randomized) test of H0 (versus H1 ) with power function .; ˛/.
If the test is of level P and is unbiased, then [in light of the continuity of the function .; /]
.0 ; ˛/ D P (for all ˛) or, equivalently,
Z Z 1
O ˛/ dt d ˛O D P (for all ˛);
O h0 .t/f0 .˛I
.t; ˛/ (6.38)
RP 0
where f0 . I ˛/ represents the pdf of the N.˛; 02 I/ distribution (which is the distribution of ˛O when
D 0 ) and where (as before) h0 ./ represents the pdf of the 2 .N P / distribution (which is the
distribution of T when D 0 )—condition (6.38) is analogous to condition (6.26). Moreover,ˇ if the
d .; ˛/ ˇˇ
test is such that condition (6.38) is satisfied and if the test is unbiased, then D 0 (for
d ˇ D0
all ˛) or, equivalently,
440 Confidence Intervals (or Sets) and Tests of Hypotheses
1
t
Z Z
O
.t; ˛/ O ˛/ dt d ˛O D 0 (for all ˛);
1 h0 .t/f0 .˛I (6.39)
RP 0 N P
analogous to condition (6.27)—the equivalence of condition (6.39) can be verified via a relatively
straightforward exercise.
As in the case of testing the null hypothesis H0C,
.; ˛/ D EfEŒ.T; ˛/ O j ˛g:
O (6.40)
Moreover, condition (6.38) is equivalent to the condition
Z 1
O h0 .t/ dt D P (wp1);
.t; ˛/ (6.41)
0
and condition (6.39) is equivalent to the condition
Z 1
t
O
.t; ˛/ 1 h0 .t/ dt D 0 (wp1); (6.42)
0 N P
as is evident upon recalling that with fixed (at 0 ), ˛O is a complete sufficient statistic.
Denote by ˛ any particular value of ˛ and by any particular value of other than 0 , and
(as before) let h ./ represent the pdf of the distribution of T in the special case where D . And
observe (in light of the statistical independenceZof T and ˛) O that when D ,
1
EŒ.T; ˛/
O j ˛
O D O h .t/ dt:
.t; ˛/ (6.43)
0
Observe also [in light of result (6.43) along with result (6.40) and in light of the equivalence of
conditions (6.41) and (6.42) to conditions (6.38) and (6.39)] that to maximize . ; ˛ / [with respect
to the choice of the critical function .; /] subject to the constraint that .; / satisfy conditions
(6.38) and (6.39), it suffices to take (for each value of ˛) O .; ˛/O to be the critical function that
maximizes Z 1
O h .t/ dt
.t; ˛/ (6.44)
0
subject to the constraints imposed by the conditions
Z 1 Z 1
t
O h0 .t/ dt D P and
.t; ˛/ O
.t; ˛/ 1 h0 .t/ dt D 0: (6.45)
0 0 N P
A solution for .; ˛/O to the latter constrained maximization problem can be obtained by applying
the results obtained earlier (in the first part of the present subsection) in choosing a translation-
invariant critical function ./ so as to maximize the quantity (6.28) subject to the constraints imposed
by conditions (6.26) and (6.27). Upon doing so, we find that among those choices for the critical
function .T; ˛/ O that satisfy conditions (6.38) and (6.39), . ; ˛ / can be maximized (for every
choice of and ˛ ) by taking
8
< 1; when t < N 2 or t > N 2 ,
1 P2 P1
O D
.t; ˛/ (6.46)
: 0; when N 2 t N 2 ,
1 P P
2 1
which is the critical function of the size- P translation-invariant unbiased test of H0 (versus H1 ) with
critical region C . Since the set consisting of all level- P unbiased tests of H0 versus H1 is a subset
of the set consisting of all tests with a critical function that satisfies conditions (6.38) and (6.39), it
follows that the size- P translation-invariant unbiased test with critical region C is UMP among all
level- P unbiased tests.
The optimality properties of the various tests can be reexpressed as optimality properties of
the corresponding confidence intervals. Each of the confidence intervals (6.2) and (6.23) is optimal
in essentially the same sense as when attention is restricted to translation-invariant procedures. The
confidence interval (6.3) is optimal in the sense that among all confidence sets for with a probability
of coverage greater than or equal to 1 P and that are unbiased (in the sense that the probability of
covering the true value of is greater than or equal to the probability of covering any value larger
than the true value), the probability of covering any value larger than the true value is minimized.
Multiple Comparisons and Simultaneous Confidence Intervals: Some Enhancements 441
can be obtained by imposing a restriction on the false rejection of multiple null hypotheses less
severe than that imposed by a requirement that the FWER be less than or equal to P. And much
shorter confidence intervals for 1 ; 2 ; : : : ; M can be obtained by adopting a criterion less stringent
than that inherent in a requirement that the probability of simultaneous coverage be greater than or
equal to 1 P. Moreover, as is discussed in Subsection b, improvements can be effected in the various
test procedures (at the expense of additional complexity and additional computational demands) by
employing “step-down” methods.
In some applications of the testing of the M null hypotheses, the linear combinations
1 ; 2 ; : : : ; M may represent the “effects” of “genes” or other such entities, and the object may
be to “discover” or “detect” those entities whose effects are nonnegligible and that should be sub-
jected to further evaluation and/or future investigation. In such applications, M can be very large.
And limiting the number of rejections of true null hypotheses may be less of a point of emphasis than
442 Confidence Intervals (or Sets) and Tests of Hypotheses
rejecting a high proportion of the false null hypotheses. In Subsection c, an example is presented of
an application where M is in the thousands. Methods that are well suited for such applications are
discussed in Subsections d and e.
Some preliminaries. Let us incorporate the notation introduced in Section 7.3a (in connection with the
canonical form of the G–M model) and take advantage of the results introduced therein. Accordingly,
O D ƒ0.X0 X/ X0 y is the least squares estimator of the vector . And var./ O D 2 C, where
0 0
C D ƒ .X X/ ƒ.
Corresponding to is the transformed vector ˛ D S0, where S is an M M matrix such
that S0 CS D I. The least squares estimator of ˛ is ˛O D S0. O And ˛O N.˛; 2 I/. Further,
0 0 0
D W ˛, O D W ˛, O and C D W W . where W is the unique M M matrix that satisfies the
equality ƒSW D ƒ; and as an estimator of , we have the (positive) square root O of O 2 D
d0d=.N P / D y 0 .I PX /y=.N P /.
O so that (for i D 1; 2; : : : ; M ) Oi D 0i .X0 X/ X0 y
Let O1 ; O2 ; : : : ; OM represent the elements of ,
is the least squares estimator of i . And observe that i and its least squares estimator are reexpressible
as i D wi0 ˛ and Oi D wi0 ˛, O where wi represents the i th column of W. Moreover, (in light of the
assumption that no column of ƒ is null or is a scalar multiple of another column of ƒ) no column of
W is null or is a scalar multiple of another column of W ; and upon observing (in light of Theorem
2.4.21) that (for i ¤ j D 1; 2; : : : ; M )
jwi0wj j
jcorr.Oi ; Oj /j D < 1; (7.1)
.wi0wi /1=2 .wj0wj /1=2
it follows that (for i ¤ j D 1; 2; : : : ; M and for any constants Pi and Pj and any nonzero constants
ai and aj )
ai .Oi Pi / ¤ aj .Oj Pj / (wp1): (7.2)
For i D 1; 2; : : : ; M, define
Oi i .0/ Oi i.0/
ti D 0
and ti D : (7.3)
0
Œi .X X/ i 1=2 O Œ0i .X0 X/ i 1=2 O
And observe that ti and ti.0/ are reexpressible as
wi0 .˛O ˛/ wi0 .˛O ˛.0/ /
ti D and ti.0/ D ; (7.4)
.wi0wi /1=2 O .wi0wi /1=2 O
where ˛.0/ D S0 .0/ D .ƒS/0 ˇ .0/. Further, let t D .t1 ; t2 ; : : : ; tM /0, and observe that
t D D 1 W 0 ŒO 1
.˛O ˛/; (7.5)
Multiple comparisons. Among the procedures for testing each of the M null hypotheses
H1.0/; H2.0/; : : : ; HM
.0/
(and of doing so in a way that accounts for the multiplicity of tests) is that
provided by the generalized S method—refer to Section 7.3. The generalized S method controls the
FWER. That control comes at the expense of the power of the tests, which for even moderately large
values of M can be quite low.
A less conservative approach (i.e., one that strikes a better balance between the probability of
false rejections and the power of the tests) can be achieved by adopting a criterion that is based on
controlling what has been referred to by Lehmann and Romano (2005a) as the k-FWER (where k
Multiple Comparisons and Simultaneous Confidence Intervals: Enhancements 443
is a positive integer). In the present context, the k-FWER is the probability of falsely rejecting k or
more of the null hypotheses H1.0/; H2.0/; : : : ; HM .0/
, that is, the probability of rejecting Hi.0/ for k or
more of the values of i (between 1 and M, inclusive) for which i D i.0/. A procedure for testing
.0/ .0/ .0/
H1 ; H2 ; : : : ; HM is said to control the k-FWER at level P (where 0 < P < 1) if k-FWER P .
Clearly, the FWER is a special case of the k-FWER; it is the special case where k D 1. And (for any
k) FWER P ) k-FWER P (so that k-FWER P is a less stringent criterion than FWER P );
more generally, for any k 0 < k, k 0 -FWER P ) k-FWER P (so that k-FWER P is a less
stringent criterion than k 0 -FWER P ).
.0/ .0/ .0/
For purposes of devising a procedure for testing H1 ; H2 ; : : : ; HM that controls the k-FWER
at level P , let i1 ; i2 ; : : : ; iM represent a permutation of the first M positive integers 1; 2; : : : ; M such
that
jti1 j jti2 j jtiM j (7.7)
—as is evident from result (7.2), this permutation is unique (wp1). And (for j D 1; 2; : : : ; M ) define
t.j / D t ij . Further, denote by c P .j / the upper 100 P % point of the distribution of jt.j / j.
Now, consider the procedure for testing H1.0/; H2.0/; : : : ; HM
.0/
that (for i D 1; 2; : : : ; M ) rejects
.0/
Hi if and only if y 2 Ci , where the critical region Ci is defined as follows:
.0/
Ci D fy W jti j > c P .k/g: (7.8)
Clearly,
.0/
Pr y 2 Ci for k or more values of i with i D i
D Pr jti j > c P .k/ for k or more values of i with i D i.0/
Thus, the procedure that tests each of the null hypotheses H1.0/; H2.0/; : : : ; HM
.0/
on the basis of the
corresponding one of the critical regions C1 ; C2 ; : : : ; CM controls the k-FWER at level P ; its k-
FWER is less than or equal to P . In the special case where k D 1, this procedure is identical to that
obtained via the generalized S method, which was discussed earlier (in Section 7.3c).
Simultaneous confidence intervals. Corresponding to the test of Hi.0/ with critical region Ci is the
confidence interval, say Ai .y/, with end points
Oi ˙ Œ0i .X0 X/ i 1=2 O c P .k/ (7.10)
(i D 1; 2; : : : ; M ). The correspondence is that implicit in the following relationship:
i.0/ 2 Ai .y/ , y … Ci : (7.11)
The confidence intervals A1.y/; A2.y/; : : : ; AM .y/ are [in light of result (7.9)] such that
PrŒi 2 Ai .y/ for at least M kC1 values of i
D PrŒjti j c P .k/ for at least M kC1 values of i
D PrŒjti j > c P .k/ for no more than k 1 values of i
D1 PrŒjti j > c P .k/ for k or more values of i
D1 P:
In the special case where k D 1, the confidence intervals A1.y/; A2.y/; : : : ; AM .y/ are identical
(when is taken to be the set whose members are the columns of IM ) to the confidence intervals
Aı .y/ (ı 2 ) of Section 7.3c—refer to the representation (3.94)—and (in that special case) the
probability of simultaneous coverage by all M of the intervals is equal to 1 P .
Computations/approximations. To implement the test and/or interval procedures, we require the
444 Confidence Intervals (or Sets) and Tests of Hypotheses
upper 100 P % point c P .k/ of the distribution of jt.k/ j. As discussed in Section 7.3c in the special
case of the computation of c P —when k D 1, c P .k/ D c P —Monte Carlo methods can (at least in
principle) be used to compute c P .k/. Whether the use of Monte Carlo methods is feasible depends
on the feasibility of making a large number of draws from the distribution of jt.k/ j. The process of
making a large number of such draws can be facilitated by taking advantage of results (7.5) and (7.6).
And by employing methods like those discussed by Edwards and Berry (1987), the resultant draws
can be used to approximate c P .k/ to a high degree of accuracy.
If the use of Monte Carlo methods is judged to be infeasible, overly burdensome, or aesthetically
unacceptable, there remains the option of replacing c P .k/ with an upper bound. In that regard, it can
be shown that for any M random variables x1 ; x2 ; : : : ; xM and any constant c,
Pr. xi > c for k or more values of i / .1=k/ M i D1 Pr. xi > c/: (7.12)
P
And upon applying inequality (7.12) in the special case where xi D jti j (i D 1; 2; : : : ; M ) and
where c D tNk P =.2M / .N P / and upon observing that (for i D 1; 2; : : : ; M ) ti S t.N P / and
hence that
in the third part of Section 7.3e and the second part of Section 7.5e, versions of the confidence
intervals and confidence bounds can be obtained such that all M intervals are of equal length and
such that all M bounds are equidistant from the least squares estimates.
Nonnormality. The assumption that the vector e of residual effects in the G–M model has an MVN
distribution is stronger than necessary. To insure that the probability of the tests falsely rejecting k or
more of the M null hypotheses does not exceed P and to insure that the probability of the confidence
intervals or bounds covering at least M kC1 of the M linear combinations 1 ; 2 ; : : : ; M is equal to
1 P, it is sufficient
that e have an absolutely continuous spherical distribution. In fact, it is sufficient
˛O ˛
that the vector have an absolutely continuous spherical distribution.
d
An alternative procedure for testing the null hypotheses H1.0/; H2.0/; : : : ; HM .0/
in a way that con-
trols the FWER or, more generally, the k-FWER: definition, characteristics, terminology, and
.0/ .0/ .0/
properties. The null hypotheses H1 ; H2 ; : : : ; HM can be tested in a way that controls the FWER
or, more generally, the k-FWER by adopting the procedure with critical regions C1 ; C2 ; : : : ; CM . For
purposes of defining an alternative to this procedure, let (for j D k; k C1; : : : ; M ) Q kIj represent
a collection of subsets of I D f1; 2; : : : ; M g consisting of every S I for which MS k and for
Q C represent a collection (of subsets of I ) consisting of those subsets
which iQk.S / D iQj . And let kIj
in Q C is the
Q kIj whose elements include all M j of the integers iQj C1 ; iQj C2 ; : : : ; iQM , that is,
kIj
collection of those subsets of I whose elements consist of k 1 of the integers iQ1 ; iQ2 ; : : : ; iQj 1 and
all M j C1 of the integers iQj ; iQj C1 ; : : : ; iQM . By definition,
Multiple Comparisons and Simultaneous Confidence Intervals: Enhancements 447
QC
Q .j D k; k C1; : : : ; M / and Q C D fI g:
(7.20)
kIj kIj kIk
Moreover,
Q C C QC
S 2 kIj ) S S for some S 2 kIj ; (7.21)
and (for j > k)
Q kIj ) S S C for some S C 2
S 2 Q kIj 1 (7.22)
C C
[where in result (7.22), S is such that S is a proper subset of S ].
Now, for j D k; k C1; : : : ; M, define
˛j D max c P .kI S /: (7.23)
Q kIj
S2
And [recalling result (7.19)] note [in light of results (7.20) and (7.21)] that ˛j is reexpressible as
˛j D max c P .kI S / (7.24)
QC
S2 kIj
—in the special case where k D 1, Q C contains a single set, say SQj (the elements of which are
kIj
iQj ; iQj C1 ; : : : ; iQM ), and in that special case, equality (7.24) simplifies to ˛j D c P .1I Sj /. Note also
that
˛k D c P .kI I / D c P .k/: (7.25)
Moreover, for k j 0 < j M,
˛j 0 > ˛j ; (7.26)
as is evident from result (7.22) [upon once again recalling result (7.19)].
Let us extend the definition (7.23) of the ˛j ’s by taking
˛1 D ˛2 D D ˛k 1 D ˛k ; (7.27)
so that [in light of inequality (7.26)]
˛1 D D ˛k 1 D ˛k > ˛kC1 > ˛kC2 > > ˛M : (7.28)
.0/
Further, take J to be an integer (between 0 and M, inclusive) such that jt.j / j > ˛j for j D 1; 2; : : : ; J
.0/ .0/ .0/
and jt.J C1/
j ˛J C1 —if jt.j /
j > ˛j for j D 1; 2; : : : ; M, take J D M ; and if jt.1/ j ˛1 , take
J D 0. And consider the following multiple-comparison procedure for testing the null hypotheses
H1.0/; H2.0/; : : : ; HM
.0/
: when J 1, HQ.0/; HQ.0/; : : : ; HQ.0/ are rejected; when J D 0, none of the null
i1 i2 iJ
hypotheses are rejected. This procedure can be regarded as a stepwise procedure, one of a kind known
as a step-down procedure. Specifically, the procedure can be regarded as one in which (starting with
.0/ .0/ .0/ .0/
HQ ) the null hypotheses are tested sequentially in the order HQ ; HQ ; : : : ; HQ by comparing the
i1 i1 i2 iM
.0/
jt.j /
.0/
j’s with the ˛j ’s; the testing ceases upon encountering a null hypothesis HQ.0/ for which jt.j /
j
ij
does not exceed ˛j .
.0/ .0/ .0/
The step-down procedure for testing the null hypotheses H1 ; H2 ; : : : ; HM can be character-
ized in terms of its critical regions. The critical regions of this procedure, say C1; C2; : : : ; CM
, are
expressible as follows:
CiD fy W J 1 and i D iQj 0 for some integer j 0 between 1 and J, inclusiveg (7.29)
(i D 1; 2; : : : ; M ). Alternatively, Ci is expressible in the form
M n o
.0/
[
Ci D y W i D iQj 0 and for every j j 0, t.j /
> ˛j (7.30)
j 0 D1
.0/
—for any particular value of y, t.j / > ˛j for every j j 0 if and only if j 0 J.
It can be shown (and subsequently will be shown) that the step-down procedure for testing the
null hypotheses H1.0/; H2.0/; : : : ; HM
.0/
, like the test procedure considered in Subsection a, controls the
k-FWER at level P (and in the special case where k D 1, controls the FWER at level P ). Moreover, in
448 Confidence Intervals (or Sets) and Tests of Hypotheses
light of result (7.25) and definition (7.27), the critical regions C1 ; C2 ; : : : ; CM of the test procedure
considered in Subsection a are reexpressible in the form
M
.0/
[
y W i D iQj 0 and t.j (7.31)
˚
Ci D 0 / > ˛1
j 0 D1
(i D 1; 2; : : : ; M ). And upon comparing expression (7.31) with expression (7.30) and upon observing
.0/ .0/ 0
[in light of the relationships (7.28)] that t.j 0 / > ˛1 implies that t.j / > ˛j for every j j , we find that
Ci Ci .i D 1; 2; : : : ; M /: (7.32)
That is, the critical regions C1 ; C2 ; : : : ; CM of the test procedure considered in Subsection a are
subsets of the corresponding critical regions C1; C2; : : : ; CM
of the step-down procedure. In fact,
Ci is a proper subset of Ci (i D 1; 2; : : : ; M ).
We conclude that while both the step-down procedure and the procedure with critical regions
C1 ; C2 ; : : : ; CM control the k-FWER (or when k D 1, the FWER), the step-down procedure is more
powerful in that its adoption can result in additional rejections. However, at the same time, it is
worth noting that the increased power comes at the expense of some increase in complexity and
computational intensity.
Verification that the step-down procedure for testing the null hypotheses H1.0/; H2.0/; : : : ; HM
.0/
controls the k-FWER (at level P ). Suppose (for purposes of verifying that the step-down procedure
controls the k-FWER) that MT k—if MT < k, then fewer than k of the null hypotheses are
true and hence at most there can be k 1 false rejections. Then, there exists an integer j 0 (where
k j 0 M MT C k) such that iQj 0 D iQk.T / —j 0 D k when T is the k-dimensional set whose
elements are iQ1 ; iQ2 ; : : : ; iQk , and j 0 D M MT C k when T is the set whose MT elements are
iQM MT C1 ; iQM MT C2 ; : : : ; iQM .
The step-down procedure results in k or more false rejections if and only if
.0/ .0/ .0/ .0/
jt.1/ j > ˛1 ; jt.2/ j > ˛2 ; : : : ; jt.j 0 1/
j > ˛j 0 1; and jt.j 0 / j > ˛j 0 : (7.33)
Thus, the step-down procedure is such that
.0/
Pr.k or more false rejections/ Pr jt.j 0 / j > ˛j 0 : (7.34)
Moreover, .0/ .0/
t.j 0 / D tQ D tQ.0/
.0/
D tkIT D tkIT ; (7.35)
ij 0 ik .T /
as is evident upon recalling result (7.18), and
˛j 0 D max c P .kI S / c P .kI T /: (7.36)
Q kIj 0
S2
—when Q kIj \ D ¿ (the empty set), set ˛j D 1. And in the generalization, continue [as in
definition (7.27)] to set
˛1 D ˛2 D D ˛k 1 D ˛k :
Under the extended definition (7.40), it is no longer necessarily the case that ˛j 0 > ˛j for every
j and j 0 for which k j 0 < j M and hence no longer necessarily the case that the sequence
˛1 ; ˛2 ; : : : ; ˛M is nonincreasing. Nor is it necessarily the case that ˛j is reexpressible (for k j
M ) as ˛j D maxS2.Q C \ / c P .kI S /, contrary to what might have been conjectured on the basis
kIj
of result (7.24).
Like the original version of the step-down procedure, the generalized (to account for the constraint
T 2 ) version controls the k-FWER or (in the special case where k D 1) the FWER (at level P ).
That the generalized version has that property can be verified by proceeding in essentially the same
way as in the verification (in a preceding part of the present subsection) that the original version has
that property. In that regard, it is worth noting that for T 2 , Q kIj 0 \ is nonempty and hence
˛j 0 is finite. And in the extension of the verification to the generalized version, the maximization
(with respect to S ) in result (7.36) is over the intersection of Q kIj 0 with rather than over Q kIj 0
itself.
When D , the generalized version of the step-down procedure is identical to the original
version. When is “smaller” than (as when 1 ; 2 ; : : : ; M are linearly dependent and D
), some of the ˛j ’s employed in the generalized version are smaller than (and the rest equal to)
those employed in the original version. Thus, when is smaller than ,the generalized version is
more powerful than the original (in that its use can result in additional rejections).
It is informative to consider the generalized version of the step-down procedure in the afore-
mentioned simple special case where the only members of are I and the empty set. In that
special case, the generalized version is such that ˛k D c P .kI I / D c P .k/—refer to result (7.25)—
and ˛kC1 D ˛kC2 D D ˛M D 1. Thus, in that special case, the generalized version of
the step-down procedure rejects none of the null hypotheses H1.0/; H2.0/; : : : ; HM .0/ .0/
if jt.1/ j c P .k/,
.0/ .0/ .0/ .0/ 0 .0/
rejects HQ ; HQ ; : : : ; HQ if jt.j / j > c P .k/ (j D 1; 2; : : : ; j ) and jt.j 0 C1/ j c P .k/ for some
i1 i2 ij 0
.0/
integer j 0 between 1 and k 1, inclusive, and rejects all M of the null hypotheses if jt.j / j > c P .k/
(j D 1; 2; : : : ; k).
An illustrative example. For purposes of illustration, consider a setting where M D P D 3, with
10 D .1; 1; 0/, 20 D .1; 0; 1/, and 30 D .0; 1; 1/. This setting is of a kind that is encountered
in applications where pairwise comparisons are to be made among some number of “treatments” (3
in this case).
Clearly, any two of the three vectors 1 , 2 , and 3 are linearly independent. Moreover, each
of these three vectors is expressible as a difference between the other two (e.g., 3 D 2 1 ),
implying in particular that M D 2 and that 1 , 2 , and 3 are linearly dependent. And the 2M D 8
members of are ¿ (the empty set), f1g, f2g, f3g, f1; 2g, f1; 3g, f2; 3g, and I D f1; 2; 3g; and the
members of are ¿, f1g, f2g, f3g, and f1; 2; 3g.
Now, suppose that k D 1. Then, we find that (for j D 1; 2; 3) the members of the collections
Q , Q C , and
Q \ are
kIj kIj kIj
Q W fiQ1 g; fiQ1 ; iQ2 g; fiQ1 ; iQ3 g; and fiQ1 ; iQ2 ; iQ3 g;
1I1
Q C W fiQ1 ; iQ2 ; iQ3 g;
1I1
Q \ W fiQ1 g and fiQ1 ; iQ2 ; iQ3 g;
1I1
Q W fiQ2 g and fiQ2 ; iQ3 g;
1I2
Q C W fiQ2 ; iQ3 g;
1I2
Q \ W fiQ2 g;
1I2
Q W fiQ3 g;
1I3
Multiple Comparisons and Simultaneous Confidence Intervals: Enhancements 451
Q C W fiQ3 g;
1I3
Q \ W fiQ3 g:
1I3
Alternatively, suppose that k D 2. Then, we find that (for j D 2; 3) the members of the collections
Q , Q C , and Q
kIj kIj kIj \ are
Q W
fiQ1 ; iQ2 g and fiQ1 ; iQ2 ; iQ3 g;
2W2
Q C
2W2 W fiQ1 ; iQ2 ; iQ3 g;
Q \ W
fiQ1 ; iQ2 ; iQ3 g;
2W2
Q W
fiQ1 ; iQ3 g and fiQ2 ; iQ3 g;
2W3
QC W
fiQ1 ; iQ3 g and fiQ2 ; iQ3 g;
2W3
Q \ W
¿ (the empty set):
2W3
Q C , MS D k 1CM j C1 D M Ck j ) that
and hence (since for S 2 kIj
write s1 for the first element of ˇs and s2 for the second element. The quantities of interest are
represented by the M D 6033 linear combinations 1 ; 2 ; : : : ; M defined as follows:
s D s2 s1 .s D 1; 2; : : : ; 6033/:
And, conceivably, the problem of discovering which genes are worthy of further study (i.e., might
be associated with prostate cancer development) could be formulated as one of testing the 6033 null
hypotheses Hs.0/ W s D 0 (s D 1; 2; : : : ; 6033) versus the alternative hypotheses Hs.1/ W s ¤ 0
(s D 1; 2; : : : ; 6033). In such an approach, the genes corresponding to whichever null hypotheses
are rejected would be deemed to be the ones of interest.
Note that the model for the prostate data is such that the variance-covariance matrix of the residual
effects is of the form (4.5.4). And recall that the variance-covariance matrix of the residual effects in
the G–M model is of the form (4.1.17). Thus, to obtain a model for the prostate data that is a G–M
model, we would need to introduce the simplifying assumptions that ss 0 D 0 for s 0 ¤ s D 1; 2; : : : ; S
and that 11 D 22 D D SS .
Alternative (to the FWER or k-FWER) criteria for devising and evaluating multiple-comparison
procedures: the false discovery proportion and false discovery rate. In applications of multiple-
comparison procedures where the procedure is to be used as a screening device, the number M of null
hypotheses is generally large and sometimes (as in the example) very large. Multiple-comparison
procedures (like those described and discussed in Section 7.3c) that restrict the FWER (familywise
error rate) to the customary levels (such as 0:01 or 0:05) are not well suited for such applications.
Those procedures are such that among the linear combinations i (i 2 F ), only those for which the
true value differs from the hypothesized value by a very large margin have a reasonable chance of
being rejected (discovered). It would seem that the situation could be improved to at least some extent
by taking the level to which the FWER is restricted to be much higher than what is customary. Or one
could turn to procedures [like those considered in Subsections a and b (of the present section)] that
restrict the k-FWER to a specified level, taking the value of k to be larger (and perhaps much larger)
than 1 and perhaps taking the level to which the k-FWER is restricted to be higher or much higher
than the customary levels. An alternative (to be considered in what follows) is to adopt a procedure
like that devised by Benjamini and Hochberg (1995) on the basis of a criterion that more directly
reflects the objectives underlying the use of the procedure as a screening device.
The criterion employed by Benjamini and Hochberg is defined in terms of the false discovery
proportion as is a related, but somewhat different, criterion considered by Lehmann and Romano
(2005a). By definition, the false discovery proportion is the number of rejected null hypotheses that
are true (number of false discoveries) divided by the total number of rejections—when the total
number of rejections equals 0, the value of the false discovery proportion is by convention (e.g.,
Benjamini and Hochberg 1995; Lehmann and Romano 2005a) taken to be 0. Let us write FDP for
the false discovery proportion. And let us consider the problem of devising multiple-comparison
procedures for which that proportion is likely to be small.
In what follows (in Subsections e and d), two approaches to this problem are described, one of
which is that of Benjamini and Hochberg (1995) (and of Benjamini and Yekutieli 2001) and the other
of which is that of Lehmann and Romano (2005a). The difference between the two approaches is
attributable to a difference in the criterion adopted as a basis for exercising control over the FDP. In
the Benjamini and Hochberg approach, that control takes the form of a requirement that for some
specified constant ı (0 < ı < 1),
E.FDP/ ı: (7.45)
As has become customary, let us refer to the expected value E.FDP/ of the false discovery proportion
as the false discovery rate and denote it by the symbol FDR. In the Lehmann and Romano approach,
control over the FDP takes a different form; it takes the form of a requirement that for some constants
and [in the interval .0; 1/],
Pr.FDP > / (7.46)
—typically, and are chosen to be much closer to 0 than to 1, so that the effect of the requirement
Multiple Comparisons and Simultaneous Confidence Intervals: Enhancements 455
(7.46) is to impose on FDP a rather stringent upper bound that is violated only infrequently. As
is to be demonstrated herein, multiple-comparison procedures that satisfy requirement (7.45) or
requirement (7.46) can be devised; these procedures are stepwise in nature.
The significance of result (7.49) is that it can be exploited for purposes of obtaining relatively
tractable conditions that when satisfied by the step-down procedure insure that Pr.FDP > / .
Suppose [for purposes of exploiting result (7.49) for such purposes] that j 0 1. Then, as can with
some effort be verified (and as is to be verified in a subsequent part of the present subsection),
Rj 0 D 0 (7.50)
and 0 C 0
j Rj 0 D Œj C 1; (7.51)
where (for any real number x) Œx denotes the largest integer that is less than or equal to x.
Let us denote by k 0 the random variable defined (in terms of j 0 ) as follows: k 0 D Œj 0 C 1 for
values of j 0 1 and k 0 D 0 for j 0 D 0. And observe [in light of result (7.51)] that (for j 0 1)
MF RjC0 D j 0 k 0: (7.52)
Observe also (in light of the equality MT D M MF ) that
MT M .j 0 k0/ D M C k0 j 0: (7.53)
0 Q Q Q
Further, let us (for j 1) denote by k the number of members in the set fi1 ; i2 ; : : : ; ij 0 g that
are members of T, so that k of the null hypotheses HQ.0/ ; HQ.0/ ; : : : ; HQ.0/ are true (and the other
i1 i2 ij 0
0 0 .0/
j k are false). And suppose (in addition to j 1) that jtQ j > ˛j .j D 1; 2; : : : ; j 0 /. Then, clearly,
ij
RjC0 j 0 k (or, equivalently, k j 0 RjC0 ), so that [in light of result (7.51)]
k0 k j 0: (7.54)
0
Thus, for some strictly positive integer s j ,
iQk0 .T / D iQs :
And [upon observing that iQk0 .T / D ik0 .T / and that iQs 2 T ] it follows that
jtk 0 IT j D jti .T / j D jtQ j D jtQ.0/ j > ˛s ˛j 0 : (7.55)
k0 is is
Upon applying result (7.55) [and realizing that result (7.54) implies that k 0 MT ], we obtain
the inequality
PrŒj 0 1 and jtQ.0/ j > ˛j .j D 1; 2; : : : ; j 0 / Pr.j 0 1; k 0 MT ; and jtk 0 IT j > ˛j 0 /: (7.56)
ij
The relevance of inequality (7.56) is the implication that to obtain a step-down procedure for which
.0/
PrŒj 0 1 and jtQi j > ˛j .j D 1; 2; : : : ; j 0 / and ultimately [in light of relationship (7.49)] one
j
for which Pr.FDP > / , it suffices to obtain a procedure for which
Pr.j 0 1; k 0 MT ; and jtk 0 IT j > ˛j 0 / : (7.57)
For purposes of obtaining a more tractable sufficient condition than condition (7.57), observe
that (for j 0 1)
k 0 1 j 0 < k 0 (7.58)
and that
1 k 0 ŒM C 1: (7.59)
Accordingly, the nonzero values of the random variable j 0 can be partitioned into mutually exclusive
categories based on the value of k 0 : for u D 1; 2; : : : ; ŒM C 1, let
Multiple Comparisons and Simultaneous Confidence Intervals: Enhancements 457
Let K D min.ŒM C 1; MT /. Then, based on the partitioning of the nonzero values of j 0 into
the mutually exclusive categories defined by relationship (7.60), we find that
Pr.j 0 1; k 0 MT ; and jtk 0 IT j > ˛j 0 /
D K 0
uD1 Pr.jtuIT j > ˛j 0 and j 2 Iu /
P
K 0
uD1 Pr.jtuIT j > ˛u and j 2 Iu /
P
PK
uD1 Pr.jtrIT j > ˛r for 1 or more values of r 2 f1; 2; : : : ; Kg and j 0 2 Iu /
Pr.jtrIT j > ˛r for 1 or more values of r 2 f1; 2; : : : ; Kg/: (7.61)
Thus, to obtain a step-down procedure for which Pr.FDP > / , it suffices [in light of the
sufficiency of condition (7.57)] to obtain a procedure for which
Pr.jtrIT j > ˛r for 1 or more values of r 2 f1; 2; : : : ; Kg/ : (7.62)
Verification of results (7.50) and (7.51). Let us verify result (7.50). Suppose that j 0 2—if j 0 D 1,
then j 0 RjC0 D 1 Rj 0 and j 0 D , so that 1 Rj 0 > , which implies that Rj 0 < 1 and
hence (since 1 < 1 and Rj 0 is a nonnegative integer) that Rj 0 D 0. And observe that (since
j 0 RjC0 > j 0 and since RjC0 D RjC0 1 C Rj 0 )
j0 1 RjC0 1 .Rj 0 1/ > .j 0 1/ C
and hence that
j0 1 RjC0 1 > .j 0 1/ C Rj 0 1 C :
Thus, Rj 0 D 0, since otherwise (i.e., if Rj 0 1), it would be the case that Rj 0 1 C > 0 and
hence that
j 0 1 RjC0 1 > .j 0 1/;
contrary to the definition of j 0 as the smallest integer j 2 I for which j RjC > j .
Turning now to the verification of result (7.51), we find that
j RjC Œj C 1
for every integer j 2 I for which j RjC > j , as is evident upon observing that (for j 2 I ) j RjC
is an integer and hence that j RjC > j implies that either j RjC D Œj C1 or j RjC > Œj C1
depending on whether j j RjC 1 or j < j RjC 1. Thus,
j 0 RjC0 Œj 0 C 1: (7.63)
Moreover, if j 0 RjC0 > Œj 0 C 1, it would (since Rj 0 D 0) be the case that
j 0 1 RjC0 1 > Œj 0 (7.64)
and hence [since both sides of inequality (7.64) are integers] that
j0 1 RjC0 1 Œj 0 C 1 > Œj 0 C 1 > j 0 D .j 0 1/;
contrary to the definition of j 0 as the smallest integer j 2 I for which j RjC > j . And it follows
that inequality (7.63) holds as an equality, that is,
458 Confidence Intervals (or Sets) and Tests of Hypotheses
j0 RjC0 D Œj 0 C 1:
Step-down procedures of a particular kind. If and when the first j 1 of the (ordered) null hypotheses
.0/ .0/ .0/ .0/
HQ ; HQ ; : : : ; HQ are rejected by the step-down procedure, the decision as to whether or not HQ
i1 i2 iM ij
is rejected is determined by whether or not jtQ.0/ j > ˛j and hence depends on the choice of ˛j . If
ij
.0/ .0/ .0/ .0/
(following the rejection of HQ ; HQ ; : : : ; HQ ) HQ is rejected, then (as of the completion of the
i1 i2 ij 1 ij
j th step of the step-down procedure) there are j total rejections (total discoveries) and the proportion
of those that are false rejections (false discoveries) equals k.j /=j, where k.j / represents the number
of null hypotheses among the j rejected null hypotheses that are true. Whether k.j /=j > or
k.j /=j [i.e., whether or not k.j /=j exceeds the prescribed upper bound] depends on whether
k.j / Œj C 1 or k.j / Œj . As discussed by Lehmann and Romano (2005a, p. 1147), that
suggests taking the step-down procedure to be of the form of a step-down procedure for controlling
the k-FWER at some specified level P and of taking k D Œj C 1 (in which case k varies with j ).
That line of reasoning leads (upon recalling the results of Subsection b) to taking ˛j to be of the
form
˛j D tN.ŒjC1/ P =f2.M CŒjC1 j /g .N P /: (7.65)
Taking (for j D 1; 2; : : : ; M ) ˛j to be of the form (7.65) reduces the task of choosing the ˛j ’s to
one of choosing P.
Note that if (for j D 1; 2; : : : ; M ) ˛j is taken to be of the form (7.65), then (for j 0 1)
˛j 0 D tNk 0 P =Œ2.M Ck 0 j 0 / .N P / tNk 0 P =.2MT / .N P /; (7.66)
as is evident upon recalling result (7.53).
Control of the probability of the FDP exceeding : a special case. Let † represent the correlation
matrix of the least squares estimators Oi (i 2 T ) of the MT estimable linear combinations i (i 2 T )
of the elements of ˇ. And suppose that i (i 2 T ) are linearly independent (in which case † is
nonsingular). Suppose further that † D I [in which case Oi (i 2 T ) are (in light of the normality
assumption) statistically independent] or, more generally, that there exists a diagonal matrix D
with diagonal elements of ˙1 for which all of the off-diagonal elements of the matrix D† 1D
are nonnegative. Then, the version of the step-down procedure obtained upon taking (for j D
1; 2; : : : ; M ) ˛j to be of the form (7.65) and upon setting P D is such that Pr.FDP > / .
Let us verify that this is the case. The verification makes use of an inequality known as the Simes
inequality. There are multiple versions of this inequality. The version best suited for present purposes
is expressible in the following form:
Pr.X.r/ > ar for 1 or more values of r 2 f1; 2; : : : ; ng/ .1=n/ nrD1 Pr.Xr > an /; (7.67)
P
where X1 ; X2 ; : : : ; Xn are absolutely continuous random variables whose joint distribution satisfies
the so-called PDS condition, where X.1/ ; X.2/ ; : : : ; X.n/ are the random variables whose values
are those obtained by ordering the values of X1 ; X2 ; : : : ; Xn from largest to smallest, and where
a1 ; a2 ; : : : ; an are any constants for which a1 a2 an 0 and for which .1=r/ Pr.Xj > ar /
is nondecreasing in r (for r D 1; 2; : : : ; n and for every j 2 f1; 2; : : : ; ng)—refer, e.g., to Sarkar
(2008, sec. 1).
Now, suppose that (for j D 1; 2; : : : ; M ) ˛j is of the form (7.65). To verify that the version of
the step-down procedure with ˛j ’s of this form is such that Pr.FDP > / when P D , it suffices
to verify that (for ˛j ’s of this form) condition (7.62) can be satisfied by taking P D . For purposes
of doing so, observe [in light of inequality (7.66)] that
Pr.jtrIT j > ˛r for 1 or more values of r 2 f1; 2; : : : ; Kg/
PrŒjtrIT j > tNr P =.2MT / .N P / for 1 or more values of r 2 f1; 2; : : : ; Kg
PrŒjtrIT j > tNr P =.2MT / .N P / for 1 or more values of r 2 f1; 2; : : : ; MT g: (7.68)
Multiple Comparisons and Simultaneous Confidence Intervals: Enhancements 459
Next, apply the Simes inequality (7.67) to the “rightmost” side of inequality (7.68), taking
n D MT and (for r D 1; 2; : : : ; MT ) taking ar D tNr P =.2MT / .N P / and taking Xr to be the rth
of the MT random variables jti j (i 2 T ). That the distributional assumptions underlying the Simes
inequality are satisfied in this application follows from Theorem 3.1 of Sarkar (2008). Moreover,
the assumption that .1=r/ Pr.Xj > ar / is nondecreasing in r is also satisfied, as is evident upon
observing that PrŒjtj j > tNr P =.2MT / .N P / D r P =MT . Thus, having justified the application of the
Simes inequality, we find that
PrŒjtrIT j > tNr P =.2MT / .N P / for 1 or more values of r 2 f1; 2; : : : ; MT g
.1=MT / i 2T PrŒjti j > tNMT P =.2MT / .N P / D MT P =MT D P : (7.69)
P
so that P P / D , cP . P / D 0;
`. (7.72)
where cP . P / is the upper 100 % point of the distribution of the random variable
maxr2f1;2;:::;ŒM C1g .jt.r/ j ˛r / when ˛1 D ˛P 1 . P /; ˛2 D ˛P 2 . P /; : : : ; ˛M D ˛P M . P /. Clearly,
cP . P / is an increasing function (of P ). Moreover, by making use of Monte Carlo methods of the kind
discussed by Edwards and Berry (1987), the draws from the distribution of t can be used to approx-
imate the values of cP . P / corresponding to the various values of P . Thus, by solving the equation
obtained from the equation cP . P / D 0 upon replacing the values of cP . P / with their approximations,
we can [in light of result (7.72)] obtain an approximation to the solution P to the equation `. P P / D .
where t.r/ D tir —unlike condition (7.76), this condition does not involve T. When the ˛j ’s satisfy
condition (7.76) or (7.77), the step-down procedure is such that Pr.FDP > / .
The analogue of taking ˛j to be of the form (7.65) is to take ˛j to be of the form
˛j D tN.ŒjC1/ P =.M CŒjC1 j / .N P /; (7.78)
where 0 < P < 1=2. When (for j D 1; 2; : : : ; M ) ˛j is taken to be of the form (7.78), it is the case
that (for j 0 1)
˛j 0 D tNk 0 P =.M Ck 0 j 0 / .N P / tNk 0 P =MT .N P / (7.79)
—this result is the analogue of result (7.66). And as before, results on controlling Pr.FDP > / at a
specified level are obtainable (under certain conditions) by making use of the Simes inequality.
Let † represent the correlation matrix of the least squares estimators Oi (i 2 T ) of the MT
estimable linear combinations i (i 2 T ) of the elements of ˇ. And suppose that i (i 2 T ) are linearly
independent (in which case † is nonsingular). Then, in the special case where (for j D 1; 2; : : : ; M )
˛j is taken to be of the form (7.78) and where P D < 1=2, the step-down procedure is such
that Pr.FDP > / provided that the off-diagonal elements of the correlation matrix † are
nonnegative—it follows from Theorem 3.1 of Sarkar (2008) that the distributional assumptions
needed to justify the application of the Simes inequality are satisfied when the off-diagonal elements
of † are nonnegative.
More generally, when (for j D 1; 2; : : : ; M ) ˛j is taken to be of the form (7.78), the value of P
needed to achieve control of Pr.FDP > / at a specified level can be determined “numerically” via
an approach similar to that described in a preceding part of the present subsection. Instead of taking
(for j D 1; 2; : : : ; M ) ˛Pj . P / to be the function of P whose values are those of expression (7.65), take
it to be the function whose values are those of expression (7.78). And take `. P P / to be the function of P
whose values are those of the left side of inequality (7.77) when (for j D 1; 2; : : : ; M ) ˛j D ˛Pj . P /.
Further, take P to be the solution (for P ) to the equation `. P P / D (if is sufficiently small that
a solution exists) or, more generally, take P to be a value of P small enough that `. P P / . Then,
upon taking (for j D 1; 2; : : : ; M ) ˛j D ˛Pj . P /, we obtain a version of the step-down procedure
for which Pr.FDP > / . It is worth noting that an “improvement” in the resultant procedure can
be achieved by introducing a modification of the ˛j ’s analogous to modification (7.75).
Nonnormality. The approach taken herein [in insuring that the step-down procedure is such that
Pr.FDP > / ] is based on insuring that the ˛j ’s satisfy condition (7.62) or condition (7.70). The
assumption (introduced at the beginning of Section 7.7) that the distribution of the vector e (of the
residual effects in the G–M model) is MVN can be relaxed without invalidating that approach.
Whether or not condition (7.62) is satisfied by any particular choice for the ˛j ’s is completely
determined by the distribution of the MT random variables ti (i 2 T ), and whether or not condition
(7.70) is satisfied is completely determined by the distribution of the M random variables ti (i D
1; 2; : : : ; M ). And each of the ti ’s is expressible as a linear combination of the elements of the vector
O 1 .˛O ˛/, as is evident from result (7.5). Moreover, O 1 .˛O ˛/ MV t.N P ; I/—refer to
˛O ˛
result (7.6)—not only when the distribution of the vector is MVN (as is the case when the
d
˛O ˛
distribution of e is MVN) but, more generally, when has an absolutely continuous spherical
d
distribution (as is the case when e has an absolutely continuous spherical distribution)—refer to
result (6.4.67). Thus, if the ˛j ’s satisfy condition (7.62) or condition (7.70) when the distribution of
e is MVN, then Pr.FDP > / not only in the case where the distribution of e is MVN but also
in various other cases.
multiple-comparison procedure that controls the FDR [i.e., controls E.FDP/] at a specified level.
And in doing so, let us make further use of various of the notation and terminology employed in the
preceding parts of the present section.
A step-up procedure: general form. Let ˛1 ; ˛2 ; : : : ; ˛M represent a nonincreasing sequence of
M strictly positive scalars (so that ˛1 ˛2 ˛M > 0). Then, corresponding to this
sequence, there is (as discussed in Subsection d) a step-down procedure for testing the M null
hypotheses Hi.0/ W i D i.0/ (i D 1; 2; : : : ; M ) [versus the alternative hypotheses Hi.1/ W i ¤ i.0/
(i D 1; 2; : : : ; M )]. There is also a step-up procedure for testing these M null hypotheses. Like the
step-down procedure, the step-up procedure rejects (for some integer J between 0 and M, inclusive)
the first J of the null hypotheses in the sequence HQ.0/ ; HQ.0/ ; : : : ; HQ.0/ [where iQ1 ; iQ2 ; : : : ; iQM is a
i1 i2 iM
.0/ .0/
permutation of the integers 1; 2; : : : ; M defined implicitly (wp1) by the inequalities jtQi j jtQi j
1 2
jtQ.0/ j]. Where it differs from the step-down procedure is in the choice of J . In the step-
iM
up procedure, J is taken to be the largest value of j for which jtQ.0/ j > ˛j —if jtQ.0/ j ˛j for
ij ij
j D 1; 2; : : : ; M, set J D 0. The step-up procedure can be visualized as a procedure in which the null
hypotheses are tested sequentially in the order HQ.0/ ; HQ.0/ ; : : : ; HQ.0/ ; by definition, jtQ.0/ j ˛M ;
iM iM 1 i1 iM
jtQ.0/ j ˛M .0/
1 ; : : : ; jtQ j ˛J C1 , and jtQ.0/ j > ˛J .
iM 1 iJ C1 iJ
.0/
Note that in contrast to the step-down procedure (where the choice of J is such that jtQi j > ˛j
j
for j D 1; 2; : : : ; J ), the choice of J in the step-up procedure is such that jtQ.0/ j does not necessarily
ij
exceed ˛j for every value of j J (though jtQ.0/ j does necessarily exceed ˛J for every value of
ij
j J ). Note also that (when the ˛j ’s are the same in both cases) the number of rejections produced
by the step-up procedure is at least as great as the number produced by the step-down procedure.
For j D 1; 2; : : : ; M, let
˛j0 D Pr.jtj > ˛j /; (7.80)
where t S t.N P /. In their investigation of the FDR as a criterion for evaluating and devising
multiple-comparison procedures, Benjamini and Hochberg (1995) proposed a step-up procedure in
which (as applied to the present setting) ˛1 ; ˛2 ; : : : ; ˛M are of the form defined implicitly by equality
(7.80) upon taking (for j D 1; 2; : : : ; M ) ˛j0 to be of the form
˛j0 D j P =M; (7.81)
where 0 < P < 1. Clearly, taking the ˛j ’s to be of that form is equivalent to taking them to be of the
form
˛j D tNj P =.2M / .N P / (7.82)
(j D 1; 2; : : : ; M ).
The FDR of a step-up procedure. Let Bi represent the event consisting of those values of the
vector t .0/ (with elements t1.0/ ; t2.0/ ; : : : ; tM
.0/
) for which Hi.0/ is rejected by the step-up procedure
(i D 1; 2; : : : ; M ). Further, let Xi represent (a random variable defined as follows:
1; if t .0/ 2 Bi ,
Xi D
0; if t .0/ … Bi .
Then, i 2T Xi equals the number of falsely rejected null hypotheses, and j 2I Xj equals the total
P P
number of rejections—recall that I D f1; 2; : : : ; M g and that T is the subset of I consisting of those
.0/
values of i 2 I for which Hi is true. And FDP > 0 only if j 2I Xj > 0, in which case
P
P
Xi
FDP D Pi 2T : (7.83)
j 2I Xj
Multiple Comparisons and Simultaneous Confidence Intervals: Enhancements 463
Observe [in light of expression (7.83)] that the false discovery rate is expressible as
Xi
where FDPi D P if Xj > 0 and FDPi D 0 if j 2I Xj D 0. Observe also that
P P
j 2I
j 2I Xj
.0/
FDPi > 0 only if t 2 Bi , in which case FDPi D 1= j 2I Xj , and that (for k D 1; 2; : : : ; M )
P
.0/
P
j 2I Xj D k , t 2 Ak ;
where Ak is the event consisting of those values of t .0/ for which exactly k of the M null hypotheses
H1.0/ ; H2.0/ ; : : : ; HM
.0/
are rejected by the step-up procedure. Accordingly,
FDR D i 2T M kD1 .1=k/ Pr t
.0/
(7.85)
P P
2 Bi \Ak :
Ak D ft .0/
W jtQ.0/ j ˛j .j D M; M 1; : : : ; k C1/I jtQ.0/ j > ˛k g: (7.86)
ij ik
Or, equivalently,
Ak D ft .0/ W max j D kg (7.87)
.0/
j 2I W jt Q j>˛j
ij
(k D M; M 1; : : : ; 1).
Now, for purposes of obtaining an expression for the FDR that is “more useful” than expres-
sion (7.85), let Si represent the (M 1)-dimensional subset of the set I D f1; 2; : : : ; M g obtained
upon deleting the integer i , and denote by t . i / the (M 1)-dimensional subvector of t .0/ ob-
tained upon striking out the i th element ti.0/. And recall that for any (nonempty) subset S of I,
iQ1 .S /; iQ2 .S /; : : : ; iQM
.S / is a permutation of the elements of S such that jtQ.0/
j jtQ.0/
j
S i1 .S/ i2 .S/
.0/
jtQ j; and observe that for i 2 I and for j 0 such that iQj 0 D i ,
iM .S/
S
8̂
< iQj .Si /;
for j D 1; 2; : : : ; j 0 1,
iQj D i; for j D j 0, (7.88)
:̂ Q
ij 1 .Si /; for j D j 0 C1; j 0 C2; : : : ; M,
and, conversely, (
iQj 1 ; for j D 2; 3; : : : ; j 0,
iQj 1 .Si / D (7.89)
iQj ; for j D j 0 C1; j 0 C2; : : : ; M.
. i/ .0/
Further, for i D 1; 2; : : : ; M, define AM I i D ft W jtQi j > ˛M g,
M 1 .Si /
AkI i D ft . i/
W jtQ.0/
j ˛j for j D M; M 1; : : : ; k C1I jtQ.0/
j > ˛k g
ij 1 .Si / ik 1 .Si /
(k D M 1; M 2; : : : ; 2), and
A1I i D ft . i/
W jtQ.0/
j ˛j for j D M; M 1; : : : ; 2g:
ij 1
.Si /
For i; k D 1; 2; : : : ; M, AkI i is interpretable in terms of the results that would be obtained if the
step-up procedure were applied to the M 1 null hypotheses H1.0/; H2.0/; : : : ; Hi.0/1 ; HiC1 .0/
; : : : ; HM.0/
(rather than to all M of the null hypotheses) with the role of ˛1 ; ˛2 ; : : : ; ˛M being assumed by
˛2 ; ˛3 ; : : : ; ˛M ; if t . i / 2 AkI i , then exactly k 1 of these M 1 null hypotheses would be rejected.
It can be shown and (in the next part of the present subsection) will be shown that for i; k D
1; 2; : : : ; M, .0/
t .0/ 2 Bi \Ak , ti 2 BkI i and t
. i/
2 AkI i ; (7.90)
where BkI i
D fti.0/ W jti.0/ j > ˛k g. And upon recalling result (7.85) and making use of relationship
464 Confidence Intervals (or Sets) and Tests of Hypotheses
[where ˛k0
is as defined by expression (7.80) and ti as defined by expression (7.3)].
Note that (for k D M; M 1; : : : ; 2)
AkI i D ft . i/
W max j D kg; (7.92)
.0/
j 2fM;M 1;:::;2g W jt Q j>˛j
i .S /
j 1 i
analogous to result (7.87). Note also that the sets A1I i ; A2I i ; : : : ; AM
I i are mutually disjoint and that
SM M 1
kD1 AkI i D R (7.93)
(i D 1; 2; : : : ; M ).
Verification of result (7.90). Suppose that t .0/ 2 Bi \Ak . Then,
jtQ.0/
j D jtQ.0/ j ˛j : (7.97)
ij 1
.Si / ij
contrary to the supposition (which implies that jti.0/ j > ˛k ). Then, making use of result (7.88), we
.0/ .0/ .0/ .0/
find that tQ D ti or tQ D tQ (depending on whether j 0 D k or j 0 < k) and that in either
ik ik ik 1 .Si /
case
jtQ.0/ j > ˛k ;
ik
and we also find that for j > k ( j 0 )
jtQ.0/ j D jtQ.0/
j ˛j ;
ij ij 1 .Si /
Multiple Comparisons and Simultaneous Confidence Intervals: Enhancements 465
leading to the conclusion that t .0/ 2 Ak and (since i D iQj 0 and j 0 k) to the further conclusion that
t .0/ 2 Bi and ultimately to the conclusion that t .0/ 2 Bi \ Ak .
Control of the FDR in a special case. For i D 1; 2; : : : ; M, let
Oi i.0/ Oi i
zi.0/ D and zi D :
Œ0i .X0 X/ i 1=2 0 0
Œi .X X/ i 1=2
And define z.0/ D .z1.0/; z2.0/; : : : ; zM .0/ 0
/ , and observe that zi.0/ D zi for i 2 T. Further, for i D
1; 2; : : : ; M, denote by z. i / the (M 1)-dimensional subvector of z.0/ obtained upon deleting the i th
element; and let u D =. O Then, ti.0/ D zi.0/=u, ti D zi =u, t .0/ D u 1 z.0/ , and t . i / D u 1 z. i / ;
u and z.0/ are statistically independent; and .N P /u2 2 .N P / and z.0/ N.; †/, where
is the M 1 vector with i th element i D Œ 2 0i .X0 X/ i 1=2 .i i.0/ / and † is the M M
matrix with ij th element 0i .X0 X/ j
ij D 0 0
Œi .X X/ i 1=2 Œ0j .X0 X/ j 1=2
—recall the results of Section 7.3a, including results (3.9) and (3.21).
Let f ./ represent the pdf of the N.0; 1/ distribution, and denote by h./ the pdf of the distribution
of the random variable u. Then, the conditional probability Pr t . i / 2 AkI i j jti j > ˛k , which appears
And upon substituting expression (7.100) in formula (7.99), we find that (in the special case where
ij D 0 for all j ¤ i )
Pr t . i / 2 AkI i j jti j > ˛k D E Pr u 1 z. i / 2 AkI i j u D Pr t . i / 2 AkI i : (7.101)
Now, suppose that ij D 0 for all i 2 T and j 2 I such that j ¤ i. Then, in light of result (7.101),
formula (7.91) for the FDR can be reexpressed as follows:
FDR D i 2T M 0
kD1 .˛k =k/ Pr t
. i/
2 AkI i : (7.102)
P P
Moreover, in the special case where (for j D 1; 2; : : : ; M ) ˛j is of the form (7.82) [and hence where
˛j0 is of the form (7.81)], expression (7.102) can be simplified. In that special case, we find (upon
recalling that the sets A1I i ; A2I i ; : : : ; AM
I i are mutually disjoint and that their union is R
M 1
) that
PM
FDR D i 2T . P =M / kD1 Pr t . i / 2 AkI i
P
D .MT =M / P Pr t . i / 2 RM 1
D .MT =M / P P : (7.103)
Based on inequality (7.103), we conclude that if ij D 0 for j ¤ i D 1; 2; : : : ; M (so that ij D 0
for all i 2 T and j 2 I such that j ¤ i regardless of the unknown identity of the set T ), then the FDR
can be controlled at level ı (in the sense that FDR ı) by taking (for j D 1; 2; : : : ; M ) ˛j to be of
466 Confidence Intervals (or Sets) and Tests of Hypotheses
the form (7.82) and by setting P D ı. The resultant step-up procedure is that proposed by Benjamini
and Hochberg (1995).
Note that when (for j D 1; 2; : : : ; M ) ˛j is taken to be of the form (7.82), the FDR could be
reduced by decreasing the value of P , but that the reduction in FDR would come at the expense of a
potential reduction in the number of rejections (discoveries)—there would be a potential reduction in
the number of rejections of false null hypotheses (number of true discoveries) as well as in the number
of true null hypotheses (number of false discoveries). Note also that the validity of results (7.102)
and (7.103) depends only on various characteristics of the distribution of the random variables zj
(j 2 T ) and u and on those random variables being distributed independently of the random variables
zj (j 2 F ); the (marginal) distribution of the random variables zj (j 2 F ) is “irrelevant.”
An extension. The step-up procedure and the various results on its FDR can be readily extended to
.0/ .0/ .1/ .0/
the case where the null and alternative hypotheses are either Hi W i i and Hi W i > i
.0/ .0/ .1/ .0/
(i D 1; 2; : : : ; M ) or Hi W i i and Hi W i < i (i D 1; 2; : : : ; M ) rather than
Hi.0/ W i D i.0/ and Hi.1/ W i ¤ i.0/ (i D 1; 2; : : : ; M ). Suppose, in particular, that the null and
alternative hypotheses are Hi.0/ W i i.0/ and Hi.1/ W i > i.0/ (i D 1; 2; : : : ; M ), so that the set T
of values of i 2 I for which Hi.0/ is true and the set F for which it is false are T D fi 2 I W i i.0/ g
and F D fi 2 I W i > i.0/ g. And consider the extension of the step-up procedure and of the various
results on its FDR to this case.
In regard to the procedure itself, it suffices to redefine the permutation iQ1 ; iQ2 ; : : : ; iQM and the
integer J : take iQ1 ; iQ2 ; : : : ; iQM to be the permutation (of the integers 1; 2; : : : ; M ) defined implicitly
(wp1) by the inequalities tQ.0/ tQ.0/ tQ.0/ , and take J to be the largest value of j for which
i1 i2 iM
tQ.0/ > ˛j —if tQ.0/ ˛j for j D 1; 2; : : : ; M, set J D 0. As before, ˛1 ; ˛2 ; : : : ; ˛M represents a
ij ij
nonincreasing sequence of scalars (so that ˛1 ˛2 ˛M ); however, unlike before, some or
all of the ˛j ’s can be negative.
In regard to the FDR of the step-up procedure, we find by proceeding in the same fashion
as in arriving at expression (7.91) and by redefining (for an arbitrary nonempty subset S of I )
iQ1 .S /; iQ2 .S /; : : : ; iQM
S
.S / to be a permutation of the elements of S for which
.0/ .0/ .0/
tQi .S/ tQi .S/ tQi ; (7.104)
1 2 MS .S/
for k D M 1; M 2; : : : ; 2 as
AkI i Dft . i/
W tQ.0/
˛j for j D M; M 1; : : : ; kC1I tQ.0/
> ˛k g; (7.106)
ij .S /
1 i
ik 1 .Si /
and for k D 1 as
A1I i D ft . i/
W tQ.0/
˛j for j D M; M 1; : : : ; 2g; (7.107)
ij 1
.Si /
by redefining BkI i (for k D 1; 2; : : : ; M ) as
.0/
BkI i D fti W ti.0/ > ˛k g; (7.108)
.0/
D i 2T M kD1 .1=k/ Pr ti > ˛k Pr t
. i/
2 AkI i j ti.0/ > ˛k
P P
i 2T M 0
kD1 .˛k =k/ Pr t
. i/
2 AkI i j ti.0/ > ˛k : (7.110)
P P
analogous to expression (7.92). Note also that subsequent to the redefinition of the sets
SM
A1I i ; A2I i ; : : : ; AM
I i , it is still the case that they are mutually disjoint and that
kD1 AkI i D R
M 1
.
Take u, zi , z. i /, f ./, and h./ to be as defined in the preceding part of the present subsection,
and take AkI i to be as redefined by expression (7.105), (7.106), or (7.107). Then, analogous to result
(7.99), we find that
Pr t . i / 2 AkI i j ti.0/ > ˛k
D Pr u 1 z. i / 2 AkI i j zi > u ˛k i
Z 1Z 1
Pr u 1 z. i / 2 AkI i j zi D z i ; u D u
D
0 u˛k i
f .z i /
dz h.u/ d u: (7.111)
Pr.zi > u ˛k i / i
Now, suppose that ij D 0 for all i 2 T and j 2 I such that j ¤ i. Then, by proceeding in
much the same fashion as in arriving at result (7.103), we find that in the special case where (for
j D 1; 2; : : : ; M ) ˛j0 [as redefined by expression (7.109)] is taken to be of the form (7.81) and hence
where (for j D 1; 2; : : : ; M ) ˛j D tNj P =M .N P /,
FDR .MT =M / P P:
Thus, as in the case of the step-up procedure for testing the null hypotheses Hi.0/ W i D i.0/
(i D 1; 2; : : : ; M ) when (for i D 1; 2; : : : ; M ) ˛j D tNj ı=.2M / .N P /, we find that if ij D 0 for
j ¤ i D 1; 2; : : : ; M , then in the special case where (for j D 1; 2; : : : ; M ) ˛j D tNj ı=M .N P /,
the step-up procedure for testing the null hypotheses Hi.0/ W i i.0/ (i D 1; 2; : : : ; M ) controls
the FDR at level ı (in the sense that FDR ı) and does so regardless of the unknown identity of the
set T.
Nonindependence. Let us consider further the step-up procedures for testing the null hypotheses
Hi.0/ W i D i.0/ (i D 1; 2; : : : ; M ) and for testing the null hypotheses Hi.0/ W i i.0/ (i D 1; 2;
: : : ; M ). Suppose that the ˛j ’s are those that result from taking (for j D 1; 2; : : : ; M ) ˛j0 to be of the
form ˛j0 D j P =M and from setting P D ı. If ij D 0 for j ¤ i D 1; 2; : : : ; M, then (as indicated in
the preceding 2 parts of the present subsection) the step-up procedures control the FDR at level ı.
To what extent does this property (i.e., control of the FDR at level ı) extend to cases where ij ¤ 0
for some or all j ¤ i D 1; 2; : : : ; M ?
Suppose that 1 ; 2 ; : : : ; M are linearly independent and that † is nonsingular (as would neces-
sarily be the case if ij D 0 for j ¤ i D 1; 2; : : : ; M ). Then, it can be shown (and subsequently will
be shown) that in the case of the step-up procedure for testing the null hypotheses Hi.0/ W i i.0/
(i D 1; 2; : : : ; M ) with ˛j0 D j P =M or, equivalently, ˛j D tNj P =M .N P / (for j D 1; 2; : : : ; M ),
ij 0 for all i 2 T and j 2 I such that j ¤ i ) FDR .MT =M / P : (7.112)
Thus, when ˛j D tNj P =M .N P / for j D 1; 2; : : : ; M and when P D ı, the step-up procedure for
.0/ .0/
testing the null hypotheses Hi W i i (i D 1; 2; : : : ; M ) controls the FDR at level ı (regardless
of the unknown identity of T ) provided that ij 0 for j ¤ i D 1; 2; : : : ; M.
468 Confidence Intervals (or Sets) and Tests of Hypotheses
Turning now to the case of the step-up procedure for testing the null hypotheses Hi.0/ W i D i.0/
(i D 1; 2; : : : ; M ), let †T represent the MT MT submatrix of † obtained upon striking out the
i th row and i th column for every i 2 F, and suppose that (for j D 1; 2; : : : ; M ) ˛j0 D j P =M or,
equivalently, ˛j D tNj P =.2M / .N P /. Then, it can be shown (and will be shown) that in this case,
the existence of an MT MT diagonal matrix DT with diagonal
elements of ˙1 for which all of the off-diagonal elements of the
matrix DT †T 1 DT are nonnegative, together with the condi-
tion ij D 0 for all i 2 T and j 2 F
) FDR .MT =M / P : (7.113)
Relationship (7.113) serves to define (for every T ) a collection of values of † for which FDR
.MT =M / P ; this collection may include values of † in addition to those for which ij D 0 for all
i 2 T and j 2 I such that j ¤ i. Note, however, that when T contains only a single member, say the
i th member, of the set f1; 2; : : : ; M g, this collection consists of those values of † for which ij D 0
for every j ¤ i . Thus, relationship (7.113) does not provide a basis for adding to the collection
of values of † for which the step-up procedure [for testing the null hypotheses Hi.0/ W i D i.0/
(i D 1; 2; : : : ; M ) with (for j D 1; 2; : : : ; M ) ˛j D tNj P =.2M / .N P / and with P D ı] controls the
FDR at level ı (regardless of the unknown value of T ).
In the case of the step-up procedure for testing the null hypotheses Hi.0/ W i D i.0/ (i D 1; 2;
.0/ .0/
: : : ; M ) or the null hypotheses Hi W i i (i D 1; 2; : : : ; M ) with the ˛j ’s chosen so that (for
0 0
j D 1; 2; : : : ; M ) ˛j is of the form ˛j D j P =M , control of the FDR at level ı can be achieved by
setting P D ı only when † satisfies certain relatively restrictive conditions. However, such control
can be achieved regardless of the value of † by setting P equal to a value that is sufficiently smaller
than ı. Refer to Exercise 31 for some specifics.
Verification of results (7.112) and (7.113). Suppose [for purposes of verifying result (7.112)] that
ij 0 for all i 2 T and j 2 I such that j ¤ i. And consider the function g.z. i / I k 0; u/ of z. i /
defined for each strictly positive scalar u and for each integer k 0 between 1 and M , inclusive, as
follows:
(
1; if uz. i / 2 kk 0 AkI i ,
S
. i/ 0
g.z I k ; u/ D
0; otherwise,
. i/
where AkI i is the set of t -values defined by expression (7.105), (7.106), or (7.107).
. i/
Clearly, the function g. I k 0; u/ is a nonincreasing function [in the sense that g.z2 I k 0; u/
. i/ . i/ . i/ . i/
g.z1 I k 0; u/ for any 2 values z1 and z2 of z. i / for which the elements of z2 are greater
than or equal to the corresponding elements of z.1 i / ]. And the distribution of z. i / conditional on
zi D z i is MVN with
0
E z. i / j zi D z i D . i / C . i / .z i i / and var z. i / j zi D z i D † . i / . i /. i / ;
And regard q.z i I k 0; u/ as a function of z i , and observe (in light of the statistical independence of u
and z.0/ ) that
q.z i I k 0; u/ D Pr u 1 z. i / 2 kk 0 AkI i j zi D z i D E g.z. i / I k 0; u/ j zi D z i :
S
Then, based on a property of the MVN distribution that is embodied in Theorem 5 of Müller (2001),
it can be deduced that q. I k 0; u/ is a nonincreasing function. Moreover, for “any” nonincreasing
function, say q.z i /, of z i (and for k 0 D 1; 2; : : : ; M 1),
Multiple Comparisons and Simultaneous Confidence Intervals: Enhancements 469
Z 1 Z 1
f .z i / f .z i /
q.z i / dz i q.z i / dz ; (7.114)
u˛k 0C1 i Pr.zi > u ˛k 0C1 i / u˛k 0 i Pr.zi > u ˛k 0 i / i
as can be readily verified. Thus,
Z 1
f .z i /
Pr u 1 z. i / 2 Ak 0C1I i j zi D z i ; u D u
dz
u˛k 0C1 i Pr.zi > u ˛k 0C1 i / i
Z 1 [ f .z i /
D Pr u 1 z. i / 2 AkI i j zi D z i ; u D u dz
u˛k 0C1 i 0C1
Pr.zi > u ˛k 0C1 i / i
Z 1 kk
[ f .z i /
Pr u 1 z. i / 2 AkI i j zi D z i ; u D u dz
u˛k 0C1 i Pr.zi > u ˛k 0C1 i / i
kk 0
1
f .z i /
Z [
Pr u 1 z. i / 2 AkI i j zi D z i ; u D u dz
u˛k 0C1 i 0
Pr.z i > u ˛k 0C1 i / i
Z 1 kk C1
[ f .z i /
Pr u z 1 . i/
2 AkI i j zi D z i ; u D u dz : (7.115)
u˛k 0 i 0
Pr.z i > u ˛k 0 i / i
kk
Finally, upon summing both “sides” of inequality (7.115) over k 0 (from 1 to M 1), we find that
M Z 1
X f .z i /
Pr u 1 z. i / 2 AkI i j zi D z i ; u D u
dz
u˛k i Pr.zi > u ˛k i / i
kD1
1
f .z i /
Z
Pr u 1 . i/
2 A1I i j zi D z i ; u D u
D z dz
u˛1 i Pr.zi > u ˛1 i / i
M
X1 Z 1 f .z i /
Pr u 1 . i/
2 Ak 0C1I i j zi D z i ; u D u
C z dz
u˛k 0C1 i Pr.zi > u ˛k 0C1 i / i
k 0 D1
1
f .z i /
Z [
Pr u 1 . i/
z 2 AkI i j zi D z i ; u D u dz
u˛M i Pr.zi > u ˛M i / i
kM
1
f .z i /
Z
D 1 dz D 1I
u˛M i Pr.zi > u ˛M i / i
so that to complete the verification of result (7.112), it remains only to observe [in light of expressions
(7.110) and (7.111)] that
k P =M
FDR i 2T M Pr t . i / 2 AkI i j ti.0/ > ˛k
P P
kD1
k
D . P =M / i 2T kD1 Pr t . i / 2 AkI i j ti.0/ > ˛k
P PM
P R1
. P =M / i 2T 0 1 h.u/ d u
D .MT =M / P :
Turning to the verification of result (7.113), suppose that there exists an MT MT diagonal
matrix DT with diagonal elements of ˙1 for which all of the off-diagonal elements of the matrix
DT †T 1 DT are nonnegative, and suppose in addition that ij D 0 for all i 2 T and j 2 F. Then,
Pr t . i / 2 AkI i j jti j > ˛k
Frequency
300
250
200
150
100
50
FIGURE 7.5. A display (in the form of a histogram with intervals of width 0:025) of the frequencies (among
the 6033 genes) of the various values of the Os ’s.
where z.0/
F is the MF 1 random vector whose elements are zj
.0/
(j 2 F ), where p./ is the pdf of the
.0/
distribution of zF , and where f ./ D 2f ./ is the pdf of the distribution of the absolute value of a
random variable that has an N.0; 1/ distribution. And whether or not u 1 z. i / 2 AkI i when z.0/
F D zF
and u D u is determined by the absolute values jzj j (j 2 T; j ¤ i ) of the MT 1 random variables zj
(j 2 T; j ¤ i ). Moreover, it follows from the results of Karlin and Rinott (1980, theorem 4.1; 1981,
theorem 3.1) that for “any” nonincreasing function gŒjzj j .j 2 T; j ¤ i / of the absolute values of
zj (j 2 T; j ¤ i /, the conditional expected value of gŒjzj j .j 2 T; j ¤ i / given that jzi j D z i is a
nonincreasing function of z i . Accordingly, result (7.113) can be verified by proceeding in much the
same way as in the verification of result (7.112).
f. An illustration
Let us use the example from Part 1 of Subsection c to illustrate various of the alternative multiple-
comparison procedures. In that example, the data consist of the expression levels obtained for 6033
genes on 102 men, 50 of whom were normal (control) subjects and 52 of whom were prostate cancer
patients. And the objective was presumed to be that of testing each of the 6033 null hypotheses
Hs.0/ W s D 0 (s D 1; 2; : : : ; 6033) versus the corresponding one of the alternative hypotheses
Hs.1/ W s ¤ 0 (s D 1; 2; : : : ; 6033), where s D s2 s1 represents the expected difference
(between the cancer patients and the normal subjects) in the expression level of the sth gene.
Assume that the subjects have been numbered in such a way that the first through 50th subjects
are the normal (control) subjects and the 51st through 102nd subjects are the cancer patients. And
for s D 1; 2; : : : ; 6033 and j D 1; 2; : : : ; 102, denote by ysj the random variable whose value is the
value obtained for the expression level of the sth gene on Pthe j th subject. Then, the least
P squares
estimator of s is Os D O s2 O s1 , where O s1 D .1=50/ j50D1 ysj and O s2 D .1=52/ j102 D51 ysj .
The values of O1 ; O2 ; : : : ; O6033 are displayed (in the form of a histogram) in Figure 7.5.
Multiple Comparisons and Simultaneous Confidence Intervals: Enhancements 471
Scaled
relative
frequency
2.5
2.0
1.5
1.0
0.5
FIGURE 7.6. A display (in the form of a histogram with intervals of width 0:04 that has been rescaled so that it
encloses an area equal to 1 and that has been overlaid with a plot of the pdf of the Nf 0:127; 1=50g
distribution) of the relative frequencies (among the 6033 genes) of the various values of the
log Os2 ’s.
Assume that ss 0 D 0 for s 0 ¤ s D 1; 2; : : : ; S, as would be the case if the results obtained
for each of the S genes are “unrelated” to those obtained for each of the others—this assumption is
consistent with Efron’s (2010) assumptions about these data. Further, for s D 1; 2; : : : ; S, write s2
for ss ; and take Os2 to be the unbiased estimator of s2 defined as
PN1 P 1 CN2
Os2 D O s1 /2 C jNDN .ysj O s2 /2 =.N1 CN2 2/;
j D1 .ysj 1 C1
where N1 D 50 and N2 D 52.
The results of Lehmann (1986, sec. 7.3), which are asymptotic in nature, suggest that (at least
in the case where the joint distribution of ys1 ; ys2 ; : : : ; ys;102 is MVN) the distribution of log Os2
2
can be approximated by the N log s ; 1=50 ŒD 2=.N1 C N2 2/ distribution. The values of
˚
log O12 ; log O22 ; : : : ; log O6033
2
are displayed in Figure 7.6 in the form of a histogram that has been
rescaled so that it encloses P area equal
an to 1 and that has been overlaid with a plot of the pdf of the
Nf 0:127 ŒD .1=6033/ 6033 sD1 log O
2
s ; 1=50g distribution. As is readily apparent from Figure 7.6,
it would be highly unrealistic to assume that 12 D 22 D S2 , that is, to assume that the variability
of the expression levels (from one normal subject to another or one cancer patient to another) is
the same for all S genes. The inappropriateness of any such assumption is reflected in the results
obtained upon applying various of the many procedures proposed for testing for the homogeneity of
472 Confidence Intervals (or Sets) and Tests of Hypotheses
^2 ’ s
σ s
2.5
2.0
1.5
1.0
0.5
variances, including that proposed by Hartley (1950) as well as that proposed by Lehmann (1986,
sec. 7.3).
Not only does the variance s2 (among the normal subjects or the cancer patients) of the expression
levels of the sth gene appear to depend on s, but there appears to be a strong tendency for s2 to increase
with the mean (s1 in the case of the normal subjects and s2 in the case of the cancer patients).
Let O s D .O s1 C O s2 /=2. Then, the tendency for s2 to increase with s1 or s2 is clearly evident in
Figure 7.7, in which the values of the Os2 ’s are plotted against the values of the corresponding O s ’s.
In Figure 7.8, the values of the Os2 ’s are plotted against the values of the corresponding Os’s. This
figure suggests that while s2 may vary to a rather considerable extent with s1 and s2 individually
and with their average (and while small values of s2 may be somewhat more likely when js j is
small), any tendency for s2 to vary with s is relatively inconsequential.
Consider (in the context of the present application) the quantities t1 ; t2 ; : : : ; tM and t1.0/ ; t2.0/ ; : : : ;
.0/
tM defined by equalities (7.3)—in the present application, M D S . The various multiple-comparison
procedures described and discussed in Subsections a, b, d, and e depend on the data only through the
(absolute) values of t1.0/ ; t2.0/ ; : : : ; tM
.0/
. And the justification for those procedures is based on their
ability to satisfy criteria defined in terms of the distribution of the random vector t D .t1 ; t2 ; : : : ; tM /0.
Moreover, the vector t is expressible in the form (7.5), so that its distribution is determined by the
distribution of the random vector O 1.˛O ˛/. When the observable random vector y follows the
G–M model and when in addition the distribution of the vector e of residual effects is MVN (or,
more generally, is any absolutely continuous spherical distribution), the distribution of O 1.˛O ˛/
is MV t.N P ; I/—refer to result (7.6).
Multiple Comparisons and Simultaneous Confidence Intervals: Enhancements 473
^2 ’ s
σ s
2.5
2.0
1.5
1.0
0.5
In the present application, the assumption (inherent in the G–M model) of the homogeneity of the
variances of the residual effects appears to be highly unrealistic, and consequently the distribution
of the vector O 1.˛O ˛/ may differ appreciably from the MV t.N P ; I/ distribution. Allowance
can be made for the heterogeneity of the variances of the residual effects by redefining the ti ’s and
the t1.0/ ’s (and by modifying the various multiple-comparison procedures accordingly).
.0/
For s D 1; 2; : : : ; S, redefine ts and ts as
Os s Os
ts D and ts.0/ D (7.116)
Œ.1=50/ C .1=52/1=2 O s Œ.1=50/ C .1=52/1=2 O s
(where Os represents the positive square root of Os2 ). Further, let ys1 D .ys1 ; ys2 ; : : : ; ys;50 /0 and
ys2 D .ys;51 ; ys;52 ; : : : ; ys;102 /0, take L1 to be any 5049 matrix whose columns form an orthonormal
basis for N.1050 / and L2 to be any 52 51 matrix whose columns form an orthonormal basis for
N.1052 /, and assume that the 6033 101-dimensional vectors
.Os s /=Œ.1=50/C.1=52/1=2
0 1
@ L01 ys1 A .s D 1; 2; : : : ; 6033/
0
L2 ys2
are distributed independently and that each of them has an N.0; s2 I101/ distribution or, more gen-
erally, has an absolutely continuous spherical distribution with variance-covariance matrix s2 I101 .
And observe that under that assumption, the random variables t1 ; t2 ; : : : ; t6033 are statistically inde-
pendent and each of them has an S tŒ100 .D N1 C N2 2/ distribution—refer to the final part of
Section 6.6.4d.
474 Confidence Intervals (or Sets) and Tests of Hypotheses
Scaled
relative
frequency
0.3
0.2
0.1
−4 −3 −2 −1 0 1 2 3 4 5
(0)
ts ’ s
FIGURE 7.9. A display (in the form of a histogram with intervals of width 0:143 that has been rescaled so that
it encloses an area equal to 1 and that has been overlaid with a plot of the pdf of the St.100/
.0/
distribution) of the relative frequencies (among the 6033 genes) of the various values of the ts ’s.
In what follows, modified versions of the various multiple-comparison procedures [in which ts
and ts.0/ have (for s D 1; 2; : : : ; 6033) been redefined by equalities (7.116)] are described and applied.
It is worth noting that the resultant procedures do not take advantage of any “relationships” among the
s1 ’s, s2 ’s and s2 ’s of the kind reflected in Figures 7.5, 7.6, 7.7, and (to a lesser extent) 7.8. At least
in principle, a more sophisticated model that reflects those relationships could be devised and could
serve as a basis for constructing improved procedures—refer, e.g., to Efron (2010). And/or one could
seek to transform the data in such a way that the assumptions underlying the various unmodified
multiple-comparison procedures (including that of the homogeneity of the residual variances) are
applicable—refer, e.g., to Durbin et al. (2002).
The values of the redefined ts.0/ ’s are displayed in Figure 7.9 in the form of a histogram that has
been rescaled so that it encloses an area equal to 1 and that has been overlaid with a plot of the pdf
of the S t.100/ distribution. The genes with the most extreme t .0/ -values are listed in Table 7.4.
FWER and k-FWER. Let us consider the modification of the multiple-comparison procedures
described in Subsections a and b for controlling the FWER or k-FWER. Among the quantities
affected by the redefinition of the ti ’s (either directly or indirectly) are the following: the permutations
i1 ; i2 ; : : : ; iM and [for any (nonempty) subset S of I ] i1 .S /; i2 .S /; : : : ; iM
S
.S /; the quantities
t.j / D tij (j D 1; 2; : : : ; M ) and tj IS D tij .S/ (j D 1; 2; : : : ; MS ); the upper 100 P % point c P .j /
of the distribution of jt.j / j and the upper 100 P % point c P .j I S / of the distribution of jtj IS j; and (for
Multiple Comparisons and Simultaneous Confidence Intervals: Enhancements 475
.0/
TABLE 7.4. The 100 most extreme among the values of the ts ’s obtained for the 6033 genes represented in
the prostate data.
MS D M j C1 C k 1 D M Ck j:
Thus, expression (7.117) simplifies to the following expression:
˛j D c P .kI S / for any S I such that MS D M Ck j: (7.118)
And in lieu of inequalities (7.14) and (7.43), we have (subsequent to the redefinition of the ti ’s) the
inequalities
c P .k/ tNk P =.2M / .100/ (7.119)
and (for j k)
˛j tNk P =Œ2.M Ck j / .100/: (7.120)
Based on these results, we can obtain suitably modified versions of the multiple-comparison
procedures described in Subsections a and b for controlling the FWER or k-FWER. As a multiple-
.0/ .0/ .0/
comparison procedure (for testing H1 ; H2 ; : : : ; HM ) that controls the k-FWER at level P, we
.0/
have the procedure that rejects Hi if and only if y 2 Ci , where
(and where the definition of c P .k/ is in terms of the distribution of the redefined ti ’s). And as a
less computationally intensive but more “conservative” variation on this procedure, we have the
procedure obtained by replacing c P .k/ with the upper bound tNk P =.2M / .100/, that is, the procedure
obtained by taking
Ci D fy W jti.0/ j > tNk P =.2M / .100/g (7.122)
rather than taking Ci to be the set (7.121). Further, suitably modified versions of the step-down
procedures (described in Subsection b) for controlling the k-FWER at level P are obtained by setting
˛1 D ˛2 D D ˛k and (for j D k; kC1; : : : ; M ) taking ˛j to be as in equality (7.118) [where the
definition of c P .kI S / is in terms of the redefined ti ’s] and taking the replacement for ˛j (in the more
conservative of the step-down procedures) to be tNk P =Œ2.M Ck j / .100/. As before, the computation
of c P .k/ and c P .kI S / is amenable to the use of Monte Carlo methods.
The modified versions of the various procedures for controlling the k-FWER were applied to
the prostate data. Results were obtained for k D 1; 2; : : : ; 20 and for three different choices for P
( P D :05; :10; and :20). The requisite values of c P .k/ and c P .kI S / were determined by Monte Carlo
methods from 149999 draws (from the joint distribution of the redefined ti ’s). The number of rejected
null hypotheses (discoveries), along with the values of c P .k/ and tNk P =.2M / .100/, is listed (for each
of the various combinations of k- and P -values) in Table 7.5.
The results clearly indicate that (at least in this kind of application and at least for k > 1) the
adoption of the more conservative versions of the procedures for controlling the k-FWER can result
in a drastic reduction in the total number of rejections (discoveries). Also, while the total number of
rejections (discoveries) increases with the value of k, it appears to do so at a mostly decreasing rate.
Controlling the probability of the FDP exceeding a specified level. Consider the modification of
the step-down procedure described in Subsection d for controlling Pr.FDP > / (where is a
Multiple Comparisons and Simultaneous Confidence Intervals: Enhancements 477
TABLE 7.6. The number of rejected null hypotheses (discoveries) obtained (for each of 4 values of and
each of 3 values of ) upon applying the step-down multiple-comparison procedure for controlling
Pr.FDP > / at level (along with the value of P ) when (for j D 1; 2; : : : ; M ) ˛j is taken to be
of the form (7.123) and when P is set equal to P and the value of P taken to be an approximate
value determined by Monte Carlo methods.
specified constant) at a specified level . And for purposes of doing so, continue to take ts and ts.0/
(for s D 1; 2; : : : ; S) to be as redefined by equalities (7.116). By employing essentially the same
reasoning as before (i.e., in Subsection d), it can be shown that the step-down procedure controls
Pr.FDP > / at level if the ˛j ’s satisfy condition (7.62), where now the random variable trIS is
(for any S I including S D T ) as follows its modification to reflect the redefinition of the ts ’s.
Further, the same line of reasoning that led before to taking ˛j to be of the form (7.65) leads now to
taking ˛j to be of the form
˛j D tN.ŒjC1/ P =f2.M CŒjC1 j /g .100/ (7.123)
(j D 1; 2; : : : ; M ), following which the problem of choosing the ˛j ’s so as to satisfy condition
(7.62) is reduced to that of choosing P . And to insure that ˛j ’s of the form (7.123) satisfy condition
(7.62) and hence that the step-down procedure controls Pr.FDP > / at level , it suffices to take
P D P , where P is the solution to the equation cP . P / D 0 and where cP ./ is a modified (to account
for the redefinition of the ts ’s) version of the function cP ./ defined in Subsection d.
The step-down procedure [in which (for j D 1; 2; : : : ; M ) ˛j was taken to be of the form (7.123)
and P was set equal to P ] was applied to the prostate data. Results were obtained for four different
values of ( D :05; :10; :20; and :50) and three different values of ( D :05; :10; and :20).
The value of P was taken to be that of the approximation provided by the solution to the equation
cR . P / D 0, where cR ./ is a function whose values are approximations to the values of cP ./ determined
by Monte Carlo methods from 149999 draws. For each of the various combinations of - and -
values, the number of rejected null hypotheses (discoveries) is reported in Table 7.6 along with the
value of P . It is worth noting that the potential improvement that comes from resetting various of the
˛j ’s [so as to achieve conformance with condition (7.74)] did not result in any additional rejections
in any of these cases.
.0/
An improvement. In the present setting [where (for s D 1; 2; : : : ; S) ts and ts have been redefined as
.0/
in equalities (7.116)], the ts ’s are statistically independent and the ts ’s are statistically independent.
By taking advantage of the statistical independence, we can attempt to devise improved versions of
the step-down procedure for controlling Pr.FDP > /. In that regard, recall (from Subsection d) that
Pr.j 0 1; k 0 MT ; and jtk 0 IT j > ˛j 0 / ) Pr.FDP > / (7.124)
0 0 0
and that k D Œj C1 for values of j 1. And observe that
Pr.j 0 1; k 0 MT ; and jtk 0 IT j > ˛j 0 /
D Pr.j 0 1 and k 0 MT / Pr.jtk 0 IT j > ˛j 0 j j 0 1 and k 0 MT /
Pr.jtk 0 IT j > ˛j 0 j j 0 1 and k 0 MT /
Pr.jtk 0 IT j > ˛j 0 j j 0 D j 0 / Pr.j 0 D j 0 j j 0 1 and k 0 MT /; (7.125)
P
D
478 Confidence Intervals (or Sets) and Tests of Hypotheses
TABLE 7.7. The number of rejected null hypotheses (discoveries) obtained (for each of 4 values of and
each of 3 values of ) upon applying the step-down multiple-comparison procedure for controlling
Pr.FDP > / at level when (for j D 1; 2; : : : ; M ) ˛j is as specified by equality (7.129) or,
alternatively, as specified by equality (7.130).
where the summation is with respect to j 0 and is over the set fj 0 W j 0 1 and Œj 0 C1 MT g.
For s 2 T, ts D ts.0/ ; and (by definition) the values of the random variables j 0 and k 0 are
completely determined by the values of the MF random variables ts.0/ (s 2 F ). Thus, in the present
.0/
setting (where the ts ’s are statistically independent) we find that (for j 0 such that j 0 1 and
0
Œj C1 MT )
Pr.jtk 0 IT j > ˛j 0 j j 0 D j 0 / D Pr.jtk 0 IT j > ˛j 0 /; (7.126)
where k 0 D Œj 0 C1. Moreover,
Pr.jtk 0 IT j > c .k 0 I S / (7.127)
for any S I such that T S and hence (when as in the present setting, the ts ’s are statistically
independent) for any S I such that MS MT .
Now, recall [from result (7.53)] that
M C k0 j 0 MT : (7.128)
And observe that, together, results (7.125), (7.126), (7.127), and (7.128) imply that the condition
Pr.j 0 1; k 0 MT ; and jtk 0 IT j > ˛j 0 /
and hence [in light of result (7.124)] the condition Pr.FDP > / can be satisfied by taking (for
j D 1; 2; : : : ; M )
˛j D c .kj I S / for any S I such that MS D M Ckj j ; (7.129)
where kj D ŒjC1.
Resort to Monte Carlo methods may be needed to effect the numerical evaluation of expression
(7.129). A more conservative but less computationally intensive version of the step-down procedure
for controlling Pr.FDP > / at level can be effected [on the basis of inequality (7.120)] by taking
(for j D 1; 2; : : : ; M )
˛j D tNkj =Œ2.M Ckj j / .100/: (7.130)
The number of rejections (discoveries) resulting from the application of the step-down procedure
(to the prostate data) was determined for the case where (for j D 1; 2; : : : ; M ) ˛j is as specified
by equality (7.129) and also for the case where ˛j is as specified by equality (7.130). The values of
the c .kj I S /’s (required in the first of the 2 cases) were taken to be approximate values determined
by Monte Carlo methods from 149999 draws. The number of rejections was determined for each of
four values of ( D :05; :10; :20; and :50) and for each of three values of ( D :05; :10; and :20).
The results are presented in Table 7.7.
Both the results presented in Table 7.6 and those presented in the right half of Table 7.7 are for
cases where (for j D 1; 2; : : : ; M ) ˛j is of the form (7.123). The latter results are those obtained
when P D , and the former those obtained when P D P — P depends on as well as on . For
P < , the number of rejections (discoveries) obtained when P D is at least as great as the number
obtained when P D P . In the application to the prostate data, there are three combinations of - and
Multiple Comparisons and Simultaneous Confidence Intervals: Enhancements 479
-values for which exceeds P by a substantial amount (that where D :05 and D :10, that where
D :05 and D :20, and that where D :10 and D :20). When D :05 and D :20, P D :05 and
setting P D rather than P D P results in an additional 7 rejections (discoveries); similarly, when
D :10 and D :20, P D :10 and setting P D rather than P D P results in an additional 6 rejections.
Based on the results presented in Table 7.7, it appears that the difference in the number of
rejections produced by the step-down procedure in the case where (for j D 1; 2; : : : ; M ) ˛j is as
specified by equality (7.129) and the number produced in the case where ˛j is as specified by equality
(7.130) can be either negligible or extremely large. This difference tends to be larger for the larger
values of and also tends to be larger for the larger values of .
Controlling the FDR. Consider the multiple-comparison procedure obtained upon modifying the
step-up procedure described in Subsection e (for controlling the FDR) so as to reflect the redefinition
.0/
of the tj ’s and tj ’s. The requisite modifications include that of taking the distribution of the random
variable t in definition (7.80) (of ˛j0 ) to be S t.100/ rather than S t.N P /. As a consequence of
this modification, taking ˛j0 to be of the form (7.81) leads to taking ˛j to be of the form
˛j D tNj P =.2M / .100/; (7.131)
rather than of the form (7.82).
The FDR of the modified step-up procedure is given by expression (7.91) (suitably reinterpreted
.0/
to reflect the redefinition of the tj ’s and tj ’s). Moreover, subsequent to the redefinition of the tj ’s
.0/ .0/ .0/
and tj ’s, the tj ’s’s are statistically independent and hence (since tj D tj for j 2 T )
Pr t . i / 2 AkI i j jti j > ˛k D Pr t . i / 2 AkI i :
It is of interest to compare these results with the results presented in the preceding part of the
present subsection, which are those obtained from the application of the step-down procedure for
controlling Pr.FDP > / at level . In that regard, it can be shown that (for j D 1; 2; : : : ; M and
0 < < 1)
tNj P =.2M / .100/ tNkj P =Œ2.M Ckj j / .100/; (7.132)
where kj D ŒjC1—refer to Exercise 32. Thus, the application of a step-up procedure in which
(for j D 1; 2; : : : ; M ) ˛j is of the form (7.131) and P D produces as least as many rejections
(discoveries) as the application of a step-down procedure in which ˛j is as specified by equality
(7.130)—even if the ˛j ’s were the same in both cases, the application of the step-up procedure
would result in at least as many rejections as the application of the step-down procedure.
In the application of these two stepwise procedures to the prostate data, the step-up procedure
produced substantially more rejections than the step-down procedure and did so even for as large
as :50—refer to the entries in the right half of Table 7.7. It is worth noting that (in the case of the
prostate data) the step-up procedure produced no more rejections than would have been produced
480 Confidence Intervals (or Sets) and Tests of Hypotheses
by a step-down procedure with the same ˛j ’s, so that (in this case) the difference in the number of
rejections produced by the two stepwise procedures is due entirely to the differences between the
˛j ’s.
The number of rejections (discoveries) produced by the step-up procedure [that in which ˛j is
of the form (7.131) and P D ] may or may not exceed the number produced by the step-down
procedure in which ˛j is as specified by equality (7.129) rather than by equality (7.130). When (in
the application to the prostate data) D :50 or when D :10 and D :20, this step-down procedure
[which, like that in which ˛j is specified by equality (7.130), controls Pr.FDP > / at level ]
produced substantially more rejections than the step-up procedure (which controls the FDR at level
)—refer to the entries in the left half of Table 7.7.
7.8 Prediction
The prediction of the realization of an unobservable random variable or vector by a single observable
“point” was discussed in Section 5.10, both in general terms and in the special case where the relevant
information is the information provided by an observable random vector that follows a G–M, Aitken
or general linear model. Let us extend the discussion of prediction to include predictions that take
the form of intervals or sets and to include tests of hypotheses about the realizations of unobservable
random variables.
Specifically, suppose that (while insufficient to determine the conditional distribution of w given y)
knowledge about the joint distribution is sufficient to form an unbiased (point) predictor w.y/
Q (of
the realization of w) with a prediction error, say e D w.y/
Q w, whose (unconditional) distribution
is known—clearly, E.e/ D 0. For example, depending on the state of knowledge, w.y/ Q might be
Q
w.y/ D E.w j y/; (8.3)
0
Q
w.y/ DC Vyw Vy 1 y; (8.4)
0
where D w Vyw Vy 1 y , or
0
Q
w.y/ D .y/
Q C Vyw Vy 1 y; (8.5)
where .y/
Q is a function of y with an expected value of —refer to Section 5.10. Then, the knowledge
about the joint distribution of w and y may be insufficient to obtain a prediction set that satisfies
condition (8.1). However, a prediction set S.y/ that satisfies condition (8.2) can be obtained by taking
Q
S.y/ D f w W w D w.y/ e; e 2 S g; (8.6)
where S is any set of M 1 vectors for which
PrŒw.y/
Q w 2 S D 1 P: (8.7)
Let us refer to any prediction set that satisfies condition (8.2) as a 100.1 P /% prediction set.
In repeated sampling from the joint distribution of w and y, the value of w would be included in a
100.1 P /% prediction set 100.1 P / percent of the time.
HPD prediction sets. Suppose that (for “every” value of y) the conditional distribution of w given
y is known and is absolutely continuous with pdf f . j y/. Among the various choices for a set S.y/
that satisfies condition (8.1) is a set of the form
S.y/ D fw W f .wj y/ kg; (8.8)
where k is a (strictly) positive constant. In a Bayesian framework, a set of the form (8.8) that satisfies
condition (8.1) would be referred to as a 100.1 P /% HPD (highest posterior density) credible set.
In the present setting, let us refer to such a set as a 100.1 P /% HPD prediction set.
A 100.1 P /% HPD prediction set has the following property: Among all choices for the set
S.y/ that satisfy the condition
PrŒw 2 S.y/ j y 1 P ; (8.9)
it is the smallest, that is, it minimizes the quantity
(8.10)
R
S.y/ d w:
Let us verify that a 100.1 P /% HPD prediction set has this property. For purposes of doing so,
take ./ to be a function defined (for M 1 vectors) as follows:
(
1; if w 2 S.y/,
.w/ D
0; if w … S.y/.
In the special case where M D 1, the 100.1 P /% HPD prediction set is reexpressible as the
interval
0 0
fw W .y/ zN P =2 .vw vyw Vy 1 vyw /1=2 w .y/ C zN P =2 .vw vyw Vy 1 vyw /1=2 g; (8.14)
0 0
where (in this special case) .y/ D C vyw Vy 1 y (with D w vyw Vy 1 y ) and where (for
0 < ˛ < 1) z˛ represents the upper 100 ˛% point of the N.0; 1/ distribution. The prediction interval
(8.14) satisfies condition (8.1); among all prediction intervals that satisfy condition (8.1), it is the
shortest. Among the other prediction intervals (in the special case where M D 1) that satisfy condition
(8.1) are 0
fw W 1 < w .y/ C zN P .vw vyw Vy 1 vyw /1=2 g (8.15)
and
0
fw W .y/ zN P .vw vyw Vy 1 vyw /1=2 w < 1g: (8.16)
Alternatively, suppose that knowledge about the joint distribution of w and y is limited to that
0 0
obtained from knowing the values of the vector D w Vyw Vy 1 y and the matrix Vyw Vy 1 and
0
from knowing the distribution of the random vector e D w.y/Q w, where w.y/
Q D C Vyw Vy 1 y.
And observe that regardless of the form of the distribution of e, it is the case that
0
cov.y; e/ D 0 and var.e/ D Vw Vyw Vy 1 Vyw
[and that E.e/ D 0]. Further, suppose that the (unconditional) distribution of e is MVN, which (in
0
light of the supposition that the distribution of e is known) implies that the matrix Vw Vyw Vy 1 Vyw
is known.
If the joint distribution of w and y were MVN, then e would be distributed independently of y
[and hence independently of w.y/],
Q in which case the conditional distribution of w.y/
Q e given y
[or given w.y/]
Q would be identical to the conditional distribution of w given y—both conditional
0
distributions would be N Œw.y/;
Q Vw Vyw Vy 1 Vyw —and any prediction set of the form (8.6) would
satisfy condition (8.1) as well as condition (8.2). More generally, a prediction set of the form (8.6)
would satisfy condition (8.2) [and hence would qualify as a 100.1 P /% prediction set], but would
not necessarily satisfy condition (8.1). In particular, the prediction set (8.13) and (in the special case
where M D 1) prediction intervals (8.14), (8.15), and (8.16) would satisfy condition (8.2), however
aside from special cases (like that where the joint distribution of w and y is MVN) they would not
necessarily satisfy condition (8.1).
Simultaneous prediction intervals. Let us consider further the case where conditionally on y, the
distribution of the vector w D .w1 ; w2 ; : : : ; wM /0 is N Œ.y/; Vw Vyw 0
Vy 1 Vyw , with .y/ D
0 0
C Vyw Vy 1 y and with D w Vyw Vy 1 y , and where the value of and of the matrices
0 1 0 1
VywVy and Vw VywVy Vyw are known. Interest in this case might include the prediction of
the realizations of some or all of the random variables w1 ; w2 ; : : : ; wM or, more generally, the
realizations of some or all linear combinations of these random variables.
Predictive inference for the realization of any particular linear combination of the random vari-
ables w1 ; w2 ; : : : ; wM , say for that of the linear combination ı 0 w (where ı ¤ 0), might take the
form of one of the following intervals:
fw W ı 0.y/ zN P =2 Œı 0 .Vw 0
Vyw Vy 1 Vyw /ı1=2
w ı 0.y/ C zN P =2 Œı 0 .Vw 0
Vyw Vy 1 Vyw /ı1=2 g; (8.17)
1 P. However, when prediction intervals are obtained for each of a number of linear combinations,
it is often the case that those linear combinations identified with the “most extreme” intervals receive
the most attention. “One-at-a-time” prediction intervals like intervals (8.17), (8.18), and (8.19) do
not account for any such identification and, as a consequence, their application can sometimes lead
to erroneous conclusions. In the case of intervals (8.17), (8.18), and (8.19), this potential pitfall can
be avoided by introducing modifications that convert these one-at-a-time intervals into prediction
intervals that provide for control of the probability of simultaneous coverage.
Accordingly, let represent a finite or infinite collection of (nonnull) M -dimensional column
vectors. And suppose that we wish to obtain a prediction interval for the realization of each of the
linear combinations ı 0w (ı 2 ). Suppose further that we wish for the intervals to be such that the
probability of simultaneous coverage equals 1 P. Such intervals can be obtained by taking (for each
ı 2 ) the interval for the realization of ı 0w to be a modified version of interval (8.17), (8.18), or
(8.19); the requisite modification consists of introducing a suitable replacement for zN P =2 or zN P .
0
Let R represent an M M nonsingular matrix such that Vw Vyw Vy 1 Vyw D R0 R—upon
0
observing [in light of Corollary 2.13.33 and result (2.5.5)] that Vw Vyw Vy 1 Vyw is a symmetric
positive definite matrix, the existence of the matrix R follows from Corollary 2.13.29. Further, let
z D .R 1 /0 Œ.y/ w, so that z N.0; I/ (both conditionally on y and unconditionally) and (for
every nonnull M 1 vector ı)
ı 0.y/ ı 0w .Rı/0 z
D N.0; 1/: (8.20)
Œı 0 .Vw Vyw 0 V 1 V /ı1=2
y yw Œ.Rı/0 Rı1=2
And take the replacement for z P =2 in interval (8.17) to be the upper 100 P % point of the distribution
of the random variable j.Rı/0 zj
max : (8.21)
ı2 Œ.Rı/0 Rı1=2
Similarly, take the replacement for z P in intervals (8.18) and (8.19) to be the upper 100 P % point of
the distribution of the random variable
.Rı/0 z
max : (8.22)
ı2 Œ.Rı/0 Rı1=2
Then, as is evident from result (8.20), the prediction intervals obtained for the realizations of the linear
combinations ı 0w (ı 2 ) upon the application of the modified version of interval (8.17), (8.18),
or (8.19), are such that the probability of simultaneous coverage equals 1 P (both conditionally
on y and unconditionally). In fact, when the unconditional distribution of .y/ w is MVN, the
unconditional probability of simultaneous coverage of the prediction intervals obtained upon the
application of the modified version of interval (8.17), (8.18), or (8.19) would equal 1 P even if the
0
conditional distribution of w given y differed from the N Œ.y/; Vw Vyw Vy 1 Vyw distribution.
When Dpfı 2 RM W ı ¤ 0g, the upper 100 P % point of the distribution of the random variable
(8.21) equals N 2P .M /, and ı 0w is contained in the modified version of interval (8.17) for every
ı 2 if and only if w is contained in the set (8.13), as can be verified by proceeding in much the
same way as in Section 7.3c in the verification of some similar results. When the members of
consist of the columns of the M M identity matrix, the linear combinations ı 0w (ı 2 ) consist of
the M random variables w1 ; w2 ; : : : ; wM.
For even moderately large values of M, a requirement that the prediction intervals achieve si-
multaneous coverage with a high probability can be quite severe. A less stringent alternative would
be to require (for some integer k greater than 1) that with a high probability, no more than k of the
intervals fail to cover. Thus, in modifying interval (8.17) for use in obtaining prediction intervals
for all of the linear combinations ı 0w (ı 2 ), we could replace z P =2 with the upper 100 P % point
of the distribution of the kth largest of the random variables j.Rı/0 zj=Œ.Rı/0 Rı1=2 (ı 2 ), rather
than with the upper 100 P % point of the distribution of the largest of these random variables. Simi-
larly, in modifying interval (8.18) or (8.19), we could replace z P with the upper 100 P % point of the
Prediction 485
distribution of the kth largest (rather than the largest) of the random variables .Rı/0 z=Œ.Rı/0 Rı1=2
(ı 2 ). In either case [and in the case of the distribution of the random variable (8.21) or (8.22), the
upper 100 P % point of the relevant distribution could be determined numerically via Monte Carlo
methods.
Hypothesis tests. Let S0 represent a (nonempty but proper) subset of the set RM of all M -dimensional
column vectors, and let S1 represent the complement of S0 , that is, the subset of RM consisting of
all M -dimensional column vectors other than those in S0 (so that S0 \ S1 is the empty set, and
S0 [S1 D RM ). And consider the problem of testing the null hypothesis H0 W w 2 S0 versus the
alternative hypothesis H1 W w 2 S1 —note that w 2 S1 , w … S0 .
Let ./ represent the critical function of a (nonrandomized) test of H0 versus H1 , in which case
./ is of the form (
1; if y 2 A,
.y/ D
0; if y … A,
where A is a (nonempty but proper) subset of RN known as the critical region, and H0 is rejected
if and only if .y/ D 1. Or, more generally, let ./ represent the critical function of a possibly
randomized test, in which case 0 .y/ 1 (for y 2 RN ) and H0 is rejected with probability .y/
when y D y. Further, let
0 D Pr.w 2 S0 / and 1 D Pr.w 2 S1 / .D 1 0 /;
and take p0 ./ and p1 ./ to be functions defined (on RN ) as follows:
p0 .y/ D Pr.w 2 S0 j y/ and p1 .y/ D Pr.w 2 S1 j y/ ŒD 1 p0 .y/:
Now, suppose that the joint distribution of w and y is such that the (marginal) distribution of
y is known and is absolutely continuous with pdf f ./ and is such that p0 ./ is known (in which
case 0 , 1 , and p1 ./ would also be known and 0 and 1 would constitute what in a Bayesian
framework would be referred to as prior probabilities and p0 .y/ and p1 .y/ what would be referred
to as posterior probabilities). Suppose further that 0 < 0 < 1 (in which case 0 < 1 < 1)—if 0 D 0
or 0 D 1, then a test for which the probability of an error of the first kind (i.e., of falsely rejecting
H0 ) and the probability of an error of the second kind (i.e., of falsely accepting H0 ) are both 0 could
be achieved by taking .y/ D 1 for every value of y or by taking .y/ D 0 for every value of y.
Note that when the distribution of w is absolutely continuous, the supposition that 0 > 0 rules
out the case where S0 consists of a single point—refer, e.g., to Berger (2002, sec. 4.3.3) for some
related discussion. Further, take f0 ./ and f1 ./ to be functions defined (on RN ) as follows: for
y 2 RN,
f0 .y/ D p0 .y/f .y /=0 and f1 .y/ D p1 .y/f .y /=1
—when (as is being implicitly assumed herein) the function p0 ./ is “sufficiently well-behaved,”
the conditional distribution of y given that w 2 S9 is absolutely continuous with pdf f0 ./ and the
conditional distribution of y given that w 2 S1 is absolutely continuous with pdf f1 ./. And observe
that the probability of rejecting the null hypothesis H0 when H0 is true is expressible as
EŒ.y/ j w 2 S0 D RN .y/f0 .y/ d y; (8.23)
R
Upon applying a version of the Neyman–Pearson lemma stated by Lehmann and Romano (2005b,
sec. 3.2) in the form of their theorem 3.2.1, we find that there exists a critical function ./ defined
by taking 8
< 1; when f1 .y/ > kf0 .y/,
.y/ D c; when f1 .y/ D kf0 .y/, (8.25)
0; when f1 .y/ < kf0 .y/,
:
and by taking c and k (0 c 1, 0 k < 1) to be constants for which
486 Confidence Intervals (or Sets) and Tests of Hypotheses
larger) than the specified level. Let us consider an alternative approach in which the probability of
falsely rejecting one or more of the null hypotheses (the FWER) or, more generally, the probability
of falsely rejecting k or more of the null hypotheses (the k-FWER) is controlled at a specified level,
say P . Such an approach can be devised by making use of the results of the preceding part of the
present subsection and by invoking the so-called closure principle (e.g., Bretz, Hothorn, and Westfall
2011, sec. 2.2.3; Efron 2010, p. 38), as is to be demonstrated in what follows.
Let z D z.y/ represent an N -dimensional column vector whose elements are (known) functions
of y—e.g., z.y/ D y or (at the other extreme) z.y/ D w.y/, Q where w.y/
Q is an unbiased (point)
predictor of the realization of the vector w D .w1 ; w2 ; : : : ; wM /0. And suppose that the (marginal)
distribution of z is known and is absolutely continuous with pdf f ./.
Now, let k represent the collection of all subsets of I D f1; 2; : : : ; M g of size k or greater, and
let kj represent the collection consisting of those subsets of I of size k or greater that include the
integer j. Further, for I 2 k ; let
S.I / D fw D .w 1 ; w 2 ; : : : ; wM /0 2 RM W wj 2 Sj.0/ for j 2 I gI
and suppose that (for I 2 k ) the joint distribution of w and z is such that PrŒw 2 S.I / j z is a
known function of the value of z—in which case PrŒw 2 S.I / would be known—and is such that
0 < PrŒw 2 S.I / < 1. And (for I 2 k ) take . I I / to be the critical function (defined on RN )
of the most-powerful P -level procedure for testing the null hypothesis H0 .I / W w 2 S.I / [versus
the alternative hypothesis H1 .I / W w … S.I /] on the basis of z; this procedure is that obtained
upon applying [with S0 D S.I /] the generalized version of the result of the preceding part of the
present subsection (i.e., the version of the result in which the choice of a test procedure is restricted
to procedures that depend on y only through the value of z). Then, the k-FWER (and in the special
case where k D 1, the FWER) can be controlled at level P by employing a (multiple-comparison)
procedure in which (for j D 1; 2; : : : ; M ) Hj.0/ W wj 2 Sj.0/ is rejected (in favor of Hj.1/ W wj … Sj.0/ )
if and only if for every I 2 kj the null hypothesis H0 .I / W w 2 S.I / is rejected by the P -level
test with critical function . I I /.
Let us verify that this procedure controls the k-FWER at level P . Define
.0/ .0/
T D fj 2 I W Hj is trueg D fj 2 I W wj 2 Sj g:
Further, denote by RT the subset of T defined as follows: for j 2 T, j 2 RT if Hj.0/ is among
the null hypotheses rejected by the multiple-comparison procedure. And (denoting by MS the size
of an arbitrary set S ) suppose that MRT k, in which case MT k and it follows from the very
definition of the multiple-comparison procedure that the null hypothesis H0 .T / is rejected by the
P -level test with critical function . I T /. Thus,
Pr.MRT k/ P
(so that the multiple-comparison procedure controls the k-FWER at level P as was to be verified).
For some discussion (in the context of the special case where k D 1) of shortcut procedures
for achieving an efficient implementation of this kind of multiple-comparison procedure, refer, for
example, to Bretz, Hothorn, and Westfall (2011).
False discovery proportion: control of the probability of its exceeding a specified constant. Let us
consider further the multiple-comparison procedure (described in the preceding part of the present
.0/
subsection) for testing the null hypotheses Hj (j D 1; 2; : : : ; M ); that procedure controls the
k-FWER at level P . Denote by Rk the total number of null hypotheses rejected by the multiple-
comparison procedure—previously (in Section 7.7d) that symbol was used to denote something
else—and denote by RTk (rather than simply by RT, as in the preceding part of the present subsection)
the number of true null hypotheses rejected by the procedure. Further, define
(
RTk =Rk ; if Rk > 0,
FDPk D
0; if Rk D 0;
488 Confidence Intervals (or Sets) and Tests of Hypotheses
this quantity represents what (in the present context) constitutes the false discovery proportion—refer
to Section 7.7c.
For any scalar in the interval .0; 1/, the multiple-comparison procedure can be used to control
the probability Pr.FDPk > / at level P. As is to be shown in what follows, such control can be
achieved by making a judicious choice for k (a choice based on the value of the observable random
vector z).
Upon observing that Rk RTk , we find that
FDPk > , RTk > Rk , RTk ŒRk C1 (8.30)
(where for any real number x, Œx denotes the largest integer that is less than or equal to x). Moreover,
for any k for which k ŒRk C1,
RTk ŒRk C1 ) the null hypothesis H0 .T / is rejected by the
P -level test with critical function . I T /: (8.31)
The quantity ŒRk C1 is nondecreasing in k and is bounded from below by 1 and from above by
ŒM C1. Let K D K.z/ represent the largest value of k for which k ŒRk C1 (or, equivalently,
the largest value for which k D ŒRk C1). Then, it follows from results (8.30) and (8.31) that
PrŒFDPK.z/ > P : (8.32)
Thus, by taking k D K, the multiple-comparison procedure for controlling the k-FWER at level P
can be configured to control (at the same level P ) the probability of the false discovery proportion
exceeding . Moreover, it is also the case—refer to Exercise 28—that when k D K, the false discovery
rate E.FDPK.z/ / of this procedure is controlled at level ı D P C .1 P /.
fw W 1 < w ı 0 w.y/
Q C tNP .ı 0 H ı/1=2 O g; (8.53)
or
f w W ı 0 w.y/
Q tNP .ı 0 H ı/1=2 O w < 1g: (8.54)
It follows from the results of the preceding part of the present subsection that when considered
in “isolation,” interval (8.52), (8.53), or (8.54) has a probability of coverage equal to 1 P . However,
when such an interval is obtained for the realization of each of a number of linear combinations, the
probability of all of the intervals covering (or even that of all but a “small number” of the intervals
covering) can be much less than 1 P .
Accordingly, let represent a finite or infinite collection of (nonnull) M -dimensional column
vectors. And suppose that we wish to obtain a prediction interval for the realization of each of the
linear combinations ı 0 w (ı 2 ) and that we wish to do so in such a way that the probability of
simultaneous coverage equals 1 P or, more generally, that the probability of coverage by all but at
most some number k of the intervals equals 1 P . Such intervals can be obtained by taking (for each
ı 2 ) the interval for the realization of ı 0 w to be a modified version of interval (8.52), (8.53), or
(8.54) in which the constant tNP =2 or tNP is replaced by a larger constant.
Let R represent an M M nonsingular matrix such that H D R0 R . Further, let
z D .1=/.R 1 /0 Œw.y/
Q w and v D .1= 2 /y 0 .I PX /y;
so that z N.0; I/, v 2 .N P /, z and v are statistically independent, and (for every M 1
vector ı)
ı 0 w.y/
Q ı0 w .R ı/0 z
D S t.N P / (8.55)
.ı 0 H ı/1=2 O
p
Œ.R ı/0 R ı1=2 v=.N P /
—result (8.55) is analogous to result (8.20). Then, prediction intervals for the realizations of the
random variables ı 0 w (ı 2 ) having a probability of simultaneous coverage equal to 1 P can be
obtained from the application of a modified version of interval (8.52) in which tNP =2 is replaced by
the upper 100 P % point of the distribution of the random variable
j.R ı/0 zj
max p (8.56)
ı2 Œ.R ı/0 R ı1=2 v=.N P /
or from the application of a modified version of interval (8.53) or (8.54) in which tNP is replaced by
the upper 100 P % point of the distribution of the random variable
.R ı/0 z
max p : (8.57)
ı2 Œ.R ı/0 R ı1=2 v=.N P /
More generally, for k 1, prediction intervals for which the probability of coverage by all but at
most k of the intervals can be obtained from the application of a modified version of interval (8.52) in
which tNP =2 is replaced by the upper 100 P % point of the distribution of the kth largest of the random
j.R ı/0 zj
variables p (ı 2 ) or from the application of a modified version of
Œ.R ı/0 R ı1=2 v=.N P /
interval (8.53) or (8.54) in which tNP is replaced by the upper 100 P % point of the distribution of the
.R ı/0 z
kth largest of the random variables p (ı 2 ). The resultant intervals
Œ.R ı/0 R ı1=2 v=.N P /
are analogous to those devised in Part 5 of Subsection a by employing a similar approach. And as in
the case of the latter intervals, the requisite percentage points could be determined via Monte Carlo
methods.
When D fı 2 RM W ı ¤ 0g, the upper 100 P % point of the distribution of the random variable
(8.56) equals ŒM FN P .M; N P /1=2, and ı 0 w is contained in the modified version of interval (8.52)
[in which tNP =2 has been replaced by the upper 100 P % point of the distribution of the random variable
(8.56)] if and only if w is contained in the set (8.47)—refer to results (3.106) and (3.113) and to the
ensuing discussion. When the members of consist of the columns of IM , the linear combinations
ı 0 w (ı 2 ) consist of the M random variables w1 ; w2 ; : : : ; wM .
492 Confidence Intervals (or Sets) and Tests of Hypotheses
Extensions and limitations. Underlying the results of the preceding four parts of the present sub-
section is a supposition that the joint distribution of y and w is MVN. This supposition is stronger
than necessary.
As in Section 7.3, let L represent an N .N P / matrix whose columns form an orthonormal
basis for N.X0 /. And let x D .1=/L0 y, and continue to define z D .1=/.R 1 /0 Œw.y/
Q w. Further,
1 0 Q 0
let t D .1=/.R
O / Œw .y/ w; and recalling result (3.21), observe that x x D .1= 2
/ y 0 .I PX /y
and hence that
t D Œx0 x=.N P / 1=2 z: (8.58)
Observe also that
Q
O w.y/
.1=/Œ w D R0 t; (8.59)
so that the distribution of the vector .1=/Œ Q
O w.y/ w is determined by the distribution of the vector
t.
When the joint distribution of y and w isMVN, then
z
N.0; I/; (8.60)
x
as is evident upon observing the L0 L DI and upon recalling expression (8.35) and recalling [from
0 z
result (3.10)] that L X D 0. And when N.0; I/,
x
t MV t.N P ; I/; (8.61)
as
is 1evident from expression (8.58). More generally, t MV t.N P ; I/ when the vector
.R /0 ŒwQ .y/ w
has an absolutely continuous spherical distribution—refer to result (6.4.67).
L0 y
Thus, the supposition (underlying the results of the preceding four parts of the present subsection)
that thejoint distribution of y and w is MVN could be replaced by the weaker supposition that the
.R 1 /0 Œw.y/
Q w
vector has an MVN distribution or, more generally, that it has an absolutely
L0 y
continuous spherical distribution. In fact, ultimately, it could be replaced by a supposition that the
1 0 Q
vector t D .1=/.R
O / Œw.y/ w has an MV t.N P ; I/ distribution.
The results of the preceding four parts of the present subsection (which are for the case where y
follows a G–M model) can be readily extended to the more general case where y follows an Aitken
model. Let N D rank H. And let Q represent an N N matrix such that Q0 HQ D IN —the
existence of such a matrix follows, e.g., from Corollary 2.13.23, which implies the existence of an
N N matrix P (of full row rank) for which H D P 0 P, and from Lemma 2.5.1, which implies the
existence of a right inverse of P. Then,
Q0 y Q0 Hyw
2 IN
var D :
w .Q0 Hyw /0 Hw
Thus, by applying the results of the preceding four parts of the present subsection with Q0 y in place
of y and with Q0 Hyw in place of Hyw (and with N in place of N ), those results can be extended
to the case where y follows an Aitken model.
In the case of the general linear model, the existence of a pivotal quantity for the realization of w
is restricted to special cases. These special cases are ones where (as in the special case of the G–M
or Aitken model) the dependence of the values of the matrices Vy ./, Vyw ./, and Vw ./ on the
elements of the vector is of a relatively simple form.
Exercises
Exercise 1. Take the context to be that of Section 7.1 (where a second-order G–M model is applied
to the results of an experimental study of the yield of lettuce plants for purposes of making inferences
Exercises 493
about the response surface and various of its characteristics). Assume that the second-order regression
coefficients ˇ11 , ˇ12 , ˇ13 , ˇ22 , ˇ23 , and ˇ33 are such that the matrix A is nonsingular. Assume also
that the distribution of the vector e of residual effects in the G–M model is MVN (in which case the
matrix A O is nonsingular with probability 1). Show that the large-sample distribution of the estimator
uO 0 of the stationary point u0 of the response surface is MVN with mean vector u0 and variance-
covariance matrix
.1=4/ A 1 var.Oa C 2AuO 0 / A 1: (E.1)
Do so by applying standard results on multi-parameter maximum likelihood estimation—refer, e.g.,
to McCulloch, Searle, and Neuhaus (2008, sec. S.4) and Zacks (1971, chap. 5).
Exercise 2. Taking the context to be that of Section 7.2a (and adopting the same notation and
Q 0 X0 y to have the same expected value under the
terminology as in Section 7.2a), show that for R
augmented G–M model as under the original G–M model, it is necessay (as well as sufficient) that
X0 Z D 0.
Exercise 3. Adopting the same notation ıand terminology as in Section 7.2, consider the expected
value of the usual estimator y 0 .I PX /y ŒN rank.X/ of the variance of the residual effects of
the (original) G–M model y D Xˇ C e. How is the expected value of this estimator affected when
the model equation is augmented via the inclusion of the additional “term” Z? That is, what is the
expected value of this estimator when its expected value is determined under the augmented G–M
model (rather than under the original G–M model)?
Exercise 4. Adopting the same notation and terminology as in Sections 7.1 and 7.2, regard the lettuce
yields as the observed values of the N .D 20/ elements of the random column vector y, and take
the model to be the “reduced” model derived from the second-order G–M model (in the 3 variables
Cu, Mo, and Fe) by deleting the four terms involving the variable Mo—such a model would be
consistent with an assumption that Mo is “more-or-less inert,” i.e., has no discernible effect on the
yield of lettuce.
(a) Compute the values of the least squares estimators of the regression coefficients (ˇ1 , ˇ2 , ˇ4 ,
ˇ11 , ˇ13 , and ˇ33 ) of the reduced model, and determine the standard errors, estimated standard
errors, and correlation matrix of these estimators.
(b) Determine the expected values of the least squares estimators (of ˇ1 , ˇ2 , ˇ4 , ˇ11 , ˇ13 , and ˇ33 )
from Part (a) under the complete second-order G–M model (i.e., the model that includes the
4 terms involving the variable Mo), and determine (on the basis of the complete model) the
estimated standard errors of these estimators.
(c) Find four linearly independent linear combinations of the four deleted regression coefficients
(ˇ3 , ˇ12 , ˇ22 , and ˇ23 ) that, under the complete second-order G–M model, would be estimable
and whose least squares estimators would be uncorrelated, each with a standard error of ;
and compute the values of the least squares estimators of these linearly independent linear
combinations, and determine the estimated standard errors of the least squares estimators.
Exercise 5. Suppose that y is an N 1 observable random vector that follows the G–M model.
Show that Q
E.y/ D XRS˛ C U;
Q S, U, ˛, and are as defined in Section 7.3a.
where R,
Exercise 6. Taking the context to be that of Section 7.3, adopting the notation employed therein,
supposing that the distribution of the vector e of residual effects (in the G–M model) is MVN, and
assuming that N > P C2, show that
494 Confidence Intervals (or Sets) and Tests of Hypotheses
.˛ ˛.0/ /0 .˛ ˛.0/ /
N P
EŒFQ .˛.0/ / D 1C
N P 2 M 2
.0/ 0
. / C . .0/ /
N P
D 1C :
N P 2 M 2
Exercise 7. Take the context to be that of Section 7.3, adopt the notation employed therein, and
suppose that the distribution of the vector e of residual effects (in the G–M model) is MVN. For
P 2 C.ƒ0 /, the distribution of F ./ P from that of FQ .˛/:
P is obtainable (upon setting ˛P D S0 ) P in light
of the relationship (3.32) and results (3.24) and (3.34),
P SF ŒM ; N P ; .1= 2 /.P /0 C .P /:
F ./
Provide an alternative derivation of the distribution of F ./
P by (1) taking b to be a P 1 vector such
that P D ƒ0 b and establishing that F ./
P is expressible in the form
.1= 2 /.y Xb/0 P Q
.y Xb/=M
XR
P D
F ./
.1= 2 /.y Xb/0 .I PX /.y Xb/=.N P /
and by (2) regarding .1= 2 /.y Xb/0 P Q .y Xb/ and .1= 2 /.y Xb/0 .I PX /.y Xb/ as quadratic
XR
forms (in y Xb) and making use of Corollaries 6.6.4 and 6.8.2.
Exercise 8. Take the context to be that of Section 7.3, and adopt the notation employed therein.
Taking the model to be the canonical form of the G–M model and taking the distribution of the vector
of residual effects to be N.0; 2 I/, derive (in terms of the transformed vector z) the size- P likelihood
ratio test of the null hypothesis HQ 0 W ˛ D ˛.0/ (versus the alternative hypothesis HQ 1 W ˛ ¤ ˛.0/ )—
refer, e.g., to Casella and Berger (2002, sec. 8.2) for a discussion of likelihood ratio tests. Show that
the size- P likelihood ratio test is identical to the size- P F test.
Exercise 9. Verify result (3.67).
Exercise 10. Verify the equivalence of conditions (3.59) and (3.68) and the equivalence of conditions
(3.61) and (3.69).
Exercise 11. Taking the context to be that of Section 7.3 and adopting the notation employed therein,
show that, corresponding to any two choices S1 and S2 for the matrix S (i.e., any two M M
matrices S1 and S2 such that T S1 and T S2 are orthogonal), there exists a unique M M matrix
Q such that
Q 2 D XRS
XRS Q 1 Q;
and show that this matrix is orthogonal.
Exercise 12. Taking the context to be that of Section 7.3, adopting the notation employed therein,
and making use of the results of Exercise 11 (or otherwise), verify that none of the groups G0 , G1 ,
G01 , G2 . .0/ /, G3 . .0/ /, and G4 (of transformations of y) introduced in the final three parts of
Subsection b of Section 7.3 vary with the choice of the matrices S, U, and L.
Exercise 13. Consider the set AQıQ (of ıQ 0 ˛-values) defined (for ıQ 2 ) Q by expression (3.89) or
(3.90). Underlying this definition is an implicit assumption that (for any M1 vector tP ) the function
Q D jıQ 0 tP j=.ıQ 0 ı/
f .ı/ Q 1=2, with domain fıQ 2
Q W ıQ ¤ 0g, attains a maximum value. Show (1) that this
function has a supremum and (2) that if the set
R D fıR 2 RM W 9 a nonnull vector ıQ in
Q such that ıR D .ıQ 0 ı/
Q 1=2 Q
ıg
Q at which this function attains a maximum value.
is closed, then there exists a nonnull vector in
Exercise 14. Take the context to be that of Section 7.3, and adopt the notation employed therein.
Further, let r D O c P , and for ıQ 2 RM, let
Exercises 495
APıQ D fP 2 R1 W P D ıQ 0 ˛; Q
P ˛P 2 Ag;
where AQ is the set (3.121) or (3.122) and is expressible in the form
jıP 0 .˛P ˛/j
O
AQ D ˛P 2 RM W r for every nonnull ıP 2
Q :
.ıP 0 ı/
P 1=2
For ı … , AıQ is identical to the set AıQ defined by expression (3.123). Show that for ıQ 2 ,
Q Q P Q Q AP Q is
ı
identical to the set AQ Q defined by expression (3.89) or (3.90) or, equivalently, by the expression
ı
AQıQ D fP 2 R1 W jP ıQ 0 ˛/j Q 1=2 rg:
O .ıQ 0 ı/ (E.2)
Exercise 15. Taking the sets and , Q the matrix C, and the random vector t to be as defined
in Section 7.3, supposing that the set fı 2 W ı 0 Cı ¤ 0g consists of a finite number of
vectors ı1 ; ı2 ; : : : ; ıQ , and letting K represent a Q Q (correlation) matrix with ij th element
.ıi0 Cıi / 1=2 .ıj0 Cıj / 1=2 ıi0 Cıj , show that
jıQ 0 tj
max D max.ju1 j; ju2 j; : : : ; juQ j/;
Q
fı2 Q
Q W ı¤0g .ıQ 0 ı/
Q 1=2
where u1 ; u2 ; : : : ; uQ are the elements of a random vector u that has an MV t.N P; K/ distribution.
Exercise 16. Define , Q c P , and t as in Section 7.3c [so that t is an M 1 random vector that has
an MV t.N P ; IM / distribution]. Show that
ıQ 0 t
Pr max > c P P =2;
Q
fı2 Q
Q W ı¤0g .ıQ 0 ı/
Q 1=2
with equality holding if and only if there exists a nonnull M 1 vector ıR (of norm 1) such that
Q 1=2 ıQ D ıR for every nonnull vector ıQ in .
.ıQ 0 ı/ Q
Exercise 17.
(a) Letting E1 ; E2 ; : : : ; EL represent any events in a probability space and (for any event E) denoting
by E the complement of E, verify the following (Bonferroni) inequality:
PL
Pr E1 \ E2 \ \ EL 1 i D1 Pr Ei :
(b) Take the context to be that of Section 7.3c, where y is an N 1 observable random vector that
follows a G–M model with N P model matrix X of rank P and where D ƒ0ˇ is an M 1
vector of estimable linear combinations of the elements of ˇ (such that ƒ ¤ 0). Further, suppose
that the distribution of the vector e of residual effects is N.0; 2 I/ (or is some other spherically
symmetric distribution with mean vector 0 and variance-covariance matrix 2 I), let O represent
the least squares estimator of , let C D ƒ0.X0 X/ ƒ, let ı1 ; ı2 ; : : : ; ıL represent M 1 vectors
of constants such that (for i D 1; 2; : : : ; L) ıi0 Cıi > 0, let O represent the positive square root of
the usual estimator of 2 (i.e., the estimator obtained upon dividing the residual sum of squares
by N P ), and let P1 ; P2 ; : : : ; PL represent positive scalars such that L i D1 Pi D P . And (for
P
i D 1; 2; : : : ; L) denote by Ai .y/ a confidence interval for ıi0 with end points
ıi0 O ˙ .ıi0 Cıi /1=2 O tNPi =2 .N P /:
Use the result of Part (a) to show that
PrŒıi0 2 Ai .y/ .i D 1; 2; : : : ; L/ 1 P
and hence that the intervals A1 .y/; A2 .y/; : : : ; AL .y/ are conservative in the sense that their
probability of simultaneous coverage is greater than or equal to 1 P—when P1 D P2 D D PL ,
the end points of interval Ai .y/ become
ıi0 O ˙ .ıi0 Cıi /1=2 O tNP =.2L/ .N P /;
and the intervals A1 .y/; A2 .y/; : : : ; AL .y/ are referred to as Bonferroni t-intervals.
496 Confidence Intervals (or Sets) and Tests of Hypotheses
Exercise 18. Suppose that the data (of Section 4.2b) on the lethal dose of ouabain in cats are regarded
as the observed values of the elements y1; y2 ; : : : ; yN of an N.D 41/-dimensional observable random
vector y that follows a G–M model. Suppose further that (for i D 1; 2; : : : ; 41) E.yi / D ı.ui /, where
u1 ; u2 ; : : : ; u41 are the values of the rate u of injection and where ı.u/ is the third-degree polynomial
ı.u/ D ˇ1 C ˇ2 u C ˇ3 u2 C ˇ4 u3:
And suppose that the distribution of the vector e of residual effects is N.0; 2 I/ (or is some other
spherically symmetric distribution with mean vector 0 and variance-covariance matrix 2 I).
(a) Compute the values of the least squares estimators ˇO1 ; ˇO2 ; ˇO3 , and ˇO4 of ˇ1 ; ˇ2 ; ˇ3 , and ˇ4 ,
respectively, and the value of the positive square root O of the usual unbiased estimator of
2 —it follows from the results of Section 5.3d that P .D rank X/ D P D 4, in which case
N P D N P D 37, and that ˇ1 ; ˇ2 ; ˇ3 , and ˇ4 are estimable.
(b) Find the values of tN:05 .37/ and Œ4FN:10 .4; 37/1=2, which would be needed if interval Iu.1/.y/ with
end points (3.169) and interval Iu.2/.y/ with end points (3.170) (where in both cases P is taken
to be :10) were used to construct confidence bands for the response surface ı.u/.
(c) By (for example) making use of the results in Liu’s (2011) Appendix E, compute Monte Carlo
approximations to the constants c:10 and c:10
that would be needed if interval Iu.3/.y/ with end
.4/
points (3.171) and interval Iu .y/ with end points (3.172) were used to construct confidence
bands for ı.u/; compute the approximations for the case where u is restricted to the interval 1
u 8, and (for purposes of comparison) also compute c:10 for the case where u is unrestricted.
O
(d) Plot (as a function of u) the value of the least squares estimator ı.u/ D ˇO1 C ˇO2 uC ˇO3 u2 C ˇO4 u3
and (taking P D :10) the values of the end points (3.169) and (3.170) of intervals Iu.1/.y/
.2/
and Iu .y/ and the values of the approximations to the end points (3.171) and (3.172) of
intervals Iu.3/.y/ and Iu.4/.y/ obtained upon replacing c:10 and c:10
with their Monte Carlo
approximations—assume (for purposes of creating the plot and for approximating c:10 and c:10 )
that u is restricted to the interval 1 u 8.
Exercise 19. Taking the setting to be that of the final four parts of Section 7.3b (and adopting
the notation and terminology employed therein) and taking GQ 2 to be the group of transformations
consisting of the totality of the groups GQ 2 .˛.0/ / (˛.0/ 2 RM ) and GQ 3 the group consisting of
Q
the totality of the groups GQ 3 .˛.0/ / (˛.0/ 2 RM ), show that (1) if a confidence set A.z/ for ˛ is
Q Q
equivariant with respect to the groups G0 and G2 .0/, then it is equivariant with respect to the group
GQ 2 and (2) if a confidence set A.z/
Q for ˛ is equivariant with respect to the groups GQ 0 and GQ 3 .0/,
then it is equivariant with respect to the group GQ 3 .
Exercise 20. Taking the setting to be that of Section 7.4e (and adopting the assumption of normality
and the notation and terminology employed therein), suppose that M D 1, and write ˛O for ˛, O ˛ for
˛, and ˛ .0/ for ˛.0/. Further, let .Q ˛; O d/ represent the critical function of an arbitrary (possibly
O ;
randomized) level- P test of the null hypothesis HQ 0 W ˛ D ˛ .0/ versus the alternative hypothesis
HQ 1 W ˛ ¤ ˛ .0/, and let Q .˛; ; / represent its power function fso that Q .˛; ; / D EŒ .Q ˛;
O ;
O d/g.
And define s D .˛O ˛ .0/ /=Œ.˛O ˛ .0/ /2 C d0 d1=2 and w D .˛O ˛ .0/ /2 C d0 d, denote by .s; R w; / O
the critical function of a level- P test (of HQ 0 versus HQ 1 ) that depends on ˛, O and d only through the
O ,
values of s, w, and ,O and write E0 for the expectation operator E in the special case where ˛ D ˛ .0/.
Q ˛;
(a) Show that if the level- P test with critical function . O d/ is an unbiased test, then
O ;
Q .˛ .0/; ; / D P for all and (E.3)
and @ Q .˛; ; / ˇˇ
ˇ
D 0 for all and : (E.4)
@˛ ˇ
˛D˛ .0/
Exercises 497
(f) Using the generalized Neyman–Pearson lemma (Lehmann and Romano 2005b, sec. 3.6; Shao
R w; /
2010, sec. 6.1.1), show that among critical functions of the form .s; O that satisfy (for any
particular values of w and )
O the conditions
R w; /
E0 Œ .s; O D P and E0 Œsw 1=2 .s;
O j w; R w; /O j w;
O D 0; (E.8)
R w; /
the value of EŒ .s; O j w; O (at those particular values of w and )O is maximized [for any
particular value of ˛ (¤ ˛ .0/ ) and any particular values of and ] when the critical function is
taken to be the critical function R .s; w; /
O defined (for all s, w, and )
O as follows:
(
1; if s < c or s > c,
R .s; w; /
O D
0; if c s c,
where c is the upper 100. P =2/% point of the distribution with pdf h./ [given by result (6.4.7)].
(g) Use the results of the preceding parts to conclude that among all level- P tests of HQ 0 versus HQ 1
that are unbiased, the size- P two-sided t test is a UMP test.
Exercise 21. Taking the setting to be that of Section 7.4e and adopting the assumption of normality
and the notation and terminology employed therein, let Q .˛; ; / represent the power function of a
size- P similar test of H0 W D .0/ or HQ 0 W ˛ D ˛.0/ versus H1 W ¤ .0/ or HQ 1 W ˛ ¤ ˛.0/. Show
that min˛2S.˛.0/;/ Q .˛; ; / attains its maximum value when the size- P similar test is taken to be
the size- P F test.
Exercise 22. Taking the setting to be that of Section 7.4e and adopting the assumption of normality
and the notation and terminology employed therein, let . Q ˛; O d/ represent the critical function
O ;
Q
of an arbitrary size- P test of the null hypothesis H0 W ˛ D ˛.0/ versus the alternative hypothesis
HQ 1 W ˛ ¤ ˛.0/. Further, let Q . ; ; I /
Q represent the power function of the test with critical function
Q Q
. ; ; /, so that Q .˛; ; I / D E Œ. Q ˛; O d/. And take Q . ; ; / to be the function defined as
O ;
follows:
Q .˛; ; / D supQ Q .˛; ; I /:
Q
This function is called the envelope power function.
(a) Show that Q .˛; ; / depends on ˛ only through the value of .˛ ˛.0/ /0 .˛ ˛.0/ /.
(b) Let QF .˛; O d/ represent the critical function of the size- P F test of HQ 0 versus HQ 1 . And as a
O ;
Q ; ; /, consider the use of the criterion
basis for evaluating the test with critical function .
max Œ Q .˛; ; / Q .˛; ; I /; Q (E.9)
˛2S.˛.0/; /
498 Confidence Intervals (or Sets) and Tests of Hypotheses
which reflects [for ˛ 2 S.˛.0/; /] the extent to which the power function of the test deviates
from the envelope power function. Using the result of Exercise 21 (or otherwise), show that the
size- P F test is the “most stringent” size- P similar test in the sense that (for “every” value of )
the value attained by the quantity (E.9) when Q D QF is a minimum among those attained when
Q is the critical function of some (size- P ) similar test.
Exercise 23. Take the setting to be that of Section 7.5a (and adopt the assumption of normality and
the notation and terminology employed therein). Show that among all tests of the null hypothesis
H0C W .0/ or HQ 0C W ˛ ˛ .0/ (versus the alternative hypothesis H1C W > .0/ or HQ 1C W ˛ > ˛ .0/ )
that are of level P and that are unbiased, the size- P one-sided t test is a UMP test. (Hint. Proceed
stepwise as in Exercise 20.)
Exercise 24.
(a) Let (for an arbitrary positive integer M ) fM ./ represent the pdf of a 2 .M / distribution. Show
that (for 0 < x < 1)
xfM .x/ D MfM C2 .x/:
(b) Verify [by using Part (a) or by other means] result (6.22).
Exercise 25. This exercise is to be regarded as a continuation of Exercise 18. Suppose (as in
Exercise 18) that the data (of Section 4.2 b) on the lethal dose of ouabain in cats are regarded as
the observed values of the elements y1 ; y2 ; : : : ; yN of an N.D 41/-dimensional observable random
vector y that follows a G–M model. Suppose further that (for i D 1; 2; : : : ; 41) E.yi / D ı.ui /,
where u1 ; u2 ; : : : ; u41 are the values of the rate u of injection and where ı.u/ is the third-degree
polynomial
ı.u/ D ˇ1 C ˇ2 u C ˇ3 u2 C ˇ4 u3:
And suppose that the distribution of the vector e of residual effects is N.0; 2 I/.
(a) Determine for P D 0:10 and also for P D 0:05 (1) the value of the 100.1 P /% lower confidence
bound for provided by the left end point of interval (6.2) and (2) the value of the 100.1 P /%
upper confidence bound for provided by the right end point of interval (6.3).
(b) Obtain [via an implementation of interval (6.23)] a 90% two-sided strictly unbiased confidence
interval for .
Exercise 26. Take the setting to be that of the final part of Section 7.6c, and adopt the notation and
terminology employed therein. In particular, take the canonical form of the G–M model to be that
identified with the special case where M D P , so that ˛ and ˛O are P -dimensional. Show that the
(size- P ) test of the null hypothesis H0C W 0 (versus the alternative hypothesis H1C W > 0 )
with critical region C C is UMP among all level- P tests. Do so by carrying out the following steps.
(a) Let .T; ˛/O represent the critical function of a level- P test of H0C versus H1C [that depends on
the vector d only through the value of T (D d0 d=02 )]. And let .; ˛/ represent the power
function of the test with critical function .T; ˛/.
O Further, let represent any particular value
of greater than 0 , let ˛ represent any particular value of ˛, and denote by h. I / the pdf of
the distribution of T, by f . I ˛; / the pdf of the distribution of ˛,
O and by s./ the pdf of the
N.˛ ; 2 02 / distribution. Show (1) that
Z
.0 ; ˛/ s.˛/ d ˛ P (E.10)
RP
and (2) that
Z Z Z 1
.0 ; ˛/ s.˛/ d ˛ D O h.tI 0 /f .˛I
.t; ˛/ O ˛ ; / dt d ˛:
O (E.11)
RP RP 0
Exercises 499
(b) By, for example, using a version of the Neyman–Pearson lemma like that stated by Casella and
Berger (2002) in the form of their Theorem 8.3.12, show that among those choices for the critical
function .T; ˛/
O for which the power function .; / satisfies condition (E.10), . ; ˛ / can
be maximized by taking .T; ˛/ O to be the critical function .T; ˛/
O defined as follows:
(
1; when t > N 2P ,
O D
.t; ˛/
0; when t N 2P .
(c) Use the results of Parts (a) and (b) to reach the desired conclusion, that is, to show that the test
of H0C (versus H1C ) with critical region C C is UMP among all level- P tests.
Exercise 27. Take the context to be that of Section 7.7a, and adopt the notation employed therein.
Using Markov’s inequality (e.g., Casella and Berger 2002, lemma 3.8.3; Bickel and Doksum 2001,
sec. A.15) or otherwise, verify inequality (7.12), that is, the inequality
Pr. xi > c for k or more values of i / .1=k/ M i D1 Pr. xi > c/:
P
Exercise 28.
(a) Letting X represent any random variable whose values are confined to the interval Œ0; 1 and
letting (0 < < 1) represent a constant, show (1) that
E.X / Pr.X / C Pr.X > / (E.12)
and then use inequality (E.12) along with Markov’s inequality (e.g., Casella and Berger 2002,
sec. 3.8) to (2) show that
E.X / E.X /
Pr.X > / : (E.13)
1
(b) Show that the requirement that the false discovery rate (FDR) satisfy condition (7.45) and the
requirement that the false discovery proportion (FDP) satisfy condition (7.46) are related as
follows:
(1) if FDR ı, then Pr.FDP > / ı=; and
(2) if Pr.FDP > / , then FDR C .1 /.
Exercise 29. Taking the setting to be that of Section 7.7 and adopting the terminology and notation
employed therein, consider the use of a multiple-comparison procedure in testing (for every i 2 I D
.0/ .0/ .1/ .0/
f1; 2; : : : ; M g) the null hypothesis Hi W i D i versus the alternative hypothesis Hi W i ¤ i
(or Hi.0/ W i i.0/ versus Hi.1/ W i > i.0/ ). And denote by T the set of values of i 2 I for which
Hi.0/ is true and by F the set for which Hi.0/ is false. Further, denote by MT the size of the set T
and by XT the number of values of i 2 T for which Hi.0/ is rejected. Similarly, denote by MF the
size of the set F and by XF the number of values of i 2 F for which Hi.0/ is rejected. Show that
(a) in the special case where MT D 0, FWER D FDR D 0;
(b) in the special case where MT D M, FWER D FDR; and
(c) in the special case where 0 < MT < M, FWER FDR, with equality holding if and only if
Pr.XT > 0 and XF > 0/ D 0.
Exercise 30.
(a) Let pO1 ; pO2 ; : : : ; pO t represent p-values [so that Pr.pOi u/ u for i D 1; 2; : : : ; t and for every
u 2 .0; 1/]. Further, let pO.j / D pOij (j D 1; 2; : : : ; t), where i1 ; i2 ; : : : ; i t is a permutation of the
first t positive integers 1; 2; : : : ; t such that pOi1 pOi2 pOi t . And let s represent a positive
500 Confidence Intervals (or Sets) and Tests of Hypotheses
(b) Take the setting to be that of Section 7.7d, and adopt the notation and terminology employed
therein. And suppose that the ˛j ’s of the step-down multiple-comparison procedure for testing
the null hypotheses H1.0/; H2.0/; : : : ; HM
.0/
are of the form
˛j D tN.ŒjC1/ P =f2.M CŒjC1 j /g .N P / .j D 1; 2; : : : ; M / (E.15)
[where P 2 .0; 1/].
(1) Show that if
Pr.jtuIT j > tNu P =.2MT / .N P / for 1 or more values of u 2 f1; 2; : : : ; Kg/ ; (E.16)
then the step-down procedure [with ˛j ’s of the form (E.15)] is such that Pr.FDP > / .
(2) Reexpress the left side of inequality (E.16) in terms of the left side of inequality (E.14).
(3) Use Part (a) to show that
Pr.jtuIT j > tNu P =.2MT / .N P /
PŒM C1
for 1 or more values of u 2 f1; 2; : : : ; Kg/ P uD1 1=u: (E.17)
(4) Show that the version of the step-down procedure [with ˛j ’s of the form (E.15)] obtained
upon setting P D = ŒM
P C1
uD1 1=u is such that Pr.FDP > / .
Exercise 31. Take the setting to be that of Section 7.7e, and adopt the notation and terminology em-
ployed therein. And take ˛1 ; ˛2 ; : : : ; ˛M
to be scalars defined implicitly (in terms of ˛1 ; ˛2 ; : : : ; ˛M )
by the equalities
˛k0 D jkD1 ˛j (E.18)
P
.k D 1; 2; : : : ; M /
or explicitly as (
Pr ˛j 1 jtj > ˛j ; for j D 2; 3; : : : ; M,
˛j D
Pr jtj > ˛j ; for j D 1,
where t S t.N P /.
(a) Show that the step-up procedure for testing the null hypotheses Hi.0/ W i D i.0/ (i D 1; 2; : : : ;
M ) is such that (1) the FDR is less than or equal to MT jMD1 ˛j =j ; (2) when M jMD1 ˛j =j < 1,
P P
the FDR is controlled at level M jMD1 ˛j =j (regardless of the identity of the set T ); and (3) in
P
the special case where (for j D 1; 2; : : : ; M ) ˛j0 is of the form ˛j0 D j P =M, the FDR is less than
1
or equal to P .MT =M / jMD1 1=j and can be controlled at level ı by taking P D ı
PM
.
P
j D1 1=j
PM
(b) The sum j D1 1=j is “tightly” bounded from above by the quantity
C log.M C0:5/ C Œ24.M C0; 5/2 1; (E.19)
where is the Euler–Mascheroni constant (e.g., Chen 2010)—to 10 significant digits, D
0:5772156649. Determine the value of jMD1 1=j and the amount by which this value is exceeded
P
by the value of expression (E.19). Do so for each of the following values of M : 5, 10, 50, 100,
500, 1;000, 5;000, 10;000, 20;000, and 50;000.
(c) What modifications are needed to extend the results encapsulated in Part (a) to the step-up
procedure for testing the null hypotheses Hi.0/ W i i.0/ (i D 1; 2; : : : ; M ).
Exercise 32. Take the setting to be that of Section 7.7, and adopt the terminology and notation
employed therein. Further, for j D 1; 2; : : : ; M , let
˛Pj D tNkj P =Œ2.M Ckj j / .N P /;
Exercises 501
where [for some scalar (0 < < 1)] kj D ŒjC1, and let
˛Rj D tNj P =.2M / .N P /:
And consider two stepwise multiple-comparison procedures for testing the null hypotheses
H1.0/; H2.0/; : : : ; HM
.0/
: a stepwise procedure for which ˛j is taken to be of the form ˛j D ˛Pj [as
in Section 7.7d in devising a step-down procedure for controlling Pr.FDP > /] and a stepwise
procedure for which ˛j is taken to be of the form ˛j D ˛Rj (as in Section 7.7e in devising a step-up
procedure for controlling the FDR). Show that (for j D 1; 2; : : : ; M ) ˛Pj ˛Rj , with equality holding
if and only if j 1=.1 / or j D M.
Exercise 33. Take the setting to be that of Section 7.7f, and adopt the terminology and notation
employed therein. And consider a multiple-comparison procedure in which (for i D 1; 2; : : : ; M )
the i th of the M null hypotheses H1.0/; H2.0/; : : : ; HM
.0/
is rejected if jti.0/ j > c, where c is a strictly
positive constant. Further, recall that T is the subset of the set I D f1; 2; : : : ; M g such that i 2 T if
.0/ .0/
Hi is true, denote by R the subset of I such that i 2 R if Hi is rejected, and (for i D 1; 2; : : : ; M )
take Xi to be a random variable defined as follows:
(
.0/
1; if jti j > c,
Xi D
0; if jti.0/ j c.
(a) Show that
EŒ.1=M / Xi D .MT =M / Pr.jtj > c/ Pr.jtj > c/;
P
i 2T
where t S t.100/.
PM
(b) Based on the observation that [when .1=M / i D1 Xi > 0]
P
.1=M / i 2T Xi
FDP D PM ;
.1=M / i D1 Xi
on the reasoning that for large M the quantity .1=M / M i D1 Xi can be regarded as a (strictly
P
positive) constant, and on the result of Part (a), the quantity MT Pr.jtj > c/=MR can be regarded
as an “estimator” of the FDR [D E.FDP/] and M Pr.jtj > c/=MR can be regarded as an estimator
of maxT FDR (Efron 2010, chap. 2)—if MR D 0, take the estimate of the FDR or of maxT FDR
to be 0. Consider the application to the prostate data of the multiple-comparison procedure in
the case where c D c P .k/ and also in the case where c D tNk P=.2M / .100/. Use the information
provided by the entries in Table 7.5 to obtain an estimate of maxT FDR for each of these two
cases. Do so for P D :05; :10; and :20 and for k D 1; 5; 10; and 20.
Exercise 34. Take the setting to be that of Part 6 of Section 7.8a (pertaining to the testing of
H0 W w 2 S0 versus H1 W w 2 S1 ), and adopt the notation and terminology employed therein.
(a) Write p0 for the random variable p0.y/, and denote by G0./ the cdf of the conditional distribution
of p0 given that w 2 S0 . Further, take k and c to be the constants that appear in the definition of
the critical function ./, take k 00 D Œ1C .1 =0 / k 1, and take ./ to be a critical function
defined as follows: 8
< 1; when p0 .y/ < k 00,
.y/ D c; when p0 .y/ D k 00,
0; when p0 .y/ > k 00.
:
Show that (1) .y/ D .y/ when f .y/ > 0, (2) that k 00 equals the smallest scalar p 0 for
(b) Show that if the joint distribution of w and y is MVN, then there exists a version of the critical
function ./ defined by equalities (8.25) and (8.26) for which .y/ depends on the value of
0 0
y only through the value of w.y/
Q D C Vyw Vy 1 y (where D w Vyw Vy 1 y ).
(c) Suppose that M D 1 and that S0 D fw W ` w ug, where ` and u are (known) constants
(with ` < u). Suppose also that the joint distribution of w and y is MVN and that vyw ¤ 0. And
0 0 0
letting wQ D w.y/
Q D C vyw Vy 1 y (with D w vyw Vy 1 y ) and vQ D vw vyw Vy 1 vyw ,
define
d D d.y/ D FfŒu w.y/= Q vQ 1=2 g FfŒ` w.y/=
Q vQ 1=2 g;
where F ./ is the cdf of the N.0; 1/ distribution. Further, let
C0 D fy 2 RN W d.y/ < dR g;
where dR is the lower 100 P % point of the distribution of the random variable d. Show that among
all P -level tests of the null hypothesis H0 , the nonrandomized P -level test with critical region
C0 achieves maximum power.
the maximum value of each of the quantities Œx.u/0 W 0 t and Œx.u/0 W 0 . t/ (subject to the constraint u 2 U)
was based on the observation that either the maximum value is attained at one of the 8 values of u such that
ui D ˙1 (i D 1; 2; 3) or it is attained at a value of u such that (1) one, two, or three of the components of
u are less than 1 in absolute value, (2) the first-order partial derivatives of Œx.u/0 W 0 t or Œx.u/0 W 0 . t/ with
respect to these components equal 0, and (3) the matrix of second-order partial derivatives with respect to these
components is negative definite—a square matrix A is said to be negative definite if A is positive definite.
§4e. Refer to Chen, Hung, and Chen (2007) for some general discussion of the use of maximum average-
power as a criterion for evaluating hypothesis tests.
§4e and Exercise 20. The result that the size- P 2-sided t test is a UMP level- P unbiased test (of H0 or HQ 0
versus H1 or HQ 1 ) can be regarded as a special case of results on UMP tests for exponential families like those
discussed by Lehmann and Romano (2005b) in their chapters 4 and 5.
§7a, §7b, and Exercise 27. The procedures proposed by Lehmann and Romano (2005a) for the control
of k-FWER served as “inspiration” for the results of Subsection a, for Exercise 27, and for some aspects of
what is presented in Subsection b. Approaches similar to the approach proposed by Lehmann and Romano for
controlling the rate of multiple false rejections were considered by Victor (1982) and by Hommel and Hoffmann
(1988).
§7b. Various of the results presented in this subsection are closely related to the results of Westfall and
Tobias (2007).
§7c. Benjamini and Hochberg’s (1995) paper has achieved landmark status. It has had a considerable impact
on statistical practice and has inspired a great deal of further research into multiple-comparison methods of a kind
better suited for applications to microarray data (and other large-scale applications) than the more traditional
kind of methods. The newer methods have proved to be popular, and their use in large-scale applications has
proved to be considerably more effective than that of the more traditional methods. Nevertheless, it could be
argued that even better results could be achieved by regarding and addressing the problem of screening or
discovery in a way that is more in tune with the true nature of the problem (than simply treating the problem as
one of multiple comparisons).
§7c and §7f. The microarray data introduced in Section 7.7c and used for illustrative purposes in Section
7.7f are the data referred to by Efron (2010, app. B) as the prostate data and are among the data made available
by him on his website. Those data were obtained by preprocessing the “raw” data from a study by Singh et al.
(2002). Direct information about the nature of the preprocessing does not seem to be available. Presumably, the
preprocessing was similar in nature to that described by Dettling (2004) and applied by him to the results of the
same study—like Efron, Dettling used the preprocessed data for illustrative purposes (and made them available
via the internet). In both cases, the preprocessed data are such that the data for each of the 50 normal subjects
and each of the 52 cancer patients has been centered and rescaled so that the average value is 0 and the sample
variance equals 1. Additionally, it could be argued that (in formulating the objectives underlying the analysis
of the prostate data in terms of multiple comparisons) it would be more realistic to take the null hypotheses
.0/ .0/ .1/ .0/
to be Hs W js j s (s D 1; 2; : : : ; 6033) and the alternative hypotheses to be Hs W js j > s
.0/ .0/
(s D 1; 2; : : : ; 6033), where s is a “threshold” such that (absolute) values of s smaller than s are
regarded as “unimportant”—this presumes the existence of enough knowledge about the underlying processes
that a suitable threshold is identifiable.
§7d. As discussed in this section, a step-down procedure for controlling Pr.FDP > / at level can be
obtained by taking (for j D 1; 2; : : : ; M ) ˛j to be of the form (7.65) (in which case ˛j can be regarded as a
function of P ) and by taking the value of P to be the largest value that satisfies condition (7.70). Instead of taking
˛j to be of the form (7.65), we could take it to be of the form ˛j D maxS2 QC c P .ŒjC1I S/ (where
ŒjC1Ij
Q C
kIj
is as defined in Section 7.7b). This change could result in additional rejections (discoveries), though any
such gains would come at the expense of greatly increased computational demands.
§7d and Exercises 28 and 30. The content of this section and these exercises is based to a considerable
extent on the results of Lehmann and Romano (2005a, sec. 3).
§7e and Exercises 29 and 31. The content of this section and these exercises is based to a considerable
extent on the results of Benjamini and Yekutieli (2001).
§7f. The results (on the total number of rejections or discoveries) reported in Table 7.5 for the conservative
counterpart of the k-FWER multiple-comparison procedure (in the case where P D :05) differ somewhat from
the results reported by Efron (2010, fig. 3.3). It is worth noting that the latter results are those for the case where
504 Confidence Intervals (or Sets) and Tests of Hypotheses
the tests of the M null hypotheses are one-sided tests rather than those for the case where the tests are two-sided
tests.
§8a. The approach taken (in the last 2 parts of Section 7.8a) in devising (in the context of prediction)
multiple-comparison procedures is based on the use of the closure principle (as generalized from control of the
FWER to control of the k-FWER). It would seem that this use of the closure principle (as generalized thusly)
in devising multiple-comparison procedures could be extended to other settings.
Exercise 17 (b). The Bonferroni t-intervals [i.e., the intervals A1 .y/; A2 .y/; : : : ; AL .y/ in the special
case where P1 D P2 D D PL ] can be regarded as having been obtained from interval (3.94) (where
ı 2 with D fı1 ; ı2 ; : : : ; ıL g) by replacing c P with tNP =.2L/ .N P /, which constitutes an upper bound
for c P . As discussed by Fuchs and Sampson (1987), intervals for ı10 ; ı20 ; : : : ; ıL 0 that are superior to the
Bonferroni t-intervals are obtainable from interval (3.94) by replacing c P with tŒ1 .1 P /1=L =2 .N P /, which
N
is a tighter upper bound for c P than tNP =.2L/ .N P /. And an even greater improvement can be effected
by replacing c P with the upper 100 P % point, say tNP .L; N P /, of the distribution of the random variable
max.jt1 j; jt2 j; : : : ; jtL j/, where t1 ; t2 ; : : : ; tL are the elements of an L-dimensional random vector that has an
MV t.N P ; IL / distribution; the distribution of max.jt1 j; jt2 j; : : : ; jtL j/ is that of the Studentized maximum
modulus, and its upper 100 P % point is an even tighter upper bound (for c P ) than tNŒ1 .1 P /1=L =2 .N P /. These
alternatives to the Bonferroni t-intervals are based on the results of Šidák (1967 and 1968); refer, for example,
to Khuri (2010, sec. 7.5.4) and to Graybill (1976, sec. 6.6) for some relevant details. As a simple example,
consider the confidence intervals (3.143), for which the probability of simultaneous coverage is :90. In this
example, P D :10, L D 3, N P D 10, tNP .L; N P / D c P D 2:410, tNŒ1 .1 P /1=L =2 .N P / D 2:446,
and tNP =.2L/ .N P / D 2:466.
Exercises 21 and 22 (b). The size- P F test is optimal in the senses described in Exercises 21 and 22 (b) not
only among size- P similar tests, but among all tests whose size does not exceed P —refer, e.g., to Lehmann
and Romano (2005b, chap. 8). The restriction to size- P similar tests serves to facilitate the solution of these
exercises.
References
Agresti, A. (2013), Categorical Data Analysis (3rd ed.), New York: Wiley.
Albert, A. (1972), Regression and the Moore–Penrose Pseudoinverse, New York: Academic Press.
Albert, A. (1976), “When Is a Sum of Squares an Analysis of Variance?,” The Annals of Statistics,
4, 775–778.
Anderson, T. W., and Fang, K.-T. (1987), “Cochran’s Theorem for Elliptically Contoured Distribu-
tions,” Sankhyā, Series A, 49, 305–315.
Arnold, S. F. (1981), The Theory of Linear Models and Multivariate Analysis, New York: Wiley.
Atiqullah, M. (1962), “The Estimation of Residual Variance in Quadratically Balanced Least-Squares
Problems and the Robustness of the F-Test,” Biometrika, 49, 83–91.
Baker, J. A. (1997), “Integration over Spheres and the Divergence Theorem for Balls,” The American
Mathematical Monthly, 104, 36–47.
Bartle, R. G. (1976), The Elements of Real Analysis (2nd ed.), New York: Wiley.
Bartle, R. G., and Sherbert, D. R. (2011), Introduction to Real Analysis (4th ed.), Hoboken, NJ:
Wiley.
Bates, D. M., and Watts, D. G. (1988), Nonlinear Regression Analysis and Its Applications, New
York: Wiley.
Beaumont, R. A., and Pierce, R. S. (1963), The Algebraic Foundations of Mathematics, Reading,
MA: Addison-Wesley.
Benjamini, Y., and Hochberg, Y. (1995), “Controlling the False Discovery Rate: a Practical and
Powerful Approach to Multiple Testing,” Journal of the Royal Statistical Society, Series B, 57,
289–300.
Benjamini, Y., and Yekutieli, D. (2001), “The Control of the False Discovery Rate in Multiple Testing
Under Dependency,” The Annals of Statistics, 29, 1165–1188.
Bennett, J. H., ed. (1990), Statistical Inference and Analysis: Selected Correspondence of R. A.
Fisher, Oxford, U.K.: Clarendon Press.
Berger, J. O. (1985), Statistical Decision Theory and Bayesian Analysis (2nd ed.), New York:
Springer-Verlag.
Bickel, P. J., and Doksum, K. A. (2001), Mathematical Statistics: Basic Ideas and Selected Topics
(Vol. I, 2nd ed.), Upper Saddle River, NJ: Prentice-Hall.
Billingsley, P. (1995), Probability and Measure (3rd ed.), New York: Wiley.
Box, G. E. P., and Draper, N. R. (1987), Empirical Model-Building and Response Surfaces, New
York: Wiley.
Bretz, F., Hothorn, T., and Westfall, P. (2011), Multiple Comparisons Using R, Boca Raton, FL:
Chapman & Hall/CRC.
Cacoullos, T., and Koutras, M. (1984), “Quadratic Forms in Spherical Random Variables: Generalized
Noncentral 2 Distribution,” Naval Research Logistics Quarterly, 31, 447–461.
506 References
Carroll, R. J., and Ruppert, D. (1988), Transformation and Weighting in Regression, New York:
Chapman & Hall.
Casella, G., and Berger, R. L. (2002), Statistical Inference (2nd ed.), Pacific Grove, CA: Duxbury.
Chen, C.-P. (2010), “Inequalities for the Euler–Mascheroni constant,” Applied Mathematics Letters,
23, 161–164.
Chen, L.-A., Hung, H.-N., and Chen, C.-R. (2007), “Maximum Average-Power (MAP) Tests,” Com-
munications in Statistics—Theory and Methods, 36, 2237–2249.
Cochran, W. G. (1934), “The Distribution of Quadratic Forms in a Normal System, with Applications
to the Analysis of Covariance,” Proceedings of the Cambridge Philosophical Society, 30, 178–191.
Cornell, J. A. (2002), Experiments with Mixtures: Designs, Models, and the Analysis of Mixture Data
(3rd ed.), New York: Wiley.
Cressie, N. A. C. (1993), Statistics for Spatial Data (rev. ed.), New York: Wiley.
David, H. A. (2009), “A Historical Note on Zero Correlation and Independence,” The American
Statistician, 63, 185–186.
Davidian, M., and Giltinan, D. M. (1995), Nonlinear Models for Repeated Measurement Data,
London: Chapman & Hall.
Dawid, A. P. (1982), “The Well-Calibrated Bayesian” (with discussion), Journal of the American
Statistical Association, 77, 605–613.
Dettling, M. (2004), “BagBoosting for Tumor Classification with Gene Expression Data,” Bioinfor-
matics, 20, 3583–3593.
Diggle, P. J., Heagerty, P., Liang, K.-Y., and Zeger, S. L. (2002), Analysis of Longitudinal Data (2nd
ed.), Oxford, U.K.: Oxford University Press.
Driscoll, M. F. (1999), “An Improved Result Relating Quadratic Forms and Chi-Square Distribu-
tions,” The American Statistician, 53, 273–275.
Driscoll, M. F., and Gundberg, W. R., Jr. (1986), “A History of the Development of Craig’s Theorem,”
The American Statistician, 40, 65–70.
Driscoll, M. F., and Krasnicka, B. (1995), “An Accessible Proof of Craig’s Theorem in the General
Case,” The American Statistician, 49, 59–62.
Durbin, B. P., Hardin, J. S., Hawkins, D. M., and Rocke, D. M. (2002), “A Variance-Stabilizing
Transformation for Gene-Expression Microarray Data,” Bioinformatics, 18, S105–S110.
Edwards, D., and Berry, J. J. (1987), “The Efficiency of Simulation-Based Multiple Comparisons,”
Biometrics, 43, 913–928.
Efron, B. (2010), Large-Scale Inference: Empirical Bayes Methods for Estimation, Testing, and
Prediction, New York: Cambridge University Press.
Fang, K.-T., Kotz, S., and Ng, K.-W. (1990), Symmetric Multivariate and Related Distributions,
London: Chapman & Hall.
Feller, W. (1971), An Introduction to Probability Theory and Its Applications (Vol. II, 2nd ed.), New
York: Wiley.
Fuchs, C., and Sampson, A. R. (1987), “Simultaneous Confidence Intervals for the General Linear
Model,” Biometrics, 43, 457–469.
Gallant, A. R. (1987), Nonlinear Statistical Models, New York: Wiley.
References 507
Gentle, J. E. (1998), Numerical Linear Algebra for Applications in Statistics, New York: Springer-
Verlag.
Golub, G. H., and Van Loan, C. F. (2013), Matrix Computations (4th ed.), Baltimore: The Johns
Hopkins University Press.
Graybill, F. A. (1961), An Introduction to Linear Statistical Models (Vol. I), New York: McGraw-Hill.
Graybill, F. A. (1976), Theory and Application of the Linear Model, North Scituate, MA: Duxbury.
Grimmett, G., and Welsh, D. (1986), Probability: An Introduction, Oxford, U.K.: Oxford University
Press.
Gupta, A. K., and Song, D. (1997), “Lp-Norm Spherical Distribution,” Journal of Statistical Planning
and Inference, 60, 241–260.
Hader, R. J., Harward, M. E., Mason, D. D., and Moore, D. P. (1957), “An Investigation of Some of
the Relationships Between Copper, Iron, and Molybdenum in the Growth and Nutrition of Lettuce:
I. Experimental Design and Statistical Methods for Characterizing the Response Surface,” Soil
Science Society of America Proceedings, 21, 59–64.
Hald, A. (1952), Statistical Theory with Engineering Applications, New York: Wiley.
Halmos, P. R. (1958), Finite-Dimensional Vector Spaces (2nd ed.), Princeton, NJ: Van Nostrand.
Hartigan, J. A. (1969), “Linear Bayesian Methods,” Journal of the Royal Statistical Society, Series
B, 31, 446–454.
Hartley, H. O. (1950), “The Maximum F -Ratio as a Short-Cut Test for Heterogeneity of Variance,”
Biometrika, 37, 308–312.
Harville, D. A. (1980), “Predictions for National Football League Games Via Linear-Model Method-
ology,” Journal of the American Statistical Association, 75, 516–524.
Harville, D. A. (1985), “Decomposition of Prediction Error,” Journal of the American Statistical
Association, 80, 132–138.
Harville, D. A. (1997), Matrix Algebra from a Statistician’s Perspective, New York: Springer-Verlag.
Harville, D. A. (2003a), “The Expected Value of a Conditional Variance: an Upper Bound,” Journal
of Statistical Computation and Simulation, 73, 609–612.
Harville, D. A. (2003b), “The Selection or Seeding of College Basketball or Football Teams for
Postseason Competition,” Journal of the American Statistical Association, 98, 17–27.
Harville, D. A. (2014), “The Need for More Emphasis on Prediction: a ‘Nondenominational’ Model-
Based Approach” (with discussion), The American Statistician, 68, 71–92.
Harville, D. A., and Kempthorne, O. (1997), “An Alternative Way to Establish the Necessity Part of
the Classical Result on the Statistical Independence of Quadratic Forms,” Linear Algebra and Its
Applications, 264 (Sixth Special Issue on Linear Algebra and Statistics), 205–215.
Henderson, C. R. (1984), Applications of Linear Models in Animal Breeding, Guelph, ON: Univer-
sity of Guelph.
Hinkelmann, K., and Kempthorne, O. (2008), Design and Analysis of Experiments, Volume I: Intro-
duction to Experimental Design (2nd ed.), Hoboken, NJ: Wiley.
Hodges, J. L., Jr., and Lehmann, E. L. (1951), “Some Applications of the Cramér–Rao Inequality,”
in Proceedings of the Second Berkeley Symposium on Mathematical Statistics and Probability,
ed. J. Neyman, Berkeley and Los Angeles: University of California Press, pp. 13–22.
Holm, S. (1979), “A Simple Sequentially Rejective Multiple Test Procedure,” Scandinavian Journal
of Statistics, 6, 65–70.
508 References
Hommel, G., and Hoffmann, T. (1988), “Controlled Uncertainty,” in Multiple Hypotheses Testing,
eds. P. Bauer, G. Hommel, and E. Sonnemann, Heidelberg: Springer, pp. 154–161.
Hsu, J. C., and Nelson, B. L. (1990), “Control Variates for Quantile Estimation,” Management
Science, 36, 835–851.
Hsu, J. C., and Nelson, B. (1998), “Multiple Comparisons in the General Linear Model,” Journal of
Computational and Graphical Statistics, 7, 23–41.
Jensen, D. R. (1985), “Multivariate Distributions,” in Encyclopedia of Statistical Sciences (Vol. 6),
eds. S. Kotz, N. L. Johnson, and C. B. Read, New York: Wiley, pp. 43–55.
Johnson, N. L., Kotz, S., and Balakrishnan, N. (1995), Continuous Univariate Distributions (Vol. 2,
2nd ed.), New York: Wiley.
Karlin, S., and Rinott, Y. (1980), “Classes of Orderings of Measures and Related Correlation In-
equalities. I. Multivariate Totally Positive Distributions,” Journal of Multivariate Analysis, 10,
467–498.
Karlin, S., and Rinott, Y. (1981), “Total Positivity Properties of Absolute Value Multinormal Variables
with Applications to Confidence Interval Estimates and Related Probabilistic Inequalities,” The
Annals of Statistics, 9, 1035–1049.
Kempthorne, O. (1980), “The Term Design Matrix” (letter to the editor), The American Statistician,
34, 249.
Khuri, A. I. (1992), “Response Surface Models with Random Block Effects,” Technometrics, 34,
26–37.
Khuri, A. I. (1999), “A Necessary Condition for a Quadratic Form to Have a Chi-Squared Distribution:
an Accessible Proof,” International Journal of Mathematical Education in Science and Technology,
30, 335–339.
Khuri, A. I. (2010), Linear Model Methodology, Boca Raton, FL: Chapman & Hall/CRC.
Kollo, T., and von Rosen, D. (2005), Advanced Multivariate Statistics with Matrices, Dordrecht, The
Netherlands: Springer.
Laha, R. G. (1956), “On the Stochastic Independence of Two Second-Degree Polynomial Statistics
in Normally Distributed Variates,” The Annals of Mathematical Statistics, 27, 790–796.
Laird, N. (2004), Analysis of Longitudinal and Cluster-Correlated Data—Volume 8 in the NSF-
CBMS Regional Conference Series in Probability and Statistics, Beachwood, OH: Institute of
Mathematical Statistics.
LaMotte, L. R. (2007), “A Direct Derivation of the REML Likelihood Function,” Statistical Papers,
48, 321–327.
Lehmann, E. L. (1986), Testing Statistical Hypotheses (2nd ed.), New York: Wiley.
Lehmann, E. L., and Casella, G. (1998), Theory of Point Estimation (2nd ed.), New York: Springer-
Verlag.
Lehmann, E. L., and Romano, J. P. (2005a), “Generalizations of the Familywise Error Rate,” The
Annals of Statistics, 33, 1138–1154.
Lehmann, E. L., and Romano, J. P. (2005b), Testing Statistical Hypotheses (3rd ed.), New York:
Springer.
Littell, R. C., Milliken, G. A., Stroup, W. W., Wolfinger, R. D., and Schabenberger, O. (2006), SAS ®
System for Mixed Models (2nd ed.), Cary, NC: SAS Institute Inc.
Liu, W. (2011), Simultaneous Inference in Regression, Boca Raton, FL: Chapman & Hall/CRC.
References 509
Luenberger, D. G., and Ye, Y. (2016), Linear and Nonlinear Programming (4th ed.), New York:
Springer.
McCullagh, P., and Nelder, J. A. (1989), Generalized Linear Models (2nd ed.), London: Chapman
& Hall.
McCulloch, C. E., Searle, S. R., and Neuhaus, J. M. (2008), Generalized, Linear, and Mixed Models
(2nd ed.), Hoboken, NJ: Wiley.
Milliken, G. A., and Johnson, D. E. (2009), Analysis of Messy Data, Volume I: Designed Experiments
(2nd ed.), Boca Raton, FL: Chapman & Hall/CRC.
Moore, D. P., Harward, M. E., Mason, D. D., Hader, R. J., Lott, W. L., and Jackson, W. A. (1957),
“An Investigation of Some of the Relationships Between Copper, Iron, and Molybdenum in the
Growth and Nutrition of Lettuce: II. Response Surfaces of Growth and Accumulations of Cu and
Fe,” Soil Science Society of America Proceedings, 21, 65–74.
Müller, A. (2001), “Stochastic Ordering of Multivariate Normal Distributions,” Annals of the Institute
of Statistical Mathematics, 53, 567–575.
Myers, R. H., Montgomery, D. C., and Anderson-Cook, C. M. (2016), Response Surface Method-
ology: Process and Product Optimization Using Designed Experiments (4th ed.), Hoboken, NJ:
Wiley.
Nash, J. C. (1990), Compact Numerical Methods for Computers: Linear Algebra and Function
Minimisation (2nd ed.), Bristol, England: Adam Hilger/Institute of Physics Publications.
Nocedal, J., and Wright, S. J. (2006), Numerical Optimization (2nd ed.), New York: Springer.
Ogawa, J. (1950), “On the Independence of Quadratic Forms in a Non-Central Normal System,”
Osaka Mathematical Journal, 2, 151–159.
Ogawa, J., and Olkin, I. (2008), “A Tale of Two Countries: the Craig–Sakamoto–Matusita Theorem,”
Journal of Statistical Planning and Inference, 138, 3419–3428.
Parzen, E. (1960), Modern Probability Theory and Its Applications, New York: Wiley.
Patterson, H. D., and Thompson, R. (1971), “Recovery of Inter-Block Information When Block Sizes
Are Unequal,” Biometrika, 58, 545–554.
Pawitan, Y. (2001), In All Likelihood: Statistical Modelling and Inference Using Likelihood, New
York: Oxford University Press.
Pinheiro, J. C., and Bates, D. M. (2000), Mixed-Effects Models in S and S-PLUS, New York: Springer-
Verlag.
Plackett, R. L. (1972), “Studies in the History of Probability and Statistics. XXIX: The Discovery
of the Method of Least Squares,” Biometrika, 59, 239–251.
Potthoff, R. F., and Roy, S. N. (1964), “A Generalized Multivariate Analysis of Variance Model
Useful Especially for Growth Curve Problems,” Biometrika, 51, 313–326.
Rao, C. R. (1965), Linear Statistical Inference and Its Applications, New York: Wiley.
Rao, C. R. (1973), Linear Statistical Inference and Its Applications (2nd ed.), New York: Wiley.
Rao, C. R., and Mitra, S. K. (1971), Generalized Inverse of Matrices and Its Applications, New York:
Wiley.
Ravishanker, N., and Dey, D. K. (2002), A First Course in Linear Model Theory, Boca Raton, FL:
Chapman & Hall/CRC.
Reid, J. G., and Driscoll, M. F. (1988), “An Accessible Proof of Craig’s Theorem in the Noncentral
Case,” The American Statistician, 42, 139–142.
510 References
Sanders, W. L., and Horn, S. P. (1994), “The Tennessee Value-Added Assessment System (TVAAS):
Mixed-Model Methodology in Educational Assessment,” Journal of Personnel Evaluation in Ed-
ucation, 8, 299–311.
Sarkar, S. K. (2008), “On the Simes Inequality and Its Generalization,” in Beyond Parametrics in
Interdisciplinary Research: Festschrift in Honor of Professor Pranab K. Sen, eds. N. Balakrishnan,
E. A. Pea, and M. J. Silvapulle, Beachwood, OH: Institute of Mathematical Statistics, pp. 231-242.
Schabenberger, O., and Gotway, C. A. (2005), Statistical Methods for Spatial Data Analysis, Boca
Raton, FL: Chapman & Hall/CRC.
Scheffé, H. (1953), “A Method for Judging All Contrasts in the Analysis of Variance,” Biometrika,
40, 87–104.
Scheffé, H. (1959), The Analysis of Variance, New York: Wiley.
Schervish, M. J. (1995), Theory of Statistics, New York: Springer-Verlag.
Schmidt, R. H., Illingworth, B. L., Deng, J. C., and Cornell, J. A. (1979), “Multiple Regression and
Response Surface Analysis of the Effects of Calcium Chloride and Cysteine on Heat-Induced
Whey Protein Gelation,” Journal of Agricultural and Food Chemistry, 27, 529–532.
Seal, H. L. (1967), “The Historical Development of the Gauss Linear Model,” Biometrika, 54, 1–24.
Searle, S. R. (1971), Linear Models, New York: Wiley.
Sen, P. K. (1989), “The Mean-Median-Mode Inequality and Noncentral Chi Square Distributions,”
Sankhyā, Series A, 51, 106-114.
Severini, T. A. (2000), Likelihood Methods in Statistics, New York: Oxford University Press.
Shanbhag, D. N. (1968), “Some Remarks Concerning Khatri’s Result on Quadratic Forms,”
Biometrika, 55, 593–595.
Shao, J. (2010), Mathematical Statistics (2nd ed.), New York: Springer-Verlag.
Šidák, Z. (1967), “Rectangular Confidence Regions for the Means of Multivariate Normal Distribu-
tions,” Journal of the American Statistical Association, 62, 626–633.
Šidák, Z. (1968), “On Multivariate Normal Probabilities of Rectangles: Their Dependence on Cor-
relations,” The Annals of Mathematical Statistics, 39, 1425–1434.
Singh, D., Febbo, P. G., Ross, K., Jackson, D. G., Manola, J., Ladd, C., Tamayo, P., Renshaw, A. A.,
D’Amico, A. V., Richie, J. P., Lander, E. S., Loda, M., Kantoff, P. W., Golub, T. R., and Sellers,
W. R. (2002), “Gene Expression Correlates of Clinical Prostate Cancer Behavior,” Cancer Cell,
1, 203–209.
Snedecor, G. W., and Cochran, W. G. (1989), Statistical Methods (8th ed.), Ames, IA: Iowa State
University Press.
Sprott, D. A. (1975), “Marginal and Conditional Sufficiency,” Biometrika, 62, 599–605.
Stigler, S. M. (1986), The History of Statistics: The Measurement of Uncertainty Before 1900,
Cambridge, MA: Belknap Press of Harvard University Press.
Stigler, S. M. (1999), Statistics on the Table: The History of Statistical Concepts and Methods,
Cambridge, MA: Harvard University Press.
Student (1908), “The Probable Error of a Mean,” Biometrika, 6, 1–25.
Thompson, W. A., Jr. (1962), “The Problem of Negative Estimates of Variance Components,” The
Annals of Mathematical Statistics, 33, 273–289.
References 511
Trefethen, L. N., and Bau, D., III (1997), Numerical Linear Algebra, Philadelphia: Society for
Industrial and Applied Mathematics.
Verbyla, A. P. (1990), “A Conditional Derivation of Residual Maximum Likelihood,” Australian
Journal of Statistics, 32, 227–230.
Victor, N. (1982), “Exploratory Data Analysis and Clinical Research,” Methods of Information in
Medicine, 21, 53–54.
Westfall, P. H., and Tobias, R. D. (2007), “Multiple Testing of General Contrasts: Truncated Closure
and the Extended Shaffer–Royen Method,” Journal of the American Statistical Association, 102,
487–494.
Wolfowitz, J. (1949), “The Power of the Classical Tests Associated with the Normal Distribution,”
The Annals of Mathematical Statistics, 20, 540–551.
Woods, H., Steinour, H. H., and Starke, H. R. (1932), “Effect of Composition of Portland Cement
on Heat Evolved During Hardening,” Industrial and Engineering Chemistry, 24, 1207–1214.
Zabell, S. L. (2008), “On Student’s 1908 Article ‘The Probable Error of a Mean’ ” (with discussion),
Journal of the American Statistical Association, 103, 1–20.
Zacks, S. (1971), The Theory of Statistical Inference, New York: Wiley.
Zhang, L., Bi, H., Cheng, P., and Davis, C. J. (2004), “Modeling Spatial Variation in Tree Diameter-
Height Relationships,” Forest Ecology and Management, 189, 317–329.
Index
for the general linear model [in the case where invertible, 42–43
y is distributed elliptically about Xˇ], involutory, 82
233–234 matrices of 1’s, 26
maximizing values of ˇ and , 234 nonnegative definite, see nonnegative definite
profile likelihood function (for ), 234 matrix (or matrices)
likelihood or log-likelihood function (REML) nonsingular, 36, 42–44, 46–49
for the Aitken model [in the special case where null, 26
y N.Xˇ; 2 H/], 224–225 orthogonal, see orthogonal matrix (or matrices)
for the general linear model [in the case where positive definite, see positive definite matrix (or
y is distributed elliptically about Xˇ], 235 matrices)
for the general linear model [in the special case row and column vectors, 26
where y N.Xˇ; V .//], 216–219, singular, 36
222–224, 249 square, 25
interpretation of the REML likelihood symmetic, 25, 81
function as a marginal likelihood, 218–219 triangular, 26, 82
relationship of the REML likelihood function matrix operations
to the profile likelihood function (for ), addition and subtraction, 24
219, 222–224 matrix multiplication, 24–25, 81
for the G–M model [in the special case where scalar multiplication, 23–24
y N.Xˇ; 2 I/], 221 transposition, 25, 81
linear dependence and independence, 34, 38–39 matrix, definition of, 23
linear expectation of one random variable or vector maximum likelihood (ML) estimator of a function of
given another, 240 the parameter vector in the general
linear space(s), 32–33, 81, 85 linear model
basis for, 34–35, 82, 311 when [in the special case where the model is the
dimension of, 35 Aitken model and D ./]
essentially disjoint, 40–41 y N.Xˇ; 2 H/, 216
orthonormal basis for, 39, 82, 311 when y N ŒXˇ; V ./, 215–216
row and column spaces, 33–36 when y is distributed elliptically about Xˇ, 234
of a partitioned matrix, 41, 52 maximum likelihood (ML) estimator of an estimable
of a product of matrices, 44 function (of ˇ)
of a sum of matrices, 41 under the Aitken model when
of X0 W X (where W is symmetric and y N.Xˇ; 2 H/, 216
nonnegative definite), 214 under the G–M model
of X0 X, 61–62 when y N.Xˇ; 2 I/, 213, 356–357,
subspaces of, 33–34, 82 492–493
linear system, 51–52 under the general linear model
coefficient matrix of, 51, 52 when y N ŒXˇ; V ./, 215
consistency of, 52–53, 57–58, 83, 168–169 when y is distributed elliptically about Xˇ,
homogeneous, 52, 58–59, 83 234
solution to, 51, 52, 83 mean or mean vector, definition of, 88
general form of, 58–60, 83 mean squared error (MSE) or MSE matrix of a
minimum norm, 247 (point) predictor, definition of, 236
solution set, 53, 58–60 mean squared error (MSE) and root MSE (of an
uniqueness of, 59–60 estimator of 0ˇ), 165–166
linear variance or variance-covariance matrix of one minimum norm generalized inverse, see under
random variable or vector given another, generalized inverse matrix
240 mixture data, 171–174
model(s) (statistical)
M assumption
Markov’s inequality, 499 of ellipticity, 233–234
matrices, types of of multivariate normality, 18–21, 128,
diagonal, 26 133–136
full row or column rank, 36, 44 of the linearity of E.y j u/, 161
identity, 26 classificatory models, 4–7
Index 519