Download as pdf or txt
Download as pdf or txt
You are on page 1of 539

Linear Models and the

Relevant Distributions
and Matrix Algebra
CHAPMAN & HALL/CRC
Texts in Statistical Science Series
Series Editors
Joseph K. Blitzstein, Harvard University, USA
Julian J. Faraway, University of Bath, UK
Martin Tanner, Northwestern University, USA
Jim Zidek, University of British Columbia, Canada

Statistical Theory: A Concise Introduction Problem Solving: A Statistician’s Guide,


F. Abramovich and Y. Ritov Second Edition
Practical Multivariate Analysis, Fifth Edition C. Chatfield
A. Afifi, S. May, and V.A. Clark Statistics for Technology: A Course in Applied
Practical Statistics for Medical Research Statistics, Third Edition
D.G. Altman C. Chatfield

Interpreting Data: A First Course Analysis of Variance, Design, and Regression :


in Statistics Linear Modeling for Unbalanced Data,
A.J.B. Anderson Second Edition
R. Christensen
Introduction to Probability with R
K. Baclawski Bayesian Ideas and Data Analysis: An
Introduction for Scientists and Statisticians
Linear Algebra and Matrix Analysis for
Statistics R. Christensen, W. Johnson, A. Branscum,
S. Banerjee and A. Roy and T.E. Hanson

Modern Data Science with R Modelling Binary Data, Second Edition


B. S. Baumer, D. T. Kaplan, and N. J. Horton D. Collett

Mathematical Statistics: Basic Ideas and Modelling Survival Data in Medical Research,
Selected Topics, Volume I, Third Edition
Second Edition D. Collett
P. J. Bickel and K. A. Doksum Introduction to Statistical Methods for
Mathematical Statistics: Basic Ideas and Clinical Trials
Selected Topics, Volume II T.D. Cook and D.L. DeMets
P. J. Bickel and K. A. Doksum Applied Statistics: Principles and Examples
Analysis of Categorical Data with R D.R. Cox and E.J. Snell
C. R. Bilder and T. M. Loughin Multivariate Survival Analysis and Competing
Statistical Methods for SPC and TQM Risks
D. Bissell M. Crowder
Introduction to Probability Statistical Analysis of Reliability Data
J. K. Blitzstein and J. Hwang M.J. Crowder, A.C. Kimber,
T.J. Sweeting, and R.L. Smith
Bayesian Methods for Data Analysis,
Third Edition An Introduction to Generalized
B.P. Carlin and T.A. Louis Linear Models, Third Edition
A.J. Dobson and A.G. Barnett
Second Edition
R. Caulcutt Nonlinear Time Series: Theory, Methods, and
Applications with R Examples
The Analysis of Time Series: An Introduction, R. Douc, E. Moulines, and D.S. Stoffer
Sixth Edition
C. Chatfield Introduction to Optimization Methods and
Their Applications in Statistics
Introduction to Multivariate Analysis B.S. Everitt
C. Chatfield and A.J. Collins
Extending the Linear Model with R: Mathematical Statistics
Generalized Linear, Mixed Effects and K. Knight
Nonparametric Regression Models, Introduction to Functional Data Analysis
Second Edition P. Kokoszka and M. Reimherr
J.J. Faraway
Introduction to Multivariate Analysis:
Linear Models with R, Second Edition Linear and Nonlinear Modeling
J.J. Faraway S. Konishi
A Course in Large Sample Theory Nonparametric Methods in Statistics with SAS
T.S. Ferguson Applications
Multivariate Statistics: A Practical O. Korosteleva
Approach Modeling and Analysis of Stochastic Systems,
B. Flury and H. Riedwyl Third Edition
Readings in Decision Analysis V.G. Kulkarni
S. French Exercises and Solutions in Biostatistical Theory
Discrete Data Analysis with R: Visualization L.L. Kupper, B.H. Neelon, and S.M. O’Brien
and Modeling Techniques for Categorical and Exercises and Solutions in Statistical Theory
Count Data L.L. Kupper, B.H. Neelon, and S.M. O’Brien
M. Friendly and D. Meyer
Design and Analysis of Experiments with R
Markov Chain Monte Carlo: J. Lawson
Stochastic Simulation for Bayesian Inference,
Second Edition Design and Analysis of Experiments with SAS
D. Gamerman and H.F. Lopes J. Lawson

Bayesian Data Analysis, Third Edition A Course in Categorical Data Analysis


A. Gelman, J.B. Carlin, H.S. Stern, D.B. Dunson, T. Leonard
A. Vehtari, and D.B. Rubin Statistics for Accountants
Multivariate Analysis of Variance and S. Letchford
Repeated Measures: A Practical Approach for Introduction to the Theory of Statistical
Behavioural Scientists Inference
D.J. Hand and C.C. Taylor H. Liero and S. Zwanzig
Practical Longitudinal Data Analysis Statistical Theory, Fourth Edition
D.J. Hand and M. Crowder B.W. Lindgren
Linear Models and the Relevant Distributions Stationary Stochastic Processes: Theory and
and Matrix Algebra Applications
D.A. Harville G. Lindgren
Logistic Regression Models Statistics for Finance
J.M. Hilbe E. Lindström, H. Madsen, and J. N. Nielsen
Richly Parameterized Linear Models: The BUGS Book: A Practical Introduction to
Additive, Time Series, and Spatial Models Bayesian Analysis
Using Random Effects D. Lunn, C. Jackson, N. Best, A. Thomas, and
J.S. Hodges D. Spiegelhalter
Statistics for Epidemiology Introduction to General and Generalized
N.P. Jewell Linear Models
Stochastic Processes: An Introduction, H. Madsen and P. Thyregod
Third Edition Time Series Analysis
P.W. Jones and P. Smith H. Madsen
The Theory of Linear Models Pólya Urn Models
B. Jørgensen H. Mahmoud
Pragmatics of Uncertainty Randomization, Bootstrap and Monte Carlo
J.B. Kadane Methods in Biology, Third Edition
Principles of Uncertainty B.F.J. Manly
J.B. Kadane Statistical Regression and Classification: From
Graphics for Statistics and Data Analysis with R Linear Models to Machine Learning
K.J. Keen N. Matloff
Introduction to Randomized Controlled Statistical Methods for Spatial Data Analysis
Clinical Trials, Second Edition O. Schabenberger and C.A. Gotway
J.N.S. Matthews Bayesian Networks: With Examples in R
Statistical Rethinking: A Bayesian Course with M. Scutari and J.-B. Denis
Examples in R and Stan Large Sample Methods in Statistics
R. McElreath P.K. Sen and J. da Motta Singer
Statistical Methods in Agriculture and Introduction to Statistical Methods for
Experimental Biology, Second Edition Financial Models
R. Mead, R.N. Curnow, and A.M. Hasted T. A. Severini
Statistics in Engineering: A Practical Approach Spatio-Temporal Methods in Environmental
A.V. Metcalfe Epidemiology
Statistical Inference: An Integrated Approach, G. Shaddick and J.V. Zidek
Second Edition Decision Analysis: A Bayesian Approach
H. S. Migon, D. Gamerman, and J.Q. Smith
F. Louzada
Analysis of Failure and Survival Data
Beyond ANOVA: Basics of Applied Statistics P. J. Smith
R.G. Miller, Jr.
Applied Statistics: Handbook of GENSTAT
A Primer on Linear Models Analyses
J.F. Monahan E.J. Snell and H. Simpson
Stochastic Processes: From Applications to Applied Nonparametric Statistical Methods,
Theory Fourth Edition
P.D Moral and S. Penev P. Sprent and N.C. Smeeton
Applied Stochastic Modelling, Second Edition Data Driven Statistical Methods
B.J.T. Morgan P. Sprent
Elements of Simulation Generalized Linear Mixed Models:
B.J.T. Morgan Modern Concepts, Methods and Applications
Probability: Methods and Measurement W. W. Stroup
A. O’Hagan Survival Analysis Using S: Analysis of
Introduction to Statistical Limit Theory Time-to-Event Data
A.M. Polansky M. Tableman and J.S. Kim
Applied Bayesian Forecasting and Time Series Applied Categorical and Count Data Analysis
Analysis W. Tang, H. He, and X.M. Tu
A. Pole, M. West, and J. Harrison Elementary Applications of Probability Theory,
Statistics in Research and Development, Second Edition
Time Series: Modeling, Computation, and H.C. Tuckwell
Inference Introduction to Statistical Inference and Its
R. Prado and M. West Applications with R
Essentials of Probability Theory for M.W. Trosset
Statisticians Understanding Advanced Statistical Methods
M.A. Proschan and P.A. Shaw P.H. Westfall and K.S.S. Henning
Introduction to Statistical Process Control Statistical Process Control: Theory and
P. Qiu Practice, Third Edition
Sampling Methodologies with Applications G.B. Wetherill and D.W. Brown
P.S.R.S. Rao Generalized Additive Models:
A First Course in Linear Model Theory An Introduction with R, Second Edition
N. Ravishanker and D.K. Dey S. Wood
Essential Statistics, Fourth Edition Epidemiology: Study Design and
D.A.G. Rees Data Analysis, Third Edition
Stochastic Modeling and Mathematical M. Woodward
Statistics: A Text for Statisticians and Practical Data Analysis for Designed
Quantitative Scientists Experiments
F.J. Samaniego B.S. Yandell
Texts in Statistical Science

Linear Models and the


Relevant Distributions
and Matrix Algebra

David A. Harville
CRC Press
Taylor & Francis Group
6000 Broken Sound Parkway NW, Suite 300
Boca Raton, FL 33487-2742

© 2018 by Taylor & Francis Group, LLC


CRC Press is an imprint of Taylor & Francis Group, an Informa business

No claim to original U.S. Government works

Printed on acid-free paper


Version Date: 20180131

International Standard Book Number-13: 978-1-138-57833-3 (Hardback)

This book contains information obtained from authentic and highly regarded sources. Reasonable efforts have been
made to publish reliable data and information, but the author and publisher cannot assume responsibility for the validity
of all materials or the consequences of their use. The authors and publishers have attempted to trace the copyright
holders of all material reproduced in this publication and apologize to copyright holders if permission to publish in this
form has not been obtained. If any copyright material has not been acknowledged please write and let us know so we may
rectify in any future reprint.

Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or utilized
in any form by any electronic, mechanical, or other means, now known or hereafter invented, including photocopying,
microfilming, and recording, or in any information storage or retrieval system, without written permission from the
publishers.

For permission to photocopy or use material electronically from this work, please access www.copyright.com (http://
www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923,
978-750-8400. CCC is a not-for-profit organization that provides licenses and registration for a variety of users. For
organizations that have been granted a photocopy license by the CCC, a separate system of payment has been arranged.

Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used only for
identification and explanation without intent to infringe.

Library of Congress Cataloging-in-Publication Data

Names: Harville, David A., author.


Title: Linear models and the relevant distributions and matrix algebra /
David A. Harville.
Description: Boca Raton : CRC Press, 2018. | Includes bibliographical
references and index.
Identifiers: LCCN 2017046289 | ISBN 9781138578333 (hardback : alk. paper)
Subjects: LCSH: Matrices--Problems, exercises, etc. | Mathematical
statistics--Problems, exercises, etc.
Classification: LCC QA188 .H3798 2018 | DDC 512.9/434--dc23
LC record available at https://lccn.loc.gov/2017046289

Visit the e-resources at: https://www.crcpress.com/9781138578333


Visit the Taylor & Francis Web site at
http://www.taylorandfrancis.com
and the CRC Press Web site at
http://www.crcpress.com
Contents

Preface ix
1 Introduction 1
1.1 Linear Statistical Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Regression Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Classificatory Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Hierarchical Models and Random-Effects Models . . . . . . . . . . . . . . . . . 7
1.5 Statistical Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.6 An Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2 Matrix Algebra: A Primer 23
2.1 The Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.2 Partitioned Matrices and Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.3 Trace of a (Square) Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.4 Linear Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.5 Inverse Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.6 Ranks and Inverses of Partitioned Matrices . . . . . . . . . . . . . . . . . . . . . 44
2.7 Orthogonal Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
2.8 Idempotent Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
2.9 Linear Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
2.10 Generalized Inverses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
2.11 Linear Systems Revisited . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
2.12 Projection Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
2.13 Quadratic Forms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
2.14 Determinants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
Bibliographic and Supplementary Notes . . . . . . . . . . . . . . . . . . . . . . 85
3 Random Vectors and Matrices 87
3.1 Expected Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
3.2 Variances, Covariances, and Correlations . . . . . . . . . . . . . . . . . . . . . . 89
3.3 Standardized Version of a Random Variable . . . . . . . . . . . . . . . . . . . . 97
3.4 Conditional Expected Values and Conditional Variances and Covariances . . . . 100
3.5 Multivariate Normal Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 101
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
Bibliographic and Supplementary Notes . . . . . . . . . . . . . . . . . . . . . . 122
4 The General Linear Model 123
4.1 Some Basic Types of Linear Models . . . . . . . . . . . . . . . . . . . . . . . . 124
4.2 Some Specific Types of Gauss–Markov Models (with Examples) . . . . . . . . . 129
4.3 Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
4.4 Heteroscedastic and Correlated Residual Effects . . . . . . . . . . . . . . . . . . 136
4.5 Multivariate Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
viii Contents

Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
Bibliographic and Supplementary Notes . . . . . . . . . . . . . . . . . . . . . . 162
5 Estimation and Prediction: Classical Approach 165
5.1 Linearity and Unbiasedness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
5.2 Translation Equivariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
5.3 Estimability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
5.4 The Method of Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
5.5 Best Linear Unbiased or Translation-Equivariant Estimation of Estimable Functions
(under the G–M Model) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
5.6 Simultaneous Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
5.7 Estimation of Variability and Covariability . . . . . . . . . . . . . . . . . . . . . 198
5.8 Best (Minimum-Variance) Unbiased Estimation . . . . . . . . . . . . . . . . . . 211
5.9 Likelihood-Based Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212
5.10 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246
Bibliographic and Supplementary Notes . . . . . . . . . . . . . . . . . . . . . . 252
6 Some Relevant Distributions and Their Properties 253
6.1 Chi-Square, Gamma, Beta, and Dirichlet Distributions . . . . . . . . . . . . . . 253
6.2 Noncentral Chi-Square Distribution . . . . . . . . . . . . . . . . . . . . . . . . 267
6.3 Central and Noncentral F Distributions . . . . . . . . . . . . . . . . . . . . . . 281
6.4 Central, Noncentral, and Multivariate t Distributions . . . . . . . . . . . . . . . 290
6.5 Moment Generating Function of the Distribution of One or More Quadratic Forms
or Second-Degree Polynomials (in a Normally Distributed Random Vector) . . . 303
6.6 Distribution of Quadratic Forms or Second-Degree Polynomials (in a Normally
Distributed Random Vector): Chi-Squareness . . . . . . . . . . . . . . . . . . . 308
6.7 The Spectral Decomposition, with Application to the Distribution of Quadratic
Forms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315
6.8 More on the Distribution of Quadratic Forms or Second-Degree Polynomials (in a
Normally Distributed Random Vector) . . . . . . . . . . . . . . . . . . . . . . . 326
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345
Bibliographic and Supplementary Notes . . . . . . . . . . . . . . . . . . . . . . 349
7 Confidence Intervals (or Sets) and Tests of Hypotheses 351
7.1 “Setting the Stage”: Response Surfaces in the Context of a Specific Application and
in General . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 351
7.2 Augmented G–M Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 357
7.3 The F Test (and Corresponding Confidence Set) and a Generalized S Method . . 364
7.4 Some Optimality Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 405
7.5 One-Sided t Tests and the Corresponding Confidence Bounds . . . . . . . . . . . 421
7.6 The Residual Variance  2 : Confidence Intervals and Tests of Hypotheses . . . . 430
7.7 Multiple Comparisons and Simultaneous Confidence Intervals: Some Enhance-
ments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 441
7.8 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 480
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 492
Bibliographic and Supplementary Notes . . . . . . . . . . . . . . . . . . . . . . 502
References 505
Index 513
Preface

Linear statistical models provide the theoretical underpinnings for many of the statistical procedures
in common use. In deciding on the suitability of one of those procedures for use in a potential
application, it would seem to be important to know the assumptions embodied in the underlying
model and the theoretical properties of the procedure as determined on the basis of that model. In
fact, the value of such knowledge is not limited to its value in deciding whether or not to use the
procedure. When (as is frequently the case) one or more of the assumptions appear to be unrealistic,
such knowledge can be very helpful in devising a suitably modified procedure—a situation of this
kind is illustrated in Section 7.7f.
Knowledge of matrix algebra has in effect become a prerequisite for reading much of the literature
pertaining to linear statistical models. The use of matrix algebra in this literature started to become
commonplace in the mid 1900s. Among the early adopters were Scheffé (1959), Graybill (1961), Rao
(1965), and Searle (1971). When it comes to clarity and succinctness of exposition, the introduction
of matrix algebra represented a great advance. However, those without an adequate knowledge of
matrix algebra were left at a considerable disadvantage.
Among the procedures for making statistical inferences are ones that are based on an assumption
that the data vector is the realization of a random vector, say y, that follows a linear statistical model.
The present volume discusses procedures of that kind and the properties of those procedures. Included
in the coverage are various results from matrix algebra needed to effect an efficient presentation of the
procedures and their properties. Also included in the coverage are the relevant statistical distributions.
Some of the supporting material on matrix algebra and statistical distributions is interspersed with
the discussion of the inferential procedures and their properties.
Two classical procedures are the least squares estimator (of an estimable function) and the F test.
The least squares estimator is optimal in the sense described in a result known as the Gauss–Markov
theorem. The Gauss–Markov theorem has a relatively simple proof. Results on the optimality of the
F test are stated and proved herein (in Chapter 7); the proofs of these results are relatively difficult
and less “accessible”—reference is sometimes made to Wolfowitz’s (1949) proofs of results on the
optimality of the F test, which are (at best) extremely terse.
The F test is valid under an assumption that the distribution of the observable random vector y is
multivariate normal. However, that assumption is stronger than necessary. As can be discerned from
results like those discussed by Fang, Kotz, and Ng (1990), as has been pointed out by Ravishanker
and Dey (2002, sec. 5.5), and is shown herein, the F test and various related procedures depend
on y only through a (possibly vector-valued) function of y whose distribution is the same for every
distribution of y that is “elliptically symmetric,” so that those procedures are valid not only when the
distribution of y is multivariate normal but more generally when the distribution of y is elliptically
symmetric.
The present volume includes considerable discussion of multiple comparisons and simultaneous
confidence intervals. At one time, the use of these kinds of procedures was confined to situations where
the requisite percentage points were those of a distribution (like the distribution of the Studentized
range) that was sufficiently tractable that the percentage points could be computed by numerical
means. The percentage points could then be tabulated or could be recomputed on an “as needed”
basis. An alternative whose use is not limited by considerations of “numerical tractability” is to
x Preface

determine the percentage points by Monte Carlo methods in the manner described by Edwards and
Berry (1987).
The discussion herein of multiple comparisons is not confined to the traditional methods, which
serve to control the FWER (familywise error rate). It includes discussion of less conservative methods
of the kinds proposed by Benjamini and Hochberg (1995) and by Lehmann and Romano (2005a).
Prerequisites. The reader is assumed to have had at least some exposure to the basic concepts of
probability “theory” and to the basic principles of statistical inference. This exposure is assumed to
have been of the kind that could have been gained through an introductory course at a level equal to
(or exceeding) that of Casella and Berger (2002) or Bickel and Doksum (2001).
The coverage of matrix algebra provided herein is more-or-less self-contained. Nevertheless,
some previous exposure of the kind that might have been gained through an introductory course
on linear algebra is likely to be helpful. That would be so even if the introductory course were
such that the level of abstractness or generality were quite high or (at the other extreme) were
such that computations were emphasized at the expense of fundamental concepts, in which case the
connections to what is covered herein would be less direct and less obvious.
Potential uses. The book could be used as a reference. Such use has been facilitated by the inclusion
of a very extensive and detailed index and by arranging the covered material in a way that allows (to
the greatest extent feasible) the various parts of the book to be read more-or-less independently.
Or the book could serve as the text for a graduate-level course on linear statistical models with
a secondary purpose of providing instruction in matrix algebra. Knowledge of matrix algebra is
critical not only in the study of linear statistical models but also in the study of various other areas of
statistics including multivariate analysis. The integration of the instruction in matrix algebra with the
coverage of linear statistical models could have a symbiotic effect on the study of both subjects. If
desired, topics not covered in the book (either additional topics pertaining to linear statistical models
or topics pertaining to some other area such as multivariate analysis) could be included in the course
by introducing material from a secondary source.
Alternatively, the book could be used selectively in a graduate-level course on linear statistical
models to provide coverage of certain topics that may be covered in less depth (or not covered at all) in
another source. It could also be used selectively in a graduate-level course in mathematical statistics
to provide in-depth illustrations of various concepts and principles in the context of a relatively
important and complex setting.
To facilitate the use of the book as a text, a large number of exercises have been included. A
solutions manual is accessible to instructors who have adopted the book at https://www.crcpress.com/
9781138578333.
An underlying perspective. A basic problem in statistics (perhaps, the basic problem) is that of making
inferences about the realizations of some number (assumed for the sake of simplicity to be finite) of
unobservable random variables, say w1 ; w2 ; : : : ; wM , based on the value of an observable random
vector y. Let w D .w1 ; w2 ; : : : ; wM /0. A statistical model might be taken to mean a specification
of the joint distribution of w and y up to the value of a vector, say , of unknown parameters. This
definition is sufficiently broad to include the case where w D —when w D , the joint distribution
of w and y is “degenerate.”
In this setting, statistical inference might take the form of a “point” estimate or prediction for the
realization of w or of a set of M -dimensional vectors and might be based on the statistical model (in
what might be deemed model-based inference). Depending on the nature of w1 ; w2 ; : : : ; wM , this
activity might be referred to as parametric inference or alternatively as predictive inference.
Let w.y/
Q represent a point estimator or predictor, and let A.y/ represent a set of M -dimensional
vectors that varies with the value of y. And consider the use of w.y/ Q and A.y/ in model-based
(parametric or predictive) inference. If EŒw.y/
Q D E.w/, w.y/
Q is said to be an unbiased estimator
or predictor. And if PrŒw 2 A.y/ D 1 P for some prespecified constant P (and for “every” value of
Preface xi

), A.y/ is what might be deemed a 100.1 P /% “confidence” set—depending on the model, such
a set might or might not exist.
In the special case where  is “degenerate” (i.e., where the joint distribution of w and y is
known), w.y/
Q could be taken to be E.wj y/ (the so-called posterior mean), in which case w.y/ Q
would be unbiased. And among the choices for the set A.y/ in that special case are choices for which
PrŒw 2 A.y/ j y D 1 P [so-called 100.1 P /% credible sets].
Other models can be generated from the original model by regarding  as a random vector whose
distribution is specified up to the value of some parameter vector  (of smaller dimension than )
and by regarding the joint distribution of w and y specified by the original model as the conditional
distribution of w and y given . The resultant (hierarchical) models are more parsimonious than the
original model, but this (reduction in the number of parameters) comes at the expense of additional
assumptions. In the special case where  is “degenerate” (i.e., where  is regarded as a random
vector whose distribution is completely specified and represents what in a Bayesian framework is
referred to as the prior distribution), the resultant models are ones in which the joint distribution of
w and y is completely specified.
As discussed in a 2014 paper (Harville 2014), I regard the division of statistical inference along
Bayesian-frequentist lines as unnecessary and undesirable. What in a Bayesian approach is referred
to as the prior distribution can simply be regarded as part of a hierarchical model. In combination with
the original model, it leads to a new model (in which the joint distribution of w and y is completely
specified).
In that 2014 paper, it is also maintained that there are many instances (especially in the case of
predictive inference) where any particular application of the inferential procedures is one in a se-
quence of “repeated” applications. In such instances, the “performance” of the procedures in repeated
application would seem to be an important consideration. Performance in repeated application can be
assessed on the basis of empirical evidence or on the basis of a “model”—for some discussion of per-
formance in repeated application within a rather specific Bayesian framework, refer to Dawid (1982).
As famously stated by George Box, “all models are wrong, but some are useful” (e.g., Box and
Draper 1987, p. 424). In fact, a model may be useful for some purposes but not for others. How
useful any particular model might be in providing a basis for statistical inference would seem to
depend at least in part on the extent to which the relationship between w and y implicit in the model
is consistent with the “actual relationship”—the more “elaborate” the model, the more opportunities
there are for discrepancies. In principle, it would seem that the inferences should be based on a
model that reflects all relevant prior information [i.e., the joint distribution of w and y should be the
conditional (on the prior information) joint distribution]. In practice, it may be difficult to formally
account for certain kinds of prior information in a way that seems altogether satisfactory; it may be
preferable to account for those kinds of prior information through informal “posterior adjustments.”
In devising a model, there is a potential pitfall. It is implicitly assumed that the specification of the
joint distribution of w and y is not influenced by the observed value of y. Yet, in practice, the model
may not be decided upon until after the data become available and/or may undergo modification
subsequent to that time. Allowing the observed value of y to influence the choice of model could
introduce subtle biases and distortions into the inferences.
Format. The book is divided into (7) numbered chapters, the chapters into numbered sections, and (in
some cases) the sections into lettered subsections. Sections are identified by two numbers (chapter
and section within chapter) separated by a decimal point—thus, the fifth section of Chapter 3 is
referred to as Section 3.5. Within a section, a subsection is referred to by letter alone. A subsection
in a different chapter or in a different section of the same chapter is referred to by referring to the
section and by appending a letter to the section number—for example, in Section 6.2, Subsection
c of Section 6.1 is referred to as Section 6.1c. An exercise in a different chapter is referred to by
the number obtained by inserting the chapter number (and a decimal point) in front of the exercise
number.
xii Preface

Some of the subsections are divided into parts. Each such subsection includes two or more parts
that begin with a heading and may or may not include an introductory part (with no heading). On the
relatively small number of occasions on which reference is made to one or more of the individual
parts, the parts that begin with headings are identified as though they had been numbered 1; 2; : : : in
order of appearance.
Some of the displayed “equations” are numbered. An equation number consists of two parts
(corresponding to section within chapter and equation within section) separated by a decimal point
(and is enclosed in parentheses). An equation in a different chapter is referred to by the “num-
ber” obtained by starting with the chapter number and appending a decimal point and the equation
number—for example, in Chapter 6, result (5.11) of Chapter 3 is referred to as result (3.5.11). For
purposes of numbering (and referring to) equations in the exercises, the exercises in each chapter are
to be regarded as forming Section E of that chapter.
Notational conventions and issues. The broad coverage of the manuscript (which includes coverage
of the statistical distributions and matrix algebra applicable to discussions of linear models) has led to
challenges and issues in devising suitable notation. It has sometimes proved necessary to use similar
(or even identical) symbols for more than one purpose. In some cases, notational conventions that
are typically followed in the treatment of one of the covered topics may conflict with those typically
followed in another of the covered topics; such conflicts have added to the difficulties in devising
suitable notation.
For example, in discussions of matrix algebra, it is customary (at least among statisticians) to use
boldface capital letters to represent matrices, to use boldface lowercase letters to represent vectors, and
to use ordinary lowercase letters to represent scalars. And in discussions of statistical distributions and
their characteristics, it is customary to distinguish the realization of a random variable or vector from
the random variable or vector itself by using a capital letter, say X , to represent the random variable or
a boldface capital letter, say X, to represent the random vector and to use the corresponding lowercase
letter x or boldface lowercase letter x to represent its realization. In such a case, the approach taken
herein is to use some other device such as an underline to differentiate between the random variable
or vector and its realization. Accordingly, x, x, or X might be used to represent a random variable,
vector, or matrix and x, x, or X to represent the realization of x, x, or X. Alternatively, in cases
where the intended usage is clear from the context, the same symbol may be used for both.
Credentials. I have brought to the writing of this book an extensive background in the subject matter.
On numerous occasions, I have taught graduate-level courses on linear statistical models. Moreover,
linear statistical models and their use as a basis for statistical inference has been my primary research
interest. My research in that area includes both work that is relatively theoretical in nature and work
in which the focus is on applications (including applications in sports and in animal breeding). I
am the author of two previous books, both of which pertain to matrix algebra: Matrix Algebra from
a Statistician’s Perspective, which provides coverage of matrix algebra of a kind that would seem
to be well-suited for those with interests in statistics and related disciplines, and Matrix Algebra:
Exercises and Solutions, which provides the solutions to the exercises in Matrix Algebra from a
Statistician’s Perspective.
In the writing of Matrix Algebra from a Statistician’s Perspective, I adopted the philosophy that (to
the greatest extent feasible) the discourse should include the theoretical underpinnings of essentially
every result. In the writing of the present volume, I have adopted much the same philosophy. Of
course, doing so has a limiting effect on the number of topics and the number of results that can be
covered.
Acknowledgments. In the writing of this volume, I have been influenced greatly (either consciously
or subconsciously) by insights acquired from others through direct contact or indirectly through
exposure to presentations they have given or to documents they have written. Among those from
whom I have acquired insights are: Frank Graybill—his 1961 book was an early influence; Justus
Preface xiii

Seely (through access to some unpublished class notes from a course he had taught at Oregon State
University, as well as through the reading of a number of his published papers); C. R. Henderson (who
was my major professor and a source of inspiration and ideas); Oscar Kempthorne (through access
to his class notes and through thought-provoking conversations during the time he was a colleague);
and Shayle Searle (who was very supportive of my efforts and who was a major contributor to the
literature on linear statistical models and the associated matrix algebra). And I am indebted to John
Kimmel, who (in his capacity as an executive editor at Chapman and Hall/CRC) has been a source
of encouragement, support, and guidance.

David A. Harville
harville@iastate.edu
1
Introduction

This book is about linear statistical models and about the statistical procedures derived on the basis
of those models. These statistical procedures include the various procedures that make up a linear
regression analysis or an analysis of variance, as well as many other well-known procedures. They
have been applied on many occasions and with great success to a wide variety of experimental and
observational data.
In agriculture, data on the milk production of dairy cattle are used to make inferences about the
“breeding values” of various cows and bulls and ultimately to select breeding stock (e.g., Henderson
1984). These inferences (and the resultant selections) are made on the basis of a linear statistical
model. The adoption of this approach to the selection of breeding stock has significantly increased
the rate of genetic progress in the affected populations.
In education, student test scores are used in assessing the effectiveness of teachers, schools, and
school districts. In the Tennessee value-added assessment system (TVAAS), the assessments are in
terms of statistical inferences made on the basis of a linear statistical model (e.g., Sanders and Horn
1994). This approach compares favorably with the more traditional ways of using student test scores
to assess effectiveness. Accordingly, its use has been mandated in a number of regions.
In sports such as football and basketball, the outcomes of past and present games can be used to
predict the outcomes of future games and to rank or rate the various teams. Very accurate results can
be obtained by basing the predictions and the rankings or ratings on a linear statistical model (e.g.,
Harville 1980, 2003b, 2014). The predictions obtained in this way are nearly as accurate as those
implicit in the betting line. And (in the case of college basketball) they are considerably more accurate
than predictions based on the RPI (Ratings Percentage Index), which is a statistical instrument used
by the NCAA (National Collegiate Athletic Association) to rank teams.
The scope of statistical procedures developed on the basis of linear statistical models can be (and
has been) extended. Extensions to various kinds of nonlinear statistical models have been considered
by Bates and Watts (1988), Gallant (1987), and Pinheiro and Bates (2000). Extensions to the kinds
of statistical models that have come to be known as generalized linear models have been considered
by McCullagh and Nelder (1989), Agresti (2013), and McCulloch, Searle, and Neuhaus (2008).

1.1 Linear Statistical Models


A central (perhaps the central) problem in statistics is that of using N data points that are to be regarded
as the respective values of observable random variables y1 ; y2 ; : : : ; yN to make inferences about
various future quantities and/or various other quantities that are deemed unobservable. Inference
about future quantities is sometimes referred to as predictive inference.
Denote by y the N -dimensional random (column) vector whose elements are y1 ; y2 ; : : : ; yN ,
respectively. In devising or evaluating an inferential procedure, whatever assumptions are made about
the distribution of y play a critical role. This distribution is generally taken to be “conditional” on
various kinds of concomitant information. Corresponding to the observed value of yi may be known
values ui1 ; ui 2 ; : : : ; uiC of C explanatory variables u1 ; u2 ; : : : ; uC (1  i  N ); these NC values
2 Introduction

may constitute some or all of the concomitant information. The various assumptions made about the
distribution of y are referred to collectively as a statistical model or simply as a model.
We shall be concerned herein with what are known as linear (statistical) models. These models
are relatively tractable and provide the theoretical underpinnings for a broad class of statistical
procedures.
What constitutes a linear model? In a linear model, the expected values of y1; y2 ; : : : ; yN are taken
to be linear combinations of some number, say P , of generally unknown parameters ˇ1 ; ˇ2 ; : : : ; ˇP .
That is, there exist numbers xi1 ; xi 2 ; : : : ; xiP (assumed known) such that
P
X
E.yi / D xij ˇj .i D 1; 2; : : : ; N /: (1.1)
j D1

The parameters ˇ1 ; ˇ2 ; : : : ; ˇP may be unrestricted, or (more generally) may be subject to “linear


constraints.”
For i D 1; 2; : : : ; N , the random variable yi can be reexpressed as
P
X P
X
yi D xij ˇj C .yi xij ˇj /:
j D1 j D1

Accordingly, condition (1.1) is equivalent to the condition


P
X
yi D xij ˇj C ei .i D 1; 2; : : : ; N /; (1.2)
j D1

where e1 ; e2 ; : : : ; eN are random variables, each of which has an expected value of 0. Under condition
(1.2), we have that
P
X
ei D yi xij ˇj D yi E.yi / .i D 1; 2; : : : ; N /: (1.3)
j D1

Aside from the two trivial cases var.ei / D 0 and xi1 D xi 2 D    D xiP D 0 and a case where
ˇ1 ; ˇ2 ; : : : ; ˇP are subject to restrictions under which jPD1 xij ˇj is known, ei is unobservable.
P
The random variables e1 ; e2 ; : : : ; eN are sometimes referred to as residual effects or as errors.
In working with linear models, the use of matrix notation is extremely convenient. Note that
condition (1.1) can be reexpressed in the form

E.y/ D Xˇ; (1.4)

where X is the N  P matrix whose ij th element is xij (i D 1; 2; : : : ; N ; j D 1; 2; : : : ; P ) and


ˇ is the P -dimensional column vector with elements ˇ1 ; ˇ2 ; : : : ; ˇP . And condition (1.2), which is
equivalent to condition (1.4), can be restated as

y D Xˇ C e; (1.5)

where e is a random column vector (the elements of which are e1 ; e2 ; : : : ; eN ) with E.e/ D 0.
Further, in matrix notation, result (1.3) becomes

eDy Xˇ D y E.y/: (1.6)

For a model to qualify as a linear model, we require something more than condition (1.1) or (1.2).
Namely, we require that the variance-covariance matrix of y, or equivalently of e, not depend on the
elements ˇ1 ; ˇ2 ; : : : ; ˇP of ˇ—the diagonal elements of the variance-covariance matrix of y are the
variances of the elements y1 ; y2 ; : : : ; yN of y, and the off-diagonal elements are the covariances.
Regression Models 3

This matrix may depend (and typically does depend) on various unknown parameters other than
ˇ1 ; ˇ2 ; : : : ; ˇP .
For a model to be useful in making inferences about the unobservable quantities of interest, it
must be possible to express those quantities in a relevant way. Consider a linear model, in which
E.y1 /; E.y2 /; : : : ; E.yN / are expressible in the form (1.1) or, equivalently, in which y is expressible in
the form (1.5). This model could be useful in making inferences about a quantity that is expressible as a
linear combination, say jPD1 j ˇj , of the elements ˇ1 ; ˇ2 ; : : : ; ˇP of ˇ—how useful would depend
P
on X, on the coefficients 1 ; 2 ; : : : ; P , and perhaps on various characteristics of the distribution of
e. More generally, this model could be useful in making inferences about an unobservable random
variable w for which E.w/ D jPD1 j ˇj and for which var.w/ and cov.w; y/ do not depend on ˇ
P

or, equivalently, an unobservable random variable w of the form w D jPD1 j ˇj C d , where d is


P
a random variable for which E.d / D 0 and for which var.d / and cov.d; e/ do not depend on ˇ—
cov.w; y/ and cov.d; e/ are N -dimensional row vectors, the elements of which are the covariances
between w and y1 ; y2 ; : : : ; yN and the covariances between d and e1 ; e2 ; : : : ; eN . Strictly speaking,
inferences are made about the “realization” of a random variable, not the random variable itself.
The model could also be useful in making inferences about a quantity that is expressible in terms of
whatever parameters may characterize the distribution of e.

1.2 Regression Models


Suppose that (as in Section 1.1) there are N data points that (for purposes of making inferences about
various unobservable quantities) are to be regarded as the respective values of observable random
variables y1 ; y2 ; : : : ; yN . Suppose further that corresponding to the observed value of yi are known
values ui1 ; ui 2 ; : : : ; uiC of C explanatory variables u1 ; u2 ; : : : ; uC (1  i  N ). For example, in
an observational study of how the amount of milk produced by a dairy cow during her first lactation
varies with her age and her body weight (recorded at the beginning of her initial pregnancy), yi might
correspond to the amount of milk produced by the i th cow and (taking C D 2) ui1 and ui 2 might
represent her age and her body weight.
A possible model is that obtained by taking
C
X
yi D ˛0 C uij ˛j C ei .i D 1; 2; : : : ; N /; (2.1)
j D1

where ˛0 ; ˛1 ; : : : ; ˛C are unrestricted parameters (of unknown value) and where e1 ; e2 ; : : : ; eN are
uncorrelated, unobservable random variables, each with mean 0 and (for a strictly positive parameter
 of unknown value) variance  2 . Models of the form (2.1) are referred to as simple or multiple
(depending on whether C D 1 or C  2) linear regression models.
As suggested by the name, a linear regression model qualifies as a linear model. Under the linear
regression model (2.1),
C
X
E.yi / D ˛0 C uij ˛j .i D 1; 2; : : : ; N /: (2.2)
j D1

The expected values (2.2) are of the form (1.1), and the expressions (2.1) are of the form (1.2); set
P D C C 1, ˇ1 D ˛0 , and (for j D 1; 2; : : : ; C ) ˇj C1 D ˛j , and take x11 D x21 D    D xN1 D 1
and (for i D 1; 2; : : : ; N and j D 1; 2; : : : ; C ) xi;j C1 D uij . Moreover, the linear regression model
is such that the variance-covariance matrix of e1 ; e2 ; : : : ; eN does not depend on the ˇj ’s; it depends
only on the parameter .
4 Introduction

In an application of the multiple linear regression model, we might wish to make inferences about
some or all of the individual parameters ˛0 ; ˛1 ; : : : ; ˛C , and . Or we might wish to make inferences
about the quantity ˛0 C jCD1 uj ˛j for various values of the explanatory variables u1 ; u2 ; : : : ; uC .
P
This quantity could be thought of as representing the “average” value of an infinitely large number
of future data points, all of which correspond to the same u1 ; u2 ; : : : ; uC values. Also of potential
interest are quantities of the form ˛0 C jCD1 uj ˛j C d , where d is an unobservable random variable
P

that is uncorrelated with e1; e2 ; : : : ; eN and that has mean 0 and variance  2. A quantity of this form is
a random variable, the value of which can be thought of as representing an individual future data point.
There are potential pitfalls in making predictive inferences on the basis of a statistical model,
both in general and in the case of a multiple linear regression model. It is essential that the relevant
characteristics of the setting in which the predictive inferences are to be applied be consistent with
those of the setting that gives rise to the data. For example, in making predictive inferences about the
relationship between a cow’s milk production and her age and her body weight, it would be essential
that there be consistency with regard to breed and perhaps with regard to various management
practices. The use of data collected on a random sample of the population that is the “target” of the
predictive inferences can be regarded as an attempt to achieve the desired consistency. In making
predictive inferences on the basis of a multiple linear regression model, it is also essential that
the model “accurately reflect” the underlying relationships and that it do so over all values of the
explanatory variables for which predictive inferences are sought (as well as over all values for which
there are data).
For some applications of the multiple linear regression model, the assumption that e1 ; e2 ; : : : ; eN
(and d ) are uncorrelated with each other may be overly simplistic. Consider, for example, an appli-
cation in which each of the data points represents the amount of milk produced by a cow. If some
of the cows are genetically related to others, then we may wish to modify the model accordingly.
Any two of the residual effects e1 ; e2 ; : : : ; eN (and d ) that correspond to cows that are genetically
related may be positively correlated (to an extent that depends on the closeness of the relationship).
Moreover, the data are likely to come from more than one herd of cows. Cows that belong to the
same herd share a common environment, and tend to be more alike than cows that belong to different
herds. One way to account for their alikeness is through the introduction of a positive covariance (of
unknown value).
In making inferences on the basis of a multiple linear regression model, a possible objective
is that of obtaining relevant input to some sort of decision-making process. In particular, when
inferences are made about future data points, it may be done with the intent of judging the effects of
changes in the values of any of the explanatory variables u1 ; u2 ; : : : ; uC that are subject to control.
Considerable caution needs to be exercised in making such judgments. There may be variables that are
not accounted for in the model but whose values may have “influenced” the values of y1 ; y2 ; : : : ; yN
and may influence future data points. If the values of any of the excluded variables are related
(either positively or negatively) to any of the variables for which changes are contemplated, then the
model-based inferences may create a misleading impression of the effects of the changes.

1.3 Classificatory Models


Let us consider further the use (in making statistical inferences) of N data points that are to be regarded
as the values of observable random variables y1 ; y2 ; : : : ; yN . In many applications, the N data points
can be partitioned (in a meaningful way) into a number of mutually exclusive and exhaustive subsets
or “groups.” In fact, the N data points may lend themselves to several such partitionings, each of
Classificatory Models 5

which is based on a different criterion or “factor.” The subsets or groups formed on the basis of any
particular factor are sometimes referred to as the “levels” of the factor.
A factor can be converted into an explanatory variable by assigning each of its levels a distinct
number. In some cases, the assignment can be done in such a way that the explanatory variable might
be suitable for inclusion in a multiple linear regression model. Consider, for example, the case of
data on individual animals that have been partitioned into groups on the basis of age or body weight.
In a case of this kind, the factor might be referred to as a “quantitative” factor.
There is another kind of situation; one where the data points are partitioned into groups on the
basis of a “qualitative” factor and where (regardless of the method of assignment) the numbers
assigned to the groups or levels are meaningful only for purposes of identification. For example, in
an application where each data point consists of the amount of milk produced by a different one of
N dairy cows, the data points might be partitioned into groups, each of which consists of the data
points from those cows that are the daughters of a different one of K bulls. The K bulls constitute the
levels of a qualitative factor. For purposes of identification, the bulls could be numbered 1; 2; : : : ; K
in whatever order might be convenient.
In a situation where the N data points have been partitioned into groups on the basis of each of
one or more qualitative factors, the data are sometimes referred to as “classificatory data.” Among the
models that could be applied to classificatory data are what might be called “classificatory models.”
Suppose (for the sake of simplicity) that there is a single qualitative factor, and that it has K levels
numbered 1; 2; : : : ; K. And (for k D 1; 2; : : : ; K) denote by Nk the number of data points associated
with level k—clearly, K kD1 Nk D N .
P
In this setting, it is convenient to use two subscripts, rather than one, in distinguishing among
the random variables y1 ; y2 ; : : : ; yN (and among related quantities). The first subscript identifies the
level, and the second allows us to distinguish among entities associated with the same level. Accord-
ingly, we write yk1 ; yk2 ; : : : ; ykNk for those of the random variables y1 ; y2 ; : : : ; yN associated with
the kth level (k D 1; 2; : : : ; K).
As a possible model, we have the classificatory model obtained by taking
yks D  C ˛k C eks .k D 1; 2; : : : ; KI s D 1; 2; : : : ; Nk /; (3.1)
where ; ˛1 ; ˛2 ; : : : ; ˛K are unknown parameters and where the eks ’s are uncorrelated, unobserv-
able random variables, each with mean 0 and (for a strictly positive parameter  of unknown value)
variance  2 . The parameters ˛1 ; ˛2 ; : : : ; ˛K are sometimes referred to as effects. And the model
itself is sometimes referred to as the one-way-classification model or (to distinguish it from a vari-
ation to be discussed subsequently) the one-way-classification fixed-effects model. The parameters
; ˛1 ; ˛2 ; : : : ; ˛K are generally taken to be unrestricted, though sometimes they are required to
satisfy the restriction K
X
˛k D 0 (3.2)
kD1
or some other restriction (such as K kD1 k ˛k D 0,  D 0, or ˛K D 0).
P
N
Is the one-way-classification model a linear model? The answer is yes, though this may be less
obvious than in the case (considered in Section 1.2) of a multiple linear regression model. That the
one-way-classification model is a linear model becomes more transparent upon observing that the
defining relation (3.1) can be reexpressed in the form
KC1
X
yks D xksj ˇj C eks .k D 1; 2; : : : ; KI s D 1; 2; : : : ; Nk /; (3.3)
j D1
where ˇ1 D  and ˇj D ˛j 1 (j D 2; 3; : : : ; K C 1) and where (for k D 1; 2; : : : ; K; s D 1; 2;
: : : ; Nk ; j D 1; 2; : : : ; K C 1)
(
1; if j D 1 or j D k C 1,
xksj D
0; otherwise.
6 Introduction

In that regard, it may be helpful (i.e., provide even more in the way of transparency) to observe that
result (3.3) is equivalent to the result
KC1
X
yi D xij ˇj C ei .i D 1; 2; : : : ; N /;
j D1

where xi1 D 1 and (for j D 2; 3; : : : ; K C 1) xij D 1 or xij D 0 depending on whether or not the
i th data point is a member of the (j 1)th group (i D 1; 2; : : : ; N ) and where (as in Section 1.2)
e1 ; e2 ; : : : ; eN are uncorrelated, unobservable random variables, each with mean 0 and variance  2.
In an application of the one-way-classification model, we might wish to make inferences about
 C ˛1 ;  C ˛2 ; : : : ;  C ˛K . For k D 1; 2; : : : ; K (and “all” s)

E.yks / D  C ˛k :

Accordingly,  C ˛k can be thought of as representing the average of an infinitely large number of


data points, each of which belongs to the kth group.
We might also wish to make inferences about various linear combinations of the quantities
 C ˛1 ;  C ˛2 ; : : : ;  C ˛K , that is, about various quantities of the form
K
X
ck . C ˛k /: (3.4)
kD1

When the coefficients c1 ; c2 ; : : : ; cK in the linear combination (3.4) are such that K
P
PK kD1 ck D 0,
the linear combination is reexpressible as kD1 ck ˛k and is referred to as a contrast. Perhaps the
simplest kind of contrast is a difference: ˛k 0 ˛k D C˛k 0 .C˛k / (where k 0 ¤ k). Still another
possibility is that we may wish to make inferences about the quantity  C ˛k C d , where 1  k  K
and where d is an unobservable random variable that (for k 0 D 1; 2; : : : ; K and s D 1; 2; : : : ; Nk 0 )
is uncorrelated with ek 0 s and that has mean 0 and variance  2. This quantity can be thought of as
representing an individual future data point belonging to the kth group.
As a variation on model (3.1), we have the model

yks D k C eks .k D 1; 2; : : : ; KI s D 1; 2; : : : ; Nk /; (3.5)

where 1 ; 2 ; : : : ; K are unknown parameters and where the eks ’s are as defined earlier [i.e.,
in connection with model (3.1)]. Model (3.5), like model (3.1), is a linear model. It is a simple
example of what is called a means model or a cell-means model; let us refer to it as the one-way-
classification cell-means model. Clearly, k D E.yks / (k D 1; 2; : : : ; KI s D 1; 2; : : : ; Nk ), so that
(for k D 1; 2; : : : ; K) k is interpretable as the expected value or the “mean” of an arbitrary one of
the random variables yk1 ; yk2 ; : : : ; ykNk , whose observed values comprise the kth group or “cell.”
In making statistical inferences, it matters not whether the inferences are based on model (3.5) or
on model (3.1). Nor does it matter whether the restriction (3.2), or a “similar” restriction, is imposed
on the parameters of model (3.1). For purposes of making inferences about the relevant quantities,
model (3.5) and the restricted and unrestricted versions of model (3.1) are “interchangeable.”
The number of applications for which the one-way-classification model provides a completely
satisfactory basis for the statistical inferences is relatively small. Even in those applications where
interest centers on a particular factor, the relevant concomitant information is typically not limited
to the information associated with that factor. To insure that the inferences obtained in such a cir-
cumstance are meaningful, they may need to be based on a model that accounts for the additional
information.
Suppose, for example, that each data point consists of the amount of milk produced during the
first lactation of a different one of N dairy cows. And suppose that Nk of the cows are the daughters
of the kth of K bulls (where N1 ; N2 ; : : : ; NK are positive integers that sum to N ). Interest might
Hierarchical Models and Random-Effects Models 7

center on differences among the respective “breeding values” of the K bulls, that is, on differences
in the “average” amounts of milk produced by infinitely large numbers of future daughters under
circumstances that are similar from bull to bull. Any inferences about these differences that are based
on a one-way-classification model (in which the factor is that whose levels correspond to the bulls)
are likely to be at least somewhat misleading. There are factors of known importance that are not
accounted for by this model. These include a factor for the time period during which the lactation
was initiated and a factor for the herd to which the cow belongs. The importance of these factors
is due to the presence of seasonal differences, environmental and genetic trends, and environmental
and genetic differences among herds.
The one-way-classification model may be unsuitable as a basis for making inferences from the
milk-production data not only because of the omission of important factors, but also because the
assumption that the eks ’s are uncorrelated may not be altogether realistic. Typically, some of the
cows will have ancestors in common on the female side of the pedigree, in which case the eks ’s for
those cows may be positively correlated.
The negative consequences of not having accounted for a factor that has been omitted from a
classificatory model may be exacerbated in a situation in which there is a tendency for the levels
of the omitted factor to be “confounded” with those of an included factor. In the case of the milk-
production data, there may (in the presence of a positive genetic trend) be a tendency for the better
bulls to be associated with the more recent time periods. Moreover, there may be a tendency for
an exceptionally large proportion of the daughters of some bulls to be located in “above-average”
herds, and for an exceptionally large proportion of the daughters of some other bulls to be located in
“below-average” herds.
A failure to account for an important factor may occur for any of several reasons. The factor may
have been mistakenly judged to be irrelevant or at least unimportant. Or the requisite information
about the factor (i.e., knowledge of which data points correspond to which levels) may be unavailable
and may not even have been ascertained (possibly for reasons of cost). Or the factor may be a “hidden”
factor.
In some cases, the data from which the inferences are to be made are those from a designed
experiment. The incorporation of randomization into the design of the experiment serves to limit the
extent of the kind of problematic confounding that may be occasioned by a failure to account for an
important factor. This kind of problematic confounding can still occur, but only to the extent that it
is introduced by chance during the randomization.

1.4 Hierarchical Models and Random-Effects Models


Let us continue to consider the use (in making statistical inferences) of N data points that are to be
regarded as the values of observable random variables y1 ; y2 ; : : : ; yN . And let us denote by y the
N -dimensional random (column) vector whose elements are y1 ; y2 ; : : : ; yN , respectively. As before,
it is supposed that the inferences are to be based on various assumptions about the distribution of y
that are referred to collectively as a statistical model. It might be assumed that the distribution of y
is known up to the value of a column vector, say , of unknown parameters. Or it might simply be
assumed that the expected value and the variance-covariance matrix of y are known up to the value of
. In either case, it is supposed that the quantities about which the inferences are to be made consist
of various functions of  or, more generally, various unobservable random variables—a function of
 can be regarded as a “degenerate” random variable.
The assumptions about the distribution of the observable random variables y1 ; y2 ; : : : ; yN (the
assumptions that comprise the statistical model) may not in and of themselves provide an adequate
basis for the inferences. In general, these assumptions need to be extended to the joint distribution
8 Introduction

of the observable and unobservable random variables (the unobservable random variables about
which the inferences are to be made). It might be assumed that the joint distribution of these random
variables is known up to the value of , or it might be only the expected values and the variances
and covariances of these random variables that are assumed to be known up to the value of .
The model consisting of assumptions that define the distribution of y (or various characteristics
of the distribution of y) up to the value of the parameter vector  can be subjected to a “hierarchical”
approach. This approach gives rise to various alternative models. In the hierarchical approach, 
is regarded as random, and the assumptions that comprise the original model are reinterpreted as
assumptions about the conditional distribution of y given . Further, the distribution of  or various
characteristics of the distribution of  are assumed to be known, or at least to be known up to the value
of a column vector, say , of unknown parameters. These additional assumptions can be thought of
as comprising a model for .
As an alternative model for y, we have the model obtained by combining the assumptions
comprising the original model for y with the assumptions about the distribution of  (or about
its characteristics). This model is referred to as a hierarchical model. In some cases, it can be readily
reexpressed in nonhierarchical terms; that is, in terms that do not involve . This process is facilitated
by the application of some basic results on conditional expectations and on conditional variances
and covariances.
For “any” random variable x,
E.x/ D EŒE.x j /; (4.1)
and
var.x/ D EŒvar.x j / C varŒE.x j /: (4.2)
And, for “any” two random variables x and w,

cov.x; w/ D EŒcov.x; w j / C covŒE.x j /; E.w j /: (4.3)

The unconditional expected values in expressions (4.1), (4.2), and (4.3) and the unconditional vari-
ance and covariance in expressions (4.2) and (4.3) are those defined with respect to the (marginal)
distribution of . In general, the expressions for E.x/, var.x/, and cov.x; w/ given by formulas (4.1),
(4.2), and (4.3) depend on . Formulas (4.1), (4.2), and (4.3) are obtainable from results presented
in Chapter 3.
Formulas (4.1) and (4.2) can be used in particular to obtain expressions for the unconditional
expected values and variances of y1 ; y2 ; : : : ; yN in terms of their conditional expected values and
variances—take x D yi (1  i  N ). Similarly, formula (4.3) can be used to obtain an expression for
the unconditional covariance of any two of the random variables y1 ; y2 ; : : : ; yN —take x D yi and
w D yj (1  i < j  N ). Moreover, if the conditional distribution of y given  has a probability
density function, say f .y j /, then, upon applying formula (4.1) with x D f .y j /, we obtain an
expression for the probability density function of the unconditional distribution of y.
In a typical implementation of the hierarchical approach, the dimension of the vector  is signifi-
cantly smaller than that of the vector . The most extreme case is that where the various assumptions
about the distribution of  do not involve unknown parameters; in that case,  can be regarded as
“degenerate” (i.e., of dimension 0). The effects of basing the inferences on the hierarchical model,
rather than on the original model, can be either positive or negative. If the additional assumptions
(i.e., the assumptions about the distribution of  or about its characteristics) are at least somewhat
reflective of an “underlying reality,” the effects are likely to be “beneficial.” If the additional as-
sumptions are not sufficiently in conformance with “reality,” their inclusion in the model may be
“counterproductive.”
The hierarchical model itself can be subjected to a hierarchical approach. In this continuation of
the hierarchical approach,  is regarded as random, and the assumptions that comprise the hierarchical
model are reinterpreted as assumptions about the conditional distributions of y given  and  and
of  given  or simply about the conditional distribution of y given . And the distribution of  or
various characteristics of the distribution of  are assumed to be known or at least to be known up
Hierarchical Models and Random-Effects Models 9

to the value of a vector of unknown parameters. In general, further continuations of the hierarchical
approach are possible. Assuming that each continuation results in a reduction in the number of
unknown parameters (as would be the case in a typical implementation), the hierarchical approach
eventually (after some number of continuations) results in a model that does not involve any unknown
parameters.
In general, a model obtained via the hierarchical approach (like any other model) may not in and
of itself provide an adequate basis for the statistical inferences. Instead of applying the approach
just to the assumptions (about the observable random variables y1 ; y2 ; : : : ; yN ) that comprise the
model, the application of the approach may need to be extended to cover any further assumptions
included among those made about the joint distribution of the observable random variables and the
unobservable random variables (the unobservable random variables about which the inferences are
to be made).
Let us now consider the hierarchical approach in the special case where y follows a linear model.
In this special case, there exist (known) numbers xi1 ; xi 2 ; : : : ; xiP such that
P
X
E.yi / D xij ˇj .i D 1; 2; : : : ; N / (4.4)
j D1

for (unknown) parameters ˇ1 ; ˇ2 ; : : : ; ˇP . And the variance-covariance matrix of y1 ; y2 ; : : : ; yN is


an N  N matrix, say a matrix †, with ij th element ij , that does not depend on ˇ1 ; ˇ2 ; : : : ; ˇP
(though it may depend on unknown parameters other than ˇ1 ; ˇ2 ; : : : ; ˇP ).
Consider an implementation of the hierarchical approach in which only ˇ1 ; ˇ2 ; : : : ; ˇP are
regarded as random. [Think of an implementation of the hierarchical approach in which only some
of the unknown parameters are regarded as random as one in which any other unknown parameters
are regarded as random variables whose joint distribution is degenerate at (i.e., assigns probability 1
to) a single (unknown) point.] The assumptions about the expected values and variance-covariance
matrix of y1 ; y2 ; : : : ; yN are now to be interpreted as applying to the conditional expected values
and conditional variance-covariance matrix given ˇ1 ; ˇ2 ; : : : ; ˇP .
Suppose that the assumptions about the distribution of ˇ1 ; ˇ2 ; : : : ; ˇP are of the same general
form as those about the distribution of y1 ; y2 ; : : : ; yN . More specifically, suppose that the expected
values of ˇ1 ; ˇ2 ; : : : ; ˇP are linear combinations of unknown parameters 1 ; 2 ; : : : ; P 0 , so that
there exist numbers zj1 ; zj 2 ; : : : ; zjP 0 (assumed known) such that
P 0
X
E.ˇj / D zjk k .j D 1; 2; : : : ; P /: (4.5)
kD1

And suppose that the variance-covariance matrix of ˇ1 ; ˇ2 ; : : : ; ˇP is a P  P matrix, say a matrix


€ with jsth element js , that does not depend on 1 ; 2 ; : : : ; P 0 —it may depend on various other
unknown parameters, some of which may be among those on which the matrix † depends.
Making use of formulas (4.1) and (4.3) along with some basic results on the expected values
and the variance and covariances of linear combinations of random variables, we find that, under the
hierarchical model,
P P P P0
X  X X X
E.yi / D E xij ˇj D xij E.ˇj / D xij zjk k
j D1 j D1 j D1 kD1
0
P X
P
X 
D xij zjk k (4.6)
kD1 j D1

(i D 1; 2; : : : ; N ) and
10 Introduction
P
X P
X 
cov.yi ; yi 0 / D E.i i 0 / C cov xij ˇj ; xi 0 j 0 ˇj 0
j D1 j 0 D1
P X
X P
D i i 0 C xij xi 0 j 0 jj 0 (4.7)
j D1 j 0 D1

(i; i 0 D 1; 2; : : : ; N ). It follows from results (4.6) and (4.7) that if † does not depend on 1 ; 2 ; : : : ;
P 0 , then the hierarchical model, like the original model, is a linear model.
For any integer j between 1 and P , inclusive, such that var.ˇj / D 0, we have that ˇj D E.ˇj /
PP 0
(with probability 1). Thus, for any such integer j , the assumption that E.ˇj / D kD1 zjk k
PP 0
simplifies in effect to an assumption that ˇj D kD1 zjk k . In the special case where E.ˇj / D k 0
for some integer k 0 (1  k 0  P 0 ), there is a further simplification to ˇj D k 0 . Thus, the hierarchical
approach is sufficiently flexible that some of the parameters ˇ1 ; ˇ2 ; : : : ; ˇP can in effect be retained
and included among the parameters 1 ; 2 ; : : : ; P 0 .
As indicated earlier (in Section 1.1), it is extremely convenient (in working with linear mod-
els) to adopt matrix notation. Let ˇ represent the P -dimensional column vector with elements
ˇ1 ; ˇ2 ; : : : ; ˇP , respectively, and  the P 0 -dimensional column vector with elements 1 ; 2 ; : : : ; P 0 ,
respectively. Then, in matrix notation, equality (4.4) becomes (in the context of the hierarchical ap-
proach)
E.y j ˇ/ D Xˇ; (4.8)
where X is the N  P matrix with ij th element xij (i D 1; 2; : : : ; N ; j D 1; 2; : : : ; P ). And
equality (4.5) becomes
E.ˇ/ D Z; (4.9)
where Z is the P  P 0 matrix with j kth element zjk (j D 1; 2; : : : ; P ; k D 1; 2; : : : ; P 0 ). Further,
results (4.6) and (4.7) can be recast in matrix notation as

E.y/ D XZ (4.10)


and
var.y/ D † C X€X0 (4.11)
(where X0 denotes the transpose of X, i.e., the P  N matrix with ij th element xj i ). Alternatively,
by making use of the more general (matrix) versions of formulas (4.1) and (4.2) presented in Chapter
3, results (4.10) and (4.11) can be derived directly from equalities (4.8) and (4.9).
The original (linear) model provides a basis for making inferences about an unobservable quan-
tity that is expressible as a linear combination, say jPD1 j ˇj , of the parameters ˇ1 ; ˇ2 ; : : : ; ˇP .
P
Or, more generally, it provides a basis for making inferences about an unobservable quantity
PP
that is expressible as a random variable w for which E.w/ D j D1 j ˇj and for which
var.w/ D and cov.y; w/ D  for some number and (column) vector  that do not depend on
ˇ1 ; ˇ2 ; : : : ; ˇP . In the context of the hierarchical approach, jPD1 j ˇj D E.w j ˇ1 ; ˇ2 ; : : : ; ˇP /,
P

D var.w j ˇ1 ; ˇ2 ; : : : ; ˇP /, and  D cov.y; w j ˇ1 ; ˇ2 ; : : : ; ˇP /.


Let i represent the i th element of  (i D 1; 2; : : : ; N ). Then, making use of formulas (4.1),
(4.2), and (4.3) and proceeding in much the same way as in the derivation of results (4.6) and (4.7),
we find that, under the hierarchical model,
P
X  XP P
X P0
X
E.w/ D E j ˇj D j E.ˇj / D j zjk k
j D1 j D1 j D1 kD1
P0P
X X 
D j zjk k ; (4.12)
kD1 j D1
Hierarchical Models and Random-Effects Models 11
P
X  P X
X P
var.w/ D E. / C var j ˇj D C j j 0 jj 0 ; (4.13)
j D1 j D1 j 0 D1

and (for i D 1; 2; : : : ; N )
P
X P
X 
cov.yi ; w/ D E.i / C cov xij ˇj ; j 0 ˇj 0
j D1 j 0 D1
P X
X P
D i C xij j 0 jj 0 : (4.14)
j D1 j 0 D1

Clearly, if and  do not depend on 1 ; 2 ; : : : ; P 0 , then neither do expressions (4.13) and (4.14).
As in the case of results (4.6) and (4.7), results (4.12), (4.13), and (4.14) can be recast in matrix no-
tation. Denote by  the P -dimensional column vector with elements 1 ; 2 ; : : : ; P , respectively—
the linear combination jPD1 j ˇj is reexpressible as jPD1 j ˇj D 0 ˇ. Under the hierarchical
P P
model,

E.w/ D 0 Z; (4.15)


var.w/ D C 0 €; (4.16)
and
cov.y; w/ D  C X€: (4.17)

The hierarchical approach is not the only way of arriving at a model characterized by expected
values and variances and covariances of the form (4.6) and (4.7) or, equivalently, of the form (4.10)
and (4.11). Under the original model, the distribution of y1 ; y2 ; : : : ; yN , and w is such that
P
X
yi D xij ˇj C ei .i D 1; 2; : : : ; N / (4.18)
j D1
and P
X
wD j ˇj C d; (4.19)
j D1

where e1 ; e2 ; : : : ; eN , and d are random variables, each with expected value 0. Or, equivalently, the
distribution of y and w is such that
y D Xˇ C e (4.20)
and 0
w D  ˇ C d; (4.21)
where e is the N -dimensional random (column) vector with elements e1 ; e2 ; : : : ; eN and hence with
E.e/ D 0. Moreover, under the original model, var.e/ D var.y/ D †, var.d / D var.w/ D , and
cov.e; d / D cov.y; w/ D .
Now, suppose that instead of taking ˇ1 ; ˇ2 ; : : : ; ˇP to be parameters, they are (as in the hierar-
chical approach) taken to be random variables with expected values of the form (4.5)—in which case,
ˇ is a random vector with an expected value of the form (4.9)—and with variance-covariance matrix
€. Suppose further that each of the random variables ˇ1 ; ˇ2 ; : : : ; ˇP is uncorrelated with each of
the random variables e1 ; e2 ; : : : ; eN , and d [or, equivalently, that cov.ˇ; e/ D 0 and cov.ˇ; d / D 0]
or, perhaps more generally, suppose that each of the quantities jPD1 x1j ˇj ; jPD1 x2j ˇj ; : : : ;
P P
PP
j D1 xNj ˇj is uncorrelated with each of the random variables e1 ; e2 ; : : : ; eN , and d [or, equiva-
lently, that cov.Xˇ; e/ D 0 and cov.Xˇ; d / D 0]. And consider the effect of these suppositions
about ˇ1 ; ˇ2 ; : : : ; ˇP on the distribution of the N C1 random variables (4.18) and (4.19) (specifically
the effect on their expected values and their variances and covariances).
12 Introduction

The suppositions about ˇ1 ; ˇ2 ; : : : ; ˇP are equivalent to the supposition that


P 0
X
ˇj D zjk k C ıj .j D 1; 2; : : : ; P /; (4.22)
kD1

where ı1 ; ı2 ; : : : ; ıP are random variables with expected values of 0 and variance-covariance ma-
trix € and with the property that each of the linear combinations jPD1 x1j ıj ; jPD1 x2j ıj ; : : : ;
P P
PP
j D1 xNj ıj is uncorrelated with each of the random variables e1 ; e2 ; : : : ; eN , and d . Or, in matrix
notation,
ˇ D Z C ı; (4.23)
where ı is the P -dimensional random (column) vector with elements ı1 ; ı2 ; : : : ; ıP and hence with
E.ı/ D 0, var.ı/ D €, cov.Xı; e/ D 0, and cov.Xı; d / D 0.
Upon replacing ˇ1 ; ˇ2 ; : : : ; ˇP in expressions (4.18) and (4.19) with the expressions for
ˇ1 ; ˇ2 ; : : : ; ˇP given by result (4.22), we obtain the expressions
0
P X
P
X 
yi D xij zjk k C fi .i D 1; 2; : : : ; N /; (4.24)
kD1 j D1
PP
where (for i D 1; 2; : : : ; N ) fi D ei C j D1 xij ıj , and the expression
0
P X
P
X 
wD j zjk k C g; (4.25)
kD1 j D1
PP
where g D d C j D1 j ıj . Results (4.24) and (4.25) can be restated in matrix notation as

y D XZ C f ; (4.26)

where f is the N -dimensional random (column) vector with elements f1 ; f2 ; : : : ; fN and hence
where f D e C Xı, and
w D 0 Z C g; (4.27)
where g D d C 0 ı. Alternatively, expressions (4.26) and (4.27) are obtainable by replacing ˇ in
expressions (4.20) and (4.21) with expression (4.23).
Clearly, E.fi / D 0 (i D 1; 2; : : : ; N ), or equivalently E.f / D 0, and E.g/ D 0. Further, by
making use of some basic results on the variances and covariances of linear combinations of random
variables [in essentially the same way as in the derivation of results (4.7), (4.13), and (4.14)], we
find that
P X
X P
cov.fi ; fi 0 / D i i 0 C xij xi 0 j 0 jj 0 .i; i 0 D 1; 2; : : : ; N /;
j D1 j 0 D1
P X
X P
var.g/ D C j j 0 jj 0 ;
j D1 j 0 D1
and P X
P
X
cov.fi ; g/ D i C xij j 0 jj 0 .i D 1; 2; : : : ; N /;
j D1 j 0 D1
or, equivalently, that
var.f / D † C X€X0 ;
var.g/ D C 0 €;
and
cov.f ; g/ D  C X€:
Hierarchical Models and Random-Effects Models 13

These results imply that, as in the case of the hierarchical approach, the expected values and the
variances and covariances of the random variables y1 ; y2 ; : : : ; yN , and w are of the form (4.6),
(4.12), (4.7), (4.13), and (4.14), or, equivalently, that E.y/, E.w/, var.y/, var.w/, and cov.y; w/ are
of the form (4.10), (4.15), (4.11), (4.16), and (4.17). Let us refer to this alternative way of arriving at
a model characterized by expected values and variances and covariances of the form (4.6) and (4.7),
or equivalently (4.10) and (4.11), as the random-effects approach.
The assumptions comprising the original model are such that (for i D 1; 2; : : : ; N ) E.yi / D
PP
j D1 xij ˇj and are such that the variance-covariance matrix of y1 ; y2 ; : : : ; yN equals † (where
† does not vary with ˇ1 ; ˇ2 ; : : : ; ˇP ). In the hierarchical approach, these assumptions are regarded
as applying to the conditional distribution of y1 ; y2 ; : : : ; yN given ˇ1 ; ˇ2 ; : : : ; ˇP . Thus, in the
hierarchical approach, the random vector e in decomposition (4.20) is such that (with probability
1) E.ei j ˇ1 ; ˇ2 ; : : : ; ˇP / D 0 (i D 1; 2; : : : ; N ) and cov.ei ; ei 0 j ˇ1 ; ˇ2 ; : : : ; ˇP / D i i 0 (i; i 0 D
1; 2; : : : ; N ) or, equivalently, E.e j ˇ/ D 0 and var.e j ˇ/ D †.
The random-effects approach results in the same alternative model as the hierarchical approach
and does so under less stringent assumptions. In the random-effects approach, it is assumed that
the (unconditional) distribution of e is such that (for i D 1; 2; : : : ; N ) E.ei / D 0 or, equiv-
alently, E.e/ D 0. It is also assumed that the joint distribution of e and of the random vec-
tor ı in decomposition (4.23) is such that (for i; i 0 D 1; 2; : : : ; N ) cov. jPD1 xij ıj ; ei 0 / D 0
P
or, equivalently, cov.Xı; e/ D 0. By making use of formulas (4.1) and (4.3), these assump-
tions can be restated as follows: (for i D 1; 2; : : : ; N ) EŒE.ei j ˇ1 ; ˇ2 ; : : : ; ˇP / D 0 and (for
i; i 0 D 1; 2; : : : ; N ) covŒ jPD1 xij ıj ; E.ei 0 j ˇ1 ; ˇ2 ; : : : ; ˇP / D 0 or, equivalently, EŒE.e j ˇ/ D 0
P

and covŒXı; E.e j ˇ/ D 0. Moreover, in the random-effects approach,

i i 0 D cov.ei ; ei 0 / D EŒcov.ei ; ei 0 j ˇ1 ; ˇ2 ; : : : ; ˇP /
C covŒE.ei j ˇ1 ; ˇ2 ; : : : ; ˇP /; E.ei 0 j ˇ1 ; ˇ2 ; : : : ; ˇP /
(i; i 0 D 1; 2; : : : ; N ) or, equivalently,
† D var.e/ D EŒvar.e j ˇ/ C varŒE.e j ˇ/:

Let us now specialize even further by considering an application of the hierarchical approach
or random-effects approach in a setting where the N data points have been partitioned into K
groups, corresponding to the first through Kth levels of a single qualitative factor. As in our previous
discussion of this setting (in Section 1.3), let us write yk1 ; yk2 ; : : : ; ykNk for those of the random
variables y1 ; y2 ; : : : ; yN associated with the kth level (k D 1; 2; : : : ; K).
A possible model is the one-way-classification cell-means model, in which

yks D k C eks .k D 1; 2; : : : ; KI s D 1; 2; : : : ; Nk /; (4.28)

where 1 ; 2 ; : : : ; K are unknown parameters and where the eks ’s are uncorrelated, unobservable
random variables, each with mean 0 and (for a strictly positive parameter  of unknown value)
variance  2 . As previously indicated (in Section 1.3), this model qualifies as a linear model.
Let us apply the hierarchical approach or random-effects approach to the one-way-classification
cell-means model. Suppose that 1 ; 2 ; : : : ; K (but not ) are regarded as random. Suppose further
that 1 ; 2 ; : : : ; K are uncorrelated and that they have a common, unknown mean, say , and (for
a nonnegative parameter ˛ of unknown value) a common variance ˛2 . Or, equivalently, suppose
that
k D  C ˛k .k D 1; 2; : : : ; K/; (4.29)
where ˛1 ; ˛2 ; : : : ; ˛K are uncorrelated random variables having mean 0 and a common variance ˛2 .
(And assume that ˛ and  are not functionally dependent on .)
14 Introduction

Under the original model (i.e., the 1-way-classification cell-means model), we have that (for
k D 1; 2; : : : ; K and s D 1; 2; : : : ; Nk )

E.yks / D k (4.30)

and that (for k; k 0 D 1; 2; : : : ; K; s D 1; 2; : : : ; Nk ; and s 0 D 1; 2; : : : ; Nk 0 )


(
 2 ; if k 0 D k and s 0 D s,
cov.yks ; yk 0 s 0 / D (4.31)
0; otherwise.

In the hierarchical approach, the expected value (4.30) and the covariance (4.31) are regarded as a
conditional (on 1 ; 2 ; : : : ; K ) expected value and a conditional covariance.
The same model that would be obtained by applying the hierarchical approach (the so-called
hierarchical model) is obtainable via the random-effects approach. Accordingly, assume that each of
the random variables ˛1 ; ˛2 ; : : : ; ˛K [in representation (4.29)] is uncorrelated with each of the N ran-
dom variables eks (k D 1; 2; : : : ; K; s D 1; 2; : : : ; Nk ). In this setting, the random-effects approach
can be implemented by replacing 1 ; 2 ; : : : ; K in representation (4.28) with the expressions for
1 ; 2 ; : : : ; K comprising representation (4.29). This operation gives

yks D  C ˛k C eks .k D 1; 2; : : : ; KI s D 1; 2; : : : ; Nk / (4.32)

or, upon letting fks D ˛k C eks ,

yks D  C fks .k D 1; 2; : : : ; KI s D 1; 2; : : : ; Nk /: (4.33)

Result (4.32) or (4.33) defines an alternative to the one-way-classification cell-means model.


Under the alternative model, we have that (for k D 1; 2; : : : ; K and s D 1; 2; : : : ; Nk ) E.fks / D 0
and hence
E.yks / D 
and that (for k; k 0 D 1; 2; : : : ; K; s D 1; 2; : : : ; Nk ; and s 0 D 1; 2; : : : ; Nk 0 )
0 0
8̂ 2 2
< C ˛ ; if k D k and s D s,
cov.yks ; yk 0 s 0 / D cov.fks ; fk 0 s 0 / D ˛2 ; if k 0 D k and s 0 ¤ s,
0; if k 0 ¤ k.

Thus, under the alternative model, all of the yks ’s have the same expected value, and those of the
yks ’s that are associated with the same level may be positively correlated. Under the original model,
the expected values of the yks ’s may vary with the level, and none of the yks ’s are correlated.
Representation (4.32) is of the same form as representation (3.1), which is identified with the
one-way-classification fixed-effects model. However, the quantities ˛1 ; ˛2 ; : : : ; ˛K that appear in
representation (4.32) are random variables and are referred to as random effects, whereas the quanti-
ties ˛1 ; ˛2 ; : : : ; ˛K that appear in representation (3.1) are (unknown) parameters that are referred to
as fixed effects. Accordingly, the model obtained from the cell-means model via the random-effects
approach (or the corresponding hierarchical approach) is referred to as the one-way-classification
random-effects model.
The one-way-classification random-effects model can be obtained not only via an application
of the random-effects approach or hierarchical approach to the one-way-classification cell-means
model, but also via an application of the random-effects approach or hierarchical approach to the
one-way-classification fixed-effects model. In the case of the random-effects approach, simply regard
the effects ˛1 ; ˛2 ; : : : ; ˛K of the fixed-effects model as random variables rather than (unknown)
parameters, assume that ˛1 ; ˛2 ; : : : ; ˛K are uncorrelated, each with mean 0 and variance ˛2 , and
assume that each of the ˛k ’s is uncorrelated with each of the eks ’s. Then, by proceeding in much
Hierarchical Models and Random-Effects Models 15

the same way as in the application of the random-effects approach to the one-way-classification
cell-means model, we once again arrive at the one-way-classification random-effects model.
We have established that the one-way-classification random-effects model can be obtained by
adding (in the context of a random-effects approach or hierarchical approach) to the assumptions
that comprise the one-way-classification fixed-effects model or the one-way-classification cell-means
model. Under what circumstances are the additional assumptions likely to reflect an underlying reality
and hence to be beneficial? The additional assumptions would seem to be warranted in a circumstance
where the K levels of the factor can reasonably be envisioned as a random sample from an infinitely
large “population” of levels. Or, relatedly, they might be warranted in a circumstance where it is
possible to conceive of K infinitely large sets of data points, each of which corresponds to a different
one of the K levels, and where the average values of the data points in the K sets can reasonably be
regarded as a random sample from an infinitely large population of averages.
Suppose, for example, that each data point consists of the amount of milk produced by a different
one of N dairy cows and that Nk of the cows are the daughters of the kth of K bulls (k D 1; 2; : : : ; K).
Then, as discussed in Section 1.3, interest might center on the differences among the breeding
values of the bulls. Among the models that could conceivably serve as a basis for inferences about
those differences is the one-way-classification fixed-effects model (in which the factor is that whose
levels correspond to the bulls) or the one-way-classification cell-means model. However, under
some circumstances, better results are likely to be obtained by basing the inferences on the one-way-
classification random-effects model. Those circumstances include ones where the underlying reality
is at least reasonably consistent with what might be expected if the K bulls were a random sample
from an infinitely large population of bulls.
As a practical matter, the circumstances are likely to be such that the one-way-classification
random-effects model is too simplistic to provide a satisfactory basis for the inferences. The circum-
stances in which the one-way-classification random-effects model is likely to be inadequate include
those discussed earlier (in Section 1.3) in which inferences based on a one-way-classification fixed-
effects (or cell-means) model are likely to be misleading. They also include other circumstances.
Some of the K bulls may have one or more ancestors in common with some of the other bulls.
Depending on the extent and closeness of the resultant genetic relationships, it may be important
to take those relationships into account. This can be done within the context of the hierarchical or
random-effects approach. Instead of taking the random effects to be uncorrelated, they can be taken
to be correlated in a way and to an extent that reflects the underlying relationships.
There may exist other information about the K bulls that (like the ancestral information) is
“external” to the information provided by the N data points and that is at odds with various of
the assumptions of the one-way-classification random-effects model. For example, the information
might take the form of results from a statistical analysis of some earlier data. We may wish to
base our inferences on a model that accounts for this information. As in the case of the ancestral
information, such a model can (at least in principle) be devised within the context of the hierar-
chical approach or random-effects approach. In our application of the random-effects approach to
the one-way-classification cell-means model, we may wish to modify our assumption that the cell
means 1 ; 2 ; : : : ; K have a common mean  as well as our assumption that the random effects
˛1 ; ˛2 ; : : : ; ˛K are uncorrelated with a common variance ˛2 .
In making inferences on the basis of the one-way-classification cell-means model (4.28), the
quantities of interest are typically ones that are expressible as a linear combination of the cell
means 1 ; 2 ; : : : ; K , say a linear combination KkD1 ck k , or, more generally, quantities that are
P
expressible as a random variable w of the form
K
X
wD ck k C d; (4.34)
kD1
where d is a random variable for which E.d / D 0 [and for which var.d / and the covariance of d
with each of the eks ’s do not depend on 1 ; 2 ; : : : ; K ]. Let us consider how w is affected by the
16 Introduction

application of the random-effects approach (to the one-way-classification cell-means model). In that
approach, 1 ; 2 ; : : : ; K are regarded as random variables that are expressible in the form (4.29).
And it is supposed that each of the random effects ˛1 ; ˛2 ; : : : ; ˛K is uncorrelated with d (as well as
with each of the eks ’s).
The random variable w can be reexpressed in terms of the parameter  and the random ef-
fects ˛1 ; ˛2 ; : : : ; ˛K . Upon replacing 1 ; 2 ; : : : ; K in expression (4.34) with the expressions for
1 ; 2 ; : : : ; K given by representation (4.29), we find that
K
X  K
X
wD ck  C ck ˛k C d (4.35)
kD1 kD1
K
X 
D ck  C g; (4.36)
kD1

where g D d C K kD1 ck ˛k .
P

Expression (4.36) gives w in terms of the parameter  and a random variable g. Clearly, E.g/ D 0.
And it follows from a basic formula on the variance of a linear combination of uncorrelated random
variables that K
X
var.g/ D var.d / C ˛2 ck2 : (4.37)
kD1
Recall that (for k D 1; 2; : : : ; K and s D 1; 2; : : : ; Nk ) fks D ˛k C eks . Under the original
(1-way-classification cell-means) model, cov.yks ; w/ D cov.eks ; d /, while under the alternative
(1-way-classification random-effects) model, cov.yks ; w/ D cov.fks ; g/. Even if eks and d are
uncorrelated, fks and g may be (and in many cases are) correlated. In fact,
 XK   XK 
cov.fks ; g/ D cov.eks ; d / C cov eks ; ck 0 ˛k 0 C cov.˛k ; d / C cov ˛k ; ck 0 ˛k 0
k 0 D1
K k 0 D1
 X 
D cov.eks ; d / C 0 C 0 C cov ˛k ; ck 0 ˛k 0
k D1 0
 K
X 
D cov.eks ; d / C cov ˛k ; ck 0 ˛k 0 ; (4.38)
k 0 D1

as can be readily verified by applying a basic formula for a covariance between sums of random
variables. And, in light of the assumption that ˛1 ; ˛2 ; : : : ; ˛K are uncorrelated, each with variance
˛2 , result (4.38) simplifies to

cov.fks ; g/ D cov.eks ; d / C ck ˛2 : (4.39)

As noted earlier in this section, the one-way-classification random-effects model can be obtained
by applying a hierarchical approach or random-effects approach to the one-way-classification fixed-
effects model (as well as by application to the 1-way-classification cell-means model). A related
observation (pertaining to inference about a random variable w) is that the application of the hier-
archical or random-effects approach to the one-way-classification fixed-effects model has the same
effect on the random variable w defined by w D K as the application to the
P
kD1 ck . C ˛k / C d P
one-way-classification cell-means model has on the random variable w D K kD1 ck k C d .
It is worth noting that results pertaining to inference about a random variable w based on the
one-way-classification random-effects model can be readily extended from a random variable w that
is expressible in the form (4.35) to one that is expressible in the more general form
K0
X
w D c0  C ck ˛k C d;
kD1
Statistical Inference 17

where K 0  K, where ˛KC1 ; ˛KC2 ; : : : ; ˛K 0 (like ˛1 ; ˛2 ; : : : ; ˛K ) are random variables having


expected values of 0, and where c0 and cKC1 ; cKC2 ; : : : ; cK 0 (like c1 ; c2 ; : : : ; cK ) represent arbitrary
numbers. Here, ˛KC1 ; ˛KC2 ; : : : ; ˛K 0 are interpretable as random effects that correspond to levels
of the factor that are not among those K levels represented in the data. For example, in the case of the
milk-production data, the bulls about whose breeding values we might wish to make inferences might
include some bulls that do not have any daughters (or at least none who contributed to the data). In the
simplest case, the assumption that the ˛k ’s are uncorrelated , each with variance ˛2 , is not restricted
to ˛1 ; ˛2 ; : : : ; ˛K , but rather is taken to apply to all K 0 of the random variables ˛1 ; ˛2 ; : : : ; ˛K 0 . In
some settings, it might be appropriate (and fruitful) to allow ˛KC1 ; ˛KC2 ; : : : ; ˛K 0 to be correlated
with ˛1 ; ˛2 ; : : : ; ˛K (e.g., some or all of the K 0 K additional bulls may be related to some of the
first K bulls).

1.5 Statistical Inference


Let us continue to consider the use, in making inferences about an unobservable quantity, of N data
points that are to be regarded as the respective values of observable random variables y1 ; y2 ; : : : ; yN .
It is supposed that the inferences are to be based on a statistical model, in which the distribution
of the N -dimensional random (column) vector y whose elements are y1 ; y2 ; : : : ; yN , respectively,
is specified up to the value of a (column) vector  of unknown parameters or in which various
characteristics of the distribution of y (such as its first and second moments) are specified up to
the value of . It is supposed further that the quantity of interest is a function, say ./, of  or,
more generally, is an unobservable random variable w whose distribution may depend on . (For
convenience, we refer to the random variable w as the quantity of interest and speak of inference
about w; we do so even though, in any particular application, it is the realization of w that corresponds
to the quantity of interest and that is the subject of the inferences.)
Inference about w might consist of point estimation (or prediction), interval (or set) estimation
(or prediction), or hypothesis testing. Let t.y/ represent a function of y, the realized value of which
[i.e., the value of t.y/ corresponding to the observed value of y] is to be regarded as an estimate (a
so-called point estimate) of the value of w. This function is referred to as a point estimator. Or if w
represents a future quantity, t.y/ might be referred to as a point predictor.
The difference t.y/ w between the estimator or predictor and the random variable w is referred
to as the error of estimation or prediction. The performance of the estimator or predictor t.y/ in
repeated applications might be assessed on the basis of various characteristics of the distribution
of the estimation or prediction error. In the special case where w is the parametric function ./,
the distribution of the estimation or prediction error is determined by the distribution of y, and
repeated applications refers to repeated draws of y-values. However, in general, the distribution of
the estimation or prediction error is that determined by the joint distribution of w and y, and repeated
application refers to repeated draws of both w- and y-values.
The function t.y/ is said to be an unbiased estimator or predictor (of w) if EŒt.y/ w D 0 or,
equivalently, if EŒt.y/ D E.w/. An unbiased estimator or predictor is well-calibrated in the sense
that, over an infinitely long sequence of repeated applications, the average value of the estimator
or predictor would (in theory) be the same as the average value of the quantity being estimated or
predicted.
The second moment EfŒt.y/ w2 g of the distribution of the estimation p or prediction error is
referred to as the mean squared error (MSE) of t.y/. And the square root EfŒt.y/ w2 g of the
MSE of t.y/ is referred to as the root mean squared error (root MSE) of t.y/. The root MSE can be
regarded as a (theoretical) measure of the magnitude of the estimation or prediction errors that would
be incurred in an infinitely long sequence of repeated applications. Note that if t.y/ is an unbiased
18 Introduction

estimator or predictor, then EfŒt.y/ w2 g D varŒt.y/ w, that is, the MSE of t.y/ equals the
variance of its estimation or prediction error. In the special case where w is a parametric function,
the variance varŒt.y/ w of the estimation or prediction error equals the variance varŒt.y/ of the
estimator or predictor; however, in general, varŒt.y/ w is not necessarily equal to varŒt.y/. In the
special case where w is a parametric
p function and where t.y/ is an unbiased estimator or predictor,
the root MSE of t.y/ equals varŒt.y/ and may be referred to as the standard error (SE) of t.y/. In
the point estimation or prediction of w, it is desirable to include an estimate of the SE or root MSE
of the estimator or predictor.
In interval (or set) estimation or prediction, inferences about w are in the form of one or more
intervals or, more generally, one or more (1-dimensional) sets. The end points of each interval are
functions of y; more generally, the membership of each set varies with y. Associated with each
interval or set is a numerical measure that may be helpful in assessing the “chances” of the interval
or set including or “covering” w.
Let S.y/ represent an arbitrary (1-dimensional) set (the membership of which varies with y).
The probability PrŒw 2 S.y/ is referred to as the probability of coverage. It is interpretable as the
(theoretical) frequency of the event w 2 S.y/ in an infinitely long sequence of repeated applications
(involving repeated draws of both w- and y-values). Clearly, PrŒw 2 S.y/ D 1 PrŒw … S.y/.
In general, PrŒw … S.y/ may depend on  and/or on characteristics of the joint distribution
of w and y not covered by the assumptions that comprise the statistical model. If S.y/ is such that
PrŒw … S.y/ D ˛ (uniformly for all distributions that conform to the underlying assumptions), then
S.y/ is said to be a 100.1 ˛/% confidence set for w [or, if appropriate, a 100.1 ˛/% confidence
interval for w]. More generally, S.y/ is said to be a 100.1 ˛/% confidence set or interval for w if
the supremum of PrŒw … S.y/ (over all distributions that conform to the underlying assumptions)
equals ˛. The infimum of the probability of coverage of a 100.1 ˛/% confidence interval or set
equals 1 ˛; this number is referred to as the confidence coefficient or confidence level. When w
represents a future quantity, the confidence interval or set might be referred to as a prediction interval
or set.
In making statistical inferences about w, it may be instructive to form a 100.1 ˛/% confidence
interval or set for each of several different values of ˛. Models that provide an adequate basis for
point estimation may not provide an adequate basis for obtaining confidence intervals or sets. Rather,
we may require a more elaborate model that is rather specific about the form of the distribution of y
(or, in general, the joint distribution of w and y). In the case of a linear model, it is common practice
to take the form of this distribution to be multivariate normal.
In making inferences about a parametric function, there are circumstances that would seem
to call for the inclusion of hypothesis testing. Suppose that w represents the parametric function
./. Suppose further that, for some value w .0/ of this function, interest centers on the question of
whether the hypothesis H0 W w D w .0/ (called the null hypothesis) is “consistent with the data.”
Corresponding to a test of the null hypothesis H0 is a set C of N -dimensional (column) vectors that
is referred to as the critical region (or the rejection region) and the complementary set A (consisting
of those N -dimensional vectors not contained in C ), which is referred to as the acceptance region.
The test consists of accepting H0 , if y 2 A, and of rejecting H0 , if y 2 C .
If (for some number ˛ between 0 and 1) the probability Pr.y 2 C / of rejecting H0 equals ˛
whenever ./ D w .0/ (i.e., whenever H0 is true), then the test is said to be a size-˛ test. More
generally, the probability Pr.y 2 C / of rejecting H0 when H0 is true may depend on  and/or on
characteristics of the distribution of y not covered by the assumptions that comprise the model, in
which case the test is said to be a size-˛ test if (under H0 ) the supremum of Pr.y 2 C / equals
˛. Under either the null hypothesis H0 or the hypothesis H1 W w ¤ w .0/ (called the alternative
hypothesis), Pr.y 2 C / is interpretable as the (theoretical) frequency with which H0 is rejected in
an infinitely long sequence of repeated applications (involving repeated draws from the distribution
of y).
Statistical Inference 19

Corresponding to any 100.1 ˛/% confidence interval or set that may exist for the parametric
function w D ./ is a size-˛ test of H0 . If S.y/ is a 100.1 ˛/% confidence interval or set for
w D ./, then a size-˛ test of H0 is obtained by taking the acceptance region A to be the set of
y-values for which w .0/ 2 S.y/ (e.g., Casella and Berger 2002, sec. 9.2). If the model is to provide an
adequate basis for hypothesis testing, it may (as in the case of interval or set estimation) be necessary
to include rather specific assumptions about the form of the distribution of y.
There may be a size-˛ test of H0 for every ˛ between 0 and 1. Let (for 0 < ˛ < 1) C.˛/ represent
the critical region of the size-˛ test. And suppose (as would often be the case) that the critical regions
are nested in the sense that, for 0 < ˛1 < ˛2 < 1, C.˛1 / is a proper subset of C.˛2 /. Then, the
infimum of the set f˛ W y 2 C.˛/g (i.e., the infimum of those ˛-values for which H0 is rejected by
the size-˛ test) is referred to as the p-value. Instead of reporting the results of the size-˛ test for one
or more values of ˛, it may be preferable to report the p-value—clearly, the p-value conveys more
information.
In practice, there will typically be more than one unobservable quantity of interest. Suppose that
inferences are to be made about M unobservable quantities, and that these quantities are to be regarded
as the respective values of M random variables w1 ; w2 ; : : : ; wM (whose joint distribution may
depend on )—this formulation includes the case where some or all of the unobservable quantities
are functions of . One approach is to deal with the M unobservable quantities separately, that is, to
make inferences about each of these quantities as though that quantity were the only quantity about
which inferences were being made. That approach lends itself to misinterpretation of results (more
so in the case of interval or set estimation or prediction and hypothesis testing than in the case of
point estimation or prediction) and to unwarranted conclusions.
The potential pitfalls can be circumvented by adopting an alternative approach in which the M
unobservable quantities of interest are dealt with simultaneously. Let w represent the M -dimensional
unobservable random (column) vector whose elements are w1 ; w2 ; : : : ; wM , respectively. Inference
about the value of the vector w can be regarded as synonymous with simultaneous inference about
w1 ; w2 ; : : : ; wM . Many of the concepts introduced earlier (in connection with inference about a
single unobservable random variable w) can be readily extended to inference about w.
Let t.y/ represent a vector-valued function (in the form of an M -dimensional column vector)
of y, the realized value of which is to be regarded as a (point) estimate of the value of the vector w.
This function is referred to as a point estimator (or predictor), and the vector t.y/ w is referred
to as the error of estimation or prediction. The estimator or predictor t.y/ is said to be unbiased if
EŒt.y/ w D 0 or, equivalently, if EŒt.y/ D E.w/.
The elements, say t1 .y/; t2 .y/; : : : ; tM .y/, of t.y/ can be regarded as estimators or predictors
of w1 ; w2 ; : : : ; wM , respectively. Clearly, t.y/ is an unbiased estimator or predictor of the vector w
if and only if, for i D 1; 2; : : : ; M , ti .y/ is an unbiased estimator or predictor of wi . The M  M
matrix EfŒt.y/ wŒt.y/ w0 g is referred to as the mean-squared-error (MSE) matrix of t.y/. Its
ij th element equals EfŒti .y/ wi Œtj .y/ wj g (i; j D 1; 2; : : : ; M ), implying in particular that
the diagonal elements of the MSE matrix are the MSEs of t1 .y/; t2 .y/; : : : ; tM .y/, respectively. If
t.y/ is an unbiased estimator or predictor of w, then its MSE matrix equals the variance-covariance
matrix of the vector t.y/ w.
Turning now to the set estimation or prediction of the vector w, let S.y/ represent an arbitrary set
of M -dimensional (column) vectors (the membership of which varies with y). The terminology intro-
duced earlier (in connection with the interval or set estimation or prediction of an unobservable ran-
dom variable w) extends in a straightforward way to the set estimation or prediction of w. In particular,
PrŒw 2 S.y/ is referred to as the probability of coverage. As in the one-dimensional case, PrŒw 2
S.y/ D 1 PrŒw … S.y/. Further, the set S.y/ is said to be a 100.1 ˛/% confidence set if the supre-
mum of PrŒw … S.y/ (over all distributions of w and y that conform to the underlying assumptions)
equals ˛. And, accordingly, the infimum of the probability of coverage of a 100.1 ˛/% confidence
set equals 1 ˛, a number that is referred to as the confidence coefficient or confidence level.
20 Introduction

Inference about the vector w can also take the form of M one-dimensional sets, one set for each
of its elements. For j D 1; 2; : : : ; M , let Sj .y/ represent an interval (whose end points depend on y)
or, more generally, a one-dimensional set (the membership of which varies with y) of wj -values. A
possible criterion for evaluating these sets (as confidence sets for w1 ; w2 ; : : : ; wM , respectively) is
the probability (for some relatively small integer k  1) that wj 2 Sj .y/ for at least M kC1 values
of j. In the special case where k D 1, this probability is referred to as the probability of simultaneous
coverage. It is worth noting that the probability of simultaneous coverage can be drastically smaller
than any of the M “individual” probabilities of coverage PrŒw1 2 S1 .y/; PrŒw2 2 S2 .y/; : : : ;
PrŒwM 2 SM .y/.
Closely related to the sets S1 .y/; S2 .y/; : : : ; SM .y/ is the set S.y/ of M -dimensional (column)
vectors defined as follows:
S.y/ D fw W wj 2 Sj .y/ .j D 1; 2; : : : ; M /g: (5.1)
When each of the sets S1 .y/; S2 .y/; : : : ; SM .y/ is an interval, the geometrical form of the set
S.y/ is that of an “M -dimensional rectangle.” Clearly, PrŒw 2 S.y/ equals the probability of
simultaneous coverage of the individual sets S1 .y/; S2 .y/, : : : ; SM .y/. Accordingly, the set S.y/
is a 100.1 ˛/% confidence set for w if (and only if) the probability of simultaneous coverage of
S1 .y/; S2 .y/; : : : ; SM .y/ equals 1 ˛.
Let us now revisit hypothesis testing. Suppose that every one of the M elements w1 ; w2 ; : : : ; wM
of w is a function of . Suppose further that, for some values w1.0/ ; w2.0/ ; : : : ; wM .0/
, interest centers
.0/
on the question of whether the hypothesis H0 W wi D wi .i D 1; 2; : : : ; M / (the so-called null
hypothesis) is consistent with the data. The null hypothesis can be restated (in matrix notation) as H0 W
.0/ .0/ .0/
w D w .0/ , where w .0/ is the M -dimensional (column) vector with elements w1 ; w2 ; : : : ; wM .
Various concepts introduced earlier in connection with the testing of the null hypothesis in the
case of a single parametric function extend in an altogether straightforward way to the case of
an M -dimensional vector of parametric functions. These concepts include those of an acceptance
region, of a critical or rejection region, of a size-˛ test, and of a p-value. Moreover, the relationship
between confidence sets and hypothesis tests discussed earlier (in connection with inference about a
single parametric function) extends to inference about the M -dimensional vector w (of parametric
functions): if S.y/ is a 100.1 ˛/% confidence set for w, then a size-˛ test of H0 W w D w .0/ is
obtained by taking the acceptance region to be the set of y-values for which w .0/ 2 S.y/.
The null hypothesis H0 W w D w .0/ can be regarded as a “composite” of the M “individual” hy-
potheses H0.1/ W w1 D w1.0/ ; H0.2/ W w2 D w2.0/ ; : : : ; H0.M / W wM D wM .0/
. Clearly, the composite null
hypothesis H0 is true if and only if all M of the individual null hypotheses H0.1/ ; H0.2/ ; : : : ; H0.M /
are true and is false if and only if one or more of the individual null hypotheses are false. As a
variation on the problem of testing H0 , there is the problem of testing H0.1/ ; H0.2/ ; : : : ; H0.M / indi-
vidually. The latter problem is known as the problem of multiple comparisons—in some applications,
w1 ; w2 ; : : : ; wM have interpretations relating them to comparisons among various entities. The clas-
.1/ .2/ .M /
sical approach to this problem is to restrict attention to tests of H0 ; H0 ; : : : ; H0 such that the
problem of one or more false rejections does not exceed some relatively low level ˛, for example,
˛ D 0:05. For even moderately large values of M , this approach can be quite “conservative.” A less
conservative approach is obtainable by considering all test procedures for which (for some positive
integer k > 1) the probability of k or more false rejections does not exceed ˛ (e.g., Lehmann and
Romano 2005a). Another such approach is that of Benjamini and Hochberg (1995); it takes the form
of controlling the false discovery rate, which by definition is the expected value of the ratio of the
number of false rejections to the total number of rejections.
In the testing of a hypothesis about a parametric function or functions and in the point or set
estimation or prediction of an unobservable random variable or vector w, the statistical properties of
the test or of the estimator or predictor depend on various characteristics of the distribution of y or,
more generally (in the case of the estimator or predictor), on the joint distribution of w and y. These
An Overview 21

properties include the probability of acceptance or rejection of a hypothesis (by a hypothesis test),
the probability of coverage (by a confidence set), and the unbiasedness and MSE or MSE matrix of
a point estimator or predictor. Some or all of the relevant characteristics of the distribution of y or of
the joint distribution of w and y are determined by the assumptions (about the distribution of y) that
comprise the statistical model and by any further assumptions pertaining to the joint distribution of
w and y. By definition, these assumptions pertain to the unconditional distribution of y and to the
unconditional joint distribution of w and y.
It can be informative to determine the properties of a hypothesis test or of a point or set estimator
or predictor under more than one model (or under more than 1 set of assumptions about the joint
distribution of w and y) and/or to determine the properties of the test or the estimator or predictor
conditionally on the values of various random variables (e.g., conditionally on the values of various
functions of y or even on the value of y itself). The appeal of a test or estimation or prediction pro-
cedure whose properties have been evaluated unconditionally under a particular model (or particular
set of assumptions about the joint distribution of w and y) can be either enhanced or diminished
by evaluating its properties conditionally and/or under an alternative model (or alternative set of
assumptions). The relative appeal of alternative procedures may be a matter of emphasis; which
procedure has the more favorable properties may depend on whether the properties are evaluated
conditionally or unconditionally and under which model. In such a case, it may be instructive to
analyze the data in accordance with each of multiple procedures.

1.6 An Overview
This volume provides coverage of linear statistical models and of various statistical procedures that
are based on those models. The emphasis is on the underlying theory; however, some discussion of
applications and some attempts at illustration are included among the content.
In-depth coverage is provided for a broad class of linear statistical models consisting of what are
referred to herein as Gauss–Markov models. Results obtained on the basis of Gauss–Markov models
can be extended in a relatively straightforward way to a somewhat broader class of linear statistical
models consisting of what are referred to herein as Aitken models. Results on a few selected topics
are obtained for what are referred to herein as general linear models, which form a very broad class
of linear statistical models and include the Gauss–Markov and Aitken models as special cases.
The models underlying (simple and multiple) linear regression procedures are Gauss–Markov
models, and results obtained on the basis of Gauss–Markov models apply more-or-less directly to
those procedures. Moreover, many of the procedures that are commonly used to analyze experimen-
tal data (and in some cases observational data) are based on classificatory (fixed-effects) models.
Like regression models, these models are Gauss–Markov models. However, some of the procedures
that are commonly used to analyze classificatory data (such as the analysis of variance) are rather
“specialized.” For the most part, those kinds of specialized procedures are outside the scope of what
is covered herein. They constitute (along with various results that would expand on the coverage of
general linear models) potential subjects for a possible future volume.
The organization of the present volume is such that the results that are directly applicable to
linear models are presented in Chapters 5 and 7. Chapter 5 provides coverage of (point) estimation
and prediction, and Chapter 7 provides coverage of topics related to the construction of confidence
intervals and sets and to the testing of hypotheses. Chapters 2, 3, 4, and 6 present results on matrix
algebra and the relevant underlying statistical distributions (as well as other supportive material);
many of these results are of importance in their own right. Some additional results on matrix algebra
and statistical distributions are introduced in Chapters 5 and 7 (as the need for them arises).
2
Matrix Algebra: A Primer

Knowledge of matrix algebra is essential in working with linear models. Chapter 2 provides a limited
coverage of matrix algebra, with an emphasis on concepts and results that are highly relevant and
that are more-or-less elementary in nature. It forms a core body of knowledge, and, as such, provides
a solid foundation for the developments that follow. Derivations or proofs are included for most
results. In subsequent chapters, the coverage of matrix algebra is extended (as the need arises) to
additional concepts and results.

2.1 The Basics


A matrix is a rectangular array of numbers, that is, a collection of numbers, say a11 ; a12 ; : : : ; a1N ;
a21 ; a22 ; : : : ; a2N ; : : : ; aM1 ; aM 2 ; : : : ; aMN , arranged in rows and columns as follows:
0 1
a11 a12 : : : a1N
B a21 a22 : : : a2N C
:: C :
B C
B :: ::
@ : : : A
aM1 aM 2 : : : aMN
The use of the term matrix is restricted herein to real matrices, that is, to rectangular arrays of real
numbers. A matrix having M rows and N columns is referred to as an M  N matrix, and M and
N are called the dimensions of the matrix. The number located at the intersection of the i th row and
the j th column of a matrix is called the ij th element or entry of the matrix.
Boldface capital letters (e.g., A) are used to represent matrices. The notation A D faij g is used in
introducing a matrix, the ij th element of which is aij . Two matrices A and B of the same dimensions
are said to be equal if each element of A equals the corresponding element of B, in which case
we write A D B (and are said to be unequal otherwise, i.e., if any element of A differs from the
corresponding element of B, in which case we write A ¤ B).

a. Matrix operations
A matrix can be transformed or can be combined with various other matrices in accordance with
operations called scalar multiplication, matrix addition and subtraction, matrix multiplication, and
transposition.
Scalar multiplication. The term scalar is to be used synonymously with real number. Scalar multi-
plication is defined for an arbitrary scalar k and an arbitrary M  N matrix A D faij g. The product
of k and A is written as kA (or, much less commonly, as Ak), and is defined to be the M  N matrix
whose ij th element is kaij . The matrix kA is said to be a scalar multiple of the matrix A. Clearly,
for any scalars c and k and any matrix A,

c.kA/ D .ck/A D .kc/A D k.cA/: (1.1)


24 Matrix Algebra: A Primer

It is customary to refer to the product . 1/A of 1 and A as the negative of A and to abbreviate
. 1/A to A.
Matrix addition and subtraction. Matrix addition and subtraction are defined for any two matrices
A D faij g and B D fbij g that have the same number of rows, say M, and the same number of
columns, say N. The sum of the two M  N matrices A and B is denoted by the symbol A C B and
is defined to be the M  N matrix whose ij th element is aij C bij .
Matrix addition is commutative, that is,

A C B D B C A: (1.2)

Matrix addition is also associative, that is, taking C to be a third M  N matrix,

A C .B C C/ D .A C B/ C C: (1.3)

The symbol A C B C C is used to represent the common value of the left and right sides of equality
(1.3), and that value is referred to as the sum of A, B, and C. This notation and terminology extend
in an obvious way to any finite number of M  N matrices.
Clearly, for any scalar k,
k.A C B/ D kA C kB; (1.4)
and, for any scalars c and k,
.c C k/A D cA C kA: (1.5)
Let us write A B for the sum A C . B/ or, equivalently, for the M  N matrix whose ij th
element is aij bij , and refer to this matrix as the difference between A and B.
Matrices having the same number of rows and the same number of columns are said to be
conformal for addition (and subtraction).
Matrix multiplication. Turning now to matrix multiplication (i.e., the multiplication of one matrix
by another), let A D faij g represent an M  N matrix and B D fbij g a P  Q matrix. When N D P
(i.e., when A has the same number of columns as B has rows), the matrix product is defined to be
the M  Q matrix whose ij th element is
N
X
ai k bkj D ai1 b1j C ai 2 b2j C    C aiN bNj :
kD1

The formation of the matrix product AB is referred to as the premultiplication of B by A or the


postmultiplication of A by B. When N ¤ P, the matrix product AB is undefined.
Matrix multiplication is associative. Thus, introducing a third matrix C,

A.BC/ D .AB/C; (1.6)

provided that N D P and that C has Q rows (so that all relevant matrix products are defined). The
symbol ABC is used to represent the common value of the left and right sides of equality (1.6), and
that value is referred to as the product of A, B, and C. This notation and terminology extend in an
obvious way to any finite number of matrices.
Matrix multiplication is distributive with respect to addition, that is,

A.B C C/ D AB C AC; (1.7)


.A C B/C D AC C BC; (1.8)

where, in each equality, it is assumed that the dimensions of A, B, and C are such that all multipli-
cations and additions are defined. Results (1.7) and (1.8) extend in an obvious way to the postmulti-
plication or premultiplication of a matrix A or C by the sum of any finite number of matrices.
The Basics 25

In general, matrix multiplication is not commutative. That is, AB is not necessarily identical to
BA. In fact, when N D P but M ¤ Q or when M D Q but N ¤ P, one of the matrix products
AB and BA is defined, while the other is undefined. When N D P and M D Q, AB and BA are
both defined, but the dimensions .M  M / of AB are the same as those of BA only if M D N.
Even if N D P D M D Q, in which case A and B are both N  N matrices and the two matrix
products AB and BA are both defined and of the same dimensions, it is not necessarily the case that
AB D BA.
Two N  N matrices A and B are said to commute if AB D BA. More generally, a collection
of N  N matrices A1 ; A2 ; : : : ; AK is said to commute in pairs if Ai Aj D Aj Ai for j > i D
1; 2; : : : ; K.
For any scalar c, M  N matrix A, and N  P matrix B, it is customary to write cAB for the
scalar product c.AB/ of c and the matrix product AB. Note that

cAB D .cA/B D A.cB/ (1.9)

(as is evident from the very definitions of scalar and matrix multiplication). This notation (for a scalar
multiple of a product of 2 matrices) and result (1.9) extend in an obvious way to a scalar multiple of
a product of any finite number of matrices.
Transposition. Corresponding to any M  N matrix A D faij g is the N  M matrix obtained by
rewriting the columns of A as rows or the rows of A as columns. This matrix, the ij th element of
which is aj i , is called the transpose of A and is to be denoted herein by the symbol A0 .
For any matrix A,
.A0 /0 D AI (1.10)
for any scalar k and any matrix A,
.kA/0 D kA0 I (1.11)
and for any two matrices A and B (that are conformal for addition),

.A C B/0 D A0 C B0 (1.12)

—these 3 results are easily verified. Further, for any two matrices A and B (for which the product
AB is defined)
.AB/0 D B0 A0 ; (1.13)
as can be verified by comparing the ij th element of B0 A0 with that of .AB/0 . More generally,

.A1 A2    AK /0 D AK
0
   A02 A01 (1.14)

for any K matrices A1 ; A2 ; : : : ; AK of appropriate dimensions.

b. Types of matrices
There are several types of matrices that are worthy of mention.
Square matrices. A matrix having the same number of rows as columns, say N rows and N columns,
is referred to as a square matrix and is said to be of order N. The N elements of a square matrix of
order N that lie on an imaginary line (called the diagonal) extending from the upper left corner of
the matrix to the lower right corner are called the diagonal elements; the other N.N 1/ elements
of the matrix (those elements that lie above and to the right or below and to the left of the diagonal)
are called the off-diagonal elements. Thus, the diagonal elements of a square matrix A D faij g of
order N are ai i (i D 1; 2; : : : ; N ), and the off-diagonal elements are aij (j ¤ i D 1; 2; : : : ; N ).
Symmetric matrices. A matrix A is said to be symmetric if A0 D A. Thus, a matrix is symmetric if
it is square and if (for all i and j ¤ i ) its j i th element equals its ij th element.
26 Matrix Algebra: A Primer

Diagonal matrices. A diagonal matrix is a square matrix whose off-diagonal elements are all equal
to 0. Thus, a square matrix A D faij g of order N is a diagonal matrix if aij D 0 for j ¤ i D
1; 2; : : : ; N . The notation D D fdi g is sometimes used to introduce a diagonal matrix, the i th
diagonal element of which is di . Also, we may write diag.d1 ; d2 ; : : : ; dN / for such a matrix (where
N is the order of the matrix).
Identity matrices. A diagonal matrix diag.1; 1; : : : ; 1/ whose diagonal elements are all equal to 1 is
called an identity matrix. The symbol IN is used to represent an identity matrix of order N. In cases
where the order is clear from the context, IN may be abbreviated to I.
Triangular matrices. If all of the elements of a square matrix that are located below and to the left of
the diagonal are 0, the matrix is said to be upper triangular. Similarly, if all of the elements that are
located above and to the right of the diagonal are 0, the matrix is said to be lower triangular. More
formally, a square matrix A D faij g of order N is upper triangular if aij D 0 for j < i D 1; : : : ; N
and is lower triangular if aij D 0 for j > i D 1; : : : ; N . By a triangular matrix, we mean a (square)
matrix that is upper triangular or lower triangular. An (upper or lower) triangular matrix is called a
unit (upper or lower) triangular matrix if all of its diagonal elements equal 1.
Row and column vectors. A matrix that has only one row, that is, a matrix of the form .a1; a2 ; : : : ; aN /
is called a row vector. Similarly, a matrix that has only one column is called a column vector. A row
or column vector having N elements may be referred to as an N -dimensional row or column vector.
Clearly, the transpose of an N -dimensional column vector is an N -dimensional row vector, and vice
versa.
Lowercase boldface letters (e.g., a) are used herein to represent column vectors. This notation
is helpful in distinguishing column vectors from matrices that may have more than one column. No
further notation is introduced for row vectors. Instead, row vectors are represented as the transposes
of column vectors. For example, a0 represents the row vector whose transpose is the column vector
a. The notation a D fai g or a0 D fai g is used in introducing a column or row vector whose i th
element is ai .
Note that each column of an M  N matrix A D faij g is an M -dimensional column vector,
and that each row of A is an N -dimensional row vector. Specifically, the j th column of A is the
M -dimensional column vector .a1j ; a2j ; : : : ; aMj /0 (j D 1; : : : ; N ), and the i th row of A is the
N -dimensional row vector .ai1 ; ai 2 ; : : : ; aiN / (i D 1; : : : ; M ).
Null matrices. A matrix all of whose elements are 0 is called a null matrix—a matrix having one or
more nonzero elements is said to be nonnull. A null matrix is denoted by the symbol 0—this notation
is reserved for use in situations where the dimensions of the null matrix can be ascertained from the
context. A null matrix that has one row or one column may be referred to as a null vector.
Matrices of 1’ s. The symbol 1N is used to represent an N -dimensional column vector all of whose
elements equal 1. In a situation where the dimensions of a column vector of 1’s is clear from the
context or is to be left unspecified, we may simply write 1 for such a vector. Note that 10N is an
N -dimensional row vector, all of whose elements equal 1, and that 1M 10N is an M  N matrix, all
of whose elements equal 1.

c. Submatrices and subvectors


A submatrix of a matrix A is a matrix that can be obtained by striking out rows and/or columns of A.
Strictly speaking, a matrix is a submatrix of itself; it is the submatrix obtained by striking out zero
rows and zero columns. Submatrices of a row or column vector, that is, of a matrix having one row
or one column, are themselves row or column vectors and are customarily referred to as subvectors.
A submatrix of a square matrix is called a principal submatrix if it can be obtained by striking
out the same rows as columns (so that the i th row is struck out whenever the i th column is struck
Partitioned Matrices and Vectors 27

out, and vice versa). The R  R (principal) submatrix of an N  N matrix obtained by striking out
the last N R rows and columns is referred to as a leading principal submatrix (R D 1; : : : ; N ). A
principal submatrix of a symmetric matrix is symmetric, a principal submatrix of a diagonal matrix
is diagonal, and a principal submatrix of an upper or lower triangular matrix is respectively upper or
lower triangular, as is easily verified.

2.2 Partitioned Matrices and Vectors


A matrix can be divided or partitioned into submatrices by drawing horizontal or vertical lines
between various of its row or columns, in which case the matrix is called a partitioned matrix and
the submatrices are sometimes referred to as blocks (as in blocks of elements). Thus, a partitioned
matrix is a matrix, say an M  N matrix A, that has been expressed in the form
0 1
A11 A12 : : : A1C
B A21 A22 : : : A2C C
ADB : :: :: C : (2.1)
B C
@ :: : : A
AR1 AR2 : : : ARC

Here, R and C are positive integers, and Aij is an Mi  Nj matrix (i D 1; 2; : : : ; R; j D


1; 2; : : : ; C ),where M1 ; M2 ; : : : ; MR and N1 ; N2 ; : : : ; NC are positive integers such that M1 C
M2 C    C MR D M and N1 C N2 C    C NC D N . Specifically, Aij is the Mi  Nj submatrix of
A obtained by striking out all of the rows and columns of A save the (M1 C M2 C    C Mi 1 C 1)th,
(M1 CM2 C  CMi 1 C2)th, : : : ; (M1 CM2 C  CMi )th rows and the (N1CN2 C  CNj 1 C1)th,
(N1 C N2 C    C Nj 1 C 2)th, : : : ; (N1 C N2 C    C Nj )th columns. (When i D 1 or j D 1,
interpret the degenerate sum M1 C M2 C    C Mi 1 or N1 C N2 C    C Nj 1 as 0.) Think of a
partitioned matrix as an array or “matrix” of matrices.
Note that, by definition, each of the submatrices Ai1 ; Ai 2 ; : : : ; AiC in the i th “row” of blocks
of the partitioned matrix (2.1) has the same number of rows and that each of the submatrices
A1j ; A2j ; : : : ; ARj in the j th “column” of blocks has the same number of columns. It is customary
to identify each of the blocks in a partitioned matrix by referring to the row of blocks and the column
of blocks in which it appears. Accordingly, when a matrix A is expressed in the partitioned form
(2.1), the submatrix Aij is referred to as the ij th block of A.
Partitioned matrices having one row or one column are customarily referred to as partitioned (row
or column) vectors. Thus, a partitioned column vector is a (column) vector, say an M -dimensional
column vector a, that has been expressed in the form
0 1
a1
B a2 C
a D B : C:
B C
@ :: A
aR

Here, R is a positive integer, and ai is an Mi -dimensional column vector (i D 1; 2; : : : ; R), where


M1 ; M2 ; : : : ; MR are positive integers such that M1 C M2 C    C MR D M . Specifically, ai is the
subvector of a obtained by striking out all of the elements of a save the (M1 CM2 C  CMi 1 C1)th,
(M1 C M2 C    C Mi 1 C 2)th, : : : ; (M1 C M2 C    C Mi )th elements. Similarly, a partitioned
row vector is a (row) vector, say the M -dimensional row vector a0, that has been expressed in the
form a0 D .a01 ; a02 ; : : : ; a0R /.
28 Matrix Algebra: A Primer

a. Matrix operations (as applied to partitioned matrices)


For partitioned matrices, the various matrix operations can be carried out “blockwise” instead of
“elementwise.” Take A to be an M  N matrix that has been expressed in the form (2.1), that is, has
been partitioned into R rows and C columns of blocks, the ij th of which is the Mi  Nj submatrix
Aij . Then, clearly, for any scalar k,
0 1
kA11 kA12 : : : kA1C
B kA21 kA22 : : : kA2C C
kA D B : :: :: C : (2.2)
B C
@ :: : : A
kAR1 kAR2 : : : kARC
Further, it is a simple exercise to show that
0 0
A021 : : : A0R1
1
A11
B A0 A022 : : : A0R2 C
B 12
A0 D B : :: C ; (2.3)
C
::
@ :: : : A
A01C A02C : : : A0RC
that is, A0 is expressible as a partitioned matrix, comprising C rows and R columns of blocks, the
ij th of which is the transpose Aj0 i of the j i th block Aj i of A.
Now, let us consider the sum and the product of the M  N partitioned matrix A and a P  Q
partitioned matrix 0 1
B11 B12 : : : B1V
B B21 B22 : : : B2V C
BDB : :: C ;
B C
::
@ :: : : A
BU 1 BU 2 : : : BU V
whose ij th block Bij is of dimensions Pi  Qj . The matrices A and B are conformal for addition
provided that P D M and Q D N . If U D R, V D C , Pi D Mi (i D 1; 2; : : : ; R), and Qj D Nj
(j D 1; 2; : : : ; C ), that is, if (besides A and B being conformal for addition) the rows and columns
of B are partitioned in the same way as those of A, then
0 1
A11 C B11 A12 C B12 : : : A1C C B1C
B A21 C B21 A22 C B22 : : : A2C C B2C C
ACB DB :: :: :: C; (2.4)
B C
@ : : : A
AR1 C BR1 AR2 C BR2 : : : ARC C BRC
and the partitioning of A and B is said to be conformal (for addition). This result and terminology
extend in an obvious way to the addition of any finite number of partitioned matrices (and can be
readily modified to obtain counterparts for matrix subtraction).
The matrix product AB is defined provided that P D N . If U D C and Pk D Nk (k D
1; 2; : : : C ) (in which case all of the products Ai k Bkj (i D 1; 2; : : : ; R; j D 1; 2; : : : ; V ; k D
1; 2; : : : ; C ), as well as the product AB, exist), then
0 1
F11 F12 : : : F1V
B F21 F22 : : : F2V C
AB D B : :: C ; (2.5)
B C
::
@ :: : : A
FR1 FR2 : : : FRV
where
C
X
Fij D Ai k Bkj D Ai1 B1j C Ai 2 B2j C    C AiC BCj ;
kD1
Partitioned Matrices and Vectors 29

and the partitioning of A and B is said to be conformal (for the premultiplication of B by A). In the
special case where R D C D U D V D 2, that is, where
   
A11 A12 B11 B12
AD and BD ;
A21 A22 B21 B22

result (2.5) simplifies to


 
A11 B11 C A12 B21 A11 B12 C A12 B22
AB D : (2.6)
A21 B11 C A22 B21 A21 B12 C A22 B22

b. Block-diagonal and block-triangular matrices


In the special case of a partitioned M  N matrix A of the form
0 1
A11 A12 : : : A1R
B A21 A22 : : : A2R C
ADB : :: :: C (2.7)
B C
@ :: ::
: : : A
AR1 AR2 : : : ARR
(for which the number of rows of blocks equals the number of columns of blocks), the ij th block
Aij of A is called a diagonal block if j D i and an off-diagonal block if j ¤ i . If every off-diagonal
block of the partitioned matrix (2.7) is a null matrix, that is, if
0 1
A11 0 : : : 0
B 0 A22 0 C
ADB : C;
B C
@ :: ::
: A
0 0 ARR

then A is said to be block-diagonal, and diag.A11 ; A22 ; : : : ; ARR / is sometimes written for A. If
Aij D 0 for j < i D 1; : : : ; R, that is, if
0 1
A11 A12 : : : A1R
B 0 A22 : : : A2R C
ADB : :: C ;
B C
@ :: ::
: : A
0 0 ARR

then A is called an upper block-triangular matrix. Similarly, if Aij D 0 for j > i D 1; : : : ; R, that
is, if 0 1
A11 0 ::: 0
B A21 A22 0 C
ADB : : C;
B C
@ :: :: : :: A
AR1 AR2 : : : ARR
then A is called a lower block-triangular matrix. To indicate that A is upper or lower block-triangular
(without being more specific), A is referred to simply as block-triangular.

c. Matrices partitioned into individual rows or columns


Note that a matrix can be partitioned into its individual rows or its individual columns, and that a
(row or column) vector can be partitioned into “subvectors” of one element each. Thus, for an M N
30 Matrix Algebra: A Primer

matrix A D .a1 ; a2 ; : : : ; aN /, with columns a1 ; a2 ; : : : ; aN , and an N -dimensional column vector


x D .x1 ; x2 ; : : : ; xN /0, with elements x1 ; x2 ; : : : ; xN , we have, as a special case of result (2.5), that

Ax D x1 a1 C x2 a2 C    C xN aN : (2.8)
0 0 1
b1
B b0 C
B 2C
Similarly, for an M  N matrix A D B : C, with rows b01 ; b02 ; : : : ; bM
0
, and an M -dimensional
@ :: A
0
bM
0
row vector x D .x1 ; x2 ; : : : ; xM /, with elements x1 ; x2 ; : : : ; xM ,

x0 A D x1 b01 C x2 b02 C    C xM bM
0
: (2.9)

Note also that an unpartitioned matrix can be regarded as a “partitioned” matrix comprising a
single row and a single column of blocks. Thus, letting A represent an M  N matrix0and1taking
y10
By0 C
B 2C
X D .x1 ; x2 ; : : : ; xQ / to be an N  Q matrix with columns x1 ; x2 ; : : : ; xQ and Y D B : C to be
@ :: A
yP0
0 0 0
a P  M matrix with rows y1 ; y2 ; : : : ; yP , result (2.5) implies that

AX D .Ax1 ; Ax2 ; : : : ; AxQ /; (2.10)


0 0 1
y1 A
By0 AC
B 2 C
YA D B : C ; (2.11)
@ :: A
yP0 A

and
y10 Ax1 y10 Ax2 : : : y10 AxQ
0 1
B y20 Ax1 y20 Ax2 : : : y20 AxQ C
Y AX D B : :: C : (2.12)
B C
::
@ :: : : A
0 0 0
yP Ax1 yP Ax2 : : : yP AxQ
That is, AX is an M  Q matrix whose j th column is Axj (j D 1; 2; : : : ; Q); Y A is a P  N
matrix whose i th row is yi0 A (i D 1; 2; : : : ; P ); and Y AX is a P  Q matrix whose ij th element is
yi0 Axj (i D 1; 2; : : : ; P ; j D 1; 2; : : : ; Q).
Representation (2.9) is helpful in establishing the elementary results expressed in the following
two lemmas—refer, for instance, to Harville (1997, sec. 2.3) for detailed derivations.
Lemma 2.2.1. For any column vector y and nonnull column vector x, there exists a matrix A
such that y D Ax.
Lemma 2.2.2. For any two M  N matrices A and B, A D B if and only if Ax D Bx for every
N -dimensional column vector x.
Note that Lemma 2.2.2 implies in particular that A D 0 if and only if Ax D 0 for every x.
Trace of a (Square) Matrix 31

2.3 Trace of a (Square) Matrix


The trace of a square matrix A D faij g of order N is defined to be the sum of the N diagonal
elements of A and is to be denoted by the symbol tr.A/. Thus,

tr.A/ D a11 C a22 C    C aNN :

a. Basic properties
Clearly, for any scalar k and any N  N matrices A and B,

tr.kA/ D k tr.A/; (3.1)


tr.A C B/ D tr.A/ C tr.B/; (3.2)
tr.A0 / D tr.A/: (3.3)

Further, for any R scalars k1 ; k2 ; : : : ; kR and for any R matrices A1 ; A2 ; : : : ; AR of dimensions


N  N,
X R  X R
tr ki Ai D ki tr.Ai /; (3.4)
i D1 i D1

as can be readily verified by, for example, the repeated application of results (3.1) and (3.2). And for
a square matrix A that has been partitioned as
0 1
A11 A12 : : : A1R
B A21 A22 : : : A2R C
ADB : :: :: C
B C
@ :: ::
: : : A
AR1 AR2 : : : ARR

in such a way that the diagonal blocks A11 ; A22 ; : : : ; ARR are square,

tr.A/ D tr.A11 / C tr.A22 / C    C tr.ARR /: (3.5)

b. Trace of a product
Let A D faij g represent an M  N matrix and B D fbj i g an N  M matrix. Then,
M X
X N
tr.AB/ D aij bj i ; (3.6)
i D1 j D1

as is evident upon observing that the i th diagonal element of AB is jND1 aij bj i . Thus, since the
P
j i th element of B is the ij th element of B0, the trace of the matrix product AB can be formed by
multiplying the ij th element of A by the corresponding (ij th) element of B0 and by then summing
(over i and j ).
A simple (but very important) result on the trace of a product of two matrices is expressed in the
following lemma.
Lemma 2.3.1. For any M  N matrix A and N  M matrix B,

tr.AB/ D tr.BA/: (3.7)


32 Matrix Algebra: A Primer

Proof. Let aij represent the ij th element of A and bj i the j i th element of B, and observe that
the j th diagonal element of BA is M i D1 bj i aij . Thus, making use of result (3.6), we find that
P

M X
X N N X
X M N X
X M
tr.AB/ D aij bj i D aij bj i D bj i aij D tr.BA/: Q.E.D.
i D1 j D1 j D1 i D1 j D1 i D1

Note [in light of results (3.7) and (3.6)] that for any M  N matrix A D faij g,
M X
X N
tr.A0 A/ D tr.AA0 / D 2
aij (3.8)
i D1 j D1

 0: (3.9)

That is, both tr.A0 A/ and tr.AA0 / equal the sum of squares of the MN elements of A, and both are
inherently nonnegative.

c. Some equivalent conditions


Note that equality is attained in inequality (3.9) if and only if A D 0 and that as a consequence, we
have the following lemma.
Lemma 2.3.2. For any M  N matrix A, A D 0 if and only if tr.A0 A/ D 0.
As an essentially immediate consequence of Lemma 2.3.2, we have the following corollary.
Corollary 2.3.3. For any M  N matrix A, A D 0 if and only if A0 A D 0.
The following corollary provides a very useful generalization of Corollary 2.3.3.
Corollary 2.3.4. (1) For any M  N matrix A and N  S matrices B and C, AB D AC if
and only if A0 AB D A0 AC. (2) Similarly, for any M  N matrix A and S  N matrices B and C,
BA0 D CA0 if and only if BA0 A D CA0 A.
Proof (of Corollary 2.3.4). (1) If AB D AC, then obviously A0 AB D A0 AC. Conversely, if
A AB D A0 AC, then
0

.AB AC/0 .AB AC/ D .B0 C 0 /.A0 AB A0 AC/ D 0;

and it follows from Corollary 2.3.3 that AB AC D 0 or, equivalently, that AB D AC.
(2) To establish Part (2), simply take the transpose of each side of the two equivalent equalities
AB0 D AC 0 and A0 AB0 D A0 AC 0 . [The equivalence of these two equalities follows from Part
(1).] Q.E.D.
Note that as a special case of Part (1) of Corollary 2.3.4 (the special case where C D 0), we have
that AB D 0 if and only if A0 AB D 0, and as a special case of Part (2), we have that BA0 D 0 if and
only if BA0 A D 0.

2.4 Linear Spaces


A nonempty set, say V, of matrices (all of which have the same dimensions) is called a linear space
if (1) for every matrix A in V and every matrix B in V, the sum A C B is in V, and (2) for every
matrix A in V and every scalar k, the product kA is in V. For example, the set consisting of all
M  N matrices is a linear space; and since sums and scalar multiples of symmetric matrices are
symmetric, the set of all N  N symmetric matrices is a linear space. Note that every linear space
Linear Spaces 33

contains the null matrix 0 (of appropriate dimensions), and that the set f0g, whose only member is
a null matrix, is a linear space. Note also that if a linear space contains a nonnull matrix, then it
contains an infinite number of nonnull matrices.
A linear combination of matrices A1 ; A2 ; : : : ; AK (of the same dimensions) is an expression of
the general form
x1 A1 C x2 A2 C    C xK AK ;
where x1 ; x2 ; : : : ; xK are scalars (which are referred to as the coefficients of the linear combination).
If A1 ; A2 ; : : : ; AK are matrices in a linear space V, then every linear combination of A1 ; A2 ; : : : ; AK
is also in V.
Corresponding to any finite set of M N matrices is the span of the set. By definition, the span of
a nonempty finite set fA1 ; A2 ; : : : ; AK g of M  N matrices is the set consisting of all matrices that
are expressible as linear combinations of A1 ; A2 ; : : : ; AK . (By convention, the span of the empty
set of M  N matrices is the set f0g, whose only member is the M  N null matrix.) The span of a
finite set S is denoted herein by the symbol sp.S /; sp.fA1 ; A2 ; : : : ; AK g/, which represents the span
of the set comprising the matrices A1 ; A2 ; : : : ; AK , is typically abbreviated to sp.A1 ; A2 ; : : : ; AK /.
Clearly, the span of any finite set of M  N matrices is a linear space.
A finite set S of matrices in a linear space V is said to span V if sp.S / D V. Or equivalently
[since sp.S /  V], S spans V if V  sp.S /.

a. Row and column spaces


Corresponding to any M N matrix A are two linear spaces of fundamental importance. The column
space of A is the span of the set whose members are the columns of A, that is, the column space of A
is the set consisting of all M -dimensional column vectors that are expressible as linear combinations
of the N columns of A. Similarly, the row space of A is the span of the set whose members are the
rows of A, that is, the row space of A is the set consisting of all N -dimensional row vectors that are
expressible as linear combinations of the M rows of A.
The column space of a matrix A is to be denoted by the symbol C.A/ and the row space by the
symbol R.A/. The symbol RN will be used to denote the set of all N -dimensional column vectors
or (depending on the context) the set of all N -dimensional row vectors. Note that C.IN / D RN
(where RN is the set of all N -dimensional column vectors), and R.IN / D RN (where RN is the
set of all N -dimensional row vectors).
In light of result (2.8), it is apparent that an M -dimensional column vector y is a member of the
column space C.A/ of an M  N matrix A if and only if there exists an N -dimensional column
vector x for which y D Ax. And in light of result (2.9), it is apparent that an N -dimensional row
vector y 0 is a member of R.A/ if and only if there exists an M -dimensional row vector x0 for which
y 0 D x0 A.
The following lemma relates the column space of a matrix to the row space of its transpose.
Lemma 2.4.1. For any matrix A, y 2 C.A/ if and only if y 0 2 R.A0 /.
Proof. If y 2 C.A/, then y D Ax for some column vector x, implying that y 0 D .Ax/0 D x0 A0
and hence that y 0 2 R.A0 /. The converse [that y 0 2 R.A0 / ) y 2 C.A/] can be established in
similar fashion. Q.E.D.

b. Subspaces
A subset U of a linear space V (of M  N matrices) is said to be a subspace of V if U is itself
a linear space. Trivial examples of a subspace of a linear space V are: (1) the set f0g, whose only
member is the null matrix, and (2) the entire set V. The column space C.A/ of an M  N matrix A
is a subspace of RM (when RM is interpreted as the set of all M -dimensional column vectors), and
R.A/ is a subspace of RN (when RN is interpreted as the set of all N -dimensional row vectors).
34 Matrix Algebra: A Primer

We require some additional terminology and notation. Suppose that S and T are subspaces of
the linear space of all M  N matrices or, more generally, that S and T are subsets of a given set.
If every member of S is a member of T , then S is said to be contained in T (or T is said to contain
S ), and we write S  T (or T  S ). Note that if S  T and T  S , then S D T , that is, the two
subsets S and T are identical.
Some basic results on row and column spaces are expressed in the following lemmas and corol-
laries, proofs of which are given by Harville (1997, sec. 4.2).
Lemma 2.4.2. Let A represent an M N matrix. Then, for any subspace U of RM, C.A/  U
if and only if every column of A belongs to U. Similarly, for any subspace V of RN , R.A/  V if
and only if every row of A belongs to V.
Lemma 2.4.3. For any M  N matrix A and M  P matrix B, C.B/  C.A/ if and only if
there exists an N  P matrix F such that B D AF. Similarly, for any M  N matrix A and Q  N
matrix C, R.C/  R.A/ if and only if there exists a Q  M matrix L such that C D LA.
Corollary 2.4.4. For any M  N matrix A and N  P matrix F, C.AF/  C.A/. Similarly,
for any M  N matrix A and Q  M matrix L, R.LA/  R.A/.
Corollary 2.4.5. Let A represent an M  N matrix, E an N  K matrix, F an N  P matrix,
L a Q  M matrix, and T an S  M matrix.
(1) If C.E/  C.F/, then C.AE/  C.AF/; and if C.E/ D C.F/, then C.AE/ D C.AF/.
(2) If R.L/  R.T /, then R.LA/  R.T A/; and if R.L/ D R.T /, then R.LA/ D R.T A/.
Lemma 2.4.6. Let A represent an M N matrix and B an M P matrix. Then, (1) C.A/  C.B/
if and only if R.A0 /  R.B0 /, and (2) C.A/ D C.B/ if and only if R.A0 / D R.B0 /.

c. Linear dependence and independence


Any finite set of row or column vectors, or more generally any finite set of M  N matrices, is either
linearly dependent or linearly independent. A nonempty finite set fA1 ; A2 ; : : : ; AK g of M  N
matrices is said to be linearly dependent if there exist scalars x1 ; x2 ; : : : ; xK , not all zero, such that

x1 A1 C x2 A2 C    C xK AK D 0:

If no such scalars exist, the set is said to be linearly independent. The empty set is considered
to be linearly independent. Note that if any subset of a finite set of M  N matrices is linearly
dependent, then the set itself is linearly dependent. Note also that if the set fA1 ; A2 ; : : : ; AK g is
linearly dependent, then some member of the set, say the sth member As , can be expressedPas a linear
combination of the other K 1 members A1; A2 ; : : : ; As 1 ; AsC1 ; : : : ; AK ; that is, As D i ¤s yi Ai
for some scalars y1 ; y2 ; : : : ; ys 1 ; ysC1 ; : : : ; yK .
While technically linear dependence and independence are properties of sets of matrices, it is
customary to speak of “a set of linearly dependent (or independent) matrices” or simply of “lin-
early dependent (or independent) matrices” instead of “a linearly dependent (or independent) set of
matrices.” In particular, in the case of row or column vectors, it is customary to speak of “linearly
dependent (or independent) vectors.”

d. Bases
A basis for a linear space V of M  N matrices is a linearly independent set of matrices in V
that spans V. The empty set is the (unique) basis for the linear space f0g (whose only member
is the null matrix). The set whose members are the M columns .1; 0; : : : ; 0/0 ; : : : ; .0; : : : ; 0; 1/0
of the M  M identity matrix IM is a basis for the linear space RM of all M -dimensional col-
umn vectors. Similarly, the set whose members are the N rows .1; 0; : : : ; 0/; : : : ; .0; : : : ; 0; 1/ of
Linear Spaces 35

the N  N identity matrix IN is a basis for the linear space RN of all N -dimensional row vec-
tors. More generally, letting Uij represent the M  N matrix whose ij th element equals 1 and
whose remaining (MN 1) elements equal 0, the set whose members are the MN matrices
U11 ; U21 ; : : : ; UM1 ; U12 ; U22 ; : : : ; UM 2 ; : : : ; U1N ; U2N ; : : : ; UMN is a basis (the so-called natural
basis) for the linear space of all M  N matrices (as can be readily verified).
Now, consider the column space C.A/ and row space R.A/ of an M N matrix A. By definition,
C.A/ is spanned by the set whose members are the columns of A. If this set is linearly independent,
it is a basis for C.A/; otherwise, it is not. Similarly, if the set whose members are the rows of A is
linearly independent, it is a basis for R.A/.
Two fundamentally important properties of linear spaces in general and row and column spaces
in particular are described in the following two theorems.
Theorem 2.4.7. Every linear space (of M  N matrices) has a basis.
Theorem 2.4.8. Any two bases for a linear space (of M  N natrices) contain the same number
of matrices
The number of matrices in a basis for a linear space V (of M  N matrices) is referred to as
the dimension of V and is denoted by the symbol dim V or dim.V/. Note that the term dimension is
used not only in reference to the number of matrices in a basis, but also in reference to the number
of rows or columns in a matrix—which usage is intended is determinable from the context.
Some basic results related to the dimension of a linear space or subspace are presented in the
following two theorems.
Theorem 2.4.9. If a linear space V (of M  N matrices) is spanned by a set of R matrices, then
dim V  R, and if there is a set of K linearly independent matrices in V, then dim V  K.
Theorem 2.4.10. Let U and V represent linear spaces of M  N matrices. If U  V (i.e., if U
is a subspace of V), then dim U  dim V. Moreover, if U  V and if in addition dim U D dim V,
then U D V.
Two key results pertaining to bases are as follows.
Theorem 2.4.11. Any set of R linearly independent matrices in an R-dimensional linear space
V (of M  N matrices) is a basis for V.
Theorem 2.4.12. A matrix A in a linear space V (of M N matrices) has a unique representation
in terms of any particular basis fA1 ; A2 ; : : : ; AR g; that is, the coefficients x1 ; x2 ; : : : ; xR in the linear
combination
A D x1 A1 C x2 A2 C    C xR AR
are uniquely determined.
For proofs of the results set forth in Theorems 2.4.7 through 2.4.12, refer to Harville (1997, sec.
4.3).

e. Rank (of a matrix)


The row rank of a matrix A is defined to be the dimension of the row space of A, and the column
rank of A is defined to be the dimension of the column space of A. A fundamental result on row and
column spaces is given by the following theorem.
Theorem 2.4.13. The row rank of any matrix A equals the column rank of A.
Refer to Harville (1997, sec 4.4) for a proof of Theorem 2.4.13. That proof is based on the
following result, which is of some interest in its own right.
Theorem 2.4.14. Let A represent an M  N nonnull matrix of row rank R and column rank
C . Then, there exist an M  C matrix B and a C  N matrix L such that A D BL. Similarly, there
exist an M  R matrix K and an R  N matrix T such that A D KT .
36 Matrix Algebra: A Primer

Proof (of Theorem 2.4.14). Take B to be an M  C matrix whose columns form a basis for
C.A/. Then, C.A/ D C.B/, and consequently it follows from Lemma 2.4.3 that there exists a C  N
matrix L such that A D BL. The existence of an M  R matrix K and an R  N matrix T such that
A D KT can be established via a similar argument. Q.E.D.
In light of Theorem 2.4.13, it is not necessary to distinguish between the row and column ranks
of a matrix A. Their common value is called the rank of A and is denoted by the symbol rank A or
rank.A/.
Various of the results pertaining to the dimensions of linear spaces and subspaces can be special-
ized to row and column spaces and restated in terms of ranks. Since the column space of an M  N
matrix A is a subspace of RM and the row space of A is a subspace of RN, the following lemma is
an immediate consequence of Theorem 2.4.10.
Lemma 2.4.15. For any M  N matrix A, rank.A/  M and rank.A/  N .
A further implication of Theorem 2.4.10 is as follows.
Theorem 2.4.16. Let A represent an M N matrix, B an M P matrix, and C a Q N matrix.
If C.B/  C.A/, then rank.B/  rank.A/; if C.B/  C.A/ and if in addition rank.B/ D rank.A/,
then C.B/ D C.A/. Similarly, if R.C/  R.A/, then rank.C/  rank.A/; if R.C/  R.A/ and
if in addition rank.C/ D rank.A/, then R.C/ D R.A/.
In light of Corollary 2.4.4, we have the following corollary of Theorem 2.4.16.
Corollary 2.4.17. Let A represent an M N matrix and F an N P matrix. Then, rank.AF/ 
rank.A/ and rank.AF/  rank.F/. Moreover, if rank.AF/ D rank.A/, then C.AF/ D C.A/;
similarly, if rank.AF/ D rank.F/, then R.AF/ D R.F/.
The rank of an M  N matrix cannot exceed min.M; N /, as is evident from Lemma 2.4.15. An
M  N matrix A is said to have full row rank if rank.A/ D M , that is, if its rank equals the number
of rows, and to have full column rank if rank.A/ D N . Clearly, an M  N matrix can have full row
rank only if M  N , that is, only if the number of rows does not exceed the number of columns,
and can have full column rank only if N  M .
A matrix is said to be nonsingular if it has both full row rank and full column rank. Clearly,
any nonsingular matrix is square. By definition, an N  N matrix A is nonsingular if and only if
rank.A/ D N . An N  N matrix of rank less than N is said to be singular.
Any M  N matrix can be expressed as the product of a matrix having full column rank and a
matrix having full row rank as indicated by the following theorem.
Theorem 2.4.18. Let A represent an M  N nonnull matrix of rank R. Then, there exist an
M  R matrix B and R  N matrix T such that A D BT. Moreover, for any M  R matrix B and
R  N matrix T such that A D BT , rank.B/ D rank.T / D R, that is, B has full column rank and
T has full row rank.
Proof. The existence of an M  R matrix B and an R  N matrix T such that A D BT follows
from Theorem 2.4.14. And, letting B represent any M  R matrix and T any R  N matrix such
that A D BT , we find that rank.B/  R and rank.T /  R (as is evident from Corollary 2.4.17)
and that rank.B/  R and rank.T /  R (as is evident from Lemma 2.4.15), and as a consequence,
we have that rank.B/ D R and rank.T / D R. Q.E.D.
The following theorem, a proof of which is given by Harville (1997, sec. 4.4), characterizes the
rank of a matrix in terms of the ranks of its submatrices.
Theorem 2.4.19. Let A represent an M  N matrix of rank R. Then, A contains R linearly
independent rows and R linearly independent columns. And for any R linearly independent rows
and R linearly independent columns of A, the R  R submatrix, obtained by striking out the other
M R rows and N R columns, is nonsingular. Moreover, any set of more than R rows or more
than R columns (of A) is linearly dependent, and there exists no submatrix of A whose rank exceeds
R.
Linear Spaces 37

As applied to symmetric matrices, Theorem 2.4.19 has the following implication.


Corollary 2.4.20. Any symmetric matrix of rank R contains an R  R nonsingular principal
submatrix.
The rank of a matrix was defined in terms of the dimension of its row and column spaces. Theorem
2.4.19 suggests some equivalent definitions. The rank of a matrix A is interpretable as the size of the
largest linearly independent set that can be formed from the rows of A. Similarly, it is interpretable
as the size of the largest linearly independent set that can be formed from the columns of A. The
rank of A is also interpretable as the size (number of rows or columns) of the largest nonsingular
(square) submatrix of A.
Clearly, an M N matrix has full row rank if and only if all M of its rows are linearly independent,
and has full column rank if and only if all N of its columns are linearly independent. An N  N
matrix is nonsingular if and only if all of its rows are linearly independent; similarly, it is nonsingular
if and only if all of its columns are linearly independent.
It is a simple exercise to show that, for any matrix A,

rank.A0 / D rank.A/ (4.1)

and that, for any matrix A and nonzero scalar k,

rank.kA/ D rank.A/: (4.2)

As a special case of result (4.2), we have that

rank. A/ D rank.A/: (4.3)

f. Orthogonal and orthonormal sets


Corresponding to an arbitrary pair of matrices, say A and B, in a linear space V of M  N matrices
is a scalar that is denoted by the symbol A  B and that is referred to as the inner product (or dot
product) of A and B. The inner product of A and B can be regarded as the value assigned to A and B
by a function whose domain consists of all (ordered) pairs of matrices in V. This function is required
to have the following four properties (but is otherwise subject to choice):
(1) A  B D B  A;
(2) A  A  0, with equality holding if and only if A D 0;
(3) .kA/  B D k.A  B/;
(4) .A C B/  C D .A  C/ C .B  C/
(where A, B, and C represent arbitrary matrices in V and k represents an arbitrary scalar).
The term inner product is used not only in referring to the values assigned by the function to the
various pairs of matrices, but also in referring to the function itself. The usual inner product for a
linear space V of M  N matrices assigns to each pair of matrices A and B in V the value
M X
X N
A  B D tr.AB0 / D aij bij : (4.4)
i D1 j D1

It is a simple exercise to verify that the function defined by expression (4.4) has the four properties
required of an inner product. In the special case of a linear space V of M -dimensional column
vectors, the value assigned by the usual inner product to each pair of vectors x D fxi g and y D fyi g
in V is expressible as
38 Matrix Algebra: A Primer
M
X
x  y D tr.xy 0 / D tr.y 0 x/ D y 0 x D x0 y D xi yi : (4.5)
i D1
And in the special case of a linear space V of N -dimensional row vectors, the value assigned by the
usual inner product to each pair of vectors x0 D fxi g and y 0 D fyi g in V is expressible as
N
X
x0  y 0 D trŒx0 .y 0 /0  D tr.x0 y/ D x0 y D xj yj : (4.6)
j D1
The four basic properties of an inner product for a linear space V (of M  N matrices) imply
various additional properties. We find, in particular, that (for any matrix A in V)
0  A D 0; (4.7)
as is evident from Property (3) upon observing that
0  A D .0A/  A D 0.A  A/ D 0:
And by making repeated use of Properties (3) and (4), we find that (for any matrices A1 ; A2 ; : : : ; AK ,
and B in V and any scalars x1 ; x2 ; : : : ; xK /,
.x1 A1 C x2 A2 C    C xK AK /  B D x1 .A1  B/ C x2 .A2  B/ C    C xK .AK  B/: (4.8)
Corresponding to an arbitrary matrix, say A, in the linear space V of M  N matrices is the
scalar .A  A/1=2 . This scalar is called the norm of A and is denoted by the symbol kAk. The norm
depends on the choice of inner product; when the inner product is taken to be the usual inner product,
the norm is referred to as the usual norm.
An important and famous inequality, known as the Schwarz inequality or Cauchy–Schwarz
inequality, is set forth in the following theorem, a proof of which is given by Harville (1997, sec 6.3).
Theorem 2.4.21 (Cauchy–Schwarz inequality). For any two matrices A and B in a linear space
V,
jA  Bj  kAkkBk; (4.9)
with equality holding if and only if B D 0 or A D kB for some scalar k.
As a special case of Theorem 2.4.21, we have that for any two M -dimensional column vectors
x and y,
jx0 yj  .x0 x/1=2 .y 0 y/1=2 ; (4.10)
with equality holding if and only if y D 0 or x D ky for some scalar k.
Two vectors x and y in a linear space V of M -dimensional column vectors are said to be
orthogonal to each other if x  y D 0. More generally, two matrices A and B in a linear space V
are said to be orthogonal to each other if A  B D 0. The statement that two matrices A and B are
orthogonal to each other is sometimes abbreviated to A ? B. Whether two matrices are orthogonal
to each other depends on the choice of inner product; two matrices that are orthogonal (to each other)
with respect to one inner product may not be orthogonal (to each other) with respect to another inner
product.
A finite set of matrices in a linear space V of M  N matrices is said to be orthogonal if every
matrix in the set is orthogonal to every other matrix in the set. Thus, the empty set and any set
containing only one matrix are orthogonal sets. And a finite set fA1 ; A2 ; : : : ; AK g of two or more
matrices in V is an orthogonal set if Ai  Aj D 0 for j ¤ i D 1; 2; : : : ; K. A finite set of matrices
in V is said to be orthonormal if the set is orthogonal and if the norm of every matrix in the set
equals 1. In the special case of a set of (row or column) vectors, the expression “set of orthogonal
(or orthonormal) vectors,” or simply “orthogonal (or orthonormal) vectors,” is often used in lieu of
the technically more correct expression “orthogonal (or orthonormal) set of vectors.”
The following lemma establishes a connection between orthogonality and linear independence.
Linear Spaces 39

Lemma 2.4.22. An orthogonal set of nonnull matrices is linearly independent.


Proof. If the orthogonal set is the empty set, then the result is clearly true (since, by convention, the
empty set is linearly independent). Suppose then that fA1 ; A2 ; : : : ; AK g is any nonempty orthogonal
set of nonnull matrices. And let x1 ; x2 ; : : : ; xK represent arbitrary scalars such that x1 A1 C x2 A2 C
   C xK AK D 0. For i D 1; 2; : : : ; K, we find [in light of results (4.7) and (4.8)] that

0 D 0  Ai D .x1 A1 C x2 A2 C    C xK AK /  Ai
D x1 .A1  Ai / C x2 .A2  Ai / C    C xK .AK  Ai /
D xi .Ai  Ai /;

implying (since Ai is nonnull) that xi D 0. We conclude that the set fA1 ; A2 ; : : : ; AK g is linearly
independent. Q.E.D.
Note that Lemma 2.4.22 implies in particular that any orthonormal set of matrices is linearly
independent. Note also that the converse of Lemma 2.4.22 is not necessarily true; that is, a linearly
independent set is not necessarily orthogonal. For example, the set consisting of the two 2-dimensional
row vectors .1; 0/ and .1; 1/ is linearly independent but is not orthogonal (with respect to the usual
inner product).
Suppose now that A1; A2 ; : : : ; AK are linearly independent matrices in a linear space V of M N
matrices. There exists a recursive procedure, known as Gram–Schmidt orthogonalization, that when
applied to A1 ; A2 ; : : : ; AK , generates an orthonormal set of M  N matrices B1 ; B2 ; : : : ; BK (the
j th of which is a linear combination of A1 ; A2 ; : : : ; Aj )—refer, for example, to Harville (1997, sec.
6.4) for a discussion of Gram–Schmidt orthogonalization. In combination with Theorems 2.4.7 and
2.4.11 and Lemma 2.4.22, the existence of such a procedure leads to the following conclusion.
Theorem 2.4.23. Every linear space (of M  N matrices) has an orthonormal basis.

g. Some results on the rank of a matrix partitioned into blocks of rows or columns
and on the rank and row or column space of a sum of matrices
A basic result on the rank of a matrix that has been partitioned into two blocks of rows or columns
is as follows.
Lemma 2.4.24. For any M  N matrix A, M  P matrix B, and Q  N matrix C,

rank .A; B/  rank.A/ C rank.B/ (4.11)


and  
A
rank  rank.A/ C rank.C/: (4.12)
C
Proof. Let R D rank.A/ and S D rank.B/. Then, there exist R M -dimensional column
vectors, say x1 ; x2 ; : : : ; xR , that form a basis for C.A/ and S M -dimensional column vectors, say
y1 ; y2 ; : : : ; yS that form a basis for C.B/. Clearly, any vector in the column space of the partitioned
matrix .A; B/ is expressible in the form A`1 C B`2 for some N -dimensional column vector `1 and
some P -dimensional column vector `2 . Moreover, A`1 is expressible as a linear combination of
x1 ; x2 ; : : : ; xR and B`2 as a linear combination of y1 ; y2 ; : : : ; yS , so that A`1 C B`2 is expressible
as a linear combination of x1 ; x2 ; : : : ; xR ; y1 ; y2 ; : : : ; yS . Thus, adopting an abbreviated notation in
which C.A; B/ is written for the column space CŒ.A; B/ of the partitioned matrix .A; B/, C.A; B/
is spanned by the set fx1 ; x2 ; : : : ; xR ; y1 ; y2 , : : : ; yS g, implying (in light of Theorem 2.4.9) that

rank .A; B/ D dim C.A; B/  R C S;

which establishes inequality (4.11). Inequality (4.12) can be established via an analogous argu-
ment. Q.E.D.
40 Matrix Algebra: A Primer

Upon the repeated application of result (4.11), we obtain the more general result that for any
matrices A1 ; A2 ; : : : ; AK having M rows,

rank .A1 ; A2 ; : : : ; AK /  rank.A1 / C rank.A2 / C    C rank.AK /: (4.13)

And, similarly, upon the repeated application of result (4.12), we find that for any matrices
A1 ; A2 ; : : : ; AK having N columns,
A1
0 1
B A2 C
rank B
@ ::: A  rank.A1 / C rank.A2 / C    C rank.AK /:
C (4.14)
AK

Every linear space of M  N matrices contains the M  N null matrix 0. When the intersection
U \ V of two linear spaces U and V of M  N matrices contains no matrices other than the M  N
null matrix, U and V are said to be essentially disjoint. The following theorem gives a necessary
and sufficient condition for equality to hold in inequality (4.11) or (4.12) of Lemma 2.4.24.
Theorem 2.4.25. Let A represent an M  N matrix, B an M  P matrix, and C a Q  N
matrix. Then,
rank .A; B/ D rank.A/ C rank.B/ (4.15)
if and only if C.A/ and C.B/ are essentially disjoint, and, similarly,
 
A
rank D rank.A/ C rank.C/ (4.16)
C
if and only if R.A/ and R.C/ are essentially disjoint.
Proof. Let R D rank A and S D rank B. And take a1 ; a2 ; : : : ; aR to be any R linearly indepen-
dent columns of A and b1 ; b2 ; : : : ; bS any S linearly independent columns of B—their existence
follows from Theorem 2.4.19. Clearly,
C.A; B/ D sp.a1 ; a2 ; : : : ; aR ; b1 ; b2 ; : : : ; bS /:
Thus, to establish the first part of Theorem 2.4.25, it suffices to show that the set fa1 , a2 , : : : ;
aR ; b1 ; b2 ; : : : ; bS g is linearly independent if and only if C.A/ and C.B/ are essentially disjoint.
Accordingly, suppose that C.A/ and C.B/ are essentially disjoint. Then, for any scalars
c1 ; c2 ; : : : ; cR and k1 ; k2 ; : : : ; kS such that R
PS PR
i D1 ci ai C j D1 kj bj D 0, we have that
P
ci ai D
PS PR PSi D1
j D1 j jk b , implying (in light of the essential disjointness) that c
i D1 i ia D 0 and j D1 j bj D
k
0 and hence that c1 D c2 D    D cR D 0 and k1 D k2 D    D kS D 0. And we conclude that the
set fa1 ; a2 ; : : : ; aR , b1 , b2 , : : : ; bS g is linearly independent.
Conversely, if C.A/ and C.B/ were essentially disjoint, there would exist scalars c1 ; c2 ; : : : ; cR
and k1 ; k2 ; : : : ; kS , not all of which are 0, such that R
PS
j D1 kj bj or, equivalently, such
P
i D1 ci ai D
PR PS
that i D1 ci ai C j D1 . kj /bj D 0, in which case the set fa1 ; a2 ; : : : ; aR ; b1 ; b2 ; : : : ; bS g would
be linearly dependent.
The second part of Theorem 2.4.25 can be proved in similar fashion. Q.E.D.
0
Suppose that the M  N matrix A and the M  P matrix B are such that A B D 0. Then, for
any N  1 vector k and any P  1 vector ` such that Ak D B`,
A0Ak D A0 B` D 0;
implying (in light of Corollary 2.3.4) that Ak D 0 and leading to the conclusion that C.A/ and C.B/
are essentially disjoint. Similarly, if AC 0 D 0, then R.A/ and R.C/ are essentially disjoint. Thus,
we have the following corollary of Theorem 2.4.25.
Inverse Matrices 41

Corollary 2.4.26. Let A represent an M  N matrix. Then, for any M  P matrix B such that
A0 B D 0, rank .A; B/ D rank.A/ C rank.B/:
And for any Q  N matrix C such that AC 0 D 0,
 
A
rank D rank.A/ C rank.C/:
C

For any two M  N matrices A and B,


 
I
A C B D .A; B/ :
I
Thus, in light of Corollaries 2.4.4 and 2.4.17 and Lemma 2.4.24, we have the following lemma and
corollary.
Lemma 2.4.27. For any two M  N matrices A and B,

C.A C B/  C.A; B/; rank.A C B/  rank .A; B/; (4.17)


   
A A
R.A C B/  R ; rank.A C B/  rank : (4.18)
B B

Corollary 2.4.28. For any two M  N matrices A and B,

rank.A C B/  rank.A/ C rank.B/: (4.19)

Upon the repeated application of results (4.17), (4.18), and (4.19), we obtain the more general
results that for any K M  N matrices A1 ; A2 ; : : : ; AK ,

C.A1 C A2 C    C AK /  C.A1 ; A2 ; : : : ; AK /; (4.20)


rank.A1 C A2 C    C AK /  rank .A1 ; A2 ; : : : ; AK /; (4.21)
A1
0 1
B A2 C
R.A1 C A2 C    C AK /  RB @ ::: A;
C (4.22)

AK
A1
0 1
B A2 C
rank.A1 C A2 C    C AK /  rank B
@ ::: A;
C (4.23)
AK
and
rank.A1 C A2 C    C AK /  rank.A1 / C rank.A2 / C rank.AK /: (4.24)

2.5 Inverse Matrices


A right inverse of an M  N matrix A is an N  M matrix R such that AR D IM . Similarly, a left
inverse of an M  N matrix A is an N  M matrix L such that LA D IN (or, equivalently, such
that A0 L0 D IN ). A matrix may or may not have a right or left inverse, as indicated by the following
lemma.
42 Matrix Algebra: A Primer

Lemma 2.5.1. An M  N matrix A has a right inverse if and only if rank.A/ D M (i.e., if and
only if A has full row rank) and has a left inverse if and only if rank.A/ D N (i.e., if and only if A
has full column rank).
Proof. If rank.A/ D M , then C.A/ D C.IM / [as is evident from Theorem 2.4.16 upon observing
that C.A/  RM D C.IM /], implying (in light of Lemma 2.4.3) that there exists a matrix R such
that AR D IM (i.e., that A has a right inverse). Conversely, if there exists a matrix R such that
AR D IM , then
rank.A/  rank.AR/ D rank.IM / D M;
implying [since, according to Lemma 2.4.15, rank.A/  M ] that rank.A/ D M . That A has a left
inverse if and only if rank.A/ D N is evident upon observing that A has a left inverse if and only if
A0 has a right inverse [and recalling that rank.A0 / D rank.A/]. Q.E.D.
As an almost immediate consequence of Lemma 2.5.1, we have the following corollary.
Corollary 2.5.2. A matrix A has both a right inverse and a left inverse if and only if A is a
(square) nonsingular matrix.
If there exists a matrix B that is both a right and left inverse of a matrix A (so that AB D I and
BA D I), then A is said to be invertible and B is referred to as an inverse of A. Only a (square)
nonsingular matrix can be invertible, as is evident from Corollary 2.5.2.
The following lemma and theorem include some basic results on the existence and uniqueness
of inverse matrices.
Lemma 2.5.3. If a square matrix A has a right or left inverse B, then A is nonsingular and B is
an inverse of A.
Proof. Suppose that A has a right inverse R. Then, it follows from Lemma 2.5.1 that A is
nonsingular and further that A has a left inverse L. Observing that

L D LI D LAR D IR D R

and hence that RA D LA D I, we conclude that R is an inverse of A.


A similar argument can be used to show that if A has a left inverse L, then A is nonsingular and
L is an inverse of A. Q.E.D.
Theorem 2.5.4. A matrix is invertible if and only if it is a (square) nonsingular matrix. Further,
any nonsingular matrix has a unique inverse B and has no right or left inverse other than B.
Proof. Suppose that A is a nonsingular matrix. Then, it follows from Lemma 2.5.1 that A has a
right inverse B and from Lemma 2.5.3 that B is an inverse of A. Thus, A is invertible. Moreover, for
any inverse C of A, we find that

C D CI D CAB D IB D B;

implying that A has a unique inverse and further—in light of Lemma 2.5.3—that A has no right or
left inverse other than B.
That any invertible matrix is nonsingular is (as noted earlier) evident from Corollary 2.5.2. Q.E.D.
The symbol A 1 is used to denote the inverse of a nonsingular matrix A. By definition,
1
AA D A 1 A D I:

A 1  1 matrix A D .a11 / is invertible if and only if its element a11 is nonzero, in which case
1
A D .1=a11 /:
 
a11 a12
For a 2  2 matrix A D , we find that
a21 a22
Inverse Matrices 43

AB D kI;
 
a22 a12
where B D and k D a11 a22 a12 a21 . If k D 0, then AB D 0, implying that the
a21 a11
columns of A are linearly dependent, in which case A is singular and hence not invertible. If k ¤ 0,
then AŒ.1=k/B D I, in which case A is invertible and
1
A D .1=k/B: (5.1)

a. Basic results on inverses and invertibility


For any nonsingular matrix A and any nonzero scalar k, kA is nonsingular and
1
.kA/ D .1=k/A 1; (5.2)

as is easily verified. In the special case k D 1, equality (5.2) reduces to


1
. A/ D A 1: (5.3)

It is easy to show that, for any nonsingular matrix A, A0 is nonsingular, and

.A0 / 1
D .A 1 /0 : (5.4)

In the special case of a symmetric matrix A, equality (5.4) reduces to

A 1
D .A 1 /0 : (5.5)

Thus, the inverse of any nonsingular symmetric matrix is symmetric.


The inverse A 1 of an N  N nonsingular matrix A is invertible, or equivalently (in light of
Theorem 2.5.4)
rank.A 1 / D N; (5.6)
and
.A 1 / 1 D A; (5.7)
that is, the inverse of A 1 is A (as is evident from the very definition of A 1 ).
For any two N  N nonsingular matrices A and B,

rank.AB/ D N; (5.8)
that is, AB is nonsingular, and 1 1
.AB/ DB A 1: (5.9)
Results (5.8) and (5.9) can be easily verified by observing that ABB A D I (so that B A 1
1 1 1

is a right inverse of AB) and applying Lemma 2.5.3. (If either or both of two N  N matrices A
and B are singular, then their product AB is singular, as is evident from Corollary 2.4.17.) Repeated
application of results (5.8) and (5.9) leads to the conclusion that, for any K nonsingular matrices
A1 ; A2 ; : : : ; AK of order N ,
rank.A1 A2    AK / D N (5.10)
and
.A1 A2    AK / 1 D AK1    A2 1 A1 1: (5.11)
44 Matrix Algebra: A Primer

b. Some results on the ranks and row and column spaces of matrix products
The following lemma gives some basic results on the effects of premultiplication or postmultiplication
by a matrix of full row or column rank.
Lemma 2.5.5. Let A represent an M  N matrix and B an N  P matrix. If A has full column
rank, then
R.AB/ D R.B/ and rank.AB/ D rank.B/:
Similarly, if B has full row rank, then

C.AB/ D C.A/ and rank.AB/ D rank.A/:

Proof. It is clear from Corollary 2.4.4 that R.AB/  R.B/ and C.AB/  C.A/. If A has full
column rank, then (according to Lemma 2.5.1) it has a left inverse L, implying that

R.B/ D R.IB/ D R.LAB/  R.AB/

and hence that R.AB/ D R.B/ [which implies, in turn, that rank.AB/ D rank.B/]. Similarly, if B
has full row rank, then it has a right inverse R, implying that C.A/ D C.ABR/  C.AB/ and hence
that C.AB/ D C.A/ [and rank.AB/ D rank.A/]. Q.E.D.
As an immediate consequence of Lemma 2.5.5, we have the following corollary.
Corollary 2.5.6. If A is an N  N nonsingular matrix, then for any N  P matrix B,

R.AB/ D R.B/ and rank.AB/ D rank.B/:

Similarly, if B is an N  N nonsingular matrix, then for any M  N matrix A,

C.AB/ D C.A/ and rank.AB/ D rank.A/:

2.6 Ranks and Inverses of Partitioned Matrices


Expressions can be obtained for the ranks and inverses of partitioned matrices in terms of their con-
stituent blocks. In the special case of block-diagonal and block-triangular matrices, these expressions
are relatively simple.

a. Special case: block-diagonal matrices


The following lemma relates the rank of a block-diagonal matrix to the ranks of its diagonal blocks.
Lemma 2.6.1. For any M  N matrix A and P  Q matrix B,
 
A 0
rank D rank.A/ C rank.B/: (6.1)
0 B

Proof. Let R D rank A and S D rank B. And suppose that R > 0 and S > 0—if R or S equals
0 (in which case A D 0 or B D 0), equality (6.1) is clearly valid. Then, according to Theorem 2.4.18,
there exist an M R matrix A and an RN matrix E such that A D A E, and, similarly, there exist
a P  S matrix B and an S  Q matrix F such that B D B F. Moreover, rank A D rank E D R
and rank B D rank F D S .
Ranks and Inverses of Partitioned Matrices 45

We have that     
A 0 A 0 E 0
D :
0 B 0 B 0 F
Further, the columns of diag.A ; B / are linearly independent, as is evident upon observing that,
for any R-dimensional column vector c and S -dimensional column vector d such that
  
A 0 c
D 0;
0 B d

A c D 0 and B d D 0, implying (since the columns of A are linearly independent) that c D 0 and
likewise that d D 0. Similarly, the rows of diag.E; F/ are linearly independent. Thus, diag.A ; B /
has full column rank and diag.E; F/ has full row rank. And, recalling Lemma 2.5.5, we conclude
that    
A 0 A 0
rank D D R C S: Q.E.D.
0 B 0 B
Repeated application of result (6.1) gives the following formula for the rank of a block-diagonal
matrix with diagonal blocks A1 ; A2 ; : : : ; AK :

rankŒdiag.A1 ; A2 ; : : : ; AK / D rank.A1 / C rank.A2 / C    C rank.AK /: (6.2)

Note that result (6.2) implies in particular that the rank of a diagonal matrix D equals the number of
nonzero diagonal elements in D.
Let T represent an M  M matrix and W an N  N matrix. Then, the .M C N /  .M C N /
block-diagonal matrix diag.T ; W / is nonsingular if and only if both T and W are nonsingular, as
is evident from Lemma 2.6.1. Moreover, if both T and W are nonsingular, then
 1  1 
T 0 T 0
D ; (6.3)
0 W 0 W 1
as is easily verified.
More generally, for any square matrices A1 ; A2 ; : : : ; AK , the block-diagonal matrix
diag.A1 ; A2 ; : : : ; AK / is nonsingular if and only if A1 ; A2 ; : : : ; AK are all nonsingular [as is evident
from result (6.2)], in which case

Œdiag.A1 ; A2 ; : : : ; AK / 1 D diag A1 1; A2 1; : : : ; AK1 : (6.4)




As what is essentially a special case of this result, we have that an N  N diagonal matrix
diag.d1 ; d2 ; : : : ; dN / is nonsingular if and only if its diagonal elements d1 ; d2 ; : : : ; dN are all
nonzero, in which case
1
Œdiag.d1 ; d2 ; : : : ; dN / D diag.1=d1 ; 1=d2 ; : : : ; 1=dN /: (6.5)

b. Special case: block-triangular matrices


   
IM 0 IN V
Consider a block-triangular matrix of the simple form or , where V is an
V IN 0 IM
N  M matrix. Upon recalling (from Theorem 2.5.4) that an invertible matrix is nonsingular and
observing that
         
I 0 I 0 I 0 I V I V I 0
D and D ;
V I V I 0 I 0 I 0 I 0 I
we obtain the following result.
46 Matrix Algebra: A Primer

 For anyN  M matrix V , the .M C N /  .M C N / partitioned matrices


 Lemma 2.6.2.
IM 0 I V
and N are nonsingular, and
V IN 0 IM
 1    1  
IM 0 IM 0 IN V I V
D and D N :
V IN V IN 0 IM 0 IM

Formula (6.1) for the rank of a block-diagonal matrix can be extended to certain block-triangular
matrices, as indicated by the following lemma.
Lemma 2.6.3. Let T represent an M  P matrix, V an N  P matrix, and W an N  Q matrix.
If T has full column rank or W has full row rank, that is, if rank.T / D P or rank.W / D N, then
   
T 0 W V
rank D rank D rank.T / C rank.W /: (6.6)
V W 0 T

Proof. Suppose that rank.T / D P . Then, according to Lemma 2.5.1, there exists a matrix L
that is a left inverse of T , in which case
         
I VL W V W 0 I 0 T 0 T 0
D and D :
0 I 0 T 0 T VL I V W 0 W
   
I VL I 0
Since (according to Lemma 2.6.2) and are nonsingular, we conclude (on
0 I VL I
the basis of Corollary 2.5.6) that
       
W V W 0 T 0 T 0
rank D rank and rank D rank
0 T 0 T V W 0 W

and hence (in light of Lemma 2.6.1) that


   
T 0 W V
rank D rank D rank.T / C rank.W /:
V W 0 T

That result (6.6) holds if rank.W / D N can be established via an analogous argument. Q.E.D.
The results of Lemma 2.6.2 can be extended to additional block-triangular matrices, as detailed
in the following lemma.
Lemma 2.6.4. Let T represent an M  M matrix, V an N  M matrix, and W an N  N
matrix.  
T 0
(1) The .M C N /  .M C N / partitioned matrix is nonsingular if and only if both T
V W
and W are nonsingular, in which case
1 
T 1
 
T 0 0
D 1 : (6.7)
V W W 1V T 1
W
 
W V
(2) The .M C N /  .M C N / partitioned matrix is nonsingular if and only if both T
0 T
and W are nonsingular, in which case
 1  1 1 1

W V W W VT
D : (6.8)
0 T 0 T 1
Ranks and Inverses of Partitioned Matrices 47
 
T 0
Proof. Suppose that is nonsingular (in which case its rows are linearly independent).
V W
Then, for any M -dimensional column vector c such that c0 T D 0, we find that
 0 
c T 0
D .c0 T ; 0/ D 0
0 V W
 
T 0
and hence that c D 0—if c were nonnull, the rows of would be linearly dependent. Thus,
V W
the rows of T are linearly independent, which implies that T is nonsingular. That W is nonsingular
can be established in similar fashion.
Conversely, suppose that both T and W are nonsingular. Then,
     
T 0 I 0 IM 0 T 0
D M ;
V W 0 W W 1 V T 1 IN 0 IN
 
I 0
as is easily verified, and (in light of Lemmas 2.6.1 and 2.6.2 or Lemma 2.6.3) ,
    0 W
I 0 T 0
, and are nonsingular. Further, it follows from result (5.10) (and also
W 1V T 1 I 0 I
 
T 0
from Lemma 2.6.3) that is nonsingular and from results (5.11) and (6.3) and Lemma 2.6.2
V W
that  1   1  1 1
T 0 T 0 I 0 I 0
D
V W 0 I W 1V T 1 I 0 W
 1   
T 0 I 0 I 0
D
0 I W 1V T 1 I 0 W 1
T 1
 
0
D :
W 1V T 1 W 1

The proof of Part (1) is now complete. Part (2) can be proved via an analogous argument. Q.E.D.
The results of Lemma 2.6.4 can be extended (by repeated application) to block-triangular matrices
having more than two rows and columns of blocks. Let A1 ; A2 ; : : : ; AR represent square matrices,
and take A to be an (upper or lower) block-triangular matrix whose diagonal blocks are respectively
A1 ; A2 ; : : : ; AR . Then, A is nonsingular if and only if A1 ; A2 ; : : : ; AR are all nonsingular (and,
as what is essentially a special case, a triangular matrix is nonsingular if and only if its diagonal
elements are all nonzero). Further, A 1 is block-triangular (lower block-triangular if A is lower
block-triangular and upper block-triangular if A is upper block-triangular). The diagonal blocks of
A 1 are A1 1; A2 1; : : : ; AR1, respectively, and the off-diagonal blocks of A 1 are expressible in terms
of recursive formulas given by, for example, Harville (1997, sec. 8.5).

c. General case
The following theorem can (when applicable) be used to express the rank of a partitioned matrix in
terms of the rank of a matrix of smaller dimensions.
Theorem 2.6.5. Let T represent an M  M matrix, U an M  Q matrix, V an N  M matrix,
and W an N  Q matrix. If rank.T / D M , that is, if T is nonsingular, then
   
T U W V
rank D rank D M C rank.W V T 1 U/: (6.9)
V W U T
48 Matrix Algebra: A Primer

Proof. Suppose that rank.T / D M . Then,


    
IM 0 T U T U
D ;
V T 1 IN V W 0 W VT 1
U
 
I 0
as is easily verified. Thus, observing that is nonsingular (as is evident from Lemma
VT 1 I
2.6.2) and making use of Corollary 2.5.6 and Lemma 2.6.3, we find that
   
T U T U
rank D rank
V W 0 W V T 1U
1
D rank.T / C rank.W VT U/
1
D M C rank.W VT U/:

And it can be shown in similar fashion that


 
W V 1
rank D M C rank.W VT U/: Q.E.D.
U T

Results (6.7) and (6.8) on the inverse of a block-triangular matrix can be extended to a more
general class of partitioned matrices, as described in the following theorem.
Theorem 2.6.6. Let T represent an M  M matrix, U an M  N matrix, V an N  M matrix,
and W an N  N matrix. Suppose that T is nonsingular, and define

Q D W V T 1 U:
 
T U
Then, the partitioned matrix is nonsingular if and only if Q is nonsingular, in which case
V W

1  1
C T 1 UQ 1 1 1 1
 
T U T VT T UQ
D : (6.10)
V W Q 1V T 1
Q 1
 
W V
Similarly, the partitioned matrix is nonsingular if and only if Q is nonsingular, in which
case U T
1 
Q 1 Q 1V T 1
 
W V
D : (6.11)
U T T 1 UQ 1 T 1 C T 1 UQ 1 V T 1
   
T U W V
Proof. That is nonsingular if and only if Q is nonsingular and that is
V W U T
nonsingular if and only if Q is nonsingular are immediate consequences of Theorem 2.6.5.
Suppose now that Q is nonsingular, and observe that
T 1U
    
T 0 T U I
D : (6.12)
V Q V W 0 I
   
T 0 T U
Then, in light of Corollary 2.5.6 and Lemma 2.6.2, , like , is nonsingular. Premul-
V Q V W
 1  1
T U T 0
tiplying both sides of equality (6.12) by and postmultiplying both sides by
V W V Q
Orthogonal Matrices 49

and making use of Lemma 2.6.4, we find that


1  1
T 1U
 
T U I T 0
D
V W 0 I V Q
1
T 1
  
I T U 0
D
0 I Q 1V T 1
Q 1
 1
T C T 1 UQ 1 V T 1 1 1

T UQ
D ;
Q 1V T 1 Q 1
which establishes formula (6.10). Formula (6.11) can be derived in similar fashion. Q.E.D.
In proving Theorem2.6.6, the approach
 taken
 to the verification of equality (6.10) was to relate
T U T 0
the inverse of to that of and to then apply formula (6.7) for the inverse of a
V W V Q
lower block-triangular matrix. Alternatively, equality (6.10)could be verified by premultiplying or
T U
postmultiplying the right side of the equality by and by confirming that the resultant
V W
product equals IM CN . And equality (6.11) could be verified in much the same way.
When T is nonsingular, the matrix Q D W V T 1 U, which appears in expressions  (6.10) 
T U
and (6.11) for the inverse of a partitioned matrix, is called the Schur complement of T in
  V W
W V
or the Schur complement of T in . Moreover, when the context is clear, it is sometimes
U T
referred to simply
 as the
  Schur complement
 of T or even more simply as the Schur complement.
T U W V
Note that if or is symmetric (in which case T 0 D T , W 0 D W , and V D U 0 ),
V W U T
then the Schur complement of T is also symmetric.

2.7 Orthogonal Matrices


A (square) matrix A is said to be orthogonal if

AA0 D A0 A D I;

or, equivalently, if A is nonsingular and A 1 D A0 . To show that a (square) matrix A is orthogonal,


it suffices (in light of Lemma 2.5.3) to demonstrate that A0 A D I or, alternatively, that AA0 D I.
For any N  N matrix A, A0 A D I if and only if the columns a1 ; a2 ; : : : ; aN of A are such that
(
0 1; for j D i D 1; 2; : : : ; N,
ai aj D (7.1)
0; for j ¤ i D 1; 2; : : : ; N

(as is evident upon observing that a0i aj equals the ij th element of A0 A). Thus, a square matrix is
orthogonal if and only if its columns form an orthonormal (with respect to the usual inner product)
set of vectors. Similarly, a square matrix is orthogonal if and only if its rows form an orthonormal
set of vectors.
Note that if A is an orthogonal matrix, then its transpose A0 is also orthogonal. Note also that
in using the term orthogonal in connection with one or more matrices, say A1 ; A2 ; : : : ; AK , care
must be exercised to avoid confusion. Under a strict interpretation, saying that A1 ; A2 ; : : : ; AK are
orthogonal matrices has an entirely different meaning than saying that the set fA1 ; A2 ; : : : ; AK g,
whose members are A1 ; A2 ; : : : ; AK , is an orthogonal set.
50 Matrix Algebra: A Primer

If P and Q are both N  N orthogonal matrices, then [in light of result (1.13)]

.P Q/0 P Q D Q0 P 0 P Q D Q0 IQ D Q0 Q D I:

Thus, the product of two N  N orthogonal matrices is another (N  N ) orthogonal matrix. The
repeated application of this result leads to the following lemma.
Lemma 2.7.1. If each of the matrices Q1 ; Q2 ; : : : ; QK is an N  N orthogonal matrix, then
the product Q1 Q2    QK is an (N  N ) orthogonal matrix.

2.8 Idempotent Matrices


A (square) matrix A is said to be idempotent if A2 D A—for any (square) matrix A, A2 represents the
product AA (and, for k D 3; 4; : : : , Ak represents the product defined recursively by Ak D AAk 1 ).
Examples of N  N idempotent matrices are the identity matrix IN , the N  N null matrix 0, and
the matrix .1=N /1N 10N , each element of which equals 1=N .
If a square matrix A is idempotent, then
.A0 /2 D .AA/0 D A0
and
.I A/2 D I 2A C A2 D I 2A C A D I A:
0 0
Thus, upon observing that A D .A / and A D I .I A/, we have the following lemma.
Lemma 2.8.1. Let A represent a square matrix. Then, (1) A0 is idempotent if and only if A is
idempotent, and (2) I A is idempotent if and only if A is idempotent.
Now, suppose that A is a square matrix such that
A2 D kA
for some nonzero scalar k. Or, equivalently, suppose that
Œ.1=k/A2 D .1=k/A;
that is, suppose that .1=k/A is an idempotent matrix (so that, depending on whether k D 1 or k ¤ 1,
A is either an idempotent matrix or a scalar multiple of an idempotent matrix). Then, in light of the
following theorem,
rank.A/ D .1=k/ tr.A/; (8.1)
and, consequently, the rank of A is determinable from the trace of A.
Theorem 2.8.2. For any square matrix A such that A2 D kA for some scalar k,
tr.A/ D k rank.A/: (8.2)

Proof. Let us restrict attention to the case where A is nonnull. [The case where A D 0 is
trivial—if A D 0, then tr.A/ D 0 D k rank.A/.]
Let N denote the order of A, and let R D rank.A/. Then, according to Theorem 2.4.18, there exist
an N  R matrix B and an R  N matrix T such that A D BT. Moreover, rank.B/ D rank.T / D R,
implying (in light of Lemma 2.5.1) the existence of a matrix L such that LB D IR and a matrix H
such that T H D IR —L is a left inverse of B and H a right inverse of T.
We have that
BT BT D A2 D kA D kBT D B.kIR /T:
Thus,
T B D IR T BIR D LBT BT H D L.BT BT /H D LB.kIR /T H D IR .kIR /IR D kIR :
Linear Systems 51

And making use of Lemma 2.3.1, we find that

tr.A/ D tr.BT / D tr.T B/ D tr.kIR / D k tr.IR / D kR: Q.E.D.

In the special case where k D 1, Theorem 2.8.2 can be restated as follows.


Corollary 2.8.3. For any idempotent matrix A,
rank.A/ D tr.A/: (8.3)

By making use of Lemma 2.8.1 and Corollary 2.8.3, we find that for any N  N idempotent
matrix A,
rank.I A/ D tr.I A/ D tr.IN / tr.A/ D N rank.A/;
thereby establishing the following, additional result.
Lemma 2.8.4. For any N  N idempotent matrix A,

rank.I A/ D tr.I A/ D N rank.A/: (8.4)

2.9 Linear Systems


Consider a set of M equations of the general form

a11 x1 C a12 x2 C    C a1N xN D b1


a21 x1 C a22 x2 C    C a2N xN D b2
::
:
aM1 x1 C aM 2 x2 C    C aMN xN D bM ;

where a11 ; a12 ; : : : ; a1N ; a21 ; a22 ; : : : ; a2N ; : : : ; aM1 ; aM 2 ; : : : ; aMN and b1 ; b2 , : : : ; bM represent
“fixed” scalars and x1 ; x2 ; : : : ; xN are scalar-valued unknowns or variables. The left side of each of
these equations is a linear combination of the unknowns x1 ; x2 ; : : : ; xN . Collectively, these equations
are called a system of linear equations (in unknowns x1 ; x2 ; : : : ; xN ) or simply a linear system (in
x1 ; x2 ; : : : ; xN ).
The linear system can be rewritten in matrix form as

Ax D b; (9.1)

where A is the M  N matrix whose ij th element is aij (i D 1; 2; : : : ; M ; j D 1; 2; : : : ; N ),


b D .b1 ; b2 ; : : : ; bM /0 , and x D .x1 ; x2 ; : : : ; xN /0 . The matrix A is called the coefficient matrix of
the linear system, and b is called the right side (or right-hand side). Any value of the vector x of
unknowns that satisfies Ax D b is called a solution to the linear system, and the process of finding
a solution (when one exists) is called solving the linear system.
There may be occasion to solve the linear system for more than one right side, that is, to solve
each of P linear systems
Axk D bk .k D 1; 2; : : : ; P / (9.2)
(in vectors x1 ; x2 ; : : : ; xP , respectively, of unknowns) that have the same coefficient matrix A but
right sides b1 ; b2 ; : : : ; bP that may differ. By forming an N  P matrix X whose first, second, …,
P th columns are x1 ; x2 ; : : : ; xP , respectively, and an M  P matrix B whose first, second, …, P th
columns are b1 ; b2 ; : : : ; bP , respectively, the P linear systems (9.2) can be rewritten collectively as

AX D B: (9.3)
52 Matrix Algebra: A Primer

As in the special case (9.1) where P D 1, AX D B is called a linear system (in X), A and B are
called the coefficient matrix and the right side, respectively, and any value of X that satisfies AX D B
is called a solution.
In the special case AX D 0, where the right side B of linear system (9.3) is a null matrix, the linear
system is said to be homogeneous. If B is nonnull, linear system (9.3) is said to be nonhomogeneous.

a. Consistency
A linear system is said to be consistent if it has one or more solutions. If no solution exists, the linear
system is said to be inconsistent.
Every homogeneous linear system is consistent—one solution to a homogeneous linear system
is the null matrix (of appropriate dimensions). A nonhomogeneous linear system may be either
consistent or inconsistent.
Let us determine the characteristics that distinguish the coefficient matrix and right side of a
consistent linear system from those of an inconsistent linear system. In doing so, let us adopt an
abbreviated notation for the column space of a partitioned matrix of the form .A; B/ by writing
C.A; B/ for CŒ.A; B/. And as a preliminary step, let us establish the following lemma.
Lemma 2.9.1. Let A represent an M  N matrix and B an M  P matrix. Then,

C.A; B/ D C.A/ , C.B/  C.A/I (9.4)


rank.A; B/ D rank.A/ , C.B/  C.A/: (9.5)

Proof. We have that C.A/   C.A; B/ and C.B/   C.A; B/, as is evident from Corollary 2.4.4
I 0
upon observing that A D .A; B/ and B D .A; B/ .
0 I
Now, suppose that C.B/  C.A/. Then, according to Lemma 2.4.3, there exists a matrix F such
that B D AF and hence such that .A; B/ D A.I; F/. Thus, C.A; B/  C.A/, implying [since
C.A/  C.A; B/] that C.A; B/ D C.A/.
Conversely, suppose that C.A; B/ D C.A/. Then, since C.B/  C.A; B/, C.B/  C.A/, and
the proof of result (9.4) is complete.
To prove result (9.5), it suffices [having established result (9.4)] to show that

rank.A; B/ D rank.A/ , C.A; B/ D C.A/:

That C.A; B/ D C.A/ ) rank.A; B/ D rank.A/ is clear. And since C.A/  C.A; B/, it follows
from Theorem 2.4.16 that rank.A; B/ D rank.A/ ) C.A; B/ D C.A/. Q.E.D.
We are now in a position to establish the following result on the consistency of a linear system.
Theorem 2.9.2. Each of the following conditions is necessary and sufficient for a linear system
AX D B (in X) to be consistent:
(1) C.B/  C.A/;
(2) every column of B belongs to C.A/;
(3) C.A; B/ D C.A/;
(4) rank.A; B/ D rank.A/.
Proof. That Condition (1) is necessary and sufficient for the consistency of AX D B is an
immediate consequence of Lemma 2.4.3. Further, it follows from Lemma 2.4.2 that Condition (2) is
equivalent to Condition (1), and from Lemma 2.9.1 that each of Conditions (3) and (4) is equivalent to
Condition (1). Thus, like Condition (1), each of Conditions (2) through (4) is necessary and sufficient
for the consistency of AX D B. Q.E.D.
A sufficient (but in general not a necessary) condition for the consistency of a linear system is
given by the following theorem.
Generalized Inverses 53

Theorem 2.9.3. If the coefficient matrix A of a linear system AX D B (in X) has full row rank,
then AX D B is consistent.
Proof. Suppose that A has full row rank. Then, it follows from Lemma 2.5.1 that there exists
a matrix R (a right inverse of A) such that AR D I and hence such that ARB D B. Thus, setting
X D RB gives a solution to the linear system AX D B, and we conclude that AX D B is
consistent. Q.E.D.

b. Solution set
The collection of all solutions to a linear system AX D B (in X) is called the solution set of the
linear system. Clearly, a linear system is consistent if and only if its solution set is nonempty.
Is the solution set of a linear system AX D B (in X) a linear space? The answer depends on
whether the linear system is homogeneous or nonhomogeneous, that is, on whether the right side B
is null or nonnull.
Consider first the solution set of a homogeneous linear system AX D 0. A homogeneous linear
system is consistent, and hence its solution set is nonempty—its solution set includes the null matrix
0. Furthermore, if X1 and X2 are solutions to AX D 0 and k is a scalar, then A.X1 C X2 / D
AX1 C AX2 D 0 and A.kX1 / D k.AX1 / D 0, so that X1 C X2 and kX1 are solutions to AX D 0.
Thus, the solution set of a homogeneous linear system is a linear space. Accordingly, the solution
set of a homogeneous linear system AX D 0 may be called the solution space of AX D 0.
The solution space of a homogeneous linear system Ax D 0 (in a column vector x) is called the
null space of the matrix A and is denoted by the symbol N.A/. Thus, for any M  N matrix A,
N.A/ D fx 2 RN W Ax D 0g:

The solution set of a nonhomogeneous linear system is not a linear space (as can be easily seen
by, e.g., observing that the solution set does not contain the null matrix).

2.10 Generalized Inverses


There is an intimate relationship between the inverse A 1 of a nonsingular matrix A and the solution
of linear systems whose coefficient matrix is A. This relationship is described in the following
theorem.
Theorem 2.10.1. Let A represent any N  N nonsingular matrix, G any N  N matrix, and
P any positive integer. Then, GB is a solution to a linear system AX D B (in X) for every N  P
matrix B if and only if G D A 1 .
Is there a matrix that relates to the solution of linear systems whose coefficient matrix is an
M  N matrix A of arbitrary rank and that does so in the same way that A 1 relates to their solution
in the special case where A is nonsingular? The following theorem serves to characterize any such
matrix.
Theorem 2.10.2. Let A represent any M  N matrix, G any N  M matrix, and P any positive
integer. Then, GB is a solution to a linear system AX D B (in X) for every M  P matrix B for
which the linear system is consistent if and only if AGA D A.
Note that if the matrix A in Theorem 2.10.2 is nonsingular, then the linear system AX D B is
consistent for every M P matrix B (as is evident from Theorem 2.9.3) and AGA D A , G D A 1
(as is evident upon observing that if A is nonsingular, then AA 1 A D AI D A and AGA D A )
54 Matrix Algebra: A Primer

A 1 AGAA 1 D A 1 AA 1 ) G D A 1 ). Thus, Theorem 2.10.2 can be regarded as a generalization


of Theorem 2.10.1.
Proof (of Theorem 2.10.2). Suppose that AGA D A. And let B represent any M  P matrix
for which AX D B is consistent, and take X to be any solution to AX D B. Then,
A.GB/ D .AG/B D AGAX D AX D B;
so that GB is a solution to AX D B.
Conversely, suppose that GB is a solution to AX D B (i.e., that AGB D B) for every M  P
matrix B for which AX D B is consistent. Letting ai represent the i th column of A, observe that
AX D B is consistent in particular for B D .ai ; 0; : : : ; 0/—for this B, one solution to AX D B is
the matrix .ui ; 0; : : : ; 0/, where ui is the i th column of IN (i D 1; 2; : : : ; N ). It follows that
AG.ai ; 0; : : : ; 0/ D .ai ; 0; : : : ; 0/
and hence that AGai D ai (i D 1; 2; : : : ; N ). Thus,
AGA D AG.a1 ; : : : ; aN / D .AGa1 ; : : : ; AGaN / D .a1 ; : : : ; aN / D A: Q.E.D.
An N  M matrix G is said to be a generalized inverse of an M  N matrix A if it satisfies the
condition
AGA D A:
For example, each of the two 3  2 matrices
0 1 0 1
1 0 42 1
@0 0A and @ 5 3A
0 0 2 2
is a generalized inverse of the 2  3 matrix
 
1 3 2
:
2 6 4
The following lemma and corollary pertain to generalized inverses of matrices of full row or
column rank and to generalized inverses of nonsingular matrices.
Lemma 2.10.3. Let A represent a matrix of full column rank and B a matrix of full row rank.
Then, (1) a matrix G is a generalized inverse of A if and only if G is a left inverse of A. And (2) a
matrix G is a generalized inverse of B if and only if G is a right inverse of B.
Proof. Let L represent a left inverse of A—that A has a left inverse is a consequence of Lemma
2.5.1. Then, ALA D AI D A, so that L is a generalized inverse of A. And the proof of Part (1) of
the lemma is complete upon observing that if G is a generalized inverse of A, then
GA D IGA D LAGA D LA D I;
so that G is a left inverse of A. The validity of Part (2) can be established via an analogous argu-
ment. Q.E.D.
Corollary 2.10.4. The “ordinary” inverse A 1 of a nonsingular matrix A is a generalized inverse
of A. And a nonsingular matrix A has no generalized inverse other than A 1.
Corollary 2.10.4 follows from either part of Lemma 2.10.3. That it follows in particular from
Part (1) of Lemma 2.10.3 is evident upon observing that (by definition) A 1 is a left inverse of a
nonsingular matrix A and upon recalling (from Theorem 2.5.4) that a nonsingular matrix A has no
left inverse other than A 1.
Let us now consider the existence of generalized inverses. Together, Lemmas 2.10.3 and 2.5.1
imply that matrices of full row rank or full column rank have generalized inverses. Does every matrix
have at least one generalized inverse? The answer to that question is yes, as can be shown by making
use of the following theorem, which is of some interest in its own right.
Generalized Inverses 55

Theorem 2.10.5. Let B represent an M  K matrix of full column rank and T a K  N matrix
of full row rank. Then, B has a left inverse, say L, and T has a right inverse, say R; and RL is a
generalized inverse of BT.
Proof. That B has a left inverse L and T a right inverse R is an immediate consequence of
Lemma 2.5.1. Moreover,
BT .RL/BT D B.T R/.LB/T D BIIT D BT:
Thus, RL is a generalized inverse of BT. Q.E.D.
Now, consider an arbitrary M  N matrix A. If A D 0, then clearly any N  M matrix is a
generalized inverse of A. If A ¤ 0, then (according to Theorem 2.4.18) there exist a matrix B of full
column rank and a matrix T of full row rank such that A D BT , and hence (in light of Theorem
2.10.5) A has a generalized inverse. Thus, we arrive at the following conclusion.
Corollary 2.10.6. Every matrix has at least one generalized inverse.
The symbol A is used to denote an arbitrary generalized inverse of an M  N matrix A. By
definition,
AA A D A:

a. General form and nonuniqueness of generalized inverses


A general expression can be obtained for a generalized inverse of an M  N matrix A in terms of
any particular generalized inverse of A, as described in the following theorem.
Theorem 2.10.7. Let A represent an M  N matrix, and G any particular generalized inverse
of A. Then, an N  M matrix G  is a generalized inverse of A if and only if
G D G C Z GAZAG (10.1)

for some N  M matrix Z. Also, G is a generalized inverse of A if and only if
G  D G C .I GA/T C S.I AG/ (10.2)
for some N  M matrices T and S.
Proof. It is a simple exercise to verify that any matrix G  that is expressible in the form (10.1)
or (10.2) is a generalized inverse of A. Conversely, if G  is a generalized inverse of A, then
G  D G C .G  G/ GA.G  G/AG
D G C Z GAZAG;
where Z D G  G, and
G  D G C .I GA/G  AG C .G  G/.I AG/
D G C .I GA/T C S.I AG/;
where T D G  AG and S D G  G. Q.E.D.
All generalized inverses of the M  N matrix A can be generated from expression (10.1) by
letting Z range over all N M matrices. Alternatively, all generalized inverses of A can be generated
from expression (10.2) by letting both T and S range over all N  M matrices. Note that distinct
choices for Z or for T and/or S may or may not result in distinct generalized inverses.
How many generalized inverses does an M  N matrix A possess? If A is nonsingular, it has
a unique generalized inverse, namely A 1. Now, suppose that A is not nonsingular (i.e., that A is
either not square or is square but singular). Then, rank.A/ < M (in which case A does not have
a right inverse) and/or rank.A/ < N (in which case A does not have a left inverse). Thus, for any
generalized inverse G of A, I AG ¤ 0 and/or I GA ¤ 0. And, based on the second part of
Theorem 2.10.7, we conclude that A has an infinite number of generalized inverses.
56 Matrix Algebra: A Primer

b. Some basic results on generalized inverses


Let us consider the extent to which various basic results on “ordinary” inverses extend to generalized
inverses. It is easy to verify the following lemma, which is a generalization of result (5.2).
Lemma 2.10.8. For any matrix A and any nonzero scalar k, .1=k/A is a generalized inverse
of kA.
Upon setting k D 1 in Lemma 2.10.8, we obtain the following corollary, which is a general-
ization of result (5.3).
Corollary 2.10.9. For any matrix A, A is a generalized inverse of A.
For any matrix A, we find that

A0 .A /0 A0 D .AA A/0 D A0 :

Thus, we have the following lemma, which is a generalization of result (5.4).


Lemma 2.10.10. For any matrix A, .A /0 is a generalized inverse of A0 .
While [according to result (5.5)] the inverse of a nonsingular symmetric matrix is symmetric, a
generalized inverse of a singular
 symmetric matrix (of order greater than1) need
 not be symmetric.
1 1 1 0
For example, the matrix is a generalized inverse of the matrix . However, Lemma
0 0 0 0
2.10.10 implies that a generalized inverse of a singular symmetric matrix has the following, weaker
property.
Corollary 2.10.11. For any symmetric matrix A, .A /0 is a generalized inverse of A.
For any M  N matrix A, the M  M matrix AA and N  N matrix A A are as described in
the following two lemmas—in the special case where A is nonsingular, AA 1 D A 1A D I.
Lemma 2.10.12. Let A represent an M  N matrix. Then, the M  M matrix AA and the
N  N matrix A A are both idempotent.
Proof. Clearly, AA AA D AA and A AA A D A A. Q.E.D.
Lemma 2.10.13. For any matrix A, C.AA / D C.A/, R.A A/ D R.A/, and

rank.AA / D rank.A A/ D rank.A/: (10.3)

Proof. It follows from Corollary 2.4.4 that C.AA /  C.A/ and also, since A D AA A D
.AA /A, that C.A/  C.AA /. Thus, C.AA / D C.A/. That R.A A/ D R.A/ follows from an
analogous argument. And since AA has the same column space as A and A A the same row space
as A, AA and A A have the same rank as A. Q.E.D.
It follows from Corollary 2.4.17 and Lemma 2.10.13 that, for any matrix A,

rank.A /  rank.AA / D rank.A/:

Thus, we have the following lemma, which can be regarded as an extension of result (5.6).
Lemma 2.10.14. For any matrix A, rank.A /  rank.A/.
The following lemma extends to generalized inverses some results on the inverses of products of
matrices—refer to results (5.9) and (5.11).
Lemma 2.10.15. Let B represent an M  N matrix and G an N  M matrix. Then, for any
M  M nonsingular matrix A and N  N nonsingular matrix C, (1) G is a generalized inverse of
AB if and only if G D HA 1 for some generalized inverse H of B, (2) G is a generalized inverse
of BC if and only if G D C 1 H for some generalized inverse H of B, and (3) G is a generalized
inverse of ABC if and only if G D C 1 HA 1 for some generalized inverse H of B.
Linear Systems Revisited 57

Proof. Parts (1) and (2) are special cases of Part (3) (those where C D I and A D I, respectively).
Thus, it suffices to prove Part (3).
By definition, G is a generalized inverse of ABC if and only if

ABCGABC D ABC: (10.4)

Upon premultiplying both sides of equality (10.4) by A 1 and postmultiplying both sides by C 1,
we obtain the equivalent equality
BCGAB D B:
Thus, G is a generalized inverse of ABC if and only if CGA D H for some generalized inverse H
of B or, equivalently, if and only if G D C 1 HA 1 for some generalized inverse H of B. Q.E.D.

2.11 Linear Systems Revisited


Having introduced (in Section 2.10) the concept of a generalized inverse, we are now in a position to
add to the results obtained earlier (in Section 2.9) on the consistency and the solution set of a linear
system.

a. More on the consistency of a linear system


Each of the four conditions of Theorem 2.9.2 is necessary and sufficient for a linear system AX D B
(in X) to be consistent. The following theorem describes another such condition.
Theorem 2.11.1. A linear system AX D B (in X) is consistent if and only if AA B D B or,
equivalently, if and only if .I AA /B D 0.
With the establishment of the following lemma, Theorem 2.11.1 becomes an immediate conse-
quence of Theorem 2.9.2.
Lemma 2.11.2. Let A represent an M N matrix. Then, for any M P matrix B, C.B/  C.A/
if and only if B D AA B or, equivalently, if and only if .I AA /B D 0. And, for any Q N matrix
C, R.C/  R.A/ if and only if C D CA A or, equivalently, if and only if C.I A A/ D 0.
Proof (of Lemma 2.11.2). If B D AA B, then it follows immediately from Corollary 2.4.4 that
C.B/  C.A/. Conversely, if C.B/  C.A/, then, according to Lemma 2.4.3, there exists a matrix
F such that B D AF, implying that
B D AA AF D AA B:
Thus, C.B/  C.A/ if and only if B D AA B. That R.C/  R.A/ if and only if C D CA A
follows from an analogous argument. Q.E.D.
According to Theorem 2.11.1, either of the two matrices AA or I AA can be used to determine
whether a linear system having A as a coefficient matrix is consistent or inconsistent. If the right side
of the linear system is unaffected by premultiplication by AA , then the linear system is consistent;
otherwise, it is inconsistent. Similarly, if the premultiplication of the right side by I AA produces
a null matrix, then the linear system is consistent; otherwise, it is inconsistent.
Consider, for example, the linear system Ax D b (in x), where
0 1
6 2 2 3
AD@ 3 1 5 2A:
3 1 3 1
58 Matrix Algebra: A Primer

One generalized inverse of A is 0 1


0 0 0
B 1 0 3C
GDB
@ 0
C;
0 0A
1 0 2
as can be easily verified. Clearly, 0 1
1 0 0
AG D @ 1 0 1A:
0 0 1
If b D .3; 2; 5/0 , then
AGb D .3; 2; 5/0 D b;
in which case the linear system Ax D b is consistent. However, if b D .1; 2; 1/0 , then

AGb D .1; 0; 1/0 ¤ b;

in which case Ax D b is inconsistent.

b. General form of a solution to a linear system


The following theorem gives an expression for the general form of a solution to a homogeneous
linear system in terms of any particular generalized inverse of the coefficient matrix.
Theorem 2.11.3. A matrix X is a solution to a homogeneous linear system AX D 0 (in X) if
and only if
X D .I A A/Y
for some matrix Y .
Proof. If X D .I A A/Y for some matrix Y , then

AX D .A AA A/Y D .A A/Y D 0;

so that X is a solution to AX D 0. Conversely, if X is a solution to AX D 0, then

X D X A 0 D X A AX D .I A A/X;

so that X D .I A A/Y for Y D X. Q.E.D.


According to Theorem 2.11.3, all solutions to the homogeneous linear system AX D 0 can be
generated by setting
X D .I A A/Y
and allowing Y to range over all matrices (of the appropriate dimensions).
As a special case of Theorem 2.11.3, we have that a column vector x is a solution to a homo-
geneous linear system Ax D 0 (in a column vector x) if and only if

x D .I A A/y

for some column vector y. Thus, we have the following corollary of Theorem 2.11.3.
Corollary 2.11.4. For any matrix A,

N.A/ D C.I A A/:

How “large” is the solution space of a homogeneous linear system Ax D 0 (in an N -dimensional
column vector x)? The answer is given by the following lemma.
Linear Systems Revisited 59

Lemma 2.11.5. Let A represent an M  N matrix. Then,


dimŒN.A/ D N rank.A/:
That is, the dimension of the solution space of the homogeneous linear system Ax D 0 (in an
N -dimensional column vector x) equals N rank.A/.
Proof. Recalling (from Lemma 2.10.12) that A A is idempotent and making use of Corollary
2.11.4 and Lemmas 2.8.4 and 2.10.13, we find that
dimŒN.A/ D dimŒC.IN A A/ D rank.IN A A/ D N rank.A A/ D N rank.A/:
Q.E.D.
The solution space of a homogeneous linear system Ax D 0 (in an N -dimensional column vector
x) is a subspace of the linear space RN of all N -dimensional column vectors. According to Lemma
2.11.5, the dimension of this subspace equals N rank.A/. Thus, if rank.A/ D N, that is, if A is of
full column rank, then the homogeneous linear system Ax D 0 has a unique solution, namely, the
null vector 0. And if rank.A/ < N , then Ax D 0 has an infinite number of solutions.
The following theorem relates the solutions of an arbitrary (consistent) linear system AX D B
(in X) to those of the homogeneous linear system AZ D 0 (in Z). (To avoid confusion, the matrix
of unknowns of the homogeneous linear system is being denoted by a different symbol than that of
the linear system AX D B.)
Theorem 2.11.6. Let X0 represent any particular solution to a consistent linear system AX D B
(in X). Then, a matrix X is a solution to AX D B if and only if
X D X0 C Z
for some solution Z to the homogeneous linear system AZ D 0 (in Z).
Proof. If X D X0 C Z for some solution Z to AZ D 0, then
AX D AX0 C AZ D B C 0 D B;
so that X is a solution to AX D B. Conversely, if X is a solution to AX D B, then, defining
Z D X X0 , we find that X D X0 C Z and that
AZ D AX AX0 D B BD0

(so that Z is a solution to AZ D 0). Q.E.D.
The upshot of Theorem 2.11.6 is that all of the matrices in the solution set of a consistent linear
system AX D B (in X) can be generated from any particular solution X0 by setting
X D X0 C Z
and allowing Z to range over all of the matrices in the solution space of the homogeneous linear
system AZ D 0 (in Z).
It follows from Theorem 2.10.2 that one solution to a consistent linear system AX D B (in X)
is the matrix A B. Thus, in light of Theorem 2.11.6, we have the following extension of Theorem
2.11.3.
Theorem 2.11.7. A matrix X is a solution to a consistent linear system AX D B (in X) if and
only if
X D A B C .I A A/Y (11.1)
for some matrix Y .
As a special case of Theorem 2.11.7, we have that a column vector x is a solution to a consistent
linear system Ax D b (in a column vector x) if and only if
x D A b C .I A A/y (11.2)
for some column vector y.
60 Matrix Algebra: A Primer

Consider, for example, expression (11.2) as applied to the linear system


0 1 0 1
6 2 2 3 3
@ 3 1 5 2Ax D @2A; (11.3)
3 1 3 1 5

the consistency of which was established earlier (in Subsection a). Taking A to be the coefficient
matrix and b the right side of linear system (11.3), choosing
0 1
0 0 0
B 1 0 3C
A DB @ 0 0 0A;
C

1 0 2

and denoting the elements of y by y1 , y2 , y3 , and y4 , respectively, we find that


0 1
y1
B12 C 3y1 11y3 C
A b C .I A A/y D B C: (11.4)
@ y3 A
7 8y3

Thus, the members of the solution set of linear system (11.3) consist of all vectors of the general
form (11.4).
A possibly nonhomogeneous linear system AX D B (in an N  P matrix X) has 0 solutions, 1
solution, or an infinite number of solutions. If AX D B is inconsistent, then, by definition, it has 0
solutions. If AX D B is consistent and A is of full column rank (i.e., of rank N ), then I A A D 0
(as is evident from Lemma 2.10.3), and it follows from Theorem 2.11.7 that AX D B has 1 solution.
If AX D B is consistent and rank.A/ < Ns, then I A A ¤ 0 (since otherwise we would arrive at
a contradiction of Lemma 2.5.1), and it follows from Theorem 2.11.7 that AX D B has an infinite
number of solutions.
In the special case of a linear system with a nonsingular coefficient matrix, we obtain the follow-
ing, additional result (that can be regarded as a consequence of Theorems 2.9.3 and 2.11.7 or that
can be verified directly).
Theorem 2.11.8. If the coefficient matrix A of a linear system AX D B (in X) is nonsingular,
then AX D B has a unique solution and that solution equals A 1 B.

2.12 Projection Matrices


Let X represent an arbitrary matrix. It follows from the very definition of a generalized inverse that

X0 X.X0 X/ X0 X D X0 X:

And, upon applying Part (1) of Corollary 2.3.4 [with A D X, B D .X0 X/ X0 X, and C D I], we
find that
X.X0 X/ X0 X D X: (12.1)
Similarly, applying Part (2) of Corollary 2.3.4 [with A D X, B D X0 X.X0 X/ , and C D I], we
find that
X0 X.X0 X/ X0 D X0 : (12.2)
Projection Matrices 61

In light of results (12.1) and (12.2), Corollary 2.4.4 implies that R.X/  R.X0 X/ and that
C.X0 /  C.X0 X/. Since Corollary 2.4.4 also implies that R.X0 X/  R.X/ and C.X0 X/  C.X0 /,
we conclude that
R.X0 X/ D R.X/ and C.X0 X/ D C.X0 /:
Moreover, matrices having the same row space have the same rank, so that
rank.X0 X/ D rank.X/:

Now, let PX represent the (square) matrix X.X0 X/ X0 . Then, results (12.1) and (12.2) can be
restated succinctly as
PX X D X (12.3)
and
X0 PX D X0 : (12.4)
For any generalized inverses G1 and G2 of X0 X, we have [in light of result (12.1)] that

XG1 X0 X D X D XG2 X0 X:

And, upon applying Part (2) of Corollary 2.3.4 (with A D X, B D XG1 , and C D XG2 ), we find
that
XG1 X0 D XG2 X0 :
Thus, PX is invariant to the choice of the generalized inverse .X0 X/ .
There is a stronger version of this invariance property. Consider the linear system X0 XB D X0
(in B). A solution to this linear system can be obtained by taking B D .X0 X/ X0 [as is evident from
result (12.2)], and, for B D .X0 X/ X0 , PX D XB. Moreover, for any two solutions B1 and B2 to
X0 XB D X0 , we have that X0 XB1 D X0 D X0 XB2 , and, upon applying Part (1) of Corollary 2.3.4
(with A D X, B D B1 , and C D B2 ), we find that XB1 D XB2 . Thus, PX D XB for every solution
to X0 XB D X0 .
Note that PX0 D XŒ.X0 X/ 0 X0 . According to Corollary 2.10.11, Œ.X0 X/ 0 , like .X0 X/ itself,
is a generalized inverse of X0 X, and, since PX is invariant to the choice of the generalized inverse
.X0 X/ , it follows that PX D XŒ.X0 X/ 0 X0 . Thus, PX0 D PX ; that is, PX is symmetric. And PX is
idempotent, as is evident upon observing [in light of result (12.3)] that

PX2 D PX X.X0 X/ X0 D X.X0 X/ X0 D PX :


Moreover, rank.PX / D rank.X/, as is evident upon observing [in light of Corollary 2.4.17 and result
(12.3)] that
rank PX  rank.X/ D rank.PX X/  rank.PX /:
Summarizing, we have the following lemma and theorem.
Lemma 2.12.1. For any matrix X,

R.X0 X/ D R.X/; C.X0 X/ D C.X0 /; and rank.X0 X/ D rank.X/:

Theorem 2.12.2. Take X to be an arbitrary matrix, and let PX D X.X0 X/ X0 . Then,


(1) PX X D X, that is, X.X0 X/ X0 X D X;
(2) X0 PX D X0 , that is, X0 X.X0 X/ X0 D X0 ;
(3) PX is invariant to the choice of the generalized inverse .X0 X/ ;
(30 ) PX D XB for any solution B to the (consistent) linear system X0 XB D X0 (in B);
(4) PX0 D PX , that is, PX is symmetric;
(5) PX2 D PX , that is, PX is idempotent;
(6) rank.PX / D rank.X/.
62 Matrix Algebra: A Primer

Subsequently, we continue to use (for any matrix X) the symbol PX to represent the matrix
X.X0 X/ X0 . For reasons that will eventually become apparent, a matrix of the general form PX is
referred to as a projection matrix.
Upon applying Corollary 2.4.5, we obtain the following generalization of Lemma 2.12.1.
Lemma 2.12.3. For any M  N matrix X and for any P  N matrix S and N  Q matrix T,
C.SX0 X/ D C.SX0 / and rank.SX0 X/ D rank.SX0 /
and
R.X0 XT / D R.XT / and rank.X0 XT / D rank.XT /:

2.13 Quadratic Forms


Let A D faij g represent an arbitrary N  N matrix, and consider the function that assigns to each
N -dimensional column vector x D .x1 ; x2 ; : : : ; xN /0 in RN the value
X X X
x0 Ax D aij xi xj D ai i xi2 C aij xi xj :
i;j i i; j ¤i

A function of x that is expressible in the form x0 Ax is called a quadratic form (in x). It is customary
to refer to A as the matrix of the quadratic form x0 Ax.
Let B D fbij g represent a second N N matrix. Under what circumstances are the two quadratic
forms x0 Ax and x0 Bx identically equal (i.e., equal for every value of x)? Clearly, a sufficient condition
for them to be identically equal is that A D B. However, except in the special case where N D 1,
A D B is not a necessary condition.
For purposes of establishing a necessary condition, suppose that x0 Ax and x0 Bx are identically
equal. Then, setting x equal to the i th column of IN , we find that

ai i D x0 Ax D x0 Bx D bi i .i D 1; 2; : : : ; N /: (13.1)

That is, the diagonal elements of A are the same as those of B. Consider now the off-diagonal elements
of A and B. Setting x equal to the N -dimensional column vector whose i th and j th elements equal
1 and whose remaining elements equal 0, we find that
ai i C aij C aj i C ajj D x0 Ax D x0 Bx D bi i C bij C bj i C bjj .j ¤ i D 1; 2; : : : ; N /: (13.2)
Together, results (13.1) and (13.2) imply that
ai i D bi i and aij C aj i D bij C bj i .j ¤ i D 1; 2; : : : ; N /
or, equivalently, that
A C A0 D B C B0 : (13.3)
Thus, condition (13.3) is a necessary condition for x0 Ax and x0 Bx to be identically equal. It is
also a sufficient condition. To see this, observe that (since a 1  1 matrix is symmetric) condition
(13.3) implies that

x0 Ax D .1=2/Œx0 Ax C .x0 Ax/0 


D .1=2/.x0Ax C x0 A0 x/
D .1=2/x0.A C A0 /x
D .1=2/x0.B C B0 /x D .1=2/Œx0 Bx C .x0 Bx/0  D x0 Bx:

In summary, we have the following lemma.


Quadratic Forms 63

Lemma 2.13.1. Let A D faij g and B D fbij g represent arbitrary N  N matrices. The two
quadratic forms x0 Ax and x0 Bx (in x) are identically equal if and only if, for j ¤ i D 1; 2; : : : ; N ,
ai i D bi i and aij C aj i D bij C bj i or, equivalently, if and only if A C A0 D B C B0 .
Note that Lemma 2.13.1 implies in particular that the quadratic form x0 A0 x (in x) is identically
equal to the quadratic form x0 Ax (in x). When B is symmetric, the condition A C A0 D B C B0 is
equivalent to the condition B D .1=2/.ACA0 /, and when both A and B are symmetric, the condition
A C A0 D B C B0 is equivalent to the condition A D B. Thus, we have the following two corollaries
of Lemma 2.13.1.
Corollary 2.13.2. Corresponding to any quadratic form x0 Ax (in x), there is a unique symmetric
matrix B such that x0 Bx D x0 Ax for all x, namely, the matrix B D .1=2/.A C A0 /.
Corollary 2.13.3. For any pair of N  N symmetric matrices A and B, the two quadratic forms
x0 Ax and x0 Bx (in x) are identically equal (i.e., x0 Ax D x0 Bx for all x) if and only if A D B.
As a special case of Corollary 2.13.3 (that where B D 0), we have the following additional
corollary.
Corollary 2.13.4. Let A represent an N  N symmetric matrix. If x0 Ax D 0 for every (N  1)
vector x, then A D 0.

a. Nonnegative definiteness and positive definiteness or semidefiniteness


A quadratic form x0 Ax [in an N -dimensional column vector x D .x1 ; x2 ; : : : ; xN /0 ] is said to be
nonnegative definite if x0 Ax  0 for every x in RN. Note that x0 Ax D 0 for at least one value of
x, namely, x D 0. If x0 Ax is nonnegative definite and if, in addition, the null vector 0 is the only
value of x for which x0 Ax D 0, then x0 Ax is said to be positive definite. That is, x0 Ax is positive
definite if x0 Ax > 0 for every x except x D 0. A quadratic form that is nonnegative definite, but not
positive definite, is called positive semidefinite. Thus, x0 Ax is positive semidefinite if x0 Ax  0 for
every x 2 RN and x0 Ax D 0 for some nonnull x.
Consider, for example, the two quadratic forms x0 IN x D x12 C x22 C    C xN 2
and x0 1N 10N x D
.10N x/0 10N x D .x1 C x2 C    C xN /2. Clearly, x0 IN x and x0 1N 10N x are both nonnegative definite.
Moreover, x0 IN x > 0 for all nonnull x, while (assuming that N  2) x0 1N 10N x D 0 for the nonnull
vector x D .1 N; 1; 1; : : : ; 1/0 . Accordingly, x0 IN x is positive definite, and x0 1N 10N x is positive
semidefinite.
The terms nonnegative definite, positive definite, and positive semidefinite are applied to matrices
as well as to quadratic forms. An N  N matrix A is said to be nonnegative definite, positive definite,
or positive semidefinite if the quadratic form x0 Ax (in x) is nonnegative definite, positive definite,
or positive semidefinite, respectively.
It is instructive to consider the following lemma, which characterizes the concepts of nonnegative
definiteness, positive definiteness, and positive semidefiniteness as applied to diagonal matrices and
which is easy to verify.
Lemma 2.13.5. Let D D fdi g represent an N  N diagonal matrix. Then, (1) D is nonnegative
definite if and only if d1 ; d2 ; : : : ; dN are nonnegative; (2) D is positive definite if and only if
d1 ; d2 ; : : : ; dN are (strictly) positive; and (3) D is positive semidefinite if and only if di  0 for
i D 1; 2; : : : ; N with equality holding for one or more values of i .
The following two lemmas give some basic results on scalar multiples and sums of nonnegative
definite matrices.
Lemma 2.13.6. Let k (> 0) represent a (strictly) positive scalar, and A an N  N matrix. If A
is positive definite, then kA is also positive definite. Similarly, if A is positive semidefinite, then kA
is also positive semidefinite.
Proof. Consider the two quadratic forms x0 Ax and x0 .kA/x (in x). Clearly, x0 .kA/x D kx0 Ax.
64 Matrix Algebra: A Primer

Thus, if x0 Ax is positive definite, then x0 .kA/x is positive definite; similarly, if x0 Ax is positive


semidefinite, then x0 .kA/x is positive semidefinite. Or, equivalently, if A is positive definite, then
kA is positive definite; and if A is positive semidefinite, then kA is positive semidefinite. Q.E.D.
Lemma 2.13.7. Let A and B represent N N matrices. If A and B are both nonnegative definite,
then A C B is nonnegative definite. Moreover, if either A or B is positive definite and the other is
nonnegative definite (i.e., either positive definite or positive semidefinite), then A C B is positive
definite.
Proof. Suppose that one of the matrices, say A, is positive definite and that the other (B) is
nonnegative definite. Then, for every nonnull vector x in RN , x0 Ax > 0 and x0 Bx  0, and hence
x0 .A C B/x D x0 Ax C x0 Bx > 0:
Thus, A C B is positive definite. A similar argument shows that if A and B are both nonnegative
definite, then A C B is nonnegative definite. Q.E.D.
The repeated application of Lemma 2.13.7 leads to the following generalization.
Lemma 2.13.8. Let A1 ; A2 ; : : : ; AK represent N  N matrices. If A1 , A2 , : : : ; AK are all
nonnegative definite, then their sum A1 C A2 C    C AK is also nonnegative definite. Moreover,
if one or more of the matrices A1 ; A2 ; : : : ; AK are positive definite and the others are nonnegative
definite, then A1 C A2 C    C AK is positive definite.
A basic property of positive definite matrices is described in the following lemma.
Lemma 2.13.9. Any positive definite matrix is nonsingular.
Proof. Let A represent an N  N positive definite matrix. For purposes of establishing a con-
tradiction, suppose that A is singular or, equivalently, that rank.A/ < N . Then, the columns of A
are linearly dependent, and hence there exists a nonnull vector x such that Ax D 0. We find that
x0 Ax D x0 .Ax / D x0 0 D 0;
which (since A is positive definite) establishes the desired contradiction. Q.E.D.
Additional basic properties of nonnegative definite matrices are established in the following
theorem and corollaries.
Theorem 2.13.10. Let A represent an N  N matrix, and P an N  M matrix.
(1) If A is nonnegative definite, then P 0 AP is nonnegative definite.
(2) If A is nonnegative definite and rank.P / < M , then P 0 AP is positive semidefinite.
(3) If A is positive definite and rank.P / D M , then P 0 AP is positive definite.
Proof. Suppose that A is nonnegative definite (either positive definite or positive semidefinite).
Then, y 0 Ay  0 for every y in RN and in particular for every y that is expressible in the form
y D P x. Thus, for every M -dimensional (column) vector x,
x0 .P 0 AP /x D .P x/0 AP x  0; (13.4)
0
which establishes that P AP is nonnegative definite, thereby completing the proof of Part (1).
If rank.P / < M , then
rank.P 0 AP /  rank.P / < M;
which (in light of Lemma 2.13.9) establishes that P 0 AP is not positive definite and hence (since
P 0 AP is nonnegative definite) that P 0 AP is positive semidefinite, thereby completing the proof of
Part (2).
If A is positive definite, then equality is attained in inequality (13.4) only when P x D 0.
Moreover, if rank.P / D M (so that the columns of P are linearly independent), then P x D 0
implies x D 0. Thus, if A is positive definite and rank.P / D M, then equality is attained in
inequality (13.4) only when x D 0, implying (since P 0 AP is nonnegative definite) that P 0 AP is
positive definite. Q.E.D.
Corollary 2.13.11. Let A represent an N  N matrix and P an N  N nonsingular matrix.
Quadratic Forms 65

(1) If A is positive definite, then P 0AP is positive definite.


(2) If A is positive semidefinite, then P 0AP is positive semidefinite.
Proof. (1) That P 0AP is positive definite if A is positive definite is a direct consequence of Part
(3) of Theorem 2.13.10.
(2) Suppose that A is positive semidefinite. Then, according to Part (1) of Theorem 2.13.10,
P 0AP is nonnegative definite. Moreover, there exists a nonnull vector y such that y 0Ay D 0. Now,
let x D P 1 y. Accordingly, y D P x, and we find that x ¤ 0 (since, otherwise, we would have that
y D 0) and that
x0 .P 0AP /x D .P x/0 AP x D y 0Ay D 0:
We conclude that P 0AP is positive semidefinite. Q.E.D.
Corollary 2.13.12. A positive definite matrix is invertible, and its inverse is positive definite.
Proof. Let A represent a positive definite matrix. Then, according to Lemma 2.13.9, A is non-
singular and hence (according to Theorem 2.5.4) invertible. Further, since .A 1 /0 D .A 1 /0 AA 1 , it
follows from Part (1) of Corollary 2.13.11 [together with result (5.6)] that .A 1 /0 is positive definite.
And, upon observing that the two quadratic forms x0 A 1 x and x0 .A 1 /0 x (in x) are identically equal
(as is evident from Lemma 2.13.1), we conclude that A 1 is positive definite. Q.E.D.
Corollary 2.13.13. Any principal submatrix of a positive definite matrix is positive definite; any
principal submatrix of a positive semidefinite matrix is nonnegative definite.
Proof. Let A represent an N  N matrix, and consider the principal submatrix of A obtained
by striking out all of its rows and columns except its i1 , i2 , : : : ; iM th rows and columns (where
i1 < i2 <    < iM ). This submatrix is expressible as P 0AP , where P is the N  M matrix whose
columns are the i1 , i2 , : : : ; iM th columns of IN . Since rank.P / D M, it follows from Part (3) of
Theorem 2.13.10 that P 0AP is positive definite if A is positive definite. Further, it follows from
Part (1) of Theorem 2.13.10 that P 0AP is nonnegative definite if A is nonnegative definite (and, in
particular, if A is positive semidefinite). Q.E.D.
Corollary 2.13.14. The diagonal elements of a positive definite matrix are positive; the diagonal
elements of a positive semidefinite matrix are nonnegative.
Proof. The corollary follows immediately from Corollary 2.13.13 upon observing (1) that the i th
diagonal element of a (square) matrix A is the element of a 1  1 principal submatrix (that obtained
by striking out all of the rows and columns of A except the i th row and column) and (2) that the
element of a 1  1 positive definite matrix is positive and the element of a 1  1 nonnegative definite
matrix is nonnegative. Q.E.D.
Corollary 2.13.15. Let P represent an arbitrary N  M matrix. The M  M matrix P 0 P is
nonnegative definite. If rank.P / D M, P 0 P is positive definite; otherwise (if rank.P / < M ), P 0 P
is positive semidefinite.
Proof. The corollary follows from Theorem 2.13.10 upon observing that P 0 P D P 0 IP and that
(as demonstrated earlier) I is positive definite. Q.E.D.
Corollary 2.13.16. Let P represent an N N nonsingular matrix and D D fdi g an N N diag-
onal matrix. Then, (1) P 0 DP is nonnegative definite if and only if d1 ; d2 ; : : : ; dN are nonnegative;
(2) P 0 DP is positive definite if and only if d1 ; d2 ; : : : ; dN are (strictly) positive; and (3) P 0 DP is
positive semidefinite if and only if di  0 for i D 1; 2; : : : ; N with equality holding for one or more
values of i .
Proof. Let A D P 0 DP . Then, clearly, D D .P 1 /0AP 1 . Accordingly, it follows from Part (1)
of Theorem 2.13.10 that A is nonnegative definite if and only if D is nonnegative definite; and, it
follows, respectively, from Parts (1) and (2) of Corollary 2.13.11 that A is positive definite if and only
if D is positive definite and that A is positive semidefinite if and only if D is positive semidefinite.
In light of Lemma 2.13.5, the proof is complete. Q.E.D.
A diagonal element, say the i th diagonal element, of a nonnegative definite matrix can equal 0
66 Matrix Algebra: A Primer

only if the other elements in the i th row and the i th column of that matrix satisfy the conditions set
forth in the following lemma and corollary.
Lemma 2.13.17. Let A D faij g represent an N  N nonnegative definite matrix. If ai i D 0,
then, for j D 1; 2; : : : ; N , aij D aj i ; that is, if the i th diagonal element of A equals 0, then
.ai1 ; ai 2 ; : : : ; aiN /, which is the i th row of A, equals .a1i ; a2i ; : : : ; aN i /, which is 1 times the
transpose of the i th column of A (i D 1; 2; : : : ; N ).
Proof. Suppose that ai i D 0, and take x D fxk g to be an N -dimensional column vector such
that xi < ajj , xj D aij C aj i , and xk D 0 for k other than k D i and k D j (where j ¤ i ).
Then,
x0 Ax D ai i xi2 C .aij C aj i /xi xj C ajj xj2
D .aij C aj i /2 .xi C ajj /
 0; (13.5)
with equality holding only if aij C aj i D 0 or, equivalently, only if aij D aj i . Moreover, since A is
nonnegative definite, x0 Ax  0, which—together with inequality (13.5)—implies that x0 Ax D 0.
We conclude that aij D aj i . Q.E.D.
If the nonnegative definite matrix A in Lemma 2.13.17 is symmetric, then aij D aj i , 2aij D
0 , aij D 0. Thus, we have the following corollary.
Corollary 2.13.18. Let A D faij g represent an N  N symmetric nonnegative definite matrix.
If ai i D 0, then, for j D 1; 2; : : : ; N , aj i D aij D 0; that is, if the i th diagonal element of A equals
zero, then the i th column .a1i ; a2i ; : : : ; aN i /0 of A and the i th row .ai1 ; ai 2 : : : ; aiN / of A are null.
Moreover, if all N diagonal elements a11 ; a22 ; : : : ; aNN of A equal 0, then A D 0.

b. General form of symmetric nonnegative definite matrices


According to Corollary 2.13.15, every matrix A that is expressible in the form A D P 0 P is a
(symmetric) nonnegative definite matrix. Is the converse true? That is, is every symmetric nonnegative
definite matrix A expressible in the form A D P 0 P ? The answer is yes. In fact, corresponding to
any symmetric nonnegative definite matrix A is an upper triangular matrix P such that A D P 0 P ,
as we now proceed to show, beginning with the establishment of the following lemma.
Lemma 2.13.19. Let A D faij g represent an N  N symmetric matrix (where N  2).
Define B D fbij g D U 0 AU, where U is an N  N unit upper triangular matrix of the form
1 u0 b11 b0 a11 a0
     
U D . Partition B and A as B D and A D . Suppose that
0 IN 1 b B22 a A22
a11 ¤ 0. Then, the .N 1/-dimensional vector u can be chosen so that b D 0; this can be done by
taking u D .1=a11 /a.
Proof. We find that  
1
b D .u; I/A D a11 u C a;
0
which, for u D .1=a11 /a, gives b D a a D 0. Q.E.D.
We are now in a position to establish the following theorem.
Theorem 2.13.20. Corresponding to any N N symmetric nonnegative definite matrix A, there
exists a unit upper triangular matrix Q such that Q0 AQ is a diagonal matrix.
Proof. The proof is by mathematical induction. The theorem is clearly valid for any 1  1
(symmetric) nonnegative definite matrix. Suppose now that it is valid for any .N 1/  .N 1/
symmetric nonnegative definite matrix, and consider an arbitrary N  N symmetric nonnegative
definite matrix A D faij g. For purposes of establishing the existence of a unitupper triangular
a11 a0

0
matrix Q such that Q AQ is diagonal, it is convenient to partition A as A D and
a A22
Quadratic Forms 67

to consider the case where a11 > 0 separately from that where a11 D 0—since A is nonnegative
definite, a11  0.
Case (1): a11 > 0. According to Lemma 2.13.19, there exists a unit upper triangular matrix U
such that U 0 AU D diag.b11 ; B22 / for some scalar b11 and some .N 1/  .N 1/ matrix B22 .
Moreover, B22 is symmetric and nonnegative definite (as is evident upon observing that U 0 AU is
symmetric and nonnegative definite). Thus, by supposition, there exists a unit upper triangular matrix
Q such that Q0 B22 Q is a diagonal matrix. Take Q D U diag.1; Q /. Then, Q is a unit upper
triangular matrix, as is evident upon observing that it is the product of two unit upper triangular
matrices—a product of 2 unit upper triangular matrices is itself a unit upper triangular matrix, as can
be readily verified by, e.g., making use of result (2.5). And
Q0 AQ D diag.1; Q0 / diag.b11 ; B22 / diag.1; Q / D diag.b11 ; Q0 B22 Q /;
which (like Q0 B22 Q ) is a diagonal matrix.
Case (2): a11 D 0. The submatrix A22 is an .N 1/  .N 1/ symmetric nonnegative definite
matrix. Thus, by supposition, there exists a unit upper triangular matrix Q such that Q0 A22 Q
is a diagonal matrix. Take Q D diag.1; Q /. Then, Q is a unit upper triangular matrix. And,
upon observing (in light of Corollary 2.13.18) that a11 D 0 ) a D 0, we find that Q0 AQ D
diag.a11 ; Q0 A22 Q /, which (like Q0 A22 Q ) is a diagonal matrix. Q.E.D.
Observe (in light of the results of Section 2.6 b) that a unit upper triangular matrix is nonsingular
and that its inverse is a unit upper triangular matrix. And note (in connection with Theorem 2.13.20)
that if Q is a unit upper triangular matrix such that Q0 AQ D D for some diagonal matrix D, then
1 0
A D .Q / Q0 AQQ 1
D .Q 1 0
/ DQ 1:
Thus, we have the following corollary.
Corollary 2.13.21. Corresponding to any N  N symmetric nonnegative definite matrix A,
there exist a unit upper triangular matrix U and a diagonal matrix D such that A D U 0 DU.
Suppose that A is an N N symmetric nonnegative definite matrix, and take U to be a unit upper
triangular matrix and D a diagonal matrix such that A D U 0 DU (the existence of such matrices
is established in Corollary 2.13.21). Further, let R D rank A, and denote by d1 ; d2 ; : : : ; dN the
diagonal elements of D. Then, di  0 (i D 1; 2; p : : : ; Np), as is evident
p from Corollary 2.13.16. And
A is expressible as A D T 0 T , where T D diag. d1 ; d2 ; : : : ; dN / U. Moreover, rank D D R,
so that R of the diagonal elements of D are strictly positive and the others are equal to 0. Thus, as
an additional corollary of Theorem 2.13.20, we have the following result.
Corollary 2.13.22. Let A represent an N  N symmetric nonnegative definite matrix. And let
R D rank A. Then, there exists an upper triangular matrix T with R (strictly) positive diagonal
elements and with N R null rows such that
A D T 0 T: (13.6)

Let (for i; j D 1; 2; : : : ; N ) tij represent the ij th element of the upper triangular matrix T in
decomposition (13.6) (and observe that tij D 0 for i > j ). Equality (13.6) implies that
p
t11 D a11 ; (13.7)
(
a1j =t11 ; if t11 > 0,
t1j D (13.8)
0; if t11 D 0,
( Pi 1
kD1 tki tkj =ti i ; if ti i > 0,

aij
tij D (13.9)
0; if ti i D 0
(i D 2; 3; : : : ; j 1), Pj 1 2 1=2
tjj D ajj kD1 kj
t (13.10)
(j D 2; 3; : : : ; N ).
68 Matrix Algebra: A Primer

It follows from equalities (13.7), (13.8), (13.9), and (13.10) that decomposition (13.6) is unique.
This decomposition is known as the Cholesky decomposition. It can be computed by a method that is
sometimes referred to as the square root method. In the square root method, formulas (13.7), (13.8),
(13.9), and (13.10) are used to construct the matrix T in N steps, one row or one column per step.
Refer, for instance, to Harville (1997, sec. 14.5) for more details and for an illustrative example.
As a variation on decomposition (13.6), we have the decomposition
A D T0 T ; (13.11)
where T is the R  N matrix whose rows are the nonnull rows of the upper triangular matrix
T. Among the implications of result (13.11) is the following result, which can be regarded as an
additional corollary of Theorem 2.13.20.
Corollary 2.13.23. Corresponding to any N  N (nonnull) symmetric nonnegative definite
matrix A of rank R, there exists an R  N matrix P such that A D P 0 P (and any such R  N matrix
P is of full row rank R).
The following corollary can be regarded as a generalization of Corollary 2.13.23.
Corollary 2.13.24. Let A represent an N  N matrix of rank R, and take M to be any positive
integer greater than or equal to R. If A is symmetric and nonnegative definite, then there exists an
M  N matrix P such that A D P 0 P (and any such M  N matrix P is of rank R).
Proof. Suppose that A is symmetric and nonnegative definite. And assume that R > 0—when
R D 0, A D 0 0 0. According to Corollary 2.13.23, there existsan  R  N matrix P1 such that
0 P1
A D P1 P1 . Take P to be the M  N matrix of the form P D . Then, clearly, A D P 0 P .
0
Moreover, it is clear from Lemma 2.12.1 that rank.P / D R for any M  N matrix P such that
A D P 0P . Q.E.D.
In light of Corollary 2.13.15, Corollary 2.13.24 has the following implication.
Corollary 2.13.25. An N  N matrix A is a symmetric nonnegative definite matrix if and only
if there exists a matrix P (having N columns) such that A D P 0 P .
Further results on symmetric nonnegative definite matrices are given by the following three
corollaries.
Corollary 2.13.26. Let A represent an N N nonnegative definite matrix, and let R D rank.AC
A0 /. Then, assuming that R > 0, there exists an R  N matrix P (of full row rank R) such that
the quadratic form x0 Ax (in an N -dimensional vector x) is expressible as the sum R 2
i D1 yi of the
P
squares of the elements y1 ; y2 ; : : : ; yR of the R-dimensional vector y D P x.
Proof. According to Corollary 2.13.2, there is a unique symmetric matrix B such that x0 Ax D
x Bx for all x, namely, the matrix B D .1=2/.A C A0 /. Moreover, B is nonnegative definite, and
0

(assuming that R > 0) it follows from Corollary 2.13.23 that there exists an R  N matrix P (of
full row rank R) such that B D P 0 P . Thus, letting y1 ; y2 ; : : : ; yR represent the elements of the
R-dimensional vector y D P x, we find that
R
X
x0 Ax D x0 P 0 P x D .P x/0 P x D yi2 : Q.E.D.
i D1

Corollary 2.13.27. For any N  M matrix X and any N  N symmetric nonnegative definite
matrix A, AX D 0 if and only if X0 AX D 0.
Proof. According to Corollary 2.13.25, there exists a matrix P such that A D P 0 P and hence
such that X0 AX D .P X/0 P X. Thus, if X0 AX D 0, then (in light of Corollary 2.3.3) P X D 0,
implying that AX D P 0 P X D P 0 0 D 0. That X0 AX D 0 if AX D 0 is obvious. Q.E.D.
Quadratic Forms 69

Corollary 2.13.28. A symmetric nonnegative definite matrix is positive definite if and only if it
is nonsingular (or, equivalently, is positive semidefinite if and only if it is singular).
Proof. Let A represent an N N symmetric nonnegative definite matrix. If A is positive definite,
then we have, as an immediate consequence of Lemma 2.13.9, that A is nonsingular.
Suppose now that the symmetric nonnegative definite matrix A is nonsingular, and consider the
quadratic form x0 Ax (in x). If x0 Ax D 0, then, according to Corollary 2.13.27, Ax D 0, and
consequently x D A 1 Ax D A 1 0 D 0. Thus, the quadratic form x0 Ax is positive definite and
hence the matrix A is positive definite. Q.E.D.
When specialized to positive definite matrices, Corollary 2.13.23, in combination with Lemma
2.13.9 and Corollary 2.13.15, yields the following result.
Corollary 2.13.29. An N  N matrix A is a symmetric positive definite matrix if and only if
there exists a nonsingular matrix P such that A D P 0 P.
In the special case of a positive definite matrix, Corollary 2.13.26 can (in light of Corollary 2.13.2
and Lemma 2.13.9) be restated as follows—Corollary 2.13.2 implies that if A is positive definite,
then so is .1=2/.A C A0 /.
Corollary 2.13.30. Let A represent an N  N positive definite matrix. Then, there exists a
nonsingular matrix P such that the quadratic form x0Ax (in an N -dimensional vector x) is expressible
as the sum N 2
i D1 yi of the squares of the elements y1 ; y2 ; : : : ; yN of the transformed vector y D P x.
P

c. Positive definiteness or semidefiniteness of partitioned matrices


Lemma 2.13.5 characterizes the concepts of nonnegative definiteness, positive definiteness, and
positive semidefiniteness as applied to diagonal matrices. The following lemma extends the results
of Lemma 2.13.5 to block-diagonal matrices.
Lemma 2.13.31. Let Ai represent an Ni  Ni matrix (i D 1; 2; : : : ; K), let N D N1 C N2 C
  CNK , and define A to be the N N block-diagonal matrix A D diag.A1 ; A2 ; : : : ; AK /. Then, (1)
A is nonnegative definite if and only if A1 ; A2 ; : : : ; AK are nonnegative definite; (2) A is positive
definite if and only if A1 ; A2 ; : : : ; AK are positive definite; and (3) A is positive semidefinite if and
only if the diagonal blocks A1 ; A2 ; : : : ; AK are nonnegative definite with at least one of the diagonal
blocks being positive semidefinite.
Proof. Consider the quadratic form x0 Ax (in an N -dimensional column vector x) whose matrix
is A. Partition x0 as x0 D .x01 ; x02 ; : : : ; xK
0
/, where (for i D 1; 2; : : : ; K) xi is an Ni -dimensional
(column) vector. Then, clearly,

x0 Ax D x01 A1 x1 C x02 A2 x2 C    C xK
0
AK xK :

(1) If A1 ; A2 ; : : : ; AK are nonnegative definite, then, by definition, x0i Ai xi  0 for every xi


(i D 1; 2; : : : ; K), implying that x0 Ax  0 for every x and hence that A is nonnegative definite.
Conversely, if A is nonnegative definite, then it follows from Corollary 2.13.13 that A1 ; A2 ; : : : ; AK
are nonnegative definite.
(2) If A1 ; A2 ; : : : ; AK are positive definite, then, by definition, x0i Ai xi > 0 for every nonnull xi
(i D 1; 2; : : : ; K), implying that x0 Ax > 0 for every nonnull x and hence that A is positive definite.
Conversely, if A is positive definite, then it follows from Corollary 2.13.13 that A1 ; A2 ; : : : ; AK are
positive definite.
(3) Suppose that A1 ; A2 ; : : : ; AK are nonnegative definite and that for some i , say i D i , Ai
is positive semidefinite. Then, by definition, x0i Ai xi  0 for every xi (i D 1; 2; : : : ; K), and there
exists some nonnull value of xi  , say xi  D x Q i  , for which x0i  Ai  xi  D 0. It follows that x0 Ax  0
0 Q 0i  ; 0; : : : ; 0/. Thus, A is positive semidefinite.
for every x, with equality holding for x D .0; : : : ; 0; x
Conversely, suppose that A is positive semidefinite. Then, it follows from Part (1) that
A1 ; A2 ; : : : ; AK are nonnegative definite. Moreover, it follows from Part (2) that not all of the
70 Matrix Algebra: A Primer

matrices A1 ; A2 ; : : : ; AK are positive definite and hence (since they are nonnegative definite) that at
least one of them is positive semidefinite. Q.E.D.
The following theorem relates the positive definiteness or semidefiniteness of a symmetric matrix
to the positive definiteness or semidefiniteness of the Schur complement of a positive definite principal
submatrix.
Theorem 2.13.32. Let T represent an M  M symmetric matrix, W an N  N symmetric
matrix, and U an M  N matrix. Suppose that T is positive definite, and define

Q D W U 0 T 1 U:
 
T U
Then, the partitioned (symmetric) matrix is positive definite if and only if Q is positive
U0 W
definite and is positive
 semidefinite if and only if Q is positive semidefinite. Similarly, the partitioned
W U0

(symmetric) matrix is positive definite if and only if Q is positive definite and is positive
U T
semidefinite if and only if Q is positive semidefinite.
T 1U
   
T U IM
Proof. Let A D , and define P D . According to Lemma 2.6.2, P
U0 W 0 IN
is nonsingular. Moreover,
P 0 AP D diag.T; Q/;
as is easily verified. And, upon observing that
1 0 1
.P / diag.T; Q/P D A;

it follows from Corollary 2.13.11 that A is positive definite if and only if diag.T; Q/ is positive
definite and is positive semidefinite if and only if diag.T; Q/ is positive semidefinite. In light of
Lemma 2.13.31, we conclude that A is positive definite if and only if Q is positive definite and
is positive semidefinite if and only if Q is positive semidefinite, thereby completing the proof of
the first part of Theorem 2.13.32. The second part of Theorem 2.13.32 can be proved in similar
fashion. Q.E.D.
As a corollary of Theorem 2.13.32, we have the following result on the positive definiteness of
a symmetric matrix.
Corollary 2.13.33. Suppose that a symmetric matrix A is partitioned as
 
T U
AD
U0 W
(where T and W are square). Then, A is positive definite if and only if T and the Schur complement
W U 0 T 1 U of T are both positive definite. Similarly, A is positive definite if and only if W and
the Schur complement T UW 1 U 0 of W are both positive definite.
Proof. If T is positive definite (in which case T is nonsingular) and W U 0 T 1 U is positive
definite, then it follows from the first part of Theorem 2.13.32 that A is positive definite. Similarly,
if W is positive definite (in which case W is nonsingular) and T UW 1 U 0 is positive definite,
then it follows from the second part of Theorem 2.13.32 that A is positive definite.
Conversely, suppose that A is positive definite. Then, it follows from Corollary 2.13.13 that T
is positive definite and also that W is positive definite. And, based on the first part of Theorem
2.13.32, we conclude that W U 0 T 1 U is positive definite; similarly, based on the second part of
that theorem, we conclude that T UW 1 U 0 is positive definite. Q.E.D.
Determinants 71

2.14 Determinants
Associated with any square matrix is a scalar that is known as the determinant of the matrix. As a
preliminary to defining the determinant, it is convenient to introduce a convention for classifying
various pairs of matrix elements as either positive or negative.
Let A D faij g represent an arbitrary N  N matrix. Consider any pair of elements of A that do
not lie either in the same row or the same column, say aij and ai 0 j 0 (where i 0 ¤ i and j 0 ¤ j ). The
pair is said to be a negative pair if one of the elements is located above and to the right of the other,
or, equivalently, if either i 0 > i and j 0 < j or i 0 < i and j 0 > j . Otherwise (if one of the elements
is located above and to the left of the other, or, equivalently, if either i 0 > i and j 0 > j or i 0 < i and
j 0 < j ), the pair is said to be a positive pair. Thus, the pair aij and ai 0 j 0 is classified as positive or
negative in accordance with the following two-way table:
i0 > i i0 < i
0
j >j C
j0 < j C

For example (supposing that N  4), the pair a34 and a22 is positive, while the pair a34 and a41 is
negative. Note that whether the pair aij and ai 0 j 0 is positive or negative is completely determined by
the relative locations of aij and ai 0 j 0 and has nothing to do with whether aij and ai 0 j 0 are positive
or negative numbers.
Now, consider N elements of A, no two of which lie either in the same row or the same column, say
the i1 j1 ; i2 j2 ; : : : ; iN jN th elements (where both i1 ; i2 ; : : : ; iN and j1 ; j2 ; : : : ; jN are permutations
of the first N positive integers). A total of N2 pairs can be formed from these N elements. The


symbol N .i1 ; j1 I i2 ; j2 I : : : I iN ; jN / is to be used to represent the number of these N2 pairs that




are negative pairs.


Observe that N .i1 ; j1 I i2 ; j2 I : : : I iN ; jN / has the following two properties:
(1) the value of N .i1 ; j1 I i2 ; j2 I : : : I iN ; jN / is not affected by permuting its N pairs of arguments;
in particular, it is not affected if the N pairs are permuted so that they are ordered by row number
or by column number [e.g., 3 .2; 3I 1; 2I 3; 1/ D 3 .1; 2I 2; 3I 3; 1/ D 3 .3; 1I 1; 2I 2; 3/];
(2) N .i1 ; j1 I i2 ; j2 I : : : I iN ; jN / D N .j1 ; i1 I j2 ; i2 I : : : I jN ; iN /
[e.g., 3 .2; 3I 1; 2I 3; 1/ D 3 .3; 2I 2; 1I 1; 3/].
For any sequence of N distinct integers i1 ; i2 ; : : : ; iN , define
N .i1 ; i2 ; : : : ; iN / D p1 C p2 C    C pN 1;

where (for k D 1; 2; : : : ; N 1) pk represents the number of integers in the subsequence


ikC1 ; ikC2 ; : : : ; iN that are smaller than ik . For example,
5 .3; 7; 2; 1; 4/ D 2 C 3 C 1 C 0 D 6:
Then, clearly, for any permutation i1 ; i2 ; : : : ; iN of the first N positive integers,
N .1; i1 I 2; i2 I : : : I N; iN / D N .i1 ; 1I i2 ; 2I : : : I iN ; N / D N .i1 ; i2 ; : : : ; iN /: (14.1)

The determinant of an N  N matrix A D faij g, to be denoted by jAj or (to avoid confusion


with the absolute value of a scalar) by det A or det.A/, is defined by
X
jAj D . 1/N .1;j1 I 2;j2 I:::I N;jN / a1j1 a2j2    aNjN (14.2)

or, equivalently, by X
jAj D . 1/N .j1 ;j2 ;:::;jN / a1j1 a2j2    aNjN ; (14.20)
72 Matrix Algebra: A Primer

where j1 ; j2 ; : : : ; jN is a permutation of the first N positive integers and the summation is over all
such permutations.
Thus, the determinant of an N  N matrix A can (at least in principle) be obtained via the
following process:
(1) Form all possible products, each of N factors, that can be obtained by picking one and only one
element from each row and column of A.
(2) In each product, count the number of negative pairs among the N2 pairs of elements that can


be generated from the N elements that contribute to this particular product. If the number of
negative pairs is an even number, attach a plus sign to the product; if it is an odd number, attach
a minus sign.
(3) Sum the signed products.
In particular, the determinant of a 1  1 matrix A D .a11 / is
jAj D a11 I (14.3)
the determinant of a 2  2 matrix A D faij g is
jAj D . 1/0 a11 a22 C . 1/1 a12 a21 D a11 a22 a12 a21 I (14.4)
and the determinant of a 3  3 matrix A D faij g is
jAj D . 1/0 a11 a22 a33 C . 1/1 a11 a23 a32 C . 1/1 a12 a21 a33
C . 1/2 a12 a23 a31 C . 1/2 a13 a21 a32 C . 1/3 a13 a22 a31
D a11 a22 a33 C a12 a23 a31 C a13 a21 a32
a11 a23 a32 a12 a21 a33 a13 a22 a31 : (14.5)
An alternative definition of the determinant of an N  N matrix A is
X
jAj D . 1/N .i1 ;1I i2 ;2I:::I iN ;N / ai1 1 ai2 2    aiN N (14.6)
X
D . 1/N .i1 ;i2 ;:::;iN / ai1 1 ai2 2    aiN N ; (14.60)
where i1 ; i2 ; : : : ; iN is a permutation of the first N positive integers and the summation is over all
such permutations.
Definition (14.6) is equivalent to definition (14.2). To see this, observe that the product
a1j1 a2j2    aNjN , which appears in definition (14.2), can be reexpressed by permuting the N factors
a1j1 ; a2j2 ; : : : ; aNjN so that they are ordered by column number, giving
a1j1 a2j2    aNjN D ai1 1 ai2 2    aiN N ;
where i1 ; i2 ; : : : ; iN is a permutation of the first N positive integers that is defined uniquely by
ji1 D 1; ji2 D 2; : : : ; jiN D N:
Further,
N .1; j1 I 2; j2 I : : : I N; jN / D N .i1 ; ji1 I i2 ; ji2 ; : : : I iN ; jiN /
D N .i1 ; 1I i2 ; 2I : : : I iN ; N /;
so that
. 1/N .1;j1 I 2;j2 I:::I N;jN / a1j1 a2j2    aNjN D . 1/N .i1 ;1I i2 ;2I:::I iN ;N / ai1 1 ai2 2    aiN N :
Thus, we can establish a one-to-one correspondence between the terms of the sum (14.6) and the
terms of the sum (14.2) such that the corresponding terms are equal. We conclude that the two sums
are themselves equal.  
A11 A12
In considering the determinant of a partitioned matrix, say , it is customary to
ˇ ˇ ˇ ˇ A21 A22
ˇ A11 A12 ˇ ˇA11 A12 ˇ
abbreviate ˇˇ ˇ to ˇ ˇ.
A21 A22 ˇ ˇA21 A22 ˇ
Determinants 73

a. Determinants of triangular and diagonal matrices


In the special case of a triangular matrix, the formula for the determinant simplifies greatly, as
described in the following lemma.
Lemma 2.14.1. If an N  N matrix A D faij g is (upper or lower) triangular, then

jAj D a11 a22    aNN ; (14.7)

that is, the determinant of a triangular matrix equals the product of its diagonal elements.
Proof. Consider the case where A is lower triangular, that is, of the form
0 1
a11 0 ::: 0
B a21 a22 0 C
ADB : : C:
B C
@ :: :: : :: A
aN1 aN 2 : : : aNN

That jAj D a11 a22    aNN is clear upon observing that the only term in the sum (14.2) or (14.20)
that can be nonzero is that corresponding to the permutation j1 D 1; j2 D 2; : : : ; jN D N [and
upon observing that N .1; 2; : : : ; N / D 0]. (To verify formally that this is the only term that can be
nonzero, let j1 ; j2 ; : : : ; jN represent an arbitrary permutation of the first N positive integers, and
suppose that a1j1 a2j2    aNjN ¤ 0 or, equivalently, that aiji ¤ 0 for i D 1; 2; : : : ; N . Then, it is
clear that j1 D 1 and that if j1 D 1; j2 D 2; : : : ; ji 1 D i 1, then ji D i . We conclude, on the
basis of mathematical induction, that j1 D 1; j2 D 2; : : : ; jN D N .)
The validity of formula (14.7) as applied to an upper triangular matrix follows from a similar
argument. Q.E.D.
Note that Lemma 2.14.1 implies in particular that the determinant of a unit (upper or lower)
triangular matrix equals 1. And, as a further implication of Lemma 2.14.1, we have the following
corollary.
Corollary 2.14.2. The determinant of a diagonal matrix equals the product of its diagonal
elements.
As obvious special cases of Corollary 2.14.2, we have that

j0j D 0; (14.8)
jIj D 1: (14.9)

b. Some basic results on determinants


The following lemma relates the determinant of a matrix to the determinant of its transpose.
Lemma 2.14.3. For any N  N matrix A,

jA0 j D jAj: (14.10)

Proof. Let aij and bij represent the ij th elements of A and A0, respectively (i; j D 1; 2; : : : ; N ).
Then, in light of the equivalence of definitions (14.20) and (14.60 ),
X
jA0 j D . 1/N .j1 ;j2 ;:::;jN / b1j1 b2j2    bNjN
X
D . 1/N .j1 ;j2 ;:::;jN / aj1 1 aj2 2    ajN N
D jAj;
74 Matrix Algebra: A Primer

where j1 ; j2 ; : : : ; jN is a permutation of the first N positive integers and the summations are over
all such permutations. Q.E.D.
As an immediate consequence of the definition of a determinant, we have the following lemma.
Lemma 2.14.4. If an N  N matrix B is formed from an N  N matrix A by multiplying all of
the elements of one row or one column of A by the same scalar k (and leaving the elements of the
other N 1 rows or columns unchanged), then
jBj D kjAj:

As a corollary of Lemma 2.14.4, we obtain the following result on the determinant of a matrix
having a null row or a null column.
Corollary 2.14.5. If one or more rows (or columns) of an N N matrix A are null, then jAj D 0.
Proof. Suppose that the i th row of A is null, and let B represent an N  N matrix formed from
A by multiplying every element of the i th row of A by 0. Then, A D B, and we find that
jAj D jBj D 0jAj D 0: Q.E.D.

The following corollary (of Lemma 2.14.4) relates the determinant of a scalar multiple of a matrix
A to that of A itself.
Corollary 2.14.6. For any N  N matrix A and any scalar k,
jkAj D k N jAj: (14.11)

Proof. This result follows from Lemma 2.14.4 upon observing that kA can be formed from A
by successively multiplying the N rows of A by k. Q.E.D.
As a special case of Corollary 2.14.6, we have the following, additional corollary.
Corollary 2.14.7. For any N  N matrix A,
j Aj D . 1/N jAj: (14.12)

The following two theorems describe how the determinant of a matrix is affected by permuting
its rows or columns in certain ways.
Theorem 2.14.8. If an N  N matrix B D fbij g is formed from an N  N matrix A D faij g
by interchanging two rows or two columns of A, then
jBj D jAj:

Proof. Consider first the case where B is formed from A by interchanging two adjacent rows,
say the i th and .i C 1/th rows. Then,
X
jBj D . 1/N .j1 ;j2 ;:::;jN / b1j1 b2j2    bi 1;ji 1 biji bi C1;ji C1 bi C2;ji C2    bNjN
X
D . 1/N .j1 ;j2 ;:::;jN / a1j1 a2j2    ai 1;ji 1 ai C1;ji ai;ji C1 ai C2;ji C2    aNjN
X
D . 1/N .j1 ;j2 ;:::;ji 1 ;ji C1 ;ji ;ji C2 ;:::;jN /
 a1j1 a2j2    ai 1;ji 1 ai;ji C1 ai C1;ji ai C2;ji C2    aNjN
"
since N .j1 ; j2 ; : : : ; ji 1 ; ji C1 ; ji ; ji C2 ; : : : ; jN /
( #
N .j1 ; j2 ; : : : ; jN / C 1; if ji C1 > ji
D
N .j1 ; j2 ; : : : ; jN / 1; if ji C1 < ji
D jAj;
Determinants 75

where j1 ; j2 ; : : : ; jN (and hence j1 ; j2 ; : : : ; ji 1 ; ji C1 ; ji ; ji C2 ; : : : ; jN ) is a permutation of the


first N positive integers and the summation is over all such permutations.
Consider now the case where B is formed from A by interchanging two not-necessarily-adjacent
rows, say the i th and kth rows where k > i . Suppose that we successively interchange the kth
row of A with the k i rows immediately preceding it, putting the N rows of A in the order
1; 2; : : : ; i 1; k; i; i C 1; : : : ; k 1; k C 1; : : : ; N. Suppose that we then further reorder the rows of
A by successively interchanging what was originally the i th row with the k i 1 rows immediately
succeeding it, putting the N rows in the order 1; 2; : : : ; i 1; k; i C1; : : : ; k 1; i; k C1; : : : ; N. Thus,
by executing 2.k i / 1 successive interchanges of adjacent rows, we have in effect interchanged the
i th and kth rows of A. Since each interchange of adjacent rows changes the sign of the determinant,
we conclude that
jBj D . 1/2.k i / 1 jAj D jAj:
By employing an analogous argument, we find that the interchange of any two columns of A
likewise changes the sign of the determinant. Q.E.D.
Theorem 2.14.9. If B is an N  P matrix (where P < N ) and C an N  Q matrix (where
Q D N P ), then
jB; Cj D . 1/PQ jC; Bj: (14.13)
Similarly, if B is a P  N matrix and C a Q  N matrix, then
ˇ ˇ ˇ ˇ
ˇB ˇ
ˇ ˇ D . 1/PQ ˇC ˇ :
ˇ ˇ
ˇC ˇ ˇB ˇ (14.14)

Proof. Consider the case where B is N  P and C is N  Q. Let b1 ; b2 ; : : : ; bP


represent the columns of B and c1 ; c2 ; : : : ; cQ the columns of C. Then, .C; B/ D
.c1 ; c2 ; : : : ; cQ ; b1 ; b2 ; : : : ; bP ).
Suppose that in the matrix .C; B/, we successively interchange the column b1 with the columns
cQ ; : : : ; c2 ; c1 , producing the matrix .b1 ; c1 ; c2 ; : : : ; cQ , b2 ; : : : ; bP /. And suppose that in the latter
matrix, we successively interchange the column b2 with the columns cQ ; : : : ; c2 ; c1 , producing the
matrix .b1 ; b2 , c1 ; c2 ; : : : ; cQ , b3 ; : : : ; bP /. Continuing in this fashion, we produce (after P steps)
the matrix .b1 ; b2 ; : : : ; bP ; c1 ; c2 ; : : : ; cQ / D .B; C/.
It is now clear that we can obtain the matrix .B; C/ from the matrix .C; B/ via a total of PQ
successive interchanges of columns. Thus, it follows from Theorem 2.14.8 that
jB; Cj D . 1/PQ jC; Bj:
Result (14.14) can be derived via an analogous approach. Q.E.D.
A (square) matrix that has one or more null rows or columns has (according to Corollary 2.14.5)
a zero determinant. Other matrices whose determinants are zero are described in the following two
lemmas.
Lemma 2.14.10. If two rows or two columns of an N  N matrix A are identical, then jAj D 0.
Proof. Suppose that two rows of A are identical, say the i th and kth rows, and let B represent
a matrix formed from A by interchanging its i th and kth rows. Obviously, B D A and hence
jBj D jAj. Moreover, according to Theorem 2.14.8, jBj D jAj. Thus, jAj D jBj D jAj,
implying that jAj D 0.
That the determinant of a (square) matrix having two identical columns equals zero can be proved
via an analogous argument. Q.E.D.
Lemma 2.14.11. If a row or column of an N  N matrix A is a scalar multiple of another row
or column, then jAj D 0.
Proof. Let a01 ; a02 ; : : : ; a0N represent the rows of A. Suppose that one row is a scalar multiple
of another, that is, suppose that a0s D ka0i for some s and i (with s ¤ i ) and some scalar k. Let B
76 Matrix Algebra: A Primer

represent a matrix formed from A by multiplying the i th row of A by the scalar k. Then, according
to Lemmas 2.14.4 and 2.14.10,
kjAj D jBj D 0: (14.15)
If k ¤ 0, then it follows from equality (14.15) that jAj D 0. If k D 0, then a0s D 0, and it follows
from Corollary 2.14.5 that jAj D 0. Thus, in either case, jAj D 0.
An analogous argument shows that if one column of a (square) matrix is a scalar multiple of
another, then again the determinant of the matrix equals zero. Q.E.D.
The transposition of a (square) matrix does not (according to Lemma 2.14.3) affect its determi-
nant. Other operations that do not affect the determinant of a matrix are described in the following
two theorems.
Theorem 2.14.12. Let B represent a matrix formed from an N  N matrix A by adding, to any
one row or column of A, scalar multiples of one or more other rows or columns. Then, jBj D jAj.
Proof. Let a0i D .ai1 ; ai 2 ; : : : ; aiN / and b0i D .bi1 ; bi 2 ; : : : ; biN / represent the i th rows of A
and B, respectively (i D 1; 2; : : : ; N ). And suppose that for some integer s (1  s  N ) and some
scalars k1 ; k2 ; : : : ; ks 1 ; ksC1 ; : : : ; kN ,
X
b0s D a0s C ki a0i and b0i D a0i .i ¤ s/:
Then, i ¤s
X
jBj D . 1/N .j1 ;j2 ;:::;jN / b1j1 b2j2    bNjN
X  X 
D . 1/N .j1 ;j2 ;:::;jN / a1j1 a2j2    as 1;js 1
asjs C ki aijs asC1;jsC1    aNjN
XX i ¤s
D jAj C . 1/N .j1 ;j2 ;:::;jN / a1j1 a2j2    as 1;js 1
.ki aijs / asC1;jsC1    aNjN
i ¤s
X
D jAj C jBi j;
i ¤s

where Bi is a matrix formed from A by replacing the sth row of A with ki a0i and where j1 ; j2 ; : : : ; jN
is a permutation of the first N positive integers and the (unlabeled) summations are over all such
permutations. Since (according to Lemma 2.14.11) jBi j D 0 (i ¤ s), we conclude that jBj D jAj.
An analogous argument shows that jBj D jAj when B is formed from A by adding, to a column
of A, scalar multiples of other columns. Q.E.D.
Theorem 2.14.13. For any N  N matrix A and any N  N unit (upper or lower) triangular
matrix T ,
jAT j D jT Aj D jAj: (14.16)
Proof. Consider the case where A is postmultiplied by T and T is unit lower triangular. Define
Ti to be a matrix formed from IN by replacing the i th column of IN with the i th column of T
(i D 1; 2; : : : ; N ). Then, T D T1 T2    TN (as is easily verified), and consequently

AT D AT1 T2    TN :

Now, define B0 D A, and Bi D AT1 T2    Ti (i D 1; 2; : : : ; N 1). Clearly, to show that


jAT j D jAj, it suffices to show that, for i D 1; 2; : : : ; N , the postmultiplication of Bi 1 by Ti
does not alter the determinant of Bi 1 . Observe that the columns of Bi 1 Ti are the same as those
of Bi 1 , except for the i th column of Bi 1 Ti , which consists of the i th column of Bi 1 plus scalar
multiples of the .i C 1/, : : : ; N th columns of Bi 1 . Thus, it follows from Theorem 2.14.12 that
jBi 1 Ti j D jBi 1 j. We conclude that jAT j D jAj.
The validity of the parts of result (14.16) that pertain to the postmultiplication of A by a unit
upper triangular matrix and the premultiplication of A by a unit upper or lower triangular matrix can
be established via similar arguments. Q.E.D.
Determinants 77

c. Determinants of block-triangular matrices


Formula (14.7) for the determinant of a triangular matrix can be extended to a block-triangular matrix
based on the following theorem.
Theorem 2.14.14. Let T represent an M  M matrix, V an N  M matrix, and W an N  N
matrix. Then, ˇ ˇ ˇ ˇ
ˇ T 0 ˇ ˇW V ˇ
ˇV W ˇ ˇ 0 T ˇ D jT jjW j:
ˇ ˇDˇ ˇ (14.17)

Proof. Let  
T 0
AD ;
V W
and let aij represent the ij th element of A (i; j D 1; 2; : : : ; M C N ). Further, denote by tij the ij th
element of T (i; j D 1; 2; : : : ; M ) and by wij the ij th element of W (i; j D 1; 2; : : : ; N ).
By definition, X
jAj D . 1/M CN .j1 ;:::;jM ;jM C1 ;:::;jM CN /
 a1j1    aMjM aM C1;jM C1    aM CN;jM CN ; (14.18)
where j1 ; : : : ; jM ; jM C1 ; : : : ; jM CN is a permutation of the first M C N positive integers and the
summation is over all such permutations. Clearly, the only terms of the sum (14.18) that can be
nonzero are those for which j1 ; : : : ; jM constitutes a permutation of the first M positive integers
and thus for which jM C1 ; : : : ; jM CN constitutes a permutation of the integers M C 1; : : : ; M C N.
For any such permutation, we have that
a1j1    aMjM aM C1;jM C1    aM CN;jM CN D t1j1    tMjM w1;jM C1 M    wN;jM CN M
D t1j1    tMjM w1k1    wN kN ;
where k1 D jM C1 M; : : : ; kN D jM CN M , and we also have that
M CN .j1 ; : : : ; jM ; jM C1 ; : : : ; jM CN / D M .j1 ; : : : ; jM / C N .jM C1 ; : : : ; jM CN /
D M .j1 ; : : : ; jM / C N .jM C1 M; : : : ; jM CN M/
D M .j1 ; : : : ; jM / C N .k1 ; : : : ; kN /:
Thus, XX
jAj D . 1/M .j1 ;:::;jM /CN .k1 ;:::;kN / t1j1    tMjM w1k1    wN kN
X X
D . 1/M .j1 ;:::;jM / t1j1    tMjM . 1/N .k1 ;:::;kN / w1k1    wN kN
D jT jjW j;
where j1 ; : : : ; jM is a permutation of the first M positive integers and k1 ; : : : ; kN a permutation of
the first Nˇ positiveˇ integers and where the respective summations are over all such permutations.
ˇW V ˇ
That ˇˇ ˇ D jT jjW j can be established via a similar argument. Q.E.D.
0 Tˇ
The repeated application of Theorem 2.14.14 leads to the following formulas for the determinant
of an arbitrary (square) upper or lower block-triangular matrix (with square diagonal blocks):
ˇ ˇ
ˇA11 A12 : : : A1R ˇ
ˇ ˇ
ˇ 0 A22 : : : A2R ˇ
:: ˇ D jA11 jjA22 j    jARR jI (14.19)
ˇ ˇ
ˇ :: ::
ˇ :
ˇ : : ˇˇ
ˇ 0 0 ARR ˇ
ˇ ˇ
ˇ B11
ˇ 0 ::: 0 ˇˇ
ˇ B21 B22 0 ˇˇ
ˇ :: :: ˇ D jB11 jjB22 j    jBRR j: (14.20)
ˇ
::
ˇ :
ˇ : : ˇ
ˇ
ˇBR1 BR2 : : : BRR ˇ
78 Matrix Algebra: A Primer

In the special case of a block-diagonal matrix, formula (14.19) becomes

diag.A11 ; A22 ; : : : ; ARR / D jA11 jjA22 j    jARR j: (14.21)

Formulas (14.19), (14.20), and (14.21) generalize the results of Lemma 2.14.1 and Corollary 2.14.2
on the determinants of triangular and diagonal matrices.
As an immediate consequence of Theorem 2.14.9, we have the following corollary of Theorem
2.14.14.
Corollary 2.14.15. Let T represent an M  M matrix, V an N  M matrix, and W an N  N
matrix. Then, ˇ ˇ ˇ ˇ
ˇ 0 T ˇ ˇV W ˇ MN
ˇW V ˇ ˇT 0 ˇ D . 1/
ˇ ˇDˇ ˇ jT jjW j: (14.22)

The following corollary gives a simplified version of formula (14.22) for the special case where
M D N and T D IN .
Corollary 2.14.16. For N  N matrices W and V ,
ˇ ˇ ˇ ˇ
ˇ0 IN ˇˇ ˇˇ V W ˇˇ
ˇ
ˇW D D jW j: (14.23)
V ˇ ˇ IN 0 ˇ

Proof (of Corollary 2.14.16). Corollary 2.14.16 can be derived from the special case of Corollary
2.14.15 where M D N and T D IN by observing that

. 1/NN j IN jjW j D . 1/NN . 1/N jW j D . 1/N.N C1/ jW j

and that either N or N C 1 is an even number and consequently N.N C 1/ is an even number. Q.E.D.

d. Determinants of matrix products and inverses


By using Theorems 2.14.14 and 2.14.13 and Corollary 2.14.16, we find that for N  N matrices A
and B, ˇ ˇ ˇ  ˇ ˇ ˇ
ˇ A 0ˇ ˇ A 0 I B ˇˇ ˇˇ A ABˇˇ
jAjjBj D ˇˇ ˇDˇ D D jABj;
I Bˇ ˇ I B 0 I ˇ ˇ I 0 ˇ
thereby establishing the following, very important result.
Theorem 2.14.17. For N  N matrices A and B,

jABj D jAjjBj: (14.24)

The repeated application of Theorem 2.14.17 leads to the following formula for the determinant
of the product of an arbitrary number of N  N matrices A1 ; A2 ; : : : ; AK :

jA1 A2    AK j D jA1 jjA2 j    jAK j: (14.25)

As a special case of this formula, we obtain the following formula for the determinant of the kth
power of an N  N matrix A:
jAk j D jAj k (14.26)
(k D 2; 3; : : : ).
In light of Lemma 2.14.3, we have the following corollary of Theorem 2.14.17.
Corollary 2.14.18. For any N  N matrix A,
jA0 Aj D jAj2: (14.27)

Corollary 2.14.18 gives rise to the following result on the determinant of an orthogonal matrix.
Determinants 79

Corollary 2.14.19. For any orthogonal matrix P ,


jP j D ˙1:

Proof (of Corollary 2.14.19). Using Corollary 2.14.18 [and result (14.9)], we find that

jP j2 D jP 0 P j D jIj D 1: Q.E.D.

Having established Theorem 2.14.17, we are now in a position to prove the following result on
the nonsingularity of a matrix and on the determinant of an inverse matrix.
Theorem 2.14.20. Let A represent an N  N matrix. Then, A is nonsingular (or, equivalently,
A is invertible) if and only if jAj ¤ 0, in which case
jA 1j D 1=jAj: (14.28)

Proof. It suffices to show that if A is nonsingular, then jAj ¤ 0 and jA 1j D 1=jAj and that if
A is singular, then jAj D 0.
Suppose that A is nonsingular. Then, according to Theorem 2.14.17 [and result (14.9)],
jA 1jjAj D jA 1Aj D jIj D 1;
implying that jAj ¤ 0 and further that jA 1j D 1=jAj.
Alternatively, suppose that A is singular. Then, some column of A, say the sth column as , can be
expressedPas a linear combination of the other N 1 columns a1 ; a2 ; : : : ; as 1 ; asC1 ; : : : ; aN ; that
is, as D i ¤s ki ai for some scalars k1P ; k2 ; : : : ; ks 1 ; ksC1 ; : : : ; kN . Now, let B represent a matrix
formed from A by adding the vector i ¤s ki ai to the sth column of A. Clearly, the sth column of
B is null, and it follows from Corollary 2.14.5 that jBj D 0. And it follows from Theorem 2.14.12
that jAj D jBj. Thus, jAj D 0. Q.E.D.
Let A represent an N  N symmetric positive definite matrix. Then, according to Corollary
2.13.29, there exists a nonsingular matrix P such that A D P 0 P . Thus, making use of Corollary
2.14.18 and observing (in light of Theorem 2.14.20) that jP j ¤ 0, we find that
jAj D jP 0 P j D jP j2 > 0:
Moreover, the determinant of a symmetric positive semidefinite matrix equals 0, as is evident from
Theorem 2.14.20 upon recalling (from Corollary 2.13.28) that a symmetric positive semidefinite
matrix is singular. Accordingly, we have the following lemma.
Lemma 2.14.21. The determinant of a symmetric positive definite matrix is positive; the deter-
minant of a symmetric positive semidefinite matrix equals 0.

e. Determinants of partitioned matrices


The following theorem gives formulas for the determinant of a partitioned matrix that are analogous
to formulas (6.10) and (6.11) for the inverse of a partitioned matrix.
Theorem 2.14.22. Let T represent an M  M matrix, U an M  N matrix, V an N  M
matrix, and W an N  N matrix. If T is nonsingular, then
ˇ ˇ ˇ ˇ
ˇT U ˇ ˇW V ˇ 1
ˇV W ˇ ˇ U T ˇ D jT jjW V T Uj:
ˇ ˇDˇ ˇ (14.29)

Proof. Suppose that T is nonsingular. Then,


    
T U I 0 T U
D :
V W VT 1 W VT 1
U 0 I
80 Matrix Algebra: A Primer

Applying Theorems 2.14.17 and 2.14.14 and result (14.9), we find that
ˇ ˇ
ˇT U ˇ 1
ˇV W ˇ D jT jjW V T Uj:
ˇ ˇ
ˇ ˇ
ˇW V ˇ
That ˇ
ˇ ˇ D jT jjW V T 1 Uj can be proved in similar fashion. Q.E.D.
U Tˇ

f. A necessary and sufficient condition for the positive definiteness of a symmetric


matrix
Whether or not a symmetric matrix is positive definite can be ascertained from the determinants of
its leading principal submatrices. The following theorem provides the basis for doing so.
Theorem 2.14.23. Let A D faij g represent an N  N symmetric matrix, and, for k D
1; 2; : : : ; N , let Ak represent the leading principal submatrix of A of order k (i.e., the principal
submatrix obtained by striking out the last N k rows and columns). Then, A is positive definite if
and only if, for k D 1; 2; : : : ; N, det.Ak / > 0, that is, if and only if the determinants of all N of the
leading principal submatrices A1 ; A2 ; : : : ; AN of A are positive.
In proving Theorem 2.14.23, it is convenient to make use of the following result, which is of
some interest in its own right.
Lemma 2.14.24. Let A represent an N  N symmetric matrix (where N  2), and partition A
as 
A a

AD ;
a0 c
where the dimensions of A are .N 1/  .N 1/. Then, A is positive definite if and only if A is
positive definite and jAj > 0.
Proof (of Lemma 2.14.24). If A is positive definite, then it is clear from Corollary 2.13.13 that
A is positive definite and from Lemma 2.14.21 that jAj > 0.
Conversely, suppose that A is positive definite (and hence nonsingular) and that jAj > 0. Then,
according to Theorem 2.14.22,
jAj D jA j .c a0 A1 a/:
Since (according to Lemma 2.14.21) jA j > 0 (and since jAj > 0), we conclude that the Schur
complement c a0 A 1 a of A (like A itself) is positive definite and hence (in light of Corollary
2.13.33) that A is positive definite. Q.E.D.
Proof (of Theorem 2.14.23). That the determinants of A1 ; A2 ; : : : ; AN are positive if A is
positive definite is an immediate consequence of Corollary 2.13.13 and Lemma 2.14.21.
For purposes of proving the converse, suppose that the determinants of A1 ; A2 , : : : ; AN are posi-
tive. The proof consists of establishing, via a mathematical induction argument, that A1 ; A2 ; : : : ; AN
are positive definite, which (since A D AN ) implies in particular that A is positive definite.
Clearly, A1 is positive definite. Suppose now that Ak 1 is positive definite (where 2  k  N ),
and partition Ak as  
Ak 1 ak
Ak D ;
a0k akk
where ak D .a1k ; a2k ; : : : ; ak 1;k /0 . Since Ak 1 is (by supposition) positive definite (and since
jAk j > 0), it follows from Lemma 2.14.24 that Ak is positive definite.
We conclude on the basis of the induction argument that A1 ; A2 ; : : : ; AN are positive definite
and that A in particular is positive definite. Q.E.D.
Exercises 81

Exercises
Exercise 1. Let A represent an M  N matrix and B an N  M matrix. Can the value of A C B0
be determined from the value of A0 C B (in the absence of any other information about A and B)?
Describe your reasoning.
Exercise 2. Show that for any M N matrix A D faij g and N P matrix B D fbij g, .AB/0 D B0 A0
[thereby verifying result (1.13)].
Exercise 3. Let A D faij g and B D fbij g represent N  N symmetric matrices.
(a) Show that in the special case where N D 2, AB is symmetric if and only if b12 .a11 a22 / D
a12 .b11 b22 /.
(b) Give a numerical example where AB is nonsymmetric.
(c) Show that A and B commute if and only if AB is symmetric.

Exercise 4. Let A represent an M  N partitioned matrix comprising R rows and U columns


of blocks, the ij th of which is an Mi  Nj matrix Aij that (for some scalar cij ) is expressible as
Aij D cij 1Mi 10Nj (a scalar multiple of an Mi Nj matrix of 1’s). Similarly, let B represent an N Q
partitioned matrix comprising U rows and V columns of blocks, the ij th of which is an Ni  Qj
matrix Bij that (for some scalar dij ) is expressible as Bij D dij 1Ni 10Qj . Obtain (in as simple form as
possible) the conditions that must be satisfied by the scalars cij (i D 1; 2; : : : ; R; j D 1; 2; : : : ; U )
and dij (i D 1; 2; : : : ; U ; j D 1; 2; : : : ; V ) in order for AB to equal a null matrix.
Exercise 5. Show that for any M  N matrix A and N  M matrix B,
tr.AB/ D tr.A0 B0 /:

Exercise 6. Show that for any M  N matrix A, N  P matrix B, and P  M matrix C,


tr.ABC/ D tr.CAB/ D tr.BCA/
(i.e., the cyclic permutation of the 3 matrices in the product ABC does not affect the trace of the
product).
Exercise 7. Let A, B, and C represent square matrices of order N.
(a) Using the result of Exercise 5 (or otherwise), show that if A, B, and C are symmetric, then
tr.ABC/ D tr.BAC/.
(b) Show that [aside from special cases like that considered in Part (a)] tr.BAC/ is not necessarily
equal to tr.ABC/.

Exercise 8. Which of the following sets are linear spaces: (1) the set of all N  N diagonal matrices,
(2) the set of all N  N upper triangular matrices, and (3) the set of all N  N nonsymmetric
matrices?
Exercise 9. Define 0 1
1 2 1 0
A D @2 1 1 1A;
1 1 2 1
and (for i D 1; 2; 3) let a0i represent the i th row of A.
(a) Show that the set fa01 ; a02 g is a basis for R.A/.
82 Matrix Algebra: A Primer

(b) Find rank.A/.


(c) Making use of the answer to Part (b) (or otherwise), find a basis for C.A/.

Exercise 10. Let A1 ; A2 ; : : : ; AK represent matrices in a linear space V, and let U represent a
subspace of V. Show that sp.A1 ; A2 ; : : : ; AK /  U if and only if A1 ; A2 ; : : : ; AK are contained in
U (thereby establishing what is essentially a generalization of Lemma 2.4.2).
Exercise 11. Let V represent a K-dimensional linear space of M  N matrices (where K  1).
Further, let fA1 ; A2 ; : : : ; AK g represent a basis for V, and, for arbitrary scalars x1 ; x2 ; : : : ; xK and
y1 ; y2 ; : : : ; yK , define A D K
PK
i D1 xi Ai and B D j D1 yj Aj . Show that
P

K
X
AB D xi yi
i D1
for all choices of x1 ; x2 ; : : : ; xK and y1 ; y2 ; : : : ; yK if and only if the basis fA1 ; A2 ; : : : ; AK g is
orthonormal.
Exercise 12. An N  N matrix A is said to be involutory if A2 D I, that is, if A is invertible and is
its own inverse.
(a) Show that an N  N matrix A is involutory if and only if .I A/.I C A/ D 0.
 
a b
(b) Show that a 2  2 matrix A D is involutory if and only if (1) a2 C bc D 1 and d D a
c d
or (2) b D c D 0 and d D a D ˙1.

Exercise 13. Let A D faij g represent an M  N matrix of full row rank.


(a) Show that in the special case M D 1 (i.e., in the special case where A is an N -dimensional row
vector), there exists an N -dimensional column vector b, N 1 elements of which are 0, that is
a right inverse of A.
(b) Generalize from Part (a) (to an arbitrary value of M ) by showing that there exists an N  M
matrix B, N M rows of which are null vectors, that is a right inverse of A.

Exercise 14. Provide an alternative verification


  of equality (6.10) by premultiplying or postmulti-
T U
plying the right side of the equality by and by confirming that the resultant product equals
V W
IM CN .
0 1
2 0 4
Exercise 15. Let A D @3 5 6A. Use the results of Section 2.6 to show that A is nonsingular
4 2 12  
1 T U
and to obtain A . (Hint. Partition A as A D , where T is a square matrix of order 2.)
V W
Exercise 16. Let T D ftij g represent an N  N triangular matrix. Show that if T is orthogonal, then
T is diagonal. If T is orthogonal, what can be inferred about the values of the diagonal elements
t11 ; t22 ; : : : ; tNN of T ?
Exercise 17. Let A represent an N  N matrix. Show that for any N  N nonsingular matrix B,
B 1AB is idempotent if and only if A is idempotent.
Exercise 18. Let x D fxi g and y D fyi g represent nonnull N -dimensional column vectors. Show
that xy 0 is a scalar multiple of an idempotent matrix (i.e., that xy 0 D cA for some scalar c and some
idempotent matrix A) if and only if N i D1 xi yi ¤ 0 (i.e., if and only if x and y are not orthogonal
P
with respect to the usual inner product).
Exercises 83

Exercise 19. Let A represent a 4  N matrix of rank 2, and take b D fbi g to be a 4-dimensional
column vector. Suppose that b1 D 1 and b2 D 0 and that two of the N columns of A are the vectors
a1 D .5; 4; 3; 1/0 and a2 D .1; 2; 0; 1/0 . Determine for which values of b3 and b4 the linear
system Ax D b (in x) is consistent.
Exercise 20. Let A represent an M  N matrix. Show that for any generalized inverses G1 and G2
of A and for any scalars w1 and w2 such that w1 C w2 D 1, the linear combination w1 G1 C w2 G2
is a generalized inverse of A.
Exercise 21. Let A represent an N  N matrix.
(a) Using the result of Exercise 20 in combination with Corollary 2.10.11 (or otherwise), show that
if A is symmetric, then A has a symmetric generalized inverse.
(b) Show that if A is singular (i.e., of rank less than N ) and if N > 1, then (even if A is symmetric) A
has a nonsymmetric generalized inverse. (Hint. Make use of the second part of Theorem 2.10.7.)

Exercise 22. Let A represent an M  N matrix of rank N 1. And let x represent any nonnull vector
in N.A/, that is, any N -dimensional nonnull column vector such that Ax D 0. Show that a matrix
Z is a solution to the homogeneous linear system AZ D 0 (in an N  P matrix Z) if and only if
Z D xk0 for some P -dimensional row vector k0.
Exercise 23. Suppose that AX D B is a consistent linear system (in an N  P matrix X).
(a) Show that if rank.A/ D N or rank.B/ D P , then, corresponding to any solution X to AX D B,
there is a generalized inverse G of A such that X D GB.
(b) Show that if rank.A/ < N and rank.B/ < P , then there exists a solution X to AX D B such
that there is no generalized inverse G of A for which X D GB.

Exercise 24. Show that a matrix A is symmetric and idempotent if and only if there exists a matrix
X such that A D PX .
Exercise 25. Show that corresponding to any quadratic form x0 Ax (in an N -dimensional vector x),
there exists a unique lower triangular matrix B such that x0 Ax and x0 Bx are identically equal, and
express the elements of B in terms of the elements of A.
Exercise 26. Show, via an example, that the sum of two positive semidefinite matrices can be positive
definite.
Exercise 27. Let A represent an N  N symmetric nonnegative definite matrix (where N  2).
Define A0 D A, and, for k D 1; 2; : : : ; N 1, take Qk to be an .N k C 1/  .N k C 1/ unit
upper triangular matrix, Ak an .N k/  .N k/ matrix, and dk a scalar that satisfy the recursive
relationship
Q0k Ak 1 Qk D diag.dk ; Ak / (E.1)
—Qk , Ak , and dk can be constructed by making use of Lemma 2.13.19 and by proceeding as in the
proof of Theorem 2.13.20.
(a) Indicate how Q1 ; Q2 ; : : : ; QN 1 ; A1 ; A2 ; : : : ; AN 1 , and d1 ; d2 ; : : : ; dN 1 could be used to
form an N  N unit upper triangular matrix Q and a diagonal matrix D such that Q0 AQ D D.
0 1
2 0 0 0
B0 4 2 4C
(b) Taking A D B @0
C (which is a symmetric nonnegative definite matrix), determine
2 1 2A
0 4 2 7
unit upper triangular matrices Q1 , Q2 , and Q3 , matrices A1 , A2 , and A3 , and scalars d1 , d2 , and
d3 that satisfy the recursive relationship (E.1), and illustrate the procedure devised in response
84 Matrix Algebra: A Primer

to Part (a) by using it to find a 4  4 unit upper triangular matrix Q and a diagonal matrix D such
that Q0 AQ D D.

Exercise 28. Let A D faij g represent an N  N symmetric positive definite matrix, and let
B D fbij g D A 1. Show that, for i D 1; 2; : : : ; N ,
bi i  1=ai i ;
with equality holding if and only if aij D 0 for all j ¤ i .
Exercise 29. Let 0 1
a11 a12 a13 a14
B a21 a22 a23 a24 C
ADB C:
B C
@ a31 a32 a33 a34 A
a41 a42 a43 a44
(a) Write out all of the pairs that can be formed from the four “boxed” elements of A.
(b) Indicate which of the pairs from Part (a) are positive and which are negative.
(c) Use formula (14.1) to compute the number of pairs from Part (a) that are negative, and check
that the result of this computation is consistent with your answer to Part (b).

Exercise 30. Obtain (in as simple form as possible) an expression for the determinant of each of the
following two matrices: (1) an N  N matrix A D faij g of the general form
0 1
0 ::: 0 0 a1N
B 0 ::: 0 a2;N 1 a2N C
B C
B 0 a a3;N 1 a3N C
ADB 3;N 2
:: :: :: C
C
B
@ : : : A
aN1 : : : aN;N 2 aN;N 1 aNN

(where aij D 0 for j D 1; 2; : : : ; N i ; i D 1; 2; : : : ; N 1); (2) an N  N matrix B D fbij g of


the general form 0 1
0 1 0 ::: 0
B
B 0 0 1 0 C
C
BDB
B :: :: :: C
B : : : C
C
@ 0 0 0 1 A
k0 k1 k2 : : : kN 1

—a matrix of this general form is called a companion matrix.


Exercise 31. Verify the part of result (14.16) that pertains to the postmultiplication of a matrix by
a unit upper triangular matrix by showing that for any N  N matrix A and any N  N unit upper
triangular matrix T , jAT j D jAj.
Exercise 32. Show that for any N  N matrix A and any N  N nonsingular matrix C,
1
jC ACj D jAj:
 
a b
Exercise 33. Let A D , where a, b, c, and d are scalars.
c d
(a) Show that in the special case where A ispsymmetric (i.e., where c D b), A is nonnegative definite
if and onlypif a  0, d  0, and jbj  ad and is positive definite if and only if a > 0, d > 0,
and jbj < ad .
Bibliographic and Supplementary Notes 85

(b) Extend the result of Part (a) by showing that in the general case where A is not necessarily
symmetric (i.e., where
p possibly c ¤ b), A is nonnegative definite if and only if a  0, dp 0,
and jb C cj=2  ad and is positive definite if and only if a > 0, d > 0, and jb C cj=2 < ad .
[Hint. Take advantage of the result of Part (a).]

Exercise 34. Let A D faij g represent an N N symmetric matrix. And suppose that A is nonnegative
definite (in which case its diagonal elements are nonnegative). By, for example, making use of the
result of Part (a) of Exercise 33, show that, for j ¤ i D 1; 2; : : : ; N ,
p
jaij j  ai i ajj  max.ai i ; ajj /;
p
with jaij j < ai i ajj if A is positive definite.
Exercise 35. Let A D faij g represent an N  N symmetric positive definite matrix. Show that
N
Y
det A  ai i ;
i D1

with equality holding if and only if A is diagonal.

Bibliographic and Supplementary Notes


Much of what is presented in Chapter 2 is taken from Chapters 1–14 in Harville’s (1997) book, Matrix Algebra
from a Statistician’s Perspective, which provides more extensive coverage of the same topics.
§4. What is referred to herein as a linear space is a special case of what is known as a finite-dimensional
vector space. A classical reference on that topic is Halmos’s (1958) book, Finite-Dimensional Vector Spaces.
The term linear space is used in lieu of the somewhat more common term vector space. At least in the present
setting, this usage is advantageous (especially for less mathematically sophisticated readers). It avoids a dual
use of the term vector, in which that term is used at times to refer to a member of a linear space of M  N
matrices and at times to specify a matrix having a single row or column.
§9 and §11. No attempt is made to discuss the computational aspects of solving a linear system. A classical
reference on that topic (and on related topics) is Golub and Van Loan’s (2013) book, Matrix Computations.
Another highly regarded source of information on computational issues is Trefethen and Bau’s (1997) book,
Numerical Linear Algebra. A source that emphasizes those computational issues that are highly relevant to
statistical applications is Gentle’s (1998) book, Numerical Linear Algebra for Applications in Statistics.
§13. The usage herein of the terms nonnegative definite, positive definite, and positive semidefinite differs
somewhat from that employed in various other presentations. In particular, these terms are applied to both sym-
metric and nonsymmetric matrices, whereas in many other presentations their application to matrices is confined
to symmetric matrices. Moreover, the term positive semidefinite is used in a way that, while not uncommon, is
at odds with its use in some other presentations. In some presentations, the term positive semidefinite is used in
the same way that nonnegative definite is used herein.
3
Random Vectors and Matrices

In working with linear models, knowledge of basic results on the distribution of random variables is
essential. Of particular relevance are various results on expected values and on variances and covari-
ances. Also of relevance are results that pertain to conditional distributions and to the multivariate
normal distribution.
In working with a large number (or even a modest number) of random variables, the use of matrix
notation can be extremely helpful. In particular, formulas for the expected values and the variances
and covariances of linear combinations of random variables can be expressed very concisely in
matrix notation. The use of matrix notation is facilitated by the arrangement of random variables in
the form of a vector or a matrix. A random (row or column) vector is a (row or column) vector whose
elements are (jointly distributed) random variables. More generally, a random matrix is a matrix
whose elements are (jointly distributed) random variables.

3.1 Expected Values


The expected value of a random variable x is denoted by the symbol E.x/. The expected value
of a nonnegative random variable is well-defined, but not necessarily finite. The expected value
E.x/ of a nonnegative random variable x is said to exist (or to be integrable) if E.x/ < 1. More
generally, the expected value E.x/ of an arbitrary random variable x is said to exist (or to be
integrable) if E.jxj/ < 1, in which case E.x/ is well-defined and finite. Unless otherwise indicated,
results involving the expected values of random variables are to be regarded as including an implicit
assumption and/or (depending on the context) claim that the expected values exist.
The expected value E.x/ of a random variable x and the existence or nonexistence of E.x/ are
characteristics of the distribution of x. Accordingly, if two random variables x and y have the same
distribution, then either E.x/ and E.y/ both exist and are equal or neither E.x/ nor E.y/ exists. In
that regard, it is worth noting that if two random variables x and y (defined on the same probability
space) are equal with probability 1 (i.e., are equal except possibly on a set of probability 0), then
they have the same distribution.
A random variable x (or its distribution) is said to be discrete if there exists a finite or countably
infinite set of distinct values x1 ; x2 ; x3 ; : : : of x such that i Pr.x D xi / D 1, in which case
P
X
EŒg.x/ D g.xi / Pr.x D xi /
i
for “any” function g.x/ of x. More generally, a random vector x (or its distribution) is said to be
discrete if there exists a finite or countably infinite set of distinct values x1 ; x2 ; x3 ; : : : of x such that
i Pr.x D xi / D 1, in which case
P
X
EŒg.x/ D g.xi / Pr.x D xi / (1.1)
i
for “any” function g.x/ of x.
88 Random Vectors and Matrices

A random variable x (or its distribution) is said to be absolutely continuous if there exists a
nonnegative function f .x/ of x, called
R a probability density function, such that, for an “arbitrary”
set A of real numbers, Pr.x 2 A/ D A f .s/ ds, in which case
Z 1
EŒg.x/ D g.s/f .s/ ds
1

for “any” function g.x/ of x. More generally, an N -dimensional random vector x (or its distribution)
is said to be absolutely continuous if there exists a nonnegative function f .x/ of x, called
R a probability
density function (pdf), such that, for an “arbitrary” subset A of RN, Pr.x 2 A/ D A f .s/ d s, in
which case Z
EŒg.x/ D g.s/f .s/ d s (1.2)
RN
for “any” function g.x/ of x.
If x is a random vector and g.x/ “any” function of x that is nonnegative [in the sense that
g.x/  0 for every value of x] or is nonnegative with probability 1 [in the sense that for some set A
of x-values for which Pr.x 2 A/ D 1, g.x/  0 for every value of x in A], then

EŒg.x/ D 0 , g.x/ D 0 with probability 1: (1.3)

By definition, two random vectors, say x and y, are statistically independent if for “every” set
A (of x-values) and “every” set B (of y-values),
Pr.x 2 A; y 2 B/ D Pr.x 2 A/ Pr.y 2 B/:
If x and y are statistically independent, then for “any” function f .x/ of x and “any” function g.y/
of y (for which EŒf .x/ and EŒg.y/ exist),
EŒf .x/g.y/ D EŒf .x/ EŒg.y/ (1.4)
(e.g., Casella and Berger 2002, sec. 4.2; Parzen 1960, p. 361).
The expected value of an N -dimensional random row or column vector is the N -dimensional
(respectively row or column) vector whose i th element is the expected value of the i th element of the
random vector (i D 1; 2; : : : ; N ). More generally, the expected value of an M  N random matrix
is the M  N matrix whose ij th element is the expected value of the ij th element of the random
matrix (i D 1; 2; : : : ; M ; j D 1; 2; : : : ; N ). The expected value of a random matrix X is denoted
by the symbol E.X/ (and is said to exist if the expected value of every element of X exists). Thus,
for an M  N random matrix X with ij th element xij (i D 1; 2; : : : ; M ; j D 1; 2; : : : ; N ),
E.x11 / E.x12 / : : : E.x1N /
0 1
B E.x21 / E.x22 / : : : E.x2N / C
E.X/ D B : :: :: C:
B C
@ : : : : A
E.xM1 / E.xM 2 / : : : E.xMN /

The expected value of a random variable x is referred to as the mean of x (or of the distribution
of x). And, similarly, the expected value of a random vector or matrix X is referred to as the mean
(or, if applicable, mean vector) of X (or of the distribution of X).
It follows from elementary properties of the expected values of random variables that for a finite
number of random variables x1 ; x2 ; : : : ; xN and for nonrandom scalars c, a1 ; a2 ; : : : ; aN ,
 X N  XN
E cC aj xj D c C aj E.xj /: (1.5)
j D1 j D1

Letting x D .x1 ; x2 ; : : : ; xN / and a D .a1 ; a2 ; : : : ; aN /0, this equality can be reexpressed in matrix
0

notation as
E.c C a0 x/ D c C a0 E.x/: (1.6)
Variances, Covariances, and Correlations 89

As a generalization of equality (1.5) or (1.6), we have that


E.c C Ax/ D c C AE.x/; (1.7)
where c is an M -dimensional nonrandom column vector, A an M  N nonrandom matrix, and x an
N -dimensional random column vector. Equality (1.5) can also be generalized as follows:
 N
X  N
X
E cC aj xj D c C aj E.xj /; (1.8)
j D1 j D1

where x1 ; x2 ; : : : ; xN are M -dimensional random column vectors, c is an M -dimensional nonran-


dom column vector, and a1 ; a2 ; : : : ; aN are nonrandom scalars.
Equality (1.7) can be readily verified by using equality (1.5) [or equality (1.6)] to show that each
element of the left side of equality (1.7) equals the corresponding element of the right side. A similar
approach can be used to verify equality (1.8). Or equality (1.8) can be derived by observing that
PN 0 0 0 0
j D1 aj xj D Ax, where A D .a1 I; a2 I; : : : ; aN I/ and x D .x1 ; x2 ; : : : ; xN /, and by applying
equality (1.7).
Further generalizations are possible. We have that

E.C C AXK/ D C C AE.X/K; (1.9)

where C is an M  Q nonrandom matrix, A an M  N nonrandom matrix, K a P  Q nonrandom


matrix, and X an N  P random matrix, and that
 N
X  N
X
E CC aj Xj D C C aj E.Xj /; (1.10)
j D1 j D1

where X1 ; X2 ; : : : ; XN are M  P random matrices, C is an M  P nonrandom matrix, and


a1 ; a2 ; : : : ; aN are nonrandom scalars. Equalities (1.9) and (1.10) can be verified by for instance using
equality (1.7) and/or equality (1.8) to show that each column of the left side of equality (1.9) or (1.10)
equals the corresponding column of the right side. Or, upon observing that jND1 aj Xj D AXK,
P

where A D .a1 I; a2 I; : : : ; aN I/, X0 D .X01 ; X02 ; : : : ; X0N /, and K D I, equality (1.10) can be
derived from equality (1.9).

3.2 Variances, Covariances, and Correlations


a. The basics: (univariate) dispersion and pairwise (statistical) dependence

Variance (and standard deviation) of a random variable. The variance of a random variable x
(whose expected value exists) is (by definition) the expected value EfŒx E.x/2 g of the square of
the difference between x and its p expected value. The variance of x is denoted by the symbol var x or
var.x/. The positive square root var.x/ of the variance of x is referred to as the standard deviation
of x.
If a random variable x is such that E.x 2 / exists [i.e., such that E.x 2 / < 1], then E.x/ exists
and var.x/ also exists (i.e., is finite). That the existence of E.x 2 / implies the existence of E.x/ can
be readily verified by making use of the inequality jxj < 1 C x 2. That it also implies the existence
(finiteness) of var.x/ becomes clear upon observing that
Œx E.x/2 D x 2 2x E.x/ C ŒE.x/2: (2.1)
90 Random Vectors and Matrices

The existence of E.x 2 / is a necessary as well as a sufficient condition for the existence of E.x/ and
var.x/, as is evident upon reexpressing equality (2.1) as

x 2 D Œx E.x/2 C 2x E.x/ ŒE.x/2:

In summary, we have that

E.x 2 / exists , E.x/ exists and var.x/ also exists: (2.2)

Further,
var.x/ D E.x 2 / ŒE.x/2; (2.3)
as can be readily verified by using formula (1.5) to evaluate expression (2.1). Also, it is worth noting
that
var.x/ D 0 , x D E.x/ with probability 1: (2.4)

Covariance of two random variables. The covariance of two random variables x and y (whose
expected values exist) is (by definition) EfŒx E.x/Œy E.y/g. The covariance of x and y is
denoted by the symbol cov.x; y/. We have that

cov.y; x/ D cov.x; y/ (2.5)


and
var.x/ D cov.x; x/; (2.6)
as is evident from the very definitions of a variance and a covariance (and from an elementary property
of the expected-value operator).
If two random variables x and y whose expected values exist are such that the expected value of
xy also exists, then the covariance of x and y exists and

cov.x; y/ D E.xy/ E.x/E.y/; (2.7)

as becomes clear upon observing that


Œx E.x/Œy E.y/ D xy x E.y/ y E.x/ C E.x/E.y/ (2.8)
and applying formula (1.5). The existence of the expected value of xy is necessary as well as sufficient
for the existence of the covariance of x and y, as is evident upon reexpressing equality (2.8) as

xy D Œx E.x/Œy E.y/ C x E.y/ C y E.x/ E.x/E.y/:

Note that in the special case where y D x, formula (2.7) reduces to formula (2.3).
Some fundamental results bearing on the covariance of two random variables x and y (whose
expected values exist) and on the relationship of the covariance to the variances of x and y are as
follows. The covariance of x and y exists if the variances of x and y both exist, in which case
p p
jcov.x; y/j  var.x/ var.y/ (2.9)
or, equivalently,
Œcov.x; y/2  var.x/ var.y/ (2.10)
or (also equivalently)
p p p p
var.x/ var.y/  cov.x; y/  var.x/ var.y/: (2.11)
Further,
var.x/ D 0 or var.y/ D 0 ) cov.x; y/ D 0; (2.12)
so that when var.x/ D 0 or var.y/ D 0 or, equivalently, when x D E.x/ with probability 1 or y D
E.y/ with probability 1, inequality (2.9) holds as an equality, both sides of which equal 0. And when
Variances, Covariances, and Correlations 91
p p
var.x/ > 0 and var.y/ > 0, inequality (2.9) holds as the equality cov.x; y/ D var.x/ var.y/ if
and only if
y E.y/ x E.x/
p D p with probability 1;
var.y/ var.x/
p p
and holds as the equality cov.x; y/ D var.x/ var.y/ if and only if
y E.y/ x E.x/
p D p with probability 1:
var.y/ var.x/

These results (on the covariance of the random variables x and y) can be inferred from the
following results on the expected value of the product of two random variables, say w and z—they
are obtained from the results on E.wz/ by setting w D x E.x/ and z D y E.y/. The expected
value E.wz/ of wz exists if the expected values E.w 2 / and E.z 2 / of w 2 and z 2 both exist, in which
case p p
jE.wz/j  E.w 2 / E.z 2 / (2.13)
or, equivalently,
ŒE.wz/2  E.w 2 / E.z 2 / (2.14)
or (also equivalently)
p p p p
E.w 2 / E.z 2 /  E.wz/  E.w 2 / E.z 2 /: (2.15)
Further,
E.w 2 / D 0 or E.z 2 / D 0 ) E.wz/ D 0; (2.16)
so that when E.w 2 / D 0 or E.z 2 / D 0 or, equivalently, when w D 0 with probability 1 or z D 0 with
2
probability 1, inequality (2.13) holds as an equality, both sides of which
pequal 0.pAnd when E.w / > 0
2
and E.z / > 0, p inequality (2.13) holds as the equality E.wz/ D E.w / E.zp/ if andponly if
2 2
ıp
z E.z 2 / D w E.w 2 / with probability 1, and holds as the equality E.wz/ D E.w 2 / E.z 2 /
ı
ıp ıp
if and only if z E.z / D w E.w / with probability 1. A verification of these results on E.wz/
2 2

is provided subsequently (in the final part of the present subsection).


Note that inequality (2.9) implies that
jcov.x; y/j  max .var x; var y/: (2.17)

Correlation of two random variables. The correlation of two random variables x and y (whose
expected values exist and whose variances also exist and are strictly positive) is (by definition)

cov.x; y/
p p ;
var.x/ var.y/

and is denoted by the symbol corr.x; y/. From result (2.5), it is clear that

corr.y; x/ D corr.x; y/: (2.18)

And, as a consequence of result (2.9), we have that

jcorr.x; y/j  1; (2.19)


which is equivalent to
Œcorr.x; y/2  1 (2.20)
and also to
1  corr.x; y/  1: (2.21)
92 Random Vectors and Matrices

Further, inequality (2.19) holds as the equality corr.x; y/ D 1 if and only if


y E.y/ x E.x/
p D p with probability 1;
var.y/ var.x/
and holds as the equality corr.x; y/ D 1 if and only if
y E.y/ x E.x/
p D p with probability 1:
var.y/ var.x/

Verification of results on the expected value of the product of two random variables. Let us verify
the results (given earlier in the present subsection) on the expected value E.wz/ of the product of
two random variables w and z. Suppose that E.w 2 / and E.z 2 / both exist. Then, what we wish to
establish are the existence of E.wz/ and the validity of inequality (2.13) and of the conditions under
which equality is attained in inequality (2.13).
Let us begin by observing that, for arbitrary scalars a and b,
1 2
2 .a C b 2/ ab D 21 .a b/2  0
and 1 2
2
.a C b2/ ab D 1
2
.a C b/2  0;
implying in particular that 1 2
2 .a C b 2 /  ab  12 .a2 C b 2 / (2.22)
or, equivalently, that
jabj  12 .a2 C b 2 /: (2.23)
Upon setting a D w and b D z in inequality (2.23), we obtain the inequality

jwzj  12 .w 2 C z 2 /: (2.24)

The expected value of the right side of inequality (2.24) exists, implying the existence of the expected
value of the left side of inequality (2.24) and hence the existence of E.wz/.
Now, consider inequality (2.13). When E.w 2 / D 0 or E.z 2 / D 0 or, equivalently, when w D 0
with probability 1 or z D 0 with probability 1, wz D 0 with probability 1 and hence inequality
(2.13) holds as an equality, both sides of which equal 0.
ıp
Alternatively, suppose that E.w 2 / > 0 and E.z 2 / > 0. And take a D w E.w 2 / and b D
ıp
z E.z 2 /. In light of result (2.22), we have that

E 21 .a2 C b 2 /  E.ab/  E 21 .a2 C b 2 / : (2.25)


   

Moreover,
E.wz/
.a2 C b 2 / D 1
1
E and E.ab/ D p (2.26)

2
p :
E.w 2 / E.z 2 /
Together, results (2.25) and (2.26) imply that
E.wz/
1 p p  1;
E.w 2 / E.z 2 /
which is equivalent to result (2.15) and hence to inequality (2.13). Further, inequality (2.13) holds as
the equality E.wz/ D E.w 2 / E.z 2 / if and only if E 21 .a2 C b 2 / ab D 0 [as is evident from
p p  

result (2.26)], or equivalently if and only if E 21 .a b/2 D 0, and hence if and only if b D a with
 
p p
probability 1. And, similarly, inequality (2.13) holds as the equality E.wz/ D E.w 2 / E.z 2 / if
and only if EŒ 12 .a2 C b 2 / ab D 0, or equivalently if and only if EŒ 12 .a C b/2  D 0, and hence
if and only if b D a with probability 1.
Variances, Covariances, and Correlations 93

b. Variance-covariance matrices and covariances of random vectors


As multivariate extensions of the variance of a random variable and the covariance of two random
variables, we have the variance-covariance matrix of a random vector and the covariance of two
random vectors. The variance-covariance matrix of an N -dimensional random (row or column)
vector with first through N th elements x1; x2 ; : : : ; xN (whose expected values exist) is (by definition)
the N  N matrix whose ij th element is cov.xi ; xj / (i; j D 1; 2; : : : ; N ). Note that the diagonal
elements of this matrix equal the variances of x1 ; x2 ; : : : ; xN . The covariance (or covariance matrix)
of an N -dimensional random (row or column) vector with first through N th elements x1 ; x2 ; : : : ; xN
(whose expected values exist) and a T -dimensional random (row or column) vector with first through
T th elements y1 ; y2 ; : : : ; yT (whose expected values exist) is (by definition) the N T matrix whose
ij th element is cov.xi ; yj / (i D 1; 2; : : : ; N ; j D 1; 2; : : : ; T ). The variance-covariance matrix of
a random vector is sometimes referred to simply as the variance matrix or the covariance matrix of
the vector or, even more simply, as the variance or the covariance of the vector.
Denote by var x or var.x/ or by var.x0 / the variance-covariance matrix of a random column
vector x or its transpose x0. Similarly, denote by cov.x; y/, cov.x0; y/, cov.x; y 0 /, or cov.x0; y 0 / the
covariance of x or x0 and a random column vector y or its transpose y 0. Thus, for an N -dimensional
column vector x D .x1 ; x2 ; : : : ; xN /0,
var.x1 / cov.x1 ; x2 / : : : cov.x1 ; xN /
0 1
B cov.x2 ; x1 / var.x2 / cov.x2 ; xN /C
var.x/ D var.x0 / D B :: CI
B C
: :
@ : : A
cov.xN ; x1 / cov.xN ; x2 / var.xN /
and for an N -dimensional column vector x D .x1 ; x2 ; : : : ; xN /0 and a T -dimensional column vector
y D .y1 ; y2 ; : : : ; yT /0,
cov.x; y/ D cov.x0; y/ D cov.x; y 0 / D cov.x0; y 0 /
cov.x1 ; y1 / cov.x1 ; y2 / : : : cov.x1 ; yT /
0 1
B cov.x2 ; y1 / cov.x2 ; y2 / : : : cov.x2 ; yT / C
DB :: :: :: C:
B C
@ : : : A
cov.xN ; y1 / cov.xN ; y2 / : : : cov.xN ; yT /

For an N -dimensional random column vector x and a T -dimensional random column vector y,
cov.x; y/ D EfŒx E.x/Œy E.y/0 g; (2.27)
cov.x; y/ D E.xy 0 / E.x/ŒE.y/0; (2.28)
and
cov.y; x/ D Œcov.x; y/0: (2.29)
Equality (2.27) can be regarded as a multivariate extension of the formula cov.x; y/ D EfŒx
E.x/Œy E.y/g for the covariance of two random variables x and y. And equality (2.28) can be
regarded as a multivariate extension of equality (2.7), and equality (2.29) as a multivariate extension
of equality (2.5). Each of equalities (2.27), (2.28), and (2.29) can be readily verified by comparing
each element of the left side with the corresponding element of the right side.
Clearly, for an N -dimensional random column vector x,
var.x/ D cov.x; x/: (2.30)
Thus, as special cases of equalities (2.27), (2.28), and (2.29), we have that
var.x/ D EfŒx E.x/Œx E.x/0 g; (2.31)
var.x/ D E.xx0 / E.x/ŒE.x/0; (2.32)
94 Random Vectors and Matrices

and
var.x/ D Œvar.x/0: (2.33)
Equality (2.31) can be regarded as a multivariate extension of the formula var.x/ D EfŒx E.x/2 g
for the variance of a random variable x, and equality (2.32) can be regarded as a multivariate extension
of equality (2.3). Equality (2.33) indicates that a variance-covariance matrix is symmetric.
For an N -dimensional random column vector x D .x1 ; x2 ; : : : ; xN /0,
N
X
PrŒx ¤ E.x/  PrŒxi ¤ E.xi /; (2.34)
i D1
as is evident upon observing that fx W x ¤ E.x/g D [N i D1 fx W xi ¤ E.xi /g. Moreover, according to
result (2.4), var.xi / D 0 ) PrŒxi ¤ E.xi / D 0 (i D 1; 2; : : : ; N ), implying [in combination with
inequality (2.34)] that if var.xi / D 0 for i D 1; 2; : : : ; N , then PrŒx ¤ E.x/ D 0 or equivalently
PrŒx D E.x/ D 1. Thus, as a generalization of result (2.4), we have [since (for i D 1; 2; : : : ; N )
PrŒx D E.x/ D 1 ) PrŒxi D E.xi / D 1 ) var.xi / D 0] that [for an N -dimensional random
column vector x D .x1 ; x2 ; : : : ; xN /0 ]
var.xi / D 0 for i D 1; 2; : : : ; N , x D E.x/ with probability 1. (2.35)
Alternatively, result (2.35) can be established by observing that
PN
var.xi / D 0 for i D 1; 2; : : : ; N ) i D1 var.xi / D 0
) EfŒx E.x/0 Œx E.x/g D 0
) PrfŒx E.x/0 Œx E.x/g D 0g D 1
and that Œx E.x/0 Œx E.x/ D 0 , x E.x/ D 0.
In connection with result (2.35) (and otherwise), it is worth noting that, for an N -dimensional
random column vector x D .x1 ; x2 ; : : : ; xN /0,
var.xi / D 0 for i D 1; 2; : : : ; N , var.x/ D 0: (2.36)
—result (2.36) is a consequence of result (2.12).
For random column vectors x and y,
var.x/ cov.x; y/
   
x
var D ; (2.37)
y cov.y; x/ var.y/
as is evident from the very definition of a variance-covariance matrix (and from the definition of the
covariance of two random vectors). More generally, for a random column vector x that has been
partitioned into subvectors x1 ; x2 , : : : ; xR [so that x0 D .x01 ; x02 ; : : : ; x0R /],
var.x1 / cov.x1 ; x2 / : : : cov.x1 ; xR /
0 1
B cov.x2 ; x1 / var.x2 / cov.x2 ; xR /C
var.x/ D B C: (2.38)
B C
:: ::
@ : : A
cov.xR ; x1 / cov.xR ; x2 / var.xR /

Corresponding to the variance-covariance matrix of an N -dimensional random column vector


x (or row vector x0 ) with first through N th elements x1 ; x2 ; : : : ; xN (whose expected values exist
and whose variances exist and are strictly positive) is the N  N matrix whose ij th element is
corr.xi ; xj /. This matrix is referred to as the correlation matrix of x (or x0 ). It equals
S 1 var.x/ S 1;
p p p
where S D diag. var x1 ; var x2 ; : : : ; var xN /. The correlation matrix is symmetric, and each
of its diagonal elements equals 1.
Variances, Covariances, and Correlations 95

c. Uncorrelated random variables or vectors


Two random variables x and y are said to be uncorrelated (or one of the random variables x and y is
said to be uncorrelated with the other) if cov.x; y/ D 0. Two or more random variables x1 ; x2 , : : : ;
xP are said to be pairwise uncorrelated or simply uncorrelated if every two of them are uncorrelated,
that is, if cov.xi ; xj / D 0 for j > i D 1; 2; : : : ; P. Accordingly, x1 ; x2 , : : : ; xP are uncorrelated
if the variance-covariance matrix of the P -dimensional random vector whose elements are x1 ; x2 ;
: : : ; xP is a diagonal matrix.
Two random vectors x (or x0 ) and y (or y 0 ) are said to be uncorrelated if cov.x; y/ D 0, that
is, if every element of x is uncorrelated with every element of y. Two or more random vectors
x1 ; x2 ; : : : ; xP are said to be pairwise uncorrelated or simply uncorrelated if every two of them are
uncorrelated, that is, if cov.xi ; xj / D 0 for j > i D 1; 2; : : : ; P. Accordingly, x1; x2 , : : : ; xP are un-
correlated if the variance-covariance matrix of the random column vector x (or the random row vector
x0 ) defined by x0 D .x01 ; x02 ; : : : ; xP
0
/ is of the block-diagonal form diag.var x1 ; var x2 ; : : : ; var xP /.
For statistically independent random variables x and y (whose expected values exist), we find
[upon recalling result (1.4)] that

cov.x; y/ D EfŒx E.x/Œy E.y/g D EŒx E.x/ EŒy E.y/ D 0:

This result can be stated in the form of the following lemma.


Lemma 3.2.1. If two random variables (whose expected values exist) are statistically indepen-
dent, they are uncorrelated.
In general, the converse of Lemma 3.2.1 is not true. That is, uncorrelated random variables are
not necessarily statistically independent.
The repeated application of Lemma 3.2.1 gives rise to the following extension.
Lemma 3.2.2. Let x represent an N -dimensional random column vector with elements x1 ; x2 ,
: : : ; xN (whose expected values exist) and y a T -dimensional random column vector with elements
y1 ; y2 , : : : ; yT (whose expected values exist). And suppose that for i D 1; 2; : : : ; N and j D
1; 2; : : : ; T, xi and yj are statistically independent. Then, cov.x; y/ D 0, that is, x and y are
uncorrelated.

d. Variances and covariances of linear combinations of random variables or vectors


PN PT
Consider now the covariance of c C i D1 ai xi and k C j D1 bj yj , where x1 ; x2 ;
: : : ; xN ; y1 ; y2 ; : : : ; yT are random variables (whose expected values exist) and where
c; a1 ; a2 ; : : : ; aN ; k; b1 ; b2 ; : : : ; bT are nonrandom scalars. This covariance is expressible as
 N
X T
X  XN X
T
cov c C ai xi ; k C bj yj D ai bj cov.xi ; yj /; (2.39)
i D1 j D1 i D1 j D1

as can be readily verified by making use of result (1.5). As a special case of equality (2.39), we have
that
 N
X  X N X N
var c C ai xi D ai aj cov.xi ; xj / (2.40)
i D1 i D1 j D1
N
X N
X1 N
X
D ai2 var.xi / C 2 ai aj cov.xi ; xj /: (2.41)
i D1 i D1 j Di C1

As in the case of equality (1.5), equalities (2.39) and (2.40) are reexpressible in matrix no-
tation. Upon letting x D .x1 ; x2 ; : : : ; xN /0, a D .a1 ; a2 ; : : : ; aN /0, y D .y1 ; y2 ; : : : ; yT /0, and
96 Random Vectors and Matrices

b D .b1 ; b2 ; : : : ; bT /0, equality (2.39) is reexpressible as


cov.c C a0 x; k C b0 y/ D a0 cov.x; y/ b; (2.42)
and equality (2.40) as
var.c C a0 x/ D a0 var.x/ a: (2.43)
Note that in the special case where y D x (and T D N ), equality (2.42) simplifies to
cov.c C a0 x; k C b0 x/ D a0 var.x/ b: (2.44)

Results (2.42), (2.43), and (2.44) can be generalized. Let c represent an M -dimensional non-
random column vector, A an M  N nonrandom matrix, and x an N -dimensional random column
vector (whose expected value exists). Similarly, let k represent an S -dimensional nonrandom col-
umn vector, B an S  T nonrandom matrix, and y a T -dimensional random column vector (whose
expected value exists). Then,
cov.c C Ax; k C By/ D A cov.x; y/ B0; (2.45)
which is a generalization of result (2.42) and which in the special case where y D x (and T D N )
yields the following generalization of result (2.44):
cov.c C Ax; k C Bx/ D A var.x/ B0: (2.46)
When k D c and B D A, result (2.46) simplifies to the following generalization of result (2.43):
var.c C Ax/ D A var.x/ A0: (2.47)
Equality (2.45) can be readily verified by comparing each element of the left side with the corre-
sponding element of the right side and by applying result (2.42).
Another sort of generalization is possible. Let x1 ; x2 ; : : : ; xN represent M -dimensional random
column vectors (whose expected values exist), c an M -dimensional nonrandom column vector, and
a1 ; a2 , : : : ; aN nonrandom scalars. Similarly, let y1 ; y2 , : : : ; yT represent S -dimensional random
column vectors (whose expected values exist), k an S -dimensional nonrandom column vector, and
b1 ; b2 , : : : ; bT nonrandom scalars. Then,
 XN T
X  X N X T
cov c C ai xi ; k C bj yj D ai bj cov.xi ; yj /; (2.48)
i D1 j D1 i D1 j D1
which is a generalization of result (2.39). As a special case of equality (2.48) [that obtained by setting
T D N and (for j D 1; 2; : : : ; T ) yj D xj ], we have that
 XN N
X  X N X N
cov c C ai xi ; k C bj xj D ai bj cov.xi ; xj /: (2.49)
i D1 j D1 i D1 j D1
And as a further special case [that obtained by setting k D c and (for j D 1; 2, : : : ; N ) bj D aj ],
we have that
 XN  X N N X
X N
var c C ai xi D ai2 var.xi / C ai aj cov.xi ; xj /: (2.50)
i D1 i D1 i D1 j D1
.j ¤i /

Equality (2.48) can be verified by comparing each element of the left side with the corresponding
element of the right side and by applying result (2.39). Alternatively, equality (2.48) can be derived
by observing that N 0 0 0 0
i D1 ai xi D Ax, where A D .a1 I; a2 I; : : : ; aN I/ and x D .x1 ; x2 ; : : : ; xN /,
P
PT 0 0 0 0
and that j D1 bj yj D By, where B D .b1 I; b2 I; : : : ; bT I/ and y D .y1 ; y2 ; : : : ; yT /, and by
applying equality (2.45).
Standardized Version of a Random Variable 97

e. Nonnegative definiteness of a variance-covariance matrix


Let x represent an N -dimensional random column vector (with elements whose expected values and
variances exist). Because a variance is inherently nonnegative, it follows immediately from result
(2.43) that a0 var.x/ a  0 for every N -dimensional nonrandom column vector a. Thus, recalling
result (2.33), we have the following theorem.
Theorem 3.2.3. The variance-covariance matrix of a random vector is nonnegative definite (and
symmetric).
Positive definite vs. positive semidefinite variance-covariance matrices. Let V represent the
variance-covariance matrix of an N -dimensional random column vector x. Theorem 3.2.3 implies
that V is either positive definite or positive semidefinite. Moreover, for an arbitrary N -dimensional
nonrandom column vector a, we have [from result (2.43)] that var.a0 x/ D a0 V a. Accordingly, if
V is positive semidefinite, there exist nonnull values of the vector a for which var.a0 x/ D 0 or,
equivalently, for which a0 x D E.a0 x/ with probability 1. Alternatively, if V is positive definite, no
such values exist; that is, if V is positive definite, then var.a0 x/ D 0 , a D 0.
In fact, in light of Corollary 2.13.27 (and the symmetry of V ),

var.a0 x/ D 0 , Va D 0 , a 2 N.V /: (2.51)

The null space N.V / of V is (as discussed in Section 2.9b) a linear space and (according to
Lemma 2.11.5) is of dimension N rank.V /. When V is positive definite, V is nonsingular, so
that dimŒN.V / D 0. When V is positive semidefinite, V is singular, so that dimŒN.V /  1.
In light of Lemma 2.14.21, we have that jV j  0 and, more specifically, that

jV j > 0 , V is positive definite (2.52)


and
jV j D 0 , V is positive semidefinite (2.53)

An inequality revisited. Consider the implications of results (2.52) and (2.53) in the special case
where N D 2. Accordingly, suppose that V is the variance-covariance matrix of a vector of two
random variables, say x and y. Then, in light of result (2.14.4),

jV j D var.x/ var.y/ Œcov.x; y/2; (2.54)

Thus, it follows from results (2.52) and (2.53 that


p p
jcov.x; y/j  var.x/ var.y/; (2.55)

with equality holding if and only if jV j D 0 and hence if and only if V is positive semidefinite (or,
equivalently, if and only if V is singular or, also equivalently, if and only if dimŒN.V /  1).
Note that inequality (2.55) is identical to inequality (2.9), the validity of which was established
(in the final part of Subsection a) via a different approach.

3.3 Standardized Version of a Random Variable


Let x represent a random variable, and for an arbitrary nonrandom scalar c and an arbitrary nonzero
nonrandom scalar a, define
x c
zD :
a
98 Random Vectors and Matrices

Then, x is expressible as
x D c C az;
and the distribution of x is determinable from c and a and from the distribution of the transformed
random variable z
Clearly,
E.x/ c
E.z/ D :
a
And
var.x/
var.z/ D ;
a2
or more generally, taking y to be a random variable and w to be the transformed random variable
defined by w D .y k/=b (where k is a nonrandom scalar and b a nonzero nonrandom scalar),
cov.x; y/
cov.z; w/ D :
ab
p
In the special case where c D E.x/ and a D var x, the random variable z is referred to as the
standardized version of the random variable x (with the use of this term being restricted to situations
where the expected p value of x exists and where the variance of x exists and is strictly positive). When
c D E.x/ and a D var x,
E.z/ D 0 and var.z/ D 1:
Further, if z is the standardized version of x and w the standardized version of a random variable y,
then
cov.z; w/ D corr.x; y/:

a. Transformation of a random vector to a vector of standardized versions


Let x represent an N -dimensional random column vector with first through N th elements x1 ; x2 ,
: : : ; xN (whose expected values exist and whose variances exist and are strictly positive). And take
z to be the N -dimensional
p random column vector whose i th element is the standardized version
zi D Œxi E.xi /= var xi of xi (i D 1; 2; : : : ; N ); or, equivalently, take

z D S 1 Œx E.x/;
p p p
where S D diag. var x1 ; var x2 ; : : : ; var xN /. Then, x is expressible as
1
x D E.x/ C S z;

and
p the distribution
p of
p x is determinable from its mean vector E.x/ and the vector
. var x1 ; var x2 ; : : : ; var xN / of standard deviations and from the distribution of the trans-
formed random vector z.
Clearly,
E.z/ D 0:
Further, var.z/ equals the correlation matrix of x, or more generally, taking y to be a T -dimensional
random column vector with first through T th elements y1 ; y2 ; : : : ; yT (whose expected values exist
and whose variances exist and are strictly positive) and taking w to be the T -dimensional random
column vector whose j th element is the standardized version of yj , cov.z; w/ equals the N  T
matrix whose ij th element is corr.xi ; yj /.
Standardized Version of a Random Variable 99

b. Transformation of a random vector to a vector of uncorrelated random variables


having mean 0 and variance 1
Let x represent an N -dimensional random column vector (with elements whose expected values and
variances exist). Further, let V D var.x/ and K D rank.V /. And suppose that V is nonnull (so that
K  1).
Now, take T to be any K N nonrandom matrix such that V D T 0 T , and observe that rank.T / D
K—that such a matrix exists and is of rank K follows from Corollary 2.13.23. And define
z D R0 Œx E.x/;
where R is any right inverse of T —the existence of a right inverse follows from Lemma 2.5.1. Then,
clearly,
E.z/ D 0;
and
var.z/ D R0 V R D .T R/0 T R D I 0 I D I:
Thus, the elements of z are uncorrelated, and each has a mean of 0 and a variance of 1.
Is the vector x expressible in terms of the vector z? Consider the vector E.x/ C T 0 z. We find that

E.x/ C T 0 z D E.x/ C T 0 R0 Œx E.x/


and, accordingly, that
x ŒE.x/ C T 0 z D .I T 0 R0 /Œx E.x/: (3.1)
In the special case where K D N , R D T 1 (as is evident from Lemma 2.5.3), so that T 0 R0 D
.RT /0 D I. Thus, in that special case, x D E.x/ C T 0 z.
More generally (when K is possibly less than N ), x D E.x/ C T 0 z for those values of x for
which x E.x/ 2 C.V / (but not for the other values of x). To see this, observe [in light of equality
(3.1)] that the condition x D E.x/ C T 0 z is equivalent to the condition .I T 0 R0 /Œx E.x/ D 0.
And let d represent the value of x E.x/ corresponding to an arbitrary value of x. If d 2 C.V /,
then d D V h for some column vector h, and we find that

.I T 0 R0 /d D .I T 0 R0 /T 0 T h D T 0 ŒI .T R/0 T h D T 0 .I I 0 /T h D 0:

Conversely, if .I T 0 R0 /d D 0, then (making use of Lemma 2.12.1), we find that

d D T 0 R0 d 2 C.T 0 / D C.V /:

Thus, .I T 0 R0 /d D 0 if and only if d 2 C.V /.


While (in general) x is not necessarily equal to E.x/ C T 0 z for every value of x, it is the case
that
x D E.x/ C T 0 z with probability 1;
as is evident upon observing [in light of result (3.1)] that

Efx ŒE.x/ C T 0 zg D .I T 0 R0 /ŒE.x/ E.x/ D 0


and
varfx ŒE.x/ C T 0 zg D .I T 0 R0 /V .I T 0 R0 /0
0 0 0
D .I T R /T .I T R/T
D .I T 0 R0 /T 0 .I I/T
D0
0
and hence that x ŒE.x/ C T z D 0 with probability 1.
Note that in the “degenerate” special case where V D 0, x D E.x/ with probability 1.
100 Random Vectors and Matrices

3.4 Conditional Expected Values and Conditional Variances and Covariances


of Random Variables or Vectors
For a random variable y [whose expected value E.y/ exists] and for a random variable, random
vector, or more generally random matrix X, let us write E.y j X/ for the expected value of y
conditional on X—refer, e.g., to Bickel and Doksum (2001, app. B.1) or (at a more advanced level)
to Feller (1971, chap. V) or Billingsley (1995, sec. 34) for the definition of a conditional expected
value. The conditional expected value E.y j X/ can be regarded as a function of the random matrix
X and as such has the following basic property:
E.y/ D EŒE.y j X/ (4.1)
(e.g., Casella and Berger 2002, thm. 4.4.3; Bickel and Doksum 2001, eq. B.1.20; Billingsley 1995,
eq. 34.6).
The variance of y conditional on X is the quantity var.y j X/ defined as follows:
var.y j X/ D EfŒy E.y j X/2 j Xg:
Further, the covariance of y and a random variable w conditional on X is the quantity cov.y; w j X/
defined as
cov.y; w j X/ D EfŒy E.y j X/Œw E.y j X/ j Xg:
The following identity relates the (unconditional) variance of the random variable y to its con-
ditional mean and variance:
var.y/ D EŒvar.y j X/ C varŒE.y j X/: (4.2)
Similarly,
cov.y; w/ D EŒcov.y; w j X/ C covŒE.y j X/; E.w j X/: (4.3)
Let us verify equality (4.3)—equality (4.2) can be regarded as a special case of equality (4.3)
(that where w D y). Starting with the very definition of a covariance between two random variables,
we find that
cov.y; w/ D EfŒy E.y/Œw E.w/g
D EfŒy E.y j X/ C E.y j X/ E.y/Œw E.w j X/ C E.w j X/ E.w/g
D EfŒy E.y j X/Œw E.w j X/g
C EfŒE.y j X/ E.y/ŒE.w j X/ E.w/g
C EfŒE.y j X/ E.y/Œw E.w j X/g
C EfŒy E.y j X/ŒE.w j X/ E.w/g: (4.4)
The first term of the sum (4.4) is expressible as
E EfŒy E.w j X/ j Xg D EŒcov.y; w j X/;

E.y j X/Œw
and, since E.y/ D EŒE.y j X/ and E.w/ D EŒE.w j X/, the second term equals covŒE.y j
X/; E.w j X/. It remains to show that the third and fourth terms of the sum (4.4) equal 0. Using
basic properties of conditional expected values (e.g., Bickel and Doksum 2001, app. B.1.3; Billingsley
1995, sec. 34), we find that
EfŒE.y j X/ E.y/Œw E.w j X/g D E EfŒE.y j X/ E.y/Œw E.w j X/ j Xg


D EfŒE.y j X/ E.y/ EŒw E.w j X/ j Xg


D EfŒE.y j X/ E.y/ŒE.w j X/ E.w j X/g
D 0:
Multivariate Normal Distribution 101

Thus, the third term of the sum (4.4) equals 0. That the fourth term equals 0 can be demonstrated in
similar fashion.
The definition of the conditional expected value of a random variable can be readily extended
to a random row or column vector or more generally to a random matrix. The expected value of an
M  N random matrix Y D fyij g conditional on a random matrix X is defined to be the M  N
matrix whose ij th element is the conditional expected value E.yij j X/ of the ij th element of Y . It
is to be denoted by the symbol E.Y j X/. As a straightforward extension of the property (4.1), we
have that
E.Y / D EŒE.Y j X/: (4.5)

The definition of the conditional variance of a random variable and the definition of a conditional
covariance of two random variables can also be readily extended. The variance-covariance matrix
of an M -dimensional random column vector y D fyi g (or its transpose y 0 ) conditional on a random
matrix X is defined to be the M  M matrix whose ij th element is the conditional covariance
cov.yi ; yj j X/ of the i th and j th elements of y or y 0 . It is to be denoted by the symbol var.y j X/
or var.y 0 j X/. Note that the diagonal elements of this matrix are the N conditional variances
var.y1 j X/; var.y2 j X/; : : : ; var.yN j X/. Further, the covariance of an M -dimensional random
column vector y D fyi g (or its transpose y 0 ) and an N -dimensional random column vector w D fwj g
(or its transpose w 0 ) conditional on a random matrix X is defined to be the M  N matrix whose
ij th element is the conditional covariance cov.yi ; wj j X/ of the i th element of y or y 0 and the j th
element of w or w 0. It is to be denoted by the symbol cov.y; w j X/, cov.y 0; w j X/, cov.y; w 0 j X/,
or cov.y 0; w 0 j X/.
As generalizations of equalities (4.2) and (4.3), we have that
var.y/ D EŒvar.y j X/ C varŒE.y j X/ (4.6)
and
cov.y; w/ D EŒcov.y; w j X/ C covŒE.y j X/; E.w j X/: (4.7)
The validity of equalities (4.6) and (4.7) is evident upon observing that equality (4.6) can be regarded
as a special case of equality (4.7) and that equality (4.3) implies that each element of the left side of
equality (4.7) equals the corresponding element of the right side.

3.5 Multivariate Normal Distribution


The multivariate normal distribution provides the theoretical underpinnings for many of the proce-
dures devised for making inferences on the basis of a linear statistical model.

a. Standard (univariate) normal distribution


Consider the function f ./ defined by
1 z 2 =2
f .z/ D p e . 1 < z < 1/:
2
Clearly,
f .z/ D f . z/ (5.1)
R0 R1
for all z, that is, f ./ is symmetric about 0. Accordingly, 1 f .z/ dz D 0 f .z/ dz (as can be
formally verified by making the change of variable y D z), so that
Z 1 Z 1 Z 0 Z 1
f .z/ dz D f .z/ dz C f .z/ dz D 2 f .z/ dz: (5.2)
1 0 1 0
102 Random Vectors and Matrices

Moreover, 1
r

Z
z 2 =2
e dz D ; (5.3)
0 2
as is well-known and as can be verified by observing that
Z 1 2  Z 1  Z 1 
z 2 =2 z 2 =2 2
e dz D e dz e y =2 dy
0 0 0
Z 1Z 1
2 2
D e .z Cy /=2 dz dy (5.4)
0 0

and by evaluating the double integral (5.4) by converting to polar coordinates—refer, e.g., to Casella
and Berger (2002, sec. 3.3) for the details.
Together, results (5.2) and (5.3) imply that
Z 1
f .z/ dz D 1:
1

And, upon observing that f .z/  0 for all z, we conclude that the function f ./ can serve as
a probability density function. The probability distribution determined by this probability density
function is referred to as the standard normal (or standard Gaussian) distribution.

b. Gamma function
To obtain convenient expressions for the moments of the standard normal distribution, it is helpful
to recall (e.g., from Parzen 1960, pp. 161–163, or Casella and Berger 2002, sec. 3.3) the definition
and some basic properties of the gamma function. The gamma function is the function €./ defined
by Z 1
€.x/ D wx 1
e w
dw .x > 0/: (5.5)
0
By integrating by parts, it can be shown that
€.x C 1/ D x €.x/: (5.6)
It is a simple exercise to show that
€.1/ D 1: (5.7)
And, by repeated application of the recursive formula (5.6), result (5.7) can be generalized to
€.n C 1/ D nŠ D n.n 1/.n 2/    1 .n D 0; 1; 2; : : :/: (5.8)
(By definition, 0Š D 1.) p
By making the change of variable z D 2w in integral (5.5), we find that, for r > 1,
  .r 1/=2 Z 1
r C1

1 2
€ D z r e z =2 dz; (5.9)
2 2 0

thereby obtaining an alternative representation for the gamma function. And, upon applying result
(5.9) in the special case where r D 0 and upon recalling result (5.3), we obtain the formula
 p
€ 12 D : (5.10)
This result is extended to €.n C 21 / by the formula
p
1 .2n/Š  1  3  5  7    .2n 1/ p
(5.11)

€ nC 2 D n
D  .n D 0; 1; 2; : : :/;
4 nŠ 2n
the validity of which can be established by making use of result (5.6) and employing mathematical
induction.
Multivariate Normal Distribution 103

c. Moments of the standard normal distribution


Denote by f ./ the probability density function of the standard normal distribution, and let z represent
a random variable whose distribution is standard normal. For r D 1; 2; 3; : : : , the rth absolute
moment of the standard normal distribution is
Z 1 Z 1 Z 0
E.jzjr / D jzjr f .z/ dz D z r f .z/ dz C . 1/r z r f .z/ dz: (5.12)
1 0 1

[In result (5.12), the symbol z is used to represent a variable of integration as well as a random
variable. In circumstances where this kind of dual usage might result in confusion or ambiguity,
either altogether different symbols are to be used for a random variable (or random vector or random
matrix) and for a related quantity (such as a variable of integration or a value of the random variable),
or the related quantity is to be distinguished from the random quantity simply by underlining whatever
symbol is used for the random quantity.]
We have that, for r D 1; 2; 3; : : : ,
Z 0 Z 1
z r f .z/ dz D . 1/r z r f .z/ dz (5.13)
1 0

[as can be readily verified by making the change of variable y D z and recalling result (5.1)] and
that, for r > 1,
Z 1 Z 1
2.r=2/ 1
 
r 1 r z 2 =2 r C1
z f .z/ dz D p z e dz D p € (5.14)
0 2 0  2
[as is evident from result (5.9)].
Now, starting with expression (5.12) and making use of results (5.13), (5.14), (5.8), and (5.11),
we find that, for r D 1; 2; 3; : : : ,
Z 1
r
E.jzj / D 2 z r f .z/ dz
0
r
2r
 
r C1
D € (5.15)
 2
8̂ r
r
< 2 Œ.r 1/=2Š if r is odd,
D  (5.16)
:̂ .r 1/.r 3/.r 5/    7  5  3  1 if r is even.

Accordingly, the rth moment of the standard normal distribution exists for r D 1; 2; 3; : : : . For
r D 1; 3; 5; : : : , we find [in light of result (5.13)] that
Z 1 Z 1
r r r
E.z / D z f .z/ dz C . 1/ z r f .z/ dz D 0: (5.17)
0 0

And, for r D 2; 4; 6; : : : , we have that


E.z r / D E.jzjr /
D .r 1/.r 3/.r 5/    7  5  3  1: (5.18)

Thus, the odd moments of the standard normal distribution equal 0, while the even moments are
given by formula (5.18).
In particular, we have that

E.z/ D 0 and var.z/ D E.z 2 / D 1: (5.19)


104 Random Vectors and Matrices

That is, the standard normal distribution has a mean of 0 and a variance of 1. Further, the third and
fourth moments of the standard normal distribution are

E.z 3 / D 0 and E.z 4 / D 3: (5.20)

d. Normal distribution (univariate)


Define
x D  C z; (5.21)
where  and  are arbitrary nonrandom scalars and z is a random variable whose distribution is
standard normal. Applying formulas (1.5) and (2.41) and making use of result (5.19), we find that

E.x/ D  and var.x/ D  2: (5.22)


p
The standard deviation of x is jj D  2 .
Denote by h./ the probability density function of the standard normal distribution. If  2 > 0,
then the distribution of x is the absolutely continuous distribution with probability density function
f ./ defined by ˇ ˇ 
ˇ 1 ˇ x  1 2 2
f .x/ D ˇˇ ˇˇ h Dp e .x / =.2 /: (5.23)
  2 2

If  2 D 0, then the distribution of x is not continuous. Rather,

Pr.x D / D 1; (5.24)

so that the distribution of x is completely concentrated at a single value, namely . Note that the
distribution of x depends on  only through the value of  2.
Let us refer to an absolutely continuous distribution having a probability density function of
the form (5.23) and also to a “degenerate” distribution of the form (5.24) as a normal (or Gaussian)
distribution. Accordingly, there is a family of normal distributions, the members of which are indexed
by the mean and the variance of the distribution. The symbol N.;  2 / is used to denote a normal
distribution with mean  and variance  2 . Note that the N.0; 1/ distribution is identical to the
standard normal distribution.
The rth central moment of the random variable x defined by equality (5.21) is expressible as

EŒ.x /r  D EŒ.z/r  D  r E.z r /:

Accordingly, it follows from results (5.17) and (5.18) that, for r D 1; 3; 5; : : : ,

EŒ.x /r  D 0 (5.25)

and that, for r D 2; 4; 6; : : : ,

EŒ.x /r  D  r .r 1/.r 3/.r 5/    7  5  3  1: (5.26)

We find in particular [upon applying result (5.25) in the special case where r D 3 and result (5.26) in
the special case where r D 4] that the third and fourth central moments of the N.;  2 / distribution
are
EŒ.x /3  D 0 and EŒ.x /4  D 3  4 (5.27)
—in the special case where r D 2, result (5.26) simplifies to var.x/ D  2 .
The form of the probability density function of a (nondegenerate) normal distribution is illustrated
in Figure 3.1.
Multivariate Normal Distribution 105

f(x)

σ = 0.625
σ=1
0.6
σ = 1.6

0.4

0.2

0 x−µ
−4 −2 0 2 4

FIGURE 3.1. The probability density function f ./ of a (nondegenerate) N.;  2 / distribution: plot of f .x/
against x  for each of 3 values of .

e. Multivariate extension
Let us now extend the approach taken in Subsection d [in defining the (univariate) normal distribution]
to the multivariate case.
Let us begin by considering the distribution of an M -dimensional random column vector, say
z, whose elements are statistically independent and individually have standard normal distributions.
This distribution is referred to as the M -variate (or multivariate) standard normal (or standard
Gaussian) distribution. It has the probability density function f ./ defined by
1
exp 1
z0 z .z 2 RM /: (5.28)

f .z/ D 2
.2/M=2
Its mean vector and variance-covariance matrix are:

E.z/ D 0 and var.z/ D I: (5.29)

Now, let M and N represent arbitrary positive integers. And define


x D  C € 0 z; (5.30)
where  is an arbitrary M -dimensional nonrandom column vector, € is an arbitrary N  M non-
random matrix, and z is an N -dimensional random column vector whose distribution is N -variate
standard normal. Further, let
† D € 0 €:
Then, upon applying formulas (1.7) and (2.47), we find [in light of result (5.29)] that
E.x/ D  and var.x/ D †: (5.31)

Probability density function: existence, derivation, and geometrical form. Let us consider the
distribution of the random vector x [defined by equality (5.30)]. If rank.€/ < M or, equivalently, if
106 Random Vectors and Matrices

rank.†/ < M (in which case † is positive semidefinite), then the distribution of x has no probability
density function. Suppose now that rank.€/ D M or, equivalently, that rank.†/ D M (in which
case † is positive definite and N  M ). Then, the distribution of x has a probability density function,
which we now proceed to derive.
Take ƒ to be an N  .N M / matrix whose columns form an orthonormal (with respect to the
usual inner product) basis for N.€ 0 /—according to Lemma 2.11.5, dimŒN.€ 0 / D N M . Then,
observing that ƒ0 ƒ D I and € 0ƒ D 0 and making use of Lemmas 2.12.1 and 2.6.1, we find that

rank .€; ƒ/ D rankŒ.€; ƒ/0 .€; ƒ/


 
† 0
D rank D rank.†/ C N M D N:
0 IN M

Thus, the N  N matrix .€; ƒ/ is nonsingular.


Define w D ƒ0 z, and denote by g./ the probability density function of the distribution of z.
Because  0  1 
€ x 
zD ;
ƒ0 w
the joint distribution of x and w has the probability density function h. ; / given by
ˇ  0 1 ˇˇ  0  1 
ˇ € € x 
h.x; w/ D ˇ det ˇ g :
ƒ0 ƒ0
ˇ
ˇ w

And, upon observing that


ˇ  0 ˇ   0  1=2   1=2
ˇ det € 0 ˇ D det € 0 det.€; ƒ/ † 0
D Œdet.†/1=2;
ˇ ˇ
D det
ˇ ƒ ˇ ƒ 0 I

we find that
1 x  0 † 1
    
1 0 x 
h.x; w/ D exp
.2/N=2 j†j1=2 2 w 0 I w
1
exp 1
/0 † 1
 
D 2
.x .x /
.2/M=2 j†j1=2
1 1 0
 expŒ 2 w w:
.2/.N M /=2

Thus, the distribution of x has the probability density function f ./ given by
Z 1 Z 1
f .x/ D  h.x; w/ dw
1 1
1
exp 1
/0 † 1
(5.32)
 
D 2
.x .x / :
.2/M=2 j†j1=2

Each of the contour lines or surfaces of f ./ consists of the points in a set of the form

fx W .x /0 † 1
.x / D cg;

where c is a nonnegative scalar—f ./ has the same value for every point in the set. When M D 2,
each of these lines or surfaces is an ellipse. More generally (when M  2), each is an M -dimensional
ellipsoid. In the special case where † (and hence † 1 ) is a scalar multiple of IM , each of the contour
lines or surfaces is (when M D 2) a circle or (when M  2) an M -dimensional sphere.
Multivariate Normal Distribution 107

Uniqueness property. The matrix † D € 0 € has the same value for various choices of the N  M
matrix € that differ with regard to their respective entries and/or with regard to the value of N .
However, the distribution of the random vector x D  C € 0 z is the same for all such choices—it
depends on € only through the value of † D € 0 €. That this is the case when † is positive definite
is evident from result (5.32). That it is the case in general (i.e., even if † is positive semidefinite) is
established in Subsection f.
Definition and notation. Let us refer to the distribution of the random vector x D  C € 0 z as
an M -variate (or multivariate) normal (or Gaussian) distribution. Accordingly, there is a family
of M -variate normal distributions, the members of which are indexed by the mean vector and the
variance-covariance matrix of the distribution. For every M -dimensional column vector  and every
M  M symmetric nonnegative definite matrix †, there is an M -variate normal distribution having
 as its mean vector and † as its variance-covariance matrix (as is evident upon recalling, from
Corollary 2.13.25, that every symmetric nonnegative definite matrix is expressible in the form € 0 €).
The symbol N.; †/ is used to denote an MVN (multivariate normal) distribution with mean vector
 and variance-covariance matrix †. Note that the N.0; IM / distribution is identical to the M -variate
standard normal distribution.

f. Verification of uniqueness property: general case


Let M represent an arbitrary positive integer. And take  to be an arbitrary M -dimensional (nonran-
dom) column vector, and † to be an arbitrary M  M (nonrandom) symmetric nonnegative definite
matrix. Further, denote by € and ƒ (nonrandom) matrices such that
† D € 0 € D ƒ0 ƒ;
let N represent the number of rows in € and S the number of rows in ƒ, and let R D rank.†/.
Finally, define
x D  C € 0z and w D  C ƒ0 y;
where z is an N -dimensional random column vector whose distribution is (N variate) standard
normal and y is an S -dimensional random column vector whose distribution is (S variate) standard
normal.
Let us verify that w  x, thereby validating the assertion made in the next-to-last paragraph of
Subsection e—depending on the context, the symbol  means “is (or be) distributed as” or “has the
same distribution as.” In doing so, let us partition † as
 
†11 †12
†D ;
†21 †22
where the dimensions of †11 are RR. And let us assume that †11 is nonsingular. This assumption is
convenient and can be made without loss of generality—the matrix † contains an R  R nonsingular
principal submatrix, which can be relocated (should it be necessary to satisfy the assumption) by
reordering the matrix’s rows and columns.
Partition € and ƒ as
€ D .€1 ; €2 / and ƒ D .ƒ1 ; ƒ2 /;
where the dimensions of €1 are N  R and those of ƒ1 are S  R. Then,
€10 €1 D †11 and €20 €1 D †21 :
Similarly,
ƒ01 ƒ1 D †11 and ƒ02 ƒ1 D †21 :
Moreover, in light of Lemma 2.12.1, we have that
rank.€1 / D R D rank.€/
108 Random Vectors and Matrices

and, similarly, that


rank.ƒ1 / D R D rank.ƒ/:
Thus, the columns of €1 form a basis for C.€/, and the columns of ƒ1 form a basis for C.ƒ/.
Accordingly, there exist matrices A and B such that
€2 D €1 A and ƒ2 D ƒ1 B:
And, taking A and B to be any such matrices, we find that
A0 €10 €1 D €20 €1 D †21 D ƒ02 ƒ1 D B0 ƒ01 ƒ1 D B0 †11 D B0 €10 €1 ;
implying (in light of Corollary 2.3.4) that
A0 €10 D B0 €10 :
Now, observe that the two random vectors €10 z and ƒ01 y have the same probability distribution;
according to result (5.32), each of them has the distribution with probability density function f ./
given by
1
f .u/ D exp. 12 u0 †111 u/:
.2/ j†11 j1=2
R=2

We conclude that
     
I I I
w D  C ƒ0 y D  C ƒ 0
y   C € 0
z D  C € 0 z D  C € 0 z D x:
B0 1 B0 1 A0 1
That is, w has the same probability distribution as x.

g. Probability density function of a bivariate (2-variate) normal distribution


Let us consider in some detail the bivariate normal distribution. Take x and y to be random variables
whose joint distribution is N.; †/, and let 1 D E.x/, 2 D E.y/, 12 D var x, 22 D var y, and
 D corr.x; y/ (where 1  0 and 2  0), so that
   2 
1 1  1 2
D and †D :
2  2 1 22
And observe [in light of result (2.14.4)] that
j†j D 12 22 .1 2 /: (5.33)
The joint distribution of x and y has a probability density function if rank.†/ D 2, or equivalently
if j†j > 0, or (also equivalently) if
1 > 0; 2 > 0; and 1 <  < 1: (5.34)
Now, suppose that condition (5.34) is satisfied. Then, in light of result (2.5.1),
1=12
 
=.1 2 /
† 1 D .1 2 / 1 : (5.35)
=.1 2 / 1=22
Upon substituting expressions (5.33) and (5.35) into expression (5.32), we obtain the following
expression for the probability density function f . ; / of the joint distribution of x and y:

x 1 2
  
1 1
f .x; y/ D exp
2.1 2 /
p
21 2 1 2 1
y 2 2
     
x 1 y 2
2 C (5.36)
1 2 2
( 1 < x < 1; 1 < y < 1).
The form of the probability density function (5.36) and the effect on the probability density
function of changes in  and 2 =1 are illustrated in Figure 3.2.
Multivariate Normal Distribution 109

σ2/σ1 = 0.625, ρ = 0 σ2/σ1 = 0.625, ρ = 0.8

2 2
.05
1 1 .3
.05 .55
.8
(y − µ2)/σ1

1.05
.55 .3
1.05 .8
1.6 2.67
0 0

−1 −1

−2 −2

−2 −1 0 1 2 −2 −1 0 1 2
σ2/σ1 = 1, ρ = 0 σ2/σ1 = 1, ρ = 0.8

.05
2 2
.3
.05 .55
1 1 .8
.3 1.05
(y − µ2)/σ1

.55
.8
1 1.67
0 0

−1 −1

−2 −2

−2 −1 0 1 2 −2 −1 0 1 2
σ2/σ1 = 1.6, ρ = 0 σ2/σ1 = 1.6, ρ = 0.8
.05

2 2 .3
.05
.55
1 .3 1 .8
(y − µ2)/σ1

.55
.625 1.04
0 0

−1 −1

−2 −2

−2 −1 0 1 2 −2 −1 0 1 2
(x − µ1)/σ1 (x − µ1)/σ1

FIGURE 3.2. Contour maps of the probability density function f . ; / of the distribution of random variables x
and y that are jointly normal with E.x/ D 1 , E.y/ D 2 , var x D 12 (1 > 0), var y D 22
(2 > 0), and corr.x; y/ D . The 6 maps are arranged in 3 rows, corresponding to values of
2 =1 of 0:625, 1, and 1:6, respectively, and in 2 columns, corresponding to values of  of 0
and 0:8. The coordinates of the points of each contour line are the values of .x 1 /=1 and
.y 2 /=1 at which f .x; y/ D k=.212 /, where k D 0:05; 0:3; 0:55; 0:8, or 1:05. Contour
maps corresponding to  D 0:8 could be obtained by forming the mirror images of those
corresponding to  D 0:8.
110 Random Vectors and Matrices

h. Linear transformation of a normally distributed random vector


Let x represent an N -dimensional random column vector whose distribution is N.; †/, and consider
the distribution of the M -dimensional random column vector y defined by
y D c C Ax;
where c is an M -dimensional nonrandom column vector and A an M  N nonrandom matrix. By
definition, the distribution of x is that of a random vector  C € 0 z, where € is a nonrandom matrix
such that † D € 0 € and where z  N.0; I/. Thus, the distribution of y is identical to that of the
vector
c C A. C € 0 z/ D c C A C .€A0 /0 z:
Since .€A0 /0 €A0 D A†A0, it follows that
y  N.c C A; A†A0 /:
In summary, we have the following theorem.
Theorem 3.5.1. Let x  N.; †/, and let y D c C Ax, where c is a nonrandom vector and A
a nonrandom matrix. Then,
y  N.c C A; A†A0 /:

i. Symmetry of the MVN distribution


The distribution of an M -dimensional random column vector x is said to be symmetric about an M -
dimensional nonrandom column vector  if the distribution of .x / is the same as that of x .
If x  N.; †/, then it follows from Theorem 3.5.1 that the distribution of x  (D  C Ix) and
the distribution of .x / [D  C . I/x] are both N.0; †/. Thus, as a consequence of Theorem
3.5.1, we have the following result.
Theorem 3.5.2. The N.; †/ distribution is symmetric about .

j. Marginal distributions
Let x represent an M -dimensional random column vector whose distribution is N.; †/, and con-
sider the distribution of a subvector of x, say the M -dimensional subvector x obtained by striking
out all of the elements of x except the j1 ; j2 ; : : : ; jMth elements. Clearly,

x D Ax;

where A is the M  M submatrix of IM obtained by striking out all of the rows of IM except the
j1 ; j2 ; : : : ; jMth rows—if the elements of x are the first M elements of x, then A D .I; 0/.
Accordingly, it follows from Theorem 3.5.1 that

x  N.A; A†A0 /:

Thus, as an additional consequence of Theorem 3.5.1, we have the following result.


Theorem 3.5.3. Let x  N.; †/. Further, let x represent a subvector of x, and let 
represent the corresponding subvector of  and † the corresponding principal submatrix of †.
Then,
x  N. ; † /:
Multivariate Normal Distribution 111

k. Statistical independence
Let x1 ; x2 ; : : : ; xP represent random column vectors having expected values i D E.xi / (i D
1; 2; : : : ; P ) and covariances †ij D cov.xi ; xj / (i; j D 1; 2; : : : ; P ). And for i D 1; 2; : : : ; P,
denote by Mi the number of elements in xi .
Let xi s represent the sth element of xi (i D 1; 2; : : : ; P ; s D 1; 2; : : : ; Mi ). If xi and xj are
statistically independent, in which case xi s and xjt are statistically independent for every s and every
t, then (according to Lemma 3.2.2) †ij D 0 (j ¤ i D 1; 2; : : : ; P ). In general, the converse is not
true (as is well-known and as could be surmised from the discussion of Section 3.2c). That is, xi and
xj being uncorrelated does not necessarily imply their statistical independence. However, when their
joint distribution is MVN, xi and xj being uncorrelated does imply their statistical independence.
More generally, when the joint distribution of x1 ; x2 ; : : : ; xP is MVN, †ij D 0 (i.e., xi and xj
being uncorrelated) for j ¤ i D 1; 2; : : : ; P implies that x1 ; x2 ; : : : ; xP are mutually (jointly)
independent.
To see this, let 0 1
x1
B x2 C
x D B : C;
B C
@ :: A
xP
and observe that E.x/ D  and var.x/ D †, where
0 1 0 1
1 †11 †12 : : : †1P
B 2 C B †21 †22 : : : †2P C
DB : C and †DB : :: C:
B C B C
:: ::
@ :: A @ :: : : : A
P †P 1 †P 2 : : : †PP
And suppose that the distribution of x is MVN and that
†ij D 0 .j ¤ i D 1; 2; : : : ; P /:
Further, define
€ D diag.€1 ; €2 ; : : : ; €P /;
where (for i D 1; 2; : : : ; P ) €i is any matrix such that †i i D €i0 €i . Then,

† D diag.†11 ; †22 ; : : : ; †PP / D diag.€10 €1 ; €20 €2 ; : : : ; €P0 €P / D € 0 €:

Now, denote by Ni the number of rows in €i (i D 1; 2; : : : ; P ), and take


0 1
z1
B z2 C
z D B : C;
B C
@ :: A
zP
where (for i D 1; 2; : : : ; P ) zi is an Ni -dimensional random column vector whose distribution is
N.0; I/ and where z1 ; z2 ; : : : ; zP are statistically independent. Clearly, z  N.0; I/, so that (by
definition) the distribution of x is identical to that of the random vector
1 C €10 z1
0 1
B 2 C € 0 z C
2 2 C
 C € 0z D B C;
B
::
@ : A
P C €P0 zP
or, equivalently, the joint distribution of x1 ; x2 ; : : : ; xP is identical to that of the random vectors
1 C €10 z1 , 2 C €20 z2 ; : : : ; P C €P0 zP . Since z1 ; z2 ; : : : ; zP are distributed independently,
112 Random Vectors and Matrices

so are the vector-valued functions 1 C €10 z1 , 2 C €20 z2 ; : : : ; P C €P0 zP and hence so are
x1 ; x2 ; : : : ; xP —vector-valued functions of statistically independent random vectors are statistically
independent, as is evident, for example, from the discussion of Casella and Berger (2002, sec. 4.6)
or Bickel and Doksum (2001, app. A).
In summary, we have the following theorem.
Theorem 3.5.4. Let x1 ; x2 ; : : : ; xP represent random column vectors whose joint distribution
is MVN. Then, x1 ; x2 ; : : : ; xP are distributed independently if (and only if)

cov.xi ; xj / D 0 .j > i D 1; 2; : : : ; P /:

Note that the coverage of Theorem 3.5.4 includes the case where each of the random vectors
x1 ; x2 ; : : : ; xP is of dimension 1 and hence is in effect a random variable. In the special case where
P D 2, Theorem 3.5.4 can be restated in the form of the following corollary.
Corollary 3.5.5. Let x and y represent random column vectors whose joint distribution is MVN.
Then, x and y are statistically independent if (and only if) cov.x; y/ D 0.
As an additional corollary of Theorem 3.5.4, we have the following result.
Corollary 3.5.6. Let x represent an N -dimensional random column vector whose distribution
is MVN; and, for i D 1; 2; : : : ; P , let yi D ci C Ai x, where ci is an Mi -dimensional nonran-
dom column vector and Ai is an Mi  N nonrandom matrix. Then, y1 ; y2 ; : : : ; yP are distributed
independently if (and only if)
cov.yi ; yj / D 0 .j > i D 1; 2; : : : ; P /:

Proof. Let 0 1 0 1 0 1
y1 c1 A1
B y2 C B c2 C B A2 C
y D B : C; c D B : C; and A D B : C:
B C B C B C
@ :: A @ :: A @ :: A
yP cP AP
Then,
y D c C Ax;
implying (in light of Theorem 3.5.1) that the joint distribution of y1; y2 ; : : : ; yP is MVN. Accordingly,
it follows from Theorem 3.5.4 that y1 ; y2 ; : : : ; yP are distributed independently if (and only if)
cov.yi ; yj / D 0 .j > i D 1; 2; : : : ; P /: Q.E.D.

If each of two or more independently distributed random vectors has an MVN distribution, then,
as indicated by the following theorem, their joint distribution is MVN.
Theorem 3.5.7. For i D 1; 2; : : : ; P , let xi represent an Mi -dimensional random column vector
whose distribution is N.i ; †i /. If x1 ; x2 ; : : : ; xP are mutually independent, then the distribution
of the random vector x defined by x0 D .x01 ; x02 ; : : : ; xP 0
/ is N.; †/, where 0 D .01 ; 02 ; : : : ;
0
P / and † D diag.†1 ; †2 ; : : : ; †P /.
Proof. For i D 1; 2; : : : ; P , take €i to be a matrix (having Mi columns) such that †i D €i0 €i —
the existence of such a matrix follows from Corollary 2.13.25—and denote by Ni the number
of rows in €i . And let N D P i D1 Ni . Further, let € D diag.€1 ; €2 ; : : : ; €P /, and define z by
P

z0 D .z01 ; z02 ; : : : ; zP0 /, where (for i D 1; 2; : : : ; P ) zi is an Ni -dimensional random column vector


whose distribution is N.0; I/ and where z1 ; z2 ; : : : ; zP are statistically independent.
For i D 1; 2; : : : ; P , i C €i0 zi  N.i ; †i / (so that i C €i0 zi has the same distribution as
xi ). Further, 1 C €10 z1 , 2 C €20 z2 ; : : : ; P C €P0 zP are mutually independent. And since
Multivariate Normal Distribution 113

1 C €10 z1
0 1
B  C € 0z C
B 2 2 2
 C € 0z D B C;
C
::
@ : A
P C €P0 zP
it follows that  C € 0 z has the same distribution as x. Clearly, z  N.0; IN / and € 0 € D †. Thus,
 C € 0 z  N.; †/ and hence x  N.; †/. Q.E.D.

l. Conditional distributions: a special case


Suppose that x is an M1 -dimensional random column vector and y an M2 -dimensional random
column vector whose joint distribution is MVN. Let 1 D E.x/, 2 D E.y/, †11 D var.x/,
†22 D var.y/, and †21 D cov.y; x/. And define
0
 
†11 †21
†D :
†21 †22
Let us derive the conditional distribution of y given x. Assume that † is positive definite (and
hence nonsingular)—consideration of the more general (and more difficult) case where † may
be positive semidefinite is deferred until Subsection m. Denoting by h. ; / the probability density
function of the joint distribution of x and y and by h1 ./ the probability density function of the
marginal distribution of x, letting
0
.x/ D 2 C †21 †111 .x 1 / and V D †22 †21 †111 †21 ;

and making use of Theorems 2.14.22 and 2.6.6, we find that the conditional distribution of y given
x is the distribution with probability density function f . j / given by
h.x; y/ 1 1
f .yjx/ D D expΠ2
q.x; y/;
h1 .x/ .2/ 2 =2 c 1=2
M

where
c D j†j=j†11 j D j†11 jjV j=j†11j D jV j
and
 0  0
 1 
x 1 †11 †21 x 1
q.x; y/ D .x 1 /0 †111 .x 1 /
y 2 †21 †22 y 2
0 0 0
1 †111 C †111 †21 V 1 †21 †111 †111 †21 V 1 x 1
  
x
D
y 2 V 1 †21 †111 V 1 y 2
.x 1 /0 †111 .x 1 /
0 1
D Œy .x/ V Œy .x/ :

The probability density function of the conditional distribution of y given x is seen to be that of
the MVN distribution with mean vector .x/ and variance-covariance matrix V. Thus, we have the
following theorem.
Theorem 3.5.8. Let x and y represent random column vectors whose joint distribution is MVN,
and let 1 D E.x/, 2  D E.y/, †11 D var.x/,
 †22 D var.y/, and †21 D cov.y; x/. Then, under
0

x †11 †21
the supposition that var D is positive definite, the conditional distribution of y
y †21 †22
given x is N Œ.x/; V , where
0
.x/ D 2 C †21 †111 .x 1 / and V D †22 †21 †111 †21 :
114 Random Vectors and Matrices

The results of the following theorem complement those of Theorem 3.5.8.


Theorem 3.5.9. Let x and y represent random column vectors, and let 1 D E.x/, 2 D E.y/,
†11 D var.x/, †22 D var.y/, and †21 D cov.y; x/. Suppose that †11 is nonsingular. And let
0
.x/ D 2 C †21 †111 .x 1 / and V D †22 †21 †111 †21 ;
and define
eDy .x/:
Then,
0
(1) EŒ.x/ D 2 , and varŒ.x/ D covŒ.x/; y D †21 †111 †21 ;
(2) E.e/ D 0, var.e/ D V , and cov.e; x/ D 0; and
(3) under the assumption that the joint distribution of x and y is MVN, the distribution of e is
N.0; V / and e and x are statistically independent.
Proof. (1) and (2). Making use of results (1.7) and (1.8), we find that

EŒ.x/ D 2 C †21 †111 E.x 1 / D 2


and that
E.e/ D E.y/ EŒ.x/ D 0:
Further, in light of result (2.47), we have that

varŒ.x/ D †21 †111 †11 .†21 †111 /0 D †21 .†21 †111 /0 D .†21 †111 †21
0 0
/

and hence (because a variance-covariance matrix is inherently symmetric) that


0
varŒ.x/ D †21 †111 †21 :

Similarly, in light of results (2.45) and (2.46), we have that


0
covŒ.x/; y D covŒ.x/; Iy D †21 †111 †21 I 0 D †21 †111 †21
0

and
covŒ.x/; x D covŒ.x/; Ix D †21 †111 †11 I 0 D †21 :
And, upon recalling results (2.48) and (2.50), it follows that

cov.e; x/ D cov.y; x/ covŒ.x/; x D 0


and
var.e/ D var.y/ C varŒ.x/ fcovŒ.x/; yg0 covŒ.x/; y
0 0 0 0 0
D †22 C .†21 †111 †21 / .†21 †111 †21 / †21 †111 †21
D V:

(3) Suppose that the joint distribution of x and y is MVN. Then, upon observing that

2 C †21 †111 1 †21 †111 I


      
e x
D C ;
x 0 I 0 y

it follows from Theorem 3.5.1 that the joint distribution of e and x is MVN. Since [according to
Part (2)] cov.e; x/ D 0, we conclude (on the basis of Corollary 3.5.5) that e and x are statistically
independent. To establish that the distribution of e is N.0; V /, it suffices [since it has already been
established in Part (2) that E.e/ D 0 and var.e/ D V ] to observe (e.g., on the basis of Theorem
3.5.3) that the distribution of e is MVN. Q.E.D.
Multivariate Normal Distribution 115

m. Conditional distributions: general case


Let x represent an M1 -dimensional random column vector and y an M2 -dimensional random column
vector, and let 1 D E.x/, 2 D E.y/, †11 D var.x/, †22 D var.y/, and †21 D cov.y; x/. And
define
0
 
†11 †21
†D :
†21 †22
What is the distribution of y conditional on x when the joint distribution of y and x is MVN? This
distribution was derived in Subsection l under the supposition that † is positive definite. Under that
supposition, the conditional distribution of y given x is MVN with mean vector 2C†21 †111 .x 1 /
0
and variance-covariance matrix †22 †21 †111 †21 . What is the conditional distribution of y given x
in the general case where † may be positive semidefinite? In what follows, it is established that in the
general case, the conditional distribution of y given x is MVN with mean vector 2C†21 †11 .x 1 /
0
and variance-covariance matrix †22 †21 †11 †21 . Thus, the generalization takes a simple form;
it suffices to replace the ordinary inverse of †11 with a generalized inverse.
As a first step in establishing this generalization, let us extend the results of Theorem 3.5.9 (to
the general case where † may be positive semidefinite). Let us take
0
.x/ D 2 C †21 †11 .x 1 / and V D †22 †21 †11 †21 ;

and define
eDy .x/:
0
Observe (in light of Theorem 2.13.25) that † D € € for some matrix €. Accordingly,

†11 D €10 €1 and †21 D €20 €1 (5.37)

for a suitable partitioning € D .€1 ; €2 /. And making use of Theorem 2.12.2 [and equating .€10 €1 /
and †11 ], we find that

†21 †11 †11 D €20 €1 .€10 €1 / €10 €1 D €20 €1 D †21 : (5.38)

By taking advantage of equality (5.38) and by proceeding in the same fashion as in proving Parts (1)
and (2) of Theorem 3.5.9, we obtain the following results:
0
EŒ.x/ D 2 and varŒ.x/ D covŒ.x/; y D †21 †11 †21 I
E.e/ D 0; var.e/ D V ; and cov.e; x/ D 0: (5.39)
0
Before proceeding, let us consider the extent to which .x/ and †21 †11 †21 (and hence V ) are
invariant to the choice of the generalized inverse †11 . Recalling result (5.37) [and equating .€10 €1 /
and †11 ], we find that
0
†21 †11 †21 D €20 €1 .€10 €1 / .€20 €1 /0 D €20 Œ€1 .€10 €1 / €10 €2 :
0
And, based on Theorem 2.12.2, we conclude that †21 †11 †21 and V are invariant to the choice of
†11 .
With regard to .x/, the situation is a bit more complicated. If x is such that x 1 2 C.†11 /,
then there exists a column vector t such that

x 1 D †11 t;

in which case it follows from result (5.38) that

†21 †11 .x 1 / D †21 †11 †11 t D †21 t


116 Random Vectors and Matrices

and hence that †21 †11 .x 1 / is invariant to the choice of †11 . Thus, .x/ is invariant to the
choice of †11 for every x such that x 1 2 C.†11 /. Moreover, x 1 2 C.†11 / is an event of
probability one. To see this, observe (in light of Lemmas 2.4.2 and 2.11.2) that

x 1 2 C.†11 / , C.x 1 /  C.†11 / , .I †11 †11 /.x 1 / D 0;

and observe also that


EŒ.I †11 †11 /.x 1 / D 0
and
varŒ.I †11 †11 /.x 1 / D .I †11 †11 /†11 .I †11 †11 /0 D 0
and hence that PrŒ.I †11 †11 /.x 1 / D 0 D 1.
Now, returning to the primary development, suppose that the joint distribution of x and y is
MVN. Then, by making use of result (5.39) and by proceeding in the same fashion as in proving Part
(3) of Theorem 3.5.9, we find that e  N.0; V / and that e and x are statistically independent.
At this point, we are in a position to derive the conditional distribution of y given x. Since e and
x are distributed independently, the conditional distribution of e given x is the same as the marginal
distribution of e. Thus, the conditional distribution of e given x is N.0; V /. And upon observing that
y D .x/ C e, it follows that the conditional distribution of y given x is N Œ.x/; V .
In summary, we have the following two theorems, which are generalizations of Theorems 3.5.8
and 3.5.9.
Theorem 3.5.10. Let x and y represent random column vectors whose joint distribution is MVN,
and let 1 D E.x/, 2 D E.y/, †11 D var.x/, †22 D var.y/, and †21 D cov.y; x/. Then, the
conditional distribution of y given x is N Œ.x/; V , where
0
.x/ D 2 C †21 †11 .x 1 / and V D †22 †21 †11 †21 :

Theorem 3.5.11. Let x and y represent random column vectors, and let 1 D E.x/, 2 D E.y/,
†11 D var.x/, †22 D var.y/, and †21 D cov.y; x/. Further, let
0
.x/ D 2 C †21 †11 .x 1 / and V D †22 †21 †11 †21 ;

and define
eDy .x/:
Then,
0
(0) †21 †11 †21 and V are invariant to the choice of †11 , and for x such that x 1 2 C.†11 /,
.x/ is invariant to the choice of †11 ; moreover, x 1 2 C.†11 / is an event of probability
one;
0
(1) EŒ.x/ D 2 , and varŒ.x/ D covŒ.x/; y D †21 †11 †21 ;
(2) E.e/ D 0, var.e/ D V , and cov.e; x/ D 0; and
(3) under the assumption that the joint distribution of x and y is MVN, the distribution of e is
N.0; V / and e and x are statistically independent.

n. Third- and fourth-order central moments of the MVN distribution


Result (5.27) gives the third and fourth central moments of the univariate normal distribution. Let us
extend that result by obtaining the third- and fourth-order central moments of the MVN distribution.
Take x to be an M -dimensional random column vector whose distribution is N.; †/, and denote
by xi and i the i th elements of x and , respectively, and by ij the ij th element of †. Further,
let € D f ij g represent any matrix such that † D € 0 €, denote by N the number of rows in €, and
take z D fzi g to be an N -dimensional random column vector whose distribution is N.0; I/, so that
x   C € 0 z.
Multivariate Normal Distribution 117

As a preliminary step, observe that, for arbitrary nonnegative integers k1 ; k2 , : : : ; kN ,

E z1k1 z2k2    zN
kN 
D E z1k1 E z2k2    E zN
kN 
(5.40)
 
:

Observe also that z  z (i.e., z is distributed symmetrically about 0), so that


P
ki k1 k2 kN
. 1/ i
z1 z2    zN D . z1 /k1 . z2 /k2    . zN /kN
 z1k1 z2k2    zN
kN
: (5.41)
P
k
If ki is an odd number or, equivalently, if . 1/ i i D 1, then it follows from result (5.41) that
P
i
k1 k2 kN 
E z1 z2    zN D E z1k1 z2k2    zN
kN 
and hence that E z1k1 z2k2    zN
kN 
D 0. Thus,

E z1k1 z2k2    zN
kN 
D 0 for k1 ; k2 ; : : : ; kN such that i ki D 1; 3; 5; 7; : : : . (5.42)
P

Now, consider the third-order central moments of the MVN distribution. Making use of result
(5.42), we find that, for i; j; s D 1; 2; : : : ; M ,
hX X X i
EŒ.xi i /.xj j /.xs s / D E 0 z
i i i 0 0 z
j j j 0 0 z
ss s 0

i0 j0 s0
X
D i 0 i j 0 j s0s E.zi 0 zj 0 zs 0 /
i 0; j 0; s0

D 0: (5.43)

That is, all of the third-order central moments of the MVN distribution equal zero. In fact, by
proceeding in similar fashion, we can establish the more general result that, for every odd positive
integer r (i.e., for r D 1; 3; 5; 7; : : :), all of the rth-order central moments of the MVN distribution
equal zero.
Turning to the fourth-order central moments of the MVN distribution, we find [in light of results
(5.40), (5.19), and (5.20)] that, for i; j; s; t D 1; 2; : : : ; M ,

EŒ.xi i /.xj j /.xs s /.x t  t /


hX X X X i
DE i 0 i zi 0 j 0 j zj 0 s 0 s zs 0 t 0t zt 0
i0 j0 s0 t0
X
D i 0i j 0j s0 s t 0 t E.zi 0 zj 0 zs 0 z t 0 /
i 0; j 0; s 0; t 0
X X  X
D i 0i i 0j i 0 s i 0t .3/ C i 0i i 0j s0s s0t
i0 i0 s 0 ¤i 0
X X 
C i 0s j 0j j 0 t C i 0t j 0 j j 0s .1/
j 0 ¤i 0 j 0 ¤i 0
X  X X X 
D i 0i i 0j s0s s0t C i 0s j 0j j 0t C i 0t j 0j j 0s
i0 s0 j0 j0

D ij st C i s jt C i t js : (5.44)

In the special case t D s D j D i , formula (5.44) reduces to

EŒ.xi i /4  D 3 i i ;

in agreement with the expression given earlier [in result (5.27)] for the fourth central moment of the
univariate normal distribution.
118 Random Vectors and Matrices

o. Moment generating function


Consider an M -variate normal distribution with mean vector  and variance-covariance matrix †.
When † is positive definite, this distribution can be characterized in terms of its probability density
function—when † is positive semidefinite, there is no probability density function. Alternatively
(and regardless of whether † is positive definite or positive semidefinite), the N.; †/ distribution
can be characterized in terms of its moment generating function.
Denote by m./ the moment generating function of the N.; †/ distribution. And let x D
 C € 0 z, where (for some integer N ) € is an N  M matrix such that † D € 0 € and where z
is an N -dimensional random column vector whose distribution is N.0; I/. Then, for an arbitrary
M -dimensional (nonrandom) column vector t, we have that

m.t/ D EŒexp.t 0 x/ D EfexpŒt 0 . C € 0 z/g


Z
D .2/ N=2 exp.t 0  C t 0 € 0 z 1 0
2
z z/ d z: (5.45)
RN

The evaluation of expression (5.45) is facilitated by the identity

t0 € 0 z 1 0
2z z D 1
2 .z €t/0 .z €t/ C 21 t 0 †t; (5.46)

obtained by “completing the square” and observing that (since z0 €t is of dimensions 1  1) z0 €t D


.z0 €t/0 D t 0 € 0 z. Upon substituting expression (5.46) into expression (5.45), we find that
Z
0 1 0
m.t/ D exp.t  C 2 t †t/ .2/ N=2 expŒ 12 .z €t/0 .z €t/ d z:
RN

Moreover, Z
.2/ N=2
expΠ1
2
.z €t/0 .z €t/ d z D 1;
RN
as is evident upon making a change of variables from z to y D z €t or upon observing that the
integrand is a probability density function [that of the N.€t; I/ distribution].
Thus, the moment generating function m./ of the M -variate normal distribution with mean
vector  and variance-covariance matrix † is given by

m.t/ D exp.t 0  C 21 t 0 †t/ .t 2 RM /: (5.47)

Corresponding to the moment generating function m./ is the cumulant generating function, say c./,
of this distribution, which is given by

c.t/ D log m.t/ D t 0  C 21 t 0 †t .t 2 RM /: (5.48)

p. A univariate characterization of multivariate normality


Let x D fxi g represent an M -dimensional random column vector. If the distribution of x is MVN,
then it follows from Theorem 3.5.1 that for every M -dimensional nonrandom column vector a D
fai g, the linear combination a0 x D M i D1 ai xi has a (univariate) normal distribution.
P
Is the converse true? That is, if every linear combination of the elements x1 ; x2 ; : : : ; xM of x
has a (univariate) normal distribution, is it necessarily the case that the distribution of x is MVN? In
what follows, the answer is shown to be yes.
Let a represent an arbitrary M -dimensional nonrandom column vector, and suppose that for every
a, a0 x has a (univariate) normal distribution (implying, in particular, that x1; x2 ; : : : ; xM have normal
distributions and hence that the expected values and variances of x1 ; x2 ; : : : ; xM exist). Further, let
Exercises 119

 D E.x/ and † D var.x/, and observe that (for every a) E.a0 x/ D a0  and var.a0 x/ D a0 †a.
Then, recalling the results of Subsection o and denoting by m . I a/ the moment generating function
of the N.a0 ; a0 †a/ distribution, we find that (for every a)

EŒexp.a0 x/ D EŒexp.1a0 x/ D m .1I a/ D exp.a0  C 12 a0 †a/: (5.49)

And we conclude that the distribution of x has a moment generating function, say m./, and that

m.a/ D exp.a0  C 12 a0 †a/ .a 2 RM /: (5.50)

A comparison of expression (5.50) with expression (5.47) reveals that the moment generating
function of the distribution of x is the same as that of the N.; †/ distribution. If two distributions
have the same moment generating function, they are identical (e.g., Casella and Berger 2002, p. 65;
Bickel and Doksum 2001, pp. 460 and 505). Consequently, the distribution of x is MVN.
In summary, we have the following characterization of multivariate normality.
Theorem 3.5.12. The distribution of the M -dimensional random column vector x D fxi g is
MVN if and only if, for every M -dimensional nonrandom column vector a D fai g, the distribution
of the linear combination a0 x D M i D1 ai xi is (univariate) normal.
P

Exercises
Exercise 1. Provide detailed verifications for (1) equality (1.7), (2) equality (1.8), (3) equality (1.10),
and (4) equality (1.9).
Exercise 2.
(a) Let w and z represent random variables [such that E.w 2 / < 1 and E.z 2 / < 1]. Show that
p p
jE.wz/j  E.jwzj/  E.w 2 / E.z 2 /I (E.1)

and determine the conditions under which the first inequality holds as an equality, the conditions
under which the second inequality holds as an equality, and the conditions under which both
inequalities hold as equalities.
(b) Let x and y represent random variables [such that E.x 2 / < 1 and E.y 2 / < 1]. Using Part (a)
(or otherwise), show that
p p
jcov.x; y/j  EŒjx E.x/jjy E.y/j  var.x/ var.y/I (E.2)

and determine the conditions under which the first inequality holds as an equality, the conditions
under which the second inequality holds as an equality, and the conditions under which both
inequalities hold as equalities.

Exercise 3. Let x represent an N -dimensional random column vector and y a T -dimensional random
column vector. And define x to be an R-dimensional subvector of x and y an S -dimensional
subvector of y (where 1  R  N and 1  S  T ). Relate E.x / to E.x/, var.x / to var.x/, and
cov.x ; y / and cov.y ; x / to cov.x; y/.
Exercise 4. Let x represent a random variable that is distributed symmetrically about 0 (so that
x  x); and suppose that the distribution of x is “nondegenerate” in the sense that there exists a
nonnegative constant c such that 0 < Pr.x > c/ < 21 [and assume that E.x 2 / < 1]. Further, define
y D jxj.
120 Random Vectors and Matrices

(a) Show that cov.x; y/ D 0.


(b) Are x and y statistically independent? Why or why not?

Exercise 5. Provide detailed verifications for (1) equality (2.39), (2) equality (2.45), and (3) equality
(2.48).
Exercise 6.
(a) Let x and y represent random variables. Show that cov.x; y/ can be determined from knowledge
of var.x/, var.y/, and var.x C y/, and give a formula for doing so.
(b) Let x D .x1 ; x2 /0 and y D .y1 ; y2 /0 represent 2-dimensional random column vectors. Can
cov.x; y/ be determined from knowledge of var.x/, var.y/, and var.x C y/? Why or why not?

Exercise 7. Let x represent an M -dimensional random column vector with mean vector  and
variance-covariance matrix †. Show that there exist M.M C 1/=2 linear combinations of the M
elements of x such that  can be determined from knowledge of the expected values of M of these
linear combinations and † can be determined from knowledge of the [M.M C1/=2] variances of
these linear combinations.
Exercise 8. Let x and y represent random variables (whose expected values and variances exist), let
V represent the variance-covariance matrix of the random vector .x; y/, and suppose that var.x/ > 0
and var.y/ > 0.
(a) Show that if jV j D 0, then, for scalars a and b,
p p
.a; b/0 2 N.V / , b var y D c a var x;
(
C1; when cov.x; y/ < 0,
where c D
1; when cov.x; y/ > 0.
(b) Use the result of Part (a) and the results of Section
p 3.2epto devise an alternative proof ofpthe result
(establishedp
in Section 3.2a) that cov.x; y/ D var x var y if andponly ifpŒy E.y/= var y D
Œx E.x/= var x with probability 1 and that cov.x; y/ D var x var y if and only if
p p
Œy E.y/= var y D Œx E.x/= var x with probability 1.

Exercise 9. Let x represent an N -dimensional random column vector (with elements whose expected
values and variances exist). Show that (regardless of the rank of var x) there exist a nonrandom column
vector c and an N  N nonsingular nonrandom matrix A such that the random vector w, defined
implicitly by x D c C A0 w, has mean 0 and a variance-covariance matrix of the form diag.I; 0/
[where diag.I; 0/ is to be regarded as including 0 and I as special cases].
Exercise 10. Establish the validity of result (5.11).
Exercise 11. Let w D jzj, where z is a standard normal random variable.
(a) Find a probability density function for the distribution of w.
(b) Use the expression obtained in Part (a) (for the probability density function of the distribution
of w) to derive formula (5.15) for E.w r /—in Section 3.5c, this formula is derived from the
probability density function of the distribution of z.
(c) Find E.w/ and var.w/.

Exercise 12. Let x represent a random variable having mean  and variance  2. Then, E.x 2 / D
2 C  2 [as is evident from result (2.3)]. Thus, the second moment of x depends on the distribution
of x only through  and  2. If the distribution of x is normal, then the third and higher moments of x
Exercises 121

also depend only on  and  2. Taking the distribution of x to be normal, obtain explicit expressions
for E.x 3 /, E.x 4 /, and, more generally, E.x r / (where r is an arbitrary positive integer).
Exercise 13. Let x represent an N -dimensional random column vector whose distribution is
N.; †/. Further, let R D rank.†/, and assume that † is nonnull (so that R  1). Show that
there exist an R-dimensional nonrandom column vector c and an R  N nonrandom matrix A such
that c C Ax  N.0; I/ (i.e., such that the distribution of c C Ax is R-variate standard normal).
Exercise 14. Let x and y represent random variables, and suppose that x C y and x y are
independently and normally distributed and have the same mean, say , and the same variance, say
 2. Show that x and y are statistically independent, and determine the distribution of x and the
distribution of y.
Exercise 15. Suppose that two or more random column vectors x1 ; x2 ; : : : ; xP are pairwise inde-
pendent (i.e., xi and xj are statistically independent for j > i D 1; 2; : : : ; P ) and that the joint
distribution of x1 ; x2 ; : : : ; xP is MVN. Is it necessarily the case that x1 ; x2 ; : : : ; xP are mutually
independent? Why or why not?
Exercise 16. Let x represent a random variable whose distribution is N.0; 1/, and define y D ux,
where u is a discrete random variable that is distributed independently of x with Pr.u D 1/ D
Pr.u D 1/ D 21 .
(a) Show that y  N.0; 1/.
(b) Show that cov.x; y/ D 0.
(c) Show that x and y are statistically dependent.
(d) Is the joint distribution of x and y bivariate normal? Why or why not?

Exercise 17. Let x1 ; x2 ; : : : ; xK represent N -dimensional random column vectors, and suppose that
x1 ; x2 ; : : : ; xK are mutually independent and that (for i D 1; 2; : : : ; K) xi  N.i ; †i /. Derive
(for arbitrary scalars a1 ; a2 ; : : : ; aK ) the distribution of the linear combination Ki D1 ai xi .
P

Exercise 18. Let x and y represent random variables whose joint distribution is bivariate normal.
Further, let 1 D E.x/, 2 D E.y/, 12 D var x, 22 D var y, and  D corr.x; y/ (where 1  0
and 2  0). Assuming that 1 > 0 , 2 > 0, and 1 <  < 1, show that the conditional distribution
of y given x is N Œ2 C  2 .x 1 /=1 ; 22 .1 2 /.
Exercise 19. Let x and y represent random variables whose joint distribution is bivariate normal.
Further, let 12 D var x, 22 D var y, and 12 D cov.x; y/ (where 1  0 and 2  0). Describe (in
as simple terms as possible) the marginal distributions of x and y and the conditional distributions
of y given x and of x given y. Do so for each of the following two “degenerate” cases: (1) 12 D 0;
and (2) 12 > 0, 22 > 0, and j12 j D 1 2 .
Exercise 20. Let x represent an N -dimensional random column vector, and take y to be the M -
dimensional random column vector defined by y D cCAx, where c is an M -dimensional nonrandom
column vector and A an M  N nonrandom matrix.
(a) Express the moment generating function of the distribution of y in terms of the moment generating
function of the distribution of x.
(b) Use the result of Part (a) to show that if the distribution of x is N.; †/, then the moment gener-
ating function of the distribution of y is the same as that of the N.c C A; A†A0 / distribution,
thereby (since distributions having the same moment generating function are identical) providing
an alternative way of arriving at Theorem 3.5.1.
122 Random Vectors and Matrices

Bibliographic and Supplementary Notes


§1. The term probability density function is used in connection with the probability distribution of a random
variable or, more generally, that of a random vector, say a random vector of dimension N —refer to Section 3.1.
Such use is restricted herein to probability density functions that are probability density functions with respect
to N -dimensional Lebesgue measure. Thus, a statement that the probability distribution of an N -dimensional
random vector does not have a probability density function means that it does not have a probability density
function with respect to N -dimensional Lebesgue measure.
§2. Inequality (2.13) can be regarded as a special case of an inequality, known as the Cauchy–Schwarz
inequality (or simply as the Schwarz inequality), that, in a more general form, is applicable to the members of
any inner-product space—refer, e.g., to Halmos (1958, sec. 64) for a statement, proof, and discussion of the
general version of this inequality. Accordingly, inequality (2.13) [or the equivalent inequality (2.14)] is often
referred to as the Cauchy–Schwarz inequality. Earlier (in Section 2.4), the names of Cauchy and Schwarz were
applied to inequality (2.4.9). Like inequality (2.13), that inequality can be regarded as a special case of the
general version of the Cauchy–Schwarz inequality.
§5p. Moment generating functions are closely related to characteristic functions. For the most part, moment
generating functions are adequate for our purposes, and their use results in a presentation suitable for those
with little or no knowledge of complex analysis—characteristic functions involve complex numbers, while
moment generating functions do not. Distributions having the same moment generating function are identical
(e.g., Casella and Berger 2002, p. 65; Bickel and Doksum 2001, pp. 460 and 505). And if two or more random
variables or vectors are such that the moment generating function of their joint distribution equals the product
of the moment generating functions of their marginal distributions, then those random variables or vectors are
statistically independent (e.g., Parzen 1960, p. 364). These two results constitute powerful “tools” for establishing
the identity of a distribution and for establishing the statistical independence of two or more random variables or
vectors. In fact, the first of the two results is used in Section 3.5p in arriving at Theorem 3.5.12. Unfortunately,
there is a downside to their use. Their proofs are relatively difficult and may be unfamiliar to (and possibly
“inaccessible” to) many potential readers. Consequently, an attempt is made herein to avoid the use of the
aforementioned results (on moment generating functions) in proving other results. Preference is given to the
use of results that are relatively elementary and easily proven.
4
The General Linear Model

The first two sections of Chapter 1 provide an introduction to linear statistical models in general
and to linear regression models in particular. Let us now expand on that introduction, doing so in a
way that facilitates the presentation (in subsequent chapters) of the results on statistical theory and
methodology that constitute the primary subject matter of the book.
The setting is one in which some number, say N, of data points are (for purposes of making
statistical inferences about various quantities of interest) to be regarded as the respective values of
observable random variables y1 ; y2 ; : : : ; yN . Define y D .y1 ; y2 ; : : : ; yN /0. It is supposed that (for
i D 1; 2; : : : ; N ) the i th datum (the observed value of yi ) is accompanied by the corresponding
value ui of a column vector u D .u1 ; u2 ; : : : ; uC /0 of C “explanatory” variables u1 ; u2 ; : : : ; uC .
The observable random vector y is to be modeled by specifying a “family,” say , of functions of u,
and by assuming that for some member of  (of unknown identity), say ı./, the random deviations
yi ı.ui / (i D 1; 2; : : : ; N ) have (“conditionally” on u1 ; u2 ; : : : ; uN ) a joint distribution with
certain specified characteristics. In particular, it might be assumed that these random deviations have
a common mean of 0 and a common variance  2 (of unknown value), that they are uncorrelated, and
possibly that they are jointly normal.
The emphasis herein is on models in which  consists of some or all of those functions (of u)
that are expressible as linear combinations of P specified functions ı1 ./; ı2 ./; : : : ; ıP ./. When
 is of that form, the assumption that, for some function ı./, the joint distribution of yi ı.ui /
(i D 1; 2; : : : ; N ) has certain specified characteristics can be replaced by the assumption that, for
some linear combination in  (of unknown identity), say one with coefficients ˇ1 ; ˇ2 ; : : : ; ˇP , the
PC
joint distribution of yi j D1 ˇj ıj .ui / (i D 1; 2; : : : ; N ) has the specified characteristics. Corre-
sponding to this linear combination is the P -dimensional parameter vector ˇ D .ˇ1 ; ˇ2 ; : : : ; ˇP /0
of coefficients—in general, this vector is of unknown value.
As what can be regarded as a very special case, we have the kind of situation where (with
probability 1) yi D ı.ui / (i D 1; 2; : : : ; N ) for some function ı./ whose identity is known. In
that special case,  has only one member, and (with probability 1) all N of the “random” deviations
yi ı.ui / (i D 1; 2; : : : ; N ) equal 0. An example of that kind of situation is provided by the
ideal-gas law of physics:
p D rnt=v:
Here, p is the pressure within a container of gas, v is the volume of the container, t is the absolute
temperature of the gas, n is the number of moles of gas present in the container, and r is the universal
gas constant. It has been found that, under laboratory conditions and for any of a wide variety of
gases, the pressure readings obtained for any of a broad range of values of n, t, and v conform almost
perfectly to the ideal-gas law. Note that (by taking logarithms) the ideal-gas law can be reexpressed
in the form of the linear equation

log p D log v C log r C log n C log t:

In general,  consists of a possibly infinite number of functions. And the relationship between
y1 ; y2 ; : : : ; yN and the corresponding values u1 ; u2 ; : : : ; uN of the vector u of explanatory variables
is typically imperfect.
124 The General Linear Model

At times (when the intended usage would seem to be clear from the context), resort is made herein
to the convenient practice of using the same symbol for a realization of a random variable (or of a
random vector or random matrix) as for the random quantity itself. At other times, the realization
might be distinguished from the random quantity by means of an underline or by the use of an
altogether different symbol. Thus, depending on the context, the N data points might be represented
by either y1 ; y2 ; : : : ; yN or y 1 ; y 2 ; : : : ; y N , and the N -dimensional column vector comprising these
points might be represented by either y or y. Similarly, depending on the context, an arbitrary member
of  might be denoted either by ı./ (the same symbol used to denote the member having the specified
characteristics) or by ı./. And either ˇ1 ; ˇ2 ; : : : ; ˇP or b1 ; b2 ; : : : ; bP might be used to represent
the coefficients of an arbitrary linear combination of the functions ı1 ./; ı2 ./; : : : ; ıP ./, and either
ˇ or b might be used to represent the P -dimensional column vector comprising these coefficients.
The family  of functions of u, in combination with whatever assumptions are made about the
joint distribution of the N random deviations yi ı.ui / (i D 1; 2; : : : ; N ), determines the statistical
model. Accordingly, it can play a critical role in establishing a basis for the use of the data in making
statistical inferences. Moreover, corresponding to an arbitrary member ı./ of  is the approximation
to the data vector y D .y 1 ; y 2 ; : : : ; y N /0 provided by the vector Œı.u1 /; ı.u2 /; : : : ; ı.uN /0. It may
be of direct or indirect interest to determine the member of  for which this approximation is the
best [best in the sense that the norm of the N -dimensional vector with elements y 1 ı.u1 /; y 2
ı.u2 /; : : : ; y N ı.uN / is minimized for ı./ 2 ]. Note that the solution to this optimization
problem may be well-defined even in the absence of any assumptions of a statistical nature.
The data could be either univariate or multivariate. If some of the N data points were “altogether
different” in character than some of the others, the data would be regarded as multivariate. Such
would be the case if, for example, part of the data consisted of height measurements and part of
weight measurements. In some cases, the distinction (between univariate data and multivariate data)
might be less than clear-cut. For example, if the data consisted entirely of measurements of the
level of a pollutant but some measurements were obtained by different means, at a different time, or
under different conditions than others, the data might be regarded as univariate or, alternatively, as
multivariate.

4.1 Some Basic Types of Linear Models


Let us continue to take the setting to be one in which there are N data points that are to be regarded
as the respective values of observable random variables y1 ; y2 ; : : : ; yN . And let us continue to define
y D .y1 ; y2 ; : : : ; yN /0 and to suppose that (for i D 1; 2; : : : ; N ) the i th datum is accompanied by
the corresponding value ui of a C -dimensional column vector u D .u1 ; u2 ; : : : ; uC /0 of explanatory
variables. As previously indicated, the observable random vector y is to be modeled in terms of
a specified family  of functions of u—it is implicitly assumed that the domain of each of these
functions includes u1 ; u2 ; : : : ; uN .
The models to be considered start with the assumption that for some (unidentified) function ı./
in ,
E.yi / D ı.ui / .i D 1; 2; : : : ; N / (1.1)
or, equivalently,
EŒyi ı.ui / D 0 .i D 1; 2; : : : ; N /: (1.2)
Letting ı D Œı.u1 /; ı.u2 /; : : : ; ı.uN /0, condition (1.1) is reexpressible as

E.y/ D ı; (1.3)
and condition (1.2) as
E.y ı/ D 0: (1.4)
Some Basic Types of Linear Models 125

Here, the distributions of y and y ı [and hence the expected values in conditions (1.1), (1.2), (1.3),
and (1.4)] are regarded as “conditional” on u1 ; u2 ; : : : ; uN . Further, it is assumed that the distribution
of y ı does not depend on ı./ or, less stringently, that var.y ı/ (D var y) does not depend on
ı./.
As is evident from the simple identity

y D ı C .y ı/;

condition (1.3) or (1.4) is equivalent to the condition

y D ı C e; (1.5)

where e is an N -dimensional random column vector with E.e/ D 0. Under condition (1.5),

eDy ıDy E.y/;

and—aside from a trivial case where ı has the same value for all ı./ 2  (and that value is known)
and/or where e D 0 (with probability 1)—e is unobservable. Moreover, under condition (1.5), the
assumption that the distribution of y ı does not depend on ı./ is equivalent to an assumption
that the distribution of e does not depend on ı./, and, similarly, the less stringent assumption that
var.y ı/ or var.y/ does not depend on ı./ is equivalent to an assumption that var.e/ does not
depend on ı./.
Suppose now that  consists of some or all of those functions (of u) that are expressible as linear
combinations of the P specified functions ı1 ./; ı2 ./; : : : ; ıP ./. Then, by definition, a function ı./
(of u) is a member of  only if (for “all” u) ı.u/ is expressible in the form

ı.u/ D b1 ı1 .u/ C b2 ı2 .u/ C    C bP ıP .u/; (1.6)

where b1 ; b2 ; : : : ; bP are scalars (that do not vary with u).


Let xij D ıj .ui / (i D 1; 2; : : : ; N ), and define xj D .x1j ; x2j ; : : : ; xNj /0 (j D 1; 2; : : : ; P ).
Further, take X D .x1 ; x2 ; : : : ; xP /, or equivalently take X to be the matrix with ij th element xij .
And observe that for a function ı./ (of u) for which ı.u/ is expressible in the form (1.6), we have
that P
X
ı.ui / D xij bj .i D 1; 2; : : : ; N / (1.7)
j D1
and that the N -dimensional column vector ı with elements ı.u1 /; ı.u2 /; : : : ; ı.uN / is expressible
as
ı D b1 x1 C b2 x2 C    C bP xP D Xb; (1.8)
where b D .b1 ; b2 ; : : : ; bP /0.
In light of result (1.7), the assumption that condition (1.1) or (1.2) is satisfied by some (unidenti-
fied) function ı./ in  can be replaced by the assumption that for parameters ˇ1 ; ˇ2 ; : : : ; ˇP (which
are the coefficients of a linear combination in  of unknown identity),
P
X
E.yi / D xij ˇj .i D 1; 2; : : : ; N / (1.9)
j D1
or, equivalently, P
X
E.yi xij ˇj / D 0 .i D 1; 2; : : : ; N /: (1.10)
j D1
0
Upon letting ˇ D .ˇ1 ; ˇ2 ; : : : ; ˇP / (which is a parameter vector of unknown value), condition (1.9)
is reexpressible as
E.y/ D Xˇ; (1.11)
126 The General Linear Model

and condition (1.10) as


E.y Xˇ/ D 0: (1.12)
Moreover, the assumption that var.y ı/ or var.y/ does not depend on ı can be replaced by the
assumption that var.y Xˇ/ (D var y) does not depend on ˇ. [If rank.X/ < P , then, strictly
speaking, the assumption that var.y Xˇ/ does not depend on ˇ may be a “bit” stronger than the
assumption that var.y ı/ does not depend on ı.]
As noted earlier, condition (1.3) or (1.4) is equivalent to condition (1.5). Similarly, condition
(1.11) or (1.12) is equivalent to the condition
y D Xˇ C e; (1.13)
where [as in condition (1.5)] e is an N -dimensional random column vector with E.e/ D 0. In
nonmatrix notation, condition (1.13) is
P
X
yi D xij ˇj C ei .i D 1; 2; : : : ; N /;
j D1

where (for i D 1; 2; : : : ; N ) ei is the i th element of e (which is a random variable with an expected


value of 0). Under condition (1.13),
eDy Xˇ D y E.y/:
Moreover, under condition (1.13), the assumption that var.y Xˇ/ or var.y/ does not depend on ˇ
is equivalent to an assumption that var.e/ does not depend on ˇ.
In what follows, three progressively more general models (for y) are defined. Each of these
models starts with the assumption that y satisfies condition (1.13), and each is such that var.e/ does
not depend on ˇ. And, in each of them, ˇ is assumed to be unrestricted; that is, the parameter space
for ˇ is taken to be the linear space RP comprising all P -dimensional column vectors. A model with
these characteristics is referred to as a linear model. (Depending on the nature of the restrictions, the
model may be referred to as a linear model even if ˇ is restricted to a subset of RP .)
In each of the three progressively more general models, the N -dimensional observable random
column vector y D .y1 ; y2 ; : : : ; yN /0 is assumed to be such that

y D Xˇ C e; (1.14)

where X is the N  P nonrandom matrix with ij th element xij D ıj .ui /, where ˇ D


.ˇ1 ; ˇ2 ; : : : ; ˇP /0 is a P -dimensional column vector of unknown parameters, and where e D
.e1 ; e2 ; : : : ; eN /0 is an N -dimensional random column vector with E.e/ D 0. Or, equivalently,
y is assumed to be such that
P
X
yi D xij ˇj C ei .i D 1; 2; : : : ; N /: (1.15)
j D1

The elements e1 ; e2 ; : : : ; eN of the vector e constitute what are sometimes (including herein) referred
to as residual effects and sometimes referred to as errors.
When assumption (1.14) or (1.15) is coupled with an assumption that
var.ei / D  2 and cov.ei ; es / D 0 .s > i D 1; 2; : : : ; N /; (1.16)
or equivalently that
var.e/ D  2 I; (1.17)
we arrive at a model that is to be called the Gauss–Markov model or (for short) the G–M model.
Here,  is a (strictly) positive parameter that is functionally unrelated to ˇ; accordingly, the parameter
Some Basic Types of Linear Models 127

space of the G–M model is


fˇ;  W ˇ 2 RP;  > 0g: (1.18)
Assumption (1.16) or (1.17) indicates that e1 ; e2 ; : : : ; eN have a common standard deviation  and
variance  2 and that e1 ; e2 ; : : : ; eN are uncorrelated with each other.
A generalization of the G–M model can be obtained by replacing assumption (1.16) or (1.17)
with the assumption that
var.e/ D  2 H; (1.19)
where H D fhij g is a symmetric nonnegative definite matrix (whose value is known). This gen-
eralization is to be called the Aitken generalization or the Aitken model. It is able to allow for the
possibility of nonzero correlations and nonhomogeneous variances. However, it includes an implicit
assumption that the correlations are known and that the variances are known up to a constant of
proportionality, which limits its usefulness. In the special case where H D I, the Aitken model
reduces to the G–M model.
A generalization of the G–M model that is considerably more flexible than the Aitken general-
ization is obtained by replacing assumption (1.16) or (1.17) with the assumption that

var.e/ D V./; (1.20)

where V./ is an N  N symmetric nonnegative definite matrix with ij th element vij ./ that
(for i; j D 1; 2; : : : ; N ) is functionally dependent on a T -dimensional (column) vector  D
.1 ; 2 ; : : : ; T /0 of unknown parameters. Here,  belongs to a specified subset, say ‚, of RT .
In the special case where T D 1, ‚ D f W 1 > 0g, and V./ D 12 H, assumption (1.20)
reduces to what is essentially assumption (1.19) (with 1 in place of )—when H D I, there is a
further reduction to what is essentially assumption (1.17). Accordingly, when the assumption that
the observable random column vector y is such that y D Xˇ C e is coupled with assumption (1.20),
we obtain what is essentially a further generalization of the Aitken generalization of the G–M model.
This generalization, whose parameter space is implicitly taken to be

fˇ;  W ˇ 2 RP;  2 ‚g; (1.21)

is to be called the general linear model.


In effect, the G–M, Aitken, and general linear models serve to specify the form of the first-order
moments and second-order (central) moments of the joint distribution of the observable random
variables y1 ; y2 ; : : : ; yN . In the case of the G–M model,

E.y/ D Xˇ and var.y/ D var.e/ D  2 II (1.22)

more generally (in the case of the Aitken model),

E.y/ D Xˇ and var.y/ D var.e/ D  2 HI (1.23)

and still more generally (in the case of the general linear model),

E.y/ D Xˇ and var.y/ D var.e/ D V./: (1.24)

Unlike the G–M model, the Aitken and general linear models are able to allow for the possibility that
y1 ; y2 ; : : : ; yN are correlated and/or have nonhomogeneous variances, however (as in the special
case of the G–M model) the variances and covariances of y1 ; y2 ; : : : ; yN are functionally unrelated
to the expected values of y1 ; y2 ; : : : ; yN .
Instead of parameterizing the G–M, Aitken, and general linear models in terms of the
P -dimensional vector ˇ D .ˇ1 ; ˇ2 ; : : : ; ˇP /0 of coefficients of the P specified functions
ı1 ./; ı2 ./; : : : ; ıP ./, they can be parameterized in terms of an N -dimensional (column) vector
128 The General Linear Model

 D .1 ; 2 ; : : : ; N /0 that “corresponds” to the vector Xˇ. In the alternative parameterization, the
model equation (1.14) becomes
y D  C e: (1.25)
And the parameter space for the G–M model and its Aitken generalization becomes

f;  W  2 C.X/;  > 0g; (1.26)

and that for the general linear model becomes

f;  W  2 C.X/;  2 ‚g: (1.27)

Further, specifying (in the case of the G–M model or its Aitken generalization) that E.y/ and var.y/
are expressible in the form (1.22) or (1.23) [subject to the restriction that ˇ and  are confined to the
space (1.18)] is equivalent to specifying that they are expressible in the form
E.y/ D  and var.y/ D  2 I (1.28)
or
E.y/ D  and var.y/ D  2 H; (1.29)
respectively [subject to the restriction that  and  are confined to the space (1.26)]. Similarly,
specifying (in the case of the general linear model) that E.y/ and var.y/ are expressible in the
form (1.24) [subject to the restriction that ˇ and  are confined to the space (1.21)] is equivalent to
specifying that they are expressible in the form

E.y/ D  and var.y/ D V./ (1.30)

[subject to the restriction that  and  are confined to the space (1.27)].
The equation (1.14) is sometimes referred to as the model equation. And the (N  P ) matrix
X, which appears in that equation, is referred to as the model matrix. What distinguishes one G–M,
Aitken, or general linear model from another is the choice of the model matrix X and (in the case of
the Aitken or general linear model) the choice of the matrix H or the choices made in the specification
(up to the value of a vector of parameters) of the variance-covariance matrix V./. Different choices
for X are associated with different choices for the functions ı1 ./; ı2 ./, : : : ; ıP ./. The number (N )
of rows of X is fixed; it must equal the number of observations. However, the number (P ) of columns
of X (and accordingly the dimension of the parameter vector ˇ) may vary from one choice of X to
another.
The only assumptions inherent in the G–M, Aitken, or general linear model about the distribution
of the vector y are those reflected in result (1.22), (1.23), or (1.24), which pertain to the first- and
second-order moments. Stronger versions of these models can be obtained by making additional
assumptions. One highly convenient and frequently adopted additional assumption is that of taking
the distribution of the vector e to be MVN. In combination with assumption (1.17), (1.19), or (1.20),
this assumption implies that e  N.0;  2 I/, e  N.0;  2 H/, or e  N Œ0; V./. Further, taking
the distribution of e to be MVN implies (in light of Theorem 3.5.1) that the distribution of y is also
MVN; specifically, it implies that y  N.Xˇ;  2 I/, y  N.Xˇ;  2 H/, or y  N ŒXˇ; V./.
The credibility of the normality assumption varies with the nature of the application. Its credibility
would seem to be greatest under circumstances in which each of the quantities e1 ; e2 ; : : : ; eN can
reasonably be regarded as the sum of a large number of deviations of a similar magnitude. Under such
circumstances, the central limit theorem may be “operative.” The credibility of the basic assumptions
inherent in the G–M, Aitken, and general linear models is affected by the choice of the functions
ı1 ./; ı2 ./; : : : ; ıP ./. The credibility of these assumptions as well as that of the normality assumption
can sometimes be enhanced via a transformation of the data.
The coverage in subsequent chapters (and in subsequent sections of the present chapter) includes
numerous results on the G–M, Aitken, and general linear models and much discussion pertaining to
Some Specific Types of Gauss–Markov Models (with Examples) 129

those results and to various aspects of the models themselves. Except where otherwise indicated, the
notation employed in that coverage is implicitly taken to be that employed in the present section (in
the introduction of the models). And in connection with the G–M, Aitken, and general linear models,
ı./ is subsequently taken to be the function (of u) defined by ı.u/ D jPD1 ˇj ıj .u/.
P
In making statistical inferences on the basis of a statistical model, it is necessary to express the
quantities of interest in terms related to the model. A G–M, Aitken, or general linear model is often
used as a basis for making inferences about quantities that are expressible as linear combinations
of ˇ1 ; ˇ2 ; : : : ; ˇP , that is, ones that are expressible in the form 0 ˇ, where  D .1 ; 2 ; : : : ; P /0
is a P -dimensional column vector of scalars, or equivalently ones that are expressible in the form
PP
j D1 j ˇj . More generally, a G–M, Aitken, or general linear model is often used as a basis for
making inferences about quantities that are expressible as random variables, each of which has an
expected value that is a linear combination of ˇ1 ; ˇ2 ; : : : ; ˇP ; these quantities may or may not be
correlated with e1 ; e2 ; : : : ; eN and/or with each other. And a general linear model is sometimes used
as a basis for making inferences about quantities that are expressible as functions of  (or in the
special case of a G–M or Aitken model, for making inferences about  2 or a function of  2 ).

4.2 Some Specific Types of Gauss–Markov Models (with Examples)


The different choices for a G–M model are associated with different choices for the functions
ı1 .u/; ı2 .u/; : : : ; ıP .u/. These choices determine the form of the function ı.u/; they determine
this function up to the values of ˇ1 ; ˇ2 ; : : : ; ˇP . In many cases, ı.u/ is taken to be a polynomial in
the elements of u (typically a polynomial of low degree). The appropriateness of taking ı.u/ to be
such a polynomial can sometimes be enhanced by introducing a transformation of the vector u and
by redefining u accordingly [so that ı.u/ is a polynomial in the elements of the transformed vector].

a. Polynomials (in 1 variable)


Suppose (in connection with the G–M model) that C D 1, in which case u D .u1 /. And let us
write u for the variable u1 and also for the (C D 1)-dimensional vector u. Further, suppose that for
j D 1; 2; : : : ; P ,
ıj .u/ D .u a/j 1;
where a is a specified value of u—by convention, .u a/0 D 1 for all u (including u D a). Then,

ı.u/ D ˇ1 C ˇ2 .u a/ C ˇ3 .u a/2 C    C ˇP .u a/P 1


; (2.1)

which is a polynomial in u a (and also in u) of degree P 1.


Now, for k D 1; 2; : : : , write ı .k/ .u/ for the kth-order derivative of ı./ at an arbitrary point u
[in the interior of the domain of ı./]. Then, for k D 2; 3; : : : ; P,

ı .k 1/
.a/ D .k 1/Šˇk (2.2)

[assuming that a is an interior point of the domain of ı./] and, more generally,
P
X
ı .k 1/
.u/ D .k 1/Šˇk C .j 1/.j 2/    .j k C 1/.u a/j k
ˇj ; (2.3)
j DkC1

as is easily verified. Thus, the derivatives of ı./ at the point a are scalar multiples of the parameters
ˇ2 ; ˇ3 , : : : ; ˇP —the parameter ˇ1 is the value ı.a/ of ı.u/ at u D a. And the .k 1/th derivative
of ı./ at an arbitrary interior point u is a linear combination of ˇk ; ˇkC1 , : : : ; ˇP .
130 The General Linear Model
TABLE 4.1. Lethal doses of ouabain in cats for each of four rates of injection.

Rate Lethal doses for individual cats


1 5, 9, 11, 13, 14, 16, 17, 20, 22, 28, 31, 31
2 3, 6, 22, 27, 27, 28, 28, 37, 40, 42, 50
4 34, 34, 38, 40, 46, 58, 60, 60, 65
8 51, 56, 62, 63, 70, 73, 76, 89, 92

Taking ı.u/ to be a relatively low-degree polynomial is most likely to be satisfactory in a situation


where u1 ; u2 , : : : ; uN (and other “relevant” values of u) are confined to a range that is not overly
large. Taylor’s theorem (e.g., Bartle 1976, sec. 28; Bartle and Sherbert 2011, sec. 6.4) indicates that,
over a limited range, low-degree polynomials provide “reasonable” approximations to a broad class
of functions.
In the special case P D 2, expression (2.1) simplifies to

ı.u/ D ˇ1 C ˇ2 .u a/; (2.4)


and is reexpressible as
ı.u/ D .ˇ1 ˇ2 a/ C ˇ2 u: (2.5)
In this special case, ı.u/ is a polynomial of degree 1, and the model is sometimes referred to as
first-order. When a D 0, expressions (2.4) and (2.5) reduce to

ı.u/ D ˇ1 C ˇ2 u: (2.6)

There are cases where ı.u/ is not of the form (2.1) but where ı.u/ can be reexpressed in that
form by introducing a transformation of u and by redefining u accordingly. Suppose, for example,
that u is strictly positive and that
ı.u/ D ˇ1 C ˇ2 log u (2.7)
(where ˇ1 and ˇ2 are parameters). Clearly, expression (2.7) is not a polynomial in u. However,
upon introducing a transformation from u to the variable u defined by u D log u, ı.u/ can be
reexpressed as a function of u ; specifically, it can be reexpressed in terms of the function ı  ./
defined by
ı  .u / D ˇ1 C ˇ2 u:
Thus, when u is redefined to be u , we can take ı.u/ to be of the form

ı.u/ D ˇ1 C ˇ2 u:

b. Example: ouabain data


Snedecor and Cochran (1989) published data on the dose of ouabain that proves to be lethal when
injected intravenously into a cat. To learn how the lethal dose is affected by the rate of injection, each
of 41 cats was injected at one of four rates. The data are reproduced in Table 4.1 (with dose and rate
being recorded in the same units and same way as by Snedecor and Cochran).
We could consider applying to these data a G–M model in which N D 41, in which y1; y2 ; : : : ; yN
are the observable random variables whose values are the lethal doses (or perhaps the logarithms of
the lethal doses), in which C D 1, and (adopting the same notation for the special case C D 1 as in
Subsection a) in which u is the rate of injection (or perhaps the logarithm of the rate). And we could
consider taking ı.u/ to be the (P 1)-degree polynomial (2.1). Different choices for P correspond
Some Specific Types of Gauss–Markov Models (with Examples) 131

to different versions of the model, more than one of which may be worthy of consideration—the
design of the study that gave rise to these data suggests that a choice for P of no more than 4 would
have been regarded as adequate. The choice of the value a in expression (2.1) is more or less a matter
of convenience.
Now, let us consider (for purposes of illustration) the matrix representation y D Xˇ C e of the
observable random column vector y in an application of the G–M model in which the observed value
of y comprises the lethal doses, in which C D 1, in which u is the rate of injection, and in which
ı.u/ is the special case of the (P 1)-degree polynomial (2.1) (in u a) obtained by setting P D 5
and a D 0. There are N D 41 data points. Suppose that the data points are numbered 1; 2; : : : ; 41
by proceeding row by row in Table 4.1 from the top to the bottom and by proceeding from left to
right within each row. Then, y 0 D .y10 ; y20 ; y30 ; y40 /, where

y10 D .y1 ; y2 ; y3 ; y4 ; y5 ; y6 ; y7 ; y8 ; y9 ; y10 ; y11 ; y12 /


D .5; 9; 11; 13; 14; 16; 17; 20; 22; 28; 31; 31/;
y20 D .y13 ; y14 ; y15 ; y16 ; y17 ; y18 ; y19 ; y20 ; y21 ; y22 ; y23 /
D .3; 6; 22; 27; 27; 28; 28; 37; 40; 42; 50/;
y30 D .y24 ; y25 ; y26 ; y27 ; y28 ; y29 ; y30 ; y31 ; y32 /
D .34; 34; 38; 40; 46; 58; 60; 60; 65/; and
y40 D .y33 ; y34 ; y35 ; y36 ; y37 ; y38 ; y39 ; y40 ; y41 /
D .51; 56; 62; 63; 70; 73; 76; 89; 92/:

Further, ˇ 0 D .ˇ1 ; ˇ2 ; ˇ3 ; ˇ4 ; ˇ5 /. Because the data are arranged in groups, with each group having a
common value of u (corresponding to a common rate of injection), the model matrix X has a succinct
representation. Specifically, this matrix, whose ij th element xij corresponds to the i th datum and to
the j th element ˇj of ˇ, is given by
0 1
112 112 112 112 112
B 111 2111 4111 8111 16111C
XDB @ 19 419 1619
C:
6419 25619 A
19 819 6419 51219 4;09619

c. Polynomials (in general)


Let us now extend (in connection with the G–M model) the development in Subsection a (which
pertains to the special case where C D 1) to the general case where C is possibly greater than
1. Taking a1 ; a2 ; : : : ; aC to be specified values of u1 ; u2 ; : : : ; uC , respectively, suppose that, for
j D 1; 2; : : : ; P,
ıj .u/ D .u1 a1 /kj1 .u2 a2 /kj 2    .uC aC /kjC; (2.8)
where kj1 ; kj 2 ; : : : ; kjC are nonnegative integers. To avoid trivialities (in the form of duplicative
expressions), suppose that, for s > j D 1; 2; : : : ; P, kst ¤ kjt for one or more values of t.
The setting is such that the function ı.u/ is a polynomial (in u1 a1 ; u2 a2 ; : : : ; uC aC ), the
coefficients of which are ˇ1 ; ˇ2 ; : : : ; ˇP . The j th term of this polynomial is of degree CtD1 kjt ,
P
PC
and the polynomial itself is of degree maxj t D1 kjt . Often, the first term is of degree 0, that is, the
first term equals ˇ1 .
A simple (but important) special case is that where P D C C 1 and where k11 D k12 D    D
k1C D 0 and, for j D 2; 3; : : : ; C C1,
(
1 for t D j 1,
kjt D (2.9)
0 for t ¤ j 1.
132 The General Linear Model

In that special case,


ı.u/ D ˇ1 C ˇ2 .u1 a1 / C ˇ3 .u2 a2 / C    C ˇC C1 .uC aC /; (2.10)
which is a “polynomial” of degree 1, and the model is sometimes referred to as first-order. When
a1 D a2 D    D aC D 0, expression (2.10) reduces to
ı.u/ D ˇ1 C ˇ2 u1 C ˇ3 u2 C    C ˇC C1 uC : (2.11)

Another special case worthy of mention is that where P D C C 1 C ŒC.C C 1/=2, where
(as in the previous special case) k11 D k12 D    D k1C D 0 and, for j D 2; 3; : : : ; C C 1,
kj1 ; kj 2 ; : : : ; kjC are given by expression (2.9), and where, for j D CC2; CC3; : : : ; CC1 C ŒC.CC
1/=2, kj1 ; kj 2 ; : : : ; kjC are such that C t D1 kjt D 2. In that special case, ı.u/ is a polynomial of
P
degree 2. That polynomial is obtainable from the degree-1 polynomial (2.10) by adding C.C C1/=2
terms of degree 2. Each of these degree-2 terms is expressible in the form
ˇj .u t a t /.u t 0 a t 0 /; (2.12)
where j is an integer between C C 2 and C C 1 C ŒC.C C 1/=2, inclusive, and where t is an
integer between 1 and C , inclusive, and t 0 an integer between t and C , inclusive—j; t, and t 0
are such that either kjt D 2 and t 0 D t or kjt D kjt 0 D 1 and t 0 > t. It is convenient to
express the degree-2 terms in the form (2.12) and to adopt a modified notation for the coefficients
ˇC C2 ; ˇC C3 ; : : : ; ˇC C1CŒC.C C1/=2 in which the coefficient in term (2.12) is identified by the values
of t and t 0 , that is, in which ˇ t t 0 is written in place of ˇj . Accordingly,
C
X C X
X C
ı.u/ D ˇ1 C ˇj C1 .uj aj / C ˇ t t 0 .u t a t /.u t 0 a t 0 /: (2.13)
j D1 t D1 t 0 Dt

When ı.u/ is of the form (2.13), the model is sometimes referred to as second-order. When a1 D
a2 D    D aC D 0, expression (2.13) reduces to
C
X C X
X C
ı.u/ D ˇ1 C ˇj C1 uj C ˇt t 0 ut ut 0 : (2.14)
j D1 t D1 t 0 Dt

d. Example: cement data


Hald (1952, p. 647) published data from an experimental investigation of how the heat that evolves
during the hardening of cement varies with the respective amounts (in the clinkers from which
the cement is produced) of the following four compounds: tricalcium aluminate, tricalcium silicate,
tetracalcium aluminoferrite, and ˇ-dicalcium silicate. Heat was measured as of 180 days and recorded
in units of calories per gram of cement, and the amount of each of the four compounds was recorded
as a percentage of the weight of the clinkers. This process was carried out for each of 13 batches of
cement. These data, the original source of which was Table I of Woods, Steinour, and Starke (1932),
are reproduced in Table 4.2. They could possibly be regarded as suitable for the application of a
G–M model in which N D 13, in which the value of y comprises the 13 heat measurements, in
which C D 4, and in which u1 ; u2 , u3 , and u4 are the respective amounts of the first through fourth
compounds. Among the various versions of the model that might be applied to the cement data is
the first-order model, in which ı.u/ is expressible in the form (2.10) or (2.11).

e. Example: lettuce data


Hader, Harward, Mason, and Moore (1957) reported the results of an experimental study of how the
yield of lettuce plants is affected by the amounts of three trace minerals: copper (Cu), molybdenum
Regression 133
TABLE 4.2. The heat evolved during hardening and the respective amounts of four compounds for each of 13
batches of cement (Hald 1952, p. 647).

Heat Amount of Amount of Amount of Amount of


evolved during tricalcium tricalcium tetracalcium ˇ-dicalcium
hardening aluminate silicate aluminoferrite silicate
78:5 7 26 6 60
74:3 1 29 15 52
104:3 11 56 8 20
87:6 11 31 8 47
95:9 7 52 6 33
109:2 11 55 9 22
102:7 3 71 17 6
72:5 1 31 22 44
93:1 2 54 18 22
115:9 21 47 4 26
83:8 1 40 23 34
113:3 11 66 9 12
109:4 10 68 8 12

(Mo), and iron (Fe). The plants were grown in a medium in containers of 3 plants each. Each container
was assigned one of 5 levels of each of the 3 trace minerals; for Fe, the lowest and highest levels
were 0:0025 ppm and 25 ppm, and for both Cu and Mo, the lowest and highest levels were 0:0002
ppm and 2 ppm. The 5 levels of each mineral were reexpressed on a transformed scale: the ppm were
replaced by a linear function of the logarithm
p of the ppm chosen so that the transformed values of
4
the highest and lowest levels were ˙ 8. Yield was recorded as grams of dry weight. The results of
the experimental study are reproduced in Table 4.3.
A G–M model could be applied to these data. We could take N D 20, take the values of
y1 ; y2 ; : : : ; yN to be the yields, take C D 3, and take u1 ; u2 , and u3 to be the transformed amounts
of Cu, Mo, and Fe, respectively. And we could consider taking ı.u/ to be a polynomial (in u1 ; u2 ,
and u3 ). The levels of Cu, Mo, and Fe represented in the study covered what was considered to be
a rather wide range. Accordingly, the first-order model, in which ı.u/ is taken to be the degree-1
polynomial (2.10) or (2.11), is not suitable. The second-order model, in which ı.u/ is taken to be
the degree-2 polynomial (2.13) or (2.14), would seem to be a much better choice.

4.3 Regression
Suppose there are N data points y1 ; y2 ; : : : ; yN and that (for i D 1; 2; : : : ; N ) yi is accompanied by
the corresponding value ui of a vector u D .u1 ; u2 ; : : : ; uC /0 of C explanatory variables. Further,
regard y1 ; y2 ; : : : ; yN as N values of a variable y. In some applications, bothu  and y can reasonably
y
be regarded as random, that is, the .C C 1/-dimensional column vector can reasonably be
u
     
y1 y y
regarded as a random vector; and the .C C 1/-dimensional vectors ; 2 ; : : : ; N can
u1 u2  uN
y
reasonably be regarded as a random sample of size N from the distribution of . Assume that
u
             
y y1 y y y1 y y
and ; 2 ; : : : ; N can be so regarded. Then, in effect, ; 2 ;:::; N
u u1 u2 uN u1 u2 uN
134 The General Linear Model
TABLE 4.3. Yields of lettuce from pots containing various amounts of Cu, Mo, and Fe (Hader, Harward, Mason,
and Moore 1957, p. 63; Moore, Harward, Mason, Hader, Lott, and Jackson 1957, p. 67).

Transformed amount
Yield Cu Mo Fe
21:42 1 1 0:4965
15:92 1 1 0:4965
22:81 1 1 0:4965
14:90 1 1 1
14:95 1 1 0:4965
7:83 1 1 1
19:90 1 1 1
4:68 p 1 1 1
4
0:20 p 8 0 0
4
17:65 8 p 0 0
4
18:16 0 p 8 0
4
25:39 0 8 p 0
4
11:99 0 0 p 8
4
7:37 0 0 8
22:22 0 0 0
19:49 0 0 0
22:76 0 0 0
24:27 0 0 0
27:88 0 0 0
27:53 0 0 0

are the realizations of N .C C 1/-dimensional


  statistically independent random vectors, each of
y
which is distributed identically to . In what follows, let us (for the sake of simplicity and as a
u
matter of convenience) use the same notation for these random vectors as for their realizations.
Let us consider the distribution of y1 ; y2 ; : : : ; yN conditional on u1 ; u2 ; : : : ; uN , doing so with
the ultimate objective of establishing a connection to the G–M model. Take ı./ to be the function
(of u) defined by
ı.u/ D E.y j u/:
And observe that (with probability 1)
EŒy ı.u/ j u D 0: (3.1)
Further, take v./ to be the function (of u) defined by
v.u/ D varŒy ı.u/ j u
or, equivalently, by v.u/ D var.y j u/.  The
 nature of the functions ı./ and v./ depends on the
y
nature of the distribution of the vector .
u
Corresponding to the function ı./ is the decomposition
yi D ı.ui / C ei ;
     
y1 y2 y
where ei D yi ı.ui / (i D 1; 2; : : : ; N ). That ; ; : : : ; N are distributed indepen-
u1 u2 uN
dently implies that, conditionally on u1 ; u2 ; : : : ; uN (as well as unconditionally), e1 ; e2 ; : : : ; eN are
distributed independently. A further implication is that (for i D 1; 2; : : : ; N ) a conditional distribu-
tion of ei given ui is a conditional distribution of ei given u1 ; u2 ; : : : ; uN . Thus, conditionally on
u1 ; u2 ; : : : ; uN , we have [in light of result (3.1) and the very definition of the function v./] that
Regression 135

E.ei / D 0 and var.ei / D v.ui / .i D 1; 2; : : : ; N /


and we also have that
cov.ei ; es / D 0 .s > i D 1; 2; : : : ; N /:
0
Upon defining e D .e1 ; e2 ; : : : ; eN / , these results can be restated in matrix notation. Conditionally
on u1 ; u2 ; : : : ; uN , we have that
E.e/ D 0 and var.e/ D diagŒv.u1 /; v.u2 /; : : : ; v.uN /:
 
y
Let us now specialize to the case where the .C C 1/-dimensional vector has an MVN
u
distribution (with a nonsingular variance-covariance matrix). In that special case, we have (in light
of the results of Section 3.5l) that the distribution of y conditional on u is normal with mean
1
E.y j u/ D E.y/ C cov.y; u/.var u/ Œu E.u/ (3.2)
and variance 1
var.y j u/ D var.y/ cov.y; u/.var u/ cov.u; y/: (3.3)
Further, based on result (3.2), we have that
ı.u/ D ˇ1 C ˇ2 .u1 a1 / C ˇ3 .u2 a2 / C    C ˇC C1 .uC aC /; (3.4)
where a1 ; a2 , : : : ; aC are arbitrarily specified values of u1 ; u2 , : : : ; uC , respectively, where

.ˇ2 ; ˇ3 ; : : : ; ˇC C1 / D cov.y; u/.var u/ 1;


and where C C1
X
ˇ1 D E.y/ C ˇj Œaj 1 E.uj 1 /:
j D2
[If (for j D 1; 2; : : : ; C ) aj D E.uj /, then ˇ1 D E.y/; if a1 D a2 D    D aC D 0, then
PC C1
ˇ1 D E.y/ j D2 ˇj E.uj 1 /.] And, based on result (3.3), we have that

v.u/ D  2; (3.5)
where
 2 D var.y/ cov.y; u/.var u/ 1
cov.u; y/:
 
y
We conclude that when the distribution of the vector is MVN, the joint distribution of
u
     
y1 y y
y1 ; y2 ; : : : ; yN obtained by regarding ; 2 ; : : : ; N as a random sample from the dis-
  u1 u2 uN
y
tribution of and by conditioning on u1 ; u2 ; : : : ; uN is identical to that obtained by adopting
u
a first-order G–M model and by taking the joint distribution of the residual   effects e1 ; e2 ; : : : ; eN
y
of the G–M model to be MVN. Moreover, when the distribution of is MVN, the parameters
u
ˇ1 ; ˇ2 ; : : : ; ˇC C1 and  of that first-order
 G–M model are expressible in terms of the mean vector
y
and the variance-covariance matrix of .
u
 
y
There are other distributions of (besides the MVN) for which the joint distribution of
u
     
y1 y2 y
y1 ; y2 ; : : : ; yN obtained by regarding ; ; : : : ; N as a random sample from the distri-
  u1 u2 uN
y
bution of and by conditioning on u1 ; u2 ; : : : ; uN is consistent with the adoption of a first-order
u
G–M model. Whether or not the joint distribution obtained in that way is  consistent
 with the adoption
y
of a first-order G–M model depends on the nature of the distribution of solely through the nature
u
of the conditional distribution of y given u; in fact, it depends solely on the nature of the mean and
136 The General Linear Model

variance of the conditional distribution of y given u. The nature of the marginal distribution of u is
without relevance (to that particular issue).
It is instructive to consider the implications of the expression for E.y j u/ given by equation
(3.2). Suppose that C D 1. Then, writing u for u1 , equation (3.2) can be reexpressed in the form
E.y j u/ E.y/ u E.u/
p D corr.y; u/ p : (3.6)
var y var u
Excluding the limiting case where jcorr.y; u/j D 1, there is an implication that, for any particular
value of u, E.y j u/ is less extreme than u in the sense that (in units of standard deviations) it is
closer to E.y/ than u is to E.u/.
Roughly speaking, observations on y corresponding to any particular value of u are on average
less extreme than the value of u. This phenomenon was recognized early on by Sir Francis Galton,
who determined (from data on a human population) that the heights of the offspring of very tall
(or very short) parents, while typically above (or below) average, tend to be less extreme than the
heights of the parents. It is a phenomenon that has come to be known as “regression to the mean” or
simply as regression. This term evolved from the term “regression (or reversion) towards mediocrity”
introduced by Galton.
Some authors reserve the use of the term regression for situations (like that under consideration
in the present section) where the explanatory variables can reasonably be regarded as realizations of
random variables (e.g., Graybill 1976; Rao 1973). This would seem to be more or less in keeping
with the original meaning of the term. However, over time, the term regression has come to be used
much more broadly. In particular, it has become common to use the term linear regression almost
synonymously with what is being referred to herein as the G–M model, with the possible proviso
that the explanatory variables be continuous. This broader usage is inclusive enough to cover a study
(like that which produced the lettuce data) where the values of the explanatory variables have been
determined systematically as part of a designed experiment.

4.4 Heteroscedastic and Correlated Residual Effects


In the G–M model, the residual effects e1 ; e2 ; : : : ; eN are regarded as random variables that, in
addition to having expected values of 0, are assumed to be homoscedastic (i.e., to have the same
variance) and to be uncorrelated. There are applications for which these assumptions are unrealistic.
The Aitken generalization of the G–M model and (to a considerably greater extent) the general linear
model are more flexible. The residual effects in the Aitken generalization and in the general linear
model can be heteroscedastic or correlated (or both).
There are certain types of heteroscedasticity and certain correlation patterns that are relatively
common and that can be readily accommodated within the framework of the Aitken generalization
or the general linear model. In the present section, an attempt is made to identify and discuss some
of the more basic types of heteroscedasticity and some of the more basic correlation patterns.
In the Aitken generalization or the general linear model, the variance-covariance matrix of the
vector e of residual effects is of the form  2 H or V./, respectively. By definition, the elements
hij (i; j D 1; 2; : : : ; N ) of H are known constants, and the elements vij ./ (i; j D 1; 2; : : : ; N ) of
V ./ are known functions of the parameter vector . These constants or functions may depend on
the N values u1 ; u2 ; : : : ; uN of the vector u D .u1 ; u2 ; : : : ; uC /0 of explanatory variables, though
any such dependence is suppressed in the notation.
Heteroscedastic and Correlated Residual Effects 137

a. Heteroscedastic residual effects


There are situations where the residual effects e1; e2 ; : : : ; eN in the model equation (1.14) or equations
(1.15) can reasonably be regarded as uncorrelated, but cannot reasonably be regarded as homoscedas-
tic (at least not completely so). Three situations of this sort are as follows.
Group averages. Suppose that y1 ; y2 ; : : : ; yN follow a G–M model. Suppose also that, among
the N rows of the model matrix X, there are only K distinct rows, say the i1 ; i2 ; : : : ; iK th rows,
.xik 1 ; xik 2 ; : : : ; xik P / (k D 1; 2; : : : ; K). And, for k D 1; 2; : : : ; K, let Ik represent the subset of
the integers 1; 2; : : : ; N such that i 2 Ik if the i th row of X equals P .xik 1 ; xik 2 ; : : : ; xik P /, and denote
by Nk the size of this subset. Further, define yNk D .1=Nk / i 2Ik yi .
Now, suppose that the individual observations y1 ; y2 ; : : : ; yN are discarded, but that the averages
yN1 ; yN2 ; : : : ; yNK are retained—if the residual effects are jointly normal, then it could be argued, on the
grounds of sufficiency, that there is no need to retain anything other than yN1 ; yN2 ; : : : ; yNK and the sum
of squares N 2
i D1 yi . And suppose that y N1 ; yN2 ; : : : ; yNK are then regarded as the data. It is a simple
P

exercise to show that, for k D 1; 2; : : : ; K, E.yNk / D jPD1 xik j ˇj and var.yNk / D  2 =Nk and that,
P

for k 0 ¤ k D 1; 2; : : : ; K, cov.yNk ; yNk 0 / D 0. Thus, yN1 ; yN2 ; : : : ; yNK follow an Aitken generalization
of a G–M model in which, taking y to be the K-dimensional vector .yN1 ; yN2 ; : : : ; yNK /0, the model
matrix is the K P matrix whose kth row is the ik th row of the original model matrix, and the matrix
H is the diagonal matrix diag.1=N1 ; 1=N2; : : : ; 1=NK /—the parameters ˇ1 ; ˇ2 ; : : : ; ˇP and  of
this model are “identical” to those of the original (G–M) model (i.e., the model for the individual
observations y1 ; y2 ; : : : ; yN ).
As a simple example, consider the ouabain data of Section 4.2b. Suppose that the lethal doses
for the 41 cats follow a G–M model in which the rate of injection is the sole explanatory variable
and in which ı./ is a polynomial. Then, the model matrix has 4 distinct rows, corresponding to the
4 rates of injection: 1, 2, 4, and 8. The numbers of cats injected at those 4 rates were 12, 11, 9, and
9, respectively. The average lethal doses for the 4 rates would follow an Aitken model, with a model
matrix that has 4 rows (which are the distinct rows of the original model matrix and whose lengths
and entries depend on the choice of polynomial) and with H D diag.1=12; 1=11; 1=9; 1=9/.
Within-group homoscedasticity. There are situations where the residual effects e1; e2 ; : : : ; eN cannot
reasonably be regarded as homoscedastic, but where they can be partitioned into some (hopefully
modest) number of mutually exclusive and exhaustive subsets or “groups,” each of which consists
of residual effects that are thought to be homoscedastic. Suppose there are K such groups and that
(for purposes of identification) they are numbered 1; 2; : : : ; K. And (for k D 1; 2; : : : ; K) denote by
Ik the subset of the integers 1; 2; : : : ; N defined by i 2 Ik if the i th residual effect ei is a member
of the kth group.
Let us assume the existence of a function, say .u/, of u (the vector of explanatory variables)
whose value .ui / for u D ui is as follows: .ui / D k if i 2 Ik . This assumption entails essentially
no loss of generality. If necessary, it can be satisfied by introducing an additional explanatory variable.
In particular, it can be satisfied by including an explanatory variable whose i th value (the value of the
explanatory variable corresponding to yi ) equals k for every i 2 Ik (k D 1; 2; : : : ; K). It is worth
noting that there is nothing in the formulation of the G–M model (or its Aitken generalization) or in
the general linear model requiring that ı.u/ (whose values for u D u1 ; u2 ; : : : ; uN are the expected
values of y1 ; y2 ; : : : ; yN ) depend nontrivially on every component of u.
Consider, for example, the case of the ouabain data. For those data, we could conceivably define
4 groups of residual effects, corresponding to the 4 rates of injection, and assume that the residual
effects are homoscedastic within a group but (contrary to what is inherent in the G–M model) not
necessarily homoscedastic across groups. Then, assuming (as before) that the rate of injection is
the sole explanatory variable u (and writing u for u), we could choose the function .u/ so that
.1/ D 1, .2/ D 2, .4/ D 3, and .8/ D 4.
The situation is one in which the residual effects in the kth group have a common variance, say
138 The General Linear Model

k2 (k D 1; 2; : : : ; K). One way of proceeding is to regard the standard deviations 1 ; 2 ; : : : ; K


as “unrelated,” strictly positive parameters (whose values are unknown). This approach is simple
and highly “flexible,” though in general not very “parsimonious.” It can be accommodated within
the framework of the general linear model. One way of doing so is to take the parameter vector 
to be the K-dimensional column vector with kth element k D k (k D 1; 2; : : : ; K), in which
case ‚ D f W k > 0 .k D 1; 2; : : : ; K/g. Then (assuming that the N residual effects are
uncorrelated), V ./ is the diagonal matrix whose i th diagonal element vi i ./ is k2 D k2 for all
2 2
i 2 Ik (k D 1; 2; : : : ; K) or, equivalently, is .ui/
D .ui/
(for i D 1; 2; : : : ; N ).
Dependence of variability on the explanatory variables. In some situations in which the residual
effects e1 ; e2 ; : : : ; eN are heteroscedastic, the variances of the residual effects may be related to the
values of the explanatory variables (related in a more substantial way than in the case of within-
group homoscedasticity). Suppose that for some nonnegative function, say v.u/, of up (and for
i D 1; 2; : : : ; N ) var.ei / D v.ui / [or, equivalently, that the standard deviation of ei equals v.ui /].
Then, assuming that e1 ; e2 ; : : : ; eN are uncorrelated,
var.e/ D diagŒv.u1 /; v.u2 /; : : : ; v.uN /: (4.1)

In general, the function v./ is known only up to the (unknown) values of one or more parameters—
the dependence on the parameters is suppressed in the notation. It is implicitly assumed that these
parameters are unrelated to the parameters ˇ1 ; ˇ2 , : : : ; ˇP , whose values determine the expected
values of y1 ; y2 , : : : ; yN . Thus, when they are regarded as the elements of the vector , var.e/ is of
the form V./ of var.e/ in the general linear model.
In what is a relatively simple special case, v.u/ is of the form
v.u/ D  2 h.u/; (4.2)
where  is a strictly positive parameter (of unknown value) and h.u/ is a known (nonnegatively
valued) function of u. In that special case, formula (4.1) is expressible as
var.e/ D  2 diagŒh.u1 /; h.u2 /; : : : ; h.uN /: (4.3)

This expression is of the form  2 H of var.e/ in the Aitken generalization of the G–M model.
Let us consider some of the more common choices for the function v.u/. For the sake of simplicity,
let us do so for the special case where v.u/ depends on u only through a single one of its C
components. Further, for convenience, let us write u for this component (dropping the subscript) and
write v.u/ for v.u/ [thereby regarding v.u/ as a function solely of u]. In the case of the ouabain data,
we could take u to be the rate of injection, or, alternatively, we could regard some strictly monotonic
function of the rate of injection as an explanatory variable and take it to be u.
One very simple choice for v.u/ is
v.u/ D  2 juj (4.4)
(where  is a strictly positive parameter of unknown value). More generally, we could take v.u/ to
be of the form
v.u/ D  2 juj2˛; (4.5)
where ˛ is a strictly positive scalar or possibly (if the domain of u does not include the value 0) an
unrestricted scalar. And, still more generally, we could take v.u/ to be of the form

v.u/ D  2 . C juj˛/2; (4.6)

where is a nonnegative scalar. While expressions (4.4), (4.5), and (4.6) depend on u only through its
absolute value and consequently are well-defined for both positive and negative values of u, choices
for v.u/ of the form (4.4), (4.5), or (4.6) would seem to be best-suited for use in situations where u
is either strictly positive or strictly negative. p
Note that taking v.u/ to be of the form (4.4), (4.5), or (4.6) is equivalent to taking v.u/ to
p
be  juj, juj˛, or . C juj˛/, respectively. Note also that expression (4.4) is of the form (4.2);
Heteroscedastic and Correlated Residual Effects 139

and recall that when v.u/ is of the form (4.2), var.e/ is of the form  2 H associated with the Aitken
generalization of the G–M model. More generally, if ˛ is known, then expression (4.5) is of the form
(4.2); and if both and ˛ are known, expression (4.6) is of the form (4.2). However, if ˛ or if
and/or ˛ are (like ) regarded as unknown parameters, then expression (4.5) or (4.6), respectively,
is not of the form (4.2), and, while var.e/ is of the form associated with the general linear model, it
is not of the form associated with the Aitken generalization of the G–M model.
As an alternative to taking v.u/ to be of the form (4.4), (4.5), or (4.6), we could take it to be of
the form
v.u/ D  2 e 2˛u; (4.7)
where ˛ is an unrestricted scalar (and  is a strictly positivepparameter of unknown value).
p Note that
taking v.u/ to be of the form (4.7) is equivalent to taking v.u/ to be of the form v.u/ D e ˛u,
1
p
and is also equivalent to taking log v.u/ [which equals 2 log v.u/] to be of the form
p
log v.u/ D log  C ˛u:
Note also that if the scalar ˛ in expression (4.7) is known, then that expression is of the form (4.2),
in which case var.e/ is of the form associated with the Aitken generalization of the G–M model.
Alternatively, if ˛ is an unknown parameter, then expression (4.7) is not of the form (4.2) and var.e/
is not of the form associated with the Aitken generalization of the G–M model [though var.e/ is of
the form associated with the general linear model].
In some applications, there may not be any choice for v.u/ of a relatively simple form [like (4.6)
or (4.7)] for which it is realistic to assume that var.ei / D v.ui / for i D 1; 2; : : : ; N . However, in
some such applications, it may be possible to partition e1 ; e2 ; : : : ; eN into mutually exclusive and
exhaustive subsets (perhaps on the basis of the values u1 ; u2 ; : : : ; uN of the vector u of explanatory
variables) in such a way that, specific to each subset, there is a choice for v.u/ (of a relatively simple
form) for which it may be realistic to assume that var.ei / D v.ui / for those i corresponding to
the members of that subset. While one subset may require a different choice for v.u/ than another,
the various choices (corresponding to the various subsets) may all be of the same general form; for
example, they could all be of the form (4.7) (but with possibly different values of  and/or ˛).

b. Intraclass correlation: compound symmetry


There are situations where not all of the residual effects e1 ; e2 ; : : : ; eN can reasonably be regarded
as uncorrelated. In some such situations, the residual effects can be partitioned into some number of
mutually exclusive and exhaustive subsets in such a way that the residual effects in any particular
subset are thought to be correlated (to an equal extent) while those in different subsets are thought
to be uncorrelated. It is customary to refer to these subsets as classes. Residual effects in the same
class are assumed to be homoscedastic (i.e., to have the same variance); in the most general case,
the variances of the residual effects may (as in Part 2 of Subsection a) differ from class to class.
For random variables x and w (whose variances exist and are strictly positive),
 
x w
var p p D 2 Œ1 corr.x; w/:
var x var w
Thus, the variance of the difference between the standardized versions of x and w is a decreasing
function of corr.x; w/. It follows that, in a certain sense, positively correlated random variables tend
to be more alike, and negatively correlated random variables less alike, than uncorrelated random
variables. In the case of the residual effects, there is an implication that, depending on whether the
“intraclass correlation” is positive or negative, residual effects in the same class tend to be either
more alike or less alike than residual effects in different classes.
Suppose that the N residual effects have been partitioned into K mutually exclusive and exhaus-
tive subsets or classes numbered 1; 2; : : : ; K. And for k D 1; 2; : : : ; K, denote by Nk the number
140 The General Linear Model
PK
of residual effects in the kth subset or class (so that kD1 Nk D N ). Further, let us suppose that
the numbering of the N data points and residual effects is such that the residual effects numbered
N1 CN2 C  CNk 1 C1 through N1 CN2 C  CNk are those in the kth subset or class—interpret
N1 C N2 C    C N0 as 0. And for the sake of convenience and simplicity, let us use two subscripts
instead of one to identify the residual effects. Accordingly, let us write ek1 ; ek2 ; : : : ; ekNk (instead
of eN1 CN2 CCNk 1 C1 , eN1 CN2 CCNk 1 C2 ; : : : ; eN1 CN2 CCNk ) for the residual effects in the kth
subset or class. Also, define (for k D 1; 2; : : : ; K) ek D .ek1 ; ek2 ; : : : ; ekNk /0, and observe that
e0 D .e01 ; e02 ; : : : ; eK
0
/:
It is supposed that cov.ek ; ej / D 0 for j ¤ k D 1; 2; : : : ; K. It is further supposed that, for
some scalar k ,
corr.eks ; ekt / D k .t ¤ s D 1; 2; : : : ; Nk / (4.8)
and that, for some strictly positive scalar k ,
var.eks / D k2 .s D 1; 2; : : : ; Nk /; (4.9)
so that the correlation matrix of ek is the Nk  Nk matrix
0 1
1 k : : : k
Bk 1 k C
Rk D B : (4.10)
B C
@ :: : :: C
A
k k 1

D .1 k /INk C k 1Nk1N0 k (4.11)


and the variance-covariance matrix of ek is
var.ek / D k2 Rk (4.12)
(k D 1; 2; : : : ; K). It follows that
var.e/ D diag.12 R1 ; 22 R2 ; : : : ; K2 RK /: (4.13)
Condition (4.8) stipulates that the correlation of every two residual effects in the kth class equals
k , so that, by definition, k is the “intraclass correlation.” Because k is a correlation, it is necessarily
the case that 1  k  1. However, not every value of k between ˙1 is a “permissible” value.
The permissible values of k are those for which the matrix (4.11) is nonnegative definite. The
determination of those values is greatly facilitated by the introduction of the following lemma.
A matrix lemma.
Lemma 4.4.1. Let R represent an M  M matrix of the form
0
R D aIM C b1M1M ;
where a and b are scalars and where M  2. Then, R is nonnegative definite if and only if a  0
and a C M b  0, and is positive definite if and only if a > 0 and a C M b > 0.
Proof. Let x D .x1 ; x2 ; : : : ; xM /0 represent an arbitrary M -dimensional column vector, and
define xN D .1=M / M i D1 xi . Observe that
P
M
X
x0 Rx D a xi2 C b.M x/
N 2
i D1
M
X 
Da xi2 M xN 2 C .a C M b/M xN 2
i D1
M
X
Da .xi N 2 C .a C M b/M xN 2:
x/ (4.14)
i D1
Heteroscedastic and Correlated Residual Effects 141
PM
Observe also that i D1 .xi x/ N 2 D 0 if and only if x1 D x2 D    D xM , and that M N 2D
P
i D1 .xi x/
2
0 and M xN D 0 if and only if x1 D x2 D    D xM D 0 or, equivalently, if and only if x D 0.
If a  0 and a C M b  0, then it is clear from result (4.14) that x0 Rx  0 for every x and
hence that R is nonnegative definite. Further, if a C M b < 0, then x0 Rx < 0 for any x of the
form x D c1M , where c is a nonzero scalar; and if a C M b D 0, then x0 Rx D 0 for any x of that
form—note that any x of that form is nonnull. Thus, if a C M b < 0, then R is not nonnegative
definite (i.e., is neither positive definite nor positive semidefinite); and if a C M b D 0, then R is not
positive definite. Similarly, if a < 0, then x0 Rx < 0 for any x with M i D1 xi D 0 and with xi ¤ xj
P
for some i and j ; and if a D 0, then x0 Rx D 0 for any such x (and any such x is nonnull). Thus, if
a < 0, then R is not nonnegative definite; and if a D 0, then R is not positive definite. To complete
the proof, it suffices to observe that if a > 0 and a C M b > 0, then x0 Rx D 0 only if x is such that
both M N 2 D 0 and M xN 2 D 0 and hence only if x D 0. Q.E.D.
P
i D1 .xi x/

Form of the variance-covariance matrix. Let us now return to the main development. Upon applying
Lemma 4.4.1 with a D 1 k and b D k , we find that (when Nk  2) the correlation matrix
Rk of the vector ek is nonnegative definite if and only if 1 k  0 and 1 C .Nk 1/k  0, or
equivalently if and only if
1
 k  1; (4.15)
Nk 1
and similarly that Rk is positive definite if and only if
1
< k < 1: (4.16)
Nk 1
The permissible values of k are those in the interval (4.15).
The variance-covariance matrix of the vector ek is such that all of its diagonal elements (which
are the variances) equal each other and all of its off-diagonal elements (which are the covariances)
also equal each other. Such a variance-covariance matrix is said to be compound symmetric (e.g.,
Milliken and Johnson 2009, p. 536).
The variance-covariance matrix of the vector e of residual effects is positive definite if and only
if all K of the matrices R1 ; R2 ; : : : ; RK are positive definite—refer to result (4.13) and “recall”
Lemma 2.13.31. Thus, var.e/ is positive definite if, for every k for which Nk  2, k is in the
interior (4.16) of interval (4.15). [If, for every k for which Nk  2, k is in interval (4.15) but, for
at least one such k, k equals an end point of interval (4.15), var.e/ is positive semidefinite.]
There are applications in which it may reasonably be assumed that the intraclass correlations are
nonnegative (i.e., that k  0 for every k for which Nk  2). In some such applications, it is further
assumed that all of the intraclass correlations are equal. Together, these assumptions are equivalent
to the assumption that, for some scalar  in the interval 0    1,
k D  (for every k for which Nk  2). (4.17)

Now, suppose that assumption (4.17) is adopted. If  were taken to be 0, var.e/ would be of
the form considered in Part 2 of Subsection a. When  as well as 1 ; 2 ; : : : ; K are regarded as
unknown parameters, var.e/ is of the form V./ of var.e/ in the general linear model—take  to be
a .K C1/-dimensional (column) vector whose elements are 1 ; 2 ; : : : ; K , and .
In some applications, there may be a willingness to augment assumption (4.17) with the addi-
tional assumption that the residual effects are completely homoscedastic, that is, with the additional
assumption that, for some strictly positive scalar ,

1 D 2 D    D K D : (4.18)
Then,
var.e/ D  2 diag.R1 ; R2 ; : : : ; RK /; (4.19)
and when  is regarded as an unknown parameter and  is taken to be known, var.e/ is of the form
142 The General Linear Model

 2 H of var.e/ in the Aitken generalization of the G–M model. When both  and  are regarded
as unknown parameters, var.e/ is of the form V ./ of var.e/ in the general linear model—take 
to be the 2-dimensional (column) vector whose elements are  and . Assumption (4.18) leads to a
model that is less flexible but more parsimonious than that obtained by allowing the variances of the
residual effects to differ from class to class.
Decomposition of the residual effects. A supposition that an intraclass correlation is nonnegative,
but strictly less than 1, is the equivalent of a supposition that each residual effect can be regarded as
the sum of two uncorrelated components, one of which is specific to that residual effect and the other
of which is shared by all of the residual effects in the same class. Let us consider this equivalence in
some detail. Accordingly, take ak (k D 1; 2; : : : ; K) and rks (k D 1; 2; : : : ; K; s D 1; 2; : : : ; Nk ) to
be uncorrelated random variables, each with mean 0, such that var.ak / D k2 for some nonnegative
scalar k and var.rks / D k2 (s D 1; 2; : : : ; Nk ) for some strictly positive scalar k . And suppose
that, for k D 1; 2; : : : ; K and s D 1; 2; : : : ; Nk ,
eks D ak C rks : (4.20)
Here, ak is the component of the sth residual effect in the kth class that is shared by all of the residual
effects in the kth class and rks is the component that is specific to eks .
We find that the variance k2 of the residual effects in the kth class is expressible as
k2 D k2 C k2 : (4.21)

Further, upon observing that (for t ¤ s D 1; 2; : : : ; Nk ) cov.eks ; ekt / D k2 , we find that the
correlation of any two residual effects in the kth class (the intraclass correlation) is expressible as
k2 k2 .k =k /2
k D D D (4.22)
k2 k2 C k2 1 C .k =k /2

—this expression is well-defined even if Nk D 1. And, in addition to representations (4.10) and


(4.11) for the correlation matrix Rk of the vector ek and representation (4.12) for var.ek /, we have
the representations
2 2
Rk D 2 k 2 INk C 2 k 2 1Nk10Nk (4.23)
 k C k  k C k
and
var.ek / D k2 INk C k2 1Nk10Nk (4.24)

D k2 ŒINk C .k =k /2 1Nk10Nk : (4.25)

Note that result (4.22) implies that the intraclass correlation is nonnegative, but strictly less than
1. Note also that an assumption that k does not vary with k can be restated as an assumption that
k2 =k2 does not vary with k and also as an assumption that k =k (or k2 =k2 ) does not vary with k.
The effects of intraclass competition. In some applications, the residual effects in each class may be
those for data on entities among which there is competition. For example, the entities might consist
of individual animals that are kept in the same pen and that may compete for space and for feed. Or
they might consist of individual plants that are in close proximity and that may compete for water,
nutrients, and light. In the presence of such competition, the residual effects in each class may tend
to be less alike than would otherwise be the case.
The decomposition of the residual effects considered in the previous part can be modified so as
to reflect the effects of competition. Let us suppose that the Nk residual effects in the kth class are
those for data on Nk of a possibly larger number Nk of entities among which there is competition.
Define ak (k D 1; 2; : : : ; K) and rks (k D 1; 2; : : : ; K; s D 1; 2; : : : ; Nk ) as in the previous part.
And take dks (k D 1; 2; : : : ; K; s D 1; 2; : : : ; Nk ) to be random variables, each with mean 0, that are
uncorrelated with the ak ’s and the rks ’s, and suppose that cov.dks ; djt / D 0 for j ¤ k D 1; 2; : : : ; K
Heteroscedastic and Correlated Residual Effects 143
PNk
(and for all s and t). Suppose further that (for k D 1; 2; : : : ; K) sD1 dks D 0, that var.dks / D !k2

(s D 1; 2; : : : ; Nk ) for some nonnegative scalar !k , and that cov.dks ; dkt / has the same value for
all t ¤ s D 1; 2; : : : ; Nk. Finally, suppose that, for k D 1; 2; : : : ; K and s D 1; 2; : : : ; Nk ,
eks D ak C dks C rks : (4.26)

Decomposition (4.26) can be regarded as a modification of decomposition (4.20) in which the


term rks is replaced by the sum dks C rks . This modification is for the purpose of accounting for the
possibility of intraclass competition.
In light of the various suppositions, we have that (for an arbitrary s and t ¤ s)

Nk
X 
var dks 0 D Nk !k2 C Nk .Nk 1/ cov.dks ; dkt / (4.27)
s 0 D1

and that Nk
X 
var dks 0 D var.0/ D 0: (4.28)
s 0 D1

Together, results (4.27) and (4.28) imply that (for t ¤ s D 1; 2; : : : ; Nk )


1
cov.dks ; dkt / D !k2 : (4.29)
Nk 1

Decomposition (4.26) is such that the variance k2 of the residual effects in the kth class is
expressible as
k2 D k2 C k2 C !k2 : (4.30)
Further, for t ¤ s D 1; 2; : : : ; Nk ,
1
cov.eks ; ekt / D k2 C cov.dks ; dkt / D k2 !k2 : (4.31)
Nk 1
Thus, the correlation of any two residual effects in the kth class (the intraclass correlation) is ex-
pressible as
1 1 1
k2 !2
Nk 1 k
k2 !2
Nk 1 k
.k =k /2 Nk 1
.!k =k /2
k D D D : (4.32)
k2 k2 C k2 C !k2 1 C .k =k /2 C .!k =k /2

In light of expression (4.32), the permissible values of k are those in the interval
1
< k < 1: (4.33)
Nk 1
The intraclass correlation k approaches the upper end point of this interval as k =k ! 1 (for
fixed !k =k ) and approaches the lower end point as !k =k ! 1 (for fixed k =k ).
Expression (4.32) “simplifies” to a considerable extent if neither k =k nor !k =k varies with
k (or, equivalently, if neither k =k nor !k =k varies with k). However, even then, k may depend
on Nk and hence may vary with k.

c. Example: corn-milling data


Littell, Milliken, Stroup, Wolfinger, and Schabenberger (2006, sec. 16.2) reported the results of an
experimental study of how the milling of corn is affected by the moisture content of the corn and by
the operating characteristics of the grinding mill. The operating characteristics were those associated
with three variables: roll gap, screen size, and roller speed. Three equally spaced settings of each
of these three variables were represented in the study. The experimental material consisted of ten
144 The General Linear Model
TABLE 4.4. Amount of grits obtained from each of 30 1-minute runs of a grinding mill and the batch, moisture
content, roll gap, screen size, and roller speed for each run (Littell, Milliken, Stroup, Wolfinger,
and Schabenberger 2006, sec. 16.2).

Transformed setting (C1, high; 0, med.; 1, low)


Amount of Moisture Roll Screen Roller
grits Batch content gap size speed
505 1 C1 C1 C1 C1
493 1 C1 1 1 1
491 1 C1 1 C1 1
498 2 C1 C1 1 0
504 2 C1 C1 1 1
500 2 C1 1 C1 0
494 3 1 0 1 1
498 3 1 0 C1 0
498 3 1 1 0 C1
496 4 0 1 1 0
503 4 0 0 C1 C1
496 4 0 1 0 1
503 5 1 1 C1 C1
495 5 1 C1 C1 1
494 5 1 1 C1 1
486 6 0 0 0 0
501 6 0 C1 C1 1
490 6 0 C1 1 C1
494 7 1 C1 0 0
497 7 1 C1 C1 C1
492 7 1 1 C1 1
503 8 C1 1 C1 C1
499 8 C1 0 0 1
493 8 C1 0 1 C1
505 9 C1 C1 C1 1
500 9 C1 C1 0 C1
490 9 C1 1 1 C1
494 10 1 1 1 C1
497 10 1 C1 1 1
495 10 1 1 1 1

30-kilogram “batches” of corn. Each batch was tempered so that its moisture content conformed to
a specified setting (one of three equally spaced settings selected for inclusion in the study).
Following its preparation, each batch was split into three equal (10 kg) parts. And for each part
of each batch, settings were specified for the roll gap, screen size, and roller speed, the grinding mill
was configured (to conform to the specified settings), the processing of the corn was undertaken, and
the amount of grits obtained from a one-minute run was determined. The moisture content, roll gap,
screen size, and roller speed were recorded on a transformed scale chosen so that the values of the
high, medium, and low settings were C1, 0, and 1, respectively. The results of the 30 experimental
runs are reproduced in Table 4.4.
The corn-milling experiment was conducted in accordance with what is known as a split-plot
design (e.g., Hinkelmann and Kempthorne 2008, chap. 13). The ten batches are the so-called whole
plots, and the three moisture-content settings constitute the so-called whole-plot treatments. Further,
the 30 parts (obtained by splitting the 10 batches into 3 parts each) are the so-called split plots, and
Heteroscedastic and Correlated Residual Effects 145

the various combinations of settings for roll gap, screen size, and roller speed constitute the so-called
split-plot treatments—21 of a possible 27 combinations were included in the experiment.
The data from the corn-milling experiment might be suitable for the application of a general linear
model. We could take N D 30, take the observed value of yi to be the amount of grits obtained
on the i th experimental run (i D 1; 2; : : : ; 30), take C D 4, and take u1 , u2 , u3 , and u4 to be the
moisture content, roll gap, screen size, and roller speed, respectively (with each being expressed on
the transformed scale). And we could consider taking ı.u/ to be a polynomial of degree 2 in u1 , u2 ,
u3 , and u4 . The nature of the application is such that ı.u/ defines what is commonly referred to as
a “response surface.”
The batches in the corn-milling experiment define classes within which the residual effects are
likely to be correlated. Moreover, this correlation is likely to be compound symmetric, that is, to
be the same for every two residual effects in the same class. In the simplest case, the intraclass
correlation would be regarded as having the same value, say , for every class. Then, assuming that
the data points (and the corresponding residual effects) are numbered 1; 2; : : : ; 30 in the order in
which the data points are listed in Table 4.4 and that the residual effects have a common variance
 2, the variance-covariance matrix of the vector e of residual effects would be

var.e/ D  2 diag.R; R; : : : ; R/;


0 1
1  
where R D @ 1 A.
  1
The batches can be expected to differ from one another in ways that go beyond any of the
specified differences in moisture content. To the extent that any such differences are reflected in
the amounts of grits obtained from the various one-minute runs, they contribute positively to the
intraclass correlation. However, their influence could be offset to at least some small extent by the
effects of splitting the batches into parts. If the splitting is such that (by the very nature of the process)
some parts are favored at the expense of others (in ways that may affect the amount of grits), then the
splitting may contribute negatively to the intraclass correlation. In effect, the splitting may introduce
what could be regarded as a form of intraclass competition.

d. Example: shear-strength data


Khuri (1992, sec. 4) reported the results of an experimental study of how the effectiveness of an 11-
component adhesive is influenced by the temperature and the curing time employed in its manufacture.
The measure of effectiveness was taken to be the shear strength of the bond between galvanized
steel bars created through application of the adhesive. Results were obtained for three temperatures
(375°F, 400°F, and 450°F) and three curing times (30, 35, and 40 seconds) in all (nine) possible
combinations. The steel used in assessing the shear strength consisted of “aliquots” obtained by
sampling (at random) from the supply on hand in a warehouse; a sample was taken on each of 12
dates between July 11th and October 10th, inclusive.
The experiment was conducted in accordance with what could be regarded as a randomized
complete block design (e.g., Hinkelmann and Kempthorne 2008, chap. 9). There are 12 “blocks”
corresponding to the 12 dates on which samples were taken and 9 “treatments” corresponding to
the 9 possible combinations of the 3 temperatures and the 3 curing times. The basic design was
augmented by including (in four of the blocks) some “replicates” of one of the treatments (the one
comprising a temperature of 400°F and a curing time of 35 seconds).
The data are reproduced in Table 4.5. These data might be suitable for the application of a general
linear model. We could take N D .12  9/ C 3 C 3 C 2 C 2 D 118, and take the observed value
of y to be the (118-dimensional) column vector formed from the 12 columns of shear strengths in
Table 4.5 by listing them successively one under the other. And we could take C D 2, take u1 and
146 The General Linear Model
TABLE 4.5. Shear strength (psi) of the bond between galvanized steel bars created through application of an
adhesive: data for each of nine combinations of temperature and curing time (those employed in
the manufacture of the adhesive) obtained using steel aliquots selected at random from those on
hand on each of twelve dates (Khuri 1992, sec. 4).

Date (month/day)
Temp. Time
(°F) (sec.) 07/11 07/16 07/20 08/07 08/08 08/14 08/20 08/22 09/11 09/24 10/03 10/10
375 30 1,226 1,075 1,172 1,213 1,282 1,142 1,281 1,305 1,091 1,281 1,305 1,207
400 30 1,898 1,790 1,804 1,961 1,940 1,699 1,833 1,774 1,588 1,992 2,011 1,742
450 30 2,142 1,843 2,061 2,184 2,095 1,935 2,116 2,133 1,913 2,213 2,192 1,995
375 35 1,472 1,121 1,506 1,606 1,572 1,608 1,502 1,580 1,343 1,691 1,584 1,486
400 35 2,010 2,175 2,279 2,450 2,291 2,374 2,417 2,393 2,205 2,142 2,052 2,339
400 35 1,882 2,355 2,268 2,032
400 35 1,915 2,420 2,103 2,190
400 35 2,106 2,240
450 35 2,352 2,274 2,168 2,298 2,147 2,413 2,430 2,440 2,093 2,208 2,201 2,216
375 40 1,491 1,691 1,707 1,882 1,741 1,846 1,645 1,688 1,582 1,692 1,744 1,751
400 40 2,078 2,513 2,392 2,531 2,366 2,392 2,392 2,413 2,392 2,488 2,392 2,390
450 40 2,531 2,588 2,617 2,609 2,431 2,408 2,517 2,604 2,477 2,601 2,588 2,572

u2 to be the temperature and curing time (employed in the manufacture of the adhesive), and take
ı.u/ to be a polynomial of degree 2 in u1 and u2 . As in the previous example (that of Subsection c),
the nature of the application is such that ı.u/ defines a “response surface.”
Steel aliquots chosen at random from those on hand on any particular date may tend to resemble
each other more closely than ones chosen at random from those on hand on different dates. Accord-
ingly, it may be prudent to regard the 12 blocks as “classes” and to allow for the possibility of an
intraclass correlation. If (in doing so) it is assumed that the intraclass correlation and the variance
of the random effects have values, say  and  2, respectively, that do not vary from block to block,
then the variance-covariance matrix of the vector e of residual effects is
var.e/ D  2 diag.R1 ; R2 ; : : : ; R12 /; (4.34)
where (for k D 1; 2; : : : ; 12)
Rk D .1 /INk C 1Nk10Nk
with 8̂
< 9 if k D 2; 3; 5; 6; 7; 8; 10, or 12,
Nk D 11 if k D 9 or 11,
12 if k D 1 or 4.

In arriving at expression (4.34), it was implicitly assumed that the residual effects associated with
the data points in any one block are uncorrelated with those associated with the data points in any
other block. It is conceivable that steel aliquots chosen at random from those on hand on different
dates may tend to be more alike when the intervening time (between dates) is short than when it is
long. If we wished to account for any such tendency, we would need to allow for the possibility that
the residual effects associated with the data points in different blocks may be correlated to an extent
that diminishes with the separation (in time) between the blocks. That would seem to call for taking
var.e/ to be of a different and more complex form than the block-diagonal form (4.34).

e. Longitudinal data
There are situations where the data are obtained by recording the value of what is essentially the same
variate for each of a number of “observational units” on each of a number of occasions (corresponding
Heteroscedastic and Correlated Residual Effects 147

to different points in time). The observational units might be people, animals, plants, laboratory
specimens, households, experimental plots (of land), or other such entities. For example, in a clinical
trial of drugs for lowering blood pressure, each drug might be administered to a different group of
people, with a placebo being administered to an additional group, and each person’s blood pressure
might be recorded periodically, including at least once prior to the administration of the drug or
placebo—in this example, each person constitutes an observational unit. Data of this kind are referred
to as longitudinal data.
Suppose that the observed values of the random variables y1 ; y2 ; : : : ; yN in the general linear
model are longitudinal data. Further, let t represent time, and denote by t1 ; t2 ; : : : ; tN the values of t
corresponding to y1 ; y2 ; : : : ; yN , respectively. And assume (as can be done essentially without loss
of generality) that the vector u of explanatory variables u1 ; u2 ; : : : ; uC is such that t is one of the
explanatory variables or, more generally, is expressible as a function of u, so that (for i D 1; 2; : : : ; N )
the i th value ti of t is determinable from the corresponding (i th) value ui of u.
Denote by K the number of observational units represented in the data, suppose that the observa-
tional units are numbered 1; 2; : : : ; K, and define Nk to be the number of data points pertaining to the
kth observational unit (so that K kD1 Nk D N ). Assume that the numbering (from 1 through N ) of
P
the N data points is such that they are ordered by observational unit and by time within observational
unit (so that if the i th data point pertains to the kth observational unit and the i 0 th data point to the
k 0 th observational unit where i 0 > i , then either k 0 > k or k 0 D k and ti 0  ti )—it is always
possible to number the data points in such a way. The setting is one in which it is customary and
convenient to use two subscripts, rather than one, to identify the random variables y1 ; y2 ; : : : ; yN ,
residual effects e1 ; e2 ; : : : ; eN , and times t1 ; t2 ; : : : ; tN . Accordingly, let us write eks for the residual
effect corresponding to the sth of those data points that pertain to the kth observational unit, and tks
for the time corresponding to that data point.
It is often possible to account for the more “systematic” effects of time through the choice of the
form of the function ı.u/. However, even then, it is seldom appropriate to assume (as in the G–M
model) that all N of the residual effects are uncorrelated with each other.
The vector e of residual effects is such that
e0 D .e01 ; e02 ; : : : ; eK
0
/;
where (for k D 1; 2; : : : ; K) ek D .ek1 ; ek2 ; : : : ; ekNk /0. It is assumed that cov.ek ; ej / D 0 for
j ¤ k D 1; 2; : : : ; K, so that
var.e/ D diagŒvar.e1 /; var.e2 /; : : : ; var.eK /: (4.35)
For k D 1; 2; : : : ; K, var.ek / is the Nk  Nk matrix with rth diagonal element var.ekr / and rsth
(where s ¤ r) off-diagonal element cov.ekr ; eks /. It is to be expected that ekr and eks will be
positively correlated, with the extent of their correlation depending on j tks tkr j; typically, the
correlation of ekr and eks is a strictly decreasing function of j tkr tks j. There are various kinds
of stochastic processes that exhibit that kind of correlation structure. Among the simplest and most
prominent of them is the following.
Stationary first-order autoregressive processes. Let x1 represent a random variable having mean 0
(and a strictly positive variance), and let x2 ; x3 ; : : : represent a possibly infinite sequence of random
variables generated successively (starting with x1 ) in accordance with the following relationship:

xi C1 D xi C di C1 : (4.36)

Here,  is a (nonrandom) scalar in the interval 0 <  < 1, and d2 ; d3 ; : : : are random variables, each
with mean 0 (and a finite variance), that are uncorrelated with x1 and with each other. The sequence
of random variables x1 ; x2 ; x3 ; : : : represents a stochastic process that is characterized as first-order
autoregressive.
148 The General Linear Model

Note that
E.xi / D 0 (for all i ). (4.37)
Note also that for r D 1; 2; : : : ; i, cov.xr ; di C1 / D 0, as can be readily verified by mathematical
induction—because xr D xr 1 C dr , cov.xr 1 ; di C1 / D 0 implies that cov.xr ; di C1 / D 0. In
particular, cov.xi ; di C1 / D 0. Thus,
var.xi C1 / D 2 var.xi / C var.di C1 /: (4.38)

Let us determine the conditions under which the sequence of random variables x1 ; x2 ; x3 ; : : : is
stationary in the sense that
var.x1 / D var.x2 / D var.x3 / D    : (4.39)
In light of equality (4.38),

var.xi C1 / D var.xi / , var.di C1 / D .1 2 / var.xi /:

Accordingly, the sequence x1 ; x2 ; x3 ; : : : satisfies condition (4.39) if (and only if)

var.di C1 / D .1 2 / var.x1 / (for all i ). (4.40)

By making repeated use of the defining relationship (4.36), we find that, for an arbitrary positive
integer s, s 1
X
xi Cs D s xi C  j di Cs j : (4.41)
j D0
And it follows that
cov.xi ; xi Cs / D cov.xi ; s xi / D s var.xi /: (4.42)
Thus, in the special case where the sequence x1 ; x2 ; x3 ; : : : satisfies condition (4.39), we have that
corr.xi ; xi Cs / D s (4.43)
or, equivalently, that (for r ¤ i D 1; 2; 3; : : :)
corr.xi ; xr / D jr ij
: (4.44)

In summary, we have that when d2 ; d3 ; d4 ; : : : satisfy condition (4.40), the sequence of random
variables x1 ; x2 ; x3 ; : : : satisfies condition (4.39) and condition (4.43) or (4.44). Accordingly, when
d2 ; d3 ; d4 ; : : : satisfy condition (4.40), the sequence x1 ; x2 ; x3 ; : : : is of a kind that is sometimes
referred to as stationary in the wide sense (e.g., Parzen 1960, chap. 10).
The entries in the sequence x1 ; x2 ; x3 ; : : : may represent the state of some phenomenon at a
succession of times, say times t1 ; t2 ; t3 ; : : : . The coefficient of xi in expression (4.36) is , which
does not vary with i and hence does not vary with the “elapsed times” t2 t1 ; t3 t2 ; t4 t3 ; : : : .
So, for the sequence x1 ; x2 ; x3 ; : : : to be a suitable reflection of the evolution of the phenomenon
over time, it would seem to be necessary that t1 ; t2 ; t3 ; : : : be equally spaced.
A sequence that may be suitable even if t1 ; t2 ; t3 ; : : : are not equally spaced can be achieved by
introducing a modified version of the defining relationship (4.36). The requisite modification can be
discerned from result (4.41) by thinking of the implications of that result as applied to a situation
where the successive differences in time (t2 t1 ; t3 t2 ; t4 t3 ; : : :) are equal and arbitrarily small.
Specifically, what is needed is to replace relationship (4.36) with the relationship
xi C1 D ti C1 ti
xi C di C1 ; (4.45)
where  is a (nonrandom) scalar in the interval 0 <  < 1.
Suppose that this replacement is made. Then, in lieu of result (4.38), we have that
var.xi C1 / D 2.ti C1 ti /
var.xi / C var.di C1 /:
Heteroscedastic and Correlated Residual Effects 149

And by employing essentially the same reasoning as before, we find that the sequence x1 ; x2 ; x3 ; : : :
satisfies condition (4.39) [i.e., the condition that the sequence is stationary in the sense that var.x1 / D
var.x2 / D var.x3 / D    ] if (and only if)
var.di C1 / D Œ1 2.ti C1 ti /
 var.x1 / (for all i ). (4.46)
Further, in lieu of result (4.42), we have that

cov.xi ; xi Cs / D ti Cs ti
var.xi /:
Thus, in the special case where the sequence x1 ; x2 ; x3 ; : : : satisfies condition (4.39), we have that

corr.xi ; xi Cs / D ti Cs ti

or, equivalently, that (for r ¤ i D 1; 2; 3; : : :)


corr.xi ; xr / D jtr ti j
(4.47)

—these two results take the place of results (4.43) and (4.44), respectively.
In connection with result (4.47), it is worth noting that [in the special case where the sequence
x1 ; x2 ; x3 ; : : : satisfies condition (4.39)] corr.xi ; xr / ! 1 as jtr ti j ! 0 and corr.xi ; xr / ! 0 as
jtr ti j ! 1. And (in that special case) the quantity  represents the correlation of any two of the
xi ’s that are separated from each other by a single unit of time.
In what follows, it is the stochastic process defined by relationship (4.45) that is referred to
as a first-order autoregressive process. Moreover, in the special case where the stochastic process
defined by relationship (4.45) satisfies condition (4.39), it is referred to as a stationary first-order
autoregressive process.
Variance-covariance matrix of a subvector of residual effects. Let us now return to the main de-
velopment, and consider further the form of the matrices var.e1 /; var.e2 /; : : : ; var.eK /. We could
consider taking (for an “arbitrary” k) var.ek / to be of the form that would result from regarding
ek1 ; ek2 ; : : : ; ekNk as Nk successive members of a stationary first-order autoregressive process, in
which case we would have (based on the results of Part 1) that (for some strictly positive scalar
 and some scalar  in the interval 0 <  < 1) var.ek1 / D var.ek2 / D    D var.ekNk / D  2
and corr.eks ; ekj / D jtkj tks j (j ¤ s D 1; 2; : : : ; Nk ). However, for most applications, taking
var.ek / to be of that form would be inappropriate. It would imply that residual effects corresponding
to “replicate” data points, that is, data points pertaining to the same observational unit and to the
same time, are perfectly correlated. It would also imply that residual effects corresponding to data
points that pertain to the same observational unit, but that are widely separated in time, are essentially
uncorrelated. Neither characteristic conforms to what is found in many applications. There may be
“measurement error,” in which case the residual effects corresponding to replicate data points are
expected to differ from one another. And all of the data points that pertain to the same observational
unit (including ones that are widely separated in time) may be subject to some common influences,
in which case every two of the residual effects corresponding to those data points may be correlated
to at least some minimal extent.
Results that are better suited for most applications can be obtained by adopting a somewhat
more elaborate approach. This approach builds on the approach introduced in Part 3 of Subsection
b in connection with compound symmetry. Take ak (k D 1; 2; : : : ; K) and rks (k D 1; 2; : : : ; K;
s D 1; 2; : : : ; Nk ) to be uncorrelated random variables, each with mean 0, such that var.ak / D k2
for some nonnegative scalar k and var.rks / D k2 (s D 1; 2; : : : ; Nk ) for some strictly positive
scalar k . Further, take fks (k D 1; 2; : : : ; K; s D 1; 2; : : : ; Nk ) to be random variables, each with
mean 0, that are uncorrelated with the ak ’s and the rks ’s; take the elements of each of the K sets
ffk1 ; fk2 ; : : : ; fkNk g (k D 1; 2; : : : ; K) to be uncorrelated with those of each of the others; and
take the variances and covariances of fk1 ; fk2 ; : : : ; fkNk to be those obtained by regarding these
150 The General Linear Model

random variables as Nk successive members of a stationary first-order autoregressive process, so


that (for some strictly positive scalar k and some scalar k in the interval 0 < k < 1) var.fk1 / D
jt tks j
var.fk2 / D    D var.fkN / D 2k and corr.fks ; fkj / D kkj (j ¤ s D 1; 2; : : : ; Nk ).
k

Now, suppose that, for k D 1; 2; : : : ; K and s D 1; 2; : : : ; Nk ,


eks D ak C fks C rks (4.48)
—clearly, this supposition is compatible with the assumptions that E.e/ D 0 and that var.e/ is of the
block-diagonal form (4.35). Equality (4.48) decomposes eks into three uncorrelated components; it
implies that var.eks / equals the quantity k2 defined by

k2 D k2 C 2k C k2 : (4.49)


Further, for j ¤ s,
jt tks j
cov.eks ; ekj / D k2 C 2k kkj (4.50)
and hence
jt tks j jt tks j jt t j
k2 C 2k kkj k2 C 2k kkj .k =k /2 C .k =k /2 kkj ks
corr.eks ; ekj / D D D :
k2 k2 C k2 C 2k 1 C .k =k /2 C .k =k /2

The correlation corr.eks ; ekj / can be regarded as a function of the elapsed time jtkj tks j. Clearly,
if the k ’s, k ’s, k ’s, and k ’s are such that k =k , k =k , and k do not vary with k, then this
function does not vary with k.
In connection with supposition (4.48), it may be advisable to extend the parameter space of k by
appending the value 0, which corresponds to allowing for the possibility that var.fk1 / D var.fk2 / D
   D var.fkNk / D 0 or, equivalently, that fk1 D fk2 D    D fkNk D 0 with probability 1. When
k D 0, var.ek / is of the compound-symmetric form considered in Subsection b.
In practice, it is common to make the simplifying assumption that the k ’s, k ’s, k ’s, and
k ’s do not vary with k, so that 1 D 2 D    D K D , 1 D 2 D    D K D ,
1 D 2 D    D K D , and 1 D 2 D    D K D  for some strictly positive scalar , for
some nonnegative scalar , for some strictly positive (or alternatively nonnegative) scalar , and for
some scalar  in the interval 0 <  < 1. Under that assumption, neither var.eks / nor corr.eks ; ekj /
varies with k. Moreover, if  and the ratios = and = are known, then var.e/ is of the form
 2 H of var.e/ in the Aitken generalization of the G–M model. When , , , and  are regarded as
unknown parameters and are taken to be the elements of the vector , var.e/ is of the form V./ of
var.e/ in the general linear model.

f. Example: dental data


Potthoff and Roy (1964) reported the results of a study involving youngsters (carried out at the
University of North Carolina Dental School) of how the distance from the center of the pituitary to
the pteryomaxillary fissure changes with age. A measurement of this distance was obtained for each
of 27 youngsters (11 girls and 16 boys) at each of 4 ages (8, 10, 12, and 14 years). The resultant data
are reproduced in Table 4.6.
These data may be suitable for application of a general linear model. We can take N D 27  4 D
108. And, upon taking y1 ; y2 ; : : : ; y27 to be the 4-dimensional subvectors of y defined implicitly
by the partitioning y 0 D .y10 ; y20 ; : : : ; y27
0
/, we can (for k D 1; 2; : : : ; 27) regard the measurements
obtained on the kth youngster at ages 8, 10, 12, and 14 as the observed values of the elements of yk .
Further, we can take C D 2, take u1 to be the youngster’s age at the time of the measurement, and
take u2 to be a variable that has one value, say 0, if the youngster is a girl, and a second (different)
value, say 1, if the youngster is a boy.
Heteroscedastic and Correlated Residual Effects 151
TABLE 4.6. Data on 27 youngsters (11 girls and 16 boys) at 4 different ages: each data point is a measurement of
the distance (in millimeters) from the center of the pituitary to the pteryomaxillary fissure (Potthoff
and Roy 1964).

Girls Boys
Age (in years) Age (in years)
Youngster 8 10 12 14 Youngster 8 10 12 14
1 21 20 21:5 23 12 26 25 29 31
2 21 21:5 24 25:5 13 21:5 22:5 23 26:5
3 20:5 24 24:5 26 14 23 22:5 24 27:5
4 23:5 24:5 25 26:5 15 25:5 27:5 26:5 27
5 21:5 23 22:5 23:5 16 20 23:5 22:5 26
6 20 21 21 22:5 17 24:5 25:5 27 28:5
7 21:5 22:5 23 25 18 22 22 24:5 26:5
8 23 23 23:5 24 19 24 21:5 24:5 25:5
9 20 21 22 21:5 20 23 20:5 31 26
10 16:5 19 19 19:5 21 27:5 28 31 31:5
11 24:5 25 28 28 22 23 23 23:5 25
23 21:5 23:5 24 28
24 17 24:5 26 29:5
25 22:5 25:5 25:5 26
26 23 24:5 26 30
27 22 21:5 23:5 25

We might consider taking (


r.u1 / if u2 D 0,
ı.u/ D (4.51)
s.u1 / if u2 D 1,
where r.u1 / and s.u1 / are both polynomials in u1 (the coefficients of which are unknown parameters).
For example, r.u1 / and s.u1 / might both be polynomials of degree 3 in u1 ; that is, r.u1 / and s.u1 /
might be of the form
r.u1 / D ˇ1 C ˇ2 u1 C ˇ3 u21 C ˇ4 u31 (4.52)
and
s.u1 / D ˇ5 C ˇ6 u1 C ˇ7 u21 C ˇ8 u31 ; (4.53)
where ˇ1 ; ˇ2 ; : : : ; ˇ8 are unknown parameters. If ı.u/ is taken to be of the form (4.51) for polyno-
mials r.u1 / and s.u1 / of the form (4.52) and (4.53), then the vector ˇ (in the general linear model)
is of dimension 8 (with elements ˇ1 ; ˇ2 ; : : : ; ˇ8 ) and the model matrix is
0 1
X 0
BX 0 C
B :: :: C
B C
0 1
B: : C 1 8 64 512
B C
BX 0 C B1 10 100 1000C
XDB B 0 X C ;
C where X D B @1 12 144 1728A :
C
B C
B 0 X C
B C 1 14 196 2744
B: :: C
@: : : A
0 X

Taking ı.u/ to be of the form (4.51) allows for the possibility that the relationship between distance
and age may be markedly different for boys than for girls.
152 The General Linear Model

The distance measurements can be regarded as longitudinal data. The 27 youngsters (11 girls
and 16 boys) form K D 27 observational units, each of which contributes 4 data points. And age
plays the role of time.
Partition the (column) vector e of residual effects (in the general linear model) into 4-dimensional
subvectors e1 ; e2 ; : : : ; e27 [in such a way that e0 D .e01 ; e02 ; : : : ; e027 /]. (This partitioning corresponds
to the partitioning of y into the subvectors y1 ; y2 ; : : : ; y27 .) And (for k D 1; 2; : : : ; 27) denote by
ek1 , ek2 , ek3 , and ek4 the elements of the subvector ek —these are the residual effects that correspond
to the distance measurements on the kth youngster at ages 8, 10, 12, and 14. Then, proceeding as in
Subsection e, we could take var.e/ to be of the form

var.e/ D diagŒvar.e1 /; var.e2 /; : : : ; var.eK /:

Further, for k D 1; 2; : : : ; 27, we could take var.ek / to be of the form associated with the
supposition that (for s D 1; 2; 3; 4) eks is expressible in the form of the decomposition (4.48).
That is, we could take the diagonal elements of var.ek / to be of the form (4.49), and the off-
diagonal elements to be of the form (4.50). Expressions (4.49) and (4.50) determine var.ek / up
to the values of the four parameters k , k , k , and k . It is assumed that, for k D 1; 2; : : : ; 11
(corresponding to the 11 girls), the values of k , k , k , and k do not vary with k, and similarly that,
for k D 12; 13; : : : ; 27 (corresponding to the 16 boys), the values of k , k , k , and k do not vary
with k. Under that assumption, var.e/ would depend on 8 parameters. Presumably, those parameters
would be unknown, in which case the vector  of unknown parameters in the variance-covariance
matrix V ./ (of the vector e of residual effects in the general linear model) would be of dimension 8.

g. Spatial data
There are situations where each of the N data points is associated with a specific location (in 1-,
2-, or 3-dimensional space). For example, the data points might be measurements of the hardness
of samples of water obtained from different wells. Data of this kind are among those referred to as
spatial data.
Suppose that the observed value of each of the random variables y1 ; y2 ; : : : ; yN in the general
linear model is associated with a specific location in D-dimensional space. Further, let us represent
an arbitrary location in D-dimensional space by a D-dimensional column vector s of “coordinates,”
denote by s1 ; s2 ; : : : ; sN the values of s corresponding to y1 ; y2 ; : : : ; yN , respectively, and take S to
be a finite or infinite set of values of s that includes s1 ; s2 ; : : : ; sN (and perhaps other values of s that
may be of interest). Assume (as can be done essentially without loss of generality) that the vector u
of explanatory variables includes the vector s as a subvector or, more generally, that the elements of
s are expressible as functions of u, so that (for i D 1; 2; : : : ; N ) the i th value si of s is determinable
from the i th value ui of u.
Typically, data points associated with locations that are in close proximity tend to be more alike
than those associated with locations that are farther apart. This phenomenon may be due in part to
“systematic forces” that manifest themselves as “trends” or “gradients” in the surface defined by
the function ı.u/—whatever part may be due to these systematic forces is sometimes referred to
as large-scale variation. However, typically not all of this phenomenon is attributable to systematic
forces or is reflected in the surface defined by ı.u/. There is generally a nonsystematic component.
The nonsystematic component takes the form of what is sometimes called small-scale variation and
is reflected in the correlation matrix of the residual effects e1 ; e2 ; : : : ; eN . The residual effects may
be positively correlated, with the correlation being greatest among residual effects corresponding to
locations that are in close proximity.
It is supposed that the residual effects e1 ; e2 ; : : : ; eN are expressible as follows:

ei D ai C ri .i D 1; 2; : : : ; N /: (4.54)
Heteroscedastic and Correlated Residual Effects 153

Here, a1 ; a2 ; : : : ; aN and r1 ; r2 ; : : : ; rN are random variables, each with an expected value of 0.


Moreover, r1 ; r2 ; : : : ; rN are assumed to be uncorrelated with each other and with a1 ; a2 ; : : : ; aN .
And it is sometimes assumed that
var.ai / D  2 and var.ri / D  2 .i D 1; 2; : : : ; N / (4.55)
for some nonnegative scalar  and some strictly positive scalar , in which case
var.ei / D  2 .i D 1; 2; : : : ; N /; (4.56)
2 2 2
where  D  C  > 0.
The random variables a1 ; a2 ; : : : ; aN may be correlated with the magnitude of the correlation
between any two of them, say ai and aj , depending on the corresponding locations si and sj . Of
particular interest is the case where
p p
cov.ai ; aj / D var ai var aj K.si sj / .i; j D 1; 2; : : : ; N /: (4.57)

Here, K./ is a function whose domain is the set


H D fh 2 RD W h D s t for s; t 2 S g
and that has the following three properties: (1) K.0/ D 1; (2) K. h/ D K.h/ for h 2 H ; and
(3) M
PM
tj /  0 for every positive integer M , for all not-necessarily-distinct
P
i D1 j D1 xi xj K.ti
vectors t1 ; t2 ; : : : ; tM in S , and for all scalars x1 ; x2 ; : : : ; xM . Note that the third property of K./,
in combination with the first two properties, establishes that the M  M matrix with ij th element
K.ti tj / can serve as a correlation matrix—it is symmetric and nonnegative definite and its diagonal
elements equal 1. It establishes, in particular, that the N  N matrix with ij th element K.si sj /
can serve as a correlation matrix. The function K./ is sometimes referred to as an autocorrelation
function or a correlogram.
Suppose that the variances of the random variables r1 ; r2 ; : : : ; rN and the variances and covari-
ances of the random variables a1; a2 ; : : : ; aN are of the form specified in conditions (4.55) and (4.57).
Then, the variances of the residual effects e1 ; e2 ; : : : ; eN are of the form (4.56), and the covariances
of the residual effects are of the form

cov.ei ; ej / D cov.ai ; aj / D  2 K.si sj / .j ¤ i D 1; 2; : : : ; N /: (4.58)

Further, the correlation between the i th and j th residual effects is of the form

2 2 .=/2
corr.ei ; ej / D K.s i sj / D K.s i sj / D K.si sj /:
2  2 C 2 1 C .=/2
Accordingly, the distribution of the residual effects is weakly stationary in the sense that all of the
residual effects have the same variance and in the sense that the covariance and correlation between
any two residual effects depend on the corresponding locations only through the difference between
the locations.
The variance  2 of the residual effects is the sum of two components: (1) the variance  2 of
the ai ’s and (2) the variance  2 of the ri ’s. The first component  2 accounts for a part of whatever
variability is spatial in origin, that is, a part of whatever variability is related to differences in
location—it accounts for the so-called small-scale variation. The second component  2 accounts for
the remaining variability, including any variability that may be attributable to measurement error.
Depending on the application, some of the N locations s1 ; s2 ; : : : ; sN may be identical. For
example, the N data points may represent the scores achieved on a standardized test by N different
students and (for i D 1; 2; : : : ; N ) si may represent the location of the school system in which the
the i th student is enrolled. In such a circumstance, the variation accounted for by  2 would include
the variation among those residual effects for which the corresponding locations are the same. If
there were L distinct values of s represented among the N locations s1 ; s2 ; : : : ; sN and the residual
154 The General Linear Model

effects e1 ; e2 ; : : : ; eN were divided into L classes in such a way that the locations corresponding
to the residual effects in any particular class were identical, then the ratio  2 =. 2 C  2 / could be
regarded as the intraclass correlation—refer to Subsection b.
There may be spatial variation that is so localized in nature that it does not contribute to the
covariances among the residual effects. This kind of spatial variation is sometimes referred to as
microscale variation (e.g., Cressie 1993). The range of its influence is less than the distance between
any two of the locations s1 ; s2 ; : : : ; sN . Its contribution, which has come to be known as the “nugget
effect,” is reflected in the component  2.
To complete the specification of the form (4.58) of the covariances of the residual effects
e1 ; e2 ; : : : ; eN , it remains to specify the form of the function K./. In that regard, it suffices to
take [for an arbitrary D-dimensional column vector h in the domain H of K./]

K.h/ D E.cos h0 w/; (4.59)

where w is a D-dimensional random column vector (the distribution of which may depend on
unknown parameters).
Let us confirm this claim; that is, let us confirm that when K./ is taken to be of the form (4.59),
it has (even if S and hence H comprise all of RD ) the three properties required of an autocorrelation
function. Recall that cos 0 D 1; that for any real number x, cos. x/ D cos x; and that for any real
numbers x and z, cos.x z/ D .cos x/.cos z/ C .sin x/.sin z/. When K./ is taken to be of the form
(4.59), we have (in light of the properties of the cosine operator) that (1)
K.0/ D E.cos 00 w/ D E.cos 0/ D E.1/ D 1I
that (2) for h 2 H ,
K. h/ D EŒcos . h/0 w D EŒcos. h0 w/ D E.cos h0 w/ D K.h/I
and that (3) for every positive integer M , any M vectors t1 ; t2 ; : : : ; tM in S , and any M scalars
x1 ; x2 ; : : : ; xM ,
M X
X M M X
X M
xi xj K.ti tj / D xi xj EŒcos.ti0 w tj0 w/
i D1 j D1 i D1 j D1
M X
hX M i
DE xi xj cos.ti0 w tj0 w/
i D1 j D1
M X
hX M M X
X M i
DE xi xj .cos ti0 w/.cos tj0 w/ C xi xj .sin ti0 w/.sin tj0 w/
i D1 j D1 i D1 j D1
M
hX M
2 X 2 i
DE xi cos ti0 w C xi sin ti0 w
i D1 i D1
 0:

Thus, when K./ is taken to be of the form (4.59), it has (even if S and H comprise all of RD ) the
requisite three properties.
Expression (4.59) depends on the distribution of w. The evaluation of this expression for a
distribution of any particular form is closely related to the evaluation of the characteristic function
of a distribution of that form. The characteristic function of the distribution of w is the function
p c./
0
defined (for an arbitrary D-dimensional column vector h) by c.h/ D E.e i h w / (where i D 1).
The characteristic function of the distribution of w can be expressed in the form
c.h/ D E.cos h0 w/ C i E.sin h0 w/; (4.60)
Heteroscedastic and Correlated Residual Effects 155

the real component of which is identical to expression (4.59). It is worth mentioning that if the
distribution of w has a moment generating function, say m./, then
c.h/ D m.i h/ (4.61)
(e.g., Grimmett and Welsh 1986, p. 117).
In the special case where w is distributed symmetrically about 0 (i.e., where w  w), expression
(4.60) simplifies to
c.h/ D E.cos h0 w/;
as is evident upon recalling that (for any real number x) sin. x/ D sin x and then observing that
E.sin h0 w/ D EŒsin h0 . w/ D EŒsin. h0 w/ D E.sin h0 w/ [which implies that E.sin h0 w/ D
0]. Thus, in the special case where w is distributed symmetrically about 0, taking K./ to be of the
form (4.59) is equivalent to taking K./ to be of the form
K.h/ D c.h/:

Suppose, for example, that w  N.0; €/, where € D f ij g is a symmetric nonnegative defi-
nite matrix. Then, it follows from result (3.5.47) that the moment generating function m./ of the
distribution of w is
m.h/ D exp. 21 h0 €h/;
implying [in light of result (4.61)] that the characteristic function c./ of the distribution of w is
1 0
c.h/ D exp. 2 h €h/:

Thus, the choices for the form of the function K./ include
1 0
K.h/ D exp. 2
h €h/: (4.62)

Autocorrelation functions of the form (4.62) are referred to as Gaussian.


When K./ is taken to be of the form (4.62), the matrix V ./ (the variance-covariance matrix
of the residual effects in the general linear model) is the N  N matrix whose diagonal elements
are  2 D  2 C  2 and whose ij th off-diagonal element equals  2 expŒ 12 .si sj /0 €.si sj /.
Assuming that € is to be regarded as unknown, the vector  (on which the elements of the matrix
V ./ are functionally dependent) could be taken to be the [2 C D.D C1/=2]-dimensional column
vector whose elements are , , and ij (j > i D 1; 2; : : : ; D).
If in addition to satisfying the three properties required of an autocorrelation function, the function
K./ is such that K.h/ depends on h only through the value of khk (the usual norm of the vector h),
the function is said to be isotropic (e.g., Cressie 1993). For example, an autocorrelation function of
the form (4.62) would be isotropic if the matrix € were restricted to D  D matrices of the form I,
where is a nonnegative scalar. If € were restricted in that way (and the scalar were regarded as an
unknown parameter), then (in connection with the variance-covariance matrix V ./ of the residual
effects in the general linear model) the vector  could be taken to be the 3-dimensional (column)
vector with elements , , and . Isotropic autocorrelation functions are quite popular. For a list of
some of the more widely employed isotropic autocorrelation functions, refer, for example, to Littell
et al. (2006, sec. 11.3).

h. Example: tree-height data


Zhang, Bi, Cheng, and Davis (2004) considered a set of measurements of the diameters (at breast
height overbark) and heights of trees in a circular plot of radius 40 m. The plot is located in the
regrowth Eucalyptus fastigata forests in Glenbog State Forest in New South Wales, Australia. While
E. fastigata was the dominant species, other eucalypts were also present as was a species of smaller
trees or shrubs. Double stems, coppicing (i.e., the formation of trees or shrubs from shoots or root
156 The General Linear Model

TABLE 4.7. The diameters and heights and the coordinates of the locations of 101 trees in a circular region of
radius 40 m—the locations are relative to the center of the region (Zhang, Bi, Cheng, and Davis
2004).

Coordinates Coordinates
Diam. Hgt. Diam. Hgt.
Tree (cm) (m) 1st (m) 2nd (m) Tree (cm) (m) 1st (m) 2nd (m)
1 17:5 14:0 0:73 1:31 52 43:7 31:5 32:13 18:55
2 18:3 15:1 1:99 0:24 53 23:5 22:0 31:55 18:95
3 10:1 7:3 2:96 3:65 54 53:1 25:1 19:79 4:93
4 49:0 26:5 32:90 15:34 55 18:4 12:1 21:30 6:92
5 32:9 22:9 18:28 11:87 56 38:8 24:7 24:06 7:82
6 50:6 16:3 22:31 1:95 57 66:7 35:3 32:98 12:00
7 66:2 26:7 27:69 3:40 58 61:7 31:5 36:32 9:73
8 73:5 31:0 29:30 0:51 59 15:9 11:2 35:84 9:60
9 39:7 26:5 37:65 1:97 60 14:5 11:5 22:11 4:70
10 69:8 32:5 39:75 3:47 61 18:8 14:2 1:98 0:31
11 67:4 28:3 15:78 0:83 62 19:4 14:8 2:26 0:44
12 68:6 31:1 33:45 9:60 63 16:2 12:7 8:25 0:87
13 27:0 15:2 34:43 10:53 64 55:9 26:0 13:10 2:31
14 44:8 25:2 36:40 12:54 65 55:9 25:6 13:10 2:31
15 44:4 29:2 32:00 15:61 66 39:6 24:3 28:44 2:99
16 44:5 32:4 29:53 17:05 67 12:7 11:4 29:43 2:06
17 72:0 28:8 9:92 6:20 68 10:2 10:6 31:08 2:72
18 94:2 33:0 12:70 9:57 69 10:2 9:2 31:18 2:73
19 63:8 29:0 29:39 22:97 70 10:3 7:2 31:88 2:79
20 12:7 12:9 20:61 22:89 71 15:3 13:2 33:65 1:77
21 40:6 23:9 14:65 22:56 72 13:5 8:4 36:08 1:26
22 31:5 13:2 13:90 23:14 73 10:8 8:3 34:33 11:16
23 40:2 26:7 16:37 26:21 74 55:8 27:9 28:09 11:35
24 38:9 22:6 16:70 34:25 75 55:8 27:5 28:09 11:35
25 20:0 15:0 16:14 34:62 76 41:4 25:5 25:72 14:85
26 17:7 9:1 10:60 29:13 77 70:4 25:0 18:45 11:99
27 24:3 18:0 11:18 29:13 78 79:0 27:5 25:29 19:77
28 10:4 10:1 0:34 3:89 79 12:0 10:6 8:19 13:63
29 52:2 28:9 0:71 20:39 80 14:7 14:5 28:34 27:37
30 17:7 13:1 0:00 21:90 81 12:4 10:2 16:44 24:38
31 19:9 13:6 0:75 21:49 82 19:4 16:8 18:33 29:34
32 56:4 29:0 1:68 24:04 83 120:0 34:0 2:41 17:13
33 27:0 16:6 2:55 36:51 84 10:2 10:0 13:75 30:88
34 37:0 28:5 2:73 19:41 85 28:7 21:3 14:37 33:88
35 55:2 24:1 3:43 32:62 86 36:5 23:9 14:49 34:15
36 81:2 30:0 4:59 32:68 87 30:4 21:0 9:79 32:04
37 41:3 27:4 4:37 8:21 88 18:8 12:1 9:76 31:94
38 16:9 13:8 19:75 34:21 89 59:0 26:6 6:62 24:73
39 15:2 12:3 19:61 33:94 90 38:7 19:7 5:88 23:58
40 11:6 12:4 17:09 17:69 91 15:0 14:1 8:03 34:79
41 12:0 10:5 19:64 21:06 92 58:0 30:8 3:24 13:00
42 11:6 13:6 22:89 32:68 93 34:7 26:6 3:54 33:71
43 10:6 10:9 23:09 18:04 94 38:3 29:2 3:65 34:71
44 36:7 20:0 26:04 22:63 95 44:4 29:2 5:51 25:92
45 31:5 15:9 25:49 22:95 96 48:3 25:7 6:36 25:52
46 65:8 33:0 10:57 6:86 97 32:8 15:7 7:73 23:78
47 51:1 27:4 14:41 6:72 98 83:7 26:7 14:39 15:43
48 19:4 10:5 14:74 7:19 99 39:0 25:1 19:46 16:32
49 30:0 21:9 18:61 9:07 100 49:0 25:4 28:67 21:60
50 40:2 23:3 18:71 9:53 101 14:4 11:4 3:61 29:38
51 136:0 32:5 27:23 13:28
Multivariate Data 157

suckers rather than seed), and double leaders occurred with some frequency. The diameters and
heights of the trees are listed in Table 4.7, along with the location of each tree. Only those trees with
a diameter greater than 10 cm were considered; there were 101 such trees.
Interest centered on ascertaining how tree height relates to tree diameter. Refer to Zhang et al.
(2004) for some informative graphical displays that bear on this relationship and that indicate how
the trees are distributed within the plot.
The setting is one that might be suitable for the application of a general linear model. We could
take N D 101, take the observed value of yi to be the logarithm of the height of the i th tree
(i D 1; 2; : : : ; 101), take C D 3, take u1 to be the diameter of the tree, and take u2 and u3 to be the
first and second coordinates of the location of the tree. And we could consider taking ı.u/ to be of
the simple form
ı.u/ D ˇ1 C ˇ2 log u1 ; (4.63)
where ˇ1 and ˇ2 are unknown parameters; expression (4.63) is a first-degree polynomial in log u1
with coefficients ˇ1 and ˇ2 . Or, following Zhang et al. (2004), we could consider taking ı.u/ to
be a variation on the first-degree polynomial (4.63) in which the coefficients of the polynomial are
allowed to differ from location to location (i.e., allowed to vary with u2 and u3 ).
The logarithms of the heights of the 101 trees constitute spatial data. The logarithm of the
height of each tree is associated with a specific location in 2-dimensional space; this location is that
represented by the value of the 2-dimensional vector s whose elements (in this particular setting) are
u2 and u3 . Accordingly, the residual effects e1 ; e2 ; : : : ; eN are likely to be correlated to an extent
that depends on the relative locations of the trees with which they are associated. Trees that are in
close proximity may tend to be subject to similar conditions, and consequently the residual effects
identified with those trees may tend to be similar. It is worth noting, however, that the influence on the
residual effects of any such similarity in conditions could be at least partially offset by the influence
of competition for resources among neighboring trees.
In light of the spatial nature of the data, we might wish to take the variance-covariance matrix
V./ of the vector e of residual effects in the general linear model to be that whose diagonal and
off-diagonal elements are of the form (4.56) and (4.58). Among the choices for the form of the
autocorrelation function K./ is that specified by expression (4.62). It might or might not be realistic
to restrict the matrix € in expression (4.62) to be of the form € D I (where is a nonnegative
scalar). When € is restricted in that way, the autocorrelation function K./ is isotropic, and the vector
 could be taken to be the 3-dimensional (column) vector with elements , , and (assuming that
each of these 3 scalars is to be regarded as an unknown parameter).

4.5 Multivariate Data


There are situations where the data are multivariate in nature. That is, there are situations where
the data consist of possibly multiple observations on each of a number of “observational units” and
where the multiple observations represent the observed values of different “response” variables. For
example, the observational units might be individual people, and the response variables might be a
person’s height, weight, and blood pressure.
There is a similarity to longitudinal data; longitudinal data were discussed earlier (in Section 4.4e).
In both cases, there are possibly multiple observations on each of a number of observational units.
However, in the case of multivariate data, the multiple observations represent the observed values
of different response variables, while in the case of longitudinal data, the multiple observations
represent the values obtained at possibly different points in time for what is essentially the same
variable. Actually, longitudinal data can be regarded as a special kind of multivariate data—think
158 The General Linear Model

of the different points in time as defining different response variables. Nevertheless, there is a very
meaningful distinction between longitudinal data and the sort of “unstructured” multivariate data
that is the subject of the present section. Longitudinal data exhibit a structure that can be exploited
for modeling purposes. Models (like those considered in Section 4.4e) that exploit that structure are
not suitable for use with unstructured multivariate data.

a. Application of the general linear model to multivariate data


Let us consider the application of the general linear model under circumstances where the observed
values of the random variables y1 ; y2 ; : : : ; yN represent multivariate data. Suppose that there are
R observational units, numbered 1; 2; : : : ; R, and S response variables, numbered 1; 2; : : : ; S. In
general, the data on any particular observational unit may be incomplete; that is, the observed value
of one or more of the S response variables may not be available for that observational unit. Let Rs
represent the number of observational units for which the observed value of the sth response variable
is available (s D 1; 2; : : : ; S); and observe that SsD1 Rs D N (the total number of data points).
P
Further, let rs1 ; rs2 ; : : : ; rsRs represent the subsequence of the sequence 1; 2; : : : ; R such that the
integer r is in the subsequence if the sth response variable is available on the rth observational unit.
Let us assume that the random variables y1 ; y2 ; : : : ; yN have been numbered in such a way that

y 0 D .y10 ; y20 ; : : : ; yS0 /;

where (for s D 1; 2; : : : ; S) ys is the Rs -dimensional column vector, the kth element of which is
the random variable whose observed value is the value obtained for the sth response variable on
the rsk th observational unit (k D 1; 2; : : : ; Rs ). The setting is such that it is convenient to use two
subscripts instead of one to distinguish among y1 ; y2 ; : : : ; yN and also to distinguish among the
residual effects e1 ; e2 ; : : : ; eN . Let us use the first subscript to identify the response variable and the
second to identify the observational unit. Accordingly, let us write ys rsk for the kth element of ys ,
so that, by definition,
ys D .ys rs1 ; ys rs2 ; : : : ; ys rsRs /0 :
And let us write es rsk for the residual effect corresponding to ys rsk .
In a typical application, the model is taken to be such that the model matrix is of the block-diagonal
form 0 1
X1 0 : : : 0
B 0 X2 0 C
XDB: C; (5.1)
B C
@ :: ::
: A
0 0 XS
where (for s D 1; 2; : : : ; S) Xs is of dimensions Rs  Ps (and where P1 ; P2 ; : : : ; PS are positive
integers that sum to P ). Now, suppose that X is of the form (5.1), and partition the parameter vector
ˇ into S subvectors ˇ1 ; ˇ2 ; : : : ; ˇS of dimensions P1 ; P2 ; : : : ; PS , respectively, so that
ˇ 0 D .ˇ10 ; ˇ20 ; : : : ; ˇS0 /:
Further, partition the vector e of residual effects into S subvectors e1 ; e2 ; : : : ; eS of dimensions
R1 ; R2 ; : : : ; RS , respectively, so that
e0 D .e01 ; e02 ; : : : ; e0S /;
where (for s D 1; 2; : : : ; S) e0s D .es rs1 ; es rs2 ; : : : ; es rsRs /0 —the partitioning of e is conformal to
the partitioning of y. Then, the model equation y D Xˇ C e is reexpressible as

ys D Xs ˇs C es .s D 1; 2; : : : ; S /: (5.2)

Note that the model equation for the vector ys of observations on the sth response variable depends
Multivariate Data 159

on only Ps of the elements of ˇ, namely, those Ps elements that are members of the subvector ˇs .
In practice, it is often the case that P1 D P2 D    D PS D P (where P D P =S ) and that there
is a matrix X of dimensions R  P such that (for s D 1; 2; : : : ; S) the (first through Rs th) rows of
Xs are respectively the rs1 ; rs2 ; : : : ; rsRs th rows of X .
An important special case is that where there is complete information on every observational unit;
that is, where every one of the S response variables is observed on every one of the R observational
units. Then, R1 D R2 D    D RS D R. And, commonly, P1 D P2 D    D PS D P .D P =S /
and X1 D X2 D    D XS D X (for some R  P matrix X ). Under those conditions, it is
possible to reexpress the model equation y D Xˇ C e as
Y D X B C E; (5.3)
where Y D .y1 ; y2 ; : : : ; yS /, B D .ˇ1 ; ˇ2 ; : : : ; ˇS /, and E D .e1 ; e2 ; : : : ; eS /.
The N residual effects es rsk (s D 1; 2; : : : ; S; k D 1; 2; : : : ; Rs ) can be regarded as a subset of
a set of RS random variables es r (s D 1; 2; : : : ; S; r D 1; 2; : : : ; R) having expected values of 0—
think of these RS random variables as the residual effects for the special case where there is complete
information on every observational unit. It is assumed that the distribution of the random variables es r
(s D 1; 2; : : : ; S; r D 1; 2; : : : ; R) is such that the R vectors .e1r ; e2r ; : : : ; eSr /0 (r D 1; 2; : : : ; R)
are uncorrelated and each has the same variance-covariance matrix † D fij g. Then, in the special
case where there is complete information on every observational unit, the variance-covariance matrix
of the vector e of residual effects is
0 1
11 IR 12 IR : : : 1S IR
B 12 IR 22 IR : : : 2S IR C
B :: :: :: C: (5.4)
B C
::
@ : : : : A
1S IR 2S IR : : : SS IR

Moreover, in the general case (where the information on some or all of the observational units is
incomplete), the variance-covariance matrix of e is a submatrix of the matrix (5.4). Specifically, it
is the submatrix obtained by replacing the stth block of matrix (5.4) with the Rs  R t submatrix
formed from that block by striking out all of the rows and columns of the block save the rs1 ; rs2 ; : : : ;
rsRs th rows and the r t1 ; r t 2 ; : : : ; r tR t th columns (s; t D 1; 2; : : : ; S).
Typically, the matrix † (which is inherently symmetric and nonnegative definite) is assumed
to be positive definite, and its S.S C 1/=2 distinct elements, say ij (j  i D 1; 2; : : : ; S), are
regarded as unknown parameters. The situation is such that (even in the absence of the assumption
that † is positive definite) var.e/ is of the form V./ of var.e/ in the general linear model—the
parameter vector  can be taken to be the [S.S C1/=2]-dimensional (column) vector with elements
ij (j  i D 1; 2; : : : ; S).

b. Example: data on four characteristics of whey-protein gels


Schmidt, Illingworth, Deng, and Cornell (1979) reported the results of a study of the effects of the
use of various levels of two reagents on the formation of whey-protein gels. The reagents are cysteine
and CaCl2 (calcium chloride). The effects of interest are those on the texture and the water-retention
capacity of the gels. Various textural characteristics are reflected in the values of three variables
known as hardness, cohesiveness, and springiness, and water-retention capacity can be assessed on
the basis of a variable that is referred to as compressible H2 O.
Data on these variables were obtained from an experiment conducted in accordance with a
response-surface design known as a central composite design (one with added center points). The
experiment consisted of 13 trials, each of which took the form of five replications conducted at
pre-specified levels of cysteine and CaCl2 . The five replicate values obtained in each trial for each
of the four response variables (hardness, cohesiveness, springiness, and compressible H2 O) were
160 The General Linear Model
TABLE 4.8. Data on the textural characteristics (hardness, cohesiveness, and springiness) and water-retention
capacity (compressible H2 O) of whey-protein gels at various levels of two reagents (cysteine and
CaCl2 ) used in their formation (Schmidt, Illingworth, Deng, and Cornell 1979).

Reagent levels Characteristics of gel


Cysteine CaCl2 Hardness Cohesiveness Springiness Compressible H2 O
Trial (mM) (mM) (kg) (mm) (g)
1 8.0 6.5 2.48 0.55 1.95 0.22
2 34.0 6.5 0.91 0.52 1.37 0.67
3 8.0 25.9 0.71 0.67 1.74 0.57
4 34.0 25.9 0.41 0.36 1.20 0.69
5 2.6 16.2 2.28 0.59 1.75 0.33
6 39.4 16.2 0.35 0.31 1.13 0.67
7 21.0 2.5 2.14 0.54 1.68 0.42
8 21.0 29.9 0.78 0.51 1.51 0.57
9 21.0 16.2 1.50 0.66 1.80 0.44
10 21.0 16.2 1.66 0.66 1.79 0.50
11 21.0 16.2 1.48 0.66 1.79 0.50
12 21.0 16.2 1.41 0.66 1.77 0.43
13 21.0 16.2 1.58 0.66 1.73 0.47

averaged. Accordingly, each trial resulted in four data points, one for each response variable. The
data from the 13 trials are reproduced in Table 4.8.
These data are multivariate in nature. The trials constitute the observational units; there are
R D 13 of them. And there are S D 4 response variables: hardness, cohesiveness, springiness, and
compressible H2 O. Moreover, there is complete information on every observational unit; every one
of the four response variables was observed on every one of the 13 trials.
The setting is one that might be suitable for the application of a general linear model. More
specifically, it might be suitable for the application of a general linear model of the form considered
in Subsection a. In such an application, we might take C D 3, take u1 to be the level of cysteine,
take u2 to be the level of CaCl2 , and take the value of u3 to be 1, 2, 3, or 4 depending on whether
the response variable is hardness, cohesiveness, springiness, or compressible H2 O.
Further, following Schmidt et al. (1979) and adopting the notation of Subsection a, we might
take (for s D 1; 2; 3; 4 and r D 1; 2; : : : ; 13)

ys r D ˇs1 C ˇs2 u1r C ˇs3 u2r C ˇs4 u21r C ˇs5 u22r C ˇs6 u1r u2r C es r ; (5.5)

where u1r and u2r are the values of u1 and u2 for the rth observational unit and where
ˇs1 ; ˇs2 ; : : : ; ˇs6 are unknown parameters. Taking (for s D 1; 2; 3; 4 and r D 1; 2; : : : ; 13) ys r
to be of the form (5.5) is equivalent to taking (for s D 1; 2; 3; 4) the vector ys (with elements
ys1 ; ys2 ; : : : ; ysR ) to be of the form (5.2) and to taking X1 D X2 D X3 D X4 D X , where X is
the 13  6 matrix whose rth row is .1; u1r ; u2r ; u21r ; u22r ; u1r u2r /.

Exercises
Exercise 1. Verify formula (2.3).
Exercise 2. Write out the elements of the vector ˇ, of the observed value of the vector y, and of the
matrix X (in the model equation y D Xˇ C e) in an application of the G–M model to the cement
Exercises 161

data of Section 4.2 d. In doing so, regard the measurements of the heat that evolves during hardening
as the data points, take C D 4, take u1 ; u2 , u3 , and u4 to be the respective amounts of tricalcium
aluminate, tricalcium silicate, tetracalcium aluminoferrite, and ˇ-dicalcium silicate, and take ı.u/
to be of the form (2.11).
Exercise 3. Write out the elements of the vector ˇ, of the observed value of the vector y, and of the
matrix X (in the model equation y D Xˇ C e) in an application of the G–M model to the lettuce
data of Section 4.2 e. In doing so, regard the yields of lettuce as the data points, take C D 3, take
u1 ; u2 , and u3 to be the transformed amounts of Cu, Mo, and Fe, respectively, and take ı.u/ to be
of the form (2.14).
Exercise 4. Let y represent a random variable and u a C -dimensional random column vector such
that the joint distribution of y and u is MVN (with a nonsingular variance-covariance matrix). And
take z D fzj g to be a transformation (of u) of the form

z D R0 Œu E.u/;

where R is a nonsingular (nonrandom) matrix such that var.z/ D I—the existence of such a matrix
follows from the results of Section 3.3b. Show that
C
E.y j u/ E.y/ X
p D corr.y; zj / zj :
var y j D1

Exercise 5. Let y represent a random variable and u D .u1 ; u2 ; : : : ; uC /0 a C -dimensional random


column vector, assume that var.u/ is nonsingular, and suppose that E.y j u/ is expressible in the
form
E.y j u/ D ˇ1 C ˇ2 .u1 a1 / C ˇ3 .u2 a2 / C    C ˇC C1 .uC aC /;
where a1 ; a2 , : : : ; aC and ˇ1 ; ˇ2 ; ˇ3 , : : : ; ˇC C1 are nonrandom scalars.
(a) Using the results of Section 3.4 (or otherwise), show that

.ˇ2 ; ˇ3 ; : : : ; ˇC C1 / D cov.y; u/.var u/ 1;


and that C
X C1
ˇ1 D E.y/ C ˇj Œaj 1 E.uj 1 /;
j D2

in agreement with the results obtained in Section 4.3 (under the assumption that the joint distri-
bution of y and u is MVN).
(b) Show that 1
EŒvar.y j u/ D var.y/ cov.y; u/.var u/ cov.u; y/:

Exercise 6. Suppose that (in conformance with the development in Section 4.4b) the residual effects
in the general linear model have been partitioned into K mutually exclusive and exhaustive subsets
or classes numbered 1; 2; : : : ; K. And for k D 1; 2; : : : ; K, write ek1 ; ek2 ; : : : ; ekNk for the residual
effects in the kth class. Take ak (k D 1; 2; : : : ; K) and rks
(k D 1; 2; : : : ; K; s D 1; 2; : : : ; Nk ) to
be uncorrelated random variables, each with mean 0, such that var.ak / D k2 for some nonnegative
scalar k and var.rks

/ D k2 (s D 1; 2; : : : ; Nk ) for some strictly positive scalar k . Consider the
effect of taking the residual effects to be of the form
eks D ak C rks

; (E.1)

rather than of the form (4.26). Are there values of k2 and k2 for which the value of var.e/ is the
same when the residual effects are taken to be of the form (E.1) as when they are taken to be of the
form (4.26)? If so, what are those values; if not, why not?
162 The General Linear Model

Exercise 7. Develop a correlation structure for the residual effects in the general linear model that,
in the application of the model to the shear-strength data (of Section 4.4 d), would allow for the
possibility that steel aliquots chosen at random from those on hand on different dates may tend to
be more alike when the intervening time is short than when it is long. Do so by making use of the
results (in Section 4.4 e) on stationary first-order autoregressive processes.
Exercise 8. Suppose (as in Section 4.4g) that the residual effects e1 ; e2 ; : : : ; eN in the general linear
model correspond to locations in D-dimensional space, that these locations are represented by D-
dimensional column vectors s1 ; s2 ; : : : ; sN of coordinates, and that S is a finite or infinite set of
D-dimensional column vectors that includes s1 ; s2 ; : : : ; sN . Suppose further that e1 ; e2 ; : : : ; eN are
expressible in the form (4.54) and that conditions (4.55) and (4.57) are applicable. And take ./ to
be the function defined on the set H D fh 2 RD W h D s t for s; t 2 S g by
.h/ D  2 C  2 Œ1 K.h/:

(a) Show that, for j ¤ i D 1; 2; : : : ; N,


1
2 var.ei ej / D .si sj /

—this result serves to establish the function ./, defined by
(
 .h/ if h ¤ 0,
.h/ D
0 if h D 0,
as what in spatial statistics is known as a semivariogram (e.g., Cressie 1993).
(b) Show that (1) .0/ D  2 ; that (2) . h/ D .h/ for h 2 H ; and that (3)
PM PM
i D1 j D1 xi xj .ti tj /  0 for every positive integer M, for all not-necessarily-distinct
vectors t1 ; t2 ; : : : ; tM in S , and for all scalars x1 ; x2 ; : : : ; xM such that M
P
i D1 xi D 0.

Exercise 9. Suppose that the general linear model is applied to the example of Section 4.5b (in the
way described in Section 4.5b). What is the form of the function ı.u/?

Bibliographic and Supplementary Notes


§1. What is herein called the model matrix is sometimes referred to as the design matrix. The term model
matrix seems preferable in that the data may not have come from a designed experiment. Moreover, as discussed
by Kempthorne (1980), different models (with different model matrices) can be contemplated even in the case
of a designed experiment.
§2e. The data on yield reported by Hader, Harward, Mason, and Moore (1957) and presented herein are
part of a larger collection of data reported by Moore, Harward, Mason, Hader, Lott, and Jackson (1957). In
addition to the data on yield, data were obtained on the Cu content and the Fe content of the lettuce plants. And
the experiment was one of four similar experiments; these experiments differed from each other in regard to
the source of Fe and/or the source of nitrogen.
§3. For an illuminating discussion of regression from a historical perspective (that includes a detailed
account of the contibutions of Sir Francis Galton), refer to Stigler (1986, chap. 8; 1999, chap. 9).
§4a. For additional discussion of the ways in which the variability of the residual effects can depend on the
explanatory variables, refer to Pinheiro and Bates (2000, sec. 5.2), Carroll and Ruppert (1988, chap. 3), and/or
Davidian and Giltinan (1995, sec. 2.2). They consider the choice of a nonnegative function v.u/ of u such that
(for i D 1; 2; : : : ; N ) var.ei / D v.ui /. They do so in a broad framework that includes choices for v.u/ in
which the dependence on u can be wholly or partly through the value of ı.u/. Such choices result in models for
y1 ; y2 , : : : ; yN of a form not covered by the general linear model (as defined herein)—they result in models in
which the variances of y1 ; y2 , : : : ; yN are related to their expected values.
Bibliographic and Supplementary Notes 163

§4b. In some presentations, the intraclass correlation is taken to be the same for every class, and the permis-
sible values of the intraclass correlation are taken to be all of the values for which the variance-covariance matrix
of the residual effects is nonnegative definite. Then, assuming that there are K classes of sizes N1 ; N2 ; : : : ; NK
and denoting the intraclass correlation by , the permissible values would be those in the interval
1
min    1:
k Nk 1
One could question whether the intraclass correlation’s being the same for every class is compatible with its being
negative. Assuming that a negative intraclass correlation is indicative of competition among some number ( the
class size) of entities, it would seem that the correlation would depend on the number of entities—presumably,
the pairwise competition would be less intense and the correlation less affected if the number of entities were
relatively large.
§4e. The development in Part 2 is based on taking the correlation structure of the sequence
fk1 ; fk2 ; : : : ; fkNk to be that of a stationary first-order autoregressive process. There are other possible choices
for this correlation structure; see, for example, Diggle, Heagerty, Liang, and Zeger (2002, secs. 4.2.2 and 5.2)
and Laird (2004, sec. 1.3).
§4g. For extensive (book-length) treatises on spatial statistics, refer, e.g., to Cressie (1993) and Schaben-
berger and Gotway (2005). Gaussian autocorrelation functions may be regarded as “artificial” and their use
discouraged; they have certain characteristics that are considered by Schabenberger and Gotway—refer to their
Section 4.3—and by many others to be inconsistent with the characteristics of real physical and biological
processes.
§5a. By transposing both of its sides, model equation (5.3) can be reexpressed in the form of the equation
Y 0 D B0 X0 C E0;
each side of which is an S  R matrix whose rows correspond to the response variables and whose columns
correspond to the observational units. In many publications, the model equation is presented in this alternative
form rather than in the form (5.3). As pointed out, for example, by Arnold (1981, p. 348), the form (5.3) has the
appealing property that, in the special case of univariate data (i.e., the special case where S D 1), each side of
the equation reduces to a column vector (rather than a row vector), in conformance with the usual representation
for that case.
5
Estimation and Prediction: Classical Approach

Models of the form of the general linear model, and in particular those of the form of the Gauss–
Markov or Aitken model, are often used to obtain point estimates of the unobservable quantities
represented by various parametric functions. In many cases, the parametric functions are ones that
are expressible in the form 0ˇ, where  D .1 ; 2 ; : : : ; P /0 is a P -dimensional column vector
of constants, or equivalently ones that are expressible in the form jPD1 j ˇj . Models of the form
P
of the G–M, Aitken, or general linear model may also be used to obtain predictions for future
quantities; these would be future quantities that are represented by unobservable random variables
with expected values of the form 0ˇ. The emphasis in this chapter is on the G–M model (in which
the only parameter other than ˇ1 ; ˇ2 ; : : : ; ˇP is the standard deviation ) and on what might be
regarded as a classical approach to estimation and prediction.

5.1 Linearity and Unbiasedness


Suppose that y is an N  1 observable random vector that follows the G–M, Aitken, or general linear
model, and consider the estimation of a parametric function of the form 0ˇ D jPD1 j ˇj , where
P
 D .1 ; 2 ; : : : ; P /0. Is it “possible” to estimate 0ˇ from the available information, and if so,
which estimator is best and in what sense is it best? One way to judge the “goodness” of an estimator
is on the basis of its mean squared error (MSE) or its root mean squared error (root MSE)—the root
MSE is the square root of the MSE. When a function t.y/ of y is regarded as an estimator of 0ˇ,
its MSE is (by definition) EfŒt.y/ 0ˇ2 g.
The information about the distribution of y provided by the G–M, Aitken, or general linear model
is limited; it is confined to information about E.y/ and var.y/. The evaluation and comparison of
potential estimators are greatly facilitated by restricting attention to estimators that are of a relatively
simple form or that satisfy certain criteria and/or by making assumptions about the distribution of
y that go beyond those inherent in the G–M, Aitken, or general linear model. If the evaluations and
comparisons are to be meaningful, the restrictions need to be ones that have appeal in their own right,
and the assumptions need to be realistic.
An estimator of 0ˇ, say an estimator t.y/, is said to be linear if it is expressible in the form
N
X
t.y/ D c C ai yi ;
i D1
where c and a1 ; a2 ; : : : ; aN are constants, or equivalently if it is expressible in the form
t.y/ D c C a0 y;
where c is a constant and a D .a1 ; a2 ; : : : ; aN /0 is an N -dimensional column vector of constants.
Linear estimators of 0ˇ are of a relatively simple form, which makes them readily amenable to
evaluation, comparison, and interpretation. Accordingly, it is convenient and of some interest to
obtain results on the estimation of 0ˇ in the special case where consideration is restricted to linear
estimators.
166 Estimation and Prediction: Classical Approach

Attention is sometimes restricted to estimators that are unbiased. By definition, an estimator t.y/
of 0ˇ is unbiased if EŒt.y/ D 0ˇ. If t.y/ is an unbiased estimator of 0ˇ, then
EfŒt.y/ 0ˇ2 g D varŒt.y/; (1.1)
that is, its MSE equals its variance.
In the case of a linear estimator c C a0 y, the expected value of the estimator is
E.c C a0 y/ D c C a0 E.y/ D c C a0 Xˇ: (1.2)
0 0
Accordingly, c C a y is an unbiased estimator of  ˇ if and only if, for every P -dimensional column
vector ˇ,
c C a0 Xˇ D 0ˇ: (1.3)
Clearly, a sufficient condition for the unbiasedness of the linear estimator c C a0 y is
cD0 and a0 X D 0 (1.4)
or, equivalently,
cD0 and X0 a D : (1.5)
0
This condition is also a necessary condition for the unbiasedness of c C a y as is evident upon
observing that if equality (1.3) holds for every column vector ˇ in RP , then it holds in particular
when ˇ is taken to be the P  1 null vector 0 (so that c D 0) and when (for each integer j between
1 and P, inclusive) ˇ is taken to be the j th column of IP (so that the j th element of a0 X equals the
j th element of 0 ).
In the special case of a linear unbiased estimator a0 y, expression (1.1) for the MSE of an unbiased
estimator of 0ˇ simplifies to
EŒ.a0 y 0ˇ/2  D a0 var.y/a: (1.6)

5.2 Translation Equivariance


Suppose (as in Section 5.1) that y is an N  1 observable random vector that follows the Gauss–
Markov, Aitken, or general linear model and that we wish to estimate a parametric function of the
form 0ˇ. Attention is sometimes restricted to estimators that are unbiased. However, unbiasedness
is not the only criterion that could be used to restrict the quantity of estimators under consideration.
Another possible criterion is translation equivariance (also known as location equivariance).
Let k represent a P -dimensional column vector of known constants, and define z D y C Xk.
The vector z, like the vector y, is an N -dimensional observable random vector. Moreover,
z D X C e; (2.1)
where  D ˇ C k. Accordingly, z follows a G–M, Aitken, or general linear model that is identical
in all respects to the model followed by y, except that the role of the parameter vector ˇ is played
by a vector [represented by  in equality (2.1)] that has a different interpretation.
It can be argued that an estimator, say t.y/, of 0ˇ should be such that the results obtained in using
t.y/ to estimate 0ˇ are consistent with those obtained in using t.z/ to estimate the corresponding
parametric function (0 D 0ˇ C 0 k). Here, the consistency is in the sense that
t.y/ C 0 k D t.z/
or, equivalently, that
t.y/ C 0 k D t.y C Xk/: (2.2)
0
When applied to a linear estimator c C a y, condition (2.2) becomes

c C a0 y C 0 k D c C a0 .y C Xk/;
Estimability 167

which (after some simplification) can be restated as

a0 Xk D 0 k: (2.3)

The estimator t.y/ is said to be translation equivariant if it is such that condition (2.2) is satisfied
for every k 2 RP (and for every value of y). Accordingly, the linear estimator c C a0 y is translation
equivariant if and only if condition (2.3) is satisfied for every k 2 RP or, equivalently, if and only if

a0 X D 0: (2.4)

Observe (in light of the results of Section 5.1) that condition (2.4) is identical to one of the
conditions needed for unbiasedness—for unbiasedness, we also need the condition c D 0. Thus,
the motivation for requiring that the coefficient vector a0 in the linear estimator c C a0 y satisfy the
condition a0 X D 0 can come from a desire to achieve unbiasedness or translation equivariance or
both.

5.3 Estimability
Suppose (as in Sections 5.1 and 5.2) that y is an N  1 observable random vector that follows the
G–M, Aitken, or general linear model, and consider the estimation of a parametric function that
is expressible in the form 0ˇ or jPD1 j ˇj , where  D .1 ; 2 ; : : : ; P /0 is a P  1 vector of
P
coefficients. If there exists a linear unbiased estimator of 0ˇ [i.e., if there exists a constant c and an
N  1 vector of constants a such that E.c C a0 y/ D 0ˇ], then 0ˇ is said to be estimable. Otherwise
(if no such estimator exists), 0ˇ is said to be nonestimable.
If 0ˇ is estimable, then the data provide at least some information about 0ˇ. Estimability can
be of critical importance in the design of an experiment. If the data from the experiment are to be
regarded as having originated from a G–M, Aitken, or general linear model and if the quantities of
interest are to be formulated as parametric functions of the form 0ˇ (as is common practice), then
it is imperative that every one of the relevant functions be estimable.
It follows immediately from the results of Section 5.1 that 0ˇ is estimable if and only if there
exists an N  1 vector a such that
0 D a0 X (3.1)
or, equivalently, such that
 D X0 a: (3.2)
Thus, for 0ˇ to be estimable (under the G–M, Aitken, or general linear model), it is necessary and
sufficient that
0 2 R.X/ (3.3)
or, equivalently, that
 2 C.X0 /: (3.4)

Note that it follows from the very definition of estimability [as well as from condition (3.1)] that
if 0ˇ is estimable, then there exists an N  1 vector a such that

0ˇ D a0 E.y/: (3.5)

Thus, if 0ˇ is estimable, it is interpretable in terms of the expected values of y1 ; y2 ; : : : ; yN . In


fact, if 0ˇ is estimable, it may be expressible in the form (3.5) for each of a number of different
168 Estimation and Prediction: Classical Approach

choices of a and, consequently, it may have multiple interpretations in terms of the expected values
of y1 ; y2 ; : : : ; yN .
Two basic and readily verifiable observations about linear combinations of parametric functions
of the form 0ˇ are as follows:
(1) linear combinations of estimable functions are estimable; and
(2) linear combinations of nonestimable functions are not necessarily nonestimable.
How many “essentially different” estimable functions are there? Let 10 ˇ; 20 ˇ; : : : ; K 0
ˇ rep-
resent K (where K is an arbitrary positive integer) linear combinations of the elements of ˇ. These
linear combinations are said to be linearly independent if their coefficient vectors 10 ; 20 ; : : : ; K
0
are
0 0 0
linearly independent vectors. A question as to whether 1 ˇ; 2 ˇ; : : : ; K ˇ are essentially different
can be made precise by taking essentially different to mean linearly independent.
Letting R D rank.X/, some basic and readily verifiable observations about linearly independent
parametric functions of the form 0ˇ and about their estimability or nonestimability are as follows:
(1) there exists a set of R linearly independent estimable functions;
(2) no set of estimable functions contains more than R linearly independent estimable functions;
and
(3) if the model is not of full rank (i.e., if R < P ), then at least one and, in fact, at least P R of
the individual parameters ˇ1 ; ˇ2 ; : : : ; ˇP are nonestimable.
When the model matrix X has full column rank P, the model is said to be of full rank. In the
special case of a full-rank model, R.X/ D RP, and every parametric function of the form 0ˇ is
estimable.
Note that the existence of an N  1 vector a that satisfies equality (3.2) is equivalent to the
consistency of a linear system (in an N  1 vector a of unknowns), namely, the linear system with
coefficient matrix X0 (which is of dimensions P  N ) and with right side . The significance of this
equivalence is that any result on the consistency of a linear system can be readily translated into a
result on the estimability of the parametric function 0ˇ. Consider, in particular, Theorem 2.11.1.
Upon applying this theorem [and observing that .X /0 is a generalized inverse of X0 ], we find that
for 0ˇ to be estimable, it is necessary and sufficient that

0 X X D 0 (3.6)
or, equivalently, that
0 .I X X/ D 0: (3.7)
If rank.X/ D P, then (in light of Lemma 2.10.3) X X D I. Thus, in the special case of a
full-rank model, conditions (3.6) and (3.7) are vacuous.

a. A result on the consistency of a linear system


The following result on the consistency of a linear system can be used to obtain additional results
on the estimability of a parametric function of the form 0ˇ.
Theorem 5.3.1. A linear system AX D B (in X) is consistent if and only if k0 B D 0 for every
column vector k (of compatible dimension) such that k0A D 0.
Proof. Denote by M the number of rows in A (and in B), let k represent an M -dimensional
column vector, and observe (in light of Corollary 2.11.4 and Lemma 2.10.10) that

k0A D 0 , A0 k D 0
, k 2 N.A0 /
, k 2 CŒI .A /0 A0 
, k D ŒI .A /0 A0 r for some M  1 vector r
, k0 D r 0 .I AA / for some M  1 vector r:
Estimability 169

Thus, recalling Lemma 2.2.2, we find that

k0 B D 0 for every k such that k0A D 0


, r 0 .I AA /B D 0 for every M  1 vector r
, .I AA /B D 0:

And based on Theorem 2.11.1, we conclude that the linear system AX D B is consistent if and only
if k0 B D 0 for every k such that k0A D 0. Q.E.D.
Theorem 5.3.1 establishes that the consistency of the linear system AX D B is equivalent to a
condition that is sometimes referred to as compatibility; the linear system AX D B is said to be
compatible if every linear relationship that exists among the rows of the coefficient matrix A also
exists among the rows of the right side B (in the sense that k0AD0 ) k0 BD0). The proof presented
herein differs from that presented in Matrix Algebra from a Statistician’s Perspective (Harville 1997,
sec. 7.3); it makes use of results on generalized inverses.

b. Some alternative necessary and sufficient conditions for estimability


Let us now consider further the estimability of a parametric function of the form 0ˇ (under the
G–M, Aitken, or general linear model). As noted earlier, 0ˇ is estimable if and only if the linear
system X0 a D  (in a) is consistent. Accordingly, it follows from Theorem 5.3.1 that for 0ˇ to be
estimable, it is necessary and sufficient that
k0  D 0 for every P  1 vector k such that k0 X0 D 0 (3.8)
or, equivalently, that
k0  D 0 for every P  1 vector k in N.X/. (3.9)
Let S D dimŒN.X/. And observe (in light of Lemma 2.11.5) that
S DP rank.X/:
Unless the model is of full rank [in which case S D 0, N.X/ D f0g, and conditions (3.8) and (3.9)
are vacuous], condition (3.9) comprises an infinite number of equalities—there is one equality for
each vector k in the S -dimensional linear space N.X/. Fortunately, all but S of the equalities that
form condition (3.9) can be eliminated without affecting the necessity or sufficiency of the condition.
Let k1 ; k2 ; : : : ; kS represent any S linearly independent vectors in N.X/, that is, any S linearly
independent (P -dimensional) column vectors such that Xk1 D Xk2 D    D XkS D 0. Then, for
0ˇ to be estimable, it is necessary and sufficient that
k01  D k02  D    D k0S  D 0: (3.10)
To verify the necessity and sufficiency of condition (3.10), it suffices to establish that condition
(3.10) is equivalent to condition (3.9). In fact, it is enough to establish that condition (3.10) implies
condition (3.9)—that condition (3.9) implies condition (3.10) is obvious. Accordingly, let k represent
an arbitrary member of N.X/. And observe (in light of Theorem 2.4.11) that the set fk1 ; k2 ; : : : ; kS g
is a basis for N.X/, implying the existence of scalars a1 ; a2 ; : : : ; aS such that
k D a1 k1 C a2 k2 C    C aS kS
and hence such that
k0  D a1 k01  C a2 k02  C    C aS k0S :
Thus, if k01  D k02  D    D k0S  D 0, then k0  D 0, leading to the conclusion that condition
(3.10) implies condition (3.9).
Condition (3.10) comprises only S of the infinite number of equalities that form condition (3.9),
making it much easier to administer than condition (3.9).
170 Estimation and Prediction: Classical Approach

c. A related concept: identifiability


Let us continue to suppose that y is an N  1 observable random vector that follows a G–M, Aitken,
or general linear model. Recall that E.y/ D Xˇ. Does a parametric function of the form 0ˇ have
a fixed value for each value of E.y/? This question can be restated more formally as follows: is
0ˇ1 D 0ˇ2 for every pair of P -dimensional column vectors ˇ1 and ˇ2 such that Xˇ1 D Xˇ2 ?
Or, equivalently, is Xˇ1 ¤ Xˇ2 for every pair of P -dimensional column vectors ˇ1 and ˇ2 such
that 0ˇ1 ¤ 0ˇ2 ? Unless the model is of full rank, the answer depends on the coefficient vector 0.
When the answer is yes, the parametric function 0ˇ is said to be identifiable—this terminology is
consistent with that of Hinkelmann and Kempthorne (2008, sec. 4.4).
The parametric function 0ˇ is identifiable if and only if it is estimable. To see this, suppose that
 ˇ is estimable. Then, 0 D a0 X for some column vector a. And for any P -dimensional column
0

vectors ˇ1 and ˇ2 such that Xˇ1 D Xˇ2 ,


0ˇ1 D a0 Xˇ1 D a0 Xˇ2 D 0ˇ2 :
Accordingly, 0ˇ is identifiable.
Conversely, suppose that 0ˇ is identifiable. Then, 0ˇ1 D 0 0 for every P -dimensional column
vector ˇ1 such that Xˇ1 D X0, or equivalently 0 k D 0 for every vector k in N.X/. And based on
the results on estimability established in Subsection b (and on the observation that 0 k D k0 ), we
conclude that 0ˇ is estimable.

d. Polynomials (in 1 variable)


Suppose that y is an N -dimensional observable random vector that follows a G–M, Aitken, or general
linear model. Suppose further that there is a single explanatory variable, so that C D 1 and u D .u1 /.
And (for the sake of simplicity) let us write u for u1 or (depending on the context) for u.
Let us consider the case (considered initially in Section 4.2a) where ı.u/ is a polynomial. Specif-
ically, let us consider the case where

ı.u/ D ˇ1 C ˇ2 u C ˇ3 u2 C    C ˇP uP 1: (3.11)

Under what circumstances are all P of the coefficients ˇ1 ; ˇ2 ; : : : ; ˇP estimable? Or, equiv-
alently, under what circumstances is the model of full rank? The answer to this question can be
established with the help of a result on a kind of matrix known as a Vandermonde matrix.
Vandermonde matrices. A Vandermonde matrix is a square matrix A of the general form

1 t1 t12 : : : t1K 1
0 1
B1 t t 2 : : : t K 1 C
B 2 2 2 C
B 2 K 1C
ADB B 1 t3 t 3 : : : t3
C;
B :: :: :: : : :: C
C
@: : : : : A
1 tK tK2 : : : tKK 1

where t1 ; t2 ; t3 ; : : : ; tK are arbitrary scalars. The determinant of a Vandermonde matrix is obtainable


from the formula Y
jAj D .ti tj /: (3.12)
i;j
.j <i /
For a derivation of this formula, refer, for example, to Harville (1997, sec. 13.6).
Denote by R the number of distinct values represented among t1 ; t2 ; t3 ; : : : ; tK . Then, it follows
from result (3.12) that
R D K , jAj ¤ 0
and hence (in light of Theorem 2.14.20) that
Estimability 171

RDK , A is nonsingular: (3.13)

Rank of the model matrix. We are now in a position to determine the rank of the model matrix X
(of a G–M, Aitken, or general linear model) in the special case where C D 1 and where ı.u/ is
the polynomial (3.11) (which is of degree P 1 in the lone explanatory variable u). Denote by D
the number of distinct values of u represented among the N values of u corresponding to the N
observable random variables y1 ; y2 ; : : : ; yN . And take i1 ; i2 ; : : : ; iD (i1 < i2 <    < iD ) to be
integers between 1 and N , inclusive, such that the values of u corresponding to yi1 ; yi2 ; : : : ; yiD are
distinct. Each of the N rows of X is either among its i1 ; i2 ; : : : ; iD th rows or is a duplicate of one of
those rows. Thus, R.X/ is spanned by the i1 ; i2 ; : : : ; iD th rows of X, and it follows that rank.X/  D
and hence [since rank.X/  P ] that rank.X/  M, where M D min.D; P /. Moreover, it follows
from result (3.13) that the M  M submatrix of X formed from its i1 ; i2 ; : : : ; iM th rows and its
first M columns is nonsingular, implying (in light of Theorem 2.4.19) that rank.X/  M . And we
conclude that
rank.X/ D min.D; P /: (3.14)
In light of result (3.14), it is evident that [in the special case where ı.u/ is the (P 1)-degree
polynomial (3.11)] the model is of full rank if and only if D  P, that is, if and only if at least P
of the N values of u (the N values of u corresponding to y1 ; y2 ; : : : ; yN ) are distinct. When the
model is of full rank, all P of the coefficients ˇ1 ; ˇ2 ; : : : ; ˇP [in the (P 1)-degree polynomial] are
estimable.
In the application to the ouabain data (of Section 4.2b), there are 4 distinct values of the ex-
planatory variable, representing the 4 different rates of injection (or perhaps the logarithms of the
4 different rates). Accordingly, if ı.u/ were taken to be a polynomial of the form (3.11), the model
would be of full rank if and only if the degree of the polynomial were taken to be 3 or less.

e. An illustration: mixture data


Consider an application of the G–M, Aitken, or general linear model in which the C explanatory
variables u1 ; u2 ; : : : ; uC represent the proportionate amounts of the ingredients in a mixture. By
their very nature, the explanatory variables are such that
C
X
uj D 1: (3.15)
j D1

Now, suppose that


ı.u/ D ˇ1 C ˇ2 u1 C ˇ3 u2 C    C ˇC C1 uC (3.16)
(in which case, P D C C 1). And let x1 ; x2 ; : : : ; xC C1 represent the first through last columns of
the model matrix X. Then, C C1
X
x1 D xj (3.17)
j D2
or, equivalently,  
1
X D 0: (3.18)
1C
Suppose, for example, that the mixtures are fruit juices, each of which is a blend of watermelon
juice, orange juice, pineapple juice, and grapefruit juice (in which case, the data points might be
flavor scores). Suppose further that N D 6 and that the data are those obtained for the following 6
blends:
172 Estimation and Prediction: Classical Approach

u1 u2 u3 u4
0:8 0:2 0 0
0:2 0 0:8 0
0:4 0 0 0:6
0:5 0:1 0:4 0
0:6 0:1 0 0:3
0:3 0 0:4 0:3
What parametric functions of the form 0ˇ are estimable? In light of equality (3.18), rank.X/ 
C , and it follows from the results of Subsection b that for 0ˇ to be estimable, it is necessary that
 0
1
D0 (3.19)
1C
or, equivalently, that
C
X C1
j D 1 : (3.20)
j D2

Is this condition sufficient as well as necessary? The answer to this question depends on rank.X/. It
follows from the results of Subsection b that if rank.X/ D C , then condition (3.20) is sufficient (as
well as necessary) for the estimability of 0ˇ; however, if rank.X/ < C , then condition (3.20) is not
(in and of itself) sufficient.
In the case of the 6 blends of the 4 juices,

rank.X/ D 3 D C 1 < C:

To see this, observe that the last 3 columns of X (which contain the values of u2 , u3 , and u4 ,
respectively) are linearly independent, so that rank.X/  3. Observe also that each of the 6 blends
of the 4 juices is such that
u4 D 1:5u1 6u2 0:375u3 ;
so that (in the case of the 6 blends of the 4 juices)

x5 D 1:5x2 6x3 0:375x4 ; (3.21)

which [together with result (3.17)] implies that the first and last columns of X are expressible as
linear combinations of the other 3 columns and hence that rank.X/  3. Thus, rank.X/ D 3.
To obtain conditions that are both necessary and sufficient for the estimability of 0ˇ (from the
information provided by the data on the 6 blends of the 4 juices), observe that equality (3.21) can be
reexpressed in the form 0 1
0
B 1:5 C
B C
C D 0:
XBB 6 C
@ 0:375A
1
And it follows from the results of Subsection b that for 0ˇ to be estimable, it is necessary that
0 10
0
B 1:5 C
B C
B 6
B
C D 0
C (3.22)
@ 0:375A
1
or, equivalently, that
5 D 1:52 63 0:3754: (3.23)
Estimability 173

Moreover, together, the two conditions (3.19) and (3.22) or, equivalently, the two conditions (3.20)
and (3.23) are sufficient (as well as necessary) for the estimability of 0ˇ.
The example provided by the 6 blends of the 4 juices is one in which rank.X/ D C 1 (D P 2).
It is easy to construct examples in which rank.X/ D C (D P 1) and to do so for any N ( C )—in
light of result (3.17) or (3.18), rank.X/ cannot be larger than C . In what is perhaps the simplest way
to construct such an example, we can take any C of the blends corresponding to the N data points
to be the C pure blends, each of which consists entirely of one ingredient. This approach results in
a model matrix X whose N rows include the vectors .1; 1; 0; 0; : : : ; 0; 0/; .1; 0; 1; 0; : : : ; 0; 0/; : : : ;
.1; 0; 0; 0; : : : ; 0; 1/. Clearly, these C vectors are linearly independent, implying that rank.X/  C
and hence [since rank.X/  C ] that rank.X/ D C .
 0
1 P C1
When rank.X/ D C , the condition  D 0, or equivalently the condition jCD2 j D 1 ,
1C
is sufficient (as well as necessary) for the estimability of 0ˇ. It is worth noting that this condition
is not satisfied by any of the individual parameters ˇ1 ; ˇ2 ; : : : ; ˇC C1 and, consequently, none of
these parameters is estimable. Thus, we have established (by means of an example) that not only is
it possible for all P of the individual parameters ˇ1 ; ˇ2 ; : : : ; ˇP of a G–M, Aitken, or general linear
model to be nonestimable, but it is possible even if rank.X/ D P 1.

f. Inherent versus noninherent restrictions on estimability


Let us continue to consider the estimability of parametric functions of the form 0ˇ (under the G–M,
Aitken, or general linear model). Unless the model is of full rank, estimability is restricted to a proper
subset of these parametric functions. The restriction is in the form of restrictions on the coefficient
vector 0 . As discussed in Subsection b, a necessary and sufficient condition for 0ˇ to be estimable
is that
k0  D 0 (3.24)
for every P  1 vector k in N.X/.
It can be helpful to think of each of the restrictions of the form (3.24) as being either an inherent
restriction or a noninherent restriction. Let U represent a set consisting of C -dimensional column
vectors that are considered to be “feasible” values of the vector u of explanatory variables—a value
might be deemed infeasible either because of the nature of the explanatory variables or because of
the “limitations” of the model. The set U is assumed to include the values u1 ; u2 ; : : : ; uN of u that
correspond to the N data points.
A restriction of the form (3.24) is an inherent restriction if it would be applicable regardless
of the number of data points and regardless of the values of u (in U ) corresponding to the data
points. Otherwise, the restriction is a noninherent restriction. A parametric function of the form 0ˇ
that fails to satisfy an inherent restriction tends not to be conceptually meaningful. Accordingly, the
nonestimability of such a function tends not to be of concern.
A parametric function of the form 0ˇ that is nonestimable but that would be estimable if it
were not for the presence of noninherent restrictions tends to be conceptually meaningful. Its non-
estimability may or may not be of concern, depending on whether or not it represents a quantity of
interest.
It is informative to consider these concepts in the context of the illustrative example introduced
and discussed in Subsection e. Accordingly, suppose that the C explanatory variables u1 ; u2 ; : : : ; uC
represent the proportionate amounts of the ingredients in a mixture, in which case they are subject
to the constraint jCD1 uj D 1. And suppose that ı.u/ is of the form (3.16). Then, as discussed in
P
Subsection e, not all parametric functions of the form 0ˇ are estimable; for 0ˇ to be estimable, it
is necessary that C C1
X
1 j D 0 (3.25)
j D2
174 Estimation and Prediction: Classical Approach

or, equivalently, that


C
X C1
j D 1 : (3.26)
j D2

In this setting, the set U of feasible values of u would be the set


PC
fu W uj  0 (for j D 1; 2; : : : ; C ) and j D1 uj D 1g (3.27)

(or, perhaps, a nondegenerate subset of that set). Geometrically, the form of the set (3.27) is that
of a (C 1)-dimensional simplex. Regardless of the number of data points and regardless  ofwhich
1
values of u correspond to the data points, the model matrix X would be such that X D0
1C
and, consequently, condition (3.25), or equivalently condition (3.26), would be a necessary condition
for the estimability of 0ˇ. Accordingly, condition (3.26) constitutes an inherent restriction on the
estimability of 0ˇ.
As discussed in Subsection e, the parametric functions for which condition (3.26) is not satis-
fied include all C C 1 of the individual parameters ˇ1 ; ˇ2 ; : : : ; ˇC C1 . If the explanatory variables
u1 ; u2 ; : : : ; uC were unrestricted, the individual parameters would have meaningful interpretations,
emanating from the observation that ˇ1 D ı.0/ and that (for j D 1; 2; : : : ; C ) ˇj C1 equals the
change in ı.u/ effected by a unit change in the j th explanatory variable uj when the other C 1
explanatory variables are held constant. However, those interpretations are rendered meaningless
by the restriction of the vector u of explanatory variables to the set (3.27). The interpretation of
ˇ1 emanating from the observation that ˇ1 D ı.0/ is meaningless because there is no mixture for
which u D 0. The interpretations of ˇ2 ; ˇ3 ; : : : ; ˇC C1 in terms of the change in ı.u/ effected by
changing one of the explanatory variables while holding the others constant are also meaningless.
By their very nature, the explanatory variables u1 ; u2 ; : : : ; uC are such that jCD1 uj D 1, making
P
it impossible to change one of the explanatory variables without changing any of the others.
Subsection e includes an example in 0which N1 D 6 and C D 4 and in which the mixtures consist
0
B 1:5 C
B C
of blends of juices. In that example, X B B 6
C D 0, so that for a parametric function of the form
C
@ 0:375A
1
0ˇ to be estimable, it is necessary that

1:52 63 0:3754 5 D 0 (3.28)


or, equivalently, that
5 D 1:52 63 0:3754: (3.29)
0
Condition (3.29) constitutes a noninherent restriction on the estimability of  ˇ. It would not be
applicable if, for instance, the value of u corresponding to the fourth of the example’s 6 data points
were .0:5; 0:25; 0:25; 0/0 rather than .0:5; 0:1; 0:4; 0/0 [in which case, the first 4 rows of the model
matrix X would be linearly independent and rank.X/ would be 4 rather than 3]. Parametric functions
of the form 0ˇ that satisfy restriction (3.26), but not restriction (3.29), are conceptually meaningful.
Because of restriction (3.29), the only mixtures for which the “response function” ı.u/ would be
estimable from the 6 data points in the example are those for which

u4 D 1:5u1 6u2 0:375u3 :

If it had been the case that the only restriction on estimability were that determined by the inherent
restriction (3.26), then the value of ı.u/ would have been estimable for every mixture [i.e., for every
u in the set (3.27)].
The Method of Least Squares 175

Noninherent restrictions on the estimability of parametric functions of the form 0ˇ may be
encountered in cases where the data are “observational” in nature. They may also be encountered
in cases where the data come from a designed experiment or a sample survey. The extent to which
their presence is of concern would seem to depend on which parametric functions are rendered
nonestimable and on the extent to which those functions are of interest.
In the case of data from a designed experiment or a sample survey, the presence of noninherent
restrictions may be either inadvertent or by intent. Their presence may be attributable to problems in
execution or design. Or when the affected parametric functions are ones that are considered to be of
little importance and/or of negligible size, the presence of noninherent restrictions may be viewed
as an acceptable consequence of an attempt to make the best possible use of limited resources.

5.4 The Method of Least Squares


Let y 1 ; y 2 ; : : : ; y N represent N data points, and let u1 ; u2 ; : : : ; uN represent the corresponding val-
ues of a C -dimensional column vector u of explanatory variables. Consider the problem of choosing
a function ı.u/ of u on the basis of how well the values ı.u1 /; ı.u2 /; : : : ; ı.uN / approximate
y 1 ; y 2 ; : : : ; y N . Let  represent the set of candidates from which the function ı./ is to be chosen.
Which member of  provides the best approximation? One way of addressing this question is
through the minimization [for ı./ 2 ] of the norm of the N -dimensional column vector with
elements y 1 ı.u1 /; y 2 ı.u2 /; : : : ; y N ı.uN /. When the norm is taken to be the usual norm,
this method is equivalent to minimizing N 2
i D1 Œy i ı.ui / , and is referred to as the method of least
P
squares or simply as least squares.
The origins of the method of least squares are a matter of some dispute, but date back at least to
an 1805 publication by Adrien Marie Legendre. For discussions of the history of the method, refer,
for example, to Plackett (1972) and Stigler (1986, chap. 1).
The focus herein is on the method of least squares as applied to settings in which the data points
are regarded as the values of observable random variables that follow a G–M, Aitken, or general linear
model. In such a setting, the method of least squares can be used to obtain estimates of estimable
functions (of the model’s parameters) of the form 0ˇ. In the special case of the G–M model, least
squares estimators have certain optimal properties.
In what follows, consideration of the method of least squares is restricted to the special case
where  consists of those functions (of u) that are expressible as linear combinations of P specified
functions ı1 ./; ı2 ./; : : : ; ıP ./. Thus, ı./ is a member of  if ı.u/ is expressible in the form

ı.u/ D b1 ı1 .u/ C b2 ı2 .u/ C    C bP ıP .u/; (4.1)

where b1 ; b2 ; : : : ; bP are arbitrary scalars. When ı./ is such that ı.u/ is expressible in the form
(4.1), N N P
X X X 2
Œy i ı.ui /2 D yi xij bj ;
i D1 i D1 j D1
where (for i D 1; 2; : : : ; N and j D 1; 2; : : : ; P ) xij D ıj .ui /. Accordingly, in the special case
under consideration, the minimization of N ı.ui /2 with respect to ı./ [for ı./ 2 ] is
P
i D1 Œy i
PN PP 2
equivalent to the minimization of i D1 y i j D1 xij bj with respect to b1 ; b2 ; : : : ; bP . Moreover,
upon letting y D .y 1 ; y 2 ; : : : ; y N /0 and b D .b1 ; b2 ; : : : ; bP /0 and taking X to be the N  P matrix
with ij th element xij ,
N P
X X 2
yi xij bj D .y Xb/0 .y Xb/; (4.2)
i D1 j D1
176 Estimation and Prediction: Classical Approach
2
so that the minimization of N
PP
with respect to b1 ; b2 ; : : : ; bP is equivalent
P
i D1 y i j D1 xij bj
0
to the minimization of .y Xb/ .y Xb/ with respect to the P -dimensional vector b (where b is
an arbitrary member of RP ).
2
We wish to obtain a solution to the problem of minimizing the quantity N
P PP
i D1 y i j D1 xij bj
with respect to b1 ; b2 ; : : : ; bP . Conditions that are necessary for this quantity to attain a minimum
value can be obtained by differentiating with respect to b1 ; b2 ; : : : ; bP and by equating the resultant
partial derivatives to 0. Or in what can be regarded as an appealing variation on this approach,
we can reformulate the minimization problem in matrix notation [as the problem of minimizing
.y Xb/0 .y Xb/ with respect to b] and take advantage of some basic results on vector differentiation.

a. Some basic results on vector differentiation


Let x D .x1 ; x2 ; : : : ; xM /0 represent an M -dimensional column vector of variables. And let f .x/
@f .x/ @f
represent a function of x. Write , or simply , for the M -dimensional column vector whose
@x @x
@f .x/
j th element is the (first-order) partial derivative of f with respect to xj ; this vector may be
@xj
referred to as the derivative of f .x/ with respect to x and is sometimes called the gradient vector of
@f .x/ @f h @f .x/ i0
f . And write , or , for ; this vector may be referred to as the derivative of f .x/
@x0 @x0 @x
@2 f .x/ @2 f
with respect to x0. Further, write 0
, or , for the M  M matrix whose ij th element is
@x@x @x@x0
@2 f .x/ @2 f .x/
the second-order partial derivative —or, when j D i , —of f with respect to xi and
@xi @xj @xi2
xj , and refer to this matrix as the Hessian matrix of f .
The formulas for obtaining the partial derivatives of a linear combination of functions, a product
of functions, and a ratio of two functions with respect to a single variable extend in an altogether
straightforward way to vector differentiation. In particular, in the case of the product of two functions
f .x/ and g.x/ of x,
@f .x/g.x/ @g.x/ @f .x/
D f .x/ C g.x/ : (4.3)
@x @x @x
Refer, for example, to Harville (1997, sec. 15.2) for further particulars.
Denote by a D .a1 ; a2 ; : : : ; aM /0 an M -dimensional column vector of constants and by A D
fai k g an M  M matrix of constants. AndP consider the differentiation of the linear form a0 x D
0 0
i ai xi and of the quadratic form x Ax D i;k ai k xi xk . The first-order partial derivatives of a x
P
0
and the first- and second-order partial derivatives of x Ax are
@ a0 x
D aj .j D 1; 2; : : : ; M /; (4.4)
@xj
@ x0 Ax X X
D aij xi C ajk xk .j D 1; 2; : : : ; M /; (4.5)
@xj
i k
and
@2 x0 Ax
D asj C ajs .s; j D 1; 2; : : : ; M /; (4.6)
@xs @xj
as can be verified via a relatively simple exercise—refer, e.g., to Harville (1997, sec. 15.3) for the
details.
In light of result (4.4), the gradient vector of a0 x is
@ a0 x
D a: (4.7)
@x
The Method of Least Squares 177

And in light of result (4.5), the gradient vector of x0 Ax is


@ x0 Ax
D .A C A0 /x (4.8)
@x
(as is evident upon observing that k ajk xk is the j th element of the column vector Ax and i aij xi
P P

is the j th element of A0 x). Further, in light of result (4.6), the Hessian matrix of x0 Ax is

@2 x0 Ax
D A C A0 : (4.9)
@x@x0
In the special case where A is symmetric, results (4.8) and (4.9) simplify to
@ x0 Ax
D 2Ax (4.10)
@x
and
@2 x0 Ax
D 2A: (4.11)
@x@x0

Let f .x/ represent a (P 1)-dimensional vector-valued function of the M -dimensional (column)


vector x D .x1 ; x2 ; : : : ; xM /0, and denote by f1 .x/, f2 .x/, : : : ; fP .x/ the P functions of x that
@ f .x/ @f
constitute the first, second, …, P th elements of f .x/. Let us write , or simply , for the
@ x0 @ x0
@fs
P  M matrix whose sj th element is ; this matrix may be referred to as the derivative of f .x/
@xj
@ f 0 .x/
with respect to x0 and is sometimes called the Jacobian matrix of f . And let us write , or
@f0 @fs @x
simply , for the M  P matrix whose jsth element is or, equivalently, whose sth column
@x @xj
@fs
is ; this matrix may be referred to as the derivative of f 0 .x/ with respect to x and is sometimes
@x
called the gradient matrix of f . Note that for any P  M matrix of constants B,
@ .Bx/ @ .Bx/0
DB and D B0: (4.12)
@ x0 @x

b. Solution to the least squares minimization problem


Let us now consider the implications of the results of Subsection a as applied to the minimization (with
respect to the P 1 vector b D fbj g) of .y Xb/0 .y Xb/, where y D fy i g is an N 1 data vector and
X D fxij g an N P matrix. Take q./ to be the function (of b) defined by q.b/ D .y Xb/0 .y Xb/.
And observe that
q.b/ D y 0 y 2.X0 y/0 b C b0 X0 Xb:
Then, as an immediate application of formulas (4.7), (4.10), and (4.11), we have that
@ q.b/
D 2X0 y C 2X0 Xb (4.13)
@b
and
@2 q.b/
D 2X0 X: (4.14)
@b@b0
It follows from basic results on unconstrained minimization (e.g., Luenberger and Ye 2016, sec.
@ q.b/
7.1) that a necessary condition for q.b/ to attain a minimum value at a point bQ is that D 0 at
@b
b D b.Q Clearly,
178 Estimation and Prediction: Classical Approach
@ q.b/
D0 , X0 Xb D X0 y:
@b
Thus, a necessary condition for q.b/ to attain a minimum value at a point bQ is that bQ constitute a
solution to the linear system
X0 Xb D X0 y (4.15)
(in the P  1 vector b). Linear system (4.15) consists of P equations; collectively, these equations
are known as the normal equations.
The normal equations are consistent, as can be verified by, for example, observing [in light of
equality (2.12.2)] that X0 X.X0 X/ X0 y D X0 y [which implies that .X0 X/ X0 y is a solution to the
normal equations]. Moreover, for q.b/ to attain a minimum value at a point b, Q it is sufficient, as well
Q
as necessary, that b constitute a solution to the normal equations. That is, the set of points at which
q.b/ attains a minimum value equals the solution set of linear system (4.15).
Let us verify that every solution to the normal equations is a point at which q.b/ attains a minimum
value. There are at least two different ways of accomplishing the verification. We could start with
@2 q.b/
the observation that is a nonnegative definite matrix and that, as a consequence, q.b/ is a
@b@b0
convex function (e.g., Luenberger and Ye 2016, sec. 7.4). We could then take advantage of a general
result on the minimization of convex functions (e.g., Luenberger and Ye 2016, sec. 7.5) to conclude
@q.b/
that for any P  1 vector bQ such that D 0 at b D bQ or, equivalently, for any solution bQ to the
@b
normal equations, q.b/ attains a minimum value at b D b. Q
An alternative way of accomplishing the verification is to do so directly (without making any
demands on the reader’s knowledge of the literature on optimization). For any solution bQ to the
normal equations,
q.b/ D q.b/ Q C ŒX.b b/ Q 0 X.b b/ Q  q.b/;Q (4.16)
as is evident upon observing that
q.b/ D Œy XbQ X.b Q 0 Œy
b/ XbQ X.b Q
b/;
that
ŒX.b Q 0 .y
b/ Q D .b
Xb/ Q 0 .X0 y
b/ Q D 0;
X0 Xb/
that
.y Q 0 X.b
Xb/ Q D Œ.y
b/ Q 0 X.b
Xb/ Q 0 D ŒX.b
b/ Q 0 .y
b/ Q
Xb/;
and that ŒX.b b/ Q 0 X.b b/ Q is a sum of squares [of the elements of X.b b/]. Q And it follows
immediately from result (4.16) that q.b/ attains a minimum value at b D b. Q
Result (4.16) also serves to confirm that any point at which q.b/ attains a minimum value is a
@q.b/
solution to the normal equations (or, equivalently, a point at which D 0). To see this, observe
@b
that
q.b/ D q.b/Q ) ŒX.b b/ Q 0 X.b b/ Q D0
) X.b Q D0
b/
) Xb D XbQ (4.17)
) X Xb D X XbQ D X y:
0 0 0

The value of the vector XbQ is the same for any solution bQ to the normal equations or, equivalently,
for any vector bQ at which q.b/ attains its minimum value. That is, for any two solutions bQ 1 and bQ 2
to the normal equations [or any two vectors bQ 1 and bQ 2 at which q.b/ attains its minimum value],
XbQ 1 D XbQ 2 : (4.18)
Equality (4.18) can be established by applying result (4.17) or, alternatively, by observing that
X0 XbQ 1 D X0 y D X0 XbQ 2 and then observing (in light of Corollary 2.3.4) that X0 XbQ 1 D X0 XbQ 2 )
The Method of Least Squares 179

XbQ 1 D XbQ 2 . The vector .X0 X/ X0 y is a solution to the normal equations. Thus, as a variation on
result (4.18), we have that for any solution bQ to the normal equations or, equivalently, for any vector
bQ at which q.b/ attains its minimum value,
XbQ D X.X0 X/ X0 y D P y X (4.19)
0 0
[where PX D X.X X/ X ].
Let bQ represent any solution to the normal equations. Then, the minimum value of q.b/ is
Q D .y
q.b/ Q 0 .y
Xb/ Q
Xb/:

This value is reexpressible in the form


Q D y 0 .y
q.b/ Q
Xb/ (4.20)
and in the form
Q D y 0y
q.b/ bQ 0 X0 y; (4.21)
as is evident upon observing that
Q 0 .y
.Xb/ Q D bQ 0 .X0 y
Xb/ Q D bQ 0 0 D 0
X0 Xb/

[and that y 0 XbQ D .y 0 Xb/


Q 0 D bQ 0 X0 y]. Moreover, in light of results (4.19) and (4.20), the minimum
value of q.b/ is also expressible as
Q D y 0 .I P /y:
q.b/ (4.22)
X

Note that expression (4.22) is a quadratic form (in y), the matrix of which is I PX .
In summary, we have established the following:
(1) The function q.b/ D .y Xb/0 .y Xb/ attains a minimum value at a point bQ if and only if bQ
is a solution to the linear system X0 Xb D X0 y (comprising the normal equations).
(2) The linear system X0 Xb D X0 y is consistent.
(3) XbQ 1 D XbQ 2 for any two solutions bQ 1 and bQ 2 to the linear system X0 Xb D X0 y or, equivalently,
for any two vectors bQ 1 amd bQ 2 at which q.b/ attains its minimum value.
(30 ) XbQ D PX y for any solution bQ to the linear system X0 Xb D X0 y or, equivalently, for any vector
bQ at which q.b/ attains its minimum value.
(4) For any solution bQ to the linear system X0 Xb D X0 y,
Q D y 0 .y
min q.b/ D q.b/ Q D y 0y
Xb/ bQ 0 X0 y D y 0 .I PX /y:
b

c. Least squares estimators of estimable functions


Suppose the setting is one in which N data points are regarded as the respective values of observable
random variables y1 ; y2 ; : : : ; yN that follow a G–M, Aitken, or general linear model. And consider
the method of least squares as applied to the estimation of an estimable parametric function of the
form 0ˇ. Take the functions ı1 ./; ı2 ./; : : : ; ıP ./ (and the number of such functions P ) to be those
for which (under the G–M, Aitken, or general linear model)

E.yi / D ˇ1 ı1 .ui / C ˇ2 ı2 .ui / C    C ˇP ıP .ui / .i D 1; 2; : : : ; N /:

Further, let y D .y1 ; y2 ; : : : ; yN /0, and continue to take X to be the N  P matrix with ij th element
xij D ıj .ui /.
By definition, the least squares estimator of an estimable function 0ˇ is the function, say `.y/,
of y whose value at y D y (an arbitrary N  1 vector) is taken to be 0 b, Q where bQ is any solution to
the linear system
180 Estimation and Prediction: Classical Approach

X0 Xb D X0 y (4.23)
(in the P  1 vector b), comprising the normal equations. Unless rank.X/ D P , there are an
infinite number of solutions to the normal equations and hence an infinite number of choices for
Q Nevertheless, 0 bQ is uniquely defined; that is, 0 bQ is invariant to the choice of b.
b. Q To see this, let
bQ 1 and bQ 2 represent any two solutions to linear system (4.23), and observe (in light of the results
of Subsection b and Section 5.3) that XbQ 1 D XbQ 2 and that (because of the estimability of 0ˇ)
0 D a0 X for some N  1 vector a, so that
0 bQ 1 D a0 XbQ 1 D a0 XbQ 2 D 0 bQ 2 :

The solutions to linear system (4.23) include the vector .X0 X/ X0 y. Thus, among the represen-
tations for the least squares estimator `.y/ of an estimable function 0ˇ is the representation

`.y/ D 0 .X0 X/ X0 y; (4.24)

so the least squares estimator is a linear estimator. In the special case where X is of full column rank
P, linear system (4.23) has the unique solution .X0 X/ 1 X0 y. And in that special case, expression
(4.24) becomes
`.y/ D 0 .X0 X/ 1 X0 y:

Some further results on estimability. Let us continue to suppose that y is a column vector of
observable random variables y1 ; y2 ; : : : ; yN that follow a G–M, Aitken, or general linear model.
And let us consider further the subject of Section 5.3, namely, the estimability of a parametric
function of the form 0ˇ (D jPD1 j ˇj ).
P

In Section 5.3, the estimability of 0ˇ was related to various characteristics of the model matrix
X. A number of conditions were set forth, each of which is necessary and sufficient for estimability.
Those conditions can be restated in terms of various characteristics of the P  P matrix X0 X, which
is the coefficient matrix of the normal equations. Their restatement is based on the following results:

R.X0 X/ D R.X/I (4.25)


rank.X0 X/ D rank.X/I (4.26)
XŒ.X0 X/ X0 X D X (4.27)

[i.e., .X0 X/ X0 is a generalized inverse of X]; and

k0 X0 X D 0 , k0 X0 D 0 (4.28)

(where k is an arbitrary P  1 vector) or, equivalently,

X0 Xk D 0 , Xk D 0
or, also equivalently,
N.X0 X/ D N.X/
—results (4.25), (4.26), and (4.27) were established in Section 2.12, and result (4.28) is a consequence
of Corollary 2.3.4.
In light of results (4.25), (4.26), (4.27), and (4.28), it follows from the results of Section 5.3 that
each of the following conditions is necessary and sufficient for the estimability of 0ˇ:
(1) 0 2 R.X0 X/;
0 0 0
(2)  D r X X for some P  1 vector r;
(3) 0 .X0 X/ X0 X D 0 or, equivalently,
0
(3 )  ŒI .X0 X/ X0 X D 0;
0

(4) k0  D 0 for every P  1 vector k such that k0 X0 X D 0 or, equivalently,


The Method of Least Squares 181

(40 ) k0  D 0 for every P  1 vector k in N.X0 X/;


(5) k01  D k02  D    D k0S  D 0,
0
where S D N rank.X X/ and where k1 ; k2 ; : : : ; kS are any S linearly independent vectors in
N.X0 X/ [i.e., any S linearly independent (P -dimensional) column vectors such that X0 Xk1 D
X0 Xk2 D    D X0 XkS D 0].
Conditions (1), (2), (3), and (30 ) are stated in terms of the row vector 0. Alternative versions
of what are essentially the same conditions can be obtained by restating the conditions in terms of
the column vector  (and by observing that Œ.X0 X/ 0, like .X0 X/ itself, is a generalized inverse of
X0 X). The alternative versions are as follows:
(1)  2 C.X0 X/;
0
(2)  D X Xr for some P  1 vector r;
(3) X0 X.X0 X/  D ;
0
(3 ) ŒI X0 X.X0 X/  D 0.
As in the case of the original versions, each of these conditions is necessary and sufficient for 0ˇ to
be estimable.
Note that the P -dimensional random column vector X0 y, defined by the right side of the normal
equations, is such that
E.X0 y/ D X0 Xˇ: (4.29)
And observe that if 0ˇ is estimable, then (in light of our results on estimability) there exists a P  1
vector r such that
0ˇ D r 0 E.X0 y/: (4.30)
Thus, an estimable function is interpretable in terms of the expected values of the P elements of
X0 y. In fact, unless rank.X/ D P, an estimable function 0ˇ will have multiple representations of
the form (4.30) and hence multiple interpretations in terms of the expected values of the elements of
X0 y.
As a further implication of result (4.29), we have that, for any P  1 vector r of constants,

E.r 0 X0 y/ D r 0 X0 Xˇ:

And upon observing that the least squares estimator of r 0 X0 Xˇ is r 0 X0 y, it follows that any linear
combination of the elements of the vector X0 y (defined by the right side of the normal equations) is
the least squares estimator of its expected value.
Conjugate normal equations. Let us resume our discussion of least squares estimation, taking the
setting to that in which N data points are regarded as the respective values of the elements of an N 1
observable random vector y that follows a G–M, Aitken, or general linear model. Corresponding to
any linear combination 0ˇ (of the elements of the vector ˇ) is the linear system

X0 Xr D ; (4.31)

comprising P equations in a P  1 vector r of unknowns. The coefficient matrix X0 X of this linear


system is the same as that of the linear system

X0 Xb D X0 y; (4.32)

comprising the normal equations; however, the right side of linear system (4.31) is the coefficient
vector , while that of linear system (4.32) is X0 y. The P equations that form linear system (4.31)
are sometimes referred to (collectively) as the conjugate normal equations.
It follows from the results of Part 1 of the present subsection that 0ˇ is estimable if and only if
the conjugate normal equations are consistent. Now, suppose that 0ˇ is estimable, and consider the
least squares estimator `.y/ of 0ˇ. The value `.y/ of `.y/ at y D y is expressible in terms of any
182 Estimation and Prediction: Classical Approach

solution to the normal equations; for any solution bQ to linear system (4.32),
Q
`.y/ D 0 b: (4.33)

The value of `.y/ is also expressible in terms of any solution to the conjugate normal equations. For
any solution rQ to linear system (4.31) [and any solution bQ to linear system (4.32)], we find that

`.y/ D 0 bQ D .X0 XQr /0 bQ D rQ 0 X0 XbQ D rQ 0 X0 y: (4.34)

The upshot of result (4.34) is that the roles of the normal equations and the conjugate normal
equations are (in a certain sense) interchangeable. The least squares estimate `.y/ can be obtained
by forming the (usual) inner product 0 bQ of a solution bQ to the normal equations and of the right
side  of the conjugate normal equations. Or, alternatively, it can be obtained by forming the inner
product rQ 0 X0 y of a solution rQ to the conjugate normal equations and of the right side X0 y of the
normal equations.
The general form and expected values, variances, and covariances of least squares estimators. Let
us continue to take y to be an N -dimensional observable random (column) vector that follows a G–M,
Aitken, or general linear model. And let us consider the general form and expected values, variances,
and covariances of least squares estimators of estimable linear combinations of the elements of ˇ.
Suppose that 0ˇ is an estimable linear combination. Then, in light of the results of Part 2 of the
present subsection, the least squares estimator `.y/ of 0ˇ is expressible in the form

`.y/ D rQ 0 X0 y; (4.35)

where rQ is any solution to the conjugate normal equations X0 Xr D . It follows immediately that
the least squares estimator is a linear estimator, in confirmation of what was established earlier (in
the introductory part of the present subsection) via a different approach. Moreover,

E.Qr 0 X0 y/ D rQ 0 X0 E.y/ D rQ 0 X0 Xˇ D .X0 XQr /0ˇ D 0ˇ: (4.36)

Thus, the least squares estimator is a linear unbiased estimator.


The vector XQr is invariant to the choice of the solution rQ to the conjugate normal equations (as
is evident, e.g., from Corollary 2.3.4). The solutions to the conjugate normal equations include the
vector .X0 X/ , and {because Œ.X0 X/ 0, like .X0 X/ itself, is a generalized inverse of X0 X}they
also include the vector Œ.X0 X/ 0 . Accordingly,

XQr D X.X0 X/  D XŒ.X0 X/ 0 : (4.37)

Under the general linear model, the variance of the least squares estimator of 0ˇ is [in light of
result (4.37) and the equality rQ 0 X0 D .XQr /0 ] expressible as

var.Qr 0 X0 y/ D rQ 0 X0 V./XQr D 0 .X0 X/ X0 V./X.X0 X/ : (4.38)

Result (4.38) can be extended. Suppose that 10 ˇ and 20 ˇ are two estimable linear combinations of
the elements of ˇ. Then, the least squares estimator of 10 ˇ equals rQ10 X0 y and that of 02 ˇ equals
rQ20 X0 y. Here, rQ1 is any solution to the linear system X0 Xr1 D 1 (in r1 ) and rQ2 any solution to the
linear system X0 Xr2 D 2 (in r2 ). And under the general linear model,

cov.Qr10 X0 y; rQ20 X0 y/ D rQ 10 X0 V./XQr2 D 10 .X0 X/ X0 V./X.X0 X/ 2 : (4.39)

In the special case of the Aitken model, result (4.38) “simplifies” to

var.Qr 0 X0 y/ D  2 rQ 0 X0 HXQr D  2 0 .X0 X/ X0 HX.X0 X/ ; (4.40)


The Method of Least Squares 183

and result (4.39) to

cov.Qr10 X0 y; rQ 20 X0 y/ D  2 rQ10 X0 HXQr2 D  2 10 .X0 X/ X0 HX.X0 X/ 2 : (4.41)

Under the G–M model, considerable further simplification is possible, and various additional repre-
sentations are obtainable. Specifically, we find that (under the G–M model)
var.Qr 0 X0 y/ D  2 rQ 0 X0 XQr D  2 rQ 0  D  2 0 rQ (4.42)
D  2 rQ 0 X0 X.X0 X/ X0 XQr D  2 0 .X0 X/ ; (4.43)
and, similarly,
cov.Qr10 X0 y; rQ20 X0 y/ D  2 rQ10 X0 XQr2 D  2 rQ10 2 D  2 10 rQ 2 (4.44)
D  2 rQ10 X0 X.X0 X/ 0
X XQr2 D  2 10 .X0 X/ 2 : (4.45)

d. The geometry of least squares


It can be informative to consider the method of least squares from a geometrical perspective. As a
preliminary to doing so, let us extend some of the basic definitions of plane and solid geometry from
R2 and R3, where they can be interpreted visually, to RM (where M may be greater than 3).
Geometrically-related definitions. The inner (or dot) product of two M -dimensional column vectors,
say x D fxi g and y D fyi g, is denoted by the symbol x  y. For a general definition of the inner
product of two M -dimensional column vectors (or two M  N matrices), refer to Section 2.4f. The
usual inner product is that for which
M
X
x  y D y 0 x D x0 y D xi yi : (4.46)
i D1
The usual inner product is (in the special cases M D 2 and M D 3) the inner product customarily
employed in plane and solid geometry.
The definition of an inner product underlies the definitions of various other quantities. Consider,
in particular, the norm (also known as the length or magnitude) of an M -dimensional column vector
x D fxi g. The norm of x is denoted by the symbol kxk. By definition,

kxk D .x  x/1=2:

When the inner product is taken to be the usual inner product,


M
X 1=2
kxk D .x0 x/1=2 D xi2 ; (4.47)
i D1
and the norm is referred to as the usual norm.
The distance between two M -dimensional column vectors x D fxi g and y D fyi g is defined to
be the norm kx yk of the difference x y between x and y. In the case of the usual inner product,
M
hX i1=2
kx yk D Œ.x y/0 .x y/1=2 D .xi yi /2 : (4.48)
i D1

The angle between two nonnull M -dimensional column vectors x D fxi g and y D fyi g is
defined indirectly in terms of its cosine. Specifically, the angle between x and y is the angle 
(0    ) defined by
xy
cos  D (4.49)
kxk kyk
xy
—it follows from Theorem 2.4.21 (the Cauchy–Schwarz inequality) that 1   1. In the
kxkkyk
184 Estimation and Prediction: Classical Approach

case of the usual inner product (and usual norm), equality (4.49) can be reexpressed in the form
PM
x0 y i D1 xi yi
cos  D 0 1=2 0 1=2 D P 1=2 PM 2 1=2 : (4.50)
.x x/ .y y/ M 2
i D1 xi i D1 yi

By definition, two M -dimensional column vectors x D fxi g and y D fyi g are orthogonal (or
perpendicular) to each other if x  y D 0. Thus, when the inner product is taken to be the usual inner
product, x and y are orthogonal to each other if x0 y D 0 or, equivalently, if M i D1 xi yi D 0. The
P
statement that x and y are orthogonal to each other is sometimes abbreviated to the statement that x
and y are orthogonal. Clearly, two nonnull vectors are orthogonal if and only if the angle between
them is =2 (90ı) or, equivalently, the cosine of that angle is 0.
If an M -dimensional column vector x is orthogonal to every vector in a subspace U of M -
dimensional column vectors, x is said to be orthogonal to U. The set consisting of all M -dimensional
column vectors that are orthogonal to the subspace U is called the orthogonal complement of U
and is denoted by the symbol U?. The set U? is a linear space (as can be readily verified). When
U D C.X/ (where X is a matrix), we may write C?.X/ for U?.
Least squares revisited: the projection and decomposition of the data vector. Denote by y D
.y1 ; y2 ; : : : ; yN /0 an N -dimensional column vector of data points—this notation differs somewhat
from that employed earlier in the section (which included an underline). Further, suppose that
y1 ; y2 ; : : : ; yN are accompanied by the corresponding values u1 ; u2 ; : : : ; uN of a C -dimensional
column vector u of explanatory variables. Let us consider the approximation of y1 ; y2 ; : : : ; yN by
ı.u1 /; ı.u2 /; : : : ; ı.uN /, where ı.u/ is a function of u. Which of the possible choices for the function
ı./ results in the “best” approximation (and in what sense)? In particular, which results in the best
approximation when the choice for ı./ is restricted to functions (of u) that are expressible as linear
combinations of P specified functions ı1 ./; ı2 ./; : : : ; ıP ./; that is, when the choice is restricted to
those functions that are expressible in the form

ı.u/ D b1 ı1 .u/ C b2 ı2 .u/ C    C bP ıP .u/; (4.51)

where b1 ; b2 ; : : : ; bP are arbitrary scalars.


In the method of least squares, the function ı./ is chosen so as to minimize the quantity
PN
f i D1 Œyi ı.ui /2 g1=2. This quantity is the (usual) norm of the N -dimensional vector whose elements
are the individual errors of approximation y1 ı.u1 /; y2 ı.u2 /; : : : ; yN ı.uN /. It is interpretable
as the (ordinary) distance between the N -dimensional data vector y and the N -dimensional vector
whose elements are the approximations ı.u1 /; ı.u2 /; : : : ; ı.uN /.
Suppose now that ı./ is taken to be of the form (4.51). And (for i D 1; 2; : : : ; N and j D
1; 2; : : : ; P ) define xij D ıj .ui /. Further, let X represent the N  P matrix with ij th element xij ,
and (for j D 1; 2; : : : ; P ) take xj D .x1j ; x2j ; : : : ; xNj /0 (so that xj is the j th column of X). Then,
as discussed earlier (in the introductory part of the present section),
N N P
X X X 2
Œyi ı.ui /2 D yi xij bj D .y Xb/0 .y Xb/; (4.52)
i D1 i D1 j D1

where b D .b1 ; b2 ; : : : ; bP /0. Note [in connection with result (4.52)] that
P P
X 0 X
Xb/0 .y

.y Xb/ D y bj xj y bj xj :
j D1 j D1

In light of result (4.52), the minimization problem that gives rise to the method of least squares
can be regarded as that of minimizing .y Xb/0 .y Xb/ [or, equivalently, that of minimizing the
(usual) norm of y Xb] with respect to the P  1 vector b. As previously indicated (in Subsection
The Method of Least Squares 185

b), .y Xb/0 .y Xb/ attains a minimum value at a P  1 vector bQ if and only if bQ is a solution to
the normal equations X0 Xb D X0 y.
Some further insights into the method of least squares can be obtained by transforming the
underlying minimization problem into a more geometrically meaningful form. Let U D C.X/, and
observe that an N -dimensional column vector w is a member of U if and only if w D Xb for
some b, in which case the elements of b are the “coordinates” of w with respect to the spanning set
fx1 ; x2 ; : : : ; xP g. Accordingly, the problem of minimizing .y Xb/0 .y Xb/ with respect to b can
be reformulated as the “coordinate-free” problem of minimizing .y w/0 .y w/ with respect to
w, where w is an arbitrary member of the linear space U. The latter problem depends on the matrix
X only through its column space. From a geometrical perspective, the problem is that of finding the
vector in the subspace U (of RN ) that is “closest” to the data vector y.
It follows from the results of Subsection b that .y w/0 .y w/ attains a minimum value over
the subspace U, doing so at a unique point z that is expressible as
Q
z D Xb; (4.53)

where bQ is any solution to the normal equations X0 Xb D X0 y, and also as

z D PX y: (4.54)

Taking (here and in the remainder of the present subsection) the inner product to be the usual inner
product, the vector z is such that y z 2 U? ; that is, the difference between y and z is orthogonal
to every vector in U. To see this, let a represent an arbitrary member of U [D C.X/], and observe
that a D Xr for some P  1 vector r and hence that

a0 .y z/ D r 0 X0 .y Q D r 0 .X0 y
Xb/ Q D r 0 .X0 y
X0 Xb/ X0 y/ D r 0 0 D 0:

Moreover, there is no member w of U other than z for which y w 2 U?, as is evident upon
observing that if w 2 U and y w 2 U?, then w z 2 U and

w z D .y z/ .y w/ 2 U?

(so that the vector w z is orthogonal to itself), implying that

.w z/0 .w z/ D 0

and hence that w z D 0 or, equivalently, that

w D z:

In summary, there is a unique vector w in U such that y w 2 U?, namely, the vector z. This
vector is referred to as the orthogonal projection of y on U or simply as the projection of y on U. As
previously indicated, the matrix PX is referred to as a projection matrix; the reason why is apparent
from expression (4.54).
Conceptually, the point z in RN at which .y w/0 .y w/ attains its minimum value for w 2 U is
obtainable by “projecting” the point y onto the surface U. The point in RN located by this operation
is such that the “line” formed by joining that point with the point y is orthogonal (perpendicular) to
the surface U.
Corresponding to the projection z of y on U is the decomposition

y D z C d; (4.55)

where d D y z. The first component of this decomposition is a member of the linear space U
[D C.X/], and the second component is a member of the orthogonal complement U? [D C?.X/] of
186 Estimation and Prediction: Classical Approach

FIGURE 5.1. The projection z of the 2-dimensional data vector y D .4; 8/0 on the 1-dimensional linear space
U D C.X/, where X D x D .3; 1/0.

U. In this context, the linear space U is sometimes referred to as the estimation space—logically, it
could also be referred to as the approximation space—and the linear space U? is sometimes referred
to as the error space. Decomposition (4.55) is unique; if y is expressed as the sum of two components,
the first of which is in the estimation space U and the second of which is in the error space U?, then
necessarily the first component equals z and the second equals d (D y z).
Example: N D 2. Suppose that N D 2, that y D .4; 8/0, and that X D x, where x is the 2-
dimensional column vector x D .3; 1/0 (in which case P D 1). Then, the linear system X0 Xb D X0 y
becomes .10/b D .20/, which has the unique solution b D .2/. Thus, the projection of y on the
linear space U [D C.X/] is the vector
   
3 6
zD .2/ D ;
1 2

as depicted in Figure 5.1.


Example: N D 3. Suppose that N D 3, that y D .3; 38=5; 74=5/0, and that X D .x1 ; x2 ; x3 /,
where 0 1 0 1 0 1
0 2 2
x1 D @3A; x2 D @ 2A; and x3 D @ 1A:
6 4 2
Clearly, x1 and x2 are linearly independent, and x3 D x2 .1=3/x1. Thus, the linear space U
[D C.X/] is of dimension 2.
The normal equations X0 Xb D X0 y become
0 1 0 1
45 30 15 66
@30 24 14Ab D @38A:
15 14 9 16
The Method of Least Squares 187

FIGURE 5.2. The projection z of the 3-dimensional data vector y D .3; 38=5; 74=5/0 on the 2-dimensional
linear space U D C.X/, where X D .x1 ; x2 ; x3 /, with x1 D .0; 3; 6/0, x2 D . 2; 2; 4/0, and
x3 D . 2; 1; 2/0.

One solution to these equations is the vector .32=15; 1=2; 1/0. Thus, the projection of y on the
linear space U [D C.X/] is
0 10 1 0 1
0 2 2 32=15 3
z D @3 2 1A@ 1=2 A D @22=5A;
6 4 2 1 44=5
as depicted in Figure 5.2.

e. Least squares computations


Let us continue to take y D .y1 ; y2 ; : : : ; yN /0 to be an N -dimensional vector of data points and
to suppose that y1 ; y2 ; : : : ; yN are accompanied by the values u1 ; u2 ; : : : ; uN of a C -dimensional
(column) vector u of explanatory variables. And let us continue to consider the approximation of
y1 ; y2 ; : : : ; yN by ı.u1 /; ı.u2 /; : : : ; ı.uN /, where ı./ is a function (of u) that is expressible in the
form of a linear combination

ı.u/ D b1 ı1 .u/ C b2 ı2 .u/ C    C bP ıP .u/

of P specified functions ı1 ./; ı2 ./; : : : ; ıP ./. Further, take b D .b1 ; b2 ; : : : ; bP /0 to be the P  1


vector of coefficients, and take X to be the N  P matrix whose ij th element xij is defined by
xij D ıj .ui /. In the method of least squares, the value of b is taken to be a value at which the
quantity .y Xb/0 .y Xb/ attains a minimum value.
188 Estimation and Prediction: Classical Approach

As discussed in Subsection b, .y Xb/0 .y Xb/ attains a minimum value at a P  1 vec-


tor bQ if and only if bQ is a solution to the normal equations X0 Xb D X0 y. Accordingly, the least
squares computations can be carried out by forming and solving the normal equations. Alternatively,
by making use of various results on the decomposition of a matrix (as applied to the matrix X),
they can be carried out in a way that does not require the formation of the normal equations. The
alternative approach can be advantageous from the standpoint of numerical accuracy, though any
such advantage typically comes at the expense of greater demands on computing resources. In many
implementations of the alternative approach, the underlying decomposition of the matrix X is taken
to be a decomposition that is known as the QR decomposition.
QR decomposition of a matrix. Any M  N matrix A of full column rank N is expressible in the
form
A D Q1 R1 ; (4.56)
where Q1 is an M  N matrix with orthonormal columns and R1 is an upper triangular matrix with
(strictly) positive diagonal elements. Moreover, the matrices Q1 and R1 are unique. The existence
of a decomposition of the form (4.56) can be established by, for example, applying Gram–Schmidt
orthogonalization. For a proof of the existence and uniqueness of a decomposition of the form (4.56),
refer, for example, to Harville (1997, sec. 6.4).
As a variation on expression (4.56), we have the expression

A D QR; (4.57)
 
R1
where Q D .Q1 ; Q2 / is an M  M orthogonal matrix and where R D . The columns of the
0
M  .M N / submatrix Q2 are any M N M -dimensional column vectors that together with the
N columns of Q1 form an orthonormal basis for RM.
Either of the two decompositions (4.56) and (4.57) might be referred to as a QR decomposition.
QR decomposition as a basis for least squares computations. Let us now resume our discussion of
the computational aspects of the minimization of .y Xb/0 .y Xb/. Assume that the N  P matrix
X is of full column rank P —discussion of the general case where rank.X/ may be less than P is
deferred until the final part of the present subsection. Consider the QR decomposition of X. That is,
consider a decomposition of X of the form

X D Q1 R1 ; (4.58)

where Q1 is an N  P matrix with orthonormal columns and R1 is an upper triangular matrix with
(strictly) positive diagonal elements, or of the related form

X D QR; (4.59)
 
R1
where Q D .Q1 ; Q2 / is an N  N orthogonal matrix and where R D .
0
 
z
Let z D Q0 y, and partition z as z D 1 , where z1 D Q01 y and z2 D Q02 y. Then,
z2
y Xb D Q.z Rb/
D Q1 .z1 R1 b/ C Q2 z2 : (4.60)
And
.y Xb/0 .y Xb/ D .z Rb/0 Q0 Q.z Rb/
0
D .z Rb/ .z Rb/
D .z1 R1 b/0 .z1 R1 b/ C z02 z2 : (4.61)
The Method of Least Squares 189

Now, consider the linear system


R1 b D z1 (4.62)
(in the vector b), whose coefficient matrix is R1 and right side is z1 . It follows from elementary
results on triangular matrices (e.g., Harville 1997, corollary 8.5.6) that R1 is nonsingular. Thus,
linear system (4.62) has a unique solution, say b.Q Clearly, the first term of expression (4.61) equals 0
Q and is greater than 0 otherwise (i.e., if b ¤ b).
if b D b, Q And because the second term of expression
(4.61) does not depend on b, we conclude that .y Xb/0 .y Xb/ attains a minimum value of z02 z2
Q Moreover, in light of result (4.60),
and does so uniquely at the point b.

y XbQ D Q2 z2 : (4.63)

These results serve as the basis for an alternative approach to the least squares computations (not
requiring the formation of the normal equations). In the alternative approach, the formation of the
matrix R1 and the vector z1 are at the heart of the computations. Their formation can be accomplished
through the use of Householder transformations (reflections) or Givens transformations (rotations)
or through the use of a modified Gram–Schmidt procedure—refer, for example, to Golub and Van
Loan (2013, chap. 5) for a detailed discussion. The value bQ at which .y Xb/0 .y Xb/ attains its
minimum value is determined from R1 and z1 by solving linear system (4.62), doing so in a way that
exploits the triangularity of R1 —refer, for example, to Harville (1997, sec. 11.8) for a discussion of
the solution of a linear system with a triangular coefficient matrix.
Our results on the alternative approach to the least squares computations can be extended to the
general case where the matrix X is not necessarily of full column rank. The extension requires some
familiarity with a type of matrix called a permutation matrix.
Permutation matrices. A permutation matrix is a square matrix whose columns can be obtained by
permuting (rearranging) the columns of an identity matrix. Thus, letting u1 ; u2 ; : : : ; uN represent
the first, second, : : : ; N th columns, respectively, of IN , an N  N permutation matrix is a matrix of
the general form .uk1 ; uk2 ; : : : ; ukN /, where k1 ; k2 ; : : : ; kN is an arbitrary permutation of the first
N positive integers 1; 2; : : : ; N. For example, one permutation matrix of order N D 3 is the 3  3
matrix 0 1
0 1 0
.u3 ; u1 ; u2 / D @0 0 1A;
1 0 0
whose columns are the third, first, and second columns, respectively, of I3 . Clearly, the columns of
any permutation matrix form an orthonormal (with respect to the usual inner product) set, and hence
any permutation matrix is an orthogonal matrix.
The j th element of the kj th row of the N  N permutation matrix .uk1 ; uk2 ; : : : ; ukN / is 1, and
its other N 1 elements are 0. That is, the j th row uj0 of IN is the kj th row of .uk1 ; uk2 ; : : : ; ukN / or,
equivalently, the j th column uj of IN is the kj th column of .uk1 ; uk2 ; : : : ; ukN /0. Thus, the transpose
of any permutation matrix is itself a permutation matrix. Further, the rows of any permutation matrix
are a permutation of the rows of an identity matrix and, conversely, any matrix whose rows can be
obtained by permuting the rows of an identity matrix is a permutation matrix.
The effect of postmultiplying an M  N matrix A by an N  N permutation matrix P is to
permute the columns of A in the same way that the columns of IN were permuted in forming P . Thus,
if a1 ; a2 ; : : : ; aN are the first, second, : : : ; N th columns of A, the first, second, : : : ; N th columns
of the product A.uk1 ; uk2 ; : : : ; ukN / of A and the N  N permutation matrix .uk1 ; uk2 ; : : : ; ukN /
are ak1 ; ak2 ; : : : ; akN , respectively. Further, the first, second : : : ; N th columns a1 ; a2 ; : : : ; aN of A
are the k1 ; k2 ; : : : ; kN th columns, respectively, of the product A.uk1 ; uk2 ; : : : ; ukN /0 of A and the
permutation matrix .uk1 ; uk2 ; : : : ; ukN /0. When N D 3, we have, for example, that
190 Estimation and Prediction: Classical Approach
0 1
0 1 0
A.u3 ; u1 ; u2 / D .a1 ; a2 ; a3 / @0 0 1A D .a3 ; a1 ; a2 /
1 0 0
and 0
0 0 1
1
A.u3 ; u1 ; u2 /0 D .a1 ; a2 ; a3 / @1 0 0A D .a2 ; a3 ; a1 /:
0 1 0
Similarly, the effect of premultiplying an N  M matrix A by an N  N permutation matrix
is to permute the rows of A. If the first, second, : : : ; N th rows of A are a01 ; a02 ; : : : ; a0N , respec-
tively, then the first, second, : : : ; N th rows of the product .uk1 ; uk2 ; : : : ; ukN /0 A of the permutation
matrix .uk1 ; uk2 ; : : : ; ukN /0 and A are a0k1 ; a0k2 ; : : : ; a0kN , respectively, and a01 ; a02 ; : : : ; a0N are the
k1 ; k2 ; : : : ; kN th rows, respectively, of .uk1 ; uk2 ; : : : ; ukN /A. When N D 3, we have, for example,
that
0 0 1 a01
0 10 1 0 0 1
a3
.u3 ; u1 ; u2 /0 A D @1 0 0A@a02 A D @a01 A
0 1 0 a03 a02
and 0
0 1 0 a01
1 0 1 0
a02
1
.u3 ; u1 ; u2 /A D @0 0 1A@a02 A D @a03 A:
1 0 0 a03 a01

Alternative approach to least squares computations: general case. Let us now extend our initial
results on the alternative approach to the least squares computations. Accordingly, suppose that we
wish to minimize the quantity .y Xb/0 .y Xb/ and that rank.X/ D K, where K is possibly less
than P —our initial results (the results of Part 2) were obtained under the simplifying assumption
that the N  P matrix X is of full column rank P .
Let L represent any P  P permutation matrix such that the first K columns of the N  P matrix
XL are linearly independent, and partition L as L D .L1 ; L2 /, where L1 is of dimensions P  K.
Then,
XL D .XL1 ; XL2 /;
and XL1 is of full column rank K. Decompose XL1 as

XL1 D Q1 R1 ; (4.64)

where Q1 is an N  K matrix with orthonormal columns and R1 is an upper triangular matrix with
(strictly) positive diagonal elements. And observe that the columns of Q1 form a basis for C.XL/
[D C.XL1 /], so that
XL2 D Q1 R2 ; (4.65)
for some matrix R2 . Together, results (4.64) and (4.65) imply that

XL D Q1 .R1 ; R2 /
and also that
XL D QR;
 
R1 R2
where Q D .Q1 ; Q2 / is an N  N orthogonal matrix and where R D . Or, equivalently,
0 0
X D Q1 .R1 ; R2 /L0 D Q1 R1 L01 C Q1 R2 L02
and
X D QRL0: (4.66)
 
h1
Let h D L0 b, and partition h as h D , where h1 D L01 b and h2 D L02 b. Further, let
h2
Best Linear Unbiased or Translation-Equivariant Estimation 191
 
z
z D Q0 y, and partition z as z D 1 , where z1 D Q01 y and z2 D Q02 y. Then,
z2
y Xb D Q.z Rh/
D Q1 .z1 R1 h1 R2 h2 / C Q2 z2 ; (4.67)

which is a generalization of result (4.60). And

.y Xb/0 .y Xb/ D .z Rh/0 Q0 Q.z Rh/


0
D .z Rh/ .z Rh/
D .z1 R1 h1 R2 h2 /0 .z1 R1 h1 R2 h2 / C z02 z2 ; (4.68)

which is a generalization of result (4.61).


Now, consider the minimization of .y Xb/0 .y Xb/ with respect to the transformed vector
h (D L0 b). It follows from result (4.68) that .y Xb/0 .y Xb/ attains a minimum value of z02 z2
and that it does so at those values of h for which the first term of expression (4.68) equals 0 or,
equivalently, at those values for which z1 R1 h1 R2 h2 D 0. Accordingly, .y Xb/0 .y Xb/
attains a minimum value of z02 z2 at values hQ 1 and h Q 2 of h1 and h2 , respectively, if and only if
R1 hQ 1 D z1 R2 h Q 2 or, equivalently, if and only if hQ 1 is the solution to the linear system

R1 h1 D z1 R2 hQ 2 (4.69)

(in the vector h1 ). Thus, an arbitrary one of the values of h at which .y Xb/0 .y Xb/ attains a
minimum value is obtained by assigning h2 an arbitrary value h Q 2 and by then taking the value h Q 1 of
h1 to be the solution to linear system (4.69)—the matrix R1 is nonsingular, so that h Q 1 is uniquely
determined by hQ 2 . In particular, we could take the value of h2 to be 0, and take the value of h1 to be
the (unique) solution to the linear system R1 h1 D z1 .
We conclude that .y Xb/0 .y Xb/ attains a minimum value of z02 z2 and that it does so at a
value bQ of b if and only if for some .P K/  1 vector hQ 2 , bQ D L1 hQ 1 C L2 hQ 2 , where h Q 1 is the
solution to linear system (4.69). Note [in light of result (4.67)] that for any such (minimizing) value
bQ of b,
y XbQ D Q2 z2 : (4.70)
These results generalize the results obtained earlier (in Part 2) for the special case where the rank
K of the N P matrix X equals P . They provide a basis for extending the alternative approach to the
least squares computations to the general case (where K may be less than P ). As in the special case,
the formation of the matrix R1 and the vector z1 are at the heart of the computations. (And, as in the
special case, the formation of R1 and z1 can be accomplished via any of several procedures devised
for that purpose.) In the general case, there is also a need to determine the permutation matrix L (i.e.,
to identify K linearly independent columns of X) and possibly the matrix R2 —if the value of h2 is
taken to be 0, then R2 is not needed. A value bQ at which .y Xb/0 .y Xb/ attains its minimum
value is determined from R1 , z1 , L, and possibly R2 by taking hQ 2 to be any .P K/  1 vector, by
computing the solution hQ 1 to linear system (4.69), and by setting bQ D L1 h Q 1 C L2 hQ 2 .

5.5 Best Linear Unbiased or Translation-Equivariant Estimation of


Estimable Functions (under the G–M Model)
Suppose that y is an N  1 observable random vector that follows a G–M, Aitken, or general linear
model. And consider the least squares estimator `.y/ of an estimable linear combination 0ˇ of
192 Estimation and Prediction: Classical Approach

the elements of the parametric vector ˇ. The least squares estimator is a linear estimator, as was
demonstrated in Section 5.4c—refer to representation (4.24) or (4.35). Moreover, the least squares
estimator is an unbiased estimator. Its unbiasedness can be established directly by verifying that
EŒ`.y/ D 0ˇ, as was done in Section 5.4c—refer to result (4.36). Alternatively, its unbiasedness
can be established by applying the following result (from Section 5.1) on the unbiasedness of linear
estimators: for an estimator of the form c C a0 y to be an unbiased estimator of 0ˇ, it is necessary
and sufficient that
cD0 and X0 a D : (5.1)
Upon observing [in light of result (4.35)] that `.y/ D .XQr /0 y, where rQ is any solution to the conjugate
normal equations X0 Xr D , it follows immediately from the sufficiency of condition (5.1) that `.y/
is an unbiased estimator of 0ˇ.
The least squares estimator of 0ˇ is translation equivariant as well as unbiased. To see this, recall
(from Section 5.2) that for an estimator of the form c C a0 y to be a translation-equivariant estimator
of 0ˇ, it is necessary and sufficient that a0 X D 0 or, equivalently, that X0 a D . The translation
equivariance of the least squares estimator `.y/ [which is expressible in the form `.y/ D .XQr /0 y]
follows from the sufficiency of the condition X0 a D  in much the same way that its unbiasedness
follows from the sufficiency of condition (5.1).
When y follows a G–M model, the least squares estimator of 0ˇ is superior to other linear
unbiased or translation-equivariant estimators in a sense that is to be discussed in Subsections a and
b. More generally (when y follows an Aitken or general linear model), this superiority is confined
to special cases. These special cases include, of course, G–M models, but also a limited number of
other models.

a. Gauss–Markov theorem
In the special case where y is an N 1 observable random vector that follows a G–M model, the least
squares estimator of an estimable linear combination 0ˇ of the elements of the parametric vector ˇ
is the best linear unbiased estimator in the sense described in the following theorem.
Theorem 5.5.1 (Gauss–Markov theorem). Suppose that y is an N 1 observable random vector
that follows a G–M, Aitken, or general linear model, and suppose that 0ˇ is an estimable linear
combination of the elements of the parametric vector ˇ. Then, the least squares estimator of 0ˇ is a
linear unbiased estimator. Moreover, in the special case where y follows a G–M model, the variance
(and hence the mean squared error) of the least squares estimator is uniformly smaller than that of
any other linear unbiased estimator.
Proof. That the least squares estimator is a linear unbiased estimator was established earlier (in
the introductory part of the present section). Now, take c C a0 y to be an arbitrary linear unbiased
estimator of the estimable linear combination 0ˇ, in which case

cD0 and X0 a D 

(as noted earlier). And recall that the least squares estimator of 0ˇ is expressible in the form .XQr /0 y,
where rQ is any solution to the conjugate normal equations X0 Xr D .
In the special case where y follows a G–M model, we find that

covŒ.XQr /0 y; c C a0 y .XQr /0 y D rQ 0 X0 . 2 I/.a XQr /


2 0 0 0
D  rQ .X a X XQr /
2 0
D  rQ . / D 0:
Best Linear Unbiased or Translation-Equivariant Estimation 193

Thus, in that special case,


var.c C a0 y/ D varŒ.XQr /0 y C c C a0 y .XQr /0 y
D varŒ.XQr / y C varŒc C a y .XQr /0 y
0 0

C 2 covŒ.XQr /0 y; c C a0 y .XQr /0 y
D varŒ.XQr /0 y C varŒc C a0 y .XQr /0 y
 varŒ.XQr /0 y; (5.2)
with equality holding if and only if varŒc C a0 y .XQr /0 y D 0. Moreover, in the special case of the
G–M model,
varŒc C a0 y .XQr /0 y D .a XQr /0 . 2 I/.a XQr /
2 0
D  .a XQr / .a XQr /;
so that equality holds in inequality (5.2) if and only if a XQr D 0 or, equivalently, if and only if
a D XQr . We conclude that in the special case of the G–M model, the variance of the least squares
estimator is uniformly smaller than that of any other linear unbiased estimator. Q.E.D.
Theorem 5.5.1 (in one form or another) has come to be known as the Gauss–Markov theorem
(in honor of the contributions of Carl Friedrich Gauss and Andrei Andreevich Markov). It is one of
the most famous theoretical results in all of statistics. Seal (1967, sec. 3) considered this result from
a historical perspective. That Gauss’s name has come to be attached to the result of Theorem 5.5.1
seems altogether appropriate. The case for the attachment of Markov’s name appears to be much
weaker.
It is customary (both in the present setting and in general) to refer to a linear unbiased estimator
that has minimum variance among all linear unbiased estimators as a BLUE (an acronym for best
linear unbiased estimator or estimation). If y is an N  1 observable random vector that follows
a G–M model, then (according to the Gauss–Markov theorem) the least squares estimator of an
estimable linear combination 0ˇ of the elements of the parametric vector ˇ is the unique BLUE
of 0ˇ. Albert (1972, sec. 6.1), in a comment he characterized as jocular, suggested that the least
squares estimator of an estimable linear combination could be referred to as a TRUE (an acronym
for tiniest residual unbiased estimator). Accordingly, when the least squares estimator is a BLUE, it
could be referred to as a TRUE-BLUE—someone who is unswervingly loyal or faithful is said to be
true-blue.

b. A corollary
Suppose that y is an N  1 observable random vector that follows a G–M, Aitken, or general linear
model, and take 0ˇ to be an estimable linear combination of the elements of the parametric vector
ˇ. Further, let `.y/ D .XQr /0 y, where rQ is any solution to the conjugate normal equations X0 Xr D 
[so that `.y/ is the least squares estimator of 0ˇ]. And let c C a0 y represent an arbitrary linear
translation-equivariant estimator of 0ˇ or, equivalently, any estimator of the form c C a0 y that
satisfies the condition a0 X D 0 ; and recall (from the introductory part of the present section) that
the least squares estimator is a linear translation-equivariant estimator.
Clearly, E.a0 y/ D 0ˇ, that is, a0 y is an unbiased estimator of 0ˇ. And, as a consequence, the
MSE (mean squared error) of c C a0 y is
EŒ.c C a0 y 0ˇ/2  D c 2 C EŒ.a0 y 0ˇ/2  C 2c E.a0 y 0ˇ/
D c 2 C var.a0 y/
 var.a0 y/;
with equality holding if and only if c D 0 and hence if and only if c C a0 y D a0 y. Moreover, in the
special case where y follows a G–M model, it follows from the Gauss–Markov theorem that
194 Estimation and Prediction: Classical Approach

var.a0 y/  varŒ`.y/;
with equality holding if and only if a0 y D `.y/, that is, if and only if a0 y is the least squares estimator.
Accordingly, in that special case, the MSE of the least squares estimator is uniformly smaller than
the MSE of any other linear translation-equivariant estimator.
In summary, we have the following result, the main part of which can be regarded as a corollary
of the Gauss–Markov theorem.
Corollary 5.5.2. Suppose that y is an N  1 observable random vector that follows a G–M,
Aitken, or general linear model, and suppose that 0ˇ is an estimable linear combination of the
elements of the parametric vector ˇ. Then, the least squares estimator of 0ˇ is a linear translation-
equivariant estimator. Moreover, in the special case where y follows a G–M model, the mean squared
error of the least squares estimator is uniformly smaller than that of any other linear translation-
equivariant estimator.

5.6 Simultaneous Estimation


Suppose that y is an N  1 observable random vector that follows a G–M, Aitken, or general linear
model. And consider the estimation of estimable linear combinations of the elements of the parametric
vector ˇ. Specifically, suppose that we wish to estimate a finite number M of such linear combina-
tions, say 10 ˇ; 20 ˇ; : : : ; M
0
ˇ (and perhaps some or all linear combinations of 10 ˇ; 20 ˇ; : : : ; M
0
ˇ).
The Gauss–Markov theorem is relevant to the estimation of these linear combinations when the linear
combinations are considered individually. However, that each of these linear combinations is to be
estimated simultaneously with M 1 or more other linear combinations is not reflected in the crite-
rion employed in the Gauss–Markov theorem. In this section, we obtain some results that account
explicitly for the simultaneous estimation of the various linear combinations.
Let ƒ D .1 ; 2 ; : : : ; M /, so that ƒ0ˇ is the M -dimensional column vector whose el-
ements are the M linear combinations 10 ˇ; 20 ˇ; : : : ; M 0
ˇ—when all M linear combinations
0 0 0
1 ˇ; 2 ˇ; : : : ; M ˇ are estimable (as is being assumed), the vector ƒ0ˇ is said to be estimable. By
definition,the least squares estimator of ƒ0ˇ is the M  1 vector `.y/ D Œ`1 .y/; `2 .y/; : : : ; `M .y/0,
where `1 .y/; `2 .y/; : : : ; `M .y/ are the least squares estimators of 10 ˇ; 20 ˇ; : : : ; M
0
ˇ, respectively.
In light of result (4.24), the least squares estimator is expressible as

`.y/ D ƒ0 .X0 X/ X0 y: (6.1)

And, in light of result (4.35), it is also expressible as


Q 0 X0 y;
`.y/ D R (6.2)
Q is any solution to the linear system X0 XR D ƒ (in the P  M matrix R).
where R
We have that
EŒ`.y/ D ƒ0ˇ; (6.3)
as is evident upon taking the expected value of expression (6.2) or, alternatively, upon using result
(4.36) to establish that each element of ƒ0ˇ equals the corresponding element of EŒ`.y/.
The least squares estimators `1 .y/; `2 .y/; : : : ; `M .y/ of 10 ˇ; 20 ˇ; : : : ; M
0
ˇ have the following
PM
basic property: the least squares estimator of j D1 kj j0 ˇ (where k1 ; k2 ; : : : ; kM are arbitrary con-
stants) is jMD1 kj `j .y/—recall (from Section 5.3) that linear combinations of estimable functions
P

are estimable. Upon letting k D .k1 ; k2 ; : : : ; kM /0, this property can be restated (in matrix notation)
Simultaneous Estimation 195

as follows: the least squares estimator of k0ƒ0ˇ [D .ƒk/0 ˇ] is k0 `.y/. In light of results (4.24) and
(6.1), this property can be readily verified by observing that the least squares estimator of .ƒk/0 ˇ is

.ƒk/0 .X0 X/ X0 y D k0ƒ0 .X0 X/ X0 y D k0 `.y/:

Alternatively, in light of results (4.35) and (6.2), it can be verified by observing that (for any solution
Q to X0 XR D ƒ) Rk
R Q is a solution to the linear system X0 Xr D ƒk and hence that the least squares
0
estimator of .ƒk/ ˇ is
Q 0 X0 y D k0 R
.Rk/ Q 0 X0 y D k0 `.y/:

Under the general linear model, the variance-covariance matrix of the least squares estimator of
ƒ0ˇ is expressible as
Q 0 X0 V./XR
varŒ`.y/ D R Q D ƒ0 .X0 X/ X0 V./X.X0 X/ ƒ (6.4)
(where R Q is any solution to X0 XR D ƒ). Result (6.4) can be deduced from result (4.39): start with
the expressions for varŒ`i .y/; `j .y/ obtained by applying result (4.39), and then observe that these
expressions are essentially the same as the ij th elements of the expressions for varŒ`.y/ given by
result (6.4) (i; j D 1; 2; : : : ; M ). In the special case of the Aitken model, result (6.4) “simplifies” to
Q 0 X0 HXR
varŒ`.y/ D  2 R Q D  2 ƒ0 .X0 X/ X0 HX.X0 X/ ƒ: (6.5)

And in the further special case of the G–M model, we find that
Q 0 X0 XR
varŒ`.y/ D  2 R Q D  2R
Q 0 ƒ D  2 ƒ0 R
Q (6.6)
D  2RQ 0 X0 X.X0 X/ X0 XR Q D  2 ƒ0 .X0 X/ ƒ: (6.7)

a. Best linear unbiased estimation


Let us consider further the estimation of ƒ0ˇ (based on an N  1 observable random vector y that
follows a G–M, Aitken, or general linear model). An estimator of ƒ0ˇ is said to be a linear estimator
if each of its elements is a linear estimator (of the corresponding element of ƒ0ˇ) or, equivalently, if
it is expressible in the form c C A0 y (where c is an M  1 vector of constants and A an N  M matrix
of constants). An estimator t.y/ of ƒ0ˇ is said to be unbiased if each of its elements is an unbiased
estimator of the corresponding element of ƒ0ˇ or, equivalently, if EŒt.y/ D ƒ0ˇ. It follows from
the results of Section 5.1 that for a linear estimator c C A0 y of ƒ0ˇ to be an unbiased estimator, it is
necessary and sufficient that
cD0 and A0 X D ƒ0: (6.8)
Moreover, it follows from what was established earlier (e.g., in Section 5.5) that the least squares
estimator of ƒ0ˇ is linear and unbiased.
How does the variance-covariance matrix of the least squares estimator of ƒ0ˇ compare with the
variance-covariance matrix of other linear unbiased estimators of ƒ0ˇ? The Gauss–Markov theorem
implies that in the special case where y follows a G–M model, at least some of the diagonal elements
of the variance-covariance matrix of the least squares estimator are (strictly) less than (and the others
are equal to) the corresponding diagonal elements of the variance-covariance matrix of any other
linear unbiased estimator. The following theorem makes a stronger statement.
Theorem 5.6.1. Suppose that y is an N 1 observable random vector that follows a G–M, Aitken,
or general linear model, take ƒ0ˇ to be any M  1 vector of estimable linear combinations of the
elements of the parametric vector ˇ, denote by `.y/ the least squares estimator of ƒ0ˇ, and let cCA0 y
represent an arbitrary linear unbiased estimator of ƒ0ˇ (or, equivalently, any estimator of the form
196 Estimation and Prediction: Classical Approach

c C A0 y such that c D 0 and A0 X D ƒ0 ). Then, (the least squares estimator) `.y/ is a linear unbiased
estimator. Moreover, in the special case where y follows a G–M model, var.c C A0 y/ varŒ`.y/ is
a nonnegative definite matrix, and var.c C A0 y/ varŒ`.y/ D 0 or, equivalently, var.c C A0 y/ D
varŒ`.y/ if and only if c C A0 y D `.y/.
Proof. That `.y/ is a linear unbiased estimator of ƒ0ˇ follows from what was established in
Section 5.5 (as was noted previously). Now, suppose that y follows a G–M model, and consider the
quadratic form
k0 fvar.c C A0 y/ varŒ`.y/gk (6.9)
(in an M -dimensional column vector k). Clearly, the quadratic form (6.9) is reexpressible as
k0 fvar.c C A0 y/ varŒ`.y/gk D varŒk0 c C .Ak/0 y varŒk0 `.y/: (6.10)
0 0 0
Moreover, k c C .Ak/ y is a linear unbiased estimator of .ƒk/ ˇ; the unbiasedness of which can
be verified simply by observing that EŒk0 c C .Ak/0 y D k0 E.c C A0 y/ D k0 ƒ0ˇ D .ƒk/0 ˇ or,
alternatively, by observing [in light of the sufficiency of condition (1.4)] that k0 c D k0 0 D 0 and
that .Ak/0 X D k0 A0 X D k0 ƒ0 D .ƒk/0. And as discussed in the introductory part of the present
section, k0 `.y/ is the least squares estimator of .ƒk/0 ˇ. Thus, it follows from the Gauss–Markov
theorem that varŒk0 `.y/  varŒk0 c C .Ak/0 y or, equivalently, that
varŒk0 c C .Ak/0 y varŒk0 `.y/  0: (6.11)
Together, results (6.10) and (6.11) imply that the quadratic form k0 fvar.c C A0 y/ varŒ`.y/gk is
nonnegative definite and hence that the matrix var.c C A0 y/ varŒ`.y/ is nonnegative definite.
As a further implication of the Gauss–Markov theorem, we have that varŒk0 c C .Ak/0 y D
varŒk0 `.y/ or, equivalently, that equality holds in inequality (6.11) if and only if k0 c C .Ak/0 y D
k0 `.y/, leading [in light of equality (6.10) and Corollary 2.13.4] to the conclusion that var.c C
A0 y/ varŒ`.y/ D 0 if and only if k0 c C .Ak/0 y D k0 `.y/ for every k and hence if and only if
c C A0 y D `.y/. Q.E.D.
Suppose (in connection with Theorem 5.6.1) that y follows a G–M model, in which case var.c C
A0 y/ varŒ`.y/ is nonnegative definite. Then, var.c C A0 y/ varŒ`.y/ D R0 R for some matrix
R (as is evident from Corollary 2.13.25). And upon recalling Lemma 2.3.2 and observing that R0 R
equals 0 if and only if all M of its diagonal elements equal 0 and upon letting (for j D 1; 2; : : : ; M )
cj represent the j th element of c, aj the j th column of A, and `j .y/ the j th element of `.y/, it
follows that
var.c C A0 y/ varŒ`.y/ D 0
, var.cj C aj0 y/ varŒ`j .y/ D 0 (j D 1; 2; : : : ; M )
0
, trfvar.c C A y/ varŒ`.y/g D 0
or, equivalently, that
var.c C A0 y/ D varŒ`.y/ , var.cj C aj0 y/ D varŒ`j .y/ (j D 1; 2; : : : ; M )
, trŒvar.c C A0 y/ D trfvarŒ`.y/g:
Because the diagonal elements of a nonnegative definite matrix are inherently nonnegative (as
evidenced by Corollary 2.13.14), the following result can be regarded as a corollary of Theorem
5.6.1.
Corollary 5.6.2. Suppose that y is an N  1 observable random vector that follows a G–M
model, take ƒ0ˇ to be any M  1 vector of estimable linear combinations of the elements of the
parametric vector ˇ, denote by `.y/ the least squares estimator of ƒ0ˇ, and let c C A0 y represent an
arbitrary linear unbiased estimator of ƒ0ˇ. Then,
trfvarŒ`.y/g  trŒvar.c C A0 y/;
Simultaneous Estimation 197

with equality holding if and only if c C A0 y D `.y/.


Alternatively, Corollary 5.6.2 can be established as an almost immediate consequence of the
Gauss–Markov theorem. A more substantial implication of Theorem 5.6.1 is provided by the fol-
lowing corollary.
Corollary 5.6.3. Suppose that y is an N  1 observable random vector that follows a G–M
model, take ƒ0ˇ to be any M  1 vector of estimable linear combinations of the elements of the
parametric vector ˇ, denote by `.y/ the least squares estimator of ƒ0ˇ, and let c C A0 y represent an
arbitrary linear unbiased estimator of ƒ0ˇ. Then,
detfvarŒ`.y/g  detŒvar.c C A0 y/;
with equality holding if and only if rank.A/ < M (in which case both sides of the inequality equal
0) or c C A0 y D `.y/.
Corollary 5.6.3 can be derived from Theorem 5.6.1 by applying the following result on deter-
minants: for any M  M symmetric nonnegative definite matrix B and for any M  M symmetric
matrix C such that C B is nonnegative definite, jCj  jBj, with equality holding if and only if
C is singular or C D B—for a proof of this result, refer, e.g., to Harville (1997, corollary 18.1.8).
Specifically, the application is that obtained by setting B D varŒ`.y/ and C D var.c C A0 y/.
The determinant of the variance-covariance matrix of a vector-valued estimator is sometimes
referred to as the generalized variance of the estimator. Accordingly, the result of Corollary 5.6.3
implies that (under the G–M model) the least squares estimator of the M -dimensional vector ƒ0ˇ
is a best linear unbiased estimator in the sense that its generalized variance is less than or equal to
that of any other linear unbiased estimator—if rank.ƒ/ D M , the generalized variance of the least
squares estimator is (strictly) less than that of any other linear unbiased estimator.

b. Best linear translation-equivariant estimation


Let us continue to consider the estimation of ƒ0ˇ (based on an N  1 observable random vector
y that follows a G–M, Aitken, or general linear model). An estimator, say t.y/, of ƒ0ˇ is said
to be translation equivariant if the elements of t.y/ are translation-equivariant estimators of the
corresponding elements of ƒ0ˇ. Thus, in light of the discussion of Section 5.2, a necessary and
sufficient condition for t.y/ to be a translation-equivariant estimator of ƒ0ˇ is that
t.y/ C ƒ0 k D t.y C Xk/ (6.12)
for every P  1 vector k (and for every value of y). Further, for a linear estimator c C A0 y to be a
translation-equivariant estimator of ƒ0ˇ, it is necessary and sufficient that
A0 X D ƒ0: (6.13)
Moreover, it follows from what was established earlier (in Section 5.5) that the least squares estimator
of ƒ0ˇ is translation equivariant.
As what can be regarded as an additional corollary of Theorem 5.6.1, we have the following
result.
Corollary 5.6.4. Suppose that y is an N  1 observable random vector that follows a G–M,
Aitken, or general linear model, take ƒ0ˇ to be any M  1 vector of estimable linear combinations
of the elements of the parametric vector ˇ, denote by `.y/ the least squares estimator of ƒ0ˇ, and
let c C A0 y represent an arbitrary linear translation-equivariant estimator of ƒ0ˇ (or, equivalently,
any estimator of the form c C A0 y such that A0 X D ƒ0 ). Then, (the least squares estimator) `.y/
is a linear translation-equivariant estimator. Moreover, in the special case where y follows a G–M
model, the difference
EŒ.cCA0 y ƒ0ˇ/.cCA0 y ƒ0ˇ/0  EfŒ`.y/ ƒ0ˇŒ`.y/ ƒ0ˇ0 g (6.14)
198 Estimation and Prediction: Classical Approach

between the mean-squared-error matrices of c C A0 y and `.y/ is a nonnegative definite matrix, and
this difference equals 0 if and only if c C A0 y D `.y/.
The nonnegative definiteness of the matrix (6.14) and the condition [c C A0 y D `.y/] under
which it equals 0 follow from Theorem 5.6.1 in much the same way that the main part of Corollary
5.5.2 follows from the Gauss–Markov theorem.

5.7 Estimation of Variability and Covariability


Suppose that y D .y1 ; y2 ; : : : ; yN /0 is an N  1 observable random vector that follows a G–
M, Aitken, or general linear model. Then, the N diagonal elements var.y1 /; var.y2 /; : : : ; var.yN /
of the matrix var.y/ represent the underlying variability and the N.N 1/ off-diagonal elements
cov.yi ; yj / (j ¤ i D 1; 2; : : : ; N ) represent the underlying covariability. In the case of the G–
M model, var.y/ D  2 I, so that y1 ; y2 ; : : : ; yN are uncorrelated with a common (strictly) positive
variance  2 (of unknown value). In the case of the Aitken model, var.y/ D  2 H (where H is a known
symmetric nonnegative definite matrix), so that the variances and covariances of y1 ; y2 ; : : : ; yN are
known up to the (unknown) value of a (strictly) positive scalar multiple  2. And, in the case of the
general linear model, var.y/ D V ./ [where V ./ is a symmetric nonnegative definite matrix whose
elements are known functions of a T  1 parameter vector ], so that the variances and covariances
of y1 ; y2 ; : : : ; yN are known up to the (unknown) value of .
The matrix var.y/ is of interest because its value determines the variances and covariances of the
least squares estimators of estimable linear combinations of the elements of the parametric vector ˇ.
Moreover, the underlying variability and covariability [represented by the diagonal and off-diagonal
elements of var.y/] may be of interest in their own right.
In the present section, some initial results are obtained on the estimation of variability and
covariability. The emphasis is on results that are specific to the estimation of  2 under the G–M
model. As a preliminary, formulas are derived for the expected values and variances and covariances
of quadratic forms (in a random vector) and for the covariance of a quadratic form and a linear form.
And, prior to that, some matrix operations that enter in various of those formulas are introduced and
briefly discussed.

a. Some matrix operations

Vec of a matrix. Let A represent an M  N matrix. It is sometimes convenient to rearrange the


elements of A in the form of an MN -dimensional column vector. The conventional way of doing so
is to successively place the first, second, …, N th columns a1 ; a2 ; : : : ; aN of A one under the other,
giving the column vector 0 1
a1
B a2 C
B :: C: (7.1)
B C
@ : A
aN
The vector (7.1) is referred to as the vec of A, and is denoted by the symbol vec.A/ or (when
the parentheses are not needed for clarity) by vec A. By definition, the ij th element of A is the
[.j 1/M Ci ]th element of vec A.
Vech of a symmetric matrix. Let A D faij g represent an N  N symmetric matrix. The values of
all N 2 elements of A can be determined from the values of those N.N C1/=2 elements that are on
or below the diagonal [or, alternatively, from those N.N C1/=2 elements that are on or above the
Estimation of Variability and Covariability 199

diagonal]. Accordingly, in rearranging the elements of A in the form of a vector (as in forming the
vec), we may wish to exclude the N.N 1/=2 “duplicate” elements. Thus, as an alternative to the
vec of A, we may wish to consider the N.N C1/=2-dimensional column vector
0 1
a1
B a C
B 2C
B : C; (7.2)
@ :: A
aN
where (for i D 1; 2; : : : ; N ) ai D .ai i ; ai C1;i ; : : : ; aN i /0 is the subvector of the i th column of A
obtained by striking out its first i 1 elements. The vector (7.2) is referred to as the vech of A and
is denoted by the symbol vech.A/ or vech A. For N D 1, N D 2, 0 and1N D 3,
a11
0 1 Ba21 C
a11 B C
Ba31 C
vech A D .a11 /; vech A D @a21 A; and vech A D B Ba22 C; respectively:
C
a22 B C
@a32 A
a33

Every element of A, and hence every element of vec A, is either an element of vech A or a
“duplicate” of an element of vech A. Thus, there exists a unique [N 2  N.N C1/=2]-dimensional
matrix, to be denoted by the symbol GN , such that (for every N  N symmetric matrix A)

vec A D GN vech A:

This matrix is called the duplication matrix. Clearly,


0 1
1 0 0 0 0 0
B0 1 0 0 0 0C
B C
0 1 B0 0 1 0 0 0C
1 0 0 B
B0
C
B0 1 0 0 0 0C
1 0C B C
G1 D .1/; G2 D B
@0
C; and G3 D B
B0 0 0 1 0 0C
C:
1 0A B0 0 0 0 1 0C
0 0 1 B
B0
C
B 0 1 0 0 0C
C
@0 0 0 0 1 0A
0 0 0 0 0 1
Note that
rank GN D N.N C 1/=2 (7.3)
(so that GN is of full column rank), as is evident upon observing that every row of IN.NC1/=2 is a
row of GN and hence that GN contains N.N C 1/=2 linearly independent rows.
Kronecker product. The Kronecker product of two matrices, say an M  N matrix A D faij g and
a P  Q matrix B D fbij g, is denoted by the symbol A ˝ B and is defined to be the MP  NQ
matrix 0 1
a11 B a12 B : : : a1N B
B a21 B a22 B : : : a2N B C
A˝B DB : :: :: C
B C
@ :: : : A
aM1 B aM 2 B : : : aMN B
obtained by replacing (for i D 1; 2; : : : ; M and j D 1; 2; : : : ; N ) the ij element of A with the
P  Q matrix aij B. Thus, the Kronecker product of A and B can be regarded as a partitioned matrix,
comprising M rows and N columns of (P  Q)-dimensional blocks, the ij th of which is aij B.
200 Estimation and Prediction: Classical Approach

Among the various properties of the Kronecker product operation is the following: for any
matrices A and B,
.A ˝ B/0 D A0 ˝ B0 (7.4)
—for a verification of equality (7.4), refer, e.g., to Harville (1997, sec. 16.1).
Two formulas. There are two formulas that will be convenient to have at our disposal; one of these is
for the vec of a product of three matrices and the other is for the trace of the product of four matrices.
The two formulas are as follows. For any M  N matrix A, N  P matrix B, and P  Q matrix C,

vec ABC D .C 0 ˝ A/ vec B: (7.5)

And for any M  N matrix A, M  P matrix B, P  Q matrix C, and N  Q matrix D,

tr.A0 BCD 0 / D .vec A/0 .D ˝ B/ vec C: (7.6)

For a derivation of formulas (7.5) and (7.6), refer, for example, to Harville (1997, sec. 16.2).

b. Expected values and variances of quadratic forms (and their covariances with
each other and with linear forms)
Suppose that x is an N -dimensional random column vector. Then, it is customary to refer to a linear
combination, say a0 x, of the elements of x (where a is an N  1 vector of constants) as a linear form
(in x).
Formulas for the expected values and the variances and covariances of linear forms are available
from the results of Sections 3.1 and 3.2. If the random vector x has a mean vector , then the expected
value of a linear form a0 x (in x) is expressible as

E.a0 x/ D a0: (7.7)

And if, in addition, x has a variance-covariance matrix †, then the variance of a0 x is expressible as

var.a0 x/ D a0 †a; (7.8)

and, more generally, the covariance of a0 x and b0 x (where b0 x is a second linear form in x) is
expressible as
cov.a0 x; b0 x/ D a0 †b: (7.9)

In what follows, these results are extended by obtaining formulas for the expected values and
variances of quadratic forms (in a random column vector) and formulas for the covariances of the
quadratic forms with each other and with linear forms.
Main results. The main results are presented in a series of three theorems.
Theorem 5.7.1. Let x represent an N -dimensional random column vector having mean vector
 D fi g and variance-covariance matrix † D fij g, and take A D faij g to be an N  N matrix
of constants. Then,
X
E.x0Ax/ D aij .ij C i j / (7.10)
i;j

D tr.A†/ C 0A: (7.11)


Estimation of Variability and Covariability 201

Proof. Letting xi represent the i th element of x (i D 1; 2; : : : ; N ), we find that


X  X
E.x0Ax/ D E aij xi xj D aij E.xi xj /
i;j i;j
X
D aij .ij C i j /
i;j
X X  X
D aij j i C aij i j
i j i;j
0
D tr.A†/ C  A:
Q.E.D.
Theorem 5.7.2. Let x represent an N -dimensional random column vector having mean vector
 D fi g, variance-covariance matrix † D fij g, and third central moments ijk D EŒ.xi
i /.xj j /.xk k / (i; j; k D 1; 2; : : : ; N ), and take b D fbi g to be an N -dimensional column
vector of constants and A D faij g to be an N  N symmetric matrix of constants. Then,
X
cov.b0 x; x0Ax/ D bi ajk .ijk C 2j i k / (7.12)
i;j; k

D b0ƒ vec A C 2 b0 †A; (7.13)

where ƒ is an N N 2 matrix whose entry for the i th row and j kth column [i.e., column .j 1/N Ck]
is ijk .
Proof. Letting z D fzi g D x  (in which case x D z C ) and using Theorem 5.7.1 [and
observing that z0A D .z0A/0 D 0Az and that b0 z D z0 b], we find that
cov.b0 x; x0Ax/ D cov.b0 z; x0Ax/ D EŒ.b0 z/x0Ax
D EŒ.b0 z/.z0Az C 20Az C 0A/
hX X i
DE bi zi ajk zj zk C 2 EŒz0 b0Az C 0
i j; k
X  X X 
DE bi ajk zi zj zk C 2 bi j ajk i k
i;j; k i; k j
X
D bi ajk .ijk C 2j i k /
i;j; k
XX  XX X 
D bi ijk akj C 2 bi i k akj j
j; k i k i j
0 0
D b ƒ vec A C 2 b †A:
Q.E.D.
If the distribution of x is MVN, then ƒ D 0 (as is evident from the results of Section 3.5n). More
generally, if the distribution of x is symmetric [in the sense that .x /  x ], then ƒ D 0
[as is evident upon observing that if the distribution of x is symmetric, then (for all i , j , and k)
ijk D ijk ]. When ƒ D 0, formula (7.13) simplifies to

cov.b0 x; x0Ax/ D 2 b0 †A: (7.14)

Thus, if the distribution of x is symmetric and its mean vector is null, then any linear form in x is
uncorrelated with any quadratic form.
Theorem 5.7.3. Let x represent an N -dimensional random column vector having mean vector
 D fi g, variance-covariance matrix † D fij g, third central moments ijk D EŒ.xi i /
202 Estimation and Prediction: Classical Approach

.xj j /.xk k / (i; j; k D 1; 2; : : : ; N ), and fourth central moments ijkm D EŒ.xi i /.xj
j /.xk k /.xm m / (i; j; k; m D 1; 2; : : : ; N ); and take A D faij g and H D fhij g to be
N  N symmetric matrices of constants. Then,

cov.x0Ax; x0 Hx/
X
D aij hkm Œ. ijkm ij km i k j m i m jk /
i;j; k; m
C 2k ij m C 2i jkm C 2i k j m C 4j k i m  (7.15)
D .vec A/0  vec H C 20 Hƒ vec A C 20Aƒ vec H
C 2 tr.A†H†/ C 40A†H; (7.16)

where  is an N 2  N 2 matrix whose entry for the ij th row [row .i 1/N C j ] and kmth column
[column .k 1/N C m] is ijkm ij km i k j m i m jk and where ƒ is an N  N 2 matrix
whose entry for the j th row and kmth column [column .k 1/N C m] is jkm .
Proof. Letting z D fzi g D x  (in which case x D z C ) and using Theorems 5.7.1 and
5.7.2 [and observing that z0A D .z0A/0 D 0Az and similarly that z0 H D 0 Hz), we find that

cov.x0Ax; x0 Hx/
D EŒ.x0Ax/.x0 Hx/ E.x0Ax/ E.x0 Hx/
D EŒ.z C /0A.z C /.z C /0 H.z C /
ŒE.z0Az/ C 0AŒE.z0 Hz/ C 0 H
D EŒ.z0Az/.z0 Hz/ C 2 EŒ.z0Az/.0 Hz/ C 2 EŒ.0Az/.z0 Hz/
C 4 EŒ.z0A/.0 Hz/ C 2 EŒ.z0A/.0 H/ C 2 EŒ.0A/.0 Hz/
E.z0Az/ E.z0 Hz/
D EŒ.z0Az/.z0 Hz/ C 2 cov.0 Hz; z0Az/ C 2 cov.0Az; z0 Hz/
C 4 E.z0A0 Hz/ C 0 C 0 E.z0Az/ E.z0 Hz/
 X 
DE aij hkm zi zj zk zm
i;j; k; m X  X 
X X
C2 aij k hkm ij m C 2 hkm i aij jkm
i;j; m k j; k; m i
XX X  X X
C4 aij j hkm k i m aij ij hkm km
i; m j k i;j k; m
X
D aij hkm Œ ijkm C 2k ij m C 2i jkm C 4j k i m ij km 
i;j; k; m
X
D aij hkm Œ. ijkm ij km i k j m i m jk /
i;j; k; m
C 2k ij m C 2i jkm C 2i k j m C 4j k i m 
X
D aj i hmk . ijkm ij km i k j m i m jk /
i;j; k; m
XhXX  i XhXX  i
C2 k hkm mij aj i C 2 i aij jkm hmk
i;j m k k; m j i
X XX X 
C2 aij j m hmk ki
i m j k
XX X 
C4 j aj i hmk k i m
i; m j k
Estimation of Variability and Covariability 203

D .vec A/0  vec H C 20 Hƒ vec A C 20Aƒ vec H


C 2 tr.A†H†/ C 40A†H:
Q.E.D.
As a special case of formula (7.16) (that where H D A), we have that
var.x0Ax/ D .vec A/0  vec A C 40 Aƒ vec A C 2 trŒ.A†/2  C 40A†A: (7.17)
If the distribution of x is symmetric, then formula (7.16) simplifies to
cov.x0Ax; x0 Hx/ D .vec A/0  vec H C 2 tr.A†H†/ C 40A†H: (7.18)
If the distribution of x is MVN, then it is symmetric and, in addition, it is such that  D 0 (as is
evident from the results of Section 3.5n), in which case there is a further simplification to
cov.x0Ax; x0 Hx/ D 2 tr.A†H†/ C 40A†H; (7.19)
or, in the special case where H D A, to
var.x0Ax/ D 2 trŒ.A†/2  C 40A†A: (7.20)

The formulas of Theorems 5.7.2 and 5.7.3 were derived under the assumption that the matrix
A of the quadratic form x0Ax is symmetric and (in the case of Theorem 5.7.3) the assumption that
the matrix H of the quadratic form x0 Hx is symmetric. Note that whether or not A and/or H are
symmetric, it would be the case that x0Ax D 21 x0 .A C A0 /x and that x0 Hx D 12 x0 .H C H0 /x. Thus,
the formulas of Theorems 5.7.2 and 5.7.3 could be extended to the case where the matrices of the
quadratic forms are possibly nonsymmetric simply by substituting 12 .A C A0 / for A and (in the case
of Theorem 5.7.3) 12 .H C H0 / for H.
Some alternative representations. By making use of the vec and vech operations, the expressions
provided by the formulas of Theorems 5.7.1, 5.7.2, and 5.7.3 can be recast in ways that are informative
about the nature of the dependence of the expressions on the elements of the matrices of the quadratic
forms.
An alternative to the matrix expression (7.11) provided by Theorem 5.7.1 for the expected value
of the quadratic form x0Ax is as follows:

E.x0Ax/ D Œvec.†/ C . ˝ /0 vec A (7.21)

[as can be readily verified from expression (7.10)]. Expression (7.21) is a linear form in vec A; that is,
it is a linear combination of the elements of vec A (which are the elements of A). If A is symmetric,
then expression (7.21) can be restated as follows:

E.x0Ax/ D Œvec.†/ C . ˝ /0 GN vech A (7.22)

(where GN is the duplication matrix). Expression (7.22) is a linear form in vech A, the elements of
which are N.N C1/=2 nonredundant elements of A—if A is symmetric, N.N 1/=2 of its elements
are redundant.
Now, consider the matrix expression (7.13) provided by Theorem 5.7.2 for the covariance of the
linear form b0 x and the quadratic form x0Ax. Making use of result (7.6), we find that the second
term of expression (7.13) can be reexpressed as follows:

2b0 †A D 2 tr.b0 †A/ D 2 trŒb0 †A.0 /0  D 2.vec b/0 .0 ˝ †/ vec A D 2b0 .0 ˝ †/ vec A:

Thus, formula (7.13) can be restated as follows:

cov.b0 x; x0Ax/ D b0 Œƒ C 2.0 ˝ †/ vec A


D b0 Œƒ C 2.0 ˝ †/GN vech A: (7.23)
204 Estimation and Prediction: Classical Approach

Expression (7.23) is a bilinear form in the N -dimensional column vector b and the N.N C1/=2-
dimensional column vector vech A, that is, for any particular value of b, it is a linear form in vech A,
and for any particular value of vech A, it is a linear form in b.
Further, consider the matrix expression (7.16) provided by Theorem 5.7.3 for the covariance of
the two quadratic forms x0Ax and x0 Hx. Making use of results (7.5) and (7.4), we find that the
second and third terms of expression (7.16) can be reexpressed as follows:

20 Hƒ vec A D 2Œ0 Hƒ vec A0


D 2.vec A/0 ƒ0 H
D 2.vec A/0 vec.ƒ0 H/
D 2.vec A/0 .0 ˝ ƒ0 / vec H
D 2.vec A/0 . ˝ ƒ/0 vec H (7.24)
and
20 Aƒ vec H D 2.ƒ0 A/0 vec H
D 2Œvec.ƒ0 A/0 vec H
D 2Œ.0 ˝ ƒ0 / vec A0 vec H
D 2.vec A/0 . ˝ ƒ/ vec H: (7.25)

Moreover, making use of result (7.6) (and Lemma 2.3.1), the fourth and fifth terms of expression
(7.16) are reexpressible as

2 tr.A†H†/ D 2 tr.A0 †H† 0 / D 2.vec A/0 .† ˝ †/ vec H (7.26)


and
40A†H D 4 tr.0A†H/
D 4 tr.A†H0 /
D 4 trŒA0 †H.0 /0 
D 4.vec A/0 Œ.0 / ˝ † vec H: (7.27)

And based on results (7.24), (7.25), (7.26), and (7.27), formula (7.16) can be restated as follows:
cov.x0Ax; x0 Hx/ D .vec A/0 f C 2. ˝ ƒ/0 C 2. ˝ ƒ/
C 2.† ˝ †/ C 4Œ.0 / ˝ †g vec H (7.28)
D .vech A/0 GN0 f C 2. ˝ ƒ/0 C 2. ˝ ƒ/
C 2.† ˝ †/ C 4Œ.0 / ˝ †gGN vech H: (7.29)

In result (7.29), cov.x0Ax; x0 Hx/ is expressed as a bilinear form in the N.NC1/=2-dimensional


column vectors vech A and vech H. As a special case of result (7.29) (that where H D A), we have
the result
var.x0Ax/ D .vech A/0 GN0 f C 2. ˝ ƒ/0 C 2. ˝ ƒ/
C 2.† ˝ †/ C 4Œ.0 / ˝ †gGN vech A; (7.30)

in which var.x0Ax/ is expressed as a quadratic form in the N.N C1/=2-dimensional column vector
vech A.

c. Estimation of  2 (under the G–M model)


Let us add to our earlier discussion of the method of least squares by introducing some notation,
terminology, and results that are relevant to making inferences about variability and covariability.
Estimation of Variability and Covariability 205

Suppose that y D .y1 ; y2 ; : : : ; yN /0 is an N  1 observable random vector that follows a G–M,


Aitken, or general linear model. And let

eQ D y PX y ŒD .I PX /y

[where PX D X.X0 X/ X0 ], so that (for i D 1; 2; : : : ; N ) the i th element of eQ is the difference


between yi and the least squares estimator of E.yi /. It is customary to refer to the elements of eQ (or
to their observed values) as least squares residuals, or simply as residuals, and to refer to eQ itself (or
to its observed value) as the residual vector.
Upon applying formulas (3.1.7) and (3.2.47) and observing (in light of Theorem 2.12.2) that
PX X D X and that PX is symmetric and idempotent, we find that
E.Qe/ D .I PX /Xˇ D .X X/ˇ D 0 (7.31)
and that, in the special case where y follows a G–M model,
var.Qe/ D .I PX /. 2 I/.I PX /0 D  2 .I PX /: (7.32)

Corresponding to the vector eQ is the quantity eQ 0 eQ , which is customarily referred to as the residual
sum of squares. It follows from the results of Section 5.4b (on least squares minimization) that, for
every value of y,
eQ 0 eQ D min .y Xb/0 .y Xb/: (7.33)
b
Moreover,
eQ eQ D y 0 .I
0
PX /y; (7.34)
as is evident from the results of Section 5.4b or upon observing (in light of the symmetry and
idempotency of PX ) that .I PX /0 .I PX / D I PX .
In what follows (i.e., in the remainder of Subsection c), it is supposed that y follows a G–M
model, and the emphasis is on the estimation of the parameter  2.
An unbiased estimator. The expected value of the residual sum of squares can be derived by applying
formula (7.11) (for the expected value of a quadratic form) to expression (7.34) (which is a quadratic
form in the random vector y). Recalling that PX X D X, we find that
E.Qe0 eQ / D EŒy 0 .I PX /y
D trŒ.I PX /. 2 I/ C .Xˇ/0 .I PX /Xˇ
2
D  tr.I PX / C 0
2
D  ŒN tr.PX /:

Moreover, because PX is idempotent and because .X0 X/ X0 is a generalized inverse of X—refer to


Part (1) of Theorem 2.12.2—we have (in light of Corollary 2.8.3 and Lemma 2.10.13) that

tr.PX / D rank.PX / D rankŒX.X0 X/ X0  D rank X: (7.35)

Thus, the expected value of the residual sum of squares is

E.Qe0 eQ / D  2 .N rank X/: (7.36)

Assume that the rank of the model matrix X is (strictly) less than N . Then, upon dividing the
residual sum of squares by N rank X, we obtain the quantity

eQ 0 eQ
O 2 D :
N rank X
206 Estimation and Prediction: Classical Approach

Clearly,
E.O 2 / D  2; (7.37)
that is, the quantity O 2 obtained by dividing the residual sum of squares by N rank X is an unbiased
estimator of the parameter  2.
Let us find the variance of the estimator O 2 . Suppose that the fourth-order moments of the
distribution of the vector e D .e1 ; e2 ; : : : ; eN /0 of residual effects are such that (for i; j; k; m D
1; 2; : : : ; N )

ˆ 3 4 if m D k D j D i ,
ˆ
< 4
 if j D i and m D k ¤ i , if k D i and m D j ¤ i ,
E.ei ej ek em / D (7.38)
ˆ
ˆ or if m D i and k D j ¤ i ,
0 otherwise

(as would be the case if the distribution of e were MVN). Further, let ƒ represent the N  N 2 matrix
whose entry for the j th row and kmth column [column .k 1/N C m] is E.ej ek em /. Then, upon
applying formula (7.17) and once again making use of the properties of the PX matrix (set forth in
Theorem 2.12.2), we find that

var.Qe0 eQ / D varŒy 0 .I PX /y


0 0
D 4ˇ X .I PX /ƒ vec.I PX / C 2 trŒ.I PX /. 2 I/.I PX /. 2 I/
C 4ˇ 0 X0 .I PX /. 2 I/.I PX /Xˇ
4
D 0 C 2 tr.I PX / C 0
4
D 2 ŒN tr.PX /:

And upon applying result (7.35), we conclude that

var.Qe0 eQ / D 2 4 .N rank X/: (7.39)

Moreover, as a particular implication of result (7.39), we have that

var.Qe0 eQ / 2 4
var.O 2 / D D : (7.40)
.N rank X/2 N rank X

The Hodges–Lehmann estimator. The estimator O 2 is of the general form

eQ 0 eQ
; (7.41)
k
where k is a (strictly) positive constant. It is the estimator of the form (7.41) obtained by taking
k D N rank X. Taking k D N rank X achieves unbiasedness. Nevertheless, it can be of interest
to consider other choices for k.
Let us derive the MSE (mean squared error) of the estimator (7.41). And, in doing so, let us con-
tinue to suppose that the fourth-order moments of the distribution of the vector e D .e1 ; e2 ; : : : ; eN /0
of residual effects are such that (for i; j; k; m D 1; 2; : : : ; N ) E.ei ej ek em / satisfies condition (7.38)
(as would be the case if the distribution of e were MVN).
The MSE of the estimator (7.41) can be regarded as a function, say m.k/, of the scalar k. Making
Estimation of Variability and Covariability 207

use of results (7.36) and (7.39), we find that


2
eQ 0 eQ
   0 
eQ eQ

2
m.k/ D var C E 
k k
 
1
D 2 var.Qe0 eQ / C ŒE.Qe0 eQ / k 2 2
k
4
 
2
D 2 2.N rank X/ C ŒN rank.X/ k (7.42)
k
rank X/ŒN rank.X/ C 2
 
4 .N 2k
D C1 : (7.43)
k2

For what choice of k does m.k/ attain its minimum value? Upon differentiating m.k/ and
engaging in some algebraic simplification, we find that

d m.k/ 2 4 .N rank X/ŒN rank.X/ C 2 k


D ;
dk k3
d m.k/ d m.k/
so that < 0 if k < N rank.X/ C 2, D 0 if k D N rank.X/ C 2, and
dk dk
d m.k/
> 0 if k > N rank.X/ C 2. Thus, m.k/ is a decreasing function of k over the interval
dk
0 < k  N rank.X/ C 2, is an increasing function over the interval k  N rank.X/ C 2, and
attains its minimum value at k D N rank.X/ C 2.
We conclude that among estimators of  2 of the form (7.41), the estimator

eQ 0 eQ
(7.44)
N rank.X/ C 2
has minimum MSE. The estimator (7.44) is sometimes referred to as the Hodges–Lehmann estimator.
In light of results (7.36) and (7.42), it has a bias of

eQ 0 eQ N rank X
   
E 2 D 2 1
N rank.X/ C 2 N rank.X/ C 2
2 2
D (7.45)
N rank.X/ C 2

and an MSE of  4 Œ2.N rank X/ C . 2/2  2 4


D : (7.46)
ŒN rank.X/ C 22 N rank.X/ C 2
By way of comparison, the unbiased estimator O 2 (obtained by taking k D N rank X) has an MSE
of 2 4=.N rank X/.
Statistical independence. Let us conclude the present subsection (Subsection c) with some results
pertaining to least squares estimators of estimable linear combinations of the elements of the para-
metric vector ˇ. The least squares estimator of any such linear combination is expressible as r 0 X0 y
for some P  1 vector r of constants; more generally, the M -dimensional column vector whose
elements are the least squares estimators of M such linear combinations is expressible as R0 X0 y
for some P  M matrix R of constants. Making use of formula (3.2.46) and of Parts (4) and (2) of
Theorem 2.12.2, we find that

cov.r 0 X0 y; eQ / D r 0 X0 . 2 I/.I PX /0 D  2 r 0 X0 .I PX / D 0 (7.47)


208 Estimation and Prediction: Classical Approach

and, similarly (and more generally), that

cov.R0 X0 y; eQ / D 0: (7.48)

Thus, the least squares estimator r 0 X0 y and the residual vector eQ are uncorrelated. And, more gen-
erally, the vector R0 X0 y of least squares estimators and the residual vector eQ are uncorrelated.
Is the least squares estimator r 0 X0 y uncorrelated with the residual sum of squares eQ 0 eQ ? Or,
equivalently, is r 0 X0 y uncorrelated with an estimator of  2 of the form (7.41), including the unbiased
estimator O 2 (and the Hodges–Lehmann estimator)? Assuming the model is such that the distribution
of the vector e D .e1 ; e2 ; : : : ; eN /0 of residual effects has third-order moments ijk D E.ei ej ek /
(i; j; k D 1; 2; : : : ; N ) and making use of formula (7.13), we find that

cov.r 0 X0 y; eQ 0 eQ / D r 0 X0 ƒ vec.I PX / C 2r 0 X0 . 2 I/.I PX /Xˇ; (7.49)

where ƒ is an N  N 2 matrix whose entry for the i th row and j kth column [column .j 1/N C k]
is ijk . The second term of expression (7.49) equals 0, as is evident upon recalling that PX X D X,
and the first term equals 0 if ƒ D 0, as would be the case if the distribution of e were MVN or, more
generally, if the distribution of e were symmetric. Thus, if the distribution of e is symmetric, then

cov.r 0 X0 y; eQ 0 eQ / D 0 (7.50)
and, more generally,
cov.R0 X0 y; eQ 0 eQ / D 0: (7.51)
Accordingly, if the distribution of e is symmetric, r 0 X0 y and R0 X0 y are uncorrelated with any
estimator of  2 of the form (7.41), including the unbiased estimator O 2 (and the Hodges–Lehmann
estimator).
Are the vector R0X0 y of least squares estimators and the residual vector eQ statistically independent
(as well as uncorrelated)? If the model is such that the distribution of e is MVN (in which case the
distribution of y is also MVN), then it follows from Corollary 3.5.6 that the answer is yes. That is, if
the model is such that the distribution of e is MVN, then eQ is distributed independently of R0 X0 y (and,
in particular, eQ is distributed independently of r 0 X0 y). Moreover, eQ being distributed independently
of R0 X0 y implies that “any” function of eQ is distributed independently of R0 X0 y—refer, e.g., to
Casella and Berger (2002, theorem 4.6.12). Accordingly, if the distribution of e is MVN, then the
residual sum of squares eQ 0 eQ is distributed independently of R0 X0 y and any estimator of  2 of the
form (7.41) (including the unbiased estimator O 2 and the Hodges–Lehmann estimator) is distributed
independently of R0 X0 y.

d. Translation invariance
Suppose that y is an N  1 observable random vector that follows a G–M, Aitken, or general linear
model. And suppose that we wish to make inferences about  2 (in the case of a G–M or Aitken
model) or  (in the case of a general linear model) or about various functions of  2 or . In making
such inferences, it is common practice to restrict attention to procedures that depend on the value of y
only through the value of a (possibly vector-valued) statistic having a property known as translation
invariance (or location invariance).
Proceeding as in Section 5.2 (in discussing the translation-equivariant estimation of a parametric
function of the form 0ˇ), let k represent a P -dimensional column vector of known constants, and
define z D y C Xk. Then, z D X C e, where  D ˇ C k. And z can be regarded as an N  1
observable random vector that follows a G–M, Aitken, or general linear model that is identical in
all respects to the model followed by y, except that the role of the parametric vector ˇ is played by
a vector (represented by ) having a different interpretation. It can be argued that inferences about
 2 or , or about functions of  2 or , should be made on the basis of a statistical procedure that
Estimation of Variability and Covariability 209

depends on the value of y only through the value of a (possibly vector-valued) statistic h.y/ that, for
every k 2 RP (and for every value of y), satisfies the condition
h.y/ D h.z/
or, equivalently, the condition
h.y/ D h.y C Xk/: (7.52)
Any statistic h.y/ that satisfies condition (7.52) and that does so for every k 2 RP (and for every
value of y) is said to be translation invariant.
If the statistic h.y/ is translation invariant, then
h.y/ D hŒy C X. ˇ/ D h.y Xˇ/ D h.e/: (7.53)
Thus, the statistical properties of a statistical procedure that depends on the value of y only through
the value of a translation-invariant statistic h.y/ are completely determined by the distribution of the
vector e of residual effects. They do not depend on the vector ˇ.
Let us now consider condition (7.52) in the special case where h.y/ is a scalar-valued statistic
h.y/ of the form h.y/ D y 0Ay;
where A is a symmetric matrix of constants. In this special case,
h.y/ D h.y C Xk/ , y 0AXk C k0 X0Ay C k0 X0AXk D 0
, 2y 0AXk D k0 X0AXk: (7.54)

For condition (7.54) to be satisfied for every k 2 RP (and for every value of y), it is sufficient that
AX D 0. It is also necessary. To see this, suppose that condition (7.54) is satisfied for every k 2 RP
(and for every value of y). Then, upon setting y D 0 in condition (7.54), we find that k0 X0AXk D 0
for every k 2 RP, implying (in light of Corollary 2.13.4) that X0AX D 0. Thus, y 0AXk D 0 for
every k 2 RP (and every value of y), implying that every element of AX equals 0 and hence that
AX D 0.
In summary, we have established that the quadratic form y 0Ay (where A is a symmetric matrix of
constants) is a translation-invariant statistic if and only if the matrix A of the quadratic form satisfies
the condition
AX D 0: (7.55)

Adopting the same notation and terminology as in Subsection c, consider the concept of translation
invariance as applied to the residual vector eQ and to the residual sum of squares eQ 0 eQ . Recall that eQ is
expressible as eQ D .I PX /y and eQ 0 eQ as eQ 0 eQ D y 0 .I PX /y. Recall also that PX X D X and hence
that .I PX /X D 0. Thus, for any P  1 vector k (and for any value of y),
.I PX /.y C Xk/ D .I PX /y:

And it follows that eQ is translation invariant. Moreover, eQ 0 eQ is also translation invariant, as is evident
upon observing that it depends on y only through the value of eQ or, alternatively, upon applying
condition (7.55) (with A D I PX )—that condition (7.55) is applicable is evident upon recalling
that PX is symmetric and hence that the matrix I PX of the quadratic form y 0 .I PX /y is symmetric.
Let us now specialize by supposing that y follows a G–M model, and let us add to the results
obtained in Subsection c (on the estimation of  2 ) by obtaining some results on translation-invariant
estimation. Since the residual sum of squares eQ 0 eQ is translation invariant, any estimator of  2 of the
form (7.41) is translation invariant. In particular, the unbiased estimator O 2 is translation invariant
(and the Hodges–Lehmann estimator is translation invariant).
A quadratic form y 0Ay in the observable random vector y (where A is a symmetric matrix of
constants) is an unbiased estimator of  2 and is translation invariant if and only if
E.y 0Ay/ D  2 and AX D 0 (7.56)
210 Estimation and Prediction: Classical Approach

(in which case the quadratic form is referred to as a quadratic unbiased translation-invariant estima-
tor). As an application of formula (7.11), we have that
E.y 0Ay/ D trŒA. 2 I/ C ˇ 0 X0AXˇ D  2 tr.A/ C ˇ 0 X0AXˇ: (7.57)
In light of result (7.57), condition (7.56) is equivalent to the condition
tr.A/ D 1 and AX D 0: (7.58)
Thus, the quadratic form y 0Ay is a quadratic unbiased translation-invariant estimator of  2 if and
only if the matrix A of the quadratic form satisfies condition (7.58).
Clearly, the estimator O 2 [which is expressible in the form O 2 D y 0 .I PX /y] is a quadratic
unbiased translation-invariant estimator of  2. In fact, if the fourth-order moments of the distribution
of the vector e D .e1 ; e2 ; : : : ; eN /0 of residual effects are such that (for i; j; k; m D 1; 2; : : : ; N )
E.ei ej ek em / satisfies condition (7.38) (as would be the case if the distribution of e were MVN), then
the estimator O 2 has minimum variance (and hence minimum MSE) among all quadratic unbiased
translation-invariant estimators of  2, as we now proceed to show.
Suppose that the fourth-order moments of the distribution of the vector e D .e1 ; e2 ; : : : ; eN /0
are such that (for i; j; k; m D 1; 2; : : : ; N ) E.ei ej ek em / satisfies condition (7.38). And denote by
ƒ the N  N 2 matrix whose entry for the j th row and kmth column [column .k 1/N C m] is
E.ej ek em /. Then, for any quadratic unbiased translation-invariant estimator y 0Ay of  2 (where A is
symmetric), we find [upon applying formula (7.17) and observing that AX D 0] that
var.y 0Ay/ D 4ˇ 0 .AX/0 ƒ vec A C 2 4 tr.A2 / C 4 2 ˇ 0 X0 AAXˇ
D 0 C 2 4 tr.A2 / C 0 D 2 4 tr.A2 /: (7.59)
1
Let R D A .I PX /, so that
N rank X
1
AD .I PX / C R: (7.60)
N rank X
Further, observe that (since PX is symmetric) R0 D R, that (since AX D 0 and PX X D X)
1
RX D AX .I PX /X D 0 0 D 0;
N rank X
and that
X0 R D X0 R0 D .RX/0 D 00 D 0:
Accordingly, upon substituting expression (7.60) for A (and recalling that PX is idempotent), we find
that 1 2
A2 D .I PX / C R C R0 R: (7.61)
.N rank X/ 2 N rank X
Moreover, because tr.A/ D 1, we have [in light of result (7.35)] that
1
tr.R/ D tr.A/ tr.I PX / D 1 1 D 0: (7.62)
N rank X
And upon substituting expression (7.61) for A2 in expression (7.59) and making use of results (7.35)
and (7.62), we find that
 
0 4 1 2 0
var.y Ay/ D 2 tr.I PX / C tr.R/ C tr.R R/
.N rank X/2 N rank X
 
1
D 2 4 C tr.R0 R/ : (7.63)
N rank X
Finally, upon observing that tr.R0 R/ D 2
i;j rij, where (for i; j D 1; 2; : : : ; N ) rij is the ij th
P

element of R, we conclude that var.y 0Ay/ attains a minimum value of 2 4 =.N rank X/ and does
1
so uniquely when R D 0 or, equivalently, when A D .I PX / (i.e., when y 0Ay D O 2 ).
N rank X
Best (Minimum-Variance) Unbiased Estimation 211

5.8 Best (Minimum-Variance) Unbiased Estimation


Take y to be an N  1 observable random vector that follows a G–M model, and consider the
estimation of an estimable linear combination 0ˇ of the elements of the parametric vector ˇ and
consider also the estimation of the parameter  2. In Section 5.5a, it was determined that the least
squares estimator of 0ˇ has minimum variance among all linear unbiased estimators. And in Section
5.7d, it was determined that the estimator O 2 D y 0 .I PX /y=.N rank X/ has minimum variance
among all quadratic unbiased translation-invariant estimators of  2 [provided that the fourth-order
moments of the distribution of the vector e D .e1 ; e2 ; : : : ; eN /0 of residual effects are such that (for
i; j; k; m D 1; 2; : : : ; N ) E.ei ej ek em / satisfies condition (7.38)].
If the distribution of e is assumed to be MVN, something more can be said. It can be shown that
under the assumption of multivariate normality, X0 y and y 0 .I PX /y form a complete sufficient
statistic—refer, e.g., to Casella and Berger (2002, def. 6.2.21) or to Schervish (1995, def. 2.34)
for the definition of completeness—in which case “any” function, say tŒX0 y; y 0 .I PX /y, of X0 y
and y 0 .I PX /y is a best (minimum-variance) unbiased estimator of EftŒX0 y; y 0 .I PX /yg (e.g.,
Schervish 1995, theorem 5.5; Casella and Berger 2002, theorem 7.3.23). It follows, in particular,
that under the assumption of multivariate normality, the least squares estimator of 0ˇ has minimum
variance among all unbiased estimators (linear or not) and the estimator O 2 has minimum variance
among all unbiased estimators of  2 (quadratic and/or translation invariant or not).
Let us assume that the distribution of e is MVN, and verify that (under the assumption of
multivariate normality) X0 y and y 0 .I PX /y form a complete sufficient statistic. Let us begin by
introducing a transformation of X0 y and y 0 .I PX /y that facilitates the verification.
Define K D rank X. And observe that there exists an N  K matrix, say W , whose columns
form a basis for C.X/. Observe also that W D XR for some matrix R and that X D W S for some
(K P ) matrix S (of rank K). Moreover, X0 y and y 0 .I PX /y are expressible in terms of the (K 1)
vector W 0 y and the sum of squares y 0 y; we have that
X0 y D S0 W 0 y and y 0 .I PX /y D y 0 y .W 0 y/0 S.X0 X/ S0 W 0 y: (8.1)
Conversely, W 0 y and y 0 y are expressible in terms of X0 y and y 0 .I PX /y; we have that
W 0 y D R0 X0 y and y 0 y D y 0 .I PX /y C .X0 y/0 .X0 X/ X0 y: (8.2)
0 0 0 0
Thus, corresponding to any function gŒX y; y .I PX /y of X y and y .I PX /y, there is a function,
say g .W 0 y; y 0 y/, of W 0 y and y 0 y such that g .W 0 y; y 0 y/ D gŒX0 y; y 0 .I PX /y for every value
of y; namely, the function g .W 0 y; y 0 y/ defined by
g .W 0 y; y 0 y/ D gŒS0 W 0 y; y 0 y .W 0 y/0 S.X0 X/ S0 W 0 y:
Similarly, corresponding to any function h.W 0 y; y 0 y/, of W 0 y and y 0 y, there is a function, say
h ŒX0 y; y 0 .I PX /y, of X0 y and y 0 .I PX /y such that h ŒX0 y; y 0 .I PX /y D h.W 0 y; y 0 y/ for
every value of y; namely, the function h ŒX0 y; y 0 .I PX /y defined by

h ŒX0 y; y 0 .I PX /y D hŒR0 X0 y; y 0 .I PX /y C .X0 y/0 .X0 X/ X0 y:

Now, suppose that W 0 y and y 0 y form a complete sufficient statistic. Then, it follows from
result (8.2) that X0 y and y 0 .I PX /y form a sufficient statistic. Moreover, if EfgŒX0 y; y 0 .I
PX /yg D 0, then EŒg .W 0 y; y 0 y/ D 0, implying that PrŒg .W 0 y; y 0 y/ D 0 D 1 and hence that
PrfgŒX0 y; y 0 .I PX /y D 0g D 1. Thus, X0 y and y 0 .I PX /y form a complete statistic.
Conversely, suppose that X0y and y 0 .I PX /y form a complete sufficient statistic. Then, it follows
from result (8.1) that W 0 y and y 0 y form a sufficient statistic. Moreover, if EŒh.W 0 y; y 0 y/ D 0,
then Efh ŒX0 y; y 0 .I PX /yg D 0, implying that Prfh ŒX0 y; y 0 .I PX /y D 0g D 1 and hence
212 Estimation and Prediction: Classical Approach

that PrŒh.W 0 y; y 0 y/ D 0 D 1. Thus, W 0 y and y 0 y form a complete statistic.


At this point, we have established that X0 y and y 0 .I PX /y form a complete sufficient statistic if
and only if W 0 y and y 0 y form a complete sufficient statistic. Thus, for purposes of verifying that X0 y
and y 0 .I PX /y form a complete sufficient statistic, it suffices to consider the sufficiency and the
completeness of the statistic formed by W 0 y and y 0 y. In that regard, the probability density function
of y, say f ./, is expressible as follows:
 
1 1 0
f .y/ D exp .y Xˇ/ .y Xˇ/
.2 2 /N=2 2 2
 
1 1 0 0 0 0 0
D exp .y y 2ˇ X y C ˇ X Xˇ/
.2 2 /N=2 2 2
    0 
1 1 0 0 1 0 1 0
D exp ˇ X Xˇ exp y y C Sˇ W y : (8.3)
.2 2 /N=2 2 2 2 2 2
Based on a well-known result on complete sufficient statistics for exponential families of distri-
butions [a result that is theorem 2.74 in Schervish’s (1995) book], it follows from result (8.3) that
W 0 y and y 0 y form a complete sufficient statistic—to establish that the result on complete sufficient
statistics for exponential families is applicable, it suffices to observe [in connection with expression
(8.3)] that the parametric function 1=.2 2 / and the (K  1) vector .1= 2 /Sˇ of parametric func-
tions are such that, for any (strictly) negative scalar c and any K  1 vector d, 1=.2 2 / D c and
.1= 2 /Sˇ D d for some value of  2 and some value of ˇ (as is evident upon noting that S contains
K linearly independent columns). It remains only to observe that since W 0 y and y 0 y form a complete
sufficient statistic, X0 y and y 0 .I PX /y form a complete sufficient statistic.

5.9 Likelihood-Based Methods


A likelihood-based method, known as maximum likelihood (ML), can be used to estimate functions
of the parameters ( and the elements of ˇ) of the G–M or Aitken model. More generally, it can be
used to estimate the parameters (the elements of ˇ and of ) of the general linear model. The use of
this method requires an assumption that the distribution of the vector e of residual effects is known
up to the value of  (in the special case of a G–M or Aitken model) or up to the value of  (in the
case of a general linear model). Typically, the distribution of e is taken to be MVN (multivariate
normal).

a. (Ordinary) maximum likelihood estimation


It is convenient and instructive to begin by considering ML estimation in the relatively simple case
of a G–M model.
G–M model. Suppose that y is an N 1 observable random vector that follows a G–M model. Suppose
further that the distribution of the vector e of residual effects is MVN. Then, y  N.Xˇ;  2 I/.
Let f .  I ˇ; / represent the probability density function (pdf) of the distribution of y, and
denote by y the observed value of y. Then, by definition, the likelihood function is the function, say
L.ˇ; I y /, of the parameters (which consist of  and the elements of ˇ) defined (for ˇ 2 RP and
 > 0) by L.ˇ; I y / D f .yI ˇ; /. Accordingly,
 
1 1 0
L.ˇ; I y / D exp .y Xˇ/ .y Xˇ/ : (9.1)
.2 2 /N=2 2 2
Likelihood-Based Methods 213

And the log-likelihood function, say `.ˇ; I y / [which, by definition, is the function obtained by
equating `.ˇ; I y/ to the logarithm of the likelihood function, i.e., to log L.ˇ; I y /], is expressible
as N N 1
`.ˇ; I y/ D log.2/ log  2 .y Xˇ/0 .y Xˇ/: (9.2)
2 2 2 2

Now, consider the maximization of the likelihood function L.ˇ; I y / or, equivalently, of the
log-likelihood function `.ˇ; I y/. Irrespective of the value of , `.ˇ; I y/ attains its maximum
value with respect to ˇ at any value of ˇ that minimizes .y Xˇ/0 .y Xˇ/. Thus, in light of the
results of Section 5.4b (on least squares minimization), `.ˇ; I y / attains its maximum value with
respect to ˇ at a point ˇQ if and only if
X0 XˇQ D X0 y; (9.3)
that is, if and only if ˇQ is a solution to the normal equations.
Letting ˇQ represent any P  1 vector that satisfies condition (9.3), it remains to consider the
maximization of `.ˇ; Q I y / with respect to . In that regard, take g./ to be a function of  of the
form K c
g./ D a log  2 ; (9.4)
2 2 2
where a is a constant, c is a (strictly) positive constant, and K is a (strictly) positive integer. And
observe that, unless y XˇQ D 0 (which is an event of probability 0), `.ˇ; Q I y / is of the form
(9.4); in the special case where a D .N=2/ log.2/, c D .y Xˇ/ Q .y Xˇ/,
0 Q and K D N ,
Q
g./ D `.ˇ; I y /. Clearly,
 
dg./ K c K 2 c
D C 3 D  :
d   3 K
Thus, dg./=d > 0 if  2 < c=K, dg./=d D 0 if  2 D c=K, and dg./=d < 0 if  2 > c=K,
p p
so that g./ is an increasing function ofp for  < c=K, is a decreasing function for  > c=K,
and attains its maximum value at  D c=K.
Unless the model is of full rank (i.e., unless rank X D P ), there are an infinite number of solutions
to the normal equations and hence an infinite number of values of ˇ that maximize `.ˇ; I y/.
However, the value of an estimable linear combination 0ˇ of the elements of ˇ is the same for every
value of ˇ that maximizes `.ˇ; I y/—recall (from the results of Section 5.4 on the method of least
squares) that 0bQ has the same value for every solution bQ to the normal equations.
In effect, we have established that the least squares estimator of any estimable linear combination
of the elements of ˇ is also the ML estimator. Moreover, since condition (9.3) can be satisfied by
taking ˇQ D .X0 X/ X0 y, the ML estimator of  2 (the square root of which is the ML estimator of )
is the estimator eQ 0 eQ
; (9.5)
N
0
where eQ D y PX y. Like the unbiased estimator eQ eQ =.N rank X/ and the Hodges–Lehmann
estimator eQ 0 eQ =ŒN rank.X/ C 2, the ML estimator of  2 is of the form (7.41).
A result on minimization and some results on matrices. As a preliminary to considering ML
estimation as applied to a general linear model (or an Aitken model), it is convenient to establish the
following result on minimization.
Theorem 5.9.1. Let b represent a P  1 vector of (unconstrained) variables, and define f .b/ D
.y Xb/0 W .y Xb/, where W is an N  N symmetric nonnegative definite matrix, X is an N  P
matrix, and y is an N  1 vector. Then, the linear system X0 W Xb D X0 W y (in b) is consistent.
Q if and only if bQ is a solution to X0 W Xb D X0 W y,
Further, f .b/ attains its minimum value at a point b
Q 0 Q
in which case f .b/ D y W y b X W y.0 0
214 Estimation and Prediction: Classical Approach

Proof. Let R represent a matrix such that W D R0 R—the existence of such a matrix is guaranteed
by Corollary 2.13.25. Then, upon letting t D Ry and U D RX, f .b/ is expressible as f .b/ D .t
Ub/0 .t Ub/. Moreover, it follows from the results of Section 5.4b (on least squares minimization)
that the linear system U 0 Ub D U 0 t (in b) is consistent and that .t Ub/0 .t Ub/ attains its
minimum value at a point bQ if and only if bQ is a solution to U 0 Ub D U 0 t, in which case
.t U b/ Q 0 .t U b/ Q D t 0 t bQ 0 U 0 t:
It remains only to observe that U 0 U D X0 W X, that U 0 t D X0 W y, and that t 0 t D y 0 W y. Q.E.D.
In addition to Theorem 5.9.1, it is convenient to have at our disposal the following lemma, which
can be regarded as a generalization of Lemma 2.12.1.
Lemma 5.9.2. For any N  P matrix X and any N  N symmetric nonnegative definite matrix
W,
R.X0 W X/ D R.W X/; C.X0 W X/ D C.X0 W /; and rank.X0 W X/ D rank.W X/:
Proof. In light of Corollary 2.13.25, W D R0 R for some matrix R. And upon observing that
X0 W X D .RX/0 RX and making use of Corollary 2.4.4 and Lemma 2.12.1, we find that
R.W X/ D R.R0 RX/  R.RX/ D RŒ.RX/0 RX D R.X0 W X/  R.W X/
and hence that R.X0 W X/ D R.W X/. Moreover, that R.X0 W X/ D R.W X/ implies that
rank.X0 W X/ D rank.W X/ and, in light of Lemma 2.4.6, that C.X0 W X/ D C.X0 W /. Q.E.D.
In the special case of Lemma 5.9.2 where W is a (symmetric) positive definite matrix (and hence
is nonsingular), it follows from Corollary 2.5.6 that R.W X/ D R.X/, C.X0 W / D C.X0 /, and
rank.W X/ D rank.X/. Thus, we have the following corollary, which (like Lemma 5.9.2 itself) can
be regarded as a generalization of Lemma 2.12.1.
Corollary 5.9.3. For any N  P matrix X and any N  N symmetric positive definite matrix
W,
R.X0 W X/ D R.X/; C.X0 W X/ D C.X0 /; and rank.X0 W X/ D rank.X/:
As an additional corollary of Lemma 5.9.2, we have the following result.
Corollary 5.9.4. For any N  P matrix X and any N  N symmetric nonnegative definite
matrix W,
W X.X0 W X/ X0 W X D W X and X0 W X.X0 W X/ X0 W D X0 W :

Proof. In light of Lemmas 5.9.2 and 2.4.3, W X D L0 X0 W X for some P  N matrix L. Thus,
W X.X0 W X/ X0 W X D L0 X0 W X.X0 W X/ X0 W X D L0 X0 W X D W X
and [since X0 W D .W X/0 D .L0 X0 W X/0 D X0 W XL]
X0 W X.X0 W X/ X0 W D X0 W X.X0 W X/ X0 W XL D X0 W XL D X0 W :
Q.E.D.
General linear model. Suppose that y is an N  1 observable random vector that follows a general
linear model. Suppose further that the distribution of the vector e of residual effects is MVN, so that
y  N ŒXˇ; V ./. And suppose that V ./ is of rank N (for every  2 ‚).
Let us consider the ML estimation of functions of the model’s parameters (which consist of
the elements ˇ1 ; ˇ2 ; : : : ; ˇP of the vector ˇ and the elements 1 ; 2 ; : : : ; T of the vector ). Let
f .  I ˇ; / represent the pdf of the distribution of y, and denote by y the observed value of y. Then,
the likelihood function is the function, say L.ˇ; I y/, of ˇ and  defined (for ˇ 2 RP and  2 ‚)
by L.ˇ; I y/ D f .yI ˇ; /. Accordingly,
 
1 1 0 1
L.ˇ; I y/ D exp .y Xˇ/ ŒV ./ .y Xˇ/ : (9.6)
.2/N=2 jV ./j1=2 2
And the log-likelihood function, say `.ˇ; I y/, is expressible as
Likelihood-Based Methods 215
N 1 1
`.ˇ; I y/ D log.2/ log jV ./j .y Xˇ/0 ŒV ./ 1
.y Xˇ/: (9.7)
2 2 2
Maximum likelihood estimates are obtained by maximizing L.ˇ; I y / or, equivalently,
`.ˇ; I y/ with respect to ˇ and : if L.ˇ; I y/ or `.ˇ; I y/ attains its maximum value at val-
ues ˇQ and Q (of ˇ and , respectively), then an ML estimate of a function, say h.ˇ; /, of ˇ and/or
 is provided by the quantity h.ˇ; Q /
Q obtained by substituting ˇQ and Q for ˇ and . In considering
the maximization of the likelihood or log-likelihood function, it is helpful to begin by regarding the
value of  as “fixed” and considering the maximization of the likelihood or log-likelihood function
with respect to ˇ alone.
Observe [in light of result (4.5.5) and Corollary 2.13.12] that (regardless of the value of )
ŒV ./ 1 is a symmetric positive definite matrix. Accordingly, it follows from Theorem 5.9.1 that
for any particular value of , the linear system
X0 ŒV ./ 1
Xb D X0 ŒV ./ 1
y (9.8)
0 1
(in the P  1 vector b) is consistent. Further, .y Xˇ/ ŒV ./ Xˇ/ attains its minimum value,
.y
0
or equivalently .1=2/.y Xˇ/ ŒV ./ .y 1 Q
Xˇ/ attains its maximum value, at a value ˇ./ of
Q
ˇ if and only if ˇ./ is a solution to linear system (9.8), that is, if and only if
Q
X0 ŒV ./ 1 Xˇ./ D X0 ŒV ./ 1 y; (9.9)
in which case
1
max .y Xˇ/0 ŒV ./ 1 .y Xˇ/
ˇ2R P 2
1 Q 0 Q
D Œy Xˇ./ ŒV ./ 1 Œy Xˇ./
2
1˚ 0 Q 0 0
D y ŒV ./ 1 y Œˇ./ X ŒV ./ 1 y : (9.10)
2
Now, suppose that (for  2 ‚) ˇ./ Q satisfies condition (9.9). Then, for any matrix A such that
Q
R.A/  R.X/, the value of Aˇ./ (at any particular value of ) does not depend on the choice
Q
of ˇ./, as is evident upon observing (in light of Corollary 5.9.3) that A D T./X0 ŒV ./ 1 X for
some matrix-valued function T./ of  and hence that
Q
Aˇ./ D T./X0 ŒV ./ 1 Xˇ./ Q D T./X0 ŒV ./ 1 y:
Q
Thus, Xˇ./ does not depend on the choice of ˇ./,Q and for any estimable linear combination 0ˇ
0Q Q
of the elements of ˇ,  ˇ./ does not depend on the choice of ˇ./. Among the possible choices for
Q 0
ˇ./ are the vector fX ŒV ./ Xg X ŒV ./ y and the vector .fX0 ŒV ./ 1 Xg /0 X0 ŒV ./ 1 y.
1 0 1

Define
Q
L .I y/ D LŒˇ./; I y and Q
` .I y/ D `Œˇ./; I y ŒD log L .I y/: (9.11)
Then,
L .I y/ D max L.ˇ; I y / and ` .I y/ D max `.ˇ; I y/; (9.12)
ˇ2RP ˇ2RP
so that L .I y/ is a profile likelihood function and ` .I y/ is a profile log-likelihood function—
refer, e.g., to Severini (2000, sec 4.6) for the definition of a profile likelihood or profile log-likelihood
function. Moreover,
N 1 1 Q 0 Q
` .I y/ D log.2/ log jV ./j Œy Xˇ./ ŒV ./ 1 Œy Xˇ./ (9.13)
2 2 2
N 1 1˚ 0 Q 0 0
D log.2/ log jV ./j y ŒV ./ 1 y Œˇ./ X ŒV ./ 1 y (9.14)
2 2 2
N 1
D log.2/ log jV ./j
2 2
1 0
y ŒV ./ 1 ŒV ./ 1 XfX0 ŒV ./ 1 Xg X0 ŒV ./ 1 y: (9.15)

2
216 Estimation and Prediction: Classical Approach

Result (9.12) is significant from a computational standpoint. It “reduces” the problem of maxi-
mizing L.ˇ; I y/ or `.ˇ; I y/ with respect to ˇ and  to that of maximizing L .I y/ or ` .I y/
with respect to  alone. Values of ˇ and  at which L.ˇ; I y/ or `.ˇ; I y/ attains its maximum
value can be obtained by taking the value of  to be a value, say , Q at which L .I y/ or ` .I y/
attains its maximum value and by then taking the value of ˇ to be a solution ˇ.Q /
Q to the linear system
Q 1 Xb D X0 ŒV ./
X0 ŒV ./ Q 1y
(in the P  1 vector b).
In general, a solution to the problem of maximizing ` .I y/ is not obtainable in “closed form”;
rather, the maximization must be accomplished numerically via an iterative procedure—the discus-
sion of such procedures is deferred until later in the book. Nevertheless, there are special cases where
the maximization of ` .I y/, and hence that of `.ˇ; I y/, can be accomplished without resort to
indirect (iterative) numerical methods. Indirect numerical methods are not needed in the special case
where y follows a G–M model; that special case was discussed in Part 1 of the present subsection.
More generally, indirect numerical methods are not needed in the special case where y follows an
Aitken model, as is to be demonstrated in what follows.
Aitken model. Suppose that y follows an Aitken model (and that H is nonsingular and that the
distribution of e is MVN). And regard the Aitken model as the special case of the general linear
model where T D 1 (i.e., where  has only 1 element), where 1 D , and where V ./ D  2 H. In
that special case, linear system (9.8) is equivalent to the linear system
X0 H 1 Xb D X0 H 1 y (9.16)
—the equivalence is in the sense that both linear systems have the same set of solutions. The equations
comprising linear system (9.16) are known as the Aitken equations. When H D I (i.e., when the
model is a G–M model), the linear system (9.16) of Aitken equations simplifies to the linear system
X0 Xb D X0 y of normal equations.
Q
In this setting, we are free to choose the vector ˇ./ in such a way that it has the same value for
Q
every value of . Accordingly, for every value of , take ˇ./ Q where ˇQ is any solution to the
to be ˇ,
Aitken equations. Then, writing  for , the profile log-likelihood function ` .I y/ is expressible
as
` .I y/ D `.ˇ;Q I y/ D N log.2/ N log  2 1 log jHj 1
.y Xˇ/ Q 0 H 1 .y Xˇ/:Q
2 2 2 2 2

Unless y XˇQ D 0 (which is an event of probability 0), ` .I y/ is of the form of the function g./
defined (in Part 1 of the present subsection) by equality (9.4); upon setting a D .N=2/ log.2/
.1=2/ logjHj, c D .y Xˇ/ Q 0 H 1 .y Xˇ/, Q and K D N , g./ D ` .I y/. Thus, it follows from
the results of Part 1 that ` .I y / attains its maximum value when  2 equals
.y Xˇ/ Q 0 H 1 .y Xˇ/ Q
: (9.17)
N
And we conclude that `.ˇ; I y/ attains its maximum value when ˇ equals ˇQ and when  2 equals
the quantity (9.17). This conclusion serves to generalize the conclusion reached in Part 1, where it
was determined that in the special case of the G–M model, the log-likelihood function attains its
maximum value when ˇ equals a solution, say ˇ, Q to the normal equations (i.e., to the linear system
0 0 2
X Xb D X y) and when  equals .y Xˇ/ .y Xˇ/=N. Q 0 Q

b. Restricted or residual maximum likelihood estimation (REML


estimation)
Suppose that y is an N  1 observable random vector that follows a general linear model. Suppose
further that the distribution of the vector e of residual effects is MVN, so that y  N ŒXˇ; V ./.
Likelihood-Based Methods 217

And let `.ˇ; I y/ represent the log-likelihood function [where y is the observed value of y and
where V ./ is assumed to be of rank N (for every  2 ‚)]. This function has the representation
(9.7).
Suppose that ˇQ and Q are values of ˇ and  at which `.ˇ; I y/ attains its maximum value. And
observe that Q is a value of  at which `.ˇ;
Q I y/ attains its maximum value. There is an implication
Q
that  is identical to the value of  that would be obtained from maximizing the likelihood function
under a supposition that ˇ is a known (P  1) vector (rather than a vector of unknown parameters)
and under the further supposition that ˇ equals ˇQ (or, perhaps more precisely, Xˇ equals Xˇ).Q Thus,
in a certain sense, maximum likelihood estimators of functions of  fail to account for the estimation
of ˇ. This failure can be disconcerting and can have undesirable consequences.
It is informative to consider the manifestation of this phenomenon in the relatively simple special
case of a G–M model. In that special case, the use of maximum likelihood estimation results in  2
being estimated by the quantity (9.5), in which the residual sum of squares is divided by N rather
than by N rank X as in the case of the unbiased estimator [or by N rank.X/ C 2 as in the case
of the Hodges–Lehmann estimator].
The failure of ML estimators of functions of  to account for the estimation of ˇ has led to
the widespread use of a variant of maximum likelihood that has come to be known by the acronym
REML (which is regarded by some as standing for restricted maximum likelihood and by others as
standing for residual maximum likelihood). In REML, inferences about functions of  are based on
the likelihood function associated with a vector of what are sometimes called error contrasts.
An error contrast is a linear unbiased estimator of 0, that is, a linear combination, say r 0 y, of
the elements of y such that E.r 0 y/ D 0 or, equivalently, such that X0 r D 0. Thus, r 0 y is an error
contrast if and only if r 2 N.X0 /. Moreover, in light of Lemma 2.11.5,
dimŒN.X0  D N rank.X0 / D N rank X:
And it follows that there exists a set of N rank X linearly independent error contrasts and that no
set of error contrasts contains more than N rank X linearly independent error contrasts.
Accordingly, let R represent an N  .N rank X/ matrix (of constants) of full column rank
N rank X such that X0 R D 0 [or, equivalently, an N  .N rank X/ matrix whose columns are
linearly independent members of the null space N.X0 / of X0 ]. And take z to be the .N rank X/  1
vector defined by z D R0 y (so that the elements of z are N rank X linearly independent error
contrasts). Then, z  N Œ0; R0 V ./R, and [in light of the assumption that V ./ is nonsingular and
in light of Theorem 2.13.10] R0 V ./R is nonsingular. Further, let f .  I / represent the pdf of the
distribution of z, and take L.I R0 y/ to be the function of  defined (for  2 ‚) by L.I R0 y/ D
f .R0 yI /. The function L.I R0 y/ is a likelihood function; it is the likelihood function obtained
by regarding the observed value of z as the data vector. In REML, the inferences about functions of
 are based on the likelihood function L.I R0 y/ [or on a likelihood function that is equivalent to
L.I R0 y/ in the sense that it differs from L.I R0 y/ by no more than a multiplicative constant].
It is worth noting that the use of REML results in the same inferences regardless of the choice
of the matrix R. To see that REML has this property, let R1 and R2 represent any two choices for
R, that is, take R1 and R2 to be any two N  .N rank X/ matrices of full column rank such
that X0 R1 D X0 R2 D 0. Further, define z1 D R01 y and z2 D R02 y. And let f1 .  I / represent the
pdf of the distribution of z1 and f2 .  I / the pdf of the distribution of z2 ; and take L1 .I R01 y/
and L2 .I R02 y/ to be the functions of  defined by L1 .I R01 y/ D f1 .R01 yI / and L2 .I R02 y/ D
f2 .R02 yI /.
There exists an .N rank X/  .N rank X/ matrix A such that R2 D R1 A, as is evident
upon observing that the columns of each of the two matrices R1 and R2 form a basis for the
.N rank X/-dimensional linear space N.X0 /; necessarily, A is nonsingular. Moreover, the pdf’s of
the distributions of z1 and z2 are such that (for every value of z1 )
f1 .z1 / D jdet Ajf2 .A0 z1 /
218 Estimation and Prediction: Classical Approach

—this relationship can be verified directly from formula (3.5.32) for the pdf of an MVN distribution
or simply by observing that z2 D A0 z1 and making use of standard results (e.g., Bickel and Doksum
2001, sec. B.2) on a change of variables. Thus,

L2 .I R02 y/ D f2 .R02 yI / D f2 .A0 R01 yI / D jdet Aj 1


f1 .R01 yI / D jdet Aj 1
L1 .I R01 y/:

We conclude that the two likelihood functions L1 .I R01 y/ and L2 .I R02 y/ differ from each other
by no more than a multiplicative constant and hence that they are equivalent.
The .N rank X/-dimensional vector z D R0 y of error contrasts is translation invariant, as is
evident upon observing that for every P  1 vector k (and every value of y),

R0 .y C Xk/ D R0 y C .X0 R/0 k D R0 y C 0k D R0 y:

In fact, z is a maximal invariant: in the present context, a (possibly vector-valued) statistic h.y/ is
said to be a maximal invariant if it is invariant and if corresponding to each pair of values y1 and y2
of y such that h.y2 / D h.y1 /, there exists a P  1 vector k such that y2 D y1 C Xk—refer, e.g., to
Lehmann and Romano (2005b, sec 6.2) for a general definition (of a maximal invariant).
To confirm that z is a maximal invariant, take y1 and y2 to be any pair of values of y such that
R0 y2 D R0 y1 . And observe that y2 D y1 C .y2 y1 / and that y2 y1 2 N.R0 /. Observe also (in light
of Lemma 2.11.5) that dimŒN.R0 / D rank X. Moreover, R0 X D .X0 R/0 D 0, implying (in light
of Lemma 2.4.2) that C.X/  N.R0 / and hence (in light of Theorem 2.4.10) that C.X/ D N.R0 /.
Thus, the linear space N.R0 / is spanned by the columns of X, leading to the conclusion that there
exists a P  1 vector k such that y2 y1 D Xk and hence such that y2 D y1 C Xk.
That z D R0 y is a maximal invariant is of interest because any maximal invariant, say h.y/,
has (in the present context) the following property: a (possibly vector-valued) statistic, say g.y/, is
translation invariant if and only if g.y/ depends on the value of y only through h.y/, that is, if and
only if there exists a function s./ such that g.y/ D sŒh.y/ (for every value of y). To see that h.y/
has this property, observe that if [for some function s./] g.y/ D sŒh.y/ (for every value of y), then
(for every P  1 vector k)
g.y C Xk/ D sŒh.y C Xk/ D sŒh.y/ D g.y/;
so that g.y/ is translation invariant. Conversely, if g.y/ is translation invariant and if y1 and y2
are any pair of values of y such that h.y2 / D h.y1 /, then y2 D y1 C Xk for some vector k and,
consequently, g.y2 / D g.y1 C Xk/ D g.y1 /.
The vector z consists of N rank X linearly independent linear combinations of the elements
of the N  1 vector y. Suppose that we introduce an additional rank X linear combinations in the
form of the .rank X/  1 vector u defined by u D X0 y, where X is any N  .rank X/ matrix (of
constants) whose columns are linearly independent columns of X or, more generally, whose columns
form a basis for C.X/. Then,  
u
D .X ; R/0 y:
z
And (since X D XA for some matrix A)

rank.X ; R/ D rankŒ.X ; R/0 .X ; R/


D rank diag.X0 X ; R0 R/
D rank.X0 X / C rank.R0 R/
D rank X C rank R
D rank.X/ C N rank.X/ D N: (9.18)

Accordingly,
  the likelihood function that would result from regarding the observed value .X ; R/0 y
u
of as the data vector differs by no more than a multiplicative constant from that obtained by
z
Likelihood-Based Methods 219

regarding the observed value y of y as the data vector (as can be readily verified). When viewed in
this context, the likelihood function that is employed in REML can be regarded as what is known
as a marginal likelihood—refer, e.g., to Pawitan (2001, sec. 10.3) for the definition of a marginal
likelihood.
The vector eQ D .I PX /y [where PX D X.X0 X/ X0 ] is the vector of (least squares) residuals.
Observe [in light of Theorem 2.12.2 and Lemma 2.8.4] that X0 .I PX / D 0 and that
rank.I PX / D N rank PX D N rank X: (9.19)
Thus, among the choices for the N  .N rank X/ matrix R (of full column rank N rank X such
that X0 R D 0) is any N  .N rank X/ matrix whose columns are a linearly independent subset
of the columns of the (symmetric) matrix I PX . For any such choice of R, the elements of the
.N rank X/  1 vector z D R0 y consist of linearly independent (least squares) residuals.
The letters R and E in the acronym REML can be regarded as representing either restricted or
residual. REML is restricted ML in the sense that in the formation of the likelihood function, the data
are restricted to those inherent in the values of the N rank X linearly independent error contrasts.
REML is residual ML in the sense that the N rank X linearly independent error contrasts can be
taken to be (least squares) residuals.
It might seem as though the use of REML would result in the loss of some information about
functions of . However, in at least one regard, there is no loss of information. Consider the profile
likelihood function L .  I y/ or profile log-likelihood function ` .  I y/ of definition (9.11)—the
(ordinary) ML estimate of a function of  is obtained from a value of  at which L .I y/ or
` .I y/ attains its maximum value. The identity of the function L .  I y/ or, equivalently, that of
the function ` .  I y/ can be determined solely from knowledge of the observed value R0 y of the
vector z of error contrasts; complete knowledge of the observed value y of y is not required. Thus,
the (ordinary) ML estimator of a function of  (like the REML estimator) depends on the value of
y only through the value of the vector of error contrasts.
Let us verify that the identity of the function ` .  I y/ is determinable solely from knowledge of
R0 y. Let eQ D .I PX /y, and observe (in light of Theorem 2.12.2) that X0 .I PX /0 D 0, implying
[since the columns of R form a basis for N.X0 /] that .I PX /0 D RK for some matrix K and hence
that
eQ D .RK/0 y D K0 R0 y (9.20)
—Qe is the observed value of the vector eQ D .I PX /y. Moreover, upon observing [in light of result
(2.5.5) and Corollary 2.13.12] that ŒV ./ 1 is a symmetric positive definite matrix, it follows from
Corollary 5.9.4 that
ŒV ./ 1 ŒV ./ 1 XfX0 ŒV ./ 1 Xg X0 ŒV ./ 1 X D 0


and that
X0 ŒV ./ 1 ŒV ./ 1 XfX0 ŒV ./ 1 Xg X0 ŒV ./ 1 D 0:


And as a consequence, formula (9.15) for ` .I y/ can be reexpressed as follows:


N 1
` .I y/ D log.2/ log jV ./j
2 2
1 0 1 1
XfX0 ŒV ./ 1
Xg X0 ŒV ./ 1
eQ : (9.21)

eQ ŒV ./ ŒV ./
2
Together, results (9.21) and (9.20) imply that the identity of the function ` .  I y/ is determinable
solely from knowledge of R0 y.
Some results on symmetric idempotent matrices and on null spaces. As a preliminary to considering
REML in the special case of a G–M model, it is helpful to establish the following three results on
symmetric idempotent matrices and on null spaces.
220 Estimation and Prediction: Classical Approach

Theorem 5.9.5. Every symmetric idempotent matrix is nonnegative definite. Moreover, if A is


an N  N symmetric idempotent matrix of rank R > 0, then there exists an N  R matrix Q such
that A D QQ0, and, for any such N  R matrix Q, rank Q D R and Q0 Q D I. And, conversely,
for any N  R matrix Q such that Q0 Q D I, QQ0 is an N  N symmetric idempotent matrix of
rank R.
Proof. Suppose that A is an N  N symmetric idempotent matrix of rank R ( 0). Then,
A D A2 D A0 A, and it follows from Corollary 2.13.15 that A is nonnegative definite. Moreover,
assuming that R > 0, it follows from Corollary 2.13.23 that there exists an N  R matrix Q such
that A D QQ0. And for any such N  R matrix Q, we find, in light of Lemma 2.12.1 [and result
(2.4.1)], that rank Q D R and that Q0 Q is nonsingular and, in addition, we find that
Q0 QQ0 QQ0 Q D Q0 A2 Q D Q0 AQ D Q0 QQ0 Q (9.22)
and hence [upon premultiplying and postmultiplying both sides of equality (9.22) by .Q0 Q/ 1 ] that
Q0 Q D I.
Conversely, suppose that Q is an N  R matrix such that Q0 Q D I. Then, upon observing that
QQ0 D PQ and (in light of Lemma 2.12.1) that

rank.Q/ D rank.Q0 Q/ D rank.IR / D R;


it follows from Theorem 2.12.2 that QQ0 is a symmetric idempotent matrix of rank R. Q.E.D.
Theorem 5.9.6. Let X represent an N  P matrix of rank R .< N /. Then, I PX is an N  N
symmetric idempotent matrix of rank N R, and there exists an N  .N R/ matrix Q such that
I PX D QQ0. Moreover, for any N  .N R/ matrix Q, I PX D QQ0 if and only if X0 Q D 0
and Q0 Q D I (in which case Q is of full column rank N R).
Proof. In light of Theorem 2.12.2 and Lemmas 2.8.1 and 2.8.4, it is clear that I PX is a symmetric
idempotent matrix of rank N R. And in light of Theorem 5.9.5, there exists an N  .N R/ matrix
Q such that I PX D QQ0.
Now, suppose that Q is any N  .N R/ matrix such that X0 Q D 0 and Q0 Q D I. Then, QQ0
is a symmetric idempotent matrix of rank N R (as is evident from Theorem 5.9.5), and PX Q D 0
and Q0 PX D 0. And, consequently, I PX QQ0 is a symmetric idempotent matrix. Further, making
use of Corollary 2.8.3 and Lemma 2.12.1, we find that
rank.I PX QQ0 / D tr.I PX QQ0 /
D tr.I PX / tr.QQ0 /
D rank.I PX / rank.QQ0 /
DN R .N R/ D 0;
implying that I PX QQ D 0 and hence that I PX D QQ0.
0

Conversely, suppose that Q is any N  .N R/ matrix such that I PX D QQ0. Then, according
to Theorem 5.9.5, Q0 Q D I. Moreover, making use of Theorem 2.12.2, we find that
X0 QQ0 D X0.I PX / D 0;
0
implying (in light of Corollary 2.3.4) that X Q D 0. Q.E.D.
Lemma 5.9.7. Let X represent an N  P matrix of rank R .< N /. Then, for any N  .N R/
matrix Q, X0 Q D 0 and Q0 Q D I if and only if the columns of Q form an orthonormal basis for
N.X0 /.
Proof. If the columns of Q form an orthonormal basis for N.X0 /, then clearly X0 Q D 0 and
Q Q D I. Conversely, suppose that X0 Q D 0 and Q0 Q D I. Then, clearly, the N R columns
0

of Q are orthonormal, and each of them is contained in N.X0 /. And since orthonormal vectors are
linearly independent (as is evident from Lemma 2.4.22) and since (according to Lemma 2.11.5)
Likelihood-Based Methods 221

dimŒN.X0 / D N R, it follows from Theorem 2.4.11 that the columns of Q form a basis for
N.X0 /. Q.E.D.
REML in the special case of a G–M model. Let us consider REML in the special case where the
N  1 observable random vector y follows a G–M model. And in doing so, let us continue to suppose
that the distribution of the vector e of residual effects is MVN. Then, y  N.Xˇ;  2 I/.
What is the REML estimator of  2, and how does it compare with other estimators of  2, including
the (ordinary) ML estimator (which was derived in Subsection a)? These questions can be readily
answered by making a judicious choice for the N  .N rank X/ matrix R (of full column rank
N rank X) such that X0 R D 0.
Let Q represent an N  .N rank X/ matrix whose columns form an orthonormal basis for
N.X0 /. Or, equivalently (in light of Lemma 5.9.7), take Q to be an N  .N rank X/ matrix such
that X0 Q D 0 and Q0 Q D I. And observe (in light of Theorem 5.9.6) that
I PX D QQ0
(and that Q is of full column rank).
Suppose that in implementing REML, we set R D Q—clearly, that is a legitimate choice for R.
Then, z D Q0 y  N.0;  2 I/. And, letting y represent the observed value of y, the log-likelihood
function that results from regarding the observed value Q0 y of z as the data vector is the function
`.; Q0 y/ of  given by

N rank X 1 1 0
`.; Q0 y/ D log ˇ 2 IN rank X ˇ y Q. 2 I/ 1 Q0 y
ˇ ˇ
log.2/
2 2 2
N rank X N rank X 1 0
D log.2/ log  2 y .I PX /y
2 2 2 2
N rank X N rank X 1
D log.2/ log  2 Œ.I PX /y0 .I PX /y: (9.23)
2 2 2 2
Unless .I PX /y D 0 (which is an event of probability 0), `.; Q0 y/ is of the form of the function
g./ defined by equality (9.4); upon setting a D Œ.N rank X/=2 log.2/, c D Œ.I PX /y0 .I
PX /y, and K D N rank X, g./ D `.; Q0 y/. Accordingly, it follows from the results of Part 1
of Subsection a that `.; Q0 y/ attains its maximum value when  2 equals
Œ.I PX /y0 .I PX /y
:
N rank X
Thus, the REML estimator of  2 is the estimator
eQ 0 eQ
; (9.24)
N rank X
where eQ D y PX y.
The REML estimator (9.24) is of the form (7.41) considered in Section 5.7c; it is the estimator
of the form (7.41) that is unbiased. Unlike the (ordinary) ML estimator eQ 0 eQ =N [which was derived
in Part 1 of Subsection a and is also of the form (7.41)], it “accounts for the estimation of ˇ”; in the
REML estimation of  2, the residual sum of squares eQ 0 eQ is divided by N rank X rather than by N.
A matrix lemma. Preliminary to the further discussion of REML, it is convenient to establish the
following lemma.
Lemma 5.9.8. Let A represent a Q  S matrix. Then, for any K  Q matrix C of full column
rank Q and any S  T matrix B of full row rank S , B.CAB/ C is a generalized inverse of A.
Proof. In light of Lemma 2.5.1, C has a left inverse, say L, and B has a right inverse, say R.
And it follows that
AB.CAB/ CA D IAB.CAB/ CAI D LCAB.CAB/ CABR D LCABR D IAI D A:
222 Estimation and Prediction: Classical Approach

Thus, B.CAB/ C is a generalized inverse of A. Q.E.D.


Note that in the special case where A is nonsingular (i.e., the special case where A is a Q  Q
matrix of rank Q), the result of Lemma 5.9.8 can be restated as follows:
1
A D B.CAB/ C: (9.25)

An informative and computationally useful expression for the REML log-likelihood function.
Suppose that y is an N  1 observable random vector that follows a general linear model. Suppose
further that the distribution of the vector e of residual effects is MVN and that the variance-covariance
matrix V ./ of e is nonsingular (for every  2 ‚). And let z D R0 y, where R is an N .N rank X/
matrix of full column rank N rank X such that X0 R D 0, and denote by y the observed value of y.
In REML, inferences about functions of  are based on the likelihood function L.I R0 y/
obtained by regarding the observed value R0 y of z as the data vector. Corresponding to L.I R0 y/
is the log-likelihood function `.I R0 y/ D log L.I R0 y/. We have that z  N Œ0; R0 V ./R, and
it follows that
N rank X 1 1 0
`.I R0 y/ D log.2/ logjR0 V ./Rj y RŒR0 V ./R 1 R0 y (9.26)
2 2 2
—recall that R0 V ./R is nonsingular.
REML estimates of functions of  are obtained from a value, say , O of  at which L.I R0 y/ or,
0
equivalently, `.I R y/ attains its maximum value. By way of comparison, (ordinary) ML estimates
of such functions are obtained from a value, say , Q at which the profile likelihood function L .I y/
or profile log-likelihood function ` .I y/ attains its maximum value; the (ordinary) ML estimate of
Q whereas the REML estimate is h./.
a function h./ of  is h./, O It is of potential interest to compare
0
`.I R y/ with ` .I y/. Expressions for ` .I y/ are given by results (9.13), (9.14), and (9.15).
However, expression (9.26) [for `.I R0 y/] is not of a form that facilitates meaningful comparisons
with any of those expressions. Moreover, depending on the nature of the variance-covariance matrix
V ./ (and on the choice of the matrix R), expression (9.26) may not be well-suited for computational
purposes [such as in computing the values of `.I R0 y/ corresponding to various values of ].
For purposes of obtaining a more useful expression for `.I R0 y/, take S to be any matrix (with
N rows) whose columns span C.X/, that is, any matrix such that C.S/ D C.X/ (in which case,
S D XA for some matrix A). And, temporarily (for the sake of simplicity) writing V for V ./,
observe that
.V 1 S; R/0 V .V 1 S; R/ D diag.S0 V 1 S; R0 V R/ (9.27)
and [in light of result (2.5.5), Corollary 2.13.12, and Corollary 5.9.3] that

rankŒ.V 1
S; R/0 V .V 1
S; R/ D rankŒdiag.S0 V 1
S; R0 V R/
D rank.S0 V 1
S/ C rank.R0 V R/
D rank.S/ C rank.R/
D rank.X/ C N rank.X/ D N: (9.28)

Result (9.28) implies (in light of Corollary 5.9.3) that


1
rank .V S; R/ D N (9.29)
1
or, equivalently, that .V S; R/ is of full row rank. Thus, upon applying formula (9.25), it follows
from result (9.27) that

V 1
D .V 1
S; R/ diagŒ.S0 V 1
S/ ; .R0 V R/ 1
.V 1
S; R/0
DV 1
S.S0 V 1
S/ S0 V 1
C R.R0 V R/ 1
R0 (9.30)
Likelihood-Based Methods 223

and hence that


R.R0 V R/ 1
R0 D V 1
V 1
S.S0 V 1
S/ S0 V 1: (9.31)
Moreover, as a special case of equality (9.31) (that where S D X), we obtain the following expression
for R.R0 V R/ 1 R0 [a quantity which appears in the 3rd term of expression (9.26) for `.I R0 y/]:
R.R0 V R/ 1
R0 D V 1
V 1
X.X0 V 1
X/ X0 V 1: (9.32)

Now, consider the quantity jR0 V Rj [which appears in the 2nd term of expression (9.26)]. Take
X to be any N  .rank X/ matrix whose columns are linearly independent columns of X or, more
generally, whose columns form a basis for C.X/ (in which case, X D XA for some matrix A).
Observing that
.X ; R/0 .X ; R/ D diag.X0 X ; R0 R/
and making use of basic properties of determinants, we find that

j.X ; R/0 V .X ; R/j D j.X ; R/0 jj.X ; R/jjV j


D j.X ; R/0 .X ; R/jjV j
D jdiag.X0 X ; R0 R/jjV j
D jX0 X jjR0 RjjV j: (9.33)

And making use of formula (2.14.29) for the determinant of a partitioned matrix, we find that
ˇ 0
ˇX V X X0 V Rˇ
ˇ
j.X ; R/0 V .X ; R/j D ˇˇ 0  
R V X R0 V R ˇ
ˇ

D jR0 V Rj jX0 V X X0 V R.R0 V R/ 1


R0 V X j
D jR0 V Rj jX0 ŒV V R.R0 V R/ 1
R0 V X j: (9.34)

Moreover, as a special case of equality (9.30) (that where S D X ), we have (since, in light of
Corollary 5.9.3, X0 V 1 X is nonsingular) that

V 1
DV 1
X .X0 V 1
X / 1
X0 V 1
C R.R0 V R/ 1
R0

and (upon premultiplying and postmultiplying by V and rearranging terms) that

V V R.R0 V R/ 1
R0 V D X .X0 V 1
X / 1
X0 : (9.35)

Upon replacing V V R.R0 V R/ 1


R0 V with expression (9.35), result (9.34) simplifies as follows:

j.X ; R/0 V .X ; R/j D jR0 V Rj jX0 X .X0 V 1


X / 1
X0 X j
D jR0 V Rj jX0 X j2 =jX0 V 1
X j: (9.36)

It remains to equate expressions (9.33) and (9.36); doing so leads to the following expression for
jR0 V Rj:
jR0 V Rj D jR0 Rj jV j jX0 V 1 X j=jX0 X j: (9.37)
Upon substituting expressions (9.32) and (9.37) [for R.R0 V R/ 1 R0 and jR0 V Rj] into expression
(9.26), we find that the REML log-likelihood function `.I R0 y/ is reexpressible as follows:

N rank X 1 1
`.I R0 y/ D log.2/ logjR0 Rj C logjX0 X j
2 2 2
1 1 0 1
logjV ./j logjX ŒV ./ X j
2 2
1 0
y ŒV ./ 1 ŒV ./ 1 XfX0 ŒV ./ 1 Xg X0 ŒV ./ 1 y: (9.38)

2
224 Estimation and Prediction: Classical Approach

If R is taken to be a matrix whose columns form an orthonormal basis for N.X0 /, then the second
term of expression (9.38) equals 0; similarly, if X is taken to be a matrix whose columns form an
orthonormal basis for C.X/, then the third term of expression (9.38) equals 0. However, what is more
important is that the choice of R affects expression (9.38) only through its second term, which is a
constant (i.e., does not involve ). And for any two choices of X , say X1 and X2 , X2 D X1 B for
some matrix B (which is necessarily nonsingular), implying that
1 1 1
log jX02 ŒV ./ 1
X2 j D log jB0 X01 ŒV ./ 1
X1 Bj D log jdet Bj log jX01 ŒV ./ 1
X1 j
2 2 2
and, similarly, that
1 1
log jX02 X2 j D log jdet Bj C log jX01 X1 j;
2 2
so that the only effect on expression (9.38) of a change in the choice of X from X1 to X2 is to add
a constant to the third term and to subtract the same constant from the fifth term. Thus, the choice of
R and the choice of X are immaterial.
Q
The last term of expression (9.38) can be reexpressed in terms of an arbitrary solution, say ˇ./,
to the linear system
X0 ŒV ./ 1 Xb D X0 ŒV ./ 1 y (9.39)
(in the P  1 vector b)—recall (from Subsection a) that this linear system is consistent, that
Q
Xˇ./ Q
does not depend on the choice of ˇ./, Q
and that the choices for ˇ./ include the vector
0 1 0 0 1
.fX ŒV ./ Xg / X ŒV ./ y. We find that
y 0 ŒV ./ 1 ŒV ./ 1 XfX0 ŒV ./ 1 Xg X0 ŒV ./ 1 y


Q
D y 0 ŒV ./ 1 y Œˇ./ 0 0
X ŒV ./ 1 y (9.40)
D Œy Q
Xˇ./0
ŒV ./ 1
Œy Q
Xˇ./: (9.41)

It is informative to compare expression (9.38) for `.I R0 y/ with expression (9.15) for the
profile log-likelihood function ` .I y/. Aside from the terms that do not depend on  [the 1st term
of expression (9.15) and the first 3 terms of expression (9.38)], the only difference between the
two expressions is the inclusion in expression (9.38) of the term 12 log jX0 ŒV ./ 1 X j. This term
depends on , but not on y. Its inclusion serves to adjust the profile log-likelihood function ` .I y/
so as to compensate for the failure of ordinary ML (in estimating functions of ) to account for the
estimation of ˇ. Unlike the profile log-likelihood function, `.I R0 y/ is the logarithm of an actual
likelihood function and, consequently, has the properties thereof—it is the logarithm of the likelihood
function L.I R0 y/ obtained by regarding the observed value R0 y of z as the data vector.
If the form of the N  N matrix V ./ is such that V ./ is relatively easy to invert (as is often
the case in practice), then expression (9.38) for `.I R0 y/ is likely to be much more useful for
computational purposes than expression (9.26). Expression (9.38) [along with expression (9.40) or
(9.41)] serves to relate the numerical evaluation of `.I R0 y/ for any particular value of  to the
solution of the linear system (9.39), comprising P equations in P “unknowns.”
Special case: Aitken model. Let us now specialize to the case where y follows an Aitken model (and
where H is nonsingular). As in Subsection a, this case is to be regarded as the special case of the
general linear model where T D 1, where  D ./, and where V ./ D  2 H. In this special case,
linear system (9.39) is equivalent to (i.e., has the same solutions as) the linear system

X0 H 1
Xb D X0 H 1
y; (9.42)

comprising the Aitken equations. And taking ˇQ to be any solution to linear system (9.42), we find
Likelihood-Based Methods 225

[in light of results (9.38) and (9.41)] that the log-likelihood function `.I R0 y/ is expressible as
N rank X 1 1
`.I R0 y/ D log.2/ logjR0 Rj C logjX0 X j
2 2 2
1 1 0 1 N rank X
logjHj logjX H X j log  2
2 2 2
1 Q 0 H 1.y Xˇ/: Q
.y Xˇ/ (9.43)
2 2

Unless y XˇQ D 0 (which is an event of probability 0), `.I R0 y/ is of the form of the
function g./ defined (in Part 1 of Subsection a) by equality (9.4); upon setting a D Œ.N
rank X/=2 log.2/ .1=2/ logjR0 Rj C .1=2/ logjX0 X j .1=2/ logjHj .1=2/ logjX0 H 1 X j,
c D .y Xˇ/ Q 0 H 1.y Xˇ/, Q and K D N rank X, g./ D `.I R0 y/. Accordingly, it follows from
the results of Part 1 of Subsection a that `.I R0 y/ attains its maximum value when  2 equals
Q 0 H 1.y Xˇ/
.y Xˇ/ Q
: (9.44)
N rank X
The quantity (9.44) is the REML estimate of  2 ; it is the estimate obtained by dividing .y
Xˇ/Q 0 H 1.y Xˇ/ Q by N rank X. It differs from the (ordinary) ML estimate of  2 ; which (as
is evident from the results of Subsection a) is obtained by dividing .y Xˇ/ Q 0 H 1.y Xˇ/Q by N .
Note that in the further special case of the G–M model (i.e., the further special case where H D I), the
Aitken equations simplify to the normal equations X0Xb D X0 y and expression (9.44) (for the REML
estimate) is (upon setting ˇQ D .X0 X/ X0 y) reexpressible as Œ.I PX /y0 .I PX /y=.N rank X/,
in agreement with the expression for the REML estimator [expression (9.24)] derived in a previous
part of the present subsection.

c. Elliptical distributions
The results of Subsections a and b (on the ML and REML estimation of functions of the parameters
of a G–M, Aitken, or general linear model) were obtained under the assumption that the distribution
of the vector e of residual effects is MVN. Some of the properties of the MVN distribution extend (in
a relatively straightforward way) to a broader class of distributions called elliptical distributions (or
elliptically contoured or elliptically symmetric distributions). Elliptical distributions are introduced
(and some of their basic properties described) in the present subsection—this follows the presentation
(in Part 1 of the present subsection) of a useful result on orthogonal matrices. Then, in Subsection
d, the results of Subsections a and b are revisited with the intent of obtaining extensions suitable for
G–M, Aitken, or general linear models when the form of the distribution of the vector e of residual
effects is taken to be that of an elliptical distribution other than a multivariate normal distribution.
A matrix lemma.
Lemma 5.9.9. For any two M -dimensional column vectors x1 and x2 , x02 x2 D x01 x1 if and
only if there exists an M  M orthogonal matrix O such that x2 D Ox1 .
Proof. If there exists an orthogonal matrix O such that x2 D Ox1 , then, clearly,

x02 x2 D .Ox1 /0 Ox1 D x01 O 0 Ox1 D x01 x1 :

For purposes of establishing the converse, take u D .1; 0; 0; : : : ; 0/0 to be the first column of IM ,
and assume that both x1 and x2 are nonnull—if x02 x2 D x01 x1 and either x1 or x2 is null, then both
x1 and x2 are null, in which case x2 D Ox1 for any M  M orthogonal matrix O. And for i D 1; 2,
define
Pi D I 2.vi0 vi / 1vi vi0 ;
226 Estimation and Prediction: Classical Approach

where vi D xi .x0i xi /1=2 u—if vi D 0, take Pi D I. The two matrices P1 and P2 are Householder
matrices; they are orthogonal and are such that, for i D 1; 2, Pi xi D .x0i xi /1=2 u—refer, e.g., to
Golub and Van Loan (2013, sec. 5.1.2). Thus, if x02 x2 D x01 x1 , then

P2 x2 D .x02 x2 /1=2 u D .x01 x1 /1=2 u D P1 x1 ;


implying that
x2 D P20 P1 x1
and hence (since P20 P1 is orthogonal) that there exists an orthogonal matrix O such that x2 D
Ox1 . Q.E.D.
Spherical distributions. Elliptical distributions are defined in terms of spherical distributions (which
are themselves elliptical distributions, albeit of a relatively simple kind). An M  1 random vector
z is said to have a spherical (or spherically symmetric) distribution if, for every M  M orthogonal
matrix O, the distribution of Oz is the same as that of z. For example, the N.0;  2 I/ distribution
(where  is any nonnegative scalar) is a spherical distribution.
Suppose that the distribution of the M -dimensional random vector z D .z1 , z2 ; : : : ; zM /0 is
spherical. Then, upon observing that IM is an orthogonal matrix, we find that
zD IM z  z: (9.45)
Thus, a spherical distribution is symmetric. And, it follows, in particular, that if E.z/ exists, then
E.z/ D 0: (9.46)
Further, if the second-order moments of the distribution of z exist, then
var.z/ D cI (9.47)
for some nonnegative scalar c.
To verify result (9.47), take Oi to be the M  M orthogonal matrix obtained by interchanging
the first and i th rows of IM , and take Pi to be the M  M orthogonal matrix obtained by multiplying
the i th row of IM by 1. Then, upon observing that Oi z  z and that zi is the first element of Oi z,
we find that
zi  z1 : (9.48)
And upon observing that Pi z  z and that zi is the i th element of Pi z, we find that (for j > i )
   
zi z
 i
zj zj
and hence that
zi zj  zi zj : (9.49)
It follows from equality (9.48) that the diagonal elements of var.z/ have a common value c and from
equality (9.49) that the off-diagonal elements of var.z/ [the ij th of which equals E.zi zj /] are 0.
According to result (9.47), the M elements z1 ; z2 ; : : : ; zM of the spherically distributed random
vector z are uncorrelated. However, it is only in the special case where the distribution of z is MVN
that z1 ; z2 ; : : : ; zM are statistically independent—refer, e.g., to Kollo and von Rosen (2005, sec. 2.3)
or to Fang, Kotz, and Ng (1990, sec. 4.3) for a proof.
The variance-covariance matrix of the spherically distributed random vector z is a scalar multiple
cI of I. Note that [aside from the degenerate special case where var.z/ D 0 or, equivalently, pwhere
z D 0 with probability 1] the elements of z can be rescaled by dividing each of them by c, the
effect of which is to transform z into the vector c 1=2 z whose variance-covariance matrix is I. Note
also that, like z, the transformed vector c 1=2 z has a spherical distribution.
Pdf of a spherical distribution. Take z D .z1 ; z2 ; : : : ; zM /0 to be an M -dimensional random (column)
vector that has an absolutely continuous distribution with pdf f ./. Clearly, whether or not this
distribution is spherical depends on the nature of the pdf.
Likelihood-Based Methods 227

Define u D Oz, where O is an arbitrary M  M orthogonal matrix, and denote by ui the i th


element of u. Then, the distribution of u has as a pdf the function h./ obtained by taking (for every
value of u)
h.u/ D jdet Jjf .O 0 u/;
where J is the M  M matrix with ij th element @zi =@uj (e.g., Bickel and Doksum 2001, sec. B.2).
Moreover, J D O 0, implying (in light of Corollary 2.14.19) that det J D ˙1. Thus,

h.u/ D f .O 0 u/ or, equivalently, h.Oz/ D f .z/:

And upon observing [in light of the fundamental theorem of (integral) calculus (e.g., Billingsley
1995)] that u  z if and only if h.Oz/ D f .Oz/ (with probability 1), it follows that u  z if and
only if
f .Oz/ D f .z/ (with probability 1): (9.50)
In effect, we have established that z has a spherical distribution if and only if, for every orthogonal
matrix O, the pdf f ./ satisfies condition (9.50). Now, suppose that f .z/ depends on the value
of z only through z0 z or, equivalently, that there exists a (nonnegative) function g./ (of a single
nonnegative variable) such that

f .z/ D g.z0 z/ (for every value of z): (9.51)

Clearly, if f ./ is of the form (9.51), then, for every orthogonal matrix O, f ./ satisfies condition
(9.50) and, in fact, satisfies the more stringent condition

f .Oz/ D f .z/ (for every value of z): (9.52)

Thus, if f ./ is of the form (9.51), then the distribution of z is spherical.


Consider the converse. Suppose that the distribution of z is spherical and hence that, for every
orthogonal matrix O, f ./ satisfies condition (9.50). Is f ./ necessarily of the form (9.51)? If for
every orthogonal matrix O, f ./ satisfies condition (9.52), then the answer is yes.
To see this, suppose that (for every orthogonal matrix O) f ./ satisfies condition (9.52). Then,
for any M  1 vectors z1 and z2 such that z02 z2 D z01 z1 , we find [upon observing (in light of Lemma
5.9.9) that z2 D Oz1 for some orthogonal matrix O] that f .z2 / D f .z1 /. Thus, for any particular
(nonnegative) constant c, f .z/ has the same value for every z for which z0 z D c. And it follows that
there exists a function g./ for which f ./ is expressible in the form (9.51).
Subsequently, there will be occasion to refer to the distribution of an M -dimensional random
column vector that is absolutely continuous with a pdf f ./ of the form (9.51). Accordingly, as a
matter of convenience, let us interpret any reference to an absolutely continuous spherical distribution
as a reference to a distribution with those characteristics.
Let g./ represent a nonnegative function whose domain is the interval Œ0; 1/. And let z D
.z1 ; z2 ; : : : ; zM /0 represent an M  1 vector of (unrestricted) variables, and suppose that
Z
0< g.z0 z/ d z < 1: (9.53)
RM

Further, take f .z/ to be the (nonnegative) function of z defined by

f .z/ D c 1 g.z0 z/; (9.54)

where c D RM g.z0 z/ d z (and observe that RM f .z/ d z D 1). Then, there is an absolutely
R R
continuous distribution (of an M  1 random vector) having f ./ as a pdf, and [since f ./ is of the
form (9.51)] that distribution is spherical.
228 Estimation and Prediction: Classical Approach

The M -dimensional integral g.z0 z/ d z can be simplified. Clearly,


R
RM
Z Z 1 Z 1 Z 1 PM
0 M
zi2 dz1 dz2    dzM :

g.z z/ d z D 2  g i D1
RM 0 0 0

Upon making the change of variables ui D zi2


(i D 1; 2; : : : ; M ) and observing that @zi =@ui D
1=2
.1=2/ui , we find that
Z Z 1Z 1 Z 1
 1 M QM 1=2
g.z0 z/ d z D 2M g M
P
 i D1 ui 2 i D1 ui du1 du2    duM
RM 0 0 0
Z 1Z 1 Z 1
1=2
g M
P  QM
D  i D1 ui i D1 ui du1 du2    duM :
0 0 0
PM
And upon making the further change of variables yi D ui (i D 1; 2; : : :; M 1),  yM D i D1 ui
I 0
and observing that the M  M matrix with ij th element @ui =@yj equals (the determinant
10 1
of which equals 1), we find that
Z Z
PM 1  1=2
g.z0 z/ d z D g.yM / iMD1 1 yi 1=2 yM
Q
i D1 yi dy1 dy2    dyM ;
RM D

where D D fy1 ; y2 ; : : : ; yM W yi  0 .i D 1; 2; : : : ; M 1/; yM  iMD1 1 yi g. Moreover,


P

upon making yet another change of variables wi D yi =yM (i D 1, 2,  : : : ; M 1), wM D yM and


wM I .w1 ; w2 ; : : : ; wM 1 /0

observing that the M  M matrix with ij th element @yi =@wj equals
0 1
M 1
(the determinant of which equals wM ), we find that
Z Z
QM 1 1=2 PM 1  1=2
g.z0 z/ d z D i D1 wi 1 i D1 wi dw1 dw2    dwM 1
RM D Z 1
.M=2/ 1
 wM g.wM / dwM ; (9.55)
0

where D D fw1 ; w2 ; : : : ; wM 1 W wi  0 .i D 1; 2; : : : ; M 1/; iMD1 1 wi  1g.


P
According to a basic result on the normalizing constant for the pdf of a Dirichlet distribution—the
Dirichlet distribution is the subject of Section 6.1e—
Œ€.1=2/M  M=2
Z
QM 1 1=2 PM 1  1=2
i D1 w i 1 i D1 w i dw 1 dw2    dwM 1 D D :
D €.M=2/ €.M=2/
Thus, Z
 M=2
Z 1
.M=2/ 1
g.z0 z/ d z D wM g.wM / dwM I (9.56)
RM €.M=2/ 0
1=2
and upon introducing the change of variable s D wM and observing that dwM =ds D 2s, we find
that Z 1
2 M=2
Z
0
g.z z/ d z D s M 1 g.s 2 / ds: (9.57)
RM €.M=2/ 0
In light of result (9.57), the function g./ satisfies condition (9.53) if and only if
Z 1
0< s M 1 g.s 2 / ds < 1;
0

in which case the constant c in expression (9.54) is expressible in the form (9.57).
Moment generating function of a spherical distribution. Spherical distributions can be characterized
in terms of their moment generating functions (or, more generally, their characteristic functions) as
Likelihood-Based Methods 229

well as in terms of their pdfs. Take z D .z1 ; z2 ; : : : ; zM /0 to be an M -dimensional random (column)


vector, and suppose that the distribution of z has a moment generating function, say ./. Then, for
the distribution of z to be spherical, it is necessary and sufficient that

.Ot/ D .t/ for every M  M orthogonal matrix O


(and for every M  1 vector t in a neighborhood of 0). (9.58)

To see this, let O represent an arbitrary M  M matrix, and observe that (for any M  1 vector t)
0   0 0 
.Ot/ D E e .Ot/ z D E e t .O z/


and hence that .Ot/ D .t/ (for every M  1 vector t in a neighborhood of 0) if and only if ./
is the moment generating function of the distribution of O 0 z (as well as that of the distribution of z),
or equivalently—refer, e.g., to Casella and Berger (2002, p. 65)—if and only if O 0 z and z have the
same distribution.
For the distribution of z to be spherical, it is necessary and sufficient that .t/ depend on the
M  1 vector t only through the value of t 0 t or, equivalently, that there exists a function ./ (of a
single nonnegative variable) such that
.t/ D .t 0 t/ (for every M  1 vector t in a neighborhood of 0): (9.59)

Let us verify the necessity and sufficiency of the existence of a function ./ that satisfies condition
(9.59). If there exists a function ./ that satisfies condition (9.59), then for every M  M orthogonal
matrix O (and for every M  1 vector t in a neighborhood of 0),

.Ot/ D Œ.Ot/0 Ot D .t 0 O 0 Ot/ D .t 0 t/ D .t/;


so that condition (9.58) is satisfied and, consequently, the distribution of z is spherical.
Conversely, suppose that the distribution of z is spherical and hence that condition (9.58) is
satisfied. Then, for “any” M  1 vectors t1 and t2 such that t20 t2 D t10 t1 , we find [upon observing
(in light of Lemma 5.9.9) that t2 D Ot1 for some orthogonal matrix O] that .t2 / D .t1 /. Thus,
for any sufficiently small nonnegative constant c, .t/ has the same value for every M  1 vector
t for which t 0 t D c. And it follows that there exists a function ./ that satisfies condition (9.59).
What can be said about the nature of the function ./? Clearly, .0/ D 1. Moreover, ./ is a
strictly increasing function. To see this, take t to be any M 1 vector (of constants) such that t 0 t D 1,
and observe that for any nonnegative scalar k, h p 0 p i
0
.k/ D .k t 0 t/ D 21 .k t 0 t/ C 12 Œk . t/0 . t/ D 21 E e kt z C e kt z :
Observe also that (for k > 0)
 p 0 p
kt 0 z
d e kt z C e
  p 0 p 
1=2 0 kt 0 z
D .1=2/ k t z e kt z e
dk
> 0 if t 0 z ¤ 0:
Thus (for k > 0) (  p 0 p
0 
)
d .k/ 1 d e kt z C e kt z
D2E > 0;
dk dk
which confirms that ./ is a strictly increasing function.
Linear transformation of a spherically distributed random vector. Let M and N represent arbitrary
positive integers. And define
x D  C € 0 z; (9.60)
where  is an arbitrary M -dimensional nonrandom column vector, € is an arbitrary N  M non-
random matrix, and z is an N -dimensional spherically distributed random column vector. Further,
let † D € 0 €.
230 Estimation and Prediction: Classical Approach

If E.z/ exists, then E.x/ exists and [in light of result (9.46)]
E.x/ D : (9.61)
And if the second-order moments of the distribution of z exist, then so do those of the distribution
of x and [in light of result (9.47)]
var.x/ D c†; (9.62)
where c is the variance of any element of z—every element of z has the same variance.
If the distribution of z has a moment generating function, say !./, then there exists a (nonnegative)
function ./ (of a single nonnegative variable) such that (for every N  1 vector s in a neighborhood
of 0) !.s/ D .s0 s/, and the distribution of x has the moment generating function ./, where (for
every M  1 vector t in a neighborhood of 0)
0   0 0  0 0  0 0
.t/ D E e t x D E e t .C€ z/ D e t  E e .€t/ z D e t  !.€t/ D e t  .t 0 †t/: (9.63)


Note that the moment generating function of the distribution of x and hence the distribution itself
depend on the value of the N  M matrix € only through the value of the M  M matrix †.
Marginal distributions (of spherically distributed random vectors). Let z represent an N -
dimensional spherically distributed random column vector. And take z to be an M -dimensional
subvector of z (where M < N ), say the subvector obtained by striking out all of the elements of z
save the i1 ; i2 ; : : : ; iM th elements.
Suppose that the distribution of z has a moment generating function, say ./. Then, necessarily,
there exists a (nonnegative) function ./ (of a single nonnegative variable) such that .s/ D .s0 s/
(for every N 1 vector s in a neighborhood of 0). Clearly, the subvector z can be regarded as a special
case of the random column vector x defined by expression (9.60); it is the special case obtained by
setting  D 0 and taking € to be the N  M matrix whose first, second, …, M th columns are,
respectively, the i1 ; i2 ; : : : ; iM th columns of IN . And (in light of the results of the preceding part of
the present subsection) it follows that the distribution of z has a moment generating function, say
 ./, and that (for every M  1 vector t in some neighborhood of 0)

 .t/ D .t 0 t/: (9.64)

Thus, the moment generating function of the distribution of the subvector z is characterized by the
same function ./ as that of the distribution of z itself.
Suppose now that u is an M -dimensional random column vector whose distribution has a moment
generating function, say !./, and that (for every M  1 vector t in a neighborhood of 0) !.t/ D
.t 0 t/. Then, the distribution of u is spherical. Moreover, it has the same moment generating function
as the distribution of z (and, consequently, u  z ). There is an implication that the elements of u,
like those of z , have the same variance as the elements of z.
The moment generating function of a marginal distribution of z (i.e., of the distribution of a
subvector of z) is characterized by the same function ./ as that of the distribution of z itself. In the
case of pdfs, the relationship is more complex.
Suppose that the distribution of the N -dimensional spherically distributed random column vector
z is an absolutely continuous spherical distribution. Then, the distribution of z has a pdf f ./, where
f .z/ D g.z0 z/ for some (nonnegative) function g./ of a single nonnegative variable (and for
every value of z). Accordingly, the distribution of the M -dimensional subvector z is the absolutely
continuous distribution with pdf f ./ defined (for every value of z ) by
Z
f .z / D g.z0 z C zN 0 zN  / d zN  ;
RN M

where zN  is the .N M /-dimensional subvector of z obtained by striking out the i1 ; i2 ; : : : ; iM th


Likelihood-Based Methods 231

elements. And upon regarding g.z0 z C w/ as a function of a nonnegative variable w and applying
result (9.57), we find that (for every value of z )
Z 1
2 .N M /=2
f .z / D s N M 1 g.z0 z C s 2 / ds: (9.65)
€Œ.N M /=2 0
Clearly, f .z / depends on the value of z only through z0 z , so that (as could have been anticipated
from our results on the moment generating function of the distribution of a subvector of a spherically
distributed random vector) the distribution of z is spherical. Further, upon introducing the changes
of variable w D s 2 and u D z0 z C w, we obtain the following variations on expression (9.65):
Z 1
 .N M /=2
f .z / D w Œ.N M /=2 1 g.z0 z C w/ dw (9.66)
€Œ.N M /=2 0
Z 1
 .N M /=2
D .u z0 z /Œ.N M /=2 1 g.u/ du: (9.67)
€Œ.N M /=2 z0 z

Elliptical distributions: definition. The distribution of a random column vector of the form of the
vector x of equality (9.60) is said to be elliptical. And a random column vector whose distribution is
that of the vector x of equality (9.60) may be referred to as being distributed elliptically about  or,
in the special case where € D I (or where € is orthogonal), as being distributed spherically about .
Clearly, a random column vector x is distributed elliptically about  if and only if x  is distributed
elliptically about 0 and is distributed spherically about  if and only if x  is distributed spherically
about 0. Let us consider the definition of an elliptical distribution as applied to distributions whose
second-order moments exist.
For any M  1 vector  and any M  M nonnegative definite matrix †, an M  1 random vector
x has an elliptical distribution with mean vector  and variance-covariance matrix † if and only if

x   C € 0z (9.68)

for some matrix € such that † D € 0 € and some random (column) vector z (of compatible dimension)
having a spherical distribution with variance-covariance matrix I—recall that if a random column
vector z has a spherical distribution with a variance-covariance matrix that is a nonzero scalar multiple
cI of I, then the rescaled vector c 1=2 z has a spherical distribution with variance-covariance matrix
I. In connection with condition (9.68), define K D rank †, and denote by N the number of rows in
the matrix € (or, equivalently, the number of elements in z)—necessarily, N  K. For any particular
N , the distribution of  C € 0 z does not depend on the choice of € [as is evident (for the case where
the distribution of z has a moment generating function) from result (9.63)]; rather, it depends only
on , †, and the distribution of z.
Now, consider the distribution of  C € 0 z for different choices of N . Assume that K  1—if
K D 0, then (for any choice of N ) € D 0 and hence  C € 0 z D . And take € to be a K  M
matrix such that † D €0 € , and take z to be a K  1 random vector having a spherical distribution
with variance-covariance matrix IK .
Suppose that the distribution of z has a moment generating function, say ! ./. Then, because
the distribution of z is spherical, there exists a (nonnegative) function ./ (of a single nonnegative
variable) such that (for every K  1 vector t in a neighborhood of 0) ! .t / D .t0 t /.
Take !.t/ to be the function of an N 1 vector t defined (for every value of t in some neighborhood
of 0) by !.t/ D .t 0 t/. There may or may not exist an (N -dimensional) distribution having !./ as
a moment generating function. If such a distribution exists, then that distribution is spherical, and for
any random vector, say w, having that distribution, the distribution of z is a marginal distribution of
w and var.w/ D IN . Accordingly, if there exists a distribution having !./ as a moment generating
function, then the distribution of the random vector z [in expression (9.68)] could be taken to be
232 Estimation and Prediction: Classical Approach

that distribution, in which case the distribution of  C € 0 z would have the same moment generating
function as the distribution of  C €0 z [as is evident from result (9.63)] and it would follow that
 C € 0 z   C €0 z .
Thus, as long as there exists an (N -dimensional) distribution having !./ as a moment generating
function [where !.t/ D .t 0 t/] and as long as the distribution of z is taken to be that distribution,
the distribution of  C € 0 z is invariant to the choice of N . This invariance extends to every N for
which there exists an (N -dimensional) distribution having !./ as a moment generating function.
Let us refer to the function ./ as the mgf generator of the distribution of the M -dimensional
random vector  C €0 z (with mgf being regarded as an acronym for moment generating function).
The moment generating function of the distribution of  C €0 z is the function ./ defined (for
every M  1 vector t in a neighborhood of 0) by
0
.t/ D e t  .t 0 †t/ (9.69)

[as is evident from result (9.63)]. The distribution of  C €0 z is completely determined by the
mean vector , the variance-covariance matrix †, and the mgf generator ./. Accordingly, we
may refer to this distribution as an (M -dimensional) elliptical distribution with mean , variance-
covariance matrix †, and mgf generator ./. The mgf generator ./ serves to identify the applicable
distribution of z ; alternatively, some other characteristic of the distribution of z could be used for
that purpose (e.g., the pdf). Note that the N.; †/ distribution is an elliptical distribution with mean
, variance-covariance matrix †, and mgf generator  ./, where (for every nonnegative scalar u)
 .u/ D exp.u=2/.
Pdf of an elliptical distribution. Let x D .x1 ; x2 ; : : : ; xM /0 represent an M  1 random vector, and
suppose that for some M  1 (nonrandom) vector  and some M  M (nonrandom) positive definite
matrix †,
x D  C € 0 z;
where € is an M  M (nonsingular) matrix such that † D € 0 € and where z D .z1 ; z2 ; : : : ; zM /0 is
an M  1 spherically distributed random vector with variance-covariance matrix I. Then, x has an
elliptical distribution with mean vector  and variance-covariance matrix †.
Now, suppose that the distribution of z is an absolutely continuous spherical distribution. Then,
the distribution of z is absolutely continuous with a pdf h./ defined asR follows in terms of some
1
(nonnegative) function g./ (of a single nonnegative variable) for which 0 s M 1 g.s 2 / ds < 1:
h.z/ D c 1
g.z0 z/;
R1
where c D Œ2 M=2= €.M=2/ 0 s M 1 g.s 2 / ds. And the distribution of x is absolutely continuous
with a pdf, say f ./, that is derivable from the pdf of the distribution of z.
Let us derive an expression for f .x/. Clearly, z D .€ 0 / 1 .x /, and the M  M matrix with
ij th element @zi =@xj equals .€ 0 / 1. Moreover,
jdet .€ 0 / 1
j D jdet € 0 j 1
D Œ.det € 0 /2  1=2
D Œ.det € 0 / det € 1=2

D Œdet.€ 0 €/ 1=2


D .det †/ 1=2
:
Thus, making use of standard results on a change of variables (e.g., Bickel and Doksum 2001, sec.
B.2) and observing that † 1 D € 1 .€ 0 / 1 D Œ.€ 0 / 1 0 .€ 0 / 1, we find that
f .x/ D c 1
j†j 1=2
gŒ.x /0 † 1
.x /: (9.70)

Linear transformation of an elliptically distributed random vector. Let x represent an N 1 random


vector that has an (N -dimensional) elliptical distribution with mean , variance-covariance matrix
†, and (if † ¤ 0) mgf generator ./. And take y to be the M  1 random vector obtained by
transforming x as follows:
Likelihood-Based Methods 233

y D c C Ax; (9.71)
where c is an M  1 (nonrandom) vector and A an M  N (nonrandom) matrix. Then, y has an
(M -dimensional) elliptical distribution with mean c C A, variance-covariance matrix A†A0 , and
(if A†A0 ¤ 0) mgf generator ./ (identical to the mgf generator of the distribution of x).
Let us verify that y has this distribution. Define K D rank †. And suppose that K > 0 (or,
equivalently, that † ¤ 0), in which case
x   C € 0 z;
where € is any K  N (nonrandom) matrix such that † D € 0 € and where z is a K  1 random
vector having a spherical distribution with a moment generating function !./ defined (for every
K  1 vector s in a neighborhood of 0) by !.s/ D .s0 s/. Then,
y  c C A. C € 0 z/ D c C A C .€A0 /0 z: (9.72)
Now, let K D rank.A†A0 /, and observe that K  K and that A†A0 D .€A0 /0 €A0 . Further,
suppose that K > 0 (or, equivalently, that A†A0 ¤ 0), take € to be any K  M (nonrandom)
matrix such that A†A0 D €0 € , and take z to be a K  1 random vector having a distribution
that is a marginal distribution of z and that, consequently, has a moment generating function ! ./
defined (for every K  1 vector s in a neighborhood of 0) by ! .s / D .s0 s /. Then, it follows
from what was established earlier (in defining elliptical distributions) that
c C A C .€A0 /0 z  c C A C €0 z ;
which [in combination with result (9.72)] implies that y has an elliptical distribution with mean
c C A, variance-covariance matrix A†A0, and mgf generator ./. It remains only to observe that
even in the “degenerate” case where † D 0 or, more generally, where A†A0 D 0, E.y/ D cCA and
var.y/ D A†A0 (and to observe that the distribution of a random vector whose variance-covariance
matrix equals a null matrix qualifies as an elliptical distribution).
Marginal distributions (of elliptically distributed random vectors). Let x represent an N 1 random
vector that has an (N -dimensional) elliptical distribution with mean , nonnull variance-covariance
matrix †, and mgf generator ./. And take x to be an M -dimensional subvector of x (where
M < N ), say the subvector obtained by striking out all of the elements of x save the i1 ; i2 ; : : : ; iM th
elements. Further, take  to be the M -dimensional subvector of  obtained by striking out all of the
elements of  save the i1 ; i2 ; : : : ; iM th elements and † to be the M  M submatrix of † obtained
by striking out all of the rows and columns of † save the i1 ; i2 ; : : : ; iM th rows and columns.
Consider the distribution of x . Clearly, x D Ax, where A is the M  N matrix whose first,
second, : : : ; M th rows are, respectively, the i1 ; i2 ; : : : ; iM th rows of IN . Thus, upon observing
that A D  and that A†A0 D † , it follows from the result of the preceding subsection (the
subsection pertaining to linear transformation of elliptically distributed random vectors) that x
has an elliptical distribution with mean  , variance-covariance matrix † , and (if † ¤ 0) mgf
generator ./ (identical to the mgf generator of x).

d. Maximum likelihood as applied to elliptical distributions (besides the MVN


distribution)
Suppose that y is an N  1 observable random vector that follows a general linear model. Suppose
further that the variance-covariance matrix V ./ of the vector e of residual effects is nonsingular
and that
e  Œ€./0 u; (9.73)
where €./ is an N  N (nonsingular) matrix (whose elements may be functionally dependent on
) such that V ./ D Œ€./0 €./ and where u is an N  1 random vector having an absolutely
234 Estimation and Prediction: Classical Approach

continuous spherical distribution with variance-covariance matrix I. The distribution of u has a pdf
h./, where (for every value of u)
h.u/ D c 1 g.u0 u/:
R1
Here, g./ is a nonnegative function (of a single nonnegative variable) such that 0 s N 1 g.s 2 / ds <
1
1, and c D Œ2 N=2= €.N=2/ 0 s N 1 g.s 2 / ds. As a consequence of supposition (9.73), y has
R

an elliptical distribution.
Let us consider the ML estimation of functions of the parameters of the general linear model (i.e.,
functions of the elements ˇ1 ; ˇ2 ; : : : ; ˇP of the vector ˇ and the elements 1 ; 2 ; : : : ; T of the vector
). That topic was considered earlier (in Subsection a) in the special case where e  N Œ0; V ./—
when g.s 2 / D exp. s 2 =2/, h.u/ D .2/ N=2 exp. 21 u0 u/, which is the pdf of the N.0; IN /
distribution.
Let f .  I ˇ; / represent the pdf of the distribution of y, and denote by y the observed value of
y. Then, the likelihood function is the function, say L.ˇ; I y/ of ˇ and  defined by L.ˇ; I y/ D
f .yI ˇ; /. Accordingly, it follows from result (9.70) that
L.ˇ; I y / D c 1
jV ./j 1=2
gf.y Xˇ/0 ŒV ./ 1
.y Xˇ/g: (9.74)

Maximum likelihood estimates of functions of ˇ and/or  are obtained from values, say ˇQ and , Q
of ˇ and  at which L.ˇ; I y/ attains its maximum value: a maximum likelihood estimate of a
Q /
function, say r.ˇ; /, of ˇ and/or  is provided by the quantity r.ˇ; Q obtained by substituting ˇQ
and Q for ˇ and .
Profile likelihood function. Now, suppose that the function g./ is a strictly decreasing function (as in
the special case where the distribution of e is MVN). Then, for any particular value of , the maximiza-
tion of L.ˇ; I y/ with respect to ˇ is equivalent to the minimization of .y Xˇ/0 ŒV ./ 1 .y Xˇ/
with respect to ˇ. Thus, upon regarding the value of  as “fixed,” upon recalling (from Part 3 of
Subsection a) that the linear system
X0 ŒV ./ 1
Xb D X0 ŒV ./ 1
y (9.75)
(in the P  1 vector b) is consistent, and upon employing the same line of reasoning as in Part 3 of
Q
Subsection a, we find that L.ˇ; I y/ attains its maximum value at a value ˇ./ of ˇ if and only if
Q
ˇ./ is a solution to linear system (9.75) or, equivalently, if and only if
X0 ŒV ./ 1 Q
Xˇ./ D X0 ŒV ./ 1
y;
in which case
Q 1 1=2 Q 0 Q
ŒV ./ 1 Œy Xˇ./ (9.76)
˚
LŒˇ./; I y D c jV ./jg Œy Xˇ./
D c 1 jV ./j 1=2 g y 0 ŒV ./ 1 y Œˇ./
Q 0 0
X ŒV ./ 1 y (9.77)
˚

D c 1 jV ./j 1=2 g y 0 ŒV ./ 1


˚

1
XfX0 ŒV ./ 1
Xg X0 ŒV ./ 1
(9.78)

ŒV ./ y :
Q
Accordingly, the function L .I y/ of  defined by L .I y/ D LŒˇ./; I y is a profile likelihood
function.
Values, say ˇQ and Q (of ˇ and , respectively), at which L.ˇ; I y/ attains its maximum value
can be obtained by taking Q to be a value at which the profile likelihood function L .I y/ attains
its maximum value and by then taking ˇQ to be a solution to the linear system
Q
X0 ŒV ./ 1 Q
Xb D X0 ŒV ./ 1
y:
Except for relatively simple special cases, the maximization of L .I y/ must be accomplished
numerically via an iterative procedure.
Likelihood-Based Methods 235

REML variant. REML is a variant of ML in which inferences about functions of  are based on
the likelihood function associated with a vector of so-called error contrasts. REML was introduced
and discussed in an earlier subsection (Subsection b) under the assumption that the distribution of e
is MVN. Let us consider REML in the present, more general context (where the distribution of e is
taken to be elliptical).
Let R represent an N  .N rank X/ matrix (of constants) of full column rank N rank X
such that X0 R D 0, and take z to be the .N rank X/  1 vector defined by z D R0 y. Note that
z D R0 e and hence that the distribution of z does not depend on ˇ. Further, let k.  I / represent
the pdf of the distribution of z, and take L.I R0 y/ to be the function of  defined (for  2 ‚)
by L.I R0 y/ D k.R0 yI /. The function L.I R0 y/ is a likelihood function; it is the likelihood
function obtained by regarding the value of z as the data vector.
Now, suppose that the (N -dimensional spherical) distribution of the random vector u [in ex-
pression (9.73)] has a moment generating function, say ./. Then, necessarily, there exists a (non-
negative) function ./ (of a single nonnegative variable) such that (for every N  1 vector t in a
neighborhood of 0) .t/ D .t 0 t/. And in light of the results of Subsection c, it follows that z
has an [.N rank X/-dimensional] elliptical distribution with mean 0, variance-covariance matrix
R0 V ./R, and mgf generator ./. Further,
z  Œ€ ./0 u ;
where € ./ is any .N rank X/  .N rank X/ matrix such that R0 V ./R D Œ€ ./0 € ./
and where u is an .N rank X/  1 random vector whose distribution is spherical with variance-
covariance matrix I and with moment generating function  ./ defined [for every .N rank X/  1
vector t in a neighborhood of 0] by  .t / D .t0 t /—the distribution of u is a marginal
distribution of u.
The distribution of u is absolutely continuous with a pdf h ./ that (at least in principle) is
determinable from the pdf of the distribution of u and that is expressible in the form
h .u / D c 1 g .u0 u /;
R1 rank.X/ 1
where g ./ is a nonnegative function (of a single nonnegative variable) such that 0 sN
g .s 2 / ds < 1 and where c is a strictly positive constant. Necessarily,
2 .N rank X/=2
Z 1
c D s N rank.X/ 1 g .s 2 / ds:
€Œ.N rank X/=2 0
Thus, in light of result (9.70), the pdf of the distribution of z is absolutely continuous with a pdf
k.  I / that is expressible as
k.zI / D c 1 jR0 V ./Rj 1=2
g fz0 ŒR0 V ./R 1
zg:
And it follows that the REML likelihood function is expressible as
L.I R0 y/ D c 1 jR0 V ./Rj 1=2
g fy 0 RŒR0 V ./R 1
R0 yg: (9.79)
As in the special case where the distribution of e is MVN, an alternative expression for L.I R0 y/
can be obtained by taking advantage of identities (9.32) and (9.37). Taking X to be any N .rank X/
matrix whose columns form a basis for C.X/, we find that
L.I R0 y/ D c 1 jR0 Rj jX0 X j1=2 jV ./j 1=2 jX0 ŒV ./ 1 X j 1=2
1=2

 g y ŒV ./ 1 ŒV ./ 1 XfX0 ŒV ./ 1 Xg X0 ŒV ./


˚ 0 1
y : (9.80)


Alternative versions of this expression can be obtained by replacing the argument of the function
g ./ with expression (9.40) or expression (9.41).
As in the special case where the distribution of e is MVN, L.I R0 y/ depends on the choice of
the matrix R only through the multiplicative constant jR0 Rj 1=2. In some special cases including that
where the distribution of e is MVN, the function g ./ differs from the function g./ by no more than a
multiplicative constant. However, in general, the relationship between g./ and g./ is more complex.
236 Estimation and Prediction: Classical Approach

5.10 Prediction
a. Some general results
Let y represent an N  1 observable random vector. And consider the use of y in predicting an
unobservable random variable or, more generally, an unobservable random vector, say an M  1
unobservable random vector w D .w1 ; w2 ; : : : ; wM /0. That is, consider the use of the observed
value of y (the so-called data vector) in making inferences about an unobservable quantity that can
be regarded as a realization (i.e., sample value) of w. Here, an unobservable quantity is a quantity
that is unobservable at the time the inferences are to be made; it may become observable at some
future time (as suggested by the use of the word prediction). In the present section, the focus is on
obtaining a point estimate of the unobservable quantity; that is, on what might be deemed a point
prediction.
Suppose that the second-order moments of the joint distribution of w and y exist. And adopt the
following notation: y D E.y/, w D E.w/, Vy D var.y/, Vyw D cov.y; w/, and Vw D var.w/.
Further, in considering the special case M D 1, let us write w, w , vyw , and vw for w, w , Vyw ,
and Vw , respectively.
It is informative to consider the prediction of w under each of the following states of knowledge:
(1) the joint distribution of y and w is known; (2) only y , w , Vy , Vyw , and Vw are known; and
(3) only Vy , Vyw , and Vw are known.
Let w.y/
Q represent an (M  1)-dimensional vector-valued function of y that qualifies as a (point)
predictor of w—in considering the special case where M D 1, let us write w.y/ Q for w.y/.
Q That w.y/
Q
qualifies as a predictor implies that the vector-valued function w./
Q depends on the joint distribution
of y and w (if at all) only through characteristics of the joint distribution that are known.
The difference w.y/Q w is referred to as the prediction error. The predictor w.y/Q is said to
be unbiased if EŒwQ .y/ w D 0, that is, if the expected value of the prediction error equals 0, or,
equivalently, if EŒw.y/
Q D w , that is, if the expected value of the predictor is the same as that of
the random vector w whose realization is being predicted.
Attention is sometimes restricted to linear predictors. An (M  1)-dimensional vector-valued
function t.y/ of y is said to be linear if it is expressible in the form
t.y/ D c C A0 y; (10.1)
where c is an M  1 vector of constants and A is an M  N matrix of constants. A vector-valued
function t.y/ that is expressible in the form (10.1) is regarded as linear even if the vector c and
the matrix A depend on the joint distribution of y and w—the linearity reflects the nature of the
dependence on the value of y, not the nature of any dependence on the joint distribution. And
it qualifies as a predictor if any dependence on the joint distribution of y and w is confined to
characteristics of the joint distribution that are known.
The M  M matrix EfŒw.y/ Q wŒwQ .y/ w0 g is referred to as the mean-squared-error (MSE)
matrix of the predictor w.y/.
Q If w.y/
Q is an unbiased predictor (of w), then

EfŒw.y/
Q wŒwQ .y/ w0 g D varŒw.y/
Q w:

That is, the MSE matrix of an unbiased predictor equals the variance-covariance matrix of its pre-
diction error (not the variance-covariance matrix of the predictor itself). Note that in the special case
where M D 1, the MSE matrix has only one element, which is expressible as EfŒw.y/ Q w2 g and
which is referred to simply as the mean squared error (MSE) of the (scalar-valued) predictor w.y/. Q
State (1): joint distribution known. Suppose that the joint distribution of y and w is known or that,
at the very least, enough is known about the joint distribution to determine the conditional expected
Prediction 237

value E.w j y/ of w given y. And observe that

EŒE.w j y/ w j y D 0 (with probability 1): (10.2)

Observe also that, for “any” column vector h.y/ of functions of y,

Efh.y/ŒE.w j y/ w0 j yg D 0 (with probability 1): (10.3)

Now, let t.y/ represent “any” (M  1)-dimensional vector-valued function of y—in the special
case where M D 1, let us write t.y/ for t.y/. Then, upon observing that

Œt.y/ wŒt.y/ w0 D ft.y/ E.w j y/ C ŒE.w j y/ wgft.y/ E.w j y/ C ŒE.w j y/ wg0
D Œt.y/ E.w j y/Œt.y/ E.w j y/0 C ŒE.w j y/ wŒE.w j y/ w0
C Œt.y/ E.w j y/ŒE.w j y/ w0 C fŒt.y/ E.w j y/ŒE.w j y/ w0 g0

and [in light of result (10.3)] that

EfŒt.y/ E.w j y/ŒE.w j y/ w0 j yg D 0 (with probability 1); (10.4)


we find that
EfŒt.y/ wŒt.y/ w0 j yg
D Œt.y/ E.w j y/Œt.y/ E.w j y/0
C EfŒE.w j y/ wŒE.w j y/ w0 j yg (with probability 1): (10.5)

Result (10.5) implies that E.w j y/ is an optimal predictor of w. It is optimal in the sense that the
difference EfŒt.y/ wŒt.y/ w0 j yg EfŒE.w j y/ wŒE.w j y/ w0 j yg between the conditional
(given y) MSE matrix of an arbitrary predictor t.y/ and that of E.w j y/ equals (with probability 1)
the matrix Œt.y/ E.w j y/Œt.y/ E.w j y/0, which is nonnegative definite and which equals 0 if
and only if t.y/ E.w j y/ D 0 or, equivalently, if and only if t.y/ D E.w j y/. In the special case
where M D 1, we have that [for an arbitrary predictor t.y/]

EfŒt.y/ w2 j yg  EfŒE.w j y/ w2 j yg (with probability 1):

It is worth noting that


EfŒt.y/ w0 g D E EfŒt.y/ w0 j yg ;

wŒt.y/ wŒt.y/
so that E.w j y/ is optimal when the various predictors are compared on the basis of their unconditional
MSE matrices as well as when they are compared on the basis of their conditional MSE matrices.
The conditional MSE matrix of the optimal predictor E.w j y/ is
EfŒE.w j y/ wŒE.w j y/ w0 j yg D var.w j y/;
and the (unconditional) MSE matrix of E.w j y/ or, equivalently, the (unconditional) variance-
covariance matrix of E.w j y/ w is
varŒE.w j y/ w D EfŒE.w j y/ wŒE.w j y/ w0 g D EŒvar.w j y/:

Clearly, E.w j y/ is an unbiased predictor; in fact, the expected value of its prediction error equals
0 conditionally on y (albeit with probability 1) as well as unconditionally [as is evident from result
(10.2)]. Whether or not E.w j y/ is a linear predictor (or, more generally, equal to a linear predictor
with probability 1) depends on the form of the joint distribution of y and w; a sufficient (but not a
necessary) condition for E.w j y/ to be linear (or, at least, “linear with probability 1”) is that the joint
distribution of y and w be MVN.
238 Estimation and Prediction: Classical Approach

State (2): only the means and the variances and covariances are known. Suppose that y , w ,
Vy , Vyw , and Vw are known, but that nothing else is known about the joint distribution of y and w.
Then, E.w j y/ is not determinable from what is known, forcing us to look elsewhere for a predictor
of w.
Assume (for the sake of simplicity) that Vy is nonsingular. And consider the predictor
0 0
.y/ D w C Vyw Vy 1 .y y / D  C Vyw Vy 1 y;
0
where  D w Vyw Vy 1 y —in the special case where M D 1, let us write .y/ for .y/.
Clearly, .y/ is linear; it is also unbiased. Now, consider its MSE matrix EfŒ.y/ wŒ.y/ w0 g
or, equivalently, the variance-covariance matrix varŒ.y/ w of its prediction error. Let us compare
the MSE matrix of .y/ with the MSE matrices of other linear predictors.
Let t.y/ represent an (M 1)-dimensional vector-valued function of y of the form t.y/ D cCA0 y,
where c is an M  1 vector of constants and A an N  M matrix of constants—in the special case
where M D 1, let us write t.y/ for t.y/. Further, decompose the difference between t.y/ and w
into two components as follows:
t.y/ w D Œt.y/ .y/ C Œ.y/ w: (10.6)
And observe that
0 0
covŒy; .y/ w D cov.y; Vyw Vy 1 y w/ D Vy ŒVyw Vy 1 0 Vyw D 0: (10.7)

Then, because EŒ.y/ w D 0 and because t.y/ .y/ D c  C .A0 0


Vyw Vy 1 /y, it follows
that
EfŒt.y/ .y/Œ.y/ w0 g D covŒt.y/ .y/; .y/ w
D .A0 0
Vyw Vy 1 / covŒy; .y/ w D 0: (10.8)
Thus,
EfŒt.y/ wŒt.y/ w0 g
D E fŒt.y/ .y/0 C Œ.y/ w0 g

.y/ C Œ.y/ wgfŒt.y/
D EfŒt.y/ .y/Œt.y/ .y/0 g C varŒ.y/ w: (10.9)

Any linear predictor of w is expressible in the form [t.y/ D c C A0 y] of the vector-valued


function t.y/. Accordingly, result (10.9) implies that .y/ is the best linear predictor of w. It is the
best linear predictor in the sense that the difference between the MSE matrix EfŒt.y/ wŒt.y/ w0 g
of an arbitrary linear predictor t.y/ and the matrix varŒ.y/ w [which is the MSE matrix of
.y/] equals the matrix EfŒt.y/ .y/Œt.y/ .y/0 g, which is nonnegative definite and which
equals 0 if and only if t.y/ .y/ D 0 or, equivalently, if and only if t.y/ D .y/. (To see that
EfŒt.y/ .y/Œt.y/ .y/0 g D 0 implies that t.y/ .y/ D 0, observe that (for j D 1; 2; : : : ; M )
the j th element of t.y/ .y/ equals kj C `j0 y, where kj is the j th element of c  and `j the
j th column of A Vy 1 Vyw , that the j th diagonal element of EfŒt.y/ .y/Œt.y/ .y/0 g equals
EŒ.kj C `j0 y/2 , and that EŒ.kj C `j0 y/2  D 0 implies that E.kj C `j0 y/ D 0 and var.kj C `j0 y/ D 0
and hence that `j D 0 and kj D 0.) In the special case where M D 1, we have [for an arbitrary
linear predictor t.y/] that
EfŒt.y/ w2 g  varŒ.y/ w D EfŒ.y/ w2 g ; (10.10)


with equality holding in inequality (10.10) if and only if t.y/ D .y/.


The prediction error of the best linear predictor .y/ can be decomposed into two components
on the basis of the following identity:
.y/ w D Œ.y/ E.w j y/ C ŒE.w j y/ w: (10.11)
The second component E.w j y/ w of this decomposition has an expected value of 0 [conditionally
on y (albeit with probability 1) as well as unconditionally], and because EŒ.y/ w D 0, the first
Prediction 239

component .y/ E.w j y/ also has an expected value of 0. Moreover, it follows from result (10.3)
that
EfŒ.y/ E.w j y/ŒE.w j y/ w0 j yg D 0 (with probability 1); (10.12)
implying that
EfŒ.y/ E.w j y/ŒE.w j y/ w0 g D 0 (10.13)
and hence that the two components .y/ E.w j y/ and E.w j y/ w of decomposition (10.11) are
uncorrelated. And upon applying result (10.5) [with t.y/ D .y/], we find that

EfŒ.y/ wŒ.y/ w0 j yg D Œ.y/ E.w j y/Œ.y/ E.w j y/0


C var.w j y/ (with probability 1): (10.14)
In the special case where M D 1, result (10.14) is reexpressible as

EfŒ.y/ w2 j yg D Œ.y/ E.w j y/2 C var.w j y/ (with probability 1):

Equality (10.14) serves to decompose the conditional (on y) MSE matrix of the best linear
predictor .y/ into two components, corresponding to the two components of the decomposition
(10.11) of the prediction error of .y/. The (unconditional) MSE matrix of .y/ or, equivalently, the
(unconditional) variance-covariance matrix of .y/ w lends itself to a similar decomposition. We
find that
varŒ.y/ w D varŒ.y/ E.w j y/ C varŒE.w j y/ w
D varŒ.y/ E.w j y/ C EŒvar.w j y/: (10.15)

Of the two components of the prediction error of .y/, the second component E.w j y/ w can be
regarded as an “inherent” component. It is inherent in the sense that it is an error that would be incurred
even if enough were known about the joint distribution of y and w that E.w j y/ were determinable
and were employed as the predictor. The first component .y/ E.w j y/ of the prediction error can
be regarded as a “nonlinearity” component; it equals 0 if and only if E.w j y/ D c C A0 y for some
vector c of constants and some matrix A of constants.
The variance-covariance matrix of the prediction error of .y/ is expressible as
0
varŒ.y/ w D var.Vyw Vy 1 y w/
0 0
D Vw C Vyw Vy 1 Vy .Vyw Vy 1 /0 0
Vyw Vy 1 Vyw 0
ŒVyw Vy 1 Vyw 0
0
D Vw Vyw Vy 1 Vyw : (10.16)

It differs from the variance-covariance matrix of .y/; the latter variance-covariance matrix is ex-
pressible as
0 0 0
varŒ.y/ D var.Vyw Vy 1 y/ D Vyw Vy 1 Vy ŒVyw Vy 1 0 D Vyw
0
Vy 1 Vyw :

In fact, varŒ.y/ w and varŒ.y/ are the first and second components in the following decompo-
sition of var.w/:
var.w/ D varŒ.y/ w C varŒ.y/:

The best linear predictor .y/ can be regarded as an approximation to E.w j y/. The expected
value EŒ.y/ E.w j y/ of the error of this approximation equals 0. Note that varŒ.y/ E.w j y/ D
EfŒ.y/ E.w j y/Œ.y/ E.w j y/0 g. Further, .y/ is the best linear approximation to E.w j y/
in the sense that, for any (M  1)-dimensional vector-valued function t.y/ of the form t.y/ D
c C A0 y, the difference between the matrix EfŒt.y/ E.w j y/Œt.y/ E.w j y/0 g and the matrix
varŒ.y/ E.w j y/ equals the matrix EfŒt.y/ .y/Œt.y/ .y/0 g, which is nonnegative definite
240 Estimation and Prediction: Classical Approach

and which equals 0 if and only if t.y/ D .y/. This result follows from what has already been
established (in regard to the best linear prediction of w) upon observing [in light of result (10.5)]
that
EfŒt.y/ wŒt.y/ w0 g D EfŒt.y/ E.w j y/Œt.y/ E.w j y/0 g C EŒvar.w j y/;
which in combination with result (10.15) implies that the difference between the two matrices
EfŒt.y/ E.w j y/Œt.y/ E.w j y/0 g and varŒ.y/ E.w j y/ is the same as that between the
two matrices EfŒt.y/ wŒt.y/ w0 g and varŒ.y/ w. In the special case where M D 1, we
have [for any function t.y/ of y of the form t.y/ D c C a0 y (where c is a constant and a an N  1
vector of constants)] that

EfŒt.y/ E.w j y/2 g  varŒ.y/ E.w j y/ D EfŒ.y/ E.w j y/2 g ; (10.17)


with equality holding in inequality (10.17) if and only if t.y/ D .y/.


Hartigan (1969) refers to .y/ as the linear expectation of w given y. And in the special case
0
where M D 1, he refers to varŒw .y/ (D Vw Vyw Vy 1 Vyw ) as the linear variance of w given
y—in the general case (where M can exceed 1), this quantity could be referred to as the linear
variance-covariance matrix of w given y. It is only in special cases, such as that where the joint
distribution of y and w is MVN, that the linear expectation and linear variance-covariance matrix of
w given y coincide with the conditional expectation E.w j y/ and conditional variance-covariance
matrix var.w j y/ of w given y.
Note that for the vector-valued function ./ to be determinable from what is known about the
joint distribution of y and w, the supposition that y , w , Vy , Vyw , and Vw are known is stronger
0 0
than necessary. It suffices to know the vector  D w Vyw Vy 1 y and the matrix Vyw Vy 1.
State (3): only the variances and covariances are known. Suppose that Vy , Vyw , and Vw are known
(and that Vy is nonsingular), but that nothing else is known about the joint distribution of y and w.
Then, ./ is not determinable from what is known, and consequently .y/ does not qualify as a
predictor. Thus, we are forced to look elsewhere for a predictor of w.
0
Corresponding to any estimator .y/
Q of the vector  (D w Vyw Vy 1 y ) is the predictor .y/
Q
of w obtained from .y/ by substituting .y/
Q for . That is, corresponding to .y/
Q is the predictor
Q
.y/ defined as follows:
0
Q
.y/ D .y/
Q C Vyw Vy 1 y: (10.18)
Equality (10.18) serves to establish a one-to-one correspondence between estimators of  and pre-
dictors of w—corresponding to any predictor .y/Q of w is a unique estimator .y/
Q of  that satisfies
0
equality (10.18), namely, the estimator .y/
Q defined by .y/
Q D .y/
Q Vyw Vy 1 y.
Clearly, the predictor .y/
Q is linear if and only if the corresponding estimator .y/
Q is linear.
Moreover, 0
EŒ.y/
Q D EŒ.y/
Q C Vyw Vy 1 y (10.19)
and, consequently,
EŒ.y/
Q D w , EŒ.y/ Q D : (10.20)
Thus, .y/
Q is an unbiased predictor of w if and only if the corresponding estimator .y/
Q is an
unbiased estimator of .
The following identity serves to decompose the prediction error of the predictor .y/
Q into two
components:
Q
.y/ w D Œ.y/
Q .y/ C Œ.y/ w: (10.21)
Clearly,
Q
.y/ .y/ D .y/
Q : (10.22)
Thus, decomposition (10.21) can be reexpressed as follows:
Q
.y/ w D Œ.y/
Q  C Œ.y/ w: (10.23)
Prediction 241

Let us now specialize to linear predictors. Let us write Q L .y/ for a linear predictor of w and
QL .y/ for the corresponding estimator of  [which, like Q L .y/, is linear]. Then, in light of results
(10.22) and (10.8),
EfŒQL .y/ Œ.y/ w0 g D EfŒQ L .y/ .y/Œ.y/ w0 g D 0: (10.24)
And making use of results (10.23) and (10.16), it follows that

EfŒQ L .y/ wŒQ L .y/ w0 g


D E fŒQL .y/ 0 C Œ.y/ w0 g

 C Œ.y/ wgfŒQL .y/
D EfŒQL .y/ ŒQL .y/ 0 g C varŒ.y/ w
0 0
D EfŒQL .y/ ŒQL .y/  g C Vw Vyw Vy 1 Vyw : (10.25)

Let Lp represent a collection of linear predictors of w. And let Le represent the collection of
(linear) estimators of  that correspond to the predictors in Lp . Then, for a predictor, say O L .y/, in
the collection Lp to be best in the sense that, for every predictor Q L .y/ in Lp , the matrix EfŒQ L .y/
wŒQ L .y/ w0 g EfŒO L .y/ wŒO L .y/ w0 g is nonnegative definite, it is necessary and sufficient
that 0
O L .y/ D O L .y/ C Vyw Vy 1 y
for some estimator O L .y/ in Le that is best in the sense that, for every estimator Q L .y/ in Le , the
matrix EfŒQL .y/ ŒQL .y/ 0 g EfŒOL .y/ ŒOL .y/ 0 g is nonnegative definite. In general,
there may or may not be an estimator that is best in such a sense; the existence of such an estimator
depends on the nature of the collection Le and on any assumptions that may be made about y and
w .
If Lp is the collection of all linear unbiased predictors of w, then Le is the collection of all
linear unbiased estimators of . As previously indicated (in Section 5.5a), it is customary to refer to
an estimator that is best among linear unbiased estimators as a BLUE (an acronym for best linear
unbiased estimator or estimation). Similarly, a predictor that is best among linear unbiased predictors
is customarily referred to as a BLUP (an acronym for best linear unbiased predictor or prediction).
The prediction error of the predictor .y/
Q can be decomposed into three components by start-
ing with decomposition (10.23) and by expanding the component .y/ w into two components
on the basis of decomposition (10.11). As specialized to the linear predictor Q L .y/, the resultant
decomposition is
Q L .y/ w D ŒQL .y/  C Œ.y/ E.w j y/ C ŒE.w j y/ w: (10.26)

Recall (from the preceding part of the present subsection) that .y/ E.w j y/ and E.w j y/ w
[which are the 2nd and 3rd components of decomposition (10.26)] are uncorrelated and that each
has an expected value of 0. Moreover, it follows from result (10.3) that QL .y/  is uncorre-
lated with E.w j y/ w and from result (10.7) that it is uncorrelated with .y/ w and hence
uncorrelated with .y/ E.w j y/ [which is expressible as the difference between .y/ w and
E.w j y/ w]. Thus, all three components of decomposition (10.26) are uncorrelated. Expanding
on the terminology introduced in the preceding part of the present subsection, the first, second, and
third components of decomposition (10.26) can be regarded, respectively, as an “unknown-means”
component, a “nonlinearity” component, and an “inherent” component.
Corresponding to decomposition (10.26) of the prediction error of Q L .y/ is the following de-
composition of the MSE matrix of Q L .y/:
EfŒQ L .y/ wŒQ L .y/ w0 g
D EfŒQL .y/ ŒQL .y/ 0 g C varŒ.y/ E.w j y/ C varŒE.w j y/ w: (10.27)
In the special case where M D 1, this decomposition can [upon writing QL .y/ for Q L .y/, QL .y/ for
QL .y/, and  for , as well as .y/ for .y/ and w for w] be reexpressed as follows:
242 Estimation and Prediction: Classical Approach

EfŒQL .y/ w2 g D EfŒQL .y/ 2 g C varŒ.y/ E.w j y/ C varŒE.w j y/ w:

In taking .y/
Q to be an estimator of  and regarding .y/
Q as a predictor of w, it is implicitly
assumed that the functions ./
Q and ./
Q depend on the joint distribution of y and w only through Vy ,
0
Vyw , and Vw . In practice, the dependence may only be through the elements of the matrix Vyw Vy 1
1
and through various functions of the elements of Vy , in which case .y/
Q may qualify as an estimator
and .y/
Q as a predictor even in the absence of complete knowledge of Vy , Vyw , and Vw .

b. Prediction on the basis of a G–M, Aitken, or general linear model


Suppose that the value of an N  1 observable random vector y is to be used to predict the realization
of an M  1 unobservable random vector w D .w1 ; w2 ; : : : ; wM /0. How might we proceed? As is
evident from the results of Subsection a, the answer depends on what is “known” about the joint
distribution of y and w.
We could refer to whatever assumptions are made about the joint distribution of y and w as a
(statistical) model. However, while doing so might be logical, it would be unconventional and hence
potentially confusing. It is customary to restrict the use of the word model to the assumptions made
about the distribution of the observable random vector y.
Irrespective of the terminology, the assumptions made about the distribution of y do not in and
of themselves provide an adequate basis for prediction. The prediction of w requires the larger set
of assumptions that apply to the joint distribution of y and w. It is this larger set of assumptions that
establishes a statistical relationship between y and w.
Now, assume that y follows a general linear model. And for purposes of predicting the realization
of w from the value of y, let us augment that assumption with an assumption that
E.w/ D ƒ0ˇ (10.28)
for some (P  M ) matrix ƒ of (known) constants and an assumption that
cov.y; w/ D Vyw ./ and var.w/ D Vw ./ (10.29)
for some matrices Vyw ./ and Vw ./ whose elements are known functions of the parametric vec-
tor —it is assumed that cov.y; w/ and var.w/, like var.y/, do not depend on ˇ. Further, let
us (in the present context)
  write Vy ./ for V ./. Note that the .N C M /  .N C M / matrix
Vy ./ Vyw ./
—which is the variance-covariance matrix of the (NCM )-dimensional vector
ŒVyw ./0 Vw ./
 
y
—is inherently nonnegative definite.
w
Note that the assumption that w satisfies condition (10.28) is consistent with taking w to be of
the form
w D ƒ0ˇ C d; (10.30)
where d is an M -dimensional random column vector with E.d/ D 0. The vector d can be regarded
as the counterpart of the vector e of residual effects in the model equation (1.14). Upon taking w to
be of the form (10.30), assumption (10.29) is reexpressible as
cov.e; d/ D Vyw ./ and var.d/ D Vw ./: (10.31)

In the present context, a predictor, say w.y/,


Q of w is unbiased if and only if EŒw.y/
Q D ƒ0ˇ. If
there exists a linear unbiased predictor of w, let us refer to w as predictable; otherwise, let us refer
to w as unpredictable. Clearly, w.y/
Q is an unbiased predictor of w if and only if it is an unbiased
estimator of ƒ0ˇ. And w is predictable if and only if ƒ0ˇ is estimable, that is, if and only if all M of
the elements of the vector ƒ0ˇ are estimable linear combinations of the P elements of the parametric
vector ˇ.
Prediction 243

As defined and discussed in Sections 5.2 and 5.6b, translation equivariance is a criterion that is
applicable to estimators of a linear combination of the elements of ˇ or, more generally, to estimators
of a vector of such linear combinations. This criterion can also be applied to predictors. A predictor
Q
w.y/ of the random vector w (the expected value of which is ƒ0ˇ) is said to be translation equivariant
if w.y
Q C Xk/ D w.y/ Q C ƒ0 k for every P  1 vector k (and for every value of y). Clearly, w.y/
Q is
a translation-equivariant predictor of w if and only if it is a translation-equivariant estimator of the
expected value ƒ0ˇ of w.
Special case: Aitken and G–M models. Let us now specialize to the case where y follows an Aitken
model. Under the Aitken model, var.y/ is an unknown scalar multiple of a known (nonnegative
definite) matrix H. It isconvenient
 (and potentially useful) to consider the prediction of w under
y
the assumption that var is also an unknown scalar multiple of a known (nonnegative definite)
w
matrix. Accordingly, it is supposed that cov.y; w/ and var.w/ are of the form
cov.y; w/ D  2 Hyw and var.w/ D  2 Hw ;
where Hyw and Hw are known matrices. Thus, writing Hy for H, the setup is such that
   
y 2 Hy Hyw
var D 0 :
w Hyw Hw
As in the general case, it is supposed that
E.w/ D ƒ0ˇ
(where ƒ is a known matrix).
The setup can be regarded as a special case of the more general setup where y follows a general
linear model and where E.w/ is of the form (10.28) and cov.y; w/ and var.w/ of the form (10.29).
Specifically, it can be regarded as the special case where  is the one-dimensional vector whose only
2 2
element is 1 D , where ‚ D f j 1 > 0g, and where   Vy ./ D 1 Hy , Vyw ./ D 1 Hyw , and
y
Vw ./ D 12 Hw . Clearly, in this special case, var is known up to the value of the unknown
w
2 2
scalar multiple 1 D  .
In the further special case where y follows a G–M model [i.e., where var.y/ D  2 I], Hy D I.
When Hy D I, the case where Hyw D 0 and Hw D I is often singled out for special attention. The
case where Hy D I, Hyw D 0, and Hw D I is encountered in applications where  the realization of
y
w corresponds to a vector of future data points and where
  the augmented vector w is assumed to
X
follow a G–M model, the model matrix of which is .
ƒ0
Best linear unbiased prediction (under a G–M model). Suppose that the N  1 observable random
vector y follows a G–M model, in which case E.y/ D Xˇ and var.y/ D  2 I. And consider the
prediction of the M  1 unobservable random vector w whose expected value is of the form

E.w/ D ƒ0ˇ (10.32)

(where ƒ is a matrix of known constants). Assume that cov.y; w/ and var.w/ are of the form

cov.y; w/ D  2 Hyw and var.w/ D  2 Hw (10.33)

(where Hyw and Hw are known matrices). Assume also that w is predictable or, equivalently, that
ƒ0ˇ is estimable.
For purposes of applying the results of the final part of the preceding subsection (Subsection a),
take  to be the M  1 vector (of linear combinations of the elements of ˇ) defined as follows:

 D E.w/ Œcov.y; w/0 Œvar.y/ 1


E.y/ D ƒ0ˇ 0
Hyw Xˇ D .ƒ0 0
Hyw X/ˇ: (10.34)
244 Estimation and Prediction: Classical Approach

Clearly,  is estimable, and its least squares estimator is the vector O L .y/ defined as follows:
OL .y/ D .ƒ0 0
Hyw X/.X0 X/ X0 y D ƒ0 .X0 X/ X0 y 0
Hyw PX y (10.35)
0 0
[where PX D X.X X/ X ]. Moreover, according to Theorem 5.6.1, O L .y/ is a linear unbiased
estimator of  and, in fact, is the BLUE (best linear unbiased estimator) of . It is the BLUE in the
sense that the difference between the MSE matrix EfŒQL .y/ ŒQL .y/ 0 g D varŒQL .y/ of an
arbitrary linear unbiased estimator QL .y/ of  and the MSE matrix EfŒOL .y/ ŒOL .y/ 0 g D
varŒOL .y/ of the least squares estimator O L .y/ is nonnegative definite [and is equal to 0 if and only
if QL .y/ D OL .y/].
Now, let
wO L .y/ D OL .y/ C Œcov.y; w/0 Œvar.y/ 1
y D ƒ0 .X0 X/ X0 y C Hyw
0
.I PX /y: (10.36)
Then, it follows from the results of the final part of Subsection a that wO L .y/ is a linear unbiased
predictor of w and, in fact, is the BLUP (best linear unbiased predictor) of w. It is the BLUP in the
sense that the difference between the MSE matrix EfŒwQ L .y/ wŒwQ L .y/ w0 g D varŒwQ L .y/ w of
an arbitrary linear unbiased predictor wQ L.y/ of w and the MSE matrix EfŒwO L .y/ wŒwO L .y/ w0 g D
varŒwL .y/ w of wL .y/ is nonnegative definite [and is equal to 0 if and only if wQ L .y/ D wO L .y/].
O O
In the special case where M D 1, the sense in which wO L .y/ is the BLUP can [upon writing wO L .y/
for wO L .y/ and w for w] be restated as follows: the MSE of wO L .y/ [or, equivalently, the variance of
the prediction error of wO L .y/] is smaller than that of any other linear unbiased predictor of w.
In light of result (6.7), the variance-covariance matrix of the least squares estimator of  is
varŒOL .y/ D  2 .ƒ0 0
Hyw X/.X0 X/ .ƒ X0 Hyw /: (10.37)
Accordingly, it follows from result (10.25) that the MSE matrix of the BLUP of w or, equivalently,
the variance-covariance matrix of the prediction error of the BLUP is
varŒwO L .y/ w D  2 .ƒ0 0
Hyw X/.X0 X/ .ƒ X0 Hyw / C  2 .Hw 0
Hyw Hyw /: (10.38)

In the special case where Hyw D 0, we find that  D ƒ0ˇ, wO L .y/ D OL .y/, varŒOL .y/ D
 ƒ0 .X0 X/ ƒ, and
2
varŒwO L .y/ w D  2 ƒ0 .X0 X/ ƒ C  2 Hw :
Note that even in this special case [where the BLUP of w equals the BLUE of  and where  D E.w/],
the MSE matrix of the BLUP typically differs from that of the BLUE. The difference between the two
0
MSE matrices [ 2 Hw in the special case and  2 .Hw Hyw Hyw / in the general case] is nonnegative
definite. This difference is attributable to the variability of w, which contributes to the variability of
the prediction error w.y/
O w but not to the variability of .y/
O .
Best linear translation-equivariant prediction (under a G–M model). Let us continue to consider the
prediction of the M  1 unobservable random vector w on the basis of the N  1 observable random
vector y, doing so under the same conditions as in the preceding part of the present subsection. Thus,
it is supposed that y follows a G–M model, that E.w/ is of the form (10.32), that cov.y; w/ and
var.w/ are of the form (10.33), and that w is predictable. Further, define , O L .y/, and wO L .y/ as in
equations (10.34), (10.35), and (10.36) [so that OL .y/ is the BLUE of  and wO L .y/ the BLUP of w].
Let us consider the translation-equivariant prediction of w. Denote by w.y/
Q an arbitrary predictor
of w, and take
Q
.y/ Q
D w.y/ Œcov.y; w/0 Œvar.y/ 1 y D w.y/
Q 0
Hyw y
to be the corresponding estimator of —refer to the final part of Subsection a. Then, w.y/ Q is a
translation-equivariant predictor (of w) if and only if .y/
Q is a translation-equivariant estimator (of
), as can be readily verified. Further, w.y/
Q is a linear translation-equivariant predictor (of w) if and
only if .y/
Q is a linear translation-equivariant estimator (of ).
Prediction 245

In light of Corollary 5.6.4, the estimator OL .y/ is a linear translation-equivariant estimator of 
and, in fact, is the best linear translation-equivariant estimator of . It is the best linear translation-
equivariant estimator in the sense that the difference between the MSE matrix EfŒQL .y/ ŒQL .y/
0 g of an arbitrary linear translation-equivariant estimator Q L.y/ of  and the MSE matrix EfŒOL .y/
ŒOL .y/ 0 g of OL .y/ is nonnegative definite [and is equal to 0 if and only if QL .y/ D OL .y/].
And upon recalling the results of the final part of Subsection a, it follows that the predictor wO L .y/ is
a linear translation-equivariant predictor of w and, in fact, is the best linear translation-equivariant
predictor of w. It is the best linear translation-equivariant predictor in the sense that the difference
between the MSE matrix EfŒwQ L .y/ wŒwQ L .y/ w0 g of an arbitrary linear translation-equivariant
predictor wQ L .y/ of w and the MSE matrix EfŒwO L .y/ wŒwO L .y/ w0 g D varŒwO L .y/ w of
wO L .y/ is nonnegative definite [and is equal to 0 if and only if wQ L .y/ D wO L .y/]. In the special case
where M D 1, the sense in which wO L .y/ is the best linear translation-equivariant predictor can [upon
writing wO L .y/ for wO L .y/ and w for w] be restated as follows: the MSE of wO L .y/ is smaller than that
of any other linear translation-equivariant predictor of w.

c. Conditional expected values: elliptical distributions


Let w represent an M 1 random vector and y an N 1 random vector. Suppose that the second-order
moments of the joint distribution of w and y exist. And adopt the following notation: y D E.y/,
w D E.w/, Vy D var.y/, Vyw D cov.y; w/, and Vw D var.w/. Further, suppose that Vy is
nonsingular.
0
Let .y/ D w C Vyw Vy 1 .y y /. If y is observable but w is unobservable, we might wish to
use the value of y to predict the realization of w. If ./ is determinable from what is known about
the joint distribution of w and y, we could use .y/ to make the prediction; it would be the best
linear predictor in the sense described in Part 2 of Subsection a. If enough more is known about the
joint distribution of w and y that E.w j y/ is determinable, we might prefer to use E.w j y/ to make
the prediction; it would be the best predictor in the sense described in Part 1 of Subsection a.
Under what circumstances is E.w j y/ equal to .y/ (at least with probability 1) or, equivalently,
under what circumstances is E.w j y/ linear (or at least “linear with probability 1”). As previously
indicated (in Part 1 of Subsection a), one such circumstance is that where the joint distribution of
w and y is MVN. More generally, E.w j y/ equals .y/ (at least with probability 1) if the joint
distribution of w and y is elliptical, as will now be shown.
Let e D w .y/, and observe that
0 0
Vy 1 y Vy 1
      
e w C Vyw I Vyw w
D C :
y 0 0 I y
Observe also [in light of Part (2) of Theorem 3.5.9] that
0
E.e/ D 0; var.e/ D Vw Vyw Vy 1 Vyw ; and cov.e; y/ D 0:
 
w
Now, suppose that the distribution of the vector is elliptical with mgf generator ./. Then, it
y  
e
follows from the next-to-last part of Section 5.9c that the vector has an elliptical distribution
y
0 1
   
0 Vw Vyw Vy Vyw 0
with mean , variance-covariance matrix , and mgf generator ./
y 0 Vy
 
e
and that the vector has this same distribution. Thus, the conditional distribution of e given
y
y is the same as that of e given y, so that the conditional distribution of e is symmetrical about 0 and
hence E.e j y/ D 0 (with probability 1). And since w D .y/ C e, we conclude that E.w j y/ D .y/
(with probability 1).
246 Estimation and Prediction: Classical Approach

Exercises
Exercise 1. Take the context to be that of estimating parametric functions of the form 0ˇ from
an N  1 observable random vector y that follows a G–M, Aitken, or general linear model. Verify
(1) that linear combinations of estimable functions are estimable and (2) that linear combinations of
nonestimable functions are not necessarily nonestimable.
Exercise 2. Take the context to be that of estimating parametric functions of the form 0ˇ from an
N  1 observable random vector y that follows a G–M, Aitken, or general linear model. And let
R D rank.X/.
(a) Verify (1) that there exists a set of R linearly independent estimable functions; (2) that no set of
estimable functions contains more than R linearly independent estimable functions; and (3) that
if the model is not of full rank (i.e., if R < P ), then at least one and, in fact, at least P R of
the individual parameters ˇ1 ; ˇ2 ; : : : ; ˇP are nonestimable.
(b) Show that the j th of the individual parameters ˇ1 ; ˇ2 ; : : : ; ˇP is estimable if and only if the j th
element of every vector in N.X/ equals 0 (j D 1; 2; : : : ; P ).

Exercise 3. Show that for a parametric function of the form 0ˇ to be estimable from an N  1
observable random vector y that follows a G–M, Aitken, or general linear model, it is necessary and
sufficient that
rank.X0; / D rank.X/:

Exercise 4. Suppose that y is an N  1 observable random vector that follows a G–M, Aitken, or
general linear model. Further, take y to be any value of y, and consider the quantity 0 b, Q where  is
an arbitrary P  1 vector of constants and bQ is any solution to the linear system X0 Xb D X0 y (in the
P  1 vector b). Show that if 0 bQ is invariant to the choice of the solution b,
Q then 0ˇ is an estimable
function. And discuss the implications of this result.
Exercise 5. Suppose that y is an N  1 observable random vector that follows a G–M, Aitken, or
general linear model. And let a represent an arbitrary N  1 vector of constants. Show that a0 y is
the least squares estimator of its expected value E.a0 y/ (i.e., of the parametric function a0 Xˇ) if and
only if a 2 C.X/.
Exercise 6. Let U represent a subspace of the linear space RM of all M -dimensional column
vectors. Verify that the set U? (comprising all M -dimensional column vectors that are orthogonal
to U) is a linear space.
Exercise 7. Let X represent an N  P matrix. A P  N matrix G is said to be a least squares
generalized inverse of X if it is a generalized inverse of X (i.e., if XGX D X) and if, in addition,
.XG/0 D XG (i.e., XG is symmetric).
(a) Show that G is a least squares generalized inverse of X if and only if X0 XG D X0.
(b) Using Part (a) (or otherwise), establish the existence of a least squares generalized inverse of X.
(c) Show that if G is a least squares generalized inverse of X, then, for any N  Q matrix Y , the
matrix GY is a solution to the linear system X0 XB D X0 Y (in the P  Q matrix B).

Exercise 8. Let A represent an M  N matrix. An N  M matrix H is said to be a minimum norm


generalized inverse of A if it is a generalized inverse of A (i.e., if AHA D A) and if, in addition,
.HA/0 D HA (i.e., HA is symmetric).
Exercises 247

(a) Show that H is a minimum norm generalized inverse of A if and only if H0 is a least squares
generalized inverse of A0 (where least squares generalized inverse is as defined in Exercise 7).
(b) Using the results of Exercise 7 (or otherwise), establish the existence of a minimum norm
generalized inverse of A.
(c) Show that if H is a minimum norm generalized inverse of A, then, for any vector b 2 C.A/,
kxk attains its minimum value over the set fx W Ax D bg [comprising all solutions to the linear
system Ax D b (in x)] uniquely at x D Hb (where kk denotes the usual norm).

Exercise 9. Let X represent an N  P matrix, and let G represent a P  N matrix that is subject
to the following four conditions: (1) XGX D X; (2) GXG D G; (3) .XG/0 D XG; and (4)
.GX/0 D GX.
(a) Show that if a P  P matrix H is a minimum norm generalized inverse of X0 X, then conditions
(1)–(4) can be satisfied by taking G D HX0.
(b) Use Part (a) and the result of Part (b) of Exercise 8 (or other means) to establish the existence of
a P  N matrix G that satisfies conditions (1)–(4) and show that there is only one such matrix.
(c) Let XC represent the unique P  N matrix G that satisfies conditions (1)–(4)—this matrix is
customarily referred to as the Moore–Penrose inverse, and conditions (1)–(4) are customarily
referred to as the Moore–Penrose conditions. Using Parts (a) and (b) and the results of Part (c)
of Exercise 7 and Part (c) of Exercise 8 (or otherwise), show that XC y is a solution to the linear
system X0 Xb D X0 y (in b) and that kbk attains its minimum value over the set fb W X0 Xb D X0 yg
(comprising all solutions to the linear system) uniquely at b D XC y (where kk denotes the usual
norm).

Exercise 10. Consider further the alternative approach to the least squares computations, taking the
formulation and the notation to be those of the final part of Section 5.4e.
(a) Let bQ D L1 hQ 1 C L2 hQ 2 , where hQ 2 is an arbitrary (P K)-dimensional column vector and hQ 1 is
the solution to the linear system R1 h1 D z1 R2 hQ 2 . Show that kbk Q is minimized by taking

hQ 2 D ŒI C .R1 1 R2 /0 R1 1 R2  1
.R1 1 R2 /0 R1 1 z1 :

Do so by formulating this  minimization problem as a least squares problem 


in which the role of
R1 1 z1 R1 1 R2

y is played by the vector , the role of X is played by the matrix , and the
0 I
Q 2.
role of b is played by h
(b) Let O1 representa P  K matrix with orthonormal columns and T1 a K  K upper triangular
R01
matrix such that D O1 T10 —the existence of a decomposition of this form can be estab-
R02
lished in much the same way as the existence of the QR decomposition (in which T1 would be
lower triangular rather than upper triangular). Further, take O2 to be any P  .P K/ matrix
such that the P  P matrix O defined by O D .O1 ; O2 / is orthogonal.
 
0 T1 0
(1) Show that X D QT .LO/ , where T D .
0 0
(2) Showthaty Xb D Q1 .z1 T1 d1 / C Q2 z2 , where d D .LO/0 b and d is partitioned as
d1
dD .
d2
(3) Show that .y Xb/0 .y Xb/ D .z1 T1 d1 /0 .z1 T1 d1 / C z02 z2 .
(4) Taking dQ 1 to be the solution to the linear system T1 d1 D z1 (in d1 ), show that .y
248 Estimation and Prediction: Classical Approach

Xb/0 .y Xb/ attains a minimum value of z02 z2 and that it does so at a value bQ of b if and
dQ
 
only if bQ D LO Q 1 for some .P K/  1 vector dQ 2 .
d2
(5) Letting bQ represent an arbitrary one of the values of b at which .y Xb/0 .y Xb/ attains a
minimum value [and, as in Part (4), taking dQ 1 to be the solution to T1 d1 D z1 ], show that
Q 2 (where kk denotes the usual norm) attains a minimum value of dQ 0 dQ and that it does
kbk 1 1
Q1
 
d
so uniquely at bQ D LO .
0

Exercise 11. Verify that the difference (6.14) is a nonnegative definite matrix and that it equals 0 if
and only if c C A0 y D `.y/.
Exercise 12. Suppose that y is an N  1 observable random vector that follows a G–M, Aitken, or
general linear model. And let s.y/ represent any particular translation-equivariant estimator of an
estimable linear combination 0ˇ of the elements of the parametric vector ˇ—e.g., s.y/ could be
the least squares estimator of 0ˇ. Show that an estimator t.y/ of 0ˇ is translation equivariant if
and only if
t.y/ D s.y/ C d.y/
for some translation-invariant statistic d.y/.
Exercise 13. Suppose that y is an N  1 observable random vector that follows a G–M model. And
let y 0Ay represent a quadratic unbiased nonnegative-definite estimator of  2, that is, a quadratic form
in y whose matrix A is a symmetric nonnegative definite matrix of constants and whose expected
value is  2.
(a) Show that y 0Ay is translation invariant.
(b) Suppose that the fourth-order moments of the distribution of the vector e D .e1 ; e2 ; : : : ; eN /0 are
such that (for i; j; k; m D 1; 2; : : : ; N ) E.ei ej ek em / satisfies condition (7.38). For what choice
of A is the variance of the quadratic unbiased nonnegative-definite estimator y 0Ay a minimum?
Describe your reasoning.

Exercise 14. Suppose that y is an N  1 observable random vector that follows a G–M model.
Suppose further that the distribution of the vector e D .e1 ; e2 ; : : : ; eN /0 has third-order moments
jkm D E.ej ek em / (j; k; m D 1; 2; : : : ; N ) and fourth-order moments ijkm D E.ei ej ek em /
(i; j; k; m D 1; 2; : : : ; N ). And let A D faij g represent an N  N symmetric matrix of constants.
(a) Show that in the special case where the elements e1 ; e2 ; : : : ; eN of e are statistically independent,
var.y 0Ay/ D a0  a C 4ˇ 0 X0Aƒ a C 2 4 tr.A2 / C 4 2 ˇ 0 X0A2 Xˇ; (E.1)

where  is the N  N diagonal matrix whose i th diagonal element is i i i i 3 4, where ƒ is


the N  N diagonal matrix whose i th diagonal element is i i i , and where a is the N  1 vector
whose elements are the diagonal elements a11 ; a22 ; : : : ; aNN of A.
(b) Suppose that the elements e1 ; e2 ; : : : ; eN of e are statistically independent, that (for i D
1; 2; : : : ; N ) i i i i D (for some scalar ), and that all N of the diagonal elements of the
PX matrix are equal to each other. Show that the estimator O 2 D eQ 0 eQ =.N rank X/ [where
eQ D .I PX /y] has minimum variance among all quadratic unbiased translation-invariant esti-
mators of  2.

Exercise 15. Suppose that y is an N  1 observable random vector that follows a G–M model, and
assume that the distribution of the vector e of residual effects is MVN.
(a) Letting 0ˇ represent an estimable linear combination of the elements of the parametric vector
ˇ, find a minimum-variance unbiased estimator of .0ˇ/2.
Exercises 249

(b) Find a minimum-variance unbiased estimator of  4.

Exercise 16. Suppose that y is an N  1 observable random vector that follows a G–M model, and
assume that the distribution of the vector e of residual effects is MVN. Show that if  2 were known,
X0 y would be a complete sufficient statistic.
Exercise 17. Suppose that y is an N  1 observable random vector that follows a general linear
model. Suppose further that the distribution of the vector e of residual effects is MVN or, more
generally, that the distribution of e is known up to the value of the vector . And take h.y/ to be any
(possibly vector-valued) translation-invariant statistic.
(a) Show that if  were known, h.y/ would be an ancillary statistic—for a definition of ancillarity,
refer, e.g., to Casella and Berger (2002, def. 6.2.16) or to Lehmann and Casella (1998, p. 41).
(b) Suppose that X0 y would be a complete sufficient statistic if  were known. Show (1) that the
least squares estimator of any estimable linear combination 0ˇ of the elements of the parametric
vector ˇ has minimum variance among all unbiased estimators, (2) that any vector of least squares
estimators of estimable linear combinations (of the elements of ˇ) is distributed independently
of h.y/, and (3) (using the result of Exercise 12 or otherwise) that the least squares estimator of
any estimable linear combination 0ˇ has minimum mean squared error among all translation-
equivariant estimators. {Hint [for Part (2)]. Make use of Basu’s theorem—refer, e.g., to Lehmann
and Casella (1998, p. 42) for a statement of Basu’s theorem.}

Exercise 18. Suppose that y is an N 1 observable random vector that follows a G–M model. Suppose
further that the distribution of the vector e of residual effects is MVN. And, letting eQ D y PX y,
take Q 2 D eQ 0 eQ =N to be the ML estimator of  2 and O 2 D eQ 0 eQ =.N rank X/ to be the unbiased
estimator.
(a) Find the bias and the MSE of the ML estimator Q 2.
(b) Compare the MSE of the ML estimator Q 2 with that of the unbiased estimator O 2 : for which
values of N and of rank X is the MSE of the ML estimator smaller than that of the unbiased
estimator and for which values is it larger?

Exercise 19. Suppose that y is an N  1 observable random vector that follows a general linear
model, that the distribution of the vector e of residual effects is MVN, and that the variance-covariance
matrix V ./ of e is nonsingular (for all  2 ‚). And, letting K D N rank X, take R to be any
N  K matrix (of constants) of full column rank K such that X0 R D 0, and (as in Section 5.9b)
define z D R0 y. Further, let w D s.z/, where s./ is a K  1 vector of real-valued functions that
defines a one-to-one mapping of RK onto some set W.
(a) Show that w is a maximal invariant.
(b) Let f1 .  I / represent the pdf of the distribution of z, and assume that s./ is such that the
distribution of w has a pdf, say f2 .  I /, that is obtainable from f1 .  I / via an application of
the basic formula (e.g., Bickel and Doksum 2001, sec. B.2) for a change of variables. And, taking
L1 .I R0 y/ and L2 ŒI s.R0 y/ (where y denotes the observed value of y) to be the likelihood
functions defined by L1 .I R0 y/ D f1 .R0 yI / and L2 ŒI s.R0 y/ D f2 Œs.R0 y/I , show that
L1 .I R0 y/ and L2 ŒI s.R0 y/ differ from each other by no more than a multiplicative constant.

Exercise 20. Suppose that y is an N 1 observable random vector that follows a general linear model,
that the distribution of the vector e of residual effects is MVN, and that the variance-covariance matrix
V ./ of e is nonsingular (for all  2 ‚). Further, let z D R0 y, where R is any N  .N rank X/
matrix (of constants) of full column rank N rank X such that X0 R D 0; and let u D X0 y, where
250 Estimation and Prediction: Classical Approach

X is any N  .rank X/ matrix (of constants) whose columns form a basis for C.X/. And denote by
y the observed value of y.
0
(a) Verify
 that the likelihood function that would result from regarding the observed value .X ; R/ y
u
of as the data vector differs by no more than a multiplicative constant from that obtained
z
by regarding the observed value y of y as the data vector.
(b) Let f0 .  j  I ˇ; / represent the pdf of the conditional distribution of u given z. And take
L0 Œˇ; I .X ; R/0 y to be the function of ˇ and  defined by L0 Œˇ; I .X ; R/0 y D
f0 .X0 y j R0 yI ˇ; /. Show that
L0 Œˇ; I .X ; R/0 y D .2/ .rank X/=2
jX0 X jjX0 ŒV ./ 1 X j1=2
1

Q
 expf 21 Œˇ./ ˇ0 X0 ŒV ./ 1 Q
XŒˇ./ ˇg;
Q
where ˇ./ is any solution to the linear system X0 ŒV ./ 1
Xb D X0 ŒV ./ 1
y (in the P  1
vector b).
(c) In connection with Part (b), show (1) that
Q
Œˇ./ ˇ0 X0 ŒV ./ 1 Q
XŒˇ./ ˇ
D .y Xˇ/0 ŒV ./ 1
XfX0 ŒV ./ 1
Xg X0 ŒV ./ 1.y Xˇ/
and (2) that the distribution of the random variable s defined by

s D .y Xˇ/0 ŒV ./ 1
XfX0 ŒV ./ 1
Xg X0 ŒV ./ 1.y Xˇ/

does not depend on ˇ.

Exercise 21. Suppose that z is an S  1 observable random vector and that z  N.0;  2 I/, where
 is a (strictly) positive unknown parameter.
(a) Show that z0 z is a complete sufficient statistic.
(b) Take w.z/ to be the S -dimensional vector-valued statistic defined by w.z/ D .z0 z/ 1=2 z—w.z/
is defined for z ¤ 0 and hence with probability 1. Show that z0 z and w.z/ are statistically
independent. (Hint. Make use of Basu’s theorem.)
(c) Show that any estimator of  2 of the form z0 z=k (where k is a nonzero constant) is scale
equivariant—an estimator, say t.z/, of  2 is to be regarded as scale equivariant if for every
(strictly) positive scalar c (and for every nonnull value of z) t.cz/ D c 2 t.z/.
(d) Let t0 .z/ represent any particular scale-equivariant estimator of  2 such that t0 .z/ ¤ 0 for z ¤ 0.
Show that an estimator t.z/ of  2 is scale equivariant if and only if, for some function u./ such
that u.cz/ D u.z/ (for every strictly positive constant c and every nonnull value of z),
t.z/ D u.z/t0 .z/ for z ¤ 0: (E.2)

(e) Show that a function u.z/ of z is such that u.cz/ D u.z/ (for every strictly positive constant c
and every nonnull value of z) if and only if u.z/ depends on the value of z only through w.z/
[where w.z/ is as defined in Part (b)].
(f) Show that the estimator z0 z=.S C2/ has minimum MSE among all scale-equivariant estimators
of  2.

Exercise 22. Suppose that y is an N  1 observable random vector that follows a G–M model
and that the distribution of the vector e of residual effects is MVN. Using the result of Part (f) of
Exercise 21 (or otherwise), show that the Hodges–Lehmann estimator y 0 .I PX /y=ŒN rank.X/C2
has minimum MSE among all translation-invariant estimators of  2 that are scale equivariant—a
Exercises 251

translation-invariant estimator, say t.y/, of  2 is to be regarded as scale equivariant if t.cy/ D c 2 t.y/


for every (strictly) positive scalar c and for every nonnull value of y in N.X0 /.
Exercise 23. Let z D .z1 ; z2 ; : : : ; zM /0 represent an M -dimensional random (column) vector that
has an absolutely continuous distribution with a pdf f ./. And suppose that for some (nonnegative)
function g./ (of a single nonnegative variable), f .z/ / g.z0 z/ (in whichRcase the distribution of z
1
is spherical). Show (for i D 1; 2; : : : ; M ) that E.zi2 / exists if and only if 0 s M C1 g.s 2 / ds < 1,
in which case 1
1 0 s M C1 g.s 2 / ds
R
var.zi / D E.zi2 / D R1 :
M 0 s M 1 g.s 2 / ds
Exercise 24. Let z represent an N -dimensional random column vector, and let z represent an
M -dimensional subvector of z (where M < N ). And suppose that the distributions of z and z
are absolutely continuous with pdfs f ./ and f ./, respectively. Suppose also that there exist (non-
negative) functions g./ and g ./ (of a single nonnegative variable) such that (for every value of z)
f .z/ D g.z0 z/ and (for every value of z ) f .z / D g .z0 z / (in which case the distributions of z
and z are spherical).
(a) Show that (for v  0)
1
 .N M /=2
Z
g .v/ D .u v/Œ.N M /=2 1
g.u/ du:
€Œ.N M /=2 v

(b) Show that if N M D 2, then (for v > 0)


1 0
g.v/ D g .v/;
 
where g0 ./ is the derivative of g ./.

Exercise 25. Let y represent an N  1 random vector and w an M  1 random vector. Suppose
that the second-order moments of the joint distribution of y and w exist, and adopt the following
notation: y D E.y/, w D E.w/, Vy D var.y/, Vyw D cov.y; w/, and Vw D var.w/. Further,
assume that Vy is nonsingular.
0
(a) Show that the matrix Vw Vyw Vy 1 Vyw EŒvar.w j y/ is nonnegative definite and that it equals 0
if and only if (for some nonrandom vector c and some nonrandom matrix A) E.w j y/ D cCA0 y
(with probability 1).
0
(b) Show that the matrix varŒE.w j y/ Vyw Vy 1 Vyw is nonnegative definite and that it equals 0 if
and only if (for some nonrandom vector c and some nonrandom matrix A) E.w j y/ D c C A0 y
(with probability 1).

Exercise 26. Let y represent an N 1 observable random vector and w an M 1 unobservable random
vector. Suppose that the second-order moments of the joint distribution of y and w exist, and adopt the
following notation: y D E.y/, w D E.w/, Vy D var.y/, Vyw D cov.y; w/, and Vw D var.w/.
0
Assume that y , w , Vy , Vyw , and Vw are known. Further, define .y/ D w C Vyw Vy .y y /,
and take t.y/ to be an (M  1)-dimensional vector-valued function of the form t.y/ D c C A0 y,
where c is a vector of constants and A is an N  M matrix of constants. Extend various of the
results of Section 5.10a (to the case where Vy may be singular) by using Theorem 3.5.11 to show
(1) that .y/ is the best linear predictor of w in the sense that the difference between the matrix
EfŒt.y/ wŒt.y/ w0 g and the matrix varŒ.y/ w [which is the MSE matrix of .y/] equals
the matrix EfŒt.y/ .y/Œt.y/ .y/0 g, which is nonnegative definite and which equals 0 if and
only if t.y/ D .y/ for every value of y such that y y 2 C.Vy /, (2) that PrŒy y 2 C.Vy / D 1,
0
and (3) that varŒ.y/ w D Vw Vyw Vy Vyw .
252 Estimation and Prediction: Classical Approach

Exercise 27. Suppose that y is an N  1 observable random vector that follows a G–M model, and
take w to be an M  1 unobservable random vector whose value is to be predicted. Suppose further
that E.w/ is of the form E.w/ D ƒ0ˇ (where ƒ is a matrix of known constants) and that cov.y; w/
is of the form cov.y; w/ D  2 Hyw (where Hyw is a known matrix). Let  D .ƒ0 Hyw 0
X/ˇ,
0
denote by w.y/
Q an arbitrary predictor (of w), and define .y/
Q Q
D w.y/ Hyw y. Verify that w.y/
Q
is a translation-equivariant predictor (of w) if and only if .y/
Q is a translation-equivariant estimator
of .

Bibliographic and Supplementary Notes


§2. In some presentations, the use of the term translation (or location) invariance is extended to include
what is herein referred to as translation equivariance.
§3e. For an extensive (book-length) discussion of mixture data, refer to Cornell (2002).
§4c. My acquaintance with the term conjugate normal equations came through some class notes authored
by Oscar Kempthorne.
§4d. For a discussion of projections that is considerably more extensive and at a somewhat more general
level than that provided herein, refer, for example, to Harville (1997, chaps. 12 and 17).
§7a. For a relatively extensive discussion of the vec and vech operations and of Kronecker products, refer,
for example, to Chapter 16 of Harville’s (1997) book and to the references cited therein.
§7c. Justification for referring to the estimator (7.44) as the Hodges–Lehmann estimator is provided by
results presented by Hodges and Lehmann in their 1951 paper. Refer to the expository note by David (2009) for
some discussion of a historical nature that relates to the statistical independence of the residual sum of squares
and a least squares estimator.
§7d. The results in this subsection that pertain to the minimum-variance quadratic unbiased translation-
invariant estimation of  2 (and the related results that are the subject of Exercise 14) are variations on the results
of Atiqullah (1962) on the minimum-variance quadratic unbiased nonnegative-definite estimation of  2, which
are covered by Ravishanker and Dey (2002) in their Section 4.4—Exercise 13 serves to relate the minimum-
variance quadratic unbiased nonnegative-definite estimation of  2 to the minimum-variance quadratic unbiased
translation-invariant estimation of  2.
§9b. REML originated with the work of Patterson and R. Thompson (1971)—while related ideas can be
found in earlier work by others (e.g., W. A. Thompson, Jr. 1962), it was Patterson and R. Thompson who
provided the kind of substantive development that was needed for REML to become a viable alternative to
ordinary ML. The discussion of maximal invariants is based on results presented (in a more general context)
in Section 6.2 of Lehmann and Romano’s (2005b) book. Refer to Verbyla (1990) and to LaMotte (2007) for
discussion of various matters pertaining to the derivation of expression (9.38) (and related expressions) for the
log-likelihood function `.I R0 y/ employed in REML in making inferences about functions of .
§9c. Refer, e.g., to Kollo and von Rosen (2005, table 2.3.1) or Fang, Kotz, and Ng (1990, table 3.1) for
a table [originating with Jensen (1985)] that characterizes (in terms of the pdf or the characteristic function)
various subclasses of multidimensional spherical distributions.
§10a. The approach (to point prediction) taken in this subsection is essentially the same as that taken in
Harville’s (1985) paper.
Exercises 7, 8, and 9. For a relatively extensive discussion of generalized inverses that satisfy one or
more of Moore–Penrose conditions (2)–(4) [as well as Moore–Penrose condition (1)], including least squares
generalized inverses, minimum norm generalized inverses, and the Moore–Penrose inverse itself, refer, e.g., to
Harville (1997, chap. 20).
Exercise 20. For some general discussion bearing on the implications of Parts (b) and (c) of Exercise 20,
refer to Sprott (1975).
Exercise 25. Exercise 25 is based on results presented by Harville (2003a).
6
Some Relevant Distributions and Their Properties

The multivariate normal distribution was introduced and was discussed extensively in Section 3.5. A
broader class of multivariate distributions, comprising so-called elliptical distributions, was consid-
ered in Section 5.9c. Numerous results on the first- and second-order moments of linear and quadratic
forms (in random vectors) were presented in Chapter 3 and in Section 5.7b.
Knowledge of the multivariate normal distribution and of other elliptical distributions and knowl-
edge of results on the first- and second-order moments of linear and quadratic forms provide a more-
or-less adequate background for the discussion of the classical approach to point estimation and
prediction, which was the subject of Chapter 5. However, when it comes to extending the results
on point estimation and prediction to the construction and evaluation of confidence regions and of
test procedures, this knowledge, while still relevant, is far from adequate. It needs to be augmented
with a knowledge of the distributions of certain functions of normally distributed random vectors
and a knowledge of various related distributions and with a knowledge of the properties of such
distributions. It is these distributions and their properties that form the subject matter of the present
chapter.

6.1 Chi-Square, Gamma, Beta, and Dirichlet Distributions


Let z D .z1 ; z2 ; : : : ; zN /0 represent an N  1 random vector whose distribution is N.0; I/ or,
equivalently, whose N elements are distributed independently and identically as N.0; 1/. The sum
of squares z0 z D N 2
i D1 zi of the elements of z has a distribution that is known as a chi-square (or
P
chi-squared) distribution—this distribution depends on the number N , which is referred to as the
degrees of freedom of the distribution. Chi-square distributions (and various related distributions)
play an important role in making statistical inferences on the basis of linear statistical models. Chi-
square distributions belong to a broader class of distributions known as gamma distributions. Gamma
distributions, including chi-square distributions, give rise to other important distributions known as
beta distributions or, more generally, Dirichlet distributions. Some relevant background is provided
in what follows.

a. Gamma distribution
For strictly positive scalars ˛ and ˇ, let f ./ represent the function defined (on the real line) by

< 1
8
x ˛ 1 e x=ˇ; for 0 < x < 1,
f .x/ D €.˛/ˇ ˛ (1.1)
0; elsewhere.
:

R1
Clearly, f .x/  0 for 1 < x < 1. And 1 f .x/ dx D 1, as is evident upon introducing the
change of variable w D x=ˇ and recalling the definition—refer to expression (3.5.5)—of the gamma
function. Thus, the function f ./ qualifies as a pdf (probability density function). The distribution
254 Some Relevant Distributions and Their Properties

determined by this pdf is known as the gamma distribution (with parameters ˛ and ˇ). Let us use
the symbol Ga.˛; ˇ/ to denote this distribution.
If a random variable x has a Ga.˛; ˇ/ distribution, then for any (strictly) positive constant c,
cx  Ga.˛; cˇ/; (1.2)
as can be readily verified.
Suppose that two random variables w1 and w2 are distributed independently as Ga.˛1 ; ˇ/ and
Ga.˛2 ; ˇ/, respectively. Let
w1
w D w1 C w2 and sD : (1.3)
w1 C w2
And note that equalities (1.3) define a one-to-one transformation from the rectangular region
fw1 ; w2 W w1 > 0; w2 > 0g onto the rectangular region

fw; s W w > 0; 0 < s < 1g: (1.4)

Note also that the inverse transformation is defined by

w1 D sw and w2 D .1 s/w:

We find that    
@w1 =@w @w1 =@s s w
det D det D w:
@w2 =@w @w2 =@s 1 s w
Let f . ; / represent the pdf of the joint distribution of w and s, and f1 ./ and f2 ./ the pdfs of the
distributions of w1 and w2 , respectively. Then, for values of w and s in the region (1.4),

f .w; s/ D f1 .sw/f2 Œ.1 s/w j wj


1
D w ˛1 C˛2 1 ˛1 1
s .1 s/˛2 1
e w=ˇ
€.˛1 /€.˛2 /ˇ ˛1 C˛2

—for values of w and s outside region (1.4), f .w; s/ D 0.


Clearly,
f .w; s/ D g.w/ h.s/ (for all values of w amd s); (1.5)
where g./ and h./ are functions defined as follows:
1
8
< w ˛1C˛2 1
e w=ˇ
; for w > 0,
g.w/ D €.˛1 C ˛2 /ˇ ˛1C˛2 (1.6)
0; elsewhere,
:

and
< €.˛1 C ˛2 / s ˛1
8
1
.1 s/ ˛2 1; for 0 < s < 1,
h.s/ D €.˛1 /€.˛2 / (1.7)
0; elsewhere.
:

The function g./ is seen to be the pdf of the Ga.˛1 C ˛2 ; ˇ/ distribution. And because h.s/  0
(for every value of s) and because
Z 1 Z 1 Z 1 Z 1Z 1
h.s/ ds D h.s/ ds g.w/ dw D f .w; s/ dw ds D 1; (1.8)
1 1 1 1 1

h./, like g./, is a pdf; it is the pdf of the distribution of the random variable s D w1 =.w1 C w2 /.
Moreover, the random variables w and s are distributed independently.
Based on what has been established, we have the following result.
Chi-Square, Gamma, Beta, and Dirichlet Distributions 255

Theorem 6.1.1. If two random variables w1 and w2 are distributed independently as Ga.˛1 ; ˇ/
and Ga.˛2 ; ˇ/, respectively, then (1) w1 C w2 is distributed as Ga.˛1 C ˛2 ; ˇ/ and (2) w1 C w2
is distributed independently of w1 =.w1 C w2 /.
By employing a simple induction argument, we can establish the following generalization of the
first part of Theorem 6.1.1.
Theorem 6.1.2. If N random variables w1 ; w2 ; : : : ; wN are distributed independently as
Ga.˛1 ; ˇ/; Ga.˛2 ; ˇ/; : : : ; Ga.˛N ; ˇ/, respectively, then w1 C w2 C    C wN is distributed as
Ga.˛1 C ˛2 C    C ˛N ; ˇ/.

b. Beta distribution and function


The distribution with pdf h./ defined by expression (1.7) is known as the beta distribution (with
parameters ˛1 and ˛2 ) and is of interest in its own right. Let us use the symbol Be.˛1 ; ˛2 / to denote
this distribution.
Take B. ; / to be the function (whose domain consists of the coordinates of the points in the first
quadrant of the plane) defined by
Z 1
B.y; z/ D t y 1.1 t/z 1 dt .y > 0; z > 0/: (1.9)
0

This function is known as the beta function. The beta function is expressible in terms of the gamma
function:
€.y/€.z/
B.y; z/ D ; (1.10)
€.y C z/
as is evident from result (1.8).
Note that the pdf h./ of the Be.˛1 ; ˛2 / distribution can be reexpressed in terms of the beta
function. We have that

1
< s ˛1 1 .1 s/ ˛2 1; for 0 < s < 1,
h.s/ D B.˛1 ; ˛2 / (1.11)
:̂ 0; elsewhere.

For 0  x  1 and for y > 0 and z > 0, define


Z x
1
Ix .y; z/ D ty 1
.1 t/z 1
dt: (1.12)
B.y; z/ 0
The function Ix . ; / is known as the incomplete beta function ratio. Clearly, the function F ./ defined
(in terms of the incomplete beta function ratio) by
F .x/ D Ix .˛1 ; ˛2 / .0  x  1/ (1.13)
coincides with the cdf (cumulative distribution function) of the Be.˛1 ; ˛2 / distribution over the
interval 0  x  1.
Upon making the change of variable r D 1 t in expression (1.12), we find that
Z 1
1
Ix .y; z/ D r z 1 .1 r/y 1 dr
B.z; y/ 1 x
 Z 1 x 
1
D B.z; y/ r z 1 .1 r/y 1 dr :
B.z; y/ 0

Thus,
Ix .y; z/ D 1 I1 x .z; y/: (1.14)
256 Some Relevant Distributions and Their Properties

c. Chi-square distribution: definition and relationship to the gamma distribution


Let z D .z1 ; z2 ; : : : ; zN /0 represent an N -dimensional random (column) vector whose distribution
is N.0; I/ or, equivalently, whose elements are distributed independently and identically as N.0; 1/.
The distribution of the sum of squares z0 z D N 2
i D1 zi is known as the chi-square distribution. The
P
positive integer N is regarded as a parameter of the chi-square distribution—there is a family of
chi-square distributions, one for each value of N. It is customary to refer to the parameter N as the
degrees of freedom of the distribution.
Let us find the pdf of the distribution of z0 z. It is fruitful to begin by finding the pdf of the
distribution of z 2, where z is a random variable whose distribution is N.0; 1/. Define u D z 2. And
denote by f ./ the pdf of the N.0; 1/ distribution, and by h./ the pdf of the distribution p of u. Note
that there are two values of z that give rise to each nonzero value of u, namely, z D ˙ u. Note also
p p  1
that d u=du D 2 u . Accordingly, using standard techniques for a change of variable (e.g.,
Casella and Berger 2002, sec. 2.1), we find that, for u > 0,
p ˇ p  1ˇ p  p  1 1
h.u/ D f u ˇ 2 u ˇCf u 2 u Dp e u=2
2u
and that, for u  0, h.u/ D 0. p
Upon recalling that € 21 D , we find that the pdf h./ is identical to the pdf of the
Ga 2 ; 2 distribution. Thus, z  Ga 12 ; 2 , so that z12 ; z22 ; : : : ; zN
1 2 2
are distributed independently—
 

functions ofstatistically independent random variables are statistically independent—and identically


as Ga 12 ; 2 . And upon applying Theorem 6.1.2, it follows that
N
X
z0 z D zi2  Ga N
(1.15)

2 ;2 :
i D1
Instead of defining the chi-square distribution (with N degrees of freedom) in terms of the N -
variate standard normal distribution (as was done herein where it was defined to be the distribution of
z0 z), some authors define it directly in terms of the gamma distribution. Specifically, they define the
chi-square distribution with N degrees of freedom to be the special case of the Ga.˛; ˇ/ distribution
where ˛ D N=2 and ˇ D 2. In their approach, the derivation of the distribution of z0 z serves a
different purpose; it serves to establish what is no longer true by definition, namely, that z0 z has
a chi-square distribution with N degrees of freedom. The two approaches (their approach and the
one taken herein) can be regarded as different means to the same end. In either case, the chi-square
distribution
 with N degrees of freedom is the distribution with a pdf, say g./, that is the pdf of the
Ga N2 ; 2 distribution and that is expressible as follows:

1
<
N=2
x .N=2/ 1 e x=2; for 0 < x < 1,
g.x/ D €.N=2/2 (1.16)
:̂ 0; elsewhere.

Let us write 2 .N / for a chi-square distribution with N degrees of freedom. Here, N is assumed
to be an integer. Reference is sometimes made to a chi-square distribution with noninteger (but strictly
positive) degrees of freedom N . Unless otherwise indicated, such a reference is to be interpreted as
a reference to the Ga N2 ; 2 distribution; this interpretation is that obtained when the relationship
between the 2 .N / and Ga N2 ; 2 distributions is regarded as extending to noninteger values of N .


In light of the relationship of the chi-square distribution to the gamma distribution, various results
on the gamma distribution can be translated into results on the chi-square distribution. In particular,
if a random variable x has a 2 .N / distribution and if c is a (strictly) positive constant, then it follows
from result (1.2) that
cx  Ga N2 ; 2c : (1.17)


The following result is an immediate consequence of Theorem 6.1.2.


Chi-Square, Gamma, Beta, and Dirichlet Distributions 257

Theorem 6.1.3. If K random variables u1 ; u2 ; : : : ; uK are distributed independently as


2 .N1 /; 2 .N2 /; : : : ; 2 .NK /, respectively, then u1 C u2 C    C uK is distributed as 2 .N1 C
N2 C    C NK /.

d. Moment generating function, moments, and cumulants of the gamma and chi-
square distributions
Let w represent a random variable whose distribution is Ga.˛; ˇ/. And denote by f ./ the pdf of
the Ga.˛; ˇ/ distribution. Further, let u represent a random variable whose distribution is 2 .N /.
For t < 1=ˇ, Z 1
E.e t w / D e txf .x/ dx
Z0 1
1
D x ˛ 1 e x.1 ˇ t /=ˇ dx
0 €.˛/ˇ ˛
Z 1
1 1
D x ˛ 1 e x= dx; (1.18)
.1 ˇt/ 0 €.˛/ ˛
˛

where  D ˇ=.1 ˇt/. The integrand of the integral in expression (1.18) equals g.x/, where g./ is
the pdf of the Ga.˛; / distribution, so that the integral equals 1. Thus, the mgf (moment generating
function), say m./, of the Ga.˛; ˇ/ distribution is
˛
m.t/ D .1 ˇt/ .t < 1=ˇ/: (1.19)

As a special case of result (1.19), we have that the mgf, say m./, of the 2 .N / distribution is
N=2
m.t/ D .1 2t/ .t < 1=2/: (1.20)

For r > ˛,
Z 1
r
E.w / D x rf .x/ dx
0
1
1
Z
D ˛
x ˛Cr 1 e x=ˇ dx
0 €.˛/ˇ
ˇ r €.˛ C r/ 1 1
Z
D ˛Cr
x ˛Cr 1
e x=ˇ
dx: (1.21)
€.˛/ 0 €.˛ C r/ˇ
The integrand of the integral in expression (1.21) equals g.x/, where g./ is the pdf of the Ga.˛Cr; ˇ/
distribution, so that the integral equals 1. Thus, for r > ˛,
€.˛ C r/
E.w r / D ˇ r : (1.22)
€.˛/
The gamma function €./ is such that, for x > 0 and for any positive integer r,
€.x C r/ D .x C r 1/    .x C 2/.x C 1/x €.x/; (1.23)
as is evident upon the repeated application of result (3.5.6). In light of result (1.23), it follows from
formula (1.22) that the rth (positive, integer-valued) moment of the Ga.˛; ˇ/ distribution is
E.w r / D ˇ r ˛.˛ C 1/.˛ C 2/    .˛ C r 1/: (1.24)
Thus, the mean and variance of the Ga.˛; ˇ/ distribution are
E.w/ D ˛ˇ (1.25)
258 Some Relevant Distributions and Their Properties

and
var.w/ D ˇ 2 ˛.˛ C 1/ .˛ˇ/2 D ˛ˇ 2: (1.26)

Upon setting ˛ D N=2 and ˇ D 2 in expression (1.22), we find that (for r > N=2)

€Œ.N=2/ C r
E.ur / D 2r : (1.27)
€.N=2/

Further, the rth (positive, integer-valued) moment of the 2 .N / distribution is

E.ur / D 2r .N=2/Œ.N=2/ C 1Œ.N=2/ C 2    Œ.N=2/ C r 1


D N.N C 2/.N C 4/    ŒN C 2.r 1/: (1.28)

And the mean and variance of the 2 .N / distribution are

E.u/ D N (1.29)
and
var.u/ D 2N: (1.30)

Upon applying formula (1.27) [and making use of result (1.23)], we find that the rth moment of the
reciprocal of a chi-square random variable (with N degrees of freedom) is (for r D 1; 2; : : : < N=2)

r €Œ.N=2/ r
r
E.u /D2
€.N=2/
r 1
D f2 Œ.N=2/ 1Œ.N=2/ 2    Œ.N=2/ rg
D Œ.N 2/.N 4/    .N 2r/ 1: (1.31)

In particular, we find that (for N > 2)


1 1
E D : (1.32)
u N 2

Upon taking the logarithm of the mgf m./ of the Ga.˛; ˇ/ distribution, we obtain the cumulant
generating function, say c./, of this distribution—refer, e.g., to Bickel and Doksum (2001, sec. A.12)
for an introduction to cumulants and cumulant generating functions. In light of result (1.19),

c.t/ D ˛ log.1 ˇt/ .t < 1=ˇ/: (1.33)

And, for 1=ˇ  t < 1=ˇ,


1
X 1
X
c.t/ D ˛ .tˇ/r=r D ˛ˇ r.r 1/Š t r=rŠ :
rD1 rD1

Thus, the rth cumulant of the Ga.˛; ˇ/ distribution is

˛ˇ r.r 1/Š : (1.34)

As a special case, we have that the rth cumulant of the 2 .N / distribution is

N 2r 1
.r 1/Š : (1.35)
Chi-Square, Gamma, Beta, and Dirichlet Distributions 259

e. Dirichlet distribution
Let w1 ; w2 ; : : : ; wK , and wKC1 represent K C 1 random variables that are distributed independently
as Ga.˛1 ; ˇ/; Ga.˛2 ; ˇ/; : : : ; Ga.˛K ; ˇ/, and Ga.˛KC1 ; ˇ/, respectively. And consider the joint
distribution of the K C 1 random variables w and s1 ; s2 ; : : : ; sK defined as follows:
KC1
X wk
wD wk and sk D PKC1 .k D 1; 2; : : : ; K/: (1.36)
kD1 k 0 D1 wk 0
A derivation of the pdf of the joint distribution of w and s1 ; s2 ; : : : ; sK was presented for the
special case where K D 1 in Subsection a. Let us extend that derivation to the general case (where
K is an arbitrary positive integer).
The K C 1 equalities (1.36) define a one-to-one transformation from the rectangular region
fw1 ; w2 ; : : : ; wK ; wKC1 W wk > 0 .k D 1; 2; : : : ; K; K C1/g onto the region
w; s1 ; s2 ; : : : ; sK W w > 0; sk > 0 .k D 1; 2; : : : ; K/; K (1.37)
˚ P
kD1 sk < 1 :
The inverse transformation is defined by
PK
and

wk D sk w .k D 1; 2; : : : ; K/ wKC1 D 1 kD1 sk w:

For j; k D 1; 2; : : : ; K,
(
@wj w; if k D j ,
D
@sk 0; if k ¤ j .
@wj @wKC1 @wKC1
Further, D sj (j D 1; 2; : : : ; K), D w (k D 1; 2; : : : ; K), and D 1
PK @w @s k @w
0
kD1 sk . Thus, letting s D .s1 ; s2 ; : : : ; sK / and making use of Theorem 2.14.9, formula (2.14.29),
and Corollary 2.14.2, we find that
ˇ ˇ
ˇ @w1 @w1 @w1 @w1 ˇˇ
ˇ
ˇ @w : : :
ˇ @s1 @s2 @sK ˇˇ
ˇ @w2 @w2 @w2 @w2 ˇˇ
ˇ
ˇ @w :::
@s1 @s2 @sK ˇˇ ˇˇ s wI ˇˇ
ˇ
ˇ :: :: :: :
ˇ
:: :: ˇ D ˇ PK
ˇ ˇ
ˇ : : : :
w10 ˇ
ˇ
ˇ ˇ1
kD1 sk
ˇ
ˇ @wK @w K @w K @w K ˇ
ˇ
ˇ @w ::: ˇ
ˇ @s1 @s2 @sK ˇˇ
ˇ @wKC1 @wKC1 @wKC1 @wKC1 ˇˇ
ˇ
ˇ @w :::
@s1 @s2 @sK ˇ
ˇ ˇ
ˇ wI s ˇ

D . 1/ ˇ
ˇ
ˇ w10 1 K
P ˇ
kD1 sk
ˇ
K K
D . 1/K w K .1
P P
kD1 sk C kD1 sk/
D . w/K:

Now, let f . ;  ;  ; : : : ; / represent the pdf of the joint distribution of w and s1 ; s2 ; : : : ; sK , define
PKC1
˛ D kD1 ˛k , and, for k D 1; 2; : : : ; K, denote by fk ./ the pdf of the distribution of wk . Then,
for w; s1 ; s2 ; : : : ; sK in the region (1.37),

f .w; s1 ; s2 ; : : : ; sK /
PK
w j. w/K j
  
D f1 .s1 w/f2 .s2 w/    fK .sK w/fKC1 1 kD1 sk
1
D
€.˛1 /€.˛2 /    €.˛K /€.˛KC1 /ˇ ˛
1 ˛1 1 ˛2 1 ˛K 1 PK ˛KC1 1
 w˛ s1 s2    sK 1 kD1 sk e w=ˇ
260 Some Relevant Distributions and Their Properties

—for w; s1 ; s2 ; : : : ; sK outside region (1.37), f .w; s1 ; s2 ; : : : ; sK / D 0.


Clearly,
f .w; s1 ; s2 ; : : : ; sK / D g.w/ h.s1 ; s2 ; : : : ; sK / (for all w and s1 ; s2 ; : : : ; sK );
where 8̂
1
< w˛ 1
e w=ˇ
; for 0 < w < 1,
g.w/ D €.˛/ˇ ˛
:̂ 0; elsewhere,

and, letting S D f.s1 ; s2 ; : : : ; sK / W sk > 0 .k D 1; 2; : : : ; K/; K


P
kD1 sk < 1g,
€.˛1 C˛2 C  C˛K C˛KC1 /

ˆ
€.˛1 /€.˛2 /    €.˛K /€.˛KC1 /
ˆ
ˆ
ˆ
ˆ ˛KC1 1
 s1˛1 1 s2˛2 1    sK
˛K 1
< PK
h.s1 ; s2 ; : : : ; sK / D 1 kD1 sk ; (1.38)
ˆ
ˆ
ˆ
ˆ for .s ;
1 2s ; : : : ; sK/ 2 S,
ˆ
0; elsewhere.

The function g./ is seen to be the pdf of the Ga. KC1 kD1 ˛k ; ˇ/ distribution. And because
P
h.s1 ; s2 ; : : : ; sK /  0 for all s1 ; s2 ; : : : ; sK and because
Z 1Z 1 Z 1
 h.s1 ; s2 ; : : : ; sK / ds1 ds2 : : : dsK
1 1
Z 1 Z11 Z 1 Z 1
D  h.s1 ; s2 ; : : : ; sK / ds1 ds2 : : : dsK g.w/ dw
1 1
Z 1 Z 1 Z 1 1Z 1 1

D  f .w; s1 ; s2 ; : : : ; sK / dwds1 ds2 : : : dsK D 1;


1 1 1 1

h. ;  ; : : : ; /, like g./, is a pdf; it is the pdf of the joint distribution of the K random variables
sk D wk = KC1 k 0 D1 wk 0 (k D 1; 2; : : : ; K). Moreover, w is distributed independently of s1; s2 ; : : : ; sK .
P

Based on what has been established, we have the following generalization of Theorem 6.1.1.
Theorem 6.1.4. If K C 1 random variables w1 ; w2 ; : : : ; wK ; wKC1 are distributed indepen-
dently as Ga.˛1 ; ˇ/; Ga.˛2 ; ˇ/; : : : ; Ga.˛K ; ˇ/; Ga.˛KC1 ; ˇ/, respectively, then (1) KC1
P
kD1 wk
PKC1 PKC1
is distributed as Ga. kD1 ˛k ; ˇ/ and (2) kD1 wk is distributed independently of
w1 = KC1
PKC1 PKC1
k.
P
kD1 wk ; w2 = kD1 wk ; : : : ; wK = kD1 w
Note that Part (1) of Theorem 6.1.4 is essentially a restatement of a result established earlier (in
the form of Theorem 6.1.2) via a mathematical induction argument.
The joint distribution of the K random variables s1 ; s2 ; : : : ; sK , the pdf of which is the
function h. ;  ; : : : ; / defined by expression (1.38), is known as the Dirichlet distribution
(with parameters ˛1 ; ˛2 ; : : : ; ˛K , and ˛KC1 ). Let us denote this distribution by the symbol
Di.˛1 ; ˛2 ; : : : ; ˛K ; ˛KC1 I K/. The beta distribution is a special case of the Dirichlet distribution;
specifically, the Be.˛1 ; ˛2 / distribution is identical to the Di.˛1 ; ˛2 I 2/ distribution.
Some results on the Dirichlet distribution. Some results on the Dirichlet distribution are stated in
the form of the following theorem.
Theorem 6.1.5. Take s1 ; s2 ; : : : ; sK to be K random variables whose joint distribution is
PK
Di.˛1 ; ˛2 ; : : : ; ˛K ; ˛KC1 I K/, and define sKC1 D 1 kD1 sk . Further, partition the inte-
gers 1; : : : ; K; K C 1 into I C 1 (nonempty) mutually exclusive and exhaustive subsets, say
B1 ; : : : ; BI ; BI C1 , of sizes K1 C1; : : : ; KI C1; KI C1 C1, respectively, and denote by fi1; i2 ; : : : ; iP g
the subset of f1; : : : ; I; I C 1g consisting of every integer i between 1 and I C 1, inclusive, for
Chi-Square, Gamma, Beta, and Dirichlet Distributions 261

which Ki  1. And for i D 1; : : : ; I; I C 1, let si D k2Bi sk and ˛i D k2Bi ˛k ; and for
P P

P , let up represent the Kip  1 vector whose elements are the first Kip of the Kip C 1
p D 1; 2; : : : ; P
quantities sk = k 0 2Bi sk 0 (k 2 Bip ). Then,
p
(1) the P C 1 random vectors u1 ; u2 ; : : : ; uP , and .s1 ; : : : ; sI ; sIC1 /0 are statistically independent;
(2) the joint distribution of s1 ; s2 ; : : : ; sI is Di.˛1 ; ˛2 ; : : : ; ˛I ; ˛IC1 I I /; and
(3) for p D 1; 2; : : : ; P , the joint distribution of the Kip elements of up is Dirichlet with parameters
˛k (k 2 Bip ).
Proof. Let wij (i D 1; : : : ; I; I C 1; j D 1; : : : ; Ki ; Ki C 1) represent statistically independent
random variables, and suppose that (for all i and j ) wij  Ga.˛ij ; ˇ/, where ˛ij is the j th of the
Ki C 1 parameters ˛k (k 2 Bi ). Further, let wi D .wi1 ; : : : ; wiKi ; wi;Ki C1 /0. And observe that in
light of the very definition of the Dirichlet distribution, it suffices (for purposes of the proof) to set the
PKi 0 C1
Ki C 1 random variables sk (k 2 Bi ) equal to wij = Ii 0C1 j 0 D1 wi j (j D 1; : : : ; Ki ; Ki C 1),
P
0 0
D1
respectively (i D 1; : : : ; I; I C1). Upon doing so, we find that
PKi C1 ıPI C1 PKi 0 C1
si D j D1 wij i 0 D1 j 0 D1 wi 0j 0 .i D 1; : : : ; I; I C1/ (1.39)

and that (for p D 1; 2; : : : ; P ) the Kip elements of the vector up are up1 ; up2 , : : : ; upKip , where
(for j D 1; 2; : : : ; Kip )
ıPI C1 PKi 0 C1
wip j i 0 D1 j 0 D1 wi j
0 0 wip j
upj D PK C1 D PK C1 : (1.40)
ip
ıP I C1 PKi 0 C1 ip
j 0 D1 wip j j 0 D1 wi j j 0 D1 wip j
0 0 0 0
i 0 D1

Part (3) of Theorem 6.1.5 follows immediately from result (1.40). And upon observing that the
PKi C1
IC1 sums j D1 wij (i D 1; : : : ; I; IC1) are statistically independent and observing also [in light of
PKi C1
Theorem 6.1.2 or Part (1) of Theorem 6.1.4] that (for i D 1; : : : ; I; IC1) j D1 wij  Ga.˛i; ˇ/,
Part (2) follows from result (1.39).
It remains to verify Part (1). Let y D .y1 ; : : : ; yI ; yI C1 /0, where (for i D 1, : : : ; I; I C 1)
PKi C1
yi D j D1 wij , and consider a change of variables from the elements of the I C 1 (statisti-
cally independent) random vectors w1 ; : : : ; wI ; wI C1 to the elements of the P C1 random vectors
u1 ; u2 ; : : : ; uP , and y. We find that the pdf, say f . ;  ; : : : ;  ; / of u1; u2 ; : : : ; uP , and y is expressible
as C1
IY P
Y
f .u1 ; u2 ; : : : ; uP ; y/ D gi .yi / hp .up /; (1.41)
i D1 pD1
where gi ./ is the pdf of the Ga.˛i; ˇ/ distribution and hp ./ is the pdf of the
Di.˛ip 1 ; : : : ; ˛ip Kip ; ˛ip ;Kip C1 I Kip / distribution. And upon observing that s1 , : : : ; sI , sIC1 are ex-
pressible as functions of y, it follows from result (1.41) that u1 ; u2 , : : : ; uP , and .s1 ; : : : ; sI ; sIC1 /0
are statistically independent. Q.E.D.
Marginal distributions. Define s1 ; s2 ; : : : ; sK , and sKC1 as in Theorem 6.1.5; that is, take
s1 ; s2 ; : : : ; sK to be K random variables whose joint distribution is Di.˛1 ; ˛2 , : : : ; ˛K ; ˛KC1 I K/,
PK
and let sKC1 D 1 kD1 sk . And, taking I to be an arbitrary integer between 1 and K, inclusive,
consider the joint distribution of any I elements of the set fs1; : : : ; sK ; sKC1 g, say the k1 ; k2 ; : : : ; kI th
elements sk1 ; sk2 ; : : : ; skI .
The joint distribution of sk1 ; sk2 ; : : : ; skI can be readily determined by applying Part (2) of
Theorem 6.1.5 (in the special case where B1 D fk1 g, B2 D fk2 g, : : : ; BI D fkI g and where BI C1
is the (K C1 I )-dimensional subset of f1; : : : ; K; K C1g obtained by striking out k1 ; k2 ; : : : ; kI ).
The joint distribution is Di.˛k1 ; ˛k2 ; : : : ; ˛kI ; ˛IC1 I I /, where ˛IC1 D k2BI C1 ˛k . In particular,
P

for an arbitrary one of the integers 1; : : : ; K; K C1, say the integer k, the (marginal) distribution of
sk is Di.˛k ; KC1
PKC1
k 0 D1 .k 0 ¤k/ ˛k 0 I 1/ or, equivalently, Be.˛k ;
P
k 0 D1 .k 0 ¤k/ ˛k 0 /.
262 Some Relevant Distributions and Their Properties

f. Applications to the multivariate standard normal distribution


Let z D .z1 ; : : : ; zK ; zKC1 /0 represent a (KC1)-dimensional random (column) vector whose distri-
bution is (K C1)-variate standard normal, that is, whose distribution is N.0; IKC1 / or, equivalently,
whose elements are distributed independently and identically as N.0; 1/. Then, z12 ; : : : ; zK 2
, and
zKC1 are distributed independently, and each has a  .1/ distribution or, equivalently, a Ga 21 ; 2
2 2

distribution.
Consider the distribution of the K C1 random variables z12 = KC1 2 2
PKC1 2
kD1 zk , : : : ; zK = z , and
P
PKC1 2 PKC1 2 kD1 k
kD1 zk . As a special case of Part (2) of Theorem 6.1.4, we have that kD1 zk is distributed
PKC1 2
independently of z12 = kD1 2
zk , : : : ; zK = KC1 2
. And as a special case of Theorem 6.1.3 [or
P
z
kD1 k
of Part (1) of Theorem 6.1.4], we have that kD1 zk   .K C 1/. Moreover, z12 = KC1
PKC1 2 2 2
kD1 zk ,
P
2
= KC1 2 1 1 1
kD1 zk have a Di 2 ; : : : ; 2 ; 2 I K distribution, and, more generally, any K (where
0
P 
: : : ; zK
0 2
PKC1 2 2
PKC1 2 2 PKC1 2
1  K  K) of the random variables z1 = kD1 zk , : : : ; zK = kD1 zk , zKC1 = kD1 zk have a
0
Di 21 ; : : : ; 12 ; KC12 K I K 0 distribution.


Now, let
KC1
X zk
uD zk2 and (for k D 1; : : : ; K; K C 1) yk D P ;
KC1 2 1=2
kD1 j D1 zj

and consider the joint distribution of u and any K 0 (where 1  K 0  K) of the random variables
y1 ; : : : ; yK ; yKC1 (which for notational convenience and without any essential loss of generality are
taken to be the first K 0 of these random variables). Let us reexpress u and y1 ; y2 ; : : : ; yK 0 as
K 0
X zk
uDvC zk2 and (for k D 1; 2; : : : ; K 0 ) yk D PK 0 ; (1.42)
2 1=2

kD1 vC j D1 zj

where v D KC1 2 2 0
kDK 0 C1 zk . Clearly, v is distributed independently of z1 ; z2 ; : : : ; zK 0 as  .K K C1/.
P
0 0
Define y D .y1 ; y2 ; : : : ; yK 0 / . And observe that the K C1 equalities (1.42) define a one-to-one
transformation from the (K 0 C1)-dimensional region defined by the K 0 C1 inequalities 0 < v < 1
and 1 < zk < 1 (k D 1; 2; : : : ; K 0 ) onto the region
fu; y W 0 < u < 1; y 2 D  g;
PK 0
where D  D fy W 2
kD1 yk < 1g. Observe also that the inverse of this transformation is the
transformation defined by the K 0 C1 equalities
PK 0
yk2 and (for k D 1; 2; : : : ; K 0 ) zk D u1=2 yk :

vDu 1 kD1

Further, letting A represent the .K 0C1/  .K 0C1/ matrix whose ij th element is the partial derivative
of the i th element of the vector .v; z1 ; z2 ; : : : ; zK 0 / with respect to the j th element of the vector
.u; y1 ; y2 ; : : : ; yK 0 / and recalling Theorem 2.14.22, we find that
ˇ 1 PK 0 y 2
ˇ ˇ

j D1 j 2uy ˇ 0
jAj D ˇ ˇ D uK =2:
ˇ
ˇ.1=2/u 1=2 y 1=2 ˇ
u I


Thus, denoting by d./ the pdf of the 2 .K K 0C1/ distribution and by b./ the pdf of the N.0; IK 0 /
distribution and making use of standard results on a change of variables, the joint distribution of
u; y1 ; y2 ; : : : ; yK 0 has as a pdf the function q. ;  ;  ; : : : ; / (of K 0 variables) obtained by taking (for
Chi-Square, Gamma, Beta, and Dirichlet Distributions 263

0 < u < 1 and y 2 D  )

q.u; y1 ; y2 ; : : : ; yK 0 /
PK 0 2  1=2
 0
y uK =2

Dd u 1 kD1 yk b u

1
D uŒ.KC1/=2 1
2.KC1/=2 €Œ.K K 0 C1/=2 K 0 =2
PK 0 2 Œ.K K 0C1/=2 1 u=2
 1 kD1 yk e
1
D uŒ.KC1/=2 1
e u=2
2.KC1/=2 €Œ.K C1/=2
€Œ.K C1/=2 PK 0 2 Œ.K K 0C1/=2 1
 0 K 0 =2 1 kD1 yk (1.43)
€Œ.K K C1/=2
—for u and y such that 1 < u  0 or y … D , q.u; y1 ; y2 ; : : : ; yK 0 / D 0. Accordingly, we
conclude that (for all u and y1 ; y2 ; : : : ; yK 0 )

q.u; y1 ; y2 ; : : : ; yK 0 / D r.u/ h .y1 ; y2 ; : : : ; yK 0 /; (1.44)

where r./ is the pdf of the 2 .KC1/ distribution and h . ;  ; : : : ; / is the function (of K 0 variables)
defined (for all y1 ; y2 ; : : : ; yK 0 ) as follows:
h .y1 ; y2 ; : : : ; yK 0 /

€Œ.KC1/=2 PK 0 2 Œ.K K 0C1/=2 1 PK 0
<
0 K 0 =2 1 kD1 yk ; if kD1 yk2 < 1,
D €Œ.K K C1/=2  (1.45)
:̂ 0; otherwise.

In effect, we have established that y1 ; y2 ; : : : ; yK 0 are statistically independent of KC1 kD1 zk and
2
P

that the distribution of y1 ; y2 ; : : : ; yK 0 has as a pdf the function h . ;  ; : : : ; / defined by expression
(1.45). In the special case where K 0 D K, the function h. ;  ; : : : ; / is the pdf of the joint distribution
of y1 ; y2 ; : : : ; yK . When K 0 D K, h . ;  ; : : : ; / is reexpressible as follows: for all y1 ; y2 ; : : : ; yK ,

< €Œ.KC1/=2 1
PK  1=2 PK
.KC1/=2 kD1 yk2 ; if kD1 yk2 < 1,
 
h .y1 ;y2 ; : : : ;yK / D (1.46)
:̂ 0; otherwise.

Denote by i./ the function of a single variable, say z, defined as follows:



< 1; for z > 0,
i.z/ D 0; for z D 0,
1; for z < 0.

Clearly, PK 1=2
yKC1 D iKC1 1 kD1 yk2 ;
where iKC1 D i.zKC1 /. Moreover, Pr.iKC1 D 0/ D 0. And the joint distribution of
z1 ; z2 ; : : : ; zK ; zKC1 is the same as that of z1 , z2 , : : : ; zK , zKC1 , implying that the joint dis-
tribution of u; y1 ; y2 ; : : : ; yK ; iKC1 is the same as that of u; y1 ; y2 ; : : : ; yK ; iKC1 and hence that
Pr.iKC1 D 1/ D Pr.iKC1 D 1/ D 12 , both unconditionally and conditionally on u; y1 ; y2 ; : : : ; yK
or y1 ; y2 ; : : : ; yK . Thus, conditionally on u; y1 ; y2 ; : : : ; yK or y1 ; y2 ; : : : ; yK ,
8
2 1=2
PK
; with probability 12 ,

< 1
kD1 yk
yKC1 D
: 1 PK y 2 1=2; with probability 1 .
kD1 k 2
264 Some Relevant Distributions and Their Properties

Random variables, say x1 ; : : : ; xK ; xKC1 , whose joint distribution is that of the random variables
y1 ; : : : ; yK ; yKC1 are said to be distributed uniformly on the surface of a .K C1/-dimensional unit
ball—refer, e.g., to definition 1.1 of Gupta and Song (1997). More generally, random variables
x1 ; : : : ; xK ; xKC1 whose joint distribution is that of the random variables ry1 ; : : : ; ryK ; ryKC1 are
said to be distributed uniformly on the surface of a .KC1/-dimensional ball of radius r. Note that if
x1 ; : : : ; xK ; xKC1 are distributed uniformly on the surface of a (K C1)-dimensional unit ball, then
x12 ; : : : ; xK
2
have a Di 12 ; : : : ; 12 ; 12 I K distribution.

The (K C1)-dimensional random vector z [which has a (K C1)-dimensional standard normal
distribution] is expressible in terms of the random variable u [which has a 2 .K C1/ distribution]
and the K C1 random variables y1 ; : : : ; yK ; yKC1 [which are distributed uniformly on the surface
of a (K C1)-dimensional unit ball independently of u]. Clearly,
p
z D u y; (1.47)
where y D .y1 ; : : : ; yK ; yKC1 /0.
The distribution of the (positive) square root of a chi-square random variable, say a chi-square
random variable with N degrees of freedom, is sometimes referred to as a chi distribution (with N
degrees of freedom). This distribution has a pdf b./ that is expressible as

1 2
<
.N=2/ 1
x N 1 e x =2; for 0 < x < 1,
b.x/ D €.N=2/ 2 (1.48)
:̂ 0; elsewhere,
p
as can be readily verified. Accordingly, the random variable u, which appears in expression (1.47),
has a chi distribution with KC1 degrees of freedom, the pdf of which is obtainable from expression
(1.48) (upon setting N D K C1).

g. Extensions to spherical distributions

A transformation from a spherical distribution to a Dirichlet distribution. Let z1 ; z2 ; : : : ; zM rep-


resent M random variables whose joint distribution is absolutely continuous with a pdf f . ;  ; : : : ; /.
And suppose that for some (nonnegative) function g./ (of a single nonnegative variable),

f .z1 ; z2 ; : : : ; zM / D g M 2
(for all values of z1 ; z2 ; : : : ; zM ). (1.49)
P 
i D1 zi

Then, as discussed in Section 5.9c, the joint distribution of z1 ; z2 ; : : : ; zM is spherical.


Define ui D zi2 (i D 1; 2; : : : ; M ), and denote by q. ;  ; : : : ; / the pdf of the joint distribution of
u1 ; u2 ; : : : ; uM . Then, upon observing that @zi =@ui D ˙.1=2/ui 1=2 (i D 1; 2; : : : ; M ) and making
use of standard results on a change of variables (e.g., Casella and Berger 2002, sec. 4.6), we find that
(for u1 > 0, u2 > 0, : : : ; uM > 0)
 1 M QM 1=2
q.u1 ; u2 ; : : : ; uM / D 2M g M
P
i D1 ui 2 i D1 ui
1=2
Dg M
P  QM
i D1 ui i D1 ui

—for u1 ; u2 ; : : : ; uM such that ui  0 for some i , q.u1 ; u2 ; : : : ; uM / D 0.


ıPM PM 2
Now, let wi D zi2 2
j D1 zj (i D 1; : : : ; M 1) and wM D i D1 zi , and denote by
p. ; : : : ;  ; / the pdf of the joint distribution of w1 ; : : : ; wM 1 ; wM . Further, take D to be the
set ˚ PM 1
.w1 ; : : : ; wM 1 / W wi > 0 .i D 1; : : : ; M 1/; i D1 wi < 1 :
PM
And take yi D ui (i D 1; : : : ; M 1) and yM D i D1 ui , and observe that wi D yi =yM
Chi-Square, Gamma, Beta, and Dirichlet Distributions 265

(i D 1; : : : ; M 1) and wM D yM . Then, proceeding in essentially the same way as in the derivation


of result (5.9.55), we find that for .w1 ; : : : ; wM 1 / 2 D and for wM > 0,
QM 1 1=2 PM 1  1=2 .M=2/ 1
p.w1 ; : : : ; wM 1 ; wM / D i D1 wi 1 i D1 wi wM g.wM /
€.M=2/ QM 1 1=2 1=2  M=2
w .M=2/ 1
PM 1 
D i D1 wi 1 i D1 wi g.wM /
 M=2 €.M=2/ M
—if .w1 ; : : : ; wM 1 / … D or if wM  0, then p.w1 ; : : : ; wM 1 ; wM / D 0. Thus, for all values of
w1 ; : : : ; wM 1 ; wM ,
p.w1 ; : : : ; wM 1 ; wM / D r.wM /h.w1 ; : : : ; wM 1 /; (1.50)

where r./ and h. ; : : : ; / are functions defined as follows:



M=2
<  w .M=2/ 1 g.wM /; for 0 < wM < 1,
r.wM / D €.M=2/ M (1.51)
:̂ 0; for 1 < wM  0,
and
8̂ €.M=2/ Q PM 1  1=2
M 1 1=2
ˆ M=2 i D1 wi 1 i D1 wi ;
< 
h.w1 ; : : : ; wM 1 / D for .w1 ; : : : ; wM 1 / 2 D , (1.52)
ˆ
elsewhere.

0;
The function h. ; : : : ; / is the pdf of a Di 12 ; : : : ; 21 ; 12 I M 1 distribution. And the function r./

R1
is nonnegative and [in light of result (5.9.56)] 1 r.wM / dwM D 1, so that r./, like h. ; : : : ; /, is
a pdf. Accordingly, we conclude that the random variable M 2
i D1 zi is distributed independently of
P
M M PM
the random variables z12 2 2
i D1 zi , : : : ; zM 1
2
i D1 zi , that the distribution of
2
i D1 zi is the
ı P ı P
2 M 2 2 M 2
distribution with pdf r./, and that z1 i D1 zi , : : : ; zM 1 i D1 zi have an (M 1)-dimensional
ı P ı P
1 1 1
Dirichlet distribution (with parameters 2 , : : : ; 2 , 2 ). Further, the distribution of the random variable
PM 2 1=2
i D1 zi is the distribution with pdf b./, where

M=2
< 2 s M 1 g.s 2 /; for 0 < s < 1,
b.s/ D €.M=2/ (1.53)
:̂ 0; elsewhere,

as can be readily verified.


In the special case where the joint distribution of z1 ; z2 ; : : : ; zM is N.0; I/ (which is the special
case considered in the preceding subsection, i.e., in Subsection f) the function g./ is that for which
g.x/ D .2/ M=2 e x=2 (for every scalar x). In that special case, r./ is the pdf of the 2 .M /
distribution, and b./ is the pdf of the chi distribution (with M degrees of freedom).
Decomposition of spherically distributed random variables. Let z1 ; : : : ; zK ; zKC1 represent K C1
random variables whose joint distribution is absolutely continuous with a pdf f . ; : : : ;  ; /. And
suppose that (for arbitrary values of z1 ; : : : ; zK ; zKC1 )

f .z1 ; : : : ; zK ; zKC1 / D g KC1 2


(1.54)
P 
kD1 zk ;

where g./ is a (nonnegative) function of a single nonnegative variable (in which case the joint
distribution of z1 ; : : : ; zK ; zKC1 is spherical). Further, let
KC1
X zk
uD zk2 and (for k D 1; : : : ; K; K C 1) yk D P :
KC1 2 1=2
kD1 j D1 zj
266 Some Relevant Distributions and Their Properties

And consider the joint distribution of u; y1 ; y2 ; : : : ; yK .


Clearly,
K
X zk
uDvC zk2 and (for k D 1; : : : ; K; K C1) yk D ;
2 1=2
PK 
kD1 vC j D1 zj
2
where v D zKC1 . Moreover, upon applying standard results on a change of variables (e.g., Casella
and Berger 2002, sec. 4.6), we find that the joint distribution of the random variables v; z1 ; z2 ; : : : ; zK
has as a pdf the function f  . ;  ;  ; : : : ; / (of K C1 variables) obtained by taking (for 0 < v < 1
and for all z1 ; z2 ; : : : ; zK )

f .v; z1 ; z2 ; : : : ; zK / D 2 g vC K 2 1=2
D g vC K 2
P  P  1=2
kD1 zk .1=2/v kD1 zk v

—if 1 < v  0, take f .v; z1 ; z2 ; : : : ; zK / D 0.


PK
Let y D .y1 ; y2 ; : : : ; yK /0, and define D  D fy W 2
kD1 yk < 1g. Then, proceeding in
essentially the same way as in the derivation of result (1.43), we find that the joint distribution of
u; y1 ; y2 ; : : : ; yK has as a pdf the function q. ;  ;  ; : : : ; / (of K C1 variables) obtained by taking
(for 0 < u < 1 and y 2 D  )
PK
q.u; y1 ; y2 ; : : : ; yK / D f  u 1 2
 1=2
y1 ; u1=2 y2 ; : : : ; u1=2 yK uK=2
 
kD1 yk ; u

1=2 PK 1=2 K=2


yk2

D g.u/ u 1 kD1 u

 .KC1/=2 €Œ.K C1/=2 PK 1=2


uŒ.KC1/=2 1 2

D g.u/ 1 kD1 yk
€Œ.K C1/=2  .KC1/=2

—for u and y such that 1 < u  0 or y … D , take q.u; y1 ; y2 ; : : : ; yK / D 0. Thus, for all u
and y1 ; y2 ; : : : ; yK ,
q.u; y1 ; y2 ; : : : ; yK / D r.u/ h.y1 ; y2 ; : : : ; yK /; (1.55)
where r./ is the function (of a single variable) defined (for all u) and h . ;  ; : : : ; / the function (of
K variables) defined (for all y1 ; y2 ; : : : ; yK ) as follows:
.KC1/=2

<  uŒ.KC1/=2 1 g.u/; for 0 < u < 1,
r.u/ D €Œ.K C 1/=2 (1.56)
0; for 1 < u  0,

and

< €Œ.KC1/=2 1
PK  1=2 PK
.KC1/=2 kD1 yk2 ; if kD1 yk2 < 1,
 
h .y1 ;y2 ; : : : ;yK / D (1.57)
:̂ 0; otherwise.
As is evident from the results of Part 1 of the present subsection, the function r./ is a pdf; it
PKC1 2
is the pdf of the distribution of kD1 zk . Further, y1 ; y2 ; : : : ; yK are statistically independent of
PKC1 2 
kD1 zk , and the distribution of y1 ; y2 ; : : : ; yK has as a pdf the function h . ;  ; : : : ; / defined by
expression (1.57). And, conditionally on u; y1 ; y2 ; : : : ; yK or y1 ; y2 , : : : ; yK ,
8
2 1=2
PK
; with probability 12 ,

< 1
kD1 yk
yKC1 D
: 1 PK y 2 1=2; with probability 1 ,
kD1 k 2

as can be established in the same way as in Subsection f [where it was assumed that the joint dis-
tribution of z1 ; : : : ; zK ; zKC1 is N.0; I/]. Accordingly, y1 ; : : : ; yK ; yKC1 are distributed uniformly
on the surface of a (K C1)-dimensional unit ball.
Noncentral Chi-Square Distribution 267

Let z D .z1 ; : : : ; zK ; zKC1 /0, and consider the decomposition of the vector z defined by the
identity p
z D u y; (1.58)
where y D .y1 ; : : : ; yK ; yKC1 /0. This decomposition was considered previously (in Subsection f) in
the special case where z  N.0; I/. As in the special case, y is distributed
puniformly on the surface of
a (KC1)-dimensional unit ball (and is distributed independently of u or u). In the present (general)
case of an arbitrary absolutely continuous spherical distribution [i.e., where the distribution of z is
any absolutely continuous distribution with a pdf of the form (1.54)], the distribution of u is the
distribution with the pdf r./ given bypexpression (1.56) and (recalling the results of Part 1 of the
present subsection) the distribution of u is the distribution with the pdf b./ given by the expression

.KC1/=2
< 2 x K g.x 2 /; for 0 < x < 1,
b.x/ D €Œ.K C 1/=2 (1.59)
:̂ 0; elsewhere.
p
In the special case where z  N.0; I/, u  2 .K C1/, and u has a chi distribution (with K C1
degrees of freedom).

6.2 Noncentral Chi-Square Distribution


Chi-square distributions were considered in Section 6.1. Those distributions form a subclass of a
larger class of distributions known as noncentral chi-square distributions. Preliminary to defining
and discussing noncentral chi-square distributions, it is convenient to introduce an orthogonal matrix
known as a Helmert matrix.

a. Helmert matrix
Let a D .a1 ; a2 ; : : : ; aN /0 represent an N -dimensional nonnull (column) vector. Does there exist an
N N orthogonal matrix, one of whose rows, say the first row, is proportional to a0 ? Or, equivalently,
does there exist an orthonormal basis for RN that includes the vector a? In what follows, the answer
is shown to be yes. The approach taken is to describe a particular N  N orthogonal matrix whose
first row is proportional to a0. Other approaches are possible—refer, e.g., to Harville (1997, sec 6.4).
Let us begin by considering a special case. Suppose that ai ¤ 0 for i D 1, 2, : : : ; N . And
consider the N  N matrix P , whose first through N th rows, say p01 ; p02 ; : : : ; p0N , are each of norm
1 and are further defined as follows: take p01 proportional to a0, take p02 proportional to
.a1 ; a12 =a2 ; 0; 0; : : : ; 0/I
take p03 proportional to
Œa1 ; a2 ; .a12 Ca22 /=a3 ; 0; 0; : : : ; 0I
and, more generally, take the second through N th rows proportional to
Pk 1 2
.a1 ; a2 ; : : : ; ak 1 ; i D1 ai =ak ; 0; 0; : : : ; 0/ .k D 2; 3; : : : ; N /; (2.1)

respectively.
Clearly, the N 1 vectors (2.1) are orthogonal to each other and to the vector a0. Thus, P is an
orthogonal matrix. Moreover, upon “normalizing” a0 and the vectors (2.1), we find that
2 1=2
PN
p01 D (2.2)

i D1 ai .a1 ; a2 ; : : : ; aN /
and that, for k D 2; 3; : : : ; N ,
268 Some Relevant Distributions and Their Properties
" #1=2
ak2 Pk 1
p0k D Pk 1  Pk .a1 ; a2 ; : : : ; ak 1; i D1 ai2 =ak ; 0; 0; : : : ; 0/: (2.3)
2 2

i D1 ai i D1 ai

When a0 D .1; 1; : : : ; 1/0, formulas (2.2) and (2.3) simplify to


p01 D N 1=2
.1; 1; : : : ; 1/
and
p0k D Œk.k 1/ 1=2
.1; 1; : : : ; 1; 1 k; 0; 0; : : : ; 0/ .k D 2; 3; : : : ; N /;
and P reduces to a matrix known as the Helmert matrix (of order N ). (In some presentations, it is
the transpose of this matrix that is called the Helmert matrix.)
We have established that in the special case where ai ¤ 0 for i D 1; 2; : : : ; N , there exists
an N  N orthogonal matrix whose first row is proportional to a0 ; the matrix P , whose N rows
p01 ; p02 ; : : : ; p0N are determinable from formulas (2.2) and (2.3), is such a matrix. Now, consider the
general case, in which as many as N 1 elements of a may equal 0.
Suppose that K of the elements of a are nonzero (where 1  K  N ), say the j1 ; j2 ; : : : ; jK th
elements (and that the other N K elements of a equal 0). Then, it follows from what has already
been established that there exists a K  K orthogonal matrix, say Q, whose first row is proportional
to the vector .aj1 ; aj2 ; : : : ; ajK /. And, denoting by q1 ; q2 ; : : : ; qK the columns of Q, an N  N
orthogonal matrix, say P, whose first row is proportional to a0 is obtainable as follows: take
 
P1
PD ;
P2
where P1 is a K  N matrix whose j1 ; j2 ; : : : ; jK th columns are q1 ; q2 ; : : : ; qK , respectively, and
whose other N K columns equal 0 and where P2 is an .N K/  N matrix whose j1 ; j2 ; : : : ; jK th
columns equal 0 and whose other N K columns are the columns of IN K —that P is orthogonal is
evident upon observing that its columns are orthonormal. Thus, as in the special case where ai ¤ 0
for i D 1; 2; : : : ; N, there exists an N  N orthogonal matrix whose first row is proportional to a0.

b. Noncentral chi-square distribution: definition


Let z D .z1 ; z2 ; : : : ; zN /0 represent an N -dimensional random (column) vector whose distribution
is N -variate normal with mean vector  D .1 ; 2 ; : : : ; N /0 and variance-covariance matrix I or,
equivalently, whose elements are distributed independently as N.1 ; 1/; N.2 ; 1/; : : : ; N.N ; 1/.
Define
w D z0 z or, equivalently, wD N 2
P
i D1 zi :

When  D 0, the distribution of w is, by definition, a chi-square distribution (with N degrees


of freedom)—the chi-square distribution was introduced and discussed in Section 6.1. Let us now
consider the distribution of w in the general case where  is not necessarily null.
Let
 D 0 D N 2
P
i D1 i ;
and let p p q
PN 2
 D  D 0 D i D1 i :

If  ¤ 0 (or, equivalently, if  ¤ 0), define a1 D .1=/; otherwise (if  D 0 or, equivalently, if


 D 0), take a1 to be any N  1 (nonrandom) vector such that a01 a1 D 1. Further, take A2 to be any
N  .N 1/ (nonrandom) matrix such that the matrix A defined by A D .a1 ; A2 / is orthogonal—the
existence of such a matrix follows from the results of Subsection a. And define
x1 D a01 z and x2 D A02 z:
Then, clearly,      
x1  1 0
N ; ; (2.4)
x2 0 0 IN 1
and
Noncentral Chi-Square Distribution 269
 0 
x1 x1
w D z0 z D z0 Iz D z0 AA0 z D .A0 z/0 A0 z D D x12 C x02 x2 : (2.5)
x2 x2
The distribution of the random variable w is called the noncentral chi-square distribution. As is
evident from results (2.4) and (2.5), this distribution depends on the value of  only through  or,
equivalently, only through . In the special case where  D 0, the noncentral chi-square distribution
is identical to the distribution that we have been referring to as the chi-square distribution. In this
special case, the distribution is sometimes (for the sake of clarity and/or emphasis) referred to as the
central chi-square distribution.
The noncentral chi-square distribution depends on two parameters: in addition to the degrees of
freedom N , it depends on the quantity , which is referred to as the noncentrality parameter. Let us
use the symbol 2 .N; / to represent a noncentral chi-square distribution with degrees of freedom
N and noncentrality parameter .

c. Pdf of the noncentral chi-square distribution


As a first step in deriving the pdf of the 2 .N; / distribution (for arbitrary degrees of freedom N ),
let us derive the pdf of the 2 .1; / distribution.
Pdf of the 2 .1; / distribution. The derivation of the pdf of the 2 .1; / distribution parallels the
derivation (in Section 6.1c) of the pdf of the 2 .1/ distribution.
Let
p represent an arbitrary nonnegative scalar, take z to be a random variable whose distribution
is N. ; 1/, and define u D z 2. Then, by definition, u  2 .1; /. And a pdf, say h./, of the
p
distribution of u is obtainable from the pdf, say f ./, of the N. ; 1/ distribution: for u > 0, take
p ˇ p  1ˇ p  p  1
h.u/ D f u ˇ 2 u ˇCf u 2 u
p  1  .puCp/2 =2 p p 2 
D 2 2u e C e . u / =2
p  1 p p 
D 2 2u e .uC/=2 e u C e u
—for u  0, take h.u/ D 0.
The pdf h./ is reexpressible in terms of the hyperbolic cosine function cosh./. By definition,
cosh.x/ D 21 e x C e x (for every scalar x):


Thus,
p 

<p 1 e .uC/=2
cosh u ; for u > 0,
h.u/ D 2u (2.6)
:̂ 0; for u  0.
The pdf h./ can be further reexpressed by making use of the power-series representation
1
X x 2r
cosh.x/ D . 1 < x < 1/
rD0
.2r/Š

for the hyperbolic cosine function and by making use of result (3.5.11). We find that, for u > 0,
1
1 .uC/=2
X .u/r
h.u/ D p e 1

2u .2/ 1=2 22rC.1=2/ rŠ € r C
rD0 2
1
X .=2/r e =2
1
D uŒ.2rC1/=2 1
e u=2
: (2.7)
rD0
rŠ €Œ.2r C 1/=2 2.2rC1/=2

And letting (for j D 1; 2; 3; : : : ) gj ./ represent the pdf of a central chi-square distribution with j
degrees of freedom, it follows that (for all u)
270 Some Relevant Distributions and Their Properties
1
X .=2/r e =2
h.u/ D g2rC1 .u/: (2.8)
rD0

The coefficients .=2/r e =2=rŠ (r D 0; 1; 2; : : : ) of the quantities g2rC1 .u/ (r D 0; 1; 2; : : : )


in the sum (2.8) can be regarded as the values p.r/ (r D 0; 1; 2; : : : ) of a function p./; this function
is the probability mass function of a Poisson distribution with parameter =2. Thus, the pdf h./
of the 2 .1; / distribution is a weighted average of the pdfs of central chi-square distributions. A
distribution whose pdf (or cumulative distribution function) is expressible as a weighted average of the
pdfs (or cumulative distribution functions) of other distributions is known as a mixture distribution.
Extension to the general case (of arbitrary degrees of freedom). According to result (2.5), the
2 .N; / distribution can be regarded as the sum of two independently distributed random variables,
the first of which has a 2 .1; / distribution and the second of which has a 2 .N 1/ distribution.
Moreover, in Part 1 (of the present subsection), the pdf of the 2 .1; / distribution was determined
to be that given by expression (2.8), which is a weighted average of pdfs of central chi-square
distributions. Accordingly, take w1 to be a random variable whose distribution is that with a pdf h./
of the form X1
h.w1 / D pr gNr .w1 /; (2.9)
rD0
where p0 ; p1 ; p2 ; : : : are nonnegative constants such that 1 rD0 pr D 1, where N0 , N1 , N2 , : : : are
P
(strictly) positive integers, and where (as in Part 1) gj ./ denotes (for any strictly positive integer j )
the pdf of a central chi-square distribution with j degrees of freedom. And for an arbitrary (strictly)
positive integer K, take w2 to be a random variable that is distributed independently of w1 as 2 .K/,
and define
w D w1 C w2 :
Let us determine the pdf of the distribution of w. Denote by b./ the pdf of the 2 .K/ distribution,
and define w1
sD :
w1 C w2
Then, proceeding in essentially the same way as in Section 6.1a (in arriving at Theorem 6.1.1), we
find that a pdf, say f . ; /, of the joint distribution of w and s is obtained by taking, for w > 0 and
0 < s < 1,
f .w; s/ D h.sw/ bŒ.1 s/w j wj
1
X
Dw pr gNr .sw/ bŒ.1 s/w
rD0
1
X 1
D pr w Œ.NrCK/=2 1 e w=2
rD0
€Œ.Nr CK/=2 2.NrCK/=2
€Œ.Nr CK/=2 .Nr =2/ 1
 s .1 s/.K=2/ 1
(2.10)
€.Nr =2/€.K=2/
—for w  0 and for s such that s  0 or s  1, f .w; s/ D 0. And letting d˛1;˛2 ./ represent the
pdf of a Be.˛1 ; ˛2 / distribution (for arbitrary values of the parameters ˛1 and ˛2 ), it follows that
(for all w and s) 1
X
f .w; s/ D pr gNr CK .w/ dNr =2; K=2 .s/: (2.11)
rD0
Thus, as a pdf of the (marginal) distribution of w, we have the function q./ obtained by taking (for
all w)
Z 1 X1 Z 1 1
X
q.w/ D f .w; s/ ds D pr gNr CK .w/ dNr =2; K=2 .s/ ds D pr gNr CK .w/: (2.12)
0 rD0 0 rD0
Noncentral Chi-Square Distribution 271

The distribution of w, like that of w1 , is a mixture distribution. As in the case of the pdf of
the distribution of w1 , the pdf of the distribution of w is a weighted average of the pdfs of central
chi-square distributions. Moreover, the sequence p0 ; p1 ; p2 ; : : : of weights is the same in the case
of the pdf of the distribution of w as in the case of the pdf of the distribution of w1 . And the central
chi-square distributions represented in one of these weighted averages are related in a simple way to
those represented in the other; each of the central chi-square distributions represented in the weighted
average (2.12) have an additional K degrees of freedom.
In light of results (2.5) and (2.8), the pdf of the 2 .N; / distribution is obtainable as a special case
of the pdf (2.12). Specifically, upon setting K D N 1 and setting (for r D 0; 1; 2; : : :) Nr D 2r C1
and pr D p.r/ [where, as in Part 1 of the present subsection, p.r/ D .=2/r e =2=rŠ], we obtain
the pdf of the 2 .N; / distribution as a special case of the pdf (2.12). Accordingly, the pdf of the
2 .N; / distribution is the function q./ that is expressible as follows:
1
X
q.w/ D p.r/ g2rCN .w/ (2.13)
rD0
1
X .=2/r e =2
D g2rCN .w/ (2.14)
rD0

8̂ 1
X .=2/r e =2
1
< w Œ.2rCN /=2 1
e w=2
; for w > 0,
D rD0
rŠ €Œ.2r C N /=2 2.2rCN /=2 (2.15)
0; for w  0.

d. Distribution of a sum of noncentral chi-square random variables


Suppose that two random variables, say w1 and w2 , are distributed independently as 2 .N1 ; 1 / and
2 .N2 ; 2 /, respectively. And consider the distribution of the sum w D w1 C w2 .
Let x represent an .N1 CN2 /  1 random vector whose distribution is
" p ! !#
1 =N1 1N1 IN1 0
N p ; ;
2 =N2 1N2 0 IN2
 
x1
and partition x as x D (where x1 is of dimensions N1  1). And observe (in light of Theorem
x2
p
3.5.3 and Corollary 3.5.5) that x1 and x2 are distributed independently as N 1 =N1 1N1 ; IN1 and

p
N 2 =N2 1N2 ; IN2 , respectively. Observe also (in light of the very definition of the noncentral

chi-square distribution) that the joint distribution of x01 x1 and x02 x2 is the same as that of w1 and
w2 . Thus,
x0 x D x01 x1 C x02 x2  w1 C w2 D w:
Moreover, x0 x has a noncentral chi-square distribution with N1 CN2 degrees of freedom and non-
centrality parameter
p !0 p !
1 =N1 1N1 1 =N1 1N1
p p D .1 =N1 /10N11N1 C .2 =N2 /10N2 1N2 D 1 C 2 ;
2 =N2 1N2 2 =N2 1N2

leading to the conclusion that


w  2 N1 CN2 ; 1 C2 : (2.16)


Upon employing a simple mathematical-induction argument, we arrive at the following general-


ization of result (2.16).
272 Some Relevant Distributions and Their Properties

Theorem 6.2.1. If K random variables w1 ; w2 ; : : : ; wK are distributed independently as


2 .N1 ; 1 /; 2 .N2 ; 2 /; : : : ; 2 .NK ; K /, respectively, then w1 C w2 C    C wK is distributed as
2 .N1 CN2 C    CNK ; 1 C2 C    CK /.
Note that in the special case where 1 D 2 D    D K D 0, Theorem 6.2.1 reduces to
Theorem 6.1.3.

e. Moment generating function and cumulants


Let z represent a random variable whose distribution is N.; 1/. Further, let f ./ represent the pdf
of the N.; 1/ distribution. Then, for t < 1=2,
Z 1
2 2
E et z D e tx f .x/ dx
1
Z 1
2
 
1 .1 2t/ 2
D p exp x C x dx
1 2 2 2
Z 1 2
t2
  
1 .1 2t/ 
D p exp x C dx
1 2 2 1 2t 1 2t
1
t2
  Z
D .1 2t/ 1=2 exp h t .x/ dx;
1 2t 1

1
where h t ./ is the pdf of the N Œ=.1 2t/; .1 2t/  distribution. Thus, for t < 1=2,

t2
 
2
E et z 2t/ 1=2 exp

D .1 :
1 2t

And it follows that the moment generating function, say m./, of the 2 .1; / distribution is
 
t
m.t/ D .1 2t/ 1=2 exp .t < 1=2/: (2.17)
1 2t

To obtain an expression for the moment generating function of the 2 .N; / distribution (where
N is any strictly positive integer), it suffices (in light of Theorem 6.2.1) to find the moment generating
function of the distribution of the sum w D w1 C w2 of two random variables w1 and w2 that are
distributed independently as 2.N 1/ and 2.1; /, respectively. Letting m1./ represent the moment
generating function of the 2 .N 1/ distribution and m2 ./ the moment generating function of the
2 .1; / distribution and making use of results (1.20) and (2.17), we find that, for t < 1=2,
E e t w D E e t w1 E e t w2 D m1 .t/ m2 .t/
  
 
t
D .1 2t/ .N 1/=2 .1 2t/ 1=2 exp
  1 2t
N=2 t
D .1 2t/ exp :
1 2t
Thus, the moment generating function, say m./, of the 2 .N; / distribution is
 
N=2 t
m.t/ D .1 2t/ exp .t < 1=2/: (2.18)
1 2t
Or, upon reexpressing expŒt=.1 2t/ as
   
t =2
exp D exp. =2/ exp
1 2t 1 2t
Noncentral Chi-Square Distribution 273

and replacing expŒ.=2/=.1 2t/ with its power-series expansion


1
.=2/r
  X
=2
exp D ; (2.19)
1 2t rD0
.1 2t/r rŠ

we obtain the alternative representation


1
X .=2/r e =2
.2rCN /=2
m.t/ D .1 2t/ .t < 1=2/: (2.20)
rD0

Alternatively, expression (2.20) for the moment generating function of the 2 .N; / distribution
can be derived from the pdf (2.14). Letting (as in Subsection c) gj ./ represent (for an arbitrary
strictly positive integer j ) the pdf of a 2 .j / distribution, the alternative approach gives
1 1
.=2/r e =2
Z X
m.t/ D et w g2rCN .w/ dw
0 rD0

1
.=2/r e =2 1 t w
X Z
D e g2rCN .w/ dw: (2.21)
rD0
rŠ 0

R1
If we use formula (1.20) to evaluate (for each r) the integral 0 e t w g2rCN .w/ dw, we arrive
immediately at expression (2.20).
The cumulant generating function, say c./, of the 2 .N; / distribution is [in light of result
(2.18)]
c.t/ D log m.t/ D .N=2/ log.1 2t/ C t.1 2t/ 1 .t < 1=2/: (2.22)
Upon expanding c.t/ in a power series (about 0), we find that (for 1=2 < t < 1=2)
1
X 1
X
c.t/ D .N=2/ .2t/r=r C t .2t/r 1

rD1 rD1
1
X
D 2r 1 r
t .N C r/=r
rD1
X1
D .N C r/ 2r 1
.r 1/Š t r=rŠ: (2.23)
rD1

Thus, the rth cumulant of the 2 .N; / distribution is

.N C r/ 2r 1
.r 1/Š: (2.24)

f. Moments
2
Mean and variance. Let w represent a random variable whose
p distribution is  .N; /. And [for
2
purposes of determining E.w/, var.w/, and E.w /] let xpD  C z, where z is a random variable
that has a standard normal distribution, so that x  N ; 1 and hence x 2  2 .1; /. Then, in
light of Theorem 6.2.1,
w  x 2 C u;
where u is a random variable that is distributed independently of z (and hence distributed indepen-
dently of x and x 2 ) as 2 .N 1/.
274 Some Relevant Distributions and Their Properties

Clearly,
E.x 2 / D var.x/ C ŒE.x/2 D 1 C : (2.25)
And, making use of results (3.5.19) and (3.5.20), we find that
 p 4 
E.x 4 / D E Cz
p
D E z 4 C 4z 3  C 6z 2  C 4z3=2 C 2


D 3 C 0 C 6 C 0 C 2 D 2 C 6 C 3; (2.26)

which [in combination with result (2.25)] implies that

var.x 2 / D E.x 4 / ŒE.x 2 /2 D 4 C 2 D 2.2 C 1/: (2.27)

Thus, upon recalling results (1.29) and (1.30), it follows that

E.w/ D E.x 2 / C E.u/ D 1 C  C N 1 D N C ; (2.28)


2
var.w/ D var.x / C var.u/ D 4 C 2 C 2.N 1/ D 2.N C 2/; (2.29)
and
E.w 2 / D var.w/ C ŒE.w/2 D .N C 2/.N C 2/ C 2: (2.30)

Higher-order moments. Let (as in Part 1) w represent a random variable whose distribution is
2 .N; /. Further, take (for an arbitrary strictly positive integer k) gk ./ to be the pdf of a 2 .k/
distribution, and recall [from result (2.14)] that the pdf, say q./, of the 2 .N; / distribution is
expressible (for all w) as 1
X .=2/j e =2
q.w/ D g2j CN .w/:

j D0
Then, using result (1.27), we find that (for r > N=2)
Z 1
E.w r / D w r q.w/ dw
0
1 1
.=2/j e =2
X Z
D w r g2j CN .w/ dw
jŠ 0
j D0
1
X .=2/j e =2
€Œ.N=2/ C j C r
D 2r
jŠ €Œ.N=2/ C j 
j D0
1
X .=2/j €Œ.N=2/ C j C r
D 2r e =2
: (2.31)
j Š €Œ.N=2/ C j 
j D0

Now, define mr D E.w r /, and regard mr as a function of the noncentrality parameter . And
observe that (for r > N=2)
1
d mr X j .=2/j 1 €Œ.N=2/ C j C r
D .1=2/ mr C 2r 1
e =2
d j Š €Œ.N=2/ C j 
j D0

and hence that


1
 d mr  X j .=2/j €Œ.N=2/ C j C r
 mr C 2 D 2rC1 e =2
:
d j Š €Œ.N=2/ C j 
j D0

Thus, for r > N=2,


Noncentral Chi-Square Distribution 275
1
X .=2/j €Œ.N=2/ C j C r C 1
mrC1 D 2rC1 e =2
j Š €Œ.N=2/ C j 
j D0
1
X .=2/j Œ.N=2/ C j C r €Œ.N=2/ C j C r
D 2rC1 e =2
j Š €Œ.N=2/ C j 
j D0
 d mr 
D 2 Œ.N=2/ C r mr C  mr C 2
d
d mr
D .N C 2r C / mr C 2 : (2.32)
d
Formula (2.32) relates E.w rC1 / to E.w r / and to the derivative of E.w r / (with respect to ).
Since clearly E.w 0 / D 1, formula (2.32) can be used to determine the moments E.w 1 /, E.w 2 /,
E.w 3 /, : : : [of the 2 .N; / distribution] recursively. In particular,
d m0
E.w/ D m1 D ŒN C 2.0/ C  m0 C 2
d
D .N C / 1 C 2 .0/ D N C  (2.33)
and
d m1
E.w 2 / D m2 D ŒN C 2.1/ C  m1 C 2
d
D .N C  C 2/.N C / C 2.1/
D .N C 2/.N C 2/ C 2; (2.34)
in agreement with results (2.28) and (2.30) from Part 1.
Explicit expressions for the moments of the 2 .N; / distribution are provided by the following
result: for r D 0; 1; 2; : : : ,
r
!
r r
X r .=2/j
E.w / D 2 €Œ.N=2/ C r (2.35)
j €Œ.N=2/ C j 
j D0
0
—interpret 0 as 1. This result, whose verification is the subject of the final part of the present
subsection, provides a representation for the rth moment of the 2 .N; / distribution that consists of
a sum of rC1 terms. In contrast, the representation provided by formula (2.31) consists of a sum of an
infinite number of terms. Moreover, by making use of result (1.23), result (2.35) can be reexpressed
in the following, “simplified” form: for r D 0; 1; 2; : : : ,
r 1
!
r r
X r j
E.w / D  C .N C 2j /ŒN C 2.j C1/ŒN C 2.j C2/    ŒN C 2.r 1/ : (2.36)
j
j D0

Verification of formula (2.35). The verification of formula (2.35) is by mathematical induction. The
formula is valid for r D 0—according to the formula, E.w 0 / D 1. Now, suppose that the formula is
valid for r D k (where k is an arbitrary nonnegative integer), that is, suppose that
k
!
X k .=2/j
mk D 2k €Œ.N=2/ C k (2.37)
j €Œ.N=2/ C j 
j D0
j
—as before, mj D E.w / (for j > N=2). We wish to show that formula (2.35) is valid for
r D k C 1, that is, to show that
kC1
!
kC1
X k C1 .=2/j
mkC1 D 2 €Œ.N=2/ C k C 1 : (2.38)
j €Œ.N=2/ C j 
j D0
276 Some Relevant Distributions and Their Properties

From result (2.32), we have that


d mk
mkC1 D .N C 2k/ mk C  mk C 2 : (2.39)
d
And it follows from supposition (2.37) that
k
!
kC1
X k .=2/sC1
 mk D 2 €Œ.N=2/ C k
sD0
s €Œ.N=2/ C s
kC1
!
kC1
X k .=2/j
D2 €Œ.N=2/ C k
j 1 €Œ.N=2/ C j 1
j D1
kC1
!
kC1
X k Œ.N=2/ C j 1.=2/j
D2 €Œ.N=2/ C k (2.40)
j 1 €Œ.N=2/ C j 
j D1

and that
k
!
d mk X k j .=2/j
2 D 2kC1 €Œ.N=2/ C k : (2.41)
d j €Œ.N=2/ C j 
j D1

Upon starting with expression (2.39) and substituting expressions (2.37), (2.40), and (2.41) for mk ,
 mk , and 2.d mk =d/, respectively, we find that
k
!
.=2/j

kC1
X k
mkC1 D 2 €Œ.N=2/ C k Œ.N=2/ C k
j €Œ.N=2/ C j 
j D0
kC1
!
X k Œ.N=2/ C j 1.=2/j
C
j 1 €Œ.N=2/ C j 
j D1
k
!
j .=2/j

X k
C : (2.42)
j €Œ.N=2/ C j 
j D1

Expressions (2.38) and (2.42) are both polynomials of degree k C1 in =2. Thus, to establish
equality (2.38), it suffices to establish that the coefficient of .=2/j is the same for each of these two
polynomials (j D 0; 1; : : : ; k C1). In the case of the polynomial (2.42), the coefficient of .=2/0 is

2kC1 €Œ.N=2/Ck Œ.N=2/Ck= €.N=2/ D 2kC1 €Œ.N=2/Ck C1= €.N=2/; (2.43)

the coefficient of .=2/kC1 is

2kC1 €Œ.N=2/ C k Œ.N=2/ C k= €Œ.N=2/ C k C 1 D 2kC1; (2.44)

and, for j D 1; 2; : : : ; k, the coefficient of .=2/j is


! ! ! !
2kC1 €Œ.N=2/Ck
 
k k k k
Œ.N=2/Ck CŒ.N=2/ 1 Cj C
€Œ.N=2/Cj  j j 1 j 1 j
! ! !
2kC1 €Œ.N=2/Ck

k k k C1
D Œ.N=2/Ck C Œ.N=2/ 1 Cj
€Œ.N=2/Cj  j j 1 j
!
2kC1 €Œ.N=2/Ck k C1

k j C1 j
D Œ.N=2/Ck C Œ.N=2/ 1 Cj
€Œ.N=2/Cj  j k C1 k C1
Noncentral Chi-Square Distribution 277
!
2kC1 €Œ.N=2/Ck k C1
D Œ.N=2/Ck
€Œ.N=2/Cj  j
!
2kC1 €Œ.N=2/Ck C1 k C1
D : (2.45)
€Œ.N=2/Cj  j

Clearly, the coefficients (2.43), (2.44), and (2.45) are identical to the coefficients of the polyno-
mial (2.38). Accordingly, equality (2.38 is established, and the mathematical-induction argument is
complete.

g. An extension: the “noncentral gamma distribution”


For strictly positive parameters ˛ and ˇ and for a nonnegative parameter ı, let f ./ represent the
function defined (on the real line) by
1
X ır e ı
f .x/ D hr .x/; (2.46)
rD0

where hr ./ is the pdf of a Ga.˛ Cr; ˇ/ distribution. The function f ./ is the pdf of a distribution.
More specifically, it is the pdf of a mixture distribution; it is a weighted average of the pdfs of gamma
distributions, where the weights ı r e ı=rŠ (r D 0; 1; 2; : : :) are the values assumed by the probability
mass function of a Poisson distribution with parameter ı.
The noncentral chi-square distribution is related to the distribution with pdf (2.46) in much the
same way that the (central) chi-square distribution is related to the gamma distribution. In the special
case where ˛ D N=2, ˇ D 2, and ı D =2, the pdf (2.46) is identical to the pdf of the 2 .N; /
distribution. Accordingly, the pdf (2.46) provides a basis for extending the definition of the noncentral
chi-square distribution to noninteger degrees of freedom.
Let us denote the distribution with pdf (2.46) by the symbol Ga.˛; ˇ; ı/. And let us write m./
for the moment generating function of that distribution. Then, proceeding in much the same way as
in arriving at expression (2.21) and recalling result (1.19), we find that (for t < 1=ˇ)
1
ı r e ı 1 tx
X Z
m.t/ D e hr .x/ dx
rD0
rŠ 0
1
X ır e ı
.˛Cr/
D .1 ˇt/ : (2.47)
rD0

Moreover, analogous to expression (2.19), we have the power-series representation


1
ır
 
ı X
exp D ;
1 ˇt rD0
.1 ˇt/r rŠ

so that the moment generating function is reexpressible in the form


 
ıˇt
m.t/ D .1 ˇt/ ˛ exp .t < 1=ˇ/: (2.48)
1 ˇt

The cumulants of the Ga.˛; ˇ; ı/ distribution can be determined from expression (2.48) in essentially
the same way that the cumulants of the 2.N; / distribution were determined from expression (2.18):
for r D 1; 2; 3; : : : , the rth cumulant is

.˛ C ır/ ˇ r .r 1/Š: (2.49)


278 Some Relevant Distributions and Their Properties

Theorem 6.2.1 (on the distribution of a sum of noncentral chi-square random variables) can
be generalized. Suppose that K random variables w1 ; w2 ; : : : ; wK are distributed independently
as Ga.˛1 ; ˇ; ı1 /, Ga.˛2 ; ˇ; ı2 /, : : : ; Ga.˛K ; ˇ; ıK /, respectively. Then, the moment generating
function, say m./, of the sum K kD1 wk of w1 ; w2 ; : : : ; wK is given by the formula
P

 
.˛1 C˛2 CC˛K / .ı1 C ı2 C    C ıK /ˇt
m.t/ D .1 ˇt/ exp .t < 1=ˇ/;
1 ˇt
as is evident from result (2.48) upon observing that
PK K
Y  K
Y
E et wk
DE e t wk D E e t wk :
 
kD1

kD1 kD1

Thus, m./ is the moment generating function of a Ga K


PK
kD1 ık distribution, and
P 
˛ ; ˇ;
PK PK PK kD1 k
it follows that kD1 wk  Ga kD1 ˛k ; ˇ; kD1 ık . In conclusion, we have the following
theorem, which can be regarded as a generalization of Theorem 6.1.2 as well as of Theorem 6.2.1.
Theorem 6.2.2. If K random variables w1 ; w2 ; : : : ; wK are distributed independently as
Ga.˛1 ; ˇ; ı1 /, Ga.˛2 ; ˇ; ı2 /, : : : ; Ga.˛K ; ˇ; ıK /, respectively, then w1 C w2 C    C wK 
Ga K
PK
kD1 ık .
P 
kD1 ˛k ; ˇ;
It remains to extend the results of Subsection f (on the moments of the noncentral chi-square
distribution) to the moments of the distribution with pdf (2.46). Let w represent a random variable
whose distribution is Ga.˛; ˇ; ı/. Then, as a readily derivable generalization of formula (2.31), we
have that (for r > ˛)
1
X ı j €.˛ C j C r/
E.w r / D ˇ r e ı : (2.50)
j Š €.˛ C j /
j D0

Further, defining mr D E.w r /, regarding mr as a function of ı, and proceeding as in the derivation


of result (2.32), we find that
h d mr i
mrC1 D ˇ .˛ C r C ı/ mr C ı : (2.51)

The recursive relationship (2.51) can be used to derive the first two moments of the Ga.˛; ˇ; ı/
distribution in much the same way that the recursive relationship (2.32) was used to derive the first
two moments of the 2 .N; / distribution. Upon observing that E.w 0 / D 1, we find that
h d m0 i
E.w/ D m1 D ˇ .˛ C 0 C ı/ m0 C ı

D ˇ.˛ C ı/ (2.52)
and that
h d m1 i
E.w 2 / D m2 D ˇ .˛ C 1 C ı/ m1 C ı
h dı i
D ˇ .˛ C 1 C ı/ˇ.˛ C ı/ C ı ˇ
D ˇ 2 Œ.˛ C 1/.˛ C 2ı/ C ı 2 : (2.53)
Further,
var.w/ D E.w 2 / ŒE.w/2 D ˇ 2 .˛ C 2ı/: (2.54)
Finally, as a generalization of formula (2.35), we have that (for r D 0; 1; 2; : : :)
r
!
r r
X r ıj
E.w / D ˇ €.˛ C r/ : (2.55)
j €.˛ C j /
j D0
Noncentral Chi-Square Distribution 279

Formula (2.55) can be verified via a mathematical-induction argument akin to that employed in Part
3 of Subsection f, and, in what is essentially a generalization of result (2.36), is [in light of result
(1.23)] reexpressible in the “simplified” form
r 1
!
r r r
X r j
E.w / D ˇ Œı C .˛Cj /.˛Cj C1/.˛Cj C2/    .˛Cr 1/ ı : (2.56)
j
j D0

h. An extension: distribution of the sum of the squared elements of a random vector


that is distributed spherically about a nonnull vector of constants
Let z D .z1 ; z2 ; : : : ; zN /0 represent an N -dimensional random (column) vector that has any par-
ticular spherical distribution. Further, let  D .1 ; 2 ; : : : ; N /0 represent any N -dimensional
nonrandom (column) vector. And consider the distribution of the quantity x0 x D N 2
i D1 xi , where
P
0
x D .x1 ; x2 ; : : : ; xN / is the N -dimensional random (column) vector defined as follows: x D  C z
or, equivalently, xi D i C zi (i D 1; 2; : : : ; N ) (so that x isqdistributed spherically about ).
p p
Define  D 0 D N 2
PN 2
i D1 i and  D  D 0 D i D1 i . It follows from what was
P

established earlier (in Subsection b) that in the special case where z  N.0; I/ and hence where
x  N.; I/, the distribution of x0 x depends on the value of  only through  or, equivalently, only
through —in that special case, the distribution of x0 x is, by definition, the noncentral chi-square
distribution with parameters N and . In fact, the distribution of x0 x has this property (i.e., the
property of depending on the value of  only through  or, equivalently, only through ) not only in
the special case where z  N.0; I/, but also in the general case (where the distribution of z is any
particular spherical distribution).
To see this, take (as in Subsection b) A to be any N  N orthogonal matrix whose first column is
.1=/—if  D 0 (or, equivalently, if  D 0), take A to be an arbitrary N  N orthogonal matrix.
And observe that
x0 x D x0 Ix D x0 AA0 x D .A0 x/0 A0 x (2.57)
and that
A0 x D A0  C w; (2.58)
where w D A0 z. Observe also that  

A0  D ; (2.59)
0
that
w  z; (2.60)
 
w1
and, upon partitioning w as w D (where w1 is a scalar), that
w2
x0 x D .A0  C w/0 .A0  C w/ D . C w1 /2 C w20 w2 D  C 2w1 C w 0 w: (2.61)

It is now clear [in light of results (2.60) and (2.61)] that the distribution of x0 x depends on the value
of  only through  or, equivalently, only through .
Consider now the special case of an absolutely continuous spherical distribution where the dis-
tribution of z is absolutely continuous with a pdf f ./ such that
f .z/ D g.z0 z/ .z 2 RN / (2.62)
0
for some (nonnegative) function g./ (of a single nonnegative variable). Letting u D w w and
v D w1 =.w 0 w/1=2, we find [in light of result (2.61)] that (for w ¤ 0) x0 x is expressible in the form
x0 x D  C 2vu1=2 C u: (2.63)
0
Moreover, in light of result (2.60), the joint distribution of u and v is the same as that of z z and
z1 =.z0 z/1=2, implying (in light of the results of Section 6.1g) that u and v are distributed independently,
that the distribution of u is the distribution with pdf r./, where
280 Some Relevant Distributions and Their Properties
8̂ N=2
<  u.N=2/ 1
g.u/; for 0 < u < 1,
r.u/ D €.N=2/ (2.64)
0; for 1 < u  0,

and that the distribution of v is the distribution with pdf h ./, where

€.N=2/
< .1 v 2 /.N 3/=2; if 1 < v < 1,
h .v/ D €Œ.N 1/=2 1=2

(2.65)
:̂ 0; otherwise.

Define y D x0 x. In the special case where z  N.0; I/ [and hence where x  N.; I/], y
has (by definition) a 2 .N; / distribution, the pdf of which is a function q./ that is expressible in
the form (2.15). Let us obtain an expression for the pdf of the distribution of y in the general case
[where the distribution of z is any (spherical) distribution that is absolutely continuous with a pdf
f ./ of the form (2.62)].
Denote by d. ; / the pdf of the joint distribution of u and v, so that (for all u and v)
d.u; v/ D r.u/ h .v/:

Now, introduce a change of variables from u and v to the random variables y and s, where
 C w1  C vu1=2
sD D
y 1=2 . C 2vu1=2 C u/1=2

(and where y D  C 2vu1=2 C u). And observe that


@y 1=2 @y
D 1 C vu ; D 2 u1=2;
@u @v
@s
D .1=y/ .1=2/y 1=2 vu 1=2 .1=2/y 1=2 1=2
/. C vu1=2 / ; and
 
.1 C vu
@u
@s
D .1=y/ y 1=2 u1=2 y 1=2
 u1=2 . C vu1=2 / I
 
@v
accordingly, ˇ
ˇ@y=@u @y=@v ˇ
ˇ
1=2 1=2
ˇ @s=@u @s=@v ˇ D u y ;
ˇ ˇ

as can be readily verified. Observe also that vu1=2 D sy 1=2  and hence that

uDy  2vu1=2 D y 2sy 1=2 C 


and
u .sy 1=2 /2 y s2 y
1 v2 D D D .1 s 2 /.y=u/:
u u
Further, denoting by b. ; / the pdf of the joint distribution of y and s and making use of standard
results on a change of variables (e.g., Bickel and Doksum 2001, sec. B.2.1), we find that, for 0 <
y < 1 and for 1 < s < 1,
 .N 1/=2
b.y; s/ D y .N=2/ 1 .1 s 2 /.N 3/=2 g.y 2sy 1=2 C / (2.66)
€Œ.N 1/=2
—for 1 < y  0 and for s such that s  1 or s  1, b.y; s/ D 0.
We conclude that a pdf, say q./, of the (marginal) distribution of the random variable y (D x0 x)
is obtainable by taking, for 0 < y < 1,
Z 1
 .N 1/=2
q.y/ D y .N=2/ 1 .1 s 2 /.N 3/=2 g.y 2sy 1=2 C / ds (2.67)
€Œ.N 1/=2 1
Central and Noncentral F Distributions 281

[and by taking, for 1 < y  0, q.y/ D 0]. In the special case where  D 0 (and hence where
 D 0), expression (2.67) simplifies to
 N=2 .N=2/ 1
q.y/ D y g.y/;
€.N=2/
in agreement with the pdf (1.56) derived earlier (in Section 6.1g) for that special case.

6.3 Central and Noncentral F Distributions


Let  
u=M u
F D D .N=M / ; (3.1)
v=N v
where u and v are random variables that are distributed independently as 2 .M / and 2 .N /, respec-
tively. The distribution of the random variable F plays an important role in the use of linear models
to make statistical inferences. This distribution is known as Snedecor’s F distribution or simply as
the F distribution. It is also sometimes referred to as Fisher’s variance-ratio distribution.
The distribution of the random variable F depends on two parameters: M, which is referred
to as the numerator degrees of freedom. and N, which is referred to as the denominator degrees of
freedom. Let us denote the F distribution with M numerator degrees of freedom and N denominator
degrees of freedom by the symbol SF .M; N /.
As a generalization of the F distribution, we have the noncentral F distribution. The noncentral
F distribution is the distribution of the random variable
u=M u
 

F D D .N=M / (3.2)
v=N v
obtained from expression (3.1) for the random variable F upon replacing the random variable u
(which has a central chi-square distribution) with a random variable u that has the noncentral chi-
square distribution 2 .M; / (and that, like u, is distributed independently of v). The distribution of
the random variable F  has three parameters: the numerator degrees of freedom M, the denominator
degrees of freedom N, and the noncentrality parameter . Let us denote the noncentral F distribution
with parameters M, N, and  by the symbol SF .M; N; /.
Clearly, the F distribution SF .M; N / can be regarded as the special case SF .M; N; 0/ of the
noncentral F distribution obtained upon setting the noncentrality parameter  equal to 0. For the sake
of clarity (i.e., to distinguish it from the noncentral F distribution), the “ordinary” F distribution
may be referred to as the central F distribution.

a. (Central) F distribution

Relationship to the beta distribution. Let u and v represent random variables that are distributed
independently as 2 .M / and 2 .N /, respectively. Further, define

u=M u
wD and xD :
v=N uCv
Then,  
Nx Mw .M=N /w
wD and xD D ; (3.3)
M.1 x/ N C Mw 1 C .M=N /w
as can be readily verified.
282 Some Relevant Distributions and Their Properties

By definition, w  SF .M; N /. And in light of the discussion of Sections 6.1a, 6.1b, and 6.1c,
x  Be.M=2; N=2/. In effect, we have established the following result: if x is a random variable
that is distributed as Be.M=2; N=2/, then
Nx
 SF .M; N /I (3.4)
M.1 x/
and if w is a random variable that is distributed as SF .M; N /, then
Mw
 Be.M=2; N=2/: (3.5)
N C Mw
The cdf (cumulative distribution function) of an F distribution can be reexpressed in terms
of an incomplete beta function ratio (which coincides with the cdf of a beta distribution)—the
incomplete beta function ratio was introduced and discussed in Section 6.1b. Denote by F ./ the
cdf of the SF .M; N / distribution. Then, letting x represent a random variable that is distributed as
Be.M=2; N=2/, we find [in light of result (3.4)] that (for any nonnegative scalar c)
   
Nx Mc
F .c/ D Pr  c D Pr x  D IM c=.NCM c/.M=2; N=2/ (3.6)
M.1 x/ N C Mc

—for c < 0, F .c/ D 0. Moreover, in light of result (1.14) on the incomplete beta function ratio, the
cdf of the SF .M; N / distribution can also be expressed (for c  0) as
F .c/ D 1 IN=.NCM c/ .N=2; M=2/: (3.7)

Distribution of the reciprocal. Let w represent a random variable that has an SF .M; N / distribution.
Then, clearly,
1
 SF .N; M /: (3.8)
w
Now, let F ./ represent the cdf (cumulative distribution function) of the SF .M; N / distribution
and G./ the cdf of the SF .N; M / distribution. Then, for any strictly positive scalar c,

F .c/ D 1 G.1=c/; (3.9)


as is evident upon observing that
Pr.w  c/ D Pr.1=w  1=c/ D Pr.1=w > 1=c/ D 1 Pr.1=w  1=c/
[and as could also be ascertained from results (3.6) and (3.7)]. Moreover, for 0 < ˛ < 1, the upper
100 ˛% point, say FN˛ .M; N /, of the SF .M; N / distribution is related to the upper 100.1 ˛/%
point, say FN1 ˛ .N; M /, of the SF .N; M / distribution as follows:
FN˛ .M; N / D 1=FN1 ˛ .N; M / (3.10)
—this relationship can be readily verified by applying result (3.9) [with c D 1=FN1 ˛ .N; M /] or
simply by observing that
PrŒw > 1=FN1 ˛ .N; M / D PrŒ1=w < FN1 ˛ .N; M / D PrŒ1=w  FN1 ˛ .N; M / D ˛:

Joint distribution. As in Part 1 of the present subsection, take u and v to be statistically in-
dependent random variables that are distributed as 2 .M / and 2 .N /, respectively, and define
w D .u=M /=.v=N / and x D u=.u C v/. Let us consider the joint distribution of w and the random
variable y defined as follows:
y D u C v:
Central and Noncentral F Distributions 283

By definition, w  SF .M; N /. And in light of Theorem 6.1.3, y  2 .M C N /. Moreover,


upon observing (in light of the results of Section 6.1c) that the 2 .M / distribution is identical to the
Ga.M=2; 2/ distribution and that the 2 .N / distribution is identical to the Ga.N=2; 2/ distribution,
it follows from Theorem 6.1.1 that y is distributed independently of x and hence [since, in light of
result (3.3), w is expressible as a function of x] that y is distributed independently of w.
Probability density function (pdf). Let x represent a random variable whose distribution is
Be.M=2; N=2/ (where M and N are arbitrary strictly positive integers). As is evident from the
results of Sections 6.1a and 6.1b, the Be.M=2; N=2/ distribution has as a pdf the function h./ given
by the formula

< €Œ.M CN /=2 x .M=2/ 1 .1 x/ .N=2/ 1; for 0 < x < 1,
h.x/ D €.M=2/€.N=2/
:̂ 0; elsewhere.

Consider the random variable w defined as follows:


Nx
wD : (3.11)
M.1 x/
According to result (3.4), w  SF .M; N /. Moreover, equality (3.11) defines a one-to-one transfor-
mation from the interval 0 < x < 1 onto the interval 0 < w < 1; the inverse transformation is that
defined by the equality
x D .M=N /wŒ1 C .M=N /w 1:
Thus, a pdf, say f ./, of the SF .M; N / distribution is obtainable from the pdf of the Be.M=2; N=2/
distribution. Upon observing that
dx 2 1
D .M=N /Œ1 C .M=N /w and 1 x D Œ1 C .M=N /w
dw
and making use of standard results on a change of variable, we find that, for 0 < w < 1,

f .w/ D h .M=N /wŒ1 C .M=N /w 1 .M=N /Œ1 C .M=N /w 2


˚

€Œ.M CN /=2
D .M=N /M=2 w .M=2/ 1 Œ1 C .M=N /w .M CN /=2 (3.12)
€.M=2/€.N=2/

—for 1 < w  0, f .w/ D 0.


Moments. Let w represent a random variable that has an SF .M; N / distribution. Then, by definition,
w  .N=M /.u=v/, where u and v are random variables that are distributed independently as 2 .M /
and 2 .N /, respectively. And making use of result (1.27), we find that, for M=2 < r < N=2,

€Œ.M=2/Cr €Œ.N=2/ r
E.w r / D .N=M /r E.ur / E.v r
/ D .N=M /r : (3.13)
€.M=2/ €.N=2/

Further, it follows from results (1.28) and (1.31) that the rth (integer) moment of the SF .M; N /
distribution is expressible as
M.M C 2/.M C 4/    ŒM C 2.r 1/
E.w r / D .N=M /r (3.14)
.N 2/.N 4/    .N 2r/
(r D 1; 2; : : : < N=2). For r  N=2, the rth moment of the SF .M; N / distribution does not exist.
(And, as a consequence, the F distribution does not have a moment generating function.)
The mean of the SF .M; N / distribution is (if N > 2)
E.w/ D N=.N 2/: (3.15)
284 Some Relevant Distributions and Their Properties

And the second moment and the variance are (if N > 4)

M.M C 2/ M C2 N2
E.w 2 / D .N=M /2 D (3.16)
.N 2/.N 4/ M .N 2/.N 4/
and 2N 2 .M C N 2/
var.w/ D E.w 2 / ŒE.w/2 D : (3.17)
M.N 2/2 .N 4/

Noninteger degrees of freedom. The definition of the F distribution can be extended to noninteger
degrees of freedom in much the same way as the definition of the chi-square distribution. For
arbitrary strictly positive numbers M and N, take u and v to be random variables that are distributed
independently as Ga M and N
; 2 , respectively, and define

2
; 2 Ga 2
u=M
wD : (3.18)
v=N
Let us regard the distribution of the random variable w as an F distribution with M (possibly
noninteger) numerator degrees of freedom and N (possibly noninteger) denominator  degrees of
freedom. In the special case where M and N are (strictly positive) integers, the Ga M 2 ; 2 distribution
is identical to the 2 .M / distribution and the Ga N2 ; 2 distribution is identical to the 2 .N /


distribution (as is evident from the results of Section 6.1c), so that this usage of the term F distribution
is consistent with our previous usage of this term.
Note that in the definition (3.18) of the random variable w, we could have taken  u and v to
be random variables that are distributed independently as Ga M 2 ; ˇ and Ga N
2 ; ˇ , respectively
(where ˇ is an arbitrary strictly positive scalar). The distribution of w is unaffected by the choice
of ˇ. To see this, observe (as in Part 1 of the present subsection) that w D N x=ŒM.1 x/, where
x D u=.u C v/, and that (irrespective of the value of ˇ) u=.u C v/  Be M N
2 ; 2 .


A function of random variables that are distributed independently and identically as N.0; 1/.
Let z D .z1 ; z2 ; : : : ; zN /0 (where N  2) represent an (N -dimensional) random (column) vector
that has an N -variate standard normal distribution or, equivalently, whose elements are distributed
independently and identically as N.0; 1/. And let K represent any integer between 1 and N 1,
inclusive. Then, PK 2
i D1 zi =K
PN  SF .K; N K/; (3.19)
2
i DKC1 zi =.N K/

as is evident from the very definitions of the chi-square and F distributions.


A multivariate version. Let u1 ; u2 ; : : : ; uJ , and v represent random variables that are distributed
independently as 2 .M1 /; 2 .M2 /; : : : ; 2 .MJ /, and 2 .N /, respectively. Further, define (for j D
1; 2; : : : ; J )
uj =Mj
wj D :
v=N
Then, the (marginal) distribution of wj is SF .Mj ; N /. And the (marginal) distribution of wj is
related to a beta distribution; as is evident from the results of Part 1 of the present subsection (and
as can be readily verified),
N xj
wj D ;
Mj .1 xj /
M N
where xj D uj =.uj Cv/  Be 2j ; 2 .
Now, consider the joint distribution of the random variables w1 ; w2 ; : : : ; wJ . The numerators
u1 =M1 ; u2 =M2 ; : : : ; uJ =MJ of w1 ; w2 ; : : : ; wJ are statistically independent, however they have a
common denominator v=N .
Central and Noncentral F Distributions 285

For j D 1; 2; : : : ; J , let uj
sj D PJ :
vC j 0 D1 uj 0
PJ PJ
Then, 1 and, consequently,
ı 
j 0 D1 sj 0 Dv vC j 0 D1 uj 0

N sj
wj D PJ :
Mj 1 j 0 D1 sj 0

Moreover, the joint distribution of s1 ; s2 ; : : : ; sJ is the Dirichlet distribution Di M21 , M22 , : : : ;


MJ N
2 ; 2 I J . Thus, the joint distribution of w1 ; w2 ; : : : ; wJ is expressible in terms of a Dirichlet


distribution.
More specifically, suppose that z1 ; z2 ; : : : ; zJ ; zJ C1 are random column vectors of dimensions
PJ C1 
N1 ; N2 ; : : : ; NJ ; NJ C1 , respectively, the joint distribution of which is j D1 Nj -variate standard
normal N.0; I/. And, for j D 1; 2; : : : ; J; J C 1, denote by zjk the kth element of zj . Further, define
PNj 2
kD1 zjk =Nj
wj D PN .j D 1; 2; : : : ; J /;
J C1 2
kD1 zJC1; k =N J C1
PNJC1 2
and observe that the JC1 sums of squares kD1 z1k ; kD1 z2k ; : : : ; N
PN1 2 PN2 2 P J 2
z , and kD1
kD1 J k
zJC1; k
2 2 2 2
are distributed independently as  .N1 /;  .N2 /; : : : ;  .NJ /, and  .NJC1 /, respectively. Then,
for j D 1; 2; : : : ; J , the (marginal) distribution of wj is SF .Nj ; NJC1 / and is related to a beta
distribution; for j D 1; 2; : : : ; J ,
NJC1 xj
wj D ;
Nj .1 xj /
PNj 2 ı PNj 2 PNJC1 2 Nj NJC1 
where xj D kD1 . The joint distribution of

zjk kD1 zjk C kD1 zJC1; k  Be 2 ; 2
w1 ; w2 ; : : : ; wJ is related to a Dirichlet distribution; clearly,
PNj 2
NJC1 sj kD1 zjk
wj D PJ  ; where sj D PJC1 PN 0 .j D 1; 2; : : : ; J /;
Nj 1 2
j 0 D1 sj 0
j
0
j D1 z
kD1 j k 0

N1 N2 NJ NJC1
and s1 ; s2 ; : : : ; sJ are jointly distributed as Di 2 ; 2 ; : : : ; 2 ; 2 I J .


b. The wider applicability of various results derived under an assumption of normality


Let z D .z1 ; : : : ; zK ; zKC1 /0 represent any (K C1)-dimensional random (column) vector having an
absolutely continuous spherical distribution. And define w0 D KC1 2
kD1 zk and (for k D 1; : : : ; K; KC
P

1) zk zk2
2
yk D P and w k D y k D KC1 2
;
KC1 2 1=2
 P
j D1 zj j D1 zj
PK
in which case wKC1 D 1 kD1 wk . As is evident from the results of Part 1  of Section 6.1g,
w1 ; w2 ; : : : ; wK are statistically independent of w0 and have a Di 12 ; : : : ; 12 ; 12 I K distribution; and
as is evident from the results of Part 2, y1 , : : : ; yK , yKC1 are statistically independent of w0 and
are distributed uniformly on the surface of a (K C1)-dimensional unit ball. There is an implication
that the distribution of w1 ; : : : ; wK ; wKC1 and the distribution of y1 ; : : : ; yK ; yKC1 are the same in
the general case (of an arbitrary absolutely continuous spherical distribution) as in the special case
where z  N.0; IKC1 /. Thus, we have the following theorem.
Theorem 6.3.1. Let z D .z1 ; : : : ; zK ; zKC1 /0 represent any (K C1)-dimensional random (col-
umn) vector having an absolutely continuous spherical distribution. And define w0 D KC1 2
kD1 zk and
P
(for k D 1; : : : ; K; K C1)
286 Some Relevant Distributions and Their Properties

zk zk2
yk D P and wk D yk2 D PKC1 :
KC1 2 1=2 zj2
j D1 zj j D1

(1) For “any” function g./ (defined on RKC1 ) such that g.z/ depends on the value of z only through
w1 ; : : : ; wK ; wKC1 or, more generally, only through y1 , : : : ; yK , yKC1 , the random variable g.z/
is statistically independent of w0 and has the same distribution in the general case (of an arbitrary
absolutely continuous spherical distribution) as in the special case where z  N.0; IKC1 /. (2)
For “any” P functions g1 ./; g2 ./; : : : ; gP ./ (defined on RKC1 ) such that (for j D 1; 2; : : : ; P )
gj .z/ depends on the value of z only through w1 ; : : : ; wK ; wKC1 or, more generally, only through
y1 ; : : : ; yK ; yKC1 , the random variables g1 .z/; g2 .z/; : : : ; gP .z/ are statistically independent of w0
and have the same (joint) distribution in the general case (of an arbitrary absolutely continuous
spherical distribution) as in the special case where z  N.0; IKC1 /.
Application to the F distribution. Let z D .z1 ; z2 ; : : : ; zN /0 (where N  2) represent an N -di-
mensional random (column) vector. In the next-to-last part of Subsection a, it was established that if
z  N.0; IN /, then (for any integer K between 1 and N 1, inclusive)
PK 2
i D1 zi =K
PN  SF .K; N K/: (3.20)
2
i DKC1 zi =.N K/
Clearly, PK 2
PK
i D1 zi =K i D1 wi =K
PN D PN ; (3.21)
2
i DKC1 zi =.N K/ i DKC1 wi =.N K/
2 N 2
where (for i D 1; 2; : : : ; N ) wi D zi i 0 D1 zi 0 . And we conclude [on the basis of Part (1)
ı P

of Theorem 6.3.1] that the random variable (3.21) is distributed (independently of N 2


i D1 zi ) as
P
SF .K; N K/ provided only that z has an absolutely continuous spherical distribution—that z 
N.0; IN / is sufficient, but it is not necessary.
A “multivariate” version of the application. Let z1 ; z2 ; : : : ; zJ ; zJ C1 represent random column
vectors of dimensions N1 ; N2 ; : : : ; NJ ; NJ C1 , respectively. And, for j D 1; 2; : : : ; J; J C 1, denote
by zjk the kth element of zj . Further, define
PNj 2
z =Nj
kD1 jk
wj D PN .j D 1; 2; : : : ; J /:
J C1 2
kD1
zJC1; k =NJ C1
Then, as observed earlier (in the last part of Subsection a) in connection with the special case where
PJ C1 
the joint distribution of z1 ; z2 ; : : : ; zJ ; zJ C1 is j D1 Nj -variate standard normal N.0; I/ (and as
can be readily verified),
PNj 2
NJC1 sj z
kD1 jk
wj D PJ  ; where sj D PJC1 PN 0 .j D 1; 2; : : : ; J /:
Nj 1 2
j 0 D1 sj 0
j
j 0 D1 z
kD1 j k 0

P C1
Clearly, s1 ; s2 ; : : : ; sJ and hence w1 ; w2 ; : : : ; wJ depend on the values of the jJD1 Nj quanti-
2
ıPJC1 PNj 0 2
ties zjk j 0 D1 k 0 D1 zj 0 k 0 (j D 1; 2; : : : ; JC1; k D 1; 2, : : : ; Nj ). Thus, it follows from Theorem
6.3.1 that if the joint distribution of z1 , z2 , : : : ; zJ , zJ C1 is an absolutely continuous spherical dis-
PNj 2
tribution, then s1 , s2 , : : : ; sJ and w1 ; w2 ; : : : ; wJ are distributed independently of jJC1 z ,
P
D1 kD1 jk
and the joint distribution of s1 ; s2 ; : : : ; sJ and the joint distribution of w1 ; w2 ; : : : ; wJ are the same
as in the special case where the joint distribution of z1 ; z2 ; : : : ; zJ ; zJ C1 is N.0; I/. And in light of
the results obtained earlier (in the last part of Subsection a) for that special case, we are able to infer
that if the joint distribution of z1 , z2 , : : : ; zJ , zJ C1 is an absolutely continuous spherical distribution,
N
then the joint distribution of s1 ; s2 ; : : : ; sJ is Di N21 ; N22 ; : : : ; N2J ; JC1 2 I J . [The same inference

could be made “directly” by using the results of Section 6.1g to establish that if the joint distribution
Central and Noncentral F Distributions 287

of z1 , z2 , : : : ; zJ , zJ C1 is an absolutely continuous spherical distribution, then the joint distribution


P C1 ıPJC1 PNj 0 2 1 1 PJ C1
of the jJD1 Nj random variables zjk 2 1 1
k 0 D1 zj 0 k 0 is Di 2 ; 2 ; : : : ; 2 ; 2 I

j 0 D1 j D1 Nj 1
and by then applying Theorem 6.1.5.]

c. Noncentral F distribution

A related distribution: the noncentral beta distribution. Let u and v represent random variables
that are distributed independently as 2 .M; / and 2 .N /, respectively. Further, define
u=M u
wD and xD :
v=N u C v
Then, as in the case of result (3.3),
 
Nx Mw .M=N /w
wD and xD D : (3.22)
M.1 x/ N C Mw 1 C .M=N /w

By definition, w  SF .M; N; /. And the distribution of x is a generalization of the Be M ; N2



2
distribution that is referred to as a noncentral beta distribution with parameters M=2, N=2, and =2
and that is to be denoted by the symbol Be M N 
2 ; 2; 2 .


Let y D u C v. Then, in light of results (2.11) and (2.13), the joint distribution of y and x has
as a pdf the function f . ; / obtained by taking (for all y and x)
1
X
f .y; x/ D p.r/ gM CN C2r .y/ d.M C2r/=2; N=2 .x/; (3.23)
rD0

where (for r D 0; 1; 2; : : : ) p.r/ D .=2/r e =2=rŠ , where (for any strictly positive integer j )
gj ./ denotes the pdf of a 2 .j / distribution, and where (for any values of the parameters ˛1 and
˛2 ) d˛1;˛2 ./ denotes the pdf of a Be.˛1 ; ˛2 / distribution. Thus, a pdf, say h./, of the (marginal)
distribution of x is obtained by taking (for all x)
Z 1 X1 Z 1
h.x/ D f .y; x/ dy D p.r/ d.M C2r/=2; N=2 .x/ gM CN C2r .y/ dy
0 rD0 0

X1
D p.r/ d.M C2r/=2; N=2 .x/: (3.24)
rD0

An extension. The definition of the noncentral beta distribution can be extended. Take x D u=.u C
v/, where u and v are random variables that are distributed independently as Ga.˛1 ; ˇ; ı/ and
Ga.˛2 ; ˇ/, respectively—here, ˛1, ˛2 , and ˇ are arbitrary strictly positive scalars and ı is an arbitrary
nonnegative scalar. Letting y D u Cv and proceeding in essentially the same way as in the derivation
of result (2.11), we find that the joint distribution of y and x has as a pdf the function f . ; / obtained
by taking (for all y and x)
1
X ı re ı 
f .y; x/ D gr .y/ d˛1 Cr; ˛2 .x/; (3.25)
rD0

where (for r D 0; 1; 2; : : : ) gr ./ represents the pdf of a Ga.˛1 C ˛2 C r; ˇ/ distribution and
d˛1 Cr; ˛2 ./ represents the pdf of a Be.˛1Cr; ˛2 / distribution. Accordingly, the pdf of the (marginal)
distribution of x is the function h./ obtained by taking (for all x)
Z 1 1
X ı re ı
h.x/ D f .y; x/ dy D d˛1 Cr; ˛2 .x/: (3.26)
0 rD0

288 Some Relevant Distributions and Their Properties

Formulas (3.25) and (3.26) can be regarded as extensions of formulas (3.23) and (3.24); formulas
(3.23) and (3.24) are for the special case where ˛1 D M=2, ˇ D 2, and ˛2 D N=2 (and where ı is
expressed in the form =2).
Take the noncentral beta distribution with parameters ˛1 , ˛2 , and ı (where ˛1 > 0, ˛2 > 0, and
ı  0) to be the distribution of the random variable x or, equivalently, the distribution with pdf h./
given by expression (3.26)—the distribution of x does not depend on the parameter ˇ. And denote
this distribution by the symbol Be.˛1 ; ˛2 ; ı/.
Probability density function (of the noncentral F distribution). Earlier (in Subsection a), the pdf
(3.12) of the (central) F distribution was derived from the pdf of a beta distribution. By taking a
similar approach, the pdf of the noncentral F distribution can be derived from the pdf of a noncentral
beta distribution.
Let x represent a random variable that has a noncentral beta distribution with parameters M=2,
N=2, and =2 (where M and N are arbitrary strictly positive integers and  is an arbitrary nonnegative
scalar). And define
Nx
wD : (3.27)
M.1 x/
Then, in light of what was established earlier (in Part 1 of the present subsection),
w  SF .M; N; /:
Moreover, equality (3.27) defines a one-to-one transformation from the interval 0 < x < 1 onto the
interval 0 < w < 1; the inverse transformation is that defined by the equality
x D .M=N /wŒ1 C .M=N /w 1:
Thus, a pdf, say f ./, of the SF .M; N; / distribution is obtainable from a pdf of the distribution of
x.
For r D 0; 1; 2; : : : , let p.r/ D .=2/r e =2=rŠ , and, for arbitrary strictly positive scalars ˛1
and ˛2 , denote by d˛1; ˛2 ./ the pdf of a Be.˛1 ; ˛2 / distribution. Then, according to result (3.24), a
pdf, say h./, of the distribution of x is obtained by taking (for all x)
X1
h.x/ D p.r/ d.M C2r/=2; N=2 .x/:
rD0
And upon observing that
dx 2 1
D .M=N /Œ1 C .M=N /w and 1 x D Œ1 C .M=N /w
dw
and making use of standard results on a change of variable, we find that, for 0 < w < 1,

f .w/ D h .M=N /wŒ1 C .M=N /w 1 .M=N /Œ1 C .M=N /w 2


˚

1
X €Œ.M CN C2r/=2
D p.r/ .M=N /.M C2r/=2
rD0
€Œ.M C2r/=2€.N=2/
 w Œ.M C2r/=2 1 Œ1 C .M=N /w .M CN C2r/=2

1
X
D p.r/ŒM=.M C2r/ gr fŒM=.M C2r/wg; (3.28)
rD0

where (for r D 0; 1; 2; : : : ) gr ./ denotes the pdf of the SF .M C2r; N / distribution—for 1 <
w  0, f .w/ D 0.
Moments. Let w represent a random variable that has an SF .M; N; / distribution. Then, by def-
inition, w  .N=M /.u=v/, where u and v are random variables that are distributed independently
Central and Noncentral F Distributions 289

as 2 .M; / and 2 .N /, respectively. And making use of results (1.27) and (2.31), we find that, for
M=2 < r < N=2,

E.w r / D .N=M /r E.ur / E.v r /


€Œ.N=2/ r
D .N=M /r r E.ur / (3.29)
2 €.N=2/
1
€Œ.N=2/ r =2 X .=2/j €Œ.M=2/Cj Cr
D .N=M /r e : (3.30)
€.N=2/ j Š €Œ.M=2/ C j 
j D0

For r  N=2, the rth moment of the SF .M; N; / distribution, like that of the central F
distribution, does not exist. And (as in the case of the central F distribution) the noncentral F
distribution does not have a moment generating function.
Upon recalling results (2.28) and (2.30) [or (2.33) and (2.34)] and result (1.23) and applying
formula (3.29), we find that the mean of the SF .M; N; / distribution is (if N > 2)
 
N 
E.w/ D 1C (3.31)
N 2 M
and the second moment and variance are (if N > 4)

.M C2/.M C2/ C 2
E.w 2 / D .N=M /2
.N 2/.N 4/
2
 2
   
N 2 2
D 1C C 1C (3.32)
.N 2/.N 4/ M M M
and
2N 2 M Œ1 C .=M /2 C .N 2/Œ1 C 2.=M /
˚
2 2
var.w/ D E.w / ŒE.w/ D : (3.33)
M .N 2/2 .N 4/

Moreover, results (3.31) and (3.32) can be regarded as special cases of a more general result: in
light of results (1.23) and (2.36), it follows from result (3.29) that (for r D 1; 2; : : : < N=2) the rth
moment of the SF .M; N; / distribution is
 r
Nr 
E.w r / D
.N 2/.N 4/    .N 2r/ M
r 1
!  
X .MC 2j /ŒMC 2.j C1/ŒMC 2.j C2/    ŒMC 2.r 1/ r  j
C : (3.34)
Mr j j M
j D0

Noninteger degrees of freedom. The definition of the noncentral F distribution can be extended to
noninteger degrees of freedom in much the same way as the definitions of the (central) F distribution
and the central and noncentral chi-square distributions. For arbitrary strictly positive numbers M and
N (and an arbitrary nonnegative number ), take u and v to be random variables that are distributed
independently as Ga.M=2; N=2; =2/ and Ga.N=2; 2/, respectively. Further, define
u=M
wD :
v=N
Let us regard the distribution of the random variable w as a noncentral F distribution with M
(possibly noninteger) numerator degrees of freedom and N (possibly noninteger) denominator de-
grees of freedom (and with noncentrality parameter ). When M is an integer, the Ga.M=2; 2; =2/
290 Some Relevant Distributions and Their Properties

distribution is identical to the 2 .M; / distribution, and when N is an integer, the Ga.N=2; 2/ dis-
tribution is identical to the 2 .N / distribution, so that this usage of the term noncentral F distribution
is consistent with our previous usage.
Let x D u=.u C v/. As in the special case where M and N are integers, w and x are related
to each other as follows:
Nx .M=N /w
wD and xD :
M.1 x/ 1 C .M=N /w
By definition, x has a Be.M=2; N=2; =2/ distribution—refer to Part 2 of the present subsection.
Accordingly, the distribution of w is related to the Be.M=2, N=2, =2/ distribution in the same
way as in the special case where M and N are integers. Further, the distribution of x and hence that
of w would be unaffected if the distributions of the (statistically independent) random variables u
and v were taken to be Ga.M=2; ˇ; =2/ and Ga.N=2; ˇ/, respectively, where ˇ is an arbitrary
strictly positive number (not necessarily equal to 2).

6.4 Central, Noncentral, and Multivariate t Distributions


Let z
tDp ; (4.1)
v=N
where z and v are random variables that are statistically independent with z  N.0; 1/ and v 
2 .N /. The distribution of the random variable t is known as Student’s t distribution or simply as
the t distribution. Like the F distribution, it plays an important role in the use of linear models to
make statistical inferences.
The distribution of the random variable t depends on one parameter; this parameter is the quantity
N, which [as in the case of the 2 .N / distribution] is referred to as the degrees of freedom. Let us
denote the t distribution with N degrees of freedom by the symbol S t.N /.
As a generalization of the t distribution, we have the noncentral t distribution. The noncentral t
distribution is the distribution of the random variable
x
t D p ;
v=N
obtained from expression (4.1) for the random variable t upon replacing the random variable z [which
has an N.0; 1/ distribution] with a random variable x that has an N.; 1/ distribution (where  is an
arbitrary scalar and where x, like z, is statistically independent of v). The distribution of the random
variable t  depends on two parameters: the degrees of freedom N and the scalar , which is referred
to as the noncentrality parameter. Let us denote the noncentral t distribution with parameters N and
 by the symbol S t.N; /.
Clearly, the t distribution S t.N / can be regarded as the special case S t.N; 0/ of the noncentral
t distribution obtained upon setting the noncentrality parameter  equal to 0. For the sake of clarity
(i.e., to distinguish it from the noncentral t distribution), the “ordinary” t distribution may be referred
to as the central t distribution.
There is a multivariate version of the t distribution. Let us continue to take v to be a random
variable that is distributed as 2 .N /, and let us take z to be an M -dimensional random column
vector that is distributed independently of v as N.0; R/, where R D frij g is an arbitrary (M  M )
correlation matrix. Then, the distribution of the M -dimensional random column vector
1
tDp z
v=N
Central, Noncentral, and Multivariate t Distributions 291

is referred to as the M -variate t distribution (or when the dimension of t is unspecified or is clear
from the context) as the multivariate t distribution. The parameters of this distribution consist of
the degrees of freedom N and the M.M 1/=2 correlations rij (i > j D 1; 2; : : : ; M )—the
diagonal elements of R equal 1 and (because R is symmetric) only M.M 1/=2 of its off-diagonal
elements are distinct. Let us denote the multivariate t distribution with N degrees of freedom and
with correlation matrix R by the symbol MV t.N; R/—the number M of variables is discernible
from the dimensions of R. The ordinary (univariate) t distribution S t.N / can be regarded as a special
case of the multivariate t distribution; it is the special case MV t.N; 1/ where the correlation matrix
is the 1  1 matrix whose only element equals 1.

a. (Central) t distribution

Related distributions. The t distribution is closely related to the F distribution (as is apparent from
the very definitions of the t and F distributions). More specifically, the S t.N / distribution is related
to the SF .1; N / distribution.
Let t represent a random variable that is distributed as S t.N /, and F a random variable that is
distributed as SF .1; N /. Then,
t 2  F; (4.2)
or, equivalently, p
jtj  F : (4.3)

The S t.N / distribution (i.e., the t distribution with N degrees of freedom) is also closely related
to the distribution of the random variable y defined as follows:
z
yD ;
.v C z 2 /1=2
where z and v are as defined in the introduction to the present section [i.e., where z and v are
2
random
ıpvariables that are distributed independently as N.0; 1/ and  .N /, respectively]. Now, let
t D z v=N , in which case t  S t.N / (as is evident from the very definition of the t distribution).
Then, t and y are related as follows:
p
Ny t
tD 2 1=2
and yD : (4.4)
.1 y / .N C t 2 /1=2

Probability density function (pdf). Let us continue to take z and v to be random variables that are
distributed independently as N.0; 1/ and 2 .N /, respectively, and to take y D z=.v C z 2 /1=2 and
ıp
t D z v=N . Further, define u D v C z 2.
Let us determine the pdf of the joint distribution of u and y and the pdf of the (marginal)
distribution of y—in light of the relationships (4.4), the pdf of the distribution of t is determinable
from the pdf of the distribution of y. The equalities u D v C z 2 and y D z=.v C z 2 /1=2 define a one-
to-one transformation from the region defined by the inequalities 0 < v < 1 and 1 < z < 1
onto the region defined by the inequalities 0 < u < 1 and 1 < y < 1. The inverse of this
transformation is the transformation defined by the equalities
v D u.1 y2/ and z D u1=2y:
Further,
ˇ@v=@u @v=@y ˇ ˇ 1 y 2
ˇ ˇ ˇ ˇ
2uy ˇˇ
ˇ@z=@u @z=@y ˇ D ˇ.1=2/u 1=2y
ˇ ˇ ˇ D u1=2:
u1=2 ˇ
Thus, denoting by d./ the pdf of the 2 .N / distribution and by b./ the pdf of the N.0; 1/ distribution
292 Some Relevant Distributions and Their Properties

and making use of standard results on a change of variables, the joint distribution of u and y has as
a pdf the function q. ; / (of 2 variables) obtained by taking (for 0 < u < 1 and 1 < y < 1)

q.u; y/ D d Œu.1 y 2 / b.u1=2 y/ u1=2


1
D uŒ.N C1/=2 1 e u=2
.1 y 2 /.N=2/ 1
€.N=2/ 2 C1/=2  1=2
.N

1 €Œ.N C1/=2
D uŒ.N C1/=2 1 e u=2
.1 y 2 /.N=2/ 1
(4.5)
€Œ.N C1/=2 2.N C1/=2 €.N=2/  1=2

—for u and y such that 1 < u  0 or 1  jyj < 1, q.u; y/ D 0. The derivation of expression
(4.5) is more or less the same as the derivation (in Section 6.1f) of expression (1.43).
The quantity q.u; y/ is reexpressible (for all u and y) in the form
q.u; y/ D g.u/ h .y/; (4.6)
where g./ is the pdf of the 2 .NC1/ distribution and where h ./ is the function (of a single variable)
defined as follows:

< €Œ.N C1/=2 .1 y 2 /.N=2/ 1; for 1 < y < 1,
h .y/ D €.N=2/  1=2

(4.7)
:̂ 0; elsewhere.

Accordingly, we conclude that h ./ is a pdf; it is the pdf of the distribution of y. Moreover, y is
distributed independently of u.
Now, upon making a change of variable from y to t [based on the relationships (4.4)] and
observing that
d t.N C t 2 / 1=2
D N.N C t 2 / 3=2;
dt
we find that the distribution of t is the distribution with pdf f ./ defined (for all t) by
€Œ.N C1/=2  t 2  .N C1/=2
f .t/ D h Œ t.N C t 2 / 1=2  N.N C t 2 / 3=2 D N 1=2
1 C : (4.8)
€.N=2/  1=2 N
And t (like y) is distributed independently of u (i.e., independently of v C z 2 ).
In the special case where N D 1 (i.e., in the special case of a t distribution with 1 degree of
freedom), expression (4.8) simplifies to
1 1
f .t/ D : (4.9)
 1 C t2
The distribution with pdf (4.9) is known as the (standard) Cauchy distribution. Thus, the t distribution
with 1 degree of freedom is identical to the Cauchy distribution.
As N ! 1, the pdf of the S t.N / distribution converges (on all of R1 ) to the pdf of the N.0; 1/
distribution—refer, e.g., to Casella and Berger (2002, exercise 5.18). And, accordingly, t converges
in distribution to z.
The pdf of the S t.5/ distribution is displayed in Figure 6.1 along with the pdf of the S t.1/
(Cauchy) distribution and the pdf of the N.0; 1/ (standard normal) distribution.
Symmetry and (absolute, odd, and even) moments. The absolute moments of the t distribution are
determinable from the results of Section 6.3a on the F distribution. Let t represent a random variable
that is distributed as S t.N / and w a random variable that is distributed as SF .1; N /. Then, as an
implication of the relationship (4.3), we have (for an arbitrary scalar r) that E.jtjr / exists if and only
if E.w r=2 / exists, in which case
E.jtjr / D E.w r=2 /: (4.10)
Central, Noncentral, and Multivariate t Distributions 293

N(0, 1) pdf
0.4
St(5) pdf

St(1) pdf

0.2

0.0
−4 −2 0 2 4

FIGURE 6.1. The probability density functions of the N.0; 1/ (standard normal), St.5/, and St.1/ (Cauchy)
distributions.

And upon applying result (3.13), we find that, for 1 < r < N ,
€Œ.r C1/=2 €Œ.N r/=2
E.jtjr / D N r=2 : (4.11)
€.1=2/ €.N=2/
For any even positive integer r, jtjr D t r. Thus, upon applying result (3.14), we find [in light of
result (4.10)] that, for r D 2; 4; 6; : : : < N , the rth moment of the S t.N / distribution exists and is
expressible as
.r 1/.r 3/    .3/.1/
E.t r / D N r=2 : (4.12)
.N 2/.N 4/    .N r/
For r  N , the rth moment of the S t.N / distribution does not exist (and, as a consequence, the t
distribution, like the F distribution, does not have a moment generating function).
The S t.N / distribution is symmetric (about 0), that is,
t t (4.13)
(as is evident from the very definition of the t distribution). And upon observing that, for any odd
positive integer r, t r D . t/r, we find that, for r D 1; 3; 5; : : : < N,
E.t r / D E. t r / D EŒ. t/r  D E.t r /
and hence that (for r D 1; 3; 5; : : : < N )
E.t r / D 0: (4.14)
Thus, those odd moments of the S t.N / distribution that exist (which are those of order less than N )
are all equal to 0. In particular, for N > 1,
E.t/ D 0: (4.15)

Note that none of the moments of the S t.1/ (Cauchy) distribution exist, not even the mean. And
the S t.2/ distribution has a mean (which equals 0), but does not have a second moment (or any other
moments of order greater than 1) and hence does not have a variance.
For N > 2, we have [upon applying result (4.12)] that
N
var.t/ D E.t 2 / D : (4.16)
N 2
294 Some Relevant Distributions and Their Properties

And, for N > 4,


3N 2  N 2 N 2
E.t 4 / D D3 ; (4.17)
.N 2/.N 4/ N 2 N 4
which in combination with result (4.16) implies that (for N > 4)
E.t 4 / N 2
2
D3 : (4.18)
Œvar.t/ N 4
[Expression (4.18) is an expression for a quantity that is sometimes referred to as the kurtosis, though
in some presentations it is the difference between this quantity and the number 3 that is referred to
as the kurtosis.]
Standardized version. Let t represent a random variable that is distributed as S t.N /. And suppose
that N > 2. Then, a standardized version, say s, of the random variable t can be created by taking
s D t=a; (4.19)
p p
where a D var.t/ D N=.N 2/.
The distribution of s has mean 0 and variance 1. Accordingly, the mean and variance of the
distribution of s are identical to the mean and variance of the N.0; 1/ distribution. And if N > 3,
the third moment E.s 3 / of the distribution of s equals 0, which is identical to the third moment of
the N.0; 1/ distribution. Further, if N > 4, the fourth moment of the distribution of s is
E.t 4 / N 2
E.s 4 / D 2
D3 (4.20)
Œvar.t/ N 4
[as is evident from result (4.18)]; by way of comparison, the fourth moment of the N.0; 1/ distribution
is 3.
Let f ./ represent the pdf of the distribution of t, that is, the pdf of the S t.N / distribution. Then,
making use of expression (4.8) and of standard results on a change of variable, we find that the
distribution of s has as a pdf the function f  ./ obtained by taking (for all s)
€Œ.N C1/=2 1  s 2  .N C1/=2
f  .s/ D af .as/ D 1 C : (4.21)
€.N=2/  1=2 .N 2/1=2 N 2
The pdf f  ./ is displayed in Figure 6.2 for the case where N D 5 and for the case where N D 3;
for purposes of comparison, the pdf of the N.0; 1/ (standard normal) distribution is also displayed.
Noninteger degrees of freedom. The definition of the t distribution can be extended to noninteger
degrees of freedom by proceeding along the same lines as in Sections 6.1c and 6.3a in extending
the definitions of the chi-square and F distributions. For any (strictly) positive number N , takeıp the t
distribution with N degrees of freedom to be the distribution of the random variable t D z v=N ,
where z and v are random variables that are statistically independent with z  N.0;  1/ and v 
Ga N2 ; 2 . In the special case where N is a (strictly positive) integer, the Ga N2 ; 2 distribution is
identical to the 2 .N / distribution, so that this usage of the term t distribution is consistent with our
previous usage of this term.
Percentage points. Let t represent a random variable that is distributed as S t.N /. Further, for
0 < ˛ < 1, denote by tN˛ .N / the upper 100˛% point of the S t.N / distribution, that is, the point c
such that Pr.t > c/ D ˛. And observe (in light of the symmetry and absolute continuity of the t
distribution) that, for any (nonrandom) scalar c,
Pr.t  c/ D Pr.t < c/ D Pr. t < c/ D Pr.t > c/ (4.22)
and that, for c  0,
Pr.jtj > c/ D Pr.t > c/ C Pr.t < c/ D 2 Pr.t > c/: (4.23)
Central, Noncentral, and Multivariate t Distributions 295

0.6
N(0, 1) pdf
std. St(5) pdf
std. St(3) pdf

0.4

0.2

0.0
−4 −2 0 2 4

FIGURE 6.2. The probability density functions of the N.0; 1/ (standard normal) distribution and of the dis-
tributions of the standardized versions of a random variable having an St.5/ distribution and a
random variable having an St.3/ distribution.

In light of result (4.22), we find that (for 0 < ˛ < 1)

PrŒt  tN1 ˛ .N / D PrŒt > tN1 ˛ .N / D1 ˛D1 PrŒt > tN˛ .N / D PrŒt  tN˛ .N /;

implying that
tN˛ .N / D tN1 ˛ .N /: (4.24)
And in light of result (4.23), we find that
PrŒjtj > tN˛=2 .N / D 2 PrŒt > tN˛=2 .N / D 2.˛=2/ D ˛; (4.25)
so that the upper 100˛% point of the distribution of jtj equals the upper 100.˛=2/% point tN˛=2 .N /
of the distribution of t [i.e., of the S t.N / distribution]. Moreover, in light of relationship (4.3),
q
tN˛=2 .N / D FN˛ .1; N /; (4.26)
where FN˛ .1; N / is the upper 100˛% point of the SF .1; N / distribution.
The t distribution as the distribution of a function of a random vector having a multivariate
standard normal distribution or a spherical distribution. Let z D .z1 ; z2 , : : : ; zN C1 /0 represent an
(N C1)-dimensional random (column) vector. And let
zN C1
tDq PN 2 :
.1=N / i D1 zi

Suppose that z  N.0; IN C1 /. Then, zN C1 and N 2


i D1 zi are statistically independent random
P
PN 2
variables with zN C1  N.0; 1/ and i D1 zi  2 .N /. Thus,

t  S t.N /: (4.27)

Moreover, it follows from the results of Part 2 (of the present subsection) that t is distributed inde-
P C1 2
pendently of N i D1 zi .
296 Some Relevant Distributions and Their Properties

More generally, suppose that z has an absolutely continuous spherical distribution. And [recalling
result (4.4)] observe that
p
Ny z
tD 2 1=2
; where y D P N C1 1=2 :
.1 y / N C1 2
i D1 zi

Then, it follows from Theorem 6.3.1 that [as in the special case where z  N.0; I/]

t  S t.N /; (4.28)
PN C1
and t is distributed independently of i D1 zi2 .

b. Noncentral t distribution

Related distributions. Let x and v represent random variables that areıp distributed independently
as N.; 1/ and 2 .N /, respectively. And observe that (by definition) x v=N  S t.N; /. Ob-
serve also that x 2 is distributed independently of v as 2 .1; 2 / and hence that x 2 =.v=N / 
SF .1; N; 2 /. Thus, if t is a random variable that is distributed as S t.N; /, then

t 2  SF .1; N; 2 / (4.29)
or, equivalently, p
jtj  F ; (4.30)
2
where F is a random variable that is distributed as SF .1; N;  /.
Now, let t D x v=N , in which case t  S t.N; /, and define
ıp

x
yDp :
v C x2
Then, as in the case of result (4.4), t and y are related as follows:
p
Ny t
tD 2 1=2
and yD : (4.31)
.1 y / .N C t 2 /1=2

Probability density function (pdf). Let us derive an expression for the pdf of the noncentral t
distribution. Let us do so by following an approach analogous to the one taken in Subsection a in
deriving expression (4.8) for the pdf of the central t distribution.
Take x and v to be random variables that are distributed independently as N.; 1/ and 2 .N /,
ıp
respectively. Further, take t D x v=N , in which case t  S t.N; /, and define u D v C x 2 and
y D x=.v Cx 2 /1=2, in which case t and y are related by equalities (4.31). Then, denoting by d./
the pdf of the 2 .N / distribution and by b./ the pdf of the N.; 1/ distribution and proceeding in
the same way as in arriving at expression (4.5), we find that the joint distribution of u and y has as
a pdf the function q. ; / obtained by taking (for 0 < u < 1 and 1 < y < 1)

q.u; y/ D d Œu.1 y 2 / b.u1=2 y/ u1=2


1 u=2 u1=2 y 2 =2
D .1 y 2 /.N=2/ 1 .N 1/=2
u e e e (4.32)
€.N=2/ 2.N C1/=2  1=2
—for u and y such that 1 < u  0 or 1  jyj < 1, q.u; y/ D 0. And upon replacing the
1=2
quantity e u y in expression (4.32) with its power-series representation
1
1=2 y
X .u1=2 y/r
eu D ;
rD0

Central, Noncentral, and Multivariate t Distributions 297

we find that (for 0 < u < 1 and 1 < y < 1)


1
€Œ.N C1/=2 2 .N=2/ 1 2 =2
X €Œ.N Cr C1/=2 p r
q.u; y/ D .1 y / e 2 y gr .u/; (4.33)
€.N=2/  1=2 rD0
€Œ.N C1/=2 rŠ

where (for r D 0; 1; 2; : : : ) gr ./ is the pdf of the 2 .N Cr C1/ distribution.


As a consequence of result (4.33), we have that (for 1 < y < 1)
Z 1 1
€Œ.N C1/=2 2 .N=2/ 1 2 =2
X €Œ.N Cr C1/=2 p r
q.u; y/ du D 1=2
.1 y / e 2 y :
0 €.N=2/  rD0
€Œ.N C1/=2 rŠ
Thus, the (marginal) distribution of y has as a pdf the function h ./ defined as follows:

€Œ.N C1/=2

2
ˆ 1=2
.1 y 2 /.N=2/ 1 e  =2
ˆ
ˆ €.N=2/  1
€Œ.N Cr C1/=2 p
<
h .y/ D
X r
 2 y ; for 1 < y < 1,
ˆ
ˆ
ˆ rD0
€Œ.N C1/=2 rŠ
0; elsewhere.

Finally, making a change of variable from y to t [based on the relationship (4.31)] and proceeding
in the same way as in arriving at expression (4.8), we find that the distribution of t has as a pdf the
function f ./ obtained by taking (for all t)

f .t/ D h Πt.N C t 2 / 1=2


 N.N C t 2 / 3=2

1 p
€Œ.N C1/=2 1=2 2 =2
X €Œ.N Cr C1/=2  2 t r  t2  .N CrC1/=2
D N e p 1C : (4.34)
€.N=2/  1=2 rD0
€Œ.N C1/=2 rŠ N N

In the special case where  D 0, expression (4.34) simplifies to expression (4.8) for the pdf of the
central t distribution S t.N /.
Moments: relationship to the moments of the N.; 1/ distribution. ıp Let t represent a random variable
that has an S t.N; / distribution. Then, by definition, t  x v=N , where x and v are random
variables that are distributed independently as N.; 1/ and 2 .N /, respectively. And, for r D
1; 2; : : : < N , the rth moment of the S t.N; / distribution exists and is expressible as

E.t r / D N r=2 E.v r=2


/ E.x r / (4.35)

or [in light of results (1.27) and (1.31)] as


€Œ.N r/=2
E.t r / D .N=2/r=2 E.x r / (4.36)
€.N=2/
or, in the special case where r is an even number,
N r=2
E.t r / D E.x r /: (4.37)
.N 2/.N 4/    .N r/

Like the S t.N / distribution, the S t.N; / distribution does not have moments of order N or greater
and, accordingly, does not have a moment generating function.
In light of results (4.36) and (4.37), we find that the mean of the S t.N; / distribution is (if
N > 1)
p €Œ.N 1/=2
E.t/ D N=2  (4.38)
€.N=2/
and the second moment is (if N > 2)
298 Some Relevant Distributions and Their Properties
N
E.t 2 / D .1 C 2 /: (4.39)
N 2

Noninteger degrees of freedom. The definition of the noncentral t distribution can be extended
to noninteger degrees of freedom in essentially the same way as the definition of the central t
distribution. For any (strictly) positive scalar N (and any scalar ), take the noncentral t distribution
with N degreespof freedom and noncentrality parameter  to be the distribution of the random
variable t D x v=N , where x and v are random variables that are statistically independent with
ı

x  N.; 1/ and v  Ga N2 ; 2 . In the special case where N is a (strictly positive) integer, the


Ga N2 ; 2 distribution is identical to the 2 .N / distribution, so that this usage of the term noncental

t distribution is consistent with our previous usage of this term.
Some relationships. Let t represent a random variable that has an S T .N; / distribution and t  a
random variable that has an S t.N; / distribution. Then, clearly,
t  t: (4.40)
And for any (nonrandom) scalar c, we have [as a generalization of result (4.22)] that

Pr.t   c/ D Pr.t  < c/ D Pr. t < c/ D Pr.t > c/; (4.41)


implying in particular that
Pr.t  > c/ D 1 Pr.t > c/ (4.42)
and that
Pr.t   c/ D 1 Pr.t  c/: (4.43)
In light of relationships (4.42) and (4.43), it suffices in evaluating Pr.t > c/ or Pr.t  c/ to restrict
attention to nonnegative values of  or, alternatively, to nonnegative values of c. Further, for c  0,
we have [as a generalization of result (4.23)] that
Pr.jtj > c/ D Pr.t > c/ C Pr.t < c/ D Pr.t > c/ C Pr. t > c/
D Pr.t > c/ C Pr.t  > c/: (4.44)

c. A result on determinants of matrices of the form R C ST U


As a preliminary to deriving (in Subsection d) some results on the multivariate t distribution, it is
convenient to introduce the following result on determinants.
Theorem 6.4.1. Let R represent an N  N matrix, S an N  M matrix, T an M  M matrix,
and U an M  N matrix. If R and T are nonsingular, then
1 1
jR C ST Uj D jRjjT jjT C UR Sj (4.45)

Proof. Suppose that R and T are nonsingular. Then, making use of Theorem 2.14.22, we find
that ˇ ˇ
ˇR S ˇˇ
D jT 1 jjR . S/.T 1 / 1 Uj D jT 1 jjR C ST Uj
T 1ˇ
ˇ
ˇU
and also that ˇ ˇ
ˇR S ˇ 1 1 1 1
1 ˇ D jRjjT UR . S/j D jRjjT C UR Sj:
ˇ ˇ
ˇU T
Thus,
1 1 1
jT jjR C ST Uj D jRjjT C UR Sj
Central, Noncentral, and Multivariate t Distributions 299
1
or, equivalently (since jT j D 1=jT j),
1 1
jR C ST Uj D jRjjT jjT C UR Sj:
Q.E.D.

In the special case where R D IN and T D IM , Theorem 6.4.1 simplifies to the following result.
Corollary 6.4.2. For any N  M matrix S and any M  N matrix U,

jIN C SUj D jIM C USj: (4.46)

In the special case where M D 1, Corollary 6.4.2 can be restated as the following corollary.
Corollary 6.4.3. For any N -dimensional column vectors s D fsi g and u D fui g,
N
X
0 0 0
jIN C su j D 1 C u s D 1 C s u D 1 C si ui : (4.47)
i D1

d. Multivariate t distribution

Related distributions. Let t D .t1 ; t2 ; : : : ; tM /0 represent a random (column) vector that has an
MV t.N; R/ distribution, that is, an M -variate t distribution with N degrees of freedom and corre-
lation matrix R. If t is any subvector of t, say the M  -dimensional subvector .ti1 ; ti2 ; : : : ; tiM  /0
consisting of the i1 ; i2 ; : : : ; iM  th elements, then, clearly,

t  MV t.N; R /; (4.48)

where R is the M   M  submatrix of R formed by striking out all of the rows and columns of R
save its i1 ; i2 ; : : : ; iM  th rows and columns. And, for i D 1, 2, : : : ; M,
ti  S t.N /; (4.49)

that is, the (marginal) distribution of each of the elements of t is a (univariate) t distribution with the
same number of degrees of freedom (N ) as the distribution of t.
In the special case where R D I,
1 0
M t t  SF .M; N /: (4.50)
More generally, if t is partitioned into some number of subvectors, say K subvectors t1 ; t2 ; : : : ; tK
of dimensions M1 ; M2 ; : : : ; MK , respectively, then, letting u1 ; u2 ; : : : ; uK , and v represent ran-
dom variables that are distributed independently as 2 .M1 /; 2 .M2 /; : : : ; 2 .MK /, and 2 .N /,
respectively, we find that, in the special case where R D I, the joint distribution of the K quan-
tities M1 1 t10 t1 , M2 1 t20 t2 , : : : ; MK 1 tK
0
tK is identical to the joint distribution of the K ratios
.u1 =M1 /=.v=N /, .u2 =M2 /=.v=N /, : : : ; .uK =MK /=.v=N / [the marginal distributions of which
are SF .M1 ; N /, SF .M2 ; N /, : : : ; SF .MK ; N /, respectively, and which have a common denomi-
nator v=N ].
The multivariate t distribution is related asymptotically to the multivariate normal distribution.
It can be shown that as N ! 1, the MV t.N; R/ distribution converges to the N.0; R/ distribution.
Now, take z to be an M -dimensional random column vector and v a random variable that are
statistically independent with z  N.0; R/ and v  2 .N /. And take t D .v=N / 1=2 z, in which
case t  MV t.N; R/, and define y D .v C z0 R 1 z/ 1=2 z. Then, t and y are related as follows:

t D ŒN=.1 y 0R 1
y/1=2 y and y D .N C t 0 R 1
t/ 1=2
t: (4.51)
300 Some Relevant Distributions and Their Properties

Probability density function (pdf). Let us continue to take z to be an M -dimensional random column
vector and v a random variable that are statistically independent with z  N.0; R/ and v  2 .N /.
And let us continue to take t D .v=N / 1=2 z and to define y D .v C z0 R 1 z/ 1=2 z. Further, define
u D v C z0 R 1 z.
Consider the joint distribution of the random variable u and the random vector y. The equalities
u D v C z0 R 1 z and y D .v C z0 R 1 z/ 1=2 z define a one-to-one transformation from the region
fv; z W 0 < v < 1; z 2 RM g onto the region fu; y W 0 < u < 1; y 0 R 1 y < 1g. The inverse of
this transformation is the transformation defined by the equalities

v D u.1 y 0R 1
y/ and z D u1=2 y:

Further, letting J represent the .M   .M C1/ matrix whose ij th element is thepartial


 C1/  derivative
v u
of the i th element of the vector with respect to the j th element of the vector and making
z y
use of results (5.4.10), (2.14.29), (2.14.11), and (2.14.9), we find that
ˇ ˇ ˇ ˇ
ˇ@v=@u @v=@y 0 ˇ ˇ 1 y 0 R 1 y 2uy 0 R 1 ˇˇ
jJ j D ˇ ˇDˇ ˇ D uM=2:
ˇ ˇ ˇ
ˇ@z=@u @z=@y 0 ˇ ˇ.1=2/u 1=2 y u1=2 I ˇ

Thus, denoting by d./ the pdf of the 2.N / distribution and by b./ the pdf of the N.0; R/ distribution
and making use of standard results on a change of variables, the joint distribution of u and y has as
a pdf the function q. ; / obtained by taking (for u and y such that 0 < u < 1 and y 0 R 1 y < 1)

q.u; y/ D d Œu.1 y 0 R 1
y/ b.u1=2 y/ uM=2
1
D uŒ.N CM /=2 1 e u=2 .1 y 0 R 1 y/.N=2/ 1
€.N=2/ 2.N CM /=2  M=2 jRj1=2
1
D uŒ.N CM /=2 1 e u=2
€Œ.N CM /=2 2.N CM /=2
€Œ.N CM /=2
 .1 y 0 R 1 y/.N=2/ 1
(4.52)
€.N=2/  M=2 jRj1=2

—for u and y such that 1 < u  0 or y 0 R 1 y  1, q.u; y/ D 0. The derivation of expression


(4.52) parallels the derivation (in Part 2 of Subsection a) of expression (4.5) and is very similar to
the derivation (in Section 6.1f) of expression (1.43).
The quantity q.u; y/ is reexpressible (for all u and y) in the form
q.u; y/ D g.u/ h .y/; (4.53)
2 
where g./ is the pdf of the  .N CM / distribution and where h ./ is the function (of an M  1
vector) defined as follows: for all y,

< €Œ.N CM /=2 .1 y 0 R 1 y/.N=2/ 1; if y 0 R 1 y < 1,
h .y/ D €.N=2/  M=2 jRj1=2

(4.54)
:̂ 0; otherwise.

Accordingly, we conclude that h ./ is a pdf; it is the pdf of the distribution of y. Moreover, y is
distributed independently of u.
Now, for j D 1; 2; : : : ; M , let yj represent the j th element of y, tj the j th element of t, and ej
Central, Noncentral, and Multivariate t Distributions 301

the j th column of IM , and observe [in light of relationship (4.51) and result (5.4.10)] that

@yj @.N C t 0 R 1 t/ 1=2


tj
D
@t @t
@tj @.N C t 0 R 1 t/ 1=2
D .N C t 0 R 1
t/ 1=2
C tj
@t @t
@t 0 R 1 t
D .N C t 0 R 1
t/ 1=2
ej C tj . 1=2/.N C t 0 R 1
t/ 3=2
@t
D .N C t 0 R 1
t/ 1=2
ej .N C t 0 R 1
t/ 3=2
tj R 1
t:
Then,
@y 0  @y @y
1 2 @yM 
D ; ;:::; D .N C t 0 R 1
t/ 1=2
I .N C t 0 R 1
t/ 3=2
R 1
tt 0
@t @t @t @t
D .N C t 0 R 1
t/ 1=2
ŒI .N C t 0 R 1
t/ 1
R 1
tt 0 ;

implying [in light of Lemma 2.14.3 and Corollaries 2.14.6 and 6.4.3] that
ˇ ˇ ˇ 0ˇ
ˇ @y ˇ ˇ @y ˇ 0 1
ˇ @t 0 ˇ ˇ @t ˇ D .N C t R t/
ˇ ˇDˇ ˇ M=2
Œ1 .N C t 0 R 1 t/ 1 t 0 R 1 t D N .N C t 0 R 1
t/ .M=2/ 1
:

Thus, upon making a change of variables from the elements of y to the elements of t, we find that
the distribution of t (which is the M -variate t distribution with N degrees of freedom) has as a pdf
the function f ./ obtained by taking (for all t)

f .t/ D h Œ.N C t 0 R 1 t/ 1=2 t N.N C t 0 R 1 t/ .M=2/ 1


€Œ.N CM /=2
D N N=2 .N C t 0 R 1 t/ .N CM /=2
€.N=2/  M=2 jRj1=2
€Œ.N CM /=2 M=2
 t 0 R 1 t  .N CM /=2
D N 1 C : (4.55)
€.N=2/  M=2 jRj1=2 N
And t (like y) is distributed independently of u (i.e., independently of v C z0 R 1 z).
In the special case where M D 1, expression (4.55) simplifies to expression (4.8) for the pdf of
the (univariate) t distribution with N degrees of freedom.
Moments. Let t D .t1 ; t2 ; : : : ; tM /0 represent an M -dimensional random (column) vector that has
an MV t.N; R/ distribution. Further, for i; j D 1; 2; : : : ; M , denote by rij the ij th element of the
correlation matrix R. And denote by k an arbitrary (strictly) positive integer, and by k1 ; k2 ; : : : ; kM
any nonnegative integers such that k D M i D1 ki .
P

By definition, t  .v=N / 1=2 z, where z D .z1 , z2 , : : : ; zM /0 is an M -dimensional random


(column) vector and v a random variable that are statistically independent with z  N.0; R/ and
v  2 .N /. For k < N , E.v k=2 / exists, and the kth-order moment E t1k1 t2k2    tM kM 
of the
MV t.N; R/ distribution is expressible as follows:
E t1k1 t2k2    tM
kM 
D E .v=N / k=2 z1k1 z2k2    zM
kM 


D N k=2 E v k=2 E z1k1 z2k2    zMkM 


(4.56)

:
Moreover, z  z, so that
E z1k1 z2k2    zM
kM 
D E . z1/k1 . z2/k2    . zM /kM D . 1/k E z1k1 z2k2    zM
kM 
 
:

Thus, if k is an odd number, then E z1k1 z2k2    zM kM 


D 0 and hence (if k is an odd number smaller
than N )
E t1k1 t2k2    tM
kM 
D 0: (4.57)
302 Some Relevant Distributions and Their Properties

Alternatively, if k is an even number (smaller than N ), then [in light of result (1.31)]
N k=2
E t1k1 t2k2    tM
kM 
D E z1k1 z2k2    zM
kM 
: (4.58)
.N 2/.N 4/    .N k/
We conclude that (for k < N ) each kth-order moment of the MV t.N; R/ distribution is either 0
or is obtainable from the corresponding kth-order moment of the N.0; R/ distribution (depending
on whether k is odd or even). In particular, for i D 1; 2; : : : ; M , we find that (if N > 1)
E.ti / D 0 (4.59)
and that (if N > 2) N N
var.ti / D E ti2 D (4.60)

ri i D ;
N 2 N 2
in agreement with results (4.15) and (4.16). And, for j ¤ i D 1; 2; : : : ; M , we find that (if N > 2)
N
cov.ti ; tj / D E.ti tj / D rij (4.61)
N 2
and that (if N > 2)
corr.ti ; tj / D rij ŒD corr.zi ; zj /: (4.62)
In matrix notation, we have that (if N > 1)
E.t/ D 0 (4.63)
and that (if N > 2) N
var.t/ D R: (4.64)
N 2

Noninteger degrees of freedom. The definition of the multivariate t distribution can be extended
to noninteger degrees of freedom in essentially the same way as the definition of the (univariate) t
distribution. For any (strictly) positive number N , take the M -variate t distribution with degrees of
freedom N and correlation matrix R to be the distribution of the M -variate random (column) vector
t D .v=N / 1=2 z, where z is an M -dimensional random column vector  and v a random variable that
N
are statistically independent with z  N.0; R/ and v  Ga 2
; 2 . In the special case where N is
N
a (strictly positive) integer, the Ga 2 ; 2 distribution is identical to the 2 .N / distribution, so that


this usage of the term M -variate t distribution is consistent with our previous usage of this term.
Sphericity and ellipticity. The MV t.N; IM / distribution is spherical. To see this, take z to be an
M -dimensional random column vector and v a random variable that are statistically independent
with z  N.0; IM / and v  2 .N /, and take O to be any M  M orthogonal matrix of constants.
Further, let t D .v=N / 1=2 z, in which case t  MV t.N; IM /, and observe that the M  1 vector
Oz, like z itself, is distributed independently of v as N.0; IM /. Thus,
1=2 1=2
Ot D .v=N / .Oz/  .v=N / z D t:

And we conclude that the distribution of t [the MV t.N; IM / distribution] is spherical.


That the MV t.N; IM / distribution is spherical can also be inferred from the form of its pdf. As
is evident from result (4.55), the MV t.N; IM / distribution has a pdf f ./ that is expressible (for all
t) in the form
f .t/ D g.t 0 t/; (4.65)
where g./ is the following (nonnegative) function of a single nonnegative variable, say w:
€Œ.N CM /=2 M=2
 w .N CM /=2
g.w/ D N 1C :
€.N=2/  M=2 N
Now, let R represent an arbitrary M  M correlation matrix. Then, in light of Corollary 2.13.24,
there exists an M  M matrix S such that R D S0 S. And since the distribution of t is spherical, the
distribution of S0 t is (by definition) elliptical; S0 t is distributed elliptically about 0. Moreover,
Moment Generating Function of the Distribution of Quadratic Forms 303

S0 t D .v=N / 1=2
.S0 z/;
and S0 z is distributed independently of v as N.0; R/, so that S0 t  MV t.N; R/. Thus, the
MV t.N; R/ distribution is elliptical.
The MV t.N; IM / distribution as the distribution of a vector-valued function of a random vec-
tor having a standard normal distribution or a spherical distribution. Let z D .z1 ; z2 ; : : : ; zN ,
zN C1 ; zN C2 ; : : : ; zN CM /0 represent an (N CM )-dimensional random (column) vector. Further, let
t D Œ.1=N / N 2 1=2
P
i D1 zi  z ;
where z D .zN C1 ; zN C2 ; : : : ; zN CM /0.
Suppose that z  N.0; IN CM /. Then, z and N 2
i D1 zi are distributed independently as
P
2
N.0; IM / and  .N /, respectively. Thus,
t  MV t.N; IM /: (4.66)
Moreover, it follows from the results of Part 2 (of the present subsection) that t is distributed
P CM 2
independently of N i D1 zi .
More generally, suppose that z has an absolutely continuous spherical distribution. And [recalling
result (4.51)] observe that
PN CM 2  1=2
t D ŒN=.1 y 0 y/1=2 y; where y D i D1 zi z :

Then, it follows from Theorem 6.3.1 that [as in the special case where z  N.0; I/]

t  MV t.N; IM /; (4.67)
PN CM 2
and t is statistically independent of i D1 zi .

6.5 Moment Generating Function of the Distribution of One or More


Quadratic Forms or Second-Degree Polynomials (in a Normally
Distributed Random Vector)
a. Some preliminary results
As a preliminary to deriving the moment generating function of the distribution of one or more quad-
ratic forms or second-degree polynomials (in a normally distributed random vector), it is convenient
to introduce some basic results on the positive definiteness of linear combinations of matrices. The
linear combinations of immediate interest are those of the form IM tA, where A is an M  M
PK
symmetric matrix and t is a scalar, or, more generally, those of the form IM i D1 ti Ai , where A1 ,
A2 , : : : ; AK are M  M symmetric matrices and t1 ; t2 ; : : : ; tK are scalars. For what values of t is
PK
IM tA positive definite? Or, more generally, for what values of t1 , t2 , : : : ; tK is IM i D1 ti Ai
positive definite?
Existence of a neighborhood within which a linear combination is positive definite.
Lemma 6.5.1. Corresponding to any M  M symmetric matrix A, there exists a (strictly)
positive scalar c such that IM tA is positive definite for every scalar t in the interval c < t < c.
Lemma 6.5.1 is essentially a special case of the following lemma.
Lemma 6.5.2. Let A1 ; A2 ; : : : ; AK represent M  M symmetric matrices, and let t D
.t1 ; t2 ; : : : ; tK /0 represent a K-dimensional (column) vector. Then, there exists a neighborhood N of
PK
the K  1 null vector 0 such that IM i D1 ti Ai is positive definite for t 2 N .
304 Some Relevant Distributions and Their Properties

Proof (of Lemma 6.5.2). For p D 1; 2; : : : ; M, let A.p/ .p/ .p/


1 ; A2 ; : : : ; AK represent the leading
principal submatrices of order p of A1 ; A2 ; : : : ; AK , respectively. Then, when regarded as a function
PK .p/ K
of the vector t, jIp i D1 ti Ai j is continuous at 0 (and at every other point in R ), as is evident
from the very definition of a determinant. Moreover,
K K
ti A.p/ 0 A.p/
ˇ X ˇ ˇ X ˇ
lim ˇIp i
ˇ D ˇI
p i
ˇ D jI j D 1:
p
t!0
i D1 i D1
PK .p/
Thus, there exists a neighborhood, say Np , of 0 such that jIp i D1 ti Ai j > 0 for t 2 Np .
Now, take N to be the smallest of the M neighborhoods N1 ; N2 ; : : : ; NM . Then, for t 2 N ,
PK .p/ PK .p/
jIp i D1 ti Ai j > 0 (p D 1; 2; : : : ; M ). And upon observing that the matrices Ip i D1 ti Ai
PK
(p D 1; 2; : : : ; M ) are the leading principal submatrices of the matrix IM i D1 ti Ai , it follows
PK
from Theorem 2.14.23 that IM t A
i D1 i i is positive definite for t 2 N . Q.E.D.
A more specific result. Let A represent an M  M symmetric matrix, and let t represent an arbitrary
scalar. And take S to be the subset of R defined as follows:
S D ft W IM tA is positive definiteg:
According to Lemma 6.5.1, there exists a (strictly) positive scalar c such that the interval . c; c/ is
contained in the set S . Let us investigate the nature of the set S more thoroughly.
Let x represent an M -dimensional column vector of variables. Further, take q./ to be the function
defined (on RM ) as follows: q.x/ D x0 Ax. Then, q.x/ attains a maximum value and a minimum
value over the set fx W x0 x D 1g, as is evident upon observing that the function f ./ is continuous
and that the set fx W x0 x D 1g is closed and bounded—“recall” that any continuous function attains
a maximum value and a minimum value over any closed and bounded set (e.g., Bartle 1976, secs.
11 and 22; Bartle and Sherbert 2011).
Accordingly, define
d0 D min x0 Ax and d1 D max x0 Ax:
x W x0 xD1 x W x0 xD1

[The scalars d0 and d1 are eigenvalues of the matrix A; in fact, they are respectively the smallest and
largest eigenvalues of A, as can be ascertained from results to be presented subsequently (in Section
6.7a).] And observe that

I tA positive definite , x0 .I tA/x > 0 for x such that x ¤ 0


, x0 .I tA/x > 0 for x such that x0 x D 1
, t x0Ax < 1 for x such that x0 x D 1
1=d0 < t < 1=d1; if d0 < 0 and d1 > 0,

ˆ
if d0 < 0 and d1  0,
ˆ
<1=d < t;
0
,
ˆ
ˆ t < 1=d 1 ; if d0  0 and d1 > 0,
0 t < 1; if d0 D d1 D 0.

Thus,
if d0 < 0 and d1 > 0,

ˆ .1=d0 ; 1=d1 /;
if d0 < 0 and d1  0,
ˆ
<.1=d ; 1/;
0
SD (5.1)
ˆ
ˆ . 1; 1=d1 /; if d0  0 and d1 > 0,
. 1; 1/; if d0 D d1 D 0.

Extended applicability of conditions. Let A represent an M  M symmetric matrix, t an arbitrary


scalar, and V an M  M symmetric positive definite matrix. Consider the extension of the conditions
Moment Generating Function of the Distribution of Quadratic Forms 305

under which a matrix of the form IM tA is positive definite to matrices of the more general form
V tA.
According to Corollary 2.13.29, there exists an M M nonsingular matrix Q such that V D Q0 Q.
And upon observing that
V tA D Q0 ŒI t.Q 1/0AQ 1
Q and I t.Q 1/0AQ 1
D .Q 1/0 .V tA/Q 1;
it becomes clear (in light of Corollary 2.13.11) that V tA is positive definite if and only if the
matrix IM t.Q 1/0AQ 1 is positive definite. Thus, the applicability of the conditions under which
a matrix of the form IM tA is positive definite can be readily extended to a matrix of the more
general form V tA; it is a simple matter of applying those conditions with .Q 1/0AQ 1 in place
PK
of A. More generally, conditions under which a matrix of the form IM i D1 ti Ai (where A1 , A2 ,
: : : ; AK are M  M symmetric matrices and t1 ; t2 ; : : : ; tK arbitrary scalars) is positive definite can
PK
be translated into conditions under which a matrix of the form V i D1 ti Ai is positive definite
by replacing A1 ; A2 ; : : : ; AK with .Q / A1 Q , .Q / A2 Q , : : : ; .Q 1/0AK Q 1, respectively.
1 0 1 1 0 1
PK PK
Note that conditions under which I tA or I i D1 ti Ai (or V tA or V ti Ai ) is
PKi D1
positive definite can be easily translated into conditions under which I C tA or I C i D1 ti Ai (or
V C tA or V C K i D1 ti Ai ) is positive definite.
P

b. Main results
Let us derive the moment generating function of the distribution of a quadratic form x0 Ax, or
more generally of the distribution of a second-degree polynomial c C b0 x C x0 Ax, in a random
column vector x, where x  N.; †/. And let us derive the moment generating function of the
joint distribution of two or more quadratic forms or second-degree polynomials. Let us do so by
establishing and exploiting the following theorem.
Theorem 6.5.3. Let z represent an M -dimensional random column vector that has an M -variate
standard normal distribution N.0; IM /. Then, for any constant c and any M -dimensional column
vector b (of constants) and for any M  M symmetric matrix A (of constants) such that I 2A is
positive definite,
0 0 0 1
E e cCb zCz Az D jI 2Aj 1=2 e cC.1=2/b .I 2A/ b: (5.2)


Proof. Let f ./ represent the pdf of the N.0; IM / distribution and g./ the pdf of the N Œ.I
2A/ 1 b; .I 2A/ 1  distribution. Then, for all z,
0 0 1=2 cC.1=2/b0 .I 2A/ 1b
e cCb zCz Azf .z/ D jI 2Aj e g.z/;
as can be readily verified. And it follows that
Z
cCb0 zCz0 Az 0 0
E e e cCb zCz Azf .z/ d z

D
RM
Z
1=2 cC.1=2/b0 .I 2A/ 1b
D jI 2Aj e g.z/ d z
RM
1=2 cC.1=2/b0 .I 2A/ 1b
D jI 2Aj e :
Q.E.D.
Moment generating function of the distribution of a single quadratic form or second-degree poly-
nomial. Let x represent an M -dimensional random column vector that has an N.; †/ distribution
(where the rank of † is possibly less than M ). Further, take € to be any matrix such that † D € 0 €—
the existence of such a matrix follows from Corollary 2.13.25—and denote by R the number of rows
in €. And observe that
x   C € 0 z; (5.3)
306 Some Relevant Distributions and Their Properties

where z is an R-dimensional random (column) vector that has an N.0; IR / distribution.


Let c represent a constant, b an M -dimensional column vector of constants, and A an M  M
symmetric matrix of constants. And denote by m./ the moment generating function of the second-
degree polynomial c C b0 x C x0 Ax (in the random vector x), and let t represent an arbitrary scalar.
Further, take S to be the subset of R defined as follows:
S D ft W the matrix I 2t€A€ 0 is positive definiteg:
As is evident from Lemma 6.5.1, this subset includes a neighborhood of 0, and, letting d0 D
2 min z W z0 zD1 z0 €A€ 0 z and d1 D 2 max z W z0 zD1 z0 €A€ 0 z, it is [in light of result (5.1)] expressible
in the form
ˆ.1=d0 ; 1=d1 /; if d0 < 0 and d1 > 0,

if d0 < 0 and d1  0,
ˆ
<.1=d ; 1/;
0
SD
ˆ. 1; 1=d1 /; if d0  0 and d1 > 0,
ˆ
. 1; 1/; if d0 D d1 D 0.

Upon observing [in light of result (5.3)] that

c C b0 x C x0 Ax  c C b0 .C€ 0 z/ C .C€ 0 z/0 A.C€ 0 z/


D c C b0  C 0 A C Œ€.bC2 A/0 z C z0 €A€ 0 z (5.4)

and upon applying result (5.2), we find that, for t 2 S ,


0 0
m.t/ D E e t .cCb xCx Ax/
 
0 0 0 0 0
D E e t .cCb C A/CŒt €.bC2A/ zCz .t €A€ /z
˚

D jI 2t€A€ 0 j 1=2
expŒ t.c Cb0 C0 A/
 expŒ.1=2/ t 2 .bC2 A/0 € 0 .I 2t€A€ 0 / 1
€.bC2 A/: (5.5)

The dependence of expression (5.5) on the variance-covariance matrix † is through the “inter-
mediary” €. The moment generating function can be reexpressed in terms of † itself. In light of
Corollary 6.4.2,
jI 2t€A€ 0 j D jI €.2tA/€ 0 j D jI 2tA€ 0 €j D jI 2tA†j; (5.6)
implying that
jI 2tA†j > 0 for t 2 S (5.7)
and hence that I 2tA† is nonsingular for t 2 S . Moreover,
.I 2t€A€ 0 / 1
€ D €.I 2tA†/ 1
for t 2 S , (5.8)
as is evident upon observing that
€.I 2tA†/ D .I 2t€A€ 0 /€
and upon premultiplying both sides of this equality by .I 2t€A€ 0 / 1 and postmultiplying both
sides by .I 2tA†/ 1.
Results (5.6) and (5.8) can be used to reexpress expression (5.5) (for the moment generating
function) as follows: for t 2 S ,

m.t/ D jI 2tA†j 1=2


expŒ t.c Cb0 C0 A/
 expŒ.1=2/ t 2 .bC2 A/0 †.I 2tA†/ 1.bC2 A/: (5.9)

In the special case where c D 0 and b D 0 [i.e., where m./ is the moment generating function of
the quadratic form x0 Ax], expression (5.9) simplifies as follows: for t 2 S ,
Moment Generating Function of the Distribution of Quadratic Forms 307

m.t/ D jI 2tA†j 1=2


expf t0 ŒI C 2tA†.I 2tA†/ 1
Ag
D jI 2tA†j 1=2
expŒ t0 .I 2tA†/ 1A: (5.10)

And in the further special case where (in addition to c D 0 and b D 0) † is nonsingular, the moment
generating function (of the distribution of x0 Ax) is also expressible as follows: for t 2 S ,
m.t/ D jI 2tA†j 1=2
expf .1=2/0ŒI .I 2tA†/ 1
† 1
g; (5.11)
as is evident upon observing that
t .I 2tA†/ 1A D .1=2/.I 2tA†/ 1. 2tA†/† 1
D .1=2/ŒI .I 2tA†/ 1
† 1:

Moment generating function of the joint distribution of multiple quadratic forms or second-degree
polynomials. Let us continue to take x to be an M -dimensional random column vector that has an
N.; †/ distribution, to take € to be any matrix such that † D € 0 €, to denote by R the number
of rows in €, and to take z to be an R-dimensional random column vector that has an N.0; IR /
distribution.
For i D 1; 2; : : : ; K (where K is a strictly positive integer), let ci represent a constant, bi an
M -dimensional column vector of constants, and Ai an M  M symmetric matrix of constants. And
denote by m./ the moment generating function of the distribution of the K-dimensional random
column vector whose i th element is the second-degree polynomial ci C b0i x C x0 Ai x (in the random
vector x), and let t D .t1 ; t2 ; : : : ; tK /0 represent an arbitrary K-dimensional column vector. Further,
take S to be the subset of RK defined as follows:

S D f.t1 ; t2 ; : : : ; tK /0 W I 2 K 0
i D1 ti €Ai € is positive definiteg:
P

As indicated by Lemma 6.5.2, this subset includes a neighborhood of 0.


Now, recalling (from Part 1) that x   C € 0 z, we have [analogous to result (5.4)] that

ci C b0i x C x0 Ai x  ci C b0i  C 0 Ai  C Œ€.bi C2 Ai /0 z C z0 €Ai € 0 z

(i D 1; 2; : : : ; K). And upon applying result (5.2), we obtain the following generalization of result
(5.5): for t 2 S ,
 P 0 0
m.t/ D E e i ti .ci Cbi xCx Ai x/


0 0 0 0 0
D E e i ti .ci Cbi C Ai /CŒ€ i ti .biC2 Ai / zCz . i ti €Ai € /z
˚ P P P

ˇ 1=2 0
D ˇI 2 i ti €Ai € 0 ˇ 0
ˇ
exp
P P 
i ti .ci Cbi C Ai /
0  1 P
 exp .1=2/ i ti .bi C2 Ai / € 0 I 2 i ti €Ai € 0 € i ti .bi C2 Ai / : (5.12)
˚ P P

As a straightforward generalization of result (5.6), we have that

ˇI 2 K ti €Ai € 0 ˇ D ˇI 2 K ti Ai † ˇ;
ˇ ˇ ˇ ˇ
(5.13)
P P
i D1 i D1

implying that ˇ PK ˇ
2 ˇ > 0 for t 2 S (5.14)
i D1 ti Ai †
ˇI
PK
and hence that I 2 i D1 ti Ai † is nonsingular for t 2 S. And as a straightforward generalization
of result (5.8), we have that
0 1
 1
I 2 K € D€ I 2 K for t 2 S . (5.15)
P  P
i D1 ti €Ai € i D1 ti Ai †

Based on results (5.13) and (5.15), we obtain, as a variation on expression (5.12) for the moment
generating function, the following generalization of expression (5.9): for t 2 S ,
308 Some Relevant Distributions and Their Properties
ˇ ˇ 1=2 0 0
exp
P P 
m.t/ D ˇI 2 i ti Ai † ˇ i ti .ci Cbi C Ai /
0  1P
 exp .1=2/ i ti .bi C2 Ai / † I 2 i ti Ai † i ti .bi C2 Ai / : (5.16)
˚ P P

In the special case where c1 D c2 D    D cK D 0 and b1 D b2 D    D bK D 0 [i.e.,


where m./ is the moment generating function of the joint distribution of the quadratic forms
x0 A1 x; x0 A2 x; : : : ; x0 AK x], expression (5.16) simplifies to the following generalization of expres-
sion (5.10): for t 2 S ,
ˇ 1=2  1P
exp 0 I 2 i ti Ai †
ˇ
(5.17)
P  P 
m.t/ D ˇI 2 i ti Ai † ˇ i ti Ai  :

And in the further special case where (in addition to c1 D c2 D    D cK D 0 and b1 D


b2 D    D bK D 0) † is nonsingular, the moment generating function (of the joint distribution of
x0 A1 x; x0 A2 x; : : : ; x0 AK x) is alternatively expressible as the following generalization of expression
(5.11): for t 2 S ,
ˇ 1=2  1 1
exp .1=2/0 I
ˇ
(5.18)
P ˚  P
m.t/ D ˇI 2 i ti Ai † ˇ I 2 i ti Ai † †  :

6.6 Distribution of Quadratic Forms or Second-Degree Polynomials


(in a Normally Distributed Random Vector): Chi-Squareness
Suppose that x is an M  1 random column vector that has an N.; †/ distribution. Under what
conditions does the quadratic form x0 Ax (where A is an M  M symmetric matrix of constants)
have a (possibly noncentral) chi-square distribution? And, more generally, under what conditions
does the second-degree polynomial c C b0 x C x0 Ax (where c is a constant and b an M  1 vector
of constants) have a (possibly noncentral) chi-square distribution? In answering these questions, it is
convenient to initially restrict attention to the special case where  D 0 and † D I (i.e., the special
case where x has an M -variate standard normal distribution).

a. Special case: quadratic form or second-degree polynomial in a random vector


that has a multivariate standard normal distribution
The following theorem gives conditions that are necessary and sufficient for a second-degree poly-
nomial (in a random vector that has a multivariate standard normal distribution) to have a noncentral
chi-square distribution.
Theorem 6.6.1. Let z represent an M -dimensional random column vector that has an N.0; IM /
distribution, and take q D c C b0 z C z0 Az, where c is a constant, b an M -dimensional column vector
of constants, and A an M  M (nonnull) symmetric matrix of constants. If
A2 D A; (6.1)
b D Ab; (6.2)
and
c D 14 b0 b; (6.3)
then q  2 .R; c/, where R D rank A D tr.A/. Conversely, if q  2 .R; / (for some strictly
positive integer R), then A, b, and c satisfy conditions (6.1), (6.2), and (6.3), R D rank A D tr.A/,
and  D c.
Distribution of Quadratic Forms: Chi-Squareness 309

In connection with Theorem 6.6.1, it is worth noting that if A, b, and c satisfy conditions (6.2)
and (6.3), then the second-degree polynomial q is reexpressible as a quadratic form
0
q D z C 21 b A z C 12 b (6.4)


[in the vector z C 12 b, the distribution of which is N 21 b; IM ]. Moreover, if A, b, and c satisfy all


three of conditions (6.1), (6.2), and (6.3), then q is reexpressible as a sum of squares
0
q D Az C 12 b Az C 12 b (6.5)


[of the elements of the vector Az C 21 b, the distribution of which is N 21 b; A ].




Theorem 6.6.1 asserts that conditions (6.1), (6.2), and (6.3) are necessary and sufficient for the
second-degree polynomial q to have a noncentral chi-square distribution. In proving Theorem 6.6.1,
it is convenient to devote our initial efforts to establishing the sufficiency of these conditions. The
proof of sufficiency is considerably simpler than that of necessity. And, perhaps fortuitously, it is
the sufficiency that is of the most importance; it is typically the sufficiency of the conditions that is
invoked in an application of the theorem rather than their necessity.
Proof (of Theorem 6.6.1): sufficiency. Suppose that the symmetric matrix A, the column vector
b, and the scalar c satisfy conditions (6.1), (6.2), and (6.3). Then, in conformance with the earlier
observation (6.4), 0
q D z C 12 b A z C 12 b :


Moreover, it follows from Theorem 5.9.5 that there exists a matrix O of dimensions M  R, where
R D rank A, such that A D OO 0 and that necessarily this matrix is such that O 0 O D IR . Thus,
0
q D z C 12 b OO 0 z C 12 b D x0 x;


where x D O 0 z C 12 O 0 b. And upon observing that


1 0

xN 2 O b; IR ;

we conclude (on the basis of the very definition of the noncentral chi-square distribution) that
0
q  2 R; 12 O 0 b 12 O 0 b :
 

It remains only to observe that (because A is idempotent) rank A D tr.A/—refer to Corollary


2.8.3—and that
1 0 0 1 0
O b 2 O b D 14 b0 OO 0 b D 14 b0 Ab D 14 b0 b D c:
 
2
Q.E.D.
The proof of the “necessity part” of Theorem 6.6.1 is deferred until Section 6.7, subsequent to
a discussion of the spectral decomposition of a symmetric matrix and of the introduction of some
results on polynomials.

b. Extension to quadratic forms or second-degree polynomials in a random vector


having an arbitrary multivariate normal distribution
Theorem 6.6.1 can be generalized as follows.
Theorem 6.6.2. Let x represent an M -dimensional random column vector that has an N.; †/
distribution, and take q D c C b0 x C x0 Ax, where c is a constant, b an M -dimensional column
vector of constants, and A an M  M symmetric matrix of constants (such that †A† ¤ 0). If
†A†A† D †A†; (6.6)
†.b C 2A/ D †A†.b C 2A/; (6.7)
310 Some Relevant Distributions and Their Properties

and
c C b0  C 0 A D 14 .b C 2A/0 †.b C 2A/; (6.8)
2 0 0 2
then q   .R; cCb C A/, where R D rank.†A†/ D tr.A†/. Conversely, if q   .R; /
(for some strictly positive integer R), then A, b, and c (and † and ) satisfy conditions (6.6), (6.7),
and (6.8), R D rank.†A†/ D tr.A†/, and  D c C b0  C 0 A.
Proof. Let d D b C 2A, take € to be any matrix (with M columns) such that † D € 0 €
(the existence of which follows from Corollary 2.13.25), and denote by P the number of rows in €.
Further, take z to be a P -dimensional random column vector that is distributed as N.0; IP /. And
observe that x   C € 0 z and hence that
q  c C b0. C € 0 z/ C . C € 0 z/0 A. C € 0 z/ D c C b0  C 0 A C .€d/0 z C z0 €A€ 0 z:
Observe also (in light of Corollary 2.3.4) that
†A† ¤ 0 , †A€ 0 ¤ 0 , €A€ 0 ¤ 0:
Accordingly, it follows from Theorem 6.6.1 that if
€A€ 0 €A€ 0 D €A€ 0; (6.9)
€d D €A€ 0 €d; (6.10)
and
c C b0  C 0 A D 41 d0 € 0 €d; (6.11)
2 0 0 0 0
then q   .R; c C b  C  A/, where R D rank.€A€ / D tr.€A€ /; and, conversely, if
q  2 .R; / (for some strictly positive integer R), then A, b, and c (and € and ) satisfy conditions
(6.9), (6.10), and (6.11), R D rank.€A€ 0 / D tr.€A€ 0 /, and  D c C b0  C 0 A. Moreover, in
light of Lemma 2.12.3,
rank.€A€ 0 / D rank.†A€ 0 / D rank.†A†/;
and in light of Lemma 2.3.1, tr.€A€ 0 / D tr.A†/. Since d0 € 0 €d D .b C 2A/0 †.b C 2A/, it
remains only to observe (in light of Corollary 2.3.4) that
€A€ 0 €A€ 0 D €A€ 0 , € 0 €A€ 0 €A€ 0 D € 0 €A€ 0 , € 0 €A€ 0 €A€ 0 € D € 0 €A€ 0 €
[so that conditions (6.6) and (6.9) are equivalent] and that
€d D €A€ 0 €d , € 0 €d D € 0 €A€ 0 €d
[so that conditions (6.7) and (6.10) are equivalent]. Q.E.D.
Note that condition (6.6) is satisfied if A is such that
.A†/2 D A† [or, equivalently, .†A/2 D †A];
that is, if A† is idempotent (or, equivalently, †A is idempotent), in which case
tr.A†/ D rank.A†/ ŒD rank.†A†/:
And note that conditions (6.6) and (6.7) are both satisfied if A and b are such that
.A†/2 D A† and b 2 C.A/:
Note also that all three of conditions (6.6), (6.7), and (6.8) are satisfied if A, b, and c are such that
A†A D A; b 2 C.A/; and c D 41 b0 †b: (6.12)
Finally, note that (by definition) A†A D A if and only if † is a generalized inverse of A.
In the special case where † is nonsingular, condition (6.12) is a necessary condition for A, b,
and c to satisfy all three of conditions (6.6), (6.7), and (6.8) (as well as a sufficient condition), as can
be readily verified. Moreover, if † is nonsingular, then rank.†A†/ D rank.A/. Thus, as a corollary
of Theorem 6.6.2, we have the following result.
Distribution of Quadratic Forms: Chi-Squareness 311

Corollary 6.6.3. Let x represent an M -dimensional random column vector that has an N.; †/
distribution, where † is nonsingular. And take q D c C b0 x C x0 Ax, where c is a constant, b an M -
dimensional column vector of constants, and A an M  M (nonnull) symmetric matrix of constants.
If
A†A D A; b 2 C.A/; and c D 41 b0 †b; (6.13)
then q  2 .rank A; cCb0 C0 A/. Conversely, if q  2 .R; / (for some strictly positive integer
R), then A, b, and c (and †) satisfy condition (6.13), R D rank A, and  D c C b0  C 0 A.
In the special case where q is a quadratic form (i.e., where c D 0 and b D 0), Corollary 6.6.3
simplifies to the following result.
Corollary 6.6.4. Let x represent an M -dimensional random column vector that has an N.; †/
distribution, where † is nonsingular. And take A to be an M  M (nonnull) symmetric matrix of
constants. If A†A D A, then x0 Ax  2 .rank A; 0 A/. Conversely, if x0 Ax  2 .R; / (for
some strictly positive integer R), then A†A D A, R D rank A, and  D 0 A.
In connection with Corollaries 6.6.3 and 6.6.4, note that if † is nonsingular, then

A†A D A , .A†/2 D A† (i.e., A† is idempotent).

Moreover, upon taking k to be an M -dimensional column vector of constants and upon applying
Corollary 6.6.3 (with A D † 1, b D 2† 1 k, and c D k0 † 1 k), we find that [for an M -
dimensional random column vector x that has an N.; †/ distribution, where † is nonsingular]

.x k/0 † 1
.x k/  2 ŒM; . k/0 † 1
. k/: (6.14)

In the special case where k D 0, result (6.14) simplifies to the following result:

x0 † 1
x  2 .M; 0 † 1/: (6.15)
1
Alternatively, result (6.15) is obtainable as an application of Corollary 6.6.4 (that where A D † ).

c. Some results on linear spaces (of M -dimensional row or column vectors or, more
generally, of M  N matrices)
At this point in the discussion of the distribution of quadratic forms, it is helpful to introduce some
additional results on linear spaces. According to Theorem 2.4.7, every linear space (of M  N
matrices) has a basis. And according to Theorem 2.4.11, any set of R linearly independent matrices
in an R-dimensional linear space V (of M  N matrices) is a basis for V. A useful generalization
of these results is provided by the following theorem.
Theorem 6.6.5. For any set S of R linearly independent matrices in a K-dimensional linear
space V (of M  N matrices), there exists a basis for V that includes all R of the matrices in S (and
K R additional matrices).
For a proof of the result set forth in Theorem 6.6.5, refer, for example, to Harville (1997, sec.
4.3g).
Not only does every linear space (of M  N matrices) have a basis (as asserted by Theorem
2.4.7), but (according to Theorem 2.4.23) every linear space (of M N matrices) has an orthonormal
basis. A useful generalization of this result is provided by the following variation on Theorem 6.6.5.
Theorem 6.6.6. For any orthonormal set S of R matrices in a K-dimensional linear space V
(of M  N matrices), there exists an orthonormal basis for V that includes all R of the matrices in
S (and K R additional matrices).
Theorem 6.6.6 can be derived from Theorem 6.6.5 in much the same way that Theorem 2.4.23
can be derived from Theorem 2.4.7—refer, e.g., to Harville (1997, sec. 6.4c) for some specifics.
312 Some Relevant Distributions and Their Properties

d. Some variations on the results of Subsections a and b


Suppose that z is an M -dimensional random column vector that is distributed as N.0; IM / and that
A is an M M (nonnull) symmetric matrix of constants. As a special case of Theorem 6.6.1, we have
the following result: if A2 D A, then z0 Az  2 .R/, where R D rank A D tr.A/; and, conversely,
if z0 Az  2 .R/ (for some strictly positive integer R), then A2 D A and R D rank A D tr.A/. A
variation on this result is as follows.
Theorem 6.6.7. Let z represent an M -dimensional random column vector that has an N.0; IM /
distribution, take y1 ; y2 ; : : : ; yM to be statistically independent random variables that are distributed
identically as N.0; 1/, and denote by A an M  M (nonnull) symmetric matrix of constants. If
A2 D A, then PR 2
z0 Az i D1 yi
0
 PM 2 ;
zz i D1 yi
where R D rank A D tr.A/; and, conversely, if z Az=z0 z  R 0 2
PM 2
i D1 yi for some integer
P
i D1 yi =
2
R between 1 and M , inclusive, then A D A and R D rank A D tr.A/.
In connection with Theorem 6.6.7, note that if z is an M -dimensional random column vector that
has an N.0; IM / distribution and if y1 ; y2 ; : : : ; yM are statistically independent random variables
that are distributed identically as N.0; 1/, then for any integer R between 1 and M 1, inclusive,
PR 2
z0 Az i D1 yi z0 Az R M R
(6.16)

 ,  Be 2; 2
z0 z M 2 z0 z
P
i D1 y i
PR 2
PM 2
—for R D M , i D1 yi = i D1 yi D 1.
Proof (of Theorem 6.6.7). Suppose that A2 D A [in which case rank A D tr.A/]. Then,
according to Theorem 5.9.5, there exists a matrix Q1 of dimensions M  R, where R D rank A,
such that A D Q1Q10 , and, necessarily, this matrix is such that Q10 Q1 D IR . Now, take Q to be the
M M matrix defined as follows: if R D M , take Q D Q1 ; if R < M , take Q D .Q1 ; Q2 /, where
Q2 is an M  .M R/ matrix whose columns consist of any M R vectors that, together with the R
columns of Q1 , form (when the inner product is taken to be the usual inner product) an orthonormal
basis for RM —the existence of such vectors follows from Theorem 6.6.6. Further, define y1 D Q10 z
and y D Q 0 z, and observe that Q is orthogonal. And upon observing that y  N.0; IM /, that

z0 Az z0 Q1Q10 z y10 y1
D D ;
z0 z z0 QQ 0 z y 0y
and that the elements of y1 are the first R elements of y, we conclude that
PR 2
z0 Az i D1 yi
 :
z0 z M 2
P
i D1 yi

Conversely, suppose that z0 Az=z0 z  R 2


PM 2
i D1 yi for some integer R between 1 and
P
i D1 yi =
M , inclusive. Then, letting z1 ; z2 ; : : : ; zM represent the elements of z (and observing that the joint
distribution of z1 ; z2 ; : : : ; zM is identical to that of y1 ; y2 ; : : : ; yM ),
PR 2
z0 Az z
0
 PiMD1 i :
zz 2
i D1 zi

Moreover, each of the quantities z0 Az=z0 z and R 2


PM 2
i D1 zi depends on the value of z only
P
i D1 zi =
through .z0 z/ 1=2 z, and, consequently, it follows from the results of Section 6.1f that each of these
quantities is distributed independently of z0 z (D M 2
i D1 zi ). Thus,
P
Distribution of Quadratic Forms: Chi-Squareness 313
PR 2
z0 Az i D1 zi
z0 Az D z0 z 0
D R 2
P
0
 z z M i D1 zi :
zz 2
P
i D1 z
i
It is now clear that z0 Az  2 .R/ and hence, upon applying the “necessity part” of Theorem 6.6.1,
that A2 D A and that R D rank A D tr.A/. Q.E.D.
The result of Theorem 6.6.7 can be generalized. Suppose that x is an M -dimensional random
column vector that is distributed as N.0; †/, let P D rank †, suppose that P > 0, and take A to
be an M  M symmetric matrix of constants (such that †A† ¤ 0). Further, take € to be a matrix
of dimensions P  M such that † D € 0 € (the existence of which follows from Corollary 2.13.23),
and take z to be a P -dimensional random column vector that has an N.0; IP / distribution.
The matrix € has full row rank P (as is evident from Corollary 2.13.23), implying (in light of
Lemma 2.5.1) that it has a right inverse, say ƒ. Accordingly, it follows from Theorem 2.10.5 that
ƒƒ0 is a generalized inverse of †. Further, upon observing that x  € 0 z and (in light of Theorem
2.12.2) that €† € 0 is invariant to the choice of the generalized inverse † , we find that
x0 Ax z0 €A€ 0 z z0 €A€ 0 z z0 €A€ 0 z
0
 0 0
D 0 0 0
D : (6.17)
x† x z €† € z z €ƒƒ € z z0 z
And upon applying Theorem 6.6.7 [and taking y1 ; y2 ; : : : ; yP to be statistically independent random
variables that are distributed identically as N.0; 1/], we conclude that if €A†A€ 0 D €A€ 0, then
PR 2
x0 Ax i D1 yi
 ;
x0 † x
PP 2
i D1 yi PR
x0 Ax y2
0 0
where R D rank.€A€ / D tr.€A€ /; and, conversely, if 0  PiPD1 i for some integer R
x† x 2
i D1 y i
between 1 and P , inclusive, then €A†A€ 0 D €A€ 0 and R D rank.€A€ 0 / D tr.€A€ 0 /—note
(in light of Corollary 2.3.4) that †A† ¤ 0 , €A€ 0 ¤ 0.
The condition €A†A€ 0 D €A€ 0 can be restated in terms that do not involve €. Upon applying
Corollary 2.3.4, we find that

€A†A€ 0 D €A€ 0 , †A†A† D †A†:

Moreover, as observed earlier (in the proof of Theorem 6.6.2),

rank.€A€ 0 / D rank.†A†/ and tr.€A€ 0 / D tr.A†/:

In summary, we have the following theorem, which generalizes Theorem 6.6.7 and which relates
to Theorem 6.6.2 in the same way that Theorem 6.6.7 relates to Theorem 6.6.1.
Theorem 6.6.8. Let x represent an M -dimensional random column vector that has an N.0; †/
distribution, let P D rank †, suppose that P > 0, take y1 ; y2 ; : : : ; yP to be statistically independent
random variables that are distributed identically as N.0; 1/, and denote by A an M  M symmetric
matrix of constants (such that †A† ¤ 0). If †A†A† D †A†, then
PR 2
x0 Ax i D1 yi
 ;
x0 † x P 2
P
i D1 yi PR 2
x0 Ax i D1 yi
where R D rank.†A†/ D tr.A†/; and, conversely, if 0  PP for some integer R
x† x 2
i D1 yi
between 1 and P , inclusive, then †A†A† D †A† and R D rank.†A†/ D tr.A†/.
x0 Ax
In connection with Theorem 6.6.8, note that (for 1  R  P 1) the condition 
PR x0 † x
2
i D1 yi
PP is equivalent to the condition
2
i D1 yi
314 Some Relevant Distributions and Their Properties
x0 Ax
 Be R2 ; P 2 R : (6.18)

0
x† x
Note also that the condition †A†A† D †A† is satisfied if, in particular, .A†/2 D A†, in which
case
tr.A†/ D rank.A†/ ŒD rank.†A†/:
Finally, note that if † is nonsingular, then the condition †A†A† D †A† is equivalent to the
condition A†A D A, and rank.†A†/ D rank A.

e. Extensions to spherically or elliptically distributed random vectors


The results of Theorem 6.6.7 [and result (6.16)] pertain to the distribution of z0 Az=z0 z, where z is an
M -dimensional random column vector that has an N.0; IM / distribution (and where A is an M  M
nonnull symmetric matrix of constants). The validity of these results is not limited to the case where
the distribution of the M -dimensional random column vector z is N.0; IM /; it extends to the more
general case where the distribution of z is an absolutely continuous spherical distribution. To see
this, suppose that the distribution of z is an absolutely continuous spherical distribution, and observe
that z0 Az
D Œ.z0 z/ 1=2 z0 AŒ.z0 z/ 1=2 z
z0 z
and that the normalized vector .z0 z/ 1=2 z has the same distribution as in the special case where the
distribution of z is N.0; IM /; as in the special case, .z0 z/ 1=2 z is distributed uniformly on the surface
of an M -dimensional unit ball—refer to the results of Sections 6.1f and 6.1g. Thus, the distribution
of z0 Az=z0 z is the same in the general case where the distribution of z is an absolutely continuous
spherical distribution as in the special case where z  N.0; IM /.
Now, consider the results summarized in Theorem 6.6.8 [and result (6.18)]; these results pertain
to the distribution of x0 Ax=x0 † x, where x is an M -dimensional random column vector that has
an N.0; †/ distribution [and where † is an M  M symmetric nonnegative definite matrix of rank
P .> 0/ and where A is an M  M symmetric matrix of constants (such that †A† ¤ 0)]. The
validity of these results is not limited to the case where the distribution of the M -dimensional random
column vector x is N.0; †/; it extends to the more general case where the distribution of x is that
of the vector € 0 z, where € is a P  M matrix such that † D € 0 € and where z is a P -dimensional
random column vector that has an absolutely continuous spherical distribution—in the more general
case, x is distributed elliptically about 0.
To see this, suppose that x  € 0 z (where € is a P  M matrix such that † D € 0 € and where z
is a P -dimensional random column vector that has an absolutely continuous spherical distribution).
And observe that, as in the special case of result (6.17) [where x  N.0; †/ and z  N.0; IP /],

x0 Ax z0 €A€ 0 z
 ;
x0 † x z0 z
that z0 €A€ 0 z
D Œ.z0 z/ 1=2
z0 €A€ 0 Œ.z0 z/ 1=2
z;
z0 z
and that the normalized vector .z0 z/ 1=2 z has the same distribution as in the special case where
z  N.0; IP /. Accordingly, the distribution of x0 Ax=x0 † x is the same in the general case where
(for a P  M matrix € such that † D € 0 € and a P -dimensional random column vector z that has
an absolutely continuous spherical distribution) x  € 0 z as in the special case where x  N.0; †/.
The Spectral Decomposition, with Application to the Distribution of Quadratic Forms 315

6.7 The Spectral Decomposition, with Application to the Distribution of


Quadratic Forms
The existence of a decomposition (of a symmetric matrix) known as the spectral decomposition
can be extremely useful, and in some cases indispensable, in establishing various results on the
distribution of quadratic forms. There is an intimate relationship between the spectral decomposition
(of a symmetric matrix) and the so-called eigenvalues and eigenvectors of the matrix.

a. Eigenvalues, eigenvectors, and the spectral decomposition


Let A D faij g represent an N  N matrix. A scalar (real number)  is said to be an eigenvalue of
A if there exists an N -dimensional nonnull column vector x such that
Ax D x
or, equivalently, such that
.A IN /x D 0:
Consider the function p./ of a single variable  defined (for all ) as follows:
p./ D jA IN j:
It follows from the very definition of a determinant that p./ is a polynomial (in ) of degree N ;
this polynomial is referred to as the characteristic polynomial of the matrix A. Upon equating p./
to 0, we obtain the equality
p./ D 0;
which can be regarded as an equation (in ) and (when so regarded) is referred to as the characteristic
equation. Clearly, a scalar is an eigenvalue of A if and only if it is a root of the characteristic polynomial
or, equivalently, is a solution to the characteristic equation.
An N -dimensional nonnull column vector x is said to be an eigenvector of the N  N matrix
A if there exists a scalar (real number)  such that Ax D x, in which case  is (by definition) an
eigenvalue of A. For any particular eigenvector x (of A), there is only one eigenvalue  such that
Ax D x, which (since Ax D x ) x0 Ax D x0 x) is
x0 Ax
D :
x0 x
The eigenvector x is said to correspond to (or belong to) this eigenvalue.
Note that if x is an eigenvector of A corresponding to an eigenvalue , then for any nonzero scalar
c, the scalar multiple cx is also an eigenvector of A, and cx corresponds to the same eigenvalue as x.
In particular, if x is an eigenvector of A corresponding to an eigenvalue , then the vector .x0x/ 1=2 x,
which is the scalar multiple of x having a norm of 1, is also an eigenvector of A corresponding to .
Existence of eigenvalues. Does an N  N matrix necessarily have an eigenvalue? The corollary of
the following theorem indicates that in the case of a symmetric matrix, the answer is yes.
Theorem 6.7.1. Let A represent an N  N matrix. Then, there exist N -dimensional nonnull
column vectors x0 and x1 such that
x00 Ax0 x0 Ax x01 Ax1
0  
x0 x0 x0 x x01 x1
for every nonnull column vector x in RN (or, equivalently, such that
x00 Ax0 x0 Ax
0 D min 0 D min x0 Ax
x0 x0 x¤0 x x x W x0 xD1
and
316 Some Relevant Distributions and Their Properties
x01 Ax1 x0 Ax
0 D max 0 D max x0 Ax/:
x1 x1 x¤0 x x x W x0 xD1

x00 Ax0 x0 Ax
Moreover, if A is symmetric, then 0 and 10 1 are eigenvalues of A—they are respectively
x0 x0 x1 x1
x0 Ax
the smallest and largest eigenvalues of A—and x0 and x1 are eigenvectors corresponding to 00 0
x0 Ax x0 x0
and 10 1 , respectively.
x1 x1
Proof. Let x represent an N -dimensional column vector of (unconstrained) variables, and take
f ./ to be the function defined (on RN ) as follows: f .x/ D x0 Ax. Further, define S D fx W
x0 x D 1g. And observe that the function f ./ is continuous and that the set S is closed and bounded.
Then, upon recalling (as in Section 6.5a) that a continuous function attains a minimum value and a
maximum value over any closed and bounded set, it follows that S contains vectors x0 and x1 such
that, for x 2 S ,
x00 Ax0  x0 Ax  x01 Ax1 :
Thus, for x ¤ 0,
x00 Ax0 x0 Ax x0 Ax
0 D x00 Ax0  0  x01 Ax1 D 10 1 :
x0 x0 xx x1 x1
Now, suppose that A is symmetric, and take x0 and x1 to be any N -dimensional nonnull column
vectors such that x00 Ax0 x0 Ax x0 Ax
0  0
 10 1
x0 x0 xx x1 x1
for every nonnull column vector x in RN . And observe that, for x ¤ 0,
1 0
x ŒA .x00 Ax0 =x00 x0 / IN x  0
x0 x
or, equivalently,
x0 ŒA .x00 Ax0 =x00 x0 / IN x  0:
Thus, A .x00 Ax0 =x00 x0 / IN is a symmetric nonnegative definite matrix, and upon observing that

x00 ŒA .x00 Ax0 =x00 x0 / IN x0 D 0;

it follows from Corollary 2.13.27 that


ŒA .x00 Ax0 =x00 x0 / IN x0 D 0:
It is now clear that x00 Ax0 =x00 x0 is an eigenvalue of A, that x0 is an eigenvector of A corresponding to
x00 Ax0 =x00 x0 , and (since if  is an eigenvalue of A,  D x0 Ax=x0 x for some nonnull vector x) that
x00 Ax0 =x00 x0 is the smallest eigenvalue of A. It follows from a similar argument that x01 Ax1 =x01 x1
is an eigenvalue of A, that x1 is an eigenvector of A corresponding to x01 Ax1 =x01 x1 , and that
x01 Ax1 =x01 x1 is the largest eigenvalue of A. Q.E.D.
As an immediate consequence of Theorem 6.7.1, we have the following corollary.
Corollary 6.7.2. Every symmetric matrix has an eigenvalue.
Does the result of Corollary 6.7.2 extend to (N  N ) nonsymmetric matrices? That is, does an
(N  N ) nonsymmetric matrix necessarily have an eigenvalue? The answer to this question depends
on whether an eigenvalue is required to be a real number (and an eigenvector a vector of real numbers),
as is the case herein, or whether the definition of an eigenvalue (and the definition of an eigenvector)
are extended, as is done in many presentations on the subject, so that a complex number can qualify
as an eigenvalue (and a vector of complex numbers as an eigenvector).
The Spectral Decomposition, with Application to the Distribution of Quadratic Forms 317
 
0 1
Consider, for example, the 2  2 matrix . The characteristic polynomial of this matrix
1 0
is ˇ ˇ
ˇ  1 ˇˇ
p./ D ˇ ˇ D 2 C 1;
1 ˇ
which has no real roots but 2 imaginary roots (namely,  D i and  D i ). In fact, the fundamental
theorem of algebra—the proof of which involves some higher-level mathematics—guarantees that
any N  N matrix (symmetric or not) has a possibly complex root and hence a “possibly complex
eigenvalue.”
Orthogonality of the eigenvectors of a symmetric matrix. Any two eigenvectors of a symmetric
matrix that correspond to different eigenvalues have the basic property described in the following
lemma.
Lemma 6.7.3. Suppose that A is an N  N symmetric matrix that has an eigenvector x1
corresponding to an eigenvalue 1 and an eigenvector x2 corresponding to an eigenvalue 2 . If
1 ¤ 2 , then x1 and x2 are orthogonal (with respect to the usual inner product).
Proof. By definition, Ax1 D 1 x1 and Ax2 D 2 x2 . Further, upon premultiplying both sides
of the first equality by x02 and both sides of the second by x01 , we find that
x02 Ax1 D 1 x02 x1 and x01 Ax2 D 2 x01 x2 :
And since A is symmetric, it follows that
1 x02 x1 D x02 Ax1 D .x01 Ax2 /0 D .2 x01 x2 /0 D 2 x02 x1 ;
implying that .1 2 / x02 x1 D 0 and hence if 1 ¤ 2 that x02 x1 D 0. Thus, if 1 ¤ 2 , then x1
and x2 are orthogonal. Q.E.D.
Diagonalization. An N  N matrix A is said to be diagonalizable (or diagonable) if there exists an
N  N nonsingular matrix Q such that Q 1AQ D D for some diagonal matrix D, in which case
Q is said to diagonalize A (or A is said to be diagonalized by Q). Note (in connection with this
definition) that
Q 1AQ D D , AQ D QD , A D QDQ 1: (7.1)

An N  N matrix A is said to be orthogonally diagonalizable if it can be diagonalized by an


orthogonal matrix. Thus, an N  N matrix A is orthogonally diagonalizable if there exists an N  N
orthogonal matrix Q such that Q0AQ D D for some diagonal matrix D, in which case

A D QDQ0: (7.2)

Since QDQ0 is symmetric, it is clear from equality (7.2) that a necessary condition for an N  N
matrix A to be orthogonally diagonalizable is that A be symmetric—certain nonsymmetric matrices
are diagonalizable, however they are not orthogonally diagonalizable. This condition is also sufficient,
as indicated by the following theorem.
Theorem 6.7.4. Every symmetric matrix is orthogonally diagonalizable.
Proof. The proof is by mathematical induction. Clearly, every 1  1 matrix is orthogonally
diagonalizable. Now, suppose that every .N 1/  .N 1/ symmetric matrix is orthogonally diago-
nalizable (where N  2). Then, it suffices to show that every N N symmetric matrix is orthogonally
diagonalizable.
Let A represent an N N symmetric matrix. And let  represent an eigenvalue of A (the existence
of which is guaranteed by Corollary 6.7.2), and take u to be an eigenvector (of A) with (usual) norm 1
that corresponds to . Further, take V to be any N  .N 1/ matrix such that the N vectors consisting
of u and the N 1 columns of V form an orthonormal basis for RN —the existence of such a matrix
318 Some Relevant Distributions and Their Properties

follows from Theorem 6.6.6—or, equivalently, such that .u; V / is an N  N orthogonal matrix.
Then, Au D u, u0 u D 1, and V 0 u D 0, and, consequently,
 0
u Au .V 0Au/0 00
  

.u; V /0 A.u; V / D D :
V 0Au V 0AV 0 V 0AV

Clearly, V 0AV is a symmetric matrix of order N 1, so that (by supposition) there exists an
.N 1/  .N 1/ orthogonal matrix R such that R0 V 0AV R D F for some diagonal matrix F (of
order N 1). Define S D diag.1; R/, and let P D .u; V /S. Then,
S0 S D diag.1; R0 R/ D diag.1; IN 1/ D IN ;
so that S is orthogonal and hence (according to Lemma 2.7.1) P is orthogonal. Further,

P 0AP D S0 .u; V /0A.u; V /S D S0 diag.; V 0AV /S D diag.; R0 V 0AV R/ D diag.; F/;

so that P 0AP is a diagonal matrix. Thus, A is orthogonally diagonalizable. Q.E.D.


Spectral decomposition: definition and some basic properties. Let A represent an N N symmetric
matrix. Further, let Q represent an N  N orthogonal matrix and D an N  N diagonal matrix such
that
Q0AQ D D (7.3)
or, equivalently, such that
A D QDQ0 (7.4)
—the existence of an orthogonal matrix Q and a diagonal matrix D that satisfy condition (7.3)
follows from Theorem 6.7.4. And observe that equality (7.4) is also expressible in the form
N
X
AD di qi q0i ; (7.5)
i D1

where (for i D 1; 2; : : : ; N ) di represents the i th diagonal element of D and qi the i th column of


Q. Expression (7.4) or expression (7.5) is sometimes referred to as the spectral decomposition or
spectral representation of the matrix A.
The characteristic polynomial p./ (of A) can be reexpressed in terms related to the spectral
decomposition (7.4) or (7.5); specifically, it can be reexpressed in terms of the diagonal elements
d1 ; d2 ; : : : ; dN of the diagonal matrix D. For  2 R,

p./ D jA IN j D jQ0 .D IN /Qj D jQj2 jD IN j D jD IN j

and hence
QN
p./ D . 1/N i D1 . di / (7.6)
N QK Nj
D. 1/ j D1 . j / ; (7.7)

where f1 ; 2 ; : : : ; K g is a set whose elements consist of the distinct values represented among the
N scalars d1 ; d2 ; : : : ; dN and where (for j D 1; 2; : : : ; K) Nj represents the number of values of
the integer i (between 1 and N , inclusive) for which di D j .
In light of expression (7.6) or (7.7), it is clear that a scalar  is an eigenvalue of A if and only
if  D di for some integer i (between 1 and N , inclusive) or, equivalently, if and only if  is
contained in the set f1 ; 2 ; : : : ; K g. Accordingly, 1 ; 2 ; : : : ; K may be referred to as the distinct
eigenvalues of A. And, collectively, d1 ; d2 ; : : : ; dN may be referred to as the not-necessarily-distinct
eigenvalues of A. Further, the set f1 ; 2 ; : : : ; K g is sometimes referred to as the spectrum of A,
and (for j D 1; 2; : : : ; K) Nj is sometimes referred to as the multiplicity of j .
The Spectral Decomposition, with Application to the Distribution of Quadratic Forms 319

Clearly,
AQ D QD; (7.8)
or, equivalently,
Aqi D di qi .i D 1; 2; : : : ; N /: (7.9)
Thus, the N orthonormal vectors q1 ; q2 ; : : : ; qN are eigenvectors of A, the i th of which corresponds
to the eigenvalue di . Note that result (7.8) or (7.9) is also expressible in the form
AQj D j Qj .j D 1; 2; : : : ; K/; (7.10)
where Qj is the N  Nj matrix whose columns consist of those of the eigenvectors q1 ; q2 ; : : : ; qN
for which the corresponding eigenvalue equals j . Note also that the K equalities in the collection
(7.10) are reexpressible as
.A j IN /Qj D 0 .j D 1; 2; : : : ; K/: (7.11)

It is clear from result (7.11) that the columns of Qj are members of N.A j IN /. In fact, they
form a basis (an orthonormal basis) for N.A j IN /, as is evident upon observing (in light of
Lemma 2.11.5) that

dimŒN.A j IN / D N rank.A j IN /
DN rankŒQ0 .A j IN /Q
DN rank.D j IN / D N .N Nj / D Nj : (7.12)

In general, a distinction needs to be made between the algebraic multiplicity and the geometric
multiplicity of an eigenvalue j ; algebraic multiplicity refers to the multiplicity of  j as a factor
of the characteristic polynomial p./, whereas geometric multiplicity refers to the dimension of
the linear space N.A j IN /. However, in the present context (where the eigenvalue is that of a
symmetic matrix A), the algebraic and geometric multiplicities are equal, so that no distinction is
necessary.
To what extent is the spectral decomposition of A unique? The distinct eigenvalues 1; 2 ; : : : ; K
are (aside from order) unique and their multiplicities N1 ; N2 ; : : : ; NK are unique, as is evident from
result (7.7). Moreover, for j (an integer between 1 and K, inclusive) such that Nj D 1, Qj is
unique, as is evident from result (7.11) upon observing [in light of result (7.12)] that (if Nj D 1)
dimŒN.A j IN / D 1. For j such that Nj > 1, Qj is not uniquely determined; Qj can be taken
to be any N  Nj matrix whose columns are orthonormal eigenvectors (of A) corresponding to the
eigenvalue j or, equivalently, whose columns form an orthonormal basis for N.A j IN /—refer
to Lemma 6.7.3. However, even for j such that Nj > 1, Qj Qj0 is uniquely determined—refer, e.g.,
to Harville (1997, sec. 21.5) for a proof. Accordingly, a decomposition of A that is unique (aside
from the order of the terms) is obtained upon reexpressing decomposition (7.5) in the form

A D jKD1 j Ej ; (7.13)
P

where Ej D Qj Qj0 . Decomposition (7.13), like decompositions (7.4) and 7.5), is sometimes referred
to as the spectral decomposition.
Rank, trace, and determinant of a symmetric matrix. The following theorem provides expressions
for the rank, trace, and determinant of a symmetric matrix (in terms of its eigenvalues).
Theorem 6.7.5. Let A represent an N N symmetric matrix with not-necessarily-distinct eigen-
values d1 ; d2 ; : : : ; dN and with distinct eigenvalues 1; 2 ; : : : ; K of multiplicities N1; N2 ; : : : ; NK ,
respectively. Then,
(1) rank A D N N0 , where N0 D Nj if j D 0 (1  j  K) and where N0 D 0 if
0 … f1 ; 2 ; : : : ; K g (i.e., where N0 equals the multiplicity of the eigenvalue 0 if 0 is an
eigenvalue of A and equals 0 otherwise);
320 Some Relevant Distributions and Their Properties
PN PK
(2) tr.A/ D i D1 di D j D1 Nj j ; and
QN QK N
(3) det.A/ D i D1 di D j D1 j j.
Proof. Let Q represent an N  N orthogonal matrix such that A D QDQ0, where D D
diag.d1 ; d2 ; : : : ; dN /—the existence of such a matrix follows from the results of the preceding part
of the present subsection (i.e., the part pertaining to the spectral decomposition).
(1) Clearly, rank A equals rank D, and rank D equals the number of diagonal elements of D that
are nonzero. Thus, rank A D N N0 .
(2) Making use of Lemma 2.3.1, we find that
tr.A/ D tr.QDQ0 / D tr.DQ0 Q/ D tr.DI/ D tr.D/ D N
P PK
i D1 di D j D1 Nj j :

(3) Making use of result (2.14.25), Lemma 2.14.3, and Corollary 2.14.19, we find that
QK Nj
jAj D jQDQ0 j D jQjjDjjQ0 j D jQj2 jDj D jDj D N
Q
i D1 di D j D1 j :
Q.E.D.
When is a symmetric matrix nonnegative definite, positive definite, or positive semidefinite? Let
A represent an N  N symmetric matrix. And take Q to be an N  N orthogonal matrix and D
an N  N diagonal matrix such that A D QDQ0 —the existence of such an orthogonal matrix and
such a diagonal matrix follows from Theorem 6.7.4. Then, upon recalling (from the discussion of
the spectral decomposition) that the diagonal elements of D constitute the (not-necessarily-distinct)
eigenvalues of A and upon applying Corollary 2.13.16, we arrive at the following result.
Theorem 6.7.6. Let A represent an N  N symmetric matrix with not-necessarily-distinct
eigenvalues d1 ; d2 ; : : : ; dN . Then, (1) A is nonnegative definite if and only if d1 ; d2 ; : : : ; dN are
nonnegative; (2) A is positive definite if and only if d1 , d2 , : : : ; dN are (strictly) positive; and (3)
A is positive semidefinite if and only if di  0 for i D 1; 2; : : : ; N with equality holding for one or
more values of i .
When is a symmetric matrix idempotent? The following theorem characterizes the idempotency of
a symmetric matrix in terms of its eigenvalues.
Theorem 6.7.7. An N  N symmetric matrix is idempotent if (and only if) it has no eigenvalues
other than 0 or 1.
Proof. Let A represent an N  N symmetric matrix, and denote by d1 ; d2 ; : : : ; dN its
not-necessarily-distinct eigenvalues. And observe (in light of the discussion of the spectral de-
composition) that there exists an N  N orthogonal matrix Q such that A D QDQ0, where
D D diag.d1 ; d2 ; : : : ; dN /. Observe also that
A2 D QDQ0 QDQ0 D QD 2 Q0:
Thus,
A2 D A , D2 D D , di2 D di .i D 1; 2; : : : ; N /:
Moreover, di2 D di if and only if either di D 0 or di D 1. It is now clear that A is idempotent if
(and only if) it has no eigenvalues other than 0 or 1. Q.E.D.

b. Reexpression of a quadratic form (in a normally distributed random vector) as


a linear combination of independently distributed random variables
Let z represent an N -dimensional random column vector that has an N.0; IN / distribution. And
consider the distribution of a second-degree polynomial (in z), say the second-degree polynomial
q D c C b0 z C z0 Az, where c is a constant, b an N -dimensional column vector of constants, and A
an N  N symmetric matrix of constants. Further, take O to be an N  N orthogonal matrix and D
an N  N diagonal matrix such that
A D ODO 0 (7.14)
The Spectral Decomposition, with Application to the Distribution of Quadratic Forms 321

—the existence of such an orthogonal matrix and such a diagonal matrix follows from Theorem
6.7.4. As previously indicated (in Subsection a), the representation (7.14) is sometimes referred to
as the spectral decomposition or representation (of the matrix A).
The second-degree polynomial q can be reexpressed in terms related to the representation (7.14).
Let r D O 0 b and u D O 0 z. Then, clearly,
N
X
q D c C r 0 u C u0 Du D .ci C ri ui C di u2i /; (7.15)
i D1

where c1 ; c2 ; : : : ; cN represent any constants such that Ni D1 ci D c and where (for i D 1; 2; : : : ; N )


P

ri represents the i th element of r, ui the i th element of u, and di the i th diagonal element of


D. Moreover, u  N.0; IN / or, equivalently, u1 , u2 , : : : ; uN are distributed independently and
identically as N.0; 1/.
Now, consider the distribution of q in the special case where c D 0 and b D 0, that is, the special
case where q is the quadratic form z0 Az. Based on result (7.15), we have that
N
X
z0 Az D u0 Du D di u2i : (7.16)
i D1
And it follows that N
X
z0 Az  di wi ; (7.17)
i D1
where w1 ; w2 ; : : : ; wN are random variables that are distributed independently and identically as
2 .1/.
According to result (7.17), z0 Az is distributed as a linear combination of the random variables
w1 ; w2 ; : : : ; wN . Upon observing that the coefficients in this linear combination are d1 ; d2 ; : : : ; dN
and upon recalling (from Subsection a) that d1 , d2 , : : : ; dN are the not-necessarily-distinct eigen-
values of the matrix A, we obtain the further result that
K
X
z0 Az  j vj ; (7.18)
j D1

where 1 ; 2 ; : : : ; K are the distinct eigenvalues of A with multiplicities N1 , N2 , : : : ; NK , re-


spectively, and where v1 ; v2 ; : : : ; vK are random variables that are distributed independently as
2 .N1 /; 2 .N2 /; : : : ; 2 .NK /, respectively.
The moment generating function of the second-degree polynomial q can be expressed in terms
of c, the elements of r, and the not-necessarily-distinct eigenvalues d1 ; d2 ; : : : ; dN of the matrix A.
Let m./ represent the moment generating function of q, define t0 D 2 mini di and t1 D 2 maxi di ,
and take
S D ft 2 R W IN 2tD is positive definiteg
or, equivalently,
.1=t0 ; 1=t1 /; if t0 < 0 and t1 > 0,

ˆ
<.1=t ; 1/; if t < 0 and t  0,
ˆ
0 0 1
SD
ˆ. 1; 1=t1 /; if t0  0 and t1 > 0,
ˆ
. 1; 1/; if t0 D t1 D 0.

Then, upon regarding q as a second-degree polynomial in the random vector u (rather than in the
random vector z) and applying formula (5.9), we find that, for t 2 S ,

m.t/ D jI 2tDj 1=2 expŒtc C .1=2/ t 2 r 0 .I 2tD/ 1 r (7.19)


D N 2tdi / 1=2 expŒtc C .1=2/ t 2 N 1 2
(7.20)
Q P
i D1 .1 i D1 .1 2tdi / ri :
322 Some Relevant Distributions and Their Properties

In the special case where c D 0 and b D 0, that is, the special case where q D z0 Az, we have that,
for t 2 S,
m.t/ D jI 2tDj 1=2 D N 2tdi / 1=2: (7.21)
Q
i D1 .1

An extension. The results on the distribution of the second-degree polynomial in an N -dimensional


random column vector that has an N.0; IN / distribution can be readily extended to the distribution
of a second-degree polynomial in an N -dimensional random column vector that has an N -variate
normal distribution with an arbitrary mean vector and arbitrary variance-covariance matrix. Let x
represent an N -dimensional random column vector that has an N.; †/ distribution. And consider
the distribution of the second-degree polynomial q D c C b0 x C x0 Ax (where, as before, c is a
constant, b an N -dimensional column vector of constants, and A a symmetric matrix of constants).
Take € to be any matrix (with N columns) such that † D € 0 €, and denote by P the number
of rows in €. Further, take z to be a P -dimensional random column vector that has an N.0; IP /
distribution. Then, x   C € 0 z, and as observed earlier [in result (5.4)],
q  c C b0  C 0 A C Œ€.bC2 A/0 z C z0 €A€ 0 z: (7.22)
Thus, upon applying result (7.15) with c C b0  C 0 A, €.bC2 A/, and €A€ 0 in place of c, b,
and A, respectively, and taking O to be a P  P orthogonal matrix and D D fdi g a P  P diagonal
matrix such that €A€ 0 D ODO 0, we find that
XN
q  k C r 0 u C u0 Du D .ki C ri ui C di u2i /; (7.23)
i D1

where k D c C b  C  A and k1 ; k2 ; : : : ; kP are any constants such that N


0 0
i D1 ki D k, where
P
r D fri g D O 0 €.bC2 A/, and where u D fui g is a P -dimensional column vector that has an
N.0; IP / distribution [or, equivalently, whose elements u1 ; u2 ; : : : ; uP are distributed independently
and identically as N.0; 1/].
Note that the same approach that led to result (7.23) could be used to obtain a formula for the
moment generating function of q that could be regarded as an extension of formula (7.19) or (7.20). It
is simply a matter of applying formula (7.19) or (7.20) to the second-degree polynomial in z that has
the same distribution as q—refer to result (7.22). That is, it is simply a matter of applying formula
(7.19) or (7.20) with c C b0  C 0 A, €.bC2 A/, and €A€ 0 in place of c, b, and A, respectively.
A variation. Let z represent an N -dimensional random column vector that has an N.0; IN / distri-
bution. And consider the distribution of the ratio z0 Az=z0 z, where A is an N  N symmetric matrix
of constants. Note that z0 Az=z0 z is reexpressible as a quadratic form Œ.z0 z/ 1=2 z0 AŒ.z0 z/ 1=2 z in
the normalized vector .z0 z/ 1=2 z, the elements of which are distributed uniformly on the surface of
an N -dimensional unit ball.
Take O to be an N N orthogonal matrix and D an N N diagonal matrix such that A D ODO 0
(i.e., such that A D ODO 0 is the spectral decomposition of A). And define u D O 0 z, and (for
i D 1; 2; : : : ; N ) denote by ui the i th element of u and by di the i th diagonal element of D (so
that di is the i th of the not-necessarily-distinct eigenvalues of A). Then, u  N.0; IN /, and as was
established earlier—refer to result (7.16)—
N
X
z0 Az D di u2i :
i D1
Further, N
X
z0 z D z0 Iz D z0 OO 0 z D u0 u D u2i :
i D1
Thus, PK
0
PN 2
z Az i D1 di ui j D1 j vj
D  PK ;
z0 z
PN 2
i D1 ui j D1 vj
The Spectral Decomposition, with Application to the Distribution of Quadratic Forms 323

where 1 ; 2 ; : : : ; K are the distinct eigenvalues of A with multiplicities N1 , N2 , : : : ; NK ,


respectively, and v1 ; v2 ; : : : ; vK are random variables that are distributed independently as
2 .N1 /; 2 .N2 /; : : : ; 2 .NK /, respectively. Upon recalling the definition of the Dirichlet distri-
bution (and the relationship between the chi-square distribution and the gamma distribution), we
conclude that K
z0 Az X
 j wj ; (7.24)
z0 z
j D1
where w1 ; w2 ; : : : ; wK 1 are random variables that are jointly distributed as Di N21 , N22 , : : : ; NK2 1 ,
NK PK 1
2 I K 1 and where wK D 1 j D1 wj —in the “degenerate” case where K D 1, w1 D 1 and


z0 Az=z0 z D 1 .
Result (7.24) can be generalized. Suppose that x is an N -dimensional random column vector that
is distributed as N.0; †/ (where † ¤ 0), let P D rank †, and take A to be an N  N symmetric
matrix of constants. Further, take € to be a matrix of dimensions P  N such that † D € 0 €
(the existence of which follows from Corollary 2.13.23), and take z to be a P -dimensional random
column vector that has an N.0; IP / distribution. Then, as in the case of result (6.17), we have that
x0 Ax z0 €A€ 0 z
0
 :
x† x z0 z
And upon applying result (7.24) (with the P  P matrix €A€ 0 in place of the N  N matrix A),
we find that K
x0 Ax X
 j wj ; (7.25)
x0 † x
j D1
where 1 ; 2 ; : : : ; K are the distinct eigenvalues of €A€ 0 with multiplicities N1 , N2 , : : : ; NK ,
respectively, where w1 ; w2 ; : : : ; wK 1 are random variables that are jointly distributed as Di N21 ,
N2 NK 1 NK PK 1
2 , :::; 2 , 2 IK 1 , and where wK D 1 j D1 wj —in the “degenerate” case where


K D 1, w1 D 1 and x0 Ax=x0 † x D 1 (with probability 1).


Applicability to spherically or elliptically distributed random vectors. Result (7.24) on the distri-
bution of the ratio z0 Az=z0 z (where z is an N -dimensional random column vector and A an N  N
symmetric matrix of constants) was derived under the assumption that z  N.0; IN /. However, as
discussed earlier (in Section 6.6e), z0 Az=z0 z has the same distribution when the distribution of z is
taken to be an arbitrary absolutely continuous spherical distribution as it does when z  N.0; IN /.
Thus, result (7.24) is applicable even if the distribution of z is an absolutely continuous distribution
other than the N.0; IN / distribution.
Now, consider result (7.25) on the distribution of the ratio x0 Ax=x0 † x, where x is an N -
dimensional random column vector that has an N.0; †/ distribution (and where † ¤ 0 and where
A is an N  N symmetric matrix of constants). Let P D rank †, take € to be a P  N matrix
such that † D € 0 €, and take z to be a P -dimensional random column vector that has an absolutely
continuous spherical distribution. Then, € 0 z is distributed elliptically about 0. Moreover,
.€ 0 z/0 A€ 0 z x0 Ax
 :
.€ 0 z/0 † € 0 z x0 † x
—refer to Section 6.6e. Thus, result (7.25) is applicable when x is taken to be an N -dimensional
random column vector whose distribution is that of the elliptically distributed random vector € 0 z as
well as when x  N.0; †/—note that € 0 z  N.0; †/ in the special case where z  N.0; IP /.

c. Some properties of polynomials (in a single variable)


Let us consider some properties of polynomials (in a single variable). The immediate objective in
doing so is to set the stage for proving (in Subsection d) the “necessity part” of Theorem 6.6.1.
324 Some Relevant Distributions and Their Properties

Let x represent a real variable. Then, a function of x, say p.x/, that is expressible in the form
p.x/ D a0 C a1 x C a2 x 2 C    C aN x N;
where N is a nonnegative integer and where the coefficients a0 ; a1 ; a2 ; : : : ; aN are real numbers, is
referred to as a polynomial (in x). The polynomial p.x/ is said to be nonzero if one or more of the
coefficients a0 ; a1 ; a2 ; : : : ; aN are nonzero, in which case the largest nonnegative integer k such that
ak ¤ 0 is referred to as the degree of p.x/ and is denoted by the symbol degŒp.x/. When it causes
no confusion, p.x/ may be abbreviated to p fand degŒp.x/ to deg.p/g.
A polynomial q is said to be a factor of a polynomial p if there exists a polynomial r such that
p  qr. And a real number c is said to be a root (or a zero) of a polynomial p if p.c/ D 0.
A basic property of polynomials is as follows.
Theorem 6.7.8. Let p.x/ and q.x/ represent polynomials (in a variable x). And suppose that
p.x/ D q.x/ for all x in some nondegenerate interval. Or, more generally, taking N to be a nonneg-
ative integer such that N > deg.p/ (if p is nonzero) and N > deg.q/ (if q is nonzero), suppose that
p.x/ D q.x/ for N distinct values of x. Then, p.x/ D q.x/ for all x.
Proof. Suppose that N > 0, and take x1 ; x2 ; : : : ; xN to be N distinct values of x such that
p.xi / D q.xi / for i D 1; 2; : : : ; N —if N D 0, then neither p nor q is nonzero, in which case
p.x/ D 0 D q.x/ for all x. And observe that there exist real numbers a0 ; a1 ; a2 ; : : : ; aN 1 and
b0 ; b1 ; b2 ; : : : ; bN 1 such that
p.x/ D a0 C a1 x C a2 x 2 C    C aN 1x
N 1

and
q.x/ D b0 C b1 x C b2 x 2 C    C bN 1x
N 1
:
Further, let p D Œp.x1 /; p.x2 /; : : : ; p.xN / and q D Œq.x1 /; q.x2 /; : : : ; q.xN /0, define a D
0

.a0 ; a1 ; a2 ; : : : ; aN 1 /0 and b D .b0 ; b1 ; b2 ; : : : ; bN 1 /0, and take H to be the N  N matrix with


ij th element xij 1 —by convention, 00 D 1. Then, clearly,
H.a b/ D Ha Hb D p q D 0:
Moreover, the matrix H is a Vandermonde matrix, and upon applying result (5.3.13), it follows that
H is nonsingular. Thus, a b D H 1 0 D 0, implying that a D b and hence that p.x/ D q.x/ for
all x. Q.E.D.
Some additional basic properties of polynomials are described in the following four theorems.
Theorem 6.7.9 (the division algorithm). Let p and q represent polynomials. Suppose that q is
nonzero. Then, there exist unique polynomials b and r such that
p  bq C r;
where either r  0 or else deg.r/ < deg.q/.
Theorem 6.7.10. Let p.x/ represent a nonzero polynomial (in x) of degree N . Then, for any
real number c, p.x/ has a unique representation of the form
p.x/ D b0 C b1 .x c/ C b2 .x c/2 C    C bN .x c/N;
where b0 ; b1 ; b2 ; : : : ; bN are real numbers.
Theorem 6.7.11 (the factor theorem). A real number c is a root of a polynomial p.x/ (in x) if
and only if the polynomial x c is a factor of p.x/.
Theorem 6.7.12. Let p.x/, q.x/, and r.x/ represent polynomials (in x). And suppose that
p.x/q.x/ D .x c/M r.x/;
where M is a positive integer and where c is a real number that is not a root of p.x/. Then, .x c/M
is a factor of q.x/; that is, there exists a polynomial s.x/ such that
q.x/ D .x c/M s.x/:
The Spectral Decomposition, with Application to the Distribution of Quadratic Forms 325

Refer, for example, to Beaumont and Pierce (1963) for proofs of Theorems 6.7.9, 6.7.10, and
6.7.11, which are equivalent to their theorems 9-3.3, 9-3.5, and 9-7.5, and to Harville (1997, appendix
to chap. 21) for a proof of Theorem 6.7.12.
The various basic properties of polynomials can be used to establish the following result.
Theorem 6.7.13. Let r1 .x/; s1 .x/ and s2 .x/ represent polynomials in a real variable x. And
take r2 .x/ to be a function of x defined as follows:

r2 .x/ D .x 1 /M1 .x 2 /M2    .x K /MK;

where K is a nonnegative integer, where M1 ; M2 ; : : : ; MK are (strictly) positive integers, where


is a nonzero real number, and where 1 ; 2 ; : : : ; K are real numbers—when K D 0, r2 .x/ D .
Suppose that  
s1 .x/ r1 .x/
log D
s2 .x/ r2 .x/
for all x in some nondegenerate interval [that does not include 1 ; 2 ; : : : ; K or any roots of s2 .x/].
Then, there exists a real number ˛ such that r1 .x/ D ˛r2 .x/ for all x. Further, s1 .x/ D e ˛s2 .x/ for
all x.
For a proof of Theorem 6.7.13, refer to Harville (1997, sec. 21.16) or to Harville and Kempthorne
(1997, sec. 3).

d. Proof of the “necessity part” of Theorem 6.6.1


Let z represent an M -dimensional random column vector that has an N.0; IM / distribution, and
take q D c C b0 z C z0 Az, where c is a constant, b an M -dimensional column vector of constants,
and A an M  M (nonnull) symmetric matrix of constants. Let us complete the proof of Theorem
6.6.1 by showing that if q  2 .R; / (for some strictly positive integer R), then A, b, and c satisfy
the conditions
A2 D A; b D Ab; and c D 41 b0 b
and R and  are such that
R D rank A D tr.A/ and  D c:
And let us do so in a way that takes advantage of the results of Subsections a, b, and c.
Define K D rank A. And take O to be an M  M orthogonal matrix such that
O 0AO D diag.d1 ; d2 ; : : : ; dM /
for some scalars d1 ; d2 ; : : : ; dM —the existence of such an M  M orthogonal matrix follows from
Theorem 6.7.4—and define u D O 0 b. In light of Theorem 6.7.5, K of the scalars d1 ; d2 ; : : : ; dM
(which are the not-necessarily-distinct eigenvalues of A) are nonzero, and the rest of them equal
0. Assume (without loss of generality) that it is the first K of the scalars d1 ; d2 ; : : : ; dM that are
nonzero, so that dKC1 D dKC2 D    D dM D 0—this assumption can always be satisfied by
reordering d1 , d2 , : : : ; dM and the corresponding columns of O (as necessary). Further, letting
D1 D diag.d1 ; d2 ; : : : ; dK / and partitioning O as O D .O1 ; O2 / (where the dimensions of O1 are
M  K), observe that
A D O1 D1 O10 :
Let m./ represent the moment generating function of q. Then, upon applying result (7.20), we
find that, for every scalar t in some nondegenerate interval that includes 0,
K K M
" !#
Y
1=2 t2 X u2i X
2
m.t/ D .1 2tdi / exp tc C C ui ; (7.26)
2 1 2tdi
i D1 i D1 i DKC1
where u1 ; u2 ; : : : ; uM represent the elements of u.
Now, suppose that q  2 .R; /. Then, it follows from result (2.18) that, for every scalar t such
that t < 1=2,
326 Some Relevant Distributions and Their Properties
R=2
m.t/ D .1 2t/ expŒt=.1 2t/: (7.27)
And upon equating expressions (7.26) and (7.27) and squaring both sides of the resultant equality,
we find that, for every scalar t in some interval I that includes 0,
K M
QK " ! #
i D1 .1 2tdi / 2
X u2i X
2 2t
D exp 2tc C t C ui : (7.28)
.1 2t/R 1 2tdi 1 2t
i D1 i DKC1
QK
Upon applying Theorem 6.7.13 to result (7.28) [and observing that, at t D 0, i D1 .1 2tdi / D
.1 2t/R ], we obtain the following result:
QK R
i D1 .1 2tdi / D .1 2t/ (for all t). (7.29)

Thus, R D K [since, otherwise, the polynomials forming the left and right sides of equality (7.29)
would be of different degrees]. And, for i D 1; 2; : : : ; K, di D 1 [since the left side of equality
(7.29) has a root at t D 1=.2di /, while, if di ¤ 1, the right side does not]. We conclude (on the basis
of Theorem 6.7.7) that A2 D A and (in light of Corollary 2.8.3) that K D tr.A/.
It remains to show that b D Ab and that  D c D 14 b0 b. Since R D K and d1 D d2 D    D
dK D 1, it follows from result (7.28) that (for t 2 I )
K M
!
2
X u2i X
2 2t
2tc C t C ui D 0: (7.30)
1 2t 1 2t
i D1 i DKC1

And upon multiplying both sides of equality (7.30) by 1 2t, we obtain the equality
h i
2.c /t 4 c 14 M 2 2
2t 3 M 2
P P
i D1 ui t i DKC1 ui D 0

[which, since both sides are polynomials in t, holds for all t]. Thus,

c  D c 14 M
P 2 PM 2
i D1 ui D i DKC1 ui D 0:
Moreover, PM 2
i D1 ui D u0u D .O 0 b/0 O 0 b D b0 b;
and PM
i DKC1 u2i D .uKC1 ; uKC2 ; : : : ; uM /.uKC1 ; uKC2 ; : : : ; uM /0 D .O20 b/0 O20 b:
We conclude that
 D c D 41 b0 b
and that O20 b D 0 and hence that

b D OO 0 b D .O1O10 C O2 O20 /b D O1O10 b D O1 IO10 b D O1 D1 O10 b D Ab:

6.8 More on the Distribution of Quadratic Forms or Second-Degree


Polynomials (in a Normally Distributed Random Vector)
a. Statistical independence of quadratic forms or second-degree polynomials (in a
normally distributed random vector)
The following theorem can be used to determine whether or not two or more quadratic forms or
second-degree polynomials (in a normally distributed random vector) are statistically independent.
More on the Distribution of Quadratic Forms or Second-Degree Polynomials 327

Theorem 6.8.1. Let x represent an M -dimensional random column vector that has an N.; †/
distribution. And, for i D 1; 2; : : : ; K, take qi D ci C b0i x C x0 Ai x, where ci is a constant, bi an
M -dimensional column vector of constants, and Ai an M  M symmetric matrix of constants. Then,
q1 ; q2 ; : : : ; qK are distributed independently if and only if, for j ¤ i D 1; 2; : : : ; K,

†Ai †Aj † D 0; (8.1)


†Ai †.bj C 2Aj / D 0; (8.2)
and
.bi C 2Ai /0 †.bj C 2Aj / D 0: (8.3)

In connection with conditions (8.1) and (8.3), it is worth noting that


†Ai †Aj † D 0 , †Aj †Ai † D 0
and
.bi C 2Ai /0 †.bj C 2Aj / D 0 , .bj C 2Aj /0 †.bi C 2Ai / D 0;
as is evident upon “taking transposes.”
In the special case of second-degree polynomials in a normally distributed random vector that
has a nonsingular variance-covariance matrix, we obtain the following corollary of Theorem 6.8.1.
Corollary 6.8.2. Let x represent an M -dimensional random column vector that has an N.; †/
distribution, where † is nonsingular. And, for i D 1; 2; : : : ; K, take qi D ci Cb0i xCx0 Ai x, where ci
is a constant, bi an M -dimensional column vector of constants, and Ai an M M symmetric matrix of
constants. Then, q1 ; q2 ; : : : ; qK are distributed independently if and only if, for j ¤ i D 1; 2; : : : ; K,

Ai †Aj D 0; (8.4)
Ai †bj D 0; (8.5)
and
b0i †bj D 0: (8.6)

Note that in the special case of Corollary 6.8.2 where q1 ; q2 ; : : : ; qK are quadratic forms (i.e,
the special case where c1 D c2 D    D cK D 0 and b1 D b2 D    D bK D 0), conditions (8.5)
and (8.6) are vacuous; in this special case, only condition (8.4) is “operative.” Note also that in the
special case of Corollary 6.8.2 where † D I, Corollary 6.8.2 reduces to the following result.
Theorem 6.8.3. Let x represent an M -dimensional random column vector that has an N.; IM /
distribution. And, for i D 1; 2; : : : ; K, take qi D ci C b0i x C x0 Ai x, where ci is a constant, bi an
M -dimensional column vector of constants, and Ai an M  M symmetric matrix of constants. Then,
q1 ; q2 ; : : : ; qK are distributed independently if and only if, for j ¤ i D 1; 2; : : : ; K,

Ai Aj D 0; Ai bj D 0; and b0i bj D 0: (8.7)

Verification of theorems. To prove Theorem 6.8.1, it suffices to prove Theorem 6.8.3. In fact, it suf-
fices to prove the special case of Theorem 6.8.3 where  D 0. To see this, define x and q1 ; q2 ; : : : ; qK
as in Theorem 6.8.1. That is, take x to be an M -dimensional random column vector that has an
N.; †/ distribution. And, for i D 1; 2; : : : ; K, take qi D ci C b0i x C x0 Ai x, where ci is a constant,
bi an M -dimensional column vector of constants, and Ai an M  M symmetric matrix of constants.
Further, take € to be any matrix (with M columns) such that † D € 0 €, denote by P the number of
rows in €, and define z to be a P -dimensional random column vector that has an N.0; IP / distri-
bution. Then, x   C € 0 z, and hence the joint distribution of q1 ; q2 ; : : : ; qK is identical to that of
q1 ; q2 ; : : : ; qK

, where (for i D 1; 2; : : : ; K)
qi D ci C b0i .C€ 0 z/ C .C€ 0 z/0Ai .C€ 0 z/:

For i D 1; 2; : : : ; K, qi is reexpressible in the form

qi D ci Cb0i C0Ai  C Œ€.bi C 2Ai /0 z C z0 €Ai € 0 z;


328 Some Relevant Distributions and Their Properties

which is a second-degree polynomial in z. Thus, it follows from Theorem 6.8.3 (upon applying the
theorem to q1 ; q2 ; : : : ; qK

) that q1 ; q2 ; : : : ; qK

(and hence q1 ; q2 ; : : : ; qK ) are statistically indepen-
dent if and only if, for i D 1; 2; : : : ; K,
€Ai € 0 €Aj € 0 D 0; (8.8)
€Ai € 0 €.bj C 2Aj / D 0; (8.9)
and
.bi C 2Ai /0 € 0 €.bj C 2Aj / D 0: (8.10)
Moreover, in light of Corollary 2.3.4,
€Ai € 0 €Aj € 0 D 0 , € 0 €Ai € 0 €Aj € 0 D 0 , € 0 €Ai € 0 €Aj € 0 € D 0
and
€Ai € 0 €.bj C 2Aj / D 0 , € 0 €Ai € 0 €.bj C 2Aj / D 0:
And we conclude that the conditions [conditions (8.8), (8.9), and (8.10)] derived from the application
of Theorem 6.8.3 to q1 ; q2 ; : : : ; qK

are equivalent to conditions (8.1), (8.2), and (8.3) (of Theorem
6.8.1).
Theorem 6.8.3, like Theorem 6.6.1, pertains to second-degree polynomials in a normally dis-
tributed random column vector. Theorem 6.8.3 gives conditions that are necessary and sufficient
for such second-degree polynomials to be statistically independent, whereas Theorem 6.6.1 gives
conditions that are necessary and sufficient for such a second-degree polynomial to have a noncentral
chi-square distribution. In the case of Theorem 6.8.3, as in the case of Theorem 6.6.1, it is much
easier to prove sufficiency than necessity, and the sufficiency is more important than the necessity
(in the sense that it is typically the sufficiency that is invoked in an application). Accordingly, the
following proof of Theorem 6.8.3 is a proof of sufficiency; the proof of necessity is deferred until a
subsequent subsection (Subsection e).
Proof (of Theorem 6.8.3): sufficiency. For i D 1; 2; : : : ; K, qi is reexpressible in the form
qi D ci C b0i x C .Ai x/0Ai .Ai x/;
and hence qi depends on the value of x only through the (M C 1)-dimensional column vector
.bi ; Ai /0 x. Moreover, for j ¤ i D 1; 2; : : : ; K,
 0
bi bj .Aj bi /0

0 0 0
covŒ.bi ; Ai / x; .bj ; Aj / x D .bi ; Ai / .bj ; Aj / D : (8.11)
Ai bj Ai Aj
And the joint distribution of the K vectors .b1 ; A1 /0 x, .b2 ; A2 /0 x, : : : ; .bK ; AK /0 x is multivariate
normal.
Now, suppose that, for j ¤ i D 1; 2; : : : ; K, condition (8.7) is satisfied. Then, in light of
result (8.11), .b1 ; A1 /0 x, .b2 ; A2 /0 x, : : : ; .bK ; AK /0 x are uncorrelated and hence (since their
joint distribution is multivariate normal) statistically independent, leading to the conclusion that the
second-degree polynomials q1 ; q2 ; : : : ; qK [each of which depends on a different one of the vectors
.b1 ; A1 /0 x, .b2 ; A2 /0 x, : : : ; .bK ; AK /0 x] are statistically independent. Q.E.D.
An extension. The coverage of Theorem 6.8.1 includes the special case where one or more of the
quantities q1 ; q2 ; : : : ; qK (whose statistical independence is in question) are linear forms. In the
following generalization, the coverage is extended to include vectors of linear forms.
Theorem 6.8.4. Let x represent an M -dimensional random column vector that has an N.; †/
distribution. And, for i D 1; 2; : : : ; K, take qi D ci C b0i x C x0 Ai x, where ci is a constant, bi
an M -dimensional column vector of constants, and Ai an M  M symmetric matrix of constants.
Further, for s D 1; 2; : : : ; R, denote by ds an Ns -dimensional column vector of constants and by
Ls an M  Ns matrix of constants. Then, q1 ; q2 ; : : : ; qK ; d1 C L01 x; d2 C L02 x; : : : ; dR C L0R x are
distributed independently if and only if, for j ¤ i D 1; 2; : : : ; K,
†Ai †Aj † D 0; (8.12)
†Ai †.bj C 2Aj / D 0; (8.13)
More on the Distribution of Quadratic Forms or Second-Degree Polynomials 329

and
.bi C 2Ai /0 †.bj C 2Aj / D 0; (8.14)
for i D 1; 2; : : : ; K and s D 1; 2; : : : ; R,
†Ai †Ls D 0 (8.15)
and 0
.bi C 2Ai / †Ls D 0; (8.16)
and, for t ¤ s D 1; 2; : : : ; R,
L0t †Ls D 0: (8.17)

Note that in the special case where † is nonsingular, conditions (8.15) and (8.16) are (collectively)
equivalent to the condition
Ai †Ls D 0 and b0i †Ls D 0: (8.18)
Accordingly, in the further special case where † D I, the result of Theorem 6.8.4 can be restated in
the form of the following generalization of Theorem 6.8.3.
Theorem 6.8.5. Let x represent an M -dimensional random column vector that has an N.; IM /
distribution. And, for i D 1; 2; : : : ; K, take qi D ci C b0i x C x0 Ai x, where ci is a constant, bi an
M -dimensional column vector of constants, and Ai an M  M symmetric matrix of constants.
Further, for s D 1; 2; : : : ; R, denote by ds an Ns -dimensional column vector of constants and by
Ls an M  Ns matrix of constants. Then, q1 ; q2 ; : : : ; qK ; d1 C L01 x; d2 C L02 x; : : : ; dR C L0R x are
distributed independently if and only if, for j ¤ i D 1; 2; : : : ; K,
Ai Aj D 0; Ai bj D 0; and b0i bj D 0; (8.19)
for i D 1; 2; : : : ; K and s D 1; 2; : : : ; R,
Ai Ls D 0 and b0i Ls D 0; (8.20)
and, for t ¤ s D 1; 2; : : : ; R,
L0t Ls D 0: (8.21)

To prove Theorem 6.8.4, it suffices to prove Theorem 6.8.5, as can be established via a straight-
forward extension of the argument used in establishing that to prove Theorem 6.8.1, it suffices to
prove Theorem 6.8.3. Moreover, the “sufficiency part” of Theorem 6.8.5 can be established via a
straightforward extension of the argument used to establish the sufficiency part of Theorem 6.8.3.
Now, consider the “necessity part” of Theorem 6.8.5. Suppose that q1, q2 , : : : ; qK , d1CL01 x, d2C
L2 x, : : : ; dRCL0R x are statistically independent. Then, for arbitrary column vectors h1 ; h2 ; : : : ; hR
0

of constants (of dimensions N1; N2 ; : : : ; NR , respectively) q1, q2 , : : : ; qK , h01d1C.L1 h1 /0 x, h02d2C


.L2 h2 /0 x, : : : ; h0RdR C .LR hR /0 x are statistically independent. And it follows from the necessity
part of Theorem 6.8.3 that, for j ¤ i D 1; 2; : : : ; K,
Ai Aj D 0; Ai bj D 0; and b0i bj D 0;
for i D 1; 2; : : : ; K and s D 1; 2; : : : ; R,
Ai Ls hs D 0 and b0i Ls hs D 0; (8.22)
and, for t ¤ s D 1; 2; : : : ; R,
h0t L0t Ls hs D 0: (8.23)
Moreover, since the vectors h1 ; h2 ; : : : ; hR are arbitrary, results (8.22) and (8.23) imply that, for
i D 1; 2; : : : ; K and s D 1; 2; : : : ; R,
Ai Ls D 0 and b0i Ls D 0
and, for t ¤ s D 1; 2; : : : ; R,
L0t Ls D 0:
Thus, the necessity part of Theorem 6.8.5 follows from the necessity part of Theorem 6.8.3.
330 Some Relevant Distributions and Their Properties

Statistical independence versus zero correlation. Let x represent an M -dimensional random column
vector that has an N.; †/ distribution. For an M  N1 matrix of constants L1 and an M  N2
matrix of constants L2 ,
cov.L01 x; L02 x/ D L01 †L2 :
Thus, two vectors of linear forms in a normally distributed random column vector are statistically
independent if and only if they are uncorrelated.
For an M  N matrix of constants L and an M  M symmetric matrix of constants A,
cov.L0 x; x0 Ax/ D 2L0 †A
—refer to result (5.7.14)—so that the vector L0 x of linear forms and the quadratic form x0 Ax (in
the normally distributed random vector x) are uncorrelated if and only if L0 †A D 0, but are
statistically independent if and only if †A†L D 0 and 0 A†L D 0 (or, equivalently, if and only
if L0 †A† D 0 and L0 †A D 0). And, for two M  M symmetric matrices of constants A1 and
A2 ,
cov.x0A1 x; x0A2 x/ D 2 tr.A1†A2 †/ C 40A1†A2 
—refer to result (5.7.19)—so that the two quadratic forms x0A1 x and x0A2 x (in the normally
distributed random vector x) are uncorrelated if and only if
tr.A1†A2 †/ C 20A1†A2  D 0; (8.24)
but are statistically independent if and only if
†A1†A2 † D 0; †A1†A2  D 0; †A2 †A1  D 0; and 0A1†A2  D 0:

b. Cochran’s theorem
Theorem 6.8.1 can be used to determine whether two or more second-degree polynomials (in a
normally distributed random vector) are statistically independent. The following theorem can be
used to determine whether the second-degree polynomials are not only statistically independent but
whether, in addition, they have noncentral chi-square distributions.
Theorem 6.8.6. Let x represent an M -dimensional random column vector that has an N.; †/
distribution. And, for i D 1; 2; : : : ; K, take qi D ci C b0i x C x0 Ai x, where ci is a constant, bi an
M -dimensional column vector of constants, and Ai an M  M symmetric matrix of constants (such
that †Ai † ¤ 0). Further, define A D A1 C A2 C    C AK . If
†A†A† D †A†; (8.25)
rank.†A1 †/ C rank.†A2 †/ C    C rank.†AK †/ D rank.†A†/; (8.26)
†.bi C 2Ai / D †A†.bi C 2Ai / .i D 1; 2; : : : ; K/ (8.27)
and
ci C b0i  C 0 Ai  D 41 .bi C 2Ai /0 †.bi C 2Ai /; .i D 1; 2; : : : ; K/; (8.28)
then q1 ; q2 ; : : : ; qK are statistically independent and (for i D 1; 2; : : : ; K) qi  2 .Ri ; ci C
b0i C0 Ai /, where Ri D rank.†Ai †/ D tr.Ai †/. Conversely, if q1 ; q2 ; : : : ; qK are statistically
independent and (for i D 1; 2; : : : ; K) qi  2 .Ri ; i / (where Ri is a strictly positive integer),
then conditions (8.25), (8.26), (8.27), and (8.28) are satisfied and (for i D 1; 2; : : : ; K) Ri D
rank.†Ai †/ D tr.Ai †/ and i D ci C b0i C0 Ai .
Some results on matrices. The proof of Theorem 6.8.6 makes use of certain properties of matrices.
These properties are presented in the form of a generalization of the following theorem.
Theorem 6.8.7. Let A1 ; A2 ; : : : ; AK represent N  N matrices, and define A D A1 C A2 C
   C AK . Suppose that A is idempotent. Then, each of the following conditions implies the other
two:
More on the Distribution of Quadratic Forms or Second-Degree Polynomials 331

(1) Ai Aj D 0 (for j ¤ i D 1; 2; : : : ; K) and rank.A2i / D rank.Ai / (for i D 1; 2; : : : ; K);


(2) A2i D Ai (for i D 1; 2; : : : ; K);
(3) rank.A1 / C rank.A2 / C    C rank.AK / D rank.A/.
As a preliminary to proving Theorem 6.8.7, let us establish the following two lemmas on idem-
potent matrices.
Lemma 6.8.8. An N  N matrix A is idempotent if and only if rank.A/ C rank.I A/ D N .
Proof (of Lemma 6.8.8). That rank.A/ C rank.I A/ D N if A is idempotent is an immediate
consequence of Lemma 2.8.4. Now, for purposes of establishing the converse, observe that N.A/ 
C.I A/—if x 2 N.A/, then Ax D 0, in which case x D .I A/x 2 C.I A/. Observe also (in
light of Lemma 2.11.5) that dimŒN.A/ D N rank.A/. Thus, if rank.A/ C rank.I A/ D N ,
then
dimŒN.A/ D rank.I A/ D dimŒC.I A/;
implying (in light of Theorem 2.4.10) that N.A/ D C.I A/ and hence that every column of I A
is a member of N.A/, in which case A.I A/ D 0 or, equivalently, A2 D A. Q.E.D.
Lemma 6.8.9. For any two idempotent matrices A and B (of the same size), ACB is idempotent
if and only if BA D AB D 0.
Proof (of Lemma 6.8.9). Clearly,
.A C B/2 D A2 C B2 C AB C BA D A C B C AB C BA:
Thus, A C B is idempotent if and only if AB C BA D 0.
Now, suppose that BA D AB D 0. Then, obviously, AB C BA D 0, and, consequently, A C B
is idempotent.
Conversely, suppose that AB C BA D 0 (as would be the case if A C B were idempotent). Then,

AB C ABA D A2 B C ABA D A.AB C BA/ D 0;


and
ABA C BA D ABA C BA2 D .AB C BA/A D 0;
implying that
AB BA D AB C ABA .ABA C BA/ D 0 0D0
and hence that
BA D AB:
Moreover,
AB D 12 .AB C AB/ D 12 .AB C BA/ D 0:
Q.E.D.
Proof (of Theorem 6.8.7). The proof consists of successively showing that Condition (1) implies
Condition (2), that Condition (2) implies Condition (3), and that Condition (3) implies Condition
(1).
(1) ) (2). Suppose that Condition (1) is satisfied. Then, for an arbitrary integer i between 1 and
K, inclusive,
A2i D Ai A D Ai AA D A2i A D A3i :
Moreover, since rank.A2i / D rank.Ai /, it follows from Corollary 2.4.17 that C.A2i / D C.Ai /, so
that Ai D A2i Li for some matrix Li . Thus,
Ai D A2i Li D A3i Li D Ai .A2i Li / D Ai Ai D A2i :

(2) ) (3). Suppose that Condition (2) is satisfied. Then, making use of Corollary 2.8.3, we find
that K K XK 
X X
rank.Ai / D tr.Ai / D tr Ai D tr.A/ D rank.A/:
i D1 i D1 i D1
PK
(3) )(1). Suppose that Condition (3) is satisfied. And define A0 D IN A. Then, i D0 Ai D I.
332 Some Relevant Distributions and Their Properties

Moreover, it follows from Lemma 2.8.4 (and also from Lemma 6.8.8) that rank.A0 / D N rank.A/
and hence that K i D0 rank.Ai / D N .
P

Now, making use of inequality (2.4.24), we find (for an arbitrary integer i between 1 and K,
inclusive) that
 K
X  K
X
rank.I Ai / D rank As  rank.As / D N rank.Ai /;
sD0 .s¤i / sD0 .s¤i /

so that rank.Ai / C rank.I Ai /  N , implying [since rank.Ai / C rank.I Ai /  rank.Ai C I


Ai / D rank.IN / D N ] that
rank.Ai / C rank.I Ai / D N:
And upon applying Lemma 6.8.8, it follows that Ai is idempotent [and that rank.A2i / D rank.Ai /].
Further, upon again making use of inequality (2.4.24), we find (for j ¤ i D 1; 2; : : : ; K) that
 X K 
rank.I Ai Aj / D rank As
sD0 .s¤i; j /
K
X
 rank.As / D N Œrank.Ai / C rank.Aj /
sD0 .s¤i; j /

N rank.Ai C Aj /;

so that rank.Ai CAj /Crank.I Ai Aj /  N , implying [since rank.Ai CAj /Crank.I Ai Aj / 


rank.Ai C Aj C I Ai Aj / D rank.IN / D N ] that
rank.Ai C Aj / C rankŒI .Ai C Aj / D N
and hence (in light of Lemma 6.8.8) that Ai C Aj is idempotent and leading (in light of Lemma
6.8.9) to the conclusion that Ai Aj D 0. Q.E.D.
When the matrices A1 ; A2 ; : : : ; AK are symmetric, Condition (1) of Theorem 6.8.7 reduces to
the condition
Ai Aj D 0 (for j ¤ i D 1; 2; : : : ; K).
To see this, observe that if Ai is symmetric, then A2i D A0i Ai , implying (in light of Lemma 2.12.1)
that rank.A2i / D rank.Ai / (i D 1; 2; : : : ; K). As what can be regarded as a generalization of the
special case of Theorem 6.8.7 where A1 ; A2 ; : : : ; AK are symmetric, we have the following result.
Theorem 6.8.10. Let A1 ; A2 ; : : : ; AK represent N  N symmetric matrices, define A D A1 C
A2 C    C AK , and take † to be an N  N symmetric nonnegative definite matrix. Suppose that
†A†A† D †A†. Then, each of the following conditions implies the other two:
(1) †Ai †Aj † D 0 (for j ¤ i D 1; 2; : : : ; K);
(2) †Ai †Ai † D †Ai † (for i D 1; 2; : : : ; K);
(3) rank.†A1 †/ C rank.†A2 †/ C    C rank.†AK †/ D rank.†A†/.
Proof (of Theorem 6.8.10). Take € to be any matrix (with N columns) such that † D € 0 €.
Clearly,
€A€ 0 D €A1 € 0 C €A2 € 0 C    C €AK € 0:
And Corollary 2.3.4 can be used to show that the condition †A†A† D †A† is equivalent to the
condition that €A€ 0 be idempotent, to show that Condition (1) is equivalent to the condition
.€Ai € 0 /.€Aj € 0 / D 0 (for j ¤ i D 1; 2; : : : ; K);
and to show that Condition (2) is equivalent to the condition
.€Ai € 0 /2 D €Ai € 0 (for i D 1; 2; : : : ; K).
Moreover, Lemma 2.12.3 can be used to show that Condition (3) is equivalent to the condition
More on the Distribution of Quadratic Forms or Second-Degree Polynomials 333

rank.€A1 € 0 / C rank.€A2 € 0 / C    C rank.€AK € 0 / D rank.€A€ 0 /:


Thus, Theorem 6.8.10 can be established by applying Theorem 6.8.7 to the symmetric matrices
€A1 € 0, €A2 € 0, : : : ; €AK € 0. Q.E.D.
2
A result related to Theorem 6.8.7 is that if A1; A2 ; : : : ; AK are square matrices such that Ai D Ai
and Ai Aj D 0 for j ¤ i D 1; 2; : : : ; K, then A1 C A2 C    AK is idempotent. Similarly, a result
related to Theorem 6.8.10 is that if A1 ; A2 ; : : : ; AK are square matrices such that †Ai †Ai † D
†Ai † and †Ai †Aj † D 0 for j ¤ i D 1; 2; : : : ; K, then
PK  PK  PK 
† i D1 Ai † i D1 Ai † D † i D1 Ai †:

Proof of Theorem 6.8.6. Let us prove Theorem 6.8.6, doing so by taking advantage of Theorem
6.8.10. Suppose that conditions (8.25), (8.26), (8.27), and (8.28) are satisfied. In light of Theorem
6.8.10, it follows from conditions (8.25) and (8.26) that

†Ai †Aj † D 0 .j ¤ i D 1; 2; : : : ; K/ (8.29)


and that
†Ai †Ai † D †Ai † .i D 1; 2; : : : ; K/: (8.30)
Moreover, for j ¤ i D 1; 2; : : : ; K, condition (8.27) [in combination with result (8.29)] implies that
†Ai †.bj C 2Aj / D †Ai †Aj †.bj C 2Aj / D 0
and that
.bi C 2Ai /0 †.bj C 2Aj / D Œ†.bi C 2Ai /0 .bj C 2Aj /
D Œ†Ai †.bi C 2Ai /0 .bj C 2Aj /
D .bi C 2Ai /0 †Ai †.bj C 2Aj / D 0:

And upon applying Theorems 6.8.1 and 6.6.2, we conclude that q1 ; q2 ; : : : ; qK are statistically inde-
pendent and that (for i D 1; 2; : : : ; K) qi  2 .Ri ; ciCb0i C0 Ai /, where Ri D rank.†Ai †/ D
tr.Ai †/.
Conversely, suppose that q1 ; q2 ; : : : ; qK are statistically independent and that (for i D
1; 2; : : : ; K) qi  2 .Ri ; i / (where Ri is a strictly positive integer). Then, from Theorem 6.6.2, we
have that conditions (8.27) and (8.28) are satisfied, that (for i D 1; 2; : : : ; K) Ri D rank.†Ai †/ D
tr.Ai †/ and i D ci C b0i C0 Ai , and that
†Ai †Ai † D †Ai † .i D 1; 2; : : : ; K/: (8.31)
And from Theorem 6.8.1, we have that
†Ai †Aj † D 0 .j ¤ i D 1; 2; : : : ; K/: (8.32)
Together, results (8.31) and (8.32) imply that condition (8.25) is satisfied. That condition (8.26) is
also satisfied can be inferred from Theorem 6.8.10. Q.E.D.
Corollaries of Theorem 6.8.6. In light of Theorem 6.8.10, alternative versions of Theorem 6.8.6 can
be obtained by replacing condition (8.26) with the condition

†Ai †Aj † D 0 .j ¤ i D 1; 2; : : : ; K/
or with the condition
†Ai †Ai † D †Ai † .i D 1; 2; : : : ; K/:
In either case, the replacement results in what can be regarded as a corollary of Theorem 6.8.6. The
following result can also be regarded as a corollary of Theorem 6.8.6.
334 Some Relevant Distributions and Their Properties

Corollary 6.8.11. Let x represent an M -dimensional random column vector that has an
N.; IM / distribution. And take A1 ; A2 ; : : : ; AK to be M  M (nonnull) symmetric matrices
of constants, and define A D A1 C A2 C    C AK . If A is idempotent and
rank.A1 / C rank.A2 / C    C rank.AK / D rank.A/; (8.33)
0 0 0
then the quadratic forms x A1 x; x A2 x; : : : ; x AK x are statistically independent and (for i D
1; 2; : : : ; K) x0 Ai x  2 .Ri ; 0 Ai /, where Ri D rank.Ai / D tr.Ai /. Conversely, if
x0 A1 x; x0 A2 x; : : : ; x0 AK x are statistically independent and if (for i D 1; 2; : : : ; K) x0 Ai x 
2 .Ri ; i / (where Ri is a strictly positive integer), then A is idempotent, condition (8.33) is satis-
fied, and (for i D 1; 2; : : : ; K) Ri D rank.Ai / D tr.Ai / and i D 0 Ai .
Corollary 6.8.11 can be deduced from Theorem 6.8.6 by making use of Theorem 6.8.7. The
special case of Corollary 6.8.11 where  D 0 and where A1 ; A2 ; : : : ; AK are such that A D I
was formulated and proved by Cochran (1934) and is known as Cochran’s theorem. Cochran’s
theorem is one of the most famous theoretical results in all of statistics. Note that in light of Theorem
6.8.7, alternative versions of Corollary 6.8.11 can be obtained by replacing condition (8.33) with the
condition
Ai Aj D 0 .j ¤ i D 1; 2; : : : ; K/
or with the condition
A2i D Ai .i D 1; 2; : : : ; K/:
Another result that can be regarded as a corollary of Theorem 6.8.6 is as follows.
Corollary 6.8.12. Let x represent an M -dimensional random column vector that has an N.; †/
distribution. And, for i D 1; 2; : : : ; K, take Ai to be an M  M symmetric matrix of constants (such
that Ai † ¤ 0). Further, define A D A1 C A2 C    C AK . If
A†A D A (8.34)
and
Ai †Ai D Ai .i D 1; 2; : : : ; K/; (8.35)
then the quadratic forms x0 A1 x; x0 A2 x; : : : ; x0 AK x are statistically independent and (for i D
1; 2; : : : ; K) x0 Ai x  2 .Ri ; 0 Ai /, where Ri D rank.Ai †/ D tr.Ai †/. Conversely, if
x0 A1 x; x0 A2 x; : : : ; x0 AK x are statistically independent and if (for i D 1; 2; : : : ; K) x0 Ai x 
2 .Ri ; i / (where Ri is a strictly positive integer), then in the special case where † is nonsingular,
conditions (8.34) and (8.35) are satisfied and (for i D 1; 2; : : : ; K) Ri D rank.Ai †/ D tr.Ai †/
and i D 0 Ai .
Proof. Condition (8.34) implies condition (8.25), and in the special case where † is nonsingular,
conditions (8.34) and (8.25) are equivalent. And as noted earlier, condition (8.26) of Theorem 6.8.6
can be replaced by the condition
†Ai †Ai † D †Ai † .i D 1; 2; : : : ; K/: (8.36)
Moreover, condition (8.35) implies condition (8.36) and that (for i D 1; 2; : : : ; K) †Ai  D
†Ai †Ai  and 0 Ai  D 0 Ai †Ai ; in the special case where † is nonsingular, conditions
(8.35) and (8.36) are equivalent. Condition (8.35) also implies that (for i D 1; 2; : : : ; K) Ai † is
idempotent and hence that rank.Ai †/ D tr.Ai †/. Accordingly, Corollary 6.8.12 is obtainable as
an “application” of Theorem 6.8.6. Q.E.D.
If in Corollary 6.8.12, we substitute .1= i /Ai for Ai , where i is a nonzero scalar, we obtain an
additional corollary as follows.
Corollary 6.8.13. Let x represent an M -dimensional random column vector that has an N.; †/
distribution. And, for i D 1; 2; : : : ; K, take Ai to be an M  M symmetric matrix of constants (such
that Ai † ¤ 0), and take i to be a nonzero constant. Further, define B D .1= 1 /A1 C .1= 2 /A2 C
   C .1= K /AK . If
B†B D B (8.37)
and
More on the Distribution of Quadratic Forms or Second-Degree Polynomials 335

Ai †Ai D i Ai .i D 1; 2; : : : ; K/; (8.38)


0 0 0
then the quadratic forms x A1 x; x A2 x; : : : ; x AK x are statistically independent and (for i D
1; 2; : : : ; K) x0 Ai x= i  2 .Ri ; 0 Ai = i /, where Ri D rank.Ai †/ D .1= i / tr.Ai †/. Con-
versely, if x0 A1 x; x0 A2 x; : : : ; x0 AK x are statistically independent and if (for i D 1; 2; : : : ; K)
x0 Ai x= i  2 .Ri ; i / (where Ri is a strictly positive integer), then in the special case where † is
nonsingular, conditions (8.37) and (8.38) are satisfied and (for i D 1; 2; : : : ; K) Ri D rank.Ai †/ D
.1= i / tr.Ai †/ and i D 0 Ai = i .
As yet another corollary of Theorem 6.8.6, we have the following generalization of a result
attributable to Albert (1976).
Corollary 6.8.14. Let x represent an M -dimensional random column vector that has an N.; †/
distribution. And, for i D 1; 2; : : : ; K, take Ai to be an M  M (nonnull) symmetric matrix of
constants, and take i to be a nonzero constant. Further, define A D A1 C A2 C    AK , and suppose
that A is idempotent. If
A2i D Ai .i D 1; 2; : : : ; K/ (8.39)
and
A†Ai D i Ai .i D 1; 2; : : : ; K/; (8.40)
then the quadratic forms x0 A1 x; x0 A2 x; : : : ; x0 AK x are statistically independent and (for i D
1; 2; : : : ; K) i D tr.Ai †/= tr.Ai / > 0 and x0 Ai x= i  2 .Ri , 0 Ai = i /, where Ri D
rank Ai D tr.Ai /. Conversely, if x0 A1 x, x0 A2 x, : : : ; x0 AK x are statistically independent and if (for
i D 1; 2; : : : ; K) x0 Ai x= i  2 .Ri ; i / (where Ri is a strictly positive integer), then in the special
case where † is nonsingular, conditions (8.39) and (8.40) are satisfied and (for i D 1; 2; : : : ; K)
0
i D tr.Ai †/= tr.Ai / > 0, Ri D rank Ai D tr.Ai /, and i D  Ai = i .
Proof (of Corollary 6.8.14). Define B D .1= 1 /A1 C .1= 2 /A2 C    C .1= K /AK .
Suppose that conditions (8.39) and (8.40) are satisfied. Then, for j ¤ i D 1, 2, : : : ; K, Ai Aj D 0
(as is evident from Theorem 6.8.7). Thus, for i D 1, 2, : : : ; K,
2
Ai †Ai D Ai A†Ai D i Ai D i Ai ; (8.41)
implying in particular that
i tr.Ai / D tr.Ai †Ai / D tr.A2i †/ D tr.Ai †/
and hence [since tr.Ai / D tr.A2i / D tr.A0i Ai / > 0 and since Ai †Ai D A0i †Ai is a symmetric
nonnegative definite matrix] that
i D tr.Ai †/= tr.Ai / > 0: (8.42)
Moreover, for j ¤ i D 1; 2; : : : ; K,
Ai †Aj D Ai A†Aj D j Ai Aj D 0;
so that
B†B D B:
And upon observing [in light of result (8.41) or (8.42)] that (for i D 1; 2; : : : ; K) Ai † ¤ 0, it
follows from Corollary 6.8.13 that x0 A1 x; x0 A2 x; : : : ; x0 AK x are statistically independent and that
(for i D 1; 2; : : : ; K) x0 Ai x= i  2 .Ri , 0 Ai = i /, where Ri D .1= i / tr.Ai †/ D tr.Ai / D
rank Ai .
Conversely, suppose that x0 A1 x; x0 A2 x; : : : ; x0 AK x are statistically independent and that (for
i D 1; 2; : : : ; K) x0 Ai x= i  2 .Ri ; i / (where Ri is a strictly positive integer). And suppose
that † is nonsingular (in which case Ai † ¤ 0 for i D 1; 2; : : : ; K). Then, from Corollary 6.8.13,
we have that B†B D B and (for i D 1; 2; : : : ; K) that
Ai †Ai D i Ai
0
and also that Ri D .1= i / tr.Ai †/ and i D  Ai = i . Further, †B†B† D †B† and (for
i D 1; 2; : : : ; K) †. i 1 Ai /†. i 1 Ai /† D †. i 1 Ai /†, so that (in light of Theorem 6.8.10) we
have that
336 Some Relevant Distributions and Their Properties
1 1 1
. i j/ †Ai †Aj † D †. i Ai /†. j Aj /† D 0
or, equivalently,
Ai †Aj D 0:
Thus, for i D 1; 2; : : : ; K, we find that
A†Ai D Ai †Ai D i Ai
and (since A is idempotent) that
A2i D A0i Ai D i
2
Ai †A2 †Ai D i
2
Ai †A†Ai D i
1
Ai †Ai D Ai ;
leading to the conclusion that conditions (8.39) and (8.40) are satisfied and by implication—via the
same argument that gave rise to result (8.42)—that (for i D 1; 2; : : : ; K) i D tr.Ai †/= tr.Ai / > 0
[in which case Ri D .1= i / tr.Ai †/ D tr.Ai / D rank Ai ]. Q.E.D.

c. Some connections to the Dirichlet distribution


As a variation on Cochran’s theorem—recall that in the special case where  D 0 and A D I,
Corollary 6.8.11 is Cochran’s theorem—we have the following result.
Theorem 6.8.15. Let z represent an M -dimensional random column vector that has an N.0; IM /
distribution, where M > 1. And take A1 ; A2 ; : : : ; AK to be M  M (nonnull) symmetric matrices
of constants, and define A D A1 C A2 C    C AK . If A is idempotent and
rank.A1 / C rank.A2 / C    C rank.AK / D rank.A/ < M; (8.43)
0 0 0 0 0 0
PK
then have a
 
z A 1 z=z z; z A 2 z=z z; : : : ; z A K z=z z Di R 1 =2; R 2 =2, : : : ; RK =2, M i D1 Ri =2;
K distribution, where (for i D 1; 2; : : : ; K) Ri D rank.Ai / D tr.Ai /. Conversely, if

PK
z0A1 z=z0 z; z0A2 z=z0 z; : : : ; z0AK z=z0 z have a Di R1 =2, R2 =2, : : : ; RK =2, M
 
i D1 Ri =2;
K distribution (where R1 ; R2 ; : : : ; RK are strictly positive integers such that K i D1 Ri < M ), then
 P
A is idempotent, condition (8.43) is satisfied, and (for i D 1; 2; : : : ; K) Ri D rank.Ai / D tr.Ai /.
Proof. Suppose that A is idempotent and that condition (8.43) is satisfied. Further, let AKC1 D
I A, so that iKC1 D1 Ai D I. And observe (in light of Lemma 2.8.4) that rank.AKC1 / D
P
PKC1
M rank.A/ and hence that AKC1 is nonnull and that i D1 rank.Ai / D M . Then, as an
application of Corollary 6.8.11 (one where  D 0 and A D I), we have that the K C 1 quad-
ratic forms z0A1 z; z0A2 z: : : : ; z0AK z; z0AKC1 z are statistically independent and that (for i D
1; 2; : : : ; K; KC1) z0Ai z  2 .Ri /, where Ri D rank.Ai / D tr.Ai /. Clearly, z0 z D KC1 0
P
PK i D1 z Ai z,
and RKC1 D M i D1 Ri . Accordingly, we conclude that z0A1 z=z0 z; z0A2 z=z0 z; : : : ; z0AK z=z0 z
PK
have a Di R1 =2; R2 =2, : : : ; RK =2, M i D1 Ri =2; K distribution.
  

Conversely, suppose that z0A1 z=z0 z; z0A2 z=z0 z; : : : ; z0AK z=z0 z have a Di R1 =2, R2 =2, : : : ;

PK
i D1 Ri =2; K distribution (where R1 ; R2 ; : : : ; RK are strictly positive integers
 
RK =2, M
such that K M ). And partition z into subvectors z1 ; z2 ; : : : ; zK ; zKC1 of dimensions
P
i D1 Ri <P
K 0 0 0 0 0
R1 ; R2 ; : : : ; RK ; M i D1 Ri , respectively, so that z D .z1 ; z2 ; : : : ; zK ; zKC1 /. Then, the (joint)
distribution of z A1 z=z z, z A2 z=z z, : : : ; z AK z=z z is identical to that of the quantities z01 z1 =z0 z,
0 0 0 0 0 0

z02 z2 =z0 z, : : : ; zK 0
zK =z0 z. Moreover, the quantities z0A1 z=z0 z, z0A2 z=z0 z, : : : ; z0AK z=z0 z and the
quantities z01 z1 =z0 z, z02 z2 =z0 z, : : : ; zK 0
zK =z0 z are both distributed independently of z0 z, as is evident
from the results of Section 6.1f upon observing that (for i D 1; 2; : : : ; K) z0Ai z=z0 z and z0i zi =z0 z
depend on the value of z only through .z0 z/ 1=2 z. Thus,
0 0 0 0
z A1 z=z0 z
0 0
z1 z1 =z0 z
1 1 1 0 0 1
z A1 z z1 z1
B z0A z C B z0A z=z0 z C B z0 z =z0 z C B z0 z C
B 2 C 2 B 2 2 C B 2 2C
B :: C D z0 zB C  z0 zB C D B :: C;
B C
:: ::
@ : A @ : A @ : A @ : A
0 0
z0AK z z0AK z=z0 z zK zK =z0 z zK zK
More on the Distribution of Quadratic Forms or Second-Degree Polynomials 337

which implies that the quadratic forms z0A1 z; z0A2 z; : : : ; z0AK z are statistically independent and
that (for i D 1; 2; : : : ; K) z0Ai z  2 .Ri /. Upon applying Corollary 6.8.11, it follows that A is
idempotent, that K rank.Ai / D rank.A/, and that (for i D 1; 2; : : : ; K) Ri D rank.Ai / D
P
i D1
tr.Ai / (in which case K
PK
i D1 rank.Ai / D i D1 Ri < M ). Q.E.D.
P

The result of Theorem 6.8.15 can be generalized in much the same fashion as Theorem 6.6.7. Take
x to be an M -dimensional random column vector that is distributed as N.0; †/, let P D rank †, sup-
pose that P > 1, and take A1 ; A2 ; : : : ; AK to be M  M symmetric matrices of constants (such that
†A1 †; †A2 †; : : : ; †AK † are nonnull). Further, take € to be a matrix of dimensions P  M such
that † D € 0 € (the existence of which follows from Corollary 2.13.23), take z to be a P -dimensional
random column vector that has an N.0; IP / distribution, and define A D A1 C A2 C    AK . And
observe that x  € 0 z, that (for i D 1; 2; : : : ; K) .€ 0 z/0 Ai € 0 z D z0 €Ai € 0 z, and (in light of the
discourse in Section 6.6d) that .€ 0 z/0 † € 0 z D z0 z. Observe also that €A1 € 0; €A2 € 0; : : : ; €AK € 0
are nonnull (as can be readily verified by making use of Corollary 2.3.4).
Now, suppose that €A€ 0 is idempotent and that
rank.€A1 € 0 / C rank.€A2 € 0 / C    C rank.€AK € 0 / D rank.€A€ 0 / < P: (8.44)
0 0 0
Then, upon applying Theorem 6.8.15 (with €A1 € ; €A2 € ; : : : ; €AK € in place of A1 , A2 , : : : ;
AK ), we find that z0 €A1 € 0 z=z0 z; z0 €A2 € 0 z=z0 z; : : : ; z0 €AK € 0 z=z0 z have a Di R1 =2; R2 =2, : : : ;

PK 0
i D1 Ri =2; K distribution, where (for i D 1; 2; : : : ; K) Ri D rank.€Ai € / D
 
RK =2, P
0 0 0 0 0 0 0 0 0 0
tr.€A
 i € /. And, conversely, if PzK€A1 € z=z z;  z €A2 € z=z z; : : : ; z €AK € z=z z have a
Di R1 =2, R2 =2, : : : ; RK =2, P R
i D1 i =2; K distribution (where R1 ; R2 ; : : : ; RK are strictly
PK
positive integers such that i D1 Ri < P ), then €A€ 0 is idempotent, condition (8.44) is satisfied,
and (for i D 1; 2; : : : ; K) Ri D rank.€Ai € 0 / D tr.€Ai € 0 /.
Clearly, x0A1 x=x0 † x; x0A2 x=x0 † x; : : : ; x0AK x=x0 † x have the same distribution as
z €A1 € 0 z=z0 z; z0 €A2 € 0 z=z0 z; : : : ; z0 €AK € 0 z=z0 z. And by employing the same line of reason-
0

ing as in the proof of Theorem 6.6.2, it can be shown that


€A€ 0 €A€ 0 D €A€ 0 , †A†A† D †A†;
0
that rank.€A€ / D rank.†A†/, that
rank.€Ai € 0 / D rank.†Ai †/ .i D 1; 2; : : : ; K/;
and that
tr.€Ai € 0 / D tr.Ai †/ .i D 1; 2; : : : ; K/:
Thus, we have the following generalization of Theorem 6.8.15.
Theorem 6.8.16. Let x represent an M -dimensional random column vector that has an N.0; †/
distribution, let P D rank †, suppose that P > 1, take A1 , A2 , : : : ; AK to be M  M symmetric
matrices of constants (such that †A1 †, †A2 †, : : : ; †AK † are nonnull), and define A D A1 C
A2 C : : : C AK . If †A†A† D †A† and
rank.†A1 †/ C rank.†A2 †/ C    C rank.†AK †/ D rank.†A†/ < P; (8.45)
0 0 0 0 0 0
then x A1 x=x † x; x A2 x=x † x; : : : ; x AK x=x † x have a Di R1 =2; R2 =2, : : : ; RK =2,

PK
i D1 Ri =2; K distribution, where (for i D 1; 2; : : : ; K) Ri D rank.†A †/ D tr.Ai †/.
 
P
i
Conversely, if x0A1 x=x0 † x, x0A2 x=x0 † x, : : : ; x0AK x=x0 † x have a Di R1 =2; R2 =2, : : : ;
PK
i D1 Ri =2; K distribution (where R1 ; R2 ; : : : ; RK are strictly positive integers such
 
RK =2, P
PK
that i D1 Ri < P ), then †A†A† D †A†, condition (8.45) is satisfied, and (for i D 1; 2; : : : ; K)
Ri D rank.†Ai †/ D tr.Ai †/.
The validity of the results of Theorem 6.8.15 is not limited to the case where the distribution
of the M -dimensional random column vector z is N.0; IM /. Similarly, the validity of the results
of Theorem 6.8.16 is not limited to the case where the distribution of the M -dimensional random
338 Some Relevant Distributions and Their Properties

column vector x is N.0; †/. The validity of these results can be extended to a broader class of
distributions by employing an approach analogous to that described in Section 6.6e for extending
the results of Theorems 6.6.7 and 6.6.8.
Specifically, the validity of the results of Theorem 6.8.15 extends to the case where the distribu-
tion of the M -dimensional random column vector z is an arbitrary absolutely continuous spherical
distribution. Similarly, the validity of the results of Theorem 6.8.16 extends to the case where the
distribution of the M -dimensional random column vector x is that of the vector € 0 z, where (with
P D rank †) € is a P  M matrix such that † D € 0 € and where z is a P -dimensional random
column vector that has an absolutely continuous spherical distribution (i.e., the case where x is
distributed elliptically about 0).

d. Some results on matrices


Let us introduce some additional results on matrices, thereby setting the stage for proving the necessity
part of Theorem 6.8.3 (on the statistical independence of second-degree polynomials).
Differentiation of a matrix: some basic results. Suppose that, for i D 1; 2; : : : ; P and j D
1; 2; : : : ; Q, fij .t/ is a function of a variable t. And define F.t/ to be the P  Q matrix with
@fij .t/
ij th element fij .t/, so that F.t/ is a matrix-valued function of t. Further, write , or simply
@t
@fij
, for the derivative of fij .t/ at an arbitrary value of t (for which the derivative exists), and write
@t
@F.t/ @F @fij .t/ @fij
, or simply , for the P  Q matrix with ij th element or .
@t @t @t @t
Certain of the basic properties of the derivatives of scalar-valued functions extend in a straight-
forward way to matrix-valued functions. In particular, if F.t/ is a P  Q matrix-valued function of
a variable t, then  0
@F 0 @F
D (8.46)
@t @t
and, for any R  P matrix of constants A and any Q  S matrix of constants B,
@AF @F @FB @F
DA and D B: (8.47)
@t @t @t @t
And if F.t/ is a P Q matrix-valued function and G.t/ a Q S matrix-valued function of a variable
t, then
@FG @G @F
DF C G: (8.48)
@t @t @t
Further, if g.t/ is a scalar-valued function and F.t/ a matrix-valued function of a variable t, then
@g
[writing for the derivative of g.t/ at an arbitrary value of t]
@t
@gF @g @F
D F Cg : (8.49)
@t @t @t
Refer, for example, to Harville (1997, sec.15.4) for additional discussion of basic results pertaining
to the differentiation of matrix-valued functions of a single variable.
Some results on matrices of the form I tA (where t is a scalar and A is a symmetric matrix. Let
A represent an N  N symmetric matrix. And regard the matrix I tA as a matrix-valued function
of a variable t. Further, take O to be an N  N orthogonal matrix and D an N  N diagonal matrix
such that
A D ODO 0; (8.50)
and denote by d1 ; d2 ; : : : ; dN the diagonal elements of D—the decomposition (8.50) is the spectral
decomposition (of the matrix A), the existence of which follows from Theorem 6.7.4.
More on the Distribution of Quadratic Forms or Second-Degree Polynomials 339

Clearly,
I tA D O.I tD/O 0 D O diag.1 td1 ; 1 td2 ; : : : ; 1 tdN / O 0: (8.51)
And in light of result (2.14.25), Lemma 2.14.3, and Corollaries 2.14.19 and 2.14.2, it follows that
N
Y
jI tAj D .1 tdi /; (8.52)
i D1

and in light of results (2.5.11) and (2.6.5), that


 1 1 1  0
.I tA/ 1 D O diag ; ;:::; O: (8.53)
1 td1 1 td2 1 tdN
Thus, N N
@ jI tAj X Y
D di .1 tdj /;
@t
i D1 j D1 .j ¤i /

implying (in light of Theorem 6.7.5) that


ˇ
@ jI tAj ˇˇ
D tr.A/: (8.54)
@t ˇ
t D0
Further,
1
@.I tA/ h d1 d2 dN i
D O diag 2
; 2
;:::; 2
O0
@t .1 td1 / .1 td2 / .1 tdN /
D .I tA/ 1A.I tA/ 1 (8.55)
@ k F.t/ @k F
and [denoting by k
or k the N  N matrix whose ij th element is the kth order derivative
@t @t
of the ij th element of an N  N matrix F.t/ of functions of a variable t]
@ 2 .I tA/ 1
tA/ 1
@.I @.I tA/ 1
D A.I tA/ 1 C .I tA/ 1 A
@t 2 @t @t
1 1 1
D 2 .I tA/ A.I tA/ A.I tA/ : (8.56)

The matrix I tA is singular if t D 1=di for some i such that di ¤ 0; otherwise, I tA


is nonsingular. And formulas (8.53), (8.55), and (8.56) are valid for any t for which I tA is
nonsingular.
Formulas (8.54), (8.55), and (8.56) can be generalized. Let V represent an N  N symmetric
nonnegative definite matrix (and continue to take A to be an N  N symmetric matrix and to regard
t as a scalar-valued variable). Then, V D R0 R for some matrix R (having N columns). And upon
observing (in light of Corollary 6.4.2) that
jI tAV j D jI tRAR0 j
and applying result (8.54) (with RAR0 in place of A), we find that
ˇ
@ jI tAV j ˇˇ
D tr.RAR0 / D tr.AR0 R/ D tr.AV /: (8.57)
@t ˇ
t D0

Now, suppose that the N  N symmetric nonnegative definite matrix V is positive definite, and
take R to be a nonsingular matrix (of order N ) such that V D R0 R. Then, upon observing that
V tA D R0 ŒI t.R 1 0
/ AR 1
R
340 Some Relevant Distributions and Their Properties
1 1 0 1 0
fso that .V tA/ D R 1ŒI t.R / AR 1
 1.R / g and applying results (8.55) and (8.56), we
find that
1
@.V tA/ 1 0 1 0 1 0 1 0
D R 1ŒI t.R / AR 1
 1.R / AR 1ŒI t.R / AR 1
 1.R /
@t
D .V tA/ 1A.V tA/ 1
(8.58)
and that
@ 2 .V tA/ 1
D 2 .V tA/ 1A.V tA/ 1A.V tA/ 1: (8.59)
@t 2
Formulas (8.58) and (8.59) are valid for any t for which V tA is nonsingular.
Some results on determinants. We are now in a position to establish the following result.
Theorem 6.8.17. Let A and B represent N  N symmetric matrices, and let c and d represent
(strictly) positive scalars. Then, a necessary and sufficient condition for

jI tA uBj D jI tAj jI uBj (8.60)

for all (scalars) t and u satisfying jtj < c and juj < d (or, equivalently, for all t and u) is that
AB D 0.
Proof. The sufficiency of the condition AB D 0 is evident upon observing that (for all t and u)
jI tAj jI uBj D j.I tA/.I uB/j D jI tA uB C tuABj:

Now, for purposes of establishing the necessity of this condition, suppose that equality (8.50)
holds for all t and u satisfying jtj < c and juj < d. And let c  ( c) represent a (strictly) positive
scalar such that I tA is positive definite whenever jtj < c  —the existence of such a scalar is
guaranteed by Lemma 6.5.1—and (for t satisfying jtj < c  ) let
H.t/ D .I tA/ 1:

If jtj < c  , then jI uBj D jI uBH.t/j and jI . u/Bj D jI . u/BH.t/j, implying that

jI u2 B2 j D jI u2 ŒBH.t/2 j: (8.61)

Since each side of equality (8.61) is (for fixed t) a polynomial in u2, we have that
jI r B2 j D jI rŒBH.t/2 j
for every scalar r (and for t such that jtj < c  )—refer to Theorem 6.7.8.
Upon observing that ŒBH.t/2 D ŒBH.t/BH.t/, that B2 and BH.t/B are symmetric, and that
H.t/ is symmetric and nonnegative definite, and upon applying results (8.54) and (8.57), we find
that, for every scalar t such that jtj < c ,

@ jI rB2 j ˇˇ @ jI rŒBH.t/2 j ˇˇ
ˇ ˇ
2
tr.B / D D D trfŒBH.t/2 g:
@r ˇ
rD0 @r ˇ
rD0
Thus,
@2 trfŒBH.t/2 g @2 tr.B2 /
D D0 (8.62)
@t 2 @t 2
(for t such that jtj < c  ). Moreover,

@ trfŒBH.t/2 g @ ŒBH.t/2
   
@ H.t/ @ H.t/
D tr D tr B BH.t/ C BH.t/B
@t @t @t @t
 
@ H.t/
D 2 tr BH.t/B ;
@t
More on the Distribution of Quadratic Forms or Second-Degree Polynomials 341

implying [in light of results (8.55) and (8.56)] that

@2 trfŒBH.t/2 g @2 H.t/
 
@ H.t/ @ H.t/
D 2 tr B B C BH.t/B
@t 2 @t @t @t 2
D 2 trŒBH.t/AH.t/BH.t/AH.t/ C 2BH.t/BH.t/AH.t/AH.t/: (8.63)

Combining results (8.62) and (8.63) and setting t D 0 gives

0 D trŒ.BA/2  C 2 tr.B2A2 /
D trŒ.BA/2 C B2A2  C tr.B2A2 /
D trŒB.AB C BA/A C tr.B2A2 /
1 1 0 2
D 2 trŒB.AB C BA/A C 2 trfŒB.AB C BA/A g C tr.BA B/
D 1
2
trŒ.AB C BA/AB C 21 trŒA.AB C BA/0 B C tr.BA2 B/
D 1
2 trŒ.AB C BA/0 AB C 21 trŒ.AB C BA/0 BA C tr.BA2 B/
1 0 0
D 2 trŒ.AB C BA/ .AB C BA/ C trŒ.AB/ AB: (8.64)

Both terms of expression (8.64) are nonnegative and hence equal to 0. Moreover, trŒ.AB/0 AB D 0
implies that AB D 0—refer to Lemma 2.3.2. Q.E.D.
In light of Lemma 6.5.2, we have the following variation on Theorem 6.8.17.
Corollary 6.8.18. Let A and B represent N  N symmetric matrices. Then, there exist (strictly)
positive scalars c and d such that I tA, I uB, and I tA uB are positive definite for all
(scalars) t and u satisfying jtj < c and juj < d . And a necessary and sufficient condition for
 
jI tA uBj
log D0
jI tAjjI uBj

for all t and u satisfying jtj < c and juj < d is that AB D 0.
The following theorem can be regarded as a generalization of Corollary 6.8.18.
Theorem 6.8.19. Let A and B represent N  N symmetric matrices. Then, there exist (strictly)
positive scalars c and d such that I tA, I uB, and I tA uB are positive definite for all
(scalars) t and u satisfying jtj < c and juj < d . And letting h.t; u/ represent a polynomial (in t and
u), necessary and sufficient conditions for
 
jI tA uBj h.t; u/
log D (8.65)
jI tAjjI uBj jI tAjjI uBjjI tA uBj

for all t and u satisfying jtj < c and juj < d are that AB D 0 and that, for all t and u satisfying
jtj < c and juj < d (or, equivalently, for all t and u), h.t; u/ D 0.
Proof (of Theorem 6.8.19). The sufficiency of these conditions is an immediate consequence
of Corollary 6.8.18 [as is the existence of positive scalars c and d such that I tA, I uB, and
I tA uB are positive definite for all t and u satisfying jtj < c and juj < d ].
For purposes of establishing their necessity, take u to be an arbitrary scalar satisfying juj < d ,
and observe (in light of Corollaries 2.13.12 and 2.13.29) that there exists an N  N nonsingular
matrix S such that .I uB/ 1 D S0 S. Observe also (in light of Theorem 6.7.4) that there exist N N
matrices P and Q such that A D P diag.d1 ; d2 ; : : : ; dN /P 0 and SAS0 D Q diag.f1 ; f2 ; : : : ; fN / Q0
for some scalars d1 ; d2 : : : ; dN and f1 ; f2 ; : : : ; fN —d1 ; d2 : : : ; dN are the not-necessarily-distinct
eigenvalues of A and f1 ; f2 ; : : : ; fN the not-necessarily-distinct eigenvalues of SAS0. Moreover,
letting R D rank A, it follows from Theorem 6.7.5 that exactly R of the scalars d1 ; d2 : : : ; dN and
exactly R of the scalars f1 ; f2 ; : : : ; fN are nonzero; assume (without any essential loss of generality)
342 Some Relevant Distributions and Their Properties

that it is the first R of the scalars d1 ; d2 : : : ; dN and the first R of the scalars f1 ; f2 ; : : : ; fN that are
nonzero. Then, in light of results (2.14.25) and (2. 14.10) and Corollary 2.14.19, we find that
jI tAj D jP ŒI diag.td1 ; : : : ; tdR 1 ; tdR ; 0; 0; : : : ; 0/P 0 j
D jdiag.1 td1 ; : : : ; 1 tdR 1 ; 1 tdR ; 1; 1; : : : ; 1/j
YR Y R
D .1 tdi / D . di /.t di 1 /
i D1 i D1
and that
jI tA uBj D jS 1
.I tSAS0 /.S 0 / 1
j
2 0
D jSj jI tSAS j
2 0
D jSj jQŒI diag.tf1 ; : : : ; tfR 1 ; tfR ; 0; 0; : : : ; 0/Q j
2
D jSj jdiag.1 tf1 ; : : : ; 1 tfR 1 ; 1 tfR ; 1; 1; : : : ; 1/j
YR R
Y
D jSj 2 .1 tfi / D jSj 2 . fi /.t fi 1 /;
i D1 i D1

so that (for fixed u) jI tA uBj and jI tAjjI uBj are polynomials in t. And
R
Y
2
jI tAjjI uBjjI tA uBj D jI uBjjSj di fi .t di 1 /.t fi 1
/; (8.66)
i D1
which (for fixed u) is a polynomial in t of degree 2R with roots d1 1, d2 1, : : : ; dR 1, f1 1, f2 1, : : : ;
fR 1.
Now, regarding u as fixed, suppose that equality (8.65) holds for all t satisfying jtj < c. Then,
in light of equality (8.66), it follows from Theorem 6.7.13 that there exists a real number ˛.u/ such
that, for all t,
h.t; u/ D ˛.u/jI tAjjI uBjjI tA uBj
and
jI tA uBj D e ˛.u/ jI tAjjI uBj: (8.67)
[In applying Theorem 6.7.13, take x D t, s1 .t/ D jI tA uBj, s2 .t/ D jI tAjjI uBj,
r1 .t/ D h.t; u/, and r2 .t/ D jI tAjjI uBjjI tA uBj.] Moreover, upon setting t D 0 in equality
(8.67), we find that
jI uBj D e ˛.u/ jI uBj;
implying that e ˛.u/ D 1 or, equivalently, that ˛.u/ D 0. Thus, for all t,
h.t; u/ D 0 and jI tA uBj D jI tAjjI uBj:
We conclude that if equality (8.65) holds for all t and u satisfying jtj < c and juj < d , then
h.t; u/ D 0 and jI tA uBj D jI tAjjI uBj for all t and u satisfying juj < d , implying
(in light of Theorem 6.7.8) that h.t; u/ D 0 for all t and u and (in light of Theorem 6.8.17) that
AB D 0. Q.E.D.
The cofactors of a (square) matrix. Let A D faij g represent an N  N matrix. And (for i; j D
1; 2; : : : ; N ) let Aij represent the .N 1/  .N 1/ submatrix of A obtained by striking out the row
and column that contain the element aij , that is, by striking out the i th row and the j th column.
The determinant jAij j of this submatrix is called the minor of the element aij ; the “signed” minor
. 1/i Cj jAij j is called the cofactor of aij .
The determinant of an N  N matrix A can be expanded in terms of the cofactors of the N
elements of any particular row or column of A, as described in the following theorem.
More on the Distribution of Quadratic Forms or Second-Degree Polynomials 343

Theorem 6.8.20. Let A represent an N  N matrix. And (for i; j D 1; 2; : : : ; N ) let aij


represent the ij th element of A and let ˛ij represent the cofactor of aij . Then, for i D 1; 2; : : : ; N,
XN
jAj D aij ˛ij D ai1 ˛i1 C ai 2 ˛i 2 C    C aiN ˛iN (8.68)
j D1
X N
D aj i ˛j i D a1i ˛1i C a2i ˛2i C    C aN i ˛N i : (8.69)
j D1

For a proof of Theorem 6.8.20, refer, for example, to Harville (1997, sec 13.5). The following
theorem adds to the results of Theorem 6.8.20.
Theorem 6.8.21. Let A represent an N N matrix. And (for i; j D 1; 2; : : : ; N ) let aij represent
the ij th element of A and let ˛ij represent the cofactor of aij . Then, for i 0 ¤ i D 1; : : : ; N ,
XN
aij ˛i 0j D ai1 ˛i 0 1 C ai 2 ˛i 0 2 C    C aiN ˛i 0 N D 0; (8.70)
j D1
and N
X
aj i ˛j i 0 D a1i ˛1i 0 C a2i ˛2i 0 C    C aN i ˛N i 0 D 0: (8.71)
j D1

Proof (of Theorem 6.8.21). Consider result (8.70). Let B represent a matrix whose i 0 th row
equals the i th row of A and whose first, second, : : : ; .i 0 1/th, .i 0 C1/th, : : : ; .N 1/th, N th rows
are identical to those of A (where i 0 ¤ i ). Observe that the i 0 th row of B is a duplicate of its i th row
and hence (in light of Lemma 2.14.10) that jBj D 0.
Let bkj represent the kj th element of B (k; j D 1; 2; : : : ; N ). Clearly, the cofactor of bi 0j is the
same as that of ai 0j (j D 1; 2; : : : ; N ). Thus, making use of Theorem 6.8.20, we find that
N
X N
X
aij ˛i 0j D bi 0j ˛i 0j D jBj D 0;
j D1 j D1

which establishes result (8.70). Result (8.71) can be proved via an analogous argument. Q.E.D.
For any N  N matrix A D faij g, the N  N matrix whose ij th element is the cofactor ˛ij of
aij is called the matrix of cofactors (or cofactor matrix) of A. The transpose of this matrix is called
the adjoint or adjoint matrix of A and is denoted by the symbol adj A or adj.A/.
There is a close relationship between the adjoint of a nonsingular matrix A and the inverse of A,
as is evident from the following theorem and as is made explicit in the corollary of this theorem.
Theorem 6.8.22. For any N  N matrix A,
A adj.A/ D .adj A/A D jAjIN :

Proof. Let aij represent the ij th element of A and let ˛ij represent the cofactor of aij (i; j D
1; 2; : : : ; N ). Then, the i i 0 th element of the matrix product A adj.A/ is jND1 aij ˛i 0j (i; i 0 D
P

1; 2; : : : ; N ). Moreover, according to Theorems 6.8.20 and 6.8.21,


(
PN jAj; if i 0 D i ;
j D1 aij ˛i 0j D
0; if i 0 ¤ i :

Thus, A adj.A/ D jAjI. That .adj A/A D jAjI can be established via a similar argument. Q.E.D.
Corollary 6.8.23. If A is an N  N nonsingular matrix, then
1
adj A D jAj A (8.72)
or, equivalently, 1
A D .1=jAj/ adj.A/: (8.73)
344 Some Relevant Distributions and Their Properties

e. Proof of the “necessity part” of Theorem 6.8.3


As in Theorem 6.8.3, let x represent an M -dimensional random column vector that has an N.; IM /
distribution. And, for i D 1; 2; : : : ; K, take qi D ci C b0i x C x0 Ai x, where ci is a constant, bi an
M -dimensional column vector of constants, and Ai an M  M symmetric matrix of constants.
Suppose that q1 ; q2 ; : : : ; qK are statistically independent. We wish to show that (for j ¤ i D
1; 2; : : : ; K)
Ai Aj D 0; Ai bj D 0; and b0i bj D 0;
thereby proving the necessity part of Theorem 6.8.3.
For i D 1; 2; : : : ; K, let di D bi C 2Ai . Further, for j ¤ i D 1; 2; : : : ; K, denote by mij .  ;  /
the moment generating function of the joint distribution of qi and qj . And letting i and j represent
arbitrary distinct integers between 1 and K, inclusive, observe (in light of the results of Section 6.5)
that there exists a neighborhood Nij of the 2  1 null vector 0 such that, for .t; u/0 2 Nij (where t
and u are scalars), I 2tAi 2uAj is positive definite and

mij .t; u/ D jI 2tAi 2uAj j 1=2


expŒt.ci Cb0i C0 Ai / C u.cj Cbj0 C0 Aj /
 exp 21 .tdi Cudj /0 .I 2tAi 2uAj / 1 .tdi Cudj / : (8.74)
 

Observe also (in light of the statistical independence of q1 ; q2 ; : : : ; qK ) that, for scalars t and u such
that .t; u/0 2 Nij ,
mij .t; u/ D mij .t; 0/ mij .0; u/: (8.75)
Upon squaring both sides of equality (8.75) and making use of formula (8.74), we find that, for t
and u such that .t; u/0 2 Nij ,
 
jI 2tAi 2uAj j
log D rij .t; u/; (8.76)
jI 2tAi j jI 2uAj j
where
rij .t; u/ D .tdi Cudj /0 .I 2tAi 2uAj / 1
.tdi Cudj /
t 2 d0i .I 2tAi / 1
di u2 dj0 .I 2uAj / 1
dj : (8.77)

In light of Corollary 6.8.23,


1
.I 2tAi 2uAj / D .1=jI 2tAi 2uAj j/ adj.I 2tAi 2uAj /
and, similarly,
1 1
.I 2tAi / D .1=jI 2tAi j/ adj.I 2tAi / and .I 2uAj / D .1=jI 2uAj j/ adj.I 2uAj /:
Moreover, the elements of adj.I 2tAi 2uAj /, adj.I 2tAi /, and adj.I 2uAj / are polynomials
in t and/or u. And jI 2tAi 2uAj j, jI 2tAi j, and jI 2uAj j are also polynomials in t and/or u.
Thus, rij .t; u/ is expressible in the form
hij .t; u/
rij .t; u/ D ; (8.78)
jI 2tAi jjI 2uAj jjI 2tAi 2uAj j
where hij .t; u/ is a polynomial in t and u.
In light of results (8.76) and (8.78), it follows from Theorem 6.8.19 that Ai Aj D 0. It also
follows that hij .t; u/ D 0 for all scalars t and u and hence that rij .t; u/ D 0 for all t and u such
that .t; u/0 2 Nij . Moreover, by making use of various of the results of Section 6.8d on matrix
differentiation, it can be shown (via a straightforward, though tedious exercise) that
@2 rij .t; u/ ˇˇ
ˇ
D 2d0i dj (8.79)
@t @u ˇ t DuD0
and that
@4 rij .t; u/ ˇˇ
ˇ
D 16.Aj di /0 Aj di C 16.Ai dj /0 Ai dj : (8.80)
@t 2 @u2 ˇ t DuD0
Exercises 345

And upon observing that the partial derivatives of rij .t; u/ evaluated at t D u D 0, equal 0, it follows
from result (8.79) that d0i dj D 0 and from result (8.80) that Aj di D 0 and Ai dj D 0. Thus,
Ai bj D Ai .dj 2Aj / D Ai dj 2Ai Aj  D 0
and
b0i bj D .di 2Ai /0 .dj 2Aj / D d0i dj 20Ai dj 2.Aj di /0  C 40Ai Aj  D 0:

The proof of the necessity part of Theorem 6.8.3 is now complete.

Exercises
Exercise 1. Let x represent a random variable whose distribution is Ga.˛; ˇ/, and let c represent a
(strictly) positive constant. Show that cx  Ga.˛; cˇ/ [thereby verifying result (1.2)].
Exercise 2. Let w represent a random variable whose distribution is Ga.˛; ˇ/, where ˛ is a (strictly
positive) integer. Show that (for any strictly positive scalar t)
Pr.w  t/ D Pr.u  ˛/;
where u is a random variable whose distribution is Poisson with parameter t=ˇ [so that Pr.u D s/ D
e t =ˇ .t=ˇ/ s=sŠ for s D 0; 1; 2; : : :].
Exercise 3. Let u and w represent random variables that are distributed independently as Be.˛; ı/
and Be.˛Cı; /, respectively. Show that uw  Be.˛; Cı/.
Exercise 4. Let x represent a random variable whose distribution is Be.˛; /.
(a) Show that, for r > ˛,
€.˛ C r/€.˛ C /
E.x r / D :
€.˛/€.˛ C  C r/
(b) Show that
˛ ˛
E.x/ D and var.x/ D :
˛C .˛ C /2 .˛
C  C 1/

Exercise 5. Take x1 ; x2 ; : : : ; xK to be K random variables whose joint distribution is


PK
Di.˛1 ; ˛2 ; : : : ; ˛K ; ˛KC1 I K/, define xKC1 D 1 kD1 xk , and let ˛ D ˛1 C    C ˛K C ˛KC1 .

(a) Generalize the results of Part (a) of Exercise 4 by showing that, for r1 > ˛1 , : : : ; rK > ˛K ,
and rKC1 > ˛KC1 ,
KC1
Y €.˛k C rk /
rK rKC1  €.˛/
E x1r1    xK xKC1 D PKC1  :
€ ˛ C kD1 rk €.˛k /
kD1

(b) Generalize the results of Part (b) of Exercise 4 by showing that (for an arbitrary integer k between
1 and K C 1, inclusive)
˛k ˛k .˛ ˛k /
E.xk / D and var.xk / D
˛ ˛ 2 .˛ C 1/
and that (for any 2 distinct integers j and k between 1 and K C1, inclusive)
˛j ˛k
cov.xj ; xk / D 2 :
˛ .˛ C 1/
346 Some Relevant Distributions and Their Properties

Exercise 6. Verify that the function b./ defined by expression (1.48) is a pdf of the chi distribution
with N degrees of freedom.
Exercise 7. For strictly positive integers J and K, let s1 ; : : : ; sJ ; sJ C1 ; : : : ; sJ CK represent J CK
random variables whose joint distribution is Di.˛1 , : : : ; ˛J , ˛J C1 , : : : ; ˛J CK ; ˛J CKC1 I J CK/.
PJ
Further, for k D 1; 2; : : : ; K, let xk D sJ Ck 1 j D1 sj . Show that the conditional distribution
ı 

of x1 ; x2 ; : : : ; xK given s1 ; s2 ; : : : ; sJ is Di.˛J C1 ; : : : ; ˛J CK ; ˛J CKC1 I K/.


Exercise 8. Let z1 ; z2 ; : : : ; zM represent random variables whose joint distribution is absolutely
continuous with a pdf f . ;  ; : : : ; / of the form f .z1 ; z2 ; : : : ; zM / D g M 2
i D1 zi [where g./ is
P 

a nonnegative function of a single nonnegative variable]. Verify that the function b./ defined by
2 1=2
PM
expression (1.53) is a pdf of the distribution of the random variable . (Note. This

i D1 zi
exercise can be regarded as a more general version of Exercise 6.)
Exercise 9. Use the procedure described in Section 6.2a to construct a 6  6 orthogonal matrix
whose first row is proportional to the vector .0; 3; 4; 2; 0; 1/.
Exercise 10. Let x1 and x2 represent M -dimensional column vectors.
(a) Use the results of Section 6.2a (pertaining to Helmert matrices) to show that if x02 x2 D x01 x1 ,
then there exist orthogonal matrices O1 and O2 such that O2 x2 D O1 x1 .
(b) Use the result of Part (a) to devise an alternative proof of the “only if” part of Lemma 5.9.9.

Exercise 11. Let w represent a random variable whose distribution is 2 .N; /. Verify that the
expressions for E.w/ and E.w 2 / provided by formula (2.36) are in agreement with those provided
by results (2.33) and (2.34) [or, equivalently, by results (2.28) and (2.30)].
Exercise 12. Let w1 and w2 represent random variables that are distributed independently as
Ga.˛1 ; ˇ; ı1 / and Ga.˛2 ; ˇ; ı2 /, respectively, and define w D w1 C w2 . Derive the pdf of
the distribution of w by starting with the pdf of the joint distribution of w1 and w2 and intro-
ducing a suitable change of variables. [Note. This derivation serves the purpose of verifying that
w  Ga.˛1 C˛2 ; ˇ; ı1 Cı2 / and (when coupled with a mathematical-induction argument) repre-
sents an alternative way of establishing Theorem 6.2.2 (and Theorem 6.2.1).]
Exercise 13. Let x D  C z, where z is an N -dimensional random column vector that has an
absolutely continuous spherical distribution and where  is an N -dimensional nonrandom column
vector. Verify that in the special case where z  N.0; I/, the pdf q./ derived in Section 6.2h for the
distribution of x0 x “simplifies to” (i.e., is reexpressible in the form of) the expression (2.15) given
in Section 6.2c for the pdf of the noncentral chi-square distribution [with N degrees of freedom and
with noncentrality parameter  (D 0 )].
Exercise 14. Let u and v represent random variables that are distributed independently as 2 .M /
and 2 .N /, respectively. And define w D .u=M /=.v=N /. Devise an alternative derivation of the
pdf of the SF .M; N / distribution by (1) deriving the pdf of the joint distribution of w and v and by
(2) determining the pdf of the marginal distribution of w from the pdf of the joint distribution of w
and v.
Exercise 15. Let t D z v=N , where z and v are random variables that are statistically independent
ıp

with z  N.0; 1/ and v  2 .N / [in which case t  S t.N /].


(a) Starting with the pdf of the joint distribution of z and v, derive the pdf of the joint distribution
of t and v.
(b) Derive the pdf of the S t.N / distribution from the pdf of the joint distribution of t and v, thereby
providing an alternative to the derivation given in Part 2 of Section 6.4a.
Exercises 347

Exercise 16. Let t D .x1 C x2 /=jx1 x2 j, where x1 and x2 are random variables that are distributed
independently and identically as N.;  2 / (with  > 0). Show that t has a noncentral t distribution,
and determine the values of the parameters (the degrees of freedom and the noncentrality parameter)
of this distribution.
Exercise 17. Let t represent a random variable that has an S t.N; / distribution. And take r to be
an arbitrary one of the integers 1; 2; : : : < N . Generalize expressions (4.38) and (4.39) [for E.t 1 /
and E.t 2 /, respectively] by obtaining an expression for E.t r / (in terms of ). (Note. This exercise is
closely related to Exercise 3.12.)
Exercise 18. Let t represent an M -dimensional random column vector that has an MV t.N; IM /
distribution. And let w D t 0 t. Derive the pdf of the distribution of w in each of the following two
ways: (1) as a special case of the pdf (1.51) and (2) by making use of the relationship (4.50).
Exercise 19. Let x represent an M -dimensional random column vector whose distribution has as a
pdf a function f ./ that is expressible in the following form: for all x,
Z 1
f .x/ D h.x j u/g.u/ du;
0
where g./ is the pdf of the distribution of a strictly positive random variable u and where (for every
u) h. j u/ is the pdf of the N.0; u 1 IM / distribution.
(a) Show that the distribution of x is spherical.
(b) Show that the distribution of u can be chosen in such a way that f ./ is the pdf of the MVt.N; IM /
distribution.

Exercise 20. Show that if condition (6.7) of Theorem 6.6.2 is replaced by the condition
†.b C 2A/ 2 C.†A†/;
the theorem is still valid.
Exercise 21. Let x represent an M -dimensional random column vector that has an N.; †/ distri-
bution (where † ¤ 0), and take G to be a symmetric generalized inverse of †. Show that
x0 Gx  2 .rank †; 0 G/
if  2 C.†/ or G†G D G. [Note. A symmetric generalized inverse G is obtainable from a possibly
nonsymmetric generalized inverse, say H, by taking G D 21 H C 12 H0 ; the condition G†G D G is
the second of the so-called Moore–Penrose conditions—refer, e.g., to Harville (1997, chap. 20) for
a discussion of the Moore–Penrose conditions.]
Exercise 22. Let z represent an N -dimensional random column vector. And suppose that the distri-
bution of z is an absolutely continuous spherical distribution, so that the distribution of z has as a pdf
a function f ./ such that (for all z) f .z/ D g.z0 z/, where g./ is a (nonnegative) function of a single
nonnegative variable. Further, take z to be an M -dimensional subvector of z (where M < N ), and
let v D z0 z .
(a) Show that the distribution of v has as a pdf the function h./ defined as follows: for v > 0,
Z 1
 N=2
h.v/ D v .M=2/ 1 w Œ.N M /=2 1 g.vCw/ dwI
€.M=2/ €Œ.N M /=2 0
for v  0, h.v/ D 0.
(b) Verify that in the special case where z  N.0; IN /, h./ simplifies to the pdf of the 2 .M /
distribution.
348 Some Relevant Distributions and Their Properties

Exercise 23. Let z D .z1 ; z2 ; : : : ; zM /0 represent an M -dimensional random (column) vector that
has a spherical distribution. And take A to be an M  M symmetric idempotent matrix of rank R
(where R  1).
(a) Starting from first principles (i.e., from the definition of a spherical distribution), use the results of
Theorems 5.9.5 and 6.6.6 to show that (1) z0 Az  R 2
i D1 zi and [assuming that Pr.z ¤ 0/ D 1]
P
R M
that (2) z0 Az=z0 z  i D1 zi2 = i D1 zi2 .
P P

(b) Provide an alternative “derivation” of results (1) and (2) of Part (a); do so by showing that
(when z has an absolutely continuous spherical distribution) these two results can be obtained
by applying Theorem 6.6.7 (and by making use of the results of Sections 6.1f and 6.1g).

Exercise 24. Let A represent an N  N symmetric matrix. And take Q to be an N  N orthogonal


matrix and D an N  N diagonal matrix such that A D QDQ0 —the decomposition A D QDQ0
is the spectral decomposition, the existence and properties of which are established in Section 6.7a.
Further, denote by d1 ; d2 ; : : : ; dN the diagonal elements of D (which are the not-necessarily-distinct
eigenvalues of A), and taking D C to be the N  N diagonal matrix whose i th diagonal element is
diC, where diC D 0 if di D 0 and where diC D 1=di if di ¤ 0, define AC D QD C Q0. Show that (1)
AACA D A (i.e., AC is a generalized inverse of A) and also that (2) ACAAC D AC, (3) AAC is
symmetric, and (4) ACA is symmetric—as discussed, e.g., by Harville (1997, chap. 20), these four
conditions are known as the Moore–Penrose conditions and they serve to determine a unique matrix
AC that is known as the Moore–Penrose inverse.
Exercise 25. Let † represent an N  N symmetric nonnegative definite matrix, and take €1 to be
a P1  N matrix and €2 a P2  N matrix such that † D €10 €1 D €20 €2 . Further, take A to be an
N  N symmetric matrix. And assuming that P2  P1 (as can be done without any essential loss
of generality), show that the P2 not-necessarily-distinct eigenvalues of the P2  P2 matrix €2 A€20
consist of the P1 not-necessarily-distinct eigenvalues of the P1  P1 matrix €1 A€10 and of P2 P1
zeroes. (Hint. Make use of Corollary 6.4.2.)
Exercise 26. Let A represent an M M symmetric matrix and † an M M symmetric nonnegative
definite matrix. Show that the condition †A†A† D †A† (which appears in Theorem 6.6.2) is
equivalent to each of the following three conditions:
(1) .A†/3 D .A†/2 ;
(2) trŒ.A†/2  D trŒ.A†/3  D trŒ.A†/4 ; and
(3) trŒ.A†/2  D tr.A†/ D rank.†A†/.
Exercise 27. Let z represent an M -dimensional random column vector that has an N.0; IM / dis-
tribution, and take q D c C b0 z C z0 Az, where c is a constant, b an M -dimensional column vector
of constants, and A an M  M (nonnull) symmetric matrix of constants. Further, denote by m./
the moment generating function of q. Provide an alternative derivation of the “sufficiency part” of
Theorem 6.6.1 by showing that if A2 D A, b D Ab, and c D 14 b0 b, then, for every scalar t in some
neighborhood of 0, m.t/ D m .t/, where m ./ is the moment generating function of a 2 .R; c/
distribution and where R D rank A D tr.A/.
Exercise 28. Let x represent an M -dimensional random column vector that has an N.0; †/ distri-
bution, and denote by A an M  M symmetric matrix of constants. Construct an example where
M D 3 and where † and A are such that A† is not idempotent but are nevertheless such that x0 Ax
has a chi-square distribution.
Exercise 29. Let x represent an M -dimensional random column vector that has an N.0; †/ distri-
bution. Further, partition x and † as
Bibliographic and Supplementary Notes 349
   
x1 †11 †12
xD and † D
x2 †21 †22
(where the dimensions of †11 are the same as the dimension of x1 ). And take G1 to be a generalized
inverse of †11 and G2 a generalized inverse of †22 . Show that x01 G1 x1 and x02 G2 x2 are distributed
independently if and only if †12 D 0.
Exercise 30. Let x represent an M -dimensional random column vector that has an N.; IM /
distribution. And, for i D 1; 2; : : : ; K, take qi D ci C b0i x C x0 Ai x, where ci is a constant, bi an M -
dimensional column vector of constants, and Ai an M  M symmetric matrix of constants. Further,
denote by m.  ;  ; : : : ;  / the moment generating function of the joint distribution of q1 ; q2 ; : : : ; qK .
Provide an alternative derivation of the “sufficiency part” of Theorem 6.8.3 by showing that if, for
j ¤ i D 1; 2; : : : ; K, Ai Aj D 0, Ai bj D 0, and b0i bj D 0, then there exist (strictly) positive
scalars h1 ; h2 ; : : : ; hK such that, for any scalars t1 ; t2 ; : : : ; tK for which jt1 j < h1 , jt2 j < h2 , : : : ;
jtK j < hK ,
m.t1 ; t2 ; : : : ; tK / D m.t1 ; 0; 0; : : : ; 0/ m.0; t2 ; 0; 0; : : : ; 0/    m.0; : : : ; 0; 0; tK /:

Exercise 31. Let x represent an M -dimensional random column vector that has an N.0; †/ distri-
bution. And take A1 and A2 to be M  M symmetric nonnegative definite matrices of constants.
Show that the two quadratic forms x0 A1 x and x0 A2 x are statistically independent if and only if they
are uncorrelated.
Exercise 32. Let x represent an M -dimensional random column vector that has an N.; IM /
distribution. Show (by producing an example) that there exist quadratic forms x0A1 x and x0A2 x
(where A1 and A2 are M  M symmetric matrices of constants) that are uncorrelated for every
 2 RM but that are not statistically independent for any  2 RM.

Bibliographic and Supplementary Notes


§1e. Theorem 6.1.5 is more or less identical to Theorem 1.4 of Fang, Kotz, and Ng (1990).
§2b. In some presentations (e.g., Ravishanker and Dey 2002, sec 5.3), the noncentrality parameter of the
noncentral chi-square distribution is defined to be 0=2 (instead of 0).
§2h. Refer, for example, to Cacoullos and Koutras (1984) for a considerably more extensive discussion of
the distribution of the random variable x0 x, where x D  C z for some N -dimensional spherically distributed
random column vector z and for some N -dimensional nonrandom column vector .
§3. The contributions of (George W.) Snedecor to the establishment of Snedecor’s F distribution would
seem to be more modest than might be inferred from the terminology. Snedecor initiated the practice of using F
as a symbol for a random variable whose distribution is that of the ratio (3.1) and indicated that he had done so to
honor the contributions of R. A. Fisher, though, subsequently, in a letter to H. W. Heckstall-Smith—refer, e.g.,
to page 319 of the volume edited by Bennett (1990)—Fisher dismissed Snedecor’s gesture as an “afterthought.”
And Snedecor was among the first to present tables of the percentage points of the F distribution; the earliest
tables, which were those of Fisher, were expressed in terms related to the distribution of the logarithm of the
ratio (3.1).
§4. The t distribution is attributable to William Sealy Gosset (1876–1937). His work on this distribution
took place while he was in the employ of the Guinness Brewery and was published under the pseudonym Student
(Student 1908)—refer, e.g., to Zabell (2008).
§4c. This subsection is essentially a replicate of the first part of Section 18.1a of Harville (1997).
§5a. The results of Part 1 of this subsection more or less duplicate results presented in Section 18.3 of
Harville (1997).
§6d and §6e. The inspiration for the content of these subsections (and for Exercises 22 and 23) came from
results like those presented by Anderson and Fang (1987).
350 Some Relevant Distributions and Their Properties

§7a. The polynomial p./ and equation p./ D 0 referred to herein as the characteristic polynomial
and characteristic equation differ by a factor of . 1/N from what some authors refer to as the characteristic
polynomial and equation; those authors refer to the polynomial q./ obtained by taking (for all ) q./ D
jIN Aj as the characteristic polynomial and/or to the equation q./ D 0 as the characteristic equation.
Theorem 6.7.1 is essentially the same as Theorem 21.5.6 of Harville (1997), and the approach taken in devising
a proof is a variation on the approach taken by Harville.
§7c. Theorem 6.7.13 can be regarded as a variant of a “lemma” on polynomials in 2 variables that was stated
(without proof) by Laha (1956) and that has come to be identified (at least among statisticians) with Laha’s
name—essentially the same lemma appears (along with a proof) in Ogawa’s (1950) paper. Various approaches
to the proof of Laha’s lemma are discussed by Driscoll and Gundberg (1986, sec. 3)—refer also to Driscoll and
Krasnicka (1995, sec. 4).
§7d. The proof (of the “necessity part” of Theorem 6.6.1) presented in Section 6.7d makes use of Theorem
6.7.13 (on polynomials). Driscoll (1999) and Khatri (1999) introduced (in the context of Corollary 6.6.4) an
alternative proof of necessity—refer also to Ravishanker and Dey (2002, sec. 5.4) and to Khuri (2010, sec. 5.2).
§8a (and §8e). The result presented herein as Corollary 6.8.2 includes as a special case a result that has
come to be widely known as Craig’s theorem (in recognition of the contributions of A. T. Craig) and that (to
acknowledge the relevance of the work of H. Sakamoto and/or K. Matusita) is also sometimes referred to as
the Craig–Sakamoto or Craig–Sakamoto–Matusita theorem. Craig’s theorem has a long and tortuous history
that includes numerous attempts at proofs of necessity, many of which have been judged to be incomplete or
otherwise flawed or deficient. Accounts of this history are provided by, for example, Driscoll and Gundberg
(1986), Reid and Driscoll (1988), and Driscoll and Krasnicka (1995) and, more recently, Ogawa and Olkin
(2008).
§8c. For more on the kind of variations on Cochran’s theorem that are the subject of this subsection, refer,
e.g., to Anderson and Fang (1987).
§8d. The proof of Theorem 6.8.17 is based on a proof presented by Rao and Mitra (1971, pp. 170–171).
For some discussion (of a historical nature and also of a more general nature) pertaining to alternative proofs
of the result of Theorem 6.8.17, refer to Ogawa and Olkin (2008).
§8e. The proof (of the “necessity part” of Theorem 6.8.3) presented in Section 6.8e is based on Theorem
6.8.19, which was proved (in Section 6.8d) by making use of Theorem 6.7.13 (on polynomials). Reid and
Driscoll (1988) and Driscoll and Krasnicka (1995) introduced an alternative proof of the necessity of the
conditions under which two or more quadratic forms (in a normally distributed random vector) are distributed
independently—refer also to Khuri (2010, sec. 5.3).
Exercise 7. The result of Exercise 7 is essentially the same as Theorem 1.6 of Fang, Kotz, and Ng (1990).
Exercise 26. Conditions (1) and (3) of Exercise 26 correspond to conditions given by Shanbhag (1968).
7
Confidence Intervals (or Sets) and Tests of Hypotheses

Suppose that y is an N  1 observable random vector that follows a G–M model. And suppose that
we wish to make inferences about a parametric function of the form 0ˇ (where  is a P  1 vector
of constants) or, more generally, about a vector of such parametric functions. Or suppose that we
wish to make inferences about the realization of an unobservable random variable whose expected
value is of the form 0ˇ or about the realization of a vector of such random variables. Inferences that
take the form of point estimation or prediction were considered in Chapter 5. The present chapter is
devoted to inferences that take the form of an interval or set of values. More specifically, it is devoted
to confidence intervals and sets (and to the corresponding tests of hypotheses).

7.1 “Setting the Stage”: Response Surfaces in the Context of a Specific


Application and in General
Recall (from Section 4.2e) the description and discussion of the experimental study of how the
yield of lettuce plants is affected by the levels of the three trace minerals Cu, Mo, and Fe. Let
u D .u1 ; u2 ; u3 /0, where u1 , u2 , and u3 represent the transformed amounts of Cu, Mo, and Fe,
respectively. The data from the experimental study consisted of the 20 yields listed in column 1
of Table 4.3. The i th of these yields came from the plants in a container in which the value, say
ui D .ui1 ; ui 2 ; ui 3 /0, of u is the 3  1 vector whose elements ui1 , ui 2 , and ui 3 are the values
of u1 , u2 , and u3 (i.e., of the transformed amounts of Cu, Mo, and Fe) listed in the i th row of
Table 4.3 (i D 1; 2; : : : ; 20). The 20 yields are regarded as the observed values of random variables
y1 ; y2 ; : : : ; y20 , respectively. It is assumed that E.yi / D ı.ui / (i D 1; 2; : : : ; 20) for some function
ı.u/ of the vector u. The function ı.u/ is an example of what (in geometric terms) is customarily
referred to as a response surface.
Among the choices for the function ı.u/ is that of a polynomial. Whether or not such a choice is
likely to be satisfactory depends in part on the degree of the polynomial and on the relevant domain
(i.e., relevant set of u-values). In the case of a first-, second-, or third-order polynomial,

ı.u/ D ˇ1 C ˇ2 u1 C ˇ3 u2 C ˇ4 u3 ; (1.1)
ı.u/ D ˇ1 C ˇ2 u1 C ˇ3 u2 C ˇ4 u3
C ˇ11 u21 C ˇ12 u1 u2 C ˇ13 u1 u3 C ˇ22 u22 C ˇ23 u2 u3 C ˇ33 u23 ; (1.2)
or
ı.u/ D ˇ1 C ˇ2 u1 C ˇ3 u2 C ˇ4 u3
C ˇ11 u21 C ˇ12 u1 u2 C ˇ13 u1 u3 C ˇ22 u22 C ˇ23 u2 u3 C ˇ33 u23
C ˇ111 u31 C ˇ112 u21 u2 C ˇ113 u21 u3 C ˇ122 u1 u22 C ˇ123 u1 u2 u3
C ˇ133 u1 u23 C ˇ222 u32 C ˇ223 u22 u3 C ˇ233 u2 u23 C ˇ333 u33 ; (1.3)

respectively. Here, the coefficients ˇj (j D 1; 2; : : : ; 4), ˇjk (j D 1; 2; 3; k D j; : : : ; 3), and ˇjk`


352 Confidence Intervals (or Sets) and Tests of Hypotheses

(j D 1; 2; 3; k D j; : : : ; 3; ` D k; : : : ; 3) are regarded as unknown (unconstrained) parameters.


Upon taking the function ı.u/ to be the first-, second-, or third-order polynomial (1.1), (1.2), or (1.3),
we obtain a G–M model (with P D 4, 10, or 20, respectively).
Let us (in the present context) refer to the three G–M models corresponding to the three choices
(1.1), (1.2), and (1.3) for ı.u/ as the first-order, second-order, and third-order models, respec-
tively. Each of these models is expressible in terms related to the general formulation (in Sec-
tion 4.1) of the G–M model. In each case, N equals 20, and for the sake of consistency with
the notation introduced in Section 4.1 (where y1 ; y2 ; : : : ; yN were taken to be the first through
N th elements of y) take y to be the 20  1 random vector whose i th element is the random
variable yi (the observed value of which is the i th of the lettuce yields listed in column 1 of
Table 4.3). Further, letting ˇ1 D .ˇ1 ; ˇ2 ; ˇ3 ; ˇ4 /0, ˇ2 D .ˇ11 ; ˇ12 ; ˇ13 ; ˇ22 ; ˇ23 ; ˇ33 /0, and
ˇ3 D .ˇ111 ; ˇ112 ; ˇ113 ; ˇ122 ; ˇ123 ; ˇ133 ; ˇ222 ; ˇ223 ; ˇ0 ˇ333 /0, take ˇ for the first-, second-,
233 ; 1
  ˇ1
ˇ1
or third-order model to be ˇ D ˇ1 , ˇ D , or ˇ D @ˇ2 A, respectively (in which case P D 4,
ˇ2
ˇ3
P D 10, or P D 20, respectively). Then, letting X1 represent the 204 matrix with i th row .1; u0i / D
.1; ui1 ; ui 2 ; ui 3 /, X2 the 206 matrix with i th row .u2i1; ui1 ui 2 ; ui1 ui 3 ; u2i2 ; ui 2 ui 3 ; u2i3 /, and X3 the
2010 matrix with i th row .u3i1; u2i1 ui 2 ; u2i1 ui 3 ; ui1 u2i2 ; ui1 ui 2 ui 3 ; ui1 u2i3 ; u3i2 ; u2i2 ui 3 ; ui 2 u2i3 ; u3i3 /,
the model matrix X for the first-, second-, or third-order model is X D X1 , X D .X1 ; X2 /, or
X D .X1 ; X2 ; X3 /, respectively.
The lettuce-yield data were obtained from a designed experiment. What seemed to be of interest
were inferences about the response surface over a targeted region and perhaps inferences about the
yields of lettuce that might be obtained from plants grown in the future under conditions similar
to those present in the experimental study. Of particular interest were various characteristics of the
response surface; these included the presence and magnitude of any interactions among Cu, Mo, and
Fe (in their effects on yield) and the location of the optimal combination of levels of Cu, Mo, and Fe
(i.e., the combination that results in the largest expected yield).
The range of u-values for which the data were to be used to make inferences about the response
surface (or its characteristics) and the range of u-values represented in the experiment were such
that (in light of information available from earlier studies) a second-order model was adopted.
The intended design was of a kind known as a rotatable central composite design (e.g., Myers,
Montgomery, and Anderson-Cook 2016, chap. 8). However, due to an error, the experiment was
carried out in such a way that the level of Fe (on the transformed scale) was 0:4965 in the containers
for which the intended level was 1—there were 4 such containers.
Suppose that ı.u/ is the second-order polynomial (1.2) in the elements of the vector u D
.u1 ; u2 ; u3 /0. Or, more generally, suppose that ı.u/ is the second-order polynomial (4.2.14) in the
elements of the vector u D .u1 ; u2 ; : : : ; uC /0. Then, ı.u/ is reexpressible in matrix notation. Clearly,
ı.u/ D ˇ1 C a0 u C u0Au; (1.4)
where a D .ˇ2 ; ˇ3 ; : : : ; ˇC C1 /0 and where A is the C  C symmetric matrix with i th diagonal
element ˇi i and with (for j > i ) ij th and j i th off-diagonal elements ˇij =2.
An expression for the gradient vector of the second-order polynomial ı.u/ is obtained upon
applying results (5.4.7) and (5.4.10) on vector differentiation. We find that
@ ı.u/
D a C 2Au: (1.5)
@u ˇ
@ ı.u/ ˇˇ
By definition, a point, say u0 , is a stationary point of ı.u/ if and only if D 0 and hence
@u ˇ uDu0
if and only if u D u0 is a solution to the following linear system (in the vector u):
2Au D a; (1.6)
“Setting the Stage”: Response Surfaces 353

in which case
ı.u0 / D ˇ1 C a0 u0 C .Au0 /0 u0 D ˇ1 C .1=2/a0u0 : (1.7)
The linear system (1.6) is consistent (i.e., has a solution) if the matrix A is nonsingular, in which
case the linear system has a unique solution u0 that is expressible as
u0 D .1=2/A 1 a: (1.8)
More generally, linear system (1.6) is consistent if (and only if) a 2 C.A/.
In light of result (5.4.11), the Hessian matrix of ı.u/ is expressible as follows:
@ 2 ı.u/
D 2A: (1.9)
@u@u0
If the matrix A is nonnegative definite, then the stationary points of ı.u/ are points at which ı.u/
attains a maximum value—refer, e.g., to Harville (1997, sec. 19.1).
Let us consider the use of the data on lettuce yields in making inferences (on the basis of the
second-order model) about the response surface and various of its characteristics. Specifically, let us
consider the use of these data in making inferences about the value of the second-order polynomial
ı.u/ for each value of u in the relevant region. And let us consider the use of these data in making
inferences about the parameters of the second-order model [and hence about the values of the first-
order derivatives (at u D 0) and second-order derivatives of the second-order polynomial ı.u/] and
in making inferences about the location of the stationary points of ı.u/.
The (20  10) model matrix X D .X1 ; X2 / is of full column rank 10—the model matrix is
sufficiently simple that its rank can be determined without resort to numerical means. Thus, all ten
of the parameters that form the elements of ˇ are estimable (and every linear combination of these
parameters is estimable).
Let ˇO D .ˇO1 ; ˇO2 ; ˇO3 ; ˇO4 ; ˇO11 ; ˇO12 ; ˇO13 ; ˇO22 ; ˇO23 ; ˇO33 /0 represent the least squares estimator of
ˇ, that is, the 10  1 vector whose elements are the least squares estimators of the corresponding
elements of ˇ; let O 2 represent the usual unbiased estimator of  2 ; and denote by y the data vector
(the elements of which are listed in column 1 of Table 4.3). The least squares estimate of ˇ (which
is the value of ˇO when y D y) is obtainable as the (unique) solution, say b, Q to the linear system
0 0
X Xb D X y (in the vector b) comprising the so-called normal equations. The residual sum of
squares equals 108:9407; dividing this quantity by N P D 10 (to obtain the value of O 2 ) gives
10:89 as an estimate of  2. Upon taking the square root of this estimate of  2, we obtain 3:30 as an
estimate of . The variance-covariance matrix of ˇO is expressible as var.ˇ/ O D  2 .X0 X/ 1 and is
estimated unbiasedly by the matrix O .X X/ . The standard errors of the elements of ˇO are given
2 0 1

by the square roots of the diagonal elements of  2 .X0 X/ 1, and the estimated standard errors (those
corresponding to the estimator O 2 ) are given by the square roots of the values of the diagonal elements
of O 2 .X0 X/ 1.
The least squares estimates of the elements of ˇ (i.e., of the parameters ˇ1 , ˇ2 , ˇ3 , ˇ4 , ˇ11 , ˇ12 ,
ˇ13 , ˇ22 , ˇ23 , and ˇ33 ) are presented in Table 7.1 along with their standard errors and estimated
standard errors. And letting S represent the diagonal matrix of order P D 10 whose first through
P th diagonal elements are respectively the square roots of the first through P th diagonal elements
of the P  P matrix .X0 X/ 1, the correlation matrix of the vector ˇO of least squares estimators of
the elements of ˇ is
0 1
1:00 0:00 0:00 0:09 0:57 0:00 0:00 0:57 0:00 0:52
B 0:00 1:00 0:00 0:00 0:00 0:00 0:24 0:00 0:00 0:00 C
0:00 0:00 1:00 0:00 0:00 0:00 0:00 0:00 0:24 0:00 C
B C
B
B
B 0:09 0:00 0:00 1:00 0:10 0:00 0:00 0:10 0:00 0:22 C
C
0:57 0:00 0:00 0:10 1:00 0:00 0:00 0:13 0:00 0:19 C
1
.X0 X/ 1 1
B
S S DB C:
B
B 0:00 0:00 0:00 0:00 0:00 1:00 0:00 0:00 0:00 0:00 C
C
B 0:00 0:24 0:00 0:00 0:00 0:00 1:00 0:00 0:00 0:00 C
0:57 0:00 0:00 0:10 0:13 0:00 0:00 1:00 0:00 0:19 C
B C
B
@ 0:00 0:00 0:24 0:00 0:00 0:00 0:00 0:00 1:00 0:00 A
0:52 0:00 0:00 0:22 0:19 0:00 0:00 0:19 0:00 1:00
354 Confidence Intervals (or Sets) and Tests of Hypotheses
TABLE 7.1. Least squares estimates (with standard errors and estimated standard errors) obtained from the
lettuce-yield data for the regression coefficients in a second-order G–M model.

Coefficient Regression Least squares Std. error of Estimated std. error


of coefficient estimate the estimator of the estimator
1 ˇ1 24:31 0:404 1:34
u1 ˇ2 4:57 0:279 0:92
u2 ˇ3 0:82 0:279 0:92
u3 ˇ4 0:03 0:318 1:05
u21 ˇ11 5:27 0:267 0:88
u1 u2 ˇ12 1:31 0:354 1:17
u1 u3 ˇ13 1:29 0:462 1:52
u22 ˇ22 0:73 0:267 0:88
u2 u3 ˇ23 0:66 0:462 1:52
u23 ˇ33 5:34 0:276 0:91

The model matrix X has a relatively simple structure, as might be expected in the case of a
designed experiment. That structure is reflected in the correlation matrix of ˇO and in the various
other quantities that depend on the model matrix through the matrix X0 X. Of course, that structure
would have been even simpler had not the level of Fe in the four containers where it was supposed
to have been 1 (on the transformed scale) been taken instead to be 0:4965.
Assuming that the distribution of the vector e of residual effects in the G–M model is N.0;  2 I/
or, more generally, that the fourth-order moments of its distribution are identical to those of the
N.0;  2 I/ distribution,
24 4
var.O 2 / D D (1.10)
N rank X 5
—refer to result (5.7.40). And (under the same assumption)

N rank.X/ C 2 4
E.O 4 / D var.O 2 / C ŒE.O 2 /2 D 
N rank.X/

and, consequently, var.O 2 / is estimated unbiasedly by

2 O 4
: (1.11)
N rank.X/ C 2

Corresponding to expression (1.10) for var.O 2 / is the expression


r
2 2
2 Dp (1.12)
N rank X 5

for the standard error of O 2, and corresponding to the estimator (1.11) of var.O 2 / is the estimator
s
2
O 2: (1.13)
N rank.X/ C 2

of the standard error of O 2. The estimated standard error of O 2 [i.e., the value of the estimator (1.13)]
is r
2
10:89407 D 4:45:
12
“Setting the Stage”: Response Surfaces 355

Mo = − 1 Mo = − 1/3
1.5 1.5
3 3
1 8 1 8
13 13
18 18
0.5 0.5
23 23
Cu

0 0
24.91
25.32
− 0.5 − 0.5

−1 −1

− 1.5 − 1.5

− 1.5 −1 − 0.5 0 0.5 1 1.5 − 1.5 −1 − 0.5 0 0.5 1 1.5

Mo = 1/3 Mo = 1
1.5 1.5

3 3
1 8 1 8
13 13
0.5 18 0.5 18
23
Cu

0 0 23

− 0.5 25.17 − 0.5 24.49

−1 −1

− 1.5 − 1.5

− 1.5 −1 − 0.5 0 0.5 1 1.5 − 1.5 −1 − 0.5 0 0.5 1 1.5


Fe Fe

FIGURE 7.1. Contour plots of the estimated response surface obtained from the lettuce-yield data (on the basis
of a second-order model). Each plot serves to relate the yield of lettuce plants to the levels of 2
trace minerals (Cu and Fe) at one of 4 levels of a third trace mineral (Mo).

Let
ˇO2 ˇO11 1 O 1 O
0 1 0 1
ˇ
2 12
ˇ
2 13
BO C
aO D @ˇ3 A and O D @ 1 ˇO12
A
B
ˇO22 1 O C:
ˇ
2 2 23 A
ˇO4 1 O
2 ˇ13
1 O
2ˇ23 ˇO33
For any particular value of the vector u D .u1 ; u2 ; u3 /0 (the elements of which represent the trans-
formed levels of Cu, Mo, and Fe), the value of ı.u/ is a linear combination of the ten regression
coefficients ˇ1 , ˇ2 , ˇ3 , ˇ4 , ˇ11 , ˇ12 , ˇ13 , ˇ22 , ˇ23 , and ˇ33 . And upon taking the same linear
combination of the least squares estimators of the regression coefficients, we obtain the least squares
estimator of the value of ı.u/. Accordingly, the least squares estimator of the value of ı.u/ is the
O
value of the function ı.u/ (of u) defined as follows:
O
ı.u/ D ˇO1 C aO 0 u C u0Au:
O (1.14)
The estimated response surface defined by this function is depicted in Figure 7.1 in the form of four
contour plots; each of these plots corresponds to a different one of four levels of Mo.
For any linear combination 0ˇ of the regression coefficients ˇ1 , ˇ2 , ˇ3 , ˇ4 , ˇ11 , ˇ12 , ˇ13 , ˇ22 ,
ˇ23 , and ˇ33 ,
356 Confidence Intervals (or Sets) and Tests of Hypotheses
O D  2 0 .X0 X/
var.0ˇ/ 1
: (1.15)
More generally, for any two linear combinations 0ˇ and ` 0ˇ,

cov.0ˇ; O D  2 0 .X0 X/ 1`:


O ` 0ˇ/ (1.16)

And upon replacing  2 in expression (1.15) or (1.16) with the unbiased estimator O 2, we obtain an
unbiased estimator of var.0ˇ/ O or cov.0ˇ; O ` 0ˇ/.O Further, upon taking the square root of var.0ˇ/ O
p
and of its p unbiased estimator, we obtain   .X X/  as an expression for the standard error of
0 0 1

0ˇO and O 0 .X0 X/ 1  as an estimator of the standard error.


Note that these results (on the least squares estimators of the arbitrary linear com-
binations 0ˇ and ` 0ˇ) are applicable to ı.u/ O O
and ı.v/, where u D .u1 ; u2 ; u3 /0 and
0
v D .v1 ; v2 ; v3 / represent any particular points in 3-dimensional space. Upon setting  D
.1; u1 ; u2 ; u3 ; u21 ; u1 u2 ; u1 u3 ; u22 ; u2 u3 ; u23 /0 and ` D .1; v1 ; v2 ; v3 ; v12 ; v1 v2 ; v1 v3 ; v22 ; v2 v3 ; v32 /0,
we find that 0ˇO D ı.u/ O O
and ` 0ˇO D ı.v/. Clearly, ı.0/O D ˇO1 .
The results on the least squares estimators of the arbitrary linear combinations 0ˇ and ` 0ˇ are
also applicable to the elements of the vector aO C 2Au O and the elements of the matrix 2A. O These
quantities are the least squares estimators of the elements of the vector a C 2Au and the elements
of the matrix 2A—clearly, each element of a C 2Au and 2A is expressible as a linear combination
of the elements of ˇ. And (under the second-order G–M model) a C 2Au is the gradient vector and
2A the Hessian matrix of the function ı.u/. Note that, when u D 0, the estimator aO C 2Au O of the
O
gradient vector simplifies to the vector aO , the elements of which are ˇ2 , ˇ3 , and ˇ4 . O O
Like the function ı.u/, the function ı.u/ O is a second-degree polynomial in the elements of the
vector u D .u1 ; u2 ; u3 /0. And analogous to results (1.5) and (1.9), we have that
O
@ ı.u/ O
@ 2 ı.u/
O
D aO C 2Au and O
D 2A: (1.17)
@u @u@u0
Assume that A O is nonsingular with probability 1 (as would be the case if, e.g., the distribution of the
vector e of residual effects in the second-order G–M model were MVN), and define
uO 0 D O 1 aO :
.1=2/A (1.18)
O
Then, with probability 1, the estimated response surface ı.u/ has a unique stationary point and that
point equals uO 0 . Further,
O uO 0 / D ˇO1 C .1=2/Oa0 uO 0 I
ı. (1.19)
and if A O
O is positive definite, ı.u/ attains its maximum value (uniquely) at uO 0 . When the least
squares estimates of ˇ1 and of the elements of a and A are those obtained from the lettuce-yield
data, we find (upon, e.g., making use of Theorem 2.14.23) that AO is positive definite and that

uO 0 D . 0:42; 0:17; 0:04/0 and O uO 0 / D 25:33:


ı.

Suppose that the second-order regression coefficients ˇ11 , ˇ12 , ˇ13 , ˇ22 , ˇ23 , and ˇ33 are such
that the matrix A is nonsingular, in which case the function ı.u/ has a unique stationary point
u0 D .1=2/A 1 a. Then, it seems natural to regard uO 0 as an estimator of u0 (as suggested by the
notation).
A more formal justification for regarding uO 0 as an estimator of u0 is possible. Suppose that the
distribution of the vector e of residual effects in the second-order (G–M) model is MVN. Then, it
O and the vector aO are the ML
follows from the results of Section 5.9a that the elements of the matrix A
estimators of the corresponding elements of the matrix A and the vector a (and are the ML estimators
even when the values of the second-order regression coefficients ˇ11 , ˇ12 , ˇ13 , ˇ22 , ˇ23 , and ˇ33
are restricted to those values for which A is nonsingular). And upon applying a well-known general
Augmented G–M Model 357

result on the ML estimation of parametric functions [which is Theorem 5.1.1 of Zacks (1971)], we
conclude that uO 0 is the ML estimator of u0 .
In making inferences from the results of the experimental study (of the yield of lettuce plants),
there could be interest in predictive inferences as well as in inferences about the function ı.u/
(and about various characteristics of this function). Specifically, there could be interest in making
inferences about the yield to be obtained in the future from a container of lettuce plants, based on
regarding the future yield as a realization of the random variable ı.u/ C d , where d is a random
variable that has an expected value of 0 and a variance of  2 (and that is uncorrelated with the vector
e). Then, with regard to point prediction, the BLUP of this quantity equals the least squares estimator
O
ı.u/ of ı.u/, and the variance of the prediction error (which equals the mean squared error of the
O
BLUP) is  2 C varŒı.u/. And the covariance of the prediction errors of the BLUPs of two future
yields, one of which is modeled as the realization of ı.u/ C d and the other as the realization of
ı.v/ C h (where h is a random variable that has an expected value of 0 and a variance of  2 and that
O
is uncorrelated with e and with d ), equals covŒı.u/; O
ı.v/.
Point estimation (or prediction) can be quite informative, especially when accompanied by stan-
dard errors or other quantities that reflect the magnitude of the underlying variability. However, it is
generally desirable to augment any such inferences with inferences that take the form of intervals or
sets. Following the presentation (in Section 7.2) of some results on “multi-part” G–M models, the
emphasis (beginning in Section 7.3) in the present chapter is on confidence intervals and sets (and
on the closely related topic of tests of hypotheses).

7.2 Augmented G–M Model


Section 7.1 was devoted to a discussion of the use of the data (from Section 4.2e) on the yield of
lettuce plants in making statistical inferences on the basis of a G–M model. Three different versions
of the G–M model were considered; these were referred to as the first-, second-, and third-order
models. The second of these versions is obtainable from the first and the third from the second via
the introduction of some additional terms. Let us consider the effects (both in general and in the
context of the lettuce-yield application) of the inclusion of additional terms in a G–M model.

a. General results
Let Z D fzij g represent a matrix with N rows, and denote by Q the number of columns in Z. And as
an alternative to the original G–M model with model matrix X, consider the following G–M model
with model matrix .X; Z/:  
ˇ
y D .X; Z/ C e; (2.1)

0
where  D .1 ; 2 ; : : : ; Q / is a Q-dimensional (column) vector
  of additional parameters [and
ˇ
where the parameter space for the augmented parameter vector is RP CQ ]. To distinguish this

model from the original G–M model, let us refer to it as the augmented G–M model. Note that the
model equation (2.1) for the augmented G–M model is reexpressible in the form
y D Xˇ C Z C e: (2.2)
0
Let ƒ represent a matrix (with P rows) such that R.ƒ / D R.X/, and denote by M the number
of columns in ƒ (in which case M  rank ƒ D rank X). Then, under the original G–M model
y D Xˇ C e, the M elements of the vector ƒ0ˇ are estimable linear combinations of the elements of
ˇ. Moreover, these estimable linear combinations include rank.X/ linearly independent estimable
358 Confidence Intervals (or Sets) and Tests of Hypotheses

linear combinations [and no set of linearly independent estimable linear combinations can include
more than rank.X/ linear combinations].
Under the original G–M model, the least squares estimators of the elements of the vector ƒ0ˇ
are the elements of the vector
RQ 0 X0 y; (2.3)
Q is any solution to the linear system
where R

X0 XR D ƒ (2.4)

(in the P M matrix R)—refer to Section 5.4 or 5.6. Note that the matrix XR and hence its transpose
R0 X0 do not vary with the choice of solution to linear system (2.4) (as is evident from, e.g., Corollary
2.3.4).
Under the augmented G–M model,
Q 0 X0 y/ D R
E.R Q 0 X0 .Xˇ C Z/ D ƒ0ˇ C R
Q 0 X0 Z: (2.5)

Thus, while RQ 0 X0 y is an unbiased estimator of ƒ0ˇ under the original G–M model (in fact, it is the
best linear unbiased estimator of ƒ0ˇ in the sense described in Section 5.6), it is not (in general)
an unbiased estimator of ƒ0ˇ under the augmented G–M model. In fact, under the augmented G–M
model, the elements of R Q 0 X0 y are the least squares estimators of the elements of ƒ0ˇ C R
Q 0 X0 Z, as
is evident upon observing that
 0
X X X0 Z

0
.X; Z/ .X; Z/ D ;
Z0 X Z0 Z
   0 
Q 0 X0 Z D .ƒ0; RQ 0 X0 Z/ ˇ D ƒ ˇ
ƒ0ˇ C R Q ; (2.6)
 Z0 XR 
and  0
Q
X X X0 Z R
   
ƒ
D Q
Z0 X Z0 Z 0 Z0 XR
and as is also evident upon observing (in light of the results of Section 5.4) that any linear combination
of the elements of the vector .X; Z/0 y is the least squares estimator of its expected value.
Note that while in general E.R Q 0 X0 y/ is affected by the augmentation of the G–M model,
Q 0 0
var.R X y/ is unaffected (in the sense that the same expressions continue to apply).
 0
Q 0 X0 Z/ D ƒ
In regard to the coefficient matrix .ƒ0; R Q of expression (2.6) for the vector
Z0 XR  
ƒ0ˇ C R Q 0 X0 Z of linear combinations of the elements of the parametric vector ˇ , note that

0 Q0 0
rank.ƒ ; R X Z/ D rank ƒ .D rank X/; (2.7)
Q 0 X0 X and
as is evident upon observing that ƒ0 D R

.ƒ0; R
Q 0 X0 Z/ D R
Q 0 X0 .X; Z/

and hence (in light of Lemma 2.12.3 and Corollary 2.4.17) that
Q 0 X0 /  rank.ƒ0; R
rank.ƒ0 / D rank.R Q 0 X0 Z/  rank.ƒ0 /:

Q 0 X0 X
Further, for any M -dimensional column vector `, we find (upon observing that ` 0ƒ0 D .R`/
and making use of Corollary 2.3.4) that

` 0ƒ0 D 0 , .R`/
Q 0 X0 D 0 ) .R`/
Q 0 X0 Z D 0 , ` 0 R
Q 0 X0 Z D 0 (2.8)
and that
Augmented G–M Model 359
Q 0 X0 Z/ D 0:
` 0ƒ0 D 0 , ` 0.ƒ0; R (2.9)
Thus, if any row or linear combination of rows of the matrix R Q 0 X0 Z is nonnull, then the corresponding
0
row or linear combination of rows of the matrix ƒ is also nonnull. And a subset of the rows of the
matrix .ƒ0; RQ 0 X0 Z/ is linearly independent if and only if the corresponding subset of the rows of the
0
matrix ƒ is linearly independent.
Let S represent a matrix (with N rows) such that C.S/ D N.X0 / (i.e., such that the columns
of S span the null space of X0 ) or equivalently fsince (according to Lemma 2.11.5) dimŒN.X0 / D
N rank.X/g such that X0 S D 0 and rank.S/ D N rank.X/. Further, denote by N the number of
columns in S—necessarily, N  N rank.X/. And consider the N -dimensional column vector
S0 Z, the elements of which are linear combinations of the elements of the vector .
Under the augmented G–M model, the elements of the vector S0 Z (like those   of the vector
0 Q 0 0 ˇ
ƒ ˇ C R X Z) are estimable linear combinations of the elements of the vector , as is evident

upon observing that
S0 X D .X0 S/0 D 0
and hence that    
0 0 ˇ 0 ˇ
S Z D .0; S Z/ D S .X; Z/ :
 
Clearly,
Q 0 X0 Z Q 0 X0 Z ˇ
 0   0  
ƒˇ C R ƒ R
D :
S0 Z 0 S0Z 
And upon observing that
Q 0 X0 Z Q X
 0   0 0
ƒ R R
D .X; Z/
0 S0Z S0
and (in light of Lemmas 2.12.1, 2.6.1, and 2.12.3) that
 0 0  0 0 0 
Q X Q X Q X Q 0 XRQ
 0 0  
R R R .XR/ 0
rank D rank D rank
S0 S0 S0 0 S0 S
Q 0 XR
D rankŒ.XR/ Q C rank.S0 S/
Q C rank.S/
D rank.XR/
D rank.ƒ/ C rank.S/
D rank.X/ C N rank.X/ D N;
it follows from Lemma 2.5.5 that
Q 0 X0 Z Q 0 X0 Z
 0   0 
ƒ R ƒ R
R D RŒ.X; Z/ and rank D rank.X; Z/: (2.10)
0 S0 Z 0 S0Z

Moreover,
Q 0 X0 Z
  0
ƒ ƒ0 L
 0 
ƒ R
D (2.11)
0 S0 Z 0 S0 Z
for some matrix L [as is evident from Lemma 2.4.3 upon observing (in light of Corollary 2.4.4 and
Lemma 2.12.3) that C.R Q 0 X0 Z/  C.R Q 0 X0 / D C.RQ 0 X0 X/ D C.ƒ0 /], implying that
Q 0 X0 Z ƒ0 0
 0    
ƒ R IP L
D
0 S0Z 0 S0 Z 0 IQ
 
I L
and hence [since (according to Lemma 2.6.2) is nonsingular] that
0 I
Q 0 X0 Z
 0 
ƒ R
rank D rank.ƒ/ C rank.S0 Z/ D rank.X/ C rank.S0 Z/ (2.12)
0 S0 Z
360 Confidence Intervals (or Sets) and Tests of Hypotheses

(as is evident from Corollary 2.5.6 and Lemma 2.6.1). Together with result (2.10), result (2.12)
implies that
rank.S0 Z/ D rank.X; Z/ rank.X/: (2.13)
For any M -dimensional column vector `1 and any N -dimensional column vector `2 ,
 0  0
Q 0 X0 Z

`1 ƒ R Q 0 X0 Z/ D 0 and ` 0 S0 Z D 0;
D 0 , `10 .ƒ0; R (2.14)
`2 0 S0 Z 2

as is evident from
 result (2.9). Thus, a subset of size rank.X; Z/ of the M CN rows of the matrix
ƒ0 R Q 0 X0 Z

is linearly independent if and only if the subset consists of rank.X/ linearly inde-
0 S0Z
pendent rows of .ƒ0; R Q 0 X0 Z/ and rank.X; Z/ rank.X/ linearly independent rows of .0; S0 Z/. [In
light of result (2.10), there exists a linearly independent subset of size rank.X; Z/ of the rows of
Q 0 X0 Z
 0 
ƒ R
and no linearly independent subset of a size larger than rank.X; Z/.]
0 S0Z
Under the augmented G–M model, a linear combination, say 01 ˇ C 02 , of the elements of ˇ
and  is [in light of result (2.10)] estimable if and only if
Q 0 X0 Z C `20 S0 Z
01 D `10 ƒ0 and 02 D `10 R (2.15)
for some M -dimensional column vector `1 and some N -dimensional column vector `2 . Moreover,
in the special case where 1 D 0 (i.e., in the special case of a linear combination of the elements
of ), this result can [in light of result (2.8)] be simplified as follows: under the augmented G–M
model, 02  is estimable if and only if
02 D `20 S0 Z (2.16)
for some N -dimensional column vector `2 .
Among the choices for the N  N matrix S is the N  N matrix I PX . Since (according to
Theorem 2.12.2) X0 PX D X0, PX2 D PX , and rank.PX / D rank.X/,
X0 .I PX / D 0
and (in light of Lemma 2.8.4)
rank.I PX / D N rank.X/:
Clearly, when S D I PX , N D N.
Let TQ represent any solution to the following linear system [in the .P CQ/  N matrix T ]:
.X; Z/0 .X; Z/T D .0; S0 Z/0 (2.17)
—the existence of a solution follows (in light of the results of Section 5.4c)
 from the estimability
ˇ
(under the augmented G–M model) of the elements of the vector .0; S Z/ 0
. Further, partition TQ

TQ
 
as TQ D Q 1 (where TQ 1 has P rows), and observe that
T2
X0 XTQ 1 D X0 ZT Q2
and hence (in light of Theorem 2.12.2) that
Q 1 D X.X0 X/ X0 XTQ 1 D X.X0 X/ X0 ZTQ 2 D
XTQ 1 D PX XT PX ZTQ 2 : (2.18)
Then, under the augmented G–M model, the least squares estimators of the elements of the vector
S0 Z are [in light of result (5.4.35)] the (corresponding) elements of a vector that is expressible as
follows:
TQ 0 .X; Z/0 y D TQ 10 X0 y C T
Q 0 Z0 y D . P ZT
2 X
Q /0 y C T
2
Q 0 Z0 y D T
2
Q 0 Z0 .I P /y:
2 X (2.19)
And (under the augmented G–M model)
covŒTQ 20 Z0 .I PX /y; R
Q 0 X0 y D  2 T
Q 0 Z0 .I P /XR
2 X
Q D0 (2.20)
Augmented G–M Model 361

(i.e., the least squares estimators of the elements of S0 Z are uncorrelated with those of the elements
of ƒ0ˇ C R Q 0 X0 Z) and [in light of result (5.6.6)]

X
Q 20 Z0 S:
varŒTQ 20 Z0 .I P /y D  2 TQ 0 .0; S0 Z/0 D  2 T (2.21)
In connection with linear system (2.17), note [in light of result (2.18)] that
Q 2 D Z0 S ŒD Z0 .I P /S
Z0 .I PX /ZT X

(and that X0 XTQ 1 D Q 2 ). Conversely, if T


X0 ZT Q 2 is taken to be a solution to the linear system

Z0 .I PX /ZT2 D Z0 S (2.22)
Q
 
Q 1 a solution to the linear system X0 XT1 D Q 2 (in T1 ), then T 1
(in T2 ) and T X0 ZT is a solution
TQ 2
to linear system (2.17).

b. Some results for a specific implementation


The results of Subsection a depend on the matrices ƒ and S. Among the choices for ƒ and S are
those derived from a particular decomposition of the model matrix .X; Z/ of the augmented G–M
model; this decomposition is as follows:

.X; Z/ D OU; (2.23)

where O is an N  rank.X; Z/ matrix with orthonormal columns and where, for some rank.X/  P
matrix U11 (of full row rank), some rank.X/  Q matrix U12 , and some Œrank.X; Z/ rank.X/  Q
matrix U22 (of full row rank),  
U11 U12
UD :
0 U22
A decomposition of the form (2.23) can be constructed by, for example, applying Gram–Schmidt
orthogonalization, or (in what would be preferable for numerical purposes) modified Gram–Schmidt
orthogonalization, to the columns of the matrix .X; Z/. In fact, when this method of construction
is employed, U is a submatrix of a .P CQ/  .P CQ/ upper triangular matrix having rank.X; Z/
positive diagonal elements and PCQ rank.X; Z/ null rows; it is the submatrix obtained by striking
out the null rows—refer, e.g., to Harville (1997, chap. 6). When U is of this form, the decomposition
(2.23) is what is known (at least in the special case of the decomposition of a matrix having full
column rank) as the QR deomposition (or the “skinny” QR decomposition). The QR decomposition
was encountered earlier; Section 5.4e included a discussion of the use of the QR decomposition in
the computation of least squares estimates.
Partition the matrix O (conformally to the partitioning of U) as O D .O1 ; O2 /, where O1 has
rank.X/ columns. And observe that
X D O1 U11 and Z D O2 U22 C O1 U12 : (2.24)
0
Then, among the choices for ƒ is that obtained by taking ƒ D U11 (as is evident from Corollary
2.4.17). Further, the choices for S include the matrix .O2 ; O3 /, where O3 is an N ŒN rank.X; Z/
matrix whose columns form an orthonormal basis for NŒ.X; Z/ or, equivalently (in light of Corollary
2.4.17), NŒ.O1 ; O2 /—the N rank.X/ columns of .O2 ; O3 / form an orthonormal basis for N.X/.
In light of result (2.24), we have that
X0 X D U11
0
U11 and X0 Z D U11
0
U12 :
Moreover, when ƒ0 D U11 , we find that
X0 XR D ƒ , U11
0 0
U11 R D U11 , U11 R D I
0 0 0
—that U11U11 R D U11 ) U11 R D I is clear upon, e.g., observing that U11U11 is nonsingular
362 Confidence Intervals (or Sets) and Tests of Hypotheses
0 0 0
and premultiplying both sides of the equation U11 U11 R D U11 by .U11U11 / 1 U11 . Thus, when
ƒ0 D U11 , the expected value ƒ0ˇ C R Q 0 X0 Z (under the augmented G–M model) of the estimator
Q Q
R X y (where as in Subsection a, R represents an arbitrary solution to X0 XR D ƒ) is reexpressible
0 0

as
ƒ0ˇ C R Q 0 X0 Z D U ˇ C R Q 0 U 0 U  D U ˇ C U : (2.25)
11 11 12 11 12

Moreover, RQ is a right inverse of U11 , and

var.R Q 0 X0 XR
Q 0 X0 y/ D  2 R Q D  2 U11 R
Q D  2 ƒ0 R Q D  2 I: (2.26)

Now, suppose that S D .O2 ; O3 /, in which case

Z0 S D .Z0 O2 ; Z0 O3 / D .U22
0
; 0/;
so that 
U22
 
U22 

S0 Z D D :
0 0
And observe (in light of Lemma 2.6.3) that

rank U D rank.X; Z/
and hence that
rank.UU 0 / D rank.X; Z/: (2.27)
Observe also that .O; O3 / is an (N  N ) orthogonal matrix and hence that

I D .O; O3 /.O; O3 /0 D OO 0 C O3 O30 :


Then,
.X; Z/0 .X; Z/ D .X; Z/0 .OO 0 C O3 O30 /.X; Z/ D U 0 U (2.28)
and
.X; Z/0 y D .X; Z/0 .OO 0 C O3 O30 /y D U 0 O 0 y: (2.29)
Moreover,    
0 0 0 0 0
.0; S0 Z/0 D 0 D U : (2.30)
U22 0 I 0
Thus, the solution TQ to the linear system .X; Z/0 .X; Z/T D .0; S0 Z/0 satisfies the equality
 
0 Q 0 0 0
U UT D U ;
I 0
so that TQ also satisfies the equality  
0 Q 0 0 0
UU U T D UU
I 0
and hence [since according to result (2.27), UU 0 is nonsingular] the equality
 
Q 0 0
UT D : (2.31)
I 0
Making use of equalities (2.29) and (2.31), we find that (under
  the augmented
 G–M model) the
U 22 
least squares estimators of the elements of the vector S0 Z D are the (corresponding)
0
elements of the vector
0 
0 0 O10 y
   0 
O2 y
TQ 0 .X; Z/0 y D TQ 0 U 0 O 0 y D 0 D : (2.32)
I 0 O2 y 0
Clearly, the variance-covariance matrix of this vector is
 
2 I 0
 : (2.33)
0 0
Note that S0 Z, the estimator (2.32) of S0 Z, and the variance-covariance matrix (2.33) of the
estimator (2.32) do not depend on O3 (even though S itself depends on O3 ).
Augmented G–M Model 363

c. An illustration
Let us illustrate the results of Subsections a and b by using them to add to the results obtained
earlier (in Section 7.1) for the lettuce-yield data. Accordingly, let us take y to be the 20  1 random
vector whose observed value is the vector of lettuce yields. Further, let us adopt the terminology and
notation introduced in Section 7.1 (along with those introduced in Subsections a and b of the present
section).
Suppose that the original G–M model is the second-order G–M model, which is the model
that was adopted in the analyses carried out in Section 7.1. And suppose that the augmented G–
M model is the third-order G–M model. Then, X D .X1 ; X2 / and Z D X3 (where X1 , X2 , and
X3 are as defined in Section 7.1). Further, ˇ D .ˇ1 ; ˇ2 ; ˇ3 ; ˇ4 ; ˇ11 ; ˇ12 ; ˇ13 ; ˇ22 ; ˇ23 ; ˇ33 /0 and
 D .ˇ111 ; ˇ112 ; ˇ113 ; ˇ122 ; ˇ123 ; ˇ133 ; ˇ222 ; ˇ223 ; ˇ233 ; ˇ333 /0.
Upon applying result (2.5) with R Q D .X0 X/ 1 (corresponding to ƒ D I), we find that under the
augmented (third-order) model

E.ˇO1 / D ˇ1 0:064ˇ113 0:064ˇ223 C 0:148ˇ333 ;


E.ˇO2 / D ˇ2 C 1:805ˇ111 C 0:560ˇ122 C 0:278ˇ133;
E.ˇO3 / D ˇ3 C 0:560ˇ112 C 1:805ˇ222 C 0:278ˇ233;
E.ˇO4 / D ˇ4 C 0:427ˇ113 C 0:427ˇ223 C 1:958ˇ333;
E.ˇO11 / D ˇ11 C 0:046ˇ113 C 0:046ˇ223 0:045ˇ333 ;
E.ˇO12 / D ˇ12 C 0:252ˇ123 ;
E.ˇO13 / D ˇ13 0:325ˇ111 C 0:178ˇ122 C 0:592ˇ133 ;
E.ˇO22 / D ˇ22 C 0:046ˇ113 C 0:046ˇ223 0:045ˇ333 ;
E.ˇO23 / D ˇ23 C 0:178ˇ112 0:325ˇ222 C 0:592ˇ233 ; and
E.ˇO33 / D ˇ33 C 0:110ˇ113 C 0:110ˇ223 0:204ˇ333 :

All ten of the least squares estimators ˇO1 , ˇO2 , ˇO3 , ˇO4 , ˇO11 , ˇO12 , ˇO13 , ˇO22 , ˇO23 , and ˇO33 of the
elements of ˇ are at least somewhat susceptible to biases occasioned by the exclusion from the model
of third-order terms. The exposure to such biases appears to be greatest in the case of the estimators
ˇO2 , ˇO3 , and ˇO4 of the first-order regression coefficients ˇ2 , ˇ3 , and ˇ4 . In fact, if the level of Fe
in the first, second, third, and fifth containers had been 1 (on the transformed scale), which was
the intended level, instead of 0:4965, the expected values of the other seven estimators (ˇO1 , ˇO11 ,
ˇO12 , ˇO13 , ˇO22 , ˇO23 , and ˇO33 ) would have been the same under the third-order model as under the
second-order model (i.e., would have equalled ˇ1 , ˇ11 , ˇ12 , ˇ13 , ˇ22 , ˇ23 , and ˇ33 , respectively,
under both models)—if the level of Fe in those containers had been 1, the expected values of ˇO2 ,
ˇO3 , and ˇO4 under the third-order model would have been
E.ˇO2 / D ˇ2 C 1:757ˇ111 C 0:586ˇ122 C 0:586ˇ133 ;
E.ˇO3 / D ˇ3 C 0:586ˇ112 C 1:757ˇ222 C 0:586ˇ233 ; and
E.ˇO4 / D ˇ4 C 0:586ˇ113 C 0:586ˇ223 C 1:757ˇ333 :

The estimators ˇO1 , ˇO2 , ˇO3 , ˇO4 , ˇO11 , ˇO12 , ˇO13 , ˇO22 , ˇO23 , and ˇO33 (which are the least squares
estimators of the regression coefficients in the second-order model) have the “same” standard errors
under the augmented (third-order) model as under the original (second-order) model; they are the
same in the sense that the expressions given in Table 7.1 for the standard errors are still applicable.
However, the “interpretation” of the parameter  (which appears as a multiplicative factor in those
364 Confidence Intervals (or Sets) and Tests of Hypotheses

expressions) differs. This difference manifests itself in the estimation of  and  2. In the case of the
augmented G–M model, the usual estimator O 2 of  2 is that represented by the quadratic form
y 0 ŒI P.X; Z/ y ŒN rank.X; Z/; (2.34)
ı

rather than ı(as in the case of the original G–M model) that represented by the quadratic form
y 0 .I PX /y ŒN rank.X/.
For the lettuce-yield data, y 0 ŒI P.X; Z/ y D 52:6306 and rank.X; Z/ D 15, so that when the
estimator (2.34) is applied to the lettuce-yield data, we obtain O 2 D 10:53. Upon taking the square
root of this value, we obtain O D 3:24, which is 1:70% smaller than the value (3:30) obtained for
O when the model was taken to be the original (second-order) model. Accordingly, when the model
for the lettuce-yield data is taken to be the augmented (third-order) model, the estimated standard
errors obtained for ˇO1 , ˇO2 , ˇO3 , ˇO4 , ˇO11 , ˇO12 , ˇO13 , ˇO22 , ˇO23 , and ˇO33 are 1:70% smaller than those
given in Table 7.1 (which were obtained on the basis of the second-order model).
Under the augmented G–M model, the rank.X; Z/ rank.X/ elements of the vector U22 
[where U22 is defined in terms of the decomposition (2.23)] are linearly independent estimable
linear combinations of the elements of  [and every estimable linear combination of the elements
of  is expressible in terms of these rank.X; Z/ rank.X/ linear combinations]. In the lettuce-yield
application [where the augmented G–M model is the third-order model and where rank.X; Z/
rank.X/ D 5], the elements of U22  [those obtained when the decomposition (2.23) is taken to be the
QR decomposition] are the following linear combinations of the third-order regression coefficients:
3:253ˇ111 1:779ˇ122 0:883ˇ133 ;
1:779ˇ112 3:253ˇ222 C 0:883ˇ233 ;
1:554ˇ113 C 1:554ˇ223 3:168ˇ333;
2:116ˇ123 ;
and
0:471ˇ333 :
The least squares estimates of these five linear combinations are 1:97, 4:22, 5:11, 2:05, and
2:07, respectively—the least squares estimators are uncorrelated, and each of them has a standard
error of  and an estimated standard error of 3:24.

7.3 The F Test (and Corresponding Confidence Set) and a


Generalized S Method
Suppose that y is an N  1 observable random vector that follows the G–M model. And let  D 0ˇ,
where  is a P  1 vector of constants. Further, suppose that  is nonnull and that 0 2 R.X/ (so
that  is a nontrivial estimable function).
In addition to the (point) estimation of —the estimation of such a function was considered earlier
(in Chapter 5)—inferences about  may take the form of a confidence interval (or set). They may also
take the form of a test (of a specified size) of the null hypothesis H0 W  D 0 versus the alternative
hypothesis H1 W  ¤ 0 or, more generally, of H0 W  D  .0/ (where  .0/ is any hypothesized value)
versus H1 W  ¤  .0/.
In any particular application, there are likely to be a number of linear combinations of the elements
of ˇ that represent quantities of interest. For i D 1; 2; : : : ; M , let i D i0 ˇ, where 1 ; 2 ; : : : ; M
are P  1 vectors of constants (one or more of which are nonnull). In “matrix notation,”  D ƒ0ˇ,
where  D .1 ; 2 ; : : : ; M /0 and ƒ D .1 ; 2 ; : : : ; M /.
Assume that 1 ; 2 ; : : : ; M are estimable, and suppose that we wish to make inferences about
The F Test and a Generalized S Method 365

1 ; 2 ; : : : ; M and possibly about some or all linear combinations of 1; 2 ; : : : ; M . These inferences
may take the form of a confidence set for the vector . Or they may take the form of individual con-
fidence intervals (or sets) for 1 ; 2 ; : : : ; M (and possibly for linear combinations of 1 ; 2 ; : : : ; M )
that have a specified probability of simultaneous coverage. Alternatively, these inferences may take
the form of a test of hypothesis. More specifically, they may take the form of a test (of a specified size)
of the null hypothesis H0 W  D 0 versus the alternative hypothesis H1 W  ¤ 0 or, more generally,
.0/ .0/ .0/
of H0 W  D  .0/ [where  .0/ D .1 ; 2 ; : : : ; M /0 is any vector of hypothesized values] versus
H1 W  ¤  .0/. They may also take a form that consists of testing whether or not each of the M
quantities 1 ; 2 ; : : : ; M (and possibly each of various linear combinations of these M quantities)
equals a hypothesized value (subject to some restriction on the probability of an “excessive overall
number” of false rejections).
In testing H0 W  D  .0/ (versus H1 W  ¤  .0/ ) attention is restricted to what are called testable
hypotheses. The null hypothesis H0 W  D  .0/ is said to be testable if (in addition to 1 ; 2 ; : : : ; M
being estimable and ƒ being nonnull)  .0/ 2 C.ƒ0 /, that is, if  .0/ D ƒ0ˇ .0/ for some P  1
vector ˇ .0/. Note that if  .0/ … C.ƒ0 /, there would not exist any values of ˇ for which ƒ0ˇ equals the
hypothesized value  .0/ and, consequently, H0 would be inherently false. It is worth noting that while
the definition of testability adopted herein rules out the existence of any contradictions among the
.0/
M equalities i D i .i D 1; 2; : : : ; M / that define H0 , it is sufficiently flexible to accommodate
redundancies among these equalities—a more restrictive definition (one adopted by many authors)
would be to require that rank.ƒ/ D M.
Most of the results on confidence sets for  or its individual elements or for testing whether or
not  or its individual elements equal hypothesized values are obtained under an assumption that the
vector e of the residual effects in the G–M model has an N.0;  2 I/ distribution. However, for some
of these results, this assumption is stronger than necessary; it suffices to assume that the distribution
of e is spherically symmetric.

a. Canonical form (of the G–M model)


Let us continue to take y to be an N  1 observable random vector that follows the G–M model. And
let us consider further inference about the M  1 vector  .D ƒ0ˇ/.
The problem of making inferences about  can be reduced to its essence and considerable
insight into this problem can be gained by introducing a suitable transformation. Accordingly, let
P D rank X and M D rank ƒ, and assume that  is estimable and that M  1. Further, let
O D .O1 ; O2 ; : : : ; OM /0, where (for j D 1; 2; : : : ; M ) Oj is the least squares estimator of j , denote by
RQ an arbitrary solution to the linear system X0 XR D ƒ (in the P  M matrix R), and recall (from
Chapter 5) that
Q 0 X0 y D ƒ0.X0 X/ X0 y
O D R and O D  2 C;
var./
where
CDR Q 0 X0 XR
Q D ƒ0 R Q DR Q 0 ƒ D ƒ0.X0 X/ ƒ:

Note that the matrix C is symmetric and nonnegative definite and (in light of Lemmas 2.12.1
and 2.12.3) that
rank C D rank.XR/Q D rank.X0 XR/Q D rank ƒ D M ; (3.1)
implying in particular (in light of Corollary 2.13.23) that
C D T 0T
for some M  M matrix T of full row rank M . Now, take S to be any right inverse of T —that
T has a right inverse is evident from Lemma 2.5.1—or, more generally, take S to be any M  M
matrix such that T S is orthogonal, in which case
S0 CS D .T S/0 T S D I (3.2)
366 Confidence Intervals (or Sets) and Tests of Hypotheses

—conversely, if S were taken to be any M  M matrix such that S0 CS D I, T S would be


orthogonal. Then,
rank S D rank.ƒS/ D M ; (3.3)
as is evident upon observing that
Q 0 ƒS/ D rank.S0 CS/ D rank.IM / D M :
M  rank S  rank.ƒS/  rank.S0 R

Further, let
˛ D S0  D S0ƒ0ˇ D .ƒS/0 ˇ;
so that ˛ is an M  1 vector whose elements are expressible as linearly independent linear combi-
nations of the elements of either  or ˇ. And let ˛O represent the least squares estimator of ˛. Then,
clearly,
Q 0 X0 y D S0 O
˛O D .RS/
and
O D  2 I:
var.˛/

Inverse relationship. The transformation from the M -dimensional vector  to the M -dimensional
vector ˛ is invertible. In light of result (3.3), the M columns of the matrix ƒS form a basis for
C.ƒ/, and, consequently, there exists a unique M  M matrix W (of full row rank M ) such that
ƒSW D ƒ: (3.4)
And upon premultiplying both sides of equality (3.4) by S0ƒ0.X0 X/ [and making use of result (3.2)],
we find that
W D .T S/0 T (3.5)
—in the special case where S is a right inverse of T , W D T . Note that
C.W 0 / D C.ƒ0 / and W S D I; (3.6)
0 0 0
as is evident upon, for example, observing that C.ƒ / D CŒ.ƒSW /   C.W / (and invoking
Theorem 2.4.16) and upon observing [in light of result (3.5)] that W S D .T S/0 T S [and applying
result (3.2)]. Note also that
C D ƒ0.X0 X/ ƒ D .ƒSW /0 .X0 X/ ƒSW D W 0 S0 CSW D W 0 W : (3.7)
Clearly,
 D ƒ0ˇ D .ƒSW /0ˇ D W 0 S0  D W 0 ˛: (3.8)
Similarly,
O D ƒ0.X0 X/ X0 y D .ƒSW /0 .X0 X/ X0 y D W 0 S0 O D W 0 ˛:
O (3.9)

A particular implementation. The vectors 1 ; 2 ; : : : ; M that form the columns of the P  M


matrix ƒ (which is of rank M ) include M linearly independent vectors. Let j1 ; j2 ; : : : ; jM (where
j1 < j2 < : : : < jM and jM C1 < jM C2 < : : : < jM ) represent a permutation of the first M
positive integers 1; 2; : : : ; M such that the j1 ; j2 ; : : : ; jM th columns j1 ; j2 ; : : : ; jM of ƒ are
linearly independent; and denote by ƒ the P  M submatrix of ƒ whose first, second, …, M th
columns are j1 ; j2 ; : : : ; jM , respectively. Then, j D ƒ kj for some uniquely defined M  1
vector kj (j D 1; 2; : : : ; M ); kj1 ; kj2 ; : : : ; kjM are respectively the first, second, …, Mth columns
of IM . Further, let K D .k1 ; k2 ; : : : ; kM /, so that ƒ D ƒ K. And take  and O to be the M  1
subvectors of  and ,O respectively, obtained by striking out their jM C1 ; jM C2 ; : : : ; jM th elements
and take C to be the M  M submatrix of C obtained by striking out its jM C1 ; jM C2 ; : : : ; jM th
rows and columns, so that
 D ƒ0 ˇ; O D ƒ0 .X0 X/ X0 y; and C D ƒ0 .X0 X/ ƒ :
and observe that
 D K0  ; O D K0 O ; and C D K0 C K:
The submatrix C is a symmetric positive definite matrix (as is evident from Corollary 2.13.28
The F Test and a Generalized S Method 367

upon observing that C is an M M symmetric nonnegative matrix of rank M ). Accordingly, let
T represent an M M nonsingular matrix such that C D T0 T —the existence of such a matrix
is evident from Corollary 2.13.29. Further, let TQ D T K, and define SQ to be the M  M matrix
whose j1 ; j2 ; : : : ; jM th rows are respectively the first, second, …, M th rows of T 1 and whose
remaining (jM C1 , jM C2 , : : : ; jM th) rows are null vectors. Then,
C D TQ 0 TQ
And
Q SQ D T T 1 D I;
T
so that SQ is a right inverse of TQ . Thus, among the choices for a matrix T such that C D T 0 T and
for a matrix S such that S0 CS D I are T D TQ and S D S; Q and when T D TQ and S D S, Q
Q D T K;
W DT
in which case  D K0 T0 ˛ and O D K0 T0 ˛O or, equivalently,
 D T0 ˛ and ji D kj0 i  .i D M C1; M C2; : : : ; M /;
and
O  D T0 ˛O and Oji D kj0 i O  .i D M C1; M C2; : : : ; M /:

An equivalent null hypothesis. Now, consider the problem of testing the null hypothesis H0 W  D
 .0/ versus the alternative hypothesis H1 W  ¤  .0/. This problem can be reformulated in terms of
the vector ˛. Assume that H0 is testable, in which case  .0/ D ƒ0ˇ .0/ for some P  1 vector ˇ .0/,
let ˛.0/ D S0  .0/, and consider the problem of testing the null hypothesis HQ 0 W ˛ D ˛.0/ versus the
alternative hypothesis HQ 1 W ˛ ¤ ˛.0/.
The problem of testing HQ 0 versus HQ 1 is equivalent to that of testing H0 versus H 1 ; they are
equivalent in the sense that any value of ˇ that satisfies HQ 0 satisfies H0 and vice versa. To see this,
observe that a value of ˇ satisfies HQ 0 if and only if it satisfies the equality S0ƒ0 .ˇ ˇ .0/ / D 0,
and it satisfies H0 if and only if it satisfies the equality ƒ0 .ˇ ˇ .0/ / D 0. Observe also that
N.S0ƒ0 /  N.ƒ0 / and [in light of Lemma 2.11.5 and result (3.3)] that
dimŒN.S0ƒ0 / D P rank.S0ƒ0 / D P rank.ƒS/ D P M D dimŒN.ƒ0 /
and hence (recalling Theorem 2.4.10) that N.S0ƒ0 / D N.ƒ0 /.
A vector of error contrasts. Let L represent an N  .N P / matrix whose columns form an
orthonormal basis for N.X0 /—the existence of an orthonormal basis follows from Theorem 2.4.23
and as previously indicated (in Section 5.9b) and as is evident from Lemma 2.11.5, dimŒN.X0 / D
N P . Further, let d D L0 y. And observe that
rank L D N P ; X0 L D 0; L0 X D 0; and L0 L D I; (3.10)
that
E.d/ D 0 and var.d/ D  2 I;
and that cov.d; X0 y/ D 0 and hence that
cov.d; ˛/
O D 0:
The N P elements of the vector d are error contrasts—error contrasts were discussed earlier (in
Section 5.9b).
Complementary parametric functions and their least squares estimators. The least squares estima-
tor ˛O of the vector ˛ and the vector d of error contrasts can be combined into a single vector and
expressed as follows:
Q Xy
   0 0 0 
˛O SR Q L/0 y:
D D .XRS;
d L0 y
The columns of the N  .N P C M / matrix .XRS; Q L/ are orthonormal. And (in light of Lemma
2.11.5)
368 Confidence Intervals (or Sets) and Tests of Hypotheses
Q L/0 g D N
dimfNŒ.XRS; .N P C M / D P M : (3.11)
Further,
Q L/0   N.L0 / D C.X/
NŒ.XRS; (3.12)
0
—that N.L / D C.X/ follows from Theorem 2.4.10 upon observing [in light of result (3.10) and
Lemma 2.4.2] that C.X/  N.L0 / and (in light of Lemma 2.11.5) that dimŒN.L0 / D N .N P/ D
P D dimŒC.X/.
Let U represent a matrix whose columns form an orthonormal basis for
NŒ.XRS;Q L/0 —the existence of such a matrix follows from Theorem 2.4.23. And observe [in
light of result (3.11)] that U is of dimensions N  .P M / and [in light of result (3.12)] that
U D XK for some matrix K [of dimensions P  .P M /]. Observe also that
X0 XK D X0 U D .U 0 X/0 (3.13)
and (in light of Lemma 2.12.3) that
rank.U 0 X/ D rank.X0 XK/ D rank.XK/ D rank.U/ D P M : (3.14)

Now, let  D U 0 Xˇ and O D U 0 y .D K0 X0 y/. Then, in light of results (3.14) and (3.13),  is a
vector of P M linearly independent estimable functions, and O is the least squares estimator of
. Further,
rank U D P M ; U 0 L D 0; Q D 0;
U 0 XRS and U 0 U D I;
implying in particular that
cov.;
O d/ D 0; cov.;
O ˛/
O D 0; and O D  2 I:
var./

An orthogonal transformation. Let z D O 0 y, where O D .XRS; Q U; L/. Upon partitioning z


(which is an N -dimensional column vector) into subvectors z1 , z2 , and z3 of dimensions M ,
P M , and N P , respectively, so that z0 D .z01 ; z02 ; z03 /, we find that
z1 D S0 R
Q 0 X0 y D ˛;
O z2 D U 0 y D ;
O and z3 D L0 y D d:
And
O 0 O D I; (3.15)
that is, O is orthogonal, and
Q 0 X0 X
1 0 0 01
S0 R
0

O 0 X D @ U 0 X A D @U 0 X A; (3.16)
0 0
so that 0 1
˛
E.z/ D O 0 Xˇ D @ A and var.z/ D O 0 . 2 I/O D  2 I: (3.17)
0
Moreover,
20 the
1 distribution
3 of z is determinable from that of y and vice versa. In particular,
˛
z  N 4@ A;  2 I5 if and only if y  N.Xˇ;  2 I/. Accordingly, in devising and evaluating
0
procedures for making inferences about the vector , we can take advantage of the relatively simple
and transparent form of the mean vector of z by working with z rather than y.
Definition and “role” (of the canonical form). Since (by supposition) y follows the G–M model,

z D O 0 Xˇ C O 0 e: (3.18)
Thus, 0 1
IM 0  
˛
z D @0 IP M A C e;

0 0
The F Test and a Generalized S Method 369

where e D O 0 e. Clearly,
E.e / D 0 and var.e / D  2 I:
Like y, z is an observable random vector. Also like y, it follows a G–M model. Specifically, it
follows
0 a G–M model
1 in which the role of the N  P model matrix X is played by the N  P matrix
IM 0  
˛
@0 IP M and the role of the P  1 parameter vector ˇ is played by the P  1 vector
A

0 0
and in which the M -dimensional vector ˛ and the .P M /-dimensional vector  are regarded as
vectors of unknown (unconstrained) parameters rather than as vectors of parametric functions. This
model is referred to as the canonical form of the (G–M) model.
Suppose that the vector z is assumed to follow the canonical form of the G–M model (with
parameterization ˛, , and  2 ). Suppose further that the distribution of the vector of residual effects
in the canonical form is taken to be the same as that of the vector e of residual effects in the original
form y D Xˇ C e. And suppose that O 0 e  e, as would be the case if e  N.0;  2 I/ or, more
generally, if the distribution of e is spherically symmetric. Then, the distribution of the vector Oz
obtained (on the basis of the canonical form) upon setting the parameter vector ˛ equal to S0ƒ0ˇ and
the parameter vector  equal to U 0 Xˇ is identical to the distribution of y (i.e., to the distribution of
y obtained from a direct application of the model y D Xˇ C e). Note that (when it comes to making
inferences about ƒ0ˇ) the elements of the parameter vector  of the canonical form can be viewed
as “nuisance parameters.”
Sufficiency and completeness. Suppose that y  N.Xˇ;  2 I/, as is the case when the distribution
of the vector e of residual effects in the G–M model is taken to be N.0;  2 I/ (which is a case where
O 0 e  e). Then, as was established earlier (in Section 5.8) by working directly with the pdf of the
distribution of y, X0 y and y 0 .I PX /y form a complete sufficient statistic. In establishing results of
this kind, it can be advantageous to adopt an alternative approach that makes use of the canonical
form of the model.
Suppose that the transformed vector z .D O 0 y/ follows the canonical form of the G–M model.
Suppose also that the2 distribution
0 1 of the vector of residual effects in the canonical form is N.0;  2 I/,
3
˛
in which case z  N 4@ A;  2 I5. Further, denote by f ./ the pdf of the distribution of z. Then,
0
2 0 10 0 13
1 1 z1 ˛ z1 ˛
f .z/ D exp 4 @ z2 A @z2  A5
.2 2 /N=2 2 2 z 0 z3 0
3
 
1 1  0 0 0
exp

D . O
˛ ˛/ . ˛O ˛/ C . O
 / . O
 / C d d
.2 2 /N=2 2 2
 
1 1 0 0
D exp .˛ ˛ C  /
.2 2 /N=2 2 2
 
1 0 0 0 1 0 1 0
 exp .d d C O
˛ O
˛ C O
 O
/ C ˛ ˛O C  O
 : (3.19)
2 2 2 2
And it follows from a standard result (e.g., Schervish 1995, theorem 2.74) on exponential families
of distributions that .d0 d C ˛O 0 ˛O C O 0 /,
O ˛, O and O form a complete sufficient statistic.
The complete sufficient statistic can be reexpressed in various alternative forms. Upon observing
that
d0 d C ˛O 0 ˛O C O 0 O D z0 z D z0 Iz D z0 O 0 Oz D y 0 y; (3.20)
[in light of Theorem 5.9.6 and result (3.10)] that
d0 d D y 0 LL0 y D y 0 .I PX /y; (3.21)
370 Confidence Intervals (or Sets) and Tests of Hypotheses

and [in light of result (3.16)] that


X0 y D X0 Iy D X0 OO 0 y D .O 0 X/0 z D ƒS ˛O C X0 U O (3.22)
and upon recalling result (3.9), we are able to conclude that each of the following combinations
forms a complete sufficient statistic:
(1) y 0 y, ˛,
O and ;
O
0
(2) y y, ,O and ;
O
(3) y 0 y and X0 y;
(4) y 0 .I PX /y, ˛, O and ;
O
(5) y 0 .I PX /y, , O and ;
O and
(6) y 0 .I PX /y and X0 y.
An extension. Consider a generalization of the situation considered in the preceding part, where
it was assumed that the vector e of residual effects in the G–M model is N.0;  2 I/. Continue
to suppose that the distribution of the vector of residual effects in the canonical form is the same
as that of the vector e in the G–M model. And suppose that the distribution of the vector  1 e
of standardized residual effects is not necessarily N.0; I/ but rather is an absolutely continuous
spherically symmetric distribution with a pdf h./ that (letting u represent an arbitrary N  1 vector)
1 0
is of the form
R 1h.u/ D c g.u u/, where g./ is a known (nonnegative) functionR 1 (of a single variable)
such that 0 s N 1 g.s 2 / ds < 1 and where c D Œ2 N=2= €.N=2/ 0 s N 1 g.s 2 / ds. Then,
O 0 e  e, and in light of the discussion of Section 5.9—refer, in particular, to expression (5.9.70)—
the distribution of z has a pdf f ./ of the form
2 0 10 0 13
1 z1 ˛ z1 ˛
f .z/ D c 1. 2 / N=2 g4 2 @z2  A @z2 A5:
 z 0 z 0
3 3

And upon proceeding in essentially the same way as in arriving at expression (3.19) and upon
applying the factorization theorem (e.g., Casella and Berger 2002, theorem 6.2.6), we arrive at the
conclusion that the same quantities that form a complete sufficient statistic in the special case where
e has an N.0;  2 I/ distribution form a sufficient (though not necessarily complete) statistic.
b. The test and confidence set and their basic properties
Let us continue to take y to be an N  1 observable random vector that follows the G–M model.
And taking the results of Subsection a to be our starting point (and adopting the notation employed
therein), let us consider further inferences about the vector  .D ƒ0ˇ/.
2
Suppose that the distribution
20 1 of
3 the vector e of residual effects is MVN. Then, y  N.Xˇ;  I/.
˛
And z .D O 0 y/  N 4@ A;  2 I5.
0
F statistic and pivotal quantity. Letting ˛P represent an arbitrary value of ˛ (i.e., an arbitrary M  1
vector), consider the random variable F.Q ˛/P defined as follows:

Q ˛/ .˛O ˛/P 0 .˛O ˛/=M


P 
F. P D 0
:
d d=.N P /
—if d D 0 (which is an event of probability 0), interpret F. Q ˛/
P as 0 or 1, depending on whether
˛O D ˛P or ˛O 6D ˛.
P Observe (in light of the results of Subsection a) that
.1=/.˛O ˛/
P  N Œ.1=/.˛ ˛/;
P I;
that
.1=/d  N.0; I/;
The F Test and a Generalized S Method 371

and that ˛O and d are statistically independent, implying that


P 0 .˛O ˛/
.1= 2 /.˛O ˛/ P 0 .˛ ˛/;
P  2 ŒM ; .1= 2 /.˛ ˛/ P
that
.1= 2 /d0d  2 .N P /;
P 0 .˛O ˛/
and that .1= 2/.˛O ˛/ Q ˛/
P and .1= 2/d0d are statistically independent. Since F. P is reexpressible
as
Q ˛/ .1= 2 /.˛O ˛/P 0 .˛O ˛/=M
P 
F. P D ; (3.23)
.1= 2 /d0d=.N P /
it follows that
F. P  SF ŒM ; N P ; .1= 2 /.˛P ˛/0 .˛P ˛/:
Q ˛/ (3.24)
In particular (in the special case where ˛P D ˛),
Q
F.˛/  SF .M ; N P /: (3.25)

In the special case where ˛P D ˛.0/, F. Q ˛/


P can serve as a “test statistic”; it provides a basis
for testing the null hypothesis HQ 0 W ˛ D ˛.0/ versus the alternative hypothesis HQ 1 W ˛ ¤ ˛.0/ or,
equivalently, H0 W  D  .0/ versus H1 W  ¤  .0/ —it is being assumed that H0 is testable. In the
special case where ˛P D ˛, it can serve as a “pivotal quantity” for devising a confidence set for ˛ and
ultimately for .
Q ˛/
Some alternative expressions. As is evident from result (3.21), the “denominator” of F. P is reex-
pressible as follows:
d0d=.N P / D y 0 .I PX /y=.N P / D O 2; (3.26)
where O 2 is the customary unbiased estimator of  2 (discussed in Section 5.7c). Thus,
P 0 .˛O ˛/=M
.˛O ˛/ P 
Q ˛/
F. P D 0 P 0 .˛O ˛/=.M
D .˛O ˛/ P O 2 /:
 (3.27)
y .I PX /y=.N P /

Now, consider the “numerator” of F. Q ˛/.


P Let P represent an arbitrary value of , that is, an
arbitrary member of C.ƒ /. Then, for ˛P D S0 P (which is the “corresponding value” of ˛),
0

P 0 .˛O ˛/=M
.˛O ˛/ P  D . P 0 C .O /=M
O / P : (3.28)
To see this, observe that
.˛O S0 /
P 0 .˛O S0 / P 0 SS0 .O /
P D .O / P
and that
CSS0C D T 0 T S.T S/0 T D T 0 IT D T 0 T D C (3.29)
(so that SS0 is a generalized inverse of C). Recalling that O D ƒ0.X0 X/ X0 y, observe also that
O P 2 C.ƒ0 / (3.30)
and [in light of Corollary 2.4.17 and result (3.1)] that
C.ƒ0 / D C.C/; (3.31)
implying that O P D Cr for some vector r and hence that
P 0 C .O /
.O / P D r 0 CC Cr D r 0 Cr
and leading to the conclusion that .O / P 0 C .O /P is invariant to the choice of the generalized
inverse C .
Q ˛/
In light of result (3.28), F. P is reexpressible in terms of the quantity F./
P defined as follows:
P 0 C .O /=M
.O / P 
P D
F./ 0
P 0 C .O /=.M
D .O / P O 2 /:

y .I PX /y=.N P /

—if y 0 .I PX /y D 0 or, equivalently, O D 0 (which is an event of probability 0), interpret F./


P as 0
P For ˛P D S0 P or, equivalently, for P D W 0 ˛,
or 1, depending on whether O D P or O 6D . P
372 Confidence Intervals (or Sets) and Tests of Hypotheses
Q ˛/
F. P D F./:P (3.32)
Q .0/ / and the pivotal quantity F.˛/],
In particular [in the case of the test statistic F.˛ Q
Q .0/ / D F. .0/ /
F.˛ and Q
F.˛/ D F./: (3.33)

An expression for the quantity .1= 2 /.˛P ˛/0 .˛P ˛/—this quantity appears in result (3.24)—can
P 0 .˛O ˛/=M
be obtained that is analogous to expression (3.28) for .˛O ˛/ P  and that can be verified in
0 0
essentially the same way. For ˛P D S P [where P 2 C.ƒ /], we find that

.1= 2 /.˛P ˛/0 .˛P ˛/ D .1= 2 /.P /0 C .P /: (3.34)

A characterization of the members of C.ƒ0 /. Let P D .P1 ; P2 ; : : : ; PM /0 represent an arbitrary M 1
vector. And (defining j1 ; j2 ; : : : ; jM as in Subsection a) let P  D .Pj1 ; Pj2 ; : : : ; PjM /0 . Then,

P 2 C.ƒ0 / , P  2 RM and Pji D kj0 i P  .i D M C1; M C2; : : : ; M / (3.35)

(where kjMC1 ; kjM C2 ; : : : ; kjM are as defined in Subsection a), as becomes evident upon observing
that P 2 C.ƒ0 / if and only if P D ƒ0 ˇP for some P  1 vector ˇP and that ƒ0 D K0 ƒ0 (where K and
ƒ are as defined in Subsection a).
A particular choice for the generalized inverse of C. Among the choices for the generalized inverse
C is
C D SQ SQ 0 (3.36)
(where SQ is as defined in Subsection a). For i; i 0 D 1; 2; : : : ; M , the ji ji 0 th element of this particular
generalized inverse equals the i i 0 th element of the ordinary inverse C 1 of the M  M nonsingular
submatrix C of C (the submatrix obtained upon striking out the jM C1 ; jM C2 ; : : : ; jM th rows and
columns of C); the remaining elements of this particular generalized inverse equal 0. Thus, upon
setting C equal to this particular generalized inverse, we find that .O / P 0 C .O /P is expressible
0
[for P 2 C.ƒ /] as follows:
P 0 C .O /
.O / P D .O  P  /0 C 1.O  P  / (3.37)
(where, as in Subsection a, O  and P  represent the subvectors of O and ,
P respectively, obtained by
striking out their jM C1 ; jM C2 ; : : : ; jM th elements).
Confidence set. Denote by AQF a set of ˛-values defined as follows:

AQF D f˛P W F.
Q ˛/
P  FN P .M ; N P /g;

where FN P .M ; N P / is the upper 100 P % point of the SF .M ; N P / distribution and P is a
scalar between 0 and 1. Since F. Q ˛/
P varies with z, the set AQF also varies with z. For purposes of
making explicit the dependence of AQF on z, let us write AQF .z/, or alternatively AQF .˛; O d/, for AQF .
O ;
On the basis of result (3.25), we have that

PrŒ˛ 2 AQF .z/ D 1 P:

Thus, the set AQF constitutes a 100.1 P /% confidence set for ˛. In light of result (3.27),

O 0 .˛P ˛/
AQF D f˛P W .˛P ˛/ O  M O 2 FN P .M ; N P /g: (3.38)

The geometrical form of the set AQF is that of an M -dimensional closed ball centered at the point ˛O
and with radius ŒM O 2 FN P .M ; N P /1=2.
By exploiting the relationship  D W 0˛, a confidence set for  can be obtained from that for ˛.
Define a set AF (of -values) as follows:
The F Test and a Generalized S Method 373

AF D fP W P D W 0˛;
P ˛P 2 AQF g: (3.39)
Since AQF depends on z and hence (since z D O 0 y) on y, the set AF depends on y. For purposes of
making this dependence explicit, let us write AF .y/ for AF . Clearly,

PrŒ 2 AF .y/ D PrŒW 0˛ 2 AF .y/ D PrŒ˛ 2 AQF .z/ D 1 P:

Thus, the set AF constitutes a 100.1 P /% confidence set for .


Making use of results (3.32) and (3.6), we find that

AF D fP 2 C.ƒ0 / W F./


P  FN P .M ; N P /g
D fP 2 C.ƒ0 / W .P /
O 0 C .P /
O  M O 2 FN P .M ; N P /g: (3.40)

In the special case where M D M , result (3.40) simplifies to

O 0 C 1.P /
AF D fP W .P / O  M O 2 FN P .M; N P /g: (3.41)
More generally,
AF D fP W .P  O  /0 C 1.P  O  /  M O 2 FN P .M ; N P /;
Pji D kj0 i P  .i D M C1; M C2; : : : ; M /g (3.42)
[where P  D .Pj1 ; Pj2 ; : : : ; PjM /0, where P1 ; P2 ; : : : ; PM represent the first, second, …, M th ele-
ments of ,P and where kjM C1 ; kjM C2 ; : : : ; kjM and j1 ; j2 ; : : : ; jM are as defined in Subsection
a]. Geometrically, the set fP W .P  O  /0 C 1.P  O  /  M O 2 FN P .M ; N P /g is represented by
the points in M -dimensional space enclosed by a surface that is “elliptical in nature.”
F test. Define a set CQ F of z-values as follows:

CQ F D fz W F.˛
Q .0/ / > FN P .M ; N P /g
.˛O ˛.0/ /0 .˛O ˛.0/ /=M
D fz W > FN P .M ; N P /g:
d0d=.N P /

And take QF .z/ to be the corresponding indicator function:


(
1; if z 2 CQ F ,
Q F .z/ D
0; if z … CQ F .

Then, under the null hypothesis HQ 0 W ˛ D ˛.0/ or, equivalently, H0 W  D  .0/,

Pr.z 2 CQ F / D PrŒ QF .z/ D 1 D P :

Thus, as a size- P test of HQ 0 or H0 , we have the test with critical (rejection) region CQ F and critical
(test) function QF .z/, that is, the test that rejects HQ 0 or H0 if z 2 CQ F or Q F .z/ D 1 and accepts HQ 0
or H0 otherwise. This test is referred to as the size- P F test.
The critical region and critical function of the size- P F test can be reexpressed in terms of y. In
light of result (3.33),
z 2 CQ F , y 2 CF and Q F .z/ D 1 , F .y/ D 1; (3.43)
where
CF D fy W F. .0/ / > FN P .M ; N P /g
.O  .0/ /0 C .O  .0/ /=M
D fy W > FN P .M ; N P /g
y 0 .I PX /y=.N P /
374 Confidence Intervals (or Sets) and Tests of Hypotheses

and (
1; if y 2 CF ,
F .y/ D
0; if y … CF .
In connection with result (3.43), it is worth noting that [in light of result (3.37)]

.O  .0/ /0 C .O  .0/ / D .O  .0/ /0 C 1.O  .0/ /; (3.44)


where .0/ is the subvector of  .0/ obtained upon striking out its jMC1 ; jM C2 , : : : ; jM th elements—
result (3.44) is consistent with the observation that the (testable) null hypothesis H0 is equivalent to
the null hypothesis that  D .0/ .
A relationship. There is an intimate relationship between the F test of the null hypothesis HQ 0 or the
equivalent null hypothesis H0 and the confidence sets AQF and AF for ˛ and . Clearly,

z … CQ F , ˛.0/ 2 AQF and y … CF ,  .0/ 2 AF : (3.45)

Thus, the confidence set AQF for ˛ consists of those values of ˛.0/ for which the null hypothesis
HQ 0 W ˛ D ˛.0/ is accepted, and the confidence set AF for  consists of those values of  .0/ [2 C.ƒ0 /]
for which the equivalent null hypothesis H0 W  D  .0/ is accepted.
Power function and probability of false coverage. The probability of HQ 0 or H0 being rejected by
the F test (or any other test) depends on the model’s parameters; when regarded as a function of
the model’s parameters, this probability is referred to as the power function of the test. The power
function of the size- P F test of HQ 0 or H0 is expressible in terms of ˛, , and ; specifically, it is
expressible as the function QF .˛; ; / defined as follows:
Q .0/ / > FN P .M ; N P /:
QF .˛; ; / D PrŒF.˛

And [as is evident from result (3.24)]


Q .0/ /  SF ŒM ; N P ; .1= 2 /.˛ ˛.0/ /0 .˛ ˛.0/ /:
F.˛ (3.46)

Thus, QF .˛; ; / does not depend on , and it depends on ˛ and  only through the quantity
.1= 2 /.˛ ˛.0/ /0 .˛ ˛.0/ / D Œ.1=/.˛ ˛.0/ /0 Œ.1=/.˛ ˛.0/ /: (3.47)
This quantity is interpretable as the squared distance (in units of ) between the true and hypothesized
values of ˛. When ˛ D ˛.0/, QF .˛; ; / D P .
The power function can be reexpressed as a function, say F .; ; /, of , , and . Clearly,

F .; ; / D QF .S0; ; /;


and, in light of result (3.34),
.1= 2 /.S 0 ˛.0/ /0 .S0 ˛.0/ / D .1= 2 /.  .0/ /0 C .  .0/ / (3.48)
2
D .1= /. .0/ /0 C 1. .0/ /: (3.49)

For ˛ ¤ ˛.0/ or equivalently  ¤  .0/, QF .˛; ; / or F .; ; / represents the power of the
size- P F test, that is, the probability of rejecting HQ 0 or H0 when HQ 0 or H0 is false. The power of a
size- P test is a widely adopted criterion for assessing the test’s effectiveness.
In the case of a 100.1 P /% confidence region for ˛ or , the assessment of its effectiveness
might be based on the probability of false coverage, which (by definition) is the probability that
the region will cover (i.e., include) a vector ˛.0/ when ˛.0/ ¤ ˛ or a vector  .0/ when  .0/ ¤ .
In light of the relationships (3.43) and (3.45), the probability Pr. .0/ 2 AF / of AF covering  .0/
[where  .0/ 2 C.ƒ0 /] equals the probability Pr.˛.0/ 2 AQF / of AQF covering ˛.0/ (D S0 .0/ ), and
their probability of coverage equals 1 F .; ; / or (in terms of ˛ and ) 1 QF .˛; ; /.
The F Test and a Generalized S Method 375

A property of the noncentral F distribution. The power function QF .˛; ; / or F .; ; / of the
F test of HQ 0 or H0 depends on the model’s parameters only through the noncentrality parameter
of a noncentral F distribution with numerator degrees of freedom M and denominator degrees of
freedom N P . An important characteristic of this dependence is discernible from the following
lemma.
Lemma 7.3.1. Let w represent a random variable that has an SF .r; s; / distribution (where
0 < r < 1, 0 < s < 1, and 0   < 1). Then, for any (strictly) positive constant c, Pr.w > c/
is a strictly increasing function of .
Preliminary to proving Lemma 7.3.1, it is convenient to establish (in the form of the following 2
lemmas) some results on central or noncentral chi-square distributions.
Lemma 7.3.2. Let u represent a random variable that has a 2 .r/ distribution (where 0 < r <
1). Then, for any (strictly) positive constant c, Pr.u > c/ is a strictly increasing function of r.
Proof (of Lemma 7.3.2). Let v represent a random variable that is distributed independently of
u as 2 .s/ (where 0 < s < 1). Then, uCv  2 .r Cs/, and it suffices to observe that
Pr.uCv > c/ D Pr.u > c/ C Pr.u  c; v > c u/ > Pr.u > c/: Q.E.D.
Lemma 7.3.3. Let u represent a random variable that has a 2 .r; / distribution (where 0 < r <
1 and 0   < 1). Then, for any (strictly) positive constant c, Pr.u > c/ is a strictly increasing
function of .
Proof (of Lemma 7.3.3). Let h./ represent the pdf of the 2 .r; / distribution. Further, for
j D 1; 2; 3; : : : , let gj ./ represent the pdf of the 2 .j / distribution, and let vj represent a random
variable that has a 2 .j / distribution. Then, making use of expression (6.2.14), we find that (for
 > 0)
Z 1
d Pr.u > c/ dh.u/
D du
d c d
Z 1X 1
d Œ.=2/k e =2=kŠ
D g2kCr .u/ du
c d
kD0
Z 1n 1
=2
X 1
D 1
2
e gr .u/ C 1
2
k.=2/k 1e =2
c kŠ o
.=2/k e =2 g2kCr .u/ du
kD1


Z 1 nX 1
D 12 .=2/j 1e =2=.j 1/Š g2j Cr .u/
 
c j D1 1 o
X
.=2/ke =2=kŠ g2kCr .u/ du
 

kD0
Z 1
1nX
1
.=2/ke =2=kŠ g2kCrC2 .u/
 
D 2
c
kD0 1 o
X
.=2/ke =2
 
=kŠ g2kCr .u/ du
kD0
1
X
1
.=2/ke =2
=kŠ Pr v2kCrC2 > c Pr v2kCr > c :
   
D 2
kD0
d Pr.u > c/
And in light of Lemma 7.3.2, it follows that > 0 and hence that Pr.u > c/ is a strictly
d
increasing function of . Q.E.D.
Proof (of Lemma 7.3.1). Let u and v represent random variables that are distributed indepen-
u=r
dently as 2 .r; / and 2 .s/, respectively. Then, observing that w  and denoting by g./ the
v=s
376 Confidence Intervals (or Sets) and Tests of Hypotheses

pdf of the 2 .s/ distribution, we find that


Z 1 
 u=r   crv  cr vP 
Pr.w > c/ D Pr > c D Pr u > D Pr u > P d v:
g.v/ P
v=s s c s
And based on Lemma 7.3.3, we conclude that Pr.w > c/ is a strictly increasing function of . Q.E.D.
As an application of Lemma 7.3.1, we have that the size- P F test of HQ 0 or H0 is such that the
probability of rejection is a strictly increasing function of the quantity (3.47), (3.48), or (3.49). And
in the case of the 100.1 P /% confidence region AQF or AF , there is an implication that the probability
Pr.˛.0/ 2 AQF / or Pr. .0/ 2 AF / of AQF or AF covering the vector ˛.0/ or  .0/ [where  .0/ 2 C.ƒ0 /]
is a strictly decreasing function of the quantity (3.47), (3.48), or (3.49).
Similarity and unbiasedness. The size- P F test of the null hypothesis HQ 0 or H0 is a similar test in the
sense that the probability of rejection is the same (equal to P ) for all values of the model’s parameters
for which the null hypothesis is satisfied. And the size- P F test of HQ 0 or H0 is an unbiased test in
the sense that the probability of rejection is at least as great for all values of the model’s parameters
for which the alternative hypothesis is satisfied as for any values for which the null hypothesis is
satisfied. In fact, as is evident from the results of the preceding part of the present subsection, it is
strictly unbiased in the sense that the probability of rejection is (strictly) greater for all values of the
model’s parameters for which the alternative hypothesis is satisfied than for any values for which the
null hypothesis is satisfied.
The 100.1 P /% confidence regions AQF or AF possess properties analogous to those of the
size- P test. The probability of coverage Pr.˛ 2 AQF / or Pr. 2 AF / is the same (equal to 1 P ) for
all values of the model’s parameters. And the probability of AQF or AF covering a vector ˛.0/ or  .0/
that differs from ˛ or , respectively, is (strictly) less than 1 P .
A special case: M D M D 1. Suppose (for the time being) that M D M D 1. Then, each of the
quantities , ,
O C, ˛, and ˛O has a single element; let us write , ,
O c, ˛, or ˛,
O respectively,pfor this
element. Each of the matrices T and S also has apsingle element.
p The element is ˙ c, and
of T p
hence the element of S can be taken to be either 1= c or 1= c; let us take it to be 1= c, in which
case p p
˛ D = c and ˛O D =
O c:
We find that (for arbitrary values ˛P and P of ˛ and , respectively)
Q ˛/
F. P 2
P D Œ tQ.˛/ and F./P D Œt./P 2; (3.50)
p
where tQ.˛/
P D .˛O ˛/= P O and t./
P D .O /=. P c O /—if O D 0 or, equivalently, d D 0, interpret
P as 0, 1, or 1 depending on whether ˛O D ˛,
tQ.˛/ P ˛O > ˛, P or ˛O < ˛P and, similarly, interpret t./
P
as 0, 1, or 1 depending on whether O D , P O > ,P or O < P . Further,
.1=/.˛O ˛/ P
tQ.˛/
P Dp  S tŒN P ; .˛ ˛/=:
P (3.51)
2 0
.1= /d d=.N P /
p p
P c or, equivalently, for P D c ˛,
And for ˛P D = P
p
tQ.˛/
P D t./
P and P
.˛ ˛/= D . /=.
P c /: (3.52)

Let tNP =2 .N P / represent the upper 100. P =2/%


qpoint of the S t.N P / distribution, and observe
[in light of result (6.4.26)] that tNP =2 .N P / D FN P .1; N P /. Then, writing  .0/ for  .0/, the
critical region CF of the size- P F test of the null hypothesis H0 W  D  .0/ is reexpressible in the
form
CF D fy W jt. .0/ /j > tNP =2 .N P /g (3.53)
.0/ .0/
D fy W t. / > tNP =2 .N P / or t. / < tNP =2 .N P /g: (3.54)
And the 100.1 P /% confidence set AF for  is reexpressible in the form
The F Test and a Generalized S Method 377
p p
AF D fP 2 R1 W O c O tNP =2 .N P /  P  O C c O tNP =2 .N P /g: (3.55)
Accordingly, in the special case under consideration (that where M D M D 1), the F test of H0 is
equivalent to what is commonly referred to as the two-sided t test, and the corresponding confidence
set takes the form of an interval whose end points are equidistant from —this
O interval is sometimes
referred to as a t interval.
An extension. Let us resume our discussion of the general case (where M is an arbitrary positive
integer and where M may be less than M ). And let us consider the extent to which the “validity” of
the F test of HQ 0 or H0 and of the corresponding confidence set for ˛ or  depend on the assumption
that the distribution of the vector e of residual effects is MVN.
Recalling result (3.18), observe that
0 1
˛O ˛
@ O  A D O 0 e:
d
Observe also that if e has an absolutely continuous spherical distribution, then so does the transformed
0
vector O  e. Moreover,
 if O 0 e has an absolutely continuous spherical distribution, then so does the
˛O ˛
vector (as is evident upon observing that this vector is a subvector of O 0 e and upon
d
recalling the discussion of Section 5.9c). And in light of expression(3.23),it follows from the
Q ˛O ˛
discussion of Section 6.3b that F.˛/ has the same distribution when has any absolutely
d
continuous spherical distribution as it does in the special case where the distribution of this vector is
N.0;  2 I/ [which is its distribution when e  N.0;  2 I/]. It is now clear that the assumption that
the distribution of e is MVN is stronger than needed to insure the validity of the size- P F test of HQ 0
or H0 and of the corresponding 100.1 Q
 P /%  confidence set AF or AF . When the distribution of e or,
˛O ˛
more generally, the distribution of , is an absolutely continuous spherical distribution, the
d
size of this test and the probability of coverage of this confidence set are the same as in the special
case where the distribution of e is N.0;  2 I/.
It is worth noting that while the size of the test and the probability of coverage of the confidence
set are the same for all absolutely continuous spherical distributions, the power of the test and the
probability of false coverage are not the same. For ˛P D ˛, the distribution of F. Q ˛/
P is the same for
all absolutely continuous spherical distributions; however, for ˛P ¤ ˛, it differs  from one absolutely
˛O ˛
continuous spherical distribution to another. If the distribution of the vector is N.0;  2 I/
d
[as would be the case if e  N.0;  2 I/], then the distribution of F. Q ˛/
P is noncentral F with degrees
of freedom M and N P and noncentrality  parameter .1= 2
/. P
˛ ˛/0 .˛P ˛/. More generally, if
˛O ˛
the distribution of the vector .1=/ (which has expected value 0 and variance-covariance
d
matrix I) is an absolutely continuous spherical distribution (that does not depend on the model’s
parameters), then the distribution of F. Q ˛/P depends on the model’s parameters only through the value
of the quantity .1= 2 /.˛P ˛/0 .˛P ˛/ [as can be readily verified via a development similar to that
leading up to result (6.2.61)]. However, only in special cases is this distribution a noncentral F
distribution.
Invariance/equivariance. Let A.y/ (or simply A) represent an arbitrary 100.1 P /% confidence
Q
set for the vector  (D ƒ0ˇ). And let A.z/ Q represent the corresponding 100.1 P /%
(or simply A)
confidence set for the vector ˛ (D S0 ), so that
Q
A.z/ D f˛P W ˛P D S0 ;
P P 2 A.Oz/g:
Q to be functions (of y and z, respectively, defined as follows:
Further, take ./ and ./
378 Confidence Intervals (or Sets) and Tests of Hypotheses
( (
1; if  .0/ … A.y/, Q
1; if ˛.0/ … A.z/,
.y/ D and Q
.z/ D (3.56)
0; if  .0/ 2 A.y/, .0/
0; if ˛ 2 A.z/,Q

in which case .y/ is the critical function for a size- P test of the null hypothesis H0 W  D  .0/ (versus
the alternative hypothesis H1 W  ¤  .0/ ) or, equivalently, of HQ 0 W ˛ D ˛.0/ (versus HQ 1 W ˛ ¤ ˛.0/ )
Q
and .z/ D .Oz/. Or, more generally, define .y/ to be the critical function for an arbitrary size- P
test of H0 or HQ 0 (versus H1 or HQ 1 ), and define .z/Q D .Oz/.
Taking u to be an N  1 unobservable random vector that has an N.0; I/ distribution or some
other absolutely continuous spherical distribution0with 1 mean 0 and variance-covariance matrix I and
˛
assuming that y  Xˇ C  u and hence that z  @ A C  u, let TQ ./ represent a one-to-one (linear)
0
transformation from RN onto RN for which there exist corresponding one-to-one transformations
FQ1 ./ from RM onto RM, FQ2 ./ from RP M onto RP M, and FQ3 ./ from the interval .0; 1/
onto .0; 1/ [where TQ ./, FQ1 ./, FQ2 ./, and FQ3 ./ do not depend on ˇ or ] such that
2 3
FQ1 .˛/
TQ .z/  4FQ2 ./5 C FQ3 ./u; (3.57)
6 7
0
so that the problem of making inferences about the vector FQ1 .˛/ on the basis of the transformed
vector TQ .z/ is of the same general form as that of making inferences about ˛ on the basis of z. Further,
let GQ represent a group of such transformations—refer, e.g., to Casella and Berger (2002, sec. 6.4)
for the definition of a group. And let us write TQ .z1 ; z2 ; z3 / for TQ .z/ whenever it is convenient to do
so.
In making a choice for the 100.1 P /% confidence set A (for ), there would seem to be some
appeal in restricting attention to choices for which the corresponding 100.1 P /% confidence set AQ
(for ˛) is such that, for every value of z and for every transformation TQ ./ in G, Q
Q TQ .z/ D f˛R W ˛R D FQ1 .˛/;
AΠQ
P ˛P 2 A.z/g: (3.58)
A choice for AQ having this property is said to be invariant or equivariant with respect to G,
Q with the
term invariant being reserved for the special case where
FQ1 .˛/
P D ˛P for every ˛P 2 RM [and every TQ ./ 2 G]
Q (3.59)
Q TQ .z/ D A.z/.
—in that special case, condition (3.58) simplifies to AŒ Q
Clearly,
˛ D ˛.0/ , FQ1 .˛/ D FQ1 .˛.0/ /: (3.60)
Suppose that
FQ1 .˛.0/ / D ˛.0/ [for every TQ ./ 2 G].
Q (3.61)
Then, in making a choice for a size- P test of the null hypothesis HQ 0 W ˛ D ˛ (versus the alternative
.0/

hypothesis HQ 1 W ˛ ¤ ˛.0/ ), there would seem to be some appeal in restricting attention to choices
Q is such that, for every value of z and for every transformation
for which the critical function ./
TQ ./ in G,
Q
Q TQ .z/ D .z/:
ΠQ (3.62)
—the appeal may be enhanced if condition (3.59) [which is more restrictive than condition (3.61)]
is satisfied. A choice for a size- P test (of HQ 0 versus HQ 1 ) having this property is said to be invariant
with respect to GQ (as is the critical function itself). In the special case where condition (3.59) is
satisfied, a size- P test with critical function of the form (3.56) is invariant with respect to GQ for every
hypothesized value ˛.0/ 2 RM if and only if the corresponding 100.1 P /% confidence set AQ is
invariant with respect to G. Q
The F Test and a Generalized S Method 379

The characterization of equivariance and invariance can be recast in terms of  and/or in terms of
transformations of y. Corresponding to the one-to-one transformation FQ1 ./ (from RM onto RM )
is the one-to-one transformation F1 ./ from C.ƒ0 / onto C.ƒ0 / defined as follows:
P D W 0FQ1 .S0 /
F1 ./ P [for every P 2 C.ƒ0 /] (3.63)
or, equivalently,
P D S0F1 .W 0˛/
FQ1 .˛/ P (for every ˛P 2 RM ). (3.64)
And corresponding to the one-to-one transformation TQ ./ (from RN onto RN ) is the one-to-one
transformation T ./ (also from RN onto RN ) defined as follows:
T .y/ D O TQ .O 0 y/ or, equivalently, TQ .z/ D O 0 T .Oz/: (3.65)
Clearly, 2
FQ1 .˛/
3
T .y/ D O TQ .z/  O4FQ2 ./ 5 C FQ3 ./u: (3.66)
0
Moreover,
FQ1 .˛/
2 3
Q
O4FQ2 ./ 5 D XRSS 0
F1 ./ C UFQ2 ./ D XRC Q F1 ./ C UFQ2 ./; (3.67)
0
as can be readily verified.
Now, let G represent the group of transformations of y obtained upon reexpressing each of the
transformations (of z) in the group GQ in terms of y, so that T ./ 2 G if and only if TQ ./ 2 G,
Q where
TQ ./ is the unique transformation (of z) that corresponds to the transformation T ./ (of y) in the
sense determined by relationship (3.65). Then, condition (3.59) is satisfied if and only if
P D P for every P 2 C.ƒ0 / [and every T ./ 2 G];
F1 ./ (3.68)
and condition (3.61) is satisfied if and only if
F1 . .0/ / D  .0/ [for every T ./ 2 G]. (3.69)

—the equivalence of conditions (3.59) and (3.68) and the equivalence of conditions (3.61) and (3.69)
can be readily verified.
Q for ˛ is equivariant or invariant with respect to GQ
Further, the 100.1 P /% confidence set A.z/
if and only if the corresponding 100.1 P /% confidence set A.y/ for  is respectively equivariant
or, in the special case where condition (3.68) is satisfied, invariant in the sense that, for every value
of y and for every transformation T ./ in G,
AŒT .y/ D fR W R D F1 ./;
P P 2 A.y/g (3.70)
—in the special case where condition (3.68) is satisfied, condition (3.70) simplifies to AŒT .y/ D
A.y/. To see this, observe that [for P and R in C.ƒ0 /]
P 2 A.y/ , P 2 A.Oz/ , S0 P 2 A.z/;
Q
that
P , R D W 0FQ1 .S0 /
R D F1 ./ P , S0 R D FQ1 .S0 /;
P
and that
R 2 AŒT .y/ , S0 R 2 AŒ
Q TQ .z/:

Now, consider the size- P test of H0 or HQ 0 (versus H1 or HQ 1 ) with critical function .z/.
Q This
Q 0y/. And assuming that condition (3.61)
test is identical to that with critical function .y/ ŒD .O
or equivalently condition (3.69) is satisfied, the test and the critical function .z/ are invariant with
respect to GQ [in the sense (3.62)] if and only if the test and the critical function .y/ are invariant
with respect to G in the sense that [for every value of y and for every transformation T ./ in G]
ŒT .y/ D .y/; (3.71)
as is evident upon observing that
380 Confidence Intervals (or Sets) and Tests of Hypotheses

ŒT .y/ D .y/ , ŒOTQ .z/ D .Oz/ , Œ


Q TQ .z/ D .z/:
Q
In the special case where condition (3.68) is satisfied, a size- P test with critical function of the form
(3.56) is invariant with respect to G for every hypothesized value  .0/ 2 C.ƒ0 / if and only if the
corresponding 100.1 P /% confidence set A is invariant with respect to G.
Translation (location) invariance/equivariance. As special cases of the size- P test (of HQ 0 or H0 )
with critical function ./ Q or ./ and of the 100.1 P /% confidence sets AQ and A (for ˛ and ,
respectively), we have the size- P F test and the corresponding confidence sets. These are the special
cases where ./ Q D Q F ./, ./ D F ./, AQ D AQF , and A D AF . Are the size- P F test and the
corresponding confidence sets invariant or equivariant?
Let TQ0 .z/ and TQ1 .z/ represent one-to-one transformations from RN onto RN of the following
form: 0
z1 Ca
1 0
z1
1
TQ0 .z/ D @ z2 A and TQ1 .z/ D @z2 CcA;
z3 z3
where a is an M -dimensional and c a (P M)-dimensional vector of constants. And let TQ01 .z/
represent the transformation formed by composition from TQ0 .z/ and TQ1 .z/, so that
0 1
z1 Ca
TQ01 .z/ D TQ0 ŒTQ1 .z/ D TQ1 ŒTQ0 .z/ D @ z2 Cc A:
z3
Clearly, TQ0 .z/, TQ1 .z/, and TQ01 .z/ are special cases of the transformation TQ .z/, which is characterized
by property (3.57)—FQ1.˛/ D ˛Ca for TQ0 .z/ and TQ01 .z/ and D ˛ for TQ1 .z/, FQ2 ./ D Cc for TQ1 .z/
and TQ01 .z/ and D  for TQ0 .z/, and FQ3 ./ D  for TQ0 .z/, TQ1 .z/, and TQ01 .z/. Further, corresponding
to TQ0 .z/, TQ1 .z/, and TQ01 .z/ are groups GQ 0 , GQ 1 , and GQ 01 of transformations [of the form TQ .z/] defined
as follows: TQ ./ 2 GQ 0 if, for some a, TQ ./ D TQ0 ./; TQ ./ 2 GQ 1 if, for some c, TQ ./ D TQ1 ./; and
TQ ./ 2 GQ 01 if, for some a and c, TQ ./ D TQ01 ./.
Corresponding to the transformations TQ0 .z/, TQ1 .z/, and TQ01 .z/ [of the form TQ .z/] are the trans-
formations T0 .y/, T1 .y/, and T01 .y/, respectively [of the form T .y/] defined as follows:
T0 .y/ D y Cv0 ; T1 .y/ D y Cv1 ; and T01 .y/ D y Cv; (3.72)
Q
where v0 2 C.XRS/, v1 2 C.U/, and v 2 C.X/, or, equivalently,
T0 .y/ D y CXb0 ; T1 .y/ D y CXb1 ; and T01 .y/ D y CXb; (3.73)
where b0 2 N.U 0 X/, b1 2 N.S0ƒ0 /, and b 2 RP. And corresponding to T0 .y/, T1 .y/, and T01 .y/
are groups G0 , G1 , and G01 of transformations [of the form T .y/] defined as follows: T ./ 2 G0
if, for some v0 or b0 , T ./ D T0 ./; T ./ 2 G1 if, for some v1 or b1 , T ./ D T1 ./; and T ./ 2 G01
if, for some v or b, T ./ D T01 ./. The groups G0 , G1 , and G01 do not vary with the choice of the
matrices S, U, and L, as can be readily verified.
Clearly, the F test (of HQ 0 or H0 ) and the corresponding confidence sets AQF .z/ and AF .y/
are invariant with respect to the group GQ 1 or G1 of transformations of the form TQ1 ./ or T1 ./.
And the confidence sets AQF .z/ and AF .y/ are equivariant with respect to the group GQ 0 or G0 of
transformations of the form TQ0 ./ or T0 ./ and with respect to the group GQ 01 or G01 of transformations
of the form TQ01 ./ or T01 ./.
Scale invariance/equivariance. Let us consider further the invariance or equivariance of the F test
of HQ 0 or H0 and of the corresponding confidence sets for ˛ and . For ˛.0/ 2 RM, let TQ2 .zI ˛.0/ /
represent a one-to-one transformation from RN onto RN of the following form:
0 .0/
˛ C k.z1 ˛.0/ /
1
.0/
TQ2 .zI ˛ / D @ kz2 A; (3.74)
kz3
The F Test and a Generalized S Method 381

where k is a strictly positive scalar. Note that in the special case where ˛.0/ D 0, equality (3.74)
simplifies to
TQ2 .zI 0/ D kz: (3.75)
The transformation TQ2 .zI ˛.0/ / is a special case of the transformation TQ .z/. In this special case,
FQ1 .˛/ D ˛.0/ C k.˛ ˛.0/ /; FQ2 ./ D k; and FQ3 ./ D k:

Denote by GQ 2.˛.0/ / the group of transformations (of z) of the form (3.74), so that a transformation
(of z) of the general form TQ .z/ is contained in GQ 2 .˛.0/ / if and only if, for some k .> 0/, TQ ./ D
TQ2 . I ˛.0/ /. Further, for  .0/ 2 C.ƒ0 /, take T2 .y I  .0/ / to be the transformation of y determined
from the transformation TQ2 .zI S0  .0/ / in accordance with relationship (3.65); and take G2 . .0/ / to
be the group of all such transformations, so that T2 . I  .0/ / 2 G2 . .0/ / if and only if, for some
transformation TQ2 . I S0  .0/ / 2 GQ 2 .S0  .0/ /, T2 .y I  .0/ / D O TQ2 .O 0 y I S0  .0/ /. The group G2 . .0/ /
does not vary with the choice of the matrices S, U, and L, as can be demonstrated via a relatively
straightforward exercise. In the particularly simple special case where  .0/ D 0, we have that
T2 .y I 0/ D ky: (3.76)

Clearly, the F test (of HQ 0 or H0 ) is invariant with respect to the group GQ 2 .˛.0/ / or G2 . .0/ / of
transformations of the form TQ2 . I ˛.0/ / or T2 . I  .0/ /. And the corresponding confidence sets AQF .z/
and AF .y/ are equivariant with respect to GQ 2 .˛.0/ / or G2 . .0/ /.
Invariance/equivariance with respect to groups of orthogonal transformations. The F test and the
corresponding confidence sets are invariant with respect to various groups of orthogonal transforma-
tions (of the vector z). For ˛.0/ 2 RM, let TQ3 .zI ˛.0/ / represent a one-to-one transformation from
RN onto RN of the following form:
˛ C P 0 .z1 ˛.0/ /
0 .0/ 1

TQ3 .zI ˛.0/ / D @ z2 A; (3.77)


z3
where P is an M  M orthogonal matrix. In particular, TQ3 .zI 0/ represents a transformation of the
form 0 0 1
P z1
TQ3 .zI 0/ D @ z2 A: (3.78)
z3
Further, let TQ4 .z/ represent a one-to-one transformation from RN onto RN of the form
0 1
z1
TQ4 .z/ D @ z2 A; (3.79)
B0 z3
where B is an .N P /  .N P / orthogonal matrix.
The transformations TQ3 .zI ˛.0/ / and TQ4 .z/ are special cases of the transformation TQ .z/. In the
special case TQ3 .zI ˛.0/ /,
FQ1 .˛/ D ˛.0/ C P 0.˛ ˛.0/ /; FQ2 ./ D ; and FQ3 ./ D :
And in the special case TQ4 .z/,
FQ1 .˛/ D ˛; FQ2 ./ D ; and FQ3 ./ D :

Denote by GQ 3 .˛.0/ / the group of transformations (of z) of the form (3.77) and by GQ 4 the group of
the form (3.79), so that a transformation (of z) of the general form TQ .z/ is contained in GQ 3.˛.0/ / if and
only if, for some (M  M orthogonal matrix) P , TQ ./ D TQ3 . I ˛.0/ / and is contained in GQ 4 if and
only if, for some [.N P/.N P / orthogonal matrix] B, TQ ./ D TQ4 ./. Further, take T4.y/ to be the
transformation of y determined from the transformation TQ4 .z/ in accordance with relationship (3.65),
382 Confidence Intervals (or Sets) and Tests of Hypotheses

and [for  .0/ 2 C.ƒ0 /] take T3 .y I  .0/ / to be that determined from the transformation TQ3 .zI S0  .0/ /.
And take G4 and G3 . .0/ / to be the respective groups of all such transformations, so that T4./ 2 G4 if
and only if, for some transformation TQ4 ./ 2 GQ 4 , T4 .y/ D O TQ4 .O 0 y/ and T3 . I  .0/ / 2 G3 . .0/ / if
and only if, for some transformation TQ3 . I S0  .0/ / 2 GQ 3 .S0  .0/ /, T3 .y I  .0/ / D O TQ3 .O 0 y I S0  .0/ /.
The groups G3 . .0/ / and G4 , like the groups G0 , G1 , G01 , and G2 . .0/ /, do not vary with the choice
of the matrices S, U, and L.
Clearly, the F test (of HQ 0 or H0 ) is invariant with respect to both the group GQ 3 .˛.0/ / or G3 . .0/ /
of transformations of the form TQ3 . I ˛.0/ / or T3 . I  .0/ / and the group GQ 4 or G4 of transformations
of the form TQ4 ./ or T4 ./. And the corresponding confidence sets AQF .z/ and AF .y/ are equivariant
with respect to GQ 3 .˛.0/ / or G3 . .0/ / and are invariant with respect to GQ 4 or G4 .

c. Simultaneous confidence intervals (or sets) and multiple comparisons: a gener-


alized S method
Let us continue to take y to be an N  1 observable random vector that follows the G–M model.
And let us continue to take i D i0 ˇ (i D 1; 2; : : : ; M ) to be estimable linear combinations of the
elements of ˇ (where at least one of the vectors 1 ; 2 ; : : : ; M of coefficients is nonnull) and to take
 D .1 ; 2 ; : : : ; M /0 and ƒ D .1 ; 2 ; : : : ; M / (in which case  D ƒ0ˇ). Further, denote by D
the distribution of the vector e of residual effects (which, by definition, is a distribution with mean
vector 0 and variance-covariance matrix  2 I), and assume that D 2 D./, where D./ is a specified
subset of the set of all N -variate distributions with mean vector 0 and variance-covariance matrix
 2 I—e.g., D./ might consist of spherically symmetric distributions or only of the N.0;  2 I/
distribution.
Suppose that we wish to make inferences about each of the M linear combinations 1 ; 2 ; : : : ; M
or, more generally, about some or all linear combinations of 1 ; 2 ; : : : ; M . That is, suppose that we
wish to make inferences about a linear combination  D ı 0 and that we wish to do so for every
ı 2 , where  D RM , where  is the M -dimensional set formed by the columns of IM , or, more
generally, where  is a finite or infinite set of M 1 vectors—to avoid trivialities, it is assumed that 
is such that ƒı ¤ 0 for some ı 2 . Inference can take the form of a point estimate, of a confidence
interval (or, more generally, a confidence set), and/or of a test that  equals some hypothesized value
(versus the alternative that it does not equal the hypothesized value).
One approach is to carry out the inferences for each of the linear combinations in isolation, that
is, to ignore the fact that inferences are being carried out for other linear combinations. Such an
approach is sometimes referred to as “one-at-a-time.” In practice, there is a natural tendency to focus
on those linear combinations (of 1 ; 2 ; : : : ; M ) for which the results of the inferences are the “most
extreme.” The one-at-a-time approach does not account for any such tendency and, as a consequence,
can result in unjustified and misleading conclusions.
In the case of confidence intervals or sets, one way to counter the deficiencies of the one-at-a-time
approach is to require that the probability of simultaneous coverage equal 1 P , where P is a specified
scalar between 0 and 1. That is, letting Aı .y/ or simply Aı represent the confidence interval or set
for the linear combination  D ı 0, the requirement is that [for all ˇ 2 RP,  > 0, and D 2 D./]
PrŒı 0 2 Aı .y/ for every ı 2  D 1 P: (3.80)
Similarly, in the case of hypothesis tests, an alternative to the one-at-a-time approach is to require that
the probability of falsely rejecting one or more of the null hypotheses not exceed P . More specifically,
.0/
letting ı represent the hypothesized value of the linear combination  D ı 0 and letting C.ı/
.ı/ .0/
represent the critical region for the test of the null hypothesis H0 W  D ı versus the alternative
hypothesis H1.ı/ W  ¤ ı.0/, the requirement is that
 
max Pr y 2 [ C.ı/ D P : (3.81)
ˇ2RP; >0; D2D./ .0/
fı2 W ı0 Dı g
The F Test and a Generalized S Method 383

In regard to requirement (3.81) and in what follows, it is assumed that there exists an M  1 vector
 .0/ 2 C.ƒ0 / such that ı.0/ D ı 0 .0/, so that the collection H0.ı/ (ı 2 ) of null hypotheses is
“internally consistent.” Corresponding to a collection Aı (ı 2 ) of confidence intervals or sets that
satisfies requirement (3.80) (i.e., for which the probability of simultaneous coverage equals 1 P )
is a collection of tests of H0.ı/ versus H1.ı/ (ı 2 ) with critical regions C.ı/ (ı 2 ) defined
(implicitly) as follows:
y 2 C.ı/ , ı.0/ … Aı .y/: (3.82)
As can be readily verified (via an argument that will subsequently be demonstrated), this collection
of tests satisfies condition (3.81).
The null hypothesis H0.ı/ can be thought of as representing a comparison between the “actual” or
“true” value of some entity (e.g., a difference in effect between two “treatments”) and a hypothesized
value (e.g., 0). Accordingly, the null hypotheses forming the collection H0.ı/ (ı 2 ) may be referred
to as multiple comparisons, and procedures for testing these null hypotheses in a way that accounts
for the multiplicity may be referred to as multiple-comparison procedures.
A reformulation. The problem of making inferences about the linear combinations (of the elements
1 ; 2 ; : : : ; M of ) forming the collection ı 0 (ı 2 ) is reexpressible in terms associated with the
canonical form of the G–M model. Adopting the notation and recalling the results of Subsection a,
the linear combination  D ı 0 (of 1 ; 2 ; : : : ; M ) is reexpressible as a linear combination of the
M elements of the vector ˛ (D S0  D S0 ƒ0 ˇ). Making use of result (3.8), we find that
 D ı 0 W 0 ˛ D .W ı/0 ˛: (3.83)
Moreover, expression (3.83) is unique; that is, if ıQ is an M  1 vector of constants such that (for
every value of ˇ)  D ıQ 0 ˛, then
ıQ D W ı: (3.84)
To see this, observe that if ıQ ˛ D ı  (for every value of ˇ), then
0 0

ıQ 0 S0ƒ0 D ı 0ƒ0 D ı 0.ƒSW /0 D ı 0 W 0 S0ƒ0;


implying that
.ıQ W ı/0 .ƒS/0 D 0
and hence [since in light of result (3.3), the rows of .ƒS/0 are linearly independent] that ıQ W ı D 0
or, equivalently, that ıQ D W ı.
It is now clear that the problem of making inferences about  D ı 0  for ı 2  can be recast as
one of making inferences about  D ıQ 0 ˛ for ıQ 2 ,Q where
Q D fıQ W ıDW
 Q ı; ı2g:
M
Note that if  D R , then
Q D RM;

as is evident upon recalling that rank W D M and observing that C.W / D RM. Note also that
corresponding to  .0/, there exists a unique M  1 vector ˛.0/ such that
 .0/ D W 0 ˛.0/: (3.85)

Simultaneous confidence intervals: a general approach. Suppose that the distribution of the  vector

˛O ˛
e of residual effects in the G–M model or, more generally, the distribution of the vector
d
is MVN or is some other absolutely continuous spherical distribution (with mean 0 and variance-
covariance matrix  2 I). Then, making use of result (6.4.67) and letting d1 ; d2 ; : : : ; dN P represent
the elements of d, we find that

O 1.˛O ˛/ D Œ.N P / 1 iND1P di2  1=2 .˛O ˛/  MV t.N P ; IM /: (3.86)


P
384 Confidence Intervals (or Sets) and Tests of Hypotheses

And letting t represent an M  1 random vector that has an MV t.N P ; IM / distribution, it
follows that
jıQ 0 .˛O ˛/j jıQ 0 tj
max  max : (3.87)
Q 
fı2 Q
Q W ı¤0g .ıQ 0 ı/
Q 1=2 O Q 
fı2 Q
Q W ı¤0g .ıQ 0 ı/
Q 1=2
Thus, letting (for any scalar P such that 0 < P < 1) c P represent the upper 100 P % point of the
distribution of the random variable maxfı2
Q  Q
Q W ı¤0g jıQ 0 tj=.ıQ 0 ı/
Q 1=2, we find that

jıQ 0 .˛O ˛/j


 
Q 0 Q 0 Q 1=2 Q Q
PrŒjı .˛O ˛/j  .ı ı/ O c P for every ı 2  D Pr max  c P D 1 P : (3.88)
Q 
fı2 Q
Q W ı¤0g .ıQ 0 ı/
Q 1=2 O

For ıQ 2 ,
Q denote by AQ Q .z/ or simply by AQ Q a set of -values (i.e., a set of scalars), the contents
ı ı
of which may depend on the value of the vector z (defined in Subsection a). Further, suppose that
AQ Q D fP 2 R1 W P D ıQ 0˛;
ı P jıQ 0 .˛O ˛/j Q 1=2 O c P g:
P  .ıQ 0 ı/ (3.89)
Clearly, the set (3.89) is reexpressible in the form
Q 1=2 O c P ;
Q 1=2 O c P  P  ıQ 0 ˛O C .ıQ 0 ı/
AQıQ D P 2 R1 W ıQ 0 ˛O .ıQ 0 ı/ (3.90)
˚

Q 1=2 O c P . And in light of result


that is, as an interval with upper and lower end points ıQ 0 ˛O ˙ .ıQ 0 ı/
(3.88),
PrŒıQ 0 ˛ 2 AQıQ .z/ for every ıQ 2 
Q D 1 P; (3.91)
that is, the probability of simultaneous coverage of the intervals AQıQ .z/ (ıQ 2 ) Q equals 1 P .
For ı 2  [and for any choice of the sets AQıQ (ıQ 2 )],Q denote by Aı .y/ or simply by Aı the set
Q 0
AW ı .O y/. Then, for any M  1 vector ˛, P
ıQ 0 ˛P 2 AQıQ .z/ for every ıQ 2 
Q , ı 0 W 0 ˛P 2 Aı .Oz/ for every ı 2 : (3.92)
And upon observing that, for ˛P D ˛, ı 0 W 0 ˛P D ı 0  [as is evident from result (3.8)], it follows, in
particular, that
PrŒı 0  2 Aı .y/ for every ı 2  D PrŒıQ 0 ˛ 2 AQ Q .z/ for every ıQ 2 :
ı
Q (3.93)

Now, suppose that (for ıQ 2 )


Q AQ Q is interval (3.90). Then, we find [in light of results (3.7) and
ı
(3.9)] that (for ı 2 )
Aı .y/ D AQW ı .O 0 y/ D fP 2 R1 W ı 0 O .ı 0 Cı/1=2 O c P  P  ı 0 O C .ı 0 Cı/1=2 O c P g (3.94)
and [in light of results (3.91) and (3.93)] that
PrŒı 0 2 Aı .y/ for every ı 2  D 1 P: (3.95)
0 0 1=2
Thus, Aı .y/ is an interval with upper and lower end points ı O ˙ .ı Cı/ O c P , and the probability
of simultaneous coverage of the intervals Aı .y/ (ı 2 ) equals 1 P . In the special case where
M D M D 1 and ı D 1, interval (3.94) simplifies to the t interval (3.55).
Note [in connection with the random variable maxfı2 Q  Q
Q W ı¤0g jıQ 0 tj=.ıQ 0 ı/
Q 1=2 and with the upper
100 P % point c P of its distribution] that for any M  1 vector tP ,
jıQ 0 tP j ıQ 0 tP ıQ 0 tP
 
max D max min ; max ; (3.96)
Q 
fı2 Q
Q W ı¤0g .ıQ 0 ı/
Q 1=2 Q 
fı2 Q
Q W ı¤0g .ıQ 0 ı/
Q 1=2 fı2 Q  Q
Q W ı¤0g .ıQ 0 ı/
Q 1=2
as can be readily verified.
Multiple comparisons: some relationships. Let ˛.0/ represent any particular value of ˛, and consider
Q
Q and for an arbitrary choice of the set AQ Q .z/] the test of the null hypothesis HQ .ı/
[for ıQ 2  W ıQ 0 ˛ D
ı 0
Q
.ı/
ıQ 0 ˛.0/ (versus the alternative hypothesis HQ 1 W ıQ 0 ˛ ¤ ıQ 0 ˛.0/ ) with critical region CQ .ı/
Q defined as
follows:
The F Test and a Generalized S Method 385
Q D fz W ıQ 0 ˛.0/ … AQ Q .z/g:
CQ .ı/ (3.97)
ı
Q
.ı/ Q
.ı/
When this test is used to test HQ 0 (versus HQ 1 ) for every ıQ 2 ,
Q the probability of one or more
false rejections is (by definition)
 
Pr z 2 [ Q Q
C .ı/ :
Q 
fı2 Q W ıQ 0 ˛DıQ 0 ˛.0/ g

Clearly,
   
Pr z 2 [ Q D1
CQ .ı/ Pr z … [ Q
CQ .ı/
Q 
fı2 Q W ıQ 0 ˛DıQ 0 ˛.0/ g Q 
fı2 Q W ıQ 0˛DıQ 0 ˛.0/ g

D1 Q0
PrŒı ˛ 2 AQıQ .z/ for every ıQ 2  Q such that ıQ 0˛ D ıQ 0˛.0/ 
1 PrŒıQ 0˛ 2 AQ Q .z/ for every ıQ 2 ;
ı
Q

with equality holding when ıQ 0˛ D ıQ 0˛.0/ for every ıQ 2 . Q Thus, if the probability PrŒıQ 0˛ 2
AıQ .z/ for every ı 2  of simultaneous coverage of the sets AQıQ .z/ (ıQ 2 )
Q Q Q Q is greater than or equal
to 1 P , then  
Pr z 2 [ Q  P:
CQ .ı/ (3.98)
Q 
fı2 Q W ıQ 0 ˛DıQ 0 ˛.0/ g

Moreover, if the probability of simultaneous coverage of the sets AQıQ .z/ (ıQ 2 ) Q equals 1 P , then
equality is attained in inequality (3.98) when ıQ ˛ D ıQ ˛ for every ıQ 2 .
0 0 .0/ Q
.0/
Now, suppose that ˛ is the unique value of ˛ that satisfies condition (3.85) (i.e., the condition
 .0/ D W 0 ˛.0/ ). And observe that (by definition) ıQ 2  Q if and only if ıQ D W ı for some ı 2 .
Observe also [in light of result (3.8)] that for ı 2  and ıQ D W ı,
ı 0  D ı 0 W 0 ˛ D ıQ 0 ˛ and ı 0  .0/ D ı 0 W 0 ˛.0/ D ıQ 0 ˛.0/;
Q Q
in which case H0.ı/ is equivalent to HQ 0.ı/ and H1.ı/ to HQ 1.ı/.
The test of H0.ı/ (versus H1.ı/ ) with critical region C.ı/ [defined (implicitly) by relationship
(3.82)] is related to the test of HQ 0.W ı/ (versus HQ 1.W ı/ ) with critical region CQ .W ı/ [defined by
expression (3.97)]. For ı 2 ,
C.ı/ D fy W ı 0  .0/ … Aı .y/g D fy W ı 0 W 0 ˛.0/ … AQW ı .O 0 y/g D fy W O 0 y 2 CQ .W ı/g: (3.99)
And the probability of one or more false rejections is expressible as
   
Pr y 2 [ C.ı/ D Pr z 2 [ Q ;
CQ .ı/ (3.100)
fı2 W ı0 Dı0  .0/ g Q 
fı2 Q W ıQ 0˛DıQ 0 ˛.0/ g

as is evident upon observing that


fıQ 2 
Q W ıQ 0˛ D ıQ 0˛.0/ g D fıQ W ıQ D W ı; ı 2 ; ı 0 D ı 0 .0/ g:
Moreover, if the probability of simultaneous coverage of the sets Aı .y/ (ı 2 ) equals 1 P , then
 
Pr y 2 [ C.ı/  P; (3.101)
fı2 W ı0 Dı0  .0/ g

with equality holding when ı 0 D ı 0 .0/ for every ı 2 .


Multiple comparisons: a general method. When (for ıQ 2 ) Q the set AQ Q is taken to be the interval
ı
(3.90) [in which case the set Aı is the interval (3.94)], the critical region CQ .ı/Q [defined by equality
(3.97)] and the critical region C.ı/ [defined (implicitly) by relationship (3.82)] are expressible as
follows:
Q D fz W jıQ 0 .˛O ˛.0/ /j > .ıQ 0 ı/
CQ .ı/ Q 1=2 O c P g (3.102)
and
C.ı/ D fy W jı 0 .O  .0/ /j > .ı 0 Cı/1=2 O c P g: (3.103)
386 Confidence Intervals (or Sets) and Tests of Hypotheses

And when (in addition) the distribution ofthe vector


 e of residual effects in the G–M model or, more
˛O ˛
generally, the distribution of the vector is MVN or is some other absolutely continuous
d
spherical distribution [in which case the probability of simultaneous coverage of the intervals AQıQ .z/
(ıQ 2 )
Q and of the intervals Aı .y/ (ı 2 ) is 1 P ],
 
Pr z 2 [ Q  P;
CQ .ı/ (3.104)
Q 
fı2 Q W ıQ 0˛DıQ 0 ˛.0/ g

with equality holding when ıQ 0˛ D ıQ 0˛.0/ for every ıQ 2 ,


Q and
 
Pr y 2 [ C.ı/  P; (3.105)
fı2 W ı0 Dı0  .0/ g

with equality holding when ı 0 D ı 0 .0/ for every ı 2 .


The S method. By definition, c P is the upper 100 P % point of the distribution of the random variable
maxfı2Q  Q
Q W ı¤0g jıQ 0 tj=.ıQ 0 ı/
Q 1=2 [where t is an M  1 random vector that has an MV t.N P ; IM /
distribution]. The upper 100 P % point c P is such that
c P  ŒM FN P .M ; N P /1=2; (3.106)
Q D RM or, still more generally, if
with equality holding if  D RM or, more generally, if 
P D 1;
Pr.t 2 / (3.107)
where P is the set fıP 2 RM W 9 a nonnull vector in  P
Q that is proportional to ıg.
For purposes of verification, observe (in light of Theorem 2.4.21, i.e., the Cauchy–Schwarz
inequality) that for any M  1 vector tP and for any nonnull M  1 vector ı, Q
jıQ 0 tP j
 .tP 0 tP /1=2; (3.108)
.ıQ 0 ı/
Q 1=2
with equality holding if and only if tP D 0 or ıQ D k tP for some (nonzero) scalar k. And it follows that
jıQ 0 tP j
max  .tP 0 tP /1=2;
Q 
fı2 Q
Q W ı¤0g .ıQ 0 ı/
Q 1=2
with equality holding for tP D 0 and for every nonnull vector tP for which there exists a nonnull vector
ıQ 2 
Q such that ıQ D k tP for some scalar k. Observe also that .t 0 t/=M  SF .M ; N P / and
hence that
Prf.t 0 t/1=2 > ŒM FN P .M ; N P /1=2 g D PrŒ.t 0 t/=M > FN P .M ; N P / D P ;
so that ŒM FN P .M ; N P /1=2 is the upper 100 P % point of the distribution of .t 0 t/1=2. Thus, c P
Q satisfies condition (3.107).
satisfies inequality (3.106), and equality holds in inequality (3.106) if 
Now, suppose (as before) that the distribution of the vector e of residual effects in the G–M
˛O ˛
model or, more generally, the distribution of the vector is MVN or is some other absolutely
d
continuous spherical distribution (with mean 0 and variance-covariance matrix  2 I). Then, as a
special case of result (3.95) (that where  D RM ), we have that
PrŒı 0 2 Aı .y/ for every ı 2 RM  D 1 P; (3.109)
M
where (for ı 2 R )
Aı .y/ D fP 2 R1 W ı 0 O .ı 0 Cı/1=2 ŒM
O  FN P .M ; N P /1=2
 P  ı 0 O C .ı 0 Cı/1=2 ŒM
O  FN P .M ; N P /1=2 g: (3.110)
And as a special case of result (3.105), we have that
The F Test and a Generalized S Method 387

TABLE 7.2. Value of ŒM FN P .M ; N P /1=2 for selected values of M , N P , and P .

N P D 10 N P D 25 N P D 1
M P D:01 P D:10 P D:50 P D:01 P D:10 P D:50 P D:01 P D:10 P D:50
1 3:17 1:81 0:70 2:79 1:71 0:68 2:58 1:64 0:67
2 3:89 2:42 1:22 3:34 2:25 1:19 3:03 2:15 1:18
3 4:43 2:86 1:59 3:75 2:64 1:56 3:37 2:50 1:54
4 4:90 3:23 1:90 4:09 2:96 1:86 3:64 2:79 1:83
5 5:31 3:55 2:16 4:39 3:23 2:11 3:88 3:04 2:09
10 6:96 4:82 3:16 5:59 4:32 3:10 4:82 4:00 3:06
20 9:39 6:63 4:55 7:35 5:86 4:46 6:13 5:33 4:40
40 12:91 9:23 6:49 9:91 8:07 6:36 7:98 7:20 6:27

 
Pr y 2 [ C.ı/  P ; (3.111)
fı2RM W ı0 Dı0  .0/ g

where (for ı 2 RM )
C.ı/ D fy W jı 0 .O  .0/ /j > .ı 0 Cı/1=2 O ŒM FN P .M ; N P /1=2 g; (3.112)
and that equality holds in inequality (3.111) when ı 0 D ı 0 .0/ for every ı 2 RM.
The use of the interval (3.110) as a means for obtaining a confidence set for ı 0  (for every
ı 2 RM ) and the use of the critical region (3.112) as a means for obtaining a test of H0.ı/ (versus
H1.ı/ ) (for every ı 2 RM ) are known as Scheffé’s method or simply as the S method. The S method
was proposed by Scheffé (1953, 1959).
The interval (3.110) (with end points of ı 0 O ˙ .ı 0 Cı/1=2 O ŒM FN P .M ; N P /1=2 ) is of length
2.ı Cı/1=2 O ŒM FN P .M ; N P /1=2, which is proportional to ŒMFN P .M ; N P /1=2 and depends
0

on M , N P , and P —more generally, the interval (3.94) (for a linear combination ı 0  such that
ı 2  and with end points of ı 0 O ˙ .ı 0 Cı/1=2 c O P ) is of length 2.ı 0 Cı/1=2 O c P . Note that when
M D 1, the interval (3.110) is identical to the 100.1 P /% confidence interval for ı 0  [D .ƒı/0ˇ]
obtained via a one-at-a-time approach by applying formula (3.55). Table 7.2 gives the value of the
factor ŒM FN P .M ; N P /1=2 for selected values of M , N P , and P . As is evident from the
tabulated values, this factor increases rapidly as M increases.
A connection. For ı 2 RM , let Aı or Aı .y/ represent the interval (3.110) (of ı 0 -values) associated
with the S method for obtaining confidence intervals [for ı 0  (ı 2 RM )] having a probability of
simultaneous coverage equal to 1 P . And denote by A or A.y/ the set (3.38) (of -values), which
as discussed earlier (in Part b) is a 100.1 P /% confidence set for the vector .
The sets Aı (ı 2 RM ) are related to the set A; their relationship is as follows: for every value
of y,
A D fP 2 C.ƒ0 / W ı 0P 2 Aı for every ı 2 RM g; (3.113)
so that [for P 2 C.ƒ0 /]
PrŒı 0P 2 Aı .y/ for every ı 2 RM  D PrŒP 2 A.y/: (3.114)
Moreover, relationship (3.113) implies that the sets C.ı/ (ı 2 RM ), where C.ı/ is the critical
.ı/
region (3.112) (for testing the null hypothesis H0 W ı 0  D ı 0  .0/ versus the alternative hypothesis
.ı/
H1 W ı 0  ¤ ı 0  .0/ ) associated with the S method of multiple comparisons, are related to the critical
region CF associated with the F test of the null hypothesis H0 W  D  .0/ versus the alternative
hypothesis H1 W  ¤  .0/ ; their relationship is as follows:
CF D fyP 2 RN W yP 2 C.ı/ for some ı 2 RM g; (3.115)
388 Confidence Intervals (or Sets) and Tests of Hypotheses

so that
PrŒy 2 C.ı/ for some ı 2 RM  D Pr.y 2 CF /: (3.116)
Let us verify relationship (3.113). Taking (for ıQ 2 RM )

AQıQ D fP 2 R1 W ıQ 0 ˛O Q 1=2 ŒM


.ıQ 0 ı/ O  FN P .M ; N P /1=2
Q 1=2 ŒM
 P  ıQ 0 ˛O C .ıQ 0 ı/ O  FN P .M ; N P /1=2 g (3.117)

and denoting by AQ the set (3.38), the relationship (3.113) is [in light of results (3.39) and (3.92)]
equivalent to the following relationship: for every value of z,
AQ D f˛P 2 RM W ıQ 0˛P 2 AQıQ for every ıQ 2 RM g: (3.118)
Thus, it suffices to verify relationship (3.118).
Let r D ŒMO  FN P .M ; N P /1=2. And (letting ˛P represent an arbitrary value of ˛) observe
that
˛P 2 AQ , Œ.˛P ˛/ O 0 .˛P ˛/
O 1=2  r (3.119)
and that (for ıQ 2 RM )
ıQ 0 ˛P 2 AQıQ , jıQ 0 .˛P ˛/jO  .ıQ 0 ı/
Q 1=2 r: (3.120)
Q Then, upon applying result (2.4.10) (which is a special case of the
Now, suppose that ˛P 2 A.
Q that
Cauchy–Schwarz inequality) and result (3.119), we find (for every M  1 vector ı)
jıQ 0 .˛P ˛/j
O  .ıQ 0 ı/
Q 1=2 Œ.˛P ˛/ O 1=2  .ıQ 0 ı/
O 0 .˛P ˛/ Q 1=2 r;

implying [in light of result (3.120)] that ıQ 0 ˛P 2 AQıQ .


Conversely, suppose that ˛P … A. Q Then, in light of result (3.119), Œ.˛P ˛/
O 0 .˛P ˛/
O 1=2 > r. And
upon setting ıQ D ˛P ˛,O we find that
jıQ 0 .˛P ˛/j O D .ıQ 0 ı/
O 0 .˛P ˛/
O D .˛P ˛/ Q 1=2 Œ.˛P ˛/ O 1=2 > .ıQ 0 ı/
O 0 .˛P ˛/ Q 1=2 r;

thereby establishing [in light of result (3.120)] the existence of an M 1 vector ıQ such that ıQ 0 ˛P … AQıQ
and completing the verification of relationship (3.118).
In connection with relationship (3.113), it is worth noting that if the condition ı 0P 2 Aı is
satisfied by any particular nonnull vector ı in RM, then it is also satisfied by any vector ıP in RM
such that ƒıP D ƒı or such that ıP / ı.
An extension. Result (3.118) relates AQıQ (ıQ 2 RM ) to A, Q where AQ Q is the set (3.117) (of ıQ 0 ˛-values)
ı
and AQ is the set (3.38) (of ˛-values). This relationship can be extended to a broader class of sets.
Suppose that the set  of M  1 vectors is such that the set  Q D fıQ W ıDW
Q ı; ı2g contains
M linearly independent (M 1) vectors, say ıQ1 ; ıQ2 ; : : : ; ıQM . And for ıQ 2 , Q take AQ Q to be interval
ı
(3.90) [which, when  Q D RM, is identical to interval (3.117)], and take AQ to be the set of ˛-values
defined as follows:
jıQ 0 .˛O ˛/j
P
 
Q
A D ˛P W max  cP (3.121)
Q 
fı2 Q
Q W ı¤0g .ıQ 0 ı/
Q 1=2 O
or, equivalently,
AQ D ˛P W ıQ 0 ˛O .ıQ 0 ı/
Q 1=2 O c P  ıQ 0 ˛P  ıQ 0 ˛C
O .ıQ 0 ı/Q 1=2 O c P for every ıQ 2 
Q (3.122)
˚

—in the special case where  Q D RM, this set is identical to the set (3.38), as is evident from result
Q Q take
(3.118). Further, for ı … ,
AQ Q D fP 2 R1 W P D ıQ 0 ˛;
ı
Q
P ˛P 2 Ag; (3.123)
thereby extending the definition of AQıQ to every ıQ 2 RM. Then, clearly,

AQ D f˛P 2 RM W ıQ 0˛P 2 AQıQ for every ıQ 2 RM g: (3.124)


The F Test and a Generalized S Method 389
Q for AQ and AQ Q .z/ for AQ Q and applying result (3.88)], we find that
And [upon writing A.z/ ı ı

PrŒıQ 0 ˛ 2 AQıQ .z/ for every ıQ 2 RM  D PrŒ˛ 2 A.z/


Q D1 P; (3.125)
so that AQ is a 100.1 P /% confidence set for the vector ˛ and the probability of simultaneous coverage
of the sets AQıQ (ıQ 2 RM ) equals 1 P .
Results (3.124) and (3.125) can be reexpressed in terms of A.y/ and Aı .y/ (ı 2 RM ), where
A.y/ is a set of -values and Aı .y/ a set of ı 0-values defined as follows:
A.y/ D fP W P D W 0 ˛; Q 0 y/g;
P ˛P 2 A.O
or equivalently
A D fP 2C.ƒ0 / W ı 0 O .ı 0 Cı/1=2 O c P  ı 0 P  ı 0 O C .ı 0 Cı/1=2 O c P for every ı2g;
and
Aı .y/ D AQW ı .O 0 y/
—for ı 2 , Aı .y/ is interval (3.94). Upon applying result (3.124), we find that
A.y/ D fP 2 C.ƒ0 / W ı 0P 2 Aı .y/ for every ı 2 RM g: (3.126)
Moreover,
PrŒı 0 2 Aı .y/ for every ı 2 RM 
D PrŒ 2 A.y/ D PrŒW 0˛ 2 A.y/ D PrŒ˛ 2 A.z/
Q D1 P: (3.127)

Let us now specialize to the case where (for M linearly independent M  1 vectors
Qı1 ; ıQ2 ; : : : ; ıQM ) Q D fıQ1 ; ıQ2 ; : : : ; ıQM g. For i D 1; 2; : : : ; M , let  D ıQ 0 ˛ and O D ıQ 0 ˛.
i i i i O
And observe that corresponding to any M  1 vector ı, Q there exist (unique) scalars k1 ; k2 ; : : : ; kM
such that ıQ D M Q
i D1 ki ıi and hence such that
P 

ıQ 0 ˛ D M Q 0 O D M ki Oi :
i D1 ki i and ı ˛
P  P
i D1

Observe also that the set AQ [defined by expression (3.121) or (3.122)] is reexpressible as
AQ D f˛P W Oi .ıQi0 ıQi /1=2 O c P  ıQi0 ˛P  Oi C.ıQi0 ıQi /1=2 c
O P .i D 1; 2; : : : ; M /g: (3.128)
Moreover, for ki ¤ 0,
Oi .ıQi0 ıQi /1=2 c
O P  ıQi0 ˛P  Oi C.ıQi0 ıQi /1=2 O c P
O P  ki ıQi0 ˛P  ki Oi Cjki j.ıQi0 ıQi /1=2 c
, ki Oi jki j.ıQi0 ıQi /1=2 c O P

(i D 1; 2; : : : ; M ), so that the set AQıQ [defined for ıQ 2  Q by expression (3.90) and for ıQ …  Q by
Q
expression (3.123)] is expressible (for every ı 2 R ) as M

AQıQ D fP 2 R1 W M
P M
Q 0 Q 1=2 O c
P 
i D1 ki Oi i D1 jki j.ıi ıi / P
 P  M
P M
 Q 0 Q 1=2 c
O P g: (3.129)
P
i D1 ki Oi C i D1 jki j.ıi ıi /

Thus, it follows from result (3.125) that

Pr M
PM
O P  M
P M
Q 0 Q 1=2 c
P  P 
i D1 ki Oi i D1 jki j.ıi ıi / i D1 ki i  i D1 ki Oi
P M
C i D1 jki j.ıQi0 ıQi /1=2 O c P for all scalars k1 ; k2 ; : : : ; kM D 1 P : (3.130)


Suppose (for purposes of illustration) that M D 2 and that ıQ1 D .1; 0/0 and ıQ2 D .0; 1/0. Then,
O expression (3.128) for the set AQ [defined by expression
letting ˛O 1 and ˛O 2 represent the elements of ˛,
(3.121) or (3.122)] is reexpressible as
AQ D f˛P D .˛P 1 ; ˛P 2 /0 W ˛O i O P  ˛P i  ˛O i C O c P .i D 1; 2/g:
c (3.131)
390 Confidence Intervals (or Sets) and Tests of Hypotheses
.˛P 2 ˛O 2 /=O

−2

.˛P 1 ˛O 1 /=O
−2 0 2

FIGURE 7.2. Display of the sets (3.131) and (3.133) [in terms of the transformed coordinates .˛P 1 ˛O 1 /=O and
O for the case where P D 0:10 and N P D 10; the set (3.131) is represented by
.˛P 2 ˛O 2 /=]
the rectangular region and the set (3.133) by the circular region.

And expression (3.129) for the set AQıQ [defined for ıQ 2  Q by expression (3.90) and for ıQ …  Q by
expression (3.123)] is reexpressible as
AQıQ D fP 2 R1 W 2iD1 ki ˛O i O c P 2iD1 jki j  P  2iD1 ki ˛O i C O c P 2iD1 jki jg (3.132)
P P P P

[where ıQ D .k1 ; k2 /0 ]. By way of comparison, consider the set AQF (of ˛-values) given by expression
(3.38), which (in the case under consideration) is reexpressible as
f˛P D .˛P 1 ; ˛P 2 /0 W 2 .˛P i ˛O i /2  2 O 2FN P .2; N P /g; (3.133)
P
i D1

and the set (of ıQ 0 ˛-values) given by expression (3.117), which (in the present context) is reexpressible
as
P2
AQıQ D fP 2 R1 W O FN P .2; N P /1=2 .k12 Ck22 /1=2
O i Œ2
i D1 ki ˛
 P  2iD1 ki ˛O i C O Œ2FN P .2; N P /1=2 .k12 Ck22 /1=2 g: (3.134)
P

The two sets (3.131) and (3.133) of ˛-values are displayed in Figure 7.2 [in terms of the trans-
formed coordinates .˛P 1 ˛O 1 /=O and .˛P 2 ˛O 2 /=]
O for the case where P D 0:10 and N P D 10—in
N
this case c P D 2:193 and F P .2; N P / D 2:924. The set (3.131) is represented by the rectangular
region, and the set (3.133) by the circular region. For each of these two sets, the probability of
coverage is 1 P D 0:90.
Interval (3.132) is of length 2 c O FN P .2; N
O P .jk1 j C jk2 j/, and interval (3.134) of length 2 Œ2
1=2 2 2 1=2
P / .k1 Ck2 / . Suppose that k1 or k2 is nonzero, in which case the length of both intervals is
strictly positive—if both k1 and k2 were 0, both intervals would be of length 0. Further, let v represent
the ratio of the length of interval (3.132) to the length of interval (3.134), and let u D k12 =.k12 Ck22 /.
And observe that
cP jk1 j C jk2 j cP  1=2
u C .1 u/1=2 ;

vD D
Œ2FN .2; N P / .k1 Ck2 /
P
1=2 2 2 1=2 Œ2FN .2; N P /P
1=2

so that v can be regarded as a function of u. Observe also that 0 pu  1, and that as u increases
from 0 to 21 , u1=2 C .1 u/1=2 increases monotonically from 1 to 2 and that as u increases from
1 1=2
p
2 to 1, u C .1 u/1=2 decreases monotonically from 2 to 1.
In Figure 7.3, v is plotted as a function of u for the case where P D 0:10 and N P D 10.
When P D 0:10 and N P D 10, v D 0:907Œu1=2 C .1 u/1=2 , and we find that v > 1 if
The F Test and a Generalized S Method 391
v

1.5

1.25

0.75

0.5

u
0 0.25 0.5 0.75 1

FIGURE 7.3. Plot (represented by the solid line) of v D c P Œ2FN P .2; N P / 1=2 Œu1=2 C .1 u/1=2  as a
function of u D k12 =.k12 Ck22 / for the case where P D 0:10 and N P D 10.

0:012 < u < 0:988, that v D 1 if u D 0:012 or u D 0:988, and that v < 1 if u < 0:012
or u > 0:988—u D 0:012 when jk1 =k2 j D 0:109 and u D 0:988 when jk2 =k1 j D 0:109 (or,
equivalently, when jk1 =k2 j D 9:145).
Computational issues. Consider the set AQıQ given for ıQ 2  Q by expression (3.90) and extended to
Qı … Q by expression (3.123). These expressions involve (either explicitly or implicitly) the upper
100 P % point c P of the distribution of the random variable maxfı2 Q  Q
Q W ı¤0g
Q 1=2. Let us
jıQ 0 tj=.ıQ 0 ı/
consider the computation of c P .
In certain special cases, the distribution of maxfı2Q  Q
Q W ı¤0g
Q 1=2 is sufficiently simple
jıQ 0 tj=.ıQ 0 ı/
that the computation of c P is relatively tractable. One such special case has already been considered.
Q D RM or, more generally, if 
If  Q is such that condition (3.107) is satisfied, then
c P D ŒM FN P .M ; N P /1=2:

A second special case where the computation of c P is relatively tractable is that where for some
nonnull M 1 orthogonal vectors ıQ1 ; ıQ2 ; : : : ; ıQM ,
Q D fıQ1 ; ıQ2 ; : : : ; ıQM g: (3.135)
In that special case,
jıQ 0 tj
max  max.jt1 j; jt2 j; : : : ; jtM j/;
Q 
fı2 Q
Q W ı¤0g .ıQ 0 ı/
Q 1=2
where t1 ; t2 ; : : : ; tM are the elements of the M -dimensional random vector t [whose distribution
is MV t.N P ; IM /]. The distribution of max.jt1 j; jt2 j; : : : ; jtM j/ is referred to as a Studentized
maximum modulus distribution.
In the special case where  Q is of the form (3.135) (and where M  2)
c P > ŒFN P .M ; N P /1=2; (3.136)
as is evident upon observing that
max.jt1 j; jt2 j; : : : ; jtM j/ D Œmax.t12 ; t22 ; : : : ; tM
2

/1=2
 P M 2 1=2
> i D1 ti =M (with probability 1)
and that M
P  2
i D1 ti =M  SF .M ; N P /.
A third special case where the computation of c P is relatively tractable is that where for some
nonzero scalar a and some M 1 orthonormal vectors ıP1 ; ıP2 ; : : : ; ıPM ,
Q D fa.ıPi ıPj / .j 6D i D 1; 2; : : : ; M /g: (3.137)
392 Confidence Intervals (or Sets) and Tests of Hypotheses

In that special case,


jıQ 0 tj 1=2
max 2 Œmax.t1 ; t2 ; : : : ; tM / min.t1 ; t2 ; : : : ; tM /;
Q 
fı2 Q
Q W ı¤0g .ıQ 0 ı/
Q 1=2
where (as in the second special case) t1 ; t2 ; : : : ; tM are the elements of the random vector t [whose
distribution is MV t.N P ; IM /]. The distribution of the random variable maxi ti minj tj is
known as the distribution of the Studentized range.
The special cases where  Q D RM or where  Q is of the form (3.135) or of the form (3.137)
are among those considered by Scheffé (1959, chap. 3) in his discussion of multiple comparisons
and simultaneous confidence intervals. In the special case where  Q is of the form (3.137), the use of
0
interval (3.94) as a means for obtaining a confidence set for ı  (for every ı 2 ) or of the critical
region (3.103) as a means for obtaining a test of the null hypothesis H0.ı/ (versus the alternative
hypothesis H1.ı/ ) (for every ı 2 ) is synonymous with what is known as Tukey’s method or simply
as the T method.
There are other special cases where the computation of c P is relatively tractable, including some
that are encountered in making inferences about the points on a regression line or a response surface
and that are discussed in some detail by Liu (2011). In less tractable cases, resort can be made to
Monte Carlo methods. Repeated draws can be made from the distribution of the random variable
maxfı2 Q  Q
Q W ı¤0g
Q 1=2, and the results used to approximate c P to a high degree of accuracy;
jıQ 0 tj=.ıQ 0 ı/
methods for using the draws in this way are discussed by, for example, Edwards and Berry (1987)
and Liu (2011, app. A). To obtain a draw from the distribution of maxfı2 Q  Q
Q W ı¤0g
Q 1=2,
jıQ 0 tj=.ıQ 0 ı/
it suffices to obtain a draw, say tP , from the distribution of t [i.e., from the MV t.N P ; IM /
distribution], which is relatively easy, and to then find the maximum value of jıQ 0 tP j=.ıQ 0 ı/ Q 1=2 for
Qı 2 ,
Q the difficulty of which depends on the characteristics of the set . Q
For ıQ … , Q the set AQ Q is defined by expression (3.123). What is the nature of this set, and
ı
how might it be constructed? These issues were considered in the preceding part of the present
subsection for the special case where  Q consists of M linearly independent vectors. That special
case is relatively simple; in that special case, AQıQ takes (for every ıQ 2 RM ) the form of interval
(3.129), the end points of which are expressible in “closed” form.
Let us consider a more general case. Suppose that  Q consists of a finite number, say L ( M ),
Q Q Q
vectors ı1 ; ı2 ; : : : ; ıL . Suppose further that M of these vectors are linearly independent, and assume
(for convenience) that the vectors have been numbered or renumbered so that ıQ1 ; ıQ2 ; : : : ; ıQM are
linearly independent. And consider a setting where c P has been computed and used to determine
(for every ıQ 2 ) Q the end points of the interval AQ Q given by expression (3.90) and where we wish to
ı
construct (for ıQ … ) Q the set AQ Q given by expression (3.123).
ı
Let ıQ represent any M  1 vector such that ıQ … . Q And let P D ıQ 0 ˛,
P where ˛P is an M  1
vector, and regard P as a function (of ˛) P whose domain is restricted to the set AQ defined by expression
(3.122). The function P is linear, and its domain AQ is a closed, bounded, convex set. Accordingly,
P assumes a maximum value, say Pmax , and a minimum value, say Pmin , and the set AQıQ defined by
expression (3.123) is the interval
AQıQ D ŒPmin ; Pmax ;
with upper and lower end points Pmax and Pmin . In general, Pmax and Pmin must be determined by
numerical methods. P M
Corresponding to ıQ are scalars k1 ; k2 ; : : : ; kM such that ıQ D Q
i D1 ki ıi . Thus, letting (for
i D 1; 2; : : : ; M ) xi D .ıQi ıQi /
0 1=2
ıQi .˛P ˛/,
0
O P is reexpressible as
O D ıQ 0 ˛O C M
P D ıQ 0 ˛O C ıQ 0 .˛P ˛/ Q 0 Q 1=2 x : (3.138)
P 
i D1 ki .ıi ıi / i

Moreover, ˛P 2 AQ [where AQ is the set (3.121) or (3.122)] if and only if


O c P  .ıQi0 ıQi / 1=2 Q 0
ıi .˛P O  c
˛/ O P .i D 1; 2; : : : ; L/: (3.139)
The F Test and a Generalized S Method 393

Clearly, the first M of the L inequalities (3.139) are reexpressible as


O c P  xi  O c P .i D 1; 2; : : : ; M /: (3.140)
P M
And (for j D 1; 2; : : : ; L M ) there exist scalars aj1 ; aj 2 ; : : : ; ajM such that ıQMCj
D i D1 aj i ıQi ,
so that the last L M of the L inequalities (3.139) are reexpressible as

O c P  .ıQM
0
ıQ / 1=2 M Q 0 Q 1=2 x  c (3.141)
P 
Cj MCj i D1 aj i .ıi ıi / i O P .j D 1; 2; : : : ; L M /:

We conclude that the problem of determining Pmax and Pmin is essentially that of determining the
maximum and minimum values of the quantity (3.138) with respect to x1 ; x2 ; : : : ; xM subject to the
constraints (3.140) and (3.141). The problem of maximizing or minimizing this quantity subject to
these constraints can be formulated as a linear programming problem, and its solution can be effected
by employing an algorithm for solving linear programming problems—refer, e.g., to Nocedal and
Wright (2006, chaps. 13 & 14).

d. An illustration
Let us illustrate various of the results of Subsections a, b, and c by using them to add to the results
obtained earlier (in Sections 7.1 and 7.2c) for the lettuce-yield data. Accordingly, let us take y to be
the 201 random vector whose observed value is the vector of lettuce yields. Further, let us adopt
the terminology and notation introduced in Section 7.1 along with those introduced in the present
section. And let us restrict attention to the case where y is assumed to follow either the second-order
or third-order model, that is, the G–M model obtained upon taking the function ı.u/ (that defines the
response surface) to be either the second-order polynomial (1.2) or the third-order polynomial (1.3)
(and taking u to be the 3-dimensional column vector whose elements u1 , u2 , and u3 represent the
transformed amounts of Cu, Mo, and Fe). In what follows, the distribution of the vector e of residual
effects (in the second- or third-order model) is taken to be N.0;  2 I/.
Second-order model versus the third-order model. The second-order model has considerable appeal;
it is relatively simple and relatively tractable. However, there may be a question as to whether the
second-order polynomial (1.2) provides an “adequate” approximation to the response surface over
the region of interest. A common way of addressing this question is to take the model to be the
third-order model and to attempt to determine whether the data are consistent with the hypothesis
that the coefficients of the third-order terms [i.e., the terms that appear in the third-order polynomial
(1.3) but not the second-order polynomial (1.2)] equal 0.
Accordingly, suppose that y follows the third-order model (in which case P D 20,
P D 15, and N P D 5). There are 10 third-order terms, the coefficients of which are
ˇ111 ; ˇ112 ; ˇ113 ; ˇ122 ; ˇ123 ; ˇ133 ; ˇ222 ; ˇ223 ; ˇ233 , and ˇ333 . Not all of these coefficients are es-
timable from the lettuce-yield data; only certain linear combinations are estimable. In fact, as
discussed in Section 7.2c, a linear combination of the coefficients of the third-order terms is es-
timable (from these data) if and only if it is expressible as a linear combination of 5 linearly in-
dependent estimable linear combinations, and among the choices for the 5 linearly independent
estimable linear combinations are the linear combinations 3:253ˇ111 1:779ˇ122 0:883ˇ133 ,
1:779ˇ112 3:253ˇ222C0:883ˇ233 , 1:554ˇ113C1:554ˇ223 3:168ˇ333 , 2:116ˇ123 , and 0:471ˇ333 .
Let  represent the 5-dimensional column vector with elements 1 D 3:253ˇ111 1:779ˇ122
0:883ˇ133 , 2 D 1:779ˇ112 3:253ˇ222 C 0:883ˇ233 , 3 D 1:554ˇ113 C 1:554ˇ223 3:168ˇ333 ,
4 D 2:116ˇ123 , and 5 D 0:471ˇ333 . And consider the null hypothesis H0 W  D  .0/ , where
 .0/ D 0 (and where the alternative hypothesis is H1 W  ¤  .0/ ). Clearly, H0 is testable.
In light of the results of Section 7.2c, the least squares estimator O of  equals
. 1:97; 4:22; 5:11; 2:05; 2:07/0, and var./ O D  2 I—by construction, the linear combinations
that form the elements of  are such that their least squares estimators are uncorrelated and have
394 Confidence Intervals (or Sets) and Tests of Hypotheses

standard errors equal to . Further, O 2 D 10:53 (so that each element of O has an estimated standard
error of O D 3:24). And
.O 0/0 C .O 0/ O 0 O
F .0/ D 2
D D 1:07:
M O 5 O 2
The size- P F test of H0 W  D 0 (versus H1 W  ¤ 0) consists of rejecting H0 if F .0/ > FN P .5; 5/,
and accepting H0 otherwise. The p-value of the F test of H0 (versus H1 ), which (by definition)
is the value of P such that F .0/ D FN P .5; 5/, equals 0:471. Thus, the size- P F test rejects H0 for
values of P larger than 0:471 and accepts H0 for values less than or equal to 0:471. This result is
more-or-less consistent with a hypothesis that the coefficients of the 10 third-order terms (of the
third-order model) equal 0. However, there is a caveat: the power of the test depends on the values of
the coefficients of the 10 third-order terms only through the values of the 5 linear combinations 1 ,
2 , 3 , 4 , and 5 . The distribution of F .0/ (under bothP
H0 and H1 ),  from which the power function
of the size- P F test of H0 is determined, is SF 5; 5; 5iD1 i2= 2 .
Presence or absence of interactions. Among the stated objectives of the experimental study of
lettuce yield was that of “determining the importance of interactions among Cu, Mo, and Fe.” That
is, to what extent (if any) does the change in yield effected by a change in the level of one of these
three variables vary with the levels of the other two?
Suppose that y follows the second-order model, in which ˇ D .ˇ1 ; ˇ2 ; ˇ3 ; ˇ4 , ˇ11 ,
ˇ12 ; ˇ13 ; ˇ22 ; ˇ23 ; ˇ33 /0, P D P D 10, and N P D 10. And take  to be the 3-dimensional
column vector with elements 1 D ˇ12 , 2 D ˇ13 , and 3 D ˇ23 . Then, M D M D 3, and
 D ƒ0ˇ, where ƒ is the 103 matrix whose columns are the 6th, 7th, and 9 th columns of the 1010
identity matrix.
Consider the problem of obtaining a 100.1 P /% confidence set for the vector  and that of
obtaining confidence intervals for ˇ12 , ˇ13 , and ˇ23 (and possibly for linear combinations of ˇ12 ,
ˇ13 , and ˇ23 ) for which the probability of simultaneous coverage is 1 P . Consider also the problem
of obtaining a size- P test of the null hypothesis H0 W  D 0 (versus the alternative hypothesis
H1 W  ¤ 0) and that of testing whether each of the quantities ˇ12 , ˇ13 , and ˇ23 (and possibly each
of their linear combinations) equals 0 (and of doing so in such a way that the probability of one or
more false rejections equals P ).
Letting ˇO12 , ˇO13 , and ˇO23 represent the least squares estimators of ˇ12 , ˇ13 , and ˇ23 , respectively,
we find that ˇO12 D 1:31, ˇO13 D 1:29, and ˇO23 D 0:66, so that the least squares estimator O of 
equals . 1:31; 1:29; 0:66/0 —refer to Table 7.1. Further, var./ O D  2 C, and
C D ƒ0.X0 X/ ƒ D diag.0:125; 0:213; 0:213/:
And for the M M matrix T of full row rank such that C D T 0 T and for the M M matrix S
such that T S is orthogonal, take
1
T D diag.0:354; 0:462; 0:462/ and S D T D diag.2:83; 2:17; 2:17/;
in which case
˛ D S0  D .2:83ˇ12 ; 2:17ˇ13 ; 2:17ˇ23 /0 and ˛O D S0 O D . 3:72; 2:80; 1:43/
(and the M M matrix W such that ƒSW D ƒ equals T ).
The “F statistic” F .0/ for testing the null hypothesis H0 W  D 0 (versus the alternative hypothesis
H1 W  ¤ 0) is expressible as
.ˇO12 ; ˇO13 ; ˇO23 /0 C 1.ˇO12 ; ˇO13 ; ˇO23 /
F .0/ D I
3O 2
its value is 0:726, which is “quite small.” Thus, if there is any “nonadditivity” in the effects of Cu,
Mo, and Fe on lettuce yield, it is not detectable from the results obtained by carrying out an F test
(on these data). Corresponding to the size- P F test is the 100.1 P /% ellipsoidal confidence set AF
The F Test and a Generalized S Method 395

(for the vector ) given by expression (3.41); it consists of the values of  D .ˇ12 ; ˇ13 ; ˇ23 /0 such
that
0:244.ˇ12 C 1:31/2 C 0:143.ˇ13 C 1:29/2 C 0:143.ˇ23 0:66/2  FN P .3; 10/: (3.142)
For P equal to :01, :10, and :50, the values of FN P .3; 10/ are 6:552, 2:728, and 0:845, respectively.
Confidence intervals for which the probability of simultaneous coverage is 1 P can be obtained
for ˇ12 , ˇ13 , and ˇ23 and all linear combinations of ˇ12 , ˇ13 , and ˇ23 by applying the S method. In
the special case where P D :10, the intervals obtained for ˇ12 , ˇ13 , and ˇ23 via the S method [using
formula (3.110)] are:
4:65  ˇ12  2:02; 5:66  ˇ13  3:07; and 3:70  ˇ23  5:02:
By way of comparison, the 90% one-at-a-time confidence intervals obtained for ˇ12 , ˇ13 , and ˇ23
[upon applying formula (3.55)] are:
3:43  ˇ12  0:80; 4:06  ˇ13  1:47; and 2:10  ˇ23  3:42:
Corresponding to the S method for obtaining (for all linear combinations of ˇ12 , ˇ13 , and ˇ23
including ˇ12 , ˇ13 , and ˇ23 themselves) confidence intervals for which the probability of simultane-
ous coverage is 1 P is the S method for obtaining for every linear combination of ˇ12 , ˇ13 , and ˇ23
a test of the null hypothesis that the linear combination equals 0 versus the alternative hypothesis
that it does not equal 0. The null hypothesis is either accepted or rejected depending on whether or
not 0 is a member of the confidence interval for that linear combination. The tests are such that the
probability of one or more false rejections is less than or equal to P .
The confidence intervals obtained for ˇ12 , ˇ13 , and ˇ23 via the S method [using formula (3.110)]
are “conservative.” The probability of simultaneous coverage for these three intervals is greater than
1 P , not equal to 1 P —there are values of y for which coverage is achieved by these three intervals
but for which coverage is not achieved by the intervals obtained [using formula (3.110)] for some
linear combinations of ˇ12 , ˇ13 , and ˇ23 .
Confidence intervals can be obtained for ˇ12 , ˇ13 , and ˇ23 for which the probability of simulta-
neous coverage equals 1 P . Letting ı1 , ı2 , and ı3 represent the columns of the 33 identity matrix,
take  D fı1 ; ı2 ; ı3 g, in which case  Q D fıQ1 ; ıQ2 ; ıQ3 g, where ıQ1 , ıQ2 , and ıQ3 are the columns of the
matrix W [which equals diag.0:354; 0:462; 0:462/]. Then, intervals can be obtained for ˇ12 , ˇ13 ,
and ˇ23 for which the probability of simultaneous coverage equals 1 P by using formula (3.90) or
(3.94). When  Q D fıQ1 ; ıQ2 ; ıQ3 g, 
Q is of the form (3.135) and, consequently, c P is the upper 100 P %
point of a Studentized maximum modulus distribution; specifically, it is the upper 100 P % point
of the distribution of max.jt1 j; jt2 j; jt3 j/, where t1 , t2 , and t3 are the elements of a 3-dimensional
random column vector whose distribution is MV t.10; I3 /. The value of c:10 is 2:410—refer, e.g.,
to Graybill (1976, p. 656). And [as obtained from formula (3.94)] confidence intervals for ˇ12 , ˇ13 ,
and ˇ23 with a probability of simultaneous coverage equal to :90 are:
4:13  ˇ12  1:50; 4:97  ˇ13  2:38; and 3:01  ˇ23  4:33: (3.143)

The values of  whose elements (ˇ12 , ˇ13 , and ˇ23 ) satisfy the three inequalities (3.143) form
a 3-dimensional rectangular set A. The set A is a 90% confidence set for . It can be regarded as a
“competitor” to the 90% ellipsoidal confidence set for  defined (upon setting P D :10) by inequality
(3.142).
Starting with the confidence intervals (3.143) for ˇ12 , ˇ13 , and ˇ23 , confidence intervals (with
the same probability of simultaneous coverage) can be obtained for every linear combination of ˇ12 ,
ˇ13 , and ˇ23 . Let ı represent an arbitrary 31 vector and denote by k1 , k2 , and k3 the elements of
ı, so that ı 0  D k1 ˇ12 C k2 ˇ13 C k3 ˇ23 . Further, take Aı D AQW ı [where AQıQ is the set defined for
ıQ 2 Q by expression (3.90) and for ıQ … Q by expression (3.123)]. Then,
PrŒı 0 2 Aı .y/ for every ı 2 R3  D 1 P:
396 Confidence Intervals (or Sets) and Tests of Hypotheses

And upon observing that W ı D k1 ıQ1 C k2 ıQ2 C k3 ıQ3 and making use of formula (3.129), we find
that Aı is the interval with end points
k1 ˇO12 Ck2 ˇO13 Ck3 ˇO23 ˙ 3iD1 jki j.ıi0 Cıi /1=2 O c P : (3.144)
P

When P D :10,
P3 0 1=2
i D1 jki j.ıi Cıi / O c P D 7:954.0:354jk1j C 0:462jk2 j C 0:462jk3j/:

The intervals obtained for all linear combinations of ˇ12 , ˇ13 , and ˇ23 by taking the end points
of each interval to be those given by expression (3.144) can be regarded as competitors to those
obtained by applying the S method. When only one of the 3 coefficients k1 , k2 , and k3 of the linear
combination k1 ˇ12 C k2 ˇ13 C k3 ˇ23 is nonzero, the interval with end points (3.144) is shorter than
the interval obtained by applying the S method.
Suppose however that k1 , k2 , and k3 are such that for some nonzero scalar k, ki D k.ıQi0 ıQi / 1=2
or, equivalently, ki D k.ıi0 Cıi / 1=2 (i D 1; 2; 3). Then, the length of the interval with end points
(3.144) is
2 3iD1 jki j.ıi0 Cıi /1=2 O c P D 6 jkjO c P D 19:80 jkjc P
P

and that of the interval obtained by applying the S method is


2.ı 0 Cı/1=2 ŒM
O  FN P .M ; N P /1=2
1=2
D 2 3iD1 ki2 ıi0 Cıi O Œ3FN P .3; 10/1=2
P

D 2 .3k 2 /1=2 O Œ3FN P .3; 10/1=2


O FN P .3; 10/1=2 D 19:80 jkjŒFN P .3; 10/1=2:
D 6 jkj Œ
As a special case of result (3.136) (that where M D 3 and N P D 10), we have that c P is greater
than ŒFN P .3; 10/1=2 ; in particular,
c:10 D 2:410 > 1:652 D ŒFN P .3; 10/1=2:
Thus, when ki D k.ıi0 Cıi / 1=2 (i D 1; 2; 3), the interval with end points (3.144) is lengthier than
that obtained by applying the S method.

e. Some additional results on the generalized S method


Preliminary to a further discussion of the lettuce-yield data, it is convenient to introduce some
additional results on the generalized S method (for obtaining simultaneous confidence intervals or
sets and for making multiple comparisons). In what follows, let us adopt the notation employed in
Subsection c.
Reformulation of a maximization problem. When in computing the upper 100 P % point c P [of the
distribution of the random variable maxfı2 Q  Q
Q W ı¤0g jıQ 0 tj=.ıQ 0 ı/
Q 1=2 ] resort is made to Monte Carlo
methods, the value of max Q Q Q jıQ 0 tj=.ıQ 0 ı/
fı2 W ı¤0g
Q 1=2 must be computed for each of a large number
of values of t. As is evident upon recalling result (3.96) and upon observing that
minfı2
Q  Q
Q W ı¤0g ıQ 0 tP =.ıQ 0 ı/
Q 1=2 D max Q Q Q Q 0 P Q 0 Q 1=2;
fı2 W ı¤0g ı . t/=.ı ı/ (3.145)

the value of maxfı2 Q  Q


Q W ı¤0g jıQ 0 tj=.ıQ 0 ı/
Q 1=2 can be computed for t D tP by computing the value of
maxfı2 Q  Q
Q W ı¤0g ıQ 0 t=.ıQ 0 ı/
Q 1=2 for each of two values of t (namely, t D tP and t D tP ) and by then
selecting the larger of these two values.
The value of maxfı2 Q  Q
Q W ı¤0g ıQ 0 t=.ıQ 0 ı/
Q 1=2 can be obtained for any particular value of t from the
solution to a constrained nonlinear least squares problem. Letting tP represent an arbitrary value of t
(i.e., an arbitrary M 1 vector) and letting  represent an arbitrary scalar, consider the minimization
(with respect to ıQ and ) of the sum of squares
The F Test and a Generalized S Method 397

.tP Q 0 .tP
ı/ Q .D tP 0 tP
ı/ (3.146) Q
2ıQ 0 tP C 2 ıQ 0 ı/
subject to the constraints ıQ 2 
Q and   0. Suppose that tP is such that for some vector ıR 2 ,
Q ıR 0 tP >
0—when no such vector exists, maxfı2 Q  Q
Q W ı¤0g
Q 0P Q 0 Q 1=2
ı t =.ı ı/  0  maxfı2Q  Q
Q W ı¤0g ı . t/=.ıQ 0 ı/
Q 0 P Q 1=2,
and (subject to the constraints ıQ 2 Q and   0) the minimum value of the sum of squares (3.146)
is the value t t attained at  D 0. And let ıP and P represent any values of ıQ and  at which the sum
P 0P

of squares (3.146) attains its minimum value (for ıQ 2  Q and   0). Then,
ıP 0 tP .ıP 0 tP /2
P D >0 and .tP P ı/
P 0 .tP P ı/
P D tP 0 tP < tP 0 tP : (3.147)
ıP 0 ıP ıP 0 ıP
Further, ıQ 0 tP ıP 0 tP
max D > 0: (3.148)
Q 
fı2 Q
Q W ı¤0g .ıQ 0 ı/
Q 1=2 .ıP 0 ı/
P 1=2

Let us verify results (3.147) and (3.148). For any vector ıR in  Q such that ıR 0 tP ¤ 0 and for
R D ıR 0 tP =ıR 0 ı,
R
R0 P 2
.tP R ı/ R 0 .tP R ı/ R D tP 0 tP .ı t / < tP 0 tP : (3.149)
ıR 0 ıR
Thus, minfı; Q  W ı2 Q 0g .t
Q ; P ı/ Q 0 .tP ı/ Q is less than tP 0 tP , which is the value of .tP ı/ Q 0 .tP ı/
Q when
 D 0 or ıQ D 0. And it follows that P > 0 and ıP ¤ 0.
To establish that maxfı2 Q  Q
Q W ı¤0g ıQ 0 tP =.ıQ 0 ı/Q 1=2 D ıP 0 tP =.ıP 0 ı/ P 1=2, assume (for purposes of estab-
lishing a contradiction) the contrary, that is, assume that there exists a nonnull vector ıR 2  Q such
that
ıR 0 tP ıP 0 tP
> :
.ıR 0 ı/
R 1=2 .ıP 0 ı/
P 1=2
Then, letting R D .ıP 0 ı= P we find that
R 1=2 ,
P ıR 0 ı/
R ıR 0 tP > P ıP 0 tP and R 2 ıR 0 ıR D P 2 ıP 0 ı;
P
which implies that
.tP R ı/
R 0 .tP R ı/
R < .tP P ı/
P 0 .tP P ı/;
P
thereby establishing the desired contradiction.
The verification of result (3.148) is complete upon observing that (since, by supposition, there
exists a vector ıR 2  Q such that ıR 0 tP > 0) max Q Q Q Q 0 P Q 0 Q 1=2 > 0. Further, turning to result
fı2 W ı¤0g ı t =.ı ı/
P as is evident upon letting R D ıP 0 tP =.ıP 0 ı/
(3.147), P D ıP 0 tP =.ıP 0 ı/, P and upon observing that

.tP P ı/
P 0 .tP P ı/
P D .tP R ı/
P 0 .tP R ı/
P C .P R 2 ıP 0 ıP
/
and [in light of result (3.148)] that R > 0. To complete the verification of result (3.147), it remains
only to observe that ıP 0 tP ¤ 0 and to apply the special case of result (3.149) obtained upon setting
ıR D ıP (and implicitly R D ).
P
The constrained nonlinear least squares problem can be reformulated. Consider the minimization
(with respect to ı and ) of the sum of squares
W .ı/0 Œ tP W .ı/
ΠtP (3.150)
subject to the constaints ı 2  and   0. And let ıR and R represent any solution to this constrained
nonlinear least squares problem, that is, any values of ı and  at which the sum of squares (3.150)
attains its minimum value (for ı 2  and   0). Then, a solution ıP and P to the original constrained
nonlinear least squares problem, that is, values of ıQ and  that minimize the sum of squares (3.146)
Q and   0, can be obtained by taking ıP D W ıR and P D .
subject to the constraints ıQ 2  R

A variation. The computation of maxfı2Q  Q


Q W ı¤0g jıQ 0 tP j=.ıQ 0 ı/
Q 1=2 can be approached in a way that
differs somewhat from the preceding approach. Let P and ıP represent values of  and ıQ at which the
398 Confidence Intervals (or Sets) and Tests of Hypotheses

sum of squares .tP ı/ Q 0 .tP ı/


Q attains its minimum value subject to the constraint ıQ 2 . Q And
suppose that  Q contains one or more vectors that are not orthogonal (with respect to the usual inner
product) to tP —when every vector in  Q is orthogonal to tP , max Q Q Q Q 0 P Q 0 Q 1=2 D 0. Then,
fı2 W ı¤0g jı t j=.ı ı/
Pı ¤ 0—refer to result (3.149)—and P D ıP 0 tP =.ıP 0 ı/. P Moreover, for any nonnull vector ıQ in  Q and
Q 0P Q 0Q
for  D ı t =.ı ı/,
.ıQ 0 tP /2 .ıQ 0 tP /2
 
tP 0 tP .tP ı/ Q 0 .tP ı/ Q D tP 0 tP tP 0 tP D : (3.151)
ıQ 0 ıQ ıQ 0 ıQ
Thus,
.ıQ 0 tP /2 .ıP 0 tP /2
max D ; (3.152)
Q 
fı2 Q
Q W ı¤0g ıQ 0 ıQ ıP 0 ıP
since otherwise there would exist a nonnull value of ı, Q say ı, R D ıR 0 tP =.ıR 0 ı/,
R and a value of , namely,  R
P R R 0 P R R P P
such that .t ı/ .t ı/ < .t ı/ .t ı/.P 0 P P P
Upon observing that
j ıQ 0 tPj .ıQ 0 tP /2 1=2
 
max D max ; (3.153)
Q 
fı2 Q
Q W ı¤0g .ıQ 0 ı/
Q 1=2 Q 
fı2 Q
Q W ı¤0g ıQ 0 ıQ
upon applying result (3.152), and upon making use of the special case of result (3.151) obtained
P we find that
upon setting ıQ D ıP and (implicitly)  D ,
j ıQ 0 tPj
max D ŒtP 0 tP .tP P ı/
P 0 .tP P ı/
P 1=2: (3.154)
Q 
fı2 Q
Q W ı¤0g .ıQ 0 ı/
Q 1=2

Thus, maxfı2Q  Q
Q W ı¤0g
Q 1=2 can be computed as the square root of the difference between
jıQ 0 tj=.ıQ 0 ı/
the total sum of squares t t and the residual sum of squares .tP P ı/
P 0P P 0 .tP P ı/.
P Further,

min .tP Q 0 .tP


ı/ Q D
ı/ min ŒtP W .ı/0 ŒtP W .ı/; (3.155)
Q W ı2
fı; Q g
Q fı; W ı2g

so that the residual sum of squares obtained by minimizing .tP ı/ Q 0 .tP ı/
Q with respect to ıQ and 
Q Q
(subject to the constraint ı 2 ) is identical to that obtained by minimizing ŒtP W .ı/0 ŒtP W .ı/
with respect to ı and  (subject to the constraint ı 2 ).
Constant-width simultaneous confidence intervals. For ı 2 , Aı is (as defined in Part c) the
confidence interval (for ı 0 ) with end points ı 0 O ˙ .ı 0 Cı/1=2 O c P . The intervals Aı (ı 2 ) were
constructed in such a way that their probability of simultaneous coverage equals 1 P . Even when
the intervals corresponding to values of ı for which ƒı D 0 (i.e., values for which ı 0  D 0) are
excluded, these intervals are (aside from various special cases) not all of the same width. Clearly,
the width of the interval Aı is proportional to the standard error .ı 0 Cı/1=2  or estimated standard
error .ı 0 Cı/1=2 O of the least squares estimator ı 0 O of ı 0 .
Suppose that  is such that ƒı ¤ 0 for every ı 2 , and suppose that  Q (which is the set
fıQ W ıQ D W ı; ı 2 g) is such that for every M  1 vector tP , maxfı2 Q0 P
Q jı t j exists.. Then,
Q g
confidence intervals can be obtained for the linear combinations ı 0 (ı 2 ) that have a probability
of simultaneous coverage equal to 1 P and that are all of the same width.
Analogous to result (3.87), which underlies the generalized S method, we find that
1 Q0 Q0
maxfı2 Q j
Q g O ı .˛O ˛/j  maxfı2 Q j ı tj;
Q g (3.156)

where t  MV t.N P ; IM /. This result can be used to devise (for each ıQ 2 ) Q an interval, say
A Q , of ı ˛-values such that the probability of simultaneous coverage of the intervals AQQ (ıQ 2 ),
Q Q 0 Q
ı ı
like that of the intervals AQıQ (ıQ 2 ),
Q equals 1 P . Letting c  represent the upper 100 P % point of the
P
distribution of max Q Q j ıQ 0 tj, this interval is as follows:
fı2g
The F Test and a Generalized S Method 399

AQıQ D P 2 R1 W ıQ 0 ˛O O c P  P  ıQ 0 ˛O C c
O P ; (3.157)
˚

Now, for ı 2 , let Aı D AQW ı , so that Aı is an interval of ı 0-values expressible as follows:
Aı D P 2 R1 W ı 0 O O c P  P  ı 0 O C cO P : (3.158)
˚

Like the intervals Aı (ı 2 ) associated with the generalized S method, the probability of
simultaneous coverage of the intervals Aı (ı 2 ) equals 1 P . Unlike the intervals Aı (ı 2 ),
the intervals Aı (ı 2 ) are all of the same width—each of them is of width 2 c
O P .

In practice, c P may have to be approximated by numerical means—refer, e.g., to Liu (2011,
chaps. 3 & 7). If necessary, c P can be approximated by adopting a Monte Carlo approach in which
repeated draws are made from the distribution of the random variable maxfı2 Q0
Q jı tj. Note that
Q g

maxfı2 Q0
Q jı tj D max maxfı2
Q0
Q ı t; maxfı2
Q0 (3.159)
 
Q g Q g Q ı . t/ ;
Q g

so that to compute the value of maxfı2 Q0


Q jı tj for any particular value of t, say t
Q g P , it suffices to
compute the value of maxfı2 Q 0 P
Q ı t for each of two values of t (namely, t D t and t D t
Q g P ) and to
then select the larger of these two values.
Suppose that the set  is such that the set  Q contains M linearly independent (M  1) vectors,
say ı1 ; ı2 ; : : : ; ıM . Then, in much the same way that the intervals AQıQ (ıQ 2 )
Q Q Q Q define a set AQ of
˛-values—refer to expression (3.122)—and a set A of -values, the intervals AQQ (ıQ 2 ) Q define a
ı
set
AQ D ˛P W ıQ 0 ˛O c
O P  ıQ 0 ˛P  ıQ 0 ˛C
O O c P for every ıQ 2 
Q
˚

of ˛-values and a corresponding set A D fP W P D W 0 ˛; P ˛P 2 AQ g of -values. And analogous to


Q Q
the set AıQ defined for ı …  in terms of A by expression (3.123), we have the set AQQ defined in
Q Q
ı
terms of the set AQ as follows:
AQıQ D fP 2 R1 W P D ıQ 0 ˛;
P ˛P 2 AQ g: (3.160)
Further, analogous to the set Aı (of ı 0 -values) defined for every ı 2 RM by the relationship
Aı D AQW ı , we have the set Aı defined (for every ı 2 RM ) by the relationship Aı D AQW ı .
Two lemmas (on order statistics). Subsequently, use is made of the following two lemmas.
Lemma 7.3.4. Let X1 ; X2 ; : : : ; XK and X represent K C 1 statistically independent random
variables, each of which has the same absolutely continuous distribution. Further, let ˛ represent a
scalar between 0 and 1, let R D .K C 1/.1 ˛/, and suppose that K and ˛ are such that R is an
integer. And denote by XŒ1 ; XŒ2 ; : : : ; XŒK the first through Kth order statistics of X1 ; X2 ; : : : ; XK
(so that with probability one, XŒ1 < XŒ2 <    < XŒK ). Then,
Pr X > XŒR D ˛:


Proof. Denote by R0 the unique (with probability 1) integer such that X ranks R0 th (in magnitude)
among the K C1 random variables X1 ; X2 ; : : : ; XK ; X (so that X < XŒ1 , XŒR0 1 < X < XŒR0  , or
X > XŒK , depending on whether R0 D 1, 2  R0  K, or R0 D K C 1). Then, upon observing
that Pr.R0 D k/ D 1=.K C1/ for k D 1; 2; : : : ; K C1, we find that
K RC1
Pr X > XŒR D Pr.R0 > R/ D KC1 0
kDRC1 Pr.R D k/ D
 P
D ˛:
K C1
Q.E.D.
Lemma 7.3.5. Let X1 ; X2 ; : : : ; XK represent K statistically independent random variables.
Further, suppose that Xk  X (k D 1; 2; : : : ; K) for some random variable X whose distribution is
absolutely continuous with a cdf G./ that is strictly increasing over some finite or infinite interval I
for which Pr.X 2 I / D 1. And denote by XŒ1 ; XŒ2 ; : : : ; XŒK the first through Kth order statistics
of X1 ; X2 ; : : : ; XK . Then, for any integer R betweeen 1 and K, inclusive,

G XŒR  Be.R; K RC1/:
400 Confidence Intervals (or Sets) and Tests of Hypotheses

Proof. Let Uk D G.Xk / (k D 1; 2; : : : ; K). Then, U1 ; U2 ; : : : ; UK are statistically independent


random variables, and each of them is distributed uniformly on the interval .0; 1/—refer, e.g., to
Theorems 4.3.5 and 2.1.10 of Casella and Berger (2002).
Now, denote by UŒ1 ; UŒ2 ; : : : ; UŒK the first through Kth order statistics of U1 ; U2 ; : : : ; UK . And
observe that (with probability 1)

UŒk D G XŒk .k D 1; 2; : : : ; K/:
Thus, for any scalar u between 0 and 1 (and for R D 1; 2; : : : ; K),
Pr G XŒR  u D Pr UŒR  u D K K
    P  k K k
kDR k u .1 u/ ;
so that the distribution of G XŒR is an absolutely continuous distribution with a pdf f ./, where

8 PK K
< d kDR k uk .1 u/K k


f .u/ D ; if 0 < u < 1, (3.161)


du
0; otherwise.
:

Moreover, for 0 < u < 1,


d K K
uk .1 u/K k
P
kDR k KŠ
D uR 1.1 u/K R
; (3.162)
du .R 1/Š .K R/Š
as can be readily verified via a series of steps that are essentially the same as those taken by Casella and
Berger (2002) in completing the proof of their Theorem 5.4.4. Based on result (3.162), we conclude
that f ./ is the pdf of a Be.R; K RC1/ distribution and hence that G XŒR  Be.R; K RC

1/. Q.E.D.
Monte Carlo approximation. For purposes of devising a Monte Carlo approximation to c P or c P ,
let X D max Q Q Q
fı2 W ı¤0g
Q 1=2 (in the case of c P ) or X D max Q Q jıQ 0 tj (in the case of
jıQ 0 tj=.ıQ 0 ı/ fı2g
c P ). Further, let X1 ; X2 ; : : : ; XK represent K statistically independent random variables, each of
which has the same distribution as X , let R D .K C 1/.1 P /, suppose that K and P are such that
R is an integer (which implies that P is a rational number), and denote by XŒ1 ; XŒ2 ; : : : ; XŒK the
first through Kth order statistics of X1 ; X2 ; : : : ; XK . And observe that (as a consequence of Lemma
7.3.4)
Pr X > XŒR D P : (3.163)


To obtain a Monte Carlo approximation to c P or c P , make K draws, say x1 ; x2 ; : : : ; xK , from the


distribution of X . And letting xŒ1 ; xŒ2 ; : : : ; xŒK represent the values of X obtained upon rearranging
x1 ; x2 ; : : : ; xK in increasing order from smallest to largest, take the Monte Carlo approximation to
be as follows: : :
c P D xŒR or c P D xŒR : (3.164)
Clearly, x1 ; x2 ; : : : ; xK can be regarded as realizations of the random variables X1 ; X2 ; : : : ; XK
(and xŒ1 ; xŒ2 ; : : : ; xŒK as realizations of the random variables XŒ1 ; XŒ2 ; : : : ; XŒK ). Conceivably
(and conceptually), the realizations x1 ; x2 ; : : : ; xK of X1 ; X2 ; : : : ; XK could be included (along with
the realizations of the elements of y) in what are regarded as the data. Then, in repeated sampling
from the joint distribution of X1 ; X2 ; : : : ; XK and y, the probability of simultaneous coverage of the
intervals Aı (ı 2 ) or Aı (ı 2 ) when XŒR is substituted for c P or c P is (exactly) 1 P , as is
evident from result (3.163).
When xŒR is substituted for c P or c P and when the repeated sampling is restricted to the distri-

bution of
 y, the probability of simultaneous coverage of the intervals Aı (ı 2 ) or Aı (ı 2 ) is
G xŒR , where G./ is the cdf of the random variable X. The difference between this probability
and the specified probability is 
G xŒR .1 P /:
This difference can be regarded as the realization of the random variable
The F Test and a Generalized S Method 401

G XŒR .1 P /:

The number K of draws from the distribution of X can be chosen so that for some specified “tolerance”
 > 0 and some specified probability !,
ˇ ˇ
Pr ˇG XŒR (3.165)
 
.1 P /ˇ    !:

As an application of Lemma 7.3.5, we have that



G XŒR  BeŒ.K C1/.1 P /; .K C1/ P :
Accordingly,
E G XŒR D 1 P and var G XŒR D P .1 P /=.K C2/;
   

as can be readily verified—refer to Exercise 6.4. Edwards and Berry (1987, p. 915) proposed (for pur-
poses of deciding on a value for K
 and implicitly for R) an implementation of the criterion (3.165) in
which the distribution of G XŒR .1 P / is approximated by an N Œ0; P .1 P /=.KC2/
p distribution—
as K ! 1, the pdf of the standardized random variable ŒG.XŒR / .1 P / = P .1 P /=.K C2/
tends to the pdf of the N.0; 1/ distribution (Johnson, Kotz, and Balakrishnan 1995, p. 240). When
this implementation is adopted, the criterion (3.165) is replaced by the much simpler criterion
2
K  P .1 P /.z.1 !/=2 =/ 2; (3.166)
where (for any scalar ˛ between 0 and 1) z ˛ is the upper 100 ˛% point of the N.0; 1/ distribution.
And the number K of draws is chosen to be such that inequality (3.166) is satisfied and such that
.K C1/.1 P / is an integer.
Table 1 of Edwards and Berry gives choices for K that would be suitable if P were taken to be
:10, :05, or :01,  were taken to be :01, :005, :002, or :001, and ! were taken to be :99—the entries
in their Table 1 are for K C1. For example, if P were taken to be :10,  to be :001, and ! to be :99
[in which case z.1 !/=2 D 2:5758 and the right side of inequality (3.166) equals 597125], we could
take K D 599999 [in which case R D 600000  :90 D 540000].

f. Confidence bands (as applied to the lettuce-yield data and in general)


Let us revisit the situation considered in Section 7.1 (and considered further in Subsection d of
the present section), adopting the notation and terminology introduced therein. Accordingly, u D
.u1 ; u2 ; : : : ; uC /0 is a column vector of C explanatory variables u1 ; u2 ; : : : ; uC . And ı.u/ is a
function of u that defines a response surface and that is assumed to be of the form
ı.u/ D jPD1 ˇj xj .u/; (3.167)
P

where xj .u/ (j D 1; 2; : : : ; P ) are specified functions of u (and where ˇ1 ; ˇ2 ; : : : ; ˇP are un-


known parameters). Further, the data are to be regarded as the realizations of the elements of an
N -dimensional observable random column vector y D .y1 ; y2 ; : : : ; yN /0 that follows a G–M model
with model matrix X whose ij th element is (for i D 1; 2; : : : ; N and j D 1; 2; : : : ; P ) xj .ui /,
where u1 ; u2 ; : : : ; uN are the values of u corresponding to the first through N th data points. Note
that ı.u/ is reexpressible in the form
ı.u/ D Œx.u/0ˇ; (3.168)
where x.u/ D Œx1 .u/; x2 .u/; : : : ; xP .u/0 [and where ˇ D .ˇ1 ; ˇ2 ; : : : ; ˇP /0 ]. In the case of the
lettuce-yield data, N D 20, C D 3, u1 , u2 , and u3 represent transformed amounts of Cu, Mo, and
Fe, respectively, and among the choices for the function ı.u/ are the first-, second-, and third-order
polynomials (1.1), (1.2), and (1.3).
We may wish to make inferences about ı.u/ for some or all values of u. Assume that rank X D P,
in which case all of the elements ˇ1 ; ˇ2 ; : : : ; ˇP of ˇ are estimable, and let ˇO D .X0 X/ 1 X0 y, which
is the least squares estimator of ˇ. Further, let ı.u/O O Then, for any particular value of u,
D Œx.u/0 ˇ.
402 Confidence Intervals (or Sets) and Tests of Hypotheses
O
ı.u/ is estimable, ı.u/ O
is the least squares estimator of ı.u/, varŒı.u/ D  2 Œx.u/0 .X0 X/ 1x.u/,
O
and varŒı.u/ is estimated unbiasedly by O 2 Œx.u/0 .X0 X/ 1x.u/ (where O 2 is the usual unbiased
2
estimator of  ). In the case of the lettuce-yield data, the assumption that rank X D P is satisfied
when ı.u/ is taken to be the second-order polynomial (1.2) [though not when it is taken to be the
O
third-order polynomial (1.3)], and [when ı.u/ is taken to be the second-order polynomial (1.2)] ı.u/
is as depicted in Figure 7.1.
Suppose that inferences are to be made about ı.u/ for every value of u in some subspace U of
C-dimensional space [assuming that (at least for u 2 U) ı.u/ is of the form (3.167) or (3.168) and
x.u/ is nonnull]. In addition to obtaining a point estimate of ı.u/ for every u 2 U, we may wish
to obtain a confidence interval for every such u. With that in mind, assume that the distribution of
the vector e of residual effects in the G–M model is MVN or is some other absolutely continuous
spherical distribution (with mean 0 and variance-covariance matrix  2 I).
Some alternative procedures. As a one-at-a-time 100.1 P /% confidence interval for the value of
ı.u/ corresponding to any particular value of u, we have the interval with end points
O
ı.u/ ˙ fŒx.u/0 .X0 X/ 1x.u/g1=2 O tNP =2 .N P / (3.169)
.1/
—refer to result (3.55). When an interval, say interval Iu .y/, is obtained for every u 2 U by taking
the end points of the interval to be those given by expression (3.169), the probability of simultaneous
coverage PrŒı.u/ 2 Iu.1/.y/ for every u 2 U is less than 1 P —typically, it is much less than 1 P .
At the opposite extreme from interval Iu.1/.y/ is the interval, say interval Iu.2/.y/, with end points
O
ı.u/ ˙ fŒx.u/0 .X0 X/ 1x.u/g1=2 O ŒP FN P .P; N P /1=2: (3.170)
0
This interval is that obtained ffor the linear combination Œx.u/ ˇg when the (ordinary) S method
is used to obtain confidence intervals for every linear combination of ˇ1 ; ˇ2 ; : : : ; ˇP such that
the probability of simultaneous coverage of the entire collection of intervals equals 1 P . When
attention is restricted to the confidence intervals for those of the linear combinations that are
expressible in the form Œx.u/0 ˇ for some u 2 U, the probability of simultaneous coverage
PrŒı.u/ 2 Iu.2/.y/ for every u 2 U is greater than or equal to 1 P . In fact, aside from spe-
cial cases like those where every linear combination of ˇ1 ; ˇ2 ; : : : ; ˇP is expressible in the form
Œx.u/0ˇ for some u 2 U, the intervals Iu.2/.y/ (u 2 U) are conservative, that is, the probability
of simultaneous coverage of these intervals exceeds 1 P . The extent to which these intervals are
conservative depends on the space U as well as on the functional form of the elements of the vector
x. u/; they are less conservative when U D RC than when U is a proper subset of RC.
Intermediate to intervals Iu.1/.y/ and Iu.2/.y/ is the interval, say interval Iu.3/.y/, with end points
O
ı.u/ ˙ fŒx.u/0 .X0 X/ 1x.u/g1=2 O c P ; (3.171)
0 1 0
where [letting W represent any P P matrix such that .X X/ D W W and letting t represent a
P 1 random vector that has an MV t.N P; IP / distribution] c P is the upper 100 P % point of the
distribution of the random variable
jŒx.u/0 W 0 tj
max :
fu2Ug fŒx.u/0 .X0 X/ 1x.u/g1=2
.3/
The collection of intervals Iu .y/ (u 2 U) is such that the probability of simultaneous coverage
PrŒı.u/ 2 Iu.3/.y/ for every u 2 U equals 1 P .
.3/
As u ranges over the space U, the lower and upper end points of interval Iu .y/ form surfaces
that define what is customarily referred to as a confidence band. The probability of this band covering
the true response surface (in its entirety) equals 1 P . The lower and upper end points of intervals
Iu.1/.y/ and Iu.2/.y/ also form surfaces that define confidence bands; the first of these confidence
bands has a probability of coverage that is typically much less than 1 P , and the second is typically
quite conservative (i.e., has a probability of coverage that is typically considerably greater than 1 P ).
The F Test and a Generalized S Method 403

The width of the confidence band defined by the surfaces formed by the end points of interval
Iu.3/.y/[or by the end points of interval Iu.1/.y/ or Iu.2/.y/] varies from one point in the space U
to another. Suppose that U is such that for every P  1 vector tP , maxfu2Ug jŒx.u/0 W 0 tP j exists [as
would be the case if the functions x1 .u/; x2 .u/; : : : ; xP .u/ are continuous and the set U is closed
.3/
and bounded]. Then, as an alternative to interval Iu .y/ with end points (3.171), we have the interval,
.4/
say interval Iu .y/, with end points
O
ı.u/ ˙ O c P ; (3.172)
where c P is the upper 100 P % point of the distribution of the random variable maxfu2Ug jŒx.u/0 W 0 tj.
.3/ .4/
Like the end points of interval Iu .y/, the end points of interval Iu .y/ form surfaces that define a
confidence band having a probability of coverage equal to 1 P . Unlike the confidence band formed
by the end points of interval Iu.3/.y/, this confidence band is of uniform width (equal to 2 c
O P ).
The end points (3.171) of interval Iu.3/.y/ depend on c P , and the end points (3.172) of interval
Iu.4/.y/ depend on c P . Except for relatively simple special cases, c P and c P have to be replaced
by approximations obtained via Monte Carlo methods. The computation of these approximations
requires (in the case of c P ) the computation of maxfu2Ug jŒx.u/0 W 0 tj=fŒx.u/0 .X0 X/ 1x.u/g1=2
and (in the case of c P ) the computation of maxfu2Ug jŒx.u/0 W 0 tj for each of a large number of
values of t.
In light of results (3.154) and (3.155), we find that for any value, say tP , of t,
maxfu2Ug jŒx.u/0 W 0 tP j=fŒx.u/0 .X0 X/ 1x.u/g1=2 can be determined by finding values, say P and
P of the scalar  and the vector u that minimize the sum of squares
u,
ŒtP W x.u/0 ŒtP W x.u/;
subject to the constraint u 2 U, and by then observing that
jŒx.u/0 W 0 tP j P x.u/ P x.u/g
max D ftP 0 tP ŒtP W P 0 ŒtP W P 1=2: (3.173)
fu2Ug fŒx.u/0 .X0 X/ 1x.u/g1=2

And maxfu2Ug jŒx.u/0 W 0 tP j can be determined by finding the maximum values of Œx.u/0 W 0 tP and
Œx.u/0 W 0 . tP / with respect to u (subject to the constraint u 2 U) and by then observing that
maxfu2Ug jŒx.u/0 W 0 tP j equals the larger of these two values.
An illustration. Let us illustrate the four alternative procedures for constructing confidence bands by
applying them to the lettuce-yield data. In this application (which adds to the results obtained for these
data in Sections 7.1and 7.2c and in Subsection d of the present section), N D 20, C D 3, and u1 , u2 ,
and u3 represent transformed amounts of Cu, Mo, and Fe, respectively. Further, let us take ı.u/ to be
the second-order polynomial (1.2), in which case rank X D P D 10 (and N rank X D N P D 10).
And let us take P D :10 and take U to be the rectangular region defined by imposing on u1 , u2 ,
and u3 upper and lower bounds as follows: 1  ui  1 (i D 1; 2; 3)—the determination of the
constants c P and c P needed to construct confidence bands Iu.3/.y/ and Iu.4/.y/ is considerably more
straightforward when U is rectangular than when, e.g., it is spherical.
The values of the constants tN:05.10/ and Œ10 FN:10.10; 10/1=2 needed to construct confidence bands
Iu .y/ and Iu.2/.y/ can be determined via well-known numerical methods and are readily avail-
.1/

able from multiple sources. They are as follows: tN:05 .10/ D 1:812461 and Œ10 FN:10 .10; 10/1=2 D
4:819340.
.3/ .4/
In constructing confidence bands Iu .y/ and Iu .y/, resort was made to the approximate versions
of those bands obtained upon replacing c P and c P with Monte Carlo approximations. The Monte
:
Carlo approximations were determined from K D 599999 draws with the following results: c:10 D
 :
3:448802 and c:10 D 2:776452—refer to the final 3 paragraphs of Subsection e for some discussion
relevant to the “accuracy” of these approximations. Note that the approximation to c:10 is considerably
greater than tN:05 .10/ and significantly smaller than Œ10 FN:10 .10; 10/1=2. If U had been taken to
be U D R3 rather than the rectangular region U D fu W jui j  1 .i D 1; 2; 3/g, the Monte
404 Confidence Intervals (or Sets) and Tests of Hypotheses

δ u2=u3=1
^
δ
I (1)
30 I (2)
I (3)
I (4)

20

10

0 u1
−1 − 0.5 0 0.5 1

u2=u3=0

30

20

10

0 u1
−1 − 0.5 0 0.5 1

.1/ .2/ .3/ .4/


FIGURE 7.4. Two segments of the confidence bands Iu .y/, Iu .y/, Iu .y/, and Iu .y/ [and of the estimated
O
response surface ı.u/] when the data are taken to be the lettuce-yield data, when ı.u/ is taken to
be the second-order polynomial (1.2) (where u1 , u2 , and u3 are the transformed amounts of Cu,
Mo, and Fe, respectively), when U D fu W jui j  1 .i D 1; 2; 3/g, when P D :10, and when c:10
 are replaced by the Monte Carlo approximations c :  :
and c:10 :10 D 3:448802 and c:10 D 2:776452.

: :
Carlo approximation to c:10 would have been c:10 D 3:520382 (rather than c:10 D 3:448802). The
difference between the two Monte Carlo approximations to c:10 is relatively small, suggesting that
most of the difference between c:10 and Œ10 FN:10 .10; 10/1=2 is accounted for by the restriction of
ı.u/ to the form of the second-order polynomial (1.2) rather than the restriction of u to the region
fu W jui j  1 .i D 1; 2; 3/g.
Segments of the various confidence bands [and of the estimated response surface ı.u/] O are
depicted in Figure 7.4; the segments depicted in the first plot are those for u-values such that
u2 D u3 D 1, and the segments depicted in the second plot are those for u-values such that
u2 D u3 D 0.
Some Optimality Properties 405

7.4 Some Optimality Properties


Suppose that y is an N  1 observable random vector that follows the G–M model. Suppose also
that the distribution of the vector e of residual effects in the G–M model is N.0;  2 I/; or, more
generally, suppose that the distribution of the vector u D  1 e of standardized residual effects is an
absolutely continuous distribution with a pdf h./ of the form h.u/ / g.u0 u/, where g./ is a known
R1
function such that 0 s N 1 g.s 2 / ds < 1. Further, adopt the notation and terminology of Section
7.3. And partition u into subvectors u1 , u2 , and u3 (conformally with a partitioning of z into the
subvectors z1 , z2 , and z3 ), so that 0 1 0 1
z1 ˛ C u1
z D @z2 A  @ C u2 A:
z3 u3
Let us consider further the problem of testing the null hypothesis H0 W  D  .0/ or HQ 0 W ˛ D ˛.0/
versus the alternative hypothesis H1 W  ¤  .0/ or HQ 1 W ˛ ¤ ˛.0/ . Let us also consider further the
problem of forming a confidence set for  or ˛.
Among the procedures for testing H0 or HQ 0 versus H1 or HQ 1 is the size- P F test, with critical
region CF or CQ F and with critical (test) function F .y/ or QF .z/. Corresponding to the size- P F
test are the 100.1 P /% confidence set AF for  and the 100.1 P /% confidence set AQF for ˛. In
Section 7.3b, it was established that the size- P F test and the corresponding 100.1 P /% confidence
sets have various seemingly desirable properties. These properties serve to define certain classes of
test procedures and certain classes of procedures for forming confidence sets. In what follows, the
focus is on obtaining useful characterizations of these classes and on establishing the optimality of
the F test and the corresponding confidence sets within these classes.

a. Some results on invariance


Q
Denote by .z/ or (when convenient) by .z Q 1 ; z2 ; z3 / an arbitrary function of the random vector z
(D O 0 y). And let TQ .z/ represent a transformation (of z) that satisfies the condition (3.57), and let GQ
represent a group of such transformations. Then, as discussed earlier [in the special case where ./ Q
is the critical function of a size- P test of HQ 0 versus HQ 1 ], ./
Q is said to be invariant with respect to
GQ if
Q TQ .z/ D .z/
ΠQ
for every transformation TQ ./ in GQ (and for every value of z). What can be discerned about the
characteristics and the distribution of .z/ Q when the function ./ Q is invariant with respect to various
of the groups introduced in the final two parts of Section 7.2b?
In what follows, the primary results are presented as a series of propositions. Each of these
propositions (after the first) builds on its predecessor. The verification of the propositions is deferred
until the presentation of all of the propositions is complete.
The propositions are as follows:
Q
(1) The function .z/ is invariant with respect to the group GQ 1 of transformations of the form TQ1 .z/
Q
if and only if .z/ D Q1 .z1 ; z3 / for some function Q 1 . ; /—z1 and z3 form what is commonly
referred to (e.g., Lehmann and Romano 2005b, chap. 6) as a maximal invariant. Moreover, the
joint distribution of z1 and z3 depends on ˛, , and  only through the values of ˛ and .
(2) The function .z/ Q is invariant with respect to the group GQ 3 .˛.0/ / of transformations of the
form TQ3 .zI ˛.0/ / as well as the group GQ 1 of transformations of the form TQ1 .z/ if and only if
Q
.z/ D Q13 Œ.z1 ˛.0/ /0 .z1 ˛.0/ /Cz03 z3 ; z3  for some function Q13 . ; /. Moreover, the joint
distribution of .z1 ˛.0/ /0 .z1 ˛.0/ /Cz03 z3 and z3 depends on ˛, , and  only through the
values of .˛ ˛.0/ / 0 .˛ ˛.0/ / and .
406 Confidence Intervals (or Sets) and Tests of Hypotheses

(3) The function .z/Q is invariant with respect to the group GQ 2 .˛.0/ / of transformations of the form
TQ2 .zI ˛.0/ / as well as the groups GQ 1 and GQ 3 .˛.0/ / of transformations of the form TQ1 .z/ and
TQ3 .zI ˛.0/ / only if there exists a function Q132 ./ [of an (N P )-dimensional vector] such that
Q
.z/ D Q132 fŒ.z1 ˛.0/ /0 .z1 ˛.0/ /Cz03 z3  1=2 z3 g
for those values of z for which z3 ¤ 0. Moreover, Pr.z3 ¤ 0/ D 1, and the distribution
of Œ.z1 ˛.0/ /0 .z1 ˛.0/ / C z03 z3  1=2 z3 depends on ˛, , and  only through the value of
.˛ ˛.0/ / 0 .˛ ˛.0/ /= 2.
(4) The function .z/ Q is invariant with respect to the group GQ 4 of transformations of the form
TQ4 .z/ as well as the groups GQ 1 , GQ 3 .˛.0/ /, and GQ 2 .˛.0/ / of transformations of the form TQ1 .z/,
TQ3 .zI ˛.0/ /, and TQ2 .zI ˛.0/ / only if there exists a function Q1324 ./ (of a single variable) such
that
Q
.z/ D Q1324fz03 z3 =Œ.z1 ˛.0/ /0 .z1 ˛.0/ /Cz03 z3 g
for those values of z for which z3 ¤ 0. Moreover, Pr.z3 ¤ 0/ D 1, and the distribution
of z03 z3 =Œ.z1 ˛.0/ /0 .z1 ˛.0/ / C z03 z3  depends on ˛, , and  only through the value of
.˛ ˛.0/ / 0 .˛ ˛.0/ /= 2.
Q
Verification of Proposition (1). If .z/ D Q1 .z1 ; z3 / for some function Q1 . ; /, then it is clear that
[for every transformation TQ1 ./ in the group GQ 1 ] Œ
Q TQ1 .z/ D Q 1 .z1 ; z3 / D .z/
Q and hence that the
Q Q
function .z/ is invariant with respect to the group G1 . Conversely, suppose that .z/ Q is invariant
Q
with respect to the group G1 . Then, for every choice of the vector c,
Q 1 ; z2 Cc; z3 / D Œ
.z Q TQ1 .z/ D .z
Q 1 ; z2 ; z3 /:
And upon setting c D z2 , we find that .z/ Q D Q 1 .z1 ; z3 /, where Q 1 .z1 ; z3 / D .z
Q 1 ; 0; z3 /.
Moreover, the joint distribution of z1 and z3 does not depend on , as is evident upon observing that
   
z1 ˛ C u1
 :
z3 u3

Verification of Proposition (2). Suppose that .z/ Q D Q13 Œ.z1 ˛.0/ /0 .z1 ˛.0/ / C z03 z3 ; z3  for
some function Q 13 . ; /. Then, .z/
Q is invariant with respect to the group GQ 1 of transformations, as is
evident from Proposition (1). And it is also invariant with respect to the group GQ 3 .˛.0/ /, as is evident
upon observing that
Q TQ3 .zI ˛.0/ / D Q Œ.z1 ˛.0/ /0 P P 0 .z1 ˛.0/ /Cz03 z3 ; z3 
Π13
D Q13 Œ.z1 ˛.0/ /0 .z1 ˛.0/ /Cz03 z3 ; z3  D .z/:
Q

Conversely, suppose that .z/ Q is invariant with respect to the group GQ 3 .˛.0/ / as well as the group
Q
G1 . Then, in light of Proposition (1), .z/ Q D Q1 .z1 ; z3 / for some function Q1 . ; /. And to establish
Q
that .z/ D Q 13 Œ.z1 ˛ / .z1 ˛ /Cz3 z3 ; z3  for some function Q13 . ; /, it suffices to observe
.0/ 0 .0/ 0

that corresponding to any two values zP 1 and zR 1 of z1 that satisfy the equality .Rz1 ˛.0/ /0 .Rz1 ˛.0/ / D
.Pz1 ˛.0/ /0 .Pz1 ˛.0/ /, there is an orthogonal matrix P such that zR 1 D ˛.0/ CP 0 .Pz1 ˛.0/ / (the existence
of which follows from Lemma 5.9.9) and hence such that Q1 .Rz1 ; z3 / D Q 1 Œ˛.0/ CP 0 .Pz1 ˛.0/ /; z3 .
It remains to verify that the joint distribution of .z1 ˛.0/ /0 .z1 ˛.0/ /Cz03 z3 and z3 depends on ˛,
, and  only through the values of .˛ ˛.0/ / 0 .˛ ˛.0/ / and . Denote by O an M M orthogonal
matrix defined as follows: if ˛ D ˛.0/ , take O to be IM or any other M M orthogonal matrix; if
˛ ¤ ˛.0/ , take O to be the Helmert matrix whose first row is proportional to the vector .˛ ˛.0/ /0.
Further, let uQ 1 D Ou1 , and take uQ to be the N -dimensional column vector whose transpose is
uQ 0 D .uQ 01 ; u02 ; u03 /. Then, upon observing that the joint distribution of .z1 ˛.0/ /0 .z1 ˛.0/ / C z03 z3
and z3 is identical to that of the random variable .˛ ˛.0/ Cu1 /0 .˛ ˛.0/ Cu1 / C  2 u03 u3 and
the random vector u3 , observing that
Some Optimality Properties 407

.˛ ˛.0/ Cu1 /0 .˛ ˛.0/ Cu1 / C  2 u03 u3


D .˛ ˛.0/ /0 .˛ ˛.0/ / C 2.˛ ˛.0/ /0 u1 C  2 u01 u1 C  2 u03 u3
D .˛ ˛.0/ /0 .˛ ˛.0/ / C 2.˛ ˛.0/ /0 O 0 Ou1 C  2 u01 O 0 Ou1 C  2 u03 u3
D .˛ ˛.0/ /0 .˛ ˛.0/ / C  2 .uQ 01 uQ 1 C u03 u3 / C 2fŒ.˛ ˛.0/ /0 .˛ ˛.0/ /1=2; 0; 0; : : : ; 0g u
Q 1;
and observing that uQ D diag.O; I; I/ u  u, it is evident that the joint distribution of .z1 ˛.0/ /0 .z1
˛.0/ / C z03 z3 and z3 depends on ˛, , and  only through the values of .˛ ˛.0/ / 0 .˛ ˛.0/ / and .
Verification of Proposition (3). Suppose that .z/ Q is invariant with respect to the group GQ 2 .˛.0/ / as
well as the groups GQ 1 and GQ 3 .˛ /. Then, in light of Proposition (2), .z/
.0/ Q D Q 13 Œ.z1 ˛.0/ /0 .z1
˛.0/ /Cz03 z3 ; z3  for some function Q 13 . ; /, in which case Œ
Q TQ .zI ˛.0/ / D Q fk 2 Œ.z ˛.0/ /0 .z
2 13 1 1
˛ /Cz3 z3 ; kz3 g. Thus, to establish the existence of a function Q132 ./ such that .z/
.0/ 0 Q D Q132 fŒ.z1
˛.0/ /0 .z1 ˛.0/ /Cz03 z3  1=2 z3 g for those values of z for which z3 ¤ 0, it suffices to take zP 1 and zR 1
to be values of z1 and zP 3 and zR 3 nonnull values of z3 such that
Œ.Rz1 ˛.0/ /0 .Rz1 ˛.0/ /C zR 03 zR 3  1=2
zR 3 D Œ.Pz1 ˛.0/ /0 .Pz1 ˛.0/ /C zP 03 zP 3  1=2
zP 3
and to observe that
zR 3 D k zP 3 and .Rz1 ˛.0/ /0 .Rz1 ˛.0/ /C zR 03 zR 3 D k 2 Œ.Pz1 ˛.0/ /0 .Pz1 ˛.0/ /C zP 03 zP 3 ;
where k D Œ.Rz1 ˛.0/ /0 .Rz1 ˛.0/ /CRz03 zR 3 1=2=Œ.Pz1 ˛.0/ /0 .Pz1 ˛.0/ /CPz03 zP 3 1=2. It remains only to observe
[as in the verification of Proposition (2)] that the joint distribution of .z1 ˛.0/ /0 .z1 ˛.0/ / C z03 z3
and z3 is identical to that of the random variable
.˛ ˛.0/ /0 .˛ ˛.0/ / C  2 .u01 u1 C u03 u3 / C 2fŒ.˛ ˛.0/ /0 .˛ ˛.0/ /1=2; 0; 0; : : : ; 0gu1
and the random vector u3 and hence that
Œ.z1 ˛.0/ /0 .z1 ˛.0/ /Cz03 z3  1=2 z3
.˛ ˛.0/ /0 .˛ ˛.0/ /

 C u01 u1 C u03 u3
2 1=2
Œ.˛ ˛.0/ /0 .˛ ˛.0/ /1=2
  
C2 ; 0; 0; : : : ; 0 u1 u3 :


Verification of Proposition (4). Suppose that .z/Q is invariant with respect to the group GQ 4 as well
Q Q .0/ Q .0/
as the groups G1 , G3 .˛ /, and G2 .˛ /. Then, in light of Proposition (3), there exists a function
Q132 ./ such that .z/
Q D Q132 fŒ.z1 ˛.0/ /0 .z1 ˛.0/ /Cz03 z3  1=2 z3 g for those values of z for which
z3 ¤ 0. Moreover, there exists a function Q1324 ./ such that for every value of z1 and every nonnull
value of z3 ,
Q 132fŒ.z1 ˛.0/ /0 .z1 ˛.0/ /Cz03 z3  1=2
z3 g D Q1324fz03 z3 =Œ.z1 ˛.0/ /0 .z1 ˛.0/ /Cz03 z3 g:
To confirm this, it suffices to take zP 1 and zR 1 to be values of z1 and zP 3 and zR 3 nonnull values of z3
such that zR 03 zR 3 zP 03 zP 3
0
D (4.1)
.Rz1 ˛.0/ /0 .Rz1 ˛.0/ /C zR 3 zR 3 .Pz1 ˛.0/ /0 .Pz1 ˛.0/ /C zP 03 zP 3
and to observe that equality (4.1) is reexpressible as the equality

fŒ.Rz1 ˛.0/ /0 .Rz1 ˛.0/ /C zR 03 zR 3  1=2


zR 3 g0 fŒ.Rz1 ˛.0/ /0 .Rz1 ˛.0/ /C zR 03 zR 3  1=2
zR 3 g
.0/ 0
D fŒ.Pz1 ˛ / .Pz1 ˛ .0/
/C zP 03 zP 3  1=2 zP 3 g0 fŒ.Pz1 .0/ 0
˛ / .Pz1 ˛.0/ /C zP 03 zP 3  1=2
zP 3 g
and hence that equality (4.1) implies (in light of Lemma 5.9.9) the existence of an orthogonal matrix
B for which
408 Confidence Intervals (or Sets) and Tests of Hypotheses
.0/ 0
Œ.Rz1 ˛ / .Rz1 ˛ .0/
/C zR 03 zR 3  1=2 zR 3 D B0 fŒ.Pz1 ˛.0/ /0 .Pz1 ˛.0/ /C zP 03 zP 3  1=2
zP 3 g
D Œ.Pz1 ˛.0/ /0 .Pz1 ˛.0/ /C.B0 zP 3 /0 B0 zP 3  1=2
B0 zP 3 :
That the distribution of z03 z3 =Œ.z1 ˛.0/ /0 .z1 ˛.0/ /Cz03 z3  depends on ˛, , and  only through the
value of .˛ ˛.0/ / 0 .˛ ˛.0/ /= 2 follows from Proposition (3) upon observing that

z03 z3 =Œ.z1 ˛.0/ /0 .z1 ˛.0/ /Cz03 z3 


D fŒ.z1 ˛.0/ /0 .z1 ˛.0/ /Cz03 z3  1=2
z3 g0 fŒ.z1 ˛.0/ /0 .z1 ˛.0/ /Cz03 z3  1=2
z3 g:

b. A relationship between sufficiency and invariance


As is evident from the final two parts of Section 7.3a, z1 , z2 , and z03 z3 form a sufficient statistic. The
following proposition has the effect of establishing a relationship between sufficiency and invariance.
Q
Proposition. The function .z/ is invariant with respect to the group GQ 4 of transformations of the
form T4 .z/ if and only if .z/ D Q 4 .z1 ; z2 ; z03 z3 / for some function Q4 . ;  ; /.
Q Q

Verification of the proposition. Suppose that .z/ Q D Q 4 .z1 ; z2 ; z03 z3 / for some function Q 4 . ;  ; /.
Then,
Q TQ4 .z/ D Q 4 Œz1 ; z2 ; .B0 z3 /0 B0 z3  D Q 4 .z1 ; z2 ; z03 z3 / D .z/:
ΠQ

Conversely, suppose that .z/ Q is invariant with respect to the group GQ 4 . Then, to establish that
.z/ D 4 .z1 ; z2 ; z3 z3 / for some function Q 4 . ;  ; /, it suffices to observe that corresponding to
Q Q 0

any two values zP 3 and zR 3 of z3 that satisfy the equality zR 03 zR 3 D zP 03 zP 3 , there is an orthogonal matrix
B such that zR 3 D B0 zP 3 (the existence of which follows from Lemma 5.9.9).

c. The Neyman–Pearson fundamental lemma and its implications


Preliminary to comparing the power of the size- P F test of HQ 0 or H0 (versus HQ 1 or H1 ) with that
of other invariant tests of HQ 0 or H0 , it is convenient to introduce the following theorem.
Theorem 7.4.1. Let X represent an observable random variable with an absolutely continuous
distribution that depends on a parameter  (of unknown value). Further, let ‚ represent the parameter
space, let  .0/ represent an hypothesized value of , let C represent an arbitrary critical (rejection)
region for testing the null hypothesis that  D  .0/ versus the alternative hypothesis that  ¤  .0/,
define .I C / D Pr.X 2 C / [so that .I C / is the power function of the test with critical region
C ], and (for  2 ‚) denote by f . I / the pdf of the distribution of X . And let P represent a scalar
in the interval 0 < P < 1. Then, subject to the constraint
. .0/ I C /  P
(i.e., the constraint that the size of the test not exceed P ), .  I C / attains its maximum value for
any specified value   of  (other than  .0/ ) when C is taken to be a critical region C  that satisfies
the following conditions:
x 2 C  if f .xI   / > kf .xI  .0/ / and x … C  if f .xI   / < kf .xI  .0/ / (4.2)
(for some nonnegative constant k) and
. .0/ I C  / D P : (4.3)

The result of Theorem 7.4.1 constitutes part of a version of what is known as the Neyman–Pearson
fundamental lemma or simply as the Neyman–Pearson lemma. For a proof of this result, refer, for
example, to Casella and Berger (2002, sec. 8.3).
Some Optimality Properties 409

In regard to Theorem 7.4.1, the test with critical region (set) C can be identified by the indicator
function of C rather than by C itself; this function is the so-called critical (test) function. The
definition of a test can be extended to include “randomized” tests; this can be done by extending the
definition of a critical function to include any function ./ such that 0  .x/  1 for every scalar
x—when ./ is an indicator function of a set C , .x/ equals either 0 or 1. Under the extended
definition, the test with critical function ./ consists of rejecting the null hypothesis  D  .0/ with
probability .x/, where x is the observed value of X . This test has a power function, say .I /,
that is expressible as .I / D EŒ.X /. The coverage of Theorem 7.4.1 can be extended: it can be
shown that among tests (randomized as well as nonrandomized) whose size does not exceed P , the
power function attains its maximum value for any specified value   of  (other than  .0/ ) when the
test is taken to be a nonrandomized test with critical region C  that satisfies conditions (4.2) and (4.3).
In the context of Theorem 7.4.1, the (nonrandomized) test with critical region C  that satisfies
conditions (4.2) and (4.3) is optimal (in the sense that the value attained by its power function at the
specified value   of  is a maximum among all tests whose size does not exceed P ). In general, this
test varies with   . Suppose, however, that the set X D fx W f .xI / > 0g does not vary with the
value of  and that for every value of  in ‚ and for x 2 X, the “likelihood ratio” f .xI /=f .xI  .0/ /
is a nondecreasing function of x or, alternatively, is a nonincreasing function of x. Then, there is a
critical region C  that satisfies conditions (4.2) and (4.3) and that does not vary with   : depending
on whether the ratio f .xI /=f .xI  .0/ / is a nondecreasing function of x or a nonincreasing function
of x, we can take C  D fx W x > k 0 g, where k 0 is the upper 100 P % point of the distribution with
pdf f . I  .0/ /, or take C  D fx W x < k 0 g, where k 0 is the lower 100 P % point of the distribution
with pdf f . I  .0/ /. In either case, the test with critical region C  constitutes what is referred to as
a UMP (uniformly most powerful) test.

d. A UMP invariant test and a UMA invariant confidence set


Let us now resume the discussion begun in Subsections a and b (pertaining to the problem of testing
the null hypothesis H0 W  D  .0/ or HQ 0 W ˛ D ˛.0/ versus the alternative hypothesis H1 W  ¤  .0/
or HQ 1 W ˛ ¤ ˛.0/ ). And let us continue to denote by .z/
Q Q 1 ; z2 ; z3 / an
or (when convenient) by .z
arbitrary function of the random vector z (D O 0 y).
Invariant functions: an alternative representation. If .z/Q is invariant with respect to the four groups
Q Q .0/ Q .0/ Q
G1 , G3 .˛ /, G2 .˛ /, and G4 of transformations, then [according to Proposition (4) of Subsection
a] there exists a function Q1324 ./ such that .z/
Q D Q 1324 fz03 z3 =Œ.z1 ˛.0/ /0 .z1 ˛.0/ /Cz03 z3 g for
those values of z for which z3 ¤ 0. And the distribution of z03 z3 =Œ.z1 ˛.0/ /0 .z1 ˛.0/ / C z03 z3 
depends on ˛, , and  only through the value of .˛ ˛.0/ / 0 .˛ ˛.0/ /= 2. Moreover, for those values
of z for which z1 ¤ ˛.0/ or z3 ¤ 0,
z03 z3 .z1 ˛.0/ /0 .z1 ˛.0/ /
D1 : (4.4)
.z1 ˛.0/ /0 .z1 ˛.0/ /Cz03 z3 .z1 ˛.0/ /0 .z1 ˛.0/ /Cz03 z3
Now, take X to be an observable random variable defined as follows:

.z1 ˛.0/ /0 .z1 ˛.0/ /
< ; if z1 ¤ ˛.0/ or z3 ¤ 0,
X D .z1 ˛.0/ /0 .z1 ˛.0/ /Cz03 z3 (4.5)
:̂ 0; if z1 D ˛.0/ and z3 D 0.

And let  D .˛ ˛.0/ /0 .˛ ˛.0/ /= 2. When regarded as a function of z, X is invariant with respect
to the groups GQ 1 , GQ 3 .˛.0/ /, GQ 2 .˛.0/ /, and GQ 4 (as can be readily verified). Thus, if .z/
Q depends on
z only through the value of X , then it is invariant with respect to the groups G1 , G3 .˛.0/ /, GQ 2 .˛.0/ /,
Q Q
and GQ 4 . Conversely, if .z/
Q is invariant with respect to the groups GQ 1 , GQ 3 .˛.0/ /, GQ 2 .˛.0/ /, and GQ 4 ,
then [in light of equality (4.4)] there exists a function Q 1324 
./ such that
410 Confidence Intervals (or Sets) and Tests of Hypotheses
Q 
.z/ D Q 1324 .X / (4.6)
Q
for those values of z for which z3 ¤ 0, in which case [since Pr.z3 ¤ 0/ D 1] .z/  Q 1324

.X /.
Moreover, the distribution of X or any function of X depends on ˛, , and  only through the value
of the nonnegative parametric function . In fact,
p
 C 2. ; 0; 0; : : : ; 0/u1 C u01 u1
X p ; (4.7)
 C 2. ; 0; 0; : : : ; 0/u1 C u01 u1 C u03 u3
as can be readily verified via a development similar to that employed in Subsection a in the verification
of Proposition (2).
Applicability of the Neyman–Pearson lemma. As is evident from the preceding part of the present
subsection, a test of HQ 0 versus HQ 1 with critical function .z/
Q is invariant with respect to the groups
G1 , G3 .˛ /, G2 .˛ /, and G4 if and only if there exists a test with critical function Q1324
Q Q .0/ Q .0/ Q 
.X / such
Q Q 
that .z/ D 1324 .X / with probability 1, in which case
Q 
EŒ.z/ D EŒQ1324 .X /; (4.8)
that is, the two tests have the same power function. The upshot of this remark is that the Neyman–
Pearson lemma can be used to address the problem of finding a test of HQ 0 versus HQ 1 that is “optimal”
among tests that are invariant with respect to the groups GQ 1 , GQ 3 .˛.0/ /, GQ 2 .˛.0/ /, and GQ 4 and whose
size does not exceed P . Among such tests, the power function of the test attains its maximum value
for values of ˛, , and  such that  equals some specified value  when the critical region of
the test is a critical region obtained upon applying the Neyman–Pearson lemma, taking X to be the
observable random variable defined by expression (4.5) and taking  D , ‚ D Œ0; 1/,  .0/ D 0,
and   D .
1 1
Special case:  e  N.0; I/. Suppose that the distribution of the vector u D  e is N.0; I/, in
" ! #
˛
2
which case z  N  ;  I . Then,
0
.z1 ˛.0/ /0 .z1 ˛.0/ /
 
M N P 
X  Be ; ; ; (4.9)
.z1 ˛.0/ /0 .z1 ˛.0/ /Cz03 z3 2 2 2

as is evident upon observing that z03 z3 = 2  2 .N P / and that .z1 ˛.0/ /0 .z1 ˛.0/ /= 2 is
statistically independent of z03 z3 = 2 and has a 2 .M ; / distribution. Further, let (for   0)
f . I / represent the pdf of the BeŒM =2; .N P /=2; =2 distribution, and observe [in light of
result (6.3.24)] that (for  > 0) the ratio f .xI / =f .xI 0/ is a strictly increasing function of x (over
the interval 0 < x < 1). And consider the test (of the null hypothesis HQ 0 W ˛ D ˛.0/ versus the
alternative hypothesis HQ 1 W ˛ ¤ ˛.0/ ) that rejects HQ 0 if and only if
X > BN P ŒM =2; .N P /=2; (4.10)
where BN P ŒM =2; .N P /=2 is the upper 100 P % point of the BeŒM =2; .N P /=2 distribution.
This test is of size P , is invariant with respect to the groups GQ 1 , GQ 3 .˛.0/ /, GQ 2 .˛.0/ /, and GQ 4 , and
among tests whose size does not exceed P and that are invariant with respect to the groups GQ 1 ,
GQ 3 .˛.0/ /, GQ 2 .˛.0/ /, and GQ 4 , is UMP (as is evident from the discussion in Subsection c and from the
discussion in the preceding part of the present subsection).
The F test. Note that (for z3 ¤ 0)
.z1 ˛.0/ /0 .z1 ˛.0/ / Q .0/ /
ŒM =.N P /F.˛
0
D ; (4.11)
.z1 ˛.0/ /0 .z1 ˛.0/ /Cz3 z3 Q .0/ /
1 C ŒM =.N P /F.˛
where (for an arbitrary M 1 vector ˛)
P
Some Optimality Properties 411

Q ˛/ .z ˛/P 0 .z1 ˛/=M


P 
F. P D 1 0 :
z3z3 =.N P /
Note also that expression (4.11) is [for FQ .˛.0/ /  0] a strictly increasing function of FQ .˛.0/ /. Thus,
the set of z-values that satisfy inequality (4.10) is essentially (wp1) identical to the critical region
CQ F of the size- P F test. Accordingly, the size- P F test (of HQ 0 versus HQ 1 ) is equivalent to a size- P
test that is invariant with respect to the groups GQ 1 , GQ 3 .˛.0/ /, GQ 2 .˛.0/ /, and GQ 4 and that [when
 1 e  N.0; I/] is UMP among all tests (of HQ 0 versus HQ 1 ) whose size does not exceed P and that
are invariant with respect to the groups GQ 1 , GQ 3 .˛.0/ /, GQ 2 .˛.0/ /, and GQ 4 ; the equivalence of the two
tests is in the sense that their critical functions are equal wp1—earlier (in Section 7.3b), it was noted
that the F test is invariant [with respect to the groups GQ 1 , GQ 3 .˛.0/ /, GQ 2 .˛.0/ /, and GQ 4 ] for those
values of z in the set fz W z3 ¤ 0g (which are the values for which the F test is defined).
A stronger result. Suppose (as in the preceding 2 parts of the present subsection) that  1 e 
N.0; I/. Then, as is evident from the results in the next-to-last part of Section 7.3a, z1 , z2 , and z03 z3
form a (vector-valued) sufficient statistic. And in light of the proposition of Subsection b, a critical
Q
function .z/ is reexpressible as a function of this statistic if and only if it is invariant with respect to
the group GQ 4 of transformations of the form TQ4 .z/. Thus, a test (of HQ 0 versus HQ 1 ) is invariant with
respect to the groups GQ 1 , GQ 3 .˛.0/ /, GQ 2 .˛.0/ /, and GQ 4 of transformations if and only if it is invariant
with respect to the groups GQ 1 , GQ 3 .˛.0/ /, and GQ 2 .˛.0/ / and, in addition, its critical function depends
on z only through the value of the sufficient statistic formed by z1 , z2 , and z03 z3 .
Q
Let .z/ represent the critical function of any (possibly randomized) test (of HQ 0 versus HQ 1 ).
Corresponding to .z/Q is a critical function .z/ N Q j z ; z ; z0 z  that depends on z only
D EŒ.z/ 1 2 3 3
through the value of the sufficient statistic—here, the conditional expectation of any function of z is
taken to be that determined from any particular version of the conditional distribution of z. Moreover,
N
EŒ.z/ Q
D EŒ.z/;
N
so that the power function of the test with critical function .z/ is identical to that of the test with
Q Q Q TQ .z/ D .z/,
critical function .z/. And for any transformation T .z/ such that ΠQ
Œ Q TQ .z/ j z1 ; z2 ; z03 z3 g D .z/:
N TQ .z/ D EfŒ N
Thus, if the test with critical function .z/ Q is invariant with respect to the groups GQ 1 , GQ 3 .˛.0/ /, and
GQ 2 .˛.0/ / of transformations, then so is the test with critical function .z/.
N
There is an implication that the size- P test (of HQ 0 versus HQ 1 ) that is invariant with respect to the
groups GQ 1 , GQ 3 .˛.0/ /, and GQ 2 .˛.0/ / and whose critical region is determined by inequality (4.10) or
(for values of z such that z3 ¤ 0) by the inequality F.˛ Q .0/ / > FN P .M ; N P / is UMP among all
tests whose size does not exceed P and that are invariant with respect to the groups GQ 1 , GQ 3 .˛.0/ /,
and GQ 2 .˛.0/ /. The restriction to tests that are invariant with respect to the group GQ 4 is unnecessary.
General case. The results of the preceding three parts of the present subsection were obtained under
a supposition that the distribution of the vector u D  1 e is N.0; I/. Let us now consider the general
case where the distribution of u is an absolutely continuous spherically symmetric distribution that
may differ from the N.0; I/ distribution. Specifically, let us consider the extent to which the results
obtained in the special case where u  N.0; I/ extend to the general case.
In the general case as well as the special case, the set of z-values that satisfy inequality (4.10)
form the critical region of a test of HQ 0 versus HQ 1 that is invariant with respect to the groups GQ 1 ,
GQ 3 .˛.0/ /, GQ 2 .˛.0/ /, and GQ 4 of transformations and that is of size P . This observaion (as it pertains
to the size of the test) is consistent with one made earlier (in Section 7.3b) in a discussion of the F
test—the size of this test is the same as that of the F test.
Is this test UMP among all tests that are invariant with respect to the groups GQ 1 , GQ 3 .˛.0/ /, and
GQ 2 .˛.0/ / and whose size does not exceed P (as it is in the special case of normality)? As in the
special case, this question can be addressed by applying the Neyman–Pearson lemma.
412 Confidence Intervals (or Sets) and Tests of Hypotheses

Let us continue to take X to be the observable random variable defined by equality (4.5), and
let us take f . I / to be the pdf of the distribution of X [as determined from the distribution of u
on the basis of expression (4.7)]. When  D 0, X  BeŒM =2; .N P /=2, so that f . I 0/ is
the same in the general case as in the special case where u  N.0; I/—refer, e.g., to Part (1) of
Theorem 6.3.1. However, when  > 0, the distribution of X varies with the distribution of u—in
the special case where u  N.0; I/, this distribution is BeŒM =2; .N P /=2; =2 (a noncentral
beta distribution). Further, let C represent an arbitrary critical region for testing (on the basis of X )
the null hypothesis that  D 0 versus the alternative hypothesis that  > 0. And denote R by . I C /
the power function of the test with critical region C, so that (for   0) .I C / D C f .xI / dx.
Now, consider a critical region C  such that subject to the constraint .0I C /  P, .I C /
attains its maximum value for any particular (strictly positive) value  of  when C D C . At least
in principle, such a critical region can be determined by applying the Neyman–Pearson lemma. The
critical region C  defines a set of z-values that form the critical region of a size- P test of HQ 0 versus
HQ 1 that is invariant with respect to the groups GQ 1 , GQ 3 .˛.0/ /, GQ 2 .˛.0/ /, and GQ 4 of transformations.
Moreover, upon recalling (from the final part of Section 7.3a) that [as in the special case where
u  N.0; I/] z1 , z2 , and z03 z3 form a (vector-valued) sufficient statistic and upon employing
essentially the same reasoning as in the preceding part of the present subsection, we find that the
value attained for  D  by the power function of that test is greater than or equal to that attained
for  D  by the power function of any test of HQ 0 versus HQ 1 that is invariant with respect to the
groups GQ 1 , GQ 3 .˛.0/ /, and GQ 2 .˛.0/ / and whose size does not exceed P.
If [as in the special case where u  N.0; I/] the ratio f .xI  /=f .xI 0/ is a nondecreasing
function of x, then C  can be taken to be the critical region defined by inequality (4.10) or (aside
from a set of z-values of probability 0) by the inequality F.˛ Q .0/ / > FN P .M ; N P /. And if [as

in the special case where u  N.0; I/] the ratio f .xI  /=f .xI 0/ is a nondecreasing function
of x for every choice of , then the size- P test (of HQ 0 versus HQ 1 ) that is invariant with respect
to the groups GQ 1 , GQ 3 .˛.0/ /, GQ 2 .˛.0/ /, and GQ 4 and whose critical region is defined by inequality
(4.10) or (for values of z such that z3 ¤ 0) by the inequality F.˛ Q .0/ / > FN P .M ; N P / is UMP
among all tests that are invariant with respect to the groups GQ 1 , GQ 3 .˛.0/ /, and GQ 2 .˛.0/ / and whose
size does not exceed P. However, in general, C  may differ nontrivially (i.e., for a set of z-values
having nonzero probability) from the critical region defined by inequality (4.10) or by the inequality
Q .0/ / > FN P .M ; N P / and from one choice of  to another, and there may not be any test that
F.˛
is UMP among all tests that are invariant with respect to the groups GQ 1 , GQ 3 .˛.0/ /, and GQ 2 .˛.0/ / and
whose size does not exceed P.
Restatement of results. The results of the preceding parts of the present subsection are stated in terms
of the transformed observable random vector z (D O 0 y) and in terms of the transformed parametric
vector ˛ (D S0 ) associated with the canonical form of the G–M model. These results can be restated
in terms of y and .
The null hypothesis HQ 0 W ˛ D ˛.0/ is equivalent to the null hypothesis H0 W  D  .0/ —
the parameter vector ˇ satisfies the condition  D  .0/ if and only if it satisfies the condition
˛ D ˛.0/ —and the alternative hypothesis HQ 1 W ˛ ¤ ˛.0/ is equivalent to the alternative hypothesis
H1 W  ¤  .0/. Moreover,  D .˛ ˛.0/ /0 .˛ ˛.0/ /= 2 is reexpressible as
 D .  .0/ /0 C .  .0/ /= 2 (4.12)
—refer to result (3.34)—and z03 z3 and .z1 ˛ .0/ 0
/ .z1 ˛ .0/
/ are reexpressible as
z03 z3 0 0
D d d D y .I PX /y (4.13)
and
.z1 ˛.0/ /0 .z1 ˛.0/ / D .˛O ˛.0/ /0 .˛O ˛.0/ / D .O  .0/ /0 C .O  .0/ / (4.14)
—refer to result (3.21) or (3.26) and to result (3.28). Results (4.13) and (4.14) can be used to reexpress
(in terms of y) the observable random variable X defined by expression (4.5) and as noted earlier
Q .0/ /.
(in Section 7.3b) to reexpress F.˛
Some Optimality Properties 413
Q
A function .z/ of z that is the critical function of a test of HQ 0 or H0 versus HQ 1 or H1 is reex-
pressible as a function .y/ [D .O Q 0 y/] of y. And corresponding to any one-to-one transformation
TQ .z/ of z (from RN onto RN ) is a one-to-one transformation T .y/ [D O TQ .O 0 y/] of y (from RN
onto RN ), and corresponding to any group GQ of such transformations of z is a group G consisting
of the corresponding transformations of y. Further, a test (of HQ 0 or H0 versus HQ 1 or H1 ) is invariant
with respect to the group GQ if and only if it is invariant with respect to the group G—if .z/ Q is the
critical function expressed as a function of z and .y/ the critical function expressed as a function
of y, then ŒQ TQ .z/ D .z/
Q for every transformation TQ .z/ in GQ if and only if ŒT .y/ D .y/ for
every transformation T .y/ in G. Refer to the final two parts of Section 7.3b for some specifics as
they pertain to the groups GQ 1 , GQ 2 .˛.0/ /, GQ 3 .˛.0/ /, and GQ 4 .
Confidence sets. Take (for each value ˛P of ˛) X. Q ˛/
P to be the random variable defined as follows:
0

.z1 ˛/ P .z1 ˛/ P
<
0 ; if z1 ¤ ˛P or z3 ¤ 0,
Q P D
X .˛/ .z1 P
˛/ 0 .z
1 P
˛/Cz 3 z3
:̂ 0; if z1 D ˛P and z3 D 0:

And take AQF .z/ to be the set (of ˛-values)


AQF .z/ D f˛P W XQ .˛/
P  BN P ŒM =2; .N P /=2g
or, equivalently (for those values of z such that z3 ¤ 0), the set
AQF .z/ D f˛P W F.Q ˛/
P  FN P .M ; N P /g:
As discussed in Section 7.3b, AQF .z/ is a 100.1 P /% confidence set for ˛.
Let GQ 2 represent the group of transformations consisting of the totality of the groups GQ 2 .˛.0/ /
(˛ 2 RM), and, similarly, let GQ 3 represent the group of transformations consisting of the totality
.0/

of the groups GQ 3 .˛.0/ / (˛.0/ 2 RM). Then, the 100.1 P /% confidence set AQF .z/ is invariant with
respect to the groups GQ 1 and GQ 4 and is equivariant with respect to the groups GQ 2 and GQ 3 —refer to
Q represent
the discussion in the final 3 parts of Section 7.3b. And for purposes of comparison, let A.z/
any confidence set for ˛ whose probability of coverage equals or exceeds 1 P and that is invariant
with respect to the group GQ 1 and equivariant with respect to the groups GQ 2 and GQ 3 . Further, for
˛.0/ 2 RM, take ı.˛.0/ I ˛/ to be the probability PrŒ˛.0/ 2 A.z/ Q of ˛.0/ being “covered” by the
Q .0/ .0/
confidence set A.z/—ı.˛I ˛/  1 P , and for ˛ ¤ ˛, ı.˛ I ˛/ is referred to as the probability
of false coverage.
Corresponding to A.z/ Q is the test of the null hypothesis HQ 0 W ˛ D ˛.0/ versus the alternative
Q .0/
hypothesis H1 W ˛ ¤ ˛ with critical function . Q I ˛0 / defined as follows:
(
Q
1; if ˛.0/ … A.z/,
Q
.zI ˛0 / D .0/
0; if ˛ 2 A.z/.Q

And in light of the invariance of A.z/Q with respect to the group GQ 1 and the equivariance of A.z/
Q
with respect to the groups GQ 2 and GQ 3 ,
Q TQ1 .z/I ˛.0/  D Œ
Œ Q TQ2 .z/I ˛.0/  D Œ
Q TQ3 .z/I ˛.0/  D .zI
Q ˛0 /; (4.15)
Q
as can be readily verified. Further, let .˛I ˛.0/ / D EŒ.zI ˛.0/ /, and observe that
.˛I ˛.0/ / D 1 ı.˛.0/ I ˛/; (4.16)
implying in particular that
.˛I ˛/ D 1 ı.˛I ˛/  P : (4.17)
1
Now, suppose that the distribution of the random vector u D  e is N.0; I/ or, more generally,
that the pdf f . I / of the distribution of the random variable (4.7) is such that for every  > 0 the
ratio f .xI /=f .xI 0/ is (for 0 < x < 1) a nondecreasing function of x. Then, in light of equalities
(4.15) and (4.17), it follows from our previous results (on the optimality of the F test) that (for every
414 Confidence Intervals (or Sets) and Tests of Hypotheses

value of ˛ and regardless of the choice for ˛.0/ ) .˛I ˛.0/ / attains its maximum value when A.z/ Q is
taken to be the set AQF .z/ [in which case .zI
Q ˛.0/ / is the critical function of the test that rejects HQ 0
if XQ .˛.0/ / > BN P ŒM =2; .N P /=2 or (for values of z such that z3 ¤ 0) the critical function of
the F test (of HQ 0 versus HQ 1 )]. And in light of equality (4.16), we conclude that among confidence
sets for ˛ whose probability of coverage equals or exceeds 1 P and that are invariant with respect
to the group GQ 1 and equivariant with respect to the groups GQ 2 and GQ 3 , the confidence set AQF .z/ is
UMA (uniformly most accurate) in the sense that (for every false value of ˛) its probability of false
coverage is a minimum.
This result on the optimality of the 100.1 P /% confidence set AQF .z/ (for the vector ˛) can be
translated into a result on the optimality of the following 100.1 P /% confidence set for the vector
:
AF .y/ D fP W P D W 0˛; P ˛P 2 AQF .O 0 y/g D fP 2 ƒ0 W X./ P  BN P ŒM =2; .N P /=2g
where (for P 2 ƒ0 )
P 0 C .O /
.O / P
8
< ; if O ¤ P or y 0 .I PX /y > 0,
P D
X./ .O P 0C
/ .O /Cy
P 0 .I P /y
X
0; if O D P and y 0 .I PX /y D 0.
:

The set AF .y/ is invariant with respect to the group G1 and equivariant with respect to the groups
G2 and G3 consisting respectively of the totality of the groups G2 . .0/ / [ .0/ 2 C.ƒ0 /] and the
totality of the groups G3 . .0/ / [ .0/ 2 C.ƒ0 /]. And the probability PrŒ .0/ 2 AF .y/ of the set AF .y/
covering any particular vector  .0/ in C.ƒ0 / equals the probability PrŒS0  .0/ 2 AQF .z/ of the set
AQF .z/ covering the vector S0  .0/.
Corresponding to any confidence set A.y/ (for ) whose probability of coverage equals or exceeds
1 P and that is invariant with respect to the group G1 and equivariant with respect to the groups G2
and G3 is a confidence set A.z/Q (for ˛) defined as follows:
Q D f˛P W ˛P D S0;
A.z/ P P 2 A.Oz/g:
This set is such that
A.y/ D fP W P D W 0˛; Q 0 y/g;
P ˛P 2 A.O
and it is invariant with respect to the group GQ 1 and equivariant with respect to the groups GQ 2 and
GQ 3 . Moreover, for any vector  .0/ 2 C.ƒ0 /,
PrŒ .0/ 2 A.y/ D PrŒS0  .0/ 2 A.z/
Q
—in particular, PrŒ 2 A.y/ D PrŒ˛ 2 A.z/. Q Since the confidence set AQF .z/ is UMA among
confidence sets for ˛ whose probability of coverage equals or exceeds 1 P and that are invariant
with respect to the group GQ 1 and equivariant with respect to the groups GQ 2 and GQ 3 , we conclude that
the confidence set AF .y/ is UMA among confidence sets for  whose probability of coverage equals
or exceeds 1 P and that are invariant with respect to the group G1 and equivariant with respect to
the groups G2 and G3 .

e. Average power and average probability of false coverage


Let us consider further the problem of testing the null hypothesis H0 W  D  .0/ or HQ 0 W ˛ D ˛.0/
versus the alternative hypothesis H1 W  ¤  .0/ or HQ 1 W ˛ ¤ ˛.0/ . In Subsection d, it was shown
(under an assumption of normality) that the size- P F test is optimal in the sense that it is UMP
among tests whose size does not exceed P and that are invariant with respect to certain groups
of transformations. In what follows, it is shown that the size- P F test is also optimal in another
sense. To proceed, we require certain definitions/results involving the integration of a function over
a hypersphere.
Integration of a function over a hypersphere. Let N represent a positive integer greater than or
Some Optimality Properties 415

equal to 2. And consider the integration of a function g.s/ of an N 1 vector s over a set SN .s.0/; /
defined for s.0/ 2 RN and  > 0 as follows:
SN .s.0/; / D fs 2 RN W .s s.0//0 .s s.0/ / D 2 g:
When N D 2, SN .s.0/; / is a circle, and when N D 3, it is a sphere. More generally, SN .s.0/; / is
referred to as a hypersphere (of dimension N 1). This circle, sphere, or hypersphere is centered at
the point s.0/ and is of radius .
Integration over the set SN .s.0/; / is related to integration over a set BN defined as follows:
BN D fx 2 RN W x0 x  1g:
This set is a closed
R ball; it is centered at the origin 0 and is of radius 1.
Let us write SN .s.0/; / g.s/ ds for the integral of the function g./ over the hypersphere SN.s.0/; /
centered at s.0/ and of radius . In the special case of a hypersphere centered at the origin and of
radius 1, 0 1=2
(4.18)
R R
SN .0; 1/ g.s/ ds D N BN gŒ.x x/ x dx

—for x D 0, define .x0 x/ 1=2 x D 0. More generally (in the special case of a hypersphere centered
at the origin and of radius ),
N 1 0 1=2
(4.19)
R R
SN .0; / g.s/ ds D N BN gŒ .x x/ x dx:
And still more generally,
g.s.0/ C sQ / d sQ (4.20)
R R
SN .s.0/; / g.s/ ds D SN .0; /
N 1 .0/ 0 1=2
(4.21)
R
D N BN gŒs C  .x x/ x dx:
Following Baker (1997), let us regard equality (4.18) or, more generally, equality (4.19) or (4.21)
as a definition—Baker indicated that he suspects equality (4.18) “is folkloric, and likely has appeared
as a theorem rather than a definition.” Various basic results on integration over a hypersphere follow
readily from equality (4.18), (4.19), or (4.21). In particular, for any two constants a and b and for
“any” two functions g1.s/ and g2 .s/ of an N 1 vector s,
(4.22)
R R R
SN .s.0/; / Œag1.s/ C bg2 .s/ ds D a SN .s.0/; / g1.s/ ds C b SN .s.0/; / g2 .s/ ds;

and for any N N orthogonal matrix O [and any function g.s/ of an N 1 vector s],
(4.23)
R R
SN .0; / g.Os/ ds D SN .0; / g.s/ ds:

In the special case where g.s/ D 1 (for every N  1 vector s), SN .s.0/; / g.s/ ds represents
R

the “surface area” of the (N 1)-dimensional hypersphere SN .s.0/; /. As demonstrated by Baker
(1997),
2 N=2 N 1
(4.24)
R
.0/
SN .s ; / ds D  :
€.N=2/
The integration of a function over a hypersphere is defined for hyperspheres of dimension 1 or
more by equality (4.21) or, in special cases, by equality (4.18) or (4.19). It is convenient to also
define the integration of a function over a hypersphere for a “hypersphere” S1.s .0/; / of dimension
0 (centered at s .0/ and of radius ), that is, for the “hypersphere”
S1.s .0/; / D fs 2 R1 W .s s .0/ /2 D  2 g D fs 2 R1 W s D s .0/ ˙g: (4.25)
.0/
Let us write g.s/ ds for the integral of a function g./ over the set S1.s ; /, and define
R
S1.s .0/; /
.0/
C/ C g.s .0/ /: (4.26)
R
S1.s .0/; / g.s/ ds D g.s

It is also convenient to extend the definition of the integral SN .s.0/; / g.s/ ds to “hyperspheres” of
R

radius 0, that is, to  D 0. For N  1, let us take


.0/
(4.27)
R
SN .s.0/; 0/ g.s/ ds D g.s /:
416 Confidence Intervals (or Sets) and Tests of Hypotheses

Note that when the definition of the integral of a function g./ over the set SN .s.0/; / is extended
to N D 1 and/or  D 0 via equalitiies (4.26) and/or (4.27), properties (4.22) and (4.23) continue to
apply. Note also that
(4.28)
R
SN .s.0/; 0/ ds D 1
and that (for  > 0)
(4.29)
R
S1.s .0/; / ds D 2:

An optimality criterion: average power. Let us now resume discussion of the problem of testing the
null hypothesis H0 W  D  .0/ or HQ 0 W ˛ D ˛.0/ versus the alternative hypothesis H1 W  ¤  .0/
or HQ 1 W ˛ ¤ ˛.0/ . In doing so, let us take the context to be that of Section 7.3, and let us adopt
the notation and terminology introduced therein. Thus, y is taken to be an N 1 observable random
vector that follows the G–M model, and z D O 0 y D .z01 ; z02 ; z03 / 0 D .˛O 0; O 0; d0 / 0. Moreover, in what
follows, it is assumed that the distribution of the vector e of residual effects in the G–M model is
N.0;  2 I/, in which case y  N.Xˇ;  2 I/ and z  N Œ.˛0; 0; 00 / 0;  2 I.
Q
Let .z/ represent (in terms of the transformed vector z) the critical function of an arbitrary
(possibly randomized) test of the null hypothesis HQ 0 W ˛ D ˛.0/ [so that 0  .z/ Q  1 for every
value of z]. And let Q .˛; ; / represent the power function of this test. By definition,
Q
Q .˛; ; / D EŒ.z/:
Further, let ˛Q D ˛ ˛.0/, and take N .; ; / to be the function of , , and  defined (for   0) as
R
S.˛.0/; / Q .˛; ; / d˛
N .; ; / D R ;
S.˛.0/; / d˛

where S.˛.0/; / D f˛ W .˛ ˛.0/ /0 .˛ ˛.0/ / D 2 g, or equivalently as


.0/
R
S.0; / Q .˛ C ˛;
Q ; / d ˛Q
N .; ; / D R ;
d
S.0; / ˛Q
where S.0; / D f˛Q W ˛Q 0 ˛Q D 2 g—for convenience, the same symbol (e.g., ˛) is sometimes used
for more than one purpose.
The function N .; ; / provides in whole or in part a possible basis for the evaluation of the test.
For  > 0, the value of N .; ; / represents the average (with respect to the value of ˛) power of the
test over a hypersphere centered at (the hypothesized value) ˛.0/ and of radius . When  D 0, this
value equals the Type-I error of the test, that is, the probability of falsely rejecting HQ 0 —in general,
this value can vary with  and/or .
Upon writing f .  I ˛; ; / for the pdf of the distribution of z, we find that
Q .0/
R R 
S.0; / RM .z/f .z I ˛ C ˛; Q ; / dz d ˛Q
N .; ; / D R
S.0; / d ˛Q
Q .0/
R R 
RM .z/ S.0; / f .z I ˛ C ˛; Q ; / d ˛Q dz
D R : (4.30)
Q
S.0; / d ˛
And upon writing f1 .  I ˛; /, f2 .  I ; /, and f3 .  I / for the pdfs of the distributions of ˛, O and
O ,
d, we find that
.0/
R
S.0; / f .z I ˛ C ˛; Q ; / d ˛Q
R D f1 .˛O I ; /f2 .O I ; /f3 .d I /; (4.31)
S.0; / d ˛Q
where
f1 .˛O I ˛ .0/C ˛;
R
S.0; / Q / d ˛Q
f1 .˛O I ; / D R
Q
S.0; / d ˛
O ˛.0/ /0 .˛
O ˛.0/ /=.2 2 / 2=.2 2 /
D .2 2 / M =2
e .˛
e .˛O ˛.0/ I ; /
Some Optimality Properties 417

and where .  I ; / is a function whose value is defined for every M 1 vector x as follows:
x0 ˛=
Q 2
R
S.0; / e d ˛Q
.x I ; / D R :
S.0; / d ˛Q
Note that
O ˛.0/ /0 .˛
O ˛.0/ /=.2 2 /
f1 .˛O I 0; / D .2 2 / M =2
e .˛
D f1 .˛O I ˛ .0/; /: (4.32)
The function .  I ; / is such that for every M M orthogonal matrix O and every M 1
vector x,
.Ox I ; / D .x I ; /;
as is evident from result (4.23). Moreover, corresponding to any M 1 vectors x1 and x2 such that
x02 x2 D x01 x1 , there exists an orthogonal matrix O such that x2 D Ox1 , as is evident from Lemma
5.9.9. Thus, .x I ; / depends on x only through the value of x0 x. That is,
.x I ; / D Q .x0 x I ; /
for some function Q .  I ; / of a single (nonnegative) variable. And f1 .˛O I ; / is reexpressible as
O ˛.0/ /0 .˛
O ˛.0/ /=.2 2 / 2=.2 2 /
f1 .˛O I ; / D .2 2 / M =2
e .˛
e Q Œ.˛O ˛.0/ /0 .˛O ˛.0/ / I ; : (4.33)

Letting x represent any M 1 vector such that x0 x D 1, we find that for any nonnegative
scalar t,  p p
Q .t I ; / D Q .t x0 x I ; / D 1
 
2 tx I ;  C t x I ;  :
Thus,
Q .0 I ; / D 1: (4.34)
Moreover, for t > 0,
1
 ptx0 ˛= 2
p 0
Q 2
Q C e t x ˛=
R ı
d Q .t I ; / 2 S.0; / d e dt d ˛Q
D R I
dt Q
S.0; / d ˛
and upon observing that (for t > 0)
 p 0Q 2 p 0
Q 2 1=2 0
p
tx0 ˛=
Q 2
p 0
Q 2
d e tx ˛= C e tx ˛= 1
Q 2 t x ˛=
ı  
dt D 2 t x ˛= e Ce
>0 (unless x0 ˛Q D 0);
it follows that (for t > 0 and  > 0)
d Q .t I ; /
>0 (4.35)
dt
and hence that (unless  D 0) Q .  I ; / is a strictly increasing function.
An equivalence. Let T represent a collection of (possibly randomized) tests of H0 or HQ 0 (versus
H1 or HQ 1 ). And consider the problem of identifying a test in T for which the value of N .; ; / (at
any particular values of , , and ) is greater than or equal to its value for every other test in T . This
problem can be transformed into a problem that is equivalent to the original but that is more directly
amenable to solution by conventional means.
Together, results (4.30) and (4.31) imply that
Q 
(4.36)
R
N .; ; / D RN .z/f .z I ; ; / dz;
where f  .z I ;R; / D f1 .˛O I ; /f2 .O I ; /f3 .d I /. Moreover, f1 .˛O I ; /  0 (for every
O and RM f1 .˛O I ; / d ˛O D 1, so that f1 .  I ; / can serve as the pdf of an absolutely
value of ˛)
continuous distribution. And it follows that
N .; ; / D E Œ.z/;
Q (4.37)

where the symbol E denotes an expected value obtained when the underlying distribution of z is
taken to be that with pdf f  .  I ; ; / rather than that with pdf f .  I ˛; ; /.
In the transformed version of the problem of identifying a test of H0 or HQ 0 that maximizes the
418 Confidence Intervals (or Sets) and Tests of Hypotheses

value of N .; ; /, the distribution of ˛O is taken to be that with pdf f1 .  I ; / and the distribution
of the observable random vector z D .˛O 0; O 0; d0 / 0 is taken to be that with pdf f  .  I ; ; /. Further,
the nonnegative scalar  (as well as  and each element of ) is regarded as an unknown parameter.
Q is regarded as a test of the null hypothesis H0 W  D 0 versus
And the test with critical function ./
the alternative hypothesis H1 W  > 0 (and T is regarded as a collection of tests of H0). Note [in


light of result (4.37)] that in this context, N .; ; / is interpretable as the power function of the test
Q
with critical function ./.
The transformed version of the problem consists of identifying a test in T for which the value
of the power function (at the particular values of , , and ) is greater than or equal to the value
of the power function for every other test in T . The transformed version is equivalent to the original
version in that a test of H0 or HQ 0 is a solution to the original version if and only if its critical function
is the critical function of a test of H0 that is a solution to the transformed version.
In what follows, T is taken to be the collection of all size- P similar tests of H0 or HQ 0 . And a
solution to the problem of identifying a test in T for which the value of N .; ; / is greater than
or equal to its value for every other test in T is effected by solving the transformed version of this
problem. Note that in the context of the transformed version, T represents the collection of all size- P
similar tests of H0, as is evident upon observing [in light of result (4.32)] that a test of H0 and a
test of H0 or HQ 0 that have the same critical function have the same Type-I error.
A sufficient statistic. Suppose that the distribution of ˛O is taken to be that with pdf f1 .  I ; / and
the distribution of z D .˛O 0; O 0; d0 / 0 to be that with pdf f  .  I ; ; /. Further, let
u D .˛O ˛.0/ /0 .˛O ˛.0/ / and v D d0 d:
And define u
sD and w D u C v:
uCv
Then, clearly, u, v, and O form a sufficient statistic—recall result (4.33)—and s, w, and O also form
a sufficient statistic.
The random vectors ˛, O and d are distributed independently and, consequently, u, v, and O are
O ,
distributed independently. And the random variable u has an absolutely continuous distribution with
a pdf g1 .  I ; / that is derivable from the pdf f1 .  I ; / of ˛O by introducing successive changes
in variables as follows: from ˛O to the vector x D .x1 ; x2 ; : : : ; xM /0 D ˛O ˛.0/, from x to the M1
vector t whose i th element is ti D xi2 , from t to the M 1 vector y whose first M 1 elements are
yi D ti (i D 1; 2; : : : ; M 1) and whose M th element is yM D M i D1 ti , and finally from y to
P 
the vector whose first M 1 elements are yi =yM (i D 1; 2; : : : ; M 1) and whose M th element
is yM —refer to Section 6.1g for the details. Upon introducing these changes of variables and upon
observing that u D yM , we find that (for u > 0)
u=.2 2 /
g1 .u I ; / / u.M =2/ 1
e Q .u I ; /: (4.38)
Moreover, the random variable v has an absolutely continuous distribution with a pdf g2 .  I / such
that (for v > 0)
1 v=.2 2 /
g2 .v I / D v Œ.N P /=2 1
e ; (4.39)
€Œ.N P /=2.2 2 /.N P /=2

as is evident upon observing that v= 2 has a chi-square distribution with N P degrees of freedom.
The random variables s and w have a joint distribution that is absolutely continuous with a pdf
h .  ;  I ; / that is determinable (via a change of variables) from the pdfs of the distributions of u
and v. For 0 < s < 1 and w > 0, we find that
h.s ; w I ; / D g1 .sw I ; / g2 Œ.1 s/w I  w
P CM /=2 1 w=.2 2 /
/ s .M =2/ 1
.1 s/Œ.N P /=2 1
w Œ.N e Q .sw I ; /: (4.40)
Some Optimality Properties 419

Corresponding to a test of H0 versus H1 with critical function .z/ Q is the test with the critical
Q j s; w; 
function EΠ.z/ O obtained upon taking the expected value of .z/ Q conditional on s, w, and
O Moreover,
.
EfEŒ .z/ O D EŒ .z/;
Q j s; w; g Q (4.41)
 Q
so that the test with critical function E Œ .z/ j s; w; O has the same power function as the test with
Q
critical function .z/. Thus, for purposes of identifying a size- P similar test of H0 for which the
value of the power function (at particular values of , , and ) is greater than or equal to the value
of the power function for every other size- P similar test of H0, it suffices to restrict attention to tests
that depend on z D .˛O 0; O 0; d0 / 0 only through the values of s, w, and .
O
Conditional Type-I error and conditional power. Let .s; Q w; /
O represent the critical function of
 
a (possibly randomized) test (of H0 versus H1 ) that depends on z only through the value of the
O And let .; ; / represent the power function of the
sufficient statistic formed by s, w, and .
test. Then, 
.; ; / D E Œ.s;
Q w; /:
O (4.42)
And the test is a size- P similar test if and only if

.0; ; / D P (4.43)
for all values of  and .
The conditional expected value E Œ.s; Q w; / O represents the conditional (on w and )
O j w;  O
Q w; /
probability of the test with critical function .s; O rejecting the null hypothesis H0. The power
function .; ; / of this test can be expressed in terms of E Œ.s; Q w; / O Upon reexpressing
O j w; .
the right side of equality (4.42) in terms of this conditional expected value, we find that

.; ; / D EfE Œ.s;
Q w; / O j w; g:
O (4.44)
 Q  Q
Let us write E 0 Œ.s; w; / O for E Œ.s; w; /
O j w;  O in the special case where  D 0, so
O j w; 
Q w; /
that E0 Œ.s; O represents the conditional (on w and )
O j w;  O probability of a Type-I error. When
 D 0, s is distributed independently of w and O as a BeŒM =2; .N P /=2 random variable, so that
(under H0 ) the conditional distribution of s given w and O is an absolutely continuous distribution
with a pdf h0 .s/ that is expressible (for 0 < s < 1) as
1
h0 .s/ D s .M =2/ 1 .1 s/ Œ.N P /=2 1
BŒM =2; .N P /=2
—refer to result (4.40) and note that Q .sw I 0; / D 1. And (for every value of w and every value
of )
O Z 1
E0 Œ.s;
Q w; / O j w; 
O D Q w; /h
.s; O 0 .s/ ds:
0
Under H0 (i.e., when  D 0), w and O form a complete sufficient statistic [for distributions of z
with a pdf of the form f  .z I ; ; /]—refer to Section 5.8. And in light of result (4.44), it follows
Q w; /
that the test with critical function .s; O is a size- P similar test if and only if
 Q
E 0 Œ.s; w; /O j w; 
O D P (wp1). (4.45)
 
Thus, a size- P similar test of H0 for which the value of .; ; / (at any particular values of , ,
and ) is greater than or equal to its value for any other size- P similar test can be obtained by taking
Q w; /
the critical function .s; O of the test to be that derived by regarding (for each value of w and each
value of ) Q w; /
O .s; O as a function of s alone and by maximizing the value of E Œ.s; Q w; / O j w; 
O
(with respect to the choice of that function) subject to the constraint E0 Œ.s; Q w; / O D P.
O j w; 
When  > 0 (as when  D 0), O is distributed independently of s and w. However, when
 > 0 (unlike when  D 0), s and w are statistically dependent. And in general (i.e., for   0),
the conditional distribution of s given w and O is an absolutely continuous distribution with a pdf

hC .s j w/ such that

hC .s j w/ / h.s ; w I ; / (4.46)
—this distribution varies with the values of  and  as well as the value of w.
420 Confidence Intervals (or Sets) and Tests of Hypotheses

Application of the Neyman–Pearson lemma. Observe [in light of results (4.46) and (4.40)] that (for
0 < s < 1) 
hC .s j w/
/ Q .sw I ; /:
h0 .s/
Observe also that (when w > 0 and  > 0) Q .sw I ; / is strictly increasing in s [as is evident upon
recalling that (when  > 0) Q .  I ; / is a strictly increasing function]. Further, let Q .s/ represent
the critical function of a (size- P ) test (of H0 ) that depends on z only through the value of s and that
is defined as follows: (
Q  1; if s > k,
 .s/ D
0; if s  k,
R1 
where k h0 .s/ ds D P . Then, it follows from an extended (to cover randomized tests) version of
Theorem 7.4.1 (the Neyman–Pearson lemma) that (for every strictly positive value of  and for all
values of  and ) the “conditional power” E Œ.s; Q w; / O of the test with critical function
O j w; 
Q w; /
.s; O attains its maximum value, subject to the constraint E0 Œ.s; Q w; / O D P , when
O j w; 
Q w; /
.s; O is taken to be the function Q .s/. And (in light of the results of the preceding two parts of
the present subsection) we conclude that the test of H0 with critical function Q .s/ is UMP among
all size- P similar tests.
Main result. Based on relationship (4.37) and equality (4.32), the result on the optimality of the test
of H0 with critical function Q .s/ can be reexpressed as a result on the optimality of the test of H0
or HQ 0 with the same critical function. Moreover, the test of H0 or HQ 0 with critical function Q .s/
is equivalent to the size- P F test of H0 or HQ 0 . Thus, among size- P similar tests of H0 W  D  .0/
or HQ 0 W ˛ D ˛.0/, the average value N .; ; / of the power function Q .˛; ; / over those ˛-values
located on the sphere S.˛.0/; /, centered at ˛.0/ and of radius , is maximized (for every  > 0 and
for all  and ) by taking the test of H0 or HQ 0 to be the size- P F test.
A corollary. Suppose that a test of H0 W  D  .0/ or HQ 0 W ˛ D ˛.0/ versus H1 W  ¤  .0/ or
HQ 1 W ˛ ¤ ˛.0/ is such that the power function Q .˛; ; / depends on ˛, , and  only through the
value of .˛ ˛0 /0 .˛ ˛0 /= 2, so that
Q .˛; ; / D R Œ.˛ ˛0 /0 .˛ ˛0 /= 2  (4.47)
for some function R ./. Then, for all  and ,
Q .˛.0/; ; / D R .0/: (4.48)
And for   0 and for all  and ,
N .; ; / D R .2= 2 /: (4.49)

As noted in Section 7.3b, the size- P F test is among those tests of H0 or HQ 0 for which the power
function Q .˛; ; / is of the form (4.47). Moreover, any size- P test of H0 or HQ 0 for which Q .˛; ; /
is of the form (4.47) is a size- P similar test, as is evident from result (4.48). And, together, results
(4.47) and (4.49) imply that (for all ˛, , and )
Q .˛; ; / D N fŒ.˛ ˛0 /0 .˛ ˛0 /1=2; ; g:
Thus, it follows from what has already been proven that among size- P tests of H0 or HQ 0 (versus H1
or HQ 1 ) with a power function of the form (4.47), the size- P F test is a UMP test.
The power function of a test of H0 or HQ 0 versus H1 or HQ 1 can be expressed as either a function
Q .˛; ; / of ˛, , and  or as a function .; ; / D Q .S 0 ; ; / of , , and . Note [in light
of result (3.48)] that for Q .˛; ; / to depend on ˛, , and  only through the value of .˛ ˛0 /0 .˛
˛0 /= 2, it is necessary and sufficient that .; ; / depend on , , and  only through the value
of .  .0/ /0 C .  .0/ /= 2.
One-Sided t Tests and the Corresponding Confidence Bounds 421

Special case: M D M D 1. Suppose that M D M D 1. And recall (from Section 7.3b) that in
this special case, the size- P F test of H0 W  D  .0/ or HQ 0 W ˛ D ˛.0/ versus H1 W  ¤  .0/ or
HQ 1 W ˛ ¤ ˛.0/ simplifies to the (size- P ) two-sided t test. Further, let us write ˛ and  for ˛ and 
and ˛ .0/ and  .0/ for ˛.0/ and  .0/.
For  > 0, S.˛ .0/; / D f˛ W j˛ ˛ .0/j D g, so that S.˛ .0/; / consists of the two points ˛ .0/ ˙ .
And we find that among size- P similar tests of H0 or HQ 0 versus H1 or HQ 1 , the average power at any
two points that are equidistant from ˛ .0/ or  .0/ is maximized by taking the test to be the (size- P )
two-sided t test. Moreover, among all size- P tests of H0 or HQ 0 whose power functions depend on ˛,
, and  or , , and  only through the value of j˛ ˛ .0/j= or j  .0/j=, the (size- P ) two-sided
t test is a UMP test.
Since the size- P two-sided t test is equivalent to the size- P F test and since (as discussed in
Section 7.3b) the size- P F test is strictly unbiased, the size- P two-sided t test is a strictly unbiased
test. Moreover, it can be shown that among all level- P unbiased tests of H0 or HQ 0 versus H1 or HQ 1 ,
the size- P two-sided t test is a UMP test.
Q ˛;
Confidence sets. Let A. O d/ or simply AQ represent an arbitrary confidence set for ˛ with con-
O ;
fidence coefficient 1 P . And let AQF .˛; O d/ or simply AQF represent the 100.1 P /% confidence
O ;
set
O 0 .˛P ˛/
f˛P W .˛P ˛/ O  M O 2 FN P .M ; N P /g:
Our results on the optimality of the size- P F test of HQ 0 W ˛ D ˛.0/ can be reexpressed as results
on the optimality of AQF . When AQ is required to be such that Pr.˛ 2 A/ Q D 1 P for all ˛, , and
Q Q
, the optimal choice for A is AF ; this choice is optimal in the sense that (for  > 0) it minimizes
the average value (with respect to ˛.0/ ) of Pr.˛.0/ 2 A/ Q (the probability of false coverage) over the
Q
sphere S.˛; / centered at ˛ with radius . And AF is UMA (uniformly most accurate) when the
choice of the set AQ is restricted to a set for which the probability Pr.˛.0/ 2 A/Q of AQ covering a vector
.0/ .0/ 0 .0/ 2
˛ depends on ˛, , and  only through the value .˛ ˛/ .˛ ˛/= ; it is UMA in the sense
that the probability of false coverage is minimized for every ˛.0/ and for all ˛, , and . Moreover,
among those 100.1 P / confidence sets for which the probability of covering a vector  .0/ [in C.ƒ0 /]
depends on , , and  only through the value of . .0/ /0 C . .0/ /= 2, the UMA set is the set
AF D fP 2 C.ƒ0 / W .P /
O 0 C .P /
O  M O 2 FN P .M ; N P /g:

7.5 One-Sided t Tests and the Corresponding Confidence Bounds


Suppose (as in Sections 7.3 and 7.4) that y is an N 1 observable random vector that follows the
G–M model. And suppose that the distribution of the vector e of residual effects in the G–M model
is N.0;  2 I/, so that y  N.Xˇ;  2 I/. Let us consider further the problem of making inferences
for estimable linear combinations of the elements of ˇ.
Procedures were presented and discussed in Sections 7.3 and 7.4 for constructing a confidence
interval for any particular estimable linear combination or for each of a number of estimable linear
combinations in such a way that the probability of coverage or probability of simultaneous coverage
equals 1 P . The end points of the interval are interpretable as upper and lower bounds; these bounds
are equidistant from the least squares estimate. In some cases, there may be no interest in bounding
the linear combination both from above and from below; rather, the objective may be to obtain as
tight an upper bound as possible or, alternatively, as tight a lower bound as possible.
422 Confidence Intervals (or Sets) and Tests of Hypotheses

a. Some basic results


Let us consider inference about a single estimable linear combination  D 0ˇ of the elements of ˇ
(where  ¤ 0). Further, let us take the context to be that of Section 7.3, and let us adopt the notation
and terminology employed therein. Accordingly, suppose that M D M D 1, regard  as the lone
element of the vector , and write O for the lone element of the vector O (of least squares estimators)
and c for the lone element of the matrix C [D ƒ0.X0 X/ ƒ]. And observe that S is the matrix with
lone element c 1=2, ˛ is the vector with lone element ˛ D c 1=2 , and ˛O is the vector with lone
element ˛O D c 1=2 .
O
Confidence bounds. Letting O 2 D y 0 .I PX /y=.N P /, letting P represent an arbitrary value of 
and defining ˛P D c 1=2 ,
P and recalling results (3.21) and (3.51), we find that
O P .1=/.˛O ˛/
P
p D p  S tŒN P
P ; .˛ ˛/=: (5.1)
O c 2 0
.1= /d d=.N P /
And upon applying this result in the special case where P D  or equivalently where ˛P D ˛, we find
that the interval p
ŒO c O tNP .N P /; 1/ (5.2)
is a 100.1 P /% confidence interval for . Similarly, the interval
p
. 1; O C c O tNP .N P / (5.3)
p
is a 100.1 P /% confidence interval for . Thus, O c O tNP .N P / is a 100.1 P /% lower “confidence
p
bound” for , and O C c O tNP .N P / is a 100.1 P /% upper “confidence bound” for .
One-sided t tests. Let  .0/ represent any hypothesized value of . Then, corresponding to the con-
fidence interval (5.2) is a “one-sided t test” of the null hypothesis H0C W    .0/ versus the
alternative hypothesis H1C W  >  .0/ with critical region C C defined as follows: C C D fy W  .0/ <
p
O c O tNP .N P /g or, equivalently,
O  .0/
 
CC D y W p > tNP .N P /
O c
p
—when O D 0, interpret the value of .O  .0/ /=.O c/ as 1, 0, or 1 depending on whether
O <  .0/, O D  .0/, or O >  .0/. This test is such that the probability Pr.y 2 C C / of rejecting
H0C depends on ˇ and  only through the value of .  .0/ /= and in fact is a strictly increasing
function of .  .0/ /=, as is evident from result (5.1) and from the very definition of the noncentral
t distribution. Thus, the test of H0C versus H1C with critical region C C is of size P and is (strictly)
unbiased.
There is a second one-sided t test. Corresponding to the confidence interval (5.3) is the one-sided
t test of the null hypothesis H0 W    .0/ versus the alternative hypothesis H1 W  <  .0/ with
p
critical region C defined as follows: C D fy W  .0/ > O C c O tNP .N P /g or, equivalently,
O  .0/
 
C D y W p N
< t P .N P / :
O c
This test is such that the probability Pr.y 2 C / of rejecting H0 depends on ˇ and  only through
the value of .  .0/ /= and in fact is a strictly decreasing function of .  .0/ /=. And it is of size
P and is (strictly) unbiased.
Recharacterization of the one-sided t tests. Upon letting ˛ .0/ D c 1=2  .0/ , the size- P one-sided
t test of H0C can be recharacterized as a test of the null hypothesis HQ 0C W ˛  ˛ .0/ versus the
alternative hypothesis HQ 1C W ˛ > ˛ .0/, and its critical region C C can be reexpressed as a set CQ C D
f z W .˛O ˛ .0/ /=O > tNP .N P /g of values of the transformed vector z D .˛O 0; O 0; d0 /0 —clearly,
z 2 CQ C , y 2 C C. Similarly, the size- P one-sided t test of H0 can be recharacterized as a test of
One-Sided t Tests and the Corresponding Confidence Bounds 423

the null hypothesis HQ 0 W ˛  ˛ .0/ versus the alternative hypothesis HQ 1 W ˛ < ˛ .0/, and its critical
region C can be reexpressed in the form of the set CQ D f z W .˛O ˛ .0/ /=O < tNP .N P /g.
Invariance. Both of the two size- P one-sided t tests are invariant with respect to groups of transfor-
mations (of z) of the form TQ1 .z/, of the form TQ2 .zI ˛.0/ /, and of the form TQ4 .z/—transformations
of these forms are among those discussed in Section 7.3b. And the 100.1 P /% confidence intervals
(5.2) and (5.3) are invariant with respect to groups of transformations of the form TQ1 .z/ and of the
form TQ4 .z/ and are equivariant with respect to groups of transformations of the form TQ2 .zI ˛.0/ /.

b. Confidence intervals of a more general form


The 100.1 P /% confidence intervals (5.2) and (5.3) and the 100.1 P /% confidence interval (3.55)
can be regarded as special cases of a 100.1 P /% confidence interval (for ) of a more general form.
Let P` and Pu represent any two nonnegative scalars such that P` C Pu D P (and define tN0 D 1).
Then, clearly, the interval
p p
fP 2 R1 W O c O tNP` .N P /  P  O C c O tNPu .N P /g (5.4)
is a 100.1 P /% confidence interval for . And interval (3.55) is the special case of interval (5.4)
where P` D Pu D P =2, interval (5.2) is the special case where P` D P and Pu D 0, and interval
(5.3) is the special case where P` D 0 and Pu D P .

c. Invariant tests
Let us add to the brief “discussion” (in the final part of Subsection a) of the invariance of one-sided
t tests by establishing some properties possessed by the critical functions of tests that are invariant
with respect to various groups of transformations but not by the critical functions of other tests.
Two propositions. The four propositions introduced previously (in Section 7.4a) serve to characterize
functions of z that are invariant with respect to the four groups GQ 1 , GQ 3 .˛ .0/ /, GQ 2 .˛ .0/ /, and GQ 4 .
In the present context, the group GQ 3 .˛ .0/ / is irrelevant. The following two propositions [in which
Q
.z/ represents an arbitrary function of z and in which z1 represents the lone element of the vector
z1 ] take the place of Propositions (2), (3), and (4) of Section 7.4a and provide the characterization
needed for present purposes:
(2 0 ) The function .z/Q is invariant with respect to the group GQ 2 .˛.0/ / of transformations of the form
T2 .zI ˛ / as well as the group GQ 1 of transformations of the form TQ1 .z/ only if there exists a
Q .0/

function Q 12 ./ [of an (N P C1)-dimensional vector] suchthat


z ˛ .0/
 
Q
.z/ D Q12 Œ.z1 ˛ / C z03 z3  1=2 1
.0/ 2
z3
for those values of z for which z3 ¤ 0. Moreover, Pr.z3 ¤ 0/ D 1, and the distribution
.0/
 
.0/ 2 0 1=2 z1 ˛
of Œ.z1 ˛ / C z3 z3  depends on ˛, , and  only through the value of
z3
.0/
.˛ ˛ /=.
Q
(3 0 ) The function .z/ is invariant with respect to the group GQ 4 of transformations of the form TQ4 .z/
as well as the groups GQ 1 and GQ 2 .˛.0/ / of transformations of the form TQ1 .z/ and TQ2 .zI ˛.0/ /
only if there exists a function Q 124 ./ (of a single variable) such that
Q
.z/ D Q124fŒ.z1 ˛ .0/ /2 C z03 z3  1=2
.z1 ˛ .0/ /g
for those values of z for which z3 ¤ 0. Moreover, Pr.z3 ¤ 0/ D 1, and the distribution of
Œ.z1 ˛ .0/ /2 Cz03 z3  1=2 .z1 ˛ .0/ / depends on ˛, , and  only through the value of .˛ ˛ .0/ /=.
424 Confidence Intervals (or Sets) and Tests of Hypotheses

Verification of Proposition (2 0 ). Suppose that .z/ Q is invariant with respect to the group GQ 2 .˛.0/ /
as well as the group GQ 1 . Then, in light of Proposition (1) (of Section 7.4a), .z/ Q D Q 1 .z1 ; z3 /
Q
for some function 1 . ; /. Thus, Q Q
 to establish
.0/
 the existence of a function 12 ./ such that .z/ D
z ˛
Q12 Œ.z1 ˛ .0/ /2 C z03 z3  1=2 1 for those values of z for which z3 ¤ 0, it suffices to take
z3
zP1 and zR1 to be values of z1 and zP 3 and zR 3 nonnull values of z3 such that
R1 ˛ .0/ P1 ˛ .0/
   
.0/ 2 0 1=2 z .0/ 2 0 1=2 z
Œ.Rz1 ˛ / C zR 3 zR 3  D Œ.Pz1 ˛ / C zP 3 zP 3 
zR 3 zP 3
and to observe that
zR 3 D k zP 3 and zR1 D ˛ .0/ C k.Pz1 ˛ .0/ /;
where k D Œ.Rz1 ˛ .0/ /2 C zR 03 zR 3 1=2=Œ.Pz1 ˛ .0/ /2 C zP 03 zP 3 1=2. 
And upon letting u1 represent a random
u
variable and u3 an .N P /  1 random vector such that 1  N.0; I/, it remains only to observe
u3
that
z ˛ .0/
 
Œ.z1 ˛ .0/ /2 C z03 z3  1=2 1
z3
˛ ˛ .0/ Cu1
 
 Œ.˛ ˛ .0/ Cu1 /2 C  2 u03 u3  1=2
u3
 1
 .˛ ˛ .0/ /Cu1

D fŒ 1.˛ ˛ .0/ /Cu1  2 C u03 u3 g 1=2 :
u3

Q
Verification of Proposition (3 0 ). Suppose that .z/ is invariant with respect to the group GQ 4 as well
as the groups G1 and G2 .˛ /. Then, in light of Proposition (2 0 ), there exists a function Q12 ./ such
Q Q .0/

z ˛ .0/
  
Q
that .z/ D Q12 Œ.z1 ˛ .0/ /2 C z03 z3  1=2 1 for those values of z for which z3 ¤ 0.
z3
Moreover, there also exists a function Q124 ./ such that
z ˛ .0/
  
Q12 Œ.z1 ˛ .0/ /2 C z03 z3  1=2 1 D Q124fŒ.z1 ˛ .0/ /2 C z03 z3  1=2 .z1 ˛ .0/ /g
z3
(for those values of z for which z3 ¤ 0). To confirm this, it suffices to take zP1 and zR1 to be values of
z1 and zP 3 and zR 3 nonnull values of z3 such that
Œ.Rz1 ˛ .0/ /2 C zR 03 zR 3  1=2
.Rz1 ˛ .0/ / D Œ.Pz1 ˛ .0/ /2 C zP 03 zP 3  1=2
.Pz1 ˛ .0/ / (5.5)
and to observe that equality (5.5) implies that

fŒ.Rz1 ˛ .0/ /2 C zR 03 zR 3  1=2


zR 3 g0 fŒ.Rz1 ˛ .0/ /2 C zR 03 zR 3  1=2
zR 3 g
0 0
D fŒ.Pz1 ˛ .0/ 2
/ C zP 3 zP 3  1=2
zP 3 g fŒ.Pz1 ˛ .0/ /2 C zP 03 zP 3  1=2
zP 3 g
and hence implies (in light of Lemma 5.9.9) the existence of an orthogonal matrix B such that
Œ.Rz1 ˛ .0/ /2 C zR 03 zR 3  1=2
zR 3 D B0 fŒ.Pz1 ˛ .0/ /2 C zP 03 zP 3  1=2
zP 3 g
0 0 0
D Œ.Pz1 ˛ .0/ 2
/ C .B zP 3 / B zP 3  1=2
B0 zP 3
and
Œ.Rz1 ˛ .0/ /2 C zR 03 zR 3  1=2
.Rz1 ˛ .0/ / D Œ.Pz1 ˛ .0/ /2 C .B0 zP 3 /0 B0 zP 3  1=2
.Pz1 ˛ .0/ /:
That the distribution of Œ.z1 ˛ .0/ /2 C z03 z3  1=2 .z1 ˛ .0/ / depends on ˛, , and  only through the
value of .˛ ˛ .0/ /= follows from Proposition (2 0 ).
One-Sided t Tests and the Corresponding Confidence Bounds 425

d. Optimality of a one-sided t test


Let us consider the extent to which the one-sided t test of the null hypothesis H0C or HQ 0C (versus
H1C or HQ 1C ) and the one-sided t test of the null hypothesis H0 or HQ 0 (versus H1 or HQ 1 ) compare
favorably with various other tests of these null hypotheses.
Best invariant test of H0C or HQ 0C (versus H1C or HQ 1C ). The size- P one-sided t test of H0C can be
recharacterized as a test of HQ 0C, and its critical region and critical function can be reexpressed as
a set of z-values or as a function of z rather than as a set of y-values or a function of y—refer to
Subsection a. As noted earlier (in Subsection a), the size- P one-sided t test of H0C or HQ 0C is invariant
to the groups GQ 1 , GQ 2 .˛ .0/ /, and GQ 4 of transformations (of z) of the form TQ1 .z/, TQ2 .zI ˛.0/ /, and
TQ4 .z/. In fact, as we now proceed to show, the size- P one-sided t test is UMP among all level- P tests
of H0C or HQ 0C (versus H1C or HQ 1C ) that are invariant with respect to these groups of transformations;
under H1C or HQ 1C, its power is uniformly (i.e., for all ˛ > ˛ .0/ and for all  and ) greater than or
equal to that of every test of H0C or HQ 0C whose size does not exceed P and that is invariant with
respect to the groups GQ 1 , GQ 2 .˛ .0/ /, and GQ 4 .
Q
Let .z/ represent [in terms of the transformed observable random vector z D .z1 ; z02 ; z03 /0 ]
the critical function of an arbitrary (possibly randomized) test of the null hypothesis H0C or HQ 0C
(versus the alternative hypothesis H1C or HQ 1C ). Further, let Y represent an observable random variable
defined as follows:
(
Œ.z1 ˛ .0/ /2 C z03 z3  1=2 .z1 ˛ .0/ /; if z1 ¤ ˛ .0/ or z3 ¤ 0,
Y D
0; if z1 D ˛ .0/ and z3 D 0.

And let  D .˛ ˛ .0/ /=.


When regarded as a function of z, Y is invariant with respect to the groups GQ 1 , GQ 2 .˛ .0/ /, and
GQ 4 (as can be readily verified). Thus, if .z/ Q depends on z only through the value of Y, then .z/ Q
is invariant with respect to the groups GQ 1 , GQ 2 .˛ .0/ /, and GQ 4 . Conversely, if .z/ Q is invariant with
respect to the groups GQ 1 , GQ 2 .˛ .0/ /, and GQ 4 , then [in light of Proposition (3 0 ) of Subsection c] there
exists a test with a critical function Q124 .Y / for which
Q
.z/ D Q124 .Y / (wp1)
and hence for which
Q
EŒ.z/ D EŒQ124 .Y /;
so that the test with critical function Q124 .Y / has the same power function as the test with critical
Q
function .z/. And the distribution of Y or “any” function of Y depends on ˛, , and  only
through the value of  fas is evident from Proposition (3 0 ) of Subsection c upon observing that
Y  Œ.z1 ˛ .0/ /2 C z03 z3  1=2 .z1 ˛ .0/ /g. Moreover, Y is a strictly increasing function of the
quantity .˛O ˛ .0/ /=O [as can be readily verified by making use of relationship (6.4.4 )], so that
( )
C
NP .N P /
t
z 2 CQ , Y 2 Y W Y  p : (5.6)
N P CŒtNP .N P /2
The upshot of these remarks is that Theorem 7.4.1 (the Neyman–Pearson lemma) can be used to
show that the size- P one-sided t test is UMP among all level- P tests of H0C or HQ 0C (versus H1C or
HQ 1C ) that are invariant with respect to the groups GQ 1 , GQ 2 .˛ .0/ /, and GQ 4 . Accordingly, let  represent
any strictly positive scalar, and let f . I / represent the pdf of the distribution of Y—clearly, this
distribution is absolutely continuous. Further, upon letting U D Œ.z1 ˛ .0/ /2 C z03 z3 = 2 and letting
q. ;  I / represent the pdf of the joint distribution of U and Y (which is absolutely continuous and
depends on ˛, , and  only through the value of ) and upon observing that (for 1 < y < 1)
Z 1
f .y I / q.u ; y I /
D du
f .y I 0/ 0 f .y I 0/
426 Confidence Intervals (or Sets) and Tests of Hypotheses

and [in light of results (6.4.32) and (6.4.7)] that (for some strictly positive scalar c that does not
depend on u, y, or  and for 0 < u < 1 and 1 < y < 1)

q.u ; y I / u=2 u1=2 y 2 =2


D c u.N P 1/=2
e e e ;
f .y I 0/
we find that (for 1 < y < 1)
1
d Œf .y I /=f .y I 0/ @Œq.u ; y I /=f .y I 0/
Z
D du
dy 0 @y
Z 1
2 1=2
D c e  =2 u.N P /=2 e u=2 e u y du
0
> 0 if  > 0,

so that the ratio f .y I  /=f .y I 0/ is a strictly increasing function of y.


Thus, upon applying Theorem 7.4.1 (with X D Y,  D , ‚ D . 1; 1/,  .0/ D 0, and
 D , we find [in light of the equivalence (5.6)] that among tests of the null hypothesis  D 0


(versus the alternative hypothesis  ¤ 0) that are of level P and that are invariant with respect to the
groups GQ 1 , GQ 2 .˛ .0/ /, and GQ 4 , the power of the test at the point  D  attains its maximum value
(for every choice of the strictly positive scalar  ) when the critical region of the test is taken to be
the region CQ C. Moreover, tests of the null hypothesis  D 0 that are of level P and that are invariant
with respect to the groups GQ 1 , GQ 2 .˛ .0/ /, and GQ 4 include as a subset tests of the null hypothesis   0
that are of level P and that are invariant with respect to those groups. And we conclude that among
all tests of the null hypothesis H0C W    .0/ or HQ 0C W ˛  ˛.0/ (versus the alternative hypothesis
H1C W  >  .0/ or HQ 1C W ˛ > ˛.0/ ) that are of level P and that are invariant with respect to the groups
GQ 1 , GQ 2 .˛ .0/ /, and GQ 4 , the size- P one-sided t test of H0C is a UMP test.
A stronger result. As is evident from the next-to-last part of Section 7.3a, z1 , z2 , and z03 z3 form a
(vector-valued) sufficient statistic. And in light of the proposition of Section 7.4b, the critical function
of a test or H0C or HQ 0C is expressible as a function of this statistic if and only if the test is invariant
with respect to the group GQ 4 of transformations (of z) of the form TQ4 .z/. Thus, a test of H0C or
HQ 0C is invariant with respect to the groups GQ 1 , GQ 2 .˛ .0/ /, and GQ 4 of transformations if and only if
it is invariant with respect to the groups GQ 1 and GQ 2 .˛ .0/ / and, in addition, its critical function is
expressible as a function of z1 , z2 , and z03 z3 .
Q
Let .z/ represent a function of z that is the critical function of a (possibly randomized) test
of H0C or HQ 0C. Then, corresponding to .z/ Q is a critical function .z/ N Q j z ; z ; z0 z 
D EŒ.z/ 1 2 3 3
that depends on z only through the value of the sufficient statistic. Moreover, the test with critical
N
function .z/ has the same power function as that with critical function .z/; Q and if the test with
Q
critical function .z/ is invariant with respect to the groups GQ 1 and GQ 2 .˛ .0/ / of transformations,
N
then so is the test with critical function .z/—refer to the fifth part of Section 7.4d.
Thus, the results (obtained in the preceding part of the present subsection) on the optimality of
the size- P one-sided t test can be strengthened. We find that among all tests of the null hypothesis
H0C W    .0/ or HQ 0C W ˛  ˛.0/ (versus the alternative hypothesis H1C W  >  .0/ or HQ 1C W ˛ > ˛.0/ )
that are of level P and that are invariant with respect to the groups GQ 1 and GQ 2 .˛ .0/ /, the size- P one-
sided t test of H0C is a UMP test. The restriction to tests that are invariant with respect to the group
GQ 4 is unnecessary.
Best invariant test of H0 or HQ 0 (versus H1 or HQ 1 ): an analogous result. By proceeding in much
the same fashion as in arriving at the result (on the optimality of the size- P one-sided t test of the
null hypothesis H0C ) presented in the preceding part of the present subsection, one can establish the
following result on the optimality of the size- P one-sided t test of the null hypothesis H0 : Among
all tests of the null hypothesis H0 W    .0/ or HQ 0 W ˛  ˛.0/ (versus the alternative hypothesis
One-Sided t Tests and the Corresponding Confidence Bounds 427

H1 W  <  .0/ or HQ 1 W ˛ < ˛.0/ ) that are of level P and that are invariant with respect to the groups
GQ 1 and GQ 2 .˛ .0/ /, the size- P one-sided t test of H0 is a UMP test.
Confidence intervals. By proceeding in much the same fashion as in the final part of Section 7.4d (in
translating results on the optimality of the F test into results on the optimality of the corresponding
confidence set), the results of the preceding parts of the present subsection (of Section 7.5) on the
optimality of the one-sided t tests can be translated into results on the optimality of the corresponding
confidence intervals.
Clearly, the 100.1 P /% confidence intervals (5.2) and (5.3) (for ) and the corresponding intervals
ŒO O tNP .N P /; 1/ and . 1; O C O tNP .N P / for ˛ are equivariant with respect to the group
GQ 2 of transformations [the group formed by the totality of the groups GQ 2 .˛ .0/ / (˛ .0/ 2 R1 )], and
they are invariant with respect to the group GQ 1 (and also with respect to the group GQ 4 ). Now, let
Q represent any confidence set for ˛ whose probability of coverage equals or exceeds 1 P and
A.z/
that is invariant with respect to the group GQ 1 and equivariant with respect to the group GQ 2 . Further,
for ˛ .0/ 2 R1, denote by ı.˛ .0/ I ˛/ the probability PrŒ˛ .0/ 2 A.z/Q of ˛ .0/ being covered by the
confidence set A.z/Q [and note that ı.˛I ˛/  1 P ]. Then, the interval ŒO O tNP .N P /; 1/ is
the optimal choice for A.z/Q in the sense that for every scalar ˛ .0/ such that ˛ .0/ < ˛, it is the choice
that minimizes ı.˛ .0/ I ˛/, that is, the choice that minimizes the probability of the confidence set
covering any value of ˛ smaller than the true value.
This result on the optimality of the 100.1 P /% confidence interval ŒO O tNP .N P /; 1/ (for
˛) can be reexpressed as a result on the optimality of the 100.1 P /% confidence interval (5.2) (for
), and/or (as discussed in Section 7.3b) conditions (pertaining to the invariance or equivariance of
confidence sets) that are expressed in terms of groups of transformations of z can be reexpressed in
terms of the corresponding groups of transformations of y. Corresponding to the groups GQ 1 and GQ 4
and the group GQ 2 of transformations of z are the groups G1 and G4 and the group G2 [consisting of
the totality of the groups G2 . .0/ / ( .0/ 2 R1 )] of transformations of y.
Let A.y/ represent any confidence set for  for which PrŒ 2 A.y/  1 P and that is invariant
with respect to the group G1 and equivariant with respect to the group G2 . Then, the interval (5.2) is
the optimal choice for A.y/ in the sense that for every scalar  .0/ < , it is the choice that minimizes
PrŒ .0/ 2 A.y/; that is, the choice that minimizes the probability of the confidence set covering
any value of  smaller than the true value. Moreover, the probability of the interval (5.2) covering
a scalar  .0/ is less than, equal to, or greater than 1 P depending on whether  .0/ < ,  .0/ D ,
or  .0/ > . While the interval (5.2) is among those choices for the confidence set A.y/ that are
invariant with respect to the group G4 of transformations (of y) and is also among those choices for
which PrŒ .0/ 2 A.y/ > 1 P for every scalar  .0/ > , the optimality of the interval (5.2) is not
limited to choices for A.y/ that have either or both of those properties.
If the objective is to minimize the probability of A.y/ covering values of  that are larger than
the true value rather than smaller, the optimal choice for A.y/ is interval (5.3) rather than interval
(5.2). For every scalar  .0/ > , interval (5.3) is the choice that minimizes PrŒ .0/ 2 A.y/. And the
probability of the interval (5.3) covering a scalar  .0/ is greater than, equal to, or less than 1 P
depending on whether  .0/ < ,  .0/ D , or  .0/ > .
Best unbiased tests. It can be shown that among all tests of the null hypothesis H0C W    .0/ or
HQ 0C W ˛  ˛.0/ (versus the alternative hypothesis H1C W  >  .0/ or HQ 1C W ˛ > ˛.0/ ) that are of
level P and that are unbiased, the size- P one-sided t test of H0C is a UMP test. Similarly, among
all tests of the null hypothesis H0 W    .0/ or HQ 0 W ˛  ˛.0/ (versus the alternative hypothesis
H1 W  <  .0/ or HQ 1 W ˛ < ˛.0/ ) that are of level P and that are unbiased, the size- P one-sided t
test of H0 is a UMP test.
428 Confidence Intervals (or Sets) and Tests of Hypotheses

e. Simultaneous inference
Suppose that we wish to obtain either a lower confidence bound or an upper confidence bound for
each of a number of estimable linear combinations of the elements of ˇ. And suppose that we wish
for these confidence bounds to be such that the probability of simultaneous coverage equals 1 P —
simultaneous coverage occurs when for every one of the linear combinations, the true value of the
linear combination is covered by the interval formed (in the case of a lower confidence bound) by
the scalars greater than or equal to the confidence bound or alternatively (in the case of an upper
confidence bound) by the scalars less than or equal to the confidence bound. Or suppose that for each
of a number of estimable linear combinations, we wish to test the null hypothesis that the true value
of the linear combination is less than or equal to some hypothesized value (versus the alternative that
it exceeds the hypothesized value) or the null hypothesis that the true value is greater than or equal
to some hypothesized value (versus the alternative that it is less than the hypothesized value), and
suppose that we wish to do so in such a way that the probability of falsely rejecting one or more of
the null hypotheses is less than or equal to P .
Let 1 ; 2 ; : : : ; M represent estimable linear combinations of the elements of ˇ, and let  D
.1 ; 2 ; : : : ; M /0. And suppose that the linear combinations (of the elements of ˇ) for which we wish
to obtain confidence bounds or to subject to hypothesis tests are expressible in the form  D ı 0 ,
where ı is an arbitrary member of a specified collection  of M  1 vectors—it is assumed that
ƒı ¤ 0 for some ı 2 . Further, in connection with the hypothesis tests, denote by ı.0/ the
.ı/ .ı/
hypothesized value of  D ı 0  and by H0 and H1 the null and alternative hypotheses, so that
either H0.ı/ W   ı.0/ and H1.ı/ W  > ı.0/ or H0.ı/ W   ı.0/ and H1.ı/ W  < ı.0/. It is assumed that
.0/
the various hypothesized values ı (ı 2 ) are simultaneously achievable in the sense that there
.0/
exists an M 1 vector  .0/ 2 C.ƒ0 / such that (for every ı 2 ) ı D ı 0  .0/.
Note that for any scalar P ,
ı 0  P , . ı/0  P and ı 0 > P , . ı/0 < P
:
0
Thus, for purposes of obtaining for every one of the linear combinations ı  (ı 2 ) either a lower
or an upper confidence bound (and for doing so in such a way that the probability of simultaneous
coverage equals 1 P ), there is no real loss of generality in restricting attention to the case where all
of the confidence bounds are lower bounds. Similarly, for purposes of obtaining for every one of the
linear combinations ı 0 (ı 2 ) a test of the null hypothesis H0.ı/ versus the alternative hypothesis
H1.ı/ (and for doing so in such a way that the probability of falsely rejecting one or more of the null
hypotheses is less than or equal to P ), there is no real loss of generality in restricting attention to the
case where for every ı 2 , H0.ı/ and H1.ı/ are H0.ı/ W   ı.0/ and H1.ı/ W  > ı.0/.
Simultaneous confidence bounds. Confidence bounds with a specified probability of simultaneous
coverage can be obtained by adopting an approach similar to that employed in Sections 7.3c and 7.3e
in obtaining confidence intervals (each of which has end points that are equidistant from the least
Q represent
squares estimator) with a specified probability of simultaneous coverage. As before, let 
the set of M 1 vectors defined as follows:
Q D fıQ W ıDW
 Q ı; ı2g:
Further, letting t represent an M  1 random vector that has an MV t.N P ; IM / distribution,
denote by a P the upper 100 P % point of the distribution of the random variable
ıQ 0 t
max : (5.7)
Q 
fı2 Q
Q W ı¤0g .ıQ 0 ı/
Q 1=2
And (for ı 2 ) take Aı .y/ to be the set
fP 2 R1 W ı 0O .ı 0 Cı/1=2 a
O P  P < 1g
One-Sided t Tests and the Corresponding Confidence Bounds 429

of ı 0-values obtained upon regarding ı 0O .ı 0 Cı/1=2 O a P as a lower bound. Then,


PrŒı 0 2 Aı .y/ for every ı 2  D 1 P;
so that the probability of simultaneous coverage of the intervals Aı .y/ (ı 2 ) defined by the lower
bounds ı 0O .ı 0 Cı/1=2 O a P (ı 2 ) is equal to 1 P .
To make use of the lower bounds ı 0O .ı 0 Cı/1=2 a O P (ı 2 ), we need the upper 100 P % point
a P of the distribution of the random variable (5.7). Aside from relatively simple special cases, the
distribution of this random variable is sufficiently complex that the computation of a P requires the
use of Monte Carlo methods. When resort is made to Monte Carlo methods, the maximum value of
the quantity ıQ 0 t=.ıQ 0 ı/
Q 1=2 (where the maximization is with respect to ıQ and is subject to the constraint
ıQ 2 )
Q must be determined for each of a very large number of values of t. As discussed in Section
7.3e, the maximum value of this quantity can be determined from the solution to a constrained
nonlinear least squares problem.
A variation. Suppose that  is such that ƒı ¤ 0 for every ı 2  (in which case, 0 … , Q as is evident
upon observing that ƒı D ƒSW ı and hence that ƒı ¤ 0 ) W ı ¤ 0). Aside from special cases,
the interval Aı .y/ is such that the distance .ı 0 Cı/1=2 a
O P between the least squares estimator ı 0O
0 0 0 1=2
(of ı ) and the lower confidence bound ı O .ı Cı/ O a P varies with ı.
Now, suppose that  Q is such that for every M  1 vector tP , max Q Q ıQ 0 tP exists, and denote
fı2g

by a P the upper 100 P % point of the distribution of the random variable maxfı2 Q g
Q0
Q ı t. Then, as an

alternative to the interval Aı .y/, we have the interval Aı .y/ defined as follows:
Aı .y/ D fP 2 R1 W ı 0O O aP  P < 1g:
The probability of simultaneous coverage of the intervals Aı .y/ (ı 2 ), like that of the intervals
Aı .y/ (ı 2 ), is equal to 1 P . However, in the case of the interval Aı .y/, the lower confidence
O P is such that the distance O aP between it and the least squares estimator ı 0O does
bound ı 0O a
not vary with ı.
Application to the points on a regression line or on a response surface. Refer to Liu (2011)
for a detailed and well-illustrated discussion of the use of the upper or lower confidence bounds
O P and ı 0O ˙ O aP in making inferences about the points on a regression line or on
ı 0O ˙.ı 0 Cı/1=2 a
a response surface.
One-sided tests. Corresponding to the confidence intervals Aı .y/ (ı 2 ), for which the probability
of simultaneous coverage is equal to 1 P , are the tests of the null hypotheses H0.ı/ W   ı.0/ (ı 2 )
with critical regions C.ı/ (ı 2 ), respectively, defined (for each ı 2 ) as follows:
.0/
C.ı/ D fy W ı … Aı .y/g:
Similarly, if  is such that ƒı ¤ 0 for every ı 2  and if  Q D fıQ W ıQ D W ı; ı 2 g is such
that (for every M 1 vector tP ) maxfı2 Q g
Q
Q
ı 0P
t exists, then corresponding to the confidence intervals
.ı/ .0/

Aı .y/ (ı 2 ) are the tests of the null hypotheses H0 W   ı (ı 2 ) with critical regions C  .ı/
(ı 2 ), respectively, defined (for each ı 2 ) as
C  .ı/ D fy W ı.0/ … Aı .y/g:

The tests of the null hypotheses H0.ı/ (ı 2 ) with critical regions C.ı/ (ı 2 ) are such that
the probability of one or more false rejections is less than or equal to P , with equality holding when
.0/
ı 0 D ı for every ı 2 . To see this, let (for ı 2 )
C 0.ı/ D fy W ı 0 … Aı .y/g:
.0/
Then, for ı 2  such that ı 0  ı , C.ı/  C 0.ı/. Thus,
430 Confidence Intervals (or Sets) and Tests of Hypotheses
     
Pr y 2 [ C.ı/  Pr y 2 [ C 0.ı/  Pr y 2 [ C 0.ı/ D P;
fı2 W ı0 ı.0/ g fı2 W ı0 ı.0/ g fı2g

so that
 
Pr y 2 [ C.ı/  P;
.0/
fı2 W ı0 ı g

with equality holding when ı 0 D ı.0/


for every ı 2 .
Now, suppose that  is such that ƒı ¤ 0 for every ı 2  and is such that (for every M1 vector
tP ) maxfı2 Q 0 P exists. Then, by employing essentially the same argument as in the case of the tests
Q ı t
Q g
.ı/
with critical regions C.ı/ (ı 2 ), it can be shown that the tests of the null hypotheses H0 (ı 2 )
with critical regions C  .ı/ (ı 2 ) are such that the probability of one or more false rejections is
less than or equal to P , with equality holding when ı 0 D ı.0/ for every ı 2 .

f. Nonnormality
In Subsections a and b, it was established that each of the confidence sets (5.2), (5.3), and more
generally (5.4) has a probability of coverage equal to 1 P . It was also established that the test of
the null hypothesis H0C or alternatively H0 with critical region C C or C , respectively, is such that
the probability of falsely rejecting H0C or H0 is equal to P . And in Subsection e, it was established
that the probability of simultaneous coverage of the confidence intervals Aı .y/ (ı 2 ) and the
probability of simultaneous coverage of the confidence intervals Aı .y/ (ı 2 ) are both equal to
1 P . In addition, it was established that the tests of the null hypotheses H0.ı/ (ı 2 ) with critical
regions C.ı/ (ı 2 ) and the tests of H0.ı/ (ı 2 ) with critical regions C  .ı/ (ı 2 ) are both such
that the probability of falsely rejecting one or more of the null hypotheses is less than or equal to P .
A supposition of normality (made at the beginning of Section 7.5) underlies those results. How-
ever, this supposition is stronger than necessary. It suffices to take the distribution of the vector e of
residual effects in the G–M model to be an absolutely continuous spherical distribution. In fact, as
is evident upon observing that
O 1
.˛O ˛/ D Œd0 d=.N P / .˛O ˛/  1=2

˛O ˛
and recalling result (6.4.67), it suffices to take the distribution of the vector to be an absolutely
d
continuous spherical distribution. A supposition of normality is also stronger than what is needed
(in Subsection c) in establishing the results of Propositions (2 0 ) and (3 0 ).

7.6 The Residual Variance  2 : Confidence Intervals and Tests of


Hypotheses
Suppose (as in Sections 7.3, 7.4, and 7.5) that y D .y1 ; y2 ; : : : ; yN /0 is an observable random vector
that follows the G–M model. Suppose further that the distribution of the vector e D .e1 ; e2 ; : : : ; eN /0
of residual effects in the G–M model is N.0;  2 I/, in which case y  N.Xˇ;  2 I/.
Let us consider the problem of constructing a confidence interval for the variance  2 (of the
residual effects e1; e2 ; : : : ; eN and of the observable random variables y1; y2 ; : : : ; yN ) or its (positive)
square root  (the standard deviation). Let us also consider the closely related problem of testing
hypotheses about  2 or . In doing so, let us adopt the notation and terminology employed in Section
7.3 (and subsequently in Sections 7.4 and 7.5).
The Residual Variance  2 : Confidence Intervals and Tests 431

a. The basics
Let S represent the sum of squares eQ 0 eQ D y 0 .I PX /y of the elements of the vector eQ D .I PX /y
of least squares residuals. And letting (as in Section 7.3a) d D L0 y [where L is an N .N P /
matrix of rank N P such that X0 L D 0 and L0 L D I] and recalling result (3.21), observe that
S D d0 d:
Moreover,
d  N.0;  2 I/
and hence .1=/d  N.0; I/, so that
S=. 2 / D Œ.1=/d0 .1=/d  2 .N P /: (6.1)

Confidence intervals/bounds. For 0 < ˛ < 1, denote by N 2˛ .N P / or simply by N˛2 the upper
100 ˛% point of the 2 .N P / distribution. Then, upon applying result (6.1), we obtain as a
100.1 P /% “one-sided” confidence interval for , the interval
s
S
  < 1; (6.2)
N 2P
q
so that S=N 2P is a 100.1 P /% lower “confidence bound” for . Similarly, the one-sided interval
s
S
0<   2
(6.3)
N 1 P
q
constitutes a 100.1 P /% confidence interval for , and S=N 21 P is interpretable as a 100.1 P /%
upper confidence bound. And upon letting P1 and P2 represent any two strictly positive scalars such
that P1 C P2 D P , we obtain as a 100.1 P /% “two-sided” confidence interval for , the interval
s s
S S
  : (6.4)
N 2P1 N12 P2

Tests of hypotheses. Let 0 represent a hypothesized value of  (where 0 < 0 < 1). Further, let
T D S=.02 /. Then, corresponding to the confidence interval (6.4) is a size- P (nonrandomized) test
of the null hypothesis H0 W  D 0 versus the alternative hypothesis H1 W  ¤ 0 with critical region
C D y W T < N12 P2 or T > N 2P1 (6.5)
˚
p p
consisting of all values of y for which 0 is larger than S=N12 P2 or smaller than S=N 2P1 . And
corresponding to the confidence interval (6.2) is a (nonrandomized) test of the null hypothesis
H0C W   0 versus the alternative hypothesis H1C W  > 0 with critical region
C C D y W T > N 2P (6.6)
˚
p
consisting of all values of y for which the lower confidence bound S=N 2P exceeds 0 . Similarly,
corresponding to the confidence interval (6.3) is a (nonrandomized) test of the null hypothesis
H0 W   0 versus the alternative hypothesis H1 W  < 0 with critical region
C D y W T < N12 P (6.7)
˚
p
consisting of all values of y for which 0 exceeds the upper confidence bound S=N12 P .
The tests of H0C and H0 with critical regions C C and C , respectively, like the test of H0 with
critical region C, are of size P . To see this, it suffices to observe that
Pr T > N 2P D Pr S=. 2 / > .0 =/2 N 2P (6.8)
  
432 Confidence Intervals (or Sets) and Tests of Hypotheses

and
Pr T < N12 P D Pr S=. 2 / < .0 =/2 N12 P ; (6.9)
  

which implies that Pr T > N 2P is greater than, equal to, or less than P and Pr T < N12 less than,
 
P
equal to, or greater than P depending on whether  > 0 ,  D 0 , or  < 0 .
Translation invariance. The confidence intervals (6.2), (6.3), and (6.4) and the tests of H0 , H0C, and
H0 with critical regions C, C C, and C are translation invariant. That is, the results produced by
these procedures are unaffected when for any P 1 vector k, the value of the vector y is replaced
by the value of the vector y CXk. To see this, it suffices to observe that the procedures depend on y
only through the value of the vector eQ D .I PX /y of least squares residuals and that
.I PX /.y C Xk/ D y:

Unbiasedness. The test of H0C versus H1C with critical region C C and the test of H0 versus H1
with critical region C are both unbiased, as is evident from results (6.8) and (6.9). In fact, they are
both strictly unbiased. In contrast, the test of H0 versus H1 with critical region C is unbiased only
if P1 (and P2 D P P1 ) are chosen judiciously. Let us consider how to choose P1 so as to achieve
unbiasedness—it is not a simple matter of setting P1 D P=2.
Let ./ represent the power function of a (possibly randomized) size- P test of H0 versus H1 .
And suppose that (as in the case of the test with critical region C ), the test is among those size- P tests
with a critical function that is expressible as a function, say .T /, of T. Then, ./ D EŒ.T /, and
Z 1
1 ./ D EŒ1 .T / D Œ1 .t/ h.t/ dt; (6.10)
0
where h./ is the pdf of the distribution of T.
The pdf of the distribution of T is derivable from the pdf of the 2 .N P / distribution. Let
U D S=. 2 /, so that U  2 .N P /. And observe that U D .0 =/2 T. Further, let g./ represent
the pdf of the 2 .N P / distribution. Then, upon recalling result (6.1.16), we find that for t > 0,
h.t/ D .0 =/2 gŒ.0 =/2 t
1 2
D .N P /=2
.0 =/N P t Œ.N P /=2 1 e .0 =/ t =2 (6.11)
€Œ.N P /=2 2 

—for t  0, h.t/ D 0.
A necessary and sufficient condition for the test to be unbiased is that ./ attain its minimum
value (with respect to ) at  D 0 or, equivalently, that 1 ./ attain its maximum value at  D 0 —
for the test to be strictly unbiased, it is necessary and sufficient that 1 ./ attain its maximum
value at  D 0 and at no other value of . And if 1 ./ attains its maximum value at  D 0 , then
ˇ
d Œ1 ./ ˇˇ
D 0: (6.12)
d ˇ
D0

Now, suppose that the test is such that lim!0 1 ./ and lim!1 1 ./ are both equal to
0 (as in the case of the test with critical region C ) or, more generally, that both of these limits are
smaller than 1 P . Suppose also that the test is such that condition (6.12) is satisfied and such that
d Œ1 ./
¤ 0 for  ¤ 0 . Then, 1 ./ attains its maximum value at  D 0 and at no other
d
value of , and hence the test is unbiased and, in fact, is strictly unbiased.
The derivative of 1 ./ is expressible as follows:
Z 1
d Œ1 ./ dh.t/
D Œ1 .t/ dt
d 0 Z 1d  2 
0 t
D  1 .N P / Œ1 .t/ 1 h.t/ dt: (6.13)
0  N P
The Residual Variance  2 : Confidence Intervals and Tests 433

Further, upon recalling the relationship U D .0 =/2 T between the random variable U [D S=. 2 /]
that has a 2 .N P / distribution with pdf g./ and the random variable T [D S=.02 /] and upon
introducing a change of variable, we find that 1 ./ and its derivative are reexpressible in the form
Z 1
1 ./ D f1 Œ.=0 /2 ug g.u/ du; (6.14)
0
and 1  
d Œ1 ./ u
Z
1 2
D .N P / f1 Œ.=0 / ug 1 g.u/ du: (6.15)
d 0 N P
And in the special case of the test with critical region C,
Z N 2 .0 =/2
P1
1 ./ D g.u/ du; (6.16)
N 2
1
.0 =/2
P C P1
and N 2P .0 =/2  
d Œ1 ./ u
Z
1 1
D .N P / 1 g.u/ du: (6.17)
d N 2
1 P C P1
.0 =/2 N P
In the further special case where  D 0 , result (6.17) is reexpressible as
ˇ Z N 2  
d Œ1 ./ ˇˇ 1
P1 u
D 0 .N P  / 1 g.u/ du: (6.18)
d ˇ
D0 N 2
1 PC P
N P
1

For what value or values of P1 (in the interval 0 < P1 < P ) is expression (6.18) equal to 0?
Note that the integrand of the integral in expression (6.18) is greater than 0 for u > N P and
less than 0 for u < N P [and that N P D E.U /]. And assume that P is sufficiently small that
N 2P > N P —the median of a chi-square distribution is smaller than the mean (e.g., Sen 1989), so
that N 2P > N P ) N 21 P < N P . Then, the value or values of P1 for which expression (6.18)
equals 0 are those for which
Z N 2   Z N P  
P1 u u
1 g.u/ du D 1 g.u/ du: (6.19)
N P N P N 2
1 PC P
N P
1

Both the left and right sides of equation (6.19) are strictly positive.
Z 1 As  P1 increases
 from 0 to
u
P (its upper limit), the left side of equation (6.19) decreases from 1 g.u/ du to
N P N P
Z N 2   Z N P  
P u u
1 g.u/ du and the right side increases from 1 g.u/ du to
N P N P N 2
1 P
N P
Z N P  
u
1 g.u/ du. Assume that P is such that
0 N P
Z 1   Z N P  
u u
1 g.u/ du  1 g.u/ du
N P N P N 2
1 P
N P
and is also such that
Z N P   Z N 2  
u P u
1 g.u/ du  1 g.u/ du
0 N P N P N P

—otherwise, there would not exist any solution (for P1 ) to equation (6.19). Then, there exists a unique
value, say P1, of P1 that is a solution to equation (6.19).
Suppose the test (of H0 versus H1 ) is that with critical region C and that P1 D P1. That is,
suppose the test is that with critical region
C  D y W T < N12 P  or T > N 2P  ; (6.20)
˚
2 1
where P2 D P P1. Then,
434 Confidence Intervals (or Sets) and Tests of Hypotheses
N 2  .0 =/2  
d Œ1 ./ u
Z
P
1 1
D .N P / 1 g.u/ du:
d N 2 .0 =/2 N P
1 PC P
1
and ˇ
d Œ1 ./ ˇˇ
D 0:
d ˇ
D0
To conclude that the test is unbiased (and, in fact, strictly unbiased), it remains only to show that
d Œ1 ./
¤ 0 for  ¤ 0 or, equivalently, that
d
Z N 2 .0 =/2  
P1 u
1 g.u/ du D 0 (6.21)
N 2  .0 =/
2 N P
1 P C P1
implies that  D 0 .
Suppose that  satisfies condition (6.21). Then,
N 2P  .0 =/2 > N P > N 21 ; P C P1 .0 =/
2
1
u
since otherwise 1 would either be less than 0 for all values of u between N 21 P C P1
.0 =/2
N P
and N 2P  .0 =/2 or greater than 0 for all such values. Thus,  is such that
1

N 2  .0 =/2  N P  
u u
Z Z
P1
1 g.u/ du D 1 g.u/ du:
N P N P N 2 .0 =/2 N P
1 PC P
1
Moreover, if  < 0 (in which case 0 = > 1), then
Z N 2  .0 =/2  Z N 2   
P
1
u P
1
u
1 g.u/ du > 1 g.u/ du
N P N P N P N P
Z N P  
u
D 1 g.u/ du
N 2 N P
1 PC P
1
N P  
u
Z
> 1 g.u/ du:
N 2  .0 =/
2 N P
1 P C P1

Similarly, if  > 0 (in which case 0 = < 1), then


Z N 2  .0 =/2  Z N 2   
P
1
u P
1
u
1 g.u/ du < 1 g.u/ du
N P N P N P N P
Z N P  
u
D 1 g.u/ du
N 2 
N P
1 PC P
1
N P  
u
Z
< 1 g.u/ du:
N 2  .0 =/
2 N P
1 P C P1

Thus,   0 and   0 (since if  < 0 , we arrive at a contradiction, and if  > 0 , we also


arrive at a contradiction), and hence  D 0 .
We have established that the size- P test of H0 versus H1 with critical region C  is a strictly
unbiased test. The value P1 of P1 that is a solution to equation (6.19) and that is needed to implement
this test can be determined by, for example, employing the method of bisection. In that regard, it can
be shown (and is worth noting) that for any constants c0 and c1 such that 1  c1 > c0  0,
Z c1  Z c1  
u u
1 g.u/ du D 1 g.u/ du
c0 N P c0 N P
D G .c1 / G.c1 / ŒG .c0 / G.c0 /; (6.22)
The Residual Variance  2 : Confidence Intervals and Tests 435

where G./ is the cdf of the 2 .N P / distribution and G ./ is the cdf of the 2 .N P C 2/
distribution. It is also worth noting that equation (6.19) does not involve 0 and hence that P1 does
not vary with the choice of 0 .
Corresponding to the size- P strictly unbiased test of H0 versus H1 with critical region C  is the
following 100.1 P /% confidence interval for :
s s
S S
   : (6.23)
N 2P  N12 P 
1 2
This interval is the special case of the 100.1 P /% confidence interval (6.4) obtained upon setting
P1 D P1 and P2 D P2 (D P P1 ). As is evident from the (strict) unbiasedness of the corresponding
test, the 100.1 P /% confidence interval (6.23) is strictly unbiased in the sense that the probability
1 P of its covering the true value of  is greater than the probability of its covering any value other
than the true value. q
Like the 100.1 P /% confidence interval (6.23), the 100.1 P /% lower confidence bound S=N 2P
q
and the 100.1 P /% upper confidence bound S=N 21 P (for ) are strictly unbiased. The strict
unbiasedness of the lower confidence bound follows from the strict unbiasedness
q of the
 size- P test
C C C 2
of H0 versus H1 with critical region C and is in the sense that Pr S=N P  0 < 1 P for
any positive scalar 0 such that 0 < . Similarly, the strict unbiasedness of the upper confidence
bound follows from the strict q
unbiasedness of the
 size- P test of H0 versus H1 with critical region
C and is in the sense that Pr S=N 21 P  0 < 1 P for any scalar 0 such that 0 > .

b. An illustration
Let us illustrate various of the results of Subsection a by using them to add to the results obtained
earlier (in Sections 7.1, 7.2c, and 7.3d and in the final part of Section 7.3f) for the lettuce-yield
data. Accordingly, let us take y to be the 20 1 random vector whose observed value is the vector
of lettuce yields. And suppose that y follows the G–M model obtained upon taking the function
ı.u/ (that defines the response surface) to be the second-order polynomial (1.2) (where u is the
3-dimensional column vector whose elements represent transformed amounts of Cu, Mo, and Fe).
Suppose further that the distribution of the vector e of residual effects is N.0;  2 I/.
Then, as is evident from the results of Section 7.1, S (the residual sum of squares) equals
108:9407. And N P D N P D 10. Further, the usual (unbiased) point estimator O 2 D S=.N P /
of  2 equals 10:89, and upon taking the square root of this value, we obtain 3:30 as an estimate of .
When P D 0:10, the value P1 of P1 that is a solution to equation (6.19) is found to be 0:03495,
and the corresponding value P2 of P2 is P2 D P P1 D 0:06505. And N 2:03495 D 19:446, and
N12 :06505 D N 2:93495 D 4:258. Thus, upon setting S D 108:9407, P1 D 0:03495, and P 2 D 0:06505
in the interval (6.23), we obtain as a 90% strictly unbiased confidence interval for  the interval
2:37    5:06:
By way of comparison, the 90% confidence interval for  obtained upon setting S D 108:9407 and
P1 D P2 D 0:5 in the interval (6.4) is
2:44    5:26
—this interval is not unbiased.
If (instead of obtaining a two-sided confidence interval for ) we had chosen to obtain [as an
application of interval (6.2)] a 90% “lower confidence bound,” we would have obtained
2:61   < 1:
Similarly, if we had chosen to obtain [as an application of interval (6.3)] a 90% “upper confidence
bound,” we would have obtained
0 <   4:73:
436 Confidence Intervals (or Sets) and Tests of Hypotheses

c. Optimality
Are the tests of H0C, H0 , and H0 (versus H1C, H1 , and H1 ) with critical regions C C, C , and C 
and the corresponding confidence intervals (6.2), (6.3), and (6.23) optimal and if so, in what sense?
These questions are addressed in what follows. In the initial treatment, attention is restricted to
translation-invariant procedures. Then, the results obtained in that context are extended to a broader
class of procedures.
Translation-invariant procedures. As noted earlier (in Section 7.3a) the vector d D L0 y [where L
is an N .N P / matrix of full column rank such that X0 L D 0 and L0 L D I] is an (N P )-
dimensional vector of linearly independent error contrasts—an error contrast is a linear combination
(of the elements of y) with an expected value equal to 0. And as is evident from the discussion of
error contrasts in Section 5.9b, a (possibly randomized) test of H0C, H0 , or H0 (versus H1C, H1 , or
H1 ) is translation invariant if and only if its critical function is expressible as a function, say .d/,
of d. Moreover, when the observed value of d (rather than that of y) is regarded as the data vector,
S D d0 d or, alternatively, T D S=02 is a sufficient statistic—refer to the next-to-last part of Section
7.3a for some relevant discussion. Thus, corresponding to the test with critical function .d/ is a
(possibly randomized) test with critical function EŒ.d/ j S  or EŒ.d/ j T  that depends on d only
through the value of S or T and that has the same power function.
Now, consider the size- P translation-invariant test of the null hypothesis H0C W   0 (versus
C C
the alternative hypothesis H1 W  > 0 ) with critical region C D y W T > N 2P . Further, let
˚

 represent any particular value of  greater than 0 , let h0 ./ represent the pdf of the 2 .N P /
distribution (which is the distribution of T when  D 0 ), let h ./ represent the pdf of the distribution
of T when  D  , and observe [in light of result (6.11)] that (for t > 0)
 N P
h .t/ 0 2
D e Œ1 .0 = /  t =2: (6.24)
h0 .t/ 
Then, upon applying Theorem 7.4.1 (the Neyman–Pearson lemma) with X D T ,  D , ‚ D
Œ0 ; 1/,  .0/ D 0 , and   D  , we find that conditions (4.2) and (4.3) (of Theorem 7.4.1) are
satisfied when the critical region is taken to be the set consisting of all values of T for which T > N 2P .
And upon observing that the test of H0C with critical region C C is such that Pr.y 2 C C /  P for
 < 0 as well as for  D 0 and upon recalling the discussion following Theorem 7.4.1, we
find that the test of H0C with critical region C C is UMP among all (possibly randomized) level- P
translation-invariant tests—note that the set consisting of all such tests is a subset of the set consisting
of all (possibly randomized) translation-invariant tests for which the probability of rejecting H0C is
less than or equal to P when  D 0 .
By employing a similar argument, it can be shown that the size- P translation-invariant test of the
null hypothesis H0 W   0 (versus the alternative hypothesis H1 W  < 0 ) with critical region
C D y W T < N12 P is UMP among all (possibly randomized) level- P translation-invariant tests
˚

(of H0 versus H1 ).
It remains to consider the size- P translation-invariant test of the null hypothesis H0 W  D 0
(versus the alternative hypothesis H1 W  ¤ 0 ) with critical region C. In the special case where
C D C  (i.e., the special case where P1 D P1 and P2 D P2 D P P1 ), this test is (strictly)

unbiased. In fact, the test of H0 with critical region C is optimal in the sense that it is UMP among
all (possibly randomized) level- P translation-invariant unbiased tests of H0 (versus H1 ).
Let us verify the optimality of this test. Accordingly, take .T / to be a function of T that
represents the critical function of a (translation-invariant possibly randomized) test of H0 (versus
H1 ). Further, denote by ./ the power function of the test with critical function .T /. And observe
that (by definition) this test is of level P if .0 /  P or, equivalently, if
Z 1
.t/ h0 .t/ dt  P (6.25)
0
The Residual Variance  2 : Confidence Intervals and Tests 437

[where, as before, h0 ./ represents the pdf of the 2 .N P / distribution].


If the test with critical function .T / is of level P and is unbiased, then .0 / D P (i.e., the test
is of size P ) or, equivalently, Z 1
.t/ h0 .t/ dt D P; (6.26)
0
as becomes evident upon observing that ./ is a continuous function—refer, e.g., to ˇLehmann and
d ./ ˇˇ
Romano (2005b, sec. 3.1). And if the test is of size P and is unbiased, then D 0 or,
d ˇ D0
equivalently, Z 1  
t
.t/ 1 h0 .t/ dt D 0 (6.27)
0 N P
—refer to Subsection a.
Denote by  an arbitrary value of  other than 0 . And consider the problem of determining the
choice of the critical function ./ that maximizes . / subject to the constraint that ./ satisfy
conditions (6.26) and (6.27). Note [in regard to the maximization of . /] that
Z 1
. / D .t/ h .t/ dt; (6.28)
0
where h ./ represents the pdf of the distribution of T when  D  .
To show that any particular choice for ./, say  ./, maximizes . / subject to the con-
straints imposed by conditions (6.26) and (6.27), it suffices [according to a generalized version of
the Neyman–Pearson lemma stated by Lehmann and Romano (2005b) in the form of their Theorem
3.6.1] to demonstrate that  ./ satisfies conditions (6.26) and (6.27) and to establish the existence
of constants k1 and k2 such that
(
1; when h .t/ > k1 h0 .t/Ck2 h0 .t/fŒt=.N P / 1g,
 .t/ D (6.29)
0; when h .t/ < k1 h0 .t/Ck2 h0 .t/fŒt=.N P / 1g.
And condition (6.29) is reexpressible in the form
(
1; when k1 C k2 t < e bt,
 .t/ D (6.30)
0; when k1 C k2 t > e bt,

where k1 D .0 = / .N P / .k1 k2 /, k2 D .0 = / .N P / k2 =.N P /, and b D .1=2/Œ1
.0 = /2 . Moreover, among the choices for the critical function ./ are choices that satisfy condi-
tions (6.26) and (6.27) and for which . /  P , as is evident upon observing that one such choice is
that obtained upon setting .t/ D P (for all t). Thus, if the critical function  ./ satisfies conditions
(6.26) and (6.27) and is such that corresponding to every choice of  there exist constants k1 and
k2 that satisfy condition (6.29) or, equivalently, condition (6.30), then the test with critical function
 ./ is a size- P translation-invariant unbiased test and [since the tests for which the critical function
./ is such that the test is of level P and is unbiased constitute a subset of those tests for which ./
satisfies conditions (6.26) and (6.27)] is UMP among all level- P translation-invariant unbiased tests.
Suppose that the choice  ./ for ./ is as follows:
8
< 1; when t < N 2  or t > N 2  ,
1 P2 P1
 .t/ D (6.31)
: 0; when N 2   t  N 2  .
1 P2 P
1

And observe that for this choice of ./, the test with critical function ./ is identical to the (non-
randomized) test of H0 with critical region C . By construction, this test is such that  ./ satisfies
conditions (6.26) and (6.27). Thus, to verify that the test of H0 with critical region C  is UMP
among all (possibly randomized) level- P translation-invariant unbiased tests, it suffices to show that
(corresponding to every choice of  ) there exist constants k1 and k2 such that  ./ is expressible
in the form (6.30).
438 Confidence Intervals (or Sets) and Tests of Hypotheses

Accordingly, suppose that k1 and k2 are the constants defined (implicitly) by taking k1 and k2
to be such that
k1 C k2 c0 D e b c0 and k1 C k2 c1 D e b c1;
2 2
where c0 D N1 P  and c1 D N P  . Further, let u.t/ D e bt
.k1 C k2 t/ (a function of t with domain
2 1
0 < t < 1), and observe that u.c1 / D u.c0 / D 0. Observe also that
du.t/ d 2 u.t/
D be bt k2 and D b 2 e bt > 0;
dt dt 2
so that u./ is a strictly convex function and its derivative is a strictly increasing function. Then,
clearly,
u.t/ < 0 for c0 < t < c1 . (6.32)
du.t/ du.t/
And < 0 for t  c0 and > 0 for t  c1 , which implies that
dt dt
u.t/ > 0 for t < c0 and t > c1
and hence in combination with result (6.32) implies that  .t/ is expressible in the form (6.30) and
which in doing so completes the verification that the test of H0 (versus H1 ) with critical region C 
is UMP among all level- P translation-invariant unbiased tests.
Corresponding to the test of H0 with critical region C  is the 100.1 P /% translation-invariant
strictly unbiased confidence interval (6.23). Among translation-invariant confidence sets (for ) that
have a probability of coverage greater than or equal to 1 P and that are unbiased (in the sense that
the probability of covering the true value of  is greater than or equal to the probability of covering
any value other than the true value), the confidence interval (6.23) is optimal; it is optimal in the
sense that the probability of covering any value of  other than the true value is minimized.
The 100.1 P /% translation-invariant confidence interval (6.2) is optimal in a different sense;
among all translation-invariant confidence sets (for ) that have a probability of coverage greater than
or equal to 1 P , it is optimal in the sense that it is the confidence set that minimizes the probability
of covering values of  smaller than the true value. Analogously, among all translation-invariant
confidence sets that have a probability of coverage greater than or equal to 1 P , the 100.1 P /%
translation-invariant confidence interval (6.3) is optimal in the sense that it is the confidence set that
minimizes the probability of covering values of  larger than the true value.
Optimality in the absence of a restriction to translation-invariant procedures. Let z D O 0y rep-
resent an observable N -dimensional random column vector that follows the canonical form of the
G–M model (as defined in Section 7.3a) in the special case where M D P . Then, ˛ and its least
squares estimator ˛O are P -dimensional, and z D .˛O 0; d0 /0, where (as before) d D L0 y, so that the
critical function of any (possibly randomized) test of H0C, H0 , or H0 is expressible as a function of
˛O and d. Moreover, ˛O and T D S=02 D d0d=02 D y 0.I PX /y=02 form a sufficient statistic, as is
evident upon recalling the results of the next-to-last part of Section 7.3a. And corresponding to any
(possibly randomized) test of H0C, H0 , or H0 , say one with critical function . Q ˛;
O d/, there is a test
with a critical function, say .T; ˛/,O that depends on d only through the value of T and that has the
same power function—take .T; ˛/ O D EŒ .Q ˛; O Thus, for present purposes, it suffices to
O d/ j T; ˛.
restrict attention to tests with critical functions that are expressible in the form .T; ˛/.O
Suppose that .T; ˛/ O is the critical function of a level- P test of the null hypothesis H0C W   0
versus the alternative hypothesis H1C W  > 0 , and consider the choice of the function .T; ˛/. O Fur-
ther, let .; ˛/ represent the power function of the test. Then, by definition, .; ˛/ D EŒ.T; ˛/, O
and, in particular, .0 ; ˛/ D E 0Œ .T; ˛/, O where E 0 represents the expectation operator in the
special case where  D 0 . Since the test is of level P, .0 ; ˛/  P (for all ˛).
Now, suppose that the level- P test with critical function .T; ˛/ O and power function .; ˛/
is unbiased. Then, upon observing that .; / is a continuous function, it follows—refer, e.g., to
Lehmann and Romano (2005b, sec. 4.1)—that
.0 ; ˛/ D P (for all ˛). (6.33)
The Residual Variance  2 : Confidence Intervals and Tests 439

Clearly,
.; ˛/ D EfEŒ.T; ˛/
O j ˛g:
O (6.34)
In particular, .0 ; ˛/ D E 0fE 0Œ.T; ˛/ O so that result (6.33) can be restated as
O j ˛g,
E 0fE 0Œ.T; ˛/ O D P (for all ˛).
O j ˛g (6.35)
Moreover, with  fixed (at 0 ), ˛O is a complete sufficient statistic—refer to the next-to-last part of
Section 7.3a. Thus, E 0Œ.T; ˛/
O j ˛O does not depend on ˛, and condition (6.35) is equivalent to the
condition
E 0Œ.T; ˛/ O D P (wp1).
O j ˛ (6.36)
Let  represent any particular value of  greater than 0 , let ˛ represent any particular value of
˛, and let E  represent the expectation operator in the special case where  D  and ˛ D ˛ . Then, in
light of result (6.34), the choice of the critical function .T; ˛/
O that maximizes . ; ˛ / subject to
the constraint (6.36), and hence subject to the constraint (6.33), is that obtained by choosing (for each
value of ˛)
O .; ˛/O so as to maximize E  Œ.T; ˛/ O subject to the constraint E 0Œ.T; ˛/
O j ˛ O D P.
O j ˛
Moreover, upon observing that T is distributed independently of ˛O and hence that the distribution of
T conditional on ˛O is the same as the unconditional distribution of T and upon proceeding as in the
preceding part of the present subsection (in determining the optimal translation-invariant test), we
find that (for every choice of  and ˛ and for every value of ˛) O E  Œ.T; ˛/ O can be maximized
O j ˛
subject to the constraint E 0Œ.T; ˛/ O D P by taking
O j ˛
(
1; when t > N 2P ,
O D
.t; ˛/ (6.37)
0; when t  N 2P .

And it follows that the test with critical function (6.37) is UMP among all tests of H0C (versus H1C )
with a critical function that satisfies condition (6.33).
Clearly, the test with critical function (6.37) is identical to the test with critical region C C, which
is the UMP level- P translation-invariant test. And upon recalling (from Subsection a) that the test
with critical region C C is unbiased and upon observing that those tests with a critical function for
which the test is of level- P and is unbiased is a subset of those tests with a critical function that
satisfies condition (6.33), we conclude that the test with critical region C C is UMP among all level- P
unbiased tests of H0C (versus H1C ).
It can be shown in similar fashion that the size- P translation-invariant unbiased test of H0 versus
H1 with critical region C is UMP among all level- P unbiased tests of H0 versus H1 . However,
as pointed out by Lehmann and Romano (2005b, sec. 3.9.1), the result on the optimality of the test
of H0C with critical region C C can be strengthened in a way that does not extend to the test of H0
with critical region C . In the case of the test of H0C, the restriction to unbiased tests is unnecessary.
It can be shown that the test of H0C versus H1C with critical region C C is UMP among all level- P
tests, not just among those level- P tests that are unbiased.
It remains to consider the optimality of the test of the null hypothesis H0 W  D 0 (versus the
alternative hypothesis H1 W  ¤ 0 ) with critical region C . Accordingly, suppose that .T; ˛/ O is the
critical function of a (possibly randomized) test of H0 (versus H1 ) with power function .; ˛/.
If the test is of level P and is unbiased, then [in light of the continuity of the function .; /]
.0 ; ˛/ D P (for all ˛) or, equivalently,
Z Z 1
O ˛/ dt d ˛O D P (for all ˛);
O h0 .t/f0 .˛I
.t; ˛/ (6.38)
RP 0

where f0 . I ˛/ represents the pdf of the N.˛; 02 I/ distribution (which is the distribution of ˛O when
 D 0 ) and where (as before) h0 ./ represents the pdf of the 2 .N P / distribution (which is the
distribution of T when  D 0 )—condition (6.38) is analogous to condition (6.26). Moreover,ˇ if the
d .; ˛/ ˇˇ
test is such that condition (6.38) is satisfied and if the test is unbiased, then D 0 (for
d ˇ D0
all ˛) or, equivalently,
440 Confidence Intervals (or Sets) and Tests of Hypotheses
1  
t
Z Z
O
.t; ˛/ O ˛/ dt d ˛O D 0 (for all ˛);
1 h0 .t/f0 .˛I (6.39)
RP 0 N P
analogous to condition (6.27)—the equivalence of condition (6.39) can be verified via a relatively
straightforward exercise.
As in the case of testing the null hypothesis H0C,
.; ˛/ D EfEŒ.T; ˛/ O j ˛g:
O (6.40)
Moreover, condition (6.38) is equivalent to the condition
Z 1
O h0 .t/ dt D P (wp1);
.t; ˛/ (6.41)
0
and condition (6.39) is equivalent to the condition
Z 1  
t
O
.t; ˛/ 1 h0 .t/ dt D 0 (wp1); (6.42)
0 N P
as is evident upon recalling that with  fixed (at 0 ), ˛O is a complete sufficient statistic.
Denote by ˛ any particular value of ˛ and by  any particular value of  other than 0 , and
(as before) let h ./ represent the pdf of the distribution of T in the special case where  D  . And
observe (in light of the statistical independenceZof T and ˛) O that when  D  ,
1
EŒ.T; ˛/
O j ˛
O D O h .t/ dt:
.t; ˛/ (6.43)
0
Observe also [in light of result (6.43) along with result (6.40) and in light of the equivalence of
conditions (6.41) and (6.42) to conditions (6.38) and (6.39)] that to maximize . ; ˛ / [with respect
to the choice of the critical function .; /] subject to the constraint that .; / satisfy conditions
(6.38) and (6.39), it suffices to take (for each value of ˛) O .; ˛/O to be the critical function that
maximizes Z 1
O h .t/ dt
.t; ˛/ (6.44)
0
subject to the constraints imposed by the conditions
Z 1 Z 1  
t
O h0 .t/ dt D P and
.t; ˛/ O
.t; ˛/ 1 h0 .t/ dt D 0: (6.45)
0 0 N P
A solution for .; ˛/O to the latter constrained maximization problem can be obtained by applying
the results obtained earlier (in the first part of the present subsection) in choosing a translation-
invariant critical function ./ so as to maximize the quantity (6.28) subject to the constraints imposed
by conditions (6.26) and (6.27). Upon doing so, we find that among those choices for the critical
function .T; ˛/ O that satisfy conditions (6.38) and (6.39), . ; ˛ / can be maximized (for every
choice of  and ˛ ) by taking
8
< 1; when t < N 2  or t > N 2  ,
1 P2 P1
O D
.t; ˛/ (6.46)
: 0; when N 2   t  N 2  ,
1 P P
2 1

which is the critical function of the size- P translation-invariant unbiased test of H0 (versus H1 ) with
critical region C . Since the set consisting of all level- P unbiased tests of H0 versus H1 is a subset
of the set consisting of all tests with a critical function that satisfies conditions (6.38) and (6.39), it
follows that the size- P translation-invariant unbiased test with critical region C  is UMP among all
level- P unbiased tests.
The optimality properties of the various tests can be reexpressed as optimality properties of
the corresponding confidence intervals. Each of the confidence intervals (6.2) and (6.23) is optimal
in essentially the same sense as when attention is restricted to translation-invariant procedures. The
confidence interval (6.3) is optimal in the sense that among all confidence sets for  with a probability
of coverage greater than or equal to 1 P and that are unbiased (in the sense that the probability of
covering the true value of  is greater than or equal to the probability of covering any value larger
than the true value), the probability of covering any value larger than the true value is minimized.
Multiple Comparisons and Simultaneous Confidence Intervals: Some Enhancements 441

7.7 Multiple Comparisons and Simultaneous Confidence Intervals:


Some Enhancements
Let us revisit the topic of multiple comparisons and simultaneous confidence intervals, which was
considered earlier in Section 7.3c. As before, let us take y to be an N 1 observable random vector
that follows the G–M model, take i D 0i ˇ (i D 1; 2; : : : ; M ) to be estimable linear combinations
of the elements of ˇ, and take  D .1 ; 2 ; : : : ; M /0 and ƒ D .1 ; 2 ; : : : ; M / (in which case
 D ƒ0ˇ). Further, let us assume that none of the columns 1 ; 2 ; : : : ; M of ƒ is null or is a scalar
multiple of another column of ƒ. And let us assume (at least initially) that the distribution of the
vector e of residual effects in the G–M model is MVN, in which case y  N.Xˇ;  2 I/.
Suppose that we wish to make inferences about each of the linear combinations 1 ; 2 ; : : : ; M .
Among the forms the inferences may take is that of multiple comparisons. For i D 1; 2; : : : ; M , let
i.0/ represent a hypothesized value of i . And suppose that each of M null hypotheses Hi.0/ W i D
i.0/ (i D 1; 2; : : : ; M ) is to be tested against the corresponding one of the M alternative hypotheses
Hi.1/ W i ¤ i.0/ (i D 1; 2; : : : ; M ).
Let  .0/ D .1.0/ ; 2.0/ ; : : : ; M
.0/ 0
/ , and assume that  .0/ D ƒ0ˇ .0/ for some P  1 vector
ˇ .0/ or, equivalently, that  .0/ 2 C.ƒ0 / (which insures that the collection of null hypotheses
H1.0/ ; H2.0/ ; : : : ; HM
.0/
is “internally consistent”). And (as is customary) let us refer to the proba-
bility of falsely rejecting one or more of the M null hypotheses as the familywise error rate (FWER).
In devising multiple-comparison procedures, the traditional approach has been to focus on those
alternatives (to the so-called one-at-a-time test procedures) that control the FWER (in the sense
that FWER  P for some specified scalar P such as 0:01 or 0:05), that are relatively simple in
form, and that are computationally tractable. The multiple-comparison procedures that form what
in Section 7.3c is referred to as the generalized S method are obtainable via such an approach. Cor-
responding to those multiple-comparison procedures are the procedures (discussed in Section 7.3c)
for obtaining confidence intervals for the linear combinations 1 ; 2 ; : : : ; M that have a probability
of simultaneous coverage equal to 1 P .
.0/ .0/ .0/
While tests of H1 ; H2 ; : : : ; HM with an FWER equal to P and confidence intervals for
1 ; 2 ; : : : ; M with a probability of simultaneous coverage equal to 1 P can be achieved via
the methods discussed in Section 7.3c, there is a downside to the adoption of such methods. For
even relatively small values of M (D rank ƒ) and for the customary values of P (such as 0:01,
0:05, and 0:10), the probability of rejecting any particular one (say the i th) of the null hypotheses
H1.0/ ; H2.0/ ; : : : ; HM
.0/
can be quite small, even when i differs substantially from i.0/. And the con-
fidence intervals for 1 ; 2 ; : : : ; M are likely to be very wide. As is discussed in Subsection a, test
procedures that are much more likely to reject any of the M null hypotheses H1.0/ ; H2.0/ ; : : : ; HM .0/

can be obtained by imposing a restriction on the false rejection of multiple null hypotheses less
severe than that imposed by a requirement that the FWER be less than or equal to P. And much
shorter confidence intervals for 1 ; 2 ; : : : ; M can be obtained by adopting a criterion less stringent
than that inherent in a requirement that the probability of simultaneous coverage be greater than or
equal to 1 P. Moreover, as is discussed in Subsection b, improvements can be effected in the various
test procedures (at the expense of additional complexity and additional computational demands) by
employing “step-down” methods.
In some applications of the testing of the M null hypotheses, the linear combinations
1 ; 2 ; : : : ; M may represent the “effects” of “genes” or other such entities, and the object may
be to “discover” or “detect” those entities whose effects are nonnegligible and that should be sub-
jected to further evaluation and/or future investigation. In such applications, M can be very large.
And limiting the number of rejections of true null hypotheses may be less of a point of emphasis than
442 Confidence Intervals (or Sets) and Tests of Hypotheses

rejecting a high proportion of the false null hypotheses. In Subsection c, an example is presented of
an application where M is in the thousands. Methods that are well suited for such applications are
discussed in Subsections d and e.

a. A further generalization of the S method

Some preliminaries. Let us incorporate the notation introduced in Section 7.3a (in connection with the
canonical form of the G–M model) and take advantage of the results introduced therein. Accordingly,
O D ƒ0.X0 X/ X0 y is the least squares estimator of the vector . And var./ O D  2 C, where
0 0
C D ƒ .X X/ ƒ.
Corresponding to  is the transformed vector ˛ D S0, where S is an M  M matrix such
that S0 CS D I. The least squares estimator of ˛ is ˛O D S0. O And ˛O  N.˛;  2 I/. Further,
0 0 0
 D W ˛, O D W ˛, O and C D W W . where W is the unique M  M matrix that satisfies the
equality ƒSW D ƒ; and as an estimator of , we have the (positive) square root O of O 2 D
d0d=.N P / D y 0 .I PX /y=.N P /.
O so that (for i D 1; 2; : : : ; M ) Oi D 0i .X0 X/ X0 y
Let O1 ; O2 ; : : : ; OM represent the elements of ,
is the least squares estimator of i . And observe that i and its least squares estimator are reexpressible
as i D wi0 ˛ and Oi D wi0 ˛, O where wi represents the i th column of W. Moreover, (in light of the
assumption that no column of ƒ is null or is a scalar multiple of another column of ƒ) no column of
W is null or is a scalar multiple of another column of W ; and upon observing (in light of Theorem
2.4.21) that (for i ¤ j D 1; 2; : : : ; M )
jwi0wj j
jcorr.Oi ; Oj /j D < 1; (7.1)
.wi0wi /1=2 .wj0wj /1=2
it follows that (for i ¤ j D 1; 2; : : : ; M and for any constants Pi and Pj and any nonzero constants
ai and aj )
ai .Oi Pi / ¤ aj .Oj Pj / (wp1): (7.2)
For i D 1; 2; : : : ; M, define
Oi i .0/ Oi i.0/
ti D 0
and ti D : (7.3)
0
Œi .X X/ i 1=2 O Œ0i .X0 X/ i 1=2 O
And observe that ti and ti.0/ are reexpressible as
wi0 .˛O ˛/ wi0 .˛O ˛.0/ /
ti D and ti.0/ D ; (7.4)
.wi0wi /1=2 O .wi0wi /1=2 O
where ˛.0/ D S0 .0/ D .ƒS/0 ˇ .0/. Further, let t D .t1 ; t2 ; : : : ; tM /0, and observe that
t D D 1 W 0 ŒO 1
.˛O ˛/; (7.5)

where D D diagŒ.w10 w1 /1=2; .w20 w2 /1=2; : : : ; .wM


0
wM /1=2 , and that
1
O .˛O ˛/  MV t.N P ; I/: (7.6)

Multiple comparisons. Among the procedures for testing each of the M null hypotheses
H1.0/; H2.0/; : : : ; HM
.0/
(and of doing so in a way that accounts for the multiplicity of tests) is that
provided by the generalized S method—refer to Section 7.3. The generalized S method controls the
FWER. That control comes at the expense of the power of the tests, which for even moderately large
values of M can be quite low.
A less conservative approach (i.e., one that strikes a better balance between the probability of
false rejections and the power of the tests) can be achieved by adopting a criterion that is based on
controlling what has been referred to by Lehmann and Romano (2005a) as the k-FWER (where k
Multiple Comparisons and Simultaneous Confidence Intervals: Enhancements 443

is a positive integer). In the present context, the k-FWER is the probability of falsely rejecting k or
more of the null hypotheses H1.0/; H2.0/; : : : ; HM .0/
, that is, the probability of rejecting Hi.0/ for k or
more of the values of i (between 1 and M, inclusive) for which i D i.0/. A procedure for testing
.0/ .0/ .0/
H1 ; H2 ; : : : ; HM is said to control the k-FWER at level P (where 0 < P < 1) if k-FWER  P .
Clearly, the FWER is a special case of the k-FWER; it is the special case where k D 1. And (for any
k) FWER  P ) k-FWER  P (so that k-FWER  P is a less stringent criterion than FWER  P );
more generally, for any k 0 < k, k 0 -FWER  P ) k-FWER  P (so that k-FWER  P is a less
stringent criterion than k 0 -FWER  P ).
.0/ .0/ .0/
For purposes of devising a procedure for testing H1 ; H2 ; : : : ; HM that controls the k-FWER
at level P , let i1 ; i2 ; : : : ; iM represent a permutation of the first M positive integers 1; 2; : : : ; M such
that
jti1 j  jti2 j      jtiM j (7.7)
—as is evident from result (7.2), this permutation is unique (wp1). And (for j D 1; 2; : : : ; M ) define
t.j / D t ij . Further, denote by c P .j / the upper 100 P % point of the distribution of jt.j / j.
Now, consider the procedure for testing H1.0/; H2.0/; : : : ; HM
.0/
that (for i D 1; 2; : : : ; M ) rejects
.0/
Hi if and only if y 2 Ci , where the critical region Ci is defined as follows:
.0/
Ci D fy W jti j > c P .k/g: (7.8)

Clearly,
.0/ 
Pr y 2 Ci for k or more values of i with i D i
D Pr jti j > c P .k/ for k or more values of i with i D i.0/


 PrŒjti j > c P .k/ for k or more values of i 


D PrŒjt.k/ j > c P .k/ D P : (7.9)

Thus, the procedure that tests each of the null hypotheses H1.0/; H2.0/; : : : ; HM
.0/
on the basis of the
corresponding one of the critical regions C1 ; C2 ; : : : ; CM controls the k-FWER at level P ; its k-
FWER is less than or equal to P . In the special case where k D 1, this procedure is identical to that
obtained via the generalized S method, which was discussed earlier (in Section 7.3c).

Simultaneous confidence intervals. Corresponding to the test of Hi.0/ with critical region Ci is the
confidence interval, say Ai .y/, with end points
Oi ˙ Œ0i .X0 X/ i 1=2 O c P .k/ (7.10)
(i D 1; 2; : : : ; M ). The correspondence is that implicit in the following relationship:
i.0/ 2 Ai .y/ , y … Ci : (7.11)
The confidence intervals A1.y/; A2.y/; : : : ; AM .y/ are [in light of result (7.9)] such that
PrŒi 2 Ai .y/ for at least M kC1 values of i 
D PrŒjti j  c P .k/ for at least M kC1 values of i 
D PrŒjti j > c P .k/ for no more than k 1 values of i 
D1 PrŒjti j > c P .k/ for k or more values of i 
D1 P:
In the special case where k D 1, the confidence intervals A1.y/; A2.y/; : : : ; AM .y/ are identical
(when  is taken to be the set whose members are the columns of IM ) to the confidence intervals
Aı .y/ (ı 2 ) of Section 7.3c—refer to the representation (3.94)—and (in that special case) the
probability of simultaneous coverage by all M of the intervals is equal to 1 P .
Computations/approximations. To implement the test and/or interval procedures, we require the
444 Confidence Intervals (or Sets) and Tests of Hypotheses

upper 100 P % point c P .k/ of the distribution of jt.k/ j. As discussed in Section 7.3c in the special
case of the computation of c P —when k D 1, c P .k/ D c P —Monte Carlo methods can (at least in
principle) be used to compute c P .k/. Whether the use of Monte Carlo methods is feasible depends
on the feasibility of making a large number of draws from the distribution of jt.k/ j. The process of
making a large number of such draws can be facilitated by taking advantage of results (7.5) and (7.6).
And by employing methods like those discussed by Edwards and Berry (1987), the resultant draws
can be used to approximate c P .k/ to a high degree of accuracy.
If the use of Monte Carlo methods is judged to be infeasible, overly burdensome, or aesthetically
unacceptable, there remains the option of replacing c P .k/ with an upper bound. In that regard, it can
be shown that for any M random variables x1 ; x2 ; : : : ; xM and any constant c,
Pr. xi > c for k or more values of i /  .1=k/ M i D1 Pr. xi > c/: (7.12)
P

And upon applying inequality (7.12) in the special case where xi D jti j (i D 1; 2; : : : ; M ) and
where c D tNk P =.2M / .N P / and upon observing that (for i D 1; 2; : : : ; M ) ti  S t.N P / and
hence that

.1=k/ M i D1 PrŒ jti j > tk P =.2M / .N P /


N
P
2M k P
D .2=k/ M i D1 PrŒ ti > tk P =.2M / .N P / D
N
P
D P;
k 2M
we find that
Pr. jti j > tNk P =.2M / .N P / for k or more values of i /  P : (7.13)
Together, results (7.13) and (7.9) imply that
tNk P =.2M / .N P /  c P .k/: (7.14)
Thus, tNk P =.2M / .N P / is an upper bound for c P .k/, and upon replacing c P .k/ with tNk P =.2M / .N P /
.0/ .0/ .0/
in the definitions of the critical regions of the tests of H1 ; H2 ; : : : ; HM and in the definitions
of the confidence intervals for 1 ; 2 ; : : : ; M , we obtain tests and confidence intervals that are
conservative; they are conservative in the sense that typically the tests are less sensitive and the
confidence intervals wider than before.
Some numerical results For purposes of comparison, some results were obtained on the values of
c P .k/ and of (the upper bound) tNk P =.2M / .N P / for selected values of M, k, and P —the value of
N P was taken to be 25. These results are presented in Table 7.3. The values of c P .k/ recorded in
the table are approximations that were determined by Monte Carlo methods from 599999 draws—
refer to Section 7.3e for some discussion pertaining to the nature and the accuracy of the Monte Carlo
approximations. These values are those for the special case where M D M and where O1 ; O2 ; : : : ; OM
are uncorrelated.
Extensions. The tests with critical regions C1 ; C2 ; : : : ; CM can be modified for use when the null
.0/ .0/ .1/ .0/
and alternative hypotheses are either Hi W i  i and Hi W i > i (i D 1; 2; : : : ; M )
or Hi.0/ W i  i.0/ and Hi.1/ W i < i.0/ (i D 1; 2; : : : ; M ) rather than Hi.0/ W i D i.0/ and
Hi.1/ W i ¤ i.0/ (i D 1; 2; : : : ; M ). And the confidence intervals A1 .y/; A2 .y/; : : : ; AM .y/ can be
modified for use in obtaining upper or lower confidence bounds for 1 ; 2 ; : : : ; M .
In defining (for j D 1; 2; : : : ; M ) t.j / D tij , take i1 ; i2 ; : : : ; iM to be a permutation of the
first M positive integers 1; 2; : : : ; M such that ti1  ti2      tiM rather than (as before) a
permutation such that jti1 j  jti2 j      jtiM j. Further, take a P .j / to be the upper 100 P % point
of the distribution of the redefined random variable t.j / . Then, the modifications needed to obtain
the critical regions for testing the null hypotheses Hi.0/ W i  i.0/ (i D 1; 2; : : : ; M ) and to obtain
the lower confidence bounds for 1 ; 2 ; : : : ; M are those that result in procedures identical to the
procedures obtained by proceeding as in Section 7.5e (in the special case where k D 1) and by
inserting a P .k/ in place of a P . Moreover, by introducing modifications analogous to those described
Multiple Comparisons and Simultaneous Confidence Intervals: Enhancements 445
TABLE 7.3. Values of tNk P =.2M / .N P / and c P .k/ for selected values of M, k, and P (and for N P D 25).

tNk P =.2M / .N P / c P .k/


M k P D:10 P D:20 P D:50 P D:10 P D:20 P D:50
10 1 2:79 2:49 2:06 2:74 2:41 1:86
2 2:49 2:17 1:71 2:06 1:82 1:42
100 1 3:73 3:45 3:08 3:61 3:28 2:76
2 3:45 3:17 2:79 3:11 2:85 2:43
5 3:08 2:79 2:38 2:54 2:34 2:02
1;000 1 4:62 4:35 4:00 4:37 4:02 3:46
5 4:00 3:73 3:36 3:53 3:28 2:87
10 3:73 3:45 3:08 3:21 2:99 2:62
10;000 1 5:51 5:24 4:89 5:05 4:67 4:06
10 4:62 4:35 4:00 4:08 3:80 3:35
50 4:00 3:73 3:36 3:46 3:23 2:85
100 3:73 3:45 3:08 3:18 2:96 2:61

in the third part of Section 7.3e and the second part of Section 7.5e, versions of the confidence
intervals and confidence bounds can be obtained such that all M intervals are of equal length and
such that all M bounds are equidistant from the least squares estimates.
Nonnormality. The assumption that the vector e of residual effects in the G–M model has an MVN
distribution is stronger than necessary. To insure that the probability of the tests falsely rejecting k or
more of the M null hypotheses does not exceed P and to insure that the probability of the confidence
intervals or bounds covering at least M kC1 of the M linear combinations 1 ; 2 ; : : : ; M is equal to
1 P, it is sufficient
 that e have an absolutely continuous spherical distribution. In fact, it is sufficient
˛O ˛
that the vector have an absolutely continuous spherical distribution.
d

b. Multiple comparisons: use of step-down methods to control the FWER or k-


FWER
Let us consider further the multiple-comparison problem considered in Subsection a; that is, the
problem of testing the M null hypotheses H1.0/ ; H2.0/ ; : : : ; HM
.0/
versus the M alternative hypotheses
.1/ .1/ .1/
H1 ; H2 ; : : : ; HM (and of doing so in a way that accounts for the multiplicity of tests). The tests
considered in Subsection a are those with the critical regions
Ci D fy W jti.0/ j > c P .k/g .i D 1; 2; : : : ; M / (7.15)
and those with the critical regions
fy W jti.0/ j > tNk P =.2M / .N P /g .i D 1; 2; : : : ; M / (7.16)
obtained upon replacing c P .k/ with the upper bound tNk P =.2M / .N P /. The critical regions (7.15)
and (7.16) of these tests are relatively simple in form. In what follows, some alternative procedures
for testing H1.0/ ; H2.0/ ; : : : ; HM
.0/
are considered. The critical regions of the alternative tests are of
a more complex form; however like the tests with critical regions (7.15) or (7.16), they control the
FWER or, more generally, the k-FWER and, at the same time, they are more powerful than the tests
with critical regions (7.15) or (7.16)—their critical regions are larger than the critical regions (7.15)
or (7.16).
Some additional notation and some preliminaries. Let us continue to employ the notation introduced
446 Confidence Intervals (or Sets) and Tests of Hypotheses

in Subsection a. In particular, let us continue (for i D 1; 2; : : : ; M ) to take ti and ti.0/ to be the


random variables defined by expressions (7.3). And let us continue to take t D .t1 ; t2 ; : : : ; tM /0,
to take i1 ; i2 ; : : : ; iM to be as defined by inequalities (7.7), and (for j D 1; 2; : : : ; M ) to define
.0/ .0/ .0/ .0/
t.j / D t ij . Further, let us extend these definitions to the ti ’s by taking t .0/ D .t1 ; t2 ; : : : ; tM /0,
taking iQ1 ; iQ2 ; : : : ; iQM to be a permutation of the integers 1; 2; : : : ; M such that
jtQ.0/ j  jtQ.0/ j      jtQ.0/ j; (7.17)
i1 i2 iM
.0/ .0/
and (for j D 1; 2; : : : ; M ) letting t.j / D t Q .
ij
Let us use the symbol I to represent the set f1; 2; : : : ; M g. Also, let us denote by T the subset of
I consisting of those values of the integer i (1  i  M ) for which Hi.0/ is true, that is, those values
.0/
for which i D i —the number of elements in this subset and the identity of the elements are of
course unknown. Further, for any subset S of I, let MS represent the size of S, that is, the number
of elements in S.
Now, for an arbitrary (nonempty) subset S D fj1 ; j2 ; : : : ; jMS g of I, let tS represent the MS -
dimensional subvector of t whose elements are tj1 ; tj2 ; : : : ; tjMS , and let i1 .S /; i2 .S /; : : : ; iM

S
.S /
represent a permutation of the elements of S such that
jti1 .S/ j  jti2 .S/ j      jtiM

.S/ j:
S
.0/
Similarly, let tS represent the MS -dimensional subvector of t .0/ whose elements are
.0/ .0/ .0/
tj1 ; tj2 ; : : : ; tjM , and let iQ1 .S /; iQ2 .S /; : : : ; iQM

S
.S / represent a permutation of the elements of
S
S such that
jtQ.0/
 j  jtQ.0/
 j      jtQ.0/
 j:
i1 .S/ i2 .S/ iM .S/
S
.0/
And note that when S D T or more generally when S  T, tS D tS . Note also that the (marginal)
distribution of tS is a multivariate t distribution with N P degrees of freedom and with a correlation
matrix that is the submatrix of the correlation matrix of t formed by its j1 ; j2 ; : : : ; jMS th rows and
columns—refer to result (6.4.48). Moreover, i1 .S /; i2 .S /; : : : ; iM 
S
.S / is a subsequence of the se-
quence i1 ; i2 ; : : : ; iM and ti1 .S/ ; ti2 .S/ ; : : : ; tiM

.S/ a subsequence of the sequence t.1/ ; t.2/ ; : : : ; t.M /
S
(or, equivalently, of ti1 ; ti2 ; : : : ; tiM )—specifically, they are the subsequences obtained upon striking
out the j th member of the sequence for every j 2 I for which ij … S. Similarly, iQ1 .S /; iQ2 .S /; : : : ;
iQM

.S / is a subsequence of the sequence iQ1 ; iQ2 ; : : : ; iQM and tQ.0/  ; tQ.0/
 ; : : : ; tQ.0/
 a subsequence
S i1 .S/ i2 .S/ iM .S/
.0/ .0/ .0/ S
of the sequence t.1/ ; t.2/ ; : : : ; t.M /
.
.0/ .0/
Additionally, for an arbitrary subset S of I (of size MS  k), let tkIS D tik .S/ and tkIS D tQ .
ik .S/
Further, let c P .kI S / represent the upper 100 P % point of the distribution of jtkWS j. And observe that
.0/
tkIS D tkIS when S  T (7.18)
and that
c P .kI S / < c P .kI S / for any (proper) subset S of S (of size MS  k). (7.19)

An alternative procedure for testing the null hypotheses H1.0/; H2.0/; : : : ; HM .0/
in a way that con-
trols the FWER or, more generally, the k-FWER: definition, characteristics, terminology, and
.0/ .0/ .0/
properties. The null hypotheses H1 ; H2 ; : : : ; HM can be tested in a way that controls the FWER
or, more generally, the k-FWER by adopting the procedure with critical regions C1 ; C2 ; : : : ; CM . For
purposes of defining an alternative to this procedure, let (for j D k; k C1; : : : ; M )  Q kIj represent
a collection of subsets of I D f1; 2; : : : ; M g consisting of every S  I for which MS  k and for
Q C represent a collection (of subsets of I ) consisting of those subsets
which iQk.S / D iQj . And let  kIj
in  Q C is the
Q kIj whose elements include all M j of the integers iQj C1 ; iQj C2 ; : : : ; iQM , that is, 
kIj
collection of those subsets of I whose elements consist of k 1 of the integers iQ1 ; iQ2 ; : : : ; iQj 1 and
all M j C1 of the integers iQj ; iQj C1 ; : : : ; iQM . By definition,
Multiple Comparisons and Simultaneous Confidence Intervals: Enhancements 447
QC 
 Q .j D k; k C1; : : : ; M / and Q C D fI g:
 (7.20)
kIj kIj kIk
Moreover,
Q C C QC
S 2 kIj ) S  S for some S 2 kIj ; (7.21)
and (for j > k)
Q kIj ) S  S C for some S C 2 
S 2 Q kIj 1 (7.22)
C C
[where in result (7.22), S is such that S is a proper subset of S ].
Now, for j D k; k C1; : : : ; M, define
˛j D max c P .kI S /: (7.23)
Q kIj
S2 

And [recalling result (7.19)] note [in light of results (7.20) and (7.21)] that ˛j is reexpressible as
˛j D max c P .kI S / (7.24)
QC
S2  kIj

—in the special case where k D 1,  Q C contains a single set, say SQj (the elements of which are
kIj
iQj ; iQj C1 ; : : : ; iQM ), and in that special case, equality (7.24) simplifies to ˛j D c P .1I Sj /. Note also
that
˛k D c P .kI I / D c P .k/: (7.25)
Moreover, for k  j 0 < j  M,
˛j 0 > ˛j ; (7.26)
as is evident from result (7.22) [upon once again recalling result (7.19)].
Let us extend the definition (7.23) of the ˛j ’s by taking
˛1 D ˛2 D    D ˛k 1 D ˛k ; (7.27)
so that [in light of inequality (7.26)]
˛1 D    D ˛k 1 D ˛k > ˛kC1 > ˛kC2 >    > ˛M : (7.28)
.0/
Further, take J to be an integer (between 0 and M, inclusive) such that jt.j / j > ˛j for j D 1; 2; : : : ; J
.0/ .0/ .0/
and jt.J C1/
j  ˛J C1 —if jt.j /
j > ˛j for j D 1; 2; : : : ; M, take J D M ; and if jt.1/ j  ˛1 , take
J D 0. And consider the following multiple-comparison procedure for testing the null hypotheses
H1.0/; H2.0/; : : : ; HM
.0/
: when J  1, HQ.0/; HQ.0/; : : : ; HQ.0/ are rejected; when J D 0, none of the null
i1 i2 iJ
hypotheses are rejected. This procedure can be regarded as a stepwise procedure, one of a kind known
as a step-down procedure. Specifically, the procedure can be regarded as one in which (starting with
.0/ .0/ .0/ .0/
HQ ) the null hypotheses are tested sequentially in the order HQ ; HQ ; : : : ; HQ by comparing the
i1 i1 i2 iM
.0/
jt.j /
.0/
j’s with the ˛j ’s; the testing ceases upon encountering a null hypothesis HQ.0/ for which jt.j /
j
ij
does not exceed ˛j .
.0/ .0/ .0/
The step-down procedure for testing the null hypotheses H1 ; H2 ; : : : ; HM can be character-
ized in terms of its critical regions. The critical regions of this procedure, say C1; C2; : : : ; CM

, are
expressible as follows:
CiD fy W J  1 and i D iQj 0 for some integer j 0 between 1 and J, inclusiveg (7.29)
(i D 1; 2; : : : ; M ). Alternatively, Ci is expressible in the form
M n o
.0/
[
Ci D y W i D iQj 0 and for every j  j 0, t.j /
> ˛j (7.30)
j 0 D1
.0/
—for any particular value of y, t.j / > ˛j for every j  j 0 if and only if j 0  J.
It can be shown (and subsequently will be shown) that the step-down procedure for testing the
null hypotheses H1.0/; H2.0/; : : : ; HM
.0/
, like the test procedure considered in Subsection a, controls the
k-FWER at level P (and in the special case where k D 1, controls the FWER at level P ). Moreover, in
448 Confidence Intervals (or Sets) and Tests of Hypotheses

light of result (7.25) and definition (7.27), the critical regions C1 ; C2 ; : : : ; CM of the test procedure
considered in Subsection a are reexpressible in the form
M
.0/
[
y W i D iQj 0 and t.j (7.31)
˚
Ci D 0 / > ˛1
j 0 D1
(i D 1; 2; : : : ; M ). And upon comparing expression (7.31) with expression (7.30) and upon observing
.0/ .0/ 0
[in light of the relationships (7.28)] that t.j 0 / > ˛1 implies that t.j / > ˛j for every j  j , we find that

Ci  Ci .i D 1; 2; : : : ; M /: (7.32)
That is, the critical regions C1 ; C2 ; : : : ; CM of the test procedure considered in Subsection a are
subsets of the corresponding critical regions C1; C2; : : : ; CM
of the step-down procedure. In fact,
Ci is a proper subset of Ci (i D 1; 2; : : : ; M ).
We conclude that while both the step-down procedure and the procedure with critical regions
C1 ; C2 ; : : : ; CM control the k-FWER (or when k D 1, the FWER), the step-down procedure is more
powerful in that its adoption can result in additional rejections. However, at the same time, it is
worth noting that the increased power comes at the expense of some increase in complexity and
computational intensity.

Verification that the step-down procedure for testing the null hypotheses H1.0/; H2.0/; : : : ; HM
.0/

controls the k-FWER (at level P ). Suppose (for purposes of verifying that the step-down procedure
controls the k-FWER) that MT  k—if MT < k, then fewer than k of the null hypotheses are
true and hence at most there can be k 1 false rejections. Then, there exists an integer j 0 (where
k  j 0  M MT C k) such that iQj 0 D iQk.T / —j 0 D k when T is the k-dimensional set whose
elements are iQ1 ; iQ2 ; : : : ; iQk , and j 0 D M MT C k when T is the set whose MT elements are
iQM MT C1 ; iQM MT C2 ; : : : ; iQM .
The step-down procedure results in k or more false rejections if and only if
.0/ .0/ .0/ .0/
jt.1/ j > ˛1 ; jt.2/ j > ˛2 ; : : : ; jt.j 0 1/
j > ˛j 0 1; and jt.j 0 / j > ˛j 0 : (7.33)
Thus, the step-down procedure is such that
 
.0/
Pr.k or more false rejections/  Pr jt.j 0 / j > ˛j 0 : (7.34)
Moreover, .0/ .0/
t.j 0 / D tQ D tQ.0/

.0/
D tkIT D tkIT ; (7.35)
ij 0 ik .T /
as is evident upon recalling result (7.18), and
˛j 0 D max c P .kI S /  c P .kI T /: (7.36)
Q kIj 0
S2 

Together, results (7.35) and (7.36) imply that


    h i
.0/
Pr jt.j 0 / j > ˛j 0 D Pr jtkIT j > ˛j 0  Pr jtkIT j > c P .kI T / D P : (7.37)
And upon combining result (7.37) with result (7.34), we conclude that the step-down procedure is
such that
Pr.k or more false rejections/  P ; (7.38)
thereby completing the verification that the step-down procedure controls the k-FWER (at level P ).
A caveat. In the specification of the step-down procedure, ˛1 ; ˛2 ; : : : ; ˛k 1 were set equal to ˛k .
As is evident from the verification (in the preceding part of the present subsection) that the step-
down procedure controls the k-FWER (at level P ), its ability to do so is not affected by the choice
of ˛1 ; ˛2 ; : : : ; ˛k 1 . In fact, if the procedure were modified in such a way that k 1 of the null
.0/
hypotheses [specifically, the null hypotheses Hi (i D iQ1 ; iQ2 ; : : : ; iQk 1 )] were always rejected, the
ability of the procedure to control the k-FWER would be unaffected—this modification corresponds
Multiple Comparisons and Simultaneous Confidence Intervals: Enhancements 449

to setting ˛1 ; ˛2 ; : : : ; ˛k 1 equal to 1 rather than to ˛k . However, convention (in the case of


step-down procedures) suggests that ˛1 ; ˛2 ; : : : ; ˛k 1 be chosen in such a way that the sequence
˛1 ; ˛2 ; : : : ; ˛k 1 ; ˛k ; ˛kC1 ; : : : ; ˛M is nonincreasing—refer, e.g., to Lehmann and Romano (2005a,
p. 1143) for some related remarks. Within the confines of that convention, the choice ˛1 D ˛2 D    D
˛k 1 D ˛k is the best choice; it is best in the sense that it maximizes the size of the critical regions
C1; C2; : : : ; CM
.
A potential improvement. Let us continue to denote by I the set f1; 2; : : : ; M g and to denote by T
the subset of I consisting of those values of the integer i (1  i  M ) for which Hi.0/ is true, that
is, those values for which i D i.0/. Further, let us denote by F the subset of I consisting of those
values of i for which Hi.0/ is false, that is, those values for which i ¤ i.0/. And denote by  the
collection of all 2M subsets of I (including the empty set). Then, by definition, both T and F belong
to the collection , T and F are disjoint (i.e., have no members in common), and T [ F D I.
When the coefficient vectors 1 ; 2 ; : : : ; M of 1 ; 2 ; : : : ; M are linearly independent, T could
potentially be any one of the 2M subsets (of I ) that form the collection . When 1 ; 2 ; : : : ; M
are linearly dependent, that is no longer the case.
Suppose that 1 ; 2 ; : : : ; M are linearly dependent. Suppose further that the i  th of the vectors
1 ; 2 ; : : : ; M is expressible in the form
P
i  D i 2S ai i ;
where i  … S  I and where the ai ’s are scalars—when the vectors 1 ; 2 ; : : : ; M are linearly
dependent, at least one of them is expressible in terms of the others. Then, we find [upon recalling
that for some P 1 vector ˇ .0/, i.0/ D 0i ˇ .0/ (i D 1; 2; : : : ; M )] that if T were equal to S , it would
be the case that
i  D 0i  ˇ D i 2T ai 0i ˇ D i 2T ai i
P P

D i 2T ai i.0/ D i 2T ai 0i ˇ .0/ D 0i  ˇ .0/ D i.0/ (7.39)


P P
 :

Thus, T cannot equal S [since T D S ) i  2 F, contrary to what is implied by result (7.39)]. In


effect, the linearP dependence of 1 ; 2 ; : : : ; M and the resultant existence of a relationship of the
form i  D i 2S ai i imposes a constraint on T.
Based on the immediately preceding development, we conclude that T 2 , where  is a
collection of subsets (of I ) defined as follows: a subset S is a member of  if for every integer
i  2 I such that i  … S , i  is linearly independent of the MS vectors i (i 2 S ) in the sense that i 
is not expressible as a linear combination of i (i 2 S ) or, equivalently, the rank of the P  .MS C1/
matrix with columns i (i 2 S ) and i  is greater (by 1) than that of the P MS matrix with columns
i (i 2 S )—refer to the discussion in Section 2.9a (on the consistency of linear systems) for some
relevant results. Note that the definition of  is such that    and is such that  includes I
and also includes the empty set. When 1 ; 2 ; : : : ; M are linearly independent,  D ; when
1 ; 2 ; : : : ; M are linearly dependent, there are some subsets of I (and hence some members of
) that are not members of .
Suppose that T is subject to the constraint T 2  or, more generally, to the constraint T 2  ,
where either  D  or  is some collection (of known identity) of subsets (of I ) other than .
A very simple special case (of mostly hypothetical interest) is that where the collection  (to which
T is constrained) is the collection whose only members are I and the empty set. Let us consider how
(when  does not contain every member of ) the information inherent in the constraint T 2 
might be used to effect improvements in the step-down procedure (for testing the null hypotheses
H1.0/; H2.0/; : : : ; HM
.0/
).
Consider the generalization of the step-down procedure obtained upon replacing (for j D k; kC
1; : : : ; M ) the definition of ˛j given by expression (7.23) with the definition
˛j D max c P .kI S / (7.40)
Q kIj \ /
S2.
450 Confidence Intervals (or Sets) and Tests of Hypotheses

—when  Q kIj \  D ¿ (the empty set), set ˛j D 1. And in the generalization, continue [as in
definition (7.27)] to set
˛1 D ˛2 D    D ˛k 1 D ˛k :
Under the extended definition (7.40), it is no longer necessarily the case that ˛j 0 > ˛j for every
j and j 0 for which k  j 0 < j  M and hence no longer necessarily the case that the sequence
˛1 ; ˛2 ; : : : ; ˛M is nonincreasing. Nor is it necessarily the case that ˛j is reexpressible (for k  j 
M ) as ˛j D maxS2.Q C \ / c P .kI S /, contrary to what might have been conjectured on the basis
kIj
of result (7.24).
Like the original version of the step-down procedure, the generalized (to account for the constraint
T 2  ) version controls the k-FWER or (in the special case where k D 1) the FWER (at level P ).
That the generalized version has that property can be verified by proceeding in essentially the same
way as in the verification (in a preceding part of the present subsection) that the original version has
that property. In that regard, it is worth noting that for T 2  ,  Q kIj 0 \  is nonempty and hence
˛j 0 is finite. And in the extension of the verification to the generalized version, the maximization
(with respect to S ) in result (7.36) is over the intersection of  Q kIj 0 with  rather than over  Q kIj 0
itself.
When  D , the generalized version of the step-down procedure is identical to the original
version. When  is “smaller” than  (as when 1 ; 2 ; : : : ; M are linearly dependent and  D
 ), some of the ˛j ’s employed in the generalized version are smaller than (and the rest equal to)
those employed in the original version. Thus, when  is smaller than ,the generalized version is
more powerful than the original (in that its use can result in additional rejections).
It is informative to consider the generalized version of the step-down procedure in the afore-
mentioned simple special case where the only members of  are I and the empty set. In that
special case, the generalized version is such that ˛k D c P .kI I / D c P .k/—refer to result (7.25)—
and ˛kC1 D ˛kC2 D    D ˛M D 1. Thus, in that special case, the generalized version of
the step-down procedure rejects none of the null hypotheses H1.0/; H2.0/; : : : ; HM .0/ .0/
if jt.1/ j  c P .k/,
.0/ .0/ .0/ .0/ 0 .0/
rejects HQ ; HQ ; : : : ; HQ if jt.j / j > c P .k/ (j D 1; 2; : : : ; j ) and jt.j 0 C1/ j  c P .k/ for some
i1 i2 ij 0
.0/
integer j 0 between 1 and k 1, inclusive, and rejects all M of the null hypotheses if jt.j / j > c P .k/
(j D 1; 2; : : : ; k).
An illustrative example. For purposes of illustration, consider a setting where M D P D 3, with
10 D .1; 1; 0/, 20 D .1; 0; 1/, and 30 D .0; 1; 1/. This setting is of a kind that is encountered
in applications where pairwise comparisons are to be made among some number of “treatments” (3
in this case).
Clearly, any two of the three vectors 1 , 2 , and 3 are linearly independent. Moreover, each
of these three vectors is expressible as a difference between the other two (e.g., 3 D 2 1 ),
implying in particular that M D 2 and that 1 , 2 , and 3 are linearly dependent. And the 2M D 8
members of  are ¿ (the empty set), f1g, f2g, f3g, f1; 2g, f1; 3g, f2; 3g, and I D f1; 2; 3g; and the
members of  are ¿, f1g, f2g, f3g, and f1; 2; 3g.
Now, suppose that k D 1. Then, we find that (for j D 1; 2; 3) the members of the collections
Q , Q C , and 
Q \  are
kIj kIj kIj
Q W fiQ1 g; fiQ1 ; iQ2 g; fiQ1 ; iQ3 g; and fiQ1 ; iQ2 ; iQ3 g;
 1I1
Q C W fiQ1 ; iQ2 ; iQ3 g;
 1I1
Q \ W fiQ1 g and fiQ1 ; iQ2 ; iQ3 g;
1I1
Q W fiQ2 g and fiQ2 ; iQ3 g;
 1I2
Q C W fiQ2 ; iQ3 g;
 1I2
Q \ W fiQ2 g;
 1I2
Q W fiQ3 g;
1I3
Multiple Comparisons and Simultaneous Confidence Intervals: Enhancements 451
Q C W fiQ3 g;
 1I3
Q \ W fiQ3 g:
1I3
Alternatively, suppose that k D 2. Then, we find that (for j D 2; 3) the members of the collections
Q , Q C , and Q 
kIj kIj kIj \  are
Q W
 fiQ1 ; iQ2 g and fiQ1 ; iQ2 ; iQ3 g;
2W2
Q C
2W2 W fiQ1 ; iQ2 ; iQ3 g;
Q \ W
 fiQ1 ; iQ2 ; iQ3 g;
2W2
Q W
 fiQ1 ; iQ3 g and fiQ2 ; iQ3 g;
2W3
QC W
 fiQ1 ; iQ3 g and fiQ2 ; iQ3 g;
2W3
Q \ W
 ¿ (the empty set):
2W3

Thus, when  D  and k D 1, ˛1 D c P .1I I / D c P .1/, ˛2 D c P .1I fiQ2 g/ or ˛2 D




c P .1I fiQ2 ; iQ3 g/ depending on whether T is subject to the constraint T 2  or T is unconstrained,


and ˛3 D c P .1I fiQ3 g/. And when  D  and k D 2, ˛1 D ˛2 D c P .2I I / D c P .2/ and ˛3 D 1
or ˛3 D maxŒc P .2I fiQ1 ; iQ3 g/; c P .2I fiQ2 ; iQ3 g/ depending on whether T is subject to the constraint
T 2  or T is unconstrained.
Computations/approximations. To implement the step-down procedure for testing the null hypothe-
.0/ .0/ .0/
ses H1 ; H2 ; : : : ; HM , we require some or all of the constants ˛k ; ˛kC1 ; : : : ; ˛M defined by
expression (7.23) or, more generally, by expression (7.40). And to obtain the requisite ˛j ’s, we re-
quire the values of c P .kI S / corresponding to various choices for S . As in the case of the constant
c P .k/ required to implement the test procedure with critical regions (7.15), the requisite values of
c P .kI S / can be computed to a high degree of accuracy by employing Monte Carlo methods of a
kind discussed by Edwards and Berry (1987). In that regard, note that (for any subset S of I ) a draw
from the distribution of tik .S/ can be obtained from a draw from the distribution of the vector t, and
recall (from the discussion in Subsection a) that the feasibility of making a large number of draws
from the distribution of t can be enhanced by making use of results (7.5) and (7.6).
As in the case of the test procedure with critical regions (7.15), a computationally less demanding
(but less powerful) version of the step-down procedure can be devised. In fact, such a version can be
devised by making use of the inequality
tNk P =.2MS / .N P /  c P .kI S / (where S  I ); (7.41)
which is a generalization of inequality (7.14)—when S D I , inequality (7.41) is equivalent to
inequality (7.14)—and which can be devised from inequality (7.12) by proceeding in essentially the
same way as in the derivation (in Subsection a) of inequality (7.14).
In the case of the original (unconstrained) version of the step-down procedure [where (for j D
k; kC1; : : : ; M ] ˛j is defined by expression (7.23) and is reexpressible in the form (7.24)], it follows
from inequality (7.41) that (for j D k; k C1; : : : ; M )
˛j D max c P .kI S /  max tNk P =.2MS / .N P / (7.42)
QC
S2  QC
S2 
kIj kIj

Q C , MS D k 1CM j C1 D M Ck j ) that
and hence (since for S 2  kIj

˛j  tNk P =Œ2.M Ck j / .N P /: (7.43)


In the case of the generalized version of the step-down procedure [where T is subject to the constraint
T 2  and where (for j D k; k C1; : : : ; M ) ˛j is defined by expression (7.40)], it follows from
inequality (7.41) that (for j D k; k C1; : : : ; M )
˛j D max c P .kI S /  max tNk P =.2MS / .N P / D tNk P =Œ2M .kIj / .N P /; (7.44)
Q kIj \ /
S2. Q kIj \ /
S2.

where M .kI j / D maxS2.Q kIj \  / MS .


452 Confidence Intervals (or Sets) and Tests of Hypotheses

Now, consider a modification of the step-down procedure in which (for j D k; kC1; : : : ; M ) ˛j


is replaced by tNk P =Œ2.M Ck j / .N P / or, more generally, by tNk P =Œ2M .kIj / .N P / (and in which
˛1 ; ˛2 ; : : : ; ˛k 1 are replaced by the replacement for ˛k ). This modification results in a version of the
step-down procedure for controlling the k-FWER that can be much less demanding from a compu-
tational standpoint but that is less powerful. When the replacement for ˛j is tNk P =Œ2.M Ck j / .N P /
(j D k; k C1; : : : ; M ), the modified version of the step-down procedure is equivalent to the step-
down procedure for controlling the k-FWER proposed by Lehmann and Romano (2005a, sec. 2) and
in the special case where k D 1, it is equivalent to the step-down procedure for controlling the FWER
proposed by Holm (1979). The modified version of the step-down procedure is more powerful than
the test procedure with critical regions (7.16) in that its adoption can result in more rejections than
the adoption of the latter procedure.
Extensions. By introducing some relatively simple and transparent modifications, the coverage of
the present subsection (on the use of step-down methods to control the FWER or k-FWER) can
.0/ .0/
be extended to the case where the null and alternative hypotheses are either Hi W i  i and
.1/ .0/ .0/ .0/ .1/ .0/
Hi W i > i (i D 1; 2; : : : ; M ) or Hi W i  i and Hi W i < i (i D 1; 2; : : : ; M )
rather than Hi.0/ W i D i.0/ and Hi.1/ W i ¤ i.0/ (i D 1; 2; : : : ; M ). Suppose, in particular, that the
null and alternative hypotheses are Hi.0/ W i  i.0/ and Hi.1/ W i > i.0/ (i D 1; 2; : : : ; M ). Then,
instead of taking the permutation i1 ; i2 ; : : : ; iM and the permutation iQ1 ; iQ2 ; : : : ; iQM to be as defined
by inequalities (7.7) and (7.17), we need to take them to be as defined by the inequalities
ti1  ti2      tiM and tQ.0/  tQ.0/      tQ.0/ :
i1 i2 iM

Similarly, we need to redefine i1 .S /,


:::; i2 .S /;and 
iM S
.S / : : : ; iQM

iQ1 .S /;
S
.S / to be iQ2 .S /;
permutations of the elements of the subset S ( I ) that satisfy the inequalities
ti  .S/  ti  .S/      ti  and tQ.0/
  tQ.0/
      tQ.0/
 :
1 2 MS .S/ i1 .S/ i2 .S/ iM .S/
S
.0/ .0/
And t.j / and t.j / (j D 1; 2; : : : ; M ) and tkIS and tkWS (S  I ; MS  k) need to be redefined
accordingly. Further, redefine c P .kI S / to be the upper 100 P % point of the distribution of tkIS , and
redefine the ˛j ’s in terms of the redefined c P .kI S /’s.
In the form taken following its redefinition in terms of the redefined iQj ’s, tj.0/ ’s, and ˛j ’s, the step-
down procedure rejects the null hypotheses HQ.0/; HQ.0/; : : : ; HQ.0/ , where J is an integer (between 0
i1 i2 iJ
.0/ .0/ .0/
and M, inclusive) such that t.j / > ˛j for j D 1; 2; : : : ; J and t.J C1/  ˛J C1 —when t.j / > ˛j for
.0/
j D 1; 2; : : : ; M, J D M and all M of the null hypotheses are rejected; when t.1/  ˛1 , J D 0
and none of the null hypotheses are rejected. In effect, the redefined step-down procedure tests the
M null hypotheses sequentially in the order HQ.0/; HQ.0/; : : : ; HQ.0/ by comparing the t.j
.0/
/ ’s with the
i1 i2 iM
.0/ .0/
˛j ’s; the testing ceases upon encountering a null hypothesis HQi for which t.j / does not exceed ˛j .
j
The redefined c P .kI S /’s satisfy the inequality
tNk P =MS .N P /  c P .kI S /:
This inequality takes the place of inequality (7.41). And to obtain upper bounds for the redefined
˛j ’s, results (7.42), (7.43), and (7.44) need to be revised accordingly. Upon replacing the redefined
˛j ’s with the redefined upper bounds, we obtain versions of the redefined step-down procedure that
are computationally less demanding but that are less powerful.
The various versions of the redefined step-down procedure control the k-FWER (at level P ). By
way of comparison, we have the tests of the null hypotheses with critical regions
fy W ti.0/ > c P .k/g .i D 1; 2; : : : ; M /;
where c P .k/ is redefined to be the upper 100 P % point of the distribution of the redefined random
Multiple Comparisons and Simultaneous Confidence Intervals: Enhancements 453

variable t.k/ , and the tests with critical regions


.0/
fy W ti > tNk P =M .N P /g .i D 1; 2; : : : ; M /:
These tests are the redefined versions of the tests with critical regions (7.15) and (7.16). Like the rede-
fined step-down procedures, they control the k-FWER; however, the redefined step-down procedures
are more powerful.
Nonnormality. The assumption that the vector e of residual effects in the G–M model has an MVN
distribution is stronger than necessary. To insure the validity of the various results on step-down
methods, it is sufficient that e 
have an absolutely continuous spherical distribution. In fact, it is
˛O ˛
sufficient that the vector have an absolutely continuous spherical distribution.
d

c. Alternative criteria for devising and evaluating multiple-comparison procedures:


the false discovery proportion and the false discovery rate
Let us consider further alternatives to the control of the FWER as a basis for testing the M null
hypotheses H1.0/; H2.0/; : : : ; HM
.0/
in a way that accounts for the multiplicity of tests. And in doing
so, let us continue to make use of the notation and terminology employed in the preceding parts of
the present section.
Multiple-comparison procedures are sometimes used as a “screening device.” That is, the objec-
tive may be to identify or “discover” which of the M linear combinations 1 ; 2 ; : : : ; M are worthy
of further investigation. In such a case, the rejection of a null hypothesis represents a “(true) discov-
ery” or a “false discovery” depending on whether the null hypothesis is false or true. And the success
of an application of a multiple-comparison procedure may be judged (at least in part) on the basis
of whether or not a large proportion of the false null hypotheses are rejected (discovered) and on
whether or not the proportion of rejected null hypotheses that are false rejections (false discoveries)
is small.
An example: microarray data. Data from so-called microarrays are sometimes used to identify or
discover which among a large number of “genes” are associated with some disease and are worthy
of further study. Among the examples of such data are the microarray data from a study of prostate
cancer that were obtained, analyzed, and discussed by Efron (2010). In that example, the data consist
of the “expression levels” obtained for 6033 genes on 102 men; 50 of the men were “normal control
subjects” and 52 were “prostate cancer patients.” According to Efron, “the principal goal of the study
was to discover a small number of ‘interesting’ genes, that is, genes whose expression levels differ
between the prostate and normal subjects.” He went on to say that “such genes, once identified, might
be further investigated for a causal link to prostate cancer development.”
The prostate data can be regarded as multivariate in nature, and as such could be formulated
and modeled in the way described and discussed in Section 4.5. Specifically, the subjects could
be regarded as R D 102 observational units and the genes (or, more precisely, their expression
levels) as S D 6033 response variables. Further, the data on the sth gene could be regarded as the
observed values of the elements of a 102-dimensional random vector ys (s D 1; 2; : : : ; 6033). And
the (N D 6033 102 D 615;366)-dimensional vector y D .y10 ; y20 ; : : : ; yS0 /0 could be regarded as
following a general linear model, where the model equation is as defined by expression (4.5.2) or
(4.5.3) and where (assuming that observational units 1 through 50 correspond to the normal subjects)
X1 D X2 D    D XS D diag.150 ; 152 /:
This model is such that each of the S D 6033 subvectors ˇ1 ; ˇ2 ; : : : ; ˇS of the parameter vector
ˇ has two elements. Clearly, the first element of ˇs represents the expected value of the response
variable (expression level) for the sth gene when the subject is a normal subject, and the second
element represents the expected value of that variable when the subject is a cancer patient; let us
454 Confidence Intervals (or Sets) and Tests of Hypotheses

write s1 for the first element of ˇs and s2 for the second element. The quantities of interest are
represented by the M D 6033 linear combinations 1 ; 2 ; : : : ; M defined as follows:
s D s2 s1 .s D 1; 2; : : : ; 6033/:
And, conceivably, the problem of discovering which genes are worthy of further study (i.e., might
be associated with prostate cancer development) could be formulated as one of testing the 6033 null
hypotheses Hs.0/ W s D 0 (s D 1; 2; : : : ; 6033) versus the alternative hypotheses Hs.1/ W s ¤ 0
(s D 1; 2; : : : ; 6033). In such an approach, the genes corresponding to whichever null hypotheses
are rejected would be deemed to be the ones of interest.
Note that the model for the prostate data is such that the variance-covariance matrix of the residual
effects is of the form (4.5.4). And recall that the variance-covariance matrix of the residual effects in
the G–M model is of the form (4.1.17). Thus, to obtain a model for the prostate data that is a G–M
model, we would need to introduce the simplifying assumptions that ss 0 D 0 for s 0 ¤ s D 1; 2; : : : ; S
and that 11 D 22 D    D SS .
Alternative (to the FWER or k-FWER) criteria for devising and evaluating multiple-comparison
procedures: the false discovery proportion and false discovery rate. In applications of multiple-
comparison procedures where the procedure is to be used as a screening device, the number M of null
hypotheses is generally large and sometimes (as in the example) very large. Multiple-comparison
procedures (like those described and discussed in Section 7.3c) that restrict the FWER (familywise
error rate) to the customary levels (such as 0:01 or 0:05) are not well suited for such applications.
Those procedures are such that among the linear combinations i (i 2 F ), only those for which the
true value differs from the hypothesized value by a very large margin have a reasonable chance of
being rejected (discovered). It would seem that the situation could be improved to at least some extent
by taking the level to which the FWER is restricted to be much higher than what is customary. Or one
could turn to procedures [like those considered in Subsections a and b (of the present section)] that
restrict the k-FWER to a specified level, taking the value of k to be larger (and perhaps much larger)
than 1 and perhaps taking the level to which the k-FWER is restricted to be higher or much higher
than the customary levels. An alternative (to be considered in what follows) is to adopt a procedure
like that devised by Benjamini and Hochberg (1995) on the basis of a criterion that more directly
reflects the objectives underlying the use of the procedure as a screening device.
The criterion employed by Benjamini and Hochberg is defined in terms of the false discovery
proportion as is a related, but somewhat different, criterion considered by Lehmann and Romano
(2005a). By definition, the false discovery proportion is the number of rejected null hypotheses that
are true (number of false discoveries) divided by the total number of rejections—when the total
number of rejections equals 0, the value of the false discovery proportion is by convention (e.g.,
Benjamini and Hochberg 1995; Lehmann and Romano 2005a) taken to be 0. Let us write FDP for
the false discovery proportion. And let us consider the problem of devising multiple-comparison
procedures for which that proportion is likely to be small.
In what follows (in Subsections e and d), two approaches to this problem are described, one of
which is that of Benjamini and Hochberg (1995) (and of Benjamini and Yekutieli 2001) and the other
of which is that of Lehmann and Romano (2005a). The difference between the two approaches is
attributable to a difference in the criterion adopted as a basis for exercising control over the FDP. In
the Benjamini and Hochberg approach, that control takes the form of a requirement that for some
specified constant ı (0 < ı < 1),
E.FDP/  ı: (7.45)
As has become customary, let us refer to the expected value E.FDP/ of the false discovery proportion
as the false discovery rate and denote it by the symbol FDR. In the Lehmann and Romano approach,
control over the FDP takes a different form; it takes the form of a requirement that for some constants
 and  [in the interval .0; 1/],
Pr.FDP > /   (7.46)
—typically,  and  are chosen to be much closer to 0 than to 1, so that the effect of the requirement
Multiple Comparisons and Simultaneous Confidence Intervals: Enhancements 455

(7.46) is to impose on FDP a rather stringent upper bound  that is violated only infrequently. As
is to be demonstrated herein, multiple-comparison procedures that satisfy requirement (7.45) or
requirement (7.46) can be devised; these procedures are stepwise in nature.

d. Step-down multiple-comparison procedures that bound the FDP (from above)


with a high probability
Let us tackle the problem of devising a multiple-comparison procedure that controls the FDP in
the sense defined by inequality (7.46). And in doing so, let us make further use of the notation and
terminology employed in the preceding parts of the present section.
A step-down procedure: general form. Corresponding to any nonincreasing sequence of M strictly
positive scalars ˛1  ˛2      ˛M (> 0), there is a step-down procedure for testing the M null
hypotheses Hi.0/ W i D i.0/ (i D 1; 2; : : : ; M ) [versus the alternative hypotheses Hi.1/ W i ¤ i.0/
(i D 1; 2; : : : ; M )]. This procedure can be regarded as one in which the null hypotheses are tested
.0/ .0/ .0/
sequentially in the order HQ ; HQ ; : : : ; HQ [where iQ1 ; iQ2 ; : : : ; iQM is a permutation of the integers
i1 i2 iM
1; 2; : : : ; M defined implicitly (wp1) by the inequalities jtQ.0/ j  jtQ.0/ j      jtQ.0/ j]. Specifically,
i1 i2 iM
this procedure consists of rejecting the first J of these null hypotheses (i.e., the null hypotheses
HQ.0/ ; HQ.0/ ; : : : ; HQ.0/ ), where J is an integer (between 0 and M , inclusive) such that jtQ.0/ j > ˛j
i1 i2 iJ ij
.0/ .0/
for j D 1; 2; : : : ; J and jtQi j  ˛J C1 —if jtQi j  ˛1 , take J D 0 (and reject none of the null
J C1 1
hypotheses), and if jtQ.0/ j > ˛j for j D 1; 2; : : : ; M , take J D M (and reject all M of the null
ij
hypotheses).
Some general results pertaining to the dependence of the FDP on the choice of the ˛j’s. Let
us consider how the FDP is affected by the choice of ˛1 ; ˛2 ; : : : ; ˛M (where ˛1 ; ˛2 ; : : : ; ˛M are
constants). For purposes of doing so, suppose that MT > 0 (i.e., at least one of the M null hypotheses
.0/ .0/ .0/
H1 ; H2 ; : : : ; HM is true)—if MT D 0, then the FDP D 0 regardless of how the ˛j’s are chosen.
And let R1 represent the number of values of s 2 F for which jts.0/ j > ˛1 , and for j D 2; 3; : : : ; M,
let Rj represent the number of values of s 2 F for which ˛j 1  jts.0/ j > ˛j —recall that, by
definition, s 2 F if Hs.0/ is false. Further, for j D 1; 2; : : : ; M, let RjC D jiD1 Ri , so that RjC
P
.0/ C
represents the number of values of s 2 F for which jts j > ˛j . Clearly, RM  MF ; in fact,
MF RMC
equals the number of values of s 2 F for which jts.0/ j  ˛M .
The i th of the M null hypotheses H1.0/ ; H2.0/ ; : : : ; HM
.0/
is among the J null hypotheses rejected
.0/
by the step-down procedure if and only if jti j > ˛J . Thus, of the J rejected null hypotheses, RJC
of them are false, and the other J RJC of them are true. And it follows that the false discovery
proportion is expressible as (
.J RJC /=J; if J > 0,
FDP D (7.47)
0; if J D 0.
Now, let j 0 represent an integer between 0 and M, inclusive, defined as follows: if there exists
an integer j 2 I for which .j RjC /=j > , take j 0 to be the smallest such integer, that is, take
j 0 D minfj 2 I W .j RjC /=j > g
or, equivalently, take
j 0 D minfj 2 I W j RjC > jgI
otherwise [i.e., if there exists no j 2 I for which .j RjC /=j > ] take j 0 D 0. And observe [in
light of expression (7.47)] that FDP   if j 0 D 0 or, alternatively, if j 0  1 and J < j 0 and hence
that
FDP >  ) J  j 0  1 ) j 0  1 and jtQ.0/ j > ˛j .j D 1; 2; : : : ; j 0 /:
ij
456 Confidence Intervals (or Sets) and Tests of Hypotheses

Observe also that j 0 is a random variable—its value depends on the values of R1 ; R2 ; : : : ; RM .


These observations lead to the inequality
Pr.FDP > /  PrŒj 0  1 and jtQ.0/ j > ˛j .j D 1; 2; : : : ; j 0 / (7.48)
ij

and ultimately to the conclusion that


PrŒj 0  1 and jtQ.0/ j > ˛j .j D 1; 2; : : : ; j 0 /   ) Pr.FDP > /  : (7.49)
ij

The significance of result (7.49) is that it can be exploited for purposes of obtaining relatively
tractable conditions that when satisfied by the step-down procedure insure that Pr.FDP > /  .
Suppose [for purposes of exploiting result (7.49) for such purposes] that j 0  1. Then, as can with
some effort be verified (and as is to be verified in a subsequent part of the present subsection),
Rj 0 D 0 (7.50)
and 0 C 0
j Rj 0 D Œj  C 1; (7.51)
where (for any real number x) Œx denotes the largest integer that is less than or equal to x.
Let us denote by k 0 the random variable defined (in terms of j 0 ) as follows: k 0 D Œj 0 C 1 for
values of j 0  1 and k 0 D 0 for j 0 D 0. And observe [in light of result (7.51)] that (for j 0  1)
MF  RjC0 D j 0 k 0: (7.52)
Observe also (in light of the equality MT D M MF ) that
MT  M .j 0 k0/ D M C k0 j 0: (7.53)
0 Q Q Q
Further, let us (for j  1) denote by k the number of members in the set fi1 ; i2 ; : : : ; ij 0 g that
are members of T, so that k of the null hypotheses HQ.0/ ; HQ.0/ ; : : : ; HQ.0/ are true (and the other
i1 i2 ij 0
0 0 .0/
j k are false). And suppose (in addition to j  1) that jtQ j > ˛j .j D 1; 2; : : : ; j 0 /. Then, clearly,
ij
RjC0  j 0 k (or, equivalently, k  j 0 RjC0 ), so that [in light of result (7.51)]
k0  k  j 0: (7.54)
0
Thus, for some strictly positive integer s  j ,
iQk0 .T / D iQs :
And [upon observing that iQk0 .T / D ik0 .T / and that iQs 2 T ] it follows that
jtk 0 IT j D jti  .T / j D jtQ j D jtQ.0/ j > ˛s  ˛j 0 : (7.55)
k0 is is

Upon applying result (7.55) [and realizing that result (7.54) implies that k 0  MT ], we obtain
the inequality
PrŒj 0  1 and jtQ.0/ j > ˛j .j D 1; 2; : : : ; j 0 /  Pr.j 0  1; k 0  MT ; and jtk 0 IT j > ˛j 0 /: (7.56)
ij

The relevance of inequality (7.56) is the implication that to obtain a step-down procedure for which
.0/
PrŒj 0  1 and jtQi j > ˛j .j D 1; 2; : : : ; j 0 /   and ultimately [in light of relationship (7.49)] one
j
for which Pr.FDP > /  , it suffices to obtain a procedure for which
Pr.j 0  1; k 0  MT ; and jtk 0 IT j > ˛j 0 /  : (7.57)
For purposes of obtaining a more tractable sufficient condition than condition (7.57), observe
that (for j 0  1)
k 0 1  j 0 < k 0 (7.58)
and that
1  k 0  ŒM  C 1: (7.59)
Accordingly, the nonzero values of the random variable j 0 can be partitioned into mutually exclusive
categories based on the value of k 0 : for u D 1; 2; : : : ; ŒM  C 1, let
Multiple Comparisons and Simultaneous Confidence Intervals: Enhancements 457

Iu D fj 2 I W .u 1/=  j < u=g;


0
and observe that (for j  1)
k 0 D u , j 0 2 Iu : (7.60)
Further, for u D 1; 2; : : : ; ŒM  C 1, let ju represent the largest member of the set Iu , so that for
u D 1; 2; : : : ; ŒM , ju D Œu= 1 or ju D Œu= depending on whether or not u= is an integer
(and for u D ŒM  C 1, ju D M ). And define ˛u D ˛ju or, equivalently,
˛u D min ˛j :
j 2I W j < u=

Let K D min.ŒM  C 1; MT /. Then, based on the partitioning of the nonzero values of j 0 into
the mutually exclusive categories defined by relationship (7.60), we find that
Pr.j 0  1; k 0  MT ; and jtk 0 IT j > ˛j 0 /
D K 0
uD1 Pr.jtuIT j > ˛j 0 and j 2 Iu /
P

 K  0
uD1 Pr.jtuIT j > ˛u and j 2 Iu /
P
PK
 uD1 Pr.jtrIT j > ˛r for 1 or more values of r 2 f1; 2; : : : ; Kg and j 0 2 Iu /
 Pr.jtrIT j > ˛r for 1 or more values of r 2 f1; 2; : : : ; Kg/: (7.61)
Thus, to obtain a step-down procedure for which Pr.FDP > /  , it suffices [in light of the
sufficiency of condition (7.57)] to obtain a procedure for which
Pr.jtrIT j > ˛r for 1 or more values of r 2 f1; 2; : : : ; Kg/  : (7.62)

Verification of results (7.50) and (7.51). Let us verify result (7.50). Suppose that j 0  2—if j 0 D 1,
then j 0 RjC0 D 1 Rj 0 and j 0 D , so that 1 Rj 0 > , which implies that Rj 0 < 1  and
hence (since 1  < 1 and Rj 0 is a nonnegative integer) that Rj 0 D 0. And observe that (since
j 0 RjC0 > j 0 and since RjC0 D RjC0 1 C Rj 0 )
j0 1 RjC0 1 .Rj 0 1/ > .j 0 1/ C 
and hence that
j0 1 RjC0 1 > .j 0 1/ C Rj 0 1 C :
Thus, Rj 0 D 0, since otherwise (i.e., if Rj 0  1), it would be the case that Rj 0 1 C    > 0 and
hence that
j 0 1 RjC0 1 > .j 0 1/;
contrary to the definition of j 0 as the smallest integer j 2 I for which j RjC > j .
Turning now to the verification of result (7.51), we find that
j RjC  Œj  C 1
for every integer j 2 I for which j RjC > j , as is evident upon observing that (for j 2 I ) j RjC
is an integer and hence that j RjC > j  implies that either j RjC D Œj C1 or j RjC > Œj C1
depending on whether j   j RjC 1 or j  < j RjC 1. Thus,
j 0 RjC0  Œj 0 C 1: (7.63)
Moreover, if j 0 RjC0 > Œj 0 C 1, it would (since Rj 0 D 0) be the case that
j 0 1 RjC0 1 > Œj 0 (7.64)
and hence [since both sides of inequality (7.64) are integers] that
j0 1 RjC0 1  Œj 0 C 1 > Œj 0 C 1  > j 0  D .j 0 1/;
contrary to the definition of j 0 as the smallest integer j 2 I for which j RjC > j . And it follows
that inequality (7.63) holds as an equality, that is,
458 Confidence Intervals (or Sets) and Tests of Hypotheses

j0 RjC0 D Œj 0 C 1:

Step-down procedures of a particular kind. If and when the first j 1 of the (ordered) null hypotheses
.0/ .0/ .0/ .0/
HQ ; HQ ; : : : ; HQ are rejected by the step-down procedure, the decision as to whether or not HQ
i1 i2 iM ij
is rejected is determined by whether or not jtQ.0/ j > ˛j and hence depends on the choice of ˛j . If
ij
.0/ .0/ .0/ .0/
(following the rejection of HQ ; HQ ; : : : ; HQ ) HQ is rejected, then (as of the completion of the
i1 i2 ij 1 ij
j th step of the step-down procedure) there are j total rejections (total discoveries) and the proportion
of those that are false rejections (false discoveries) equals k.j /=j, where k.j / represents the number
of null hypotheses among the j rejected null hypotheses that are true. Whether k.j /=j >  or
k.j /=j   [i.e., whether or not k.j /=j exceeds the prescribed upper bound] depends on whether
k.j /  Œj  C 1 or k.j /  Œj . As discussed by Lehmann and Romano (2005a, p. 1147), that
suggests taking the step-down procedure to be of the form of a step-down procedure for controlling
the k-FWER at some specified level P and of taking k D Œj  C 1 (in which case k varies with j ).
That line of reasoning leads (upon recalling the results of Subsection b) to taking ˛j to be of the
form
˛j D tN.ŒjC1/ P =f2.M CŒjC1 j /g .N P /: (7.65)
Taking (for j D 1; 2; : : : ; M ) ˛j to be of the form (7.65) reduces the task of choosing the ˛j ’s to
one of choosing P.
Note that if (for j D 1; 2; : : : ; M ) ˛j is taken to be of the form (7.65), then (for j 0  1)
˛j 0 D tNk 0 P =Œ2.M Ck 0 j 0 / .N P /  tNk 0 P =.2MT / .N P /; (7.66)
as is evident upon recalling result (7.53).
Control of the probability of the FDP exceeding : a special case. Let † represent the correlation
matrix of the least squares estimators Oi (i 2 T ) of the MT estimable linear combinations i (i 2 T )
of the elements of ˇ. And suppose that i (i 2 T ) are linearly independent (in which case † is
nonsingular). Suppose further that † D I [in which case Oi (i 2 T ) are (in light of the normality
assumption) statistically independent] or, more generally, that there exists a diagonal matrix D
with diagonal elements of ˙1 for which all of the off-diagonal elements of the matrix D† 1D
are nonnegative. Then, the version of the step-down procedure obtained upon taking (for j D
1; 2; : : : ; M ) ˛j to be of the form (7.65) and upon setting P D  is such that Pr.FDP > /  .
Let us verify that this is the case. The verification makes use of an inequality known as the Simes
inequality. There are multiple versions of this inequality. The version best suited for present purposes
is expressible in the following form:
Pr.X.r/ > ar for 1 or more values of r 2 f1; 2; : : : ; ng/  .1=n/ nrD1 Pr.Xr > an /; (7.67)
P

where X1 ; X2 ; : : : ; Xn are absolutely continuous random variables whose joint distribution satisfies
the so-called PDS condition, where X.1/ ; X.2/ ; : : : ; X.n/ are the random variables whose values
are those obtained by ordering the values of X1 ; X2 ; : : : ; Xn from largest to smallest, and where
a1 ; a2 ; : : : ; an are any constants for which a1  a2      an  0 and for which .1=r/ Pr.Xj > ar /
is nondecreasing in r (for r D 1; 2; : : : ; n and for every j 2 f1; 2; : : : ; ng)—refer, e.g., to Sarkar
(2008, sec. 1).
Now, suppose that (for j D 1; 2; : : : ; M ) ˛j is of the form (7.65). To verify that the version of
the step-down procedure with ˛j ’s of this form is such that Pr.FDP > /   when P D , it suffices
to verify that (for ˛j ’s of this form) condition (7.62) can be satisfied by taking P D . For purposes
of doing so, observe [in light of inequality (7.66)] that
Pr.jtrIT j > ˛r for 1 or more values of r 2 f1; 2; : : : ; Kg/
 PrŒjtrIT j > tNr P =.2MT / .N P / for 1 or more values of r 2 f1; 2; : : : ; Kg
 PrŒjtrIT j > tNr P =.2MT / .N P / for 1 or more values of r 2 f1; 2; : : : ; MT g: (7.68)
Multiple Comparisons and Simultaneous Confidence Intervals: Enhancements 459

Next, apply the Simes inequality (7.67) to the “rightmost” side of inequality (7.68), taking
n D MT and (for r D 1; 2; : : : ; MT ) taking ar D tNr P =.2MT / .N P / and taking Xr to be the rth
of the MT random variables jti j (i 2 T ). That the distributional assumptions underlying the Simes
inequality are satisfied in this application follows from Theorem 3.1 of Sarkar (2008). Moreover,
the assumption that .1=r/ Pr.Xj > ar / is nondecreasing in r is also satisfied, as is evident upon
observing that PrŒjtj j > tNr P =.2MT / .N P / D r P =MT . Thus, having justified the application of the
Simes inequality, we find that
PrŒjtrIT j > tNr P =.2MT / .N P / for 1 or more values of r 2 f1; 2; : : : ; MT g
 .1=MT / i 2T PrŒjti j > tNMT P =.2MT / .N P / D MT P =MT D P : (7.69)
P

In combination with result (7.68), result (7.69) implies that


Pr.jtrIT j > ˛r for 1 or more values of r 2 f1; 2; : : : ; Kg/  P :
And it follows that condition (7.62) can be satisfied by taking P D .
Control of the probability of the FDP exceeding : the general case. To insure that the
step-down procedure is such that Pr.FDP > /  , it suffices to restrict the choice
of the ˛j ’s to a choice that satisfies condition (7.62), that is, to a choice for which
Pr.jtrIT j > ˛r for 1 or more values of r 2 f1; 2; : : : ; Kg/  . The probability of jtrIT j
exceeding ˛r for 1 or more values of r 2 f1; 2; : : : ; Kg depends on T , which is a set of unknown
identity (and unknown size); it depends on T through the value of K as well as through the (abso-
lute) values of the trIT ’s. However, jtrIT j  jt ir j D jt.r/ j and K  ŒM  C 1, so that by substituting
t.r/ for trIT and ŒM  C 1 for K, we can obtain an upper bound for this probability that does not
depend on T —clearly, the value of this probability is at least as great following these substitutions
as before. And by restricting the choice of the ˛j ’s to a choice for which the resultant upper bound
does not exceed , that is, to one that satisfies the condition
Pr.jt.r/ j > ˛r for 1 or more values of r 2 f1; 2; : : : ; ŒM  C 1g/   (7.70)
[and hence also satisfies condition (7.62)], we can insure that the step-down procedure is such that
Pr.FDP > /  .
Let us consider the choice of the ˛j ’s when (for j D 1; 2; : : : ; M ) ˛j is taken to be of the form
(7.65). When ˛j is taken to be of that form, the value of ˛j corresponding to any particular value
of P can be regarded as the value of a function, say ˛Pj . P /, of P . And corresponding to the functions
P P /, of P defined by the left side of inequality (7.70)
˛P 1./; ˛P 2./; : : : ; ˛P M ./ is the function, say `.
when ˛1 D ˛P 1 . P /; ˛2 D ˛P 2 . P /; : : : ; ˛M D ˛P M . P /. Clearly, ˛Pj . P / is a decreasing function of P ,
and `. P P / is an increasing function.
Now, let P represent the solution (for P ) to the equation `. P P / D , so that P is the largest value
P
of P that satisfies the inequality `. P /  . Then, upon setting ˛j D ˛Pj . P / (j D 1; 2; : : : ; M ), we
obtain a choice for the ˛j ’s that satisfies condition (7.70) and hence one for which the step-down
procedure is such that Pr.FDP > /  .
To implement the step-down procedure when ˛j D ˛Pj . P / (j D 1; 2; : : : ; M ), we must be able
to carry out the requisite computations for obtaining the solution P to the equation `. P P / D . As in
the case of the computation of c P .kI S / (for any particular P , k, and S ) resort can be made to Monte
Carlo methods of the kind discussed by Edwards and Berry (1987).
Suppose that a large number of draws are made from the distribution of the vector t D
.t1 ; t2 ; : : : ; tM /0 —the feasibility of making a large number of draws from that distribution can be
enhanced by taking advantage of results (7.5) and (7.6). And observe that (for any choice of the ˛j’ s)
jt.r/ j > ˛r for 1 or more values of r 2 f1; 2; : : : ; ŒM  C 1g
, max .jt.r/ j ˛r / > 0; (7.71)
r2f1;2;:::;ŒM C1g
460 Confidence Intervals (or Sets) and Tests of Hypotheses

so that P P / D  , cP . P / D 0;
`. (7.72)
where cP . P / is the upper 100 % point of the distribution of the random variable
maxr2f1;2;:::;ŒM C1g .jt.r/ j ˛r / when ˛1 D ˛P 1 . P /; ˛2 D ˛P 2 . P /; : : : ; ˛M D ˛P M . P /. Clearly,
cP . P / is an increasing function (of P ). Moreover, by making use of Monte Carlo methods of the kind
discussed by Edwards and Berry (1987), the draws from the distribution of t can be used to approx-
imate the values of cP . P / corresponding to the various values of P . Thus, by solving the equation
obtained from the equation cP . P / D 0 upon replacing the values of cP . P / with their approximations,
we can [in light of result (7.72)] obtain an approximation to the solution P to the equation `. P P / D .

A potential improvement. By definition (specifically, the definition of j1 ; j2 ; : : : ; jŒM



C1
),
˛j  ˛ju .j 2 Iu I u D 1; 2; : : : ; ŒM  C 1/: (7.73)
If (for 1 or more values of u) there exist values of j 2 Iu for which ˛j > ˛ju , then the step-down
procedure can be “improved” by setting ˛j D ˛ju for every such j, in which case
˛j D ˛ju .j 2 Iu I u D 1; 2; : : : ; ŒM  C 1/: (7.74)
This change in the ˛j ’s can be expected to result in additional rejections (discoveries). Moreover, the
change in the ˛j ’s does not affect inequality (7.62) or inequality (7.70); so that if inequality (7.62)
or inequality (7.70) and (as a consequence) the condition Pr.FDP > /   are satisfied prior to
the change, they are also satisfied following the change. Thus, when the objective is to produce a
“large” number of rejections (discoveries) while simultaneously controlling Pr.FDP > / at level ,
the step-down procedure with
˛j D ˛Pju . P / .j 2 Iu I u D 1; 2; : : : ; ŒM  C 1/ (7.75)
[where ˛Pj ./ (j D 1; 2; : : : ; M ) and P are as defined in the preceding part of the present subsection]
may be preferable to that with ˛j D ˛Pj . P / (j D 1; 2; : : : ; M ).
Extensions. By introducing some relatively simple and (for the most part) transparent modifications,
the coverage of the present subsection [pertaining to step-down methods for controlling Pr.FDP > /
at a specified level ] can be extended to the case where the null and alternative hypotheses are either
.0/ .0/ .1/ .0/ .0/ .0/ .1/ .0/
Hi W i  i and Hi W i > i (i D 1; 2; : : : ; M ) or Hi W i  i and Hi W i < i
(i D 1; 2; : : : ; M ) rather than Hi.0/ W i D i.0/ and Hi.1/ W i ¤ i.0/ (i D 1; 2; : : : ; M ).
Suppose, in particular, that the null and alternative hypotheses are Hi.0/ W i  i.0/ and Hi.1/ W
.0/
i > i (i D 1; 2; : : : ; M ). Then, the requisite modifications are similar in nature to those described
in some detail (in Part 8 of Subsection b) for extending the coverage (provided in the first 7 parts
of Subsection b) of step-down methods for controlling the FWER or k-FWER. In particular, the
permutations i1 ; i2 ; : : : ; iM and iQ1 ; iQ2 ; : : : ; iQM and the permutations i1 .S /, i2 .S /; : : : ; iM

S
.S / and
Qi1 .S /; iQ2 .S /; : : : ; iQ  .S / are redefined and the step-down procedure subjected to the same kind of
MS
modifications as in Part 8 of Subsection b.
The definitions of the Rj ’s are among the other items that require modification. The quantity R1
is redefined to be the number of values of s 2 F for which ts.0/ > ˛1 ; and for j D 2; 3; : : : ; M, Rj
is redefined to be the number of values of s 2 F for which ˛j 1  ts.0/ > ˛j . Then, for j D 1; 2;
: : : ; M, RjC D jiD1 Ri represents the number of values of s 2 F for which ts.0/ > ˛j . Clearly,
P
various other quantities (such as k 0 and j 0 ) are affected implicitly or explicitly by the redefinition of
the Rj ’s.
In lieu of condition (7.62), we have the condition
Pr.trIT > ˛r for 1 or more values of r 2 f1; 2; : : : ; Kg/  ; (7.76)
where trIT D tir .T / . And in lieu of condition (7.70), we have the condition
Pr.t.r/ > ˛r for 1 or more values of r 2 f1; 2; : : : ; ŒM  C 1g/  ; (7.77)
Multiple Comparisons and Simultaneous Confidence Intervals: Enhancements 461

where t.r/ D tir —unlike condition (7.76), this condition does not involve T. When the ˛j ’s satisfy
condition (7.76) or (7.77), the step-down procedure is such that Pr.FDP > /  .
The analogue of taking ˛j to be of the form (7.65) is to take ˛j to be of the form
˛j D tN.ŒjC1/ P =.M CŒjC1 j / .N P /; (7.78)
where 0 < P < 1=2. When (for j D 1; 2; : : : ; M ) ˛j is taken to be of the form (7.78), it is the case
that (for j 0  1)
˛j 0 D tNk 0 P =.M Ck 0 j 0 / .N P /  tNk 0 P =MT .N P / (7.79)
—this result is the analogue of result (7.66). And as before, results on controlling Pr.FDP > / at a
specified level  are obtainable (under certain conditions) by making use of the Simes inequality.
Let † represent the correlation matrix of the least squares estimators Oi (i 2 T ) of the MT
estimable linear combinations i (i 2 T ) of the elements of ˇ. And suppose that i (i 2 T ) are linearly
independent (in which case † is nonsingular). Then, in the special case where (for j D 1; 2; : : : ; M )
˛j is taken to be of the form (7.78) and where P D  < 1=2, the step-down procedure is such
that Pr.FDP > /   provided that the off-diagonal elements of the correlation matrix † are
nonnegative—it follows from Theorem 3.1 of Sarkar (2008) that the distributional assumptions
needed to justify the application of the Simes inequality are satisfied when the off-diagonal elements
of † are nonnegative.
More generally, when (for j D 1; 2; : : : ; M ) ˛j is taken to be of the form (7.78), the value of P
needed to achieve control of Pr.FDP > / at a specified level  can be determined “numerically” via
an approach similar to that described in a preceding part of the present subsection. Instead of taking
(for j D 1; 2; : : : ; M ) ˛Pj . P / to be the function of P whose values are those of expression (7.65), take
it to be the function whose values are those of expression (7.78). And take `. P P / to be the function of P
whose values are those of the left side of inequality (7.77) when (for j D 1; 2; : : : ; M ) ˛j D ˛Pj . P /.
Further, take P to be the solution (for P ) to the equation `. P P / D  (if  is sufficiently small that
a solution exists) or, more generally, take P to be a value of P small enough that `. P P /  . Then,
upon taking (for j D 1; 2; : : : ; M ) ˛j D ˛Pj . P /, we obtain a version of the step-down procedure
for which Pr.FDP > /  . It is worth noting that an “improvement” in the resultant procedure can
be achieved by introducing a modification of the ˛j ’s analogous to modification (7.75).
Nonnormality. The approach taken herein [in insuring that the step-down procedure is such that
Pr.FDP > /  ] is based on insuring that the ˛j ’s satisfy condition (7.62) or condition (7.70). The
assumption (introduced at the beginning of Section 7.7) that the distribution of the vector e (of the
residual effects in the G–M model) is MVN can be relaxed without invalidating that approach.
Whether or not condition (7.62) is satisfied by any particular choice for the ˛j ’s is completely
determined by the distribution of the MT random variables ti (i 2 T ), and whether or not condition
(7.70) is satisfied is completely determined by the distribution of the M random variables ti (i D
1; 2; : : : ; M ). And each of the ti ’s is expressible as a linear combination of the elements of the vector
O 1 .˛O ˛/, as is evident from result (7.5). Moreover, O 1 .˛O ˛/  MV t.N P ; I/—refer to
˛O ˛
result (7.6)—not only when the distribution of the vector is MVN (as is the case when the
d
 
˛O ˛
distribution of e is MVN) but, more generally, when has an absolutely continuous spherical
d
distribution (as is the case when e has an absolutely continuous spherical distribution)—refer to
result (6.4.67). Thus, if the ˛j ’s satisfy condition (7.62) or condition (7.70) when the distribution of
e is MVN, then Pr.FDP > /   not only in the case where the distribution of e is MVN but also
in various other cases.

e. Step-up multiple-comparison procedures for controlling the FDR


Having considered (in Subsection d) the problem of devising a multiple-comparison procedure
that controls Pr.FDP > / at a specified level, let us consider the problem of devising a
462 Confidence Intervals (or Sets) and Tests of Hypotheses

multiple-comparison procedure that controls the FDR [i.e., controls E.FDP/] at a specified level.
And in doing so, let us make further use of various of the notation and terminology employed in the
preceding parts of the present section.
A step-up procedure: general form. Let ˛1 ; ˛2 ; : : : ; ˛M represent a nonincreasing sequence of
M strictly positive scalars (so that ˛1  ˛2      ˛M > 0). Then, corresponding to this
sequence, there is (as discussed in Subsection d) a step-down procedure for testing the M null
hypotheses Hi.0/ W i D i.0/ (i D 1; 2; : : : ; M ) [versus the alternative hypotheses Hi.1/ W i ¤ i.0/
(i D 1; 2; : : : ; M )]. There is also a step-up procedure for testing these M null hypotheses. Like the
step-down procedure, the step-up procedure rejects (for some integer J between 0 and M, inclusive)
the first J of the null hypotheses in the sequence HQ.0/ ; HQ.0/ ; : : : ; HQ.0/ [where iQ1 ; iQ2 ; : : : ; iQM is a
i1 i2 iM
.0/ .0/
permutation of the integers 1; 2; : : : ; M defined implicitly (wp1) by the inequalities jtQi j  jtQi j
1 2

     jtQ.0/ j]. Where it differs from the step-down procedure is in the choice of J . In the step-
iM
up procedure, J is taken to be the largest value of j for which jtQ.0/ j > ˛j —if jtQ.0/ j  ˛j for
ij ij
j D 1; 2; : : : ; M, set J D 0. The step-up procedure can be visualized as a procedure in which the null
hypotheses are tested sequentially in the order HQ.0/ ; HQ.0/ ; : : : ; HQ.0/ ; by definition, jtQ.0/ j  ˛M ;
iM iM 1 i1 iM
jtQ.0/ j  ˛M .0/
1 ; : : : ; jtQ j  ˛J C1 , and jtQ.0/ j > ˛J .
iM 1 iJ C1 iJ
.0/
Note that in contrast to the step-down procedure (where the choice of J is such that jtQi j > ˛j
j
for j D 1; 2; : : : ; J ), the choice of J in the step-up procedure is such that jtQ.0/ j does not necessarily
ij
exceed ˛j for every value of j  J (though jtQ.0/ j does necessarily exceed ˛J for every value of
ij
j  J ). Note also that (when the ˛j ’s are the same in both cases) the number of rejections produced
by the step-up procedure is at least as great as the number produced by the step-down procedure.
For j D 1; 2; : : : ; M, let
˛j0 D Pr.jtj > ˛j /; (7.80)
where t  S t.N P /. In their investigation of the FDR as a criterion for evaluating and devising
multiple-comparison procedures, Benjamini and Hochberg (1995) proposed a step-up procedure in
which (as applied to the present setting) ˛1 ; ˛2 ; : : : ; ˛M are of the form defined implicitly by equality
(7.80) upon taking (for j D 1; 2; : : : ; M ) ˛j0 to be of the form
˛j0 D j P =M; (7.81)

where 0 < P < 1. Clearly, taking the ˛j ’s to be of that form is equivalent to taking them to be of the
form
˛j D tNj P =.2M / .N P / (7.82)
(j D 1; 2; : : : ; M ).
The FDR of a step-up procedure. Let Bi represent the event consisting of those values of the
vector t .0/ (with elements t1.0/ ; t2.0/ ; : : : ; tM
.0/
) for which Hi.0/ is rejected by the step-up procedure
(i D 1; 2; : : : ; M ). Further, let Xi represent (a random variable defined as follows:
1; if t .0/ 2 Bi ,
Xi D
0; if t .0/ … Bi .
Then, i 2T Xi equals the number of falsely rejected null hypotheses, and j 2I Xj equals the total
P P

number of rejections—recall that I D f1; 2; : : : ; M g and that T is the subset of I consisting of those
.0/
values of i 2 I for which Hi is true. And FDP > 0 only if j 2I Xj > 0, in which case
P

P
Xi
FDP D Pi 2T : (7.83)
j 2I Xj
Multiple Comparisons and Simultaneous Confidence Intervals: Enhancements 463

Observe [in light of expression (7.83)] that the false discovery rate is expressible as

FDR D E.FDP/ D i 2T E.FDPi /; (7.84)


P

Xi
where FDPi D P if Xj > 0 and FDPi D 0 if j 2I Xj D 0. Observe also that
P P
j 2I
j 2I Xj
.0/
FDPi > 0 only if t 2 Bi , in which case FDPi D 1= j 2I Xj , and that (for k D 1; 2; : : : ; M )
P

.0/
P
j 2I Xj D k , t 2 Ak ;
where Ak is the event consisting of those values of t .0/ for which exactly k of the M null hypotheses
H1.0/ ; H2.0/ ; : : : ; HM
.0/
are rejected by the step-up procedure. Accordingly,
FDR D i 2T M kD1 .1=k/ Pr t
.0/
(7.85)
P P 
2 Bi \Ak :

Clearly, AM D ft .0/ W jtQ.0/ j > ˛M g; and for k D M 1; M 2; : : : ; 1,


iM

Ak D ft .0/
W jtQ.0/ j  ˛j .j D M; M 1; : : : ; k C1/I jtQ.0/ j > ˛k g: (7.86)
ij ik

Or, equivalently,
Ak D ft .0/ W max j D kg (7.87)
.0/
j 2I W jt Q j>˛j
ij
(k D M; M 1; : : : ; 1).
Now, for purposes of obtaining an expression for the FDR that is “more useful” than expres-
sion (7.85), let Si represent the (M 1)-dimensional subset of the set I D f1; 2; : : : ; M g obtained
upon deleting the integer i , and denote by t . i / the (M 1)-dimensional subvector of t .0/ ob-
tained upon striking out the i th element ti.0/. And recall that for any (nonempty) subset S of I,
iQ1 .S /; iQ2 .S /; : : : ; iQM

.S / is a permutation of the elements of S such that jtQ.0/
 j  jtQ.0/
 j   
S i1 .S/ i2 .S/
.0/
jtQ j; and observe that for i 2 I and for j 0 such that iQj 0 D i ,
iM .S/
S

< iQj .Si /;

for j D 1; 2; : : : ; j 0 1,
iQj D i; for j D j 0, (7.88)
:̂ Q 
ij 1 .Si /; for j D j 0 C1; j 0 C2; : : : ; M,
and, conversely, (
iQj 1 ; for j D 2; 3; : : : ; j 0,
iQj 1 .Si / D (7.89)
iQj ; for j D j 0 C1; j 0 C2; : : : ; M.
 . i/ .0/
Further, for i D 1; 2; : : : ; M, define AM I i D ft W jtQi  j > ˛M g,
M 1 .Si /

AkI i D ft . i/
W jtQ.0/
 j  ˛j for j D M; M 1; : : : ; k C1I jtQ.0/
 j > ˛k g
ij 1 .Si / ik 1 .Si /

(k D M 1; M 2; : : : ; 2), and
A1I i D ft . i/
W jtQ.0/
 j  ˛j for j D M; M 1; : : : ; 2g:
ij 1
.Si /

For i; k D 1; 2; : : : ; M, AkI i is interpretable in terms of the results that would be obtained if the
step-up procedure were applied to the M 1 null hypotheses H1.0/; H2.0/; : : : ; Hi.0/1 ; HiC1 .0/
; : : : ; HM.0/

(rather than to all M of the null hypotheses) with the role of ˛1 ; ˛2 ; : : : ; ˛M being assumed by
˛2 ; ˛3 ; : : : ; ˛M ; if t . i / 2 AkI i , then exactly k 1 of these M 1 null hypotheses would be rejected.
It can be shown and (in the next part of the present subsection) will be shown that for i; k D
1; 2; : : : ; M, .0/ 
t .0/ 2 Bi \Ak , ti 2 BkI i and t
. i/
2 AkI i ; (7.90)

where BkI i
D fti.0/ W jti.0/ j > ˛k g. And upon recalling result (7.85) and making use of relationship
464 Confidence Intervals (or Sets) and Tests of Hypotheses

(7.90), we find that


PM 
FDR D .1=k/ Pr ti 2 BkI 2 AkI i
i and t
. i/
P 
i 2T kD1
PM 0
kD1 .˛k =k/ Pr t
. i/
2 AkI i j jti j > ˛k (7.91)
P 
D i 2T

[where ˛k0
is as defined by expression (7.80) and ti as defined by expression (7.3)].
Note that (for k D M; M 1; : : : ; 2)
AkI i D ft . i/
W max j D kg; (7.92)
.0/
j 2fM;M 1;:::;2g W jt Q  j>˛j
i .S /
j 1 i

analogous to result (7.87). Note also that the sets A1I i ; A2I i ; : : : ; AM

I i are mutually disjoint and that
SM  M 1
kD1 AkI i D R (7.93)
(i D 1; 2; : : : ; M ).
Verification of result (7.90). Suppose that t .0/ 2 Bi \Ak . Then,

jtQ.0/ j  ˛j for j > k and jtQ.0/ j > ˛k : (7.94)


ij ik
Further,
i D iQj 0 for some j 0  k; (7.95)
and [in light of result (7.89)]
iQj 1 .Si / D iQj for j > j 0 and hence for j > k: (7.96)
Results (7.95) and (7.94) imply that
.0/ .0/ .0/
jti j D jtQ j  jtQ j > ˛k
ij ik
.0/ 
and hence that ti 2 BkI i . Moreover, it follows from results (7.96) and (7.94) that for j > k,

jtQ.0/
 j D jtQ.0/ j  ˛j : (7.97)
ij 1
.Si / ij

And for k > 1, (


iQk ; if k > j 0,
iQk 1 .Si / D
iQk 1 ; if k D j 0,
as is evident from result (7.89), so that [observing that jtQ.0/ j  jtQ.0/ j and making use of result (7.94)]
ik 1 ik
.0/ .0/
jtQ j  jtQ j > ˛k .k > 1/: (7.98)
ik 1 .Si / ik

Together, results (7.97) and (7.98) imply that t . i/


2 AkI i .
.0/ 
Conversely, suppose that and t . i / 2 AkI i . Further, denote by j 0 the integer defined
ti 2 BkI i
(implicitly and uniquely) by the equality i D iQj 0 . And observe that j 0  k, since otherwise (i.e., if
j 0 > k) it would [in light of result (7.88)] be the case that
.0/ .0/ .0/ .0/
jti j D jtQ j  jtQ j D jtQ j  ˛j 0  ˛k ;
ij 0 ij 0 1 ij 0 1 .Si /

contrary to the supposition (which implies that jti.0/ j > ˛k ). Then, making use of result (7.88), we
.0/ .0/ .0/ .0/
find that tQ D ti or tQ D tQ (depending on whether j 0 D k or j 0 < k) and that in either
ik ik ik 1 .Si /
case
jtQ.0/ j > ˛k ;
ik
and we also find that for j > k ( j 0 )
jtQ.0/ j D jtQ.0/
 j  ˛j ;
ij ij 1 .Si /
Multiple Comparisons and Simultaneous Confidence Intervals: Enhancements 465

leading to the conclusion that t .0/ 2 Ak and (since i D iQj 0 and j 0  k) to the further conclusion that
t .0/ 2 Bi and ultimately to the conclusion that t .0/ 2 Bi \ Ak .
Control of the FDR in a special case. For i D 1; 2; : : : ; M, let
Oi i.0/ Oi i
zi.0/ D and zi D :
Œ0i .X0 X/ i 1=2  0 0
Œi .X X/ i 1=2 
And define z.0/ D .z1.0/; z2.0/; : : : ; zM .0/ 0
/ , and observe that zi.0/ D zi for i 2 T. Further, for i D
1; 2; : : : ; M, denote by z. i / the (M 1)-dimensional subvector of z.0/ obtained upon deleting the i th
element; and let u D =. O Then, ti.0/ D zi.0/=u, ti D zi =u, t .0/ D u 1 z.0/ , and t . i / D u 1 z. i / ;
u and z.0/ are statistically independent; and .N P /u2  2 .N P / and z.0/  N.; †/, where
 is the M 1 vector with i th element i D Œ 2 0i .X0 X/ i  1=2 .i i.0/ / and † is the M M
matrix with ij th element 0i .X0 X/ j
ij D 0 0
Œi .X X/ i 1=2 Œ0j .X0 X/ j 1=2
—recall the results of Section 7.3a, including results (3.9) and (3.21).
Let f ./ represent the pdf of the N.0; 1/ distribution, and denote by h./ the pdf of the distribution
of the random variable u. Then, the conditional probability Pr t . i / 2 AkI i j jti j > ˛k , which appears


in expression (7.91) (for the FDR), is expressible as follows:


Pr t . i / 2 AkI i j jti j > ˛k D Pr u 1 z. i / 2 AkI i j jzi j > u ˛k
 
Z 1Z
Pr u 1 z. i / 2 AkI i j zi D z i ; u D u

D
0 z i W jz i j>u˛k
f .z i /
dz h.u/ d u: (7.99)
Pr.jzi j > u ˛k / i
Moreover, the distribution of z.0/ conditional on u D u does not depend on u; and if ij D 0 for all
j ¤ i , then the distribution of z. i / conditional on zi D z i does not depend on z i . Thus, if ij D 0 for
all j ¤ i , then
Pr u 1 z. i / 2 AkI i j zi D z i ; u D u D Pr u 1 z. i / 2 AkI i j u D u : (7.100)
 

And upon substituting expression (7.100) in formula (7.99), we find that (in the special case where
ij D 0 for all j ¤ i )
Pr t . i / 2 AkI i j jti j > ˛k D E Pr u 1 z. i / 2 AkI i j u D Pr t . i / 2 AkI i : (7.101)
   

Now, suppose that ij D 0 for all i 2 T and j 2 I such that j ¤ i. Then, in light of result (7.101),
formula (7.91) for the FDR can be reexpressed as follows:

FDR D i 2T M 0
kD1 .˛k =k/ Pr t
. i/
2 AkI i : (7.102)
P P 

Moreover, in the special case where (for j D 1; 2; : : : ; M ) ˛j is of the form (7.82) [and hence where
˛j0 is of the form (7.81)], expression (7.102) can be simplified. In that special case, we find (upon
recalling that the sets A1I i ; A2I i ; : : : ; AM

I i are mutually disjoint and that their union is R
M 1
) that
PM
FDR D i 2T . P =M / kD1 Pr t . i / 2 AkI i
P 

D .MT =M / P Pr t . i / 2 RM 1


D .MT =M / P  P : (7.103)

Based on inequality (7.103), we conclude that if ij D 0 for j ¤ i D 1; 2; : : : ; M (so that ij D 0
for all i 2 T and j 2 I such that j ¤ i regardless of the unknown identity of the set T ), then the FDR
can be controlled at level ı (in the sense that FDR  ı) by taking (for j D 1; 2; : : : ; M ) ˛j to be of
466 Confidence Intervals (or Sets) and Tests of Hypotheses

the form (7.82) and by setting P D ı. The resultant step-up procedure is that proposed by Benjamini
and Hochberg (1995).
Note that when (for j D 1; 2; : : : ; M ) ˛j is taken to be of the form (7.82), the FDR could be
reduced by decreasing the value of P , but that the reduction in FDR would come at the expense of a
potential reduction in the number of rejections (discoveries)—there would be a potential reduction in
the number of rejections of false null hypotheses (number of true discoveries) as well as in the number
of true null hypotheses (number of false discoveries). Note also that the validity of results (7.102)
and (7.103) depends only on various characteristics of the distribution of the random variables zj
(j 2 T ) and u and on those random variables being distributed independently of the random variables
zj (j 2 F ); the (marginal) distribution of the random variables zj (j 2 F ) is “irrelevant.”
An extension. The step-up procedure and the various results on its FDR can be readily extended to
.0/ .0/ .1/ .0/
the case where the null and alternative hypotheses are either Hi W i  i and Hi W i > i
.0/ .0/ .1/ .0/
(i D 1; 2; : : : ; M ) or Hi W i  i and Hi W i < i (i D 1; 2; : : : ; M ) rather than
Hi.0/ W i D i.0/ and Hi.1/ W i ¤ i.0/ (i D 1; 2; : : : ; M ). Suppose, in particular, that the null and
alternative hypotheses are Hi.0/ W i  i.0/ and Hi.1/ W i > i.0/ (i D 1; 2; : : : ; M ), so that the set T
of values of i 2 I for which Hi.0/ is true and the set F for which it is false are T D fi 2 I W i  i.0/ g
and F D fi 2 I W i > i.0/ g. And consider the extension of the step-up procedure and of the various
results on its FDR to this case.
In regard to the procedure itself, it suffices to redefine the permutation iQ1 ; iQ2 ; : : : ; iQM and the
integer J : take iQ1 ; iQ2 ; : : : ; iQM to be the permutation (of the integers 1; 2; : : : ; M ) defined implicitly
(wp1) by the inequalities tQ.0/  tQ.0/      tQ.0/ , and take J to be the largest value of j for which
i1 i2 iM
tQ.0/ > ˛j —if tQ.0/  ˛j for j D 1; 2; : : : ; M, set J D 0. As before, ˛1 ; ˛2 ; : : : ; ˛M represents a
ij ij
nonincreasing sequence of scalars (so that ˛1  ˛2      ˛M ); however, unlike before, some or
all of the ˛j ’s can be negative.
In regard to the FDR of the step-up procedure, we find by proceeding in the same fashion
as in arriving at expression (7.91) and by redefining (for an arbitrary nonempty subset S of I )
iQ1 .S /; iQ2 .S /; : : : ; iQM

S
.S / to be a permutation of the elements of S for which
.0/ .0/ .0/
tQi  .S/  tQi  .S/      tQi  ; (7.104)
1 2 MS .S/

by redefining AkI i for k D M as



AM I i D ft
. i/
W tQ.0/
 > ˛M g; (7.105)
iM 1 .Si /

for k D M 1; M 2; : : : ; 2 as

AkI i Dft . i/
W tQ.0/
  ˛j for j D M; M 1; : : : ; kC1I tQ.0/
 > ˛k g; (7.106)
ij .S /
1 i
ik 1 .Si /

and for k D 1 as
A1I i D ft . i/
W tQ.0/
  ˛j for j D M; M 1; : : : ; 2g; (7.107)
ij 1
.Si /


by redefining BkI i (for k D 1; 2; : : : ; M ) as
.0/

BkI i D fti W ti.0/ > ˛k g; (7.108)

and by redefining ˛j0 (for j D 1; 2; : : : ; M ) as

˛j0 D Pr.t > ˛j / (7.109)


[where t  S t.N P /], that
Multiple Comparisons and Simultaneous Confidence Intervals: Enhancements 467
.0/
FDR D i 2T M 
kD1 .1=k/ Pr ti 2 BkI i and t
. i/
2 AkI i
P P 

.0/
D i 2T M kD1 .1=k/ Pr ti > ˛k Pr t
. i/
2 AkI i j ti.0/ > ˛k
P P  

 i 2T M 0
kD1 .˛k =k/ Pr t
. i/
2 AkI i j ti.0/ > ˛k : (7.110)
P P 

Note that (for k D M; M 1; : : : ; 2) AkI i [as redefined by expression (7.105) or (7.106)] is


reexpressible as
AkI i D ft . i / W max j D kg;
j 2fM;M 1;:::;2g W t Q.0/
 >˛j
i .S /
j 1 i

analogous to expression (7.92). Note also that subsequent to the redefinition of the sets
SM
A1I i ; A2I i ; : : : ; AM

I i , it is still the case that they are mutually disjoint and that

kD1 AkI i D R
M 1
.
Take u, zi , z. i /, f ./, and h./ to be as defined in the preceding part of the present subsection,
and take AkI i to be as redefined by expression (7.105), (7.106), or (7.107). Then, analogous to result
(7.99), we find that
Pr t . i / 2 AkI i j ti.0/ > ˛k


D Pr u 1 z. i / 2 AkI i j zi > u ˛k i

Z 1Z 1
Pr u 1 z. i / 2 AkI i j zi D z i ; u D u

D
0 u˛k i
f .z i /
dz h.u/ d u: (7.111)
Pr.zi > u ˛k i / i
Now, suppose that ij D 0 for all i 2 T and j 2 I such that j ¤ i. Then, by proceeding in
much the same fashion as in arriving at result (7.103), we find that in the special case where (for
j D 1; 2; : : : ; M ) ˛j0 [as redefined by expression (7.109)] is taken to be of the form (7.81) and hence
where (for j D 1; 2; : : : ; M ) ˛j D tNj P =M .N P /,
FDR  .MT =M / P  P:
Thus, as in the case of the step-up procedure for testing the null hypotheses Hi.0/ W i D i.0/
(i D 1; 2; : : : ; M ) when (for i D 1; 2; : : : ; M ) ˛j D tNj ı=.2M / .N P /, we find that if ij D 0 for
j ¤ i D 1; 2; : : : ; M , then in the special case where (for j D 1; 2; : : : ; M ) ˛j D tNj ı=M .N P /,
the step-up procedure for testing the null hypotheses Hi.0/ W i  i.0/ (i D 1; 2; : : : ; M ) controls
the FDR at level ı (in the sense that FDR  ı) and does so regardless of the unknown identity of the
set T.
Nonindependence. Let us consider further the step-up procedures for testing the null hypotheses
Hi.0/ W i D i.0/ (i D 1; 2; : : : ; M ) and for testing the null hypotheses Hi.0/ W i  i.0/ (i D 1; 2;
: : : ; M ). Suppose that the ˛j ’s are those that result from taking (for j D 1; 2; : : : ; M ) ˛j0 to be of the
form ˛j0 D j P =M and from setting P D ı. If ij D 0 for j ¤ i D 1; 2; : : : ; M, then (as indicated in
the preceding 2 parts of the present subsection) the step-up procedures control the FDR at level ı.
To what extent does this property (i.e., control of the FDR at level ı) extend to cases where ij ¤ 0
for some or all j ¤ i D 1; 2; : : : ; M ?
Suppose that 1 ; 2 ; : : : ; M are linearly independent and that † is nonsingular (as would neces-
sarily be the case if ij D 0 for j ¤ i D 1; 2; : : : ; M ). Then, it can be shown (and subsequently will
be shown) that in the case of the step-up procedure for testing the null hypotheses Hi.0/ W i  i.0/
(i D 1; 2; : : : ; M ) with ˛j0 D j P =M or, equivalently, ˛j D tNj P =M .N P / (for j D 1; 2; : : : ; M ),
ij  0 for all i 2 T and j 2 I such that j ¤ i ) FDR  .MT =M / P : (7.112)
Thus, when ˛j D tNj P =M .N P / for j D 1; 2; : : : ; M and when P D ı, the step-up procedure for
.0/ .0/
testing the null hypotheses Hi W i  i (i D 1; 2; : : : ; M ) controls the FDR at level ı (regardless
of the unknown identity of T ) provided that ij  0 for j ¤ i D 1; 2; : : : ; M.
468 Confidence Intervals (or Sets) and Tests of Hypotheses

Turning now to the case of the step-up procedure for testing the null hypotheses Hi.0/ W i D i.0/
(i D 1; 2; : : : ; M ), let †T represent the MT MT submatrix of † obtained upon striking out the
i th row and i th column for every i 2 F, and suppose that (for j D 1; 2; : : : ; M ) ˛j0 D j P =M or,
equivalently, ˛j D tNj P =.2M / .N P /. Then, it can be shown (and will be shown) that in this case,
the existence of an MT  MT diagonal matrix DT with diagonal
elements of ˙1 for which all of the off-diagonal elements of the
matrix DT †T 1 DT are nonnegative, together with the condi-
tion ij D 0 for all i 2 T and j 2 F
) FDR  .MT =M / P : (7.113)
Relationship (7.113) serves to define (for every T ) a collection of values of † for which FDR 
.MT =M / P ; this collection may include values of † in addition to those for which ij D 0 for all
i 2 T and j 2 I such that j ¤ i. Note, however, that when T contains only a single member, say the
i th member, of the set f1; 2; : : : ; M g, this collection consists of those values of † for which ij D 0
for every j ¤ i . Thus, relationship (7.113) does not provide a basis for adding to the collection
of values of † for which the step-up procedure [for testing the null hypotheses Hi.0/ W i D i.0/
(i D 1; 2; : : : ; M ) with (for j D 1; 2; : : : ; M ) ˛j D tNj P =.2M / .N P / and with P D ı] controls the
FDR at level ı (regardless of the unknown value of T ).
In the case of the step-up procedure for testing the null hypotheses Hi.0/ W i D i.0/ (i D 1; 2;
.0/ .0/
: : : ; M ) or the null hypotheses Hi W i  i (i D 1; 2; : : : ; M ) with the ˛j ’s chosen so that (for
0 0
j D 1; 2; : : : ; M ) ˛j is of the form ˛j D j P =M , control of the FDR at level ı can be achieved by
setting P D ı only when † satisfies certain relatively restrictive conditions. However, such control
can be achieved regardless of the value of † by setting P equal to a value that is sufficiently smaller
than ı. Refer to Exercise 31 for some specifics.
Verification of results (7.112) and (7.113). Suppose [for purposes of verifying result (7.112)] that
ij  0 for all i 2 T and j 2 I such that j ¤ i. And consider the function g.z. i / I k 0; u/ of z. i /
defined for each strictly positive scalar u and for each integer k 0 between 1 and M , inclusive, as
follows:
(
1; if uz. i / 2 kk 0 AkI i ,
S
. i/ 0
g.z I k ; u/ D
0; otherwise,
 . i/
where AkI i is the set of t -values defined by expression (7.105), (7.106), or (7.107).
. i/
Clearly, the function g. I k 0; u/ is a nonincreasing function [in the sense that g.z2 I k 0; u/ 
. i/ . i/ . i/ . i/
g.z1 I k 0; u/ for any 2 values z1 and z2 of z. i / for which the elements of z2 are greater
than or equal to the corresponding elements of z.1 i / ]. And the distribution of z. i / conditional on
zi D z i is MVN with
0
E z. i / j zi D z i D . i / C . i / .z i i / and var z. i / j zi D z i D † . i / . i /. i / ;
 

where . i / is the (M 1)-dimensional subvector of the vector  and . i / the (M 1)-dimensional


subvector of the vector .1i ; 2i ; : : : ; M i /0 obtained upon excluding the i th element and where
† . i / is the .M 1/  .M 1/ submatrix of † obtained upon excluding the i th row and i th column.
Now, let
q.z i I k 0; u/ D Pr u 1 z. i / 2 kk 0 AkI i j zi D z i ; u D u :
S 

And regard q.z i I k 0; u/ as a function of z i , and observe (in light of the statistical independence of u
and z.0/ ) that
q.z i I k 0; u/ D Pr u 1 z. i / 2 kk 0 AkI i j zi D z i D E g.z. i / I k 0; u/ j zi D z i :
S   

Then, based on a property of the MVN distribution that is embodied in Theorem 5 of Müller (2001),
it can be deduced that q. I k 0; u/ is a nonincreasing function. Moreover, for “any” nonincreasing
function, say q.z i /, of z i (and for k 0 D 1; 2; : : : ; M 1),
Multiple Comparisons and Simultaneous Confidence Intervals: Enhancements 469
Z 1 Z 1
f .z i / f .z i /
q.z i / dz i  q.z i / dz ; (7.114)
u˛k 0C1 i Pr.zi > u ˛k 0C1 i / u˛k 0 i Pr.zi > u ˛k 0 i / i
as can be readily verified. Thus,
Z 1
f .z i /
Pr u 1 z. i / 2 Ak 0C1I i j zi D z i ; u D u

dz
u˛k 0C1 i Pr.zi > u ˛k 0C1 i / i
Z 1  [  f .z i /
D Pr u 1 z. i / 2 AkI i j zi D z i ; u D u dz
u˛k 0C1 i 0C1
Pr.zi > u ˛k 0C1 i / i
Z 1 kk
 [  f .z i /
Pr u 1 z. i / 2 AkI i j zi D z i ; u D u dz
u˛k 0C1 i Pr.zi > u ˛k 0C1 i / i
kk 0
1
f .z i /
Z  [ 
 Pr u 1 z. i / 2 AkI i j zi D z i ; u D u dz
u˛k 0C1 i 0
Pr.z i > u ˛k 0C1 i / i
Z 1 kk C1
 [  f .z i /
Pr u z 1 . i/
2 AkI i j zi D z i ; u D u dz : (7.115)
u˛k 0 i 0
Pr.z i > u ˛k 0 i / i
kk

Finally, upon summing both “sides” of inequality (7.115) over k 0 (from 1 to M 1), we find that
M Z 1
X f .z i /
Pr u 1 z. i / 2 AkI i j zi D z i ; u D u

dz
u˛k i Pr.zi > u ˛k i / i
kD1
1
f .z i /
Z
Pr u 1 . i/
2 A1I i j zi D z i ; u D u

D z dz
u˛1 i Pr.zi > u ˛1 i / i
M
X1 Z 1 f .z i /
Pr u 1 . i/
2 Ak 0C1I i j zi D z i ; u D u

C z dz
u˛k 0C1 i Pr.zi > u ˛k 0C1 i / i
k 0 D1
1
f .z i /
Z  [ 
 Pr u 1 . i/
z 2 AkI i j zi D z i ; u D u dz
u˛M i Pr.zi > u ˛M i / i
kM
1
f .z i /
Z
D 1 dz D 1I
u˛M i Pr.zi > u ˛M i / i
so that to complete the verification of result (7.112), it remains only to observe [in light of expressions
(7.110) and (7.111)] that
k P =M
FDR  i 2T M Pr t . i / 2 AkI i j ti.0/ > ˛k
P P 
kD1
k
D . P =M / i 2T kD1 Pr t . i / 2 AkI i j ti.0/ > ˛k
P PM 
P R1
 . P =M / i 2T 0 1 h.u/ d u
D .MT =M / P :
Turning to the verification of result (7.113), suppose that there exists an MT  MT diagonal
matrix DT with diagonal elements of ˙1 for which all of the off-diagonal elements of the matrix
DT †T 1 DT are nonnegative, and suppose in addition that ij D 0 for all i 2 T and j 2 F. Then,
Pr t . i / 2 AkI i j jti j > ˛k


D Pr u 1 z. i / 2 AkI i j jzi j > u ˛k



Z 1Z Z 1
Pr u 1 z. i / 2 AkI i j jzi jD z i ; z.0/

D F D zF ; u D u
0 RMF u˛k
f  .z i /
dz p.zF / d zF h.u/ d u;
Pr.jzi j > u ˛k / i
470 Confidence Intervals (or Sets) and Tests of Hypotheses

Frequency

300

250

200

150

100

50

−1.0 −0.5 0.0 0.5 1.0


^τs’s

FIGURE 7.5. A display (in the form of a histogram with intervals of width 0:025) of the frequencies (among
the 6033 genes) of the various values of the Os ’s.

where z.0/
F is the MF 1 random vector whose elements are zj
.0/
(j 2 F ), where p./ is the pdf of the
.0/ 
distribution of zF , and where f ./ D 2f ./ is the pdf of the distribution of the absolute value of a
random variable that has an N.0; 1/ distribution. And whether or not u 1 z. i / 2 AkI i when z.0/
F D zF
and u D u is determined by the absolute values jzj j (j 2 T; j ¤ i ) of the MT 1 random variables zj
(j 2 T; j ¤ i ). Moreover, it follows from the results of Karlin and Rinott (1980, theorem 4.1; 1981,
theorem 3.1) that for “any” nonincreasing function gŒjzj j .j 2 T; j ¤ i / of the absolute values of
zj (j 2 T; j ¤ i /, the conditional expected value of gŒjzj j .j 2 T; j ¤ i / given that jzi j D z i is a
nonincreasing function of z i . Accordingly, result (7.113) can be verified by proceeding in much the
same way as in the verification of result (7.112).

f. An illustration
Let us use the example from Part 1 of Subsection c to illustrate various of the alternative multiple-
comparison procedures. In that example, the data consist of the expression levels obtained for 6033
genes on 102 men, 50 of whom were normal (control) subjects and 52 of whom were prostate cancer
patients. And the objective was presumed to be that of testing each of the 6033 null hypotheses
Hs.0/ W s D 0 (s D 1; 2; : : : ; 6033) versus the corresponding one of the alternative hypotheses
Hs.1/ W s ¤ 0 (s D 1; 2; : : : ; 6033), where s D s2 s1 represents the expected difference
(between the cancer patients and the normal subjects) in the expression level of the sth gene.
Assume that the subjects have been numbered in such a way that the first through 50th subjects
are the normal (control) subjects and the 51st through 102nd subjects are the cancer patients. And
for s D 1; 2; : : : ; 6033 and j D 1; 2; : : : ; 102, denote by ysj the random variable whose value is the
value obtained for the expression level of the sth gene on Pthe j th subject. Then, the least
P squares
estimator of s is Os D O s2 O s1 , where O s1 D .1=50/ j50D1 ysj and O s2 D .1=52/ j102 D51 ysj .
The values of O1 ; O2 ; : : : ; O6033 are displayed (in the form of a histogram) in Figure 7.5.
Multiple Comparisons and Simultaneous Confidence Intervals: Enhancements 471

Scaled
relative
frequency

2.5

2.0

1.5

1.0

0.5

−2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0


log ^2
σ ’s
s

FIGURE 7.6. A display (in the form of a histogram with intervals of width 0:04 that has been rescaled so that it
encloses an area equal to 1 and that has been overlaid with a plot of the pdf of the Nf 0:127; 1=50g
distribution) of the relative frequencies (among the 6033 genes) of the various values of the
log Os2 ’s.

Assume that ss 0 D 0 for s 0 ¤ s D 1; 2; : : : ; S, as would be the case if the results obtained
for each of the S genes are “unrelated” to those obtained for each of the others—this assumption is
consistent with Efron’s (2010) assumptions about these data. Further, for s D 1; 2; : : : ; S, write s2
for ss ; and take Os2 to be the unbiased estimator of s2 defined as
PN1 P 1 CN2
Os2 D O s1 /2 C jNDN .ysj O s2 /2 =.N1 CN2 2/;

j D1 .ysj  1 C1
where N1 D 50 and N2 D 52.
The results of Lehmann (1986, sec. 7.3), which are asymptotic in nature, suggest that (at least
in the case where the joint distribution of ys1 ; ys2 ; : : : ; ys;102 is MVN) the distribution of log Os2
2
can be approximated by the N log s ; 1=50 ŒD 2=.N1 C N2 2/ distribution. The values of
˚
log O12 ; log O22 ; : : : ; log O6033
2
are displayed in Figure 7.6 in the form of a histogram that has been
rescaled so that it encloses P area equal
an to 1 and that has been overlaid with a plot of the pdf of the
Nf 0:127 ŒD .1=6033/ 6033 sD1 log O
 2
s ; 1=50g distribution. As is readily apparent from Figure 7.6,
it would be highly unrealistic to assume that 12 D 22 D    S2 , that is, to assume that the variability
of the expression levels (from one normal subject to another or one cancer patient to another) is
the same for all S genes. The inappropriateness of any such assumption is reflected in the results
obtained upon applying various of the many procedures proposed for testing for the homogeneity of
472 Confidence Intervals (or Sets) and Tests of Hypotheses

^2 ’ s
σ s

2.5

2.0

1.5

1.0

0.5

−0.50 −0.25 0.00 0.25 0.50 0.75


^ ’s
µs
FIGURE 7.7. Plot of the values of the Os2 ’s against the values of the corresponding O s ’s.

variances, including that proposed by Hartley (1950) as well as that proposed by Lehmann (1986,
sec. 7.3).
Not only does the variance s2 (among the normal subjects or the cancer patients) of the expression
levels of the sth gene appear to depend on s, but there appears to be a strong tendency for s2 to increase
with the mean (s1 in the case of the normal subjects and s2 in the case of the cancer patients).
Let O s D .O s1 C O s2 /=2. Then, the tendency for s2 to increase with s1 or s2 is clearly evident in
Figure 7.7, in which the values of the Os2 ’s are plotted against the values of the corresponding O s ’s.
In Figure 7.8, the values of the Os2 ’s are plotted against the values of the corresponding Os’s. This
figure suggests that while s2 may vary to a rather considerable extent with s1 and s2 individually
and with their average (and while small values of s2 may be somewhat more likely when js j is
small), any tendency for s2 to vary with s is relatively inconsequential.
Consider (in the context of the present application) the quantities t1 ; t2 ; : : : ; tM and t1.0/ ; t2.0/ ; : : : ;
.0/
tM defined by equalities (7.3)—in the present application, M D S . The various multiple-comparison
procedures described and discussed in Subsections a, b, d, and e depend on the data only through the
(absolute) values of t1.0/ ; t2.0/ ; : : : ; tM
.0/
. And the justification for those procedures is based on their
ability to satisfy criteria defined in terms of the distribution of the random vector t D .t1 ; t2 ; : : : ; tM /0.
Moreover, the vector t is expressible in the form (7.5), so that its distribution is determined by the
distribution of the random vector O 1.˛O ˛/. When the observable random vector y follows the
G–M model and when in addition the distribution of the vector e of residual effects is MVN (or,
more generally, is any absolutely continuous spherical distribution), the distribution of O 1.˛O ˛/
is MV t.N P ; I/—refer to result (7.6).
Multiple Comparisons and Simultaneous Confidence Intervals: Enhancements 473

^2 ’ s
σ s

2.5

2.0

1.5

1.0

0.5

−0.75 −0.50 −0.25 0.00 0.25 0.50 0.75


^τs’ s
FIGURE 7.8. Plot of the values of the Os2 ’s against the values of the corresponding Os ’s.

In the present application, the assumption (inherent in the G–M model) of the homogeneity of the
variances of the residual effects appears to be highly unrealistic, and consequently the distribution
of the vector O 1.˛O ˛/ may differ appreciably from the MV t.N P ; I/ distribution. Allowance
can be made for the heterogeneity of the variances of the residual effects by redefining the ti ’s and
the t1.0/ ’s (and by modifying the various multiple-comparison procedures accordingly).
.0/
For s D 1; 2; : : : ; S, redefine ts and ts as
Os s Os
ts D and ts.0/ D (7.116)
Œ.1=50/ C .1=52/1=2 O s Œ.1=50/ C .1=52/1=2 O s
(where Os represents the positive square root of Os2 ). Further, let ys1 D .ys1 ; ys2 ; : : : ; ys;50 /0 and
ys2 D .ys;51 ; ys;52 ; : : : ; ys;102 /0, take L1 to be any 5049 matrix whose columns form an orthonormal
basis for N.1050 / and L2 to be any 52  51 matrix whose columns form an orthonormal basis for
N.1052 /, and assume that the 6033 101-dimensional vectors
.Os s /=Œ.1=50/C.1=52/1=2
0 1
@ L01 ys1 A .s D 1; 2; : : : ; 6033/
0
L2 ys2
are distributed independently and that each of them has an N.0; s2 I101/ distribution or, more gen-
erally, has an absolutely continuous spherical distribution with variance-covariance matrix s2 I101 .
And observe that under that assumption, the random variables t1 ; t2 ; : : : ; t6033 are statistically inde-
pendent and each of them has an S tŒ100 .D N1 C N2 2/ distribution—refer to the final part of
Section 6.6.4d.
474 Confidence Intervals (or Sets) and Tests of Hypotheses

Scaled
relative
frequency

0.3

0.2

0.1

−4 −3 −2 −1 0 1 2 3 4 5
(0)
ts ’ s

FIGURE 7.9. A display (in the form of a histogram with intervals of width 0:143 that has been rescaled so that
it encloses an area equal to 1 and that has been overlaid with a plot of the pdf of the St.100/
.0/
distribution) of the relative frequencies (among the 6033 genes) of the various values of the ts ’s.

In what follows, modified versions of the various multiple-comparison procedures [in which ts
and ts.0/ have (for s D 1; 2; : : : ; 6033) been redefined by equalities (7.116)] are described and applied.
It is worth noting that the resultant procedures do not take advantage of any “relationships” among the
s1 ’s, s2 ’s and s2 ’s of the kind reflected in Figures 7.5, 7.6, 7.7, and (to a lesser extent) 7.8. At least
in principle, a more sophisticated model that reflects those relationships could be devised and could
serve as a basis for constructing improved procedures—refer, e.g., to Efron (2010). And/or one could
seek to transform the data in such a way that the assumptions underlying the various unmodified
multiple-comparison procedures (including that of the homogeneity of the residual variances) are
applicable—refer, e.g., to Durbin et al. (2002).
The values of the redefined ts.0/ ’s are displayed in Figure 7.9 in the form of a histogram that has
been rescaled so that it encloses an area equal to 1 and that has been overlaid with a plot of the pdf
of the S t.100/ distribution. The genes with the most extreme t .0/ -values are listed in Table 7.4.
FWER and k-FWER. Let us consider the modification of the multiple-comparison procedures
described in Subsections a and b for controlling the FWER or k-FWER. Among the quantities
affected by the redefinition of the ti ’s (either directly or indirectly) are the following: the permutations
i1 ; i2 ; : : : ; iM and [for any (nonempty) subset S of I ] i1 .S /; i2 .S /; : : : ; iM

S
.S /; the quantities
t.j / D tij (j D 1; 2; : : : ; M ) and tj IS D tij .S/ (j D 1; 2; : : : ; MS ); the upper 100 P % point c P .j /
of the distribution of jt.j / j and the upper 100 P % point c P .j I S / of the distribution of jtj IS j; and (for
Multiple Comparisons and Simultaneous Confidence Intervals: Enhancements 475
.0/
TABLE 7.4. The 100 most extreme among the values of the ts ’s obtained for the 6033 genes represented in
the prostate data.

s ts.0/ s ts.0/ s ts.0/ s ts.0/ s ts.0/


610 5:65 4073 3:98 2945 3:65 4515 3:35 2811 3:19
1720 5:11 735 3:87 2856 3:64 637 3:34 478 3:19
364 4:67 3665 3:84 3017 3:61 4496 3:34 1507 3:18
332 4:64 1130 3:83 698 3:59 298 3:32 3313 3:18
914 4:61 1346 3:82 3292 3:59 292 3:31 3585 3:16
3940 4:57 1589 3:81 905 3:59 1659 3:31 2852 3:16
4546 4:54 921 3:80 4396 3:57 1491 3:30 3242 3:14
1068 4:40 4549 3:79 4552 3:56 718 3:30 5159 3:13
579 4:35 739 3:77 3930 3:56 3208 3:29 4378 3:12
4331 4:34 4104 3:75 1588 3:55 1966 3:29 1329 3:12
1089 4:31 4981 3:75 721 3:54 452 3:28 4500 3:07
3647 4:31 1314 3:75 3260 3:49 3879 3:27 4671 3:06
1113 4:22 702 3:74 4154 3:48 3200 3:27 354 3:06
1077 4:13 2897 3:72 805 3:46 1647 3:26 341 3:05
4518 4:10 4000 3:71 4040 3:44 2968 3:25 995 3:04
1557 4:10 2 3:68 11 3:44 4013 3:22 3961 3:04
4088 4:06 3282 3:66 3505 3:44 2912 3:20 3696 3:03
3991 4:05 694 3:66 3269 3:42 684 3:20 1097 3:03
3375 4:01 2370 3:66 4492 3:40 1572 3:19 3343 3:03
4316 3:99 3600 3:65 377 3:38 913 3:19 641 3:03

j D k; k C1; : : : ; M ) the quantity


˛j D maxS2Q C c P .kI S / (7.117)
kIj

(where  Q C is the collection of subsets of I whose elements include k 1 of the integers


kIj
iQ1 ; iQ2 ; : : : ; iQj 1 and all M j C 1 of the integers iQj ; iQj C1 ; : : : ; iQM and where iQ1 ; iQ2 ; : : : ; iQM is a
.0/ .0/ .0/
permutation of the integers 1; 2; : : : ; M such that jtQi j  jtQi j      jtQi j).
1 2 M
In regard to expression (7.117), note that subsequent to the redefinition of the ti ’s, c P .kI S /
depends on the subset S only through the size MS of S . Moreover, for S 2  QC ,
kIj

MS D M j C1 C k 1 D M Ck j:
Thus, expression (7.117) simplifies to the following expression:
˛j D c P .kI S / for any S  I such that MS D M Ck j: (7.118)
And in lieu of inequalities (7.14) and (7.43), we have (subsequent to the redefinition of the ti ’s) the
inequalities
c P .k/  tNk P =.2M / .100/ (7.119)
and (for j  k)
˛j  tNk P =Œ2.M Ck j / .100/: (7.120)
Based on these results, we can obtain suitably modified versions of the multiple-comparison
procedures described in Subsections a and b for controlling the FWER or k-FWER. As a multiple-
.0/ .0/ .0/
comparison procedure (for testing H1 ; H2 ; : : : ; HM ) that controls the k-FWER at level P, we
.0/
have the procedure that rejects Hi if and only if y 2 Ci , where

Ci D fy W jti.0/ j > c P .k/g (7.121)


476 Confidence Intervals (or Sets) and Tests of Hypotheses
TABLE 7.5. The number (No.) of rejected null hypotheses (discoveries) obtained (for k D 1; 2; : : : ; 20 and for 3
values of P ) upon applying the multiple-comparison procedure for controlling the k-FWER and the
number obtained upon applying its more conservative counterpart, along with the values of c P .k/
and tN D tNk P =.2M / .100/ and along with the number (if any) of additional rejections (discoveries)
obtained upon applying the step-down versions of those procedures.

P D :05 P D :10 P D :20


k c P .k/ No. tN No. c P .k/ No. tN No. c P .k/ No. tN No.
1 4:70 2 4:70 2 4:52 7 4:53 7 4:32 10 4:35 9
2 4:20 13 4:53 7 4:09 16 4:35 9 3:97 21 4:16 13
3 3:97 21 4:42 7 3:89 21 4:24 12 3:79 28 4:05 17
4 3:83 23C1 4:35 9 3:76 29 4:16 13 3:68 35C1 3:98 21
5 3:72 33 4:29 12 3:66 36 4:10 14 3:59 43C1 3:91 21
6 3:64 42 4:24 12 3:59 46 4:05 17 3:52 51 3:86 22
7 3:57 46C1 4:20 13 3:52 51 4:01 19 3:46 53 3:82 24C1
8 3:52 51 4:16 13 3:47 53 3:98 21 3:41 58 3:78 28
9 3:47 53 4:13 13 3:42 58 3:94 21 3:37 60 3:75 31
10 3:42 58 4:10 14 3:38 60 3:91 21 3:33 63 3:72 33C1
11 3:38 59C1 4:08 16 3:34 61C1 3:89 21 3:30 67C1 3:69 35
12 3:35 61 4:05 17 3:31 66 3:86 22 3:27 73C1 3:67 36
13 3:32 64 4:03 18 3:28 70C1 3:84 23 3:24 75 3:64 42
14 3:29 70 4:01 19 3:25 74C1 3:82 24C1 3:21 76 3:62 42
15 3:26 74 3:99 19 3:22 75C1 3:80 26 3:18 83C1 3:60 43
16 3:23 75 3:98 21 3:20 78 3:78 28 3:16 84C2 3:58 46
17 3:21 76C1 3:96 21 3:18 84 3:76 29 3:14 86C1 3:56 48
18 3:18 82C2 3:94 21 3:15 86 3:75 31 3:12 89C1 3:55 50
19 3:16 84 3:93 21 3:13 87C1 3:73 33 3:10 90 3:53 51
20 3:14 86 3:91 21 3:11 90 3:72 33C1 3:08 90 3:51 51

(and where the definition of c P .k/ is in terms of the distribution of the redefined ti ’s). And as a
less computationally intensive but more “conservative” variation on this procedure, we have the
procedure obtained by replacing c P .k/ with the upper bound tNk P =.2M / .100/, that is, the procedure
obtained by taking
Ci D fy W jti.0/ j > tNk P =.2M / .100/g (7.122)
rather than taking Ci to be the set (7.121). Further, suitably modified versions of the step-down
procedures (described in Subsection b) for controlling the k-FWER at level P are obtained by setting
˛1 D ˛2 D    D ˛k and (for j D k; kC1; : : : ; M ) taking ˛j to be as in equality (7.118) [where the
definition of c P .kI S / is in terms of the redefined ti ’s] and taking the replacement for ˛j (in the more
conservative of the step-down procedures) to be tNk P =Œ2.M Ck j / .100/. As before, the computation
of c P .k/ and c P .kI S / is amenable to the use of Monte Carlo methods.
The modified versions of the various procedures for controlling the k-FWER were applied to
the prostate data. Results were obtained for k D 1; 2; : : : ; 20 and for three different choices for P
( P D :05; :10; and :20). The requisite values of c P .k/ and c P .kI S / were determined by Monte Carlo
methods from 149999 draws (from the joint distribution of the redefined ti ’s). The number of rejected
null hypotheses (discoveries), along with the values of c P .k/ and tNk P =.2M / .100/, is listed (for each
of the various combinations of k- and P -values) in Table 7.5.
The results clearly indicate that (at least in this kind of application and at least for k > 1) the
adoption of the more conservative versions of the procedures for controlling the k-FWER can result
in a drastic reduction in the total number of rejections (discoveries). Also, while the total number of
rejections (discoveries) increases with the value of k, it appears to do so at a mostly decreasing rate.
Controlling the probability of the FDP exceeding a specified level. Consider the modification of
the step-down procedure described in Subsection d for controlling Pr.FDP > / (where  is a
Multiple Comparisons and Simultaneous Confidence Intervals: Enhancements 477
TABLE 7.6. The number of rejected null hypotheses (discoveries) obtained (for each of 4 values of  and
each of 3 values of ) upon applying the step-down multiple-comparison procedure for controlling
Pr.FDP > / at level  (along with the value of P ) when (for j D 1; 2; : : : ; M ) ˛j is taken to be
of the form (7.123) and when P is set equal to P and the value of P taken to be an approximate
value determined by Monte Carlo methods.

 D :05  D :10  D :20


Number of Number of Number of
 P rejections P rejections P rejections
:05 :04 2 :05 2 :05 2
:10 :05 2 :09 6 :10 7
:20 :05 2 :10 8 :19 21
:50 :05 13 :10 22 :20 60

specified constant) at a specified level . And for purposes of doing so, continue to take ts and ts.0/
(for s D 1; 2; : : : ; S) to be as redefined by equalities (7.116). By employing essentially the same
reasoning as before (i.e., in Subsection d), it can be shown that the step-down procedure controls
Pr.FDP > / at level  if the ˛j ’s satisfy condition (7.62), where now the random variable trIS is
(for any S  I including S D T ) as follows its modification to reflect the redefinition of the ts ’s.
Further, the same line of reasoning that led before to taking ˛j to be of the form (7.65) leads now to
taking ˛j to be of the form
˛j D tN.ŒjC1/ P =f2.M CŒjC1 j /g .100/ (7.123)
(j D 1; 2; : : : ; M ), following which the problem of choosing the ˛j ’s so as to satisfy condition
(7.62) is reduced to that of choosing P . And to insure that ˛j ’s of the form (7.123) satisfy condition
(7.62) and hence that the step-down procedure controls Pr.FDP > / at level , it suffices to take
P D P , where P is the solution to the equation cP . P / D 0 and where cP ./ is a modified (to account
for the redefinition of the ts ’s) version of the function cP ./ defined in Subsection d.
The step-down procedure [in which (for j D 1; 2; : : : ; M ) ˛j was taken to be of the form (7.123)
and P was set equal to P ] was applied to the prostate data. Results were obtained for four different
values of  ( D :05; :10; :20; and :50) and three different values of  ( D :05; :10; and :20).
The value of P was taken to be that of the approximation provided by the solution to the equation
cR . P / D 0, where cR ./ is a function whose values are approximations to the values of cP ./ determined
by Monte Carlo methods from 149999 draws. For each of the various combinations of - and -
values, the number of rejected null hypotheses (discoveries) is reported in Table 7.6 along with the
value of P . It is worth noting that the potential improvement that comes from resetting various of the
˛j ’s [so as to achieve conformance with condition (7.74)] did not result in any additional rejections
in any of these cases.
.0/
An improvement. In the present setting [where (for s D 1; 2; : : : ; S) ts and ts have been redefined as
.0/
in equalities (7.116)], the ts ’s are statistically independent and the ts ’s are statistically independent.
By taking advantage of the statistical independence, we can attempt to devise improved versions of
the step-down procedure for controlling Pr.FDP > /. In that regard, recall (from Subsection d) that
Pr.j 0  1; k 0  MT ; and jtk 0 IT j > ˛j 0 /   ) Pr.FDP > /   (7.124)
0 0 0
and that k D Œj  C1 for values of j  1. And observe that
Pr.j 0  1; k 0  MT ; and jtk 0 IT j > ˛j 0 /
D Pr.j 0  1 and k 0  MT / Pr.jtk 0 IT j > ˛j 0 j j 0  1 and k 0  MT /
 Pr.jtk 0 IT j > ˛j 0 j j 0  1 and k 0  MT /
Pr.jtk 0 IT j > ˛j 0 j j 0 D j 0 / Pr.j 0 D j 0 j j 0  1 and k 0  MT /; (7.125)
P
D
478 Confidence Intervals (or Sets) and Tests of Hypotheses
TABLE 7.7. The number of rejected null hypotheses (discoveries) obtained (for each of 4 values of  and
each of 3 values of ) upon applying the step-down multiple-comparison procedure for controlling
Pr.FDP > / at level  when (for j D 1; 2; : : : ; M ) ˛j is as specified by equality (7.129) or,
alternatively, as specified by equality (7.130).

˛j D c .kj ; S/ where MS D M C kj j ˛j D tNkj =Œ2.MCkj j / .100/

  D :05  D :10  D :20  D :05  D :10  D :20


:05 2 7 10 2 7 9
:10 2 7 28 2 7 13
:20 2 78 90 2 12 21
:50 230 248 262 13 22 60

where the summation is with respect to j 0 and is over the set fj 0 W j 0  1 and Œj 0  C1  MT g.
For s 2 T, ts D ts.0/ ; and (by definition) the values of the random variables j 0 and k 0 are
completely determined by the values of the MF random variables ts.0/ (s 2 F ). Thus, in the present
.0/
setting (where the ts ’s are statistically independent) we find that (for j 0 such that j 0  1 and
0
Œj  C1  MT )
Pr.jtk 0 IT j > ˛j 0 j j 0 D j 0 / D Pr.jtk 0 IT j > ˛j 0 /; (7.126)
where k 0 D Œj 0  C1. Moreover,
Pr.jtk 0 IT j > c .k 0 I S /   (7.127)
for any S  I such that T  S and hence (when as in the present setting, the ts ’s are statistically
independent) for any S  I such that MS  MT .
Now, recall [from result (7.53)] that
M C k0 j 0  MT : (7.128)
And observe that, together, results (7.125), (7.126), (7.127), and (7.128) imply that the condition
Pr.j 0  1; k 0  MT ; and jtk 0 IT j > ˛j 0 /  
and hence [in light of result (7.124)] the condition Pr.FDP > /   can be satisfied by taking (for
j D 1; 2; : : : ; M )
˛j D c .kj I S / for any S  I such that MS D M Ckj j ; (7.129)
where kj D ŒjC1.
Resort to Monte Carlo methods may be needed to effect the numerical evaluation of expression
(7.129). A more conservative but less computationally intensive version of the step-down procedure
for controlling Pr.FDP > / at level  can be effected [on the basis of inequality (7.120)] by taking
(for j D 1; 2; : : : ; M )
˛j D tNkj =Œ2.M Ckj j / .100/: (7.130)
The number of rejections (discoveries) resulting from the application of the step-down procedure
(to the prostate data) was determined for the case where (for j D 1; 2; : : : ; M ) ˛j is as specified
by equality (7.129) and also for the case where ˛j is as specified by equality (7.130). The values of
the c .kj I S /’s (required in the first of the 2 cases) were taken to be approximate values determined
by Monte Carlo methods from 149999 draws. The number of rejections was determined for each of
four values of  ( D :05; :10; :20; and :50) and for each of three values of  ( D :05; :10; and :20).
The results are presented in Table 7.7.
Both the results presented in Table 7.6 and those presented in the right half of Table 7.7 are for
cases where (for j D 1; 2; : : : ; M ) ˛j is of the form (7.123). The latter results are those obtained
when P D , and the former those obtained when P D P — P depends on  as well as on . For
P < , the number of rejections (discoveries) obtained when P D  is at least as great as the number
obtained when P D P . In the application to the prostate data, there are three combinations of - and
Multiple Comparisons and Simultaneous Confidence Intervals: Enhancements 479

-values for which  exceeds P by a substantial amount (that where  D :05 and  D :10, that where
 D :05 and  D :20, and that where  D :10 and  D :20). When  D :05 and  D :20, P D :05 and
setting P D  rather than P D P results in an additional 7 rejections (discoveries); similarly, when
 D :10 and  D :20, P D :10 and setting P D  rather than P D P results in an additional 6 rejections.
Based on the results presented in Table 7.7, it appears that the difference in the number of
rejections produced by the step-down procedure in the case where (for j D 1; 2; : : : ; M ) ˛j is as
specified by equality (7.129) and the number produced in the case where ˛j is as specified by equality
(7.130) can be either negligible or extremely large. This difference tends to be larger for the larger
values of  and also tends to be larger for the larger values of .
Controlling the FDR. Consider the multiple-comparison procedure obtained upon modifying the
step-up procedure described in Subsection e (for controlling the FDR) so as to reflect the redefinition
.0/
of the tj ’s and tj ’s. The requisite modifications include that of taking the distribution of the random
variable t in definition (7.80) (of ˛j0 ) to be S t.100/ rather than S t.N P /. As a consequence of
this modification, taking ˛j0 to be of the form (7.81) leads to taking ˛j to be of the form
˛j D tNj P =.2M / .100/; (7.131)
rather than of the form (7.82).
The FDR of the modified step-up procedure is given by expression (7.91) (suitably reinterpreted
.0/
to reflect the redefinition of the tj ’s and tj ’s). Moreover, subsequent to the redefinition of the tj ’s
.0/ .0/ .0/
and tj ’s, the tj ’s’s are statistically independent and hence (since tj D tj for j 2 T )
Pr t . i / 2 AkI i j jti j > ˛k D Pr t . i / 2 AkI i :
 

And upon recalling that the sets A1I i ; A2I i ; : : : ; AM



I i are mutually disjoint and that their union
M 1
equals R and upon proceeding in the same way as in the derivation of result (7.103), we find that
when (for j D 1; 2; : : : ; M ) ˛j0 is taken to be of the form (7.81) [in which case ˛j is of the form
(7.131)], the revised step-up procedure is such that
FDR D .MT =M / P  P :
Thus, when ˛j0
is taken to be of the form (7.81) and when P is set equal to ı, the revised step-up
procedure controls the FDR at level ı.
When ˛j0 is taken to be of the form (7.81), the number of rejections (discoveries) obtained upon
applying the modified step-up procedure to the prostate data is (for each of 3 values of P ) as follows:

P D :05 P D :10 P D :20


21 59 105

It is of interest to compare these results with the results presented in the preceding part of the
present subsection, which are those obtained from the application of the step-down procedure for
controlling Pr.FDP > / at level . In that regard, it can be shown that (for j D 1; 2; : : : ; M and
0 <  < 1)
tNj P =.2M / .100/  tNkj P =Œ2.M Ckj j / .100/; (7.132)
where kj D ŒjC1—refer to Exercise 32. Thus, the application of a step-up procedure in which
(for j D 1; 2; : : : ; M ) ˛j is of the form (7.131) and P D  produces as least as many rejections
(discoveries) as the application of a step-down procedure in which ˛j is as specified by equality
(7.130)—even if the ˛j ’s were the same in both cases, the application of the step-up procedure
would result in at least as many rejections as the application of the step-down procedure.
In the application of these two stepwise procedures to the prostate data, the step-up procedure
produced substantially more rejections than the step-down procedure and did so even for  as large
as :50—refer to the entries in the right half of Table 7.7. It is worth noting that (in the case of the
prostate data) the step-up procedure produced no more rejections than would have been produced
480 Confidence Intervals (or Sets) and Tests of Hypotheses

by a step-down procedure with the same ˛j ’s, so that (in this case) the difference in the number of
rejections produced by the two stepwise procedures is due entirely to the differences between the
˛j ’s.
The number of rejections (discoveries) produced by the step-up procedure [that in which ˛j is
of the form (7.131) and P D ] may or may not exceed the number produced by the step-down
procedure in which ˛j is as specified by equality (7.129) rather than by equality (7.130). When (in
the application to the prostate data)  D :50 or when  D :10 and  D :20, this step-down procedure
[which, like that in which ˛j is specified by equality (7.130), controls Pr.FDP > / at level ]
produced substantially more rejections than the step-up procedure (which controls the FDR at level
)—refer to the entries in the left half of Table 7.7.

7.8 Prediction
The prediction of the realization of an unobservable random variable or vector by a single observable
“point” was discussed in Section 5.10, both in general terms and in the special case where the relevant
information is the information provided by an observable random vector that follows a G–M, Aitken
or general linear model. Let us extend the discussion of prediction to include predictions that take
the form of intervals or sets and to include tests of hypotheses about the realizations of unobservable
random variables.

a. Some general results


Let y represent an N 1 observable random vector. And as in Section 5.10a, consider the use of y
in making inferences about the realization of an unobservable random variable or, more generally,
the realization of an unobservable random vector, say the realization of an M  1 unobservable
random vector w D .w1 ; w2 ; : : : ; wM /0. Further, assume the existence of the first- and second-
order moments of the joint distribution of w and y; define y D E.y/, w D E.w/, Vy D var.y/,
Vy Vyw
Vyw D cov.y; w/, and Vw D var.w/; and assume that rank 0 D NCM. In considering
Vyw Vw
the special case M D 1, let us write w, w , vyw , and vw for w, w , Vyw , and Vw , respectively.
Prediction intervals and sets. Let S.y/ represent a set of w-values that varies with the value of y
and that could potentially be used to predict the realization of w. And consider each of the following
two conditions:
PrŒw 2 S.y/ j y D 1 P (for “every” value of y) (8.1)
and
PrŒw 2 S.y/ D 1 P: (8.2)
Clearly, condition (8.1) implies condition (8.2), but (in general) condition (8.2) does not imply
condition (8.1).
Now, suppose that the joint distribution of w and y is known or, more generally, that (for “every”
value of y) the conditional distribution of w given y is known. Then, among the choices for the
prediction set S.y/ are sets that reflect the characteristics of the conditional distribution of w given
y and that satisfy condition (8.1).
In a Bayesian framework, the probability PrŒw 2 S.y/ j y would be referred to as a posterior
probability; and in the case of a choice for S.y/ that satisfies condition (8.1), S.y/ would be referred
to as a 100.1 P / percent credible set. Any choice for S.y/ that satisfies condition (8.1) is such that
in repeated sampling from the conditional distribution of w given y or from the joint distribution of
w and y, the value of w would be included in S.y/ 100.1 P / percent of the time.
Alternatively, suppose that what is known about the joint distribution of w and y is more limited.
Prediction 481

Specifically, suppose that (while insufficient to determine the conditional distribution of w given y)
knowledge about the joint distribution is sufficient to form an unbiased (point) predictor w.y/
Q (of
the realization of w) with a prediction error, say e D w.y/
Q w, whose (unconditional) distribution
is known—clearly, E.e/ D 0. For example, depending on the state of knowledge, w.y/ Q might be
Q
w.y/ D E.w j y/; (8.3)
0
Q
w.y/ DC Vyw Vy 1 y; (8.4)
0
where  D w Vyw Vy 1 y , or
0
Q
w.y/ D .y/
Q C Vyw Vy 1 y; (8.5)
where .y/
Q is a function of y with an expected value of —refer to Section 5.10. Then, the knowledge
about the joint distribution of w and y may be insufficient to obtain a prediction set that satisfies
condition (8.1). However, a prediction set S.y/ that satisfies condition (8.2) can be obtained by taking
Q
S.y/ D f w W w D w.y/ e; e 2 S  g; (8.6)

where S is any set of M 1 vectors for which
PrŒw.y/
Q w 2 S  D 1 P: (8.7)
Let us refer to any prediction set that satisfies condition (8.2) as a 100.1 P /% prediction set.
In repeated sampling from the joint distribution of w and y, the value of w would be included in a
100.1 P /% prediction set 100.1 P / percent of the time.
HPD prediction sets. Suppose that (for “every” value of y) the conditional distribution of w given
y is known and is absolutely continuous with pdf f . j y/. Among the various choices for a set S.y/
that satisfies condition (8.1) is a set of the form
S.y/ D fw W f .wj y/  kg; (8.8)
where k is a (strictly) positive constant. In a Bayesian framework, a set of the form (8.8) that satisfies
condition (8.1) would be referred to as a 100.1 P /% HPD (highest posterior density) credible set.
In the present setting, let us refer to such a set as a 100.1 P /% HPD prediction set.
A 100.1 P /% HPD prediction set has the following property: Among all choices for the set
S.y/ that satisfy the condition
PrŒw 2 S.y/ j y  1 P ; (8.9)
it is the smallest, that is, it minimizes the quantity
(8.10)
R
S.y/ d w:
Let us verify that a 100.1 P /% HPD prediction set has this property. For purposes of doing so,
take ./ to be a function defined (for M 1 vectors) as follows:
(
1; if w 2 S.y/,
.w/ D
0; if w … S.y/.

And observe that the quantity (8.10) is reexpressible as


R R
S.y/ d w D RM .w/ d w

and that condition (8.9) is reexpressible in the form


R
RM .w/f .w j y/ d w  1 P:
Further, let S  .y/ represent a 100.1 P /% HPD prediction set, and define
(
 1; if w 2 S  .y/,
 .w/ D
0; if w … S  .y/.
482 Confidence Intervals (or Sets) and Tests of Hypotheses

Now, define sets S C .y/ and S .y/ as follows:


S C .y/ D fw W   .w/ .w/ > 0g
and
S .y/ D fw W   .w/ .w/ < 0g:
And observe that if w 2 S .y/, then   .w/ D 1 and hence f .wj y/ k  0; and similarly if
C

w 2 S .y/, then   .w/ D 0 and hence f .wj y/ k < 0. Thus,


 
R R
RM Œ .w/ .w/Œf .wj y/ k d w D S C .y/[S .y/ Œ .w/ .w/Œf .wj y/ k d w  0;
so that R
k RM Œ  .w/ .w/ d w  RM Œ  .w/ .w/f .wj y/ d w  1
R
P .1 P / D 0;
in which case
  .w/ d w  RM .w/ d w
R R
RM
or, equivalently, R R
S  .y/ d w  S.y/ d w
(as was to be verified).
100.1 P /% prediction sets of the form (8.6): minimum size. Suppose that knowledge about the joint
distribution of w and y is sufficient to form an unbiased (point) predictor w.y/
Q (of the realization of
w) with a prediction error e D w.y/
Q w whose distribution is known. Then, as discussed earlier (in
Part 1 of the present subsection), any set S.y/ of the form (8.6) is a 100.1 P /% prediction set.
How do the various 100.1 P /% prediction sets of the form (8.6) compare with each other? And
how do they compare with the additional prediction sets that would result if condition (8.7) were
replaced by the condition
PrŒw.y/
Q w 2 S   1 P (8.11)
—the additional prediction sets are ones whose (unconditional) probability of coverage exceeds 1 P .
Specifically, how do they compare with respect to size.
Clearly, for any of the prediction sets of the form (8.6) and for any of the additional prediction
sets that would result if condition (8.7) were replaced by condition (8.11), the size of the prediction
set is the same as the size of the set S . Thus, for purposes of comparing any of these prediction sets
with respect to size, it suffices to compare (with respect to size) the choices for the set S  that gave
rise to the prediction sets.
Now, suppose that the (unconditional) distribution of the vector e is absolutely continuous, say
absolutely continuous with pdf g./. Then, among the choices for the set S  that satisfy condition
(8.7) is a choice of the form
S  D fe W g.e/ > kg (8.12)
(where 0 < k < 1). Among all choices for S  that satisfy condition (8.11), this choice is of minimum
size—by definition, the size of a set S of e-values is S d e. That this choice is of minimum size can
R
be verified by proceeding in much the same fashion as in the preceding part of the present subsection
(in verifying that HPD prediction sets are of minimum size).
A special case: multivariate normality. If the joint distribution of w and y were MVN, the conditional
0 0
distribution of w given y would be the N . C Vyw Vy 1 y; Vw Vyw Vy 1 Vyw / distribution (where
0
 D w Vyw Vy 1 y ). Accordingly, suppose that the joint distribution of w and y is MVN or of
0
some other form for which the conditional distribution of w given y is the N . C Vyw Vy 1 y; Vw
0 1 0
VywVy Vyw / distribution. Suppose further that the values of the vector  and of the matrices Vyw Vy 1
0
and Vw Vyw Vy 1 Vyw are known, in which case the conditional distribution of w given y would be
known. Then, a 100.1 P /% HPD prediction set (for the realization of w) would be
fw W Œw .y/0 .Vw 0
Vyw Vy 1 Vyw / 1
Œw .y/  N 2P .M /g; (8.13)
0
where .y/ D  C Vyw Vy 1 y and where N 2P .M / is the upper 100 P % point of the 2 .M / distribution
[as can be readily verified by making use of formula (3.5.32) for the pdf of an MVN distribution and
of result (6.6.14) on the distribution of quadratic forms].
Prediction 483

In the special case where M D 1, the 100.1 P /% HPD prediction set is reexpressible as the
interval
0 0
fw W .y/ zN P =2 .vw vyw Vy 1 vyw /1=2  w  .y/ C zN P =2 .vw vyw Vy 1 vyw /1=2 g; (8.14)
0 0
where (in this special case) .y/ D  C vyw Vy 1 y (with  D w vyw Vy 1 y ) and where (for
0 < ˛ < 1) z˛ represents the upper 100 ˛% point of the N.0; 1/ distribution. The prediction interval
(8.14) satisfies condition (8.1); among all prediction intervals that satisfy condition (8.1), it is the
shortest. Among the other prediction intervals (in the special case where M D 1) that satisfy condition
(8.1) are 0
fw W 1 < w  .y/ C zN P .vw vyw Vy 1 vyw /1=2 g (8.15)
and
0
fw W .y/ zN P .vw vyw Vy 1 vyw /1=2  w < 1g: (8.16)

Alternatively, suppose that knowledge about the joint distribution of w and y is limited to that
0 0
obtained from knowing the values of the vector  D w Vyw Vy 1 y and the matrix Vyw Vy 1 and
0
from knowing the distribution of the random vector e D w.y/Q w, where w.y/
Q D  C Vyw Vy 1 y.
And observe that regardless of the form of the distribution of e, it is the case that
0
cov.y; e/ D 0 and var.e/ D Vw Vyw Vy 1 Vyw
[and that E.e/ D 0]. Further, suppose that the (unconditional) distribution of e is MVN, which (in
0
light of the supposition that the distribution of e is known) implies that the matrix Vw Vyw Vy 1 Vyw
is known.
If the joint distribution of w and y were MVN, then e would be distributed independently of y
[and hence independently of w.y/],
Q in which case the conditional distribution of w.y/
Q e given y
[or given w.y/]
Q would be identical to the conditional distribution of w given y—both conditional
0
distributions would be N Œw.y/;
Q Vw Vyw Vy 1 Vyw —and any prediction set of the form (8.6) would
satisfy condition (8.1) as well as condition (8.2). More generally, a prediction set of the form (8.6)
would satisfy condition (8.2) [and hence would qualify as a 100.1 P /% prediction set], but would
not necessarily satisfy condition (8.1). In particular, the prediction set (8.13) and (in the special case
where M D 1) prediction intervals (8.14), (8.15), and (8.16) would satisfy condition (8.2), however
aside from special cases (like that where the joint distribution of w and y is MVN) they would not
necessarily satisfy condition (8.1).
Simultaneous prediction intervals. Let us consider further the case where conditionally on y, the
distribution of the vector w D .w1 ; w2 ; : : : ; wM /0 is N Œ.y/; Vw Vyw 0
Vy 1 Vyw , with .y/ D
0 0
 C Vyw Vy 1 y and with  D w Vyw Vy 1 y , and where the value of  and of the matrices
0 1 0 1
VywVy and Vw VywVy Vyw are known. Interest in this case might include the prediction of
the realizations of some or all of the random variables w1 ; w2 ; : : : ; wM or, more generally, the
realizations of some or all linear combinations of these random variables.
Predictive inference for the realization of any particular linear combination of the random vari-
ables w1 ; w2 ; : : : ; wM , say for that of the linear combination ı 0 w (where ı ¤ 0), might take the
form of one of the following intervals:
fw W ı 0.y/ zN P =2 Œı 0 .Vw 0
Vyw Vy 1 Vyw /ı1=2
 w  ı 0.y/ C zN P =2 Œı 0 .Vw 0
Vyw Vy 1 Vyw /ı1=2 g; (8.17)

fw W 1 < w  ı 0.y/ C zN P Œı 0 .Vw 0


Vyw Vy 1 Vyw /ı1=2 g; (8.18)
or
fw W ı 0.y/ zN P Œı 0 .Vw 0
Vyw Vy 1 Vyw /ı1=2  w < 1g: (8.19)
In an application to a single linear combination or in an application to any one of a number of linear
combinations that is assessed independently of the application to any of the others, the probability
of coverage (both conditionally on y and unconditionally) of interval (8.17), (8.18), or (8.19) equals
484 Confidence Intervals (or Sets) and Tests of Hypotheses

1 P. However, when prediction intervals are obtained for each of a number of linear combinations,
it is often the case that those linear combinations identified with the “most extreme” intervals receive
the most attention. “One-at-a-time” prediction intervals like intervals (8.17), (8.18), and (8.19) do
not account for any such identification and, as a consequence, their application can sometimes lead
to erroneous conclusions. In the case of intervals (8.17), (8.18), and (8.19), this potential pitfall can
be avoided by introducing modifications that convert these one-at-a-time intervals into prediction
intervals that provide for control of the probability of simultaneous coverage.
Accordingly, let  represent a finite or infinite collection of (nonnull) M -dimensional column
vectors. And suppose that we wish to obtain a prediction interval for the realization of each of the
linear combinations ı 0w (ı 2 ). Suppose further that we wish for the intervals to be such that the
probability of simultaneous coverage equals 1 P. Such intervals can be obtained by taking (for each
ı 2 ) the interval for the realization of ı 0w to be a modified version of interval (8.17), (8.18), or
(8.19); the requisite modification consists of introducing a suitable replacement for zN P =2 or zN P .
0
Let R represent an M  M nonsingular matrix such that Vw Vyw Vy 1 Vyw D R0 R—upon
0
observing [in light of Corollary 2.13.33 and result (2.5.5)] that Vw Vyw Vy 1 Vyw is a symmetric
positive definite matrix, the existence of the matrix R follows from Corollary 2.13.29. Further, let
z D .R 1 /0 Œ.y/ w, so that z  N.0; I/ (both conditionally on y and unconditionally) and (for
every nonnull M 1 vector ı)
ı 0.y/ ı 0w .Rı/0 z
D  N.0; 1/: (8.20)
Œı 0 .Vw Vyw 0 V 1 V /ı1=2
y yw Œ.Rı/0 Rı1=2
And take the replacement for z P =2 in interval (8.17) to be the upper 100 P % point of the distribution
of the random variable j.Rı/0 zj
max : (8.21)
ı2 Œ.Rı/0 Rı1=2

Similarly, take the replacement for z P in intervals (8.18) and (8.19) to be the upper 100 P % point of
the distribution of the random variable
.Rı/0 z
max : (8.22)
ı2 Œ.Rı/0 Rı1=2

Then, as is evident from result (8.20), the prediction intervals obtained for the realizations of the linear
combinations ı 0w (ı 2 ) upon the application of the modified version of interval (8.17), (8.18),
or (8.19), are such that the probability of simultaneous coverage equals 1 P (both conditionally
on y and unconditionally). In fact, when the unconditional distribution of .y/ w is MVN, the
unconditional probability of simultaneous coverage of the prediction intervals obtained upon the
application of the modified version of interval (8.17), (8.18), or (8.19) would equal 1 P even if the
0
conditional distribution of w given y differed from the N Œ.y/; Vw Vyw Vy 1 Vyw  distribution.
When  Dpfı 2 RM W ı ¤ 0g, the upper 100 P % point of the distribution of the random variable
(8.21) equals N 2P .M /, and ı 0w is contained in the modified version of interval (8.17) for every
ı 2  if and only if w is contained in the set (8.13), as can be verified by proceeding in much the
same way as in Section 7.3c in the verification of some similar results. When the members of 
consist of the columns of the M M identity matrix, the linear combinations ı 0w (ı 2 ) consist of
the M random variables w1 ; w2 ; : : : ; wM.
For even moderately large values of M, a requirement that the prediction intervals achieve si-
multaneous coverage with a high probability can be quite severe. A less stringent alternative would
be to require (for some integer k greater than 1) that with a high probability, no more than k of the
intervals fail to cover. Thus, in modifying interval (8.17) for use in obtaining prediction intervals
for all of the linear combinations ı 0w (ı 2 ), we could replace z P =2 with the upper 100 P % point
of the distribution of the kth largest of the random variables j.Rı/0 zj=Œ.Rı/0 Rı1=2 (ı 2 ), rather
than with the upper 100 P % point of the distribution of the largest of these random variables. Simi-
larly, in modifying interval (8.18) or (8.19), we could replace z P with the upper 100 P % point of the
Prediction 485

distribution of the kth largest (rather than the largest) of the random variables .Rı/0 z=Œ.Rı/0 Rı1=2
(ı 2 ). In either case [and in the case of the distribution of the random variable (8.21) or (8.22), the
upper 100 P % point of the relevant distribution could be determined numerically via Monte Carlo
methods.
Hypothesis tests. Let S0 represent a (nonempty but proper) subset of the set RM of all M -dimensional
column vectors, and let S1 represent the complement of S0 , that is, the subset of RM consisting of
all M -dimensional column vectors other than those in S0 (so that S0 \ S1 is the empty set, and
S0 [S1 D RM ). And consider the problem of testing the null hypothesis H0 W w 2 S0 versus the
alternative hypothesis H1 W w 2 S1 —note that w 2 S1 , w … S0 .
Let ./ represent the critical function of a (nonrandomized) test of H0 versus H1 , in which case
./ is of the form (
1; if y 2 A,
.y/ D
0; if y … A,
where A is a (nonempty but proper) subset of RN known as the critical region, and H0 is rejected
if and only if .y/ D 1. Or, more generally, let ./ represent the critical function of a possibly
randomized test, in which case 0  .y/  1 (for y 2 RN ) and H0 is rejected with probability .y/
when y D y. Further, let
0 D Pr.w 2 S0 / and 1 D Pr.w 2 S1 / .D 1 0 /;
and take p0 ./ and p1 ./ to be functions defined (on RN ) as follows:
p0 .y/ D Pr.w 2 S0 j y/ and p1 .y/ D Pr.w 2 S1 j y/ ŒD 1 p0 .y/:
Now, suppose that the joint distribution of w and y is such that the (marginal) distribution of
y is known and is absolutely continuous with pdf f ./ and is such that p0 ./ is known (in which
case 0 , 1 , and p1 ./ would also be known and 0 and 1 would constitute what in a Bayesian
framework would be referred to as prior probabilities and p0 .y/ and p1 .y/ what would be referred
to as posterior probabilities). Suppose further that 0 < 0 < 1 (in which case 0 < 1 < 1)—if 0 D 0
or 0 D 1, then a test for which the probability of an error of the first kind (i.e., of falsely rejecting
H0 ) and the probability of an error of the second kind (i.e., of falsely accepting H0 ) are both 0 could
be achieved by taking .y/ D 1 for every value of y or by taking .y/ D 0 for every value of y.
Note that when the distribution of w is absolutely continuous, the supposition that 0 > 0 rules
out the case where S0 consists of a single point—refer, e.g., to Berger (2002, sec. 4.3.3) for some
related discussion. Further, take f0 ./ and f1 ./ to be functions defined (on RN ) as follows: for
y 2 RN,
f0 .y/ D p0 .y/f .y /=0 and f1 .y/ D p1 .y/f .y /=1
—when (as is being implicitly assumed herein) the function p0 ./ is “sufficiently well-behaved,”
the conditional distribution of y given that w 2 S9 is absolutely continuous with pdf f0 ./ and the
conditional distribution of y given that w 2 S1 is absolutely continuous with pdf f1 ./. And observe
that the probability of rejecting the null hypothesis H0 when H0 is true is expressible as
EŒ.y/ j w 2 S0  D RN .y/f0 .y/ d y; (8.23)
R

and the probability of rejecting H0 when H0 is false is expressible as


EŒ.y/ j w 2 S1  D RN .y/f1 .y/ d y: (8.24)
R

Upon applying a version of the Neyman–Pearson lemma stated by Lehmann and Romano (2005b,
sec. 3.2) in the form of their theorem 3.2.1, we find that there exists a critical function   ./ defined
by taking 8
< 1; when f1 .y/ > kf0 .y/,
  .y/ D c; when f1 .y/ D kf0 .y/, (8.25)
0; when f1 .y/ < kf0 .y/,
:
and by taking c and k (0  c  1, 0  k < 1) to be constants for which
486 Confidence Intervals (or Sets) and Tests of Hypotheses

EŒ  .y/ j w 2 S0  D P I (8.26)


and we find that among all choices for the critical function ./ for which
EŒ.y/ j w 2 S0   P ;
the power EŒ.y/ j w 2 S1  of the test attains its maximum value when ./ is taken to be   ./.
In regard to expression (8.25), note that for every N 1 vector y for which f0 .y/ > 0,
> > > <
( ) ( ) ( ) ( )
f1 .y/ D kf0 .y/ , B D k , p1 .y/=p0 .y/ D k , p0 .y/ D k 00; 0
(8.27)
< < < >
where
f1 .y/ p1 .y/=1 p1 .y/=p0 .y/
BD D D ; k 0 D .1 =0 / k; and k 00 D .1C k 0 / 1:
f0 .y/ p0 .y/=0 1 =0
In Bayesian terms, the ratio 1 =0 represents the prior odds in favor of H1 and against H0 and (when
y is regarded as the observed value of y) p1 .y/=p0 .y/ represents the posterior odds in favor of H1
and against H0 and B represents the Bayes factor in favor of H1 (e.g., Berger 1985, sec 4.3.3).
The result on the existence of the critical function   ./ and on the optimality of the test with
critical function   ./ can be generalized. Let z D z.y/ represent an N  -dimensional column vector
whose elements are (known) functions of the N -dimensional observable random column vector y
(where N   N ). And suppose that the (marginal) distribution of z is known and is absolutely
continuous with pdf f  ./ and also that Pr.w 2 S0 j z/ is a known function of the value of z. Suppose
further that the choice of a test procedure is restricted to procedures that depend on y only through
the value of z.
The result on existence and optimality can be generalized so as to achieve coverage of this
situation. It is a simple matter of replacing [in the definitions of p0 ./, p1 ./, f0 ./, and f1 ./ and
in the statement of the result] the observable random vector y and the pdf f ./ with the observable
random vector z and the pdf f  ./. In the generalized version of the result, the optimality is in regard
to P -level test procedures that depend on y only through the value of z rather than (as in the original
version) in regard to all P -level test procedures.
Suppose, in particular, that enough is known about the joint distribution of w and y that there
exists an unbiased predictor w.y/Q of the realization of w. And writing wQ for w.y/,
Q suppose further
that the distribution of wQ is known and is absolutely continuous and that Pr.w 2 S0 j w/ Q is a known
function of the value of w. Q Then, as a special case of the generalized version of the result on the
existence of the critical function   ./ and on the optimality of the test with critical function   ./,
we have the case where z D w. Q In that special case, the functions p0 ./ and p1 ./ can be expressed
in terms of the conditional (on w) Q distribution of w and reexpressed in terms of the conditional
distribution of the prediction error e D w Q w as follows:
p0 .w/ Q D PrŒe 2 S0 .w/
Q D Pr.w 2 S0 j w/ Q j w;
Q (8.28)
where (for wQ 2 R ) M
S0 .w/
Q M
D fe 2 R W e D wQ w; w 2 S0 g, and, similarly,
p1 .w/ Q D PrŒe 2 S1 .w/
Q D Pr.w 2 S1 j w/ Q j w;
Q (8.29)
where S1 .w/
Q D fe 2 RM
W e D wQ w; w 2 S1 g.
Multiple comparisons. Suppose that we wish to make inferences about the realization of each of the
M random variables w1 ; w2 ; : : : ; wM . Suppose, in particular, that (for j D 1; 2; : : : ; M ) we wish
.0/ .0/ .1/ .0/
to test the null hypothesis Hj W wj 2 Sj versus the alternative hypothesis Hj W wj … Sj , where
.0/ .0/ .0/ .0/
Sj is “any” particular (nonempty) subset of R1 —the M subsets S1 ; S2 ; : : : ; SM may or may
not be identical.
In a “one-at-a-time” approach, each of the M null hypotheses H1.0/; H2.0/; : : : ; HM .0/
would be
tested individually at a specified level, in which case the probability of falsely rejecting one or more
of the M null hypotheses would be larger (and for even moderately large values of M , could be much
Prediction 487

larger) than the specified level. Let us consider an alternative approach in which the probability of
falsely rejecting one or more of the null hypotheses (the FWER) or, more generally, the probability
of falsely rejecting k or more of the null hypotheses (the k-FWER) is controlled at a specified level,
say P . Such an approach can be devised by making use of the results of the preceding part of the
present subsection and by invoking the so-called closure principle (e.g., Bretz, Hothorn, and Westfall
2011, sec. 2.2.3; Efron 2010, p. 38), as is to be demonstrated in what follows.
Let z D z.y/ represent an N -dimensional column vector whose elements are (known) functions
of y—e.g., z.y/ D y or (at the other extreme) z.y/ D w.y/, Q where w.y/
Q is an unbiased (point)
predictor of the realization of the vector w D .w1 ; w2 ; : : : ; wM /0. And suppose that the (marginal)
distribution of z is known and is absolutely continuous with pdf f  ./.
Now, let k represent the collection of all subsets of I D f1; 2; : : : ; M g of size k or greater, and
let kj represent the collection consisting of those subsets of I of size k or greater that include the
integer j. Further, for I 2 k ; let
S.I / D fw D .w 1 ; w 2 ; : : : ; wM /0 2 RM W wj 2 Sj.0/ for j 2 I gI
and suppose that (for I 2 k ) the joint distribution of w and z is such that PrŒw 2 S.I / j z is a
known function of the value of z—in which case PrŒw 2 S.I / would be known—and is such that

0 < PrŒw 2 S.I / < 1. And (for I 2 k ) take   .  I I / to be the critical function (defined on RN )
of the most-powerful P -level procedure for testing the null hypothesis H0 .I / W w 2 S.I / [versus
the alternative hypothesis H1 .I / W w … S.I /] on the basis of z; this procedure is that obtained
upon applying [with S0 D S.I /] the generalized version of the result of the preceding part of the
present subsection (i.e., the version of the result in which the choice of a test procedure is restricted
to procedures that depend on y only through the value of z). Then, the k-FWER (and in the special
case where k D 1, the FWER) can be controlled at level P by employing a (multiple-comparison)
procedure in which (for j D 1; 2; : : : ; M ) Hj.0/ W wj 2 Sj.0/ is rejected (in favor of Hj.1/ W wj … Sj.0/ )
if and only if for every I 2 kj the null hypothesis H0 .I / W w 2 S.I / is rejected by the P -level
test with critical function   .  I I /.
Let us verify that this procedure controls the k-FWER at level P . Define
.0/ .0/
T D fj 2 I W Hj is trueg D fj 2 I W wj 2 Sj g:
Further, denote by RT the subset of T defined as follows: for j 2 T, j 2 RT if Hj.0/ is among
the null hypotheses rejected by the multiple-comparison procedure. And (denoting by MS the size
of an arbitrary set S ) suppose that MRT  k, in which case MT  k and it follows from the very
definition of the multiple-comparison procedure that the null hypothesis H0 .T / is rejected by the
P -level test with critical function   .  I T /. Thus,
Pr.MRT  k/  P
(so that the multiple-comparison procedure controls the k-FWER at level P as was to be verified).
For some discussion (in the context of the special case where k D 1) of shortcut procedures
for achieving an efficient implementation of this kind of multiple-comparison procedure, refer, for
example, to Bretz, Hothorn, and Westfall (2011).
False discovery proportion: control of the probability of its exceeding a specified constant. Let us
consider further the multiple-comparison procedure (described in the preceding part of the present
.0/
subsection) for testing the null hypotheses Hj (j D 1; 2; : : : ; M ); that procedure controls the
k-FWER at level P . Denote by Rk the total number of null hypotheses rejected by the multiple-
comparison procedure—previously (in Section 7.7d) that symbol was used to denote something
else—and denote by RTk (rather than simply by RT, as in the preceding part of the present subsection)
the number of true null hypotheses rejected by the procedure. Further, define
(
RTk =Rk ; if Rk > 0,
FDPk D
0; if Rk D 0;
488 Confidence Intervals (or Sets) and Tests of Hypotheses

this quantity represents what (in the present context) constitutes the false discovery proportion—refer
to Section 7.7c.
For any scalar  in the interval .0; 1/, the multiple-comparison procedure can be used to control
the probability Pr.FDPk > / at level P. As is to be shown in what follows, such control can be
achieved by making a judicious choice for k (a choice based on the value of the observable random
vector z).
Upon observing that Rk  RTk , we find that
FDPk >  , RTk > Rk , RTk  ŒRk C1 (8.30)
(where for any real number x, Œx denotes the largest integer that is less than or equal to x). Moreover,
for any k for which k  ŒRk C1,
RTk  ŒRk C1 ) the null hypothesis H0 .T / is rejected by the
P -level test with critical function   .  I T /: (8.31)
The quantity ŒRk C1 is nondecreasing in k and is bounded from below by 1 and from above by
ŒM C1. Let K D K.z/ represent the largest value of k for which k  ŒRk C1 (or, equivalently,
the largest value for which k D ŒRk C1). Then, it follows from results (8.30) and (8.31) that
PrŒFDPK.z/ >   P : (8.32)
Thus, by taking k D K, the multiple-comparison procedure for controlling the k-FWER at level P
can be configured to control (at the same level P ) the probability of the false discovery proportion
exceeding . Moreover, it is also the case—refer to Exercise 28—that when k D K, the false discovery
rate E.FDPK.z/ / of this procedure is controlled at level ı D P C .1 P /.

b. Prediction on the basis of a G–M, Aitken, or general linear model


Suppose that y is an N1 observable random vector that follows a general linear model. And consider
the use of the observed value of y in making predictive inferences about the realization of an M 1
vector w D .w1 ; w2 ; : : : ; wM /0 of unobservable random variables that is expressible in the form
w D ƒ0ˇ C d; (8.33)
0
where ƒ is a P M matrix of (known) constants (such that ƒ ˇ is estimable) and where d is an
M  1 (unobservable) random vector with E.d/ D 0, var.d/ D Vw ./, and (denoting by e the
vector of residual effects in the general linear model) cov.e; d/ D Vyw ./ for some matrices Vw ./
and Vyw ./ whose elements [like those of the matrix var.e/ D V./] are known functions of the
parametric vector . Further, write Vy ./ for V./, and observe that
       
y X y Vy ./ Vyw ./
E D ˇ and var D :
w ƒ0 w ŒVyw ./0 Vw ./
The Aitken model can be regarded as the special case of the general linear model where  D ./
and where V./ D  2 H, and the G–M model can be regarded as the further special case where
H D I. Let us suppose that when (in making predictive inferences about the realization of w) the
model is taken to be the Aitken model or the G–M model, Vw ./ and Vyw ./ are taken to be of the
form
Vw ./ D  2 Hw and Vyw ./ D  2 Hyw ;
where Hw and Hyw are (known) matrices of constants. Thus, when the model is taken to be the
Aitken model or the G–M model,
       
y X y 2 Hy Hyw
E D ˇ and var D  ; (8.34)
w ƒ0 w 0
Hyw Hw
where (in the case of the Aitken model) Hy D H and (in the case of the G–M model) Hy D I.
The joint distribution of y and w is such that (regardless of the specific form of the distribution)
Prediction 489
 
y
the variance-covariance matrix of the vector depends on unknown parameters ( or, more
w
generally, the elements of ). Thus, aside from trivial special cases and in the absence of “additional
information” about the value of  or , the information available about the joint distribution of y and
w falls short of what would be needed to apply the prediction procedures described in Subsection
a. Among the kinds of additional information that would allow the application of some or all of
those procedures, the simplest and most extreme is the kind that imparts “exact” knowledge of the
value of  or . In some cases, the information about the value of  or  provided by the value of
y (perhaps in combination with information available from “external sources”) may be adequate to
provide justification for proceeding as though the value of  or  is known.
A special case: G–M model. Let us focus on the special case where y follows
  the G–M model and
y
hence where the expected value and the variance-covariance matrix of are of the form (8.34)
w
with Hy D I. In that special case, an unbiased (point) predictor of the realization of w is
Q
w.y/ D ƒ0 .X0 X/ X0 y C Hyw
0
.I PX /y: (8.35)
In fact, as discussed in Section 5.10b, this predictor is the BLUP (best linear unbiased predictor).
And the variance-covariance matrix of its prediction error is
varŒw.y/
Q w D  2 H ; (8.36)
where H D .ƒ X0 Hyw /0 .X0 X/ .ƒ X0 Hyw / C Hw Hyw 0
Hyw —refer to result (10.38). Then,
clearly, the distribution of the prediction error is not known; regardless of its form, it depends on ,
the value of which is unknown. Thus, formula (8.6) for obtaining 100.1 P /% prediction sets is not
applicable.
In effect, formula (8.6) is based on regarding the prediction error w.y/
Q w as a “pivotal quantity.”
In the present setting, the prediction error cannot be so regarded. Suppose, however, that we divide
the prediction error by O [where O 2 D y 0 .I PX /y=.N P / is the usual unbiased estimator of  2 ],
thereby obtaining a vector that is expressible as
Q
O w.y/
.1=/Œ w D Œ.1= 2 /y 0 .I PX /y=.N P / 1=2
Q
.1=/Œw.y/ w: (8.37)
Moreover,
y 0 .I PX /y D Œ.I PX /y0 .I PX /y; (8.38)
so that y 0 .I PX /y depends on y only through the value of the vector .I PX /y, and
covŒw.y/
Q w; .I PX /y
D  2 Œƒ0 .X0 X/ X0 .I PX / C Hyw
0
.I PX /.I PX / 0
Hyw .I PX / D 0: (8.39)
Now, suppose that the joint distribution of y and w is MVN. Then, .1= 2 /y 0 .I PX /y and
Q
.1=/Œw.y/ w are statistically independent [as is evident from results (8.38) and (8.39)]. Further,
Q
.1=/Œw.y/ w  N.0; H /: (8.40)
And
.1= 2 /y 0 .I PX /y  2 .N P / (8.41)
(as is evident upon recalling the results of Section 7.3). Thus, the distribution of .1=/ΠQ
O w.y/ w does
not depend on any unknown parameters, and hence .1=/Œ O wQ .y/ w can serve [in lieu of w.y/
Q w]
as a “pivotal quantity.”
It is now clear that a 100.1 P /% prediction set S.y/ (for the realization of w) can be obtained
by taking
S.y/ D f w W w D w.y/ Q O u; u 2 S  g; (8.42)
where S  is any set of M 1 vectors for which
PrŒ.1=/Œ Q
O w.y/ w 2 S   D 1 P: (8.43)
490 Confidence Intervals (or Sets) and Tests of Hypotheses

A prediction set of a particular kind. Suppose that the matrix H isnonsingular,


as would be the case
0 I Hyw
if Hw Hyw Hyw were nonsingular or equivalently if the matrix 0 were nonsingular.
Hyw Hw
Further, define u D .1=/Œ
O wQ .y/ w, and let
R D .1=M / u0H 1 u D .1=M /Œw.y/
Q w0 H 1 ŒwQ .y/ w=O 2: (8.44)
And observe [in light of result (6.6.15) and the results of the preceding part of the present subsection]
R  SF .M; N P /; (8.45)
N N
and write F P for F P .M; N P / [the upper 100 P % point of the SF .M; N P / distribution]. Then,
among the choices for the set S  defined by equality (8.43) is the set
S  D f u W .1=M / u0 H 1 u  FN P g: (8.46)
Corresponding to this choice for S  is the 100.1 P /% prediction set
0
Q
S.y/ D f w W .1=M / Œw w.y/ H 1 Œw w.y/
Q =O 2  FN P g (8.47)
defined by equality (8.42). This set is similar in form to the set (8.13).
Special case: M D 1. Let us continue the discussion of the preceding two parts of the present
subsection (pertaining to the prediction of the realization of w in the special case where y follows
a G–M model) by further specializing to the case where M D 1. In this further special case, let
us write hw , hyw , h , w, w.y/,
Q and u for Hw , Hyw , H , w, w.y/,
Q and u, respectively. Then,
1 1=2
u D O Œw.y/
Q w and u= h  S t.N P /. And R is reexpressible as
R D u2= h D .u= h1=2 2
 / : (8.48)
For 0 < ˛ < 1, let us write tN˛ for tN˛ .N P / [i.e., for the upper 100 ˛% point of the S t.N P /
distribution]. And observe [in light of result (6.4.26)] that (when M D 1)
1=2
FNP D tNP =2 :
Thus, when M D 1, the 100.1 P /% prediction set (8.47) takes the form of the interval
f w W w.y/
Q tNP =2 h1=2
  O  w  w.y/
Q C tNP =2 h1=2
  O g: (8.49)
This interval is analogous to interval (8.14).
Two other 100.1 P /% prediction intervals are
fw W 1 < w  w.y/
Q C tNP h1=2
 gO (8.50)
and
f w W w.y/
Q tNP h1=2
  O  w < 1g: (8.51)
These intervals are analogous to intervals (8.15) and (8.16). They represent the special cases of the
100.1 P /% prediction set (8.42) obtained when M D 1 and when [in the case of interval (8.50)]
1=2
S  is taken to be the set fu W u = h  tNP g or [in the case of interval (8.51)] to be the set
1=2
fu W u = h  tNP g.
Simultaneous prediction intervals. Let us continue the discussion of the preceding three parts
of the present subsection [pertaining to predictive inference about the realization of w D
.w1 ; w2 ; : : : ; wM /0 when y follows a G–M model]. And in doing so, let us continue to suppose
0
that the matrix H D Hw Hyw Hyw is nonsingular.
Suppose that we wish to undertake predictive inference for the realizations of some or all of
the random variables w1 ; w2 ; : : : ; wM or, more generally, for the realizations of some or all linear
combinations of these random variables. Predictive inference for the realization of any particular
linear combination, say for that of the linear combination ı 0 w (where ı ¤ 0), might take the form
of the interval
f w W ı 0 w.y/
Q tNP =2 .ı 0 H ı/1=2 O  w  ı 0 w.y/
Q C tNP =2 .ı 0 H ı/1=2 g;
O (8.52)
Prediction 491

fw W 1 < w  ı 0 w.y/
Q C tNP .ı 0 H ı/1=2 O g; (8.53)
or
f w W ı 0 w.y/
Q tNP .ı 0 H ı/1=2 O  w < 1g: (8.54)
It follows from the results of the preceding part of the present subsection that when considered
in “isolation,” interval (8.52), (8.53), or (8.54) has a probability of coverage equal to 1 P . However,
when such an interval is obtained for the realization of each of a number of linear combinations, the
probability of all of the intervals covering (or even that of all but a “small number” of the intervals
covering) can be much less than 1 P .
Accordingly, let  represent a finite or infinite collection of (nonnull) M -dimensional column
vectors. And suppose that we wish to obtain a prediction interval for the realization of each of the
linear combinations ı 0 w (ı 2 ) and that we wish to do so in such a way that the probability of
simultaneous coverage equals 1 P or, more generally, that the probability of coverage by all but at
most some number k of the intervals equals 1 P . Such intervals can be obtained by taking (for each
ı 2 ) the interval for the realization of ı 0 w to be a modified version of interval (8.52), (8.53), or
(8.54) in which the constant tNP =2 or tNP is replaced by a larger constant.
Let R represent an M M nonsingular matrix such that H D R0 R . Further, let
z D .1=/.R 1 /0 Œw.y/
Q w and v D .1= 2 /y 0 .I PX /y;
so that z  N.0; I/, v  2 .N P /, z and v are statistically independent, and (for every M 1
vector ı)
ı 0 w.y/
Q ı0 w .R ı/0 z
D  S t.N P / (8.55)
.ı 0 H ı/1=2 O
p
Œ.R ı/0 R ı1=2 v=.N P /
—result (8.55) is analogous to result (8.20). Then, prediction intervals for the realizations of the
random variables ı 0 w (ı 2 ) having a probability of simultaneous coverage equal to 1 P can be
obtained from the application of a modified version of interval (8.52) in which tNP =2 is replaced by
the upper 100 P % point of the distribution of the random variable
j.R ı/0 zj
max p (8.56)
ı2 Œ.R ı/0 R ı1=2 v=.N P /

or from the application of a modified version of interval (8.53) or (8.54) in which tNP is replaced by
the upper 100 P % point of the distribution of the random variable
.R ı/0 z
max p : (8.57)
ı2 Œ.R ı/0 R ı1=2 v=.N P /

More generally, for k  1, prediction intervals for which the probability of coverage by all but at
most k of the intervals can be obtained from the application of a modified version of interval (8.52) in
which tNP =2 is replaced by the upper 100 P % point of the distribution of the kth largest of the random
j.R ı/0 zj
variables p (ı 2 ) or from the application of a modified version of
Œ.R ı/0 R ı1=2 v=.N P /
interval (8.53) or (8.54) in which tNP is replaced by the upper 100 P % point of the distribution of the
.R ı/0 z
kth largest of the random variables p (ı 2 ). The resultant intervals
Œ.R ı/0 R ı1=2 v=.N P /
are analogous to those devised in Part 5 of Subsection a by employing a similar approach. And as in
the case of the latter intervals, the requisite percentage points could be determined via Monte Carlo
methods.
When  D fı 2 RM W ı ¤ 0g, the upper 100 P % point of the distribution of the random variable
(8.56) equals ŒM FN P .M; N P /1=2, and ı 0 w is contained in the modified version of interval (8.52)
[in which tNP =2 has been replaced by the upper 100 P % point of the distribution of the random variable
(8.56)] if and only if w is contained in the set (8.47)—refer to results (3.106) and (3.113) and to the
ensuing discussion. When the members of  consist of the columns of IM , the linear combinations
ı 0 w (ı 2 ) consist of the M random variables w1 ; w2 ; : : : ; wM .
492 Confidence Intervals (or Sets) and Tests of Hypotheses

Extensions and limitations. Underlying the results of the preceding four parts of the present sub-
section is a supposition that the joint distribution of y and w is MVN. This supposition is stronger
than necessary.
As in Section 7.3, let L represent an N  .N P / matrix whose columns form an orthonormal
basis for N.X0 /. And let x D .1=/L0 y, and continue to define z D .1=/.R 1 /0 Œw.y/
Q w. Further,
1 0 Q 0
let t D .1=/.R
O  / Œw .y/ w; and recalling result (3.21), observe that x x D .1= 2
/ y 0 .I PX /y
and hence that
t D Œx0 x=.N P / 1=2 z: (8.58)
Observe also that
Q
O w.y/
.1=/Œ w D R0 t; (8.59)
so that the distribution of the vector .1=/ΠQ
O w.y/ w is determined by the distribution of the vector
t.
When the joint distribution of y and w isMVN, then
z
 N.0; I/; (8.60)
x
as is evident upon observing the L0 L DI and  upon recalling expression (8.35) and recalling [from
0 z
result (3.10)] that L X D 0. And when  N.0; I/,
x
t  MV t.N P ; I/; (8.61)
as
 is 1evident from expression (8.58). More generally, t  MV t.N P ; I/ when the vector
.R /0 ŒwQ .y/ w
has an absolutely continuous spherical distribution—refer to result (6.4.67).
L0 y
Thus, the supposition (underlying the results of the preceding four parts of the present subsection)
that thejoint distribution of y and w is MVN could be replaced by the weaker supposition that the
.R 1 /0 Œw.y/
Q w
vector has an MVN distribution or, more generally, that it has an absolutely
L0 y
continuous spherical distribution. In fact, ultimately, it could be replaced by a supposition that the
1 0 Q
vector t D .1=/.R
O  / Œw.y/ w has an MV t.N P ; I/ distribution.
The results of the preceding four parts of the present subsection (which are for the case where y
follows a G–M model) can be readily extended to the more general case where y follows an Aitken
model. Let N D rank H. And let Q represent an N  N matrix such that Q0 HQ D IN —the
existence of such a matrix follows, e.g., from Corollary 2.13.23, which implies the existence of an
N N matrix P (of full row rank) for which H D P 0 P, and from Lemma 2.5.1, which implies the
existence of a right inverse of P.  Then,
Q0 y Q0 Hyw
 
2 IN
var D :
w .Q0 Hyw /0 Hw
Thus, by applying the results of the preceding four parts of the present subsection with Q0 y in place
of y and with Q0 Hyw in place of Hyw (and with N in place of N ), those results can be extended
to the case where y follows an Aitken model.
In the case of the general linear model, the existence of a pivotal quantity for the realization of w
is restricted to special cases. These special cases are ones where (as in the special case of the G–M
or Aitken model) the dependence of the values of the matrices Vy ./, Vyw ./, and Vw ./ on the
elements of the vector  is of a relatively simple form.

Exercises
Exercise 1. Take the context to be that of Section 7.1 (where a second-order G–M model is applied
to the results of an experimental study of the yield of lettuce plants for purposes of making inferences
Exercises 493

about the response surface and various of its characteristics). Assume that the second-order regression
coefficients ˇ11 , ˇ12 , ˇ13 , ˇ22 , ˇ23 , and ˇ33 are such that the matrix A is nonsingular. Assume also
that the distribution of the vector e of residual effects in the G–M model is MVN (in which case the
matrix A O is nonsingular with probability 1). Show that the large-sample distribution of the estimator
uO 0 of the stationary point u0 of the response surface is MVN with mean vector u0 and variance-
covariance matrix
.1=4/ A 1 var.Oa C 2AuO 0 / A 1: (E.1)
Do so by applying standard results on multi-parameter maximum likelihood estimation—refer, e.g.,
to McCulloch, Searle, and Neuhaus (2008, sec. S.4) and Zacks (1971, chap. 5).
Exercise 2. Taking the context to be that of Section 7.2a (and adopting the same notation and
Q 0 X0 y to have the same expected value under the
terminology as in Section 7.2a), show that for R
augmented G–M model as under the original G–M model, it is necessay (as well as sufficient) that
X0 Z D 0.
Exercise 3. Adopting the same notation ıand terminology as in Section 7.2, consider the expected
value of the usual estimator y 0 .I PX /y ŒN rank.X/ of the variance of the residual effects of
the (original) G–M model y D Xˇ C e. How is the expected value of this estimator affected when
the model equation is augmented via the inclusion of the additional “term” Z? That is, what is the
expected value of this estimator when its expected value is determined under the augmented G–M
model (rather than under the original G–M model)?
Exercise 4. Adopting the same notation and terminology as in Sections 7.1 and 7.2, regard the lettuce
yields as the observed values of the N .D 20/ elements of the random column vector y, and take
the model to be the “reduced” model derived from the second-order G–M model (in the 3 variables
Cu, Mo, and Fe) by deleting the four terms involving the variable Mo—such a model would be
consistent with an assumption that Mo is “more-or-less inert,” i.e., has no discernible effect on the
yield of lettuce.
(a) Compute the values of the least squares estimators of the regression coefficients (ˇ1 , ˇ2 , ˇ4 ,
ˇ11 , ˇ13 , and ˇ33 ) of the reduced model, and determine the standard errors, estimated standard
errors, and correlation matrix of these estimators.
(b) Determine the expected values of the least squares estimators (of ˇ1 , ˇ2 , ˇ4 , ˇ11 , ˇ13 , and ˇ33 )
from Part (a) under the complete second-order G–M model (i.e., the model that includes the
4 terms involving the variable Mo), and determine (on the basis of the complete model) the
estimated standard errors of these estimators.
(c) Find four linearly independent linear combinations of the four deleted regression coefficients
(ˇ3 , ˇ12 , ˇ22 , and ˇ23 ) that, under the complete second-order G–M model, would be estimable
and whose least squares estimators would be uncorrelated, each with a standard error of ;
and compute the values of the least squares estimators of these linearly independent linear
combinations, and determine the estimated standard errors of the least squares estimators.

Exercise 5. Suppose that y is an N  1 observable random vector that follows the G–M model.
Show that Q
E.y/ D XRS˛ C U;
Q S, U, ˛, and  are as defined in Section 7.3a.
where R,
Exercise 6. Taking the context to be that of Section 7.3, adopting the notation employed therein,
supposing that the distribution of the vector e of residual effects (in the G–M model) is MVN, and
assuming that N > P C2, show that
494 Confidence Intervals (or Sets) and Tests of Hypotheses

.˛ ˛.0/ /0 .˛ ˛.0/ /
 
N P
EŒFQ .˛.0/ / D 1C
N P 2 M  2
.0/ 0
.  / C .  .0/ /
 
N P
D 1C :
N P 2 M  2

Exercise 7. Take the context to be that of Section 7.3, adopt the notation employed therein, and
suppose that the distribution of the vector e of residual effects (in the G–M model) is MVN. For
P 2 C.ƒ0 /, the distribution of F ./ P from that of FQ .˛/:
P is obtainable (upon setting ˛P D S0 ) P in light
of the relationship (3.32) and results (3.24) and (3.34),
P  SF ŒM ; N P ; .1= 2 /.P /0 C .P /:
F ./
Provide an alternative derivation of the distribution of F ./
P by (1) taking b to be a P  1 vector such
that P D ƒ0 b and establishing that F ./
P is expressible in the form
.1= 2 /.y Xb/0 P Q
.y Xb/=M
XR
P D
F ./
.1= 2 /.y Xb/0 .I PX /.y Xb/=.N P /
and by (2) regarding .1= 2 /.y Xb/0 P Q .y Xb/ and .1= 2 /.y Xb/0 .I PX /.y Xb/ as quadratic
XR
forms (in y Xb) and making use of Corollaries 6.6.4 and 6.8.2.
Exercise 8. Take the context to be that of Section 7.3, and adopt the notation employed therein.
Taking the model to be the canonical form of the G–M model and taking the distribution of the vector
of residual effects to be N.0;  2 I/, derive (in terms of the transformed vector z) the size- P likelihood
ratio test of the null hypothesis HQ 0 W ˛ D ˛.0/ (versus the alternative hypothesis HQ 1 W ˛ ¤ ˛.0/ )—
refer, e.g., to Casella and Berger (2002, sec. 8.2) for a discussion of likelihood ratio tests. Show that
the size- P likelihood ratio test is identical to the size- P F test.
Exercise 9. Verify result (3.67).
Exercise 10. Verify the equivalence of conditions (3.59) and (3.68) and the equivalence of conditions
(3.61) and (3.69).
Exercise 11. Taking the context to be that of Section 7.3 and adopting the notation employed therein,
show that, corresponding to any two choices S1 and S2 for the matrix S (i.e., any two M  M
matrices S1 and S2 such that T S1 and T S2 are orthogonal), there exists a unique M  M matrix
Q such that
Q 2 D XRS
XRS Q 1 Q;
and show that this matrix is orthogonal.
Exercise 12. Taking the context to be that of Section 7.3, adopting the notation employed therein,
and making use of the results of Exercise 11 (or otherwise), verify that none of the groups G0 , G1 ,
G01 , G2 . .0/ /, G3 . .0/ /, and G4 (of transformations of y) introduced in the final three parts of
Subsection b of Section 7.3 vary with the choice of the matrices S, U, and L.
Exercise 13. Consider the set AQıQ (of ıQ 0 ˛-values) defined (for ıQ 2 ) Q by expression (3.89) or
(3.90). Underlying this definition is an implicit assumption that (for any M1 vector tP ) the function
Q D jıQ 0 tP j=.ıQ 0 ı/
f .ı/ Q 1=2, with domain fıQ 2 
Q W ıQ ¤ 0g, attains a maximum value. Show (1) that this
function has a supremum and (2) that if the set
R D fıR 2 RM W 9 a nonnull vector ıQ in 
 Q such that ıR D .ıQ 0 ı/
Q 1=2 Q
ıg
Q at which this function attains a maximum value.
is closed, then there exists a nonnull vector in 
Exercise 14. Take the context to be that of Section 7.3, and adopt the notation employed therein.
Further, let r D O c P , and for ıQ 2 RM, let
Exercises 495

APıQ D fP 2 R1 W P D ıQ 0 ˛; Q
P ˛P 2 Ag;
where AQ is the set (3.121) or (3.122) and is expressible in the form
jıP 0 .˛P ˛/j
 
O
AQ D ˛P 2 RM W  r for every nonnull ıP 2 
Q :
.ıP 0 ı/
P 1=2
For ı … , AıQ is identical to the set AıQ defined by expression (3.123). Show that for ıQ 2 ,
Q Q P Q Q AP Q is
ı
identical to the set AQ Q defined by expression (3.89) or (3.90) or, equivalently, by the expression
ı
AQıQ D fP 2 R1 W jP ıQ 0 ˛/j Q 1=2 rg:
O  .ıQ 0 ı/ (E.2)

Exercise 15. Taking the sets  and , Q the matrix C, and the random vector t to be as defined
in Section 7.3, supposing that the set fı 2  W ı 0 Cı ¤ 0g consists of a finite number of
vectors ı1 ; ı2 ; : : : ; ıQ , and letting K represent a Q  Q (correlation) matrix with ij th element
.ıi0 Cıi / 1=2 .ıj0 Cıj / 1=2 ıi0 Cıj , show that
jıQ 0 tj
max D max.ju1 j; ju2 j; : : : ; juQ j/;
Q 
fı2 Q
Q W ı¤0g .ıQ 0 ı/
Q 1=2
where u1 ; u2 ; : : : ; uQ are the elements of a random vector u that has an MV t.N P; K/ distribution.
Exercise 16. Define , Q c P , and t as in Section 7.3c [so that t is an M 1 random vector that has
an MV t.N P ; IM / distribution]. Show that
ıQ 0 t
 
Pr max > c P  P =2;
Q 
fı2 Q
Q W ı¤0g .ıQ 0 ı/
Q 1=2
with equality holding if and only if there exists a nonnull M  1 vector ıR (of norm 1) such that
Q 1=2 ıQ D ıR for every nonnull vector ıQ in .
.ıQ 0 ı/ Q

Exercise 17.
(a) Letting E1 ; E2 ; : : : ; EL represent any events in a probability space and (for any event E) denoting
by E the complement of E, verify the following (Bonferroni) inequality:
PL
Pr E1 \ E2 \    \ EL  1 i D1 Pr Ei :
 

(b) Take the context to be that of Section 7.3c, where y is an N 1 observable random vector that
follows a G–M model with N P model matrix X of rank P and where  D ƒ0ˇ is an M 1
vector of estimable linear combinations of the elements of ˇ (such that ƒ ¤ 0). Further, suppose
that the distribution of the vector e of residual effects is N.0;  2 I/ (or is some other spherically
symmetric distribution with mean vector 0 and variance-covariance matrix  2 I), let O represent
the least squares estimator of , let C D ƒ0.X0 X/ ƒ, let ı1 ; ı2 ; : : : ; ıL represent M 1 vectors
of constants such that (for i D 1; 2; : : : ; L) ıi0 Cıi > 0, let O represent the positive square root of
the usual estimator of  2 (i.e., the estimator obtained upon dividing the residual sum of squares
by N P ), and let P1 ; P2 ; : : : ; PL represent positive scalars such that L i D1 Pi D P . And (for
P
i D 1; 2; : : : ; L) denote by Ai .y/ a confidence interval for ıi0  with end points
ıi0 O ˙ .ıi0 Cıi /1=2 O tNPi =2 .N P /:
Use the result of Part (a) to show that
PrŒıi0  2 Ai .y/ .i D 1; 2; : : : ; L/  1 P
and hence that the intervals A1 .y/; A2 .y/; : : : ; AL .y/ are conservative in the sense that their
probability of simultaneous coverage is greater than or equal to 1 P—when P1 D P2 D    D PL ,
the end points of interval Ai .y/ become
ıi0 O ˙ .ıi0 Cıi /1=2 O tNP =.2L/ .N P /;
and the intervals A1 .y/; A2 .y/; : : : ; AL .y/ are referred to as Bonferroni t-intervals.
496 Confidence Intervals (or Sets) and Tests of Hypotheses

Exercise 18. Suppose that the data (of Section 4.2b) on the lethal dose of ouabain in cats are regarded
as the observed values of the elements y1; y2 ; : : : ; yN of an N.D 41/-dimensional observable random
vector y that follows a G–M model. Suppose further that (for i D 1; 2; : : : ; 41) E.yi / D ı.ui /, where
u1 ; u2 ; : : : ; u41 are the values of the rate u of injection and where ı.u/ is the third-degree polynomial
ı.u/ D ˇ1 C ˇ2 u C ˇ3 u2 C ˇ4 u3:
And suppose that the distribution of the vector e of residual effects is N.0;  2 I/ (or is some other
spherically symmetric distribution with mean vector 0 and variance-covariance matrix  2 I).
(a) Compute the values of the least squares estimators ˇO1 ; ˇO2 ; ˇO3 , and ˇO4 of ˇ1 ; ˇ2 ; ˇ3 , and ˇ4 ,
respectively, and the value of the positive square root O of the usual unbiased estimator of
 2 —it follows from the results of Section 5.3d that P .D rank X/ D P D 4, in which case
N P D N P D 37, and that ˇ1 ; ˇ2 ; ˇ3 , and ˇ4 are estimable.
(b) Find the values of tN:05 .37/ and Œ4FN:10 .4; 37/1=2, which would be needed if interval Iu.1/.y/ with
end points (3.169) and interval Iu.2/.y/ with end points (3.170) (where in both cases P is taken
to be :10) were used to construct confidence bands for the response surface ı.u/.
(c) By (for example) making use of the results in Liu’s (2011) Appendix E, compute Monte Carlo
approximations to the constants c:10 and c:10 
that would be needed if interval Iu.3/.y/ with end
.4/
points (3.171) and interval Iu .y/ with end points (3.172) were used to construct confidence
bands for ı.u/; compute the approximations for the case where u is restricted to the interval 1 
u  8, and (for purposes of comparison) also compute c:10 for the case where u is unrestricted.
O
(d) Plot (as a function of u) the value of the least squares estimator ı.u/ D ˇO1 C ˇO2 uC ˇO3 u2 C ˇO4 u3
and (taking P D :10) the values of the end points (3.169) and (3.170) of intervals Iu.1/.y/
.2/
and Iu .y/ and the values of the approximations to the end points (3.171) and (3.172) of
intervals Iu.3/.y/ and Iu.4/.y/ obtained upon replacing c:10 and c:10

with their Monte Carlo

approximations—assume (for purposes of creating the plot and for approximating c:10 and c:10 )
that u is restricted to the interval 1  u  8.

Exercise 19. Taking the setting to be that of the final four parts of Section 7.3b (and adopting
the notation and terminology employed therein) and taking GQ 2 to be the group of transformations
consisting of the totality of the groups GQ 2 .˛.0/ / (˛.0/ 2 RM ) and GQ 3 the group consisting of
Q
the totality of the groups GQ 3 .˛.0/ / (˛.0/ 2 RM ), show that (1) if a confidence set A.z/ for ˛ is
Q Q
equivariant with respect to the groups G0 and G2 .0/, then it is equivariant with respect to the group
GQ 2 and (2) if a confidence set A.z/
Q for ˛ is equivariant with respect to the groups GQ 0 and GQ 3 .0/,
then it is equivariant with respect to the group GQ 3 .
Exercise 20. Taking the setting to be that of Section 7.4e (and adopting the assumption of normality
and the notation and terminology employed therein), suppose that M D 1, and write ˛O for ˛, O ˛ for
˛, and ˛ .0/ for ˛.0/. Further, let .Q ˛; O d/ represent the critical function of an arbitrary (possibly
O ;
randomized) level- P test of the null hypothesis HQ 0 W ˛ D ˛ .0/ versus the alternative hypothesis
HQ 1 W ˛ ¤ ˛ .0/, and let Q .˛; ; / represent its power function fso that Q .˛; ; / D EŒ .Q ˛;
O ;
O d/g.
And define s D .˛O ˛ .0/ /=Œ.˛O ˛ .0/ /2 C d0 d1=2 and w D .˛O ˛ .0/ /2 C d0 d, denote by .s; R w; / O
the critical function of a level- P test (of HQ 0 versus HQ 1 ) that depends on ˛, O and d only through the
O ,
values of s, w, and ,O and write E0 for the expectation operator E in the special case where ˛ D ˛ .0/.
Q ˛;
(a) Show that if the level- P test with critical function . O d/ is an unbiased test, then
O ;
Q .˛ .0/; ; / D P for all  and  (E.3)
and @ Q .˛; ; / ˇˇ
ˇ
D 0 for all  and : (E.4)
@˛ ˇ
˛D˛ .0/
Exercises 497

(b) Show that  


@ Q .˛; ; / ˛O ˛ Q
DE O ;
.˛; O d/ : (E.5)
@˛ 2
(c) Show that corresponding to the level- P test (of HQ 0 ) with critical function .
Q ˛; O d/, there is a
O ;
(level- P ) test that depends on ˛, O and d only through the values of s, w, and O and that has the
O ,
same power function as the test with critical function .Q ˛;
O ;
O d/.
(d) Show that when ˛ D ˛ .0/, (1) w and O form a complete sufficient statistic and (2) s is statistically
independent of w and O and has an absolutely continuous distribution, the pdf of which is the
pdf h./ given by result (6.4.7).
(e) Show that when the critical function . Q ˛; R w; /,
O d/ (of the level- P test) is of the form .s;
O ; O
condition (E.3) is equivalent to the condition
R w; /
E0 Π.s; O D P (wp1)
O j w;  (E.6)
and condition (E.4) is equivalent to the condition
E0 Œsw 1=2 .s;
R w; / O D 0 (wp1).
O j w;  (E.7)

(f) Using the generalized Neyman–Pearson lemma (Lehmann and Romano 2005b, sec. 3.6; Shao
R w; /
2010, sec. 6.1.1), show that among critical functions of the form .s; O that satisfy (for any
particular values of w and )
O the conditions
R w; /
E0 Œ .s; O D P and E0 Œsw 1=2 .s;
O j w;  R w; /O j w; 
O D 0; (E.8)
R w; /
the value of EŒ .s; O j w; O (at those particular values of w and )O is maximized [for any
particular value of ˛ (¤ ˛ .0/ ) and any particular values of  and ] when the critical function is
taken to be the critical function R .s; w; /
O defined (for all s, w, and )
O as follows:
(
1; if s < c or s > c,
R .s; w; /
O D
0; if c  s  c,
where c is the upper 100. P =2/% point of the distribution with pdf h./ [given by result (6.4.7)].
(g) Use the results of the preceding parts to conclude that among all level- P tests of HQ 0 versus HQ 1
that are unbiased, the size- P two-sided t test is a UMP test.

Exercise 21. Taking the setting to be that of Section 7.4e and adopting the assumption of normality
and the notation and terminology employed therein, let Q .˛; ; / represent the power function of a
size- P similar test of H0 W  D  .0/ or HQ 0 W ˛ D ˛.0/ versus H1 W  ¤  .0/ or HQ 1 W ˛ ¤ ˛.0/. Show
that min˛2S.˛.0/;/ Q .˛; ; / attains its maximum value when the size- P similar test is taken to be
the size- P F test.
Exercise 22. Taking the setting to be that of Section 7.4e and adopting the assumption of normality
and the notation and terminology employed therein, let . Q ˛; O d/ represent the critical function
O ;
Q
of an arbitrary size- P test of the null hypothesis H0 W ˛ D ˛.0/ versus the alternative hypothesis
HQ 1 W ˛ ¤ ˛.0/. Further, let Q .  ;  ;  I /
Q represent the power function of the test with critical function
Q Q
.  ;  ;  /, so that Q .˛; ; I / D E Œ. Q ˛; O d/. And take Q .  ;  ;  / to be the function defined as
O ;
follows:
Q .˛; ; / D supQ Q .˛; ; I /:
Q
This function is called the envelope power function.
(a) Show that Q .˛; ; / depends on ˛ only through the value of .˛ ˛.0/ /0 .˛ ˛.0/ /.
(b) Let QF .˛; O d/ represent the critical function of the size- P F test of HQ 0 versus HQ 1 . And as a
O ;
Q  ;  ;  /, consider the use of the criterion
basis for evaluating the test with critical function .
max Œ Q .˛; ; / Q .˛; ; I /; Q (E.9)
˛2S.˛.0/; /
498 Confidence Intervals (or Sets) and Tests of Hypotheses

which reflects [for ˛ 2 S.˛.0/; /] the extent to which the power function of the test deviates
from the envelope power function. Using the result of Exercise 21 (or otherwise), show that the
size- P F test is the “most stringent” size- P similar test in the sense that (for “every” value of )
the value attained by the quantity (E.9) when Q D QF is a minimum among those attained when
Q is the critical function of some (size- P ) similar test.

Exercise 23. Take the setting to be that of Section 7.5a (and adopt the assumption of normality and
the notation and terminology employed therein). Show that among all tests of the null hypothesis
H0C W    .0/ or HQ 0C W ˛  ˛ .0/ (versus the alternative hypothesis H1C W  >  .0/ or HQ 1C W ˛ > ˛ .0/ )
that are of level P and that are unbiased, the size- P one-sided t test is a UMP test. (Hint. Proceed
stepwise as in Exercise 20.)
Exercise 24.
(a) Let (for an arbitrary positive integer M ) fM ./ represent the pdf of a 2 .M / distribution. Show
that (for 0 < x < 1)
xfM .x/ D MfM C2 .x/:
(b) Verify [by using Part (a) or by other means] result (6.22).

Exercise 25. This exercise is to be regarded as a continuation of Exercise 18. Suppose (as in
Exercise 18) that the data (of Section 4.2 b) on the lethal dose of ouabain in cats are regarded as
the observed values of the elements y1 ; y2 ; : : : ; yN of an N.D 41/-dimensional observable random
vector y that follows a G–M model. Suppose further that (for i D 1; 2; : : : ; 41) E.yi / D ı.ui /,
where u1 ; u2 ; : : : ; u41 are the values of the rate u of injection and where ı.u/ is the third-degree
polynomial
ı.u/ D ˇ1 C ˇ2 u C ˇ3 u2 C ˇ4 u3:
And suppose that the distribution of the vector e of residual effects is N.0;  2 I/.
(a) Determine for P D 0:10 and also for P D 0:05 (1) the value of the 100.1 P /% lower confidence
bound for  provided by the left end point of interval (6.2) and (2) the value of the 100.1 P /%
upper confidence bound for  provided by the right end point of interval (6.3).
(b) Obtain [via an implementation of interval (6.23)] a 90% two-sided strictly unbiased confidence
interval for .

Exercise 26. Take the setting to be that of the final part of Section 7.6c, and adopt the notation and
terminology employed therein. In particular, take the canonical form of the G–M model to be that
identified with the special case where M D P , so that ˛ and ˛O are P -dimensional. Show that the
(size- P ) test of the null hypothesis H0C W   0 (versus the alternative hypothesis H1C W  > 0 )
with critical region C C is UMP among all level- P tests. Do so by carrying out the following steps.
(a) Let .T; ˛/O represent the critical function of a level- P test of H0C versus H1C [that depends on
the vector d only through the value of T (D d0 d=02 )]. And let .; ˛/ represent the power
function of the test with critical function .T; ˛/.
O Further, let  represent any particular value
of  greater than 0 , let ˛ represent any particular value of ˛, and denote by h.  I / the pdf of
the distribution of T, by f .  I ˛; / the pdf of the distribution of ˛,
O and by s./ the pdf of the
N.˛ ; 2 02 / distribution. Show (1) that
Z
.0 ; ˛/ s.˛/ d ˛  P (E.10)
RP
and (2) that
Z Z Z 1
.0 ; ˛/ s.˛/ d ˛ D O h.tI 0 /f .˛I
.t; ˛/ O ˛ ;  / dt d ˛:
O (E.11)
RP RP 0
Exercises 499

(b) By, for example, using a version of the Neyman–Pearson lemma like that stated by Casella and
Berger (2002) in the form of their Theorem 8.3.12, show that among those choices for the critical
function .T; ˛/
O for which the power function .; / satisfies condition (E.10), . ; ˛ / can
be maximized by taking .T; ˛/ O to be the critical function  .T; ˛/
O defined as follows:
(
1; when t > N 2P ,
O D
 .t; ˛/
0; when t  N 2P .

(c) Use the results of Parts (a) and (b) to reach the desired conclusion, that is, to show that the test
of H0C (versus H1C ) with critical region C C is UMP among all level- P tests.

Exercise 27. Take the context to be that of Section 7.7a, and adopt the notation employed therein.
Using Markov’s inequality (e.g., Casella and Berger 2002, lemma 3.8.3; Bickel and Doksum 2001,
sec. A.15) or otherwise, verify inequality (7.12), that is, the inequality
Pr. xi > c for k or more values of i /  .1=k/ M i D1 Pr. xi > c/:
P

Exercise 28.
(a) Letting X represent any random variable whose values are confined to the interval Œ0; 1 and
letting  (0 <  < 1) represent a constant, show (1) that
E.X /   Pr.X  / C Pr.X > / (E.12)
and then use inequality (E.12) along with Markov’s inequality (e.g., Casella and Berger 2002,
sec. 3.8) to (2) show that
E.X /  E.X /
 Pr.X > /  : (E.13)
1  
(b) Show that the requirement that the false discovery rate (FDR) satisfy condition (7.45) and the
requirement that the false discovery proportion (FDP) satisfy condition (7.46) are related as
follows:
(1) if FDR  ı, then Pr.FDP > /  ı=; and
(2) if Pr.FDP > /  , then FDR   C .1 /.

Exercise 29. Taking the setting to be that of Section 7.7 and adopting the terminology and notation
employed therein, consider the use of a multiple-comparison procedure in testing (for every i 2 I D
.0/ .0/ .1/ .0/
f1; 2; : : : ; M g) the null hypothesis Hi W i D i versus the alternative hypothesis Hi W i ¤ i
(or Hi.0/ W i  i.0/ versus Hi.1/ W i > i.0/ ). And denote by T the set of values of i 2 I for which
Hi.0/ is true and by F the set for which Hi.0/ is false. Further, denote by MT the size of the set T
and by XT the number of values of i 2 T for which Hi.0/ is rejected. Similarly, denote by MF the
size of the set F and by XF the number of values of i 2 F for which Hi.0/ is rejected. Show that
(a) in the special case where MT D 0, FWER D FDR D 0;
(b) in the special case where MT D M, FWER D FDR; and
(c) in the special case where 0 < MT < M, FWER  FDR, with equality holding if and only if
Pr.XT > 0 and XF > 0/ D 0.

Exercise 30.
(a) Let pO1 ; pO2 ; : : : ; pO t represent p-values [so that Pr.pOi  u/  u for i D 1; 2; : : : ; t and for every
u 2 .0; 1/]. Further, let pO.j / D pOij (j D 1; 2; : : : ; t), where i1 ; i2 ; : : : ; i t is a permutation of the
first t positive integers 1; 2; : : : ; t such that pOi1  pOi2      pOi t . And let s represent a positive
500 Confidence Intervals (or Sets) and Tests of Hypotheses

integer such that s  t, and let c0 ; c1 ; : : : ; cs represent constants such that 0 D c0  c1     


cs  1. Show that
Pr.pO.j /  cj for 1 or more values of j 2 f1; 2; : : : ; sg/  t js D1 .cj cj 1 /=j: (E.14)
P

(b) Take the setting to be that of Section 7.7d, and adopt the notation and terminology employed
therein. And suppose that the ˛j ’s of the step-down multiple-comparison procedure for testing
the null hypotheses H1.0/; H2.0/; : : : ; HM
.0/
are of the form
˛j D tN.ŒjC1/ P =f2.M CŒjC1 j /g .N P / .j D 1; 2; : : : ; M / (E.15)
[where P 2 .0; 1/].
(1) Show that if
Pr.jtuIT j > tNu P =.2MT / .N P / for 1 or more values of u 2 f1; 2; : : : ; Kg/  ; (E.16)
then the step-down procedure [with ˛j ’s of the form (E.15)] is such that Pr.FDP > /  .
(2) Reexpress the left side of inequality (E.16) in terms of the left side of inequality (E.14).
(3) Use Part (a) to show that
Pr.jtuIT j > tNu P =.2MT / .N P /
PŒM C1
for 1 or more values of u 2 f1; 2; : : : ; Kg/  P uD1 1=u: (E.17)
(4) Show that the version of the step-down procedure [with ˛j ’s of the form (E.15)] obtained
upon setting P D = ŒM
P C1
uD1 1=u is such that Pr.FDP > /  .

Exercise 31. Take the setting to be that of Section 7.7e, and adopt the notation and terminology em-
ployed therein. And take ˛1 ; ˛2 ; : : : ; ˛M

to be scalars defined implicitly (in terms of ˛1 ; ˛2 ; : : : ; ˛M )
by the equalities
˛k0 D jkD1 ˛j (E.18)
P
.k D 1; 2; : : : ; M /
or explicitly as (
Pr ˛j 1  jtj > ˛j ; for j D 2; 3; : : : ; M,

˛j D
Pr jtj > ˛j ; for j D 1,


where t  S t.N P /.

(a) Show that the step-up procedure for testing the null hypotheses Hi.0/ W i D i.0/ (i D 1; 2; : : : ;
M ) is such that (1) the FDR is less than or equal to MT jMD1 ˛j =j ; (2) when M jMD1 ˛j =j < 1,
P P

the FDR is controlled at level M jMD1 ˛j =j (regardless of the identity of the set T ); and (3) in
P

the special case where (for j D 1; 2; : : : ; M ) ˛j0 is of the form ˛j0 D j P =M, the FDR is less than
 1
or equal to P .MT =M / jMD1 1=j and can be controlled at level ı by taking P D ı
PM
.
P
j D1 1=j
PM
(b) The sum j D1 1=j is “tightly” bounded from above by the quantity
C log.M C0:5/ C Œ24.M C0; 5/2  1; (E.19)
where is the Euler–Mascheroni constant (e.g., Chen 2010)—to 10 significant digits, D
0:5772156649. Determine the value of jMD1 1=j and the amount by which this value is exceeded
P
by the value of expression (E.19). Do so for each of the following values of M : 5, 10, 50, 100,
500, 1;000, 5;000, 10;000, 20;000, and 50;000.
(c) What modifications are needed to extend the results encapsulated in Part (a) to the step-up
procedure for testing the null hypotheses Hi.0/ W i  i.0/ (i D 1; 2; : : : ; M ).

Exercise 32. Take the setting to be that of Section 7.7, and adopt the terminology and notation
employed therein. Further, for j D 1; 2; : : : ; M , let
˛Pj D tNkj P =Œ2.M Ckj j / .N P /;
Exercises 501

where [for some scalar  (0 <  < 1)] kj D ŒjC1, and let
˛Rj D tNj P =.2M / .N P /:
And consider two stepwise multiple-comparison procedures for testing the null hypotheses
H1.0/; H2.0/; : : : ; HM
.0/
: a stepwise procedure for which ˛j is taken to be of the form ˛j D ˛Pj [as
in Section 7.7d in devising a step-down procedure for controlling Pr.FDP > /] and a stepwise
procedure for which ˛j is taken to be of the form ˛j D ˛Rj (as in Section 7.7e in devising a step-up
procedure for controlling the FDR). Show that (for j D 1; 2; : : : ; M ) ˛Pj  ˛Rj , with equality holding
if and only if j  1=.1 / or j D M.
Exercise 33. Take the setting to be that of Section 7.7f, and adopt the terminology and notation
employed therein. And consider a multiple-comparison procedure in which (for i D 1; 2; : : : ; M )
the i th of the M null hypotheses H1.0/; H2.0/; : : : ; HM
.0/
is rejected if jti.0/ j > c, where c is a strictly
positive constant. Further, recall that T is the subset of the set I D f1; 2; : : : ; M g such that i 2 T if
.0/ .0/
Hi is true, denote by R the subset of I such that i 2 R if Hi is rejected, and (for i D 1; 2; : : : ; M )
take Xi to be a random variable defined as follows:
(
.0/
1; if jti j > c,
Xi D
0; if jti.0/ j  c.
(a) Show that
EŒ.1=M / Xi  D .MT =M / Pr.jtj > c/  Pr.jtj > c/;
P
i 2T
where t  S t.100/.
PM
(b) Based on the observation that [when .1=M / i D1 Xi > 0]
P
.1=M / i 2T Xi
FDP D PM ;
.1=M / i D1 Xi

on the reasoning that for large M the quantity .1=M / M i D1 Xi can be regarded as a (strictly
P
positive) constant, and on the result of Part (a), the quantity MT Pr.jtj > c/=MR can be regarded
as an “estimator” of the FDR [D E.FDP/] and M Pr.jtj > c/=MR can be regarded as an estimator
of maxT FDR (Efron 2010, chap. 2)—if MR D 0, take the estimate of the FDR or of maxT FDR
to be 0. Consider the application to the prostate data of the multiple-comparison procedure in
the case where c D c P .k/ and also in the case where c D tNk P=.2M / .100/. Use the information
provided by the entries in Table 7.5 to obtain an estimate of maxT FDR for each of these two
cases. Do so for P D :05; :10; and :20 and for k D 1; 5; 10; and 20.

Exercise 34. Take the setting to be that of Part 6 of Section 7.8a (pertaining to the testing of
H0 W w 2 S0 versus H1 W w 2 S1 ), and adopt the notation and terminology employed therein.
(a) Write p0 for the random variable p0.y/, and denote by G0./ the cdf of the conditional distribution
of p0 given that w 2 S0 . Further, take k and c to be the constants that appear in the definition of
the critical function   ./, take k 00 D Œ1C .1 =0 / k 1, and take   ./ to be a critical function
defined as follows: 8
< 1; when p0 .y/ < k 00,

 .y/ D c; when p0 .y/ D k 00,
0; when p0 .y/ > k 00.
:

Show that (1)  .y/ D  .y/ when f .y/ > 0, (2) that k 00 equals the smallest scalar p 0 for
 

which G0 .p 0 /  P , and (3) that


P Pr.p0 < k 00 j w 2 S0 /
cD when Pr.p0 D k 00 j w 2 S0 / > 0
Pr.p0 D k 00 j w 2 S0 /
—when Pr.p0 D k 00 j w 2 S0 / D 0, c can be chosen arbitrarily.
502 Confidence Intervals (or Sets) and Tests of Hypotheses

(b) Show that if the joint distribution of w and y is MVN, then there exists a version of the critical
function   ./ defined by equalities (8.25) and (8.26) for which   .y/ depends on the value of
0 0
y only through the value of w.y/
Q D  C Vyw Vy 1 y (where  D w Vyw Vy 1 y ).
(c) Suppose that M D 1 and that S0 D fw W `  w  ug, where ` and u are (known) constants
(with ` < u). Suppose also that the joint distribution of w and y is MVN and that vyw ¤ 0. And
0 0 0
letting wQ D w.y/
Q D  C vyw Vy 1 y (with  D w vyw Vy 1 y ) and vQ D vw vyw Vy 1 vyw ,
define
d D d.y/ D FfŒu w.y/= Q vQ 1=2 g FfŒ` w.y/=
Q vQ 1=2 g;
where F ./ is the cdf of the N.0; 1/ distribution. Further, let
C0 D fy 2 RN W d.y/ < dR g;
where dR is the lower 100 P % point of the distribution of the random variable d. Show that among
all P -level tests of the null hypothesis H0 , the nonrandomized P -level test with critical region
C0 achieves maximum power.

Bibliographic and Supplementary Notes


§1. The term standard error is used herein to refer to the standard deviation of an estimator. In some
presentations, this term is used to refer to what is herein referred to as an estimated standard error.
§3b. The results of Lemmas 7.3.1, 7.3.2, and 7.3.3 are valid for noninteger degrees of freedom as well as
integer degrees of freedom. The practice adopted herein of distinguishing between equivariance and invariance
is consistent with that adopted by, e.g., Lehmann and Romano (2005b).
§3c. In his account of the S method, Scheffé (1959) credits Holbrook Working and Harold Hotelling with
having devised a special case of the method applicable to obtaining confidence intervals for the points on a
regression line.
§3c, §3e, and §3f. The approach to the construction of simultaneous confidence intervals described herein
is very similar to the approach described (primarily in the context of confidence bands) by Liu (2011).
§3e. The discussion of the use of Monte Carlo methods to approximate the percentage point c P or c P is
patterned after the discussion of Edwards and Berry (1987, sec. 2). In particular, Lemmas 7.3.4 and 7.3.5 are
nearly identical to Edwards and Berry’s Lemmas 1 and 2.
§3e and §3f. Hsu and Nelson (1990 and 1998) described ways in which a control variate could be used to
advantage in obtaining a Monte Carlo approximation to the upper 100 P % point of the distribution of a random
variable like maxfı2Q  Q
Q W ı¤0g jıQ 0 tj=.ıQ 0 ı/
Q 1=2 or max Q Q jıQ 0 tj. A control variate is a random variable having a
fı2g
distribution whose upper 100 P % point can be determined “exactly” and that is “related to” the distribution whose
upper 100 P % point is to be approximated. Result (3.154) suggests that for purposes of obtaining a Monte Carlo
approximation to c P [which is the upper 100 P % point of the distribution of maxfı2 Q  Q
Q W ı¤0g jıQ 0 tj=.ıQ 0 ı/
Q 1=2 ],
the random variable t 0 t=M [which has an SF .M ; N P / distribution] could serve as a control variate. The
K D 599999 draws used (in Subsection f) in obtaining a Monte Carlo approximation to c:10 were such that
subsequent to rearranging the 599999 sample values of t 0 t=M from smallest to largest, the 540000 th of these
values was 2:317968; in actuality, this value is the upper 10:052% point of the SF .10; 10/ distribution, not the
upper 10:000% point.
§3f. In computing the Monte Carlo approximations to c:10 and c:10  , the value of max
fu2Ug
jŒx.u/0 W 0 tj=fŒx.u/0 .X0 X/ 1x.u/g1=2 and the value of maxfu2Ug jŒx.u/0 W 0 tj had to be computed for each
of the 599999 draws from the distribution of t. The value of max fu2Ug jŒx.u/0 W 0 tj=fŒx.u/0 .X0 X/ 1x.u/g1=2
was determined by (1) using an implementation (algorithm nlxb in the R package nlmrt) of a nonlinear least
squares algorithm proposed (as a variant on the Marquardt procedure) by Nash (1990) to find the minimum value
of the sum of squares Œt W x.u/0 Œt W x.u/ with respect to  and u (subject to the constraint u 2 U) and
by then (2) exploiting relationship (3.173). The value of maxfu2Ug jŒx.u/0 W 0 tj was determined by finding
the maximum values of Œx.u/0 W 0 t and Œx.u/0 W 0 . t/ with respect to u (subject to the constraint u 2 U)
and by taking the value of maxfu2Ug jŒx.u/0 W 0 tj to be the larger of these two values. The determination of
Bibliographic and Supplementary Notes 503

the maximum value of each of the quantities Œx.u/0 W 0 t and Œx.u/0 W 0 . t/ (subject to the constraint u 2 U)
was based on the observation that either the maximum value is attained at one of the 8 values of u such that
ui D ˙1 (i D 1; 2; 3) or it is attained at a value of u such that (1) one, two, or three of the components of
u are less than 1 in absolute value, (2) the first-order partial derivatives of Œx.u/0 W 0 t or Œx.u/0 W 0 . t/ with
respect to these components equal 0, and (3) the matrix of second-order partial derivatives with respect to these
components is negative definite—a square matrix A is said to be negative definite if A is positive definite.
§4e. Refer to Chen, Hung, and Chen (2007) for some general discussion of the use of maximum average-
power as a criterion for evaluating hypothesis tests.
§4e and Exercise 20. The result that the size- P 2-sided t test is a UMP level- P unbiased test (of H0 or HQ 0
versus H1 or HQ 1 ) can be regarded as a special case of results on UMP tests for exponential families like those
discussed by Lehmann and Romano (2005b) in their chapters 4 and 5.
§7a, §7b, and Exercise 27. The procedures proposed by Lehmann and Romano (2005a) for the control
of k-FWER served as “inspiration” for the results of Subsection a, for Exercise 27, and for some aspects of
what is presented in Subsection b. Approaches similar to the approach proposed by Lehmann and Romano for
controlling the rate of multiple false rejections were considered by Victor (1982) and by Hommel and Hoffmann
(1988).
§7b. Various of the results presented in this subsection are closely related to the results of Westfall and
Tobias (2007).
§7c. Benjamini and Hochberg’s (1995) paper has achieved landmark status. It has had a considerable impact
on statistical practice and has inspired a great deal of further research into multiple-comparison methods of a kind
better suited for applications to microarray data (and other large-scale applications) than the more traditional
kind of methods. The newer methods have proved to be popular, and their use in large-scale applications has
proved to be considerably more effective than that of the more traditional methods. Nevertheless, it could be
argued that even better results could be achieved by regarding and addressing the problem of screening or
discovery in a way that is more in tune with the true nature of the problem (than simply treating the problem as
one of multiple comparisons).
§7c and §7f. The microarray data introduced in Section 7.7c and used for illustrative purposes in Section
7.7f are the data referred to by Efron (2010, app. B) as the prostate data and are among the data made available
by him on his website. Those data were obtained by preprocessing the “raw” data from a study by Singh et al.
(2002). Direct information about the nature of the preprocessing does not seem to be available. Presumably, the
preprocessing was similar in nature to that described by Dettling (2004) and applied by him to the results of the
same study—like Efron, Dettling used the preprocessed data for illustrative purposes (and made them available
via the internet). In both cases, the preprocessed data are such that the data for each of the 50 normal subjects
and each of the 52 cancer patients has been centered and rescaled so that the average value is 0 and the sample
variance equals 1. Additionally, it could be argued that (in formulating the objectives underlying the analysis
of the prostate data in terms of multiple comparisons) it would be more realistic to take the null hypotheses
.0/ .0/ .1/ .0/
to be Hs W js j  s (s D 1; 2; : : : ; 6033) and the alternative hypotheses to be Hs W js j > s
.0/ .0/
(s D 1; 2; : : : ; 6033), where s is a “threshold” such that (absolute) values of s smaller than s are
regarded as “unimportant”—this presumes the existence of enough knowledge about the underlying processes
that a suitable threshold is identifiable.
§7d. As discussed in this section, a step-down procedure for controlling Pr.FDP > / at level  can be
obtained by taking (for j D 1; 2; : : : ; M ) ˛j to be of the form (7.65) (in which case ˛j can be regarded as a
function of P ) and by taking the value of P to be the largest value that satisfies condition (7.70). Instead of taking
˛j to be of the form (7.65), we could take it to be of the form ˛j D maxS2 QC c P .ŒjC1I S/ (where
ŒjC1Ij
Q C
 kIj
is as defined in Section 7.7b). This change could result in additional rejections (discoveries), though any
such gains would come at the expense of greatly increased computational demands.
§7d and Exercises 28 and 30. The content of this section and these exercises is based to a considerable
extent on the results of Lehmann and Romano (2005a, sec. 3).
§7e and Exercises 29 and 31. The content of this section and these exercises is based to a considerable
extent on the results of Benjamini and Yekutieli (2001).
§7f. The results (on the total number of rejections or discoveries) reported in Table 7.5 for the conservative
counterpart of the k-FWER multiple-comparison procedure (in the case where P D :05) differ somewhat from
the results reported by Efron (2010, fig. 3.3). It is worth noting that the latter results are those for the case where
504 Confidence Intervals (or Sets) and Tests of Hypotheses

the tests of the M null hypotheses are one-sided tests rather than those for the case where the tests are two-sided
tests.
§8a. The approach taken (in the last 2 parts of Section 7.8a) in devising (in the context of prediction)
multiple-comparison procedures is based on the use of the closure principle (as generalized from control of the
FWER to control of the k-FWER). It would seem that this use of the closure principle (as generalized thusly)
in devising multiple-comparison procedures could be extended to other settings.
Exercise 17 (b). The Bonferroni t-intervals [i.e., the intervals A1 .y/; A2 .y/; : : : ; AL .y/ in the special
case where P1 D P2 D    D PL ] can be regarded as having been obtained from interval (3.94) (where
ı 2  with  D fı1 ; ı2 ; : : : ; ıL g) by replacing c P with tNP =.2L/ .N P /, which constitutes an upper bound
for c P . As discussed by Fuchs and Sampson (1987), intervals for ı10 ; ı20 ; : : : ; ıL 0  that are superior to the

Bonferroni t-intervals are obtainable from interval (3.94) by replacing c P with tŒ1 .1 P /1=L =2 .N P /, which
N
is a tighter upper bound for c P than tNP =.2L/ .N P /. And an even greater improvement can be effected
by replacing c P with the upper 100 P % point, say tNP .L; N P /, of the distribution of the random variable
max.jt1 j; jt2 j; : : : ; jtL j/, where t1 ; t2 ; : : : ; tL are the elements of an L-dimensional random vector that has an
MV t.N P ; IL / distribution; the distribution of max.jt1 j; jt2 j; : : : ; jtL j/ is that of the Studentized maximum
modulus, and its upper 100 P % point is an even tighter upper bound (for c P ) than tNŒ1 .1 P /1=L =2 .N P /. These
alternatives to the Bonferroni t-intervals are based on the results of Šidák (1967 and 1968); refer, for example,
to Khuri (2010, sec. 7.5.4) and to Graybill (1976, sec. 6.6) for some relevant details. As a simple example,
consider the confidence intervals (3.143), for which the probability of simultaneous coverage is :90. In this
example, P D :10, L D 3, N P D 10, tNP .L; N P / D c P D 2:410, tNŒ1 .1 P /1=L =2 .N P / D 2:446,
and tNP =.2L/ .N P / D 2:466.
Exercises 21 and 22 (b). The size- P F test is optimal in the senses described in Exercises 21 and 22 (b) not
only among size- P similar tests, but among all tests whose size does not exceed P —refer, e.g., to Lehmann
and Romano (2005b, chap. 8). The restriction to size- P similar tests serves to facilitate the solution of these
exercises.
References

Agresti, A. (2013), Categorical Data Analysis (3rd ed.), New York: Wiley.
Albert, A. (1972), Regression and the Moore–Penrose Pseudoinverse, New York: Academic Press.
Albert, A. (1976), “When Is a Sum of Squares an Analysis of Variance?,” The Annals of Statistics,
4, 775–778.
Anderson, T. W., and Fang, K.-T. (1987), “Cochran’s Theorem for Elliptically Contoured Distribu-
tions,” Sankhyā, Series A, 49, 305–315.
Arnold, S. F. (1981), The Theory of Linear Models and Multivariate Analysis, New York: Wiley.
Atiqullah, M. (1962), “The Estimation of Residual Variance in Quadratically Balanced Least-Squares
Problems and the Robustness of the F-Test,” Biometrika, 49, 83–91.
Baker, J. A. (1997), “Integration over Spheres and the Divergence Theorem for Balls,” The American
Mathematical Monthly, 104, 36–47.
Bartle, R. G. (1976), The Elements of Real Analysis (2nd ed.), New York: Wiley.
Bartle, R. G., and Sherbert, D. R. (2011), Introduction to Real Analysis (4th ed.), Hoboken, NJ:
Wiley.
Bates, D. M., and Watts, D. G. (1988), Nonlinear Regression Analysis and Its Applications, New
York: Wiley.
Beaumont, R. A., and Pierce, R. S. (1963), The Algebraic Foundations of Mathematics, Reading,
MA: Addison-Wesley.
Benjamini, Y., and Hochberg, Y. (1995), “Controlling the False Discovery Rate: a Practical and
Powerful Approach to Multiple Testing,” Journal of the Royal Statistical Society, Series B, 57,
289–300.
Benjamini, Y., and Yekutieli, D. (2001), “The Control of the False Discovery Rate in Multiple Testing
Under Dependency,” The Annals of Statistics, 29, 1165–1188.
Bennett, J. H., ed. (1990), Statistical Inference and Analysis: Selected Correspondence of R. A.
Fisher, Oxford, U.K.: Clarendon Press.
Berger, J. O. (1985), Statistical Decision Theory and Bayesian Analysis (2nd ed.), New York:
Springer-Verlag.
Bickel, P. J., and Doksum, K. A. (2001), Mathematical Statistics: Basic Ideas and Selected Topics
(Vol. I, 2nd ed.), Upper Saddle River, NJ: Prentice-Hall.
Billingsley, P. (1995), Probability and Measure (3rd ed.), New York: Wiley.
Box, G. E. P., and Draper, N. R. (1987), Empirical Model-Building and Response Surfaces, New
York: Wiley.
Bretz, F., Hothorn, T., and Westfall, P. (2011), Multiple Comparisons Using R, Boca Raton, FL:
Chapman & Hall/CRC.
Cacoullos, T., and Koutras, M. (1984), “Quadratic Forms in Spherical Random Variables: Generalized
Noncentral 2 Distribution,” Naval Research Logistics Quarterly, 31, 447–461.
506 References

Carroll, R. J., and Ruppert, D. (1988), Transformation and Weighting in Regression, New York:
Chapman & Hall.
Casella, G., and Berger, R. L. (2002), Statistical Inference (2nd ed.), Pacific Grove, CA: Duxbury.
Chen, C.-P. (2010), “Inequalities for the Euler–Mascheroni constant,” Applied Mathematics Letters,
23, 161–164.
Chen, L.-A., Hung, H.-N., and Chen, C.-R. (2007), “Maximum Average-Power (MAP) Tests,” Com-
munications in Statistics—Theory and Methods, 36, 2237–2249.
Cochran, W. G. (1934), “The Distribution of Quadratic Forms in a Normal System, with Applications
to the Analysis of Covariance,” Proceedings of the Cambridge Philosophical Society, 30, 178–191.
Cornell, J. A. (2002), Experiments with Mixtures: Designs, Models, and the Analysis of Mixture Data
(3rd ed.), New York: Wiley.
Cressie, N. A. C. (1993), Statistics for Spatial Data (rev. ed.), New York: Wiley.
David, H. A. (2009), “A Historical Note on Zero Correlation and Independence,” The American
Statistician, 63, 185–186.
Davidian, M., and Giltinan, D. M. (1995), Nonlinear Models for Repeated Measurement Data,
London: Chapman & Hall.
Dawid, A. P. (1982), “The Well-Calibrated Bayesian” (with discussion), Journal of the American
Statistical Association, 77, 605–613.
Dettling, M. (2004), “BagBoosting for Tumor Classification with Gene Expression Data,” Bioinfor-
matics, 20, 3583–3593.
Diggle, P. J., Heagerty, P., Liang, K.-Y., and Zeger, S. L. (2002), Analysis of Longitudinal Data (2nd
ed.), Oxford, U.K.: Oxford University Press.
Driscoll, M. F. (1999), “An Improved Result Relating Quadratic Forms and Chi-Square Distribu-
tions,” The American Statistician, 53, 273–275.
Driscoll, M. F., and Gundberg, W. R., Jr. (1986), “A History of the Development of Craig’s Theorem,”
The American Statistician, 40, 65–70.
Driscoll, M. F., and Krasnicka, B. (1995), “An Accessible Proof of Craig’s Theorem in the General
Case,” The American Statistician, 49, 59–62.
Durbin, B. P., Hardin, J. S., Hawkins, D. M., and Rocke, D. M. (2002), “A Variance-Stabilizing
Transformation for Gene-Expression Microarray Data,” Bioinformatics, 18, S105–S110.
Edwards, D., and Berry, J. J. (1987), “The Efficiency of Simulation-Based Multiple Comparisons,”
Biometrics, 43, 913–928.
Efron, B. (2010), Large-Scale Inference: Empirical Bayes Methods for Estimation, Testing, and
Prediction, New York: Cambridge University Press.
Fang, K.-T., Kotz, S., and Ng, K.-W. (1990), Symmetric Multivariate and Related Distributions,
London: Chapman & Hall.
Feller, W. (1971), An Introduction to Probability Theory and Its Applications (Vol. II, 2nd ed.), New
York: Wiley.
Fuchs, C., and Sampson, A. R. (1987), “Simultaneous Confidence Intervals for the General Linear
Model,” Biometrics, 43, 457–469.
Gallant, A. R. (1987), Nonlinear Statistical Models, New York: Wiley.
References 507

Gentle, J. E. (1998), Numerical Linear Algebra for Applications in Statistics, New York: Springer-
Verlag.
Golub, G. H., and Van Loan, C. F. (2013), Matrix Computations (4th ed.), Baltimore: The Johns
Hopkins University Press.
Graybill, F. A. (1961), An Introduction to Linear Statistical Models (Vol. I), New York: McGraw-Hill.
Graybill, F. A. (1976), Theory and Application of the Linear Model, North Scituate, MA: Duxbury.
Grimmett, G., and Welsh, D. (1986), Probability: An Introduction, Oxford, U.K.: Oxford University
Press.
Gupta, A. K., and Song, D. (1997), “Lp-Norm Spherical Distribution,” Journal of Statistical Planning
and Inference, 60, 241–260.
Hader, R. J., Harward, M. E., Mason, D. D., and Moore, D. P. (1957), “An Investigation of Some of
the Relationships Between Copper, Iron, and Molybdenum in the Growth and Nutrition of Lettuce:
I. Experimental Design and Statistical Methods for Characterizing the Response Surface,” Soil
Science Society of America Proceedings, 21, 59–64.
Hald, A. (1952), Statistical Theory with Engineering Applications, New York: Wiley.
Halmos, P. R. (1958), Finite-Dimensional Vector Spaces (2nd ed.), Princeton, NJ: Van Nostrand.
Hartigan, J. A. (1969), “Linear Bayesian Methods,” Journal of the Royal Statistical Society, Series
B, 31, 446–454.
Hartley, H. O. (1950), “The Maximum F -Ratio as a Short-Cut Test for Heterogeneity of Variance,”
Biometrika, 37, 308–312.
Harville, D. A. (1980), “Predictions for National Football League Games Via Linear-Model Method-
ology,” Journal of the American Statistical Association, 75, 516–524.
Harville, D. A. (1985), “Decomposition of Prediction Error,” Journal of the American Statistical
Association, 80, 132–138.
Harville, D. A. (1997), Matrix Algebra from a Statistician’s Perspective, New York: Springer-Verlag.
Harville, D. A. (2003a), “The Expected Value of a Conditional Variance: an Upper Bound,” Journal
of Statistical Computation and Simulation, 73, 609–612.
Harville, D. A. (2003b), “The Selection or Seeding of College Basketball or Football Teams for
Postseason Competition,” Journal of the American Statistical Association, 98, 17–27.
Harville, D. A. (2014), “The Need for More Emphasis on Prediction: a ‘Nondenominational’ Model-
Based Approach” (with discussion), The American Statistician, 68, 71–92.
Harville, D. A., and Kempthorne, O. (1997), “An Alternative Way to Establish the Necessity Part of
the Classical Result on the Statistical Independence of Quadratic Forms,” Linear Algebra and Its
Applications, 264 (Sixth Special Issue on Linear Algebra and Statistics), 205–215.
Henderson, C. R. (1984), Applications of Linear Models in Animal Breeding, Guelph, ON: Univer-
sity of Guelph.
Hinkelmann, K., and Kempthorne, O. (2008), Design and Analysis of Experiments, Volume I: Intro-
duction to Experimental Design (2nd ed.), Hoboken, NJ: Wiley.
Hodges, J. L., Jr., and Lehmann, E. L. (1951), “Some Applications of the Cramér–Rao Inequality,”
in Proceedings of the Second Berkeley Symposium on Mathematical Statistics and Probability,
ed. J. Neyman, Berkeley and Los Angeles: University of California Press, pp. 13–22.
Holm, S. (1979), “A Simple Sequentially Rejective Multiple Test Procedure,” Scandinavian Journal
of Statistics, 6, 65–70.
508 References

Hommel, G., and Hoffmann, T. (1988), “Controlled Uncertainty,” in Multiple Hypotheses Testing,
eds. P. Bauer, G. Hommel, and E. Sonnemann, Heidelberg: Springer, pp. 154–161.
Hsu, J. C., and Nelson, B. L. (1990), “Control Variates for Quantile Estimation,” Management
Science, 36, 835–851.
Hsu, J. C., and Nelson, B. (1998), “Multiple Comparisons in the General Linear Model,” Journal of
Computational and Graphical Statistics, 7, 23–41.
Jensen, D. R. (1985), “Multivariate Distributions,” in Encyclopedia of Statistical Sciences (Vol. 6),
eds. S. Kotz, N. L. Johnson, and C. B. Read, New York: Wiley, pp. 43–55.
Johnson, N. L., Kotz, S., and Balakrishnan, N. (1995), Continuous Univariate Distributions (Vol. 2,
2nd ed.), New York: Wiley.
Karlin, S., and Rinott, Y. (1980), “Classes of Orderings of Measures and Related Correlation In-
equalities. I. Multivariate Totally Positive Distributions,” Journal of Multivariate Analysis, 10,
467–498.
Karlin, S., and Rinott, Y. (1981), “Total Positivity Properties of Absolute Value Multinormal Variables
with Applications to Confidence Interval Estimates and Related Probabilistic Inequalities,” The
Annals of Statistics, 9, 1035–1049.
Kempthorne, O. (1980), “The Term Design Matrix” (letter to the editor), The American Statistician,
34, 249.
Khuri, A. I. (1992), “Response Surface Models with Random Block Effects,” Technometrics, 34,
26–37.
Khuri, A. I. (1999), “A Necessary Condition for a Quadratic Form to Have a Chi-Squared Distribution:
an Accessible Proof,” International Journal of Mathematical Education in Science and Technology,
30, 335–339.
Khuri, A. I. (2010), Linear Model Methodology, Boca Raton, FL: Chapman & Hall/CRC.
Kollo, T., and von Rosen, D. (2005), Advanced Multivariate Statistics with Matrices, Dordrecht, The
Netherlands: Springer.
Laha, R. G. (1956), “On the Stochastic Independence of Two Second-Degree Polynomial Statistics
in Normally Distributed Variates,” The Annals of Mathematical Statistics, 27, 790–796.
Laird, N. (2004), Analysis of Longitudinal and Cluster-Correlated Data—Volume 8 in the NSF-
CBMS Regional Conference Series in Probability and Statistics, Beachwood, OH: Institute of
Mathematical Statistics.
LaMotte, L. R. (2007), “A Direct Derivation of the REML Likelihood Function,” Statistical Papers,
48, 321–327.
Lehmann, E. L. (1986), Testing Statistical Hypotheses (2nd ed.), New York: Wiley.
Lehmann, E. L., and Casella, G. (1998), Theory of Point Estimation (2nd ed.), New York: Springer-
Verlag.
Lehmann, E. L., and Romano, J. P. (2005a), “Generalizations of the Familywise Error Rate,” The
Annals of Statistics, 33, 1138–1154.
Lehmann, E. L., and Romano, J. P. (2005b), Testing Statistical Hypotheses (3rd ed.), New York:
Springer.
Littell, R. C., Milliken, G. A., Stroup, W. W., Wolfinger, R. D., and Schabenberger, O. (2006), SAS ®
System for Mixed Models (2nd ed.), Cary, NC: SAS Institute Inc.
Liu, W. (2011), Simultaneous Inference in Regression, Boca Raton, FL: Chapman & Hall/CRC.
References 509

Luenberger, D. G., and Ye, Y. (2016), Linear and Nonlinear Programming (4th ed.), New York:
Springer.
McCullagh, P., and Nelder, J. A. (1989), Generalized Linear Models (2nd ed.), London: Chapman
& Hall.
McCulloch, C. E., Searle, S. R., and Neuhaus, J. M. (2008), Generalized, Linear, and Mixed Models
(2nd ed.), Hoboken, NJ: Wiley.
Milliken, G. A., and Johnson, D. E. (2009), Analysis of Messy Data, Volume I: Designed Experiments
(2nd ed.), Boca Raton, FL: Chapman & Hall/CRC.
Moore, D. P., Harward, M. E., Mason, D. D., Hader, R. J., Lott, W. L., and Jackson, W. A. (1957),
“An Investigation of Some of the Relationships Between Copper, Iron, and Molybdenum in the
Growth and Nutrition of Lettuce: II. Response Surfaces of Growth and Accumulations of Cu and
Fe,” Soil Science Society of America Proceedings, 21, 65–74.
Müller, A. (2001), “Stochastic Ordering of Multivariate Normal Distributions,” Annals of the Institute
of Statistical Mathematics, 53, 567–575.
Myers, R. H., Montgomery, D. C., and Anderson-Cook, C. M. (2016), Response Surface Method-
ology: Process and Product Optimization Using Designed Experiments (4th ed.), Hoboken, NJ:
Wiley.
Nash, J. C. (1990), Compact Numerical Methods for Computers: Linear Algebra and Function
Minimisation (2nd ed.), Bristol, England: Adam Hilger/Institute of Physics Publications.
Nocedal, J., and Wright, S. J. (2006), Numerical Optimization (2nd ed.), New York: Springer.
Ogawa, J. (1950), “On the Independence of Quadratic Forms in a Non-Central Normal System,”
Osaka Mathematical Journal, 2, 151–159.
Ogawa, J., and Olkin, I. (2008), “A Tale of Two Countries: the Craig–Sakamoto–Matusita Theorem,”
Journal of Statistical Planning and Inference, 138, 3419–3428.
Parzen, E. (1960), Modern Probability Theory and Its Applications, New York: Wiley.
Patterson, H. D., and Thompson, R. (1971), “Recovery of Inter-Block Information When Block Sizes
Are Unequal,” Biometrika, 58, 545–554.
Pawitan, Y. (2001), In All Likelihood: Statistical Modelling and Inference Using Likelihood, New
York: Oxford University Press.
Pinheiro, J. C., and Bates, D. M. (2000), Mixed-Effects Models in S and S-PLUS, New York: Springer-
Verlag.
Plackett, R. L. (1972), “Studies in the History of Probability and Statistics. XXIX: The Discovery
of the Method of Least Squares,” Biometrika, 59, 239–251.
Potthoff, R. F., and Roy, S. N. (1964), “A Generalized Multivariate Analysis of Variance Model
Useful Especially for Growth Curve Problems,” Biometrika, 51, 313–326.
Rao, C. R. (1965), Linear Statistical Inference and Its Applications, New York: Wiley.
Rao, C. R. (1973), Linear Statistical Inference and Its Applications (2nd ed.), New York: Wiley.
Rao, C. R., and Mitra, S. K. (1971), Generalized Inverse of Matrices and Its Applications, New York:
Wiley.
Ravishanker, N., and Dey, D. K. (2002), A First Course in Linear Model Theory, Boca Raton, FL:
Chapman & Hall/CRC.
Reid, J. G., and Driscoll, M. F. (1988), “An Accessible Proof of Craig’s Theorem in the Noncentral
Case,” The American Statistician, 42, 139–142.
510 References

Sanders, W. L., and Horn, S. P. (1994), “The Tennessee Value-Added Assessment System (TVAAS):
Mixed-Model Methodology in Educational Assessment,” Journal of Personnel Evaluation in Ed-
ucation, 8, 299–311.
Sarkar, S. K. (2008), “On the Simes Inequality and Its Generalization,” in Beyond Parametrics in
Interdisciplinary Research: Festschrift in Honor of Professor Pranab K. Sen, eds. N. Balakrishnan,
E. A. Pea, and M. J. Silvapulle, Beachwood, OH: Institute of Mathematical Statistics, pp. 231-242.
Schabenberger, O., and Gotway, C. A. (2005), Statistical Methods for Spatial Data Analysis, Boca
Raton, FL: Chapman & Hall/CRC.
Scheffé, H. (1953), “A Method for Judging All Contrasts in the Analysis of Variance,” Biometrika,
40, 87–104.
Scheffé, H. (1959), The Analysis of Variance, New York: Wiley.
Schervish, M. J. (1995), Theory of Statistics, New York: Springer-Verlag.
Schmidt, R. H., Illingworth, B. L., Deng, J. C., and Cornell, J. A. (1979), “Multiple Regression and
Response Surface Analysis of the Effects of Calcium Chloride and Cysteine on Heat-Induced
Whey Protein Gelation,” Journal of Agricultural and Food Chemistry, 27, 529–532.
Seal, H. L. (1967), “The Historical Development of the Gauss Linear Model,” Biometrika, 54, 1–24.
Searle, S. R. (1971), Linear Models, New York: Wiley.
Sen, P. K. (1989), “The Mean-Median-Mode Inequality and Noncentral Chi Square Distributions,”
Sankhyā, Series A, 51, 106-114.
Severini, T. A. (2000), Likelihood Methods in Statistics, New York: Oxford University Press.
Shanbhag, D. N. (1968), “Some Remarks Concerning Khatri’s Result on Quadratic Forms,”
Biometrika, 55, 593–595.
Shao, J. (2010), Mathematical Statistics (2nd ed.), New York: Springer-Verlag.
Šidák, Z. (1967), “Rectangular Confidence Regions for the Means of Multivariate Normal Distribu-
tions,” Journal of the American Statistical Association, 62, 626–633.
Šidák, Z. (1968), “On Multivariate Normal Probabilities of Rectangles: Their Dependence on Cor-
relations,” The Annals of Mathematical Statistics, 39, 1425–1434.
Singh, D., Febbo, P. G., Ross, K., Jackson, D. G., Manola, J., Ladd, C., Tamayo, P., Renshaw, A. A.,
D’Amico, A. V., Richie, J. P., Lander, E. S., Loda, M., Kantoff, P. W., Golub, T. R., and Sellers,
W. R. (2002), “Gene Expression Correlates of Clinical Prostate Cancer Behavior,” Cancer Cell,
1, 203–209.
Snedecor, G. W., and Cochran, W. G. (1989), Statistical Methods (8th ed.), Ames, IA: Iowa State
University Press.
Sprott, D. A. (1975), “Marginal and Conditional Sufficiency,” Biometrika, 62, 599–605.
Stigler, S. M. (1986), The History of Statistics: The Measurement of Uncertainty Before 1900,
Cambridge, MA: Belknap Press of Harvard University Press.
Stigler, S. M. (1999), Statistics on the Table: The History of Statistical Concepts and Methods,
Cambridge, MA: Harvard University Press.
Student (1908), “The Probable Error of a Mean,” Biometrika, 6, 1–25.
Thompson, W. A., Jr. (1962), “The Problem of Negative Estimates of Variance Components,” The
Annals of Mathematical Statistics, 33, 273–289.
References 511

Trefethen, L. N., and Bau, D., III (1997), Numerical Linear Algebra, Philadelphia: Society for
Industrial and Applied Mathematics.
Verbyla, A. P. (1990), “A Conditional Derivation of Residual Maximum Likelihood,” Australian
Journal of Statistics, 32, 227–230.
Victor, N. (1982), “Exploratory Data Analysis and Clinical Research,” Methods of Information in
Medicine, 21, 53–54.
Westfall, P. H., and Tobias, R. D. (2007), “Multiple Testing of General Contrasts: Truncated Closure
and the Extended Shaffer–Royen Method,” Journal of the American Statistical Association, 102,
487–494.
Wolfowitz, J. (1949), “The Power of the Classical Tests Associated with the Normal Distribution,”
The Annals of Mathematical Statistics, 20, 540–551.
Woods, H., Steinour, H. H., and Starke, H. R. (1932), “Effect of Composition of Portland Cement
on Heat Evolved During Hardening,” Industrial and Engineering Chemistry, 24, 1207–1214.
Zabell, S. L. (2008), “On Student’s 1908 Article ‘The Probable Error of a Mean’ ” (with discussion),
Journal of the American Statistical Association, 103, 1–20.
Zacks, S. (1971), The Theory of Statistical Inference, New York: Wiley.
Zhang, L., Bi, H., Cheng, P., and Davis, C. J. (2004), “Modeling Spatial Variation in Tree Diameter-
Height Relationships,” Forest Ecology and Management, 189, 317–329.
Index

A mean and variance, 258


adjoint matrix, see under cofactors (of the elements mgf, 257
of a square matrix) moments, 258
Aitken equations, 216 noninteger degrees of freedom, 256
Aitken model, see under model(s) (statistical) pdf of, 256, 347–348, 498
angle (between 2 vectors), 183–184 relationship to gamma distribution, 256
augmentation of the model equation (of a G–M Cholesky decomposition, see under nonnegative
model) with additional terms, 361–362, matrix (or matrices)
493 Cochran’s theorem, see under quadratic form(s) or
effects on the expected values of least squares 2nd-degree polynomial(s) (in a random
estimators, 357–358, 363, 493 vector), distribution of
effects on the (usual) estimator of  2 , 363–364, cofactors (of the elements of a square matrix)
493 adjoint matrix (transpose of cofactor matrix),
effects on what is estimable, 358–360, 363–364 343
least squares estimators (of functions that are basic properties of, 342–343
estimable under the model with the definition of, 342
additional terms) and their variances and matrix of (the cofactor matrix), 343
covariances, 358, 360–361, 363–364 complete sufficient statistic [when
y  N.Xˇ;  2 I/], 211–212, 248–249,
B 369–370
beta distribution, 312–314, 345 confidence bands, see under confidence intervals for
cdf of, 255 .ƒı/0ˇ (ı 2 ) with probability of
mean and variance, 345 simultaneous coverage 1 P : a
moments, 345 generalized S method
pdf/definition of, 255 confidence intervals for the standard deviation  (of
relationship to the F distribution, 281–282 the residual effects in the G–M model)
beta function and the corresponding tests of hypotheses,
definition of, 255 431–432, 498
relationship to gamma function, 255 an illustration, 435
Bonferroni inequality, 495 background and underlying assumptions,
Bonferroni t-intervals, see under confidence intervals 430–431
for .ƒı/0ˇ (ı 2 ) with probability of reexpression of various properties of the tests of
simultaneous coverage 1 P : a hypotheses as properties of the
generalized S method corresponding confidence intervals, 435,
438, 440
C tests of H0C W   0 vs. H1C W  > 0 and of
canonical form (of the G–M model), 365–369, H0 W   0 vs. H1 W  < 0
493–494 optimality: UMP among translation-invariant
Cauchy–Schwarz inequality, 38, 122 tests, 436
characteristic polynomial and equation, 315, 318, 350 optimality of the test of H0C in the special
chi distribution case where rank ƒ D rank X: UMP
definition of, 264 among “all” tests, 438–439, 498–499
pdf of, 264, 346 optimality of the test of H0 in the special
chi-square distribution (central) case where rank ƒ D rank X: UMP
basic properties of, 256–257, 375, 502 among unbiased tests, 439
cumulants, 258 unbiasedness of, 432
definition of, 256
514 Index

tests of H0 W  D 0 vs. H1 W  ¤ 0 conjugate normal equations, see under least squares,


optimality of the unbiased version: UMP method of
among translation-invariant unbiased correlation (of 2 random variables)
tests, 436–438 basic properties of, 91–92
optimality of the unbiased version in the definition of, 91
special case where rank ƒ D rank X: correlation matrix, definition of, 94
UMP among unbiased tests, 439–440 covariance (of 2 random variables)
special case: unbiased version, 432–435, 498 basic properties of, 90–91, 97, 100–101, 119
translation invariance of, 432 conditional, 100–101
confidence intervals for .ƒı/0ˇ (ı 2 ) with definition of, 90
probability of simultaneous coverage existence of, 90
1 P : a generalized S method, 502 for statistically independent random variables,
a conservative alternative: Bonferroni 95
t-intervals, 495–496, 504 of 2 linear combinations of random variables,
a requisite upper 100 P % point c P , 383–384, 95–96, 200
494–495 of 2 quadratic forms (in a random vector),
a variation: constant-width confidence intervals, 201–204
398–401 of a linear form and quadratic form (in a
background/underlying assumptions, 382–383 random vector), 201, 203–204
computation of c P : use of Monte Carlo covariance matrix
methods, 392, 396–398, 400–401, 496, basic properties of, 93, 101
502–503 conditional, 101
confidence bands, 401–404, 496 definition of, 93
end points of interval for .ƒı/0ˇ, 384 for c C Ax and k C By, 96
for c C N
PT
implicitly defined confidence intervals for i D1 ai xi and k C j D1 bj yj , 96
P
.ƒı/0ˇ (ı … ), 388–393, 395–396, covariance of 2 random vectors, see covariance
494–495 matrix
multiple-comparison procedure for testing
.ı/ .0/
H0 W .ƒı/0ˇ D ı (ı 2 ), 384–386 D
special case where c P is the upper 100 P % point data sets (illustrative)
of the Studentized maximum modulus cement data, 132–133, 160–161
distribution, 391, 395 corn-milling data, 143–145
special case (where  D RM ): S (Scheffé’s) dental data, 150–152
method, 386–387, 391, 395, 502 lettuce data, 132–134, 161–162, 351–357,
special case: T (Tukey’s) method, 391–392 363–364, 393–396, 403–404, 435, 493,
confidence set (ellipsoidal) for ƒ0ˇ [when 504
y  N.Xˇ;  2 I/], 394–395, 405 microarray data, 453–454, 470–480, 501,
a connection to the S method for obtaining 503–504
confidence intervals for .ƒı/0ˇ ouabain data, 130–131, 171, 496, 498
(ı 2 RM ), 387–388 shear-strength data, 145–146, 162
alternative confidence sets (of a different form), tree-height data, 155–157
388–390, 395, 494–495 whey-protein-gel data, 159–160, 162
basic properties, 376 determinant, 319–320, 342–343
defining inequality, 372–373 definition of, 71–72, 84
invariance/equivariance properties, 380–382 of a block-diagonal or block-triangular matrix,
optimality 77–78
minimum “average” probability of false of a diagonal or triangular matrix, 73
coverage, 421 of a matrix with permuted rows or columns,
UMA invariant/equivariant, 413–414 74–75
pivotal quantity, 370–372 of a nonsingular matrix, 79
probability of false coverage, 374, 376 of a partitioned matrix, 79–80
relationship to F test, 374 of a positive definite matrix, 79
special case: t interval (for a single estimable of a positive semidefinite matrix, 79
function 0 ˇ), 376–377 of a product of matrices, 76, 78, 84
validity for spherical distributions, 377 of a scalar multiple of a matrix, 74
Index 515

of a singular matrix, 79 linear transformation of an elliptically


of a transposed matrix, 73–74 distributed random vector, 232–233
of an inverse matrix, see under inverse matrix marginal distributions, 233
of an orthogonal matrix, see under orthogonal mgf, 230–232
matrix (or matrices) pdf of, 232
of C 1AC, 84 envelope power function, 497
of I tA, of I uB, and of I tA uB, error contrast(s), 367
338–342, 350 definition of, 217
of RCST U, 298–299 linearly independent, number of, 217
diagonalization (of a matrix) special case: least squares residual(s), 219
definition of, 317 translation invariance of, 218
orthogonal, 317–318 estimability (of 0ˇ)
differentiation basic results on, 167–168, 246
of a matrix, 338 definition of, 167
wrt a vector, 176–177 identifiability (and its relationship to
gradient vector or matrix, 176–177, 352 estimability), 170
Hessian matrix, 176, 353 in the case of mixture data, 171–174
Jacobian matrix, 177 necessary and sufficient conditions for,
of a linear or quadratic form, 176–177, 352 167–169, 171–173, 180–181, 246
of a product, 176 restrictions: inherent vs. noninherent, 173–175
of a vector of linear forms, 177 estimation of the variance  2 of the residual effects
stationary point, 352–353 of the G–M model, 248
Dirichlet distribution, 322–323, 336–338, 346 Hodges–Lehmann estimator: best scalar
marginal distributions, 261 multiple of the residual sum of squares,
means, variances, and covariances, 345–346 206–207, 250–251
moments, 345 bias of, 207
pdf/definition of, 259–260 MSE of, 207
properties of, 260–261, 349 ML estimator, 213, 249
relationship quadratic unbiased translation-invariant
to spherical distributions, 264–265, 285–286 estimators, variance of, 209–210
to the beta distribution, 260–261 REML estimator (D O 2 ), 221
to the multivariate standard normal “usual” estimator O 2 : unbiased scalar multiple
distribution, 262 of the residual sum of squares, 205–206,
distance (between 2 vectors), 183 249, 353, 496
distribution of quadratic form(s), see quadratic optimality among quadratic unbiased
form(s) or 2nd-degree polynomial(s) (in a translation-invariant estimators, 209–210,
random vector), distribution of 248
duplication matrix, see under vech (of a symmetric optimality among unbiased estimators [when
matrix) y  N.Xˇ;  2 I/], 211
variance of, 206, 354–355
E estimators (of 0ˇ), types of
eigenvalue(s), 319–320, 348 linear, 165–167
definition(s) of, 315 translation (location) equivariant, 166–167,
distinct eigenvalue(s), 318–319 192, 248, 252
multiplicity, 318–319 unbiased, 166–167, 192
spectrum, 318 expected value, 87
existence of, 315–317 conditional, 100–101
eigenvector(s) existence of, 87
definition of, 315 of a conditional expected value, 100–101
orthogonality, 317 of a function of a random variable or vector
scalar multiples of, 315 when distribution is absolutely continuous,
elliptical (or elliptically contoured or elliptically 88, 122
symmetric) distribution, 225, 337–338 when distribution is discrete, 87
definition of, 231–232 of a linear combination of random variables, 88,
E.w j y/ for random vectors w and y whose 200
joint distribution is elliptical, 245
516 Index

of a nonnegative function of a random vector, 88 validity for spherical distributions, 377


of a product of 2 random variables, 91–92, 119, false discovery rate (FDR), see under multiple
122 “comparison” procedures [for testing
of a product of functions of 2 statistically .0/ .0/
Hi W 0i ˇ D i (i D 1; 2; : : : ; M )]
independent random vectors, 88 that control the “size” of the FDP
of a quadratic form (in a random vector),
200–201, 203 G
of a random vector or matrix, 88, 101 gamma distribution
of c C Ax,
P 88–89 basic properties of, 254–255, 259–260, 345
of C C jND1 aj Xj , 88–89 cumulants and cumulant generating function,
of C C AXK, 89 258
mean and variance, 257–258
F mgf, 257
F distribution (central) moments, 257
basic properties of, 282–283 pdf/definition of, 253–254
definition of, 281 relationship to the Poisson distribution, 345
mean and variance, 283–284 gamma function, 102, 257
moments, 283–284 Gauss–Markov (G–M) model, see under model(s)
multivariate version (statistical)
formulation of, 284 Gauss–Markov theorem, see under least squares,
relationship to spherical distributions, method of
286–287 general linear model, see under model(s) (statistical)
relationship to the Dirichlet distribution, 285 generalized inverse matrix, 53–55, 83
relationship to the multivariate standard existence of, 54–55
normal distribution, 285 for a matrix of full row or column rank, 54
noninteger degrees of freedom, 284 for a nonsingular matrix, 54
pdf of, 283, 346 for a product of matrices, 56–57
relationship for a scalar multiple of a matrix, 56
to spherical distributions, 286 for a symmetric matrix, 56, 83
to the beta distribution, 281–282 for the transpose of a matrix, 56
to the multivariate standard normal general form of, 55
distribution, 284 least squares generalized inverse, 246
Snedecor’s contributions, 349 minimum norm generalized inverse, 246–247
F test of H0 W ƒ0ˇ D  .0/ [when y  N.Xˇ;  2 I/], Moore–Penrose inverse, 247
393–395, 405 number of generalized inverses, 55
basic properties of the form B.CAB/ C (where C is of full
similarity, 376 column rank and B of full row rank),
unbiasedness, 376 221–222
critical region, 373–374 properties of AA and A A, 56–58
equivalence to the likelihood ratio test, 494 rank of, 56
invariance properties, 380–382 Gram–Schmidt orthogonalization, 39
optimality
maximum “average” power, 416–421, 503 H
maximum minimum (over a hypersphere) Helmert matrix, see under orthogonal matrix (or
power, 497, 504 matrices)
most stringent, 497–498, 504
UMP invariant, 410–413 I
UMP unbiased (in the special case of the idempotent matrix (or matrices), 50, 56, 82–83,
2-sided t test), 496–497, 503 330–333
power function, 374, 376, 494 eigenvalues of, 320
relationship to ellipsoidal confidence set for expression in the form QQ0 , 220
ƒ0ˇ, 374 rank of, 50–51
special case: 2-sided t test, 376–377, 421, trace of, 50–51
496–497 transformation of form B 1AB, 82
test statistic, 370–372, 493–494 incomplete beta function ratio
a basic property of, 255
Index 517

definition of, 255 a geometrical perspective, 184–187


inner product (of 2 vectors or matrices), 37–38, 82, an extension: least squares estimator of a vector
183 of estimable functions, 194
integration of a function over a hypersphere, 414–416 a basic property of linear combinations of the
intraclass correlation, see under model(s) (statistical) elements of the least squares estimator,
invariance/equivariance (as applied to tests of 194–195
hypotheses about and confidence sets for conjugate normal equations, 194
ƒ0ˇ under the G–M model), 377–380, generalization of results on best linear
494, 496 translation-equivariant estimation,
a relationship between sufficiency and 197–198, 248
invariance, 408 generalizations of the Gauss–Markov
form and distribution of an invariant function, theorem, 195–197, 248
405–410, 423–424 linearity, unbiasedness, and translation
invariance/equivariance wrt orthogonal equivariance, 194–195, 197
transformations, 381–382, 494 variance-covariance matrix, 195
scale invariance/equivariance, 380–381, 494 computational considerations: QR
translation (location) invariance/equivariance, decomposition of the model matrix,
380, 494 187–191, 247–248
inverse matrix, 42–43, 343 conjugate normal equations, 181–182, 252
determinant of, 79 “generalized” or “weighted” least squares,
for a diagonal or block-diagonal matrix, 45 213–214
for a partitioned matrix, 48–49, 82 least squares estimator (of an estimable
for a positive definite matrix, see under positive function), definition of, 179–180
definite matrix (or matrices) linearity of least squares estimators, 180, 182
for a product of matrices, 43 normal equations, 177–181, 247
for a scalar multiple of a matrix, 43 optimality of a least squares estimator, 249
for a symmetric matrix, 43 best linear translation-equivariant estimator,
for a triangular or block-triangular matrix, 193–194
46–47 best unbiased estimator [when
for an orthogonal matrix, see under orthogonal y  N.Xˇ;  2 I/], 211
matrix (or matrices) Gauss–Markov theorem: best linear unbiased
for I tA, 338–342 estimator (BLUE), 192–193
for the inverse of a matrix, 43 residuals, see least squares residuals
for the transpose of a matrix, 43 translation equivariance of least squares
estimators, 192
K unbiasedness of least squares estimators, 182,
Kronecker product (of 2 matrices) 192
definition of, 199 variances and covariances of least squares
transpose of, 200 estimators, 182–183, 353–356
left or right inverse (of a matrix), 41–42, 82
L likelihood or log-likelihood function (ordinary)
least squares generalized inverse, see under for the Aitken model [in the special case where
generalized inverse matrix y  N.Xˇ;  2 H/]
least squares residuals, 204–205 maximizing values of ˇ and , 216
residual sum of squares and its expected value profile log-likelihood function (for ), 216
and variance, 205–206 for the G–M model [in the special case where
residual vector and its expected value and y  N.Xˇ;  2 I/], 212–213
variance-covariance matrix, 204–205 maximizing values of ˇ and , 213
statistical independence of the residual vector or profile log-likelihood function (for ), 213
the residual sum of squares and least for the general linear model [in the special case
squares estimators of estimable functions, where y  N.Xˇ; V .//], 214–215,
207–208 249–250
translation invariance of the residual vector and maximizing values of ˇ and , 215–216
the residual sum of squares, 209 profile likelihood or log-likelihood function
least squares, method of, 175–179, 246, 353–356, (for ), 215–216
493, 496
518 Index

for the general linear model [in the case where invertible, 42–43
y is distributed elliptically about Xˇ], involutory, 82
233–234 matrices of 1’s, 26
maximizing values of ˇ and , 234 nonnegative definite, see nonnegative definite
profile likelihood function (for ), 234 matrix (or matrices)
likelihood or log-likelihood function (REML) nonsingular, 36, 42–44, 46–49
for the Aitken model [in the special case where null, 26
y  N.Xˇ;  2 H/], 224–225 orthogonal, see orthogonal matrix (or matrices)
for the general linear model [in the case where positive definite, see positive definite matrix (or
y is distributed elliptically about Xˇ], 235 matrices)
for the general linear model [in the special case row and column vectors, 26
where y  N.Xˇ; V .//], 216–219, singular, 36
222–224, 249 square, 25
interpretation of the REML likelihood symmetic, 25, 81
function as a marginal likelihood, 218–219 triangular, 26, 82
relationship of the REML likelihood function matrix operations
to the profile likelihood function (for ), addition and subtraction, 24
219, 222–224 matrix multiplication, 24–25, 81
for the G–M model [in the special case where scalar multiplication, 23–24
y  N.Xˇ;  2 I/], 221 transposition, 25, 81
linear dependence and independence, 34, 38–39 matrix, definition of, 23
linear expectation of one random variable or vector maximum likelihood (ML) estimator of a function of
given another, 240 the parameter vector  in the general
linear space(s), 32–33, 81, 85 linear model
basis for, 34–35, 82, 311 when [in the special case where the model is the
dimension of, 35 Aitken model and  D ./]
essentially disjoint, 40–41 y  N.Xˇ;  2 H/, 216
orthonormal basis for, 39, 82, 311 when y  N ŒXˇ; V ./, 215–216
row and column spaces, 33–36 when y is distributed elliptically about Xˇ, 234
of a partitioned matrix, 41, 52 maximum likelihood (ML) estimator of an estimable
of a product of matrices, 44 function (of ˇ)
of a sum of matrices, 41 under the Aitken model when
of X0 W X (where W is symmetric and y  N.Xˇ;  2 H/, 216
nonnegative definite), 214 under the G–M model
of X0 X, 61–62 when y  N.Xˇ;  2 I/, 213, 356–357,
subspaces of, 33–34, 82 492–493
linear system, 51–52 under the general linear model
coefficient matrix of, 51, 52 when y  N ŒXˇ; V ./, 215
consistency of, 52–53, 57–58, 83, 168–169 when y is distributed elliptically about Xˇ,
homogeneous, 52, 58–59, 83 234
solution to, 51, 52, 83 mean or mean vector, definition of, 88
general form of, 58–60, 83 mean squared error (MSE) or MSE matrix of a
minimum norm, 247 (point) predictor, definition of, 236
solution set, 53, 58–60 mean squared error (MSE) and root MSE (of an
uniqueness of, 59–60 estimator of 0ˇ), 165–166
linear variance or variance-covariance matrix of one minimum norm generalized inverse, see under
random variable or vector given another, generalized inverse matrix
240 mixture data, 171–174
model(s) (statistical)
M assumption
Markov’s inequality, 499 of ellipticity, 233–234
matrices, types of of multivariate normality, 18–21, 128,
diagonal, 26 133–136
full row or column rank, 36, 44 of the linearity of E.y j u/, 161
identity, 26 classificatory models, 4–7
Index 519

a “variation”: cell-means models, 6 correlated: intraclass correlation and


factors: qualitative vs. quantitative, 4–5 compound symmetry, 139–146, 161, 163
omitted factors and the role of decomposition (into uncorrelated
randomization, 6–7 components) to account for grouping (and
reformulation as a multiple linear regression possibly for intragroup competition),
model, 5–6 142–146, 161
special case: 1-way-classification definition of, 126
(fixed-effects) model, 5–7 heteroscedastic, 137–139, 162, 470–475
for multivariate data, 157–160, 162 spatially related, 152–157, 162–163
formulation of temporally related (as in the case of
in general, 1–2, 123–125 longitudinal data), 146–152, 162–163
in the presence of a “linearity” constraint, simple or multiple linear “regression” models, 3
2–3, 124–126 use (in statistical inference) of multiple models,
full rank, 168, 170–171 21
hierarchical models (models obtained from moment generating function (mgf) vs. characteristic
other models via a hierarchical appoach), function, 122
7–11 Moore–Penrose conditions, 247
inferring “causal” relationships from Moore–Penrose inverse, see under generalized
“explanatory” relationships, 4 inverse matrix
linear models multiple “comparison” procedures [for testing
Aitken model, 126–127 .0/ .0/
Hi W 0i ˇ D i (i D 1; 2; : : : ; M )]
Gauss–Markov (G–M) model, 126–127, that control the k-FWER, 503
133–136, 351–357 an illustration: application to the microarray
general linear model, 126–127 data, 474–476, 501, 503–504
mean vector and variance-covariance matrix, background/underlying assumptions, 441–442
127–128, 133–136 k-FWER vs. FWER, 442–443
model equation, 128 one-step procedure, 443
alternative form (for multivariate data), a conservative approximation: replacement of
158–159, 163 c P .k/ with an upper bound, 444–445, 499
augmentation of, see augmentation of the a requisite upper 100 P % point c P .k/,
model equation (of a G–M model) with 443–445
additional terms computation of c P .k/: use of Monte Carlo
first-order (vs. higher-order), 130–136, methods, 443–444
143–146 corresponding confidence intervals for 0i ˇ
polynomials (in 1 variable), 129–131, (i D 1; 2; : : : ; M ), 443
170–171, 496 .0/
critical region for test of Hi , 443
polynomials (in general), 129, 131–133, extensions, 444–445
143–146, 150–152, 159–160, 351–352 FDR of, 501
response surface (defined by the model validity for nonnormal distributions, 445
equation), 351–357 step-down procedure, 445–447, 503
second-order (vs. higher-order), 393–394 “superiority” to 1-step procedure, 447–448
model matrix, 128, 162 a caveat, 448–449
potential pitfall: making inferences from a computationally less demanding (but less
unrepresentative data and/or on the basis powerful) version, 451–452, 499
of an unsuitable model, 4, 7, 15 a potential improvement, 449–451
quantities of interest, 3–4, 6–11, 15–17, 129, computations: use of Monte Carlo methods,
352–353 451
random-effects models control of the k-FWER: verification, 448
a “conventional” formulation, 11–13 .0/
critical region for test of Hi , 447
derivation via a hierarchical approach, 9–11
extensions, 452–453
special case: 1-way-classification
validity for nonnormal distributions, 453
random-effects model, 13–15
regression, 133–136, 161–162
residual effects (in a linear model)
520 Index

multiple-comparison procedure for testing symmetry of, 110


.ı/ .0/ third- and fourth-order central moments,
H0 W .ƒı/0ˇ D ı (ı 2 ), see under
confidence intervals for .ƒı/0ˇ (ı 2 ) 116–117
with probability of simultaneous coverage univariate characterization of, 118–119
1 P : a generalized S method multivariate standard normal distribution, 121
multiple “comparison” procedures [for testing definition of, 105
.0/ .0/
Hi W 0i ˇ D i (i D 1; 2; : : : ; M )] mean vector and variance-covariance matrix,
that control the “size” of the FDP 105
background/underlying assumptions, 441–442, pdf of, 105
453–455, 503 multivariate t distribution, 347
illustrative example: microarray data, 453–454, definition of, 290–291
470–480, 503–504 marginal distributions, 299
size of the FDP: alternative measures, 454–455, moments, 301–302
499, 503 noninteger degrees of freedom, 302
step-down procedures for controlling pdf of, 300–301
Pr.FDP > /, 455–461, 503 relationship
a potential improvement, 460 to F distribution, 299
an illustration: application to the microarray to multivariate normal distribution, 299, 303
data, 476–479 to spherical and elliptical distributions,
critical values of a particular form, 458, 302–303
500–501 N
critical values: a special case, 458–459 Neyman–Pearson (fundamental) lemma and its
critical values: a sufficient condition, 455–458 implications, 408–409
critical values: general case, 459–460, noncentral chi-square distribution
499–500, 503 a basic property of, 271–272
extensions, 460–461 cumulants and cumulant generating function, 273
validity for nonnormal distributions, 461 definition of, 268–269, 349
step-up procedures for controlling E.FDP/ (the extensions
FDR), 461–470, 503 “noncentral gamma distribution”, 277–279,
an illustration: application to the microarray 346
data, 479–480 distribution of the sum of squared elements
control of the FDR (in special cases) by the of a random vector that is distributed
Benjamini–Hochberg procedure, 465–466 spherically around a vector of constants,
critical values: Benjamini–Hochberg 279–281, 349
procedure, 462, 500–501 mean and variance, 273–274
expressions for the FDR in special cases, mgf, 272–273
465–466 moments, 274–277, 346
expressions for the FDR: general, 462–465 noninteger degrees of freedom, 270–271
extensions, 466–470, 500, 503 pdf of, 269–271
step-up vs. step-down, 462 probability of a random variable with a
multivariate normal (MVN) distribution noncentral chi-square distribution
bivariate normal distribution, 108–109, 121 exceeding a specified constant: an
conditional distributions increasing function of the noncentrality
general case, 115–116, 121 parameter, 375, 502
special case: positive definite noncentral F distribution
variance-covariance matrix, 113–114, 121 definition of, 281
definition/pdf of, 105–109 mean and variance, 289
linear transformation of an MVN random moments, 288–289
vector, 110, 121 noninteger degrees of freedom, 289–290
marginal distributions, 110 pdf of, 288
mgf, 118, 121 probability of a random variable with a
standard, see multivariate standard normal noncentral F distribution exceeding a
distribution specified constant: an increasing function
statistical independence of subvectors of an of the noncentrality parameter, 375–376,
MVN random vector, 111–113, 121 502
Index 521

related distribution: noncentral beta for matrix multiplication, 28–29


distribution, 287–288, 290 determinant of, see under determinant
noncentral t distribution, 347 inverse of, see under inverse matrix
definition of, 290 multiplication of, 28–30
moments, 297–298, 347 nonnegative definiteness, positive definiteness,
noninteger degrees of freedom, 298 or positive semidefiniteness of, 69–70, 80
pdf of, 296–297 partitioned into rows or columns, 29–30
relationship rank of, see under rank
to the noncentral F distribution, 296 scalar multiples of, 28
to the St.N; / distribution, 298 trace of, see under trace
nonnegative definite matrix (or matrices), 63, 84–85 transposition of, 28
Cholesky decompostion of, 67–68 permutation matrices, 189–190
diagonal elements of, 65–66 polynomials (in a single variable), properties of,
eigenvalues of, 320 323–325, 350
existence of unit upper triangular matrix U and positive definite matrix (or matrices), 63, 84–85
diagonal matrix D such that U 0 AU D D, determinant of, see under determinant
66–67, 83–84 diagonal elements of, 65
expression in form P 0 P , 66–68 eigenvalues of, 320
of the form aICb110 , 140–141 expression in form P 0 P , 69
principal submatrices of, 65 inverse of, 65, 84
scalar multiples of, 63–64 nonsingularity of, 64, 69
sum of, 64, 83 of the form aICb110 , 140–141
transformation of form Q0AQ, 64–67 of the form I tA (or generalizations thereof),
norm (of a vector or matrix), 38, 183 303–305
normal distribution (univariate) principal submatrices of
central moments, 104 determinants of, 80
pdf/definition of, 104–105 positive definiteness, 65
standard, see standard (univariate) normal transformation of form Q0AQ, 64–65
distribution prediction (point prediction): use of the value of an
normal equations, see under least squares, method of observable random vector y to predict the
null space (of a matrix), 53, 58–59, 220–221 realization of an unobservable random
variable or vector w, 236
O when joint distribution of y and w is known
order statistics, some basic results, 399–400 MSE matrix of optimal predictor, 237
orthogonal and orthonormal sets, 38–39 optimal predictor: E.w j y/, 236–237
orthogonal complement (of a subspace), 184, 246 when “only” the mean vector and
orthogonal matrix (or matrices), 49, 82 variance-covariance matrix are known,
a basic property of, 225 251
determinant of, 78–79 best linear predictor, 238–240
Helmert matrix (and its generalizations), decomposition of prediction error and of
267–268, 346 MSE or MSE matrix, 238–239
inverse of, 49 MSE matrix of best linear predictor, 239, 251
product of, 49–50 when “only” the variance-covariance matrix is
orthogonal projection of a vector on a subspace, see known
projection of a vector on a subspace best linear unbiased predictor (BLUP), 241
orthogonality of a vector to a subspace, 184 correspondence between predictors and
orthogonality of one vector or matrix to another, 38, estimators of w Vyw 0 V 1  , 240
y y
184 decomposition of prediction error and of
MSE or MSE matrix, 240–242
P MSE or MSE matrix of a linear predictor,
partitioned matrices, 27–30 241–242
addition of, 28 when y and w follow an “extended” general
block-diagonal, 29, 69–70 linear model, 242
block-triangular, 29 predictability of w vs. unpredictability, 242
conformal partitioning translation-equivariant predictors, 243, 252
for addition, 28
522 Index

when y and w follow an “extended” Aitken projection of a vector on a subspace, 185


model, 243
when y and w follow an “extended” G–M Q
model, 243, 252, 357 QR decomposition (of a matrix), 188
BLUP: wO L .y/, 243–244 quadratic (functional) form, 62
MSE matrix of the BLUP, 244 expression as a sum of squares, 68–69
best linear translation-equivariant predictor matrix of, 62
(D w O L .y/), 244–245 nonnegative definiteness, positive
prediction error, definition of, 236 definiteness, or positive semidefiniteness
predictors, types of point predictors of, 63, 85
linear, 236 nonuniqueness of, 62–63
translation equivariant, 243, 252 symmetric, 63
unbiased, 236 triangular, 83
prediction intervals or sets for the realization of an nonnegative definiteness, positive definiteness,
unobservable random variable or vector w or positive semidefiniteness of, 63, 85
based on the value of an observable quadratic form(s) or 2nd-degree polynomial(s) (in a
random vector y random vector), distribution of
background/underlying assumptions and chi-squareness, 308–311, 325–326, 347–348,
notation, 480 350, 494
conditional probability of coverage vs. Cochran’s theorem (and related results),
unconditional, 480–481 330–338, 350
simultaneous prediction intervals [for ı 0 w mean, see under expected value
(ı 2 )], 483–485 mgf, 305–308, 321–322, 348–349
when the conditional distribution of w given y reexpression of x0 Ax=x0 † x as a linear
is known, 480 combination of random variables whose
HPD prediction sets, 481–483 joint distribution is a Dirichlet
special case: conditional distribution is distribution, 322–323
MVN, 482–485 reexpression of a quadratic form or 2nd-degree
when “only” an unbiased predictor and the polynomial as a sum or linear combination
(unconditional) distribution of its of independently distributed random
prediction error are known, 480–481 variables, 320–322
prediction sets of minimum size, 482 statistical independence, 326–328, 344–345,
special case: the unbiased predictor is the 348–350, 494
best linear predictor and the distribution of an extension (to include vectors of linear
its prediction error is MVN, 483–485 forms), 328–329
when y and w follow an extended G–M model, vs. zero correlation, 330, 349
489–492 variances and covariances, see under variance
“ellipsoidal” prediction set, 490 and under covariance (of 2 random
extensions to nonnormal distributions, 492 variables)
prediction intervals for the realization of a x0 Ax=x0 † x  BeŒR=2; .P R/=2
single unobservable random variable w, when †A†A† D †A† and x  N.0; †/,
490 312–314
simultaneous prediction intervals, 490–491 when †A†A† D †A† and x is
special case: joint distribution of y and w is distributed elliptically about 0 with
MVN, 489–491 var.x/ / †, 314
when y and w follow an extended general linear
or extended Aitken model, 488–489, 492 R
projection matrix, 60–62, 178–179, 185 random vector or matrix, definition of, 87
a generalization: W X.X0 W X/ X0 (where W rank, 35–37, 319–320
is symmetric and nonnegative definite), of a diagonal or block-diagonal matrix, 44–45
214 of a generalized inverse matrix, see under
idempotency of, 61, 83 generalized inverse matrix
rank of, 61 of a partitioned matrix, 39–41, 47–48, 52
symmetry of, 61 of a product of matrices, 44
trace of, 205 of a projection matrix, see under projection
matrix
Index 523

of a sum of matrices, 41 leading principal submatrices, 27


of a triangular or block-triangular matrix, 45–46 principal submatrices, 26–27, 37
of an idempotent matrix, see under idempotent
matrix (or matrices) T
of X0 W X (where W is symmetric and t distribution (central)
nonnegative definite), 214 definition of, 290
of X0 X, 61–62 kurtosis, 294
regression, see under model(s) (statistical) mean and variance, 293–294
restricted (or residual) maximum likelihood (REML) moments, 292–294
estimator of a function of the parameter noninteger degrees of freedom, 294
vector  in the general linear model pdf of, 291–293, 346
when [in the special case where the model is the percentage points, 294–295
Aitken model and  D ./] relationship
y  N.Xˇ;  2 H/, 225 to spherical distributions, 295–296
when y  N ŒXˇ; V ./, 222 to the F distribution, 291, 295
when y is distributed elliptically about Xˇ, 235 to the multivariate standard normal
row and column spaces, see under linear space(s) distribution, 295
to the multivariate t distribution, 291
S to the noncentral t distribution, 290
scale equivariance, 250 special case: Cauchy distribution, 292–293
Schur complement, 49, 70, 79–80 standardized version, 294–295
Simes inequality, 458–459 symmetry of, 293
span (of a set of matrices), 33, 82 t interval, see under confidence set (ellipsoidal) for
spectral decomposition, 318–323, 338–339, 348 ƒ0ˇ [when y  N.Xˇ;  2 I/]
spherical (or spherically symmetric) distribution, t test, 1-sided [for H0C W 0 ˇ   .0/ vs.
226, 337–338, 346–348 H1C W 0 ˇ >  .0/ when
definition of, 226 y  N.Xˇ;  2 I/], 422–423
distribution of a sum of squares, see under invariance properties, 423
noncentral chi-square distribution multiple (wrt ) tests or “comparisons”,
linear transformation of a spherically 429–430
distributed random vector, 229–230 optimality
marginal distributions, 230–231, 251 UMP invariant, 425–427
mean vector and variance-covariance matrix, UMP unbiased, 427, 498
226, 251 validity for nonnormal distributions, 430
mgf, 228–230 t test, 2-sided, see under F test of H0 W ƒ0ˇ D  .0/
pdf of, 226–228, 230–231 [when y  N.Xˇ;  2 I/]
symmetry of, 226 t upper or lower confidence bound [for 0 ˇ when
standard (univariate) normal distribution y  N.Xˇ;  2 I/], 422
moments, 103–104 a generalization, 423
pdf/definition of, 101–102, 293, 295 confidence bands, 429
standard deviation, definition of, 89 invariance/equivariance properties, 423
standardized version of a random variable, 97–98 optimality, 427
statistical inference (model-based parametric or simultaneous (for multiple ) upper or lower
predictive inference for parametric confidence bounds, 428–429
functions or for the realizations of validity for nonnormal distributions, 430
unobservable random variables or for a test of H0 W w 2 S0 (where w is an unobservable
vector of such entities), some forms of, 17 random variable or vector) based on the
multiple “comparisons”, 20 value of an observable random vector y
“1-at-a-time” confidence intervals or sets, multiple “comparisons”: tests (for
18–20 .0/ .0/
j D 1; 2; : : : ; M ) of Hj W wj 2 Sj
“1-at-a-time” hypothesis tests, 18–20
(where wj is an unobservable random
point estimation or prediction, 17–19
variable)
“simultaneous” confidence intervals or sets, 20
control of the FWER or k-FWER: use of the
Student’s t distribution, see t distribution (central)
closure principle, 486–487, 504
submatrices and subvectors, 26–27
control of Pr.FDP > /, 487–488
524 Index

when Pr.w 2 S0 j y/ and the marginal extended definition of [in terms of a


distribution of y are known, 485–486, (KC1)-variate spherical distribution],
501–502 265–267, 285–286
when (for some possibly vector-valued function pdf of marginal distributions, 262–264
z of y) Pr.w 2 S0 j z/ and the marginal
distribution of z are known, 486, 502
test of the null hypothesis H0 W ƒ0ˇ D  .0/ (under V
the G–M model), see F test of Vandermonde matrix, 170–171
H0 W ƒ0ˇ D  .0/ [when variance
y  N.Xˇ;  2 I/] basic properties of, 90, 94, 100–101
testability of the null hypothesis H0 W ƒ0ˇ D  .0/ conditional, 100–101
(under the G–M model), 365 definition of, 89
trace, 31–32, 319–320 existence of, 89–90
of a partitioned matrix, 31 of a linear combination of random variables,
of a product of matrices, 31–32, 81, 200 95–96, 200
of an idempotent matrix, see under idempotent of a quadratic form (in a random vector),
matrix (or matrices) 203–204, 248
transformation of a random vector variance-covariance matrix
to a vector of standardized random variables, 98 basic properties of, 93–94, 97, 101
to a vector of uncorrelated and standardized conditional, 101
random variables, 99 definition of, 93
translation invariance, 249, 432 for c C Ax,
P 96
a maximal invariant: N rank.X/ linearly for c C N i D1 ai xi , 96
independent error contrasts, 218 of a partitioned random vector, 94
as applied to a quadratic form, 209 positive definite vs. positive semidefinite,
in “general”, 208–209, 249 97
vec (of a matrix)
U definition of, 198
uncorrelated random variables or vectors, 95 for a product of 3 matrices, 200
uniform distribution on the surface of a vech (of a symmetric matrix)
(KC1)-dimensional ball definition of, 198–199
definition of (in terms of a (KC1)-variate duplication matrix, 199
standard normal distribution), 262–264 vector space(s), see linear space(s)

You might also like