Covariance Matrix (W Krzanowski)

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

 

Covariance Matrix 
n
 −1
× exp − 12 (xi − μ)  (xi − μ)
i=1
Suppose that p variables have been measured on
and a little algebra (see, for example, [1, pp. 60–65])
each of n sample individuals and the results have
establishes that the maximum likelihood estimators of
been displayed in an n × p data matrix. Write xi =
μ and  are
(xi1 , xi2 , . . . , xip ) for the vector of p values observed
on the ith individual, so that xi constitutes the ith μ̂ = x = (x 1 , x 2 , . . . , x p ) ,
row of the data matrix (i = 1, . . . , n). In order to
provide a framework for parametric inference, the xi i.e. the sample mean vector (where x j = 1/n

are generally viewed as independent realizations of n
i=1 xij ), and
a random vector X = (X1 , X2 , . . . , Xp ), the distri-
bution of which specifies the population from which  n
ˆ = 1
 (xi − x)(xi − x) .
the sample has been taken. Geometrically, the pop- n i=1
ulation can be represented as a swarm of points in
p-dimensional space by associating each variable Xj If we write
with an orthogonal axis in this space and assign-
ing the observed value xi to the point with coor- 
n
A= (xi − x)(xi − x) ,
dinates (xi1 , xi2 , . . . , xip ) on these axes (see Axes
i=1
in Multivariate Analysis). The main characteris-
tics of this swarm are its location in space and then the diagonalelements of A are the corrected
n
i=1 (xij − x j ) of each variable,
its dispersion. The former is specified by the mean 2
sums of squares
vector the off-diagonal
 elements are the corrected sums
of products ni=1 (xij − x j )(xik − x k ) between every
μ = E(X) = [E(X1 ), E(X2 ), . . . , E(Xp )] , pair of variables, and
and the latter by the matrix
ˆ = 1 A.

 = E(X − μ)(X − μ) , n
The sampling distribution of A was derived first
which contains variances of the variables down its by Wishart [17] (see Wishart Distribution), whose
principal diagonal and covariances between every results show that E(A) = (n − 1). Hence the max-
pair of variables in its off-diagonal positions. imum likelihood estimator of  is biased. When
Whether any more parameters are required fully corrected for bias we obtain the estimator
to specify the population depends on the specific
assumptions made about the distributional form of X. 1
S= A,
In the vast majority of practical applications, how- n−1
ever, multivariate central limit arguments suggest
which has the usual sample variances of each variable
that multivariate normality is a suitable assump-
down the main diagonal and sample covariances
tion. In this case, the probability density function only
between every pair of variables in the off-diagonal
depends on μ and , so interest has focused very
positions. This is the sample covariance matrix; it is
heavily on these two parameters within multivariate
the estimator of  preferred by many practitioners.
inference. In this article we concern ourselves with
Adopting approaches to inference other than the
questions about .
frequentist produces different estimators of . The
First, we consider its estimation. Maximum like-
two main approaches are decision theoretic and
lihood is the most commonly adopted method of
Bayesian, and we now briefly summarize the com-
obtaining estimates of parameters in a frequentist
petitor estimates under each of these philosophies.
approach to inference. Assuming normality, the likeli-
In decision theory we are required to supply
hood of the sample is
a loss function l(, T) that quantifies the “loss”
1 incurred when  is estimated by T. The expecta-
L=
(2π)np/2 ||n/2 tion of this loss over the distribution of the data
2 Covariance Matrix

defines the risk function R(, T) associated with that recent years there has been some interest in theory
loss, and this risk function is used to compare dif- associated with elliptic distributions. These distri-
ferent estimators Ti . An estimator T1 beats another butions share many of the features of the normal
estimator T2 if R(, T1 ) ≤ R(, T2 ) for all  and distribution (which is itself a member of this class of
R(, T1 ) < R(, T2 ) for at least one , and an esti- distributions), but they encompass distributions such
mator is admissible, i.e. “best”, if no other estimator as the multivariate t, the multivariate Cauchy, and
beats it. The unbiased estimator S turns out to be the multivariate logistic, all of which have heavier
the best estimator of the form αA under the loss tails than the multivariate normal. Elliptic distribu-
function tions thus provide good models for data involving
either outliers or other contaminants. An elliptic dis-
l(, T) = tr( −1 T) − log det( −1 T) − p,
tribution with mean μ and dispersion matrix  has a
but other (more complicated) estimators are best if we density function of the form
either look outside the class of estimators of the form
αA or consider other loss functions. Muirhead [11,
f (x) = ||−1/2 ψ[(x − μ)  −1 (x − μ)]
pp. 128–136] summarizes the main results.
Turning to the Bayesian approach, it is first nec-
essary to specify a joint prior distribution for all for some function ψ(·). Fang & Zhang [4] show
the unknown parameters. This is combined with the that if x1 , x2 , . . . , xn form a random sample from
likelihood of the data to yield a joint posterior distri- this distribution, then the maximum likelihood esti-
bution of the parameters. Any parameters not of direct mator of  is λ0 A, where λ0 is the maximum
interest are then integrated out to give a marginal of the function φ(λ) = λ−np/2 ψ(p/λ). It is easy
distribution of the parameters to be estimated, and a to check that in the case of a normal distribution,
suitable summary measure of this marginal distribu- ψ(z) = (2π)−p/2 exp(−z/2), so that λ0 = 1/n and
tion (typically the mode) provides the estimator of the we recover the maximum likelihood estimator . ˆ
parameters. Assuming again that sampling is from a Despite all of the above results, in the overwhelm-
normal distribution, Press [14, p. 168] suggests using ing number of practical applications  is routinely
the “natural conjugate” prior estimated in frequentist fashion either by the max-
imum likelihood estimator  ˆ = (1/n)A or by the
π(μ, ) ∝ ||−(m+1)/2 exp{− 12 [tr  −1 G
unbiased matrix S = [1/(n − 1)]A, so we restrict our
+ (μ − φ)  −1 (μ − φ)]}, attention to these estimators for the rest of the present
section.
where φ, G, and m > 2p − 1 are parameters the
The asymptotic distribution of S provides a mech-
values of which quantify the prior knowledge about
anism for obtaining large-sample inferences about
. Following through the above steps, Press then
 without making any assumptions of normality
shows that the Bayes estimator of  is
for the data. We simply require x1 , . . . , xn to be
 
nA + G + n(x − φ)(x − φ) independent realizations of the random vector X =
1+n (X1 , X2 , . . . , Xp ) , the distribution of which has
. mean vector μ and dispersion matrix  = (σij ).
(n + m − 2p − 2)
There are 12 p(p + 1) distinct elements of S and these
If there is no prior knowledge about , then a suitable elements can be written as a vector s. [One common
choice of prior distribution is way of doing this is by stacking successive columns
of the lower-triangular portion of S on top of each
π(μ, ) ∝ ||−(p+1)/2 ,
other in a column vector; such a vector is denoted
which yields the Bayes estimator [1/(n − p − 2)]A vech(S)]. The corresponding vector representation of
on working through the same steps as before. Further  can be denoted σ = vech(). Layard [8] has stud-
results concerning Bayes estimation of  are given ied the joint distribution of elements of s. He uses
by Dickey et al. [3] and Leonard & Hsu [9]. the multivariate central limit theorem to show that,
All of the above results are appropriate when asymptotically for n → ∞, s has a multivariate nor-
sampling from a multivariate normal distribution. In mal distribution in which the mean vector is σ , the
Covariance Matrix 3

variance of any element sj k is statistic is

1
[E(Zj2 Zk2 ) − E(Zj2 )E(Zk2 )], ω = 2l(θ̂, ψ̂) − 2l(θ0 , ψ̂0 )
n
and the covariance between any two elements sj k and (or some monotonic function of ω).
sms is On the other hand, any null hypothesis involv-
ing d > 1 constraints can be regarded as the union
1
[E(Zj Zk Zm Zs ) − E(Zj Zk )E(Zm Zs )], of an infinite set of simpler hypotheses. For exam-
n ple, θ = θ0 implies a θ = a θ0 for any vector a.
where Zi = Xi − μi for i = 1, . . . , p. Convergence This is a univariate hypothesis, and univariate the-
to normality can be speeded up by transforming the ory will generally supply some test statistic, V
elements of s, taking logarithms of the variances say, for this hypothesis. Finding the value of a
sjj and using tanh−1 [sj k /(sjj skk )1/2 ] in place of the (up to a multiplying factor) that maximizes V ,
covariances sj k . However, this improvement in speed choosing this hypothesis and then testing it (mak-
of convergence comes at the expense of complicating ing due allowance for the maximization) is the
the terms in the asymptotic dispersion matrix. Details basis of the union-intersection principle of test con-
are given by Seber [16, pp. 99–101]. Asymptotically, struction.
of course,  ˆ has the same distribution as S. A full account of these principles of test construc-
This asymptotic distribution enables large-sample tion can be found in most multivariate textbooks;
(approximate) confidence regions and hypothesis see, for example, Mardia et al. [10]. Here we simply
tests to be constructed for elements of , irrespective summarize the test statistics and their null distribu-
of the distribution from which the sample has been tions for the most common tests about . In all of
drawn (see Large-sample Theory). However, for these tests we assume normality of data, unknown
small sample exact tests or for tests of specified population mean vector μ, and the general alterna-
structure of  we need to assume normality of tive Ha : not H0 . Thus, for deriving the likelihood
data. Moreover, even then it is virtually impossible ratio test statistic, θ and ψ above are  and μ
to employ optimal theory of hypothesis testing as respectively, while unconstrained maximum likeli-
uniformly most powerful tests are derivable only hood estimators are given by  ˆ = (1/n)A and μ̂ = x
in rather artificial circumstances. In most practical as above.
circumstances, therefore, recourse must be made to
some general principle that can be relied on to 1. H0 :  = 0 , a specified matrix. The likeli-
hood ratio test statistic is ω = n tr(0−1 ) ˆ −
produce a “good” test. The principle of invariance
−1 ˆ
will often focus attention on a particular class n log |0 | − np = np(a − log g − 1), where
of test statistics within which to search, but may a and g are the arithmetic and geometric means
not necessarily pinpoint one specific test. To do of the eigenvalues of 0−1 . ˆ For the exact null
this, the most common approaches are to use distribution of this statistic, see Anderson [1]
either the likelihood ratio or the union–intersection and Korin [6]. However, this distribution is not
principles of test construction. easy to use, so recourse has to be made to
Suppose that the null hypothesis H0 imposes a set the general result that asymptotically, ω has a
of d constraints on the parameters, say θ = θ0 , where 2
χ(1/2)p(p+1) distribution (chi-square distribution
θ has d elements, and the alternative hypothesis with 2 p(p + 1) degrees of freedom) under H0 .
1

Ha is the general “not H0 ”. Usually, also, there The union–intersection statistic, on the other
will be other (nuisance) parameters ψ. Write l(θ̂ , ψ̂) hand, is a function of just the extreme eigen-
for the log likelihood of the sample when θ̂ and values of 0−1 . ˆ This test rejects H0 if either
ψ̂ are unconstrained maximum likelihood estimates λp < c1 or λ1 > c2 , where λi is the ith largest
of all the parameters, and l(θ0 , ψ̂0 ) for the log eigenvalue of 0−1  ˆ and c1 and c2 are chosen
likelihood when θ = θ0 and ψ̂0 is the maximum to make the size of test α (see Level of a Test).
likelihood estimate of ψ conditional on θ = θ0 . Then, Tables for carrying out this test are given in Pear-
under regularity conditions, the likelihood ratio test son & Hartley [13].
4 Covariance Matrix

2. H0 :  = k0 , for unknown k. The maximum 8] and Muirhead [11, Chapter 8 and 11]. Muirhead
likelihood estimate of k is given by k̂ = also details modifications to the statistics in order
ˆ
tr(0−1 )/p and the likelihood ratio statistic to insure unbiased tests, and establishes asymptotic
is ω = np log(a0 /g0 ), where a0 and g0 are null and nonnull distributional results for samples
the arithmetic and geometric means of the from elliptic as well as from normal distributions.
eigenvalues of 0−1 . ˆ Asymptotically this A general review of all the tests, along with some
2 significance levels, is provided by Krishnaiah &
statistic has a χ(1/2)(p−1)(p+2) distribution under
H0 . The special case of 0 = I leads to Lee [7].
the sphericity test, for which we have a0 = One other problem of common interest is the
(1/p) tr ˆ and g0 = || ˆ 1/p . No straightforward testing of equality of dispersion matrices in sev-
union–intersection tests exist in these situations, eral multivariate populations, since the assumption
but Olkin & Tomsky [12] give some modified of equal dispersion matrices is made in multivari-
versions. The test of sphericity plays an ate techniques such as canonical variate analysis (see
important role in analysis of variance. In Canonical Correlation) and multivariate analysis
general, the data vector should have a covariance of variance. The likelihood ratio test is a generaliza-
matrix consonant with the sphericity hypothesis tion of Bartlett’s test of homogeneity of variance in
for the F tests on means in this analysis to be univariate populations. We assume that random sam-
valid. Also, more particular model structures can ples of sizes n1 , n2 , . . . , ng are available
g from each of
be reduced to this hypothesis and tested. The g populations, and we write N = i=1 ni . Suppose
most important of these is the usual covariance that Ai is the sums of squares and products matrix for
structure assumed for repeated measures data the sample from population i, so that  ˆ i = (1/ni )Ai
(see Longitudinal Data Analysis, Overview), in is the maximum likelihood estimator of the dispersion
which all variances (diagonal elements of k0 ) matrix for this population and Si = [1/(ni − 1)]Ai
are assumed to be equal to σ 2 and all covariances is the unbiased version. Under the null hypothesis
(off-diagonal elements of k0 ) are assumed to be that all dispersion
 matrices are equal to , we have
equal to ρσ 2 . It can be shown that any p-element ˆ = (1/N ) g Ai and the corresponding unbiased
i=1 g
random vector X = (X1 , . . . , Xp ) satisfies this version S = [1/(N − g)] i=1 Ai . (The latter matrix
covariance structure if and only if the (p − 1)- is known as the pooled within-sample covariance
element vector Y = CX satisfies the sphericity matrix.) Then the likelihood ratio test statistic for
hypothesis, where C is any (p − 1) × p matrix testing the null hypothesis against the general alter-
the rows of which are orthogonal to each other native that at least one dispersion  matrix differs
and to the vector 1 = (1, 1, . . . , 1) . from the rest is N log || ˆ − g ni log | ˆ i |, and
i=1
3. H0 :  is diagonal. This is the hypothesis that under the null hypothesis this statistic is asymptoti-
2
all the variables are uncorrelated with each cally distributed as χ(1/2)p(p+1)(g−1) . Box [2] proposed
g
other. Under H0 , the mean and variance of the alternate statistic (N − g) log |S| − i=1 (ni −
each variable are estimated separately, whence 1) log |Si |, which has the same asymptotic chi-square
 ˆ = R, the sample correlation matrix. This
ˆ −1  distribution under the null hypothesis. He also gave
0
has trace p, so ω = −n log |R|. Under H0 , ω has an F approximation to the null distribution, and
2
an asymptotic χ(1/2)p(p−1) distribution; Box [2] tables based on this latter approximation are given
showed that the χ 2 approximation is improved if by Seber [16].
n is replaced by n = n − 12 (2p + 11). There is The union-intersection approach is viable in the
no straightforward union-intersection test in this special case g = 2, and produces a test based on the
case either. largest and smallest eigenvalues of S1 S−1 2 , with tables
given by Schurrmann et al. [15]. However, this test
Of course, the maximum likelihood estimate  ˆ is does not generalize easily to the case g > 2.
equal to [(n − 1)/n]S, so each of the above test This section has been concerned with inferential
statistics can be expressed in terms of S if so aspects of the sample covariance matrix. This matrix
desired. is at the heart of many multivariate techniques; see
Properties such as the unbiasedness and invariance especially principal components analysis and factor
of these statistics are discussed by Giri [5, Chapter analysis.
Covariance Matrix 5

References [11] Muirhead, R.J. (1982). Aspects of Multivariate Statistical


Theory. Wiley, New York.
[12] Olkin, I. & Tomsky, T.L. (1975). A new class of mul-
[1] Anderson, T.W. (1984). An Introduction to Multivariate
tivariate tests based on the union-intersection princi-
Statistical Analysis, 2nd Ed. Wiley, New York.
ple, Bulletin of the International Statistical Institute 46,
[2] Box, G.E.P. (1949). A general distribution theory for a
Part 4, 202–204.
class of likelihood criteria, Biometrika 36, 317–346.
[13] Pearson, E.S. & Hartley, H.O. (1972). Biometrika Tables
[3] Dickey, J.M., Lindley, D.V. & Press, S.J. (1985). Bayes-
for Statisticians, Vol. 2. Cambridge University Press,
ian estimation of the dispersion matrix of a multivariate
Cambridge.
normal distribution, Communications in Statistics –
[14] Press, S.J. (1972). Applied Multivariate Analysis. Holt,
Theory and Methods 14, 1019–1034.
Rinehart & Winston, New York.
[4] Fang, K.-T. & Zhang, Y.-T. (1990). Generalized
[15] Schurrmann, F.J., Waikar, V.B. & Krishnaiah, P.R.
Multivariate Analysis. Science Press, Beijing/Springer-
(1973). Percentage points of the joint distribution of the
Verlag, Berlin.
extreme roots of the random matrix (S1 + S2 )−1 , Journal
[5] Giri, N.C. (1977). Multivariate Statistical Inference.
of Statistical Computation and Simulation 2, 17–38.
Academic Press, New York.
[16] Seber, G.A.F. (1984). Multivariate Observations. Wiley,
[6] Korin, B.P. (1968). On the distribution of a statistic used
New York.
for testing a covariance matrix, Biometrika 55, 171–178.
[17] Wishart, J. (1928). The generalized product moment
[7] Krishnaiah, P.R. & Lee, J.C. (1980). Likelihood ratio
distribution in samples from a normal multivariate
tests for mean vectors and covariance matrices, in
distribution, Biometrika 20A, 32–52 (correction: 20A,
Handbook of Statistics, Vol. 1, P.R. Krishnaiah. ed.
424).
North-Holland, Amsterdam, pp. 513–570.
[8] Layard, M.W.J. (1972). Large sample tests for the equal-
ity of two covariance matrices, Annals of Mathematical (See also Inference, Foundations of; Multivariate
Statistics 43, 123–141.
[9] Leonard, T. & Hsu, J.S.J. (1992). Bayesian inference for
Analysis, Overview; Multivariate Bartlett Test)
a covariance matrix, Annals of Statistics 20, 1669–1696.
[10] Mardia, K.V., Kent, J.T. & Bibby, J.M. (1979). Multi-
W.J. KRZANOWSKI
variate Analysis. Academic Press, London.

You might also like