Lec1 (1) .Ps

You might also like

Download as ps, pdf, or txt
Download as ps, pdf, or txt
You are on page 1of 100

STAT 8250 | Applied Multivariate Analysis

Lecture Notes

Introduction

\Multivariate" refers to the presence of multiple


random variables.
So Multivariate Data are data that are thought of
as the realizations of several random variables,
and Multivariate Analysis can be de ned broadly
as an inquiry into the structure of interrelationships
among multiple random variables.
Haven't we already studied that?
(e.g., Regression, Anova)

1
Regression:
Simple: explain variability in Y from X .
Multiple: explain variability in Y from X1 ; X2;
: : : ; Xr .
Anova:
One-way: explain variability in Y based on X ,
where X is a factor
These examples are not usually considered to be
multivariate analyses.
Why?
We distinguish between the Response Variable (Y )
and Explanatory Variables (X 's) and in the analysis
we treat the X 's as xed.
) only 1 random variable in each situation.
(multiple regression 6= multivariate regression)

2
One type of analysis that is a good example of a
multivariate method is correlation analysis.
Recall:
Correlation Analysis: measure linear association be-
tween X; Y without distinguishing response ver-
sus explanatory.
Other examples:
M'variate 1 & 2 Samp. Inference: Hypothesis tests
and con dence intervals for the mean vector(s)
from one (or two) group(s). Procedures anala-
gous to t and Z in univariate setting.
MANOVA: Acronym for multivariate analysis of
variance. Inference for mean vectors from sev-
eral populations.
Multivariate Linear Regression: Eplain variability
in response variables Y1 ; Y2 ; : : : ; Ym based on
explanatory variables X1; : : : ; Xr .

3
Principal Components Analysis: Reduce the dimen-
sion of a data set by examining a small number
of linear combinations of the original variables
that explain most of the variablity among the
original variables.
Factor Analysis: Similar to P.C.A. Describe the var-
iance/covariance relationships among many
variables in terms of a few underlying, but un-
observable, random quantities called factors.
Canonical Correlation Analysis: Determine the lin-
ear association between two sets of random vari-
ables. Find a linear combination of several X 's
and a linear combination of several Y 's such
that the two linear combinations have maxi-
mum correlation.
Classi cation and Discrimination: Methods for sep-
arating distinct sets of objects (experimental
units) and with allocating (classifying) objects
to previously de ned groups.

4
Cluster Analysis: Methods for grouping objects
based on distances (which objects are most sim-
ilar to which others). Groups are not prede ned
and no assumptions are made concerning the
number of groups or the group structure.
Others: Multidimensional Scaling, Repeated Mea-
sures Analyses, Path Analysis, Econometric
Methods.

5
Variable- vs. Individual-Directed Methods
Variable-directed Techniques: primarily concerned
with relationships among the response variables
being measured. E.g., PCA, FA, M'var. Reg.,
Can. Corr.
Individual-directed Techniques: primarily concerned
with relationships among the among the ex-
perimental units or individuals being measured.
E.g., Discrim. Anal. Clustering, MANOVA.
 Some methods can be used as either variable-
directed or individual-directed. E.g., clustering
can be done to group individuals or to group
variables.
 Multivariate methods also di er with respect to
how informal or exploratory their typical uses
are. E.g., cluster analysis, P.C.A. are often pre-
liminary or exploratory tools used as part of a
larger analysis.

6
Theory vs. Methods:
(1) problem: estimate population mean 
(given x1 ; : : : ; xn )

model: X1 ; : : : ; Xn iid
 N (; 2)
, 2 unknown
theory: X  N (; 2 =n),
(n 1)S 2  2 2 (n 1),
 S 2 independent.
X; p
) n(X )=S  t(n 1)
method:
p
x  t1 =2(n 1)s= n forms a
100(1 )% C.I. for 
application: n = 20, = 0:10
x = 49:70, s = 0:32
t:95 (19) = 1:729 p
90% CI = 49:70  1:729(0:32= 20)
= (49:58; 49:82)
(2) (problem ! method ! application) not su-
cient.

7
(3) validity of method depends on the validity of
model. Is model valid? (need to know some-
thing about iid
 N (; 2)) If not valid, how is
method a ected? Alternatives?
(4) How do we interpret the results of our applica-
tion? (need to understand what model means)

8
Preliminaries

1. Vector and matrix algebra.


2. Multivariate versions of basic statistical con-
cepts.
3. The multivariate normal distribution.
Vector and Matrix Algebra
(See section 1.4 and chapter 2 of Johnson and Wich-
ern.)
De nitions:
Matrix: a rectangular array of numbers with n
rows and p columns
0 a11 a12    a1p 1
B a a    a p C  i = 1; : : : ; n
@ ... ... . . . ... C
A=B 21 22 2
A = (aij ) j = 1; : : : ; p
an1 an2    anp
9
e.g.,
02 1
1
A = @1 0 A : (n = 3; p = 2)
2 3

Dimension of a Matrix: the ordered pair (n; p).


We say that A is n  p.
Vector: a matrix with p = 1 column.
0 x1 1 021
B
B x 2C
x = @ .. C e.g., x = @1A:
A
. 2
xn
Note: A = (a1 ; a2 ;    ; ap ). A \vector" in this
course will always refer to a column array of
numbers. Some authors de ne row vectors and
column vectors separately.
Scalar: a real valued variable, or a 1  1 matrix.
We will typically denote matrices with bold cap-
itals, vectors with bold lower case, and scalars
with regular type lower case.
10
Basic Operations:
Transpose: If A is n  p then A0 is the p  n
matrix by interchanging the rows and columns
of A. E.g.,
 
A0 = 21 10 3 : 2

If we want a row vector version of x then we


use x0 (= (2 1 2 ) in the example).
A matrix with the property A = A0 is called sym-
metric.
Scalar Multiplication: A scalar, c, times a ma-
trix A is given by
04 2
1
cA = (caij ) e.g., 2A = @ 2 0 A
4 6

11
Addition: A = (aij ) added to B = (bij ) is given
by
A + B = (aij + bij )
A and B must have the same dimension.
E.g., for A as before and B given by
0 1 2
1 0 3 1
1
B=@ 3 1 A we get A+B=@ 2 1A
0 2 2 1

Subtraction: A B = A + ( 1)B (A, B of same


dimension). E.g.,
01 3
1
A B = @4 1A
2 5

(Matrix) Multiplication: AB is de ned if and


only if A has the same number of columns as
B has rows; that is, if and only if A and B are
conformable. AB will have as many rows as
A and as many columns as B.

12
If Amn = (aij ) and Bnp = (bij ) are conformable
then the m  p matrix product AB has (i; j )th ele-
ment
Xn
aik bkj :
k=1

E.g., take A32 as before and let B be given by


 
B24 = 21 03 11 2 1

Then AB is a 3  4 matrix given by


02 1
1  
AB = @ 1 0 A 2 0 1 1
2 3 1 3 1 2
05 3 1 4
1
= @2 0 1 1A
1 9 5 4

13
Special cases:
1. Matrix times a vector:
0 a11 a12    a1p 1 0 x1 1
B a21 a22    a 2p C
Ax = @ .. .. . . .. A @ .. C
B C B
B x2 C
. . . . .A
an1 an2    anp np xp p1
0 a11x1 + a12x2 +    + a1pxp 1
=@
B
B a21x1 + a22x2 +    + a2pxp C
C
..
. A
an1 x1 + an2 x2 +    + anp xp n1
X p
= x1 a1 + x2 a2 +    + xp ap = xi ai
i=1
a linear combination of the columns of A.

14
2. Dot product or inner product:
0 y1 1
B y C X
p
@ .. C
x0 y = ( x1 x2    xp ) B 2
A = xiyi
. i=1
yp
Sometimes
P written x  y or < x; y >. Note:
x  x = ni=1 x2i (sum of squared elements).
3. Column Vector times row vector:
0 x1 1
B x 2C
xy = @ .. C
0 B A ( y1 y2    yp ) = (xiyj )pp
.
xp
3.a. Vector times its transpose:
0 x1 1
0 B
B x2 C
xx = @ .. C ( x1 x2    xp ) = (xi xj )pp
. A
xp

15
Properties:
1. A + B = B + A.
2. (A + B) + C = A + (B + C).
3. c(A + B) = cA + cB.
4. (c + d)A = cA + dA.
5. (cd)A = c(dA).
6. c(AB) = (cA)B.
7. A(BC) = (AB)C.
8. A(B + C) = AB + AC.
9. (A + B)C = AC + BC.
10. (A + B)0 = A0 + B0 .
11. (cA)0 = cA0 .
12. (AB)0 = B0 A0 .
13. AB =6 BA. E.g.,
1 1 0 1
A= 0 2 ; B= 1 2
1 3 0 2
AB = 2 4 ; BA = 1 5
BA may not be de ned; if it is, AB and BA
may not have the same dimension; if dim(AB) =
dim(BA), doesn't imply that AB = BA.

16
Geometrical Concepts:
Vector: a line segment from the origin (0, all co-
ordinates equals to zero) to the point indicated
by the coordinates of the (algebraic) vector.
011
e.g., x = @3A
2

Vector Addition:

17
Scalar Multiplication: Multiplication by a scalar
scales a vector by shrinking or extending the
vector in the same direction.

Length (Norm) of Vector: Lx = (x0x)1=2 is the


length, or norm of x. Sometimes written as
jjxjj. Lx 1x is the vector of length 1 which lies
in the direction of x.
Angle Between Vectors: If  is the angle be-
tween two vectors x and y, then
x 0y
cos() = :
Lx Ly

18
Orthogonal (Perpendicular) Vectors: Two
vectors x and y are orthogonal to one another if
and only if x0 y = 0 () cos() = 0 )  = =2).

Projection: The projection of x on y is the vector


in the direction of y that is closest to x and is
given by
x0 y y = x0 y L 1 y :
y 0 y Ly y

19
Euclidean Distance: Since a vector x has the in-
terpretation of a directed line segment from the
origin (0) to the point (x1 ; x2 ; : : : ; xp ), the Eu-
clidean (straight-line) distance between 0 and x
is clearly the length of the vector, Lx :
q
d(x; 0) = Lx = x21 +    + x2p :
More generally, the Euclidean distance between
two arbitrary points x and y is
q
d(x; y) = Lx y = (x1 y1 )2 +    + (xp yp )2 :
 Many ways to de ne distance. Euclidean dis-
tance is just one (familiar) example.
 Properties of a distance measure d:
d(x; y) = d(y; x)
d(x; y)  0; with equality when x = y
d(x; y)  d(x; z) + d(z; y) (triangle inequality)
 We'll consider another distance measure, sta-
tistical distance, later.

20
More Matrix Functions and De nitions:
Determinant: The determinant of a square ma-
trix A (same number of rows as columns) with
dimension k  k is the scalar denoted as jAj
given by
 a11 if k = 1
jAj = Pk a1j jA1j j( 1)1+j if k > 1,
j =1
where A1j is the (k 1)  (k 1) matrix ob-
tained by deleting the rst row and the j th col-
Pk ofa AjA. jNote
umn
( 1)
that jAj can be de ned as
i+j using the ith row in place
j =1 ij ij
of the rst.
Special Cases:
k = 2:
a a
a11 a12 = a11a22( 1)2+a12a21( 1)3 = a11a22 a12a21
21 22

E.g.,

1 = 12
3
2 4 ( 2) = 14

21
k = 3:
a a a
a11 a12 a13 = +a a22 a23 a a21 a23
a21 a22 a23 11 a
32 a33
12 a
31 a33
31 32 33
a a
+ a13 a21 a22
31 32
= a11(a22 a33 a23 a32)
a12(a21 a33 a23 a31)
+ a13(a21 a32 a22 a31)

Properties:
1. jcAj = ck jAj, for Akk .
2. jA0j = jAj.
3. jABj = jAjjBj.
4. For a diagonal matrix Dkk of the form
0 d11 0    0 1
B 0 d22    0 C
D = @ .. .. . . . .. C
B ;
. . . A
0 0    dkk
jDj = d11d22    dkk .
22
Linear Dependence: A set of vectors x1 ; : : : ; xk
is said to be linearly dependent if there exist
constants c1 ; : : : ; ck not all zero such that
c1 x1 + c2 x2 +    + ck xk = 0
 Recall collinearity/multicollinearity complications
occurred in regression when the explanatory vari-
ables were linearly dependent; that is, when the
columns of the X matrix were linearly depen-
dent.
 A set of vectors which is not linearly dependent
is said to be linearly independent.
jAj = 0 if and only if there exists a nonzero vector x
such that Ax = 0; that is, if and only if the columns
of A are linearly dependent.

23
Identity Matrix: A matrx with ones on the di-
agonal and zeros elsewhere is called an identity
matrix and is denoted by Ikk or I when the
dimension is clear from the
01 0 01 context. E.g.,
I33 = @ 0 1 0 A :
0 0 1
I is the \identity" matrix because
AI = IA = A
Matrix Inverse: Matrix division is de ned as mul-
tiplication by the matrix inverse. Just as a=b
can be de ned as ab 1 for scalars a and b, we
will think of A divided by B as AB 1 where
B 1 is that matrix such that
BB 1 = I
(just as b 1 = 1=b is that scalar such that bb 1 =
1).
 The matrix inverse is only de ned for square
matrices that have nonzero determinants.
 If A (square) has jAj = 0 then its columns are
linearly dependent and it has no inverse. Such
a matrix is called singular.
24
Computation of Matrix Inverse: A 1 has (i; j )th
element ( 1)i+j jAji j=jAj, where Aji is the ma-
trix obtained by deleting the j th row and ith
column of A. E.g.,
a b 1 1
d b

c d = ad bc c a
We will rely on the computer to compute in-
verses for matrices with dimension > 2.
Properties:
1. (A 1 )0 = (A0 ) 1 .
2. (AB) 1 = B 1 A 1.
3. jA 1j = 1=jAj.

25
Eigenvalues and Eigenvectors: A square k  k
matrix A is said to have an eigenvalue , with
a corresponding eigenvector x =
6 0 if
Ax = x (1):
Since (1) can be written
(A I)x = 0
for some nonzero vector x, the eigenvalues of A
are the solutions of
jA Ij = 0
If we look at how a determinant is calculated
it is not dicult to see that jA Ij will be a
polynomial in  of order k so there will be k
(not necessarily real) eigenvalues (solutions).
1. if jAj = 0 then Ax = 0 for some nonzero x.
That is, if the columns of A are linearly depen-
dent then  = 0 is an eigenvalue of A.

26
2. The eigenvector associated with a particular eigen-
value is unique only up to a scale factor. That
is, if Ax = x, then A(cx) = (cx) so x and
cx are both eigenvectors for A corresponding to
. We typically normalize eigenvectors to have
length 1 (choose the x that has the property
x0 x = 1).
3. If i and j are distinct eigenvalues of the sym-
metric matrix A then their associated eigenvec-
tors xi , xj , are orthogonal.

27
4. Computation: By computer typically, but for
2  2 case we have
a b  a  b 
A= c d ) A I = c d 
jA Ij = 0 )
(a )(d ) bc = 2 (a + d) + (ad bc) = 0
p
(a + d)  (a + d)2 4(ad bc)
) = 2
To obtain associated eigenvector solve Ax =
x. There are in nitely many solutions for x so
choose one by setting x1 = 1, say, and solve for
x2 . Normalize to have length one by computing
Lx 1 x.

28
Orthogonal Matrices: We say that A is orthogo-
nal if A0 = A 1 or, equivalently, AA0 = A0 A =
I, so that the columns (and rows) of A all have
length 1 and are mutually orthogonal.
Spectral Decomposition: If Akk is a symmetric
matrix (i.e., A = A0 ) then it can be written
(decomposed) as follows:
Ekk kk E0kk ;
where 0 1 0    0 1
B 0 2    0 C
 = @ .. .. . . .. C
B
. . . . A
0 0    k
and E is an orthogonal matrix with columns
e1 ; e2 ; : : : ; ek the eigenvectors corresponding to
eigenvalues 1 ; 2 ; : : : ; k .
We haven't yet talked about how to interpret eigen-
values and eigenvectors, but the spectral decompo-
sition says something about the signi cance of these
quantities (why we're interested in them): the \in-
formation" in A can be broken apart into its eigen-
values and a set of orthogonal eigenvectors.
29
Results:
1. For Akk symmetric,
Yk
jAj = jEjjjjE0j = jEE0jjj = jIjjj = i
i=1
I.e., the determinant of a symmetric matrix is
the product of its eigenvalues.
2. If A has eigenvalues 1 ; : : : ; k , then A 1 has
the same associated eigenvectors and eigenval-
ues 1=1 ; : : : ; 1=k since
A 1 = (EE0) 1 = E 1E0 :

30
3. Similarly, if A has the additional property that
it is positive de nite (de ned below) a square
root matrix, A1=2, can be obtained with the
property A1=2A1=2 = A:
A1=2 = E1=2E0, where
0 p1 0    0 1
p
1=2 B
B 0 2   
= @ ..
0 C
C
.
..
.
. . . .. A
p.
0 0  k
A1=2 is symmetric, has eigenvalues that are the
square roots of the eigenvalues of A, and has
the same associated eigenvectors as A.
Quadratic Forms: For a symmetric matrix Akk ,
a quadratic form in xk1 is de ned by
k X
X k
x0 Ax = aij xi xj :
i=1 j =1
(x0 Ax is a sum in squared (quadratic) terms,
x2i , and xi xj terms.)

31
Positive De nite Matrices: A is positive de -
nite (p.d.) if x0 Ax > 0 for all x 6= 0. If x0 Ax 
0 for all nonzero x then A is positive semi-
de nite (p.s.d.) or, sometimes, non-negative
de nite.
 If Akk is p.d. then for i = 1; : : : ; k, e0iAei =
i (e0i ei ) > 0;
 i.e., the eigenvalues of A are all positive.
 It can also be shown that if the eigenvalues of
A are all positive then A is p.d.

32
Statistical Distance:

Consider points P and Q in the above scatter-


plot.
 Euclidean Distance: d(0; P) > d(0; Q).
But suppose X1 and X2 represented random vari-
ables each with mean 0, but with di erent variances
(var(X1 ) > var(X2 )). If we are interested in how
unusual the points P and Q are then we want to
know how far from the average (0) these points
are. For this purpose it does not make sense that
d(0; P) > d(0; Q).
 Euclidean distance does not account for vari-
ability inherent in the random variables repre-
senting the coordinates.

33
 Need a distance measure where coordinates are
weighted di erentially and distance is relative
to the variability of the data.
One solution, Karl Pearson Distance: Distance
of a point P = (x1 ; x2 ) from the origin 0 = (0; 0)
takes into account how many (X1 ) standard devi-
ations x1 is from 0 and how many (X2 ) standard
deviations x2 is from 0:
s 2  2
d(0; P) = x1 x
+ 2 ;
s1 s2
where s1 is the standard deviation of X1 and s2 is
the standard deviation of X2. The generalization
to k-dimensional space is straight-forward:
s 2  x 2
d(0; P) = x 1 +    + sk :
s1 k

 A problem: Karl Pearson distance doesn't ac-


count for correlation among the coordinates.

34
An alternative: Statistical Distance: Karl Pearson
Distance would be o.k. if it were computed with
respect to the rotated axes X~1 and X~2 . So, in two
dimensions, we can de ne statistical distance as
s 2  2
d(0; P) = xs~~1 + xs~~2 ;
1 2

where
x~1 = x1 cos() + x2 sin()
;
x~2 = x1 sin() + x2 cos()
s~1 , s~2 are standard deviations of X~1 and X~ 2 , and 
is the angle of rotation.

35
With respect to the original coordinate system we
have
q
d(0; P) = a11 x21 + 2a12 x1 x2 + a22 x22 ;
where a11 , a12 , and a22 are coecients depending
on s1 , s2 , cov(X1 ; X2 ) and trigonometric functions
of  (not important now).
Notice that d(0; P) as de ned above can be written
as
a  
a12 x1 = x0 Ax;
d(0; P)2 = ( x1 x2 ) a11 a22 x2
12
a quadratic form in x involving the symmetric ma-
trix A.
 In fact, any positive de nite quadratic form can
be thought of as a squared statistical distance.
The positive de niteness of the matrix A is re-
quired to ensure the properties of a distance
metric (non-negativity, triangle inequality, etc.).
x0 Ax: Squared distance from x to 0,
(x y)0 A(x y): Squared distance from x to y.
36
Geometric Interpretation to Spectral Decom-
position: Consider the set of points a constant
(statistical) distance c away from the origin.
These points satisfy
x0 Ax = c2 X
) x0 EE0x = y0 y = i yi2 = c2;
i
which is the equation of an ellipsoid in the trans-
formed variable, y = E0 x.
A transformation of this form, y = E0 x, involving
an orthogonal matrix E, de nes a rotation from the
x coordinate system into the y coordinate system
which is rigid, in the sense that the orthogonality of
the axes are maintained. The new axes are in the
direction of the eigenvectors, e1 ; e2 ; : : :.

37
In two dimensions we have
1 y12 + 2 y22 = c2;
the equation of an ellipse, with y1 = x0 e1 and y2 =
x0 e2. If we let x = c1 1=2e1 we see that this value
satis es x0 Ax = c2 :
x0 Ax = 1 (x0e1)2 + 2 (x0e2 )2
e01e1)2 + 2 (c1 1=2 |{z}
= 1 (c1 1=2 |{z} e01e2)2 = c2:
=1 =0
p
Therefore, jjc1 1=2e1 jj = c= 1 givesp thegives
length
of the ellipse along e1 . Similarly, c= 2 the
length of the ellipse in the perpendicular direction
along e2 .

38
One More Matrix Function:
Trace of a Matrix: De ned only for square matri-
ces. Let Akk = (aij ) be a square matrix. The
trace of A is the sum of its diagonal elements:
X
k
tr(A) = aii
i=1

Properties:
1. tr(cA) = ctr(A).
2. tr(A + B) = tr(A) + tr(B).
3. tr(ABC) = tr(CAB) as long as the matrix
products are de ned (matrices are conformable)
and ABC and CAB are square. In particular,
tr(B 1 AB) = tr(A)
and
tr(Axx0 ) = tr( |x0{z
Ax} ) = x0 Ax:
a scalar
0 P P
4. tr(AA ) = i j a2ij .
5. If Akk is symmetric
P with eigenvalues 1; : : : ; k ,
then tr(A) = ki=1 i .
39
Basic Multivariate Statistical Concepts

(See sections 2.5, 2.6 and chapter 3 of Johnson and


Wichern.)
De nitions:
Random Vector: A vector whose elements are
random variables. E.g.,
xp1 = ( x1 x2    xp )0 ;
where x1 ; : : : ; xp are each random variables.
Random Matrix: A matrix whose elements are
random variables. E.g., Xnp = (xij ), where
x11 ; x12 ; : : : ; xnp are each random variables.
Expected Value: The expected value (population
mean) of a random matrix (vector) is the matrix
(vector) of expected values. For Xnp ,
0 E(x ) E(x )    E(x ) 1
11 12 1p
E(X) = B @ ... ..
.
... .. C
. A  :
E(xn1 ) E(xn2 )    E(xnp )

40
 Recall, for a univariate random variable X ,
R 1 xf (x)dx if X is continuous;
E(X ) = P1 Xxp (x) if X is discrete.
all x X
Here, fX (x) is the probability density function
of X in the continuous case, pX (x) is the prob-
ability function of X in the discrete case.
(Population) Variance-Covariance Matrix: For
a random vector xp1 = (x1 ; x2 ; : : : ; xp )0 , the
matrix
0 var(x1) cov(x1; x2)    cov(x1 ; xp) 1
B
B cov(x2 ; x1 ) var(x2 )    cov(x2 ; xp) C
C
@ ... ..
.
... ..
. A
cov(xp ; x1 ) cov(xp ; x2 )  var(xp )
0 11 12     1p 1
B
B 21 
 @ .. ..
22     2p C
C
. .
... ..
. A
p1 p2    pp
is called the variance-covariance matrix of x and
is denoted var(x).

41
 Recall: for a univariate random variable xi with
expected value i ,
ii = var(xi ) = E[(xi i )2 ]

 Recall: for univariate random variables xi and


xj ,
ij = cov(xi ; xj ) = E[(xi i )(xj j )]

 var(x) is symmetric because ij = ji.


 In terms of vector/matrix algebra, var(x) has
formula
var(x) = E[(x )(x )0 ]:

 If the random variables x1 ; : : : ; xp in x are mu-


tually independent, then cov(xi ; xj ) = 0, when
i 6= j , and var(x) is diagonal with
(11 ; 22 ; : : : ; pp )0 along the diagonal and zeros
elsewhere.

42
(Population) Correlation Matrix: For a ran-
dom variable xp1 , the population correlation
matrix is the matrix of correlations among the
elements of x:
0 1 12    1p 1
corr(x) = B
B  1     p C
@ ... ... . . . ... C
21 2
A;
p1 p2    1
where ij = corr(xi ; xj ).
 Recall: for random variables xi and xj ,
ij = corr(xi ; xj ) = p p
ij
ii jj
measures the amount of linear association be-
tween xi and xj .
 Correlation matrices are symmetric.
 For a random vector xp1 , let  = corr(x),  =
var(x) and V = diag(11 ; 22 ; : : : ; pp ). The
relationship between  and  is
 = V1=2V1=2
 = (V1=2) 1 (V1=2) 1

43
Properties:
Let X, Y be random matrices of the same di-
mension and A, B be matrices of constants such
that AXB is de ned
1. E(X + Y) = E(X) + E(Y).
2. E(AXB) = AE(X)B.
Now let xp1 be a random vector with mean x
and variance x . Let cp1 be a vector of constants,
and Ckp a matrix of constants.
For the linear combination c0 x = c1 x1 +    +
cp xp ,
3. E(c0 x) = c0 , and
4. var(c0 x) = c0 x c.
For the k-dimensional vector of linear combina-
tions, Zk1 = Cx,
5. Z = E(Z) = E(Cx) = CE(x) = Cx and
6. Z = var(Z) = var(Cx) = Cx C0 .

44
All of the de nitions and properties of the last 5
pages have been de ned in terms of population quan-
tities. There are corresponding de nitions of sample
quatities for which the same properties hold:
Examples (sample size n):
{ Sample mean: x
 = x
( ; : : : ; x
 ) 0 , where xi =
n 1 P 1 p
n x . (x is the kth observation of
k=1 ki ki
random variable xi ).
{ Sample variance-covariance matrix:
0s  s 1
11 1p
SX = B
@ ... . . . ... C
A
sp1    spp
1 Xn
= n 1 (xk x )(xk x )0 ;
k=1
where
1 Xn
sij = n 1 (xki xi )(xkj xj ):
k=1

45
{ Sample correlation matrix:
0 1 r12    r1p 1
B
RX = B
r 1    r 2p C
@ ... ... . . . ... C
21
A;
rp1 rp2    1
where rij is the sample correlation computed on
the n sample pairs of observations (x1i ; x1j ); : : : ;
(xni ; xnj ):
rij = ps sijps :
ii jj

46
In the univariate setting we typically have a random
variable X , say, from which we observe a random
sample. That is, we assume that X has some dis-
tribution that is partially or completely unknown
and we observe n independent copies X1 ; : : : ; Xn of
X . The observed values of these copies we typi-
cally denote with lower case letters: x1 ; : : : ; xn , and
we infer features of the true distribution from the
sample.
In the multivariate setting we are interested in the
true (multivariate) distribution of a random vector
xp1. Typically the random variables that make up
x, x1; : : : ; xp are not independent and that depen-
dence (correlation) among x1 ; : : : ; xp is the reason
for our interest. A random sample here consists of
n copies of x that we can arrange in an n  p data
matrix:
0 x11 x12    x1p 1
B
B x 21 x  
X = @ .. .. . . .. C
22  x 2p C
. . . . A
xn1 xn2    xnp
(I'm using lower case here, no longer distinguish-
ing between the random variable and its observed
value.)
47
 Rows are assumed independent (n independent
copies).
 Dependence typically among columns.
The sample mean and variance can be computed
from the data matrix X as
x = 1 X0 1
n
1 1
S = n 1 X (I n 110 )X
0

where 1 is an n-dimensional vector of ones.


Generalized Sample Variance: S summarizes
the variability in each of x1 ; : : : ; xp and the co-
variability among these variables. S involves p
variances and p(p 1)=2 covariances, though.
A more succinct summary is provided by the
generalized sample variance, jSj.

48
 Sample variance-covariance matrices with quite
di erent patterns can have the same generalized
variances. E.g.,
5 4  5 4
S1 = 4 5 ; S2 4 5 ;
3 0
S3 = 0 3
have values of r12 equal to 0.8, -0.8 and 0.0,
respectively. However,
jS1j = jS2j = jS3j = 9:

 jSj is sensitive to large values of the sample vari-


ance, sii, for one or a few of the p variables in
X. In some cases this may simply be an e ect
of scale. In such cases it is of interest to com-
pute the generalized variance on the standard-
ized data matrix
p where xij has been replaced by
(xij xi )= sii . This procedure is equivalent to
computing
jRj = jSj=(s11s22    spp);
the determinant of the sample correlation ma-
trix.
49
 jSj is only one way to quantify the total vari-
ability and covariability in S. Another possibil-
ity that ignores the covariance structure is the
total variance:
X
p
sii :
i=1

50
Example, Kite Data: Consider the bivariate data
of Table 5.11 in the text. A wildlife ecologist mea-
sured x1 = tail length (mm) and x2 = wing length
(mm) for a sample of n = 45 female hook-billed
kites. The handout consists of a printout of a SAS
program, kite1.sas, and the associated output, kite1.lst.

51
 PROC PRINT prints out a listing of the data
on p. 1 of kite1.lst.
 PROC CORR computes the sample correlation
matrix R for the variables on the VAR state-
ment. In addition the COV option on the PROC
CORR line requests the sample covariance ma-
trix S and, by default, PROC CORR computes
summary statistics (x1 , x2 , etc.). All of these
results appear on p. 2 of kite1.lst.
 PROC PLOT prints a scatter plot of winglen
versus taillen on p. 3 of kite1.lst.
 The generalized sample variance is
jSj = 120:69(208:54) 122:352 = 10201:24
and we have
jRj = 12010201 :24 = 1:0(1:0) 0:772 = 0:41:
:69(208:54)
 The total variance is 120:69 + 208:54 = 329:24.

52
Suppose we decide that we are interested in the
size and shape of the birds in a way that is not
captured by wing length (x1 ) and tail length
(x2 ). Speci cally, we might look at y1 = x1 + x2
(size) and y2 = x1 x2 (shape).
To compute summary statistics on y = (y1 ; y2 )0 ,
we can use our methods for linear combinations
of x = (x1 ; x2 )0 . Notice,
y = Ax;
where 1 
A = 1 11 ;
so
1  193:62   473:40 
y = Ax = 1 11 279:78 = 86:16 :

53
In addition,
Sy = AS x A 0
 
120:69

122:35 1

= 11 11 122 :35 208:54 1
1
1
 1 1  243:04 1:65 
= 1 1 330:89 86:19
 573:93 87:85 
= 87:85 84:54

54
The Multivariate Normal Distribution
(See chapter 4 of Johnson and Wichern.)
 Recall the central role that the (univariate) nor-
mal distribution played in the development of
univariate statistical methods and why:
a. For many continuous random variables the
assumption of normality is appropriate.
b. numerous statistics, particularly those that
can be expressed as sums or averages, have
normal sampling distributions regardless of
the underlying population distribution.
c. the normal distribution is easy to work with
mathematically and many useful results are
available.
 Items (a) and (b) above are due to the Central
Limit Theorem.

55
Recall:

Central Limit Theorem. If Y1; : : : ; Yn is a se-


quence of n i.i.d. random variables each with mean
 and variance 2 (both nite), then
Pn Y n :
Zn = i=1pin  N (0; 1):
That is,
lim F Z n (z ) ! 1;
n!1 (z )
where FZn (z ) is the c.d.f. of Zn and (z ) is the
c.d.f. of a standard normal r.v.

 How large must n be? Depends on the problem


(on the distribution of the Y 's).
 CLT directly implies (b) but also implies (a)
because the measured realization of a random
variable can often be thought of as including
the sum of numerous random quantities (e.g.,
reaction time).
56
 The multivariate normal distribution plays an
even more central role in multivariate analysis.
The justi cation/motivation is the same, plus
d. There are few alternatives in the multivari-
ate setting (few tractable multivariate dis-
tributions as alternatives). There are al-
ternatives in Discrete Multivariate Analy-
sis, see Bishop, et al., STAT8620.
Recall the univariate normal distribution: we
write X  N (; 2 ) to signify that the univariate
r.v. X has the normal distribution with mean  and
variance 2 which means that X has probability
density function (p.d.f.)
 (x )2 
fX (x) = p 1 2 exp 22
2
Meaning: for two values x1 < x2 the area under
the graph of the p.d.f. between fX (x1 ) and fX (x2 )
gives Pr(x1 < X < x2 ).

57
 The cumulative distribution function (c.d.f.) of
X is
Zx
FX (x) = Pr(X  x) = fX (t)dt:
1
FX is not available in closed-form but the stan-
dard normal version (with  = 0, 2 = 1) is
extensively tabled.
The multivariate normal distribution: for the
vector x, we write x  Np (; ) to signify that x is
a p dimensional random vector with mean  and
variance . x has p.d.f.
 
f (x) = (2)p=12 jj1=2 exp 1 (x )0  1 (x ) ;
2
and c.d.f.
F (x) = Pr(x1  x1 ; x2  x2 ; : : : ; xp  xp )
Z xp Z x1
=  f (t)dt:
1 1

58
 We've replaced a univariate distance from x to
 with a multivariate statistical distance,
(x )0  1 (x );
from x to .
 It is now the volume under the p.d.f. which
represents the joint probability that x1 ; : : : ; xp
fall within given intervals.

59
 Contours of constant density for the p dimensional
normal distribution are ellipsoids de ned by x
such that
(x )0  1 (x ) = c2:
p ellipsoids are centered at  and have axes
These
c i ei, for eigen-pairs i ; ei of .

 We will show that for x  Np (; ) with jj >


0
(x )0  1 (x )  2 (p):
This result implies that the solid ellipsoid of x
values satisfying
(x )0  1 (x )  2 (p)
has probability 1 .
60
Additional Properties:
1. If x  Np (; ) then any linear combination
of variables a0 x = a1 x1 + a2 x2 +    + ap xp is
distributed as N (a0 ; a0 a).
2. If x  Np (; ), the q linear combinations
0 a11x1 +    + a1pxp 1
B
B
Aqpx = @
a21 x 1 +    + a 2 p x pC
C
..
. A
aq1 x1 +    + aqp xp
are distributed as Nq (A; AA0).
3. For x  Np (; ) and dp1 a vector of con-
stants, x + d is distributed as Np ( + d; ).
4. For p dimensional random vector x, if a0 x 
N (a0 ; a0 a) for every p vector of constants a
then x  Np (; ).

61
5. For x  Np (; ), all subsets of x are (multi-
variate) normally distributed. If we partition x,
and  and  accordingly as
0 x1 1 0 1 1
x = @ (q 1) A  = @ (q 1) A
(p  1) x 2 (p  1)  2
((p q)1) ((p q)1)
0 11 .. 12
1
 =B (q  q ) . (q(p q)) C
(p  p) @    21

..

22
A
((p q)q) . ((p q)(p q))
then x1 is distributed as Nq (1 ; 11 ).

62
6.
a. If x1 and x2 are independent q1 and q2 di-
mensional vectors, respectively, then regard-
less of the distributions of x1 and x2 ,
cov(x1 ; x2 ) = 0q1 q2 .
b. If x1 , x2 are as before and are each multi-
variate normally distributed, then
cov(x1 ; x2 ) = 0 implies that x1 and x2 are
independent.
c. If x1 and x2 are independent with Nq1 (1 ; 11 )
and Nq2 (2 ; 22 ) distributions, respectively,
then
0x 1 02 3 2 .
.. 0
3 1
 
@    A  Nq1+q2 B
1
@4    5 ; 64     .     75C
1 11
A:
x2 2 0 .. 22
If we combine (multivariate) normally dis-
tributed components into a single vector,
the result is not necessarily multivariate nor-
mal. However, if the components are (mul-
tivariate) normal and independent then the
result will be multivariate normal.

63
7. If x  Np (; ) where
0x 1 0 1
1 1
x = @A  = @A
x2 2
0 .. 1
B 11 . 12 C
 = @    A
. ..
21 22
with j22 j > 0, then the conditional distribu-
tion of x1 given x2 = x2 , is (multivariate) nor-
mal with
Mean = 1 + 12 221 (x2 2 )
and
Covariance = 11 12 221 21 :
Notice that the conditional covariance does not
depend on the value of the conditioning vari-
able.

64
8. For x  Np (; ) with jj > 0,
(x )0  1 (x )  2 (p)
where 2 (p) denotes the chi-square distribution
with p degrees of freedom.
Proof: we can write the quadratic form as
(x )0 E 1=2 1=2E0 (x ) = Z0 Z;
where Z =  1=2E0 (x ) is multivariate nor-
mal with mean
E(Z) =  1=2E0 E(x ) = 0
and variance
var(Z) =  1=2E0 E 1=2
=  1=2 1=2 = I
so Z0 Z is the sum of p squared independent
N (0; 1) r.v.s ) Z0 Z  2 (p).

65
Multivariate Sampling Distributions:
In this course we will most often consider data that
are or are assumed to be a random sample from a
multivariate population. That is, we will have n in-
dependent copies, x1 ; x2 ; : : : ; xn , of a p dimensional
random variable which we arrange in a n  p data
matrix:
0 x11 x12    x1p 1
B
B x 21 x  
X = @ .. .. . . .. C
22  x 2p C
. . . . A
xn1 xn2    xnp
0 x01 1
B
=B
x 02 C
C
@.A.
.
x0n
 Rows of X are independent.
 Columns of X are dependent and this depen-
dence is of interest.

66
First we consider the general case where x1 ; : : : ; xn
have arbitrary (not necessarily normal) common dis-
tribution with mean  and variance :
1. The sample mean x is unbiased for ; i.e.,
E(x) = :
This is easily seen by writing x as x = n1 X0 1.
It follows that
1
E(x) = E(X0 )1
n
n times
1z }| {
= n (; ; : : : ; ) 1 = :

67
2. var(x) = n1 . We can write x as
n times
0 x 1
1
z1 1 }| 1 { B x2 C
x = ( n I; n I; : : : ; n I) B
@ . A:
. C
.
xn
Therefore, var(x) equals
0  0    0 1 0 n1 I 1
B 0     0 C B
B 1IC
C
( n1 I; n1 I; : : : ; n1 I) B . . . . C
@ .. .. . . .. A @ ... C
B n
A
0 0     n1 I
0 11
n
= ( n1 I; n1 I; : : : ; n1 I) B
@ ... CA
1
1  1 n
= n n2  = n 

68
3. The sample variance matrix S is unbiased for
; i.e., E(S) = . Proof:
1 X n
E(S) = n 1 E[(xi x )(xi x )0 ]
"i=1X
n #
=n1 1 E(xi x0i ) nE(xx 0 ) :
i=1
As a general result,
y = E [(y y )(y y )0 ]
= E(yy0 ) y 0y
) E(yy0 ) = y + y 0y :
So,
1  0 0 
E(S) = n 1 n( +  ) n( n  +  ) = :
1

69
Large Sample Results (general case):
1. x p! , S p! .
p
2. n(x ) d! Np (0; ).
3. Combining (1) and (2) we get
p nS 1=2 (
x ) d! Np (0; I):

4. (3) implies
n(x )0 S 1 (x ) d! 2 (p):

 a p! b means as the sample size gets large, a


converges to b in the sense that the probability
that a is \close" to b goes to 1.
 statistic d! distribution means that as the sam-
ple size gets large the distribution of the statis-
tic on the left becomes approximately equal to
the distribution on the right.
70
Next we consider the special case where we assume
that the common distribution of our random sample
x1 ; : : : ; xn is Np (; ). Our additional assumption
of normality buys us some stronger results:
1.
x  Np (; n 1)
n(x )0  1 (x )  2 (p)
Unlike the general case, these results are exact
and hold for any sample size.
2. The random p  p matrix (n 1)S has a distri-
bution known as the Wishart distribution.
The Wishart distribution has two parameters,
the degrees of freedom (n 1 in this case) and
the scale matrix ( in this case). We write
(n 1)S  Wp (; n 1).
3. x and S are independent.

71
Properties of the Wishart Distribution:
P
1. Wp (; m) is the distribution of m 0
i=1 zi zi where
z1; : : : ; zm iid
 Np (0; ) which implies that
X
n
(xi )(xi )0 = (n 1)S  Wp (; n 1)
i=1

for a multivariate normal sample x1 ; : : : ; xn iid



N (; ) with sample variance-covariance ma-
trix S.

2. If Ai ind
 Wp(; mi ) then
X X !
Ai  Wp ; mi :
i i

3. If A  Wp (; m) then for Cqp ,


CAC0  Wq (CC0 ; m):

72
Assessment of Normality:
 We have seen that xp1 is multivariate nor-
mal if and only if every linear combination of
x1 ; : : : ; xp is normal. Therefore, one check is
whether selected linear combinations are (uni-
variate) normal.
 As a special case of the above we can check
whether each component of x is (univariate)
normal.
 Other selected linear combinations of x1; : : : ; xp
worth checking are e^01 x and e^0p x, where e^1 and
e^p are the eigenvectors associated with ^1 and
^p , respectively, the largest and smallest eigen-
values of S.

73
How do we assess univariate normality?
One method is the quantile-quantile (Q-Q) plot (also
known as a probability plot).
 Idea is to plot the sample quantiles (percentiles)
for the variable in question versus the quan-
tiles one would expect to observe assuming ex-
act normality.
 If the data are normal, plot should follow a
straight line. Departures from a straight line
may indicate the type of non-normality (e.g.,
heavy tails, skewness).
 Q-Q plots can be misleading for small sample
sizes (n < 20, say). Small samples of truly nor-
mal data can look crooked.

74
Example:

75
 A formal test for normality is based on rq the
correlation coecient of the observed and ex-
pected quantiles. The hypothesis of normality
is rejected for values of rq below the tabulated
value in Table 4.2 on p.193 of the text.
 The test statistic based on rq is approximately
equivalent to the Shapiro-Wilk test computed
by PROC UNIVARIATE and in, general, these
two tests give very similar results. Either may
be used.
Example { Sparrow Data:
Recall: x1 = total length, x2 = alar extent, x3 =
length of beak and head, x4 = length of humerus,
x5 = length of keel of sternum.
 See sparrow2.sas and sparrow2.lst. Here we just
consider x1 ; x2 ; x3 to keep the size of the exam-
ple manageable. In addition, we consider only
the birds who died (birds 1{21).

76
 Notice that the Q-Q plots produced by PROC
UNIVARIATE have such small scales as to make
them practically worthless.
 The Shapiro-Wilk tests for x1 {x3 all give p values
> :05, so there is not strong evidence in any of
the individual variables to indicate non-multivariate
normality.
 E.g., the Shapiro-Wilk test statistic for x1 , the
total length of the bird, is W = 0:934 with
p value 0.165 (see p.1 of sparrow2.lst).
 However, we also should check the univariate
normality of e^01 x and e^03 x. e^01 x and e^03 x are
computed in PROC IML and printed on p.9.
 The S-W test for e^01x is given by PROC UNI-
VARIATE on p.11 as W = 0:861 with p value
0.0067. Therefore, we reject the null hypothesis
of normality for e01 x at the 0.05 level ) we re-
ject the hypotheses that x is trivariate normal.

77
 As an alternative to the Shapiro-Wilk test, rq =
0:920 is computed on p.16. Since 0.920 falls
below the critical value for 0:05 and a sample
size of 20 (0.9269 from Table 4.2 in the text) we
reject the null hypothesis of normality.
 The tests agree.
 Note that the procedure here deviates from the
usual one by which we reject when the test statis-
tic exceeds the critical value.
 Notice that as an alternative to letting PROC
UNIVARIATE automatically produce the Q-Q
plot, it is fairly easy to compute the quantiles to
be plotted in a Q-Q plot and then to plot these
quantiles with PROC PLOT. This I have done
for both e^01 x and e^03 x to produce the plots on
p.15 and p.17. These are obviously much easier
to read than the corresponding plots produced
by PROC UNIVARIATE on pp. 12 and 14,
respectively.

78
A graphical procedure to examine multivariate nor-
mality more directly is the Chi-square plot or
gamma plot.
 When the sample size is moderate to large, the
squared generalized distances
d2j = (xj x )0 S 1 (xj x ); j = 1; : : : ; n;
should be approximately distributed as 2 (p)
random variables.
 The chi-square plot is a Q-Q plot of d21; : : : ; d2n
versus the quantiles one would expect to observe
based on the 2 (p) distribution.
 A straight line shape indicates multivariate nor-
mality.

79
Sparrow Example (Continued):
 The Q-Q plot for d2 appears on p.4 of spar-
row3.lst.
 Except for one point, this plot looks pretty close
to a straight line. The unusual point, or outlier,
is bird #8. This outlier was also seen in the Q-
Q plot of e^01 x and seems to be the main cause
of non-multivariate normality in these data.
 It would be interesting to eliminate bird #8 and
then repeat our examination of the multivariate
normality of this data set. Perhaps bird #8's
data was incorrectly recorded, or perhaps bird
#8 happened to die for some reason other than
exposure to the severe storm.
 If a good reason can be found that this bird's
data do not belong in the data set, then this
outlier should be removed. Otherwise, a good
solution would be to analyze the data both with
and without the outlier and to report both sets
of results.

80
Suggested Procedure:
1. Examine Q-Q plots for all univariate compo-
nents of x.
2. Examine Q-Q plots for e^01 x and e^0p x.
3. Examine Q-Q plot for d2j 's.
 While one bad plot is enough to invalidate the
multivariate normality assumption, it is not enough
that plots on all examined linear combinations
look good to con rm (with certainty) the nor-
mality assumption.
 However, we can only go so far in trying to con-
rm multivariate normality. If all plots look
good, we will consider the multivariate normal-
ity assumption to be satis ed.
 Other \goodness of t" tests of normality exist
(chi-square, Kolmogorov-Smirnov), but these suf-
fer serious drawbacks. They have low power to
detect nonnormality in small samples and over-
sensitivity in large samples.

81
Handling Non-normality { Transformations
 Basic idea: if x is not normal, nd a function of
x (square root, logarithm, etc.) that is.
 Most of the useful transformations are contained
in the family of power transformations:
 x ; if  6= 0
f (x) = log(x); corresponds to  = 0.

 Although there are analytic methods for picking


the  that gives the best power transformations,
in practice, trial and error is an e ective and
acceptable approach.
 Trial and error is aided by the knowledge that
p
power transformations with  < 1 (x1=2 = x,
x0 = log(x), x 1 = 1=x, etc.) shrink large val-
ues of x. Power transformations with  > 1 (x2 ,
x3 , etc.) increase large values of x.

82
 Power transformations are de ned for positive
valued random variables. However they can
be applied to variables with negative values by
rst adding a large value to each observation
(to make all values positive) before taking the
transformation.
 For certain special types of data, speci c trans-
formations are available that often work well:
Original Scale Transformed Scale
Counts, y py
Proportions, p 1 logit(p) = 1 log( p )
2 2 1 p
Correlations, r Fisher's z (r) = 21 log( 11+rr )

 Theory may also suggest an appropriate trans-


formation.
 Often transforming the univariate components
that are non-normal will be sucient. Always
re-check the multivariate normality of the trans-
formed vectors.

83
Inference about a Mean Vector
(See chapter 5 of Johnson and Wichern.)
Finally, some data analysis!
Recall the univariate situation: t tests and Z tests
for a single mean.
The simplest scenario is one in which we want to
know whether the true mean of a population, , is
equal to some speci ed value, 0 . We frame the
problem as a statistical test of competing hypothe-
ses:
H0 :  = 0 versus H1 :  6= 0 ;
where  is the mean of the population from which
we have drawn a sample x1 ; : : : ; xn iid
 N (; 2).
2 known: If the population variance is known then
x
 
Z = = n0  N (0; 1);
p
where
1 X
n
x = n xi (sample mean):
i=1
84
2 unknown: (usual case) if the population variance
is unknown we have to estimate it with the sample
variance s2 and the substitution of s for  in our
test statistic changes its distribution (test statistic
is more variable now because of errors in estimating
2 ):
t= x
 p  0  t(n 1) t dist'n with n 1 d.f.
s= n
where
1 Xn
s2 = n 1 (xi x)2 (sample variance):
i=1
We reject H0 in favor of H1 at the level if the
observed jtj exceeds a critical value:
Reject H0 if jtj > t =2 (n 1):

We can equivalently form a rejection rule based on


t2 : Reject H0 if
t2 = n(x 0 )(s2 ) 1 (x 0 ) > t2 =2 (n 1):
(looks sort of like a quadratic form (statistical dis-
tance) doesn't it?)
85
As an alternative to looking at the problem from a
testing perspective, we can focus on estimation.
 The acceptance region of our two sided test
(those values of x for which we reject H0 in favor
of H1 ) at level , corresponds to a 100(1 )%
con dence interval for .
Why?
Because
 x  
Pr s=pn0  t =2 (n 1)
 s

= Pr jx 0j  t =2 (n 1) pn
= Pr t =2 (n 1) psn  x 0
 t =2(n 1) pn s 
p s
= Pr x t =2 (n 1) n  0
 x + t =2 (n 1) pns 

86
Extension to Multivariate Case:
From a Np (; ) population we draw a sample

x1 ; : : : ; xn iid
 Np (; ):
The corresponding hypotheses in the multivariate
case are
H0 :  = 0 versus H1 :  6= 0 ;
where 0 1 0 1
1 10
=B
@ ... CA 0 = B
@ ... CA:
p p0
Note that H0 requires all component means to be
equal to the speci ed values, 10 ; : : : ; p0 and HA
holds if any (one or more) i 6= i0.
Test statistic (Hotelling's T 2 ):
T 2 = n(x 0 )0 S 1 (x 0 ):

 We reject H0 in favor of H1 if T 2 is large.


87
How large? What critical value do we compare
the observed T 2 to?
 We need to know the distribution of T 2 to an-
swer these questions.
 t in the univariate case had a student's t distri-
bution because it t is of the form
pN2(d(0:f;:1))=d:f :

 Random variables which are of the form


2 (d:f :1 )=d:f :1
2 (d:f :2 )=d:f :2
are known to have an F (d:f :1 ; d:f :2 ) distribu-
tion.

88
 So, since
t2
p 2 1 p
= n(x 0 )(s ) n(x 0 )
is of the form
(N (0; 1)) 2 (d:f :2 )=d:f :2
 1 (N (0; 1))
 2 (1)=1
= 2 (d:f : )=d:f : ;
2 2
it follows that t2  F (1; d:f :2 ).
 In the multivariate case
p 0 p
T = n(x 0 ) S n(x 0 )
2 1

is of the form
(Np )(Wp =d:f :) 1 (Np ):
Since the multivariate normal is a generaliza-
tion of the normal and the Wishart is a gener-
alization of the 2 , it is not very surprising that
it turns out that T 2 has a (scaled) F distribu-
tion:
T 2  (n 1)p F (p; n p):
n p

89
 We reject H0 in favor of H1 if
T2 > (n 1)p F (p; n p)
n p
where F (d:f :1 ; d:f :2 ) is the upper (100 )th per-
centile of the F (d:f :1 ; d:f :2 ) distribution.
 Notice that T 2 is the statistical distance be-
tween x and 0 .
Example { Kite Data:
p = 2 variables (x1 =wing length, x2 =tail length)
measured on a sample of n = 45 birds. Suppose we
want to test
 190   190 
H0 :  = 275 versus H1 :  6= 275

90
Recall
 193:62   120:69 122:35

x = 279:78 S = 122:35 208:54
so S 1 equals
1
 208:54 122:35

120:69(208:54) (122:35)2 122:35 120:69 :
So
T 2 = 45 ( 193:62 190 279:78 275 ) 10201 1
 208:54 122:35  193:62 190:24
 122:35 120:69 279:78 275
= 5:543
For a signi cance level of = 0:05, the critical value
is
(n 1)p F (p; n p) = 2(44) F (2; 43)
n p 43 :05
= (2:047)(3:214) = 6:578
so we do not reject H0 and we conclude that 0 =
(190; 275) is plausible.
91
 Hotelling's T 2 test is invariant under transfor-
mations of the form
yp1 = Cppxp1 + dp1
where C is nonsingular.
Kite Example (Continued):
So, for example, if we wanted to test the correspond-
ing hypothesis on size (y1 = x1 + x2 ) and shape
(y2 = x1 x2 ) as we de ned them before:
1 1 
y = 1 1 x;
we would test
 190 + 275   465 
H0 : y = 190 275 = 85
For this problem, T 2 = 5:543 as before with the
same critical value, 6.578, so we do not reject H0 .
Note that
1 1
1 1 6= 0 (nonsingular transformation).

92
Why not just do p separate (univariate) t tests for
each component of x?
 Does not take into account interrelationships
(correlations) among the components of x.
 The multivariate test controls the type I error
rate, the probability of rejecting H0 when H0
is true.
{ When doing multiple univariate tests the
probability that at least one univariate null
hypothesis will be rejected by chance alone
increases with the number of tests being
performed.
{ Procedures to account for this problem ex-
ist (e.g., Bonferroni) but generally such pro-
cedures are not as attractive as performing
a single multivariate test.

93
T 2 as a LRT:
 So far the use of T 2 as a test statistic has been
motivated by analogy to the univariate case.
Could we do better? (Is T 2 optimal? E.g., in terms
of power?)
 It turns out that T 2 does have several optimal-
ity properties because T 2 can also be derived
as a likelihood ratio test (LRT). The theory
of LRTs (don't worry) establishes a number of
optimality properties that T 2 inherits.

94
Likelihood Ratio Tests:
 The likelihood function is just the density
function, but thought of as a function of the
parameters rather than of the data.
For Example, the univariate normal distribu-
tion based on a sample x1 ; : : : ; xn :
{ We think of the density
1
f (x1 ; : : : ; xn ) = (22 )n=2
" X
n (x )2 #
 exp i
i=1 22
as a function of the data (the xi 's) for given
values of the parameters (, 2 ) giving, in
some sense, the probability of observing
x1 ; : : : ; xn for given , 2 . Hence, the ar-
gument of f is x1 ; : : : ; xn .
{ Actually, the density function involves both
the data and the parameters.

95
{ Once we have the data, we want to do infer-
ence concerning the parameters. Therefore,
we can think of the density as a function of
the parameters (, 2 ) given the data (the
xi 's).
{ To avoid confusion when we think of the
density this way we give it a new name, the
likelihood function, and write it as a func-
tion of the parameters:

L(; 2 ) = 1
(2" 2 )n=2 #
X
n (x
i )2
 exp 22
i=1

{ Likelihood function gives, in some sense,


the probability (or likelihood) of the param-
eters given the data.

96
 The maximum likelihood approach to esti-
mation says to estimate the parameters with
those values that maximize the likelihood.
{ For the univariate normal distribution, with
some calculus we can establish that the max-
imizers of L(; 2 ) (the maximum likelihood
estimates or MLEs) are
1 X
n
^ = x = n xi ;
i=1
n 1 1 Xn
^ 2 = n s2 = n (xi x)2 :
i=1

{ For the multivariate normal distribution, the


likelihood function is
1
L(; ) = (2)np=2 jjn=2
" X n #
1
 exp 2 (xi )0 1 (xi ) :
i=1

97
{ Again some calculus lead to the MLEs:
1 X
n
^ = x = n xi ;
i=1
n 1 1 Xn
^ = n S = n (xi x )(xi x )0 :
i=1

{ The idea is that ^ and ^ are those choices


for ,  that make observing the data ac-
tually obtained, most likely.
 Idea behind the LRT for some hypothesis H0:
{ Compare the maximum of the likelihood func-
tion assuming H0 is true (the constrained
maximum likelihood) and the maximum of
the likelihood without assuming H0 is true
(the unconstrained maximum likelihood).
{ If the constrained ML is \signi cantly" smaller
then we conclude that assuming H0 makes
our data much less likely than not assuming
H0 ; ) we should reject H0 .
98
Application to H0 :  = 0 for multivariate normal
data:
 The unconstrained ML is the likelihood evalu-
ated at the MLEs:
max L( ; ) = L 
(^ ; ^)

;
= 1
(2)np= ^ jn=2
2 j
" X n #
 exp 12 (xi ^ )0^ 1(xi ^ )
i=1
1  np 
= ^ exp 2
(2) jj
np=2 n= 2

after some matrix algebra.

99
 The constrained ML is the maximum of the like-
lihood function when we assume H0 is true; that
is, when we assume that  = 0 .
{ Thus the constrained ML is the likelihood
evaluated at the constrained MLE, the es-
timators of ,  that assumes  = 0 .
{ The constrained MLEs are 0 and the 
that maximizes L(0 ; ), which turns out
to be (calculus)
1 Xn
^ =
0 (x  )(x  )0 :
n i=1 i 0 i 0

{ Therefore the constrained ML is


max L( 0 ; ) = L ( 0 ; ^ 0)


= 1
(2")np=2 j^ 0 jn=2 #
1 Xn
 exp 2 (xi 0)0^ 0 1(xi 0)
i=1
1  np 
= ^ exp 2
(2) j0 j
np= 2 n= 2

100

You might also like