Lec1 (1) .Ps

STAT 8250 | Applied Multivariate Analysis
Lecture Notes
Introduction
\Multivariate" refers to the presence of multiple

random variables.
So Multivariate Data are data that are thought of
as the realizations of several random variables,
and Multivariate Analysis can be dened broadly
as an inquiry into the structure of interrelationships
among multiple random variables.
Haven't we already studied that?
(e.g., Regression, Anova)
1
Regression:
Simple: explain variability in Y from X .
Multiple: explain variability in Y from X1 ; X2;
: : : ; Xr .
Anova:
One-way: explain variability in Y based on X ,
where X is a factor
These examples are not usually considered to be
multivariate analyses.
Why?
We distinguish between the Response Variable (Y )
and Explanatory Variables (X 's) and in the analysis
we treat the X 's as xed.
) only 1 random variable in each situation.
(multiple regression 6= multivariate regression)
2
One type of analysis that is a good example of a
multivariate method is correlation analysis.
Recall:
Correlation Analysis: measure linear association be-
tween X; Y without distinguishing response ver-
sus explanatory.
Other examples:
M'variate 1 & 2 Samp. Inference: Hypothesis tests
and condence intervals for the mean vector(s)
from one (or two) group(s). Procedures anala-
gous to t and Z in univariate setting.
MANOVA: Acronym for multivariate analysis of
variance. Inference for mean vectors from sev-
eral populations.
Multivariate Linear Regression: Eplain variability
in response variables Y1 ; Y2 ; : : : ; Ym based on
explanatory variables X1; : : : ; Xr .
3
Principal Components Analysis: Reduce the dimen-
sion of a data set by examining a small number
of linear combinations of the original variables
that explain most of the variablity among the
original variables.
Factor Analysis: Similar to P.C.A. Describe the var-
iance/covariance relationships among many
variables in terms of a few underlying, but un-
observable, random quantities called factors.
Canonical Correlation Analysis: Determine the lin-
ear association between two sets of random vari-
ables. Find a linear combination of several X 's
and a linear combination of several Y 's such
that the two linear combinations have maxi-
mum correlation.
Classication and Discrimination: Methods for sep-
arating distinct sets of objects (experimental
units) and with allocating (classifying) objects
to previously dened groups.
4
Cluster Analysis: Methods for grouping objects
based on distances (which objects are most sim-
ilar to which others). Groups are not predened
and no assumptions are made concerning the
number of groups or the group structure.
Others: Multidimensional Scaling, Repeated Mea-
sures Analyses, Path Analysis, Econometric
Methods.
5
Variable- vs. Individual-Directed Methods
Variable-directed Techniques: primarily concerned
with relationships among the response variables
being measured. E.g., PCA, FA, M'var. Reg.,
Can. Corr.
Individual-directed Techniques: primarily concerned
with relationships among the among the ex-
perimental units or individuals being measured.
E.g., Discrim. Anal. Clustering, MANOVA.
Some methods can be used as either variable-
directed or individual-directed. E.g., clustering
can be done to group individuals or to group
variables.
Multivariate methods also dier with respect to
how informal or exploratory their typical uses
are. E.g., cluster analysis, P.C.A. are often pre-
liminary or exploratory tools used as part of a
larger analysis.
6
Theory vs. Methods:
(1) problem: estimate population mean
(given x1 ; : : : ; xn )
model: X1 ; : : : ; Xn iid
N (; 2)
, 2 unknown
theory: X N (; 2 =n),
(n 1)S 2 2 2 (n 1),
S 2 independent.
X; p
) n(X )=S t(n 1)
method:
p
x t1 =2(n 1)s= n forms a
100(1 )% C.I. for
application: n = 20, = 0:10
x = 49:70, s = 0:32
t:95 (19) = 1:729 p
90% CI = 49:70 1:729(0:32= 20)
= (49:58; 49:82)
(2) (problem ! method ! application) not su-
cient.
7
(3) validity of method depends on the validity of
model. Is model valid? (need to know some-
thing about iid
N (; 2)) If not valid, how is
method aected? Alternatives?
(4) How do we interpret the results of our applica-
tion? (need to understand what model means)
8
Preliminaries
1. Vector and matrix algebra.

2. Multivariate versions of basic statistical con-
cepts.
3. The multivariate normal distribution.
Vector and Matrix Algebra
(See section 1.4 and chapter 2 of Johnson and Wich-
ern.)
Denitions:
Matrix: a rectangular array of numbers with n
rows and p columns
0 a11 a12 a1p 1
B a a a p C i = 1; : : : ; n
@ ... ... . . . ... C
A=B 21 22 2
A = (aij ) j = 1; : : : ; p
an1 an2 anp
9
e.g.,
02 1
1
A = @1 0 A : (n = 3; p = 2)
2 3
Dimension of a Matrix: the ordered pair (n; p).

We say that A is n p.
Vector: a matrix with p = 1 column.
0 x1 1 021
B
B x 2C
x = @ .. C e.g., x = @1A:
A
. 2
xn
Note: A = (a1 ; a2 ; ; ap ). A \vector" in this
course will always refer to a column array of
numbers. Some authors dene row vectors and
column vectors separately.
Scalar: a real valued variable, or a 1 1 matrix.
We will typically denote matrices with bold cap-
itals, vectors with bold lower case, and scalars
with regular type lower case.
10
Basic Operations:
Transpose: If A is n p then A0 is the p n
matrix by interchanging the rows and columns
of A. E.g.,

A0 = 21 10 3 : 2
If we want a row vector version of x then we

use x0 (= (2 1 2 ) in the example).
A matrix with the property A = A0 is called sym-
metric.
Scalar Multiplication: A scalar, c, times a ma-
trix A is given by
04 2
1
cA = (caij ) e.g., 2A = @ 2 0 A
4 6
11
Addition: A = (aij ) added to B = (bij ) is given
by
A + B = (aij + bij )
A and B must have the same dimension.
E.g., for A as before and B given by
0 1 2
1 0 3 1
1
B=@ 3 1 A we get A+B=@ 2 1A
0 2 2 1
Subtraction: A B = A + ( 1)B (A, B of same

dimension). E.g.,
01 3
1
A B = @4 1A
2 5
(Matrix) Multiplication: AB is dened if and

only if A has the same number of columns as
B has rows; that is, if and only if A and B are
conformable. AB will have as many rows as
A and as many columns as B.
12
If Amn = (aij ) and Bnp = (bij ) are conformable
then the m p matrix product AB has (i; j )th ele-
ment
Xn
aik bkj :
k=1
E.g., take A32 as before and let B be given by

B24 = 21 03 11 2 1
Then AB is a 3 4 matrix given by

02 1
1
AB = @ 1 0 A 2 0 1 1
2 3 1 3 1 2
05 3 1 4
1
= @2 0 1 1A
1 9 5 4
13
Special cases:
1. Matrix times a vector:
0 a11 a12 a1p 1 0 x1 1
B a21 a22 a 2p C
Ax = @ .. .. . . .. A @ .. C
B C B
B x2 C
. . . . .A
an1 an2 anp np xp p1
0 a11x1 + a12x2 + + a1pxp 1
=@
B
B a21x1 + a22x2 + + a2pxp C
C
..
. A
an1 x1 + an2 x2 + + anp xp n1
X p
= x1 a1 + x2 a2 + + xp ap = xi ai
i=1
a linear combination of the columns of A.
14
2. Dot product or inner product:
0 y1 1
B y C X
p
@ .. C
x0 y = ( x1 x2 xp ) B 2
A = xiyi
. i=1
yp
Sometimes
P written x y or < x; y >. Note:
x x = ni=1 x2i (sum of squared elements).
3. Column Vector times row vector:
0 x1 1
B x 2C
xy = @ .. C
0 B A ( y1 y2 yp ) = (xiyj )pp
.
xp
3.a. Vector times its transpose:
0 x1 1
0 B
B x2 C
xx = @ .. C ( x1 x2 xp ) = (xi xj )pp
. A
xp
15
Properties:
1. A + B = B + A.
2. (A + B) + C = A + (B + C).
3. c(A + B) = cA + cB.
4. (c + d)A = cA + dA.
5. (cd)A = c(dA).
6. c(AB) = (cA)B.
7. A(BC) = (AB)C.
8. A(B + C) = AB + AC.
9. (A + B)C = AC + BC.
10. (A + B)0 = A0 + B0 .
11. (cA)0 = cA0 .
12. (AB)0 = B0 A0 .
13. AB =6 BA. E.g.,
1 1 0 1
A= 0 2 ; B= 1 2
1 3 0 2
AB = 2 4 ; BA = 1 5
BA may not be dened; if it is, AB and BA
may not have the same dimension; if dim(AB) =
dim(BA), doesn't imply that AB = BA.
16
Geometrical Concepts:
Vector: a line segment from the origin (0, all co-
ordinates equals to zero) to the point indicated
by the coordinates of the (algebraic) vector.
011
e.g., x = @3A
2
Vector Addition:
17
Scalar Multiplication: Multiplication by a scalar
scales a vector by shrinking or extending the
vector in the same direction.
Length (Norm) of Vector: Lx = (x0x)1=2 is the

length, or norm of x. Sometimes written as
jjxjj. Lx 1x is the vector of length 1 which lies
in the direction of x.
Angle Between Vectors: If is the angle be-
tween two vectors x and y, then
x 0y
cos() = :
Lx Ly
18
Orthogonal (Perpendicular) Vectors: Two
vectors x and y are orthogonal to one another if
and only if x0 y = 0 () cos() = 0 ) = =2).
Projection: The projection of x on y is the vector

in the direction of y that is closest to x and is
given by
x0 y y = x0 y L 1 y :
y 0 y Ly y
19
Euclidean Distance: Since a vector x has the in-
terpretation of a directed line segment from the
origin (0) to the point (x1 ; x2 ; : : : ; xp ), the Eu-
clidean (straight-line) distance between 0 and x
is clearly the length of the vector, Lx :
q
d(x; 0) = Lx = x21 + + x2p :
More generally, the Euclidean distance between
two arbitrary points x and y is
q
d(x; y) = Lx y = (x1 y1 )2 + + (xp yp )2 :
Many ways to dene distance. Euclidean dis-
tance is just one (familiar) example.
Properties of a distance measure d:
d(x; y) = d(y; x)
d(x; y) 0; with equality when x = y
d(x; y) d(x; z) + d(z; y) (triangle inequality)
We'll consider another distance measure, sta-
tistical distance, later.
20
More Matrix Functions and Denitions:
Determinant: The determinant of a square ma-
trix A (same number of rows as columns) with
dimension k k is the scalar denoted as jAj
given by
a11 if k = 1
jAj = Pk a1j jA1j j( 1)1+j if k > 1,
j =1
where A1j is the (k 1) (k 1) matrix ob-
tained by deleting the rst row and the j th col-
Pk ofa AjA. jNote
umn
( 1)
that jAj can be dened as
i+j using the ith row in place
j =1 ij ij
of the rst.
Special Cases:
k = 2:
a a
a11 a12 = a11a22( 1)2+a12a21( 1)3 = a11a22 a12a21
21 22
E.g.,

1 = 12
3
2 4 ( 2) = 14
21
k = 3:
a a a
a11 a12 a13 = +a a22 a23 a a21 a23
a21 a22 a23 11 a
32 a33
12 a
31 a33
31 32 33
a a
+ a13 a21 a22
31 32
= a11(a22 a33 a23 a32)
a12(a21 a33 a23 a31)
+ a13(a21 a32 a22 a31)
Properties:
1. jcAj = ck jAj, for Akk .
2. jA0j = jAj.
3. jABj = jAjjBj.
4. For a diagonal matrix Dkk of the form
0 d11 0 0 1
B 0 d22 0 C
D = @ .. .. . . . .. C
B ;
. . . A
0 0 dkk
jDj = d11d22 dkk .
22
Linear Dependence: A set of vectors x1 ; : : : ; xk
is said to be linearly dependent if there exist
constants c1 ; : : : ; ck not all zero such that
c1 x1 + c2 x2 + + ck xk = 0
Recall collinearity/multicollinearity complications
occurred in regression when the explanatory vari-
ables were linearly dependent; that is, when the
columns of the X matrix were linearly depen-
dent.
A set of vectors which is not linearly dependent
is said to be linearly independent.
jAj = 0 if and only if there exists a nonzero vector x
such that Ax = 0; that is, if and only if the columns
of A are linearly dependent.
23
Identity Matrix: A matrx with ones on the di-
agonal and zeros elsewhere is called an identity
matrix and is denoted by Ikk or I when the
dimension is clear from the
01 0 01 context. E.g.,
I33 = @ 0 1 0 A :
0 0 1
I is the \identity" matrix because
AI = IA = A
Matrix Inverse: Matrix division is dened as mul-
tiplication by the matrix inverse. Just as a=b
can be dened as ab 1 for scalars a and b, we
will think of A divided by B as AB 1 where
B 1 is that matrix such that
BB 1 = I
(just as b 1 = 1=b is that scalar such that bb 1 =
1).
The matrix inverse is only dened for square
matrices that have nonzero determinants.
If A (square) has jAj = 0 then its columns are
linearly dependent and it has no inverse. Such
a matrix is called singular.
24
Computation of Matrix Inverse: A 1 has (i; j )th
element ( 1)i+j jAji j=jAj, where Aji is the ma-
trix obtained by deleting the j th row and ith
column of A. E.g.,
a b 1 1
d b

c d = ad bc c a
We will rely on the computer to compute in-
verses for matrices with dimension > 2.
Properties:
1. (A 1 )0 = (A0 ) 1 .
2. (AB) 1 = B 1 A 1.
3. jA 1j = 1=jAj.
25
Eigenvalues and Eigenvectors: A square k k
matrix A is said to have an eigenvalue , with
a corresponding eigenvector x =
6 0 if
Ax = x (1):
Since (1) can be written
(A I)x = 0
for some nonzero vector x, the eigenvalues of A
are the solutions of
jA Ij = 0
If we look at how a determinant is calculated
it is not dicult to see that jA Ij will be a
polynomial in of order k so there will be k
(not necessarily real) eigenvalues (solutions).
1. if jAj = 0 then Ax = 0 for some nonzero x.
That is, if the columns of A are linearly depen-
dent then = 0 is an eigenvalue of A.
26
2. The eigenvector associated with a particular eigen-
value is unique only up to a scale factor. That
is, if Ax = x, then A(cx) = (cx) so x and
cx are both eigenvectors for A corresponding to
. We typically normalize eigenvectors to have
length 1 (choose the x that has the property
x0 x = 1).
3. If i and j are distinct eigenvalues of the sym-
metric matrix A then their associated eigenvec-
tors xi , xj , are orthogonal.
27
4. Computation: By computer typically, but for
2 2 case we have
a b a b
A= c d ) A I = c d
jA Ij = 0 )
(a )(d ) bc = 2 (a + d) + (ad bc) = 0
p
(a + d) (a + d)2 4(ad bc)
) = 2
To obtain associated eigenvector solve Ax =
x. There are innitely many solutions for x so
choose one by setting x1 = 1, say, and solve for
x2 . Normalize to have length one by computing
Lx 1 x.
28
Orthogonal Matrices: We say that A is orthogo-
nal if A0 = A 1 or, equivalently, AA0 = A0 A =
I, so that the columns (and rows) of A all have
length 1 and are mutually orthogonal.
Spectral Decomposition: If Akk is a symmetric
matrix (i.e., A = A0 ) then it can be written
(decomposed) as follows:
Ekk kk E0kk ;
where 0 1 0 0 1
B 0 2 0 C
= @ .. .. . . .. C
B
. . . . A
0 0 k
and E is an orthogonal matrix with columns
e1 ; e2 ; : : : ; ek the eigenvectors corresponding to
eigenvalues 1 ; 2 ; : : : ; k .
We haven't yet talked about how to interpret eigen-
values and eigenvectors, but the spectral decompo-
sition says something about the signicance of these
quantities (why we're interested in them): the \in-
formation" in A can be broken apart into its eigen-
values and a set of orthogonal eigenvectors.
29
Results:
1. For Akk symmetric,
Yk
jAj = jEjjjjE0j = jEE0jjj = jIjjj = i
i=1
I.e., the determinant of a symmetric matrix is
the product of its eigenvalues.
2. If A has eigenvalues 1 ; : : : ; k , then A 1 has
the same associated eigenvectors and eigenval-
ues 1=1 ; : : : ; 1=k since
A 1 = (EE0) 1 = E 1E0 :
30
3. Similarly, if A has the additional property that
it is positive denite (dened below) a square
root matrix, A1=2, can be obtained with the
property A1=2A1=2 = A:
A1=2 = E1=2E0, where
0 p1 0 0 1
p
1=2 B
B 0 2
= @ ..
0 C
C
.
..
.
. . . .. A
p.
0 0 k
A1=2 is symmetric, has eigenvalues that are the
square roots of the eigenvalues of A, and has
the same associated eigenvectors as A.
Quadratic Forms: For a symmetric matrix Akk ,
a quadratic form in xk1 is dened by
k X
X k
x0 Ax = aij xi xj :
i=1 j =1
(x0 Ax is a sum in squared (quadratic) terms,
x2i , and xi xj terms.)
31
Positive Denite Matrices: A is positive de-
nite (p.d.) if x0 Ax > 0 for all x 6= 0. If x0 Ax
0 for all nonzero x then A is positive semi-
denite (p.s.d.) or, sometimes, non-negative
denite.
If Akk is p.d. then for i = 1; : : : ; k, e0iAei =
i (e0i ei ) > 0;
i.e., the eigenvalues of A are all positive.
It can also be shown that if the eigenvalues of
A are all positive then A is p.d.
32
Statistical Distance:
Consider points P and Q in the above scatter-

plot.
Euclidean Distance: d(0; P) > d(0; Q).
But suppose X1 and X2 represented random vari-
ables each with mean 0, but with dierent variances
(var(X1 ) > var(X2 )). If we are interested in how
unusual the points P and Q are then we want to
know how far from the average (0) these points
are. For this purpose it does not make sense that
d(0; P) > d(0; Q).
Euclidean distance does not account for vari-
ability inherent in the random variables repre-
senting the coordinates.
33
Need a distance measure where coordinates are
weighted dierentially and distance is relative
to the variability of the data.
One solution, Karl Pearson Distance: Distance
of a point P = (x1 ; x2 ) from the origin 0 = (0; 0)
takes into account how many (X1 ) standard devi-
ations x1 is from 0 and how many (X2 ) standard
deviations x2 is from 0:
s 2 2
d(0; P) = x1 x
+ 2 ;
s1 s2
where s1 is the standard deviation of X1 and s2 is
the standard deviation of X2. The generalization
to k-dimensional space is straight-forward:
s 2 x 2
d(0; P) = x 1 + + sk :
s1 k
A problem: Karl Pearson distance doesn't ac-

count for correlation among the coordinates.
34
An alternative: Statistical Distance: Karl Pearson
Distance would be o.k. if it were computed with
respect to the rotated axes X~1 and X~2 . So, in two
dimensions, we can dene statistical distance as
s 2 2
d(0; P) = xs~~1 + xs~~2 ;
1 2
where
x~1 = x1 cos() + x2 sin()
;
x~2 = x1 sin() + x2 cos()
s~1 , s~2 are standard deviations of X~1 and X~ 2 , and
is the angle of rotation.
35
With respect to the original coordinate system we
have
q
d(0; P) = a11 x21 + 2a12 x1 x2 + a22 x22 ;
where a11 , a12 , and a22 are coecients depending
on s1 , s2 , cov(X1 ; X2 ) and trigonometric functions
of (not important now).
Notice that d(0; P) as dened above can be written
as
a
a12 x1 = x0 Ax;
d(0; P)2 = ( x1 x2 ) a11 a22 x2
12
a quadratic form in x involving the symmetric ma-
trix A.
In fact, any positive denite quadratic form can
be thought of as a squared statistical distance.
The positive deniteness of the matrix A is re-
quired to ensure the properties of a distance
metric (non-negativity, triangle inequality, etc.).
x0 Ax: Squared distance from x to 0,
(x y)0 A(x y): Squared distance from x to y.
36
Geometric Interpretation to Spectral Decom-
position: Consider the set of points a constant
(statistical) distance c away from the origin.
These points satisfy
x0 Ax = c2 X
) x0 EE0x = y0 y = i yi2 = c2;
i
which is the equation of an ellipsoid in the trans-
formed variable, y = E0 x.
A transformation of this form, y = E0 x, involving
an orthogonal matrix E, denes a rotation from the
x coordinate system into the y coordinate system
which is rigid, in the sense that the orthogonality of
the axes are maintained. The new axes are in the
direction of the eigenvectors, e1 ; e2 ; : : :.
37
In two dimensions we have
1 y12 + 2 y22 = c2;
the equation of an ellipse, with y1 = x0 e1 and y2 =
x0 e2. If we let x = c1 1=2e1 we see that this value
satises x0 Ax = c2 :
x0 Ax = 1 (x0e1)2 + 2 (x0e2 )2
e01e1)2 + 2 (c1 1=2 |{z}
= 1 (c1 1=2 |{z} e01e2)2 = c2:
=1 =0
p
Therefore, jjc1 1=2e1 jj = c= 1 givesp thegives
length
of the ellipse along e1 . Similarly, c= 2 the
length of the ellipse in the perpendicular direction
along e2 .
38
One More Matrix Function:
Trace of a Matrix: Dened only for square matri-
ces. Let Akk = (aij ) be a square matrix. The
trace of A is the sum of its diagonal elements:
X
k
tr(A) = aii
i=1
Properties:
1. tr(cA) = ctr(A).
2. tr(A + B) = tr(A) + tr(B).
3. tr(ABC) = tr(CAB) as long as the matrix
products are dened (matrices are conformable)
and ABC and CAB are square. In particular,
tr(B 1 AB) = tr(A)
and
tr(Axx0 ) = tr( |x0{z
Ax} ) = x0 Ax:
a scalar
0 P P
4. tr(AA ) = i j a2ij .
5. If Akk is symmetric
P with eigenvalues 1; : : : ; k ,
then tr(A) = ki=1 i .
39
Basic Multivariate Statistical Concepts
(See sections 2.5, 2.6 and chapter 3 of Johnson and

Wichern.)
Denitions:
Random Vector: A vector whose elements are
random variables. E.g.,
xp1 = ( x1 x2 xp )0 ;
where x1 ; : : : ; xp are each random variables.
Random Matrix: A matrix whose elements are
random variables. E.g., Xnp = (xij ), where
x11 ; x12 ; : : : ; xnp are each random variables.
Expected Value: The expected value (population
mean) of a random matrix (vector) is the matrix
(vector) of expected values. For Xnp ,
0 E(x ) E(x ) E(x ) 1
11 12 1p
E(X) = B @ ... ..
.
... .. C
. A :
E(xn1 ) E(xn2 ) E(xnp )
40
Recall, for a univariate random variable X ,
R 1 xf (x)dx if X is continuous;
E(X ) = P1 Xxp (x) if X is discrete.
all x X
Here, fX (x) is the probability density function
of X in the continuous case, pX (x) is the prob-
ability function of X in the discrete case.
(Population) Variance-Covariance Matrix: For
a random vector xp1 = (x1 ; x2 ; : : : ; xp )0 , the
matrix
0 var(x1) cov(x1; x2) cov(x1 ; xp) 1
B
B cov(x2 ; x1 ) var(x2 ) cov(x2 ; xp) C
C
@ ... ..
.
... ..
. A
cov(xp ; x1 ) cov(xp ; x2 ) var(xp )
0 11 12 1p 1
B
B 21
@ .. ..
22 2p C
C
. .
... ..
. A
p1 p2 pp
is called the variance-covariance matrix of x and
is denoted var(x).
41
Recall: for a univariate random variable xi with
expected value i ,
ii = var(xi ) = E[(xi i )2 ]
Recall: for univariate random variables xi and

xj ,
ij = cov(xi ; xj ) = E[(xi i )(xj j )]
var(x) is symmetric because ij = ji.

In terms of vector/matrix algebra, var(x) has
formula
var(x) = E[(x )(x )0 ]:
If the random variables x1 ; : : : ; xp in x are mu-

tually independent, then cov(xi ; xj ) = 0, when
i 6= j , and var(x) is diagonal with
(11 ; 22 ; : : : ; pp )0 along the diagonal and zeros
elsewhere.
42
(Population) Correlation Matrix: For a ran-
dom variable xp1 , the population correlation
matrix is the matrix of correlations among the
elements of x:
0 1 12 1p 1
corr(x) = B
B 1 p C
@ ... ... . . . ... C
21 2
A;
p1 p2 1
where ij = corr(xi ; xj ).
Recall: for random variables xi and xj ,
ij = corr(xi ; xj ) = p p
ij
ii jj
measures the amount of linear association be-
tween xi and xj .
Correlation matrices are symmetric.
For a random vector xp1 , let = corr(x), =
var(x) and V = diag(11 ; 22 ; : : : ; pp ). The
relationship between and is
= V1=2V1=2
= (V1=2) 1 (V1=2) 1
43
Properties:
Let X, Y be random matrices of the same di-
mension and A, B be matrices of constants such
that AXB is dened
1. E(X + Y) = E(X) + E(Y).
2. E(AXB) = AE(X)B.
Now let xp1 be a random vector with mean x
and variance x . Let cp1 be a vector of constants,
and Ckp a matrix of constants.
For the linear combination c0 x = c1 x1 + +
cp xp ,
3. E(c0 x) = c0 , and
4. var(c0 x) = c0 x c.
For the k-dimensional vector of linear combina-
tions, Zk1 = Cx,
5. Z = E(Z) = E(Cx) = CE(x) = Cx and
6. Z = var(Z) = var(Cx) = Cx C0 .
44
All of the denitions and properties of the last 5
pages have been dened in terms of population quan-
tities. There are corresponding denitions of sample
quatities for which the same properties hold:
Examples (sample size n):
{ Sample mean: x
= x
( ; : : : ; x
) 0 , where xi =
n 1 P 1 p
n x . (x is the kth observation of
k=1 ki ki
random variable xi ).
{ Sample variance-covariance matrix:
0s s 1
11 1p
SX = B
@ ... . . . ... C
A
sp1 spp
1 Xn
= n 1 (xk x )(xk x )0 ;
k=1
where
1 Xn
sij = n 1 (xki xi )(xkj xj ):
k=1
45
{ Sample correlation matrix:
0 1 r12 r1p 1
B
RX = B
r 1 r 2p C
@ ... ... . . . ... C
21
A;
rp1 rp2 1
where rij is the sample correlation computed on
the n sample pairs of observations (x1i ; x1j ); : : : ;
(xni ; xnj ):
rij = ps sijps :
ii jj
46
In the univariate setting we typically have a random
variable X , say, from which we observe a random
sample. That is, we assume that X has some dis-
tribution that is partially or completely unknown
and we observe n independent copies X1 ; : : : ; Xn of
X . The observed values of these copies we typi-
cally denote with lower case letters: x1 ; : : : ; xn , and
we infer features of the true distribution from the
sample.
In the multivariate setting we are interested in the
true (multivariate) distribution of a random vector
xp1. Typically the random variables that make up
x, x1; : : : ; xp are not independent and that depen-
dence (correlation) among x1 ; : : : ; xp is the reason
for our interest. A random sample here consists of
n copies of x that we can arrange in an n p data
matrix:
0 x11 x12 x1p 1
B
B x 21 x
X = @ .. .. . . .. C
22 x 2p C
. . . . A
xn1 xn2 xnp
(I'm using lower case here, no longer distinguish-
ing between the random variable and its observed
value.)
47
Rows are assumed independent (n independent
copies).
Dependence typically among columns.
The sample mean and variance can be computed
from the data matrix X as
x = 1 X0 1
n
1 1
S = n 1 X (I n 110 )X
0
where 1 is an n-dimensional vector of ones.

Generalized Sample Variance: S summarizes
the variability in each of x1 ; : : : ; xp and the co-
variability among these variables. S involves p
variances and p(p 1)=2 covariances, though.
A more succinct summary is provided by the
generalized sample variance, jSj.
48
Sample variance-covariance matrices with quite
dierent patterns can have the same generalized
variances. E.g.,
5 4 5 4
S1 = 4 5 ; S2 4 5 ;
3 0
S3 = 0 3
have values of r12 equal to 0.8, -0.8 and 0.0,
respectively. However,
jS1j = jS2j = jS3j = 9:
jSj is sensitive to large values of the sample vari-

ance, sii, for one or a few of the p variables in
X. In some cases this may simply be an eect
of scale. In such cases it is of interest to com-
pute the generalized variance on the standard-
ized data matrix
p where xij has been replaced by
(xij xi )= sii . This procedure is equivalent to
computing
jRj = jSj=(s11s22 spp);
the determinant of the sample correlation ma-
trix.
49
jSj is only one way to quantify the total vari-
ability and covariability in S. Another possibil-
ity that ignores the covariance structure is the
total variance:
X
p
sii :
i=1
50
Example, Kite Data: Consider the bivariate data
of Table 5.11 in the text. A wildlife ecologist mea-
sured x1 = tail length (mm) and x2 = wing length
(mm) for a sample of n = 45 female hook-billed
kites. The handout consists of a printout of a SAS
program, kite1.sas, and the associated output, kite1.lst.
51
PROC PRINT prints out a listing of the data
on p. 1 of kite1.lst.
PROC CORR computes the sample correlation
matrix R for the variables on the VAR state-
ment. In addition the COV option on the PROC
CORR line requests the sample covariance ma-
trix S and, by default, PROC CORR computes
summary statistics (x1 , x2 , etc.). All of these
results appear on p. 2 of kite1.lst.
PROC PLOT prints a scatter plot of winglen
versus taillen on p. 3 of kite1.lst.
The generalized sample variance is
jSj = 120:69(208:54) 122:352 = 10201:24
and we have
jRj = 12010201 :24 = 1:0(1:0) 0:772 = 0:41:
:69(208:54)
The total variance is 120:69 + 208:54 = 329:24.
52
Suppose we decide that we are interested in the
size and shape of the birds in a way that is not
captured by wing length (x1 ) and tail length
(x2 ). Specically, we might look at y1 = x1 + x2
(size) and y2 = x1 x2 (shape).
To compute summary statistics on y = (y1 ; y2 )0 ,
we can use our methods for linear combinations
of x = (x1 ; x2 )0 . Notice,
y = Ax;
where 1
A = 1 11 ;
so
1 193:62 473:40
y = Ax = 1 11 279:78 = 86:16 :
53
In addition,
Sy = AS x A 0

120:69

122:35 1

= 11 11 122 :35 208:54 1
1
1
1 1 243:04 1:65
= 1 1 330:89 86:19
573:93 87:85
= 87:85 84:54
54
The Multivariate Normal Distribution
(See chapter 4 of Johnson and Wichern.)
Recall the central role that the (univariate) nor-
mal distribution played in the development of
univariate statistical methods and why:
a. For many continuous random variables the
assumption of normality is appropriate.
b. numerous statistics, particularly those that
can be expressed as sums or averages, have
normal sampling distributions regardless of
the underlying population distribution.
c. the normal distribution is easy to work with
mathematically and many useful results are
available.
Items (a) and (b) above are due to the Central
Limit Theorem.
55
Recall:
Central Limit Theorem. If Y1; : : : ; Yn is a se-

quence of n i.i.d. random variables each with mean
and variance 2 (both nite), then
Pn Y n :
Zn = i=1pin N (0; 1):
That is,
lim F Z n (z ) ! 1;
n!1 (z )
where FZn (z ) is the c.d.f. of Zn and (z ) is the
c.d.f. of a standard normal r.v.
How large must n be? Depends on the problem

(on the distribution of the Y 's).
CLT directly implies (b) but also implies (a)
because the measured realization of a random
variable can often be thought of as including
the sum of numerous random quantities (e.g.,
reaction time).
56
The multivariate normal distribution plays an
even more central role in multivariate analysis.
The justication/motivation is the same, plus
d. There are few alternatives in the multivari-
ate setting (few tractable multivariate dis-
tributions as alternatives). There are al-
ternatives in Discrete Multivariate Analy-
sis, see Bishop, et al., STAT8620.
Recall the univariate normal distribution: we
write X N (; 2 ) to signify that the univariate
r.v. X has the normal distribution with mean and
variance 2 which means that X has probability
density function (p.d.f.)
(x )2
fX (x) = p 1 2 exp 22
2
Meaning: for two values x1 < x2 the area under
the graph of the p.d.f. between fX (x1 ) and fX (x2 )
gives Pr(x1 < X < x2 ).
57
The cumulative distribution function (c.d.f.) of
X is
Zx
FX (x) = Pr(X x) = fX (t)dt:
1
FX is not available in closed-form but the stan-
dard normal version (with = 0, 2 = 1) is
extensively tabled.
The multivariate normal distribution: for the
vector x, we write x Np (; ) to signify that x is
a p dimensional random vector with mean and
variance . x has p.d.f.

f (x) = (2)p=12 jj1=2 exp 1 (x )0 1 (x ) ;
2
and c.d.f.
F (x) = Pr(x1 x1 ; x2 x2 ; : : : ; xp xp )
Z xp Z x1
= f (t)dt:
1 1
58
We've replaced a univariate distance from x to
with a multivariate statistical distance,
(x )0 1 (x );
from x to .
It is now the volume under the p.d.f. which
represents the joint probability that x1 ; : : : ; xp
fall within given intervals.
59
Contours of constant density for the p dimensional
normal distribution are ellipsoids dened by x
such that
(x )0 1 (x ) = c2:
p ellipsoids are centered at and have axes
These
c i ei, for eigen-pairs i ; ei of .
We will show that for x Np (; ) with jj >

0
(x )0 1 (x ) 2 (p):
This result implies that the solid ellipsoid of x
values satisfying
(x )0 1 (x ) 2 (p)
has probability 1 .
60
Additional Properties:
1. If x Np (; ) then any linear combination
of variables a0 x = a1 x1 + a2 x2 + + ap xp is
distributed as N (a0 ; a0 a).
2. If x Np (; ), the q linear combinations
0 a11x1 + + a1pxp 1
B
B
Aqpx = @
a21 x 1 + + a 2 p x pC
C
..
. A
aq1 x1 + + aqp xp
are distributed as Nq (A; AA0).
3. For x Np (; ) and dp1 a vector of con-
stants, x + d is distributed as Np ( + d; ).
4. For p dimensional random vector x, if a0 x
N (a0 ; a0 a) for every p vector of constants a
then x Np (; ).
61
5. For x Np (; ), all subsets of x are (multi-
variate) normally distributed. If we partition x,
and and accordingly as
0 x1 1 0 1 1
x = @ (q 1) A = @ (q 1) A
(p 1) x 2 (p 1) 2
((p q)1) ((p q)1)
0 11 .. 12
1
=B (q q ) . (q(p q)) C
(p p) @ 21

..

22
A
((p q)q) . ((p q)(p q))
then x1 is distributed as Nq (1 ; 11 ).
62
6.
a. If x1 and x2 are independent q1 and q2 di-
mensional vectors, respectively, then regard-
less of the distributions of x1 and x2 ,
cov(x1 ; x2 ) = 0q1 q2 .
b. If x1 , x2 are as before and are each multi-
variate normally distributed, then
cov(x1 ; x2 ) = 0 implies that x1 and x2 are
independent.
c. If x1 and x2 are independent with Nq1 (1 ; 11 )
and Nq2 (2 ; 22 ) distributions, respectively,
then
0x 1 02 3 2 .
.. 0
3 1

@ A Nq1+q2 B
1
@4 5 ; 64 . 75C
1 11
A:
x2 2 0 .. 22
If we combine (multivariate) normally dis-
tributed components into a single vector,
the result is not necessarily multivariate nor-
mal. However, if the components are (mul-
tivariate) normal and independent then the
result will be multivariate normal.
63
7. If x Np (; ) where
0x 1 0 1
1 1
x = @A = @A
x2 2
0 .. 1
B 11 . 12 C
= @ A
. ..
21 22
with j22 j > 0, then the conditional distribu-
tion of x1 given x2 = x2 , is (multivariate) nor-
mal with
Mean = 1 + 12 221 (x2 2 )
and
Covariance = 11 12 221 21 :
Notice that the conditional covariance does not
depend on the value of the conditioning vari-
able.
64
8. For x Np (; ) with jj > 0,
(x )0 1 (x ) 2 (p)
where 2 (p) denotes the chi-square distribution
with p degrees of freedom.
Proof: we can write the quadratic form as
(x )0 E 1=2 1=2E0 (x ) = Z0 Z;
where Z = 1=2E0 (x ) is multivariate nor-
mal with mean
E(Z) = 1=2E0 E(x ) = 0
and variance
var(Z) = 1=2E0 E 1=2
= 1=2 1=2 = I
so Z0 Z is the sum of p squared independent
N (0; 1) r.v.s ) Z0 Z 2 (p).
65
Multivariate Sampling Distributions:
In this course we will most often consider data that
are or are assumed to be a random sample from a
multivariate population. That is, we will have n in-
dependent copies, x1 ; x2 ; : : : ; xn , of a p dimensional
random variable which we arrange in a n p data
matrix:
0 x11 x12 x1p 1
B
B x 21 x
X = @ .. .. . . .. C
22 x 2p C
. . . . A
xn1 xn2 xnp
0 x01 1
B
=B
x 02 C
C
@.A.
.
x0n
Rows of X are independent.
Columns of X are dependent and this depen-
dence is of interest.
66
First we consider the general case where x1 ; : : : ; xn
have arbitrary (not necessarily normal) common dis-
tribution with mean and variance :
1. The sample mean x is unbiased for ; i.e.,
E(x) = :
This is easily seen by writing x as x = n1 X0 1.
It follows that
1
E(x) = E(X0 )1
n
n times
1z }| {
= n (; ; : : : ; ) 1 = :
67
2. var(x) = n1 . We can write x as
n times
0 x 1
1
z1 1 }| 1 { B x2 C
x = ( n I; n I; : : : ; n I) B
@ . A:
. C
.
xn
Therefore, var(x) equals
0 0 0 1 0 n1 I 1
B 0 0 C B
B 1IC
C
( n1 I; n1 I; : : : ; n1 I) B . . . . C
@ .. .. . . .. A @ ... C
B n
A
0 0 n1 I
0 11
n
= ( n1 I; n1 I; : : : ; n1 I) B
@ ... CA
1
1 1 n
= n n2 = n
68
3. The sample variance matrix S is unbiased for
; i.e., E(S) = . Proof:
1 X n
E(S) = n 1 E[(xi x )(xi x )0 ]
"i=1X
n #
=n1 1 E(xi x0i ) nE(xx 0 ) :
i=1
As a general result,
y = E [(y y )(y y )0 ]
= E(yy0 ) y 0y
) E(yy0 ) = y + y 0y :
So,
1 0 0
E(S) = n 1 n( + ) n( n + ) = :
1
69
Large Sample Results (general case):
1. x p! , S p! .
p
2. n(x ) d! Np (0; ).
3. Combining (1) and (2) we get
p nS 1=2 (
x ) d! Np (0; I):
4. (3) implies
n(x )0 S 1 (x ) d! 2 (p):
a p! b means as the sample size gets large, a

converges to b in the sense that the probability
that a is \close" to b goes to 1.
statistic d! distribution means that as the sam-
ple size gets large the distribution of the statis-
tic on the left becomes approximately equal to
the distribution on the right.
70
Next we consider the special case where we assume
that the common distribution of our random sample
x1 ; : : : ; xn is Np (; ). Our additional assumption
of normality buys us some stronger results:
1.
x Np (; n 1)
n(x )0 1 (x ) 2 (p)
Unlike the general case, these results are exact
and hold for any sample size.
2. The random p p matrix (n 1)S has a distri-
bution known as the Wishart distribution.
The Wishart distribution has two parameters,
the degrees of freedom (n 1 in this case) and
the scale matrix ( in this case). We write
(n 1)S Wp (; n 1).
3. x and S are independent.
71
Properties of the Wishart Distribution:
P
1. Wp (; m) is the distribution of m 0
i=1 zi zi where
z1; : : : ; zm iid
Np (0; ) which implies that
X
n
(xi )(xi )0 = (n 1)S Wp (; n 1)
i=1
for a multivariate normal sample x1 ; : : : ; xn iid

N (; ) with sample variance-covariance ma-
trix S.
2. If Ai ind
Wp(; mi ) then
X X !
Ai Wp ; mi :
i i
3. If A Wp (; m) then for Cqp ,

CAC0 Wq (CC0 ; m):
72
Assessment of Normality:
We have seen that xp1 is multivariate nor-
mal if and only if every linear combination of
x1 ; : : : ; xp is normal. Therefore, one check is
whether selected linear combinations are (uni-
variate) normal.
As a special case of the above we can check
whether each component of x is (univariate)
normal.
Other selected linear combinations of x1; : : : ; xp
worth checking are e^01 x and e^0p x, where e^1 and
e^p are the eigenvectors associated with ^1 and
^p , respectively, the largest and smallest eigen-
values of S.
73
How do we assess univariate normality?
One method is the quantile-quantile (Q-Q) plot (also
known as a probability plot).
Idea is to plot the sample quantiles (percentiles)
for the variable in question versus the quan-
tiles one would expect to observe assuming ex-
act normality.
If the data are normal, plot should follow a
straight line. Departures from a straight line
may indicate the type of non-normality (e.g.,
heavy tails, skewness).
Q-Q plots can be misleading for small sample
sizes (n < 20, say). Small samples of truly nor-
mal data can look crooked.
74
Example:
75
A formal test for normality is based on rq the
correlation coecient of the observed and ex-
pected quantiles. The hypothesis of normality
is rejected for values of rq below the tabulated
value in Table 4.2 on p.193 of the text.
The test statistic based on rq is approximately
equivalent to the Shapiro-Wilk test computed
by PROC UNIVARIATE and in, general, these
two tests give very similar results. Either may
be used.
Example { Sparrow Data:
Recall: x1 = total length, x2 = alar extent, x3 =
length of beak and head, x4 = length of humerus,
x5 = length of keel of sternum.
See sparrow2.sas and sparrow2.lst. Here we just
consider x1 ; x2 ; x3 to keep the size of the exam-
ple manageable. In addition, we consider only
the birds who died (birds 1{21).
76
Notice that the Q-Q plots produced by PROC
UNIVARIATE have such small scales as to make
them practically worthless.
The Shapiro-Wilk tests for x1 {x3 all give p values
> :05, so there is not strong evidence in any of
the individual variables to indicate non-multivariate
normality.
E.g., the Shapiro-Wilk test statistic for x1 , the
total length of the bird, is W = 0:934 with
p value 0.165 (see p.1 of sparrow2.lst).
However, we also should check the univariate
normality of e^01 x and e^03 x. e^01 x and e^03 x are
computed in PROC IML and printed on p.9.
The S-W test for e^01x is given by PROC UNI-
VARIATE on p.11 as W = 0:861 with p value
0.0067. Therefore, we reject the null hypothesis
of normality for e01 x at the 0.05 level ) we re-
ject the hypotheses that x is trivariate normal.
77
As an alternative to the Shapiro-Wilk test, rq =
0:920 is computed on p.16. Since 0.920 falls
below the critical value for 0:05 and a sample
size of 20 (0.9269 from Table 4.2 in the text) we
reject the null hypothesis of normality.
The tests agree.
Note that the procedure here deviates from the
usual one by which we reject when the test statis-
tic exceeds the critical value.
Notice that as an alternative to letting PROC
UNIVARIATE automatically produce the Q-Q
plot, it is fairly easy to compute the quantiles to
be plotted in a Q-Q plot and then to plot these
quantiles with PROC PLOT. This I have done
for both e^01 x and e^03 x to produce the plots on
p.15 and p.17. These are obviously much easier
to read than the corresponding plots produced
by PROC UNIVARIATE on pp. 12 and 14,
respectively.
78
A graphical procedure to examine multivariate nor-
mality more directly is the Chi-square plot or
gamma plot.
When the sample size is moderate to large, the
squared generalized distances
d2j = (xj x )0 S 1 (xj x ); j = 1; : : : ; n;
should be approximately distributed as 2 (p)
random variables.
The chi-square plot is a Q-Q plot of d21; : : : ; d2n
versus the quantiles one would expect to observe
based on the 2 (p) distribution.
A straight line shape indicates multivariate nor-
mality.
79
Sparrow Example (Continued):
The Q-Q plot for d2 appears on p.4 of spar-
row3.lst.
Except for one point, this plot looks pretty close
to a straight line. The unusual point, or outlier,
is bird #8. This outlier was also seen in the Q-
Q plot of e^01 x and seems to be the main cause
of non-multivariate normality in these data.
It would be interesting to eliminate bird #8 and
then repeat our examination of the multivariate
normality of this data set. Perhaps bird #8's
data was incorrectly recorded, or perhaps bird
#8 happened to die for some reason other than
exposure to the severe storm.
If a good reason can be found that this bird's
data do not belong in the data set, then this
outlier should be removed. Otherwise, a good
solution would be to analyze the data both with
and without the outlier and to report both sets
of results.
80
Suggested Procedure:
1. Examine Q-Q plots for all univariate compo-
nents of x.
2. Examine Q-Q plots for e^01 x and e^0p x.
3. Examine Q-Q plot for d2j 's.
While one bad plot is enough to invalidate the
multivariate normality assumption, it is not enough
that plots on all examined linear combinations
look good to conrm (with certainty) the nor-
mality assumption.
However, we can only go so far in trying to con-
rm multivariate normality. If all plots look
good, we will consider the multivariate normal-
ity assumption to be satised.
Other \goodness of t" tests of normality exist
(chi-square, Kolmogorov-Smirnov), but these suf-
fer serious drawbacks. They have low power to
detect nonnormality in small samples and over-
sensitivity in large samples.
81
Handling Non-normality { Transformations
Basic idea: if x is not normal, nd a function of
x (square root, logarithm, etc.) that is.
Most of the useful transformations are contained
in the family of power transformations:
x ; if 6= 0
f (x) = log(x); corresponds to = 0.
Although there are analytic methods for picking

the that gives the best power transformations,
in practice, trial and error is an eective and
acceptable approach.
Trial and error is aided by the knowledge that
p
power transformations with < 1 (x1=2 = x,
x0 = log(x), x 1 = 1=x, etc.) shrink large val-
ues of x. Power transformations with > 1 (x2 ,
x3 , etc.) increase large values of x.
82
Power transformations are dened for positive
valued random variables. However they can
be applied to variables with negative values by
rst adding a large value to each observation
(to make all values positive) before taking the
transformation.
For certain special types of data, specic trans-
formations are available that often work well:
Original Scale Transformed Scale
Counts, y py
Proportions, p 1 logit(p) = 1 log( p )
2 2 1 p
Correlations, r Fisher's z (r) = 21 log( 11+rr )
Theory may also suggest an appropriate trans-

formation.
Often transforming the univariate components
that are non-normal will be sucient. Always
re-check the multivariate normality of the trans-
formed vectors.
83
Inference about a Mean Vector
(See chapter 5 of Johnson and Wichern.)
Finally, some data analysis!
Recall the univariate situation: t tests and Z tests
for a single mean.
The simplest scenario is one in which we want to
know whether the true mean of a population, , is
equal to some specied value, 0 . We frame the
problem as a statistical test of competing hypothe-
ses:
H0 : = 0 versus H1 : 6= 0 ;
where is the mean of the population from which
we have drawn a sample x1 ; : : : ; xn iid
N (; 2).
2 known: If the population variance is known then
x

Z = = n0 N (0; 1);
p
where
1 X
n
x = n xi (sample mean):
i=1
84
2 unknown: (usual case) if the population variance
is unknown we have to estimate it with the sample
variance s2 and the substitution of s for in our
test statistic changes its distribution (test statistic
is more variable now because of errors in estimating
2 ):
t= x
p 0 t(n 1) t dist'n with n 1 d.f.
s= n
where
1 Xn
s2 = n 1 (xi x)2 (sample variance):
i=1
We reject H0 in favor of H1 at the level if the
observed jtj exceeds a critical value:
Reject H0 if jtj > t=2 (n 1):
We can equivalently form a rejection rule based on

t2 : Reject H0 if
t2 = n(x 0 )(s2 ) 1 (x 0 ) > t2=2 (n 1):
(looks sort of like a quadratic form (statistical dis-
tance) doesn't it?)
85
As an alternative to looking at the problem from a
testing perspective, we can focus on estimation.
The acceptance region of our two sided test
(those values of x for which we reject H0 in favor
of H1 ) at level , corresponds to a 100(1 )%
condence interval for .
Why?
Because
x
Pr s=pn0 t=2 (n 1)
s

= Pr jx 0j t=2 (n 1) pn
= Pr t=2 (n 1) psn x 0
t=2(n 1) pn s
p s
= Pr x t=2 (n 1) n 0
x + t=2 (n 1) pns
86
Extension to Multivariate Case:
From a Np (; ) population we draw a sample
x1 ; : : : ; xn iid
Np (; ):
The corresponding hypotheses in the multivariate
case are
H0 : = 0 versus H1 : 6= 0 ;
where 0 1 0 1
1 10
=B
@ ... CA 0 = B
@ ... CA:
p p0
Note that H0 requires all component means to be
equal to the specied values, 10 ; : : : ; p0 and HA
holds if any (one or more) i 6= i0.
Test statistic (Hotelling's T 2 ):
T 2 = n(x 0 )0 S 1 (x 0 ):
We reject H0 in favor of H1 if T 2 is large.

87
How large? What critical value do we compare
the observed T 2 to?
We need to know the distribution of T 2 to an-
swer these questions.
t in the univariate case had a student's t distri-
bution because it t is of the form
pN2(d(0:f;:1))=d:f :
Random variables which are of the form

2 (d:f :1 )=d:f :1
2 (d:f :2 )=d:f :2
are known to have an F (d:f :1 ; d:f :2 ) distribu-
tion.
88
So, since
t2
p 2 1 p
= n(x 0 )(s ) n(x 0 )
is of the form
(N (0; 1)) 2 (d:f :2 )=d:f :2
1 (N (0; 1))
2 (1)=1
= 2 (d:f : )=d:f : ;
2 2
it follows that t2 F (1; d:f :2 ).
In the multivariate case
p 0 p
T = n(x 0 ) S n(x 0 )
2 1
is of the form
(Np )(Wp =d:f :) 1 (Np ):
Since the multivariate normal is a generaliza-
tion of the normal and the Wishart is a gener-
alization of the 2 , it is not very surprising that
it turns out that T 2 has a (scaled) F distribu-
tion:
T 2 (n 1)p F (p; n p):
n p
89
We reject H0 in favor of H1 if
T2 > (n 1)p F (p; n p)
n p
where F (d:f :1 ; d:f :2 ) is the upper (100)th per-
centile of the F (d:f :1 ; d:f :2 ) distribution.
Notice that T 2 is the statistical distance be-
tween x and 0 .
Example { Kite Data:
p = 2 variables (x1 =wing length, x2 =tail length)
measured on a sample of n = 45 birds. Suppose we
want to test
190 190
H0 : = 275 versus H1 : 6= 275
90
Recall
193:62 120:69 122:35

x = 279:78 S = 122:35 208:54
so S 1 equals
1
208:54 122:35

120:69(208:54) (122:35)2 122:35 120:69 :
So
T 2 = 45 ( 193:62 190 279:78 275 ) 10201 1
208:54 122:35 193:62 190:24
122:35 120:69 279:78 275
= 5:543
For a signicance level of = 0:05, the critical value
is
(n 1)p F (p; n p) = 2(44) F (2; 43)
n p 43 :05
= (2:047)(3:214) = 6:578
so we do not reject H0 and we conclude that 0 =
(190; 275) is plausible.
91
Hotelling's T 2 test is invariant under transfor-
mations of the form
yp1 = Cppxp1 + dp1
where C is nonsingular.
Kite Example (Continued):
So, for example, if we wanted to test the correspond-
ing hypothesis on size (y1 = x1 + x2 ) and shape
(y2 = x1 x2 ) as we dened them before:
1 1
y = 1 1 x;
we would test
190 + 275 465
H0 : y = 190 275 = 85
For this problem, T 2 = 5:543 as before with the
same critical value, 6.578, so we do not reject H0 .
Note that
1 1
1 1 6= 0 (nonsingular transformation).
92
Why not just do p separate (univariate) t tests for
each component of x?
Does not take into account interrelationships
(correlations) among the components of x.
The multivariate test controls the type I error
rate, the probability of rejecting H0 when H0
is true.
{ When doing multiple univariate tests the
probability that at least one univariate null
hypothesis will be rejected by chance alone
increases with the number of tests being
performed.
{ Procedures to account for this problem ex-
ist (e.g., Bonferroni) but generally such pro-
cedures are not as attractive as performing
a single multivariate test.
93
T 2 as a LRT:
So far the use of T 2 as a test statistic has been
motivated by analogy to the univariate case.
Could we do better? (Is T 2 optimal? E.g., in terms
of power?)
It turns out that T 2 does have several optimal-
ity properties because T 2 can also be derived
as a likelihood ratio test (LRT). The theory
of LRTs (don't worry) establishes a number of
optimality properties that T 2 inherits.
94
Likelihood Ratio Tests:
The likelihood function is just the density
function, but thought of as a function of the
parameters rather than of the data.
For Example, the univariate normal distribu-
tion based on a sample x1 ; : : : ; xn :
{ We think of the density
1
f (x1 ; : : : ; xn ) = (22 )n=2
" X
n (x )2 #
exp i
i=1 22
as a function of the data (the xi 's) for given
values of the parameters (, 2 ) giving, in
some sense, the probability of observing
x1 ; : : : ; xn for given , 2 . Hence, the ar-
gument of f is x1 ; : : : ; xn .
{ Actually, the density function involves both
the data and the parameters.
95
{ Once we have the data, we want to do infer-
ence concerning the parameters. Therefore,
we can think of the density as a function of
the parameters (, 2 ) given the data (the
xi 's).
{ To avoid confusion when we think of the
density this way we give it a new name, the
likelihood function, and write it as a func-
tion of the parameters:
L(; 2 ) = 1
(2" 2 )n=2 #
X
n (x
i )2
exp 22
i=1
{ Likelihood function gives, in some sense,

the probability (or likelihood) of the param-
eters given the data.
96
The maximum likelihood approach to esti-
mation says to estimate the parameters with
those values that maximize the likelihood.
{ For the univariate normal distribution, with
some calculus we can establish that the max-
imizers of L(; 2 ) (the maximum likelihood
estimates or MLEs) are
1 X
n
^ = x = n xi ;
i=1
n 1 1 Xn
^ 2 = n s2 = n (xi x)2 :
i=1
{ For the multivariate normal distribution, the

likelihood function is
1
L(; ) = (2)np=2 jjn=2
" X n #
1
exp 2 (xi )0 1 (xi ) :
i=1
97
{ Again some calculus lead to the MLEs:
1 X
n
^ = x = n xi ;
i=1
n 1 1 Xn
^ = n S = n (xi x )(xi x )0 :
i=1
{ The idea is that ^ and ^ are those choices

for , that make observing the data ac-
tually obtained, most likely.
Idea behind the LRT for some hypothesis H0:
{ Compare the maximum of the likelihood func-
tion assuming H0 is true (the constrained
maximum likelihood) and the maximum of
the likelihood without assuming H0 is true
(the unconstrained maximum likelihood).
{ If the constrained ML is \signicantly" smaller
then we conclude that assuming H0 makes
our data much less likely than not assuming
H0 ; ) we should reject H0 .
98
Application to H0 : = 0 for multivariate normal
data:
The unconstrained ML is the likelihood evalu-
ated at the MLEs:
max L( ; ) = L
(^ ; ^)

;
= 1
(2)np= ^ jn=2
2 j
" X n #
exp 12 (xi ^ )0^ 1(xi ^ )
i=1
1 np
= ^ exp 2
(2) jj
np=2 n= 2
after some matrix algebra.
99
The constrained ML is the maximum of the like-
lihood function when we assume H0 is true; that
is, when we assume that = 0 .
{ Thus the constrained ML is the likelihood
evaluated at the constrained MLE, the es-
timators of , that assumes = 0 .
{ The constrained MLEs are 0 and the
that maximizes L(0 ; ), which turns out
to be (calculus)
1 Xn
^ =
0 (x )(x )0 :
n i=1 i 0 i 0
{ Therefore the constrained ML is

max L( 0 ; ) = L ( 0 ; ^ 0)

= 1
(2")np=2 j^ 0 jn=2 #
1 Xn
exp 2 (xi 0)0^ 0 1(xi 0)
i=1
1 np
= ^ exp 2
(2) j0 j
np= 2 n= 2
100

Lec1 (1) .Ps

Uploaded by

Copyright:

Available Formats

You might also like

Lec1 (1) .Ps

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lec1 (1) .Ps

Uploaded by

Copyright:

Available Formats

STAT 8250 | Applied Multivariate Analysis

\Multivariate" refers to the presence of multiple

1. Vector and matrix algebra.

Dimension of a Matrix: the ordered pair (n; p).

If we want a row vector version of x then we

Subtraction: A B = A + ( 1)B (A, B of same

(Matrix) Multiplication: AB is de ned if and

E.g., take A32 as before and let B be given by

Then AB is a 3 4 matrix given by

Length (Norm) of Vector: Lx = (x0x)1=2 is the

Projection: The projection of x on y is the vector

Consider points P and Q in the above scatter-

 A problem: Karl Pearson distance doesn't ac-

(See sections 2.5, 2.6 and chapter 3 of Johnson and

 Recall: for univariate random variables xi and

 var(x) is symmetric because ij = ji.

 If the random variables x1 ; : : : ; xp in x are mu-

where 1 is an n-dimensional vector of ones.

 jSj is sensitive to large values of the sample vari-

Central Limit Theorem. If Y1; : : : ; Yn is a se-

 How large must n be? Depends on the problem

 We will show that for x  Np (; ) with jj >

 a p! b means as the sample size gets large, a

for a multivariate normal sample x1 ; : : : ; xn iid

3. If A  Wp (; m) then for Cqp ,

 Although there are analytic methods for picking

 Theory may also suggest an appropriate trans-

We can equivalently form a rejection rule based on

 We reject H0 in favor of H1 if T 2 is large.

 Random variables which are of the form

{ Likelihood function gives, in some sense,

{ For the multivariate normal distribution, the

{ The idea is that ^ and ^ are those choices

after some matrix algebra.

{ Therefore the constrained ML is

You might also like

(Matrix) Multiplication: AB is dened if and

A problem: Karl Pearson distance doesn't ac-

Recall: for univariate random variables xi and

var(x) is symmetric because ij = ji.

If the random variables x1 ; : : : ; xp in x are mu-

jSj is sensitive to large values of the sample vari-

How large must n be? Depends on the problem

We will show that for x Np (; ) with jj >

a p! b means as the sample size gets large, a

3. If A Wp (; m) then for Cqp ,

Although there are analytic methods for picking

Theory may also suggest an appropriate trans-

We reject H0 in favor of H1 if T 2 is large.

Random variables which are of the form

{ The idea is that ^ and ^ are those choices