Professional Documents
Culture Documents
Lec1 (1) .Ps
Lec1 (1) .Ps
Lec1 (1) .Ps
Lecture Notes
Introduction
1
Regression:
Simple: explain variability in Y from X .
Multiple: explain variability in Y from X1 ; X2;
: : : ; Xr .
Anova:
One-way: explain variability in Y based on X ,
where X is a factor
These examples are not usually considered to be
multivariate analyses.
Why?
We distinguish between the Response Variable (Y )
and Explanatory Variables (X 's) and in the analysis
we treat the X 's as xed.
) only 1 random variable in each situation.
(multiple regression 6= multivariate regression)
2
One type of analysis that is a good example of a
multivariate method is correlation analysis.
Recall:
Correlation Analysis: measure linear association be-
tween X; Y without distinguishing response ver-
sus explanatory.
Other examples:
M'variate 1 & 2 Samp. Inference: Hypothesis tests
and condence intervals for the mean vector(s)
from one (or two) group(s). Procedures anala-
gous to t and Z in univariate setting.
MANOVA: Acronym for multivariate analysis of
variance. Inference for mean vectors from sev-
eral populations.
Multivariate Linear Regression: Eplain variability
in response variables Y1 ; Y2 ; : : : ; Ym based on
explanatory variables X1; : : : ; Xr .
3
Principal Components Analysis: Reduce the dimen-
sion of a data set by examining a small number
of linear combinations of the original variables
that explain most of the variablity among the
original variables.
Factor Analysis: Similar to P.C.A. Describe the var-
iance/covariance relationships among many
variables in terms of a few underlying, but un-
observable, random quantities called factors.
Canonical Correlation Analysis: Determine the lin-
ear association between two sets of random vari-
ables. Find a linear combination of several X 's
and a linear combination of several Y 's such
that the two linear combinations have maxi-
mum correlation.
Classication and Discrimination: Methods for sep-
arating distinct sets of objects (experimental
units) and with allocating (classifying) objects
to previously dened groups.
4
Cluster Analysis: Methods for grouping objects
based on distances (which objects are most sim-
ilar to which others). Groups are not predened
and no assumptions are made concerning the
number of groups or the group structure.
Others: Multidimensional Scaling, Repeated Mea-
sures Analyses, Path Analysis, Econometric
Methods.
5
Variable- vs. Individual-Directed Methods
Variable-directed Techniques: primarily concerned
with relationships among the response variables
being measured. E.g., PCA, FA, M'var. Reg.,
Can. Corr.
Individual-directed Techniques: primarily concerned
with relationships among the among the ex-
perimental units or individuals being measured.
E.g., Discrim. Anal. Clustering, MANOVA.
Some methods can be used as either variable-
directed or individual-directed. E.g., clustering
can be done to group individuals or to group
variables.
Multivariate methods also dier with respect to
how informal or exploratory their typical uses
are. E.g., cluster analysis, P.C.A. are often pre-
liminary or exploratory tools used as part of a
larger analysis.
6
Theory vs. Methods:
(1) problem: estimate population mean
(given x1 ; : : : ; xn )
model: X1 ; : : : ; Xn iid
N (; 2)
, 2 unknown
theory: X N (; 2 =n),
(n 1)S 2 2 2 (n 1),
S 2 independent.
X; p
) n(X )=S t(n 1)
method:
p
x t1 =2(n 1)s= n forms a
100(1 )% C.I. for
application: n = 20, = 0:10
x = 49:70, s = 0:32
t:95 (19) = 1:729 p
90% CI = 49:70 1:729(0:32= 20)
= (49:58; 49:82)
(2) (problem ! method ! application) not su-
cient.
7
(3) validity of method depends on the validity of
model. Is model valid? (need to know some-
thing about iid
N (; 2)) If not valid, how is
method aected? Alternatives?
(4) How do we interpret the results of our applica-
tion? (need to understand what model means)
8
Preliminaries
11
Addition: A = (aij ) added to B = (bij ) is given
by
A + B = (aij + bij )
A and B must have the same dimension.
E.g., for A as before and B given by
0 1 2
1 0 3 1
1
B=@ 3 1 A we get A+B=@ 2 1A
0 2 2 1
12
If Amn = (aij ) and Bnp = (bij ) are conformable
then the m p matrix product AB has (i; j )th ele-
ment
Xn
aik bkj :
k=1
13
Special cases:
1. Matrix times a vector:
0 a11 a12 a1p 1 0 x1 1
B a21 a22 a 2p C
Ax = @ .. .. . . .. A @ .. C
B C B
B x2 C
. . . . .A
an1 an2 anp np xp p1
0 a11x1 + a12x2 + + a1pxp 1
=@
B
B a21x1 + a22x2 + + a2pxp C
C
..
. A
an1 x1 + an2 x2 + + anp xp n1
X p
= x1 a1 + x2 a2 + + xp ap = xi ai
i=1
a linear combination of the columns of A.
14
2. Dot product or inner product:
0 y1 1
B y C X
p
@ .. C
x0 y = ( x1 x2 xp ) B 2
A = xiyi
. i=1
yp
Sometimes
P written x y or < x; y >. Note:
x x = ni=1 x2i (sum of squared elements).
3. Column Vector times row vector:
0 x1 1
B x 2C
xy = @ .. C
0 B A ( y1 y2 yp ) = (xiyj )pp
.
xp
3.a. Vector times its transpose:
0 x1 1
0 B
B x2 C
xx = @ .. C ( x1 x2 xp ) = (xi xj )pp
. A
xp
15
Properties:
1. A + B = B + A.
2. (A + B) + C = A + (B + C).
3. c(A + B) = cA + cB.
4. (c + d)A = cA + dA.
5. (cd)A = c(dA).
6. c(AB) = (cA)B.
7. A(BC) = (AB)C.
8. A(B + C) = AB + AC.
9. (A + B)C = AC + BC.
10. (A + B)0 = A0 + B0 .
11. (cA)0 = cA0 .
12. (AB)0 = B0 A0 .
13. AB =6 BA. E.g.,
1 1 0 1
A= 0 2 ; B= 1 2
1 3 0 2
AB = 2 4 ; BA = 1 5
BA may not be dened; if it is, AB and BA
may not have the same dimension; if dim(AB) =
dim(BA), doesn't imply that AB = BA.
16
Geometrical Concepts:
Vector: a line segment from the origin (0, all co-
ordinates equals to zero) to the point indicated
by the coordinates of the (algebraic) vector.
011
e.g., x = @3A
2
Vector Addition:
17
Scalar Multiplication: Multiplication by a scalar
scales a vector by shrinking or extending the
vector in the same direction.
18
Orthogonal (Perpendicular) Vectors: Two
vectors x and y are orthogonal to one another if
and only if x0 y = 0 () cos() = 0 ) = =2).
19
Euclidean Distance: Since a vector x has the in-
terpretation of a directed line segment from the
origin (0) to the point (x1 ; x2 ; : : : ; xp ), the Eu-
clidean (straight-line) distance between 0 and x
is clearly the length of the vector, Lx :
q
d(x; 0) = Lx = x21 + + x2p :
More generally, the Euclidean distance between
two arbitrary points x and y is
q
d(x; y) = Lx y = (x1 y1 )2 + + (xp yp )2 :
Many ways to dene distance. Euclidean dis-
tance is just one (familiar) example.
Properties of a distance measure d:
d(x; y) = d(y; x)
d(x; y) 0; with equality when x = y
d(x; y) d(x; z) + d(z; y) (triangle inequality)
We'll consider another distance measure, sta-
tistical distance, later.
20
More Matrix Functions and Denitions:
Determinant: The determinant of a square ma-
trix A (same number of rows as columns) with
dimension k k is the scalar denoted as jAj
given by
a11 if k = 1
jAj = Pk a1j jA1j j( 1)1+j if k > 1,
j =1
where A1j is the (k 1) (k 1) matrix ob-
tained by deleting the rst row and the j th col-
Pk ofa AjA. jNote
umn
( 1)
that jAj can be dened as
i+j using the ith row in place
j =1 ij ij
of the rst.
Special Cases:
k = 2:
a a
a11 a12 = a11a22( 1)2+a12a21( 1)3 = a11a22 a12a21
21 22
E.g.,
1 = 12
3
2 4 ( 2) = 14
21
k = 3:
a a a
a11 a12 a13 = +a a22 a23 a a21 a23
a21 a22 a23 11 a
32 a33
12 a
31 a33
31 32 33
a a
+ a13 a21 a22
31 32
= a11(a22 a33 a23 a32)
a12(a21 a33 a23 a31)
+ a13(a21 a32 a22 a31)
Properties:
1. jcAj = ck jAj, for Akk .
2. jA0j = jAj.
3. jABj = jAjjBj.
4. For a diagonal matrix Dkk of the form
0 d11 0 0 1
B 0 d22 0 C
D = @ .. .. . . . .. C
B ;
. . . A
0 0 dkk
jDj = d11d22 dkk .
22
Linear Dependence: A set of vectors x1 ; : : : ; xk
is said to be linearly dependent if there exist
constants c1 ; : : : ; ck not all zero such that
c1 x1 + c2 x2 + + ck xk = 0
Recall collinearity/multicollinearity complications
occurred in regression when the explanatory vari-
ables were linearly dependent; that is, when the
columns of the X matrix were linearly depen-
dent.
A set of vectors which is not linearly dependent
is said to be linearly independent.
jAj = 0 if and only if there exists a nonzero vector x
such that Ax = 0; that is, if and only if the columns
of A are linearly dependent.
23
Identity Matrix: A matrx with ones on the di-
agonal and zeros elsewhere is called an identity
matrix and is denoted by Ikk or I when the
dimension is clear from the
01 0 01 context. E.g.,
I33 = @ 0 1 0 A :
0 0 1
I is the \identity" matrix because
AI = IA = A
Matrix Inverse: Matrix division is dened as mul-
tiplication by the matrix inverse. Just as a=b
can be dened as ab 1 for scalars a and b, we
will think of A divided by B as AB 1 where
B 1 is that matrix such that
BB 1 = I
(just as b 1 = 1=b is that scalar such that bb 1 =
1).
The matrix inverse is only dened for square
matrices that have nonzero determinants.
If A (square) has jAj = 0 then its columns are
linearly dependent and it has no inverse. Such
a matrix is called singular.
24
Computation of Matrix Inverse: A 1 has (i; j )th
element ( 1)i+j jAji j=jAj, where Aji is the ma-
trix obtained by deleting the j th row and ith
column of A. E.g.,
a b 1 1
d b
c d = ad bc c a
We will rely on the computer to compute in-
verses for matrices with dimension > 2.
Properties:
1. (A 1 )0 = (A0 ) 1 .
2. (AB) 1 = B 1 A 1.
3. jA 1j = 1=jAj.
25
Eigenvalues and Eigenvectors: A square k k
matrix A is said to have an eigenvalue , with
a corresponding eigenvector x =
6 0 if
Ax = x (1):
Since (1) can be written
(A I)x = 0
for some nonzero vector x, the eigenvalues of A
are the solutions of
jA Ij = 0
If we look at how a determinant is calculated
it is not dicult to see that jA Ij will be a
polynomial in of order k so there will be k
(not necessarily real) eigenvalues (solutions).
1. if jAj = 0 then Ax = 0 for some nonzero x.
That is, if the columns of A are linearly depen-
dent then = 0 is an eigenvalue of A.
26
2. The eigenvector associated with a particular eigen-
value is unique only up to a scale factor. That
is, if Ax = x, then A(cx) = (cx) so x and
cx are both eigenvectors for A corresponding to
. We typically normalize eigenvectors to have
length 1 (choose the x that has the property
x0 x = 1).
3. If i and j are distinct eigenvalues of the sym-
metric matrix A then their associated eigenvec-
tors xi , xj , are orthogonal.
27
4. Computation: By computer typically, but for
2 2 case we have
a b a b
A= c d ) A I = c d
jA Ij = 0 )
(a )(d ) bc = 2 (a + d) + (ad bc) = 0
p
(a + d) (a + d)2 4(ad bc)
) = 2
To obtain associated eigenvector solve Ax =
x. There are innitely many solutions for x so
choose one by setting x1 = 1, say, and solve for
x2 . Normalize to have length one by computing
Lx 1 x.
28
Orthogonal Matrices: We say that A is orthogo-
nal if A0 = A 1 or, equivalently, AA0 = A0 A =
I, so that the columns (and rows) of A all have
length 1 and are mutually orthogonal.
Spectral Decomposition: If Akk is a symmetric
matrix (i.e., A = A0 ) then it can be written
(decomposed) as follows:
Ekk kk E0kk ;
where 0 1 0 0 1
B 0 2 0 C
= @ .. .. . . .. C
B
. . . . A
0 0 k
and E is an orthogonal matrix with columns
e1 ; e2 ; : : : ; ek the eigenvectors corresponding to
eigenvalues 1 ; 2 ; : : : ; k .
We haven't yet talked about how to interpret eigen-
values and eigenvectors, but the spectral decompo-
sition says something about the signicance of these
quantities (why we're interested in them): the \in-
formation" in A can be broken apart into its eigen-
values and a set of orthogonal eigenvectors.
29
Results:
1. For Akk symmetric,
Yk
jAj = jEjjjjE0j = jEE0jjj = jIjjj = i
i=1
I.e., the determinant of a symmetric matrix is
the product of its eigenvalues.
2. If A has eigenvalues 1 ; : : : ; k , then A 1 has
the same associated eigenvectors and eigenval-
ues 1=1 ; : : : ; 1=k since
A 1 = (EE0) 1 = E 1E0 :
30
3. Similarly, if A has the additional property that
it is positive denite (dened below) a square
root matrix, A1=2, can be obtained with the
property A1=2A1=2 = A:
A1=2 = E1=2E0, where
0 p1 0 0 1
p
1=2 B
B 0 2
= @ ..
0 C
C
.
..
.
. . . .. A
p.
0 0 k
A1=2 is symmetric, has eigenvalues that are the
square roots of the eigenvalues of A, and has
the same associated eigenvectors as A.
Quadratic Forms: For a symmetric matrix Akk ,
a quadratic form in xk1 is dened by
k X
X k
x0 Ax = aij xi xj :
i=1 j =1
(x0 Ax is a sum in squared (quadratic) terms,
x2i , and xi xj terms.)
31
Positive Denite Matrices: A is positive de-
nite (p.d.) if x0 Ax > 0 for all x 6= 0. If x0 Ax
0 for all nonzero x then A is positive semi-
denite (p.s.d.) or, sometimes, non-negative
denite.
If Akk is p.d. then for i = 1; : : : ; k, e0iAei =
i (e0i ei ) > 0;
i.e., the eigenvalues of A are all positive.
It can also be shown that if the eigenvalues of
A are all positive then A is p.d.
32
Statistical Distance:
33
Need a distance measure where coordinates are
weighted dierentially and distance is relative
to the variability of the data.
One solution, Karl Pearson Distance: Distance
of a point P = (x1 ; x2 ) from the origin 0 = (0; 0)
takes into account how many (X1 ) standard devi-
ations x1 is from 0 and how many (X2 ) standard
deviations x2 is from 0:
s 2 2
d(0; P) = x1 x
+ 2 ;
s1 s2
where s1 is the standard deviation of X1 and s2 is
the standard deviation of X2. The generalization
to k-dimensional space is straight-forward:
s 2 x 2
d(0; P) = x 1 + + sk :
s1 k
34
An alternative: Statistical Distance: Karl Pearson
Distance would be o.k. if it were computed with
respect to the rotated axes X~1 and X~2 . So, in two
dimensions, we can dene statistical distance as
s 2 2
d(0; P) = xs~~1 + xs~~2 ;
1 2
where
x~1 = x1 cos() + x2 sin()
;
x~2 = x1 sin() + x2 cos()
s~1 , s~2 are standard deviations of X~1 and X~ 2 , and
is the angle of rotation.
35
With respect to the original coordinate system we
have
q
d(0; P) = a11 x21 + 2a12 x1 x2 + a22 x22 ;
where a11 , a12 , and a22 are coecients depending
on s1 , s2 , cov(X1 ; X2 ) and trigonometric functions
of (not important now).
Notice that d(0; P) as dened above can be written
as
a
a12 x1 = x0 Ax;
d(0; P)2 = ( x1 x2 ) a11 a22 x2
12
a quadratic form in x involving the symmetric ma-
trix A.
In fact, any positive denite quadratic form can
be thought of as a squared statistical distance.
The positive deniteness of the matrix A is re-
quired to ensure the properties of a distance
metric (non-negativity, triangle inequality, etc.).
x0 Ax: Squared distance from x to 0,
(x y)0 A(x y): Squared distance from x to y.
36
Geometric Interpretation to Spectral Decom-
position: Consider the set of points a constant
(statistical) distance c away from the origin.
These points satisfy
x0 Ax = c2 X
) x0 EE0x = y0 y = i yi2 = c2;
i
which is the equation of an ellipsoid in the trans-
formed variable, y = E0 x.
A transformation of this form, y = E0 x, involving
an orthogonal matrix E, denes a rotation from the
x coordinate system into the y coordinate system
which is rigid, in the sense that the orthogonality of
the axes are maintained. The new axes are in the
direction of the eigenvectors, e1 ; e2 ; : : :.
37
In two dimensions we have
1 y12 + 2 y22 = c2;
the equation of an ellipse, with y1 = x0 e1 and y2 =
x0 e2. If we let x = c1 1=2e1 we see that this value
satises x0 Ax = c2 :
x0 Ax = 1 (x0e1)2 + 2 (x0e2 )2
e01e1)2 + 2 (c1 1=2 |{z}
= 1 (c1 1=2 |{z} e01e2)2 = c2:
=1 =0
p
Therefore, jjc1 1=2e1 jj = c= 1 givesp thegives
length
of the ellipse along e1 . Similarly, c= 2 the
length of the ellipse in the perpendicular direction
along e2 .
38
One More Matrix Function:
Trace of a Matrix: Dened only for square matri-
ces. Let Akk = (aij ) be a square matrix. The
trace of A is the sum of its diagonal elements:
X
k
tr(A) = aii
i=1
Properties:
1. tr(cA) = ctr(A).
2. tr(A + B) = tr(A) + tr(B).
3. tr(ABC) = tr(CAB) as long as the matrix
products are dened (matrices are conformable)
and ABC and CAB are square. In particular,
tr(B 1 AB) = tr(A)
and
tr(Axx0 ) = tr( |x0{z
Ax} ) = x0 Ax:
a scalar
0 P P
4. tr(AA ) = i j a2ij .
5. If Akk is symmetric
P with eigenvalues 1; : : : ; k ,
then tr(A) = ki=1 i .
39
Basic Multivariate Statistical Concepts
40
Recall, for a univariate random variable X ,
R 1 xf (x)dx if X is continuous;
E(X ) = P1 Xxp (x) if X is discrete.
all x X
Here, fX (x) is the probability density function
of X in the continuous case, pX (x) is the prob-
ability function of X in the discrete case.
(Population) Variance-Covariance Matrix: For
a random vector xp1 = (x1 ; x2 ; : : : ; xp )0 , the
matrix
0 var(x1) cov(x1; x2) cov(x1 ; xp) 1
B
B cov(x2 ; x1 ) var(x2 ) cov(x2 ; xp) C
C
@ ... ..
.
... ..
. A
cov(xp ; x1 ) cov(xp ; x2 ) var(xp )
0 11 12 1p 1
B
B 21
@ .. ..
22 2p C
C
. .
... ..
. A
p1 p2 pp
is called the variance-covariance matrix of x and
is denoted var(x).
41
Recall: for a univariate random variable xi with
expected value i ,
ii = var(xi ) = E[(xi i )2 ]
42
(Population) Correlation Matrix: For a ran-
dom variable xp1 , the population correlation
matrix is the matrix of correlations among the
elements of x:
0 1 12 1p 1
corr(x) = B
B 1 p C
@ ... ... . . . ... C
21 2
A;
p1 p2 1
where ij = corr(xi ; xj ).
Recall: for random variables xi and xj ,
ij = corr(xi ; xj ) = p p
ij
ii jj
measures the amount of linear association be-
tween xi and xj .
Correlation matrices are symmetric.
For a random vector xp1 , let = corr(x), =
var(x) and V = diag(11 ; 22 ; : : : ; pp ). The
relationship between and is
= V1=2V1=2
= (V1=2) 1 (V1=2) 1
43
Properties:
Let X, Y be random matrices of the same di-
mension and A, B be matrices of constants such
that AXB is dened
1. E(X + Y) = E(X) + E(Y).
2. E(AXB) = AE(X)B.
Now let xp1 be a random vector with mean x
and variance x . Let cp1 be a vector of constants,
and Ckp a matrix of constants.
For the linear combination c0 x = c1 x1 + +
cp xp ,
3. E(c0 x) = c0 , and
4. var(c0 x) = c0 x c.
For the k-dimensional vector of linear combina-
tions, Zk1 = Cx,
5. Z = E(Z) = E(Cx) = CE(x) = Cx and
6. Z = var(Z) = var(Cx) = Cx C0 .
44
All of the denitions and properties of the last 5
pages have been dened in terms of population quan-
tities. There are corresponding denitions of sample
quatities for which the same properties hold:
Examples (sample size n):
{ Sample mean: x
= x
( ; : : : ; x
) 0 , where xi =
n 1 P 1 p
n x . (x is the kth observation of
k=1 ki ki
random variable xi ).
{ Sample variance-covariance matrix:
0s s 1
11 1p
SX = B
@ ... . . . ... C
A
sp1 spp
1 Xn
= n 1 (xk x )(xk x )0 ;
k=1
where
1 Xn
sij = n 1 (xki xi )(xkj xj ):
k=1
45
{ Sample correlation matrix:
0 1 r12 r1p 1
B
RX = B
r 1 r 2p C
@ ... ... . . . ... C
21
A;
rp1 rp2 1
where rij is the sample correlation computed on
the n sample pairs of observations (x1i ; x1j ); : : : ;
(xni ; xnj ):
rij = ps sijps :
ii jj
46
In the univariate setting we typically have a random
variable X , say, from which we observe a random
sample. That is, we assume that X has some dis-
tribution that is partially or completely unknown
and we observe n independent copies X1 ; : : : ; Xn of
X . The observed values of these copies we typi-
cally denote with lower case letters: x1 ; : : : ; xn , and
we infer features of the true distribution from the
sample.
In the multivariate setting we are interested in the
true (multivariate) distribution of a random vector
xp1. Typically the random variables that make up
x, x1; : : : ; xp are not independent and that depen-
dence (correlation) among x1 ; : : : ; xp is the reason
for our interest. A random sample here consists of
n copies of x that we can arrange in an n p data
matrix:
0 x11 x12 x1p 1
B
B x 21 x
X = @ .. .. . . .. C
22 x 2p C
. . . . A
xn1 xn2 xnp
(I'm using lower case here, no longer distinguish-
ing between the random variable and its observed
value.)
47
Rows are assumed independent (n independent
copies).
Dependence typically among columns.
The sample mean and variance can be computed
from the data matrix X as
x = 1 X0 1
n
1 1
S = n 1 X (I n 110 )X
0
48
Sample variance-covariance matrices with quite
dierent patterns can have the same generalized
variances. E.g.,
5 4 5 4
S1 = 4 5 ; S2 4 5 ;
3 0
S3 = 0 3
have values of r12 equal to 0.8, -0.8 and 0.0,
respectively. However,
jS1j = jS2j = jS3j = 9:
50
Example, Kite Data: Consider the bivariate data
of Table 5.11 in the text. A wildlife ecologist mea-
sured x1 = tail length (mm) and x2 = wing length
(mm) for a sample of n = 45 female hook-billed
kites. The handout consists of a printout of a SAS
program, kite1.sas, and the associated output, kite1.lst.
51
PROC PRINT prints out a listing of the data
on p. 1 of kite1.lst.
PROC CORR computes the sample correlation
matrix R for the variables on the VAR state-
ment. In addition the COV option on the PROC
CORR line requests the sample covariance ma-
trix S and, by default, PROC CORR computes
summary statistics (x1 , x2 , etc.). All of these
results appear on p. 2 of kite1.lst.
PROC PLOT prints a scatter plot of winglen
versus taillen on p. 3 of kite1.lst.
The generalized sample variance is
jSj = 120:69(208:54) 122:352 = 10201:24
and we have
jRj = 12010201 :24 = 1:0(1:0) 0:772 = 0:41:
:69(208:54)
The total variance is 120:69 + 208:54 = 329:24.
52
Suppose we decide that we are interested in the
size and shape of the birds in a way that is not
captured by wing length (x1 ) and tail length
(x2 ). Specically, we might look at y1 = x1 + x2
(size) and y2 = x1 x2 (shape).
To compute summary statistics on y = (y1 ; y2 )0 ,
we can use our methods for linear combinations
of x = (x1 ; x2 )0 . Notice,
y = Ax;
where 1
A = 1 11 ;
so
1 193:62 473:40
y = Ax = 1 11 279:78 = 86:16 :
53
In addition,
Sy = AS x A 0
120:69
122:35 1
= 11 11 122 :35 208:54 1
1
1
1 1 243:04 1:65
= 1 1 330:89 86:19
573:93 87:85
= 87:85 84:54
54
The Multivariate Normal Distribution
(See chapter 4 of Johnson and Wichern.)
Recall the central role that the (univariate) nor-
mal distribution played in the development of
univariate statistical methods and why:
a. For many continuous random variables the
assumption of normality is appropriate.
b. numerous statistics, particularly those that
can be expressed as sums or averages, have
normal sampling distributions regardless of
the underlying population distribution.
c. the normal distribution is easy to work with
mathematically and many useful results are
available.
Items (a) and (b) above are due to the Central
Limit Theorem.
55
Recall:
57
The cumulative distribution function (c.d.f.) of
X is
Zx
FX (x) = Pr(X x) = fX (t)dt:
1
FX is not available in closed-form but the stan-
dard normal version (with = 0, 2 = 1) is
extensively tabled.
The multivariate normal distribution: for the
vector x, we write x Np (; ) to signify that x is
a p dimensional random vector with mean and
variance . x has p.d.f.
f (x) = (2)p=12 jj1=2 exp 1 (x )0 1 (x ) ;
2
and c.d.f.
F (x) = Pr(x1 x1 ; x2 x2 ; : : : ; xp xp )
Z xp Z x1
= f (t)dt:
1 1
58
We've replaced a univariate distance from x to
with a multivariate statistical distance,
(x )0 1 (x );
from x to .
It is now the volume under the p.d.f. which
represents the joint probability that x1 ; : : : ; xp
fall within given intervals.
59
Contours of constant density for the p dimensional
normal distribution are ellipsoids dened by x
such that
(x )0 1 (x ) = c2:
p ellipsoids are centered at and have axes
These
c i ei, for eigen-pairs i ; ei of .
61
5. For x Np (; ), all subsets of x are (multi-
variate) normally distributed. If we partition x,
and and accordingly as
0 x1 1 0 1 1
x = @ (q 1) A = @ (q 1) A
(p 1) x 2 (p 1) 2
((p q)1) ((p q)1)
0 11 .. 12
1
=B (q q ) . (q(p q)) C
(p p) @ 21
..
22
A
((p q)q) . ((p q)(p q))
then x1 is distributed as Nq (1 ; 11 ).
62
6.
a. If x1 and x2 are independent q1 and q2 di-
mensional vectors, respectively, then regard-
less of the distributions of x1 and x2 ,
cov(x1 ; x2 ) = 0q1 q2 .
b. If x1 , x2 are as before and are each multi-
variate normally distributed, then
cov(x1 ; x2 ) = 0 implies that x1 and x2 are
independent.
c. If x1 and x2 are independent with Nq1 (1 ; 11 )
and Nq2 (2 ; 22 ) distributions, respectively,
then
0x 1 02 3 2 .
.. 0
3 1
@ A Nq1+q2 B
1
@4 5 ; 64 . 75C
1 11
A:
x2 2 0 .. 22
If we combine (multivariate) normally dis-
tributed components into a single vector,
the result is not necessarily multivariate nor-
mal. However, if the components are (mul-
tivariate) normal and independent then the
result will be multivariate normal.
63
7. If x Np (; ) where
0x 1 0 1
1 1
x = @A = @A
x2 2
0 .. 1
B 11 . 12 C
= @ A
. ..
21 22
with j22 j > 0, then the conditional distribu-
tion of x1 given x2 = x2 , is (multivariate) nor-
mal with
Mean = 1 + 12 221 (x2 2 )
and
Covariance = 11 12 221 21 :
Notice that the conditional covariance does not
depend on the value of the conditioning vari-
able.
64
8. For x Np (; ) with jj > 0,
(x )0 1 (x ) 2 (p)
where 2 (p) denotes the chi-square distribution
with p degrees of freedom.
Proof: we can write the quadratic form as
(x )0 E 1=2 1=2E0 (x ) = Z0 Z;
where Z = 1=2E0 (x ) is multivariate nor-
mal with mean
E(Z) = 1=2E0 E(x ) = 0
and variance
var(Z) = 1=2E0 E 1=2
= 1=2 1=2 = I
so Z0 Z is the sum of p squared independent
N (0; 1) r.v.s ) Z0 Z 2 (p).
65
Multivariate Sampling Distributions:
In this course we will most often consider data that
are or are assumed to be a random sample from a
multivariate population. That is, we will have n in-
dependent copies, x1 ; x2 ; : : : ; xn , of a p dimensional
random variable which we arrange in a n p data
matrix:
0 x11 x12 x1p 1
B
B x 21 x
X = @ .. .. . . .. C
22 x 2p C
. . . . A
xn1 xn2 xnp
0 x01 1
B
=B
x 02 C
C
@.A.
.
x0n
Rows of X are independent.
Columns of X are dependent and this depen-
dence is of interest.
66
First we consider the general case where x1 ; : : : ; xn
have arbitrary (not necessarily normal) common dis-
tribution with mean and variance :
1. The sample mean x is unbiased for ; i.e.,
E(x) = :
This is easily seen by writing x as x = n1 X0 1.
It follows that
1
E(x) = E(X0 )1
n
n times
1z }| {
= n (; ; : : : ; ) 1 = :
67
2. var(x) = n1 . We can write x as
n times
0 x 1
1
z1 1 }| 1 { B x2 C
x = ( n I; n I; : : : ; n I) B
@ . A:
. C
.
xn
Therefore, var(x) equals
0 0 0 1 0 n1 I 1
B 0 0 C B
B 1IC
C
( n1 I; n1 I; : : : ; n1 I) B . . . . C
@ .. .. . . .. A @ ... C
B n
A
0 0 n1 I
0 11
n
= ( n1 I; n1 I; : : : ; n1 I) B
@ ... CA
1
1 1 n
= n n2 = n
68
3. The sample variance matrix S is unbiased for
; i.e., E(S) = . Proof:
1 X n
E(S) = n 1 E[(xi x )(xi x )0 ]
"i=1X
n #
=n1 1 E(xi x0i ) nE(xx 0 ) :
i=1
As a general result,
y = E [(y y )(y y )0 ]
= E(yy0 ) y 0y
) E(yy0 ) = y + y 0y :
So,
1 0 0
E(S) = n 1 n( + ) n( n + ) = :
1
69
Large Sample Results (general case):
1. x p! , S p! .
p
2. n(x ) d! Np (0; ).
3. Combining (1) and (2) we get
p nS 1=2 (
x ) d! Np (0; I):
4. (3) implies
n(x )0 S 1 (x ) d! 2 (p):
71
Properties of the Wishart Distribution:
P
1. Wp (; m) is the distribution of m 0
i=1 zi zi where
z1; : : : ; zm iid
Np (0; ) which implies that
X
n
(xi )(xi )0 = (n 1)S Wp (; n 1)
i=1
2. If Ai ind
Wp(; mi ) then
X X !
Ai Wp ; mi :
i i
72
Assessment of Normality:
We have seen that xp1 is multivariate nor-
mal if and only if every linear combination of
x1 ; : : : ; xp is normal. Therefore, one check is
whether selected linear combinations are (uni-
variate) normal.
As a special case of the above we can check
whether each component of x is (univariate)
normal.
Other selected linear combinations of x1; : : : ; xp
worth checking are e^01 x and e^0p x, where e^1 and
e^p are the eigenvectors associated with ^1 and
^p , respectively, the largest and smallest eigen-
values of S.
73
How do we assess univariate normality?
One method is the quantile-quantile (Q-Q) plot (also
known as a probability plot).
Idea is to plot the sample quantiles (percentiles)
for the variable in question versus the quan-
tiles one would expect to observe assuming ex-
act normality.
If the data are normal, plot should follow a
straight line. Departures from a straight line
may indicate the type of non-normality (e.g.,
heavy tails, skewness).
Q-Q plots can be misleading for small sample
sizes (n < 20, say). Small samples of truly nor-
mal data can look crooked.
74
Example:
75
A formal test for normality is based on rq the
correlation coecient of the observed and ex-
pected quantiles. The hypothesis of normality
is rejected for values of rq below the tabulated
value in Table 4.2 on p.193 of the text.
The test statistic based on rq is approximately
equivalent to the Shapiro-Wilk test computed
by PROC UNIVARIATE and in, general, these
two tests give very similar results. Either may
be used.
Example { Sparrow Data:
Recall: x1 = total length, x2 = alar extent, x3 =
length of beak and head, x4 = length of humerus,
x5 = length of keel of sternum.
See sparrow2.sas and sparrow2.lst. Here we just
consider x1 ; x2 ; x3 to keep the size of the exam-
ple manageable. In addition, we consider only
the birds who died (birds 1{21).
76
Notice that the Q-Q plots produced by PROC
UNIVARIATE have such small scales as to make
them practically worthless.
The Shapiro-Wilk tests for x1 {x3 all give p values
> :05, so there is not strong evidence in any of
the individual variables to indicate non-multivariate
normality.
E.g., the Shapiro-Wilk test statistic for x1 , the
total length of the bird, is W = 0:934 with
p value 0.165 (see p.1 of sparrow2.lst).
However, we also should check the univariate
normality of e^01 x and e^03 x. e^01 x and e^03 x are
computed in PROC IML and printed on p.9.
The S-W test for e^01x is given by PROC UNI-
VARIATE on p.11 as W = 0:861 with p value
0.0067. Therefore, we reject the null hypothesis
of normality for e01 x at the 0.05 level ) we re-
ject the hypotheses that x is trivariate normal.
77
As an alternative to the Shapiro-Wilk test, rq =
0:920 is computed on p.16. Since 0.920 falls
below the critical value for 0:05 and a sample
size of 20 (0.9269 from Table 4.2 in the text) we
reject the null hypothesis of normality.
The tests agree.
Note that the procedure here deviates from the
usual one by which we reject when the test statis-
tic exceeds the critical value.
Notice that as an alternative to letting PROC
UNIVARIATE automatically produce the Q-Q
plot, it is fairly easy to compute the quantiles to
be plotted in a Q-Q plot and then to plot these
quantiles with PROC PLOT. This I have done
for both e^01 x and e^03 x to produce the plots on
p.15 and p.17. These are obviously much easier
to read than the corresponding plots produced
by PROC UNIVARIATE on pp. 12 and 14,
respectively.
78
A graphical procedure to examine multivariate nor-
mality more directly is the Chi-square plot or
gamma plot.
When the sample size is moderate to large, the
squared generalized distances
d2j = (xj x )0 S 1 (xj x ); j = 1; : : : ; n;
should be approximately distributed as 2 (p)
random variables.
The chi-square plot is a Q-Q plot of d21; : : : ; d2n
versus the quantiles one would expect to observe
based on the 2 (p) distribution.
A straight line shape indicates multivariate nor-
mality.
79
Sparrow Example (Continued):
The Q-Q plot for d2 appears on p.4 of spar-
row3.lst.
Except for one point, this plot looks pretty close
to a straight line. The unusual point, or outlier,
is bird #8. This outlier was also seen in the Q-
Q plot of e^01 x and seems to be the main cause
of non-multivariate normality in these data.
It would be interesting to eliminate bird #8 and
then repeat our examination of the multivariate
normality of this data set. Perhaps bird #8's
data was incorrectly recorded, or perhaps bird
#8 happened to die for some reason other than
exposure to the severe storm.
If a good reason can be found that this bird's
data do not belong in the data set, then this
outlier should be removed. Otherwise, a good
solution would be to analyze the data both with
and without the outlier and to report both sets
of results.
80
Suggested Procedure:
1. Examine Q-Q plots for all univariate compo-
nents of x.
2. Examine Q-Q plots for e^01 x and e^0p x.
3. Examine Q-Q plot for d2j 's.
While one bad plot is enough to invalidate the
multivariate normality assumption, it is not enough
that plots on all examined linear combinations
look good to conrm (with certainty) the nor-
mality assumption.
However, we can only go so far in trying to con-
rm multivariate normality. If all plots look
good, we will consider the multivariate normal-
ity assumption to be satised.
Other \goodness of t" tests of normality exist
(chi-square, Kolmogorov-Smirnov), but these suf-
fer serious drawbacks. They have low power to
detect nonnormality in small samples and over-
sensitivity in large samples.
81
Handling Non-normality { Transformations
Basic idea: if x is not normal, nd a function of
x (square root, logarithm, etc.) that is.
Most of the useful transformations are contained
in the family of power transformations:
x ; if 6= 0
f (x) = log(x); corresponds to = 0.
82
Power transformations are dened for positive
valued random variables. However they can
be applied to variables with negative values by
rst adding a large value to each observation
(to make all values positive) before taking the
transformation.
For certain special types of data, specic trans-
formations are available that often work well:
Original Scale Transformed Scale
Counts, y py
Proportions, p 1 logit(p) = 1 log( p )
2 2 1 p
Correlations, r Fisher's z (r) = 21 log( 11+rr )
83
Inference about a Mean Vector
(See chapter 5 of Johnson and Wichern.)
Finally, some data analysis!
Recall the univariate situation: t tests and Z tests
for a single mean.
The simplest scenario is one in which we want to
know whether the true mean of a population, , is
equal to some specied value, 0 . We frame the
problem as a statistical test of competing hypothe-
ses:
H0 : = 0 versus H1 : 6= 0 ;
where is the mean of the population from which
we have drawn a sample x1 ; : : : ; xn iid
N (; 2).
2 known: If the population variance is known then
x
Z = = n0 N (0; 1);
p
where
1 X
n
x = n xi (sample mean):
i=1
84
2 unknown: (usual case) if the population variance
is unknown we have to estimate it with the sample
variance s2 and the substitution of s for in our
test statistic changes its distribution (test statistic
is more variable now because of errors in estimating
2 ):
t= x
p 0 t(n 1) t dist'n with n 1 d.f.
s= n
where
1 Xn
s2 = n 1 (xi x)2 (sample variance):
i=1
We reject H0 in favor of H1 at the level if the
observed jtj exceeds a critical value:
Reject H0 if jtj > t=2 (n 1):
86
Extension to Multivariate Case:
From a Np (; ) population we draw a sample
x1 ; : : : ; xn iid
Np (; ):
The corresponding hypotheses in the multivariate
case are
H0 : = 0 versus H1 : 6= 0 ;
where 0 1 0 1
1 10
=B
@ ... CA 0 = B
@ ... CA:
p p0
Note that H0 requires all component means to be
equal to the specied values, 10 ; : : : ; p0 and HA
holds if any (one or more) i 6= i0.
Test statistic (Hotelling's T 2 ):
T 2 = n(x 0 )0 S 1 (x 0 ):
88
So, since
t2
p 2 1 p
= n(x 0 )(s ) n(x 0 )
is of the form
(N (0; 1)) 2 (d:f :2 )=d:f :2
1 (N (0; 1))
2 (1)=1
= 2 (d:f : )=d:f : ;
2 2
it follows that t2 F (1; d:f :2 ).
In the multivariate case
p 0 p
T = n(x 0 ) S n(x 0 )
2 1
is of the form
(Np )(Wp =d:f :) 1 (Np ):
Since the multivariate normal is a generaliza-
tion of the normal and the Wishart is a gener-
alization of the 2 , it is not very surprising that
it turns out that T 2 has a (scaled) F distribu-
tion:
T 2 (n 1)p F (p; n p):
n p
89
We reject H0 in favor of H1 if
T2 > (n 1)p F (p; n p)
n p
where F (d:f :1 ; d:f :2 ) is the upper (100)th per-
centile of the F (d:f :1 ; d:f :2 ) distribution.
Notice that T 2 is the statistical distance be-
tween x and 0 .
Example { Kite Data:
p = 2 variables (x1 =wing length, x2 =tail length)
measured on a sample of n = 45 birds. Suppose we
want to test
190 190
H0 : = 275 versus H1 : 6= 275
90
Recall
193:62 120:69 122:35
x = 279:78 S = 122:35 208:54
so S 1 equals
1
208:54 122:35
120:69(208:54) (122:35)2 122:35 120:69 :
So
T 2 = 45 ( 193:62 190 279:78 275 ) 10201 1
208:54 122:35 193:62 190:24
122:35 120:69 279:78 275
= 5:543
For a signicance level of = 0:05, the critical value
is
(n 1)p F (p; n p) = 2(44) F (2; 43)
n p 43 :05
= (2:047)(3:214) = 6:578
so we do not reject H0 and we conclude that 0 =
(190; 275) is plausible.
91
Hotelling's T 2 test is invariant under transfor-
mations of the form
yp1 = Cppxp1 + dp1
where C is nonsingular.
Kite Example (Continued):
So, for example, if we wanted to test the correspond-
ing hypothesis on size (y1 = x1 + x2 ) and shape
(y2 = x1 x2 ) as we dened them before:
1 1
y = 1 1 x;
we would test
190 + 275 465
H0 : y = 190 275 = 85
For this problem, T 2 = 5:543 as before with the
same critical value, 6.578, so we do not reject H0 .
Note that
1 1
1 1 6= 0 (nonsingular transformation).
92
Why not just do p separate (univariate) t tests for
each component of x?
Does not take into account interrelationships
(correlations) among the components of x.
The multivariate test controls the type I error
rate, the probability of rejecting H0 when H0
is true.
{ When doing multiple univariate tests the
probability that at least one univariate null
hypothesis will be rejected by chance alone
increases with the number of tests being
performed.
{ Procedures to account for this problem ex-
ist (e.g., Bonferroni) but generally such pro-
cedures are not as attractive as performing
a single multivariate test.
93
T 2 as a LRT:
So far the use of T 2 as a test statistic has been
motivated by analogy to the univariate case.
Could we do better? (Is T 2 optimal? E.g., in terms
of power?)
It turns out that T 2 does have several optimal-
ity properties because T 2 can also be derived
as a likelihood ratio test (LRT). The theory
of LRTs (don't worry) establishes a number of
optimality properties that T 2 inherits.
94
Likelihood Ratio Tests:
The likelihood function is just the density
function, but thought of as a function of the
parameters rather than of the data.
For Example, the univariate normal distribu-
tion based on a sample x1 ; : : : ; xn :
{ We think of the density
1
f (x1 ; : : : ; xn ) = (22 )n=2
" X
n (x )2 #
exp i
i=1 22
as a function of the data (the xi 's) for given
values of the parameters (, 2 ) giving, in
some sense, the probability of observing
x1 ; : : : ; xn for given , 2 . Hence, the ar-
gument of f is x1 ; : : : ; xn .
{ Actually, the density function involves both
the data and the parameters.
95
{ Once we have the data, we want to do infer-
ence concerning the parameters. Therefore,
we can think of the density as a function of
the parameters (, 2 ) given the data (the
xi 's).
{ To avoid confusion when we think of the
density this way we give it a new name, the
likelihood function, and write it as a func-
tion of the parameters:
L(; 2 ) = 1
(2" 2 )n=2 #
X
n (x
i )2
exp 22
i=1
96
The maximum likelihood approach to esti-
mation says to estimate the parameters with
those values that maximize the likelihood.
{ For the univariate normal distribution, with
some calculus we can establish that the max-
imizers of L(; 2 ) (the maximum likelihood
estimates or MLEs) are
1 X
n
^ = x = n xi ;
i=1
n 1 1 Xn
^ 2 = n s2 = n (xi x)2 :
i=1
97
{ Again some calculus lead to the MLEs:
1 X
n
^ = x = n xi ;
i=1
n 1 1 Xn
^ = n S = n (xi x )(xi x )0 :
i=1
99
The constrained ML is the maximum of the like-
lihood function when we assume H0 is true; that
is, when we assume that = 0 .
{ Thus the constrained ML is the likelihood
evaluated at the constrained MLE, the es-
timators of , that assumes = 0 .
{ The constrained MLEs are 0 and the
that maximizes L(0 ; ), which turns out
to be (calculus)
1 Xn
^ =
0 (x )(x )0 :
n i=1 i 0 i 0
100