Professional Documents
Culture Documents
Solutions To The Exercises On Principal Component Analysis
Solutions To The Exercises On Principal Component Analysis
Laurenz Wiskott
Institut für Neuroinformatik
Ruhr-Universität Bochum, Germany, EU
4 February 2017
Contents
1 Intuition 3
1
1.5 Covariance matrix and higher order structure . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2 Formalism 8
2.2 Matrix VT : Mapping from high-dimensional old coordinate system to low-dimensional new
coordinate system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3 Matrix V: Mapping from low-dimensional new coordinate system to subspace in old coordi-
nate system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.4 Matrix (VT V): Identity mapping within new coordinate system . . . . . . . . . . . . . . . . 9
2.5 Matrix (VVT ): Projection from high- to low-dimensional (sub)space within old coordinate
system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.6 Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.15.2 Exercise: From data distribution to second-moment matrix via the eigenvectors . . . . 12
2.15.3 Exercise: From data distribution to second-moment matrix via the eigenvectors . . . . 13
2
3 Application 15
4 Acknowledgment 15
1 Intuition
How are mean m, variance v and 2nd moment s related to each other? In other words, if mean and variance
of a one-dimensional distribution were given. How could you compute the corresponding 2nd moment?
Hint: Assume x to be the data values and x̄ their mean. Then play around with the corresponding expressions
for mean x̄ = hxi, variance h(x − x̄)2 i and second moment hx2 i.
Solution: Let x be the data values and x̄ their mean. For the second moment we then get
s = hx2 i (1)
= h((x − x̄) + x̄)2 i (2)
2 2
= h(x − x̄) + 2(x − x̄)x̄ + x̄ i (3)
2 2
= h(x − x̄) i + h2(x − x̄)x̄i + hx̄ i (4)
2 2
= h(x − x̄) i + 2( hxi −x̄)x̄ + x̄ (5)
|{z}
= x̄
= h(x − x̄)2 i + x̄2 (6)
2
= v+m . (7)
Thus, the 2nd moment is the sum of the variance and the square of the mean.
Calculate the second moment of a uniform, i.e. flat, distribution in [−1, +1]. This is a distribution where
every value between −1 and +1 is equally likely and other values are impossible.
Solution:
This might be bit surprising, since one might think that such a distribution has a standard deviation of 0.5
and therefore a variance of 0.52 = 0.25. However, due to the square in the second moment, larger values are
weighted
√ more than smaller values. Thus, the variance of this distribution is 1/3 and its standard deviation
1/ 3 ≈ 0.577.
3
1.2 Projection and reconstruction error
xk = vvT x (1)
where x is the data point and v is the unit vector along the principal axis of the projection. Show that
the difference vector between data point and the projected data point
x⊥ = x − xk (2)
is orthogonal to v.
Solution: Not available!
2. Give a reason why the orthogonality of the two vectors is useful.
Solution: Not available!
Why should the reconstruction error, E, be defined as the mean of the squared difference of the original and
reconstructed data vectors, and not simply the mean of the difference or the mean of the absolute difference?
Solution: In the mean of the difference, positive errors can cancel out with negative errors, and a poor
solution might have a low error value, which would render the error function useless.
The mean of the absolute difference does not have this flaw and might actually be a reasonable error function.
However, the square is mathematically more convenient than the absolute value in many ways, for instance,
the derivative is well defined everywhere. Thus, the square is more practical. (It also has a close relationship
to Gaussian noise, which would be a bit more involved to explain.)
For a set of data vectors xµ , µ = 1, ..., M the second moment matrix C is defined as Cij := hxµi xµj iµ . What
are the upper and lower limits of Cij if Cii and Cjj are known?
Solution: Interpret xµi and xµj as two M -dimensional vectors, like xi := (x1i , x2i , ..., xM
i ) and let the inner
product be defined as
1 X µ µ
(xi , xj ) := x x . (1)
M µ i j
4
Then
1 X µ µ
Cii = x x = (xi , xi ) = ||xi ||2 , (2)
M µ i i
1 X µ µ
Cjj = x x = (xj , xj ) = ||xj ||2 , (3)
M µ j j
1 X µ µ
Cij = x x = (xi , xj ) = ||xi ||||xj || cos(α) , (4)
M µ i j
p
=⇒ |Cij | ≤ Cii Cjj (since −1 ≤ cos(α) ≤ 1) . (5)
Give an estimate of the second moment matrix for the following data distributions.
x2 x2 x2
1 1 1
1 x1 1 x1 1 x1
1. x1 and x2 are uncorrelated and the second-moment matrix therefore diagonal. The first component is
a uniform distribution between −1 and +1, which we know has variance 1/3. The second component
is a uniform distribution between −1/4 and +1/4, the variance of which is therefore scaled by 1/16
compared to that of the first one resulting in a variance of 1/48. Thus,
0.33 0
C≈ . (1)
0 0.02
2. x1 and x2 are again uncorrelated, plus the distribution is rotation symmetric, so that the variances are
identical. The distrubution of one component lies between −1 and +1, thus the variance is less than 1,
but it is concentrated towards the ends, thus the variance is greater than 1/3. Let’s guess 0.5. Thus,
0.5 0
C≈ . (2)
0 0.5
3. If we would just consider the mean of the distribution the second moment matrix would have the
values (−1)2 = 1, −1 · 0.5 = −0.5, 0.5 · −1 = −0.5, and 0.52 = 0.25. We also know the variances of
the distribution are 1/3 scaled by 1/42 = 1/16, because it has a width of 1/4 + 1/4 = 1/2 in both
directions. Adding this to the diagonal elements of the second-moment matrix of the mean yields
1.02 −0.5
C≈ , (3)
−0.5 0.27
assuming that the off-diagonal elements are not affected by the variance of the distribution.
5
1.4.3 Exercise: From data distribution to second-moment matrix
Give an estimate of the second moment matrix for the following data distributions.
x2 x2 x2
1 1 1
1 x1 1 x1 1 x1
(a) x2 is apparantly uniformly distributed in [−1, +1] and thus has a 2nd moment of C22 ≈ 1/3. Since
x1 ≈ −x2 /2, its 2nd moment is C11 = hx1 x1 i ≈ h(−x2 /2) · (−x2 /2)i ≈ C22 /4 ≈ 1/12 and the the
mixed 2nd moments are C12 = C21 ≈ −C22 /2 ≈ −1/6. Thus,
1/12 −1/6
C≈ . (1)
−1/6 1/3
(b) If all points were exactly at x = (0.5, 1)T the 2nd-moment matrix would simply be
0.25 0.5
C= . (2)
0.5 1
Since the points are slightly spread in the x2 -direction, C22 is slightly increased from 1 to, let say, 1.1
(the 2nd moment of a variable is the square of its mean plus its variance). The other values are not
effected. This is obvious for C11 but also true for C12 = C21 = hx1 x2 i = h0.5 x2 i = 0.5hx2 i = 0.5 · 1 =
0.5. Thus,
0.25 0.5
C≈ . (3)
0.5 1.1
(c) This distribution has a rotation symmetry of 120◦ , and since the variance does not change for a rotation
of 180◦ , the directional variance of the distribution has a rotation symmetry of 60◦ . This effectively
means that the directional variance, which in general is an ellipse, must by a circle. Thus the 2nd
moment matrix is diagonal with C11 = C22 . If one projects the data onto the x1 -axis one might guess
that C11 is slightly larger than for a uniform distribution in [−1, +1]. Thus,
0.4 0
C≈ . (4)
0 0.4
Draw a data distribution qualitatively consistent with the following second-moment matrices C.
1 −0.5 1 0 1 1
(a) C= (b) C= (c) C=
−0.5 1 0 0.5 1 1
Solution:
6
(a) (b) (c)
1 1 1
1
−1 −1 1 −1 1
−1 −1 −1
© CC BY-SA 4.0
The fat red squares indicate minimal sets of data points to generate the second-moment matrices.
1. Define a procedure by which you can turn any mean-free data distribution into a distribution with
finite (non-zero) mean but identical second-moment matrix. (Are there exceptions?)
Solution: If we flip a data point µ at the origin, i.e. if we replace xµ by −xµ , the second-moment
matrix does not change, since xµi xµj = (−xµi )(−xµj ). Thus, if we flip each point with negative first
component, then the second-matrix has not changed but the first component of the mean should be
positive. If the first component is always negative we can do the flipping with any other suitable
component.
Only if all components are always zero are we stuck and cannot produce a non-zero-mean data dis-
tribution with identical second-moment matrix. In this case the second-moment matrix would be
zero.
2. Conversely, define a procedure by which you can turn any data distribution with finite mean into a
distribution with zero mean but identical second-moment matrix. (Are there exceptions?)
Solution: Here one can use the same trick as in the first part. However, one not only flips data points
but copies them also. Thus, for each data points xµ a flipped one x(µ+M ) := −xµ is added. The
second-moment matrix does not change but the mean vanishes.
There is no exception for this method. It always works.
Hint: Think about what happens if you flip a point µ at the origin, i.e. if you replace xµ by −xµ in the data
set.
7
1.5 Covariance matrix and higher order structure
2 Formalism
Show that
N
X
kvk2 = vi2 . (2)
i=1
8
2.4 Matrix (VT V): Identity mapping within new coordinate system
2.6 Variance
(//10/11 min)Show that a second-moment matrix C := hxµ (xµ )T iµ is always positive semi-definite, i.e. for
each vector v we find vT Cv ≥ 0. For which vectors v does vT Cv = 0 hold?
9
2.9 Eigenvalue equation of the covariance matrix
Prove that the eigenvectors of a symmetric matrix are orthogonal, if their eigenvalues are different. Proceed
as follows:
1. Let A be a symmetric N -dimensional matrix, i.e. A = AT . Show first that (v, Aw) = (Av, w) for
any vectors v, w ∈ RN , with (·, ·) indicating the Euclidean inner product.
Solution:
(v, Aw) = vT Aw = vT AT w = (Av)T w = (Av, w) . (1)
2. Let {ai } be the eigenvectors of the matrix A with the eigenvalues λi . Show with the help of part one
that (ai , aj ) = 0 if λi 6= λj .
Hint: λi (ai , aj ) = ...
Solution:
(1)
λi (ai , aj ) = (λi ai , aj ) = (Aai , aj ) = (ai , Aaj ) = (ai , λj aj ) = λj (ai , aj ) (2)
=⇒ (ai , aj ) = 0 if λi 6= λj , (3)
10
The second-moment matrix is
1 1 1T T T
C = x x + x2 x2 + x3 x3 (5)
3
1 −3 1 −2
= (−3, 2) + (1, −1) + (−2, 3) (6)
3 2 −1 3
1 9 −6 1 −1 4 −6
= + + (7)
3 −6 4 −1 1 −6 9
1 +14 −13
= . (8)
3 −13 +14
© CC BY-SA 4.0
The symmetry of the data points (blue points) indicates that two eigenvectors are c1 = √12 (−1, 1)T
and c2 = √12 (1, 1)T (red arrows). Multiplying with the second-moment matrix verifies the eigenvectors
and provides the eigenvalues.
1 +14 −13 −1
Cc1 = √ (9)
2·3 −13 +14 1
1 −27 1 −9
= √ =√ = 9 c1 (10)
2·3 27 2 9
=⇒ λ1 = 9 , (11)
1 +14 −13 1
Cc2 = √ (12)
2·3 −13 +14 1
1 1 1
= √ = c2 (13)
2·3 1 3
1
=⇒ λ2 = . (14)
3
(15)
y µ = cTα xµ , (16)
11
Hint: You don’t have to compute the projected data. There is a simpler way.
Solution: First consider the general equations. The first moment is
and the second moments are simply the eigenvalues 9 and 1/3.
2.15.2 Exercise: From data distribution to second-moment matrix via the eigenvectors
Give an estimate of the second-moment matrix for the following data distributions by first guessing the
eigenvalues and normalized eigenvectors from the distribution and then calculating the matrix.
x2 x2 x2
1 1 1
1 x1 1 x1 1 x1
(a) The two coefficients x1 and x2 are uncorrelated and therefore valid eigenvectors lie along the axes,
i.e. u1 = (1, 0)T , u2 = (0, 1)T , resulting in U = 1. Since x1 is uniformly distributed in [−1, +1], its
variance is 1/3, thus λ1 = 1/3. The other coefficient, x2 , is compressed by a factor of about 4, resulting
in a variance that is a factor of 42 smaller, thus λ2 = 1/48. The 2nd-moment matrix is therefore
C = UΛUT (1)
= Λ (since U = 1) (2)
1/3 0
= (3)
0 1/48
0.33 0
≈ . (4)
0 0.02
(b) The two coefficients x1 and x2 are uncorrelated and therefore valid eigenvectors lie along the axes, i.e.
u1 = (1, 0)T , u2 = (0, 1)T , resulting in U = 1. However, since the variance is the same in all directions
for symmetry reasons, any other set of orthogonal unit vectors would do as well. If one projects the
12
data onto one of the axes one sees that a single coefficient is not uniformly distributed but is heavier
near ±1. Thus, the variance might be about 1/2 instead of 1/3 and λ1 = λ2 = 1/2. Therefore
C = UΛUT (5)
= Λ (since U = 1) (6)
1/2 0
= . (7)
0 1/2
(c) This distribution clearly has its largest 2nd moment (not variance) in the direction of u1 = √1 (−1, 1/2)T =
5/4
√1 (−2, 1)T ,
and the corresponding value is a bit more than 1, let say λ1 = 61/48. The second eigen-
5
vector must be orthogonal to the first one, for instance u2 = √15 (1, 2)T , and the corresponding 2nd
moment (in this case even variance) is much smaller, let say λ2 = 1/48. The 2nd-moment matrix is
therefore
C = UΛUT (8)
uT1
= u1 u2 diag(λ1 , λ2 ) (9)
uT2
1 −2 1 61/48 0 1 −2 1
= √ √ (10)
5 1 2 0 1/48 5 1 2
1 −2 1 61 0 −2 1
= (11)
5 · 48 1 2 0 1 1 2
1 −2 1 −122 61
= (12)
5 · 48 1 2 1 2
1 245 −120
= (13)
5 · 48 −120 65
1 49 −24
= (14)
48 −24 13
1.02 −0.5
≈ . (15)
−0.5 0.27
2.15.3 Exercise: From data distribution to second-moment matrix via the eigenvectors
Give an estimate of the second-moment matrix for the following data distributions by first guessing the
eigenvalues and normalized eigenvectors from the distribution and then calculating the matrix.
x2 x2 x2
1 1 1
1 x1 1 x1 1 x1
13
© CC BY-SA 4.0
Solution: Not available!
Given some data in R3 with the corresponding 3 × 3 second-moment matrix C with eigenvectors cα and
eigenvalues λα , with λ1 = 3, λ2 = 1 and λ3 = 0.2.
1. Define a matrix A ∈ R2×3 that maps the data into a two-dimensional space while preserving as much
variance as possible.
Solution: The dimension with least variance is spanned by the eigenvector of λ3 . The two-dimensional
subspace with largest variance is spanned by the eigenvectors of λ1 and λ2 . The corresponding matrix
reads T
c1
A := . (1)
cT2
2. Define a matrix B ∈ R3×2 that places the reduced data back into R3 with minimal reconstruction
error. How large is the reconstruction error?
Solution: Embedding the reduced data back into the R3 is done again with the eigenvectors.
B := (c1 , c2 ) . (2)
The reconstruction error is the sum over eigenvalues of the neglected eigenvectors, which is λ3 = 0.2
in this case.
3. Prove that AB is an identity matrix. Why would one expect that intuitively?
Solution: Intuitively the matrix AB corresponds to a mapping from R2 into R3 and back again. No
information is lost in this process, which means that AB should be the identity matrix. We can also
show formally that
T T
c1 c1 cT1 c2
c1 1 0
AB = (c 1 , c 2 ) = = . (3)
cT2 cT2 c1 cT2 c2 0 1
To show that BA is not the identity matrix we multiply it with the third eigenvector.
T
c1
BAc3 = (c1 , c2 ) c3 (5)
cT2
0
= (c1 , c2 ) (since c3 is orthogonal to c1 and c2 ) (6)
0
= 0 (7)
6= c3 . (8)
14
2.16 PCA Algorithm
Prove that sphered zero-mean data x̂ projected onto two orthogonal vectors n1 and n2 is uncorrelated.
Hint: The correlation coefficient for two scalar data sets y1 and y2 with means ȳi := hyi i is defined as
Solution: Projecting the data x̂ onto the vectors n1 and n2 , which we assume are normalized without loss
of generality, yields
yi = nTi x̂ , (2)
which is zero-mean because x̂ is zero-mean.
h(y1 − ȳ1 )(y2 − ȳ2 )i = hy1 y2 i (since the yi are zero-mean) (3)
(2)
= h(nT1 x̂) (nT2 x̂)i (4)
= h(nT1 x̂) (x̂T n2 )i (5)
= nT1 hx̂x̂T in2 (6)
= nT1 1n2 (since x̂ is sphered) (7)
= nT1 n2 (8)
= 0 (since n1 and n2 are orthogonal) . (9)
This proves the assertion for data with finite variance. If the variance of the data is zero then the denominator
is zero and the correlation is not defined.
3 Application
4 Acknowledgment
15