Professional Documents
Culture Documents
PCA Explained
PCA Explained
PCA Explained
Analysis
esp. Principal Component
Analysis (PCA)
Why Factor or Component Analysis?
• We study phenomena that can not be directly observed
– ego, personality, intelligence in psychology
– Underlying factors that govern the observed data
Original Variable B PC 2
PC 1
Original Variable A
• Properties
– It can be viewed as a rotation of the existing axes to new positions in the
space defined by original variables
– New axes are orthogonal and represent the directions with maximum
variability
Computing the Components
• Data points are vectors in a multidimensional space
• Projection of vector x onto an axis (dimension) u is u.x
• Direction of greatest variability is that in which the average
square of the projection is greatest
– I.e. u such that E((u.x)2) over all x is maximized
– (we subtract the mean along each dimension, and center the
original axis system at the centroid of all data points, for simplicity)
– This direction of u is the direction of the first Principal Component
Computing the Components
• E((u.x)2) = E ((u.x) (u.x)T) = E (u.x.x T.uT)
• The matrix C = x.xT contains the correlations (similarities) of the
original axes based on how the data values project onto them
• So we are looking for w that maximizes uCuT, subject to u
being unit-length
• It is maximized when w is the principal eigenvector of the
matrix C, in which case
– uCuT = uuT = if u is unit-length, where is the principal
eigenvalue of the correlation matrix C
– The eigenvalue denotes the amount of variability captured along that
dimension
Why the Eigenvectors?
Maximise uTxxTu s.t uTu = 1
Construct Langrangian uTxxTu – λuTu
Vector of partial derivatives set to zero
xxTu – λu = (xxT – λI) u = 0
As u ≠ 0 then u must be an eigenvector of xxT with
eigenvalue λ
Singular Value Decomposition
The first root is called the prinicipal eigenvalue which has an
associated orthonormal (uTu = 1) eigenvector u
Subsequent roots are ordered such that λ1> λ2 >… > λM with
rank(D) non-zero values.
Eigenvectors form an orthonormal basis i.e. uiTuj = δij
The eigenvalue decomposition of xxT = UΣUT
where U = [u1, u2, …, uM] and Σ = diag[λ 1, λ 2, …, λ M]
Similarly the eigenvalue decomposition of xTx = VΣVT
The SVD is closely related to the above x=U Σ1/2 VT
The left eigenvectors U, right eigenvectors V,
singular values = square root of eigenvalues.
Computing the Components
• Similarly for the next axis, etc.
• So, the new axes are the eigenvectors of the matrix
of correlations of the original variables, which
captures the similarities of the original variables
based on how data samples project to them
20
15
10
Variance (%)
5
0
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10
• Factor Analysis
– Also known as Principal Axis Analysis
– Analyses only the shared variance
• NO ERROR!
Vector Space Model for Documents
• Documents are vectors in multi-dim Euclidean space
– Each term is a dimension in the space: LOTS of them
– Coordinate of doc d in dimension t is prop. to TF(d,t)
• TF(d,t) is term frequency of t in document d
This happens to be a rank-7 matrix 0.2791 -0.2591 0.6442 0.1593 -0.1648 0.5455 0.2998
In the sense of having to find quantities that are not observable directly
Similarly, transcription factors in biology, as unobservable causal bridges
between experimental conditions and gene expression
Computing and Using LSI
Documents Documents
Terms
M =
U S Vt Uk Sk Vkt = Terms
Recreate Matrix:
Singular Value Reduce Dimensionality: Multiply to produce
Decomposition Throw out low-order approximate term-
(SVD): rows and columns document matrix.
Convert term-document Use new matrix to
matrix into 3matrices process queries
U, S and V OR, better, map query to
reduced space
Following the Example
U (9x7) =
0.3996 -0.1037 0.5606 -0.3717 -0.3919 -0.3482 0.1029
0.4180 -0.0641 0.4878 0.1566 0.5771 0.1981 -0.1094
0.3464 -0.4422 -0.3997 -0.5142 0.2787 0.0102 -0.2857
0.1888 0.4615 0.0049 -0.0279 -0.2087 0.4193 -0.6629
0.3602 0.3776 -0.0914 0.1596 -0.2045 -0.3701 -0.1023
term ch2 ch3 ch4 ch5 ch6 ch7 ch8 ch9 0.4075 0.3622 -0.3657 -0.2684 -0.0174 0.2711 0.5676
0.2750 0.1667 -0.1303 0.4376 0.3844 -0.3066 0.1230
controllability 1 1 0 0 1 0 0 1 0.2259 -0.3096 -0.3579 0.3127 -0.2406 -0.3122 -0.2611
0.2958 -0.4232 0.0277 0.4305 -0.3800 0.5114 0.2010
observability 1 0 0 0 1 1 0 1
S (7x7) =
realization 1 0 1 0 1 0 1 0 3.9901 0 0 0 0 0 0
0 2.2813 0 0 0 0 0
feedback 0 1 0 0 0 1 0 0 0 0 1.6705 0 0 0 0
0 0 0 1.3522 0 0 0
controller 0 1 0 0 1 1 0 0 0 0 0 0 1.1818 0 0
0 0 0 0 0 0.6623 0
observer 0 1 1 0 1 1 0 0 0 0 0 0 0 0 0.6487
transfer
function
0 0 0 0 1 1 0 0 V (7x8) =
0.2917 -0.2674 0.3883 -0.5393 0.3926 -0.2112
T
-0.4505
polynomial 0 0 0 0 1 0 1 0 0.3399 0.4811 0.0649 -0.3760 -0.6959 -0.0421 -0.1462
0.1889 -0.0351 -0.4582 -0.5788 0.2211 0.4247 0.4346
matrices 0 0 0 0 1 0 1 1 -0.0000 -0.0000 -0.0000 -0.0000 0.0000 -0.0000 0.0000
0.6838 -0.1913 -0.1609 0.2535 0.0050 -0.5229 0.3636
0.4134 0.5716 -0.0566 0.3383 0.4493 0.3198 -0.2839
0.2176 -0.5151 -0.4369 0.1694 -0.2893 0.3161 -0.5330
This happens to be a rank-7 matrix 0.2791 -0.2591 0.6442 0.1593 -0.1648 0.5455 0.2998
USVT =U7S7V7T
term ch2 ch3 ch4 ch5 ch6 ch7 ch8 ch9 U 4 S4 V 4 T
controllability 1 1 0 0 1 0 0 1 1.1630535 0.67789733 0.17131016 0.0 0.85744447 0.30088043 -0.025483057 1.0295205
observability 1 0 0 0 1 1 0 1 0.7278324 0.46981966 -0.1757451 0.0 1.0910251 0.6314231 0.11810507 1.0620605
realization
feedback
1
0
0
1
1
0
0
0
1
0
0
1
1
0
0
0
K=4 0.78863835 0.20257005 1.0048805 0.0 1.0692837 -0.20266426 0.9943222 0.106248446
-0.03825318 0.7772852 0.12343567 0.0 0.30284256 0.89999276 -0.3883498 -0.06326774
controller 0 1 0 0 1 1 0 0 0.013223715 0.8118903 0.18630582 0.0 0.8972661 1.1681904 -0.027708884 0.11395822
observer 0 1 1 0 1 1 0 0 3 components 0.21186034 1.0470067 0.76812166 0.0 0.960058 1.0562774 0.1336124 -0.2116417
transfer
function
0 0 0 0 1 1 0 0 ignored -0.18525022 0.31930918 -0.048827052 0.0 0.8625925 0.8834896 0.23821498 0.1617572
-0.008397698 -0.23121 0.2242676 0.0 0.9548515 0.14579195 0.89278513 0.1167786
polynomial 0 0 0 0 1 0 1 0
0.30647483 -0.27917668 -0.101294056 0.0 1.1318822 0.13038804 0.83252335 0.70210195
matrices 0 0 0 0 1 0 1 1
U 6 S6 V 6 T
K=6 1.0299273 1.0099105 -0.029033005 0.0 0.9757162 0.019038305 0.035608776 0.98004794
0.96788234 -0.010319378 0.030770123 0.0 1.0258299 0.9798115 -0.03772955 1.0212346
0.9165214 -0.026921304 1.0805727 0.0 1.0673982 -0.052518982 0.9011715 0.055653755
One component ignored -0.19373542 0.9372319 0.1868434 0.0 0.15639876 0.87798584 -0.22921464 0.12886547
-0.029890355 0.9903935 0.028769515 0.0 1.0242295 0.98121595 -0.03527296 0.020075336
0.16586632 1.0537577 0.8398298 0.0 0.8660687 1.1044582 0.19631699 -0.11030859
0.035988174 0.01172187 -0.03462495 0.0 0.9710446 1.0226605 0.04260301 -0.023878671
-0.07636017 -0.024632007 0.07358454 0.0 1.0615499 -0.048087567 0.909685 0.050844945
0.05863098 0.019081593 -0.056740552 0.0 0.95253044 0.03693092 1.0695065 0.96087193
Querying U2 (9x2) =
0.3996 -0.1037
0.4180 -0.0641
0.3464 -0.4422
0.1888 0.4615
0.3602 0.3776
0.4075 0.3622
0.2750 0.1667
To query for feedback controller, the query vector would 0.2259 -0.3096
0.2958 -0.4232
be S2 (2x2) =
q = [0 0 0 1 1 0 0 0 0]' (' indicates 3.9901 0
0 2.2813
transpose), V2 (8x2) =
0.2917 -0.2674
0.3399 0.4811
Let q be the query vector. Then the document-space vector 0.1889 -0.0351
-0.0000 -0.0000
corresponding to q is given by: 0.6838 -0.1913
0.4134 0.5716
q'*U2*inv(S2) = Dq 0.2176 -0.5151
0.2791 -0.2591
Point at the centroid of the query terms’ poisitions in the
new space. term ch2 ch3 ch4 ch5 ch6 ch7 ch8 ch9
controllability 1 1 0 0 1 0 0 1
For the feedback controller query vector, the result is:
observability 1 0 0 0 1 1 0 1
Dq = 0.1376 0.3678 realization 1 0 1 0 1 0 1 0
feedback 0 1 0 0 0 1 0 0
To find the best document match, we compare the Dq controller 0 1 0 0 1 1 0 0
vector against all the document vectors in the 2- observer 0 1 1 0 1 1 0 0
dimensional V2 space. The document vector that is nearest transfer
function
0 0 0 0 1 1 0 0
in direction to Dq is the best match. The cosine values polynomial 0 0 0 0 1 0 1 0
for the eight document vectors and the query vector are: matrices 0 0 0 0 1 0 1 1
-0.3747 0.9671 0.1735 -0.9413 0.0851 0.9642 -0.7265 -0.3805 -0.37 0.967 0.173 -0.94 0.08 0.96 -0.72 -0.38
Medline data
Within .40
threshold
K is the number of singular values used
What LSI can do
• LSI analysis effectively does
– Dimensionality reduction
– Noise reduction
– Exploitation of redundant data
– Correlation analysis and Query expansion (with related words)
ICA PCA
PCA vs ICA
PCA ICA
(orthogonal coordinate) (non-orthogonal coordinate)
Limitations of PCA
• The reduction of dimensions for
complex distributions may need non
linear processing
• Curvilenear Component Analysis (CCA)
– Non linear extension of PCA
– Preserves the proximity between the points in the input
space i.e. local topology of the distribution
– Enables to unfold some varieties in the input data
– Keep the local topology
Example of data representation
using CCA
http://en.wikipedia.org/wiki/Image:Eigenfaces.png
Eigenfaces – Face Recognition
• When properly weighted, eigenfaces can be
summed together to create an approximate
gray-scale rendering of a human face.
• Remarkably few eigenvector terms are needed
to give a fair likeness of most people's faces
• Hence eigenfaces provide a means of applying
data compression to faces for identification
purposes.
• Similarly, Expert Object Recognition in Video
Eigenfaces
• Experiment and Results
Data used here are from the ORL database of faces.
Facial images of 16 persons each with 10 views are
used. - Training set contains 16×7 images.
- Test set contains 16×3 images.
accuracy
0.6
0.4
0 50 100 150
number of eigenfaces
• Chance
• Natural variation inherent in a process. Cumulative effect
of many small, unavoidable causes.
• Assignable
• Variations in raw material, machine tools, mechanical
failure and human error. These are accountable
circumstances and are normally larger.
PCA for process monitoring
• Latent variables can sometimes be interpreted as
measures of physical characteristics of a process
i.e., temp, pressure.
• Variable reduction can increase the sensitivity of a
control scheme to assignable causes
• Application of PCA to monitoring is increasing
– Start with a reference set defining normal operation
conditions, look for assignable causes
– Generate a set of indicator variables that best describe the
dynamics of the process
– PCA is sensitive to data types
Independent Component Analysis (ICA)
PCA ICA
(orthogonal coordinate) (non-orthogonal coordinate)
Cocktail-party Problem
• Multiple sound sources in room (independent)
• Multiple sensors receiving signals which are
mixture of original signals
• Estimate original source signals from mixture
of received signals
• Can be viewed as Blind-Source Separation
as mixing parameters are not known
DEMO: BLIND SOURCE SEPARATION
http://www.cis.hut.fi/projects/ica/cocktail/cocktail_en.cgi
BSS and ICA
• Cocktail party or Blind Source Separation (BSS) problem
– Ill posed problem, unless assumptions are made!
• Most common assumption is that source signals are
statistically independent. This means knowing value of
one of them gives no information about the other.
• Methods based on this assumption are called Independent
Component Analysis methods
– statistical techniques for decomposing a complex data set into
independent parts.
• It can be shown that under some reasonable conditions, if
the ICA assumption holds, then the source signals can be
recovered up to permutation and scaling.
Source Separation Using ICA
Microphone 1 Separation 1
W 11 +
W 21
W 12
Microphone 2 Separation 2
W22 +
Original signals (hidden
sources) s1(t), s2(t), s3(t), s4(t),
t=1:T
The ICA model
s1 s2 s3 s4
xi(t) = ai1*s1(t) +
ai2*s2(t) +
ai3*s3(t) +
a13 ai4*s4(t)
a12
a11 a14 Here, i=1:4.
In vector-matrix
notation, and dropping
index t, this is
x1 x2 x3 x4 x=A*s
This is recorded by the
microphones: a linear mixture of
the sources
xi(t) = ai1*s1(t) + ai2*s2(t) + ai3*s3(t) +
Recovered signals
BSS
Problem: Determine the source
signals s, given only the
mixtures x.
• If we knew the mixing parameters aij then we would just need to
solve a linear system of equations.
• We know neither aij nor si.
• ICA was initially developed to deal with problems closely related
to the cocktail party problem
• Later it became evident that ICA has many other applications
– e.g. from electrical recordings of brain activity from different
locations of the scalp (EEG signals) recover underlying components
of brain activity
ICA Solution and Applicability
• ICA is a statistical method, the goal of which is to decompose
given multivariate data into a linear sum of statistically
independent components.
• For example, given two-dimensional vector , x = [ x1 x2 ] T , ICA
aims at finding the following decomposition
⎡x1 ⎤ ⎡a11 ⎤ ⎡a12 ⎤
⎢ ⎥ ⎢ ⎥ s1 ⎢ ⎥ s 2
= +
⎣x2 ⎦ ⎣a 21⎦ ⎣a 22 ⎦
x = a1 s1 + a 2 s 2
where a1, a2 are basis vectors and s1, s2 are basis coefficients.
Noisy
Original
image
image
Wiener ICA
filtering filtering
Blind Separation of Information from Galaxy
Spectra
1.4
1.2
0.8
0.6
0.4
0.2
-0.2
0 50 100 150 200 250 300 350
Approaches to ICA
• Unsupervised Learning
– Factorial coding (Minimum entropy coding, Redundancy
reduction)
– Maximum likelihood learning
– Nonlinear information maximization (entropy maximization)
– Negentropy maximization
– Bayesian learning
• Dynamic image
– for the quantitative analysis
(ex. blood flow, metabolism)
– acquiring the data sequentially with a time interval
Dynamic Image Acquisition
Using PET
frame n
ral
po
frame 3
tem
frame 2
frame 1
spatial
y1y2y3yN
Left Ventricle u1
Right Ventricle u2
Myocardium
u3
•••
•••
•••
•••
•••
Noise
uN
Right
Ventricle
Left
Ventricle
Tissue
Other PET applications
Eigenfaces PCA/FA
It finds a linear data representation that
best model the covariance structure of
face image data (second-order relations)
Low
PCA Dimensional
Representation
= a 1× + a 2× + a 3× + a 4× + + a n×
Factorial Code Representation:
Factorial Faces
(a) (b)
Figure 3. First 20 basis images: (a) in eigenface method; (b) factorial
code. They are ordered by column, then, by row.
Experimental Results
• PCA
– Focus on uncorrelated and Gaussian components
– Second-order statistics
– Orthogonal transformation
• ICA
– Focus on independent and non-Gaussian components
– Higher-order statistics
– Non-orthogonal transformation
Gaussians and ICA
• If some components are gaussian and some are
non-gaussian.
– Can estimate all non-gaussian components
– Linear combination of gaussian components can be
estimated.
– If only one gaussian component, model can be
estimated
• ICA sometimes viewed as non-Gaussian factor
analysis