Download as pdf or txt
Download as pdf or txt
You are on page 1of 41

Dimensionality Reduction

Jayanta Mukhopadhyay
Dept. of Computer Science and Engg.
Books

n Chapters 6 of “Introduction to
Machine Learning” by Ethem Alpaydin.
Why to reduce dimension?
n For reducing complexity of inference, memory and
computation.
n In most learning algorithms, the complexity depends on
n the number of input dimensions, d

n the size of the data sample, N,

n Saving cost of extraction of features.


n Simpler models more robust in small datasets.
n Explanation with fewer features convenient for
knowledge extraction.
n Convenient to plot, visualize, etc.
Two major approaches
n Feature selection
n To find k of the d dimensions that give us the most
information discarding the other (d − k) dimensions.
n Subset selection method
n Feature extraction
n A new set of k dimensions that are combination of
original d dimensions.
n Supervised and unsupervised techniques
n PCA, LDA.

n Projection of feature vectors to a lower dimensional


space.
Subset Selection
n F: A feature set of input dimensions xi, i=1,2,..d.
n E(F): Error in validation set if F is used as input.
n Supervised method (requires training and
testing).
n Any method would do.
n Two methods (Greedy methods)
n Sequential forward selection.
n Sequential backward selection.
Sequential forward selection
n F=NULL.
n Select xi which provides least E(F U xi).
n Add xi if E(F U xi) < E(F).
n Repeat above two steps till no more
addition possible.
o Local search method.
o Does not guarantee optimal feature combination.
o The cost of training and testing is O(d2).
Sequential backward selection
n F= Set of all the features.
n Select xi which provides least E(F - xi).
n Remove xi if E(F - xi) < E(F).
n Repeat above two steps till no more
removal possible.
o Local search method.
o Does not guarantee optimal feature combination.
o The cost of training and testing is O(d2),
o training with more features more costly.
Principal component analysis
(PCA)
n To find a mapping from the inputs in the
original d-dimensional space to a new (k < d)-
dimensional space, with minimum loss of
information.
n x: Input feature vector of dimension d
n w: A direction (unit vector) of dimension d.
n Projection of x along w: wTx
n Make data centered around origin of the space.
n Principal component: component along the
direction w1 such that its variance is maximum
among all possible projections.
Principal components
n Principal component: component along the
direction w1 such that its variance is maximum
among all possible projections.
n 1st principal component.
n 2nd component:
n component along a direction w2 orthogonal to w1 having
the maximum variance.
n Similarly other principal components are defined.
n For a d-dimensional space there are maximum d
principal components.
Computation of 1st component
n z1=w1Tx
n Corresponding random variable be denoted as Z1
n X is the random variable whose instance is x with
mean m and covariance matrix Σ.
n Mean of Z1: w1Tm
n Variance of Z1 : w1T Σ w1
n Optimization problem:
n To maximize variance keeping w1 as a unit vector.
n w1= argmaxw{wT Σ w - l. (wTw-1)}
n l is the Lagrange coefficient.
Computation of 1st component
n z1=w1Tx
n w1= argmaxw{wT Σ w - l. (wTw-1)}
n l is the Lagrange coefficient.
n Taking the derivative of the argument w.r.t.
w and setting it to 0:
n 2 Σ w1-2lw1=0 è Σ w1 = lw1
n è w1T Σ w1 = w1T l w1 = l w1T w1 =l (variance)
n w1 is the eigen vector of Σ corresponding to the
maximum eigen value.
Computation of 2nd component
n z2=w2Tx
n Optimization problem
n w2= argmaxw{wT Σ w – l1 (wTw-1)-l2 (w1Tw-0)}
n l1 and l2 are Lagrange coefficients.
n w2 orthogonal to w1.
n Taking the derivative of the argument w.r.t.
w and setting it to 0:
n 2 Σ w2-2l1w2-l2w1=0
Computation of 2nd component
n 2 Σ w2-2l1w2-l2w1=0
n Pre-multiplying with w1 we get
n 2 w1TΣ w2-2l1w1Tw2-l2w1Tw1=0
n à2 w1TΣ w2-l2=0
n As w1TΣ w2 is scalar, w2TΣ w1 is also scalar.
n Replacing Σ w1 by l w1
n w2TΣ w1 =0. Hence l2=0
n 2 Σ w2 - 2l1w2=0 à Σ w2 = l1w2
n à w2 is eigen vector and l1 is the variance.
n w2 : eigen vector of Σ to the 2nd maximum eigen value.
and so on …
z=WT(x-m)

PCA-Algorithm
n Input: A set of data points: S={xj=(x1j,x2j,…xdj)| xj in Rd}.
n Output: A set of k eigen vectors providing tx. matrix: W=[w1,w2,…,wk]
1. Compute mean of data points.
2. Translate all data points to their mean.
3. Compute covariance matrix of the set.
4. Compute eigen vetcors and eigen values (in increasing
order).
5. Choose k such that the fraction of variance accounted for is
more than a threshold.
6. Use those k-components for representing any data point.
Example
n Data : {( 5, 3, 2), (4, 6, 0), (3, -7, 14), (2, 5, 3),
(3, 13, -6)}
n Perform PCA and if applicable, reduce the
dimension of data.
Example (contd.)
Example (contd.)

Total variance: Trace(C)=1.04+41.6+42.24=84.88


Sum of eigen
Eigen values of C: (83.3238, 1.5562, 0) values

Respective eigen vectors:


Example (contd.)
Respective eigen vectors:

Points lying in the plane:


X+Y+Z=10

Redundant dimension
Coordinate transformation

Normal to plane (e3)


Z

e2
e1 Y

X
PCA properties
n PCA diagonalizes the data covariance matrix Σ.
n Σ = CDCT,
n D: Diagonal matrix;
n C: Columns are unit eigen vectors of Σ. à CCT=CTC=I
n Components are uncorrelated
n As covariance among components is zero.
n By normalizing components with their variances
(eigen values), Euclidean distance could be used
for classification.
n Reconstr. error from lower dimensional space
minimum among all linear transforms of the data.
Application of PCA
n Data compression
n Provides optimum set of orthonormal basis vectors for a
set of data points.
n Data dependent.

n Basis vectors also called as ‘Karhunen-Loeve’ basis, and

the transform called ‘Karhunen-Loeve Transform’


(KLT).
n Type-2 DCT basis vectors are approximately the eigen

vectors of a 2-D matrix with (j,k) the entries as r|j-k|.


n Covariance matrix for a useful class of signals, where r is the
measure of correlation between adjacent samples and a value
near to 1.
Application of PCA
n Decorrelating components
n Color images in RGB space highly correlated.
n By performing PCA with different blocks of color images a color
transformation matrix obtained, useful for segmentation.
n (R+G+B)/3, R-B, (2G-R-B)/2

n Multispectral, hyperspectral and ultraspectral remote


sensing images.
n Multispectral – 10’s of bands
n Hyperspectral – 100’s of bands
n Ultraspectral - 1000’s of bands
n PCA required to highlight decorrelated information.
Y.I. Ohta, T. Kanade, and T. Sakai, “Color information for region
segmentation”, Computer Graphics and Image Processing, 13, 222-241,
PCA components of a
hyperspectral image
Band PCA 1 Band PCA 2 Band PCA 3 Band PCA 4 Band PCA 5

After
Band PCA 6 Band PCA 7 Band PCA 8 Band PCA 9 Band PCA 10
component 20,
not much
details are
Band PCA 11 Band PCA 12 Band PCA 13 Band PCA 14 Band PCA 15
available.

Band PCA 16 Band PCA 17 Band PCA 18 Band PCA 19 Band PCA 20
Removal of
data
redundancy.
Courtesy: Li et al, “A New Subspace Approach for Supervised Hyperspectral Image
Classification”, 2011 IEEE International Geoscience and Remote Sensing Symposium.
Application of PCA
n Factor analysis.
n Highlights decorrelated factors.
n Useful for classification.
n For example, eigen faces for representing
human faces.
n Performs PCA on a large set of images of human
faces cropped to the same size.
n Any arbitrary face expressed as linear combination of
them.
n Coefficients of linear combination represent an
arbitrary face.
PCA: Eigen faces

http://en.wikipedia.org/wiki/Image:Eigenfaces.png
Application of PCA
n Classification / High level processing
n Using the representation derived by
principal component analysis.
PCA
basis
vectors

Principal Classification
Components
Output
Linear discriminant analysis
n For the purpose of classification, dimensional
reduction using PCA may not work.
n It captures the direction of maximum variance for
a data set.
n For labelled data sets, it does not capture the
direction of maximum separation between the
groups of data points of differing labels.
Well separated
but not along
the direction of
Direction for principal
principal
component.
component.
Fisher linear discriminant
n Consider a set of data points S={xi| xi in Rd}.
n N1 points in class w1.
n N2 points in class w2.
n Say, N1+ N2=N (total data points).
n Consider a line with direction u.
n Projection of data xi on u: yi= xiTu
n One dimensional subspace representing data.
Separation between projected
data of different classes
n m1= mean of data points in w1.
n m2= mean of data points in w2.
n Projection of means:
n my1=m1Tu
n my2=m2Tu
n A measure of separation: D
_
my1 my2
n D=|my1 my2|
n Does not consider variance of data.
A better measure of
separation
q Normalized by a factor proportional to
class variances.
q Scatter of data belonging to class C: D
my1 my2

Mean
Class Variance x Number of samples

n Measure of separation: Scatter of


Scatter of class w1 class w2
Scatter of projected
n To obtain u maximizing J(u). samples should be small.
Scatter matrix
n Scatter matrix for samples of class C in
original space :
Within the class Scatter matrix
Scatter matrixes for
class w1 and w2.

Within the class scatter matrix: Sw=S1+S2


Between the class scatter
matrix
Between the class scatter matrix: Means of w1 and w2

Rewriting optimization function

To maximize
Solution
To maximize

u should be such that


Should be invertible Eigen value problem.

For any vector z, SBz = k.(m1-m2)


(m1-m2)(m1-m2)Tz = k (m1-m2)
k
u=SW-1 (m1-m2) Only direction matters.
Example
n Data points:
n X1={(5, 3, 2), (4, 6, 0), (3, -7, 14)}
n X2={(-2 -5 17), (3 -13 10), (-4 -2 16)}
n Perform LDA and get the optimum direction.
Check separability in the line of projection.
n Perform PCA on the whole data set ignoring
class information and get the dominant
principal direction. Check the separability of
projected points on it.
Example (contd.)
n LDA:

SW=S1+S2
Example (contd.)
n LDA: Separability

Well separated.
Example (contd.)
n PCA:

Eigen values: 72.96, 20.29, 1.47


Eigen vectors:
Example (contd.)
n PCA: Separability

Reduced
margin of
separation.
Summary
n Feature selection (Subset Selection)
n Forward and backward sequential selection
methods.
n Unsupervised dimension reduction method.
n PCA
n Supervised dimension reduction method
n LDA

You might also like