02 Feature2 Extraction PR l9 Featsel-C3

Feature Extraction,
Curse of
dimensionality, PCA
AP
The curse of dimensionality
What does the curse of dimensionality mean, in practice?
– For a given sample size, there is a maximum number of features above
which the performance of our classifier will degrade rather than
improve
– In most cases, the additional information that is lost by discarding
some features is (more than) compensated by a more accurate
mapping in the lower-dimensional space
performance
dimensionality
AP
Dimensionality reduction
Two approaches are available to reduce dimensionality
– Feature extraction: creating a subset of new features by combinations
of the existing features
– Feature selection: choosing a subset of all the features
𝑥1 𝑥𝑖1 𝑥1 𝑥1
𝑦1
𝑥2 𝑥𝑖 2 𝑥2 𝑥2
𝑦2
→ → =𝑓
𝑥𝑖𝑀 𝑦𝑀
𝑥𝑁 𝑥𝑁 𝑥𝑁
The problem of feature extraction can be stated as
– Given a feature space 𝑥𝑖 ∈ ℜ𝑁 find a mapping 𝑦 = 𝑓 𝑥 : 𝑅𝑁 → 𝑅𝑀
with 𝑀 < 𝑁 such that the transformed feature vector
𝑦 ∈ 𝑅𝑀 preserves (most of) the information or structure in 𝑅𝑁
– An optimal mapping 𝑦 = 𝑓(𝑥) is one that does not increase 𝑃 𝑒𝑟𝑟𝑜𝑟
– This is, a Bayes decision rule applied to the initial space 𝑅𝑁 and to the
reduced space 𝑅𝑀 yield the same classification rate
AP
Dimensionality reduction …. Cntd.
Linear dimensionality reduction
– In general, the optimal mapping 𝑦 = 𝑓(𝑥) will be a non-linear function
• However, there is no systematic way to generate non-linear transforms
• The selection of a particular subset of transforms is problem dependent
– For these reasons, feature extraction is commonly based on linear
transforms, of the form 𝑦 = 𝑊𝑥
𝑥1 𝑥1
𝑦1
𝑥2 𝑤11 𝑤12 𝑤1𝑁 𝑥2
𝑦2
→ =
𝑦𝑀 𝑤𝑀1 𝑤𝑀𝑁
𝑥𝑁 𝑥𝑁
• NOTE: When the mapping is a non-linear function, the reduced space is
called a manifold
– We will focus on linear feature extraction for now, and revisit non-linear
techniques when we cover multi-layer perceptions, manifold learning, and
kernel methods
AP
Signal representation versus classification
Finding the mapping 𝑦 = 𝑓(𝑥) is guided by an objective
function that we seek to maximize (or minimize)
–Depending on the criteria used by the objective function, feature
extraction techniques are grouped into two categories:
• Signal representation: The goal of the feature extraction
mapping is to represent the samples accurately in a lower-
dimensional space
• Classification: The goal of the feature extraction mapping is to
enhance the class-discriminatory information in the lower-
dimensional space 2
Feature 2
– Within the realm of linear feature 22
22 2 2
extraction, two techniques are 111 22 2
1
1 1 11 2 22
commonly used 1 2 2
22 2
1 1 111
• Principal components analysis 111 1 2
(PCA): uses a signal representation 11
1
criterion
• Linear discriminant analysis (LDA):
uses a signal classification criterion AP
Feature 1
Principal components analysis (PCA)
PCA seeks preserve as much of the randomness (variance) in
the high-dimensional space as possible
– Let 𝑥 ∈ ℜ𝑁 be represented as a linear combination of orthonormal
basis vectors 𝜑1 𝜑2 … 𝜑𝑁 as
0; 𝑖 ≠ j
𝑥 = ∑𝑖𝑁=1𝑦𝑖 𝜑𝑖 where 𝜑𝑖𝑇 𝜑𝑗 = {
1; 𝑖 = 𝑗
– Suppose we want to represent 𝑥 with only 𝑀 (𝑀 < 𝑁) basis vectors
– We can do this by replacing the components 𝑦𝑀+1, … 𝑦𝑁 𝑇 with some
pre-selected constants 𝑏𝑖
𝑥^ 𝑀 = ∑𝑀 𝑦𝑖𝜑𝑖 + ∑𝑁 𝑏𝑖𝜑𝑖
𝑖=1 𝑖=𝑀+1
– The “representation error” is then
Δ𝑥 𝑀 = 𝑥 − 𝑥^ 𝑀 =
𝑀 𝑁
∑𝑖𝑁=1𝑦𝑖𝜑𝑖 − ∑𝑖 =1𝑦𝑖𝜑𝑖 + ∑ 𝑖 =𝑀+1𝑏𝑖𝜑𝑖 =
∑𝑖𝑁=𝑀+1 𝑦𝑖 − 𝑏𝑖 𝜑𝑖 AP
PCA …. Cntd.
– We can measure this “representation error” by the mean-squared
magnitude of Δ𝑥
– Our goal is to find the basis vectors 𝜑𝑖 and constants 𝑏𝑖 that minimize
this mean-square error
𝜖 2 𝑀 = 𝐸 Δ𝑥 𝑀 2 =
𝑁
E ∑𝑁 ∑ 𝑇
𝑖 =𝑀+1 𝑗 =𝑀+1 𝑦𝑖 − 𝑏𝑖 𝑦𝑗 − 𝑏𝑗 𝜑 𝜑 𝑗 =
∑𝑖𝑁=𝑀+1𝐸 𝑦𝑖 − 𝑏𝑖 2
– To find the optimal values of 𝑏𝑖 we compute the partial derivative of

the objective function and equate it to zero
𝜕 2
𝐸 𝑦𝑖 − 𝑏𝑖 = −2 𝐸 𝑦𝑖 − 𝑏𝑖 = 0
𝜕𝑏 𝑖
⇒ 𝑏𝑖 = 𝐸 𝑦𝑖
– Therefore, we will replace the discarded dimensions by their expected
value, which is an intuitive result
AP
PCA …. Cntd.
– The MSE can then be written as
𝜖 2 𝑀 = ∑𝑖𝑁=𝑀+1𝐸 𝑦𝑖 − 𝐸 𝑦𝑖 2 =
= ∑𝑖𝑁=𝑀+1𝐸 𝑥𝜑𝑖 − 𝐸 𝑥𝜑𝑖 𝑇 𝑥𝜑𝑖 − 𝐸 𝑥𝜑𝑖 =
= ∑𝑁 𝑇
𝑖 =𝑀+1 𝜑𝑖 𝐸 𝑥 − 𝐸 𝑥 𝑥−𝐸 𝑥 𝑇 𝜑𝑖 =
= ∑𝑖𝑁=𝑀+1𝜑𝑖𝑇 Σ 𝑥 𝜑 𝑖
– We seek the solution that minimizes this expression, subject to the
orthonormality constraint, which we incorporate into the expression
using a set of Lagrange multipliers 𝜆𝑖
𝜖 2 𝑀 = ∑𝑁 𝜑 𝑇 Σ 𝑥 𝜑 𝑖 + ∑𝑁 𝜆 1 − 𝜑𝑇𝜑
𝑖=𝑀+1 𝑖 𝑖=𝑀+1 𝑖 𝑖 𝑖
– Computing the partial derivative with respect to the basis vectors
𝜕𝜖 2 𝑀 𝜕 ∑𝑁 𝜑 𝑇 Σ 𝜑 + ∑𝑁 𝜆 1 − 𝜑 𝑇𝜑 =
=
𝜕𝜑𝑖 𝜕𝜑𝑖 𝑖=𝑀+1 𝑖 𝑥 𝑖 𝑖=𝑀+1 𝑖 𝑖 𝑖
= 2 Σ𝑥𝜑𝑖 − 𝜆𝑖𝜑𝑖 = 0 ⇒ Σ𝑥𝜑𝑖 = 𝜆𝑖𝜑𝑖

𝑑
• NOTE: 𝑥 𝑇 𝐴𝑥 = 𝐴 + 𝐴𝑇 𝑥 = 2𝐴𝑥 (for A symmetric)
𝑑𝑥
– So 𝝋𝒊and 𝝀𝒊 are the eigenvectors

AP
and eigenvalues of 𝚺𝒙
PCA …. Cntd.
– We can then express the sum-squared error as
𝜖 2 𝑀 = ∑𝑖𝑁=𝑀+1𝜑𝑖𝑇 Σ 𝑥 𝜑 𝑖 = ∑𝑖𝑁=𝑀+1𝜑𝑖𝑇 𝜆 𝑖 𝜑 𝑖 = ∑𝑁
𝑖 =𝑀+1𝜆𝑖
– In order to minimize this measure, 𝜆𝑖 will have to be smallest

eigenvalues
– Therefore, to represent 𝑥 with minimum MSE, we will choose the
eigenvectors 𝜑𝑖 corresponding to the largest eigen values 𝜆𝑖
PCA dimensionality reduction
The optimal* approximation of a random vector 𝑥 ∈ ℜ𝑁 by a linear

combination of 𝑀 < 𝑁 independent vectors is obtained by
projecting 𝑥 onto the eigenvectors 𝜑𝑖 corresponding to the
largest eigenvalues 𝜆𝑖 of the covariance matrix Σ𝑥
* - here optimality is defined as the minimum of the sum-square

magnitude of the approximation error
AP
PCA …. Cntd.
NOTE:
– The main limitation of PCA is that it does not consider class
separability since it does not take into account the class label of
the feature vector
• PCA simply performs a coordinate rotation that aligns the
transformed axes with the directions of maximum variance
• There is no guarantee that the directions of maximum
variance will contain good features for discrimination
– Historical remarks
• Principal Components Analysis is the oldest technique in
multivariate analysis
• PCA is also known as the Karhunen-Loève transform
(communication theory)
• PCA was first introduced by Pearson in 1901, and it experienced
several modifications until itAP was generalized by Loève in 1963
Example I
3D Gaussian distribution 𝑁 𝜇, Σ 12
25 −1 7 10
𝜇 = 052 𝑇 and Σ = 4 −4
8
6
10 4
– The three pairs of PCA projections are
x3
2
0
shown below -2
-4
• Notice that PC1 has the largest -6
variance, followed by PC2 -8
5
10
0
• Also notice how PCA decorrelates the 10 5 0 -10
-5
x1
axes x2
12
10 15
10
5 10
8
6
y2
y3
y3
0 5
4
0 2
-5
-10 -5
-2
-15 -10 -5 0 5 10 15 -15 -10 -5 0 5 10 15 -10 -8 -6 -4 -2 0 2 4 6 8 10

y1 y1 y2
AP
Example I …. Cntd.
The example in the next slide shows a projection of a 3D data set
into two dimensions
–Initially, there is no apparent structure in the dataset, except for
the elongation of the point cloud
–Choosing an appropriate rotation unveils the underlying structure
• You can think of this rotation as "walking around" the 3D
dataset, looking for the best viewpoint
–PCA can help find such underlying structure
• It selects a rotation such that most of the variability within
the data set is represented in the first few dimensions of the
rotated data
• In our 3D case, this may seem of little use
• However, when the data is highly multidimensional (10’s of
dimensions), this analysis isAP quite powerful
Example II
AP
Example III
10
Problem statement
– Compute the PCA for dataset 8
𝑋 = {(1,2), (3,3), (3,5), (5,4), (5,6), (6,5), (8,7), (9,8)} v2

6
– Let’s first plot the data to get an idea of x2
v1
which solution we should expect 4
Solution 0
0 2 4 6 8 10
– The sample covariance is x1
Σ𝑥 = 6.25 4.25
4.25 3.5
Σ 𝑥 𝑣 = 𝜆𝑣 ⇒ Σ𝑥 − 𝜆𝐼 = 0
⇒ 6.25 − 𝜆 4.25
=0
4.25 3.5 − 𝜆
⇒ 𝝀𝟏 = 𝟗. 𝟑𝟒; 𝝀𝟐 = 𝟎. 𝟒𝟏
– The eigenvalues are the zeros of
AP the characteristic equation
Example III …. Cntd.
– The eigenvectors are the solutions of the system
𝑣11 𝜆 𝑣 𝑣
6.25 4.25 = 1 11 ⇒ 11 = 0.81
4.25 3.5 𝑣12 𝜆1𝑣12 𝑣12 0.59
𝑣21 𝜆 𝑣 𝑣
6.25 4.25 = 2 21 ⇒ 21 = −0.59
4.25 3.5 𝑣22 𝜆2𝑣22 𝑣22 0.81
– HINT: To solve each system manually, first assume that one of the
variables is equal to one (i.e. 𝑣𝑖1 = 1), then find the other one and
finally normalize the vector to make it unit-length
AP

02 Feature2 Extraction PR l9 Featsel-C3

Uploaded by

Copyright:

Available Formats

You might also like

02 Feature2 Extraction PR l9 Featsel-C3

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

02 Feature2 Extraction PR l9 Featsel-C3

Uploaded by

Copyright:

Available Formats

Feature Extraction,

– To find the optimal values of 𝑏𝑖 we compute the partial derivative of

= 2 Σ𝑥𝜑𝑖 − 𝜆𝑖𝜑𝑖 = 0 ⇒ Σ𝑥𝜑𝑖 = 𝜆𝑖𝜑𝑖

– So 𝝋𝒊and 𝝀𝒊 are the eigenvectors

– In order to minimize this measure, 𝜆𝑖 will have to be smallest

PCA dimensionality reduction

The optimal* approximation of a random vector 𝑥 ∈ ℜ𝑁 by a linear

* - here optimality is defined as the minimum of the sum-square

– The three pairs of PCA projections are

-15 -10 -5 0 5 10 15 -15 -10 -5 0 5 10 15 -10 -8 -6 -4 -2 0 2 4 6 8 10

𝑋 = {(1,2), (3,3), (3,5), (5,4), (5,6), (6,5), (8,7), (9,8)} v2

which solution we should expect 4

You might also like