02 Feature2 Extraction PR l9 Featsel-C3

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 15

Feature Extraction,

Curse of
dimensionality, PCA

AP
The curse of dimensionality
What does the curse of dimensionality mean, in practice?
– For a given sample size, there is a maximum number of features above
which the performance of our classifier will degrade rather than
improve
– In most cases, the additional information that is lost by discarding
some features is (more than) compensated by a more accurate
mapping in the lower-dimensional space
performance

dimensionality

AP
Dimensionality reduction
Two approaches are available to reduce dimensionality
– Feature extraction: creating a subset of new features by combinations
of the existing features
– Feature selection: choosing a subset of all the features
𝑥1 𝑥𝑖1 𝑥1 𝑥1
𝑦1
𝑥2 𝑥𝑖 2 𝑥2 𝑥2
𝑦2
→ → =𝑓
𝑥𝑖𝑀 𝑦𝑀
𝑥𝑁 𝑥𝑁 𝑥𝑁
The problem of feature extraction can be stated as
– Given a feature space 𝑥𝑖 ∈ ℜ𝑁 find a mapping 𝑦 = 𝑓 𝑥 : 𝑅𝑁 → 𝑅𝑀
with 𝑀 < 𝑁 such that the transformed feature vector
𝑦 ∈ 𝑅𝑀 preserves (most of) the information or structure in 𝑅𝑁
– An optimal mapping 𝑦 = 𝑓(𝑥) is one that does not increase 𝑃 𝑒𝑟𝑟𝑜𝑟
– This is, a Bayes decision rule applied to the initial space 𝑅𝑁 and to the
reduced space 𝑅𝑀 yield the same classification rate

AP
Dimensionality reduction …. Cntd.
Linear dimensionality reduction
– In general, the optimal mapping 𝑦 = 𝑓(𝑥) will be a non-linear function
• However, there is no systematic way to generate non-linear transforms
• The selection of a particular subset of transforms is problem dependent
– For these reasons, feature extraction is commonly based on linear
transforms, of the form 𝑦 = 𝑊𝑥
𝑥1 𝑥1
𝑦1
𝑥2 𝑤11 𝑤12 𝑤1𝑁 𝑥2
𝑦2
→ =
𝑦𝑀 𝑤𝑀1 𝑤𝑀𝑁
𝑥𝑁 𝑥𝑁
• NOTE: When the mapping is a non-linear function, the reduced space is
called a manifold
– We will focus on linear feature extraction for now, and revisit non-linear
techniques when we cover multi-layer perceptions, manifold learning, and
kernel methods
AP
Signal representation versus classification
Finding the mapping 𝑦 = 𝑓(𝑥) is guided by an objective
function that we seek to maximize (or minimize)
–Depending on the criteria used by the objective function, feature
extraction techniques are grouped into two categories:
• Signal representation: The goal of the feature extraction
mapping is to represent the samples accurately in a lower-
dimensional space
• Classification: The goal of the feature extraction mapping is to
enhance the class-discriminatory information in the lower-
dimensional space 2

Feature 2
– Within the realm of linear feature 22
22 2 2
extraction, two techniques are 111 22 2
1
1 1 11 2 22
commonly used 1 2 2
22 2
1 1 111
• Principal components analysis 111 1 2
(PCA): uses a signal representation 11
1
criterion
• Linear discriminant analysis (LDA):
uses a signal classification criterion AP
Feature 1
Principal components analysis (PCA)
PCA seeks preserve as much of the randomness (variance) in
the high-dimensional space as possible
– Let 𝑥 ∈ ℜ𝑁 be represented as a linear combination of orthonormal
basis vectors 𝜑1 𝜑2 … 𝜑𝑁 as
0; 𝑖 ≠ j
𝑥 = ∑𝑖𝑁=1𝑦𝑖 𝜑𝑖 where 𝜑𝑖𝑇 𝜑𝑗 = {
1; 𝑖 = 𝑗
– Suppose we want to represent 𝑥 with only 𝑀 (𝑀 < 𝑁) basis vectors
– We can do this by replacing the components 𝑦𝑀+1, … 𝑦𝑁 𝑇 with some
pre-selected constants 𝑏𝑖

𝑥^ 𝑀 = ∑𝑀 𝑦𝑖𝜑𝑖 + ∑𝑁 𝑏𝑖𝜑𝑖
𝑖=1 𝑖=𝑀+1
– The “representation error” is then

Δ𝑥 𝑀 = 𝑥 − 𝑥^ 𝑀 =
𝑀 𝑁
∑𝑖𝑁=1𝑦𝑖𝜑𝑖 − ∑𝑖 =1𝑦𝑖𝜑𝑖 + ∑ 𝑖 =𝑀+1𝑏𝑖𝜑𝑖 =
∑𝑖𝑁=𝑀+1 𝑦𝑖 − 𝑏𝑖 𝜑𝑖 AP
PCA …. Cntd.
– We can measure this “representation error” by the mean-squared
magnitude of Δ𝑥
– Our goal is to find the basis vectors 𝜑𝑖 and constants 𝑏𝑖 that minimize
this mean-square error
𝜖 2 𝑀 = 𝐸 Δ𝑥 𝑀 2 =
𝑁
E ∑𝑁 ∑ 𝑇
𝑖 =𝑀+1 𝑗 =𝑀+1 𝑦𝑖 − 𝑏𝑖 𝑦𝑗 − 𝑏𝑗 𝜑 𝜑 𝑗 =

∑𝑖𝑁=𝑀+1𝐸 𝑦𝑖 − 𝑏𝑖 2

– To find the optimal values of 𝑏𝑖 we compute the partial derivative of


the objective function and equate it to zero
𝜕 2
𝐸 𝑦𝑖 − 𝑏𝑖 = −2 𝐸 𝑦𝑖 − 𝑏𝑖 = 0
𝜕𝑏 𝑖
⇒ 𝑏𝑖 = 𝐸 𝑦𝑖
– Therefore, we will replace the discarded dimensions by their expected
value, which is an intuitive result
AP
PCA …. Cntd.
– The MSE can then be written as
𝜖 2 𝑀 = ∑𝑖𝑁=𝑀+1𝐸 𝑦𝑖 − 𝐸 𝑦𝑖 2 =
= ∑𝑖𝑁=𝑀+1𝐸 𝑥𝜑𝑖 − 𝐸 𝑥𝜑𝑖 𝑇 𝑥𝜑𝑖 − 𝐸 𝑥𝜑𝑖 =
= ∑𝑁 𝑇
𝑖 =𝑀+1 𝜑𝑖 𝐸 𝑥 − 𝐸 𝑥 𝑥−𝐸 𝑥 𝑇 𝜑𝑖 =
= ∑𝑖𝑁=𝑀+1𝜑𝑖𝑇 Σ 𝑥 𝜑 𝑖
– We seek the solution that minimizes this expression, subject to the
orthonormality constraint, which we incorporate into the expression
using a set of Lagrange multipliers 𝜆𝑖
𝜖 2 𝑀 = ∑𝑁 𝜑 𝑇 Σ 𝑥 𝜑 𝑖 + ∑𝑁 𝜆 1 − 𝜑𝑇𝜑
𝑖=𝑀+1 𝑖 𝑖=𝑀+1 𝑖 𝑖 𝑖
– Computing the partial derivative with respect to the basis vectors
𝜕𝜖 2 𝑀 𝜕 ∑𝑁 𝜑 𝑇 Σ 𝜑 + ∑𝑁 𝜆 1 − 𝜑 𝑇𝜑 =
=
𝜕𝜑𝑖 𝜕𝜑𝑖 𝑖=𝑀+1 𝑖 𝑥 𝑖 𝑖=𝑀+1 𝑖 𝑖 𝑖

= 2 Σ𝑥𝜑𝑖 − 𝜆𝑖𝜑𝑖 = 0 ⇒ Σ𝑥𝜑𝑖 = 𝜆𝑖𝜑𝑖


𝑑
• NOTE: 𝑥 𝑇 𝐴𝑥 = 𝐴 + 𝐴𝑇 𝑥 = 2𝐴𝑥 (for A symmetric)
𝑑𝑥

– So 𝝋𝒊and 𝝀𝒊 are the eigenvectors


AP
and eigenvalues of 𝚺𝒙
PCA …. Cntd.
– We can then express the sum-squared error as
𝜖 2 𝑀 = ∑𝑖𝑁=𝑀+1𝜑𝑖𝑇 Σ 𝑥 𝜑 𝑖 = ∑𝑖𝑁=𝑀+1𝜑𝑖𝑇 𝜆 𝑖 𝜑 𝑖 = ∑𝑁
𝑖 =𝑀+1𝜆𝑖

– In order to minimize this measure, 𝜆𝑖 will have to be smallest


eigenvalues
– Therefore, to represent 𝑥 with minimum MSE, we will choose the
eigenvectors 𝜑𝑖 corresponding to the largest eigen values 𝜆𝑖

PCA dimensionality reduction

The optimal* approximation of a random vector 𝑥 ∈ ℜ𝑁 by a linear


combination of 𝑀 < 𝑁 independent vectors is obtained by
projecting 𝑥 onto the eigenvectors 𝜑𝑖 corresponding to the
largest eigenvalues 𝜆𝑖 of the covariance matrix Σ𝑥

* - here optimality is defined as the minimum of the sum-square


magnitude of the approximation error
AP
PCA …. Cntd.
NOTE:
– The main limitation of PCA is that it does not consider class
separability since it does not take into account the class label of
the feature vector
• PCA simply performs a coordinate rotation that aligns the
transformed axes with the directions of maximum variance
• There is no guarantee that the directions of maximum
variance will contain good features for discrimination
– Historical remarks
• Principal Components Analysis is the oldest technique in
multivariate analysis
• PCA is also known as the Karhunen-Loève transform
(communication theory)
• PCA was first introduced by Pearson in 1901, and it experienced
several modifications until itAP was generalized by Loève in 1963
Example I
3D Gaussian distribution 𝑁 𝜇, Σ 12

25 −1 7 10

𝜇 = 052 𝑇 and Σ = 4 −4
8
6
10 4

– The three pairs of PCA projections are

x3
2
0
shown below -2
-4
• Notice that PC1 has the largest -6
variance, followed by PC2 -8
5
10
0
• Also notice how PCA decorrelates the 10 5 0 -10
-5

x1
axes x2

12
10 15

10

5 10
8

6
y2

y3

y3
0 5
4

0 2
-5

-10 -5
-2

-15 -10 -5 0 5 10 15 -15 -10 -5 0 5 10 15 -10 -8 -6 -4 -2 0 2 4 6 8 10


y1 y1 y2
AP
Example I …. Cntd.
The example in the next slide shows a projection of a 3D data set
into two dimensions
–Initially, there is no apparent structure in the dataset, except for
the elongation of the point cloud
–Choosing an appropriate rotation unveils the underlying structure
• You can think of this rotation as "walking around" the 3D
dataset, looking for the best viewpoint
–PCA can help find such underlying structure
• It selects a rotation such that most of the variability within
the data set is represented in the first few dimensions of the
rotated data
• In our 3D case, this may seem of little use
• However, when the data is highly multidimensional (10’s of
dimensions), this analysis isAP quite powerful
Example II

AP
Example III
10
Problem statement
– Compute the PCA for dataset 8

𝑋 = {(1,2), (3,3), (3,5), (5,4), (5,6), (6,5), (8,7), (9,8)} v2


6
– Let’s first plot the data to get an idea of x2
v1

which solution we should expect 4

Solution 0
0 2 4 6 8 10
– The sample covariance is x1

Σ𝑥 = 6.25 4.25
4.25 3.5
Σ 𝑥 𝑣 = 𝜆𝑣 ⇒ Σ𝑥 − 𝜆𝐼 = 0

⇒ 6.25 − 𝜆 4.25
=0
4.25 3.5 − 𝜆
⇒ 𝝀𝟏 = 𝟗. 𝟑𝟒; 𝝀𝟐 = 𝟎. 𝟒𝟏
– The eigenvalues are the zeros of
AP the characteristic equation
Example III …. Cntd.
– The eigenvectors are the solutions of the system

𝑣11 𝜆 𝑣 𝑣
6.25 4.25 = 1 11 ⇒ 11 = 0.81
4.25 3.5 𝑣12 𝜆1𝑣12 𝑣12 0.59
𝑣21 𝜆 𝑣 𝑣
6.25 4.25 = 2 21 ⇒ 21 = −0.59
4.25 3.5 𝑣22 𝜆2𝑣22 𝑣22 0.81

– HINT: To solve each system manually, first assume that one of the
variables is equal to one (i.e. 𝑣𝑖1 = 1), then find the other one and
finally normalize the vector to make it unit-length

AP

You might also like