Download as pdf or txt
Download as pdf or txt
You are on page 1of 36

Feature Selection Feature Extraction

Multivariate Parametric Methods

Steven J Zeil

Old Dominion Univ.

Fall 2010

1
Feature Selection Feature Extraction

Outline

1 Feature Selection

2 Feature Extraction
Principal Components Analysis (PCA)
Factor Analysis (FA)
Multidimensional Scaling (MDS)
Linear Discriminants Analysis (LDA)
Feature Selection Feature Extraction

Motivation

Reduction in complexity of prediction and training


Reduction in cost of data extraction
Simpler models – reduced variance
Easier to visualize & analyze results, identify outliers, etc.

3
Feature Selection Feature Extraction

Basic Approaches

Given an input population characterized by d attributes:


Feature Selection: find k < d dimensions that give the most
information. Discard the other d − k.

4
Feature Selection Feature Extraction

Basic Approaches

Given an input population characterized by d attributes:


Feature Selection: find k < d dimensions that give the most
information. Discard the other d − k.
subset selection

4
Feature Selection Feature Extraction

Basic Approaches

Given an input population characterized by d attributes:


Feature Selection: find k < d dimensions that give the most
information. Discard the other d − k.
subset selection
Feature Extraction: find k ≤ d dimensions that are linear
combinations of the original d

4
Feature Selection Feature Extraction

Basic Approaches

Given an input population characterized by d attributes:


Feature Selection: find k < d dimensions that give the most
information. Discard the other d − k.
subset selection
Feature Extraction: find k ≤ d dimensions that are linear
combinations of the original d
Principal Components Analysis (unsupervised)

4
Feature Selection Feature Extraction

Basic Approaches

Given an input population characterized by d attributes:


Feature Selection: find k < d dimensions that give the most
information. Discard the other d − k.
subset selection
Feature Extraction: find k ≤ d dimensions that are linear
combinations of the original d
Principal Components Analysis (unsupervised)
Related: Factor Analysis and Multidimensional Scaling

4
Feature Selection Feature Extraction

Basic Approaches

Given an input population characterized by d attributes:


Feature Selection: find k < d dimensions that give the most
information. Discard the other d − k.
subset selection
Feature Extraction: find k ≤ d dimensions that are linear
combinations of the original d
Principal Components Analysis (unsupervised)
Related: Factor Analysis and Multidimensional Scaling
Linear Discriminants Analysis (supervised)

4
Feature Selection Feature Extraction

Basic Approaches

Given an input population characterized by d attributes:


Feature Selection: find k < d dimensions that give the most
information. Discard the other d − k.
subset selection
Feature Extraction: find k ≤ d dimensions that are linear
combinations of the original d
Principal Components Analysis (unsupervised)
Related: Factor Analysis and Multidimensional Scaling
Linear Discriminants Analysis (supervised)
Text also mensions Nonlinear methods: Isometric feature
mapping and Locally Linear Embedding

4
Feature Selection Feature Extraction

Basic Approaches

Given an input population characterized by d attributes:


Feature Selection: find k < d dimensions that give the most
information. Discard the other d − k.
subset selection
Feature Extraction: find k ≤ d dimensions that are linear
combinations of the original d
Principal Components Analysis (unsupervised)
Related: Factor Analysis and Multidimensional Scaling
Linear Discriminants Analysis (supervised)
Text also mensions Nonlinear methods: Isometric feature
mapping and Locally Linear Embedding
Not enough info to really justify

4
Feature Selection Feature Extraction

Subset Selection
Assume we have a suitable error function and can evaluate it
for a variety of models (cross-validation).
Feature Selection Feature Extraction

Subset Selection
Assume we have a suitable error function and can evaluate it
for a variety of models (cross-validation).
Misclassification error for classification problems
Feature Selection Feature Extraction

Subset Selection
Assume we have a suitable error function and can evaluate it
for a variety of models (cross-validation).
Misclassification error for classification problems
Mean-squared error for regression
Feature Selection Feature Extraction

Subset Selection
Assume we have a suitable error function and can evaluate it
for a variety of models (cross-validation).
Misclassification error for classification problems
Mean-squared error for regression
Can’t evaluate all 2d subsets of d features
Feature Selection Feature Extraction

Subset Selection
Assume we have a suitable error function and can evaluate it
for a variety of models (cross-validation).
Misclassification error for classification problems
Mean-squared error for regression
Can’t evaluate all 2d subsets of d features
Forward selection: Start with an empty feature set.
Repeatedly add the feature that reduces the error the most.
Stop when decrease is insignificant.
Feature Selection Feature Extraction

Subset Selection
Assume we have a suitable error function and can evaluate it
for a variety of models (cross-validation).
Misclassification error for classification problems
Mean-squared error for regression
Can’t evaluate all 2d subsets of d features
Forward selection: Start with an empty feature set.
Repeatedly add the feature that reduces the error the most.
Stop when decrease is insignificant.
Backward selection: Start with all features.
Remove the feature that decreases the error the most (or
increases it the least).
Stop when any further removals increase the error significantly.
Feature Selection Feature Extraction

Subset Selection
Assume we have a suitable error function and can evaluate it
for a variety of models (cross-validation).
Misclassification error for classification problems
Mean-squared error for regression
Can’t evaluate all 2d subsets of d features
Forward selection: Start with an empty feature set.
Repeatedly add the feature that reduces the error the most.
Stop when decrease is insignificant.
Backward selection: Start with all features.
Remove the feature that decreases the error the most (or
increases it the least).
Stop when any further removals increase the error significantly.
Both directions are O(d 2 )

5
Feature Selection Feature Extraction

Subset Selection
Assume we have a suitable error function and can evaluate it
for a variety of models (cross-validation).
Misclassification error for classification problems
Mean-squared error for regression
Can’t evaluate all 2d subsets of d features
Forward selection: Start with an empty feature set.
Repeatedly add the feature that reduces the error the most.
Stop when decrease is insignificant.
Backward selection: Start with all features.
Remove the feature that decreases the error the most (or
increases it the least).
Stop when any further removals increase the error significantly.
Both directions are O(d 2 )
Hill-climing: not guaranteed to find global optimum

5
Feature Selection Feature Extraction

Notes

Variant floating search adds multiple features at once, then


backtracks to see what features can be removed
Selection is less useful in very high-dimension problems where
individual features are of limiteduse, but clusters of features
are significant.

6
Feature Selection Feature Extraction

Outline

1 Feature Selection

2 Feature Extraction
Principal Components Analysis (PCA)
Factor Analysis (FA)
Multidimensional Scaling (MDS)
Linear Discriminants Analysis (LDA)
Feature Selection Feature Extraction

Principal Components Analysis (PCA)

Find a mapping ~z = A~x onto a lower-dimension space


Unsupervised method: seeks to minimize variance
Intuitively: try to spread the points apart as far as possible

8
Feature Selection Feature Extraction

1st Principal Component

Assume ~x ∼ N (~
µ, Σ). Then

~ T ~x ∼ N (~
w wT µ ~ T Σ~
~, w w)

Find z1 = w ~ 1T w
~ 1T ~x , with w ~ 1 = 1, that maximizes
Var(z1 ) = w T
~ 1 Σ~ w1 .
~ 1T Σ~
Find maxw~ 1 w w1T w
w1 − α(~ ~ 1 − 1), α ≥ 0
Solution: Σ~w1 = α~w1
This is an eigenvalue problem on Σ. We want the solution
(eigenvector) corresponding to the largest eigenvalue α

9
Feature Selection Feature Extraction

2nd Principal Component

Next find z2 = w ~ 2T w
~ 2T ~x , with w ~ 2T w
~ 2 = 1 and w ~ 1 = 0, that
maximizes Var(z2 ) = w T
~ 2 Σ~
w2 .
Solution: Σ~ ~2
w 2 = α2 w
Choose the solution (eigenvector) corresponding to the 2nd
largest eigenvalue α2
Because Σ is symmetric, its eigenvectors are mutually
orthogonal

10
Feature Selection Feature Extraction

Visualizing PCA

~z = WT (~x − m
~)

11
Feature Selection Feature Extraction

Is Spreading the Space Enough?

Although we can argue that spreading the points leads to a better-


conditioned problem:

What does this have to do with reducing dimensionality?

12
Feature Selection Feature Extraction

Detecting Linear Depencencies

Suppose that some subset of the inputs are linearly correlated

∃~q |~q T ~x = 0

Then Σ is singular.

E [~q T ~x − ~q T µ
~] = 0

Σ~q = 0
~q is an eigenvector of the problem Σ~
w = α~
w with α = 0
The last eigenvectors(s) we would consider using
Flip side: PCA can be overly sensitive to scaling issues
[normalize] and to outliers

13
Feature Selection Feature Extraction

When to Stop?

Proportion of Variance (PoV) for eigenvalues λ1 , λ2 , . . . , λk


Pk
λi
Pdi=1
i=1 λd

Plot and look for elbow


Typically stop around PoV = 0.9

14
Feature Selection Feature Extraction

PoV

15
Feature Selection Feature Extraction

PCA & Visualization

If 1st two eigenvectors account for majority of variance, plot data,


using symbols for classes or other features
Visually search for patterns

16
Feature Selection Feature Extraction

PCA Visualization

17
Feature Selection Feature Extraction

Factor Analysis (FA)

A kind of “inverted” PCA.


Find a set of factors ~z that can be combined to generate ~x :
 
Xk
x i − µi =  vij zj  + εi
j=1

zi are latent factors

E [zi ] = 0, Var(zi ) = 1, i 6= j ⇒ (Cov (zi , zj ) = 0

εi are noise sources

E [εi ] = 0, Var(εi ) = φi , i 6= j ⇒ (Cov (εi , εj ) = 0

Cov (εi , zj ) = 0
vij are factor loadings
18
Feature Selection Feature Extraction

PCA vs FA

19
Feature Selection Feature Extraction

Multidimensional Scaling (MDS)

Given the pairwise distances dij between N points, place those


points on a low-dimension map, preserving the distances

~z = ~g (~x |θ)
Choose θ to minimize Sammon stress
X (||~z r − ~z s || − ||~x r − ~x s ||)2
E (θ|X ) =
r ,s
||~x r − ~x s ||
X (||~g (~x r |θ) − ~g (~x s |θ)|| − ||~x r − ~x s ||)2
=
r ,s
||~x r − ~x s ||

Use regression methods for ~g , usng the above as the error


function to be minimized.

20
Feature Selection Feature Extraction

Linear Discriminants Analysis (LDA)

Supervised method
Find a projection of ~x onto
a low-dimension space
where classes are
well-separated
~ maximizing
Find w

(m1 − m2 )2
J(~
w) =
s12 + s22

mi = w~ Tm ~i
X
si = w T ~x t − mi )2 r t
(~
t

21
Feature Selection Feature Extraction

Scatter

(m1 − m2 )2
J(~
w) =
s12 + s22

(m1 − m2 )2 = (~
wT m ~ tm
~1−w ~ 2 )2
~ T SB w
= w ~

m1 − m
where SB = (~ m1 − m
~ 2 )(~ ~ 2 )T is the between-class scatter
Similarly,
s12 + s22 = w~ T SW w~
where SW = S1 + S2 is the within-class scatter

You might also like