Professional Documents
Culture Documents
Multivariate Parametric Methods: Steven J Zeil
Multivariate Parametric Methods: Steven J Zeil
Steven J Zeil
Fall 2010
1
Feature Selection Feature Extraction
Outline
1 Feature Selection
2 Feature Extraction
Principal Components Analysis (PCA)
Factor Analysis (FA)
Multidimensional Scaling (MDS)
Linear Discriminants Analysis (LDA)
Feature Selection Feature Extraction
Motivation
3
Feature Selection Feature Extraction
Basic Approaches
4
Feature Selection Feature Extraction
Basic Approaches
4
Feature Selection Feature Extraction
Basic Approaches
4
Feature Selection Feature Extraction
Basic Approaches
4
Feature Selection Feature Extraction
Basic Approaches
4
Feature Selection Feature Extraction
Basic Approaches
4
Feature Selection Feature Extraction
Basic Approaches
4
Feature Selection Feature Extraction
Basic Approaches
4
Feature Selection Feature Extraction
Subset Selection
Assume we have a suitable error function and can evaluate it
for a variety of models (cross-validation).
Feature Selection Feature Extraction
Subset Selection
Assume we have a suitable error function and can evaluate it
for a variety of models (cross-validation).
Misclassification error for classification problems
Feature Selection Feature Extraction
Subset Selection
Assume we have a suitable error function and can evaluate it
for a variety of models (cross-validation).
Misclassification error for classification problems
Mean-squared error for regression
Feature Selection Feature Extraction
Subset Selection
Assume we have a suitable error function and can evaluate it
for a variety of models (cross-validation).
Misclassification error for classification problems
Mean-squared error for regression
Can’t evaluate all 2d subsets of d features
Feature Selection Feature Extraction
Subset Selection
Assume we have a suitable error function and can evaluate it
for a variety of models (cross-validation).
Misclassification error for classification problems
Mean-squared error for regression
Can’t evaluate all 2d subsets of d features
Forward selection: Start with an empty feature set.
Repeatedly add the feature that reduces the error the most.
Stop when decrease is insignificant.
Feature Selection Feature Extraction
Subset Selection
Assume we have a suitable error function and can evaluate it
for a variety of models (cross-validation).
Misclassification error for classification problems
Mean-squared error for regression
Can’t evaluate all 2d subsets of d features
Forward selection: Start with an empty feature set.
Repeatedly add the feature that reduces the error the most.
Stop when decrease is insignificant.
Backward selection: Start with all features.
Remove the feature that decreases the error the most (or
increases it the least).
Stop when any further removals increase the error significantly.
Feature Selection Feature Extraction
Subset Selection
Assume we have a suitable error function and can evaluate it
for a variety of models (cross-validation).
Misclassification error for classification problems
Mean-squared error for regression
Can’t evaluate all 2d subsets of d features
Forward selection: Start with an empty feature set.
Repeatedly add the feature that reduces the error the most.
Stop when decrease is insignificant.
Backward selection: Start with all features.
Remove the feature that decreases the error the most (or
increases it the least).
Stop when any further removals increase the error significantly.
Both directions are O(d 2 )
5
Feature Selection Feature Extraction
Subset Selection
Assume we have a suitable error function and can evaluate it
for a variety of models (cross-validation).
Misclassification error for classification problems
Mean-squared error for regression
Can’t evaluate all 2d subsets of d features
Forward selection: Start with an empty feature set.
Repeatedly add the feature that reduces the error the most.
Stop when decrease is insignificant.
Backward selection: Start with all features.
Remove the feature that decreases the error the most (or
increases it the least).
Stop when any further removals increase the error significantly.
Both directions are O(d 2 )
Hill-climing: not guaranteed to find global optimum
5
Feature Selection Feature Extraction
Notes
6
Feature Selection Feature Extraction
Outline
1 Feature Selection
2 Feature Extraction
Principal Components Analysis (PCA)
Factor Analysis (FA)
Multidimensional Scaling (MDS)
Linear Discriminants Analysis (LDA)
Feature Selection Feature Extraction
8
Feature Selection Feature Extraction
Assume ~x ∼ N (~
µ, Σ). Then
~ T ~x ∼ N (~
w wT µ ~ T Σ~
~, w w)
Find z1 = w ~ 1T w
~ 1T ~x , with w ~ 1 = 1, that maximizes
Var(z1 ) = w T
~ 1 Σ~ w1 .
~ 1T Σ~
Find maxw~ 1 w w1T w
w1 − α(~ ~ 1 − 1), α ≥ 0
Solution: Σ~w1 = α~w1
This is an eigenvalue problem on Σ. We want the solution
(eigenvector) corresponding to the largest eigenvalue α
9
Feature Selection Feature Extraction
Next find z2 = w ~ 2T w
~ 2T ~x , with w ~ 2T w
~ 2 = 1 and w ~ 1 = 0, that
maximizes Var(z2 ) = w T
~ 2 Σ~
w2 .
Solution: Σ~ ~2
w 2 = α2 w
Choose the solution (eigenvector) corresponding to the 2nd
largest eigenvalue α2
Because Σ is symmetric, its eigenvectors are mutually
orthogonal
10
Feature Selection Feature Extraction
Visualizing PCA
~z = WT (~x − m
~)
11
Feature Selection Feature Extraction
12
Feature Selection Feature Extraction
∃~q |~q T ~x = 0
Then Σ is singular.
E [~q T ~x − ~q T µ
~] = 0
Σ~q = 0
~q is an eigenvector of the problem Σ~
w = α~
w with α = 0
The last eigenvectors(s) we would consider using
Flip side: PCA can be overly sensitive to scaling issues
[normalize] and to outliers
13
Feature Selection Feature Extraction
When to Stop?
14
Feature Selection Feature Extraction
PoV
15
Feature Selection Feature Extraction
16
Feature Selection Feature Extraction
PCA Visualization
17
Feature Selection Feature Extraction
Cov (εi , zj ) = 0
vij are factor loadings
18
Feature Selection Feature Extraction
PCA vs FA
19
Feature Selection Feature Extraction
~z = ~g (~x |θ)
Choose θ to minimize Sammon stress
X (||~z r − ~z s || − ||~x r − ~x s ||)2
E (θ|X ) =
r ,s
||~x r − ~x s ||
X (||~g (~x r |θ) − ~g (~x s |θ)|| − ||~x r − ~x s ||)2
=
r ,s
||~x r − ~x s ||
20
Feature Selection Feature Extraction
Supervised method
Find a projection of ~x onto
a low-dimension space
where classes are
well-separated
~ maximizing
Find w
(m1 − m2 )2
J(~
w) =
s12 + s22
mi = w~ Tm ~i
X
si = w T ~x t − mi )2 r t
(~
t
21
Feature Selection Feature Extraction
Scatter
(m1 − m2 )2
J(~
w) =
s12 + s22
(m1 − m2 )2 = (~
wT m ~ tm
~1−w ~ 2 )2
~ T SB w
= w ~
m1 − m
where SB = (~ m1 − m
~ 2 )(~ ~ 2 )T is the between-class scatter
Similarly,
s12 + s22 = w~ T SW w~
where SW = S1 + S2 is the within-class scatter