Multivariate Parametric Methods: Steven J Zeil

Feature Selection Feature Extraction
Multivariate Parametric Methods
Steven J Zeil
Old Dominion Univ.
Fall 2010
1
Outline
1 Feature Selection
2 Feature Extraction
Principal Components Analysis (PCA)
Factor Analysis (FA)
Multidimensional Scaling (MDS)
Linear Discriminants Analysis (LDA)
Motivation
Reduction in complexity of prediction and training

Reduction in cost of data extraction
Simpler models – reduced variance
Easier to visualize & analyze results, identify outliers, etc.
3
Basic Approaches
Given an input population characterized by d attributes:

Feature Selection: find k < d dimensions that give the most
information. Discard the other d − k.
4
Basic Approaches

subset selection
4
Basic Approaches

subset selection
Feature Extraction: find k ≤ d dimensions that are linear
combinations of the original d
4
Basic Approaches

subset selection
Principal Components Analysis (unsupervised)
4
Basic Approaches

subset selection
Related: Factor Analysis and Multidimensional Scaling
4
Basic Approaches

subset selection
Linear Discriminants Analysis (supervised)
4
Basic Approaches

subset selection
Text also mensions Nonlinear methods: Isometric feature
mapping and Locally Linear Embedding
4
Basic Approaches

subset selection
Text also mensions Nonlinear methods: Isometric feature
mapping and Locally Linear Embedding
Not enough info to really justify
4
Subset Selection
Assume we have a suitable error function and can evaluate it
for a variety of models (cross-validation).
Subset Selection
Misclassification error for classification problems
Subset Selection
Mean-squared error for regression
Subset Selection
Can’t evaluate all 2d subsets of d features
Subset Selection
Forward selection: Start with an empty feature set.
Repeatedly add the feature that reduces the error the most.
Stop when decrease is insignificant.
Subset Selection
Backward selection: Start with all features.
Remove the feature that decreases the error the most (or
increases it the least).
Stop when any further removals increase the error significantly.
Subset Selection
Both directions are O(d 2 )
5
Subset Selection
Both directions are O(d 2 )
Hill-climing: not guaranteed to find global optimum
5
Notes
Variant floating search adds multiple features at once, then

backtracks to see what features can be removed
Selection is less useful in very high-dimension problems where
individual features are of limiteduse, but clusters of features
are significant.
6
Outline
1 Feature Selection
2 Feature Extraction
Find a mapping ~z = A~x onto a lower-dimension space

Unsupervised method: seeks to minimize variance
Intuitively: try to spread the points apart as far as possible
8
1st Principal Component
Assume ~x ∼ N (~
µ, Σ). Then
~ T ~x ∼ N (~
w wT µ ~ T Σ~
~, w w)
Find z1 = w ~ 1T w
~ 1T ~x , with w ~ 1 = 1, that maximizes
Var(z1 ) = w T
~ 1 Σ~ w1 .
~ 1T Σ~
Find maxw~ 1 w w1T w
w1 − α(~ ~ 1 − 1), α ≥ 0
Solution: Σ~w1 = α~w1
This is an eigenvalue problem on Σ. We want the solution
(eigenvector) corresponding to the largest eigenvalue α
9
2nd Principal Component
Next find z2 = w ~ 2T w
~ 2T ~x , with w ~ 2T w
~ 2 = 1 and w ~ 1 = 0, that
maximizes Var(z2 ) = w T
~ 2 Σ~
w2 .
Solution: Σ~ ~2
w 2 = α2 w
Choose the solution (eigenvector) corresponding to the 2nd
largest eigenvalue α2
Because Σ is symmetric, its eigenvectors are mutually
orthogonal
10
Visualizing PCA
~z = WT (~x − m
~)
11
Is Spreading the Space Enough?
Although we can argue that spreading the points leads to a better-

conditioned problem:
What does this have to do with reducing dimensionality?
12
Detecting Linear Depencencies
Suppose that some subset of the inputs are linearly correlated
∃~q |~q T ~x = 0
Then Σ is singular.
E [~q T ~x − ~q T µ
~] = 0
Σ~q = 0
~q is an eigenvector of the problem Σ~
w = α~
w with α = 0
The last eigenvectors(s) we would consider using
Flip side: PCA can be overly sensitive to scaling issues
[normalize] and to outliers
13
When to Stop?
Proportion of Variance (PoV) for eigenvalues λ1 , λ2 , . . . , λk

Pk
λi
Pdi=1
i=1 λd
Plot and look for elbow

Typically stop around PoV = 0.9
14
PoV
15
PCA & Visualization
If 1st two eigenvectors account for majority of variance, plot data,

using symbols for classes or other features
Visually search for patterns
16
PCA Visualization
17
A kind of “inverted” PCA.

Find a set of factors ~z that can be combined to generate ~x :
 
Xk
x i − µi =  vij zj  + εi
j=1
zi are latent factors
E [zi ] = 0, Var(zi ) = 1, i 6= j ⇒ (Cov (zi , zj ) = 0
εi are noise sources
E [εi ] = 0, Var(εi ) = φi , i 6= j ⇒ (Cov (εi , εj ) = 0
Cov (εi , zj ) = 0
vij are factor loadings
18
PCA vs FA
19
Given the pairwise distances dij between N points, place those

points on a low-dimension map, preserving the distances
~z = ~g (~x |θ)
Choose θ to minimize Sammon stress
X (||~z r − ~z s || − ||~x r − ~x s ||)2
E (θ|X ) =
r ,s
||~x r − ~x s ||
X (||~g (~x r |θ) − ~g (~x s |θ)|| − ||~x r − ~x s ||)2
=
r ,s
||~x r − ~x s ||
Use regression methods for ~g , usng the above as the error

function to be minimized.
20
Supervised method
Find a projection of ~x onto
a low-dimension space
where classes are
well-separated
~ maximizing
Find w
(m1 − m2 )2
J(~
w) =
s12 + s22
mi = w~ Tm ~i
X
si = w T ~x t − mi )2 r t
(~
t
21
Scatter
(m1 − m2 )2
J(~
w) =
s12 + s22
(m1 − m2 )2 = (~
wT m ~ tm
~1−w ~ 2 )2
~ T SB w
= w ~
m1 − m
where SB = (~ m1 − m
~ 2 )(~ ~ 2 )T is the between-class scatter
Similarly,
s12 + s22 = w~ T SW w~
where SW = S1 + S2 is the within-class scatter

Multivariate Parametric Methods: Steven J Zeil

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Multivariate Parametric Methods: Steven J Zeil

Uploaded by

Copyright:

Available Formats

Feature Selection Feature Extraction

Multivariate Parametric Methods

Old Dominion Univ.

Reduction in complexity of prediction and training

Given an input population characterized by d attributes:

Given an input population characterized by d attributes:

Given an input population characterized by d attributes:

Given an input population characterized by d attributes:

Given an input population characterized by d attributes:

Given an input population characterized by d attributes:

Given an input population characterized by d attributes:

Given an input population characterized by d attributes:

Variant floating search adds multiple features at once, then

Principal Components Analysis (PCA)

Find a mapping ~z = A~x onto a lower-dimension space

1st Principal Component

2nd Principal Component

Is Spreading the Space Enough?

Although we can argue that spreading the points leads to a better-

What does this have to do with reducing dimensionality?

Detecting Linear Depencencies

Suppose that some subset of the inputs are linearly correlated

Proportion of Variance (PoV) for eigenvalues λ1 , λ2 , . . . , λk

Plot and look for elbow

PCA & Visualization

If 1st two eigenvectors account for majority of variance, plot data,

Factor Analysis (FA)

A kind of “inverted” PCA.

zi are latent factors

E [zi ] = 0, Var(zi ) = 1, i 6= j ⇒ (Cov (zi , zj ) = 0

εi are noise sources

E [εi ] = 0, Var(εi ) = φi , i 6= j ⇒ (Cov (εi , εj ) = 0

Multidimensional Scaling (MDS)

Given the pairwise distances dij between N points, place those

Use regression methods for ~g , usng the above as the error

Linear Discriminants Analysis (LDA)

You might also like