Professional Documents
Culture Documents
4 Feature Selection
4 Feature Selection
4 Feature Selection
SELECTION
● The goals:
➢ Select the “optimum” number l of features
➢ Select the “best” l features
0
➢ Given N
• l must be large enough to learn
– what makes classes different
– what makes patterns in the same class similar
• l must be small enough not to learn what makes patterns of the
same class different
• In practice, has been reported to be a sensible choice
for a number of cases
l < Ν /3
0
0
● The basic philosophy
➢ Discard individual features with poor information content
➢ The remaining information rich features are examined jointly as
vectors
0
➢ The steps:
• N measurements xi , i = 1,2,..., N
are known
• Define a function of them
test statistic
so thatq = fis( xeasily parameterized in terms of θ.
1 , x2 ,..., x N ) :
x1 , x2 ,..., x N ,
0
➢ Probability of an error
pq ( q ∈ D H 0 ) = ρ
1-ρ
• ρ is preselected and it is known as the significance level.
0
● Application: The known variance case:
➢ Let x be a random variable and the experimental samples,
, arexassumed mutually independent. Also let
i = 1,2,..., N
E[ x] = µ
➢ Compute
E[( x − µ )the
2 sample
] = σ 2 mean
➢ This is1also
N a random variable with mean value
x= ∑ x i
N i =1
That is, it is an Unbiased Estimator
N
1
E[ x ] = ∑ E[ x ] = µ
i
N i =1
0
➢ The variance σ x2
N
2 1 2
E[( x − µ ) ] = E[( ∑x i − µ) ]
N i =1
1 N 1
= 2 ∑ E[( xi − µ ) 2 ] + 2 ∑∑ E[( xi − µ )( x j − µ )]
N i =1
Due to independence N i j
1
That
σ x2 =is, itσis
2
x
Asymptotically Efficient
N
➢ Hypothesis test
➢ Test
H Statistic:
1 : E [x ]≠ µ
ˆ Define the variable
H 0 : E [x ]= µˆ
x − µˆ
q= 0
σ/ N
➢ Central limit theorem under H0
N ⎛ N ( x − µˆ ) 2 ⎞
p x ( x) = exp⎜⎜ − 2
⎟
⎟
2π σ ⎝ 2 σ ⎠
➢ Thus, under H0
1 ⎛ q2 ⎞
pq ( q ) = exp⎜⎜ − ⎟⎟ q ≈ N (0,1)
2π ⎝ 2 ⎠
0
➢ The decision steps
• Compute q from xi, i=1,2,…,N
• Choose significance level ρ
• Compute from N(0,1) tables D=[-xρ, xρ]
1-ρ
•
if q ∈ D accept H 0
➢ An example: A random variable x has variance σ2=
if q Ν=16
(0.23)2. ∈ D reject H0
measurements are obtained giving
. The significance level is ρ=0.05.
Test the hypothesis
x = 1.35
H 0 : µ = µˆ = 1.4
0
H1 : µ ≠ µˆ
x − µˆ
➢ Since σ2 is known, q = is N(0,1).
σ /4
From tables, we obtain the values with acceptance
intervals [-xρ, xρ] for normal N(0,1)
➢ Thus
⎧ x − µˆ ⎫
Prob ⎨− 1.967 < < 1.967 ⎬ = 0.95
⎩ 0.23 / 4 ⎭
or
{ }
Prob − 0.113 < x − µˆ < 0.113 = 0.95
or
Prob{1.237 < µˆ < 1.463}= 0.95
0
➢ Since µˆ = 1.4 lies within the above acceptance interval,
we accept H0, i.e.,
µ = µˆ = 1.4
The interval [1.237, 1.463] is also known as confidence
interval at the 1-ρ=0.95 level.
0
● The Unknown Variance Case
E[σˆ 2 ] = σ 2
➢ Define the test statistic
x−µ
q=
σˆ / N
0
➢ This is no longer Gaussian. If x is Gaussian, then
q follows a t-distribution, with N-1 degrees of freedom
➢ An example:
0
➢ Table of acceptance intervals for t-distribution
Degrees of
Freedom 1-ρ 0.9 0.95 0.975 0.99
⎧ x − µˆ ⎫
Prob ⎨− 2.49 < < 2.49⎬
⎩ σˆ / 4 ⎭
1.207 < µˆ < 1.493
Thus, µˆ = 1.4 is accepted 0
● Application in Feature Selection
H 0 : Δµ = µ1 − µ 2 = 0
H1 : Δµ ≠ 0
0
➢ Define
z=x-y
➢ Obviously
E[z]=µ1-µ2
1 N
➢ zKnown
= ∑ ( xi − yiCase:
Variance ) = x −Define
y
N i =1
( x − y ) − ( µˆ1 − µˆ 2 )
➢ q
This is N(0,1) and one follows the procedure as before.
=
2
σ
N
0
➢ Unknown Variance Case:
Define the test statistic
( x − y ) − ( µ1 − µ 2 )
q=
2
Sz
N
N N
2 1
S =
z (∑ ( xi − x) + ∑ ( yi − y ) 2 )
2
• q is 2t-distribution
N − 2 i =1 with 2N-2 idegrees
=1 of freedom,
• Then apply appropriate tables as before.
0
➢ We have
ω1 : x = 3.73, σˆ12 = 0.0601
ω2 : y = 3.25, σˆ 22 = 0.0672
For N=10
1 2
S z2 = (σˆ1 + σˆ 22 )
2
( x − y) − 0
q=
2
Sz
10
➢ qFrom the table of the t-distribution with 2N-2=18 degrees of
= 4.25
freedom and ρ=0.05, we obtain D=[-2.10,2.10] and
since q=4.25 is outside D, H1 is accepted and the feature is
selected.
0
● Class Separability Measures
The emphasis so far was on individually considered features. However,
such an approach cannot take into account existing correlations among
the features. That is, two features may be rich in information, but if they
are highly correlated we need not consider both of them. To this end, in
order to search for possible correlations, we consider features jointly as
elements of vectors. To this end:
0
➢ Combine remaining features to search for the “best” combination.
To this end:
0
➢ Class separability measures: Let x be the current feature
combination vector.
• Divergence. To see the rationale behind this cost, consider the
two – class case. Obviously, if on the average the
value of p( x | ω
is1 )close to zero, then should be a
ln x
p( x | ω2 )
poor feature combination. Define:
– +∞
p ( x | ω1 )
D12 = ∫ p( x | ω1 ) ln dx
−∞
p( x | ω2 )
–
+∞
p( x | ω2 )
– D21 = ∫ p( x | ω2 ) ln dx
−∞
p ( x | ω1 )
d12
d12 is
=D known as the divergence and can be used as a class
12 + D21
separability measure.
0
– For the multi-class case, define dij for every pair of classes
i, j and the average divergence is defined as
M M
d = ∑∑ P(ωi ) P(ω j )d ij
i =1 j =1
– Some properties:
d ij ≥ 0
d ij = 0, if i = j
–
d =d
Large values of d areij indicative
ji of good feature combination.
0
➢ Scatter Matrices. These are used as a measure of the way data are
scattered in the respective feature space.
• Within-class scatter matrix
M
S w = ∑ Pi Si
where i =1
and [( )(
Si = E x − µ i x − µ i )]
T
n
(ωi ) ≈ ini i.
Pi ≡ Psamples
ni the number of training
N
Trace {Sw} is a measure of the average variance of the features.
0
• Between-class scatter matrix
M
( )(
Sb = ∑ Pi µ i − µ 0 µ i − µ 0 )
T
i =1
M
µ 0 = ∑ Pi µ i
Trace {Sb} is a measure ofi =1the average distance of the mean of
each class from the respective global one.
0
➢ Measures based on Scatter Matrices.
• Trace{S m }
J1 =
Trace{S w }
• Sm −1
J2 = = Sw Sm
Sw
•
• J 3 =criteria
Other {
Traceare
−1
S walso }
S mpossible, by using various combinations of
Sm, Sb, Sw.
The above J1, J2, J3 criteria take high values for the cases where:
• Data are clustered together within each class.
• The means of the various classes are far.
0
0
• Fisher’s discriminant ratio. In one dimension and for two
equiprobable classes the determinants become:
S w ∝ σ 12 + σ 22
2
and Sb ∝ (µ1 − µ 2 )
2
Sb (µ1 − µ 2 )
=
Sw
known as Fischer’s ratio. σ 12 + σ 22
0
● Ways to combine features:
Trying to form all possible combinations of features
! from an original
set of m selected features is a computationally hard task. Thus, a
number of suboptimal searching techniques have been derived.
0
• From the above selected feature vector eliminate one feature and
for each of the resulting combinations, , , [ x1 , x2 ]T
[ x 2 , x3 ]T and selectCthe best combination.
[ x1, x3 ]T compute
The above selection procedure shows how one can start from
features and end up with the “best” ones. Obviously, the choice is
suboptimal. The number of required calculations is:
m
!
1
In contrast, a full search
1 + requires:
((m + 1)m − !
(!+ 1) )
2
operations.
⎛m⎞ m!
⎜⎜ ⎟⎟ =
⎝ !⎠ ! !( m − !
)!
0
➢ Sequential backward selection. Here the reverse procedure is
followed.
• Compute C for each feature. Select the “best” one, say x1
• For all possible 2D combinations of x1, i.e., [x1, x2], [x1, x3],
[x1, x4] compute C and choose the best, say [x1, x3].
operations.
(!
! − 1)
m−
!
2
0
➢ Floating Search Methods
The above two procedures suffer from the nesting effect. Once a bad
choice has been done, there is no way to reconsider it in the
following steps.
0
➢ Remarks:
• Besides suboptimal techniques, some optimal searching
techniques can also be used, provided that the optimizing cost
has certain properties, e.g., monotonic.
0
● Hints from Generalization Theory.
Generalization theory aims at providing general bounds that relate the
error performance of a classifier with the number of training points, N,
on one hand, and some classifier dependent parameters, on the other. Up
to now, the classifier dependent parameters that we considered were the
number of its free parameters and the dimensionality, , of the
subspace, in which the classifier operates.
! ( also affects the number of
free parameters). !
➢ Definitions
• Let the classifier be a binary one, i.e.,
0
• The shatter coefficient S(F,N) of the class F is defined as:
the maximum number of dichotomies of N points that can be
formed by the functions in F.
The maximum possible number of dichotomies is 2N. However,
NOT ALL dichotomies can be realized by the set of functions in
F.
0
– It can be shown that
S ( F , N ) ≤ N Vc + 1
Vc: the VC dimension of the class.
That is, the shatter coefficient is either 2N (the maximum
possible number of dichotomies) or it is upper bounded, as
suggested by the above inequality.
0
• The Vdimension
c
may or may not be related to the dimension
and the number
! of free parameters.
– Perceptron: V = ! +1
c
– Multilayer perceptron with hard limiting activation function
⎡ k nh ⎤
2 ⎢ total
where is the ⎥! ≤ Vc ≤ of
number 2khidden
w log 2 (layer
ek n ) nodes, the
⎣ 2nodes,
total number of ⎦ and the total number of weights.
k nh kn
– Let be a training data sample and assume that
kw
xi
x i ≤ r , i = 1, 2 , ..., N
0
Let also a hyperplane such that
2
w ≤c
and
(
T
)
yi w x i + b ≥ 1
(i.e., the constraints we met in the SVM formulation). Then
0
➢ Generalization Performance
• Let P N ( f be
) the error rate of classifier f, based on the N training
e
points, also known as empirical error.
0
• Let f be
*
the function resulting by minimizing the empirical (over
the finite training set) error function.
–
{ f ∈F
( e
N
) } ⎛ Nε 2 ⎞
prob max P ( f ) − Pe ( f ) > ε ≤ 8S ( F , N ) exp⎜⎜ −
⎝ 32 ⎠
⎟⎟
–
⎛ Nε 2 ⎞
–
{ *
}
prob Pe ( f ) − Pe > ε ≤ 8S ( F , N ) exp⎜⎜ −
Taking into account that for finite dimension, 128
the
⎟⎟
⎝ ⎠growth
of is only polynomial, the above bounds tell us that
for a large N : Vc
º is closeS to
( F , N ), with high probability.
º is close to , with high probability.
PeN ( f ) Pe ( f )
Pe ( f * ) Pe
0
• Some more useful bounds
– The minimum number of points, N (,ε,that
ρ ) guarantees, with
high probability, a good generalization error performance is
given by
⎧ k1Vc k 2Vc k3 8 ⎫
( )
N ε , ρ ≤ max ⎨ 2 ln 2 , 2 ln ⎬
That is, for any ⎩ε ε ε ρ⎭
N ≥ N (ε , ρ )
prob{Pe ( f ) − Pe > ε }≤ ρ
Where, k1 , k 2 , kconstants.
3
In words, for N ≥ N (ε , ρ )
the performance of the classifier is guaranteed, with
high probability, to be close to the optimal classifier
in the class F. N (ε ,as
is known ρ )the sample
complexity.
0
– With a probability of at least 1 −the
ρ following bound
holds:
⎛V ⎞
Pe ( f ) ≤ PeN ( f ) + Φ⎜ c ⎟
⎝N⎠
where
⎛ ⎛ 2N ⎞ ⎛ ρ ⎞ ⎞⎟
Vc ⎜ ln⎜⎜
⎜ + 1⎟⎟ − ln⎜ ⎟ ⎟
⎛ Vc ⎞ ⎝ V
⎝ c ⎠ ⎝ 4 ⎠⎠
Φ⎜ ⎟ =
⎝N⎠ N
➢ Remark: Observe that all the bounds given so far are:
• Dimension free
• Distribution free
0
● Model Complexity vs Performance
This issue has already been touched in the form of overfitting in neural
networks modeling and in the form of bias-variance dilemma. A
different perspective of the issue is dealt below.
0
• Let F (1) , F ( 2 ) ,be
... a sequence of nested classes:
F (1) ⊂ F ( 2 ) ⊂ ...
with increasing, yet finite dimensions.
Vc
lim inf( i ) Pe ( f ) = PB
i →∞ fof
For each N and class ∈F functions F(i), i=1, 2, …, compute the
optimum f*N,i, with respect to the empirical error. Then from all
these classifiers choose the one than minimizes, over all i, the
upper bound in:
That is, * N *
⎛ Vc , F ( i ) ⎞
Pe ( f ) ≤ Pe ( f ) + Φ⎜⎜
Ν ,ι Ν ,ι
⎟
⎟
⎝ N ⎠
*
⎡ N * ⎛ Vc , F ( i ) ⎞⎤
f N = arg min ⎢ Pe ( f Ν ,ι ) + Φ⎜⎜ ⎟⎥
⎟
i
⎢⎣ ⎝ N ⎠⎥⎦ 0
• Then, as Ν→∞
Pe ( f N* ) → PB
– The term
⎛ Vc , F ( i ) ⎞
Φ⎜⎜ ⎟
⎝ N ⎟⎠
in the minimized bound is a complexity penalty term. If the
classifier model is simple the penalty term is small but the
empirical error term
0
➢ Bayesian Information Criterion (BIC)
Let N the size of the training set, the θvector of the unknown
m
parameters of the classifier, the dimensionality of K , and
m
runs over all possibleθ models. m
m
( )
AIC = −2 L θˆ m + 2 K m
0