4 Feature Selection

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 46

FEATURE

SELECTION
● The goals:
➢ Select the “optimum” number l of features
➢ Select the “best” l features

● Large l has a three-fold disadvantage:


➢ High computational demands
➢ Low generalization performance
➢ Poor error estimates

0
➢ Given N
• l must be large enough to learn
– what makes classes different
– what makes patterns in the same class similar
• l must be small enough not to learn what makes patterns of the
same class different
• In practice, has been reported to be a sensible choice
for a number of cases
l < Ν /3

➢ Once l has been decided, choose the l most informative features


• Best: Large between class distance,
Small within class variance

0
0
● The basic philosophy
➢ Discard individual features with poor information content
➢ The remaining information rich features are examined jointly as
vectors

● Feature Selection based on statistical Hypothesis Testing


➢ The Goal: For each individual feature, find whether the values,
which the feature takes for the different classes, differ
significantly.
That is, answer
• : The values differ significantly
• H1 : θ1 ≠:θ 0 The values do not differ significantly
If they
H 0do not=differ
: θ θ0 significantly reject feature from subsequent
1
stages.
● Hypothesis Testing Basics

0
➢ The steps:
• N measurements xi , i = 1,2,..., N
are known
• Define a function of them
test statistic
so thatq = fis( xeasily parameterized in terms of θ.
1 , x2 ,..., x N ) :

• Let D be anp interval, where q has a high probability


q ( q; θ )
to lie under H0, i.e., pq(q‫׀‬θ0)
• Let D be the complement of D
D Acceptance Interval
D Critical Interval
• If q, resulting from
lies in D we accept H0, otherwise we reject it.

x1 , x2 ,..., x N ,
0
➢ Probability of an error

pq ( q ∈ D H 0 ) = ρ

1-ρ
• ρ is preselected and it is known as the significance level.

0
● Application: The known variance case:
➢ Let x be a random variable and the experimental samples,
, arexassumed mutually independent. Also let
i = 1,2,..., N

E[ x] = µ
➢ Compute
E[( x − µ )the
2 sample
] = σ 2 mean

➢ This is1also
N a random variable with mean value
x= ∑ x i
N i =1
That is, it is an Unbiased Estimator
N
1
E[ x ] = ∑ E[ x ] = µ
i
N i =1

0
➢ The variance σ x2
N
2 1 2
E[( x − µ ) ] = E[( ∑x i − µ) ]
N i =1

1 N 1
= 2 ∑ E[( xi − µ ) 2 ] + 2 ∑∑ E[( xi − µ )( x j − µ )]
N i =1
Due to independence N i j

1
That
σ x2 =is, itσis
2
x
Asymptotically Efficient
N
➢ Hypothesis test

➢ Test
H Statistic:
1 : E [x ]≠ µ
ˆ Define the variable
H 0 : E [x ]= µˆ

x − µˆ
q= 0
σ/ N
➢ Central limit theorem under H0
N ⎛ N ( x − µˆ ) 2 ⎞
p x ( x) = exp⎜⎜ − 2


2π σ ⎝ 2 σ ⎠
➢ Thus, under H0

1 ⎛ q2 ⎞
pq ( q ) = exp⎜⎜ − ⎟⎟ q ≈ N (0,1)
2π ⎝ 2 ⎠

0
➢ The decision steps
• Compute q from xi, i=1,2,…,N
• Choose significance level ρ
• Compute from N(0,1) tables D=[-xρ, xρ]

1-ρ


if q ∈ D accept H 0
➢ An example: A random variable x has variance σ2=
if q Ν=16
(0.23)2. ∈ D reject H0
measurements are obtained giving
. The significance level is ρ=0.05.
Test the hypothesis
x = 1.35

H 0 : µ = µˆ = 1.4
0
H1 : µ ≠ µˆ
x − µˆ
➢ Since σ2 is known, q = is N(0,1).
σ /4
From tables, we obtain the values with acceptance
intervals [-xρ, xρ] for normal N(0,1)

1-ρ 0.8 0.85 0.9 0.95 0.98 0.99 0.998 0.999


xρ 1.28 1.44 1.64 1.96 2.32 2.57 3.09 3.29

➢ Thus

⎧ x − µˆ ⎫
Prob ⎨− 1.967 < < 1.967 ⎬ = 0.95
⎩ 0.23 / 4 ⎭
or
{ }
Prob − 0.113 < x − µˆ < 0.113 = 0.95
or
Prob{1.237 < µˆ < 1.463}= 0.95
0
➢ Since µˆ = 1.4 lies within the above acceptance interval,
we accept H0, i.e.,

µ = µˆ = 1.4
The interval [1.237, 1.463] is also known as confidence
interval at the 1-ρ=0.95 level.

We say that: There is no evidence at the 5% level that the


mean value is not equal to
µ̂

0
● The Unknown Variance Case

➢ Estimate the variance. The estimate


2 1 N
σ̂ = ∑ ( xi − x) 2
N − 1 i =1
is unbiased, i.e.,

E[σˆ 2 ] = σ 2
➢ Define the test statistic

x−µ
q=
σˆ / N

0
➢ This is no longer Gaussian. If x is Gaussian, then
q follows a t-distribution, with N-1 degrees of freedom

➢ An example:

x is Gaussian, N = 16, obtained from measurements,


x = 1.35 and σˆ 2 = (0.23) 2 . Test the hypothesis
H 0 : µ = µˆ = 1.4
at the significance level ρ = 0.025.

0
➢ Table of acceptance intervals for t-distribution

Degrees of
Freedom 1-ρ 0.9 0.95 0.975 0.99

12 1.78 2.18 2.56 3.05


13 1.77 2.16 2.53 3.01
14 1.76 2.15 2.51 2.98
15 1.75 2.13 2.49 2.95
16 1.75 2.12 2.47 2.92
17 1.74 2.11 2.46 2.90
➢ 18 1.73 2.10 2.44 2.88

⎧ x − µˆ ⎫
Prob ⎨− 2.49 < < 2.49⎬
⎩ σˆ / 4 ⎭
1.207 < µˆ < 1.493
Thus, µˆ = 1.4 is accepted 0
● Application in Feature Selection

➢ The goal here is to test against zero the difference


µ1-µ2 of the respective means in
ω1, ω2 of a single feature.
➢ Let xi i=1,…,N , the values of a feature in ω1
➢ Let yi i=1,…,N , the values of the same feature in ω2
➢ Assume in both classes
(unknown or not)
σ 12 = σ 22 = σ 2
➢ The test becomes

H 0 : Δµ = µ1 − µ 2 = 0
H1 : Δµ ≠ 0
0
➢ Define
z=x-y

➢ Obviously
E[z]=µ1-µ2

➢ Define the average

1 N
➢ zKnown
= ∑ ( xi − yiCase:
Variance ) = x −Define
y
N i =1

( x − y ) − ( µˆ1 − µˆ 2 )
➢ q
This is N(0,1) and one follows the procedure as before.
=
2
σ
N

0
➢ Unknown Variance Case:
Define the test statistic
( x − y ) − ( µ1 − µ 2 )
q=
2
Sz
N
N N
2 1
S =
z (∑ ( xi − x) + ∑ ( yi − y ) 2 )
2

• q is 2t-distribution
N − 2 i =1 with 2N-2 idegrees
=1 of freedom,
• Then apply appropriate tables as before.

➢ Example: The values of a feature in two classes are:


ω1: 3.5, 3.7, 3.9, 4.1, 3.4, 3.5, 4.1, 3.8, 3.6, 3.7
ω2: 3.2, 3.6, 3.1, 3.4, 3.0, 3.4, 2.8, 3.1, 3.3, 3.6

Test if the mean values in the two classes differ


significantly, at the significance level ρ=0.05

0
➢ We have
ω1 : x = 3.73, σˆ12 = 0.0601
ω2 : y = 3.25, σˆ 22 = 0.0672
For N=10

1 2
S z2 = (σˆ1 + σˆ 22 )
2
( x − y) − 0
q=
2
Sz
10
➢ qFrom the table of the t-distribution with 2N-2=18 degrees of
= 4.25
freedom and ρ=0.05, we obtain D=[-2.10,2.10] and
since q=4.25 is outside D, H1 is accepted and the feature is
selected.

0
● Class Separability Measures
The emphasis so far was on individually considered features. However,
such an approach cannot take into account existing correlations among
the features. That is, two features may be rich in information, but if they
are highly correlated we need not consider both of them. To this end, in
order to search for possible correlations, we consider features jointly as
elements of vectors. To this end:

➢ Discard poor in information features, by means of a statistical test.

➢ Choose the maximum number, , of features to be used. This is


dictated by the specific problem (e.g., the number, N, of available
training patterns and the type of the classifier to be adopted).
!

0
➢ Combine remaining features to search for the “best” combination.
To this end:

• Use different feature combinations to form the feature vector.


Train the classifier, and choose the combination resulting in the
best classifier performance.
A major disadvantage of this approach is the high complexity.
Also, local minima, may give misleading results.

• Adopt a class separability measure and choose the best feature


combination against this cost.

0
➢ Class separability measures: Let x be the current feature
combination vector.
• Divergence. To see the rationale behind this cost, consider the
two – class case. Obviously, if on the average the
value of p( x | ω
is1 )close to zero, then should be a
ln x
p( x | ω2 )
poor feature combination. Define:

– +∞
p ( x | ω1 )
D12 = ∫ p( x | ω1 ) ln dx
−∞
p( x | ω2 )

+∞
p( x | ω2 )
– D21 = ∫ p( x | ω2 ) ln dx
−∞
p ( x | ω1 )
d12
d12 is
=D known as the divergence and can be used as a class
12 + D21
separability measure.

0
– For the multi-class case, define dij for every pair of classes
i, j and the average divergence is defined as
M M
d = ∑∑ P(ωi ) P(ω j )d ij
i =1 j =1
– Some properties:

d ij ≥ 0
d ij = 0, if i = j


d =d
Large values of d areij indicative
ji of good feature combination.

0
➢ Scatter Matrices. These are used as a measure of the way data are
scattered in the respective feature space.
• Within-class scatter matrix
M
S w = ∑ Pi Si
where i =1

and [( )(
Si = E x − µ i x − µ i )]
T

n
(ωi ) ≈ ini i.
Pi ≡ Psamples
ni the number of training
N
Trace {Sw} is a measure of the average variance of the features.

0
• Between-class scatter matrix
M
( )(
Sb = ∑ Pi µ i − µ 0 µ i − µ 0 )
T

i =1

M
µ 0 = ∑ Pi µ i
Trace {Sb} is a measure ofi =1the average distance of the mean of
each class from the respective global one.

• Mixture scatter matrix

It turns out that:


S m = Sm [(
E =x Sw )(
− µ+0Sbx − µ 0 )]
Τ

0
➢ Measures based on Scatter Matrices.
• Trace{S m }
J1 =
Trace{S w }
• Sm −1
J2 = = Sw Sm
Sw

• J 3 =criteria
Other {
Traceare
−1
S walso }
S mpossible, by using various combinations of
Sm, Sb, Sw.
The above J1, J2, J3 criteria take high values for the cases where:
• Data are clustered together within each class.
• The means of the various classes are far.

0
0
• Fisher’s discriminant ratio. In one dimension and for two
equiprobable classes the determinants become:

S w ∝ σ 12 + σ 22
2
and Sb ∝ (µ1 − µ 2 )

2
Sb (µ1 − µ 2 )
=
Sw
known as Fischer’s ratio. σ 12 + σ 22

0
● Ways to combine features:
Trying to form all possible combinations of features
! from an original
set of m selected features is a computationally hard task. Thus, a
number of suboptimal searching techniques have been derived.

➢ Sequential forward selection. Let x1, x2, x3, x4 the available


features (m=4). The procedure consists of the following steps:
• Adopt a class separability criterion (could also be the error rate
of the respective classifier). Compute its value for ALL features
considered jointly [x1, x2, x3, x4]T.

• Eliminate one feature and for each of the possible resulting


combinations, that is [x1, x2, x3]T, [x1, x2, x4]T, [x1, x3, x4]T,
[x2, x3, x4]T, compute the class reparability criterion value C.
Select the best combination, say [x1, x2, x3]T.

0
• From the above selected feature vector eliminate one feature and
for each of the resulting combinations, , , [ x1 , x2 ]T
[ x 2 , x3 ]T and selectCthe best combination.
[ x1, x3 ]T compute

The above selection procedure shows how one can start from
features and end up with the “best” ones. Obviously, the choice is
suboptimal. The number of required calculations is:
m
!

1
In contrast, a full search
1 + requires:
((m + 1)m − !
(!+ 1) )
2

operations.
⎛m⎞ m!
⎜⎜ ⎟⎟ =
⎝ !⎠ ! !( m − !
)!

0
➢ Sequential backward selection. Here the reverse procedure is
followed.
• Compute C for each feature. Select the “best” one, say x1

• For all possible 2D combinations of x1, i.e., [x1, x2], [x1, x3],
[x1, x4] compute C and choose the best, say [x1, x3].

• For all possible 3D combinations of [x1, x3], e.g., [x1, x3,


x2], etc., compute C and choose the best one.

The above procedure is repeated till the “best” vector with


features has been formed. This is also a suboptimal technique,
requiring: !

operations.
(!
! − 1)
m−
!
2

0
➢ Floating Search Methods

The above two procedures suffer from the nesting effect. Once a bad
choice has been done, there is no way to reconsider it in the
following steps.

In the floating search methods one is given the opportunity in


reconsidering a previously discarded feature or to discard a feature
that was previously chosen.

The method is still suboptimal, however it leads to improved


performance, at the expense of complexity.

0
➢ Remarks:
• Besides suboptimal techniques, some optimal searching
techniques can also be used, provided that the optimizing cost
has certain properties, e.g., monotonic.

• Instead of using a class separability measure (filter techniques) or


using directly the classifier (wrapper techniques), one can
modify the cost function of the classifier appropriately, so that to
perform feature selection and classifier design in a single step
(embedded) method.

• For the choice of the separability measure a multiplicity of costs


have been proposed, including information theoretic costs.

0
● Hints from Generalization Theory.
Generalization theory aims at providing general bounds that relate the
error performance of a classifier with the number of training points, N,
on one hand, and some classifier dependent parameters, on the other. Up
to now, the classifier dependent parameters that we considered were the
number of its free parameters and the dimensionality, , of the
subspace, in which the classifier operates.
! ( also affects the number of
free parameters). !

➢ Definitions
• Let the classifier be a binary one, i.e.,

• Let F be the set of all functions f that can be realized by the


f :ℜ
adopted classifier (e.g., !
{ }
→ 0,the
changing 1 synapses of a given neural
network different functions are implemented).

0
• The shatter coefficient S(F,N) of the class F is defined as:
the maximum number of dichotomies of N points that can be
formed by the functions in F.
The maximum possible number of dichotomies is 2N. However,
NOT ALL dichotomies can be realized by the set of functions in
F.

• The Vapnik – Chernovenkis (VC) dimension of a class F is the


largest integer k for which S(F,k) = 2k. If S(F,N)=2N,
we say that the VC dimension is infinite. ∀N ,
– That is, VC is the integer for which the class of functions F
can achieve all possible dichotomies, 2k.
– It is easily seen that the VC dimension of the single
perceptron class, operating in the ℓ -dimensional space, is
ℓ+1.

0
– It can be shown that
S ( F , N ) ≤ N Vc + 1
Vc: the VC dimension of the class.
That is, the shatter coefficient is either 2N (the maximum
possible number of dichotomies) or it is upper bounded, as
suggested by the above inequality.

In words, for finite Vc and large enough N, the shatter


coefficient is bounded by a polynomial growth.
º Note that in order to have a polynomial growth of the shatter
coefficient, N must be larger than the Vc dimension.
– The Vc dimension can be considered as an intrinsic
capacity of the classifier, and, as we will soon see, only if
the number of training vectors exceeds this number
sufficiently, we can expect good generalization performance.

0
• The Vdimension
c
may or may not be related to the dimension
and the number
! of free parameters.
– Perceptron: V = ! +1
c
– Multilayer perceptron with hard limiting activation function

⎡ k nh ⎤
2 ⎢ total
where is the ⎥! ≤ Vc ≤ of
number 2khidden
w log 2 (layer
ek n ) nodes, the
⎣ 2nodes,
total number of ⎦ and the total number of weights.
k nh kn
– Let be a training data sample and assume that
kw

xi
x i ≤ r , i = 1, 2 , ..., N

0
Let also a hyperplane such that
2
w ≤c
and
(
T
)
yi w x i + b ≥ 1
(i.e., the constraints we met in the SVM formulation). Then

That is, by controlling the constant c, the of the linear


classifier can be lessVthan ( )
.2 In other words, can be
c ≤ r c, !
controlled independently of the dimension. V
c
Thus, by minimizing in the SVM, ! one attempts to keep
V
as small as possible. Moreover, one can achieve finite c
dimension, even for infinite 2 dimensional spaces. This is an
w
explanation of the potential for good generalization
Vc
performance of the SVM’s, as this is readily deduced from
Vc
the following bounds.

0
➢ Generalization Performance
• Let P N ( f be
) the error rate of classifier f, based on the N training
e
points, also known as empirical error.

• Let be the true error probability of f (also known as


Pe ( f )
generalization error), when f is confronted with data outside the
finite training set.

• Let be the minimum error probability that can be attained over


ALL functions
P in the set F.
e

0
• Let f be
*
the function resulting by minimizing the empirical (over
the finite training set) error function.

• It can be shown that:


{ f ∈F
( e
N
) } ⎛ Nε 2 ⎞
prob max P ( f ) − Pe ( f ) > ε ≤ 8S ( F , N ) exp⎜⎜ −
⎝ 32 ⎠
⎟⎟

⎛ Nε 2 ⎞

{ *
}
prob Pe ( f ) − Pe > ε ≤ 8S ( F , N ) exp⎜⎜ −
Taking into account that for finite dimension, 128
the
⎟⎟
⎝ ⎠growth
of is only polynomial, the above bounds tell us that
for a large N : Vc
º is closeS to
( F , N ), with high probability.
º is close to , with high probability.
PeN ( f ) Pe ( f )
Pe ( f * ) Pe

0
• Some more useful bounds
– The minimum number of points, N (,ε,that
ρ ) guarantees, with
high probability, a good generalization error performance is
given by

⎧ k1Vc k 2Vc k3 8 ⎫
( )
N ε , ρ ≤ max ⎨ 2 ln 2 , 2 ln ⎬
That is, for any ⎩ε ε ε ρ⎭
N ≥ N (ε , ρ )

prob{Pe ( f ) − Pe > ε }≤ ρ

Where, k1 , k 2 , kconstants.
3
In words, for N ≥ N (ε , ρ )
the performance of the classifier is guaranteed, with
high probability, to be close to the optimal classifier
in the class F. N (ε ,as
is known ρ )the sample
complexity.

0
– With a probability of at least 1 −the
ρ following bound
holds:
⎛V ⎞
Pe ( f ) ≤ PeN ( f ) + Φ⎜ c ⎟
⎝N⎠
where
⎛ ⎛ 2N ⎞ ⎛ ρ ⎞ ⎞⎟
Vc ⎜ ln⎜⎜
⎜ + 1⎟⎟ − ln⎜ ⎟ ⎟
⎛ Vc ⎞ ⎝ V
⎝ c ⎠ ⎝ 4 ⎠⎠
Φ⎜ ⎟ =
⎝N⎠ N
➢ Remark: Observe that all the bounds given so far are:
• Dimension free
• Distribution free

0
● Model Complexity vs Performance
This issue has already been touched in the form of overfitting in neural
networks modeling and in the form of bias-variance dilemma. A
different perspective of the issue is dealt below.

➢ Structural Risk Minimization (SRM)


• Let be he Bayesian error probability for a given task.
P
• Let B be *
the true (generalization) error of an optimally
Pe (classifier
design f ) , from class ∗ , given a finite training set.
• f F
is the minimum error attainable in
f *the B = (Pe ( fis )small,
− Pe )+then
*
– Pe ( If ) − Pclass (Pe −the
PB )first term is expected to be
Pe small and the second term is expected to F be large. The
opposite is trueF when the class is large

0
• Let F (1) , F ( 2 ) ,be
... a sequence of nested classes:
F (1) ⊂ F ( 2 ) ⊂ ...
with increasing, yet finite dimensions.
Vc

Also, let Vc , F (1) ≤ Vc , F ( 2 ) ≤ ...

lim inf( i ) Pe ( f ) = PB
i →∞ fof
For each N and class ∈F functions F(i), i=1, 2, …, compute the
optimum f*N,i, with respect to the empirical error. Then from all
these classifiers choose the one than minimizes, over all i, the
upper bound in:

That is, * N *
⎛ Vc , F ( i ) ⎞
Pe ( f ) ≤ Pe ( f ) + Φ⎜⎜
Ν ,ι Ν ,ι


⎝ N ⎠
*
⎡ N * ⎛ Vc , F ( i ) ⎞⎤
f N = arg min ⎢ Pe ( f Ν ,ι ) + Φ⎜⎜ ⎟⎥

i
⎢⎣ ⎝ N ⎠⎥⎦ 0
• Then, as Ν→∞

Pe ( f N* ) → PB
– The term
⎛ Vc , F ( i ) ⎞
Φ⎜⎜ ⎟
⎝ N ⎟⎠
in the minimized bound is a complexity penalty term. If the
classifier model is simple the penalty term is small but the
empirical error term

will be large. The opposite


P N ( f is
* true for complex models.
Ν ,ι )
e

• The SRM criterion aims at achieving the best trade-off between


performance and complexity.

0
➢ Bayesian Information Criterion (BIC)
Let N the size of the training set, the θvector of the unknown
m
parameters of the classifier, the dimensionality of K , and
m
runs over all possibleθ models. m
m

• The BIC criterion chooses the model by minimizing:

– is the log-likelihood computed at the ML estimate


, and it is BIC ( )
= −2 L θˆ mindex.
the performance + K m ln N
– is the model complexity term.
( )
Lθmˆ
• θˆ m Criterion:
Akaike Information
K m ln N

( )
AIC = −2 L θˆ m + 2 K m
0

You might also like