CS 464 Introduction To Machine Learning: Feature Selection

CS 464
Introduction to Machine
Learning
Feature Selection
(Slides based on material by Mehmet Koyutürk, Öznur

Taştan and Mark Craven)
Feature Selection
• The objective in classification/regression is to learn
a function that relates values of features to values
of outcome variable(s)
– Often, we are presented with many features
– Not all of these features are relevant
• Feature Selection is the task of identifying an

“optimal” (take this in lay language) set of features
that are useful for accurately predicting the
outcome variable
Motivation for Feature Selection
• Accuracy
– Getting rid of irrelevant features can help learn better
predictive models by reducing confusion
• Generalizability
– Models with less features have lower complexity, so they
are less prone to overfitting
• Interpretability
– Identifying a small set of features can help understand the
mechanistics of the relationship between the features and
the outcome variable(s)
• Efficiency
– With smaller number of features, learning and prediction
may take less time/space
Three Main Approaches
1. Treat feature selection as a separate task
• Filtering-based feature selection
• Wrapper-based feature selection
2. Embed feature selection into the task of learning a

model
• Regularization
3. Do not select features, instead construct new

features that effectively represent combinations
original features
• Dimensionality reduction
Feature Selection as a Separate Task
Filtering
Rank
Score Features Select Top Train
Features Based on k Features Model
Score
• k can be chosen heuristically
• Scores do not represent • Standard rules of thumb can be
prediction performance used to set a threshold (e.g.,
since no validation is use features with statistically
done at this stage significant scores)
• Do NOT use • Can use cross-validation to
validation/test samples to select an optimal value of k
compute score (using prediction performance
as the criterion)
Scoring Features for Filtering
• Mutual information
– Reduction in uncertainty on the value of the outcome variable
upon observation of the value of feature
– Already discussed
• Statistical tests
– t-statistic: Standardized difference of the mean value of the
feature in different classes (continuous features)
– Chi-square statistic: Difference between counts in different
classes (discrete features, related to mutual information)
• Variance/frequency
– Continuous features with low variance are usually not useful
– Discrete features that are too frequent or too rare are usually
not useful
Feature Selection – In Text Classification
• In text classification, we usually represent documents with a
high-dimensional feature vector:
• Each dimension corresponding to a term
• Many dimensions correspond to rare words
• Rare words can mislead the classifier
• Rare misleading features are called noise features
• Eliminating noise features from the representation increases

efficiency and effectiveness of text classification
40
Noisy Features
• A noise feature is one that increases the classification error on
new data.
• Suppose you are doing topic classification. One class is China
• A rare term, say arachnocentric, has no information about

documents about China, but all instances of arachnocentric in the
training data happen to occur in the documents related to China
• The learner might produce a classifier that misassigns test

documents containing arachnocentric to China.
• Such an incorrect generalization from an accidental property of

the training set is an example of overfitting
9
Feature Selection
• All possible feature subsets 2^N combinations.
• If you fix the feature subset size to M
• This number of combinations is unfeasible, even for

moderate M
• A search strategy is therefore needed to direct the

feature selection process as it explores the space of
all possible combination of features
10
Filtering-Based Selection
• Use a simple measure to assess the relevance of
each feature to the outcome variable (class)
• Mutual information – reduction in the uncertainty in class
upon observation of the value of the feature
• Chi-square test -a statistical test that compares the frequencies
of a term between different classes
• Rank features, try models that include the top k

features as you increase k
• These methods are based on the rationale:

– good feature subsets contain features highly correlated
with (predictive of) the class
43
11
Information
• Information: reduction in uncertainty (amount of surprise in
the outcome)
1
I(E) = log 2
I(X=x) = -log 2 p(x)
p(x)
• If the probability of this event happening is small and it
happens the information is large:
Observing the outcome of a coin flip is head

I = -log2 1/ 2 =1
The outcome of a dice is 6
I = -log2 1/ 6 = 2.58
12
Entropy
• The entropy of a random variable is the sum of the
information provided by its possible values, weighted by the
probability of each value
• Entropy is a measure of uncertainty
The summation is over all

possible values of the random
variable
The entropy of a binary random variable

as a function of the probability of a success
Mutual Information
• Mutual information I(X,Y) is the reduction of uncertainty in
one variable upon observation of the other variable
• Mutual information is a measure of statistical dependency between
two random variables
Mutual Information
• The mutual information between feature vector and class
label measures the amount by which the uncertainty in the
class is decreased by knowledge of the feature. Compute the
mutual information (MI) of term t and class c.
• Below U is a random variable that takes values (the
document contains term ) and (the document does not
contain )
• C is a random variable that takes values (the document is in
class ) and (the document is not in class ).
§§ Definition:
45
Mutual Information
• If a term’s occurrence is independent of the class (ie.
term’s distribution is the same in the class as it is in the
collection as a whole), then MI is 0
• MI is maximum if the term is a perfect indicator for class

membership (ie. the term is present in a document if and
only if the document is in the class)
16
Mutual Information Example
• class poultry and the term export
• The counts of the number of documents with the four possible

combinations of indicator values are as follows
• 𝑁!" : number of documents that contain 𝑡 (𝑒# = 1) and are not in c (𝑒$ = 0)
• 𝑁!! : number of documents that contain 𝑡 (𝑒# = 1) and are in c (𝑒$ = 1)
• 𝑁"! : number of documents that do not contain 𝑡 (𝑒# = 0) and are in c 𝑒$ = 1
• 𝑁"" : number of documents that do not contain 𝑡 (𝑒# = 1) and are not in c (𝑒$ = 1)
48
17
How to compute Mutual Information
• Based on maximum likelihood estimates, the formula we
actually use is:
• 𝑁!" : number of documents that contain 𝑡 (𝑒# = 1) and are not in c (𝑒$ = 0)
• 𝑁!! : number of documents that contain 𝑡 (𝑒# = 1) and are in c (𝑒$ = 1)
• 𝑁"! : number of documents that do not contain 𝑡 (𝑒# = 0) and are in c 𝑒$ = 1
• 𝑁"" : number of documents that do not contain 𝑡 (𝑒# = 1) and are not in c (𝑒$ = 1)
• N = 𝑁"" + 𝑁"! + 𝑁!" + 𝑁!!
47
18
Mutual Information Example
48
19
Chi-square statistic
• The Chi-square test is applied to test the
independence of two events, where two events A
and B are defined to be independent if
• P(AB) = P(A)P(B) or, equivalently,
• P(A|B) = P(A) and P(B|A) = P(B).
• The two events are occurrence of the term and

occurrence of the class. We rank the terms with
respect to the following quantity with dataset D,
term t and class c.
(*#! #" +,#! #" )$
• 𝐶ℎ𝑖 ! (𝐷, 𝑡, 𝑐) = ∑"!∈{%,'} ∑""∈{%,'} ,#! #"
(𝑁"!"" − 𝐸"!"" )!
𝐶ℎ𝑖 ! (𝐷, 𝑡, 𝑐) = , ,
𝐸"!""
"! ∈{%,'} "" ∈{%,'}
• 𝐸!! : is the expected frequency of t and c occurring together in a document

assuming that term and class are independent.
• N is the observed frequency in D and E the expected frequency.
Measure of how much expected counts E

and observed counts N deviate from each
other.
Frequency based
• selecting the terms that are most common in the class
• Document frequency : the number of documents in the

class c that contain the term t. -> More appropriate for
binomial model
• Collection frequency: the number of tokens of t that occur

in documents in c. - > More appropriate for Multinomial
model
Why Feature Selection Helps
50
t-statistic
• We have n1 and n2 samples from each class, respectively
• For each feature, let x1 , s1 be the sample mean and variance of

the first class, x2, s2 be that of the second
• The distribution of t approaches from uniform to normal distribution

as number of samples grow
• We can set a threshold on the t-statistic for a feature to be selected
based on the t-distribution
Wrapper Methods
• Frame the feature selection task as a search
problem
• Evaluate each feature set by using the prediction

performance of the learning algorithm on that
feature set
– Cross-validation
• How to search the exponential space of feature

sets?
Searching for Feature Sets
state = set of features

start state = empty (forward selection)
or full (backward elimination)
operators
add/subtract a feature
scoring function
cross-validation accuracy using learning method on a
given state’s feature set
Forward Selection
Given: feature set 𝑋- , … , 𝑋. , training set 𝐷, learning

method 𝐿.
𝐹 ← {} Scores feature set G by learning

model(s) with L and assessing
While score of 𝐹 is improving its(their accuracy.)
for i ← {} to 𝑛 do
if 𝑋- ∉ 𝐹
𝐺- ← 𝐹 ∪ 𝑋-
𝑆𝑐𝑜𝑟𝑒- = Evaluate(𝐺- , 𝐿, 𝐷)
F ← 𝐺/ with best 𝑆𝑐𝑜𝑟𝑒/
return feature set 𝐹.
Forward Selection
Backward Elimination
Forward Selection vs. Backward Elimination
• Both use a hill-climbing search
Forward selection Backward Elimination
Efficient for choosing a small Efficient for discarding a

subset of the features small subset of features.
Misses features whose Preserves features whose

usefulness requires other usefulness requires other
features. features.
Embedded Methods (Regularization)
• Instead of explicitly selecting features, bias the

learning process towards using a small number
of features
• Key idea: objective function has two parts

• Term representing error minimization (model fit)
• Term that “shrinks” parameters toward 0
Ridge Regression
• Linear regression:
'
𝑓 𝑥 = 𝑤" + 3 𝑥% 𝑤%
%&!
𝐸 𝑤 = 3(𝑦 ( − 𝑓(𝑥 (() ))-

()*
'
E(x) = 3(𝑦 ( − 𝑤" − 3 𝑥%( 𝑤% )-

()* %&!
• Penalty term (L2 norm of the coefficients) added:

6 ! 6
E x =, 𝑦 1
− 𝑤% − , 𝑥41 𝑤4 + 𝜆 , 𝑤4!
123 45' 45'
LASSO
• Ridge regression shrinks the weights, but does not
necessarily reduce the number of features
– We would like to force some coefficients to be set to 0
• Add L1 norm of the coefficients as the penalty term:

E x
' - '
=3 𝑦 ( − 𝑤" − 3 𝑥%( 𝑤% + 𝜆 3 |𝑥% |

()* %&! %&!
– Why does this result in more coefficients to be set to 0,

effectively performing feature selection?
Ridge Regression vs. LASSO
Plot of the contours of the unregularized error function (red) along with the
constraint region for lasso (left) and ridge (right). 𝛽’s are the weights we
learn.
Generalizing Regularization
• L1 and L2 penalties can be used with other learning
methods (logistic regression, neural nets, SVMs,
etc.)
– Both can help avoid overfitting by reducing variance
• There are many variants with somewhat different
biases
– Elastic net: includes L1 and L2 penalties
– Group Lasso: bias towards selecting defined groups of
features
– Graph Lasso: bias towards selecting “adjacent” features
in a defined graph

CS 464 Introduction To Machine Learning: Feature Selection

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CS 464 Introduction To Machine Learning: Feature Selection

Uploaded by

Copyright:

Available Formats

CS 464

(Slides based on material by Mehmet Koyutürk, Öznur

• Feature Selection is the task of identifying an

2. Embed feature selection into the task of learning a

3. Do not select features, instead construct new

• Rare misleading features are called noise features

• Eliminating noise features from the representation increases

• Suppose you are doing topic classification. One class is China

• A rare term, say arachnocentric, has no information about

• The learner might produce a classifier that misassigns test

• Such an incorrect generalization from an accidental property of

• If you fix the feature subset size to M

• This number of combinations is unfeasible, even for

• A search strategy is therefore needed to direct the

• Rank features, try models that include the top k

• These methods are based on the rationale:

Observing the outcome of a coin flip is head

The summation is over all

The entropy of a binary random variable

• MI is maximum if the term is a perfect indicator for class

• The counts of the number of documents with the four possible

• The two events are occurrence of the term and

• 𝐸!! : is the expected frequency of t and c occurring together in a document

Measure of how much expected counts E

• selecting the terms that are most common in the class

• Document frequency : the number of documents in the

• Collection frequency: the number of tokens of t that occur

• For each feature, let x1 , s1 be the sample mean and variance of

• The distribution of t approaches from uniform to normal distribution

• Evaluate each feature set by using the prediction

• How to search the exponential space of feature

state = set of features

Given: feature set 𝑋- , … , 𝑋. , training set 𝐷, learning

𝐹 ← {} Scores feature set G by learning

• Both use a hill-climbing search

Forward selection Backward Elimination

Efficient for choosing a small Efficient for discarding a

Misses features whose Preserves features whose

• Instead of explicitly selecting features, bias the

• Key idea: objective function has two parts

𝐸 𝑤 = 3(𝑦 ( − 𝑓(𝑥 (() ))-

E(x) = 3(𝑦 ( − 𝑤" − 3 𝑥%( 𝑤% )-

• Penalty term (L2 norm of the coefficients) added:

• Add L1 norm of the coefficients as the penalty term:

=3 𝑦 ( − 𝑤" − 3 𝑥%( 𝑤% + 𝜆 3 |𝑥% |

– Why does this result in more coefficients to be set to 0,

You might also like