Download as pdf or txt
Download as pdf or txt
You are on page 1of 36

CS 464

Introduction to Machine
Learning

Feature Selection

(Slides based on material by Mehmet Koyutürk, Öznur


Taştan and Mark Craven)
Feature Selection
• The objective in classification/regression is to learn
a function that relates values of features to values
of outcome variable(s)
– Often, we are presented with many features
– Not all of these features are relevant

• Feature Selection is the task of identifying an


“optimal” (take this in lay language) set of features
that are useful for accurately predicting the
outcome variable
Motivation for Feature Selection
• Accuracy
– Getting rid of irrelevant features can help learn better
predictive models by reducing confusion
• Generalizability
– Models with less features have lower complexity, so they
are less prone to overfitting
• Interpretability
– Identifying a small set of features can help understand the
mechanistics of the relationship between the features and
the outcome variable(s)
• Efficiency
– With smaller number of features, learning and prediction
may take less time/space
Three Main Approaches
1. Treat feature selection as a separate task
• Filtering-based feature selection
• Wrapper-based feature selection

2. Embed feature selection into the task of learning a


model
• Regularization

3. Do not select features, instead construct new


features that effectively represent combinations
original features
• Dimensionality reduction
Feature Selection as a Separate Task
Filtering

Rank
Score Features Select Top Train
Features Based on k Features Model
Score
• k can be chosen heuristically
• Scores do not represent • Standard rules of thumb can be
prediction performance used to set a threshold (e.g.,
since no validation is use features with statistically
done at this stage significant scores)
• Do NOT use • Can use cross-validation to
validation/test samples to select an optimal value of k
compute score (using prediction performance
as the criterion)
Scoring Features for Filtering
• Mutual information
– Reduction in uncertainty on the value of the outcome variable
upon observation of the value of feature
– Already discussed

• Statistical tests
– t-statistic: Standardized difference of the mean value of the
feature in different classes (continuous features)
– Chi-square statistic: Difference between counts in different
classes (discrete features, related to mutual information)

• Variance/frequency
– Continuous features with low variance are usually not useful
– Discrete features that are too frequent or too rare are usually
not useful
Feature Selection – In Text Classification
• In text classification, we usually represent documents with a
high-dimensional feature vector:
• Each dimension corresponding to a term
• Many dimensions correspond to rare words
• Rare words can mislead the classifier

• Rare misleading features are called noise features

• Eliminating noise features from the representation increases


efficiency and effectiveness of text classification

40
Noisy Features
• A noise feature is one that increases the classification error on
new data.

• Suppose you are doing topic classification. One class is China

• A rare term, say arachnocentric, has no information about


documents about China, but all instances of arachnocentric in the
training data happen to occur in the documents related to China

• The learner might produce a classifier that misassigns test


documents containing arachnocentric to China.

• Such an incorrect generalization from an accidental property of


the training set is an example of overfitting
9
Feature Selection
• All possible feature subsets 2^N combinations.

• If you fix the feature subset size to M

• This number of combinations is unfeasible, even for


moderate M

• A search strategy is therefore needed to direct the


feature selection process as it explores the space of
all possible combination of features

10
Filtering-Based Selection
• Use a simple measure to assess the relevance of
each feature to the outcome variable (class)
• Mutual information – reduction in the uncertainty in class
upon observation of the value of the feature
• Chi-square test -a statistical test that compares the frequencies
of a term between different classes

• Rank features, try models that include the top k


features as you increase k

• These methods are based on the rationale:


– good feature subsets contain features highly correlated
with (predictive of) the class
43
11
Information
• Information: reduction in uncertainty (amount of surprise in
the outcome)

1
I(E) = log 2
I(X=x) = -log 2 p(x)
p(x)
• If the probability of this event happening is small and it
happens the information is large:

Observing the outcome of a coin flip is head


I = -log2 1/ 2 =1
The outcome of a dice is 6
I = -log2 1/ 6 = 2.58

12
Entropy
• The entropy of a random variable is the sum of the
information provided by its possible values, weighted by the
probability of each value
• Entropy is a measure of uncertainty

The summation is over all


possible values of the random
variable

The entropy of a binary random variable


as a function of the probability of a success
Mutual Information
• Mutual information I(X,Y) is the reduction of uncertainty in
one variable upon observation of the other variable
• Mutual information is a measure of statistical dependency between
two random variables
Mutual Information
• The mutual information between feature vector and class
label measures the amount by which the uncertainty in the
class is decreased by knowledge of the feature. Compute the
mutual information (MI) of term t and class c.
• Below U is a random variable that takes values (the
document contains term ) and (the document does not
contain )
• C is a random variable that takes values (the document is in
class ) and (the document is not in class ).

§§ Definition:

45
Mutual Information
• If a term’s occurrence is independent of the class (ie.
term’s distribution is the same in the class as it is in the
collection as a whole), then MI is 0

• MI is maximum if the term is a perfect indicator for class


membership (ie. the term is present in a document if and
only if the document is in the class)

16
Mutual Information Example
• class poultry and the term export

• The counts of the number of documents with the four possible


combinations of indicator values are as follows

• 𝑁!" : number of documents that contain 𝑡 (𝑒# = 1) and are not in c (𝑒$ = 0)
• 𝑁!! : number of documents that contain 𝑡 (𝑒# = 1) and are in c (𝑒$ = 1)
• 𝑁"! : number of documents that do not contain 𝑡 (𝑒# = 0) and are in c 𝑒$ = 1
• 𝑁"" : number of documents that do not contain 𝑡 (𝑒# = 1) and are not in c (𝑒$ = 1)

48
17
How to compute Mutual Information
• Based on maximum likelihood estimates, the formula we
actually use is:

• 𝑁!" : number of documents that contain 𝑡 (𝑒# = 1) and are not in c (𝑒$ = 0)
• 𝑁!! : number of documents that contain 𝑡 (𝑒# = 1) and are in c (𝑒$ = 1)
• 𝑁"! : number of documents that do not contain 𝑡 (𝑒# = 0) and are in c 𝑒$ = 1
• 𝑁"" : number of documents that do not contain 𝑡 (𝑒# = 1) and are not in c (𝑒$ = 1)
• N = 𝑁"" + 𝑁"! + 𝑁!" + 𝑁!!

47
18
Mutual Information Example

48
19
Chi-square statistic
• The Chi-square test is applied to test the
independence of two events, where two events A
and B are defined to be independent if
• P(AB) = P(A)P(B) or, equivalently,
• P(A|B) = P(A) and P(B|A) = P(B).

• The two events are occurrence of the term and


occurrence of the class. We rank the terms with
respect to the following quantity with dataset D,
term t and class c.
(*#! #" +,#! #" )$
• 𝐶ℎ𝑖 ! (𝐷, 𝑡, 𝑐) = ∑"!∈{%,'} ∑""∈{%,'} ,#! #"
Chi-square statistic
(𝑁"!"" − 𝐸"!"" )!
𝐶ℎ𝑖 ! (𝐷, 𝑡, 𝑐) = , ,
𝐸"!""
"! ∈{%,'} "" ∈{%,'}

• 𝐸!! : is the expected frequency of t and c occurring together in a document


assuming that term and class are independent.
• N is the observed frequency in D and E the expected frequency.
Chi-square statistic

Measure of how much expected counts E


and observed counts N deviate from each
other.
Frequency based

• selecting the terms that are most common in the class

• Document frequency : the number of documents in the


class c that contain the term t. -> More appropriate for
binomial model

• Collection frequency: the number of tokens of t that occur


in documents in c. - > More appropriate for Multinomial
model
Why Feature Selection Helps

50
t-statistic
• We have n1 and n2 samples from each class, respectively

• For each feature, let x1 , s1 be the sample mean and variance of


the first class, x2, s2 be that of the second

• The distribution of t approaches from uniform to normal distribution


as number of samples grow
• We can set a threshold on the t-statistic for a feature to be selected
based on the t-distribution
Wrapper Methods
• Frame the feature selection task as a search
problem

• Evaluate each feature set by using the prediction


performance of the learning algorithm on that
feature set
– Cross-validation

• How to search the exponential space of feature


sets?
Searching for Feature Sets

state = set of features


start state = empty (forward selection)
or full (backward elimination)

operators
add/subtract a feature

scoring function
cross-validation accuracy using learning method on a
given state’s feature set
Forward Selection

Given: feature set 𝑋- , … , 𝑋. , training set 𝐷, learning


method 𝐿.

𝐹 ← {} Scores feature set G by learning


model(s) with L and assessing
While score of 𝐹 is improving its(their accuracy.)
for i ← {} to 𝑛 do
if 𝑋- ∉ 𝐹
𝐺- ← 𝐹 ∪ 𝑋-
𝑆𝑐𝑜𝑟𝑒- = Evaluate(𝐺- , 𝐿, 𝐷)
F ← 𝐺/ with best 𝑆𝑐𝑜𝑟𝑒/
return feature set 𝐹.
Forward Selection
Backward Elimination
Forward Selection vs. Backward Elimination

• Both use a hill-climbing search

Forward selection Backward Elimination

Efficient for choosing a small Efficient for discarding a


subset of the features small subset of features.

Misses features whose Preserves features whose


usefulness requires other usefulness requires other
features. features.
Embedded Methods (Regularization)

• Instead of explicitly selecting features, bias the


learning process towards using a small number
of features

• Key idea: objective function has two parts


• Term representing error minimization (model fit)
• Term that “shrinks” parameters toward 0
Ridge Regression
• Linear regression:
'

𝑓 𝑥 = 𝑤" + 3 𝑥% 𝑤%
%&!

𝐸 𝑤 = 3(𝑦 ( − 𝑓(𝑥 (() ))-


()*
'

E(x) = 3(𝑦 ( − 𝑤" − 3 𝑥%( 𝑤% )-


()* %&!

• Penalty term (L2 norm of the coefficients) added:


6 ! 6

E x =, 𝑦 1
− 𝑤% − , 𝑥41 𝑤4 + 𝜆 , 𝑤4!
123 45' 45'
LASSO
• Ridge regression shrinks the weights, but does not
necessarily reduce the number of features
– We would like to force some coefficients to be set to 0

• Add L1 norm of the coefficients as the penalty term:


E x
' - '

=3 𝑦 ( − 𝑤" − 3 𝑥%( 𝑤% + 𝜆 3 |𝑥% |


()* %&! %&!

– Why does this result in more coefficients to be set to 0,


effectively performing feature selection?
Ridge Regression vs. LASSO

Plot of the contours of the unregularized error function (red) along with the
constraint region for lasso (left) and ridge (right). 𝛽’s are the weights we
learn.
Generalizing Regularization
• L1 and L2 penalties can be used with other learning
methods (logistic regression, neural nets, SVMs,
etc.)
– Both can help avoid overfitting by reducing variance
• There are many variants with somewhat different
biases
– Elastic net: includes L1 and L2 penalties
– Group Lasso: bias towards selecting defined groups of
features
– Graph Lasso: bias towards selecting “adjacent” features
in a defined graph

You might also like