Feature Selection: Slide 1

Feature Selection
Slide 1
Feature Selection
Slide 2
Slide 3
Slide 4
Slide 5
Slide 6
Why we need Feature Selection:
1. to improve performance (in terms of speed, predictive power,

simplicity of the model).
2. to visualize the data for model selection.
3. To reduce dimensionality and remove noise.
• Feature Selection is a process that chooses an optimal subset of features

according to a certain criterion.
Slide 7
Need for Feature Selection
• In machine learning, it is necessary to provide a pre-processed and

good input dataset in order to get better outcomes.
• We collect a huge amount of data to train our model and help it to
learn better. Generally, the dataset consists of noisy data, irrelevant
data, and some part of useful data.
• Moreover, the huge amount of data also slows down the training
process of the model, and with noise and irrelevant data, the model
may not predict and perform well.
• So, it is very necessary to remove such noises and less-important
data from the dataset and to do this, and Feature selection
techniques are used.
Slide 8
Overview
• Reasons for performing Feature Selection may include:

– removing irrelevant data.
– increasing predictive accuracy of learned models.
– reducing the cost of the data.
– improving learning efficiency, such as reducing storage requirements
and computational cost.
– reducing the complexity of the resulting model description, improving
the understanding of the data and the model.
Slide 9
Feature Selection Techniques
• There are mainly two types of Feature Selection techniques, which are:
• Supervised Feature Selection technique
Supervised Feature selection techniques consider the target variable and can be used for the
labelled dataset.
• Unsupervised Feature Selection technique
Unsupervised Feature selection techniques ignore the target variable and can be used for the
unlabelled dataset.
Slide 10
• Filter Method:
• features are dropped based on their relation to the output, or
how they are correlating to the output. We use correlation to
check if the features are positively or negatively correlated to
the output labels and drop features accordingly. Eg:
Information Gain, Chi-Square Test, Fisher’s Score, etc.
Slide 11
Wrapper Methods
• We split our data into subsets and train a model using this. Based on the
output of the model, we add and subtract features and train the model
again. It forms the subsets using a greedy approach and evaluates the
accuracy of all the possible combinations of features. Eg: Forward
Selection, Backwards Elimination, etc.
Some of the techniques under these methods are:
Foreword selection.
Backward selection.
Bi-directional selection.
Exhaustive selection.
Recursive selection.
Slide 12
• Intrinsic Method: This method combines the qualities of both
the Filter and Wrapper method to create the best subset.
• This method takes care of the machine training iterative
process while maintaining the computation cost to be
minimum. Eg: Lasso and Ridge Regression.
Slide 13
Slide 14
How to choose a Feature Selection Method?
Slide 15
How to Choose a Feature Selection Method
• Choosing the best feature selection method depends on the input and output in
consideration:
• Numerical Input, Numerical Output: feature selection regression problem with numerical
input variables - use a correlation coefficient, such as Pearson’s correlation coefficient (for
linear regression feature selection) or Spearman’s rank coefficient (for nonlinear).
• Numerical Input, Categorical Output: feature selection classification problem with
numerical input variables - use a correlation coefficient, taking into account the
categorical target, such as ANOVA correlation coefficient (for linear) or Kendall’s rank
coefficient (nonlinear).
• Categorical Input, Numerical Output: regression predictive modeling problem with
categorical input variables (rare) - use a correlation coefficient, such as ANOVA correlation
coefficient (for linear) or Kendall’s rank coefficient (nonlinear), but in reverse.‍
• Categorical Input, Categorical Output: classification predictive modeling problem with
categorical input variables - use a correlation coefficient, such as Chi-Squared test
(contingency tables) or Mutual Information, which is a powerful method that is agnostic
to data types.
Slide 16
Perspectives
• Filters:
– measuring uncertainty, distances, dependence or consistency is usually
cheaper than measuring the accuracy of a learning process. Thus, filter
methods are usually faster.
– it does not rely on a particular learning bias, in such a way that the
selected features can be used to learn different models
– it can handle larger sized data, due to the simplicity and low time
complexity of the evaluation measures.
Slide 17
Slide 18
Perspectives
• Filters:
Slide 19
Perspectives
• Embedded FS:
– similar to the wrapper approach in the sense that the features are
specifically selected for a certain learning algorithm, but in this
approach, the features are selected during the learning process.
– they could take advantage of the available data by not requiring to split
the training data into a training and validation set; they could achieve a
faster solution by avoiding the re-training of a predictor for each feature
subset explored.
Slide 20
Aspects:Output of Feature Selection
• Feature Ranking Techniques:

– we expect as the output a ranked list of features which are ordered
according to evaluation measures.
– they return the relevance of the features.
– For performing actual FS, the simplest way is to choose the first m
features for the task at hand, whenever we know the most appropriate
m value.
Slide 21
• Feature Ranking Techniques:
Slide 22
• Minimum Subset Techniques:

– The number of relevant features is a parameter that is often not known
by the practitioner.
– There must be a second category of techniques focused on obtaining
the minimum possible subset without ordering the features.
– whatever is relevant within the subset, is otherwise irrelevant.
Slide 23
• Minimum Subset Techniques:
Slide 24
Aspects: Evaluation
• Goals:
– Inferability: For predictive tasks, considered as an
improvement of the prediction of unseen examples with
respect to the direct usage of the raw training data.
– Interpretability: Given the incomprehension of raw data by
humans, DM is also used for generating more
understandable structure representation that can explain
the behavior of the data.
– Data Reduction: It is better and simpler to handle data with
lower dimensions in terms of efficiency and interpretability.
Slide 25
Aspects:Evaluation
• We can derive three assessment measures from these three

goals:
– Accuracy
– Complexity
– Number of Features Selected
– Speed of the FS method
– Generality of the features selected

Slide 26
Aspects:Drawbacks
• The resulted subsets of many models of FS are strongly dependent on
the training set size.
• It is not true that a large dimensionality input can always be reduced to
a small subset of features because the objective feature is actually
related with many input features and the removal of any of them will
seriously effect the learning performance.
• A backward removal strategy is very slow when working with large-
scale data sets. This is because in the firsts stages of the algorithm, it
has to make decisions funded on huge quantities of data.
• In some cases, the FS outcome will still be left with a relatively large
number of relevant features which even inhibit the use of complex
learning methods.
Slide 27
Aspects:Using Decision Trees for FS
• Decision trees can be used to implement a trade-off between

the performance of the selected features and the computation
time which is required to find a subset.
• Decision tree inducers can be considered as anytime

algorithms for FS, due to the fact that they gradually improve
the performance and can be stopped at any time, providing
sub-optimal feature subsets.
Slide 28
Slide 29

Feature Selection: Slide 1

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Feature Selection: Slide 1

Uploaded by

Copyright:

Available Formats

Feature Selection

1. to improve performance (in terms of speed, predictive power,

• Feature Selection is a process that chooses an optimal subset of features

• In machine learning, it is necessary to provide a pre-processed and

• Reasons for performing Feature Selection may include:

• Feature Ranking Techniques:

• Feature Ranking Techniques:

• Minimum Subset Techniques:

• Minimum Subset Techniques:

• We can derive three assessment measures from these three

– Number of Features Selected

– Speed of the FS method

– Generality of the features selected

• Decision trees can be used to implement a trade-off between

• Decision tree inducers can be considered as anytime

You might also like