Business Intelligence DM4 Feature Selection

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 19

Business Intelligence – DM4 Feature Selection

• Based on Chapter 8, 8.8 & 8.9 in Witten et al. (2017, 4th edition)

Stephan Poelmans- Introduction to BI


Feature (Attribute) Selection

• Findings the smallest attribute (feature) set that is crucial in predicting the class is an important issue, certainly if
there are many attributes. Although all attributes are considered when building the model, not all are equally
important (i.e. have discriminatory capacity). Some might even distort the quality of the model.

• So, feature or attribute selection is used to set aside a subset of attributes that really matter and have added
value in creating the classifier.

• A typically example is an insurance company that has a huge data warehouse with historical data from their
clients, with many attributes. The company wants to predict the risk of a new client (from their perspective) and
get insight into what attributes really determine the risk. E.g. age, gender, and marital status might be the key
factors to predict the risk of future car accidents of a client. Not so much the obtained college degree, or financial
status.

• Feature selection is different from dimensionality reduction (e.g. Factor Analysis). Both methods seek to reduce
the number of attributes in the dataset, but a dimensionality reduction method does so by creating new
combinations of attributes, where as feature selection methods include and exclude attributes present in the data
without changing them.

Stephan Poelmans- Introduction to BI


Reasons for feature selection
• Achieving Simpler model
• More transparent

• Easier to interpret

• Faster model induction (creation)

• Structural knowledge
• Knowing which attributes are important may be inherently important to the application and interpretation of the classifier

• Note that relevant attributes can also be harmful if they mislead the learning algorithm
• For example, adding a random (i.e., irrelevant) attribute can significantly degrade J48’ s performance

• Instance-based learning (Ibk) is particularly susceptible to irrelevant attributes


• Number of training instances required increases exponentially with number of irrelevant attributes

• Exception: naïve Bayes can cope well with irrelevant attributes

Stephan Poelmans- Introduction to BI


Feature selection methods
• There are three general classes of feature selection algorithms: filter methods, wrapper methods and embedded methods.
1. Filter Methods: ’Scheme-independent’
• Filter feature selection methods apply a statistical measure to assign a scoring to each feature. The features are ranked by the score and
can be kept or removed from the dataset. The methods are often univariate and consider the feature independently, or only with regard to
the dependent variable.
• Some examples of some filter methods include the Chi squared test, information gain and correlation coefficient scores.
2. Wrapper Methods: ‘Scheme-dependent’
• Wrapper methods consider the selection of a set of features as a search problem, where different combinations are prepared, evaluated and
compared to other combinations. A predictive model is used to evaluate a combination of features and assign a score based on model
accuracy.
• The search process may be methodical such as a best-first search, it may be stochastic such as a random hill-climbing algorithm, or it may
use heuristics, like forward and backward passes to add and remove features.
3. Embedded Methods
• Embedded methods learn which features best contribute to the accuracy of the model while the model is being created. The most common
type of embedded feature selection methods are regularization methods.
• Regularization methods are also called penalization methods that introduce additional constraints into the optimization of a predictive
algorithm (such as a regression algorithm) that bias the model toward lower complexity (fewer coefficients).
• Examples of regularization algorithms are the LASSO, Elastic Net and Ridge Regression.
• Embedded methods are however beyond the scope of this course.
Stephan Poelmans- Introduction to BI
Attribute Evaluator and Search Method

• A Feature selection approach combines an Attribute Evaluator and a Search Method. Each
section, evaluation and search, has multiple techniques from which to choose.

• The attribute evaluator is the technique by which each attribute in a dataset is evaluated in the
context of the output variable (e.g. the class). The search method is the technique by which to
navigate different combinations of attributes in the dataset in order to arrive on a short list of
chosen features.

• As an example, calculating correlation scores for each attribute (with the class variable), is only
done by the attribute evaluator. Assigning a rank to an attribute and listing the attribute in the
ranked order is done by a search method, enabling the selection of features.

• For some attribute evaluators, only certain search method are compatible
• E.g. the CorrelationAttributeEval attribute evaluator in Weka can only be used with a Ranker Search Method,
that lists the attributes in a ranked order.

Stephan Poelmans- Introduction to BI


Filter methods: Attribute selection via Correlation, Information gain
(cf. entropy)

• Using correlations for selecting the most relevant attributes in a dataset is quite popular. The idea is
simple: calculate the correlation between each attribute and the output variable and select only those
attributes that have a moderate-to-high positive or negative correlation (close to -1 or 1) and drop those
attributes with a low correlation (value close to zero). E.g. the CorrelationAttributeEval.

• As mentioned before: the CorrelationAttributeEval technique requires the use of a Ranker search
method.

• Another popular feature selection technique is to calculate the information gain. You can calculate the
information gain (see entropy) for each attribute for the output variable. Entry values vary from 0 (no
information) to 1 (maximum information). Those attributes that contribute more information will have a
higher information gain value and can be selected, whereas those that do not add much information will
have a lower score and can be removed.

• Try the InfoGainAttributeEval Attribute Evaluator in Weka. Like the correlation technique above, the
Ranker Search Method must be used with this evaluator.

Stephan Poelmans- Introduction to BI


Filter methods

• Filter methods result in either


• A Ranked list of attributes
• Typical when each attribute is evaluated individually
• The user has to make the attribute selection for the classifier ‘manually’

• A selected subset of attributes. The subset is the result of:


• Forward selection
• A Best first algorithm
• Random search such as genetic algorithm

Stephan Poelmans- Introduction to BI


Filter, an example: Correlation

• Open glass.arff, go to “select attributes”, use the CorrelationAttributeEval evaluator (correlation of each
attribute with the class variable). “Ranker” is now mandatory as a search method. Use the full training set
(we are not evaluating a classifier).

The attributes are now ranked


according to their Pearson
correlation with the class
variable. You can make a
selection and run any
classifier in the classify tab.
You might decide to stop the
selection of attributes if a
bigger delta occurs, e.g. after
Na..

Stephan Poelmans- Introduction to BI


Filter, an example: Information gain

• Open glass.arff, use the InfoGainAttributeEval evaluator (information gain of each attribute). “Ranker” is
again mandatory as a search method.

Stephan Poelmans- Introduction to BI


Wrapper methods: Learner-based Attribute Selection

• A different and popular feature selection technique is to use a generic but powerful learning algorithm (e.g.
classifier) and evaluate the performance of the algorithm on the dataset with different subsets of attributes. (So
no real a priori evaluation of the attributes; no filter approach. )

• The subset that results in the best performance is taken as the selected subset. The algorithm used to evaluate
the subsets does not have to be the algorithm that is intended to be used for the classifier, but it should be
generally quick to train and powerful, like a decision tree method.

• So if the target algorithm is Naïve Bayes, a different algorithm could be chosen to select a subset of
attributes..

• This is a scheme-dependent approach because the target schema, the actual classifier you want to develop, is in
the loop.

• In Weka this type of feature selection is supported by the WrapperSubsetEval technique and must use a
GreedyStepwise or BestFirst Search Method (cf. further). The latter, BestFirst, is preferred but requires more
compute time.

Stephan Poelmans- Introduction to BI


Wrappers

Select a subset of
• “Wrap around” the attributes
learning algorithm
• Must therefore always Induce learning
evaluate subsets algorithm on this subset
• Return the best subset
of attributes Evaluate the resulting
• Apply for each learning model (e.g., accuracy)
algorithm considered
No Yes
Stop?

Stephan Poelmans- Introduction to BI


Searching the attribute space

• The available attributes in a dataset are often referred to as the ‘attribute space’ (in which to find a subset that
best predicts the class)
• When searching an attribute subset, the number of subsets is exponential in the number of attributes
• Common greedy approaches:
• forward selection
• backward elimination
• Recursive feature elimination

A greedy algorithm is any algorithm that simply picks the best choice it sees at the time and takes it.

• More sophisticated strategies:


• Bidirectional search
• Best-first search: can find an optimum solution
• Beam search: an approximation to the best-first search
• Genetic algorithms

Stephan Poelmans- Introduction to BI


12
The attribute space: local vs global optimum

At point z, level a seems like an optimum.


However, this is only a local optimum
(maximum). At position z, b is not visible. You
have to traverse the entire terrain to find the
global optimum, b.

In feature selection, at point z, a combination


of features is found, leading to a higher merit
a, (e.g. the accuracy rate), which is then a
local optimum. By extending the search (e.g.
via the parameter ‘searchTerminator’, see
below), a better optimum, or even the global
optimum can be found. Greedy search
approaches are less likely to find a global
optimum..

Stephan Poelmans- Introduction to BI


Common Greedy Approaches

• Forward Selection: Forward selection is an iterative method that starts with having no feature in the model. In
each iteration, the method keeps adding the feature which best improves the model till an addition of a new
attribute does not improve the performance of the model (such as a classifier) anymore. Because the accuracy of
the classifier is evaluated with all the features in a set, this method will pick out features which work well together
for classification. Features are not assumed to be independent and so advantages may be gained from looking
at their combined effect.
• Backward Elimination: In backward elimination, the model starts with all the features and removes the least
significant feature at each iteration which improves the performance of the model. This is repeated until no
improvement is observed on removal of features.
• Recursive Feature elimination: It is a greedy optimization algorithm which aims to find the best performing
feature subset. It repeatedly creates models and keeps aside the best or the worst performing feature at each
iteration. It constructs the next model with the remaining features until all the features are exhausted. It then
ranks the features based on the order of their elimination.

Stephan Poelmans- Introduction to BI


Search Methods
As an example: 9 attributes

Stephan Poelmans- Introduction to BI


Wrapper approach: In sum

• Uses a classifier to find a good attribute set (“scheme-dependent”)


• E.g. J48, Ibk, Naïve Bayes,..

• Wraps a classifier in a feature selection loop


• Involves both an Attribute Evaluator and a Search Method

• Searching can be greedy forward, backward, or bidirectional


• Computationally intensive

• Greedy searching finds a local optimum in the search space

• The attribute space can be futher traversed by increasing the searchTermination parameter in Weka (if
> 1, the attribute space is further investigated, increasing the likelihood of finding a more global, less
local optimum)

Stephan Poelmans- Introduction to BI


Wrapper: an example in Weka

• Example:
• Open glass.arff; choose the attribute evaluator WrapperSubsetEval in ‘Select attributes’, select J48, “use full training set”

• Search as a search method ‘BestFirst’; select direction ‘Backward’

• Get the attribute subset: RI, Mg, Al, K, Ba: “merit” 0.73 (=accuracy %)

• Searching how far? … An littel experiment:


• Set searchTermination = 1: Total number of subsets evaluated 36 (see the output)

• Set searchTermination = 3: Total number of subsets evaluated 45 (see the output)

• Set searchTermination = 5: Total number of subsets evaluated 49 (see the output)

• But the “merit” , the best subset, is the same in all 3…Searching the attribute space further did not lead to a higher optimum.

• Use Cross-validation (instead of ”Use full training set”):


(searchTermination = 1): see next slide

Stephan Poelmans- Introduction to BI


Wrapper, an example:

• So, Use Cross-validation (instead of ”Use full training set”):


(searchTermination = 1):

• Certainly select: Ri (appeared in 9 folds), Al (appeared in 10 folds !), Ba and MG, and maybe also Na

Stephan Poelmans- Introduction to BI


What Feature Selection Techniques To Use

• Typically, you do not know a priori which view, or subset of features, in your data will produce the most
accurate models.

• Therefore, it is a good idea to try a number of different feature selection techniques on your data, to
create several views on your data (several subsets of features)

Filter method

Wrapper method

Stephan Poelmans- Introduction to BI

You might also like