Professional Documents
Culture Documents
Business Intelligence DM4 Feature Selection
Business Intelligence DM4 Feature Selection
Business Intelligence DM4 Feature Selection
• Based on Chapter 8, 8.8 & 8.9 in Witten et al. (2017, 4th edition)
• Findings the smallest attribute (feature) set that is crucial in predicting the class is an important issue, certainly if
there are many attributes. Although all attributes are considered when building the model, not all are equally
important (i.e. have discriminatory capacity). Some might even distort the quality of the model.
• So, feature or attribute selection is used to set aside a subset of attributes that really matter and have added
value in creating the classifier.
• A typically example is an insurance company that has a huge data warehouse with historical data from their
clients, with many attributes. The company wants to predict the risk of a new client (from their perspective) and
get insight into what attributes really determine the risk. E.g. age, gender, and marital status might be the key
factors to predict the risk of future car accidents of a client. Not so much the obtained college degree, or financial
status.
• Feature selection is different from dimensionality reduction (e.g. Factor Analysis). Both methods seek to reduce
the number of attributes in the dataset, but a dimensionality reduction method does so by creating new
combinations of attributes, where as feature selection methods include and exclude attributes present in the data
without changing them.
• Easier to interpret
• Structural knowledge
• Knowing which attributes are important may be inherently important to the application and interpretation of the classifier
• Note that relevant attributes can also be harmful if they mislead the learning algorithm
• For example, adding a random (i.e., irrelevant) attribute can significantly degrade J48’ s performance
• A Feature selection approach combines an Attribute Evaluator and a Search Method. Each
section, evaluation and search, has multiple techniques from which to choose.
• The attribute evaluator is the technique by which each attribute in a dataset is evaluated in the
context of the output variable (e.g. the class). The search method is the technique by which to
navigate different combinations of attributes in the dataset in order to arrive on a short list of
chosen features.
• As an example, calculating correlation scores for each attribute (with the class variable), is only
done by the attribute evaluator. Assigning a rank to an attribute and listing the attribute in the
ranked order is done by a search method, enabling the selection of features.
• For some attribute evaluators, only certain search method are compatible
• E.g. the CorrelationAttributeEval attribute evaluator in Weka can only be used with a Ranker Search Method,
that lists the attributes in a ranked order.
• Using correlations for selecting the most relevant attributes in a dataset is quite popular. The idea is
simple: calculate the correlation between each attribute and the output variable and select only those
attributes that have a moderate-to-high positive or negative correlation (close to -1 or 1) and drop those
attributes with a low correlation (value close to zero). E.g. the CorrelationAttributeEval.
• As mentioned before: the CorrelationAttributeEval technique requires the use of a Ranker search
method.
• Another popular feature selection technique is to calculate the information gain. You can calculate the
information gain (see entropy) for each attribute for the output variable. Entry values vary from 0 (no
information) to 1 (maximum information). Those attributes that contribute more information will have a
higher information gain value and can be selected, whereas those that do not add much information will
have a lower score and can be removed.
• Try the InfoGainAttributeEval Attribute Evaluator in Weka. Like the correlation technique above, the
Ranker Search Method must be used with this evaluator.
• Open glass.arff, go to “select attributes”, use the CorrelationAttributeEval evaluator (correlation of each
attribute with the class variable). “Ranker” is now mandatory as a search method. Use the full training set
(we are not evaluating a classifier).
• Open glass.arff, use the InfoGainAttributeEval evaluator (information gain of each attribute). “Ranker” is
again mandatory as a search method.
• A different and popular feature selection technique is to use a generic but powerful learning algorithm (e.g.
classifier) and evaluate the performance of the algorithm on the dataset with different subsets of attributes. (So
no real a priori evaluation of the attributes; no filter approach. )
• The subset that results in the best performance is taken as the selected subset. The algorithm used to evaluate
the subsets does not have to be the algorithm that is intended to be used for the classifier, but it should be
generally quick to train and powerful, like a decision tree method.
• So if the target algorithm is Naïve Bayes, a different algorithm could be chosen to select a subset of
attributes..
• This is a scheme-dependent approach because the target schema, the actual classifier you want to develop, is in
the loop.
• In Weka this type of feature selection is supported by the WrapperSubsetEval technique and must use a
GreedyStepwise or BestFirst Search Method (cf. further). The latter, BestFirst, is preferred but requires more
compute time.
Select a subset of
• “Wrap around” the attributes
learning algorithm
• Must therefore always Induce learning
evaluate subsets algorithm on this subset
• Return the best subset
of attributes Evaluate the resulting
• Apply for each learning model (e.g., accuracy)
algorithm considered
No Yes
Stop?
• The available attributes in a dataset are often referred to as the ‘attribute space’ (in which to find a subset that
best predicts the class)
• When searching an attribute subset, the number of subsets is exponential in the number of attributes
• Common greedy approaches:
• forward selection
• backward elimination
• Recursive feature elimination
A greedy algorithm is any algorithm that simply picks the best choice it sees at the time and takes it.
• Forward Selection: Forward selection is an iterative method that starts with having no feature in the model. In
each iteration, the method keeps adding the feature which best improves the model till an addition of a new
attribute does not improve the performance of the model (such as a classifier) anymore. Because the accuracy of
the classifier is evaluated with all the features in a set, this method will pick out features which work well together
for classification. Features are not assumed to be independent and so advantages may be gained from looking
at their combined effect.
• Backward Elimination: In backward elimination, the model starts with all the features and removes the least
significant feature at each iteration which improves the performance of the model. This is repeated until no
improvement is observed on removal of features.
• Recursive Feature elimination: It is a greedy optimization algorithm which aims to find the best performing
feature subset. It repeatedly creates models and keeps aside the best or the worst performing feature at each
iteration. It constructs the next model with the remaining features until all the features are exhausted. It then
ranks the features based on the order of their elimination.
• The attribute space can be futher traversed by increasing the searchTermination parameter in Weka (if
> 1, the attribute space is further investigated, increasing the likelihood of finding a more global, less
local optimum)
• Example:
• Open glass.arff; choose the attribute evaluator WrapperSubsetEval in ‘Select attributes’, select J48, “use full training set”
• Get the attribute subset: RI, Mg, Al, K, Ba: “merit” 0.73 (=accuracy %)
• But the “merit” , the best subset, is the same in all 3…Searching the attribute space further did not lead to a higher optimum.
• Certainly select: Ri (appeared in 9 folds), Al (appeared in 10 folds !), Ba and MG, and maybe also Na
• Typically, you do not know a priori which view, or subset of features, in your data will produce the most
accurate models.
• Therefore, it is a good idea to try a number of different feature selection techniques on your data, to
create several views on your data (several subsets of features)
Filter method
Wrapper method