11.feature Selection, Extraction

FEATURE SELECTION, EXTRACTION
WHAT IS FEATURE EXTRACTION/SELECTION?

SELECTION?
 Extraction: Getting useful features from existing data.
 Selection: Choosing a subset of the original pool of features.
WHY MUST WE APPLY FEATURE
EXTRACTION/SELECTION?
 Feature extraction is a quite complex concept concerning the translation of
raw data into the inputs that a particular Machine Learning algorithm
requires.
 While some inherent features can be obtained directly from raw data, we
usually need derived features from these inherent features that are actually
relevant to attack the underlying problem.
 A poor model fed with meaningful features will surely perform better than an
amazing algorithm fed with low-quality features – “garbage in, garbage out”.
WHY MUST WE APPLY FEATURE
 Feature extraction fills this requirement: it builds valuable information from
raw data – the features – by reformatting, combining, transforming primary
features into new ones… until it yields a new set of data that can be
consumed by the Machine Learning models to achieve their goals.
 Feature selection, for its part, is a clearer task: given a set of potential
features, select some of them and discard the rest. Feature selection is applied
either to prevent redundancy and/or irrelevancy existing in the
features or just to get a limited number of features to prevent from
overfitting.
WHY FEATURE SELECTION IS REQUIRED?
1. To remove irrelevant data.
2. To increase predictive accuracy of learned models.
3. To reduce the computational cost of the data.
4. To reduce storage requirements and computational cost.
5. To reduce the complexity of the resulting model description, improving the
understanding of the data and the model
6. To reduce dimensionality and remove noise.
HOW TO APPLY FEATURE
FILTER METHOD
 Filter methods select features from a dataset independently for any machine
learning algorithm. These methods rely only on the characteristics of these
variables, so features are filtered out of the data before learning begins.
 These methods are powerful and simple and help to quickly remove features
— and they are generally the first step in any feature selection pipeline.
FILTER METHODS: ADVANTAGES
 Selected features can be used in any machine learning algorithm,
 They’re computationally inexpensive—you can process thousands of features
in a matter of seconds.
 Filter methods are very good for eliminating irrelevant, redundant, constant,
duplicated, and correlated features.
FILTER METHODS: TYPES
 Univariate and Multivariate.
 Univariate filter methods : evaluate and rank a single feature according to
certain criteria.
 They treat each feature individually and independently of the feature space.
 It ranks features according to certain criteria.
 Then select the highest ranking features according to those criteria.
 It may select a redundant variable, as it does not take into consideration the
relationship between features.
MULTIVARIATE FILTER METHODS
 Multivariate filter methods, evaluate the entire feature space. They take
into account features in relation to other ones in the dataset.
 These methods are able to handle duplicated, redundant, and correlated
features.
BASIC FILTER METHODS
1. Constant Features that show single values in all the observations in the
dataset. These features provide no information that allows ML models to
predict the target.
from sklearn.feature_selection import VarianceThreshold
vs_constant = VarianceThreshold(threshold=0)
# select the numerical columns only.

numerical_x_train = x_train[x_train.select_dtypes([np.number]).columns]
# fit the object to our data.

vs_constant.fit(numerical_x_train)
# get the constant colum names.

constant_columns = [column for column in numerical_x_train.columns
if column not in numerical_x_train.columns[vs_constant.get_support()]]
# detect constant categorical variables.

constant_cat_columns = [column for column in x_train.columns
if (x_train[column].dtype == "O" and len(x_train[column].unique()) == 1 )]
# conctenating the two lists.

all_constant_columns = constant_cat_columns + constant_columns
# drop the constant columns

x_train.drop(labels=all_constant_columns, axis=1, inplace=True)
x_test.drop(labels=all_constant_columns, axis=1, inplace=True)

2. Quasi-Constant Features in which a value occupies the majority of the records.
threshold = 0.98
# create empty list

quasi_constant_feature = []
# loop over all the columns

for feature in x_train.columns:
# calculate the ratio.

predominant = (x_train[feature].value_counts() / np.float(len(x_train))).sort_values(ascending=False).values[0]
# append the column name if it is bigger than the threshold

if predominant >= threshold:
quasi_constant_feature.append(feature)
print(quasi_constant_feature)
# drop the quasi constant columns

x_train.drop(labels=quasi_constant_feature, axis=1, inplace=True)
x_test.drop(labels=quasi_constant_feature, axis=1, inplace=True)
3. Duplicated Features, which is self-explanatory—the same feature.
train_features_T = x_train.T
# print the number of duplicated features

print(train_features_T.duplicated().sum())
# select the duplicated features columns names
duplicated_columns = train_features_T[train_features_T.duplicated()].index.values
# drop those columns

x_train.drop(labels=duplicated_columns, axis=1, inplace=True)
x_test.drop(labels=duplicated_columns, axis=1, inplace=True)

CORRELATION FILTER METHODS
 if two variables are highly correlated among themselves, they provide
redundant information in regards to the target. Essentially, we can make an
accurate prediction on the target with just one of the redundant variables.
 In these cases, the second variable doesn’t add additional information, so
removing it can help to reduce the dimensionality and also the added noise.
PEARSON CORRELATION COEFFICIENT
 It’s used to summarize the strength of the linear relationship between two data
variables, which can vary between 1 and -1:
 1 means a positive correlation: the values of one variable increase as the values of
another increase.
 -1 means a negative correlation: the values of one variable decrease as the values of
another increase.
 0 means no linear correlation between the two variables.
The assumptions that the Pearson correlation coefficient makes:

 Both variables should be normally distributed.
 A straight-line relationship between the two variables.
 Data is equally distributed around the regression line.
The following formula is used to calculate the value of the Pearson correlation
coefficient:
corr_features = set()
# create the correlation matrix (default to pearson)
corr_matrix = x_train.corr()
# optional: display a heatmap of the correlation matrix
plt.figure(figsize=(11,11))
sns.heatmap(corr_matrix)
for i in range(len(corr_matrix .columns)):

for j in range(i):
if abs(corr_matrix.iloc[i, j]) > 0.8:
colname = corr_matrix.columns[i]
corr_features.add(colname)
x_train.drop(labels=corr_features, axis=1, inplace=True)
x_test.drop(labels=corr_features, axis=1, inplace=True)

STATISTICAL & RANKING FILTER METHODS
 These methods are statistical tests that evaluate each feature individually.
 By shedding light on the target, they evaluate whether the variable is
important in order to discriminate against the target.
 Essentially, these methods rank the features based on certain criteria or
metrics and then select the features with the highest ranking.
1. MUTUAL INFORMATION
 It determines how much we can know about one variable by understanding
another—it’s a little bit like correlation, but mutual information is more general.
 In machine learning, mutual information measures how much information the
presence/absence of a feature contributes to making the correct prediction on Y.
 If X and Y are independent, their MI is Zero.
 The mutual information between two random variables X and Y can be stated

formally as follows:
MI(X ; Y) = H(X) – H(X | Y)
Where MI(X ; Y) is the mutual information for X and Y, H(X) is the entropy
for X and H(X | Y) is the conditional entropy for X given Y.
from sklearn.feature_selection import mutual_info_classif
from sklearn.feature_selection import SelectKBest
# select the number of features you want to retain.
select_k = 10
# get only the numerical features.

numerical_x_train = x_train[x_train.select_dtypes([np.number]).columns]
# create the SelectKBest with the mutual info strategy.
selection = SelectKBest(mutual_info_classif, k=select_k).fit(numerical_x_train, y_train)
# display the retained features.

features = x_train.columns[selection.get_support()]
print(features)
CHI-SQUARED SCORE
 This is another statistical method that’s commonly used for testing
relationships between categorical variables.
 It’s suited for categorical variables and binary targets only, and the variables
should be non-negative and typically boolean, frequencies, or counts.
 Compares the observed distribution between various features in the dataset and
the target variable.
from sklearn.feature_selection import chi2
# change this to how much features you want to keep from the top ones.
select_k = 10
# apply the chi2 score on the data and target (target should be binary).
selection = SelectKBest(chi2, k=select_k).fit(x_train, y_train)
# display the k selected features.

print(features)
ANOVA UNIVARIATE TEST
 A univariate test, or more specifically ANOVA ( — short
for ANalysis Of VAriance), is similar to the previous scores, as it measures
the dependence of two variables.
 ANOVA assumes a linear relationship between the variables and the target,
and also that the variables are normally distributed.
 Well-suited for continuous variables and requires a binary target,
from sklearn.feature_selection import f_classif
# select the number of features you want to retain.
select_k = 10
# create the SelectKBest with the anova strategy.
selection = SelectKBest(f_classif, k=select_k).fit(x_train, y_train)
# display the retained features.

print(features)
WRAPPER METHODS
 The feature selection process is based on a specific machine learning
algorithm that we are trying to fit on a given dataset.
 Follows a greedy search approach by evaluating all the possible
combinations of features against the evaluation criterion.
 For regression evaluation criterion can be p-values, R-squared, Adjusted R-
squared,
 For classification the evaluation criterion can be accuracy, precision, recall,
f1-score, etc.
 Finally, it selects the combination of features that gives the optimal results
for the specified machine learning algorithm.
WRAPPER METHODS
Most commonly used techniques under wrapper methods are:
 Forward selection
 Backward elimination
 Bi-directional elimination(Stepwise Selection)

FORWARD SELECTION
 In forward selection, we start with a null model and then start fitting the
model with each individual feature one at a time and select the feature with
the minimum p-value.
 Now fit a model with two features by trying combinations of the earlier
selected feature with all other remaining features. Again select the feature
with the minimum p-value.
 Now fit a model with three features by trying combinations of two previously
selected features with other remaining features.
 Repeat this process until we have a set of selected features with a p-value of
individual features less than the significance level.
The steps for the forward selection technique are as follows :
1. Choose a significance level (e.g. SL = 0.05 with a 95% confidence).
2. Fit all possible simple regression models by considering one feature at a
time. Total ’n’ models are possible. Select the feature with the lowest p-value.
3. Fit all possible models with one extra feature added to the previously
selected feature(s).
4. Again, select the feature with a minimum p-value. if p_value < significance
level then go to Step 3, otherwise terminate the process.
BACKWARD ELIMINATION
 In backward elimination, we start with the full model (including all the
independent variables) and then remove the insignificant feature with the
highest p-value(> significance level).
 This process repeats again and again until we have the final set
of significant features.
In short, the steps involved in backward elimination are as follows:
1. Choose a significance level (e.g. SL = 0.05 with a 95% confidence).
2. Fit a full model including all the features.
3. Consider the feature with the highest p-value. If the p-value > significance
level then go to Step 4, otherwise terminate the process.
4. Remove the feature which is under consideration.
5. Fit a model without this feature. Repeat the entire process from Step 3.
BI-DIRECTIONAL ELIMINATION(STEP-WISE
SELECTION)
 It is similar to forward selection but the difference is while adding a new feature it
also checks the significance of already added features and if it finds any of the
already selected features insignificant then it simply removes that particular feature
through backward elimination.
 Hence, It is a combination of forward selection and backward elimination.
The steps involved in bi-directional elimination are as follows:

1. Choose a significance level to enter and exit the model (e.g. SL_in = 0.05 and
SL_out = 0.05 with 95% confidence).
2. Perform the next step of forward selection (newly added feature must have p-value <
SL_in to enter).
3. Perform all steps of backward elimination (any previously added feature with p-
value>SL_out is ready to exit the model).
4. Repeat steps 2 and 3 until we get a final optimal set of features.
EMBEDDED METHODS
 Embedded methods perform feature selection during the model training,
which is why we call them embedded methods.
 A learning algorithm takes advantage of its own variable selection process
and performs feature selection and classification/regression at the same time.
EMBEDDED METHODS: ADVANTAGES
 The embedded method solves both issues we encountered with the filter and
wrapper methods by combining their advantages.
Here’s how:
 They take into consideration the interaction of features like wrapper methods
do.
 They are faster like filter methods.
 They are more accurate than filter methods.
 They find the feature subset for the algorithm being trained.
 They are much less prone to overfitting.

EMBEDDED METHODS: PROCESS
Any and all embedded methods work as follows:
 First, these methods train a machine learning model.
 They then derive feature importance from this model, which is a measure of
how much is feature important when making a prediction.
 Finally, they remove non-important features using the derived feature
importance.
REGULARIZATION
 Regularization in machine learning adds a penalty to the different
parameters of a model to reduce its freedom. This penalty is applied to the
coefficient that multiplies each of the features in the linear model, and is
done to avoid overfitting, make the model robust to noise, and to improve its
generalization.
There are three main types of regularization for linear models:
 lasso regression or L1 regularization
 ridge regression or L2 regularization
 elastic nets or L1/L2 regularization

REGULARIZATION
 L1 regularization has shrinks some of the coefficients to zero, therefore
indicating that a certain predictor or certain features will be multiplied by
zero to estimate the target. Thus, it won’t be added to the final prediction of
the target—this means that these features can be removed because they
aren’t contributing to the final prediction.
 L2 regularization, on the other hand, doesn’t set the coefficient to zero, but
only approaching zero—that’s why we use only L1 in feature selection.
 L1/L2 regularization is a combination of the L1 and L2. It incorporates their
penalties, and therefore we can end up with features with zero as a coefficient
—similar to L1.

11.feature Selection, Extraction

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

11.feature Selection, Extraction

Uploaded by

Copyright:

Available Formats

FEATURE SELECTION, EXTRACTION

WHAT IS FEATURE EXTRACTION/SELECTION?

 It ranks features according to certain criteria.

 Then select the highest ranking features according to those criteria.

# select the numerical columns only.

# fit the object to our data.

# get the constant colum names.

if column not in numerical_x_train.columns[vs_constant.get_support()]]

# detect constant categorical variables.

if (x_train[column].dtype == "O" and len(x_train[column].unique()) == 1 )]

# conctenating the two lists.

# drop the constant columns

x_test.drop(labels=all_constant_columns, axis=1, inplace=True)

# create empty list

# loop over all the columns

# calculate the ratio.

# append the column name if it is bigger than the threshold

# drop the quasi constant columns

# print the number of duplicated features

# select the duplicated features columns names

# drop those columns

x_test.drop(labels=duplicated_columns, axis=1, inplace=True)

The assumptions that the Pearson correlation coefficient makes:

 A straight-line relationship between the two variables.

 Data is equally distributed around the regression line.

# create the correlation matrix (default to pearson)

# optional: display a heatmap of the correlation matrix

for i in range(len(corr_matrix .columns)):

x_train.drop(labels=corr_features, axis=1, inplace=True)

x_test.drop(labels=corr_features, axis=1, inplace=True)

 The mutual information between two random variables X and Y can be stated

from sklearn.feature_selection import SelectKBest

# select the number of features you want to retain.

# get only the numerical features.

# create the SelectKBest with the mutual info strategy.

selection = SelectKBest(mutual_info_classif, k=select_k).fit(numerical_x_train, y_train)

# display the retained features.

selection = SelectKBest(chi2, k=select_k).fit(x_train, y_train)

# display the k selected features.

# select the number of features you want to retain.

# create the SelectKBest with the anova strategy.

selection = SelectKBest(f_classif, k=select_k).fit(x_train, y_train)

# display the retained features.

 Bi-directional elimination(Stepwise Selection)

The steps involved in bi-directional elimination are as follows:

 They are more accurate than filter methods.

 They are much less prone to overfitting.

 ridge regression or L2 regularization

 elastic nets or L1/L2 regularization

You might also like