Professional Documents
Culture Documents
11.feature Selection, Extraction
11.feature Selection, Extraction
Univariate and Multivariate.
Univariate filter methods : evaluate and rank a single feature according to
certain criteria.
They treat each feature individually and independently of the feature space.
It may select a redundant variable, as it does not take into consideration the
relationship between features.
MULTIVARIATE FILTER METHODS
Multivariate filter methods, evaluate the entire feature space. They take
into account features in relation to other ones in the dataset.
These methods are able to handle duplicated, redundant, and correlated
features.
BASIC FILTER METHODS
1. Constant Features that show single values in all the observations in the
dataset. These features provide no information that allows ML models to
predict the target.
from sklearn.feature_selection import VarianceThreshold
vs_constant = VarianceThreshold(threshold=0)
threshold = 0.98
print(quasi_constant_feature)
train_features_T = x_train.T
duplicated_columns = train_features_T[train_features_T.duplicated()].index.values
The following formula is used to calculate the value of the Pearson correlation
coefficient:
corr_features = set()
corr_matrix = x_train.corr()
plt.figure(figsize=(11,11))
sns.heatmap(corr_matrix)
select_k = 10
print(features)
CHI-SQUARED SCORE
This is another statistical method that’s commonly used for testing
relationships between categorical variables.
It’s suited for categorical variables and binary targets only, and the variables
should be non-negative and typically boolean, frequencies, or counts.
Compares the observed distribution between various features in the dataset and
the target variable.
from sklearn.feature_selection import chi2
from sklearn.feature_selection import SelectKBest
# change this to how much features you want to keep from the top ones.
select_k = 10
# apply the chi2 score on the data and target (target should be binary).
print(features)
ANOVA UNIVARIATE TEST
A univariate test, or more specifically ANOVA ( — short
for ANalysis Of VAriance), is similar to the previous scores, as it measures
the dependence of two variables.
ANOVA assumes a linear relationship between the variables and the target,
and also that the variables are normally distributed.
Well-suited for continuous variables and requires a binary target,
from sklearn.feature_selection import f_classif
from sklearn.feature_selection import SelectKBest
select_k = 10
print(features)
WRAPPER METHODS
The feature selection process is based on a specific machine learning
algorithm that we are trying to fit on a given dataset.
Follows a greedy search approach by evaluating all the possible
combinations of features against the evaluation criterion.
For regression evaluation criterion can be p-values, R-squared, Adjusted R-
squared,
For classification the evaluation criterion can be accuracy, precision, recall,
f1-score, etc.
Finally, it selects the combination of features that gives the optimal results
for the specified machine learning algorithm.
WRAPPER METHODS
Most commonly used techniques under wrapper methods are:
Forward selection
Backward elimination
They find the feature subset for the algorithm being trained.
They then derive feature importance from this model, which is a measure of
how much is feature important when making a prediction.
Finally, they remove non-important features using the derived feature
importance.
REGULARIZATION
Regularization in machine learning adds a penalty to the different
parameters of a model to reduce its freedom. This penalty is applied to the
coefficient that multiplies each of the features in the linear model, and is
done to avoid overfitting, make the model robust to noise, and to improve its
generalization.
There are three main types of regularization for linear models:
lasso regression or L1 regularization