Feature Selection

Machine learning perspective, data engineering involves dataset
collecting, dataset cleansing/transforming, feature selecting, feature

transformation. Here we focus on feature selection to show how does
it benefit a machine learning process.
Feature Analysis
We are using the famous iris datasets in our example. It is well-

formed, clean, balanaced already.
from sklearn import datasets
# load data to dict derived class Bunch

iris = datasets.load_iris()
target = datasets.load_iris().target
# convert to dataframe for processing

iris = pd.DataFrame(iris.data, columns = iris.feature_names)
to make sure the data is balanced. It is in our case, the same 50

samples on each class.
import seaborn as snsplt.hist(target)
check the its min, max and other basic information to make sure we
don’t have outliers
iris.describe()
Now let’s normalize it and viusalize feature’s correlations with classes

#to normalize dataset, we use this handy MinMaxScaler
from sklearn.preprocessing import MinMaxScalerscaler =
MinMaxScaler()
scaler.fit(iris)iris_norm=scaler.transform(iris)# visualizing features and
target
iris_norm = pd.DataFrame(iris_norm, columns = iris.columns)
iris_norm_ = pd.DataFrame(np.hstack((iris_norm, target[:,
np.newaxis])), columns = iris.columns.tolist() + ['class'])
sns.pairplot(iris_norm_, hue = 'class', diag_kind='hist')
As we can see, all other pairs can separate 3 classes very clear, other
than sepal width/sepal length pair. Class 1 and class 2 are tangled in
the chart.
Further more, let’s check their coviances among features and class:
# manually verify the correlation among features and classes
iris_cov = iris_norm_.cov()
sns.heatmap(iris_cov, annot = True, cbar = False)
Feature Selection
Ideally we want a feature which is a)more relevant to the class and

b)less relevant to other features. a) is the most important factor,
because it can’t contribute an algorithm if it is totally irrelevant. b) can
make the process more effect, but it is another topic beyond this
article.
From the covariance heatmap, we can see ‘sepal width’ is the least
relevant to the class. This can explain why is class 1 and 2 are tangled
in the pairplot chart from the previous section.
Let’s use sklearn SelectKBest model to select the best 3 features.

Since all 4 features are continous and we use F-test to do this. Our
goal is the remove ‘sepal width’ feature.
from sklearn.feature_selection import SelectKBest, f_classifbestfeatures
= SelectKBest(score_func=f_classif, k=3)
iris_trim = bestfeatures.fit_transform(iris_norm,
target)print(bestfeatures.scores_)
print(bestfeatures.pvalues_)
print(iris_trim.shape)
As you can see the second feature has the least score and the largest p-
value. And the resulting dataset is of shape 150 x 3, the second feature
(sepal width) was removed.
let’s see the pairplot again:
We can draw a 3D chart for the 3 features now for a more intruitive
view.
from mpl_toolkits import mplot3d
fig = plt.figure(figsize=(8,8))
ax = plt.axes(projection='3d')
ax.scatter3D(iris_trim[:, 0], iris_trim[:, 1], iris_trim[:, 2], c = target,
cmap='Accent', marker = '>')
Validation
Now let’s compare both 4 feature case and 3 feature case. Define a
training and validation function first, then prepare both datesets.
def train_and_validate(X_train, X_test, y_train, y_test):
mode = GaussianNB()
mode.fit(X_train, y_train);
y_calc = mode.predict(X_test)
y_prob = mode.predict_proba(X_test)
#print(y_prob)
mat = confusion_matrix(y_test, y_calc)
sns.heatmap(mat.T, annot=True, cbar = False)X_train4, X_test4,
y_train, y_test = train_test_split(iris_norm, target, test_size = 0.10,
stratify = None, random_state=0)
X_train3, X_test3 = X_train4.drop(['sepal width (cm)'], axis=1),
X_test4.drop(['sepal width (cm)'], axis=1)
Run and compare
As we can see, the reduced feature set has a better result. In the
confusion matrics the 3 feature dataset yields a 100% accuracy, while
the 4 feature set model misses one sample.
I changed the random_state to generate different sets of data to repeat
the process, and I can see the 3-feature dataset performs better or at
least equally good as a 4-feature dataset.
Conclusion
A better prepared dataset can benefit a machine learning process.

Properly selected feature set not only saves model training time,
storage space, but also leads to more accurate result.

Feature Selection

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Feature Selection

Uploaded by

Copyright:

Available Formats

Machine learning perspective, data engineering involves dataset

collecting, dataset cleansing/transforming, feature selecting, feature

We are using the famous iris datasets in our example. It is well-

# load data to dict derived class Bunch

# convert to dataframe for processing

to make sure the data is balanced. It is in our case, the same 50

Now let’s normalize it and viusalize feature’s correlations with classes

Ideally we want a feature which is a)more relevant to the class and

Let’s use sklearn SelectKBest model to select the best 3 features.

let’s see the pairplot again:

Run and compare

A better prepared dataset can benefit a machine learning process.

You might also like