Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 8

Machine learning perspective, data engineering involves dataset

collecting, dataset cleansing/transforming, feature selecting, feature


transformation. Here we focus on feature selection to show how does
it benefit a machine learning process.

Feature Analysis

We are using the famous iris datasets in our example. It is well-


formed, clean, balanaced already.
from sklearn import datasets

# load data to dict derived class Bunch


iris = datasets.load_iris()
target = datasets.load_iris().target

# convert to dataframe for processing


iris = pd.DataFrame(iris.data, columns = iris.feature_names)

to make sure the data is balanced. It is in our case, the same 50


samples on each class.
import seaborn as snsplt.hist(target)
check the its min, max and other basic information to make sure we
don’t have outliers
iris.describe()

Now let’s normalize it and viusalize feature’s correlations with classes


#to normalize dataset, we use this handy MinMaxScaler
from sklearn.preprocessing import MinMaxScalerscaler =
MinMaxScaler()
scaler.fit(iris)iris_norm=scaler.transform(iris)# visualizing features and
target
iris_norm = pd.DataFrame(iris_norm, columns = iris.columns)
iris_norm_ = pd.DataFrame(np.hstack((iris_norm, target[:,
np.newaxis])), columns = iris.columns.tolist() + ['class'])
sns.pairplot(iris_norm_, hue = 'class', diag_kind='hist')
As we can see, all other pairs can separate 3 classes very clear, other
than sepal width/sepal length pair. Class 1 and class 2 are tangled in
the chart.

Further more, let’s check their coviances among features and class:
# manually verify the correlation among features and classes
iris_cov = iris_norm_.cov()
sns.heatmap(iris_cov, annot = True, cbar = False)
Feature Selection

Ideally we want a feature which is a)more relevant to the class and


b)less relevant to other features. a) is the most important factor,
because it can’t contribute an algorithm if it is totally irrelevant. b) can
make the process more effect, but it is another topic beyond this
article.

From the covariance heatmap, we can see ‘sepal width’ is the least
relevant to the class. This can explain why is class 1 and 2 are tangled
in the pairplot chart from the previous section.

Let’s use sklearn SelectKBest model to select the best 3 features.


Since all 4 features are continous and we use F-test to do this. Our
goal is the remove ‘sepal width’ feature.
from sklearn.feature_selection import SelectKBest, f_classifbestfeatures
= SelectKBest(score_func=f_classif, k=3)
iris_trim = bestfeatures.fit_transform(iris_norm,
target)print(bestfeatures.scores_)
print(bestfeatures.pvalues_)
print(iris_trim.shape)

As you can see the second feature has the least score and the largest p-
value. And the resulting dataset is of shape 150 x 3, the second feature
(sepal width) was removed.

let’s see the pairplot again:

We can draw a 3D chart for the 3 features now for a more intruitive
view.
from mpl_toolkits import mplot3d
fig = plt.figure(figsize=(8,8))
ax = plt.axes(projection='3d')
ax.scatter3D(iris_trim[:, 0], iris_trim[:, 1], iris_trim[:, 2], c = target,
cmap='Accent', marker = '>')

Validation

Now let’s compare both 4 feature case and 3 feature case. Define a
training and validation function first, then prepare both datesets.
def train_and_validate(X_train, X_test, y_train, y_test):
mode = GaussianNB()
mode.fit(X_train, y_train);
y_calc = mode.predict(X_test)
y_prob = mode.predict_proba(X_test)
#print(y_prob)
mat = confusion_matrix(y_test, y_calc)
sns.heatmap(mat.T, annot=True, cbar = False)X_train4, X_test4,
y_train, y_test = train_test_split(iris_norm, target, test_size = 0.10,
stratify = None, random_state=0)
X_train3, X_test3 = X_train4.drop(['sepal width (cm)'], axis=1),
X_test4.drop(['sepal width (cm)'], axis=1)

Run and compare

As we can see, the reduced feature set has a better result. In the
confusion matrics the 3 feature dataset yields a 100% accuracy, while
the 4 feature set model misses one sample.
I changed the random_state to generate different sets of data to repeat
the process, and I can see the 3-feature dataset performs better or at
least equally good as a 4-feature dataset.

Conclusion

A better prepared dataset can benefit a machine learning process.


Properly selected feature set not only saves model training time,
storage space, but also leads to more accurate result.

You might also like