Advanced Machine Learning and Feature Engineering: Stacking

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

Advanced Machine Learning and Feature Engineering

Stacking

Stacking is a technique that takes several regression or classification models

and uses their output as the input for the meta-classifier/regressor. Stacking is an

ensemble learning technique much like Random Forests where the quality of prediction

is improved by combining.

There are many stacking methods that can be deployed. They are:

1. Vecstack

2. Sklearn stacking

3. mlxtend

We have chosen sklearn stacking. The library that we used in sklearn stacking is as
mentioned below

Here, the “earth”, “lin_model”, “lgmbr” are estimators where “MARS”, “Lasso”,
“LightGBM” are strings, and the “final_estimator” is the parameter which will be used
to combine the base estimators, by default the parameter for “final_estimator” is
“RidgeCV”.
The benefit of stacking is that it can harness the capabilities of a range of well-
performing models on classification or regression task and make predictions that have
better performance than any single model in the ensemble.

Once, we obtain the best model we can perform feature selection depending on the
model. The features that are obtained are model dependent.

Usually hyper parameter tuning is done in machine learning by using


“RandomSearchCV”, “GridSearchCV”, “HyperOpt”, “BayesianOptimization”. But,
while using deep learning models we can also perform hyper parameter optimization
by using KerasTuner. It needs to be installed using below command
pip install keras-tuner

The hyperparameters tuning don’t change over training time and remain constant and
manipulate the training process of a model.
The task we need to find in deep learning are:
• Number of layers to choose
• Number of neurons in a layer to choose
• Choice of the optimization function
• Choice of the learning rate for optimization function
• Choice of the loss function
• Choice of metrics
• Choice of activation function
• Choice of layer weight initialization

From all the above options we are selecting number of layers, number of neurons in a
layer, the activation function. Our main aim is to get the Root Mean Squared Error
(RMSE) value to be low, so we are considering Mean Squared Error (MSE).

As, we have used multiple machine learning models and deep learning models.
Among them we are considering the model which has highest “R squared score” and
least “Root Mean Squared Error”.

We have selected the best model and checked the loss that have been obtained. As
we can see the Root Mean Squared Error is high for the MLP trained with Keras Tuner
and the error is of 3646.
We now try using LSTM model, Long short-term memory (LSTM) is an artificial
recurrent neural network architecture used in the field of deep learning. Unlike
standard feedforward neural networks, LSTM has feedback connections. It can
process not only single data points, but also entire sequences of data.

Below screenshot explains the how an LSTM works.

Now, we have a look how to build an LSTM model in Python. The bi-directional LSTM
with dropouts works fairly well, when compared with uni-directional LSTM

The below screenshot, provides the information about:


• The layers and their order in the model.
• The output shape of each layer.
• The number of parameters (weights) in each layer.
• The total number of parameters (weights) in the model.

The below screenshot explains, the lstm model is trained on previous 100 days and
predicts the next 60 days i.e., 2 months of future data.
Error Analysis

The above screenshot, explains the RMSE and R Squared Value for different Kinds
of dataset. We can clearly say that LSTM, Light GBM have given a less root mean
squared error when compared with others.

Model Interpretability
The major methods that are used for check the model interpretability are:
• LIME
• SHAP
As, we already know that by SHAP method is much better than LIME since, in the
LIME method it is difficult to define the neighbourhood, and the neighbourhood is
exponential kernel and it is instability due to lack of robustness. So, we are using SHAP
rather than LIME, below is the mentioned formula to calculate the SHAP.

To work with SHAP we need to run below command in our command prompt to install
the relevant libraries

The below screenshot explains which feature is most important, the feature
importance is ranked based in the descending order. The “onpromotion”,
“f_GROCERY I”, “f_CLEANING”, “store_nbr” and “f_BEVERAGES” are top 5 features
that are most important to predict the output.
As we already know that Multiple Linear Regression works well and moreover it is
considered as base line model. When an unknown dataset is trained on multiple Linear
Regression the Root Mean squared Error is 10.72.

You might also like