Download as pdf or txt
Download as pdf or txt
You are on page 1of 15

How I used

ensemble learning
in my most recent
machine learning
project.
Bagging
Briefing

High variance
Usually when the model is overfitted
, there is always a problem of high
variance.

What is high variance ?


High variance is when the test error
vary largely due to the selection of
the training samples. So if you bring
various test set , you’d realize that
the test errors vary largely due. This
is a very common problem in
machine learning.
The theory behind bagging ensemble method
Creating subset from
Why ensemble How bagging the original dataset for
method? works? training.
High variance is a very Instead of having to use all Not that , you can create n
common machine learning of our dataset in training , numbers of subsets from
problem and then , there are we can decide to create the initial dataset , this is all
many ways to solve this various samples from our based on your choice , and
problem: Bagging and dataset and this method is with multiple trials , you can
boosting are one of these called resampling with get the desired results.
methods . But in this work , I replacement.
used bagging method.
How bagging method works.

Fig 1
Image from : Codebasics machine learning course
Explanation of the fig 1
Training different models on different
dataset
After creating different subsets , we
train different models on all these
subsets

Then taking majority vote


In a classification problem, we take
majority vote , in a regression
problem, we can find the average as
the case maybe.
What my dataset looks like

I used the same dataset from my last project , so I will be doing the same data preprocessing. As you
can see , I can’t throw this dataset into my machine learning model because it contains a lot of
categorical variables.
Scaling of my dataset

why I scaled my dataset :

● The variables in my dataset are terribly lopsided , for example , my dummies are zero
and ones , and there are other variables that are way bigger, throwing them into the
model without scaling might confuse the model.
Data preparation
Missing values Label encoding
I tried to explore my dataset to see if ● So columns with just two unique
there is any missing values and variables ,I used label encoding
fortunately , there is none. while I used one hot encoding for
the rest.

Categorical variables One hot encoding


● So , next , I explored each column NB: When using One-Hot Encoding ,
with categorical variables you must remember to drop first in the
● This gave me an insight on how I argument as this helps to remove the
can handle them. issue of multicollinearity .
Solutions
1 2
To get a baseline model, I Used Random Tree Classifier
tried to use other machine , Decision Tree classifier,
models before using bagging
model.

4 3
Also , use cross validation
The I used the bagging
also just to get several
model to compare my results
results I can compare with.
Kfold with Decision Tree classifier
Bagging method
Accuracy from my
bagging model is:

83%
CONCLUSION

While accuracy from the bagging


model seems better than SVM
and Decision Tree, in my previous
work on the same dataset after
applying pca and random forest
classifier , I got about 90%
accuracy . It is always advisable
to try different methods.
THANK YOU

You might also like