Professional Documents
Culture Documents
ML Project - Monica Sharma
ML Project - Monica Sharma
pg. 1
MACHINE LEARNING PROJECT REPORT | MONICA SHARMA
TABLE OF CONTENT
1. Problem 1 -------------------------------------------------------------------------------------------------------------------- 4
1.1 Define the problem and perform Exploratory Data Analysis - Problem definition - Check shape,
Data types, statistical summary - Univariate analysis - Bivariate analysis - Use appropriate
visualizations to identify the patterns and insights - Key meaningfu observations on individual
variables and the relationship between variables. ------------------------------------------------------------------ 5
1.2 Data Pre-processing Prepare the data for modelling: - Outlier Detection (treat, if needed) -
Feature Engineering / drop redundant features (if needed) - Encode the data - Train-test split------- 12
1.3 Model Building - Bagging - Build a Bagging classifier - Build a Random Forest classifier - Check
the performance of the models across train and test set using different metrics and comment on the
same. 13
1.4 Model Improvement - Bagging - Try and improve the model performance by tuning the model
(minimum 2 parameters to be tuned) - Bagging Classifier - Random Forest Classifier - Comment on
model performance after tuning the model. ------------------------------------------------------------------------- 17
1.5 Model Building - Boosting - Build a Boosting classifier - Check the performance of the models
across train and test set using different metrics and comment on the same Note: AdaBoost or
GradientBoosting classifier can be built. ------------------------------------------------------------------------------ 21
1.6 Model Improvement - Boosting - Try and improve the model performance by tuning the model
(minimum 2 parameters to be tuned) - Comment on model performance after tuning the model.-- 23
1.7 Actionable Insights & Recommendations - Compare all the models and choose the best model
with proper rationale - Conclude with the key takeaways (actionable insights and recommendations)
for the business. ------------------------------------------------------------------------------------------------------------- 25
2. Problem 2 ------------------------------------------------------------------------------------------------------------------- 27
2.1 Data Preparation Data preparation and exploratory data analysis - Pick out the Deal
(Dependent Variable) and Description columns into a separate dataframe - Create two corpora - one
with those who secured a deal and the other with those who did not secur a deal - Find the number
of characters for both the corpuses Text preprocessing on corpora which secured the deal ----------- 27
2.1.1 Pick out the Deal (Dependent Variable) and Description columns into a separate data
frame. 31
2.1.2 Create two corpora - one with those who secured a deal and the other with those who
did not secure a deal ----------------------------------------------------------------------------------------------------- 31
2.1.3 Find the number of characters for both the corpuses Text preprocessing on corpora which
secured the deal. ---------------------------------------------------------------------------------------------------------- 31
2.1.4 Text pre-processing on corpora which secured the deal. ------------------------------------------ 32
2.2 Insight Generation - Create a wordcloud of common words used by companies who secure a
deal - Provide insights from the preprocessed data. --------------------------------------------------------------- 35
2.3 Business Report Quality - Adhere to the business report checklist ---------------------------------- 35
pg. 2
MACHINE LEARNING PROJECT REPORT | MONICA SHARMA
LIST OF FIGURES
Figure 1-1: No. of male & female using different transport modes. ....................................................... 6
Figure 1-2: Distribution of Age.............................................................................................................. 7
Figure 1-2: Distribution of Work Experience ......................................................................................... 7
Figure 1-2: Observation on Gender....................................................................................................... 8
Figure 1-2: Distribution on preferred mode of transport ....................................................................... 8
Figure 1-2: Gender Impact on mode of transport ................................................................................ 10
Figure 1-2: Work Exp Impact on mode of transport ............................................................................ 10
Figure 1-2: Age Impact on mode of transport ..................................................................................... 11
Figure 1-9: Outlier Plot ....................................................................................................................... 12
LIST OF TABLES
Table 1-1:Data Information .................................................................................................................. 5
Table 1-2:Duplicate Value information ................................................................................................. 5
Table 1-3:Shape of the data.................................................................................................................. 5
Table 1-4:Statistical Information of the dataset .................................................................................... 6
Table 1-6:Preferred mode of Transport wrt Gender .............................................................................. 9
Table 1-5:Multivariate Analysis (Heat Map) .......................................................................................... 9
Table 2-1:head of the dataset (Part 1) ................................................................................................ 27
Table 2-2:head of the dataset (Part 1) ................................................................................................ 28
Table 2-3:Shape of the data................................................................................................................ 28
Table 2-4:Dataset type ....................................................................................................................... 29
Table 2-5:Dataset Information ............................................................................................................ 29
Table 2-6:Null value of Dataset ........................................................................................................... 30
pg. 3
MACHINE LEARNING PROJECT REPORT | MONICA SHARMA
1. Problem 1
Context
You are in discussions with ABC Consulting company for providing transport for their employees. For this
purpose, you are tasked with understanding how do the employees of ABC Consulting prefer to
commute presently (between home and office). Based on the parameters like age, salary, work
experience etc. given in the data set ‘Transport.csv’, you are required to predict the preferred mode of
transport. The project requires you to build several Machine Learning models and compare them so that
the model can be finalized.
Objective
The objective is to build various Machine Learning models on this data set and based on the accuracy
metrics decide which model is to be finalized for finally predicting the mode of transport chosen by the
employee.
Data Dictionary
Age: Age of the Employee in Years
pg. 4
MACHINE LEARNING PROJECT REPORT | MONICA SHARMA
1.1 Define the problem and perform Exploratory Data Analysis - Problem definition - Check shape,
Data types, statistical summary - Univariate analysis - Bivariate analysis - Use appropriate
visualizations to identify the patterns and insights - Key meaningful observations on individual
variables and the relationship between variables.
Data is imported and the following are the observations:
Statistical Summary
pg. 5
MACHINE LEARNING PROJECT REPORT | MONICA SHARMA
• 50% of the employees have work experience of less than 5 years and 75% of the employees have
work experience below 8 yrs.
• Average employee age is 27.75 years.
• The average salary of an employee is 16.23.
• 75% of the employees have travel distance of less than 13
Figure 1-1: No. of male & female using different transport modes.
pg. 6
MACHINE LEARNING PROJECT REPORT | MONICA SHARMA
• The distribution of age is rightly skewed. From the plot it is inferred that most of the employees
are aged between 23 to 30 years
• Work Exp variable looks right skewed with most of the employees having work experience
between 0 to 8 years.
pg. 7
MACHINE LEARNING PROJECT REPORT | MONICA SHARMA
- As it can be observed, the dataset has 71.2% male and 28.8% female
- 300 people use public transport and rest 144 prefer Private transport
pg. 8
MACHINE LEARNING PROJECT REPORT | MONICA SHARMA
Multivariate Analysis
- As it can be observed from the heat map Work Exp is highly correlated with Salary and Age
pg. 9
MACHINE LEARNING PROJECT REPORT | MONICA SHARMA
- People with higher work experience prefer to travel using Private transport than Public
transport
pg. 10
MACHINE LEARNING PROJECT REPORT | MONICA SHARMA
- People with Age more than 30 generally prefer to travel using Private transport than Public
transport
pg. 11
MACHINE LEARNING PROJECT REPORT | MONICA SHARMA
1.2 Data Pre-processing Prepare the data for modelling: - Outlier Detection (treat, if needed) - Feature
Engineering / drop redundant features (if needed) - Encode the data - Train-test split
Outlier Detection
• There are outliers present. However, for now we will keep the outlier and proceed with model
building.
Data Split
1 0.674194
0 0.325806
Name: Transport, dtype: float64
• Percentage of classes in test set:
1 0.679104
0 0.320896
Name: Transport, dtype: float64
pg. 12
MACHINE LEARNING PROJECT REPORT | MONICA SHARMA
1.3 Model Building - Bagging - Build a Bagging classifier - Build a Random Forest classifier - Check the
performance of the models across train and test set using different metrics and comment on the
same.
1. The model predicts that the public mode of transport is preferred but employees prefer private
mode.
2. The model predicts that that the Private mode of transport is preferred but employee prefers
public mode.
Both are important to correctly estimate the number of employees who prefer private transport.
• F1 Score can be used as the metric for evaluation of the model, greater the F1 score higher are
the chances of minimizing False Negatives and False Positives.
• We will use balanced class weights so that the model focuses equally on both classes.
We have created functions to calculate different metrics and confusion matrix so that we don't have to
use the same code repeatedly for each model.
pg. 13
MACHINE LEARNING PROJECT REPORT | MONICA SHARMA
pg. 14
MACHINE LEARNING PROJECT REPORT | MONICA SHARMA
- As we can see, the model is overfitting here. We will try to tune the model and reduce
overfitting.
pg. 15
MACHINE LEARNING PROJECT REPORT | MONICA SHARMA
- Similar to bagging model, it can be seen that the random forest model is overfitting
here. We will try to tune the model and reduce overfitting.
pg. 16
MACHINE LEARNING PROJECT REPORT | MONICA SHARMA
1.4 Model Improvement - Bagging - Try and improve the model performance by tuning the model
(minimum 2 parameters to be tuned) - Bagging Classifier - Random Forest Classifier - Comment
on model performance after tuning the model.
pg. 17
MACHINE LEARNING PROJECT REPORT | MONICA SHARMA
- The model is still found to overfit the training data, as the training metrics are high, but
the testing metrics are not.
pg. 18
MACHINE LEARNING PROJECT REPORT | MONICA SHARMA
pg. 19
MACHINE LEARNING PROJECT REPORT | MONICA SHARMA
- The model is still found to overfit the training data, as the training metrics are high, but
the testing metrics are not.
pg. 20
MACHINE LEARNING PROJECT REPORT | MONICA SHARMA
1.5 Model Building - Boosting - Build a Boosting classifier - Check the performance of the models
across train and test set using different metrics and comment on the same Note: AdaBoost or
GradientBoosting classifier can be built.
a. Boosting- Model Building and Hyperparameter Tuning
• Checking model performance on training set
- We can see that the True positives account to 206, False negatives account to 3, False
Positives account to 32 and true negatives account to 69.
pg. 21
MACHINE LEARNING PROJECT REPORT | MONICA SHARMA
- We can see that the True positives account to 87, False negatives account to 4, False
Positives account to 17 and true negatives account to 26.
pg. 22
MACHINE LEARNING PROJECT REPORT | MONICA SHARMA
1.6 Model Improvement - Boosting - Try and improve the model performance by tuning the model
(minimum 2 parameters to be tuned) - Comment on model performance after tuning the model.
- We can see that the True positives account to 204, False negatives account to 5, False
Positives account to 40 and true negatives account to 61.
pg. 23
MACHINE LEARNING PROJECT REPORT | MONICA SHARMA
- We can see that the True positives account to 86, False negatives account to 5, False
Positives account to 19 and true negatives account to 24.
pg. 24
MACHINE LEARNING PROJECT REPORT | MONICA SHARMA
1.7 Actionable Insights & Recommendations - Compare all the models and choose the best model
with proper rationale - Conclude with the key takeaways (actionable insights and
recommendations) for the business.
Observation
- Based on the above data for all the modules, it can be observed that Adaboost classsifer
model will be able to provide better predictions. Compared to all the models, Adaboost
classifier shows better accuracy and precision.
pg. 25
MACHINE LEARNING PROJECT REPORT | MONICA SHARMA
- Looking at the feature importance of the Adaboost classifier model, the top three
important features to look for are -Salary, Distance and Age.
pg. 26
MACHINE LEARNING PROJECT REPORT | MONICA SHARMA
2. Problem 2
Context
A dataset of Shark Tank episodes is made available. It contains 495 entrepreneurs making their pitch
to the VC sharks. You will ONLY use “Description” column for the initial text mining exercise.
2.1 Data Preparation Data preparation and exploratory data analysis - Pick out the
Deal (Dependent Variable) and Description columns into a separate dataframe -
Create two corpora - one with those who secured a deal and the other with those
who did not secure a deal - Find the number of characters for both the corpuses
Text preprocessing on corpora which secured the deal
a. Data Description
pg. 27
MACHINE LEARNING PROJECT REPORT | MONICA SHARMA
pg. 28
MACHINE LEARNING PROJECT REPORT | MONICA SHARMA
pg. 29
MACHINE LEARNING PROJECT REPORT | MONICA SHARMA
pg. 30
MACHINE LEARNING PROJECT REPORT | MONICA SHARMA
2.1.1 Pick out the Deal (Dependent Variable) and Description columns into a separate data
frame.
- The new dataframe “df2” have 495 rows and 2 columns i.e., Deal and Description
2.1.2 Create two corpora - one with those who secured a deal and the other with those who did not
secure a deal
We created two corpora – Corpora 1: deal secured and Corpora 2 : deal not secured
2.1.3 Find the number of characters for both the corpuses Text preprocessing on corpora which
secured the deal.
pg. 31
MACHINE LEARNING PROJECT REPORT | MONICA SHARMA
- The number of characters in corpus which did not secure the Deal is 47184
We'll be doing text preprocessing on the corpus for those who secured the deal
b. De-contraction of words
c. Tokenization
pg. 32
MACHINE LEARNING PROJECT REPORT | MONICA SHARMA
d. Lowercasing: Lowercasing ALL your text data, although commonly overlooked, is one of the
simplest and most effective form of text preprocessing.
e. Removal of Punctuation
- Examples of stop words in English are “a”, “the”, “is”, “are” etc. The intuition behind
using stop words is that, by removing low information words from text, we can focus on
the important words instead.
pg. 33
MACHINE LEARNING PROJECT REPORT | MONICA SHARMA
f. Lemmatization
pg. 34
MACHINE LEARNING PROJECT REPORT | MONICA SHARMA
2.2 Insight Generation - Create a wordcloud of common words used by companies who secure a deal
- Provide insights from the preprocessed data.
- From the word cloud, we can say that an entrepreneur who secured the deal used words
like ‘product’, ‘make’, ‘design’, ‘online’, ‘offer’, ‘need’ and more positive and product
descriptive words to attract the customer’s interest, hence securing the deal.
- Hence to increase the performance one must make more use of words which will attract
the customer’s interest and use more product and design oriented words.
pg. 35