ML Project - Monica Sharma

MACHINE LEARNING PROJECT REPORT | MONICA SHARMA
MACHINE LEARNING PROJECT
pg. 1
TABLE OF CONTENT
1. Problem 1 -------------------------------------------------------------------------------------------------------------------- 4
1.1 Define the problem and perform Exploratory Data Analysis - Problem definition - Check shape,
Data types, statistical summary - Univariate analysis - Bivariate analysis - Use appropriate
visualizations to identify the patterns and insights - Key meaningfu observations on individual
variables and the relationship between variables. ------------------------------------------------------------------ 5
1.2 Data Pre-processing Prepare the data for modelling: - Outlier Detection (treat, if needed) -
Feature Engineering / drop redundant features (if needed) - Encode the data - Train-test split------- 12
1.3 Model Building - Bagging - Build a Bagging classifier - Build a Random Forest classifier - Check
the performance of the models across train and test set using different metrics and comment on the
same. 13
1.4 Model Improvement - Bagging - Try and improve the model performance by tuning the model
(minimum 2 parameters to be tuned) - Bagging Classifier - Random Forest Classifier - Comment on
model performance after tuning the model. ------------------------------------------------------------------------- 17
1.5 Model Building - Boosting - Build a Boosting classifier - Check the performance of the models
across train and test set using different metrics and comment on the same Note: AdaBoost or
GradientBoosting classifier can be built. ------------------------------------------------------------------------------ 21
1.6 Model Improvement - Boosting - Try and improve the model performance by tuning the model
(minimum 2 parameters to be tuned) - Comment on model performance after tuning the model.-- 23
1.7 Actionable Insights & Recommendations - Compare all the models and choose the best model
with proper rationale - Conclude with the key takeaways (actionable insights and recommendations)
for the business. ------------------------------------------------------------------------------------------------------------- 25
2. Problem 2 ------------------------------------------------------------------------------------------------------------------- 27
2.1 Data Preparation Data preparation and exploratory data analysis - Pick out the Deal
(Dependent Variable) and Description columns into a separate dataframe - Create two corpora - one
with those who secured a deal and the other with those who did not secur a deal - Find the number
of characters for both the corpuses Text preprocessing on corpora which secured the deal ----------- 27
2.1.1 Pick out the Deal (Dependent Variable) and Description columns into a separate data
frame. 31
2.1.2 Create two corpora - one with those who secured a deal and the other with those who
did not secure a deal ----------------------------------------------------------------------------------------------------- 31
2.1.3 Find the number of characters for both the corpuses Text preprocessing on corpora which
secured the deal. ---------------------------------------------------------------------------------------------------------- 31
2.1.4 Text pre-processing on corpora which secured the deal. ------------------------------------------ 32
2.2 Insight Generation - Create a wordcloud of common words used by companies who secure a
deal - Provide insights from the preprocessed data. --------------------------------------------------------------- 35
2.3 Business Report Quality - Adhere to the business report checklist ---------------------------------- 35
pg. 2
LIST OF FIGURES
Figure 1-1: No. of male & female using different transport modes. ....................................................... 6
Figure 1-2: Distribution of Age.............................................................................................................. 7
Figure 1-2: Distribution of Work Experience ......................................................................................... 7
Figure 1-2: Observation on Gender....................................................................................................... 8
Figure 1-2: Distribution on preferred mode of transport ....................................................................... 8
Figure 1-2: Gender Impact on mode of transport ................................................................................ 10
Figure 1-2: Work Exp Impact on mode of transport ............................................................................ 10
Figure 1-2: Age Impact on mode of transport ..................................................................................... 11
Figure 1-9: Outlier Plot ....................................................................................................................... 12
LIST OF TABLES
Table 1-1:Data Information .................................................................................................................. 5
Table 1-2:Duplicate Value information ................................................................................................. 5
Table 1-3:Shape of the data.................................................................................................................. 5
Table 1-4:Statistical Information of the dataset .................................................................................... 6
Table 1-6:Preferred mode of Transport wrt Gender .............................................................................. 9
Table 1-5:Multivariate Analysis (Heat Map) .......................................................................................... 9
Table 2-1:head of the dataset (Part 1) ................................................................................................ 27
Table 2-2:head of the dataset (Part 1) ................................................................................................ 28
Table 2-3:Shape of the data................................................................................................................ 28
Table 2-4:Dataset type ....................................................................................................................... 29
Table 2-5:Dataset Information ............................................................................................................ 29
Table 2-6:Null value of Dataset ........................................................................................................... 30
pg. 3
1. Problem 1
Context
You are in discussions with ABC Consulting company for providing transport for their employees. For this
purpose, you are tasked with understanding how do the employees of ABC Consulting prefer to
commute presently (between home and office). Based on the parameters like age, salary, work
experience etc. given in the data set ‘Transport.csv’, you are required to predict the preferred mode of
transport. The project requires you to build several Machine Learning models and compare them so that
the model can be finalized.
Objective
The objective is to build various Machine Learning models on this data set and based on the accuracy
metrics decide which model is to be finalized for finally predicting the mode of transport chosen by the
employee.
Data Dictionary
Age: Age of the Employee in Years
Gender: Gender of the Employee
Engineer: For Engineer =1 , Non Engineer =0
MBA: For MBA =1 , Non-MBA =0
Work Exp: Experience in years
Salary: Salary in Lakhs per Annum
Distance: Distance in km from Home to Office
license: If Employee has Driving Licence -1, If not, then 0
Transport: Mode of Transport
pg. 4
1.1 Define the problem and perform Exploratory Data Analysis - Problem definition - Check shape,
Data types, statistical summary - Univariate analysis - Bivariate analysis - Use appropriate
visualizations to identify the patterns and insights - Key meaningful observations on individual
variables and the relationship between variables.
Data is imported and the following are the observations:
Table 1-1:Data Information
Table 1-2:Duplicate Value information
Table 1-3:Shape of the data
• There are 444 employee records.

• There is a total of 9 variables, Transport is dependent and other variables are independent.
• There are no duplicate values in the record
Statistical Summary
pg. 5
Table 1-4:Statistical Information of the dataset
• 50% of the employees have work experience of less than 5 years and 75% of the employees have
work experience below 8 yrs.
• Average employee age is 27.75 years.
• The average salary of an employee is 16.23.
• 75% of the employees have travel distance of less than 13
Figure 1-1: No. of male & female using different transport modes.
• Out of 444 records 316 is of ‘Male’ and remaining 128 is ‘Female’.

• Frequency of employees travelling through public transport is 300 and 144 is Private transport.
pg. 6
Figure 1-2: Distribution of Age
• The distribution of age is rightly skewed. From the plot it is inferred that most of the employees
are aged between 23 to 30 years
Figure 1-3: Distribution of Work Experience
• Work Exp variable looks right skewed with most of the employees having work experience
between 0 to 8 years.
pg. 7
Figure 1-4: Observation on Gender
- As it can be observed, the dataset has 71.2% male and 28.8% female
Figure 1-5: Distribution on preferred mode of transport
- 300 people use public transport and rest 144 prefer Private transport
pg. 8
Table 1-5:Preferred mode of Transport wrt Gender
Multivariate Analysis
Table 1-6:Multivariate Analysis (Heat Map)
- As it can be observed from the heat map Work Exp is highly correlated with Salary and Age
pg. 9
Figure 1-6: Gender Impact on mode of transport
- More females tend to prefer Private transport as compared to males
Figure 1-7: Work Exp Impact on mode of transport
- People with higher work experience prefer to travel using Private transport than Public
transport
pg. 10
Figure 1-8: Age Impact on mode of transport
- People with Age more than 30 generally prefer to travel using Private transport than Public
transport
pg. 11
1.2 Data Pre-processing Prepare the data for modelling: - Outlier Detection (treat, if needed) - Feature
Engineering / drop redundant features (if needed) - Encode the data - Train-test split
Outlier Detection
Figure 1-9: Outlier Plot
• There are outliers present. However, for now we will keep the outlier and proceed with model
building.
Data Split
• Shape of Training set: (310, 8)

• Shape of test set: (134, 8)
• Percentage of classes in training set:
1 0.674194
0 0.325806
Name: Transport, dtype: float64
• Percentage of classes in test set:
1 0.679104
0 0.320896
Name: Transport, dtype: float64
pg. 12
1.3 Model Building - Bagging - Build a Bagging classifier - Build a Random Forest classifier - Check the
performance of the models across train and test set using different metrics and comment on the
same.
Model evaluation criterion:
Model can make wrong predictions as:
1. The model predicts that the public mode of transport is preferred but employees prefer private
mode.
2. The model predicts that that the Private mode of transport is preferred but employee prefers
public mode.
Which case is more important?
Both are important to correctly estimate the number of employees who prefer private transport.
How to reduce the losses?
• F1 Score can be used as the metric for evaluation of the model, greater the F1 score higher are
the chances of minimizing False Negatives and False Positives.
• We will use balanced class weights so that the model focuses equally on both classes.
We have created functions to calculate different metrics and confusion matrix so that we don't have to
use the same code repeatedly for each model.
• The model_performance_classification_sklearn function will be used to check the model

performance of models.
• The confusion_matrix_sklearn function will be used to plot the confusion matrix.
a. Bagging - Model Building

• Checking model performance on training set
pg. 13
• Checking model performance on tested set
pg. 14
- As we can see, the model is overfitting here. We will try to tune the model and reduce
overfitting.
b. Random Forest- Model Building

pg. 15
- Similar to bagging model, it can be seen that the random forest model is overfitting
here. We will try to tune the model and reduce overfitting.
pg. 16
1.4 Model Improvement - Bagging - Try and improve the model performance by tuning the model
(minimum 2 parameters to be tuned) - Bagging Classifier - Random Forest Classifier - Comment
on model performance after tuning the model.
a. Hyperparameter Tuning – Bagging Classifier

pg. 17
- The model is still found to overfit the training data, as the training metrics are high, but
the testing metrics are not.
b. Hyperparameter Tuning – Random Classifier
pg. 18
pg. 19
- The model is still found to overfit the training data, as the training metrics are high, but
the testing metrics are not.
pg. 20
1.5 Model Building - Boosting - Build a Boosting classifier - Check the performance of the models
across train and test set using different metrics and comment on the same Note: AdaBoost or
GradientBoosting classifier can be built.
a. Boosting- Model Building and Hyperparameter Tuning
- We can see that the True positives account to 206, False negatives account to 3, False
Positives account to 32 and true negatives account to 69.
pg. 21
pg. 22
1.6 Model Improvement - Boosting - Try and improve the model performance by tuning the model
(minimum 2 parameters to be tuned) - Comment on model performance after tuning the model.
pg. 23
pg. 24
1.7 Actionable Insights & Recommendations - Compare all the models and choose the best model
with proper rationale - Conclude with the key takeaways (actionable insights and
recommendations) for the business.
Observation
- Based on the above data for all the modules, it can be observed that Adaboost classsifer
model will be able to provide better predictions. Compared to all the models, Adaboost
classifier shows better accuracy and precision.
pg. 25
- Looking at the feature importance of the Adaboost classifier model, the top three
important features to look for are -Salary, Distance and Age.
Actionable Insights and Recommendations:
- Important variables are Salary, Age, Work. exp, And Distance

- Age and Work.Exp are correlated.
- People with higher salaries prefer to use Private transport. However, we can see outlier in
the dataset.
- People with age more than 30 generally prefer to travel using Private transport than public
transport.
- People with higher work experience tend to prefer using Private mode of transport. There
are outlier present in the public transport data with more experience.
pg. 26
2. Problem 2
Context
A dataset of Shark Tank episodes is made available. It contains 495 entrepreneurs making their pitch
to the VC sharks. You will ONLY use “Description” column for the initial text mining exercise.
2.1 Data Preparation Data preparation and exploratory data analysis - Pick out the
Deal (Dependent Variable) and Description columns into a separate dataframe -
Create two corpora - one with those who secured a deal and the other with those
who did not secure a deal - Find the number of characters for both the corpuses
Text preprocessing on corpora which secured the deal
a. Data Description
Table 2-1:head of the dataset (Part 1)
pg. 27
Table 2-2:head of the dataset (Part 1)
Table 2-3:Shape of the data
pg. 28
Table 2-4:Dataset type
Table 2-5:Dataset Information
pg. 29
Table 2-6:Null value of Dataset
- There 495 rows and 19 columns

- The dataset contains 2 Boolean, 5 integer and 12 objects.
- There are null values present in entrepreneur and website columns. However, as we will
not be using these columns for our study, we can keep it as it is.
pg. 30
2.1.1 Pick out the Deal (Dependent Variable) and Description columns into a separate data
frame.
- The new dataframe “df2” have 495 rows and 2 columns i.e., Deal and Description
2.1.2 Create two corpora - one with those who secured a deal and the other with those who did not
secure a deal
We created two corpora – Corpora 1: deal secured and Corpora 2 : deal not secured
2.1.3 Find the number of characters for both the corpuses Text preprocessing on corpora which
secured the deal.
- The number of characters in corpus which secure the Deal is 45002
pg. 31
- The number of characters in corpus which did not secure the Deal is 47184
2.1.4 Text pre-processing on corpora which secured the deal.
We'll be doing text preprocessing on the corpus for those who secured the deal
a. Removal of http links
b. De-contraction of words
c. Tokenization
pg. 32
d. Lowercasing: Lowercasing ALL your text data, although commonly overlooked, is one of the
simplest and most effective form of text preprocessing.
e. Removal of Punctuation
• Removal of stop words:
- Stop words are a set of commonly used words in a language.
- Examples of stop words in English are “a”, “the”, “is”, “are” etc. The intuition behind
using stop words is that, by removing low information words from text, we can focus on
the important words instead.
pg. 33
f. Lemmatization
- Lemmatization on the surface is very similar to stemming, where the goal is to

remove inflections and map a word to its root form.
g. Normalization (aggregating pre-processing function into one):
pg. 34
2.2 Insight Generation - Create a wordcloud of common words used by companies who secure a deal
- Provide insights from the preprocessed data.
- From the word cloud, we can say that an entrepreneur who secured the deal used words
like ‘product’, ‘make’, ‘design’, ‘online’, ‘offer’, ‘need’ and more positive and product
descriptive words to attract the customer’s interest, hence securing the deal.
- Hence to increase the performance one must make more use of words which will attract
the customer’s interest and use more product and design oriented words.
2.3 Business Report Quality - Adhere to the business report checklist
pg. 35

ML Project - Monica Sharma

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

ML Project - Monica Sharma

Uploaded by

Copyright:

Available Formats

MACHINE LEARNING PROJECT REPORT | MONICA SHARMA

MACHINE LEARNING PROJECT

Gender: Gender of the Employee

Engineer: For Engineer =1 , Non Engineer =0

MBA: For MBA =1 , Non-MBA =0

Work Exp: Experience in years

Salary: Salary in Lakhs per Annum

Distance: Distance in km from Home to Office

license: If Employee has Driving Licence -1, If not, then 0

Transport: Mode of Transport

Table 1-1:Data Information

Table 1-2:Duplicate Value information

Table 1-3:Shape of the data

• There are 444 employee records.

Table 1-4:Statistical Information of the dataset

• Out of 444 records 316 is of ‘Male’ and remaining 128 is ‘Female’.

Figure 1-2: Distribution of Age

Figure 1-3: Distribution of Work Experience

Figure 1-4: Observation on Gender

Figure 1-5: Distribution on preferred mode of transport

Table 1-5:Preferred mode of Transport wrt Gender

Table 1-6:Multivariate Analysis (Heat Map)

Figure 1-6: Gender Impact on mode of transport

- More females tend to prefer Private transport as compared to males

Figure 1-7: Work Exp Impact on mode of transport

Figure 1-8: Age Impact on mode of transport

Figure 1-9: Outlier Plot

• Shape of Training set: (310, 8)

Model evaluation criterion:

Model can make wrong predictions as:

Which case is more important?

How to reduce the losses?

• The model_performance_classification_sklearn function will be used to check the model

a. Bagging - Model Building

• Checking model performance on tested set

b. Random Forest- Model Building

• Checking model performance on tested set

a. Hyperparameter Tuning – Bagging Classifier

b. Hyperparameter Tuning – Random Classifier

• Checking model performance on tested set

Actionable Insights and Recommendations:

- Important variables are Salary, Age, Work. exp, And Distance

Table 2-1:head of the dataset (Part 1)

Table 2-2:head of the dataset (Part 1)

Table 2-3:Shape of the data

Table 2-4:Dataset type

Table 2-5:Dataset Information

Table 2-6:Null value of Dataset

- There 495 rows and 19 columns

- The number of characters in corpus which secure the Deal is 45002

2.1.4 Text pre-processing on corpora which secured the deal.

a. Removal of http links

• Removal of stop words:

- Stop words are a set of commonly used words in a language.

- Lemmatization on the surface is very similar to stemming, where the goal is to

g. Normalization (aggregating pre-processing function into one):

2.3 Business Report Quality - Adhere to the business report checklist

You might also like