Download as pdf or txt
Download as pdf or txt
You are on page 1of 35

MACHINE LEARNING PROJECT REPORT | MONICA SHARMA

MACHINE LEARNING PROJECT

pg. 1
MACHINE LEARNING PROJECT REPORT | MONICA SHARMA

TABLE OF CONTENT
1. Problem 1 -------------------------------------------------------------------------------------------------------------------- 4
1.1 Define the problem and perform Exploratory Data Analysis - Problem definition - Check shape,
Data types, statistical summary - Univariate analysis - Bivariate analysis - Use appropriate
visualizations to identify the patterns and insights - Key meaningfu observations on individual
variables and the relationship between variables. ------------------------------------------------------------------ 5
1.2 Data Pre-processing Prepare the data for modelling: - Outlier Detection (treat, if needed) -
Feature Engineering / drop redundant features (if needed) - Encode the data - Train-test split------- 12
1.3 Model Building - Bagging - Build a Bagging classifier - Build a Random Forest classifier - Check
the performance of the models across train and test set using different metrics and comment on the
same. 13
1.4 Model Improvement - Bagging - Try and improve the model performance by tuning the model
(minimum 2 parameters to be tuned) - Bagging Classifier - Random Forest Classifier - Comment on
model performance after tuning the model. ------------------------------------------------------------------------- 17
1.5 Model Building - Boosting - Build a Boosting classifier - Check the performance of the models
across train and test set using different metrics and comment on the same Note: AdaBoost or
GradientBoosting classifier can be built. ------------------------------------------------------------------------------ 21
1.6 Model Improvement - Boosting - Try and improve the model performance by tuning the model
(minimum 2 parameters to be tuned) - Comment on model performance after tuning the model.-- 23
1.7 Actionable Insights & Recommendations - Compare all the models and choose the best model
with proper rationale - Conclude with the key takeaways (actionable insights and recommendations)
for the business. ------------------------------------------------------------------------------------------------------------- 25
2. Problem 2 ------------------------------------------------------------------------------------------------------------------- 27
2.1 Data Preparation Data preparation and exploratory data analysis - Pick out the Deal
(Dependent Variable) and Description columns into a separate dataframe - Create two corpora - one
with those who secured a deal and the other with those who did not secur a deal - Find the number
of characters for both the corpuses Text preprocessing on corpora which secured the deal ----------- 27
2.1.1 Pick out the Deal (Dependent Variable) and Description columns into a separate data
frame. 31
2.1.2 Create two corpora - one with those who secured a deal and the other with those who
did not secure a deal ----------------------------------------------------------------------------------------------------- 31
2.1.3 Find the number of characters for both the corpuses Text preprocessing on corpora which
secured the deal. ---------------------------------------------------------------------------------------------------------- 31
2.1.4 Text pre-processing on corpora which secured the deal. ------------------------------------------ 32
2.2 Insight Generation - Create a wordcloud of common words used by companies who secure a
deal - Provide insights from the preprocessed data. --------------------------------------------------------------- 35
2.3 Business Report Quality - Adhere to the business report checklist ---------------------------------- 35

pg. 2
MACHINE LEARNING PROJECT REPORT | MONICA SHARMA

LIST OF FIGURES
Figure 1-1: No. of male & female using different transport modes. ....................................................... 6
Figure 1-2: Distribution of Age.............................................................................................................. 7
Figure 1-2: Distribution of Work Experience ......................................................................................... 7
Figure 1-2: Observation on Gender....................................................................................................... 8
Figure 1-2: Distribution on preferred mode of transport ....................................................................... 8
Figure 1-2: Gender Impact on mode of transport ................................................................................ 10
Figure 1-2: Work Exp Impact on mode of transport ............................................................................ 10
Figure 1-2: Age Impact on mode of transport ..................................................................................... 11
Figure 1-9: Outlier Plot ....................................................................................................................... 12

LIST OF TABLES
Table 1-1:Data Information .................................................................................................................. 5
Table 1-2:Duplicate Value information ................................................................................................. 5
Table 1-3:Shape of the data.................................................................................................................. 5
Table 1-4:Statistical Information of the dataset .................................................................................... 6
Table 1-6:Preferred mode of Transport wrt Gender .............................................................................. 9
Table 1-5:Multivariate Analysis (Heat Map) .......................................................................................... 9
Table 2-1:head of the dataset (Part 1) ................................................................................................ 27
Table 2-2:head of the dataset (Part 1) ................................................................................................ 28
Table 2-3:Shape of the data................................................................................................................ 28
Table 2-4:Dataset type ....................................................................................................................... 29
Table 2-5:Dataset Information ............................................................................................................ 29
Table 2-6:Null value of Dataset ........................................................................................................... 30

pg. 3
MACHINE LEARNING PROJECT REPORT | MONICA SHARMA

1. Problem 1
Context
You are in discussions with ABC Consulting company for providing transport for their employees. For this
purpose, you are tasked with understanding how do the employees of ABC Consulting prefer to
commute presently (between home and office). Based on the parameters like age, salary, work
experience etc. given in the data set ‘Transport.csv’, you are required to predict the preferred mode of
transport. The project requires you to build several Machine Learning models and compare them so that
the model can be finalized.

Objective
The objective is to build various Machine Learning models on this data set and based on the accuracy
metrics decide which model is to be finalized for finally predicting the mode of transport chosen by the
employee.

Data Dictionary
Age: Age of the Employee in Years

Gender: Gender of the Employee

Engineer: For Engineer =1 , Non Engineer =0

MBA: For MBA =1 , Non-MBA =0

Work Exp: Experience in years

Salary: Salary in Lakhs per Annum

Distance: Distance in km from Home to Office

license: If Employee has Driving Licence -1, If not, then 0

Transport: Mode of Transport

pg. 4
MACHINE LEARNING PROJECT REPORT | MONICA SHARMA

1.1 Define the problem and perform Exploratory Data Analysis - Problem definition - Check shape,
Data types, statistical summary - Univariate analysis - Bivariate analysis - Use appropriate
visualizations to identify the patterns and insights - Key meaningful observations on individual
variables and the relationship between variables.
Data is imported and the following are the observations:

Table 1-1:Data Information

Table 1-2:Duplicate Value information

Table 1-3:Shape of the data

• There are 444 employee records.


• There is a total of 9 variables, Transport is dependent and other variables are independent.
• There are no duplicate values in the record

Statistical Summary

pg. 5
MACHINE LEARNING PROJECT REPORT | MONICA SHARMA

Table 1-4:Statistical Information of the dataset

• 50% of the employees have work experience of less than 5 years and 75% of the employees have
work experience below 8 yrs.
• Average employee age is 27.75 years.
• The average salary of an employee is 16.23.
• 75% of the employees have travel distance of less than 13

Figure 1-1: No. of male & female using different transport modes.

• Out of 444 records 316 is of ‘Male’ and remaining 128 is ‘Female’.


• Frequency of employees travelling through public transport is 300 and 144 is Private transport.

pg. 6
MACHINE LEARNING PROJECT REPORT | MONICA SHARMA

Figure 1-2: Distribution of Age

• The distribution of age is rightly skewed. From the plot it is inferred that most of the employees
are aged between 23 to 30 years

Figure 1-3: Distribution of Work Experience

• Work Exp variable looks right skewed with most of the employees having work experience
between 0 to 8 years.

pg. 7
MACHINE LEARNING PROJECT REPORT | MONICA SHARMA

Figure 1-4: Observation on Gender

- As it can be observed, the dataset has 71.2% male and 28.8% female

Figure 1-5: Distribution on preferred mode of transport

- 300 people use public transport and rest 144 prefer Private transport

pg. 8
MACHINE LEARNING PROJECT REPORT | MONICA SHARMA

Table 1-5:Preferred mode of Transport wrt Gender

Multivariate Analysis

Table 1-6:Multivariate Analysis (Heat Map)

- As it can be observed from the heat map Work Exp is highly correlated with Salary and Age

pg. 9
MACHINE LEARNING PROJECT REPORT | MONICA SHARMA

Figure 1-6: Gender Impact on mode of transport

- More females tend to prefer Private transport as compared to males

Figure 1-7: Work Exp Impact on mode of transport

- People with higher work experience prefer to travel using Private transport than Public
transport

pg. 10
MACHINE LEARNING PROJECT REPORT | MONICA SHARMA

Figure 1-8: Age Impact on mode of transport

- People with Age more than 30 generally prefer to travel using Private transport than Public
transport

pg. 11
MACHINE LEARNING PROJECT REPORT | MONICA SHARMA

1.2 Data Pre-processing Prepare the data for modelling: - Outlier Detection (treat, if needed) - Feature
Engineering / drop redundant features (if needed) - Encode the data - Train-test split

Outlier Detection

Figure 1-9: Outlier Plot

• There are outliers present. However, for now we will keep the outlier and proceed with model
building.

Data Split

• Shape of Training set: (310, 8)


• Shape of test set: (134, 8)
• Percentage of classes in training set:

1 0.674194
0 0.325806
Name: Transport, dtype: float64
• Percentage of classes in test set:

1 0.679104
0 0.320896
Name: Transport, dtype: float64

pg. 12
MACHINE LEARNING PROJECT REPORT | MONICA SHARMA

1.3 Model Building - Bagging - Build a Bagging classifier - Build a Random Forest classifier - Check the
performance of the models across train and test set using different metrics and comment on the
same.

Model evaluation criterion:

Model can make wrong predictions as:

1. The model predicts that the public mode of transport is preferred but employees prefer private
mode.
2. The model predicts that that the Private mode of transport is preferred but employee prefers
public mode.

Which case is more important?

Both are important to correctly estimate the number of employees who prefer private transport.

How to reduce the losses?

• F1 Score can be used as the metric for evaluation of the model, greater the F1 score higher are
the chances of minimizing False Negatives and False Positives.
• We will use balanced class weights so that the model focuses equally on both classes.

We have created functions to calculate different metrics and confusion matrix so that we don't have to
use the same code repeatedly for each model.

• The model_performance_classification_sklearn function will be used to check the model


performance of models.
• The confusion_matrix_sklearn function will be used to plot the confusion matrix.

a. Bagging - Model Building


• Checking model performance on training set

pg. 13
MACHINE LEARNING PROJECT REPORT | MONICA SHARMA

• Checking model performance on tested set

pg. 14
MACHINE LEARNING PROJECT REPORT | MONICA SHARMA

- As we can see, the model is overfitting here. We will try to tune the model and reduce
overfitting.

b. Random Forest- Model Building


• Checking model performance on training set

• Checking model performance on tested set

pg. 15
MACHINE LEARNING PROJECT REPORT | MONICA SHARMA

- Similar to bagging model, it can be seen that the random forest model is overfitting
here. We will try to tune the model and reduce overfitting.

pg. 16
MACHINE LEARNING PROJECT REPORT | MONICA SHARMA

1.4 Model Improvement - Bagging - Try and improve the model performance by tuning the model
(minimum 2 parameters to be tuned) - Bagging Classifier - Random Forest Classifier - Comment
on model performance after tuning the model.

a. Hyperparameter Tuning – Bagging Classifier


• Checking model performance on tested set

pg. 17
MACHINE LEARNING PROJECT REPORT | MONICA SHARMA

- The model is still found to overfit the training data, as the training metrics are high, but
the testing metrics are not.

b. Hyperparameter Tuning – Random Classifier

pg. 18
MACHINE LEARNING PROJECT REPORT | MONICA SHARMA

pg. 19
MACHINE LEARNING PROJECT REPORT | MONICA SHARMA

- The model is still found to overfit the training data, as the training metrics are high, but
the testing metrics are not.

pg. 20
MACHINE LEARNING PROJECT REPORT | MONICA SHARMA

1.5 Model Building - Boosting - Build a Boosting classifier - Check the performance of the models
across train and test set using different metrics and comment on the same Note: AdaBoost or
GradientBoosting classifier can be built.
a. Boosting- Model Building and Hyperparameter Tuning
• Checking model performance on training set

- We can see that the True positives account to 206, False negatives account to 3, False
Positives account to 32 and true negatives account to 69.

• Checking model performance on tested set

pg. 21
MACHINE LEARNING PROJECT REPORT | MONICA SHARMA

- We can see that the True positives account to 87, False negatives account to 4, False
Positives account to 17 and true negatives account to 26.

pg. 22
MACHINE LEARNING PROJECT REPORT | MONICA SHARMA

1.6 Model Improvement - Boosting - Try and improve the model performance by tuning the model
(minimum 2 parameters to be tuned) - Comment on model performance after tuning the model.

- We can see that the True positives account to 204, False negatives account to 5, False
Positives account to 40 and true negatives account to 61.

pg. 23
MACHINE LEARNING PROJECT REPORT | MONICA SHARMA

- We can see that the True positives account to 86, False negatives account to 5, False
Positives account to 19 and true negatives account to 24.

pg. 24
MACHINE LEARNING PROJECT REPORT | MONICA SHARMA

1.7 Actionable Insights & Recommendations - Compare all the models and choose the best model
with proper rationale - Conclude with the key takeaways (actionable insights and
recommendations) for the business.

Observation
- Based on the above data for all the modules, it can be observed that Adaboost classsifer
model will be able to provide better predictions. Compared to all the models, Adaboost
classifier shows better accuracy and precision.

pg. 25
MACHINE LEARNING PROJECT REPORT | MONICA SHARMA

- Looking at the feature importance of the Adaboost classifier model, the top three
important features to look for are -Salary, Distance and Age.

Actionable Insights and Recommendations:

- Important variables are Salary, Age, Work. exp, And Distance


- Age and Work.Exp are correlated.
- People with higher salaries prefer to use Private transport. However, we can see outlier in
the dataset.
- People with age more than 30 generally prefer to travel using Private transport than public
transport.
- People with higher work experience tend to prefer using Private mode of transport. There
are outlier present in the public transport data with more experience.

pg. 26
MACHINE LEARNING PROJECT REPORT | MONICA SHARMA

2. Problem 2
Context

A dataset of Shark Tank episodes is made available. It contains 495 entrepreneurs making their pitch
to the VC sharks. You will ONLY use “Description” column for the initial text mining exercise.

2.1 Data Preparation Data preparation and exploratory data analysis - Pick out the
Deal (Dependent Variable) and Description columns into a separate dataframe -
Create two corpora - one with those who secured a deal and the other with those
who did not secure a deal - Find the number of characters for both the corpuses
Text preprocessing on corpora which secured the deal

a. Data Description

Table 2-1:head of the dataset (Part 1)

pg. 27
MACHINE LEARNING PROJECT REPORT | MONICA SHARMA

Table 2-2:head of the dataset (Part 1)

Table 2-3:Shape of the data

pg. 28
MACHINE LEARNING PROJECT REPORT | MONICA SHARMA

Table 2-4:Dataset type

Table 2-5:Dataset Information

pg. 29
MACHINE LEARNING PROJECT REPORT | MONICA SHARMA

Table 2-6:Null value of Dataset

- There 495 rows and 19 columns


- The dataset contains 2 Boolean, 5 integer and 12 objects.
- There are null values present in entrepreneur and website columns. However, as we will
not be using these columns for our study, we can keep it as it is.

pg. 30
MACHINE LEARNING PROJECT REPORT | MONICA SHARMA

2.1.1 Pick out the Deal (Dependent Variable) and Description columns into a separate data
frame.
- The new dataframe “df2” have 495 rows and 2 columns i.e., Deal and Description

2.1.2 Create two corpora - one with those who secured a deal and the other with those who did not
secure a deal
We created two corpora – Corpora 1: deal secured and Corpora 2 : deal not secured

2.1.3 Find the number of characters for both the corpuses Text preprocessing on corpora which
secured the deal.

- The number of characters in corpus which secure the Deal is 45002

pg. 31
MACHINE LEARNING PROJECT REPORT | MONICA SHARMA

- The number of characters in corpus which did not secure the Deal is 47184

2.1.4 Text pre-processing on corpora which secured the deal.

We'll be doing text preprocessing on the corpus for those who secured the deal

a. Removal of http links

b. De-contraction of words

c. Tokenization

pg. 32
MACHINE LEARNING PROJECT REPORT | MONICA SHARMA

d. Lowercasing: Lowercasing ALL your text data, although commonly overlooked, is one of the
simplest and most effective form of text preprocessing.

e. Removal of Punctuation

• Removal of stop words:

- Stop words are a set of commonly used words in a language.

- Examples of stop words in English are “a”, “the”, “is”, “are” etc. The intuition behind
using stop words is that, by removing low information words from text, we can focus on
the important words instead.

pg. 33
MACHINE LEARNING PROJECT REPORT | MONICA SHARMA

f. Lemmatization

- Lemmatization on the surface is very similar to stemming, where the goal is to


remove inflections and map a word to its root form.

g. Normalization (aggregating pre-processing function into one):

pg. 34
MACHINE LEARNING PROJECT REPORT | MONICA SHARMA

2.2 Insight Generation - Create a wordcloud of common words used by companies who secure a deal
- Provide insights from the preprocessed data.

- From the word cloud, we can say that an entrepreneur who secured the deal used words
like ‘product’, ‘make’, ‘design’, ‘online’, ‘offer’, ‘need’ and more positive and product
descriptive words to attract the customer’s interest, hence securing the deal.
- Hence to increase the performance one must make more use of words which will attract
the customer’s interest and use more product and design oriented words.

2.3 Business Report Quality - Adhere to the business report checklist

pg. 35

You might also like