Download as pdf or txt
Download as pdf or txt
You are on page 1of 30

GREAT LEARNING

Power Ahead

MACHINE LEARNING

SHREYA PRAKASH

PGP – DSBA ONLINE

MARCH ‘22

DATE – 03/08/2022

1
Table of Contents
Contents

Problem Part 1 …………………………………………………………………………………………………………………………………

INTRODUCTION …………………………………………………………………………………………………………………………….4

SAMPLE OF DATASET …………………………………………………………………………………………………………………….4

EXPLORATORY DATA ANAYSIS ……………………………………………………………………………………………………….4

Q1 Basic data summary, Univariate, Bivariate analysis, graphs, checking correlations, outliers
and missing values treatment (if necessary) and check the basic descriptive statistics of the
dataset.……………………………………………………………………….5-9

Q2 Split the data into train and test in the ratio 70:30. Is scaling necessary or not? ………….9-10

Q3 Build the following models on the 70% training data and check the performance of these
models on the Training as well as the 30% Test data using the various inferences from the
Confusion Matrix and plotting a AUC-ROC curve along with the AUC values. Tune the models
wherever required for optimum performance.:
a. Logistic Regression Model
b. Linear Discriminant Analysis
c. Decision Tree Classifier – CART model
d. Naïve Bayes Model
e. KNN Model
f. Random Forest Model
g. Boosting Classifier Model using Gradient boost.. ………………………….10-17

Q4 Which model performs the best? …………………………………...18-19

Q5 What are your business insights?.........................................................20

Part 2:

A dataset of Shark Tank episodes is made available. It contains 495 entrepreneurs making their
pitch to the VC sharks. You will ONLY use “Description” column for the initial text mining
exercise.

Q1 Pick out the Deal (Dependent Variable) and Description columns into a separate data
frame………………………………………………………………20-22

Q2 Create two corpora, one with those who secured a Deal, the other with those who did not
secure a deal.........................................................................22-25

Q3 The following exercise is to be done for both the corpora:


a) Find the number of characters for both the corpuses.
b) Remove Stop Words from the corpora. (Words like ‘also’, ‘made’, ‘makes’, ‘like’, ‘this’, ‘even’
and ‘company’ are to be removed)
c) What were the top 3 most frequently occurring words in both corpuses (after removing stop
words)?
d) Plot the Word Cloud for both the corpora. ……………………………25-29

2
Q4 Refer to both the word clouds. What do you infer?.....................................29

Q5 Looking at the word clouds, is it true that the entrepreneurs who introduced devices are less
likely to secure a deal based on your analysis?....................................................30

List of Figures
Scatter Plot
Pairplot
Correlation HeatMap
Boxplot
Scree Plot

LIST OF TABLES
Dataset Sample
Contingency Table

3
Problem 1

INTRODUCTION

Here we are analysing which mode of transport chosen by employee of ABC Company. The decision is
based on the parameters like age, salary, work experience etc. in the dataset ‘Transport.csv’ .In this
project we are building several Machine Learning models and comparing them so that we can get the
best model .

SAMPLE OF DATASET

EXPLORATORY DATA ANALYSIS

Checking Null Values

4
Checking duplicate values

Question 1

Basic data summary, Univariate, Bivariate analysis, graphs, checking correlations, outliers and missing
values treatment (if necessary) and check the basic descriptive statistics of the dataset..

Answer:-

As checked in info function there are 9 attributes in the dataset also 2 are of float , 5 are of int and 2 are
of object data type .

Here total no of rows are 444 and 9 column.

5
We can see that lowest mean is of license

Checking NULL values

As checked there is no NULL value

Checking duplicate values

There is no duplicate values as well

Checking Outlier

6
As checked there are outliers . Now treating outlier.

Univariate Analysis

As checked most of the employee is from almost 23 to 28 ages , also as checked population of Male
Gender is more than female .

No of Engineers is also more than of Non Engineer . No of MBA is less than then no of non MBA.

7
Work Experience is from 0 to 8 years maximum . Salary is mostly from 8 Lakhs to 17-18 lakhs .

Distance is following Normal distribution . Employee with license is more than employee without license.

Boxplot

As seen Age , Work Exp , Salary , Distance is having outlier

Bivarate Analysis

8
Age , Work Exp and Salary are positively correlated .

Question 2

Split the data into train and test in the ratio 70:30. Is scaling necessary or not?

Yes , scaling is necessary as there is difference in the range of different variables , so to make them in one
range we need to scale the data

9
Now we need to split the data into 70:30 .Here Transport is the dependent variable so we will split the
data accordingly

Question 3

Build the following models on the 70% training data and check the performance of these models on the
Training as well as the 30% Test data using the various inferences from the Confusion Matrix and plotting
a AUC-ROC curve along with the AUC values. Tune the models wherever required for optimum
performance.: a. Logistic Regression Model b. Linear Discriminant Analysis c. Decision Tree Classifier –
CART model d. Naïve Bayes Model e. KNN Model f. Random Forest Model g. Boosting Classifier Model
using Gradient boost

Answer

Here were are building different models to find best accuracy of all the models on Training and Testing
set .

Logistic regression

We are using Logistic regression to find accuracy of model

Logistic Regression Accuracy

We can see accuracy is 0.7611 , now using classification matrix , here also we can see accuracy is 0.76

10
We can also check coeffecient of variation

Through coefficient of variation we can see

A) Gender , Age , MBA are having positive impact on target variable.


B) Work Exp , Salary and Distance are having negative impact on target variables.
C) Engineer and license is having no impact
D) Train accuracy and Test accuracy is same , so we can say there is no case of over fitting.

Linear Discriminant Regression Analysis

Here we are using Linear Regression to find accuracy .We can accuracy is almost similar to logistic
regression.

11
Here we can see accuracy of Training set is 76 % and Test set is 78 percent.

Now using Decision Tree let’s check the accuracy

The accuracy through DTC is 0.73 . We can check the same through classification matrix

Checking confusion matrix for both training and test set

Classification matrix for both training set and test set

12
Accuracy of Test and Training data is 1 and 0.73 there can be chances of over fitting

GaussianNB

Accuracy of GNB is 0.813

13
Accuracy of Test and Training set is 0.76 and 0.81

Confusion matrix for Test and Train data

KNeighborsClassifier

Now we will check classification matrix

Training and Test set accuracy is 0.85 and 0.78

Confusion matrix for KNN

14
Now through confusion matrix we can check True positive , false positive , true negative

Random Forest Model

We can see accuracy is 0.79 of training set

Now we will see classification report of training set and test data , we can see accuracy of training set is 1
and that test data is 0.79

Confusion matrix of RFC

15
Gradient Boosting

Accuracy of model is 0.73

Classification matrix of training set and test set is 0.95 and 0.73

Confusion Matrix of GB

16
AUC ROC Curve

We can find out AUC ROC score for all the models

Also we can plot AUC ROC curve for all models

17
Question 4

Which model performs the best?

Answer:-

Model performance on both training and test set . Performance of both the set can be changed by
changing the features . We should know how to extract the features and what features should be used
together to obtain maximum business profit.

18
Here we can see GNB has the best accuracy then RFC followed by KNN model. With respect to accuracy
GNB has best accuracy

Question 5

What are your business insights?

19
Answer:- Business insights are age , Work Exp and salary are are contributing much , so it should be
checked whether these parameters are checked..

Work should be done on non MBA people to choose ne transport , as they are using private transport.

Work should also be done to encourage more people who has not done MBA to choose new transport
medium.

Part 2

A dataset of Shark Tank episodes is made available. It contains 495 entrepreneurs making their pitch to
the VC sharks. You will ONLY use “Description” column for the initial text mining exercise.

Question 1

Pick out the Deal (Dependent Variable) and Description columns into a separate data frame.

Data description

Checking NULL values

20
Removing Null values

Checking duplicate values

21
There are no duplicate rows in the data frame.

Here we making a new data frame named as data and we are placing column deal and description into
that data frame.

Question 2

Create two corpora, one with those who secured a Deal, the other with those who did not secure a deal.

Here we are creating 2 corpora , one who has secured a deal and one who has not secured deal.

First we are making 2 data frame one who has secured a deal as df_true and one who has not secured a
deal as df_false ,now we are grouping the data dataframe according to deal parameter , the one whose
deal parameter value is true is in df_true and the one whose deal parameter value is false they are in
df_false dataframe.

The original data dataframe is having total 495 rows

22
After making corposes

23
After making corposes we can see there are 251 df_true rows and 244 df_false rows .

Let’s check data frame who has secured deal

Here data frame df_true who has secured deal has all True value of deal parameter

24
Also data frame df_false who has not secured deal has all False as value of deal parameter in df_false
data frame.

Question 3

The following exercise is to be done for both the corpora: a) Find the number of characters for both the
corpuses. b) Remove Stop Words from the corpora. (Words like ‘also’, ‘made’, ‘makes’, ‘like’, ‘this’, ‘even’
and ‘company’ are to be removed) c) What were the top 3 most frequently occurring words in both
corpuses (after removing stop words)? d) Plot the Word Cloud for both the corpora.

a) Find number of characters for both the corpuses.

Answer

Here first we are dropping deal column from both df_true and df_false data frame and calculating
total no of characters.

Df_true after dropping deal column from data frame

25
df_false after dropping deal column from df_false

Now after using sum function we can calculate total no of characters in both the corpus.

b) Remove Stop Words from the corpora. (Words like ‘also’, ‘made’, ‘makes’, ‘like’, ‘this’, ‘even’ and
‘company’ are to be removed)

Answer

Here we have to remove stop words from both the corpuses which cracked the deal and which did
not crack the deal

First we can remove stop words from df_True who cracked deal , we are using description column to
remove stop words .Below is showing data frame after removing stop words.

26
Now we will be checking df_false who did not crack the deal , we will be using description column to
remove stop words from this data frame .

c) What were the top 3 most frequently occurring words in both corpuses (after removing stop
words)?

Answer

Here we checking 3 most frequently words used from both the corpuses after removing stop
words.

Here we are using function nltk.FreqDist() to check frequency of words from both the corpuses . After
finding frequency we are using freq.most_common(3) function to find 3 most commonly used parameter
from both the new data frame nsw_false and nsw_true .

Here first row shows 3 most commonly used parameter in nsw_false and 2nd row shows most commonly
words used in nsw_true.

d) Plot the Word Cloud for both the corpora.

Answer

Word Cloud is a technique of data visualization. It is used for representing text data .Here the size of
each word indicates its frequency or importance

Here we are plotting world cloud for both thr corpora , the one which has cracked the deal and the one
which has not cracked the deal . First we will check the world cloud who has cracked the deal .

We are combining all the words who has cracked the deal one by one using join function . we are using
WorldCloud function to visualize the data . Also using generate function to generate the cloud of words.

27
Wordcloud of true corpose

Now WordCloud of who did not crack the deal

We are combining all the words who has not cracked the deal one by one using join function from the
dataframe . we are using WorldCloud function to visualize the data . Also using generate function to
generate the cloud of words.

WordCloud of Who did not get the deal

28
Question 4

Refer to both the word clouds. What do you infer?

Answer

The wordcloud who has secured a deal contains words like 'one', 'design' , 'free' ,'children'
,'offer', 'easy' ,'online','use' .These word indicated that Deals aimed towards catering to the liking
of children .Through Cloudword we can see that this deal provided offers or a free
sample/product, easy to use, having a good design and was unique in its creativity are more
likely to have the deal.

The wordcloud who did not secured deal contains words such as 'one', 'designed' , 'help' ,'device'
,'bottle', 'premium' ,'use' .These word indicates that deals with a mediocre design, have low
chance to suit to solve a problem, products involving water bottles, having a higher and premium
price and have less usability are having less chance to secure a deal.

It is observed that words such as 'one', 'designed' ,'system' and 'use' have a higher weight in both
these wordclouds.This indicates that either these were not the prominent factors to whether a
deal is cracked or not or might be they have been used in a different context/respect in the
description in each case.

29
Question 5

Looking at the word clouds, is it true that the entrepreneurs who introduced devices are less likely to
secure a deal based on your analysis?

Answer

The word 'device' is not easily found in the case when deal is secured in the wordcloud while it is
easily found in the case when deal is not secured in the wordcloud. This indicates that the word
'device' has occurred frequently when a deal was rejected or not made hence this implies or
indicated the statement given in the question is true.

30

You might also like