Professional Documents
Culture Documents
Machine Learning Business Report PDF
Machine Learning Business Report PDF
Submitted to:
Great Learning
Submitted by:
Manoj Kumbhare
PGPDSBA Online July_C 2021
Data Science and Business Analytics.
Contents
Introduction:..................................................................................................................................... 5
1. Problem 1 ..................................................................................................................................... 5
1.1 Read the dataset. Do the descriptive statistics and do the null value condition check.
Write an inference on it. ............................................................................................................... 6
1.2 Perform Univariate and Bivariate Analysis. Do exploratory data analysis. Check for
Outliers. ........................................................................................................................................ 8
Univariate Analysis ................................................................................................................... 8
Bivariate Analysis .................................................................................................................... 12
Multivariate Analysis .............................................................................................................. 15
1.3 Encode the data (having string values) for Modelling. Is Scaling necessary here or not?
Data Split: Split the data into train and test (70:30). ................................................................... 17
1.4 Apply Logistic Regression and LDA (linear discriminant analysis). ................................... 18
1.5 Apply KNN Model and Naïve Bayes Model. Interpret the results. ................................... 24
1.6 Model Tuning, Bagging (Random Forest should be applied for Bagging), and Boosting. . 27
1.7 Performance Metrics: Check the performance of Predictions on Train and Test sets using
Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC score for each model. Final
Model: Compare the models and write inference which model is best/optimized. .................... 37
1.8 Based on these predictions, what are the insights? .............................................................. 39
2. Problem 2 ................................................................................................................................... 40
2.1 Find the number of characters, words, and sentences for the mentioned documents. ........ 40
2.2 Remove all the stopwords from all three speeches............................................................... 45
2.3 Which word occurs the most number of times in his inaugural address for each president?
Mention the top three words. (after removing the stopwords) .................................................. 49
2.4 Plot the word cloud of each of the speeches of the variable. (after removing the stopwords) [
refer to the End-to-End Case Study done in the Mentored Learning Session ] ............................ 52
LIST OF TABLES
Table 1: Data dictionary for Election Dataset ..................................................................................... 6
Table 2: Model Comparison ............................................................................................................. 38
LIST OF EQUATION
No table of figures entries found.
LIST OF FIGURES
Figure 1: Libraries Used ..................................................................................................................... 5
Figure 2: Election Dataset Head ......................................................................................................... 6
Figure 3: Election Dataset Tail ............................................................................................................ 6
Figure 4: Descriptive Statistics of Election Dataset ............................................................................. 6
Figure 5: Info of Election Dataset ....................................................................................................... 7
Figure 6: Null value check for Election Dataset................................................................................... 7
Figure 7: Univariant Analysis of age ................................................................................................... 8
Figure 8: Univariant Analysis of vote .................................................................................................. 8
Figure 9: Univariant Analysis of economic.cond.national ................................................................... 9
Figure 10: Univariant Analysis of economic.cond.household .............................................................. 9
Figure 11: Univariant Analysis of Blair .............................................................................................. 10
Figure 12: Univariant Analysis of Hague ........................................................................................... 10
Figure 13: Univariant Analysis of Europe .......................................................................................... 11
Figure 14: Univariant Analysis of political.knowledge....................................................................... 11
Figure 15: Univariant Analysis of gender .......................................................................................... 12
Figure 16: Vote v/s other variables graphs ....................................................................................... 13
Figure 17: Gender v/s other variables Graphs .................................................................................. 14
Figure 18: Pair plot for dataset Education ........................................................................................ 15
Figure 19: Heatmap for Dataset Education ...................................................................................... 16
Figure 20: Box plot for Dataset Education ........................................................................................ 16
Figure 21: Logistic Regression Confusion Matrices ........................................................................... 18
Figure 22: Logistic Regression Classification Reports ........................................................................ 18
Figure 23: Logistic Regression AUC and ROC curve ........................................................................... 18
Figure 24: Logistic Regression_Importance feature .......................................................................... 19
Figure 25: Tuned Logistic Regression Confusion Matrices ................................................................ 20
Figure 26: Tuned Logistic Regression Clssification Reports ............................................................... 20
Figure 27: Tuned Logistic Regression AUC-ROC curves ..................................................................... 20
Figure 28: Tuned Logistic Regression_Importance feature ............................................................... 21
Figure 29: LDA- Confusion Matrices ................................................................................................. 22
Figure 30: LDA - Classification Reports ............................................................................................. 22
Figure 31: LDA - AUC-ROC curve ...................................................................................................... 22
Figure 32: Tuned LDA - Confusion Matrices ..................................................................................... 23
Figure 33: Tuned LDA Classification Reports .................................................................................... 23
Figure 34: Tuned LDA - AUC ROC curve ............................................................................................ 23
Figure 35: KNN Model- Confusion Matrices ..................................................................................... 24
Figure 36: KNN Model- Classification report .................................................................................... 24
Figure 37: KNN Model - AUC-ROC curves ......................................................................................... 24
Figure 38: Tuned KNN Model - Confusion Matrices .......................................................................... 25
Figure 39: Tuned KNN Model - Classification Reports ....................................................................... 25
Figure 40: Tuned KNN Model- AUC-ROC curve ................................................................................. 25
Figure 41: Naïve Model - Confusion Matrices................................................................................... 26
Figure 42: Naïve Model - Classification Reports ............................................................................... 26
Figure 43: Naïve Model - AUC-ROC curves ....................................................................................... 26
Figure 44: Bagging- Confusion Matrices ........................................................................................... 27
Figure 45: Bagging - Classification reports ........................................................................................ 27
Figure 46: Bagging - AUC-ROC curves ............................................................................................... 27
Figure 47: Bagging tuned - Confusion Matrices ................................................................................ 28
Figure 48: Bagging tuned - Classification Reports ............................................................................. 28
Figure 49: Bagging tuned - AUC-ROC curves..................................................................................... 28
Figure 50: Ada Boosting - Confusion Matrices .................................................................................. 29
Figure 51: Ada Boosting - Classification reports ............................................................................... 29
Figure 52: Ada Boosting- AUC-ROC curves ....................................................................................... 29
Figure 53: Ada Boosting- Importance of features ............................................................................. 30
Figure 54: Tuned Ada Boosting- Confusion Matrices ........................................................................ 31
Figure 55: Tuned Ada Boosting- Classification Reports ..................................................................... 31
Figure 56: Tuned Ada Boosting- AUC-ROC curves............................................................................. 31
Figure 57: Tuned Ada Boosting- Importance features ...................................................................... 32
Figure 58: Gradient Boosting- Confusion Matrices ........................................................................... 33
Figure 59: Gradient Boosting- Classification Report ......................................................................... 33
Figure 60: Gradient Boosting- AUC-ROC Curve ................................................................................. 33
Figure 61: Gradient Boosting- Importance Features ......................................................................... 34
Figure 62: Tuned Gradient Boosting- Confusion Report ................................................................... 35
Figure 63: Tuned Gradient Boosting-Classification Reports .............................................................. 35
Figure 64: Tuned Gradient Boosting-AUC-ROC Curves...................................................................... 35
Figure 65: Tuned Gradient Boosting-Importance Features ............................................................... 36
Figure 66: All Models AUC-ROC curves for training data................................................................... 37
Figure 67: All Models AUC-ROC curves for testing data .................................................................... 38
Figure 68: President Speech Data frame .......................................................................................... 40
Figure 69: President Roosevelt-1941 Speech ................................................................................... 42
Figure 70: President Kennedy-1961 Speech ..................................................................................... 43
Figure 71: President Nixon-1973 Speech after ................................................................................. 44
Figure 72: Counts after removing stop word. ................................................................................... 45
Figure 73: Word Count after removing Stop Words ......................................................................... 45
Figure 74: Character Count after removing Stop words .................................................................... 46
Figure 75: Sentence Count after removing stop words ..................................................................... 46
Figure 76: President Roosevelt-1941 Speech after removing stop words ......................................... 47
Figure 77: President Kennedy-1961 Speech after removing stop words ........................................... 47
Figure 78: President Nixon-1973 Speech after removing stop words ................................................ 48
Figure 79: Inagural speech of Roosevelt........................................................................................... 49
Figure 80: Inagural speech of Kennedy ............................................................................................ 49
Figure 81: Inagural speech of Nixon ................................................................................................. 50
Figure 82: Inagural speech of Roosevelt- after 2nd clean up ............................................................ 50
Figure 83: Inagural speech of Kennedy- after 2nd clean up .............................................................. 51
Figure 84: Inagural speech of Nixon- after 2nd clean up................................................................... 51
Figure 85: President Roosevelt-1941: Word Cloud ........................................................................... 52
Figure 86: President Kennedy-1961: Word Cloud ............................................................................. 53
Figure 87: President Nixon-1973: Word Cloud ................................................................................. 54
Introduction:
This Business report is prepared for project of Predictive Modelling. Under this report we are
analysing 2 different problem statement with the help of question asked.
All problem statement section will contain problem statement, Exploratory Data Analysis (EDA),
Descriptive Data Analysis, Null value check and Answers to the questions. Tables and graphs are
added as an image for the better references and data visibility.
Tool used for solution is Jupyter Notebook (server version: 6.3.0) with Python 3.8.8.
1. Problem 1
You are hired by one of the leading news channels CNBE who wants to analyze recent elections. This
survey was conducted on 1525 voters with 9 variables. You have to build a model, to predict which
party a voter will vote for on the basis of the given information, to create an exit poll that will help in
predicting overall win and seats covered by a particular party.
Data Dictionary:
1.1 Read the dataset. Do the descriptive statistics and do the null value condition check. Write an
inference on it.
Univariate Analysis
By Seeing the above hist plot we can say age column is normally distributed.
There are no outlier present in age column.
Skewness for age column is 0.14.
On the scale of 1 to 5 (where 5 is greater), approx. 81% voters rated 3 or greater than 3.
On the scale of 1 to 5 (where 5 is greater), approx. 75% voters rated 3 or greater than 3.
Figure 11: Univariant Analysis of Blair
Blair is the labour party leader, 65% voter rated him 4 or more than 4.
We can say labour party is more popular among voters.
Hague is the Conservative party leader, 41% voter rated him 4 or more than 4.
We can say Conservative party is not much popular among voters.
Figure 13: Univariant Analysis of Europe
On the scale of 1 to 11 (where 11 is greater), we can say that the ‘Eurosceptic’ sentiment is
high.
On the scale of 0 to 3 (where 3 is greater), we can say 65% present voter ranked 2-3.
Figure 15: Univariant Analysis of gender
Bivariate Analysis
Box Plot
Above are the importance feature which play major role in decision.
Above are the importance feature which play major role in decision.
KNN Model
Model perform good in training data but testing data performance is low.
We can see significant drop in testing AUC-ROC curves.
We can say our KNN model is not stable and overfitting.
Tuned KNN Model
We can see tuned KNN model are more stable than basic model.
F1 score and Recall value are close to each other.
AUC-ROC curves are almost similar.
AUC values are also close to each other.
Hence, we can say tuned KNN model works more stable than basic KNN model.
Naïve Model
Bagging model perform good on training data but not on testing data.
We can see drop in AUC-ROC curve of testing data as compare to training data.
Hence, we can say our model in overfitting.
Bagging tuned using Grid
Model Performance are similar for both training and testing data set.
AUC scores are very close to each other.
F1 score and recall values are also close to each other.
Hence, we can say model is stable.
Figure 53: Ada Boosting- Importance of features
Model Performance are similar for both training and testing data set.
AUC scores are very close to each other.
F1 score and recall values are also close to each other.
Hence, we can say model is stable.
Both Tuned Ada boosting and Basic Ada boosting model are stable and their performance
are also almost similar.
Figure 57: Tuned Ada Boosting- Importance features
Above is the feature importance graph for Tuned Ada boosting model.
Variable Hague is most significant variable and gender is least important variable in decision
making.
Gradient Boosting
Above is the feature importance graph for Tuned Gradient Boosting model.
Variable Hague is most significant variable and gender is least important variable in decision
making.
1.7 Performance Metrics: Check the performance of Predictions on Train and Test sets using Accuracy,
Confusion Matrix, Plot ROC curve and get ROC_AUC score for each model. Final Model: Compare
the models and write inference which model is best/optimized.
From the above plot, we can clearly see Bagging model perform good on train data followed by
gradient Boost, Tuned gradient boost and KNN models.
Figure 67: All Models AUC-ROC curves for testing data
From the above plot, we can say Gradient Boost and Tuned Gradient Boost model perform well on
testing data set followed by Tuned LDA and LDA model.
we can clearly see Basic bagging model is having the highest accuracy.
As well as, bagging model is capturing the Maximum True Negative and True Positive values
in the dataset.
1.8 Based on these predictions, what are the insights?
From the above comparison we can say Basic Bagging model is the finest model among all 13
models.
We have built our model for vote – Labour. Below are the insights based on our models.
2.1 Find the number of characters, words, and sentences for the mentioned documents.
Characters Count
Sentence Count
Note : All above counts are with stop word and spacing.
President Roosevelt-1941 Speech
The word count in President Roosevelt-1941 speech (after removing stop words) is 627.
The word count in President Kennedy-1961 speech (after removing stop words) is 693.
The word count in President Nixon-1973 speech (after removing stop words) is 833.
Character Count after removing Stop words
The Character count in President Roosevelt-1941 speech (after removing stop words) is
4590.
The Character count in President Kennedy-1961 speech (after removing stop words) is 4771.
The Character count in President Nixon-1973 speech (after removing stop words) is 5950.
The sentence count in President Roosevelt-1941 speech (after removing stop words) is 1.
The sentence count in President Kennedy-1961 speech (after removing stop words) is 1.
The sentence count in President Nixon-1973 speech (after removing stop words) is 1.
President Roosevelt-1941 Speech after removing stop words
Note : the word ‘let’ and ‘us’ is more like unwanted words for us. We will remove it.
2.4 Plot the word cloud of each of the speeches of the variable. (after removing the stopwords) [ refer
to the End-to-End Case Study done in the Mentored Learning Session ]
By seeing above word cloud, we can say president Roosevelt mostly talked about nation, spirit,
people, life etc.
President Kennedy-1961: Word Cloud
By seeing above word cloud, we can say president Kennedy mostly talked about world, sides, new,
nation etc.
President Nixon-1973: Word Cloud
By seeing above word cloud, we can say president Nixon mostly talked about America, nation, world,
peace, new, responsibility etc.