Machine Learning Business Report PDF

Machine Learning Project
Submitted to:
Great Learning
Submitted by:
Manoj Kumbhare
PGPDSBA Online July_C 2021
Data Science and Business Analytics.
Contents
Introduction:..................................................................................................................................... 5
1. Problem 1 ..................................................................................................................................... 5
1.1 Read the dataset. Do the descriptive statistics and do the null value condition check.
Write an inference on it. ............................................................................................................... 6
1.2 Perform Univariate and Bivariate Analysis. Do exploratory data analysis. Check for
Outliers. ........................................................................................................................................ 8
Univariate Analysis ................................................................................................................... 8
Bivariate Analysis .................................................................................................................... 12
Multivariate Analysis .............................................................................................................. 15
1.3 Encode the data (having string values) for Modelling. Is Scaling necessary here or not?
Data Split: Split the data into train and test (70:30). ................................................................... 17
1.4 Apply Logistic Regression and LDA (linear discriminant analysis). ................................... 18
1.5 Apply KNN Model and Naïve Bayes Model. Interpret the results. ................................... 24
1.6 Model Tuning, Bagging (Random Forest should be applied for Bagging), and Boosting. . 27
1.7 Performance Metrics: Check the performance of Predictions on Train and Test sets using
Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC score for each model. Final
Model: Compare the models and write inference which model is best/optimized. .................... 37
1.8 Based on these predictions, what are the insights? .............................................................. 39
2. Problem 2 ................................................................................................................................... 40
2.1 Find the number of characters, words, and sentences for the mentioned documents. ........ 40
2.2 Remove all the stopwords from all three speeches............................................................... 45
2.3 Which word occurs the most number of times in his inaugural address for each president?
Mention the top three words. (after removing the stopwords) .................................................. 49
2.4 Plot the word cloud of each of the speeches of the variable. (after removing the stopwords) [
refer to the End-to-End Case Study done in the Mentored Learning Session ] ............................ 52
LIST OF TABLES
Table 1: Data dictionary for Election Dataset ..................................................................................... 6
Table 2: Model Comparison ............................................................................................................. 38
LIST OF EQUATION
No table of figures entries found.
LIST OF FIGURES
Figure 1: Libraries Used ..................................................................................................................... 5
Figure 2: Election Dataset Head ......................................................................................................... 6
Figure 3: Election Dataset Tail ............................................................................................................ 6
Figure 4: Descriptive Statistics of Election Dataset ............................................................................. 6
Figure 5: Info of Election Dataset ....................................................................................................... 7
Figure 6: Null value check for Election Dataset................................................................................... 7
Figure 7: Univariant Analysis of age ................................................................................................... 8
Figure 8: Univariant Analysis of vote .................................................................................................. 8
Figure 9: Univariant Analysis of economic.cond.national ................................................................... 9
Figure 10: Univariant Analysis of economic.cond.household .............................................................. 9
Figure 11: Univariant Analysis of Blair .............................................................................................. 10
Figure 12: Univariant Analysis of Hague ........................................................................................... 10
Figure 13: Univariant Analysis of Europe .......................................................................................... 11
Figure 14: Univariant Analysis of political.knowledge....................................................................... 11
Figure 15: Univariant Analysis of gender .......................................................................................... 12
Figure 16: Vote v/s other variables graphs ....................................................................................... 13
Figure 17: Gender v/s other variables Graphs .................................................................................. 14
Figure 18: Pair plot for dataset Education ........................................................................................ 15
Figure 19: Heatmap for Dataset Education ...................................................................................... 16
Figure 20: Box plot for Dataset Education ........................................................................................ 16
Figure 21: Logistic Regression Confusion Matrices ........................................................................... 18
Figure 22: Logistic Regression Classification Reports ........................................................................ 18
Figure 23: Logistic Regression AUC and ROC curve ........................................................................... 18
Figure 24: Logistic Regression_Importance feature .......................................................................... 19
Figure 25: Tuned Logistic Regression Confusion Matrices ................................................................ 20
Figure 26: Tuned Logistic Regression Clssification Reports ............................................................... 20
Figure 27: Tuned Logistic Regression AUC-ROC curves ..................................................................... 20
Figure 28: Tuned Logistic Regression_Importance feature ............................................................... 21
Figure 29: LDA- Confusion Matrices ................................................................................................. 22
Figure 30: LDA - Classification Reports ............................................................................................. 22
Figure 31: LDA - AUC-ROC curve ...................................................................................................... 22
Figure 32: Tuned LDA - Confusion Matrices ..................................................................................... 23
Figure 33: Tuned LDA Classification Reports .................................................................................... 23
Figure 34: Tuned LDA - AUC ROC curve ............................................................................................ 23
Figure 35: KNN Model- Confusion Matrices ..................................................................................... 24
Figure 36: KNN Model- Classification report .................................................................................... 24
Figure 37: KNN Model - AUC-ROC curves ......................................................................................... 24
Figure 38: Tuned KNN Model - Confusion Matrices .......................................................................... 25
Figure 39: Tuned KNN Model - Classification Reports ....................................................................... 25
Figure 40: Tuned KNN Model- AUC-ROC curve ................................................................................. 25
Figure 41: Naïve Model - Confusion Matrices................................................................................... 26
Figure 42: Naïve Model - Classification Reports ............................................................................... 26
Figure 43: Naïve Model - AUC-ROC curves ....................................................................................... 26
Figure 44: Bagging- Confusion Matrices ........................................................................................... 27
Figure 45: Bagging - Classification reports ........................................................................................ 27
Figure 46: Bagging - AUC-ROC curves ............................................................................................... 27
Figure 47: Bagging tuned - Confusion Matrices ................................................................................ 28
Figure 48: Bagging tuned - Classification Reports ............................................................................. 28
Figure 49: Bagging tuned - AUC-ROC curves..................................................................................... 28
Figure 50: Ada Boosting - Confusion Matrices .................................................................................. 29
Figure 51: Ada Boosting - Classification reports ............................................................................... 29
Figure 52: Ada Boosting- AUC-ROC curves ....................................................................................... 29
Figure 53: Ada Boosting- Importance of features ............................................................................. 30
Figure 54: Tuned Ada Boosting- Confusion Matrices ........................................................................ 31
Figure 55: Tuned Ada Boosting- Classification Reports ..................................................................... 31
Figure 56: Tuned Ada Boosting- AUC-ROC curves............................................................................. 31
Figure 57: Tuned Ada Boosting- Importance features ...................................................................... 32
Figure 58: Gradient Boosting- Confusion Matrices ........................................................................... 33
Figure 59: Gradient Boosting- Classification Report ......................................................................... 33
Figure 60: Gradient Boosting- AUC-ROC Curve ................................................................................. 33
Figure 61: Gradient Boosting- Importance Features ......................................................................... 34
Figure 62: Tuned Gradient Boosting- Confusion Report ................................................................... 35
Figure 63: Tuned Gradient Boosting-Classification Reports .............................................................. 35
Figure 64: Tuned Gradient Boosting-AUC-ROC Curves...................................................................... 35
Figure 65: Tuned Gradient Boosting-Importance Features ............................................................... 36
Figure 66: All Models AUC-ROC curves for training data................................................................... 37
Figure 67: All Models AUC-ROC curves for testing data .................................................................... 38
Figure 68: President Speech Data frame .......................................................................................... 40
Figure 69: President Roosevelt-1941 Speech ................................................................................... 42
Figure 70: President Kennedy-1961 Speech ..................................................................................... 43
Figure 71: President Nixon-1973 Speech after ................................................................................. 44
Figure 72: Counts after removing stop word. ................................................................................... 45
Figure 73: Word Count after removing Stop Words ......................................................................... 45
Figure 74: Character Count after removing Stop words .................................................................... 46
Figure 75: Sentence Count after removing stop words ..................................................................... 46
Figure 76: President Roosevelt-1941 Speech after removing stop words ......................................... 47
Figure 77: President Kennedy-1961 Speech after removing stop words ........................................... 47
Figure 78: President Nixon-1973 Speech after removing stop words ................................................ 48
Figure 79: Inagural speech of Roosevelt........................................................................................... 49
Figure 80: Inagural speech of Kennedy ............................................................................................ 49
Figure 81: Inagural speech of Nixon ................................................................................................. 50
Figure 82: Inagural speech of Roosevelt- after 2nd clean up ............................................................ 50
Figure 83: Inagural speech of Kennedy- after 2nd clean up .............................................................. 51
Figure 84: Inagural speech of Nixon- after 2nd clean up................................................................... 51
Figure 85: President Roosevelt-1941: Word Cloud ........................................................................... 52
Figure 86: President Kennedy-1961: Word Cloud ............................................................................. 53
Figure 87: President Nixon-1973: Word Cloud ................................................................................. 54
Introduction:
This Business report is prepared for project of Predictive Modelling. Under this report we are
analysing 2 different problem statement with the help of question asked.
All problem statement section will contain problem statement, Exploratory Data Analysis (EDA),
Descriptive Data Analysis, Null value check and Answers to the questions. Tables and graphs are
added as an image for the better references and data visibility.
Tool used for solution is Jupyter Notebook (server version: 6.3.0) with Python 3.8.8.
Below are mentioned all necessary libraries used.
Figure 1: Libraries Used
1. Problem 1
You are hired by one of the leading news channels CNBE who wants to analyze recent elections. This
survey was conducted on 1525 voters with 9 variables. You have to build a model, to predict which
party a voter will vote for on the basis of the given information, to create an exit poll that will help in
predicting overall win and seats covered by a particular party.
Dataset for Problem: Election_Data.xlsx
Data Dictionary:
Variable Name Description

vote Party choice: Conservative or Labour
age in years
economic.cond.national Assessment of current national economic conditions, 1 to 5.
economic.cond.household Assessment of current household economic conditions, 1 to 5.
Blair Assessment of the Labour leader, 1 to 5.
Hague Assessment of the Conservative leader, 1 to 5.
Europe an 11-point scale that measures respondents' attitudes toward
European integration. High scores represent ‘Eurosceptic’
sentiment.
political.knowledge Knowledge of parties' positions on European integration, 0 to 3.
gender female or male.
Table 1: Data dictionary for Election Dataset
1.1 Read the dataset. Do the descriptive statistics and do the null value condition check. Write an
inference on it.
Loading the data and viewing head and tail of dataset.
Figure 2: Election Dataset Head
Figure 3: Election Dataset Tail
 Data has been loaded successfully.

 We have total 10 variables in the dataset.
Descriptive Statistics of Dataset
Figure 4: Descriptive Statistics of Election Dataset

 We have total 1525 records in the dataset.
 Vote and gender are the categorical columns.
 Age is continuous variable.
 Economic.cond.national, economic.cond.household, Blair, Hague, Europe and
political.knowledge are listed as continuous but we can consider it as categorical columns or
ordinal variable.
 Unnamed: 0 is an identical or index variables.
 Maximum voters are from female group.
 Maximum votes go to labour party.
Checking for null value
Figure 5: Info of Election Dataset
Figure 6: Null value check for Election Dataset
No null values available for any of variable in the dataset.

1.2 Perform Univariate and Bivariate Analysis. Do exploratory data analysis. Check for Outliers.
Univariate Analysis
Figure 7: Univariant Analysis of age
 By Seeing the above hist plot we can say age column is normally distributed.
 There are no outlier present in age column.
 Skewness for age column is 0.14.
Figure 8: Univariant Analysis of vote
 Labour party is getting maximum vote.

Figure 9: Univariant Analysis of economic.cond.national
 On the scale of 1 to 5 (where 5 is greater), approx. 81% voters rated 3 or greater than 3.
Figure 10: Univariant Analysis of economic.cond.household
 On the scale of 1 to 5 (where 5 is greater), approx. 75% voters rated 3 or greater than 3.
Figure 11: Univariant Analysis of Blair
 Blair is the labour party leader, 65% voter rated him 4 or more than 4.
 We can say labour party is more popular among voters.
Figure 12: Univariant Analysis of Hague
 Hague is the Conservative party leader, 41% voter rated him 4 or more than 4.
 We can say Conservative party is not much popular among voters.
Figure 13: Univariant Analysis of Europe
 On the scale of 1 to 11 (where 11 is greater), we can say that the ‘Eurosceptic’ sentiment is
high.
Figure 14: Univariant Analysis of political.knowledge
 On the scale of 0 to 3 (where 3 is greater), we can say 65% present voter ranked 2-3.
Figure 15: Univariant Analysis of gender
 Female voters are more than the male voters.
Bivariate Analysis
 Vote v/s other variables

Figure 16: Vote v/s other variables graphs
 Gender v/s other variables

Figure 17: Gender v/s other variables Graphs
Multivariate Analysis
Figure 18: Pair plot for dataset Education

Figure 19: Heatmap for Dataset Education
Box Plot
Figure 20: Box plot for Dataset Education
 There are outliers presents for variable ‘economic.cond.national’ and

‘economic.cond.household’.
 As ‘economic.cond.national’ and ‘economic.cond.household’ are the assessment variable we
are ignoring the outliers.
1.3 Encode the data (having string values) for Modelling. Is Scaling necessary here or not? Data Split:
Split the data into train and test (70:30).
Data pre-processing steps
 Removing unnamed: 0 identity column.

 Removing 8 duplicate rows.( These records can be of different person but still it won’t be
useful in our study)
 Converting age variables into bins for improve performance of model.
 Creating 6 bins for age column (0,25,40,55,70,85,100) with name (1,2,3,4,5) respectively.
 Creating dummy variable for Male column.
 Changing the data types of variables to categorical data type.
 No scaling is needed in this scenario, As now all variables are of types categorical. Only Age
column was the issue, but we have binned that on as well.
 Splitting the data into training and test set into 70:30 ratio.
1.4 Apply Logistic Regression and LDA (linear discriminant analysis).
Logistic Regression Model
Below is the result for Logistic Regression model.
Figure 21: Logistic Regression Confusion Matrices
Figure 22: Logistic Regression Classification Reports
Figure 23: Logistic Regression AUC and ROC curve
 Model perform good on both train and test data.

 AUC scores are approx. same for both Train(0.89) and Test Data(0.88)
 F1 score and recall score are also very close to each other.
 We can say our model working well.
Figure 24: Logistic Regression_Importance feature
Above are the importance feature which play major role in decision.
 Economic.cond.national and Blair are the most important feature.

 Hugue is the least important feature.
Logistic Regression Model using gridSearch CV
Figure 25: Tuned Logistic Regression Confusion Matrices
Figure 26: Tuned Logistic Regression Clssification Reports
Figure 27: Tuned Logistic Regression AUC-ROC curves
 Model perform almost same on both training and testing dataset.

 F1 score and recalls values almost same for both training and testing dataset.
 Hence, we can say Logistic regression with tuned parameter are also worked stabled.
Figure 28: Tuned Logistic Regression_Importance feature
Above are the importance feature which play major role in decision.

 Hugue is the least important feature.
 Importance features in basic Logistic regression and tuned logistic regression model are
same.
Linear Discriminant analysis
Figure 29: LDA- Confusion Matrices
Figure 30: LDA - Classification Reports
Figure 31: LDA - AUC-ROC curve
 Accuracy is same for both training and testing dataset.

 F1 and rcall values are almost similar for training and testing dataset.
 Hence, we can say our model is stable.
Linear Discriminant analysis using gridSearch CV
Figure 32: Tuned LDA - Confusion Matrices
Figure 33: Tuned LDA Classification Reports
Figure 34: Tuned LDA - AUC ROC curve
 Model performance is same for both training and testing dataset.

 AUC-ROC curve is almost similar for both training and testing dataset.
 Accuracy is also almost similar for both training and testing dataset.
 Hence, we can say our tuned LDA model is also stable.
1.5 Apply KNN Model and Naïve Bayes Model. Interpret the results.
KNN Model
Figure 35: KNN Model- Confusion Matrices
Figure 36: KNN Model- Classification report
Figure 37: KNN Model - AUC-ROC curves
 Model perform good in training data but testing data performance is low.
 We can see significant drop in testing AUC-ROC curves.
 We can say our KNN model is not stable and overfitting.
Tuned KNN Model
Figure 38: Tuned KNN Model - Confusion Matrices
Figure 39: Tuned KNN Model - Classification Reports
Figure 40: Tuned KNN Model- AUC-ROC curve
 We can see tuned KNN model are more stable than basic model.
 F1 score and Recall value are close to each other.
 AUC-ROC curves are almost similar.
 AUC values are also close to each other.
 Hence, we can say tuned KNN model works more stable than basic KNN model.
Naïve Model
Figure 41: Naïve Model - Confusion Matrices
Figure 42: Naïve Model - Classification Reports
Figure 43: Naïve Model - AUC-ROC curves
 Model performance are almost similar on training and testing dataset.

 We can see AUC values are close to each other.
 F1 score are also close to each other for training and test dataset.
 Hence, we can say Naïve Bayes model is stable.
1.6 Model Tuning, Bagging (Random Forest should be applied for Bagging), and Boosting.
Bagging (Random Forest)
Figure 44: Bagging- Confusion Matrices
Figure 45: Bagging - Classification reports
Figure 46: Bagging - AUC-ROC curves
 Bagging model perform good on training data but not on testing data.
 We can see drop in AUC-ROC curve of testing data as compare to training data.
 Hence, we can say our model in overfitting.
Bagging tuned using Grid
Figure 47: Bagging tuned - Confusion Matrices
Figure 48: Bagging tuned - Classification Reports
Figure 49: Bagging tuned - AUC-ROC curves
 Model perform good for both training and testing dataset.

 F1 score and recall values are close to each other.
 Hence, we can say tuned bagging model perform good and stable as compare to basic
bagging mode.
Ada Boosting
Figure 50: Ada Boosting - Confusion Matrices
Figure 51: Ada Boosting - Classification reports
Figure 52: Ada Boosting- AUC-ROC curves
 Model Performance are similar for both training and testing data set.
 AUC scores are very close to each other.
 F1 score and recall values are also close to each other.
 Hence, we can say model is stable.
Figure 53: Ada Boosting- Importance of features
 Above is the feature importance graph for Ada boosting model.

 Variable Europe is most significant variable and gender is least important variable in decision
making.
Ada Boosting – Tuned using Grid
Figure 54: Tuned Ada Boosting- Confusion Matrices
Figure 55: Tuned Ada Boosting- Classification Reports
Figure 56: Tuned Ada Boosting- AUC-ROC curves
 Model Performance are similar for both training and testing data set.
 AUC scores are very close to each other.
 F1 score and recall values are also close to each other.
 Hence, we can say model is stable.
 Both Tuned Ada boosting and Basic Ada boosting model are stable and their performance
are also almost similar.
Figure 57: Tuned Ada Boosting- Importance features
 Above is the feature importance graph for Tuned Ada boosting model.
 Variable Hague is most significant variable and gender is least important variable in decision
making.
Gradient Boosting
Figure 58: Gradient Boosting- Confusion Matrices
Figure 59: Gradient Boosting- Classification Report
Figure 60: Gradient Boosting- AUC-ROC Curve
 Model perform good on Training data but not on Testing data.

 AUC value for training data is higher than testing data.
 We can see significant drop in AUC-ROC curve also in testing data as compare to training
data.
 F1 score and recall values is also having difference in between training data and test data.
 Hence, we can say Gradient Boosting model is not stable.
Figure 61: Gradient Boosting- Importance Features
 Above is the feature importance graph for Gradient Boosting model.

making.
Gradient Boosting Tuned with Grid
Figure 62: Tuned Gradient Boosting- Confusion Report
Figure 63: Tuned Gradient Boosting-Classification Reports
Figure 64: Tuned Gradient Boosting-AUC-ROC Curves
 Model perform good on Training data but not on Testing data.

 AUC value for training data is higher than testing data.
 We can see significant drop in AUC-ROC curve also in testing data as compare to training
data.
 Hence, we can say Tuned Gradient Boosting model is not stable.
 Both Tuned and basic Gradient Boosting model perform similarly but both are not stable.
Figure 65: Tuned Gradient Boosting-Importance Features
 Above is the feature importance graph for Tuned Gradient Boosting model.
making.
1.7 Performance Metrics: Check the performance of Predictions on Train and Test sets using Accuracy,
Confusion Matrix, Plot ROC curve and get ROC_AUC score for each model. Final Model: Compare
the models and write inference which model is best/optimized.
Figure 66: All Models AUC-ROC curves for training data
From the above plot, we can clearly see Bagging model perform good on train data followed by
gradient Boost, Tuned gradient boost and KNN models.
Figure 67: All Models AUC-ROC curves for testing data
From the above plot, we can say Gradient Boost and Tuned Gradient Boost model perform well on
testing data set followed by Tuned LDA and LDA model.
Accuracy Confusion Matrix

Models
Train Accuracy Test Accuracy True Negative: Train Data True Negative: Test Data True Positive: Train Data True Positive: Test Data
Logistic Regression 0.89 0.882 194 111 689 268
Tuned Logistic Regression 0.89 0.879 196 110 691 266
LDA 0.889 0.886 199 111 686 269
Tuned LDA 0.889 0.886 199 111 686 269
KNN Model 0.934 0.877 231 102 689 272
Tuned KNN Model 0.91 0.898 219 104 680 270
Naive Bayes Model 0.888 0.875 210 113 675 263
Bagging 0.961 0.895 228 102 720 278
Tuned Bagging 0.9 0.879 145 78 725 287
Ada Boosting 0.906 0.88 207 106 686 268
Tuned Ada Boosting 0.899 0.881 181 95 702 271
Gradient Boosting 0.946 0.897 238 108 706 270
Tuned Gradient Boosting 0.937 0.897 226 106 706 274
Table 2: Model Comparison
Above is the comparison table of all the models
 we can clearly see Basic bagging model is having the highest accuracy.
 As well as, bagging model is capturing the Maximum True Negative and True Positive values
in the dataset.
1.8 Based on these predictions, what are the insights?
From the above comparison we can say Basic Bagging model is the finest model among all 13
models.
We have built our model for vote – Labour. Below are the insights based on our models.

 So, people who have ranked Economic.cond.national and Blair high, most likely to vote for
labour party.
 Hugue is the least important feature, so people ranked for Hague is most likely to vote for
Conservative party.
 Gender is not having any effect on deciding the vote.
2. Problem 2
In this particular project, we are going to work on the inaugural corpora from the nltk in Python. We
will be looking at the following speeches of the Presidents of the United States of America:
1. President Franklin D. Roosevelt in 1941
2. President John F. Kennedy in 1961
3. President Richard Nixon in 1973
(Hint: use .words(), .raw(), .sent() for extracting counts)
Figure 68: President Speech Data frame
2.1 Find the number of characters, words, and sentences for the mentioned documents.
Characters Count
 The character count in President Roosevelt-1941 speech was 7571.

 The character count in President Kennedy-1961 speech was 7618.
 The character count in President Nixon-1973 speech was 9991.
Word Count
 The word count in President Roosevelt-1941 speech was 1360.

 The word count in President Kennedy-1961 speech was 1390.
 The word count in President Nixon-1973 speech was 1819.
Sentence Count
 The sentence count in President Roosevelt-1941 speech was 68.

 The sentence count in President Kennedy-1961 speech was 52.
 The sentence count in President Nixon-1973 speech was 68.
Note : All above counts are with stop word and spacing.
President Roosevelt-1941 Speech
Figure 69: President Roosevelt-1941 Speech

President Kennedy-1961 Speech
Figure 70: President Kennedy-1961 Speech

President Nixon-1973 Speech after
Figure 71: President Nixon-1973 Speech after

2.2 Remove all the stopwords from all three speeches.
 We will convert the speech text into lower case.

 We have removed all un-necessary spaces from the speech text.
 Then we will remove all stop words and special character from the speech.
Figure 72: Counts after removing stop word.
Word Count after removing Stop Words
Figure 73: Word Count after removing Stop Words
 The word count in President Roosevelt-1941 speech (after removing stop words) is 627.
 The word count in President Kennedy-1961 speech (after removing stop words) is 693.
 The word count in President Nixon-1973 speech (after removing stop words) is 833.
Character Count after removing Stop words
Figure 74: Character Count after removing Stop words
 The Character count in President Roosevelt-1941 speech (after removing stop words) is
4590.
 The Character count in President Kennedy-1961 speech (after removing stop words) is 4771.
 The Character count in President Nixon-1973 speech (after removing stop words) is 5950.
Sentence Count after removing stop words
Figure 75: Sentence Count after removing stop words
 The sentence count in President Roosevelt-1941 speech (after removing stop words) is 1.
 The sentence count in President Kennedy-1961 speech (after removing stop words) is 1.
 The sentence count in President Nixon-1973 speech (after removing stop words) is 1.
President Roosevelt-1941 Speech after removing stop words
Figure 76: President Roosevelt-1941 Speech after removing stop words
President Kennedy-1961 Speech after removing stop words
Figure 77: President Kennedy-1961 Speech after removing stop words

President Nixon-1973 Speech after removing stop words
Figure 78: President Nixon-1973 Speech after removing stop words

2.3 Which word occurs the most number of times in his inaugural address for each president? Mention
the top three words. (after removing the stopwords)
President Roosevelt-1941: Top Words
Figure 79: Inagural speech of Roosevelt
 President Roosevelt used word ‘nation’ maximum time.

 Top 3 words used by President Roosevelt are:
1. Nation
2. Know
3. Democracy
President Kennedy-1961: Top Words
Figure 80: Inagural speech of Kennedy
 President Kennedy used word ‘let’ maximum time.

 Top 3 words used by President Kennedy are:
1. let
2. us
3. sides
President Nixon-1973: Top Words
Figure 81: Inagural speech of Nixon
 President Nixon used word ‘us’ maximum time.

 Top 3 words used by President Nixon are:
1. us
2. let
3. peace
Note : the word ‘let’ and ‘us’ is more like unwanted words for us. We will remove it.
President Roosevelt-1941: Top Words- after second treatment
Figure 82: Inagural speech of Roosevelt- after 2nd clean up
 President Roosevelt used word ‘nation’ maximum time.

 Top 3 words used by President Roosevelt are:
1. Nation
2. Know
3. democracy
President Kennedy-1961: Top Words- after second treatment
Figure 83: Inagural speech of Kennedy- after 2nd clean up
 President Kennedy used word ‘sides’ maximum time.

 Top 3 words used by President Kennedy are:
1. sides
2. world
3. new
President Nixon-1973: Top Words- after second treatment
Figure 84: Inagural speech of Nixon- after 2nd clean up
 President Nixon used word ‘peace’ maximum time.

 Top 3 words used by President Nixon are:
1. peace
2. world
3. new
2.4 Plot the word cloud of each of the speeches of the variable. (after removing the stopwords) [ refer
to the End-to-End Case Study done in the Mentored Learning Session ]
President Roosevelt-1941: Word Cloud
Figure 85: President Roosevelt-1941: Word Cloud
By seeing above word cloud, we can say president Roosevelt mostly talked about nation, spirit,
people, life etc.
President Kennedy-1961: Word Cloud
Figure 86: President Kennedy-1961: Word Cloud
By seeing above word cloud, we can say president Kennedy mostly talked about world, sides, new,
nation etc.
President Nixon-1973: Word Cloud
Figure 87: President Nixon-1973: Word Cloud
By seeing above word cloud, we can say president Nixon mostly talked about America, nation, world,
peace, new, responsibility etc.

Machine Learning Business Report PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Machine Learning Business Report PDF

Uploaded by

Copyright:

Available Formats

Machine Learning Project

Below are mentioned all necessary libraries used.

Figure 1: Libraries Used

Dataset for Problem: Election_Data.xlsx

Variable Name Description

Loading the data and viewing head and tail of dataset.

Figure 2: Election Dataset Head

Figure 3: Election Dataset Tail

 Data has been loaded successfully.

Descriptive Statistics of Dataset

Figure 4: Descriptive Statistics of Election Dataset

Checking for null value

Figure 5: Info of Election Dataset

Figure 6: Null value check for Election Dataset

No null values available for any of variable in the dataset.

Figure 7: Univariant Analysis of age

Figure 8: Univariant Analysis of vote

 Labour party is getting maximum vote.

Figure 10: Univariant Analysis of economic.cond.household

Figure 12: Univariant Analysis of Hague

Figure 14: Univariant Analysis of political.knowledge

 Female voters are more than the male voters.

 Vote v/s other variables

 Gender v/s other variables

Figure 18: Pair plot for dataset Education

Figure 20: Box plot for Dataset Education

 There are outliers presents for variable ‘economic.cond.national’ and

 Removing unnamed: 0 identity column.

Below is the result for Logistic Regression model.

Figure 21: Logistic Regression Confusion Matrices

Figure 22: Logistic Regression Classification Reports

Figure 23: Logistic Regression AUC and ROC curve

 Model perform good on both train and test data.

 Economic.cond.national and Blair are the most important feature.

Figure 25: Tuned Logistic Regression Confusion Matrices

Figure 26: Tuned Logistic Regression Clssification Reports

Figure 27: Tuned Logistic Regression AUC-ROC curves

 Model perform almost same on both training and testing dataset.

 Economic.cond.national and Blair are the most important feature.

Figure 29: LDA- Confusion Matrices

Figure 30: LDA - Classification Reports

Figure 31: LDA - AUC-ROC curve

 Accuracy is same for both training and testing dataset.

Figure 32: Tuned LDA - Confusion Matrices

Figure 33: Tuned LDA Classification Reports

Figure 34: Tuned LDA - AUC ROC curve

 Model performance is same for both training and testing dataset.

Figure 35: KNN Model- Confusion Matrices

Figure 36: KNN Model- Classification report

Figure 37: KNN Model - AUC-ROC curves

Figure 38: Tuned KNN Model - Confusion Matrices

Figure 39: Tuned KNN Model - Classification Reports

Figure 40: Tuned KNN Model- AUC-ROC curve

Figure 41: Naïve Model - Confusion Matrices

Figure 42: Naïve Model - Classification Reports

Figure 43: Naïve Model - AUC-ROC curves

 Model performance are almost similar on training and testing dataset.

Bagging (Random Forest)

Figure 44: Bagging- Confusion Matrices

Figure 45: Bagging - Classification reports

Figure 46: Bagging - AUC-ROC curves

Figure 47: Bagging tuned - Confusion Matrices

Figure 48: Bagging tuned - Classification Reports