Machine Learning Business Report - Reshma

C2 -
Restrict
ed use
Machine Learning PROJECT

REPORT
DSBA
Reshma A
PGP-DSBA Online May-2022
Date:27/11/22
C2 -
Restrict
ed use
Contents
Problem 1................................................................................................................................................4
1.1. Read the dataset. Do the descriptive statistics and do the null value condition check.
Write an inference on it......................................................................................................................4
1.2. Perform Univariate and Bivariate Analysis. Do exploratory data analysis. Check for
Outliers...............................................................................................................................................6
1.3. 1.3 Encode the data (having string values) for Modelling. Is Scaling necessary here or
not? Data Split: Split the data into train and test (70:30)..................................................................9
1.4. 1.4 Apply Logistic Regression and LDA (linear discriminant analysis)................................11
1.5. Apply KNN Model and Naïve Bayes Model. Interpret the results......................................11
1.6. Model Tuning, Bagging (Random Forest should be applied for Bagging), and Boosting...12
1.7. 1.7 Performance Metrics: Check the performance of Predictions on Train and Test sets
using Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC score for each model. Final
Model: Compare the models and write inference which model is best/optimized........................13
Problem2...............................................................................................................................................31
2.1 Find the number of characters, words, and sentences for the mentioned documents............31
2.2 Remove all the stopwords from all three speeches...................................................................32
2.3 Which word occurs the most number of times in his inaugural address for each president?
Mention the top three words. (after removing the stopwords)......................................................33
2.4 Plot the word cloud of each of the speeches of the variable. (after removing the stopwords)34
C2 -
Restrict
ed use
List of Figures
Figure 1:Univatiate Analysis:Histogram....................................................................................................6
Figure 2:Boxplots with outliers.................................................................................................................7
Figure 3:Vote countplot............................................................................................................................7
Figure 4:Gender countplot........................................................................................................................8
List of Tables
Table 1: Sample of dataset........................................................................................................................4
Table 2:Datatypes.....................................................................................................................................4
Table 3:Statistical data..............................................................................................................................5
Table 5:Pairplot.........................................................................................................................................8
Table 6: Dataset after encoding..............................................................................................................10
Table 7:Boxplot after scaling...................................................................................................................11
C2 -
Restrict
ed use
Problem 1
You are hired by one of the leading news channels CNBE who wants to analyze recent elections. This survey
was conducted on 1525 voters with 9 variables. You have to build a model, to predict which party a voter will
vote for on the basis of the given information, to create an exit poll that will help in predicting overall win and
seats covered by a particular party.
1.1. Read the dataset. Do the descriptive statistics and do the null value condition check. Write an
inference on it.
Sample of dataset:
Table 1: Sample of dataset
Datatypes:
The following are the datatypes associated with the dataset
Table 2:Datatypes
Statistical Data:
The following is the statistical data associated with the dataset and the information:
C2 -
Restrict
ed use
Table 3:Statistical data
Null Value Check:
No null values were observed in the null value check
Duplicates:
There were 8 duplicates observed. They were dropped for better data analysis
Skewness:
Observations:
 The dataset has 1525 rows and 9 columns. The Unnamed field was dropped because it wasn’t relevant
to the analysis . Hence it’s a 1525 x 8 dataset
 ‘Gender’ and ‘age’ are of object datatype while all others are int64 datatype
 No null values were observed on null value check
 There were 8 duplicate records which were dropped due to lack of relevance
 From the skewness table it is observed that the data is fairly symmetrical. Only ‘Blair’ seems to have a
higher value of skewness and all others have skewness between 0.5 and -0.5
C2 -
Restrict
ed use
1.2. Perform Univariate and Bivariate Analysis. Do exploratory data analysis.

Check for Outliers.
Figure 1:Univatiate Analysis:Histogram
Univariate analysis is the simplest form of analyzing data. Uni means one, so in other words the data has only
one variable. Univariate data requires to analyze each variable separately. Data is gathered for the purpose of
answering a question, or more specifically, a research question
C2 -
Restrict
ed use
Figure 2:Boxplots with outliers
Figure 3:Vote countplot

C2 -
Restrict
ed use
Figure 4:Gender countplot
Bivariate Analysis
Table 4:Pairplot
C2 -
Restrict
ed use
Heatmap
Figure 5:Heatmap
Observations
 Labour gets higher votes than conservative party
 There are more female voters than male
C2 -
Restrict
ed use
 Economic.cond.household and economic.cond.national have outliers and they need to be
treated for outliers
 Age of the voters is scattered across values but mostly within the range of 40 and 70 but the
plot if slightly right skewed.
 Distribution of both economic conditions fields are left skewed
 Distribution of Hague is slightly right skewed
 Distribution of Europe is left skewed
 Distribution of Blair is slightly left skewed.
 The political knowledge is highest in the range of age 35-50
 Europe sentiment has not affected their rating of economic conditions of household and
nation
 Voters with Europe sentiment favour Conservative party
 Younger people have tendency to vote Labour. Labour party receives high voting rate among
all ages
1.3. Encode the data (having string values) for Modelling. Is Scaling necessary here or not? Data Split:
Split the data into train and test (70:30).
Encoding
The two categorical variables ‘vote’ and ‘gender’ were converted to numeric . ‘Vote’ had two values
Labour and Conservative which was replaced with ‘0’ and ‘1’ respectively.’Gender’ with values male
and female was also replaced with ‘1’ and ‘0’. Given below is the dataset after encoding
Table 5: Dataset after encoding
Scaling
From the exploratory data analysis it was very evident that the data was scattered and the variables
had different scales . Hence scaling was opted. As we can see in the previous segment where boxplots
were analysed, there were outliers too and so outlier treatment was also performed
C2 -
Restrict
ed use
Figure 6:Boxplot after scaling
Test and Train Split

The data was split in the ratio of 70:30 for train and test respectively.
1.4. 1.4 Apply Logistic Regression and LDA (linear discriminant analysis).
Logistic Regression
Test
Accuracy :82.89%
AUC : 0.881
Train
Accuracy :83.69%
AUC :0.891
There isn’t any overfitting or underfitting.

Linear Discriminant Analysis
Test
C2 -
Restrict
ed use
Accuracy :83.11%
AUC :0.888
Train
Accuracy :83.41
AUC :0.890
1.5. Apply KNN Model and Naïve Bayes Model. Interpret the results.
KNN Model
Test
Accuracy :82.67%
AUC :0.882
Hyper parameters used:
Train
Accuracy :85.67%
AUC :0.982
There is slight overfitting. There are no available hyper-parameters for this model hence none was
used. The error in test is slightly greater than train
Naïve Bayes Model

Test
Accuracy :82.3%
AUC :0.876
Hyper parameters used:
Train
Accuracy :83.4%
AUC :0.889
There is no overfitting or underfitting. There are no available hyper-parameters for this model hence
none was used. The error in test is slightly greater than train
1.6. Model Tuning, Bagging (Random Forest should be applied for Bagging), and Boosting.
Bagging
Random Forest was applied before bagging classifier

Hyper parameters used for model tuning :
base estimator=RandomForestClassifier
max_depth=10,
C2 -
Restrict
ed use
min_samples_leaf=15,
min_samples_split=30,
n_estimators=50,
random_state=1
Due to overfitting issues we use GridSearchCV to model tune .
Test
Accuracy :81.5%
AUC :0.888
Train
Accuracy :84.6%
AUC :0.909
Boosting
Hyper parameters used for model tuning :
Test
Accuracy :89.9%
AUC :0.884
Train
Accuracy :83.4%
AUC :0.902
It was initially overfitted and hence tuned to get better results.

C2 -
Restrict
ed use
1.7. Performance Metrics: Check the performance of Predictions on Train

and Test sets using Accuracy, Confusion Matrix, Plot ROC curve and get
ROC_AUC score for each model. Final Model: Compare the models and write
inference which model is best/optimized.
Logistic Regression
Test
Classification Report
Confusion Matrix
ROC Curve
C2 -
Restrict
ed use
Train
Confusion Matrix
C2 -
Restrict
ed use
ROC Curve
Linear Discriminant Analysis

Test
C2 -
Restrict
ed use
Confusion Matrix
ROC Curve
C2 -
Restrict
ed use
Train
Confusion Matrix
C2 -
Restrict
ed use
ROC Curve
KNN Model
Test
C2 -
Restrict
ed use
Confusion Matrix
ROC Curve
C2 -
Restrict
ed use
Train
Confusion Matrix
ROC Curve
C2 -
Restrict
ed use
Naïve Bayes Model

Test
Confusion Matrix
C2 -
Restrict
ed use
ROC Curve
C2 -
Restrict
ed use
Train
Confusion Matrix
C2 -
Restrict
ed use
ROC Curve
Bagging
Test
Confusion Matrix
C2 -
Restrict
ed use
ROC Curve
C2 -
Restrict
ed use
Train
Confusion Matrix
C2 -
Restrict
ed use
ROC Curve
Boosting
Train
C2 -
Restrict
ed use
Confusion Matrix
ROC Curve
C2 -
Restrict
ed use
Test
Confusion Matrix
C2 -
Restrict
ed use
ROC Curve
For Train data

.
Model Accuracy Recall Precision AUC score
Logistic Regression 0.84 0.71 0.75 0.891
LDA 0.83 0.65 0.74 0.890
KNN 0.87 0.72 0.77 0.928
NB 0.83 0.69 0.72 0.889
Bagging 0.85 0.81 0.61 0.902
Boosting 0.84 0.78 0.61 0.909
For Test data

.
.
Model Accuracy Recall Precision AUC score
Logistic Regression 0.82 0.72 0.76 0.881
LDA 0.83 0.73 0.76 0.888
KNN 0.83 0.68 0.78 0.882
NB 0.822 0.73 0.74 0.876
Bagging 0.82 0.60 0.80 0.882
Boosting 0.81 0.64 0.75 0.888
C2 -
Restrict
ed use
Observations
 In terms of accuracy the KNN model is the most accurate
 Underfittign and overfitting models of bagging and boosting were tuned to get better results
 Boost classifier performed the best with max AUC score of 88.8% in test and 90.9
 in train
 Overfitting was observed in Random Forest which was used for Bagging. Hence both models
were overfitted. It was model tuned using GridSearchCV
1.8 Based on these predictions, what are the insights?
 The Labour party has considerable votes more than the Conservative party
 All models had good accuracy
 Blair has higher number of votes than Hague and had scored 3.3 average while Balir scored
lower at 2.74 average
 People who gave low values for a party also ended up voting for the same party
 Political knowledge has low count in a large section of people with average of only 1.7
 The political knowledge is highest in the range of age 35-50
 Europe sentiment has not affected their rating of economic conditions of household and
nation
 The economic condition for householf averaged at 3.13 and the economic condition for nation
averaged at 3.24
 Voters with Europe sentiment favour Conservative party
 Younger people have tendency to vote Labour. Labour party receives high voting rate among
all ages
 Overfitting was observed in Bagging model as well as Random Forest
 Boosting classifier is the best model of the used models.
Problem2
In this particular project, we are going to work on the inaugural corpora from the nltk in
Python. We will be looking at the following speeches of the Presidents of the United States of
America:
1. President Franklin D. Roosevelt in 1941
2. President John F. Kennedy in 1961
3. President Richard Nixon in 1973
(Hint: use .words(), .raw(), .sent() for extracting counts)
2.1 Find the number of characters, words, and sentences for the mentioned
documents.
Roosevelt file:
C2 -
Restrict
ed use
Number of characters in file:7571
Number of words in file:1390
Number of sentences in file:52
Kennedy File:

Number of sentences in file:68
Nixon File:
Number of sentences in file:/67
2.2 Remove all the stopwords from all three speeches.
There were 871 words after stopwords were removed

Kennedy
C2 -
Restrict
ed use
There are 904 words left after removal of stopwords
Nixon
There are 1094 words after removal of stopwords
2.3 Which word occurs the most number of times in his inaugural address
for each president? Mention the top three words. (after removing the
stopwords)
Roosevelt’s speech used ‘nation’ the most number of times and the top three word
occurrences are given below
Kennedy’s speech used ‘world’ the most number of times and the top three word
occurrences are given below
C2 -
Restrict
ed use
Nixon’s speech used ‘America’ the most number of times and the top three word occurrences
are given below
2.4 Plot the word cloud of each of the speeches of the variable. (after
removing the stopwords)
Word cloud for Roosevelts speech

C2 -
Restrict
ed use
Word cloud for Kennedys speech

C2 -
Restrict
ed use
Word cloud for Nixon’s speech

C2 -
Restrict
ed use

Machine Learning Business Report - Reshma

Uploaded by

Copyright:

Available Formats

You might also like

Machine Learning Business Report - Reshma

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Machine Learning Business Report - Reshma

Uploaded by

Copyright:

Available Formats

C2 -

Machine Learning PROJECT

PGP-DSBA Online May-2022

Table 1: Sample of dataset

The following are the datatypes associated with the dataset

Table 3:Statistical data

Null Value Check:

No null values were observed in the null value check

1.2. Perform Univariate and Bivariate Analysis. Do exploratory data analysis.

Figure 1:Univatiate Analysis:Histogram

Figure 2:Boxplots with outliers

Figure 3:Vote countplot

Figure 4:Gender countplot

Table 5: Dataset after encoding

Figure 6:Boxplot after scaling

Test and Train Split

There isn’t any overfitting or underfitting.

Naïve Bayes Model

Random Forest was applied before bagging classifier

Due to overfitting issues we use GridSearchCV to model tune .

Hyper parameters used for model tuning :

It was initially overfitted and hence tuned to get better results.

1.7. Performance Metrics: Check the performance of Predictions on Train

Linear Discriminant Analysis

Naïve Bayes Model

For Train data

For Test data

1.8 Based on these predictions, what are the insights?

(Hint: use .words(), .raw(), .sent() for extracting counts)

Number of characters in file:7618

2.2 Remove all the stopwords from all three speeches.

There were 871 words after stopwords were removed

There are 904 words left after removal of stopwords

There are 1094 words after removal of stopwords

Word cloud for Roosevelts speech

Word cloud for Kennedys speech

Word cloud for Nixon’s speech

You might also like