Machine Learning Business Report - Reshma

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 38

C2 -

Restrict
ed use

Machine Learning PROJECT


REPORT

DSBA

Reshma A

PGP-DSBA Online May-2022

Date:27/11/22
C2 -
Restrict
ed use

Contents
Problem 1................................................................................................................................................4
1.1. Read the dataset. Do the descriptive statistics and do the null value condition check.
Write an inference on it......................................................................................................................4
1.2. Perform Univariate and Bivariate Analysis. Do exploratory data analysis. Check for
Outliers...............................................................................................................................................6
1.3. 1.3 Encode the data (having string values) for Modelling. Is Scaling necessary here or
not? Data Split: Split the data into train and test (70:30)..................................................................9
1.4. 1.4 Apply Logistic Regression and LDA (linear discriminant analysis)................................11
1.5. Apply KNN Model and Naïve Bayes Model. Interpret the results......................................11
1.6. Model Tuning, Bagging (Random Forest should be applied for Bagging), and Boosting...12
1.7. 1.7 Performance Metrics: Check the performance of Predictions on Train and Test sets
using Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC score for each model. Final
Model: Compare the models and write inference which model is best/optimized........................13
Problem2...............................................................................................................................................31
2.1 Find the number of characters, words, and sentences for the mentioned documents............31
2.2 Remove all the stopwords from all three speeches...................................................................32
2.3 Which word occurs the most number of times in his inaugural address for each president?
Mention the top three words. (after removing the stopwords)......................................................33
2.4 Plot the word cloud of each of the speeches of the variable. (after removing the stopwords)34
C2 -
Restrict
ed use
List of Figures
Figure 1:Univatiate Analysis:Histogram....................................................................................................6
Figure 2:Boxplots with outliers.................................................................................................................7
Figure 3:Vote countplot............................................................................................................................7
Figure 4:Gender countplot........................................................................................................................8

List of Tables
Table 1: Sample of dataset........................................................................................................................4
Table 2:Datatypes.....................................................................................................................................4
Table 3:Statistical data..............................................................................................................................5
Table 5:Pairplot.........................................................................................................................................8
Table 6: Dataset after encoding..............................................................................................................10
Table 7:Boxplot after scaling...................................................................................................................11
C2 -
Restrict
ed use

Problem 1

You are hired by one of the leading news channels CNBE who wants to analyze recent elections. This survey
was conducted on 1525 voters with 9 variables. You have to build a model, to predict which party a voter will
vote for on the basis of the given information, to create an exit poll that will help in predicting overall win and
seats covered by a particular party.

1.1. Read the dataset. Do the descriptive statistics and do the null value condition check. Write an
inference on it. 

Sample of dataset:

Table 1: Sample of dataset

Datatypes:

The following are the datatypes associated with the dataset

Table 2:Datatypes

Statistical Data:

The following is the statistical data associated with the dataset and the information:
C2 -
Restrict
ed use

Table 3:Statistical data

Null Value Check:

No null values were observed in the null value check

Duplicates:
There were 8 duplicates observed. They were dropped for better data analysis

Skewness:

Observations:
 The dataset has 1525 rows and 9 columns. The Unnamed field was dropped because it wasn’t relevant
to the analysis . Hence it’s a 1525 x 8 dataset
 ‘Gender’ and ‘age’ are of object datatype while all others are int64 datatype
 No null values were observed on null value check
 There were 8 duplicate records which were dropped due to lack of relevance
 From the skewness table it is observed that the data is fairly symmetrical. Only ‘Blair’ seems to have a
higher value of skewness and all others have skewness between 0.5 and -0.5
C2 -
Restrict
ed use

1.2. Perform Univariate and Bivariate Analysis. Do exploratory data analysis.


Check for Outliers.

Figure 1:Univatiate Analysis:Histogram

Univariate analysis is the simplest form of analyzing data. Uni means one, so in other words the data has only
one variable. Univariate data requires to analyze each variable separately. Data is gathered for the purpose of
answering a question, or more specifically, a research question
C2 -
Restrict
ed use

Figure 2:Boxplots with outliers

Figure 3:Vote countplot


C2 -
Restrict
ed use

Figure 4:Gender countplot

Bivariate Analysis

Table 4:Pairplot
C2 -
Restrict
ed use

Heatmap

Figure 5:Heatmap

Observations
 Labour gets higher votes than conservative party
 There are more female voters than male
C2 -
Restrict
ed use
 Economic.cond.household and economic.cond.national have outliers and they need to be
treated for outliers
 Age of the voters is scattered across values but mostly within the range of 40 and 70 but the
plot if slightly right skewed.
 Distribution of both economic conditions fields are left skewed
 Distribution of Hague is slightly right skewed
 Distribution of Europe is left skewed
 Distribution of Blair is slightly left skewed.
 The political knowledge is highest in the range of age 35-50
 Europe sentiment has not affected their rating of economic conditions of household and
nation
 Voters with Europe sentiment favour Conservative party
 Younger people have tendency to vote Labour. Labour party receives high voting rate among
all ages

1.3. Encode the data (having string values) for Modelling. Is Scaling necessary here or not? Data Split:
Split the data into train and test (70:30). 

Encoding
The two categorical variables ‘vote’ and ‘gender’ were converted to numeric . ‘Vote’ had two values
Labour and Conservative which was replaced with ‘0’ and ‘1’ respectively.’Gender’ with values male
and female was also replaced with ‘1’ and ‘0’. Given below is the dataset after encoding

Table 5: Dataset after encoding

Scaling
From the exploratory data analysis it was very evident that the data was scattered and the variables
had different scales . Hence scaling was opted. As we can see in the previous segment where boxplots
were analysed, there were outliers too and so outlier treatment was also performed
C2 -
Restrict
ed use

Figure 6:Boxplot after scaling

Test and Train Split


The data was split in the ratio of 70:30 for train and test respectively.

1.4. 1.4 Apply Logistic Regression and LDA (linear discriminant analysis).

Logistic Regression
Test
Accuracy :82.89%
AUC : 0.881

Train
Accuracy :83.69%
AUC :0.891

There isn’t any overfitting or underfitting.


Linear Discriminant Analysis

Test
C2 -
Restrict
ed use
Accuracy :83.11%
AUC :0.888

Train
Accuracy :83.41
AUC :0.890

1.5. Apply KNN Model and Naïve Bayes Model. Interpret the results.

KNN Model
Test
Accuracy :82.67%
AUC :0.882
Hyper parameters used:

Train
Accuracy :85.67%
AUC :0.982
There is slight overfitting. There are no available hyper-parameters for this model hence none was
used. The error in test is slightly greater than train

Naïve Bayes Model


Test
Accuracy :82.3%
AUC :0.876
Hyper parameters used:

Train
Accuracy :83.4%
AUC :0.889

There is no overfitting or underfitting. There are no available hyper-parameters for this model hence
none was used. The error in test is slightly greater than train

1.6. Model Tuning, Bagging (Random Forest should be applied for Bagging), and Boosting.

Bagging

Random Forest was applied before bagging classifier


Hyper parameters used for model tuning :

base estimator=RandomForestClassifier
max_depth=10,
C2 -
Restrict
ed use
min_samples_leaf=15,
min_samples_split=30,
n_estimators=50,
random_state=1

Due to overfitting issues we use GridSearchCV to model tune .

Test
Accuracy :81.5%
AUC :0.888
Train
Accuracy :84.6%
AUC :0.909

Boosting

Hyper parameters used for model tuning :

Test
Accuracy :89.9%
AUC :0.884

Train
Accuracy :83.4%
AUC :0.902

It was initially overfitted and hence tuned to get better results.


C2 -
Restrict
ed use

1.7. Performance Metrics: Check the performance of Predictions on Train


and Test sets using Accuracy, Confusion Matrix, Plot ROC curve and get
ROC_AUC score for each model. Final Model: Compare the models and write
inference which model is best/optimized. 

Logistic Regression
Test

Classification Report

Confusion Matrix

ROC Curve
C2 -
Restrict
ed use

Train

Classification Report
Confusion Matrix
C2 -
Restrict
ed use

ROC Curve

Linear Discriminant Analysis


Test

Classification Report
C2 -
Restrict
ed use

Confusion Matrix

ROC Curve
C2 -
Restrict
ed use

Train

Classification Report

Confusion Matrix
C2 -
Restrict
ed use

ROC Curve

KNN Model
Test

Classification Report
C2 -
Restrict
ed use

Confusion Matrix

ROC Curve
C2 -
Restrict
ed use

Train

Classification Report

Confusion Matrix

ROC Curve
C2 -
Restrict
ed use

Naïve Bayes Model


Test

Classification Report

Confusion Matrix
C2 -
Restrict
ed use

ROC Curve
C2 -
Restrict
ed use

Train

Classification Report

Confusion Matrix
C2 -
Restrict
ed use

ROC Curve

Bagging
Test

Classification Report

Confusion Matrix
C2 -
Restrict
ed use

ROC Curve
C2 -
Restrict
ed use

Train

Classification Report

Confusion Matrix
C2 -
Restrict
ed use

ROC Curve

Boosting
Train

Classification Report
C2 -
Restrict
ed use

Confusion Matrix

ROC Curve
C2 -
Restrict
ed use
Test

Classification Report

Confusion Matrix
C2 -
Restrict
ed use

ROC Curve

For Train data


.
Model Accuracy Recall Precision AUC score
Logistic Regression 0.84 0.71 0.75 0.891
LDA 0.83 0.65 0.74 0.890
KNN 0.87 0.72 0.77 0.928
NB 0.83 0.69 0.72 0.889
Bagging 0.85 0.81 0.61 0.902
Boosting 0.84 0.78 0.61 0.909

For Test data


.
.
Model Accuracy Recall Precision AUC score
Logistic Regression 0.82 0.72 0.76 0.881
LDA 0.83 0.73 0.76 0.888
KNN 0.83 0.68 0.78 0.882
NB 0.822 0.73 0.74 0.876
Bagging 0.82 0.60 0.80 0.882
Boosting 0.81 0.64 0.75 0.888
C2 -
Restrict
ed use

Observations
 In terms of accuracy the KNN model is the most accurate
 Underfittign and overfitting models of bagging and boosting were tuned to get better results
 Boost classifier performed the best with max AUC score of 88.8% in test and 90.9
 in train
 Overfitting was observed in Random Forest which was used for Bagging. Hence both models
were overfitted. It was model tuned using GridSearchCV

1.8 Based on these predictions, what are the insights?

 The Labour party has considerable votes more than the Conservative party
 All models had good accuracy
 Blair has higher number of votes than Hague and had scored 3.3 average while Balir scored
lower at 2.74 average
 People who gave low values for a party also ended up voting for the same party
 Political knowledge has low count in a large section of people with average of only 1.7
 The political knowledge is highest in the range of age 35-50
 Europe sentiment has not affected their rating of economic conditions of household and
nation
 The economic condition for householf averaged at 3.13 and the economic condition for nation
averaged at 3.24
 Voters with Europe sentiment favour Conservative party
 Younger people have tendency to vote Labour. Labour party receives high voting rate among
all ages
 Overfitting was observed in Bagging model as well as Random Forest
 Boosting classifier is the best model of the used models.

Problem2
In this particular project, we are going to work on the inaugural corpora from the nltk in
Python. We will be looking at the following speeches of the Presidents of the United States of
America:
1. President Franklin D. Roosevelt in 1941
2. President John F. Kennedy in 1961
3. President Richard Nixon in 1973

(Hint: use .words(), .raw(), .sent() for extracting counts)

2.1 Find the number of characters, words, and sentences for the mentioned
documents.

Roosevelt file:
C2 -
Restrict
ed use
Number of characters in file:7571
Number of words in file:1390
Number of sentences in file:52

Kennedy File:

Number of characters in file:7618


Number of words in file:1819
Number of sentences in file:68

Nixon File:
Number of characters in file:9991
Number of words in file:1819
Number of sentences in file:/67

2.2 Remove all the stopwords from all three speeches.

There were 871 words after stopwords were removed


Kennedy
C2 -
Restrict
ed use

There are 904 words left after removal of stopwords

Nixon

There are 1094 words after removal of stopwords

2.3 Which word occurs the most number of times in his inaugural address
for each president? Mention the top three words. (after removing the
stopwords)

Roosevelt’s speech used ‘nation’ the most number of times and the top three word
occurrences are given below

Kennedy’s speech used ‘world’ the most number of times and the top three word
occurrences are given below
C2 -
Restrict
ed use

Nixon’s speech used ‘America’ the most number of times and the top three word occurrences
are given below

2.4 Plot the word cloud of each of the speeches of the variable. (after
removing the stopwords)

Word cloud for Roosevelts speech


C2 -
Restrict
ed use

Word cloud for Kennedys speech


C2 -
Restrict
ed use

Word cloud for Nixon’s speech


C2 -
Restrict
ed use

You might also like