Anmol Sehgal ML

DATA
ANALYSI
S
Anmol Sehgal
PGP-DSBA
Online Sept-21
Date: 16/11/2022
0
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution
prohibited
2
prohibited
Problem 1
You are hired by one of the leading news channels CNBE who wants to analyze recent elections. This survey
was conducted on 1525 voters with 9 variables. You have to build a model, to predict which party a voter will
vote for on the basis of the given information, to create an exit poll that will help in predicting overall win and
seats covered by a particular party.
Data Description
1. vote: Party choice: Conservative or Labour
2. age: in years
3. economic.cond.national: Assessment of current national economic conditions, 1 to 5.
4. economic.cond.household: Assessment of current household economic conditions, 1 to 5.
5. Blair: Assessment of the Labour leader, 1 to 5.
6. Hague: Assessment of the Conservative leader, 1 to 5.
7. Europe: an 11-point scale that measures respondents' attitudes toward European integration. High
scores represent ‘Eurosceptic’ sentiment.
8. political.knowledge: Knowledge of parties' positions on European integration, 0 to 3.
9. gender: female or male.
Database’s sample
Table 1. Database
3
prohibited
1.1 Read the dataset. Do the descriptive statistics and
do the null value condition check. Write an inference
on it.
Data Information
Table 2. Data Information

 Data has 2525 rows and 9 columns
4
prohibited
Data Describe
Table 2. Data Describe
Null Check
Table 3:- Null sum per variable
Gender Distribution
Table 4:- Gender Distribution
There are 812 male voters and 713 female voters in the dataset
VOTES
Table 5:- Votes per parties
1063 people voted for labour whereas only 462 voted for conservatives
5
prohibited
Duplicates and removal of duplicates
Table 6:- Duplicates
There are 8 duplicates. We shall be removing them

After treatment the duplicates are 0
Skewness Check
Table 7:- Skewness
Observations
1. The data have good distribution in gender but the sample data set has un-equal distribution
of voting in them.
2. The skewness of Blair and Europe is only in the acceptable range; hence data is asymmetric
6
prohibited
1.2 Perform Univariate and Bivariate Analysis. Do
exploratory data analysis. Check for Outliers.
7
prohibited
UNIVARIANT ANALYSIS
Boxplot
8
prohibited
Histogram
9
prohibited
BIVARIANT ANALYSIS
Pairplot
Graph 1 :- Pair plot with hue as vote
10
prohibited
Graph 2 :- Pair plot with hue as vote
Graph 3:- Votes vs Age
11
prohibited
Graph 4:- Votes vs economics cond national
Graph 5:- Votes vs political knowledge
Graph 6:- Votes vs Hague
Graph 7:- Votes vs Blair
12
prohibited
Graph 8:- Votes vs economics cond household
Graph 9:- Votes vs gender
13
prohibited
Correlation Heat Map
Insights
1. No data is normally distributed. The only data which is near normal is "AGE"
2. The outliers are only in "Economic cond National" and "Economic cond household"
3. Age less than 30 our greater than 80 votes more for Labour than conservative
4. Conservatives vote bank is more of the "2nd category" in Economic cond national while all the "4th
category" voted for Labour
5. Those who have highest political knowledge ("3") voted for conservatives while those which have "2" on
knowledge preferred Labour
6. The lower rating of Hague is in Labour while higher ones preferred Conservatives
7. The highest rating of Blair is in Labour
8. Those whose rating is "2" in economic cond household prefers conservatives while those who have good
economic cond household, they vote for Labor
9. Females’ ratio is comparatively more for Conservatives while their is equal distribution of males
14
prohibited
1.3 Encode the data (having string values) for
Modelling. Is Scaling necessary here or not? Data
Split: Split the data into train and test
Table 8 :- Data after encoding
Table 9 :- Data info
We are splitting the data into train and test

Train = 70% of data
Test = 30% of data
Scaling
Table 10:- After scaling
Scaling will be done as:-

15
prohibited
1. There is high variance in data
2. The max-min value is different for most of the variables. Ex:- Age vs vote
3. We will be using Z-Score scaling technique
1.4 Apply Logistic Regression and LDA (linear

discriminant analysis)
Logistic Regression
Train Data
Accuracy
Classification Report
Confusion Matrix
16
prohibited
Test Data
Accuracy
Confusion Matrix
Validation of Model
1. The accuracy of model is good. It is neither under-fit nor over-fit
2. Here all the FN, FP, TN and TP are important because we are viewing the data with no biasness towards
any party
3. There is not much of difference in accuracy of Test and Train data, hence model is good
Linear Discriminant Analysis
Train Data
Accuracy
Confusion Matrix
17
prohibited
Test Data
Accuracy
Confusion Matrix
Validation of Model
2. Here all the FN, FP, TN and TP are important because we are viewing the data with no biasness towards any
party
1.5 Apply KNN Model and Naïve Bayes Model. Interpret

the results.
KNN Model
Train Data
Accuracy
18
prohibited
Confusion Matrix
Test Data
Accuracy
Confusion Matrix
Validation of Model
party
3. There is 8% of difference in accuracy of Test and Train data which is less than 10% , hence model is
acceptable 19
prohibited
Naïve Bayes
Train Data
Accuracy
Confusion Matrix
Test Data
Accuracy
Confusion Matrix
20
prohibited
Validation of Model
2. There is no difference in accuracy of Test and Train data which is less than 10% , hence model is good
1.6 Model Tuning, Bagging (Random Forest should be

applied for Bagging), and Boosting
Ada Boost
From the above, we will select the best input to get maximum accuracy. Following is the best input:-
Train Data
Accuracy
Confusion Matrix
Test Data
Accuracy
21
prohibited
Confusion Matrix
Validation of Model
2. Here all the FN, FP, TN and TP are important because we are viewing the data with no biasness towards
any party
Decision Tree
Train Data
Accuracy
22
prohibited
Confusion Matrix
Test Data
Accuracy
Confusion Matrix
Validation of Model
party
Random Forest
23
prohibited
Train Data
Accuracy
Confusion Matrix
Test Data
Accuracy
Confusion Matrix
24
prohibited
Validation of Model
party
1.7 Performance Metrics: Check the performance of

Predictions on Train and Test sets using Accuracy,
Confusion Matrix, Plot ROC curve and get ROC_AUC
score for each model. Final Model: Compare the
models and write inference which model is
best/optimized.
Logistic Regression
Train Data
Accuracy
Confusion Matrix
25
prohibited
AUC and ROC
Test Data
Accuracy
Confusion Matrix
26
prohibited
AUC and ROC
Linear Discriminant Analysis
Train Data
Accuracy
Confusion Matrix
27
prohibited
AUC and ROC
Test Data
Accuracy
Confusion Matrix
28
prohibited
AUC and ROC
KNN Model
Train Data
Accuracy
Confusion Matrix
29
prohibited
AUC and ROC
Test Data
Accuracy
Confusion Matrix
30
prohibited
AUC and ROC
Naïve Bayes
Train Data
Accuracy
Confusion Matrix
31
prohibited
AUC and ROC
Test Data
Accuracy
Confusion Matrix
32
prohibited
AUC and ROC
Ada Boost
Train Data
Accuracy
Confusion Matrix 33
prohibited
AUC and ROC
Test Data
Accuracy
Confusion Matrix
34
prohibited
AUC and ROC
Decision Tree
Train Data
Accuracy
35
prohibited
Confusion Matrix
AUC and ROC
Test Data
Accuracy
36
prohibited
Confusion Matrix
AUC and ROC
Random Forest
Train Data
37
prohibited
Accuracy
Confusion Matrix
AUC and ROC
Test Data
Accuracy
38
prohibited
Confusion Matrix
AUC and ROC
COMPARISON OF ALL MODELS
TRAIN DATA
Model Name Accuracy Precision Recall f1 AUC
Logistic Regression 84% 87% 91% 89% 89%
LDA 84% 90% 87% 88% 89%
KNN 86% 92% 89% 90% 92%
Naïve Bayes 83% 88% 88% 88% 89%
AdaBoost 84% 85% 92% 88% 90%
Decision Tree 84% 87% 90% 88% 90%
Random Forest 84% 85% 93% 89% 91%
TEST DATA
Model Name Accuracy Precision Recall f1 AUC
Logistic Regression 82% 87% 89% 88% 88%
LDA 82% 88% 87% 87% 88%
KNN 79% 85% 85% 85% 83%
Naïve Bayes 83% 87% 89% 88% 89%
AdaBoost 83% 87% 90% 88% 89%
Decision Tree 81% 87% 87% 87% 87%
39
prohibited
Random Forest 83% 86% 91% 89% 90%
Conclusion
1. There is no model that is under-fit or over-fit
2. All models are good but the best model which is fit for both test and train is Random Forest model
3. In Random Forest, Accuracy is 84% and 83% for Train and Test respectively. Though they are not
the highest but the difference between test and train accuracy is the least while accuracy is good
4. Random Forest has the best Recall and AUC curve for both test and train
5. Random Forest’s Precision and f1 is also good
6. Hence, Best model is Random Forest model
1.8 Based on these predictions, what are the insights?
Insights
1. Chances of Labour party winning is double than Conservative party
2. Age less than 30 our greater than 80 votes more for Labour than conservative
3. Conservatives vote bank is more of the "2nd category" in Economic cond national while all the "4th
category" voted for Labour
4. Those who have highest political knowledge ("3") voted for conservatives while those which have
"2" on knowledge preferred Labour
5. The lower rating of Hague is in Labour while higher ones preferred Conservatives
6. The highest rating of Blair is in Labour
7. Those whose rating is "2" in economic cond household prefers conservatives while those who have
good economic cond household, they vote for Labor
8. Females’ ratio is comparatively more for Conservatives while their is equal distribution of males
9. People with least political knowledge have voted for Labour party
10. People having lower Eurosceptic sentiment have voted for Labour party
11. All models have good fitting but best model is Random Forest model
Recommendations
1. The data is skewed data and there is skewness towards Labour. More data should be collected
2. Conservative parties should start some rallies to target people with less political knowledge
40
prohibited
Problem 2
In this particular project, we are going to work on the inaugural corpora from the nltk in Python. We will be
looking at the following speeches of the Presidents of the United States of America:
1. President Franklin D. Roosevelt in 1941

2. President John F. Kennedy in 1961
3. President Richard Nixon in 1973
2.1 Find the number of characters, words, and sentences for

the mentioned documents.
Characters
41
prohibited
Words
Sentences
2.2 Remove all the stopwords from all three speeches
42
prohibited
2.3 Which word occurs the most number of times in his
inaugural address for each president? Mention the top
three words. (after removing the stopwords)
The most common word in Roosevelt, Kennedy and Nixon speech was nation, let and us respectively
2.4 Plot the word cloud of each of the speeches of the

variable. (after removing the stopwords)
Word cloud of Roosevelt
43
prohibited
Word cloud of Kennedy
44
prohibited
Word cloud of Nixon
45
prohibited
-------------------THE END-------------------
46
prohibited

Anmol Sehgal ML

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Anmol Sehgal ML

Uploaded by

Copyright:

Available Formats

DATA

Table 2. Data Information

Table 2. Data Describe

Table 3:- Null sum per variable

Table 4:- Gender Distribution

Table 5:- Votes per parties

Table 6:- Duplicates

There are 8 duplicates. We shall be removing them

Table 7:- Skewness

Graph 1 :- Pair plot with hue as vote

Graph 3:- Votes vs Age

Graph 5:- Votes vs political knowledge

Graph 6:- Votes vs Hague

Graph 7:- Votes vs Blair

Graph 9:- Votes vs gender

Table 8 :- Data after encoding

Table 9 :- Data info

We are splitting the data into train and test

Table 10:- After scaling

Scaling will be done as:-

1.4 Apply Logistic Regression and LDA (linear

Linear Discriminant Analysis

1.5 Apply KNN Model and Naïve Bayes Model. Interpret

1.6 Model Tuning, Bagging (Random Forest should be

1.7 Performance Metrics: Check the performance of

Linear Discriminant Analysis

AUC and ROC

AUC and ROC

AUC and ROC

AUC and ROC

COMPARISON OF ALL MODELS

1.8 Based on these predictions, what are the insights?

1. President Franklin D. Roosevelt in 1941

2.1 Find the number of characters, words, and sentences for

2.2 Remove all the stopwords from all three speeches

2.4 Plot the word cloud of each of the speeches of the

Word cloud of Roosevelt

You might also like