Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 46

DATA

ANALYSI
S

Anmol Sehgal
PGP-DSBA
Online Sept-21
Date: 16/11/2022

0
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution
prohibited
2
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution
prohibited
Problem 1
You are hired by one of the leading news channels CNBE who wants to analyze recent elections. This survey
was conducted on 1525 voters with 9 variables. You have to build a model, to predict which party a voter will
vote for on the basis of the given information, to create an exit poll that will help in predicting overall win and
seats covered by a particular party.

Data Description
1. vote: Party choice: Conservative or Labour
2. age: in years
3. economic.cond.national: Assessment of current national economic conditions, 1 to 5.
4. economic.cond.household: Assessment of current household economic conditions, 1 to 5.
5. Blair: Assessment of the Labour leader, 1 to 5.
6. Hague: Assessment of the Conservative leader, 1 to 5.
7. Europe: an 11-point scale that measures respondents' attitudes toward European integration. High
scores represent ‘Eurosceptic’ sentiment.
8. political.knowledge: Knowledge of parties' positions on European integration, 0 to 3.
9. gender: female or male.

Database’s sample

Table 1. Database

3
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution
prohibited
1.1 Read the dataset. Do the descriptive statistics and
do the null value condition check. Write an inference
on it.

Data Information

Table 2. Data Information


 Data has 2525 rows and 9 columns

4
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution
prohibited
Data Describe

Table 2. Data Describe

Null Check

Table 3:- Null sum per variable

Gender Distribution

Table 4:- Gender Distribution

There are 812 male voters and 713 female voters in the dataset

VOTES

Table 5:- Votes per parties

1063 people voted for labour whereas only 462 voted for conservatives
5
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution
prohibited
Duplicates and removal of duplicates

Table 6:- Duplicates

There are 8 duplicates. We shall be removing them


After treatment the duplicates are 0

Skewness Check

Table 7:- Skewness

Observations
1. The data have good distribution in gender but the sample data set has un-equal distribution
of voting in them.
2. The skewness of Blair and Europe is only in the acceptable range; hence data is asymmetric

6
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution
prohibited
1.2 Perform Univariate and Bivariate Analysis. Do
exploratory data analysis. Check for Outliers.

7
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution
prohibited
UNIVARIANT ANALYSIS
Boxplot

8
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution
prohibited
Histogram

9
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution
prohibited
BIVARIANT ANALYSIS
Pairplot

Graph 1 :- Pair plot with hue as vote

10
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution
prohibited
Graph 2 :- Pair plot with hue as vote

Graph 3:- Votes vs Age

11
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution
prohibited
Graph 4:- Votes vs economics cond national

Graph 5:- Votes vs political knowledge

Graph 6:- Votes vs Hague

Graph 7:- Votes vs Blair

12
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution
prohibited
Graph 8:- Votes vs economics cond household

Graph 9:- Votes vs gender

13
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution
prohibited
Correlation Heat Map

Insights
1. No data is normally distributed. The only data which is near normal is "AGE"
2. The outliers are only in "Economic cond National" and "Economic cond household"
3. Age less than 30 our greater than 80 votes more for Labour than conservative
4. Conservatives vote bank is more of the "2nd category" in Economic cond national while all the "4th
category" voted for Labour
5. Those who have highest political knowledge ("3") voted for conservatives while those which have "2" on
knowledge preferred Labour
6. The lower rating of Hague is in Labour while higher ones preferred Conservatives
7. The highest rating of Blair is in Labour
8. Those whose rating is "2" in economic cond household prefers conservatives while those who have good
economic cond household, they vote for Labor
9. Females’ ratio is comparatively more for Conservatives while their is equal distribution of males

14
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution
prohibited
1.3 Encode the data (having string values) for
Modelling. Is Scaling necessary here or not? Data
Split: Split the data into train and test

Table 8 :- Data after encoding

Table 9 :- Data info

We are splitting the data into train and test


Train = 70% of data
Test = 30% of data

Scaling

Table 10:- After scaling

Scaling will be done as:-


15
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution
prohibited
1. There is high variance in data
2. The max-min value is different for most of the variables. Ex:- Age vs vote
3. We will be using Z-Score scaling technique

1.4 Apply Logistic Regression and LDA (linear


discriminant analysis)

Logistic Regression

Train Data

Accuracy

Classification Report

Confusion Matrix

16
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution
prohibited
Test Data

Accuracy

Classification Report

Confusion Matrix

Validation of Model
1. The accuracy of model is good. It is neither under-fit nor over-fit
2. Here all the FN, FP, TN and TP are important because we are viewing the data with no biasness towards
any party
3. There is not much of difference in accuracy of Test and Train data, hence model is good

Linear Discriminant Analysis

Train Data

Accuracy

Classification Report

Confusion Matrix

17
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution
prohibited
Test Data

Accuracy

Classification Report

Confusion Matrix

Validation of Model
1. The accuracy of model is good. It is neither under-fit nor over-fit
2. Here all the FN, FP, TN and TP are important because we are viewing the data with no biasness towards any
party
3. There is not much of difference in accuracy of Test and Train data, hence model is good

1.5 Apply KNN Model and Naïve Bayes Model. Interpret


the results.

KNN Model

Train Data

Accuracy

18
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution
prohibited
Classification Report

Confusion Matrix

Test Data

Accuracy

Classification Report

Confusion Matrix

Validation of Model
1. The accuracy of model is good. It is neither under-fit nor over-fit
2. Here all the FN, FP, TN and TP are important because we are viewing the data with no biasness towards any
party
3. There is 8% of difference in accuracy of Test and Train data which is less than 10% , hence model is
acceptable 19
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution
prohibited
Naïve Bayes

Train Data

Accuracy

Classification Report

Confusion Matrix

Test Data

Accuracy

Classification Report

Confusion Matrix

20
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution
prohibited
Validation of Model
1. The accuracy of model is good. It is neither under-fit nor over-fit
2. There is no difference in accuracy of Test and Train data which is less than 10% , hence model is good

1.6 Model Tuning, Bagging (Random Forest should be


applied for Bagging), and Boosting

Ada Boost

From the above, we will select the best input to get maximum accuracy. Following is the best input:-

Train Data

Accuracy

Classification Report

Confusion Matrix

Test Data

Accuracy
21
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution
prohibited
Classification Report

Confusion Matrix

Validation of Model
1. The accuracy of model is good. It is neither under-fit nor over-fit
2. Here all the FN, FP, TN and TP are important because we are viewing the data with no biasness towards
any party
3. There is not much of difference in accuracy of Test and Train data, hence model is good

Decision Tree

From the above, we will select the best input to get maximum accuracy. Following is the best input:-

Train Data

Accuracy

22
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution
prohibited
Classification Report

Confusion Matrix

Test Data

Accuracy

Classification Report

Confusion Matrix

Validation of Model
1. The accuracy of model is good. It is neither under-fit nor over-fit
2. Here all the FN, FP, TN and TP are important because we are viewing the data with no biasness towards any
party
3. There is not much of difference in accuracy of Test and Train data, hence model is good

Random Forest

23
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution
prohibited
From the above, we will select the best input to get maximum accuracy. Following is the best input:-

Train Data

Accuracy

Classification Report

Confusion Matrix

Test Data

Accuracy

Classification Report

Confusion Matrix

24
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution
prohibited
Validation of Model
1. The accuracy of model is good. It is neither under-fit nor over-fit
2. Here all the FN, FP, TN and TP are important because we are viewing the data with no biasness towards any
party
3. There is not much of difference in accuracy of Test and Train data, hence model is good

1.7 Performance Metrics: Check the performance of


Predictions on Train and Test sets using Accuracy,
Confusion Matrix, Plot ROC curve and get ROC_AUC
score for each model. Final Model: Compare the
models and write inference which model is
best/optimized.
Logistic Regression

Train Data

Accuracy

Classification Report

Confusion Matrix

25
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution
prohibited
AUC and ROC

Test Data

Accuracy

Classification Report

Confusion Matrix

26
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution
prohibited
AUC and ROC

Linear Discriminant Analysis

Train Data

Accuracy

Classification Report

Confusion Matrix

27
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution
prohibited
AUC and ROC

Test Data

Accuracy

Classification Report

Confusion Matrix

28
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution
prohibited
AUC and ROC

KNN Model

Train Data

Accuracy

Classification Report

Confusion Matrix

29
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution
prohibited
AUC and ROC

Test Data

Accuracy

Classification Report

Confusion Matrix

30
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution
prohibited
AUC and ROC

Naïve Bayes

Train Data

Accuracy

Classification Report

Confusion Matrix

31
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution
prohibited
AUC and ROC

Test Data

Accuracy

Classification Report

Confusion Matrix

32
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution
prohibited
AUC and ROC

Ada Boost

Train Data

Accuracy

Classification Report

Confusion Matrix 33
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution
prohibited
AUC and ROC

Test Data

Accuracy

Classification Report

Confusion Matrix

34
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution
prohibited
AUC and ROC

Decision Tree

Train Data

Accuracy

Classification Report

35
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution
prohibited
Confusion Matrix

AUC and ROC

Test Data

Accuracy

Classification Report

36
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution
prohibited
Confusion Matrix

AUC and ROC

Random Forest

Train Data
37
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution
prohibited
Accuracy

Classification Report

Confusion Matrix

AUC and ROC

Test Data

Accuracy

38
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution
prohibited
Classification Report

Confusion Matrix

AUC and ROC

COMPARISON OF ALL MODELS

TRAIN DATA
Model Name Accuracy Precision Recall f1 AUC
Logistic Regression 84% 87% 91% 89% 89%
LDA 84% 90% 87% 88% 89%
KNN 86% 92% 89% 90% 92%
Naïve Bayes 83% 88% 88% 88% 89%
AdaBoost 84% 85% 92% 88% 90%
Decision Tree 84% 87% 90% 88% 90%
Random Forest 84% 85% 93% 89% 91%

TEST DATA
Model Name Accuracy Precision Recall f1 AUC
Logistic Regression 82% 87% 89% 88% 88%
LDA 82% 88% 87% 87% 88%
KNN 79% 85% 85% 85% 83%
Naïve Bayes 83% 87% 89% 88% 89%
AdaBoost 83% 87% 90% 88% 89%
Decision Tree 81% 87% 87% 87% 87%
39
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution
prohibited
Random Forest 83% 86% 91% 89% 90%

Conclusion
1. There is no model that is under-fit or over-fit
2. All models are good but the best model which is fit for both test and train is Random Forest model
3. In Random Forest, Accuracy is 84% and 83% for Train and Test respectively. Though they are not
the highest but the difference between test and train accuracy is the least while accuracy is good
4. Random Forest has the best Recall and AUC curve for both test and train
5. Random Forest’s Precision and f1 is also good
6. Hence, Best model is Random Forest model

1.8 Based on these predictions, what are the insights?

Insights
1. Chances of Labour party winning is double than Conservative party
2. Age less than 30 our greater than 80 votes more for Labour than conservative
3. Conservatives vote bank is more of the "2nd category" in Economic cond national while all the "4th
category" voted for Labour
4. Those who have highest political knowledge ("3") voted for conservatives while those which have
"2" on knowledge preferred Labour
5. The lower rating of Hague is in Labour while higher ones preferred Conservatives
6. The highest rating of Blair is in Labour
7. Those whose rating is "2" in economic cond household prefers conservatives while those who have
good economic cond household, they vote for Labor
8. Females’ ratio is comparatively more for Conservatives while their is equal distribution of males
9. People with least political knowledge have voted for Labour party
10. People having lower Eurosceptic sentiment have voted for Labour party
11. All models have good fitting but best model is Random Forest model

Recommendations
1. The data is skewed data and there is skewness towards Labour. More data should be collected
2. Conservative parties should start some rallies to target people with less political knowledge

40
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution
prohibited
Problem 2
In this particular project, we are going to work on the inaugural corpora from the nltk in Python. We will be
looking at the following speeches of the Presidents of the United States of America:

1. President Franklin D. Roosevelt in 1941


2. President John F. Kennedy in 1961
3. President Richard Nixon in 1973

2.1 Find the number of characters, words, and sentences for


the mentioned documents.

Characters

41
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution
prohibited
Words

Sentences

2.2 Remove all the stopwords from all three speeches

42
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution
prohibited
2.3 Which word occurs the most number of times in his
inaugural address for each president? Mention the top
three words. (after removing the stopwords)

The most common word in Roosevelt, Kennedy and Nixon speech was nation, let and us respectively

2.4 Plot the word cloud of each of the speeches of the


variable. (after removing the stopwords)

Word cloud of Roosevelt

43
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution
prohibited
Word cloud of Kennedy

44
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution
prohibited
Word cloud of Nixon

45
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution
prohibited
-------------------THE END-------------------

46
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution
prohibited

You might also like