Professional Documents
Culture Documents
Anmol Sehgal ML
Anmol Sehgal ML
ANALYSI
S
Anmol Sehgal
PGP-DSBA
Online Sept-21
Date: 16/11/2022
0
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution
prohibited
2
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution
prohibited
Problem 1
You are hired by one of the leading news channels CNBE who wants to analyze recent elections. This survey
was conducted on 1525 voters with 9 variables. You have to build a model, to predict which party a voter will
vote for on the basis of the given information, to create an exit poll that will help in predicting overall win and
seats covered by a particular party.
Data Description
1. vote: Party choice: Conservative or Labour
2. age: in years
3. economic.cond.national: Assessment of current national economic conditions, 1 to 5.
4. economic.cond.household: Assessment of current household economic conditions, 1 to 5.
5. Blair: Assessment of the Labour leader, 1 to 5.
6. Hague: Assessment of the Conservative leader, 1 to 5.
7. Europe: an 11-point scale that measures respondents' attitudes toward European integration. High
scores represent ‘Eurosceptic’ sentiment.
8. political.knowledge: Knowledge of parties' positions on European integration, 0 to 3.
9. gender: female or male.
Database’s sample
Table 1. Database
3
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution
prohibited
1.1 Read the dataset. Do the descriptive statistics and
do the null value condition check. Write an inference
on it.
Data Information
4
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution
prohibited
Data Describe
Null Check
Gender Distribution
There are 812 male voters and 713 female voters in the dataset
VOTES
1063 people voted for labour whereas only 462 voted for conservatives
5
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution
prohibited
Duplicates and removal of duplicates
Skewness Check
Observations
1. The data have good distribution in gender but the sample data set has un-equal distribution
of voting in them.
2. The skewness of Blair and Europe is only in the acceptable range; hence data is asymmetric
6
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution
prohibited
1.2 Perform Univariate and Bivariate Analysis. Do
exploratory data analysis. Check for Outliers.
7
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution
prohibited
UNIVARIANT ANALYSIS
Boxplot
8
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution
prohibited
Histogram
9
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution
prohibited
BIVARIANT ANALYSIS
Pairplot
10
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution
prohibited
Graph 2 :- Pair plot with hue as vote
11
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution
prohibited
Graph 4:- Votes vs economics cond national
12
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution
prohibited
Graph 8:- Votes vs economics cond household
13
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution
prohibited
Correlation Heat Map
Insights
1. No data is normally distributed. The only data which is near normal is "AGE"
2. The outliers are only in "Economic cond National" and "Economic cond household"
3. Age less than 30 our greater than 80 votes more for Labour than conservative
4. Conservatives vote bank is more of the "2nd category" in Economic cond national while all the "4th
category" voted for Labour
5. Those who have highest political knowledge ("3") voted for conservatives while those which have "2" on
knowledge preferred Labour
6. The lower rating of Hague is in Labour while higher ones preferred Conservatives
7. The highest rating of Blair is in Labour
8. Those whose rating is "2" in economic cond household prefers conservatives while those who have good
economic cond household, they vote for Labor
9. Females’ ratio is comparatively more for Conservatives while their is equal distribution of males
14
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution
prohibited
1.3 Encode the data (having string values) for
Modelling. Is Scaling necessary here or not? Data
Split: Split the data into train and test
Scaling
Logistic Regression
Train Data
Accuracy
Classification Report
Confusion Matrix
16
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution
prohibited
Test Data
Accuracy
Classification Report
Confusion Matrix
Validation of Model
1. The accuracy of model is good. It is neither under-fit nor over-fit
2. Here all the FN, FP, TN and TP are important because we are viewing the data with no biasness towards
any party
3. There is not much of difference in accuracy of Test and Train data, hence model is good
Train Data
Accuracy
Classification Report
Confusion Matrix
17
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution
prohibited
Test Data
Accuracy
Classification Report
Confusion Matrix
Validation of Model
1. The accuracy of model is good. It is neither under-fit nor over-fit
2. Here all the FN, FP, TN and TP are important because we are viewing the data with no biasness towards any
party
3. There is not much of difference in accuracy of Test and Train data, hence model is good
KNN Model
Train Data
Accuracy
18
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution
prohibited
Classification Report
Confusion Matrix
Test Data
Accuracy
Classification Report
Confusion Matrix
Validation of Model
1. The accuracy of model is good. It is neither under-fit nor over-fit
2. Here all the FN, FP, TN and TP are important because we are viewing the data with no biasness towards any
party
3. There is 8% of difference in accuracy of Test and Train data which is less than 10% , hence model is
acceptable 19
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution
prohibited
Naïve Bayes
Train Data
Accuracy
Classification Report
Confusion Matrix
Test Data
Accuracy
Classification Report
Confusion Matrix
20
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution
prohibited
Validation of Model
1. The accuracy of model is good. It is neither under-fit nor over-fit
2. There is no difference in accuracy of Test and Train data which is less than 10% , hence model is good
Ada Boost
From the above, we will select the best input to get maximum accuracy. Following is the best input:-
Train Data
Accuracy
Classification Report
Confusion Matrix
Test Data
Accuracy
21
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution
prohibited
Classification Report
Confusion Matrix
Validation of Model
1. The accuracy of model is good. It is neither under-fit nor over-fit
2. Here all the FN, FP, TN and TP are important because we are viewing the data with no biasness towards
any party
3. There is not much of difference in accuracy of Test and Train data, hence model is good
Decision Tree
From the above, we will select the best input to get maximum accuracy. Following is the best input:-
Train Data
Accuracy
22
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution
prohibited
Classification Report
Confusion Matrix
Test Data
Accuracy
Classification Report
Confusion Matrix
Validation of Model
1. The accuracy of model is good. It is neither under-fit nor over-fit
2. Here all the FN, FP, TN and TP are important because we are viewing the data with no biasness towards any
party
3. There is not much of difference in accuracy of Test and Train data, hence model is good
Random Forest
23
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution
prohibited
From the above, we will select the best input to get maximum accuracy. Following is the best input:-
Train Data
Accuracy
Classification Report
Confusion Matrix
Test Data
Accuracy
Classification Report
Confusion Matrix
24
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution
prohibited
Validation of Model
1. The accuracy of model is good. It is neither under-fit nor over-fit
2. Here all the FN, FP, TN and TP are important because we are viewing the data with no biasness towards any
party
3. There is not much of difference in accuracy of Test and Train data, hence model is good
Train Data
Accuracy
Classification Report
Confusion Matrix
25
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution
prohibited
AUC and ROC
Test Data
Accuracy
Classification Report
Confusion Matrix
26
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution
prohibited
AUC and ROC
Train Data
Accuracy
Classification Report
Confusion Matrix
27
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution
prohibited
AUC and ROC
Test Data
Accuracy
Classification Report
Confusion Matrix
28
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution
prohibited
AUC and ROC
KNN Model
Train Data
Accuracy
Classification Report
Confusion Matrix
29
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution
prohibited
AUC and ROC
Test Data
Accuracy
Classification Report
Confusion Matrix
30
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution
prohibited
AUC and ROC
Naïve Bayes
Train Data
Accuracy
Classification Report
Confusion Matrix
31
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution
prohibited
AUC and ROC
Test Data
Accuracy
Classification Report
Confusion Matrix
32
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution
prohibited
AUC and ROC
Ada Boost
Train Data
Accuracy
Classification Report
Confusion Matrix 33
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution
prohibited
AUC and ROC
Test Data
Accuracy
Classification Report
Confusion Matrix
34
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution
prohibited
AUC and ROC
Decision Tree
Train Data
Accuracy
Classification Report
35
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution
prohibited
Confusion Matrix
Test Data
Accuracy
Classification Report
36
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution
prohibited
Confusion Matrix
Random Forest
Train Data
37
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution
prohibited
Accuracy
Classification Report
Confusion Matrix
Test Data
Accuracy
38
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution
prohibited
Classification Report
Confusion Matrix
TRAIN DATA
Model Name Accuracy Precision Recall f1 AUC
Logistic Regression 84% 87% 91% 89% 89%
LDA 84% 90% 87% 88% 89%
KNN 86% 92% 89% 90% 92%
Naïve Bayes 83% 88% 88% 88% 89%
AdaBoost 84% 85% 92% 88% 90%
Decision Tree 84% 87% 90% 88% 90%
Random Forest 84% 85% 93% 89% 91%
TEST DATA
Model Name Accuracy Precision Recall f1 AUC
Logistic Regression 82% 87% 89% 88% 88%
LDA 82% 88% 87% 87% 88%
KNN 79% 85% 85% 85% 83%
Naïve Bayes 83% 87% 89% 88% 89%
AdaBoost 83% 87% 90% 88% 89%
Decision Tree 81% 87% 87% 87% 87%
39
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution
prohibited
Random Forest 83% 86% 91% 89% 90%
Conclusion
1. There is no model that is under-fit or over-fit
2. All models are good but the best model which is fit for both test and train is Random Forest model
3. In Random Forest, Accuracy is 84% and 83% for Train and Test respectively. Though they are not
the highest but the difference between test and train accuracy is the least while accuracy is good
4. Random Forest has the best Recall and AUC curve for both test and train
5. Random Forest’s Precision and f1 is also good
6. Hence, Best model is Random Forest model
Insights
1. Chances of Labour party winning is double than Conservative party
2. Age less than 30 our greater than 80 votes more for Labour than conservative
3. Conservatives vote bank is more of the "2nd category" in Economic cond national while all the "4th
category" voted for Labour
4. Those who have highest political knowledge ("3") voted for conservatives while those which have
"2" on knowledge preferred Labour
5. The lower rating of Hague is in Labour while higher ones preferred Conservatives
6. The highest rating of Blair is in Labour
7. Those whose rating is "2" in economic cond household prefers conservatives while those who have
good economic cond household, they vote for Labor
8. Females’ ratio is comparatively more for Conservatives while their is equal distribution of males
9. People with least political knowledge have voted for Labour party
10. People having lower Eurosceptic sentiment have voted for Labour party
11. All models have good fitting but best model is Random Forest model
Recommendations
1. The data is skewed data and there is skewness towards Labour. More data should be collected
2. Conservative parties should start some rallies to target people with less political knowledge
40
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution
prohibited
Problem 2
In this particular project, we are going to work on the inaugural corpora from the nltk in Python. We will be
looking at the following speeches of the Presidents of the United States of America:
Characters
41
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution
prohibited
Words
Sentences
42
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution
prohibited
2.3 Which word occurs the most number of times in his
inaugural address for each president? Mention the top
three words. (after removing the stopwords)
The most common word in Roosevelt, Kennedy and Nixon speech was nation, let and us respectively
43
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution
prohibited
Word cloud of Kennedy
44
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution
prohibited
Word cloud of Nixon
45
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution
prohibited
-------------------THE END-------------------
46
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution
prohibited