Download as pdf or txt
Download as pdf or txt
You are on page 1of 29

1

TABLE OF CONTENTS
Executive Summary ...................................................................................................................................... 3
Problem Statement 1: ................................................................................................................................. 3
1.1 Read the dataset. Do the descriptive statistics and do the null value condition check. Write an
inference on it. ............................................................................................................................................. 3
1.2 Perform Univariate and Bivariate Analysis. Do exploratory data analysis. Check for Outliers. ...... 5
1.3 Encode the data (having string values) for Modelling. Is Scaling necessary here or not? Data Split:
Split the data into train and test (70:30). ..................................................................................................... 9
1.4 Apply Logistic Regression and LDA (linear discriminant analysis). ....................................................... 10
1.5 Apply KNN Model and Naïve Bayes Model. Interpret the results. ....................................................... 14
1.6 Model Tuning, Bagging (Random Forest should be applied for Bagging), and Boosting...................... 16
1.7 Performance Metrics: Check the performance of Predictions on Train and Test sets using Accuracy,
Confusion Matrix, Plot ROC curve and get ROC_AUC score for each model. Final Model: Compare the
models and write inference which model is best/optimized. .................................................................... 16
1.8 Based on these predictions, what are the insights?. ............................................................................ 25
Problem Statement 2 ................................................................................................................................. 25
2.1 Find the number of characters, words, and sentences for the mentioned documents. ...................... 25
2.2 Remove all the stopwords from all three speeches. ............................................................................ 26
2.3 Which word occurs the most number of times in his inaugural address for each president? Mention
the top three words. (after removing the stopwords) ............................................................................... 26
2.4 Plot the word cloud of each of the speeches of the variable. (after removing the stopwords) .......... 27

2
Executive Summary
In this report we will be machine learning techinques and compare different models by applying to our
dataset and draw the insights in the problem statement 1 and in problem statement 2 we will be
removing stopwords and plot word cloud for the speech.

Problem Statement 1:

You are hired by one of the leading news channels CNBE who wants to analyze recent elections. This
survey was conducted on 1525 voters with 9 variables. You have to build a model, to predict which
party a voter will vote for on the basis of the given information, to create an exit poll that will help in
predicting overall win and seats covered by a particular party.

Problem 1.1

Read the dataset. Do the descriptive statistics and do the null value condition check. Write an
inference on it.

All the attributes are of continuous with the float data type.

Sample of the dataset:

Table 1. Dataset Sample - Problem 1

There are 1525rows and 9 columns i.e attributes in the given dataframe

Exploratory Data Analysis

Let us check the types of variables in the data frame.

3
Checking the information and missing values in the data

Information of the data

Missing values

• There are 1525 entries having 0 to 1524 index.


• There are 9 columns which summarizes the survey of the voters in the election.
• There are no missing values in the data set

Checking for Duplicate rows

Total no of duplicate values = 8

There are 8 duplicate rows in our data

Describing the data

4
Table 2. Statistical Analysis - Problem 1

• From the above table we can see that age of the people who participated in voting lies in the range
of 24 to 93
• The ratings are in the range if 1-5 for Blair, Hague, economic con national and economic cond
household.

Dropping unnecessary columns

New data frame

We have dropped the 'Unamed : 0 ' column as it had no significance in our analysis

Problem 1.1

Perform Univariate and Bivariate Analysis. Do exploratory data analysis. Check for Outliers.

Uni-Variate Analysis :

5
Fig 3. Univariate Analysis - Histogram

• Performed uni variate analysis to check the distribution of the data using the histogram and also
check the outliers using the box plot.
• We can see from the above histogram, the dataset is normally distributed for all our attributes.
• And we can see that there are outliers for two columns i.e., economic and economic household.

Bivariate Analysis

6
Fig 4. Bivariate Analysis(Pairplot) - Problem 1

Multi-Variate Analysis

7
Fig 5. Multivariate Analysis - Correlation plot

• From the above heatmap we can infer that there is no significant correlation among the
attributes given in the dataset.

Outlier Treatment:

• As we could see from the box plot, there were outliers which need to be treated. So below is the box
plot after the treatment.

8
Fig 6. Box Plot - Outlier Treatment

• So after the treatment, we can see that there no outliers present now.

Problem 1.3

1.3 Encode the data (having string values) for Modelling. Is Scaling necessary here or not? Data
Split: Split the data into train and test (70:30).

We have encoded the data use on-hot encoding method for the categorical columns we have
i.e., Vote and Gender.

9
Dataset after Encoding

After Encoding the data, i have split the data into training dataset and test dataset in the ratio
of 70:30 for our further analysis.

Training Dataset Sample

Test Dataset Sample

Scaling is not necessary for the dataset except for KNN model.

1.4 Apply Logistic Regression and LDA (linear discriminant analysis).

LDA Model:

10
Confusion Matrix for Training and Test Data :

Fig 7. Confusion Matrix - LDA

Confusion Report for Training and Test Data :

Fig 8. Classification Report - LDA

AUC and ROC for Training and Test Data :

11
Fig 8. AUC and ROC Curve - LDA

Logistic Regression Model:

Confusion Matrix for Training and Test Data :

(First - Train, Second - Test)

Fig 9. Confusion Matrix - LR

Confusion Report for Training and Test Data :

12
Fig 10. Classification Report - LR

AUC and ROC for Training and Test Data :


(First - Train, Second - Test)

Fig 8. AUC and ROC Curve - LDA

• From both the models we can see that the AUC score is almost similar for both the
models i.e., approximately 89% for train and test data set.
• Accuracy is also 84% and 82% respectively.

13
1.5 Apply KNN Model and Naïve Bayes Model. Interpret the results.

Naive Bayes:

Confusion Matrix and Classification report for Training Data :

Fig 11. Confusion Matrix & Classification Report - Naive Bayes - Train

Confusion Matrix and Classification report for Test Data :

Fig 12. Confusion Matrix & Classification Report - Naive Bayes - Test

KNN:

• We have used the plot of misclassification error vs k (with k value on X-axis) u to find the
optimum k -value for our analysis.

14
• Based on the plot we can see that for K=15, the misclassification error is least, hence we will
proceed building the model with K=15.

Confusion Matrix and Classification report for Training Data :

Fig 13. Confusion Matrix & Classification Report - KNN - Train

Confusion Matrix and Classification report for Test Data :

Fig 14. Confusion Matrix & Classification Report - KNN - Test

15
1.6 Model Tuning, Bagging (Random Forest should be applied for Bagging), and Boosting.

1.7 Performance Metrics: Check the performance of Predictions on Train and Test sets using
Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC score for each model. Final
Model: Compare the models and write inference which model is best/optimized.

LDA Tuning:

Confusion Matrix for Training and Test Data :


(First - Train, Second - Test)

Fig 15. Confusion Matrix - Tuning LDA

Classification Report for Training and Test Data :

16
Fig 16. Classification Report - Tuning LDA

AUC and ROC for Training and Test Data :

Fig 17. AUC and ROC Curve - LDA Tuning

LR Tuning:

Confusion Matrix for Training and Test Data :


(First - Train, Second - Test)

17
Fig 18. Confusion Matrix - Tuning LR

Classification Report for Training and Test Data :

Fig 19. Classification Report - Tuning LR

AUC and ROC for Training and Test Data :

18
Training

Test
Fig 20. AUC and ROC Curve - LR tuning

KNN Tuning:

Confusion Matrix for Training and Test Data :

19
Fig 19. Confusion Matrix - KNN Tuning

AUC and ROC for Training and Test Data :

Training

20
Test
Fig 23. AUC and ROC Curve - KNN tuning

Bagging using Random Forest:

Confusion Matrix and Classification for Training Data :

Fig 23. Confusion Matrix & Classification Report - Train - Bagging

Confusion Matrix and Classification for Test Data :

21
Fig 24. Confusion Matrix & Classification Report - Test - Bagging

AUC and ROC for Training and Test Data :

Fig 25. AUC and ROC Curve - Bagging

ADA Boosting:

Confusion Matrix and Classification for Training Data :

Fig 26. Confusion Matrix & Classification Report - Train - ADA Boosting

22
Confusion Matrix and Classification for Test Data :

Fig 27. Confusion Matrix & Classification Report - Test - ADA Boosting

AUC and ROC for Training and Test Data :

First - Train, Second - Test

Fig 28. AUC and ROC Curve - ADA Boosting

Gradient Boosting:

Confusion Matrix and Classification for Training Data :

23
Fig 29. Confusion Matrix & Classification Report - Train - Gradient Boosting

Confusion Matrix and Classification for Test Data :

Fig 30. Confusion Matrix & Classification Report - Test - Gradient Boosting

AUC and ROC for Training and Test Data :

First - Train, Second - Test

Fig 31. AUC and ROC Curve - Gradient Boosting

24
1.8 Based on these predictions, what are the insights?.
• Gradient Boosting model is the best model with the AUC score of 95% for training data and 89%
for test data.
• We have used different models but each model has its own advantages and disadvantages
• Linear discriminate analysis makes more assumptions about the underlying data, hence it is
assumed that logistic regression is the more flexible and more robust method in case of
violations of these assumptions.
• A general difference between KNN and other models is the large real time computation needed
by KNN compared to others. KNN vs naive bayes : Naive bayes is much faster than KNN due to
KNN's real-time execution.
• Bagging and Boosting are two types of Ensemble Learning. These two decrease the variance of a
single estimate as they combine several estimates from different models. So the result may be a
model with higher stability. We could see that with the help of AUC score of each of the models.

Insights:

• The Age of the voters who participated falls in the range of 24 to 93 , but however the age group
of 30 -70 are more prone to participate in the voting.
• The voters who gave 4 starts to Blair are the same voters who gave 2 stars to Hague.
• Labour party receive more votes as compared to Conservative party.

Problem Statement 2

In this particular project, we are going to work on the inaugural corpora from the nltk in Python. We
will be looking at the following speeches of the Presidents of the United States of America:

President Franklin D. Roosevelt in 1941

President John F. Kennedy in 1961

President Richard Nixon in 1973

2.1 Find the number of characters, words, and sentences for the mentioned documents.

The number of characters is char_count, the number of words is word_count and the number of
sentences is sents_count respectively for all the three speeches.

25
2.2 Remove all the stopwords from all three speeches.

With the of nltk.corpus package we can use the stopwords function to remove the stopwords from the
given speeches.

Below is the output after removing the stopwords.

2.3 Which word occurs the most number of times in his inaugural address for each president? Mention
the top three words. (after removing the stopwords)

Top 3 words in Roosevelt Speech

Top 3 words in Kennedy's Speech

Top 3 words in Nixon's Speech

26
2.4 Plot the word cloud of each of the speeches of the variable. (after removing the stopwords)

Word cloud for Roosevelt Speech

Fig 32. Word cloud for Roosevelt Speech

Word cloud for Kennedy's Speech

27
Fig 33. Word cloud for Kennedy's Speech

Word cloud for Nixon's Speech

28
Fig 33. Word cloud for Nixon's Speech

29

You might also like