LDA KNN Logistic

1
TABLE OF CONTENTS
Executive Summary ...................................................................................................................................... 3
Problem Statement 1: ................................................................................................................................. 3
1.1 Read the dataset. Do the descriptive statistics and do the null value condition check. Write an
inference on it. ............................................................................................................................................. 3
1.2 Perform Univariate and Bivariate Analysis. Do exploratory data analysis. Check for Outliers. ...... 5
1.3 Encode the data (having string values) for Modelling. Is Scaling necessary here or not? Data Split:
Split the data into train and test (70:30). ..................................................................................................... 9
1.4 Apply Logistic Regression and LDA (linear discriminant analysis). ....................................................... 10
1.5 Apply KNN Model and Naïve Bayes Model. Interpret the results. ....................................................... 14
1.6 Model Tuning, Bagging (Random Forest should be applied for Bagging), and Boosting...................... 16
1.7 Performance Metrics: Check the performance of Predictions on Train and Test sets using Accuracy,
Confusion Matrix, Plot ROC curve and get ROC_AUC score for each model. Final Model: Compare the
models and write inference which model is best/optimized. .................................................................... 16
1.8 Based on these predictions, what are the insights?. ............................................................................ 25
Problem Statement 2 ................................................................................................................................. 25
2.1 Find the number of characters, words, and sentences for the mentioned documents. ...................... 25
2.2 Remove all the stopwords from all three speeches. ............................................................................ 26
2.3 Which word occurs the most number of times in his inaugural address for each president? Mention
the top three words. (after removing the stopwords) ............................................................................... 26
2.4 Plot the word cloud of each of the speeches of the variable. (after removing the stopwords) .......... 27
2
Executive Summary
In this report we will be machine learning techinques and compare different models by applying to our
dataset and draw the insights in the problem statement 1 and in problem statement 2 we will be
removing stopwords and plot word cloud for the speech.
Problem Statement 1:
You are hired by one of the leading news channels CNBE who wants to analyze recent elections. This
survey was conducted on 1525 voters with 9 variables. You have to build a model, to predict which
party a voter will vote for on the basis of the given information, to create an exit poll that will help in
predicting overall win and seats covered by a particular party.
Problem 1.1
Read the dataset. Do the descriptive statistics and do the null value condition check. Write an
inference on it.
All the attributes are of continuous with the float data type.
Sample of the dataset:
Table 1. Dataset Sample - Problem 1
There are 1525rows and 9 columns i.e attributes in the given dataframe
Exploratory Data Analysis
Let us check the types of variables in the data frame.
3
Checking the information and missing values in the data
Information of the data
Missing values
• There are 1525 entries having 0 to 1524 index.

• There are 9 columns which summarizes the survey of the voters in the election.
• There are no missing values in the data set
Checking for Duplicate rows
Total no of duplicate values = 8
There are 8 duplicate rows in our data
Describing the data
4
Table 2. Statistical Analysis - Problem 1
• From the above table we can see that age of the people who participated in voting lies in the range
of 24 to 93
• The ratings are in the range if 1-5 for Blair, Hague, economic con national and economic cond
household.
Dropping unnecessary columns
New data frame
We have dropped the 'Unamed : 0 ' column as it had no significance in our analysis
Problem 1.1
Perform Univariate and Bivariate Analysis. Do exploratory data analysis. Check for Outliers.
Uni-Variate Analysis :
5
Fig 3. Univariate Analysis - Histogram
• Performed uni variate analysis to check the distribution of the data using the histogram and also
check the outliers using the box plot.
• We can see from the above histogram, the dataset is normally distributed for all our attributes.
• And we can see that there are outliers for two columns i.e., economic and economic household.
Bivariate Analysis
6
Fig 4. Bivariate Analysis(Pairplot) - Problem 1
Multi-Variate Analysis
7
Fig 5. Multivariate Analysis - Correlation plot
• From the above heatmap we can infer that there is no significant correlation among the
attributes given in the dataset.
Outlier Treatment:
• As we could see from the box plot, there were outliers which need to be treated. So below is the box
plot after the treatment.
8
Fig 6. Box Plot - Outlier Treatment
• So after the treatment, we can see that there no outliers present now.
Problem 1.3
1.3 Encode the data (having string values) for Modelling. Is Scaling necessary here or not? Data
Split: Split the data into train and test (70:30).
We have encoded the data use on-hot encoding method for the categorical columns we have
i.e., Vote and Gender.
9
Dataset after Encoding
After Encoding the data, i have split the data into training dataset and test dataset in the ratio
of 70:30 for our further analysis.
Training Dataset Sample
Test Dataset Sample
Scaling is not necessary for the dataset except for KNN model.
1.4 Apply Logistic Regression and LDA (linear discriminant analysis).
LDA Model:
10
Confusion Matrix for Training and Test Data :
Fig 7. Confusion Matrix - LDA
Confusion Report for Training and Test Data :
Fig 8. Classification Report - LDA
AUC and ROC for Training and Test Data :
11
Fig 8. AUC and ROC Curve - LDA
Logistic Regression Model:
(First - Train, Second - Test)
Fig 9. Confusion Matrix - LR
Confusion Report for Training and Test Data :
12
Fig 10. Classification Report - LR

Fig 8. AUC and ROC Curve - LDA
• From both the models we can see that the AUC score is almost similar for both the
models i.e., approximately 89% for train and test data set.
• Accuracy is also 84% and 82% respectively.
13
1.5 Apply KNN Model and Naïve Bayes Model. Interpret the results.
Naive Bayes:
Confusion Matrix and Classification report for Training Data :
Fig 11. Confusion Matrix & Classification Report - Naive Bayes - Train
Confusion Matrix and Classification report for Test Data :
Fig 12. Confusion Matrix & Classification Report - Naive Bayes - Test
KNN:
• We have used the plot of misclassification error vs k (with k value on X-axis) u to find the
optimum k -value for our analysis.
14
• Based on the plot we can see that for K=15, the misclassification error is least, hence we will
proceed building the model with K=15.
Confusion Matrix and Classification report for Training Data :
Fig 13. Confusion Matrix & Classification Report - KNN - Train
Confusion Matrix and Classification report for Test Data :
Fig 14. Confusion Matrix & Classification Report - KNN - Test
15
1.6 Model Tuning, Bagging (Random Forest should be applied for Bagging), and Boosting.
1.7 Performance Metrics: Check the performance of Predictions on Train and Test sets using
Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC score for each model. Final
Model: Compare the models and write inference which model is best/optimized.
LDA Tuning:

Fig 15. Confusion Matrix - Tuning LDA
Classification Report for Training and Test Data :
16
Fig 16. Classification Report - Tuning LDA
Fig 17. AUC and ROC Curve - LDA Tuning
LR Tuning:

17
Fig 18. Confusion Matrix - Tuning LR
Classification Report for Training and Test Data :
Fig 19. Classification Report - Tuning LR
18
Training
Test
Fig 20. AUC and ROC Curve - LR tuning
KNN Tuning:
19
Fig 19. Confusion Matrix - KNN Tuning
Training
20
Test
Fig 23. AUC and ROC Curve - KNN tuning
Bagging using Random Forest:
Confusion Matrix and Classification for Training Data :
Fig 23. Confusion Matrix & Classification Report - Train - Bagging
Confusion Matrix and Classification for Test Data :
21
Fig 24. Confusion Matrix & Classification Report - Test - Bagging
Fig 25. AUC and ROC Curve - Bagging
ADA Boosting:
Fig 26. Confusion Matrix & Classification Report - Train - ADA Boosting
22
Fig 27. Confusion Matrix & Classification Report - Test - ADA Boosting
First - Train, Second - Test
Fig 28. AUC and ROC Curve - ADA Boosting
Gradient Boosting:
23
Fig 29. Confusion Matrix & Classification Report - Train - Gradient Boosting
Fig 30. Confusion Matrix & Classification Report - Test - Gradient Boosting
First - Train, Second - Test
Fig 31. AUC and ROC Curve - Gradient Boosting
24
1.8 Based on these predictions, what are the insights?.
• Gradient Boosting model is the best model with the AUC score of 95% for training data and 89%
for test data.
• We have used different models but each model has its own advantages and disadvantages
• Linear discriminate analysis makes more assumptions about the underlying data, hence it is
assumed that logistic regression is the more flexible and more robust method in case of
violations of these assumptions.
• A general difference between KNN and other models is the large real time computation needed
by KNN compared to others. KNN vs naive bayes : Naive bayes is much faster than KNN due to
KNN's real-time execution.
• Bagging and Boosting are two types of Ensemble Learning. These two decrease the variance of a
single estimate as they combine several estimates from different models. So the result may be a
model with higher stability. We could see that with the help of AUC score of each of the models.
Insights:
• The Age of the voters who participated falls in the range of 24 to 93 , but however the age group
of 30 -70 are more prone to participate in the voting.
• The voters who gave 4 starts to Blair are the same voters who gave 2 stars to Hague.
• Labour party receive more votes as compared to Conservative party.
Problem Statement 2
In this particular project, we are going to work on the inaugural corpora from the nltk in Python. We
will be looking at the following speeches of the Presidents of the United States of America:
President Franklin D. Roosevelt in 1941
President John F. Kennedy in 1961
President Richard Nixon in 1973
2.1 Find the number of characters, words, and sentences for the mentioned documents.
The number of characters is char_count, the number of words is word_count and the number of
sentences is sents_count respectively for all the three speeches.
25
2.2 Remove all the stopwords from all three speeches.
With the of nltk.corpus package we can use the stopwords function to remove the stopwords from the
given speeches.
Below is the output after removing the stopwords.
2.3 Which word occurs the most number of times in his inaugural address for each president? Mention
the top three words. (after removing the stopwords)
Top 3 words in Roosevelt Speech
Top 3 words in Kennedy's Speech
Top 3 words in Nixon's Speech
26
2.4 Plot the word cloud of each of the speeches of the variable. (after removing the stopwords)
Word cloud for Roosevelt Speech
Fig 32. Word cloud for Roosevelt Speech
Word cloud for Kennedy's Speech
27
Fig 33. Word cloud for Kennedy's Speech
Word cloud for Nixon's Speech
28
Fig 33. Word cloud for Nixon's Speech
29

LDA KNN Logistic

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

LDA KNN Logistic

Uploaded by

Copyright:

Available Formats

1

Sample of the dataset:

Table 1. Dataset Sample - Problem 1

Exploratory Data Analysis

Let us check the types of variables in the data frame.

Information of the data

• There are 1525 entries having 0 to 1524 index.

Checking for Duplicate rows

Total no of duplicate values = 8

There are 8 duplicate rows in our data

Describing the data

Dropping unnecessary columns

New data frame

Training Dataset Sample

Test Dataset Sample

1.4 Apply Logistic Regression and LDA (linear discriminant analysis).

Fig 7. Confusion Matrix - LDA

Confusion Report for Training and Test Data :

Fig 8. Classification Report - LDA

AUC and ROC for Training and Test Data :

Logistic Regression Model:

Confusion Matrix for Training and Test Data :

(First - Train, Second - Test)

Fig 9. Confusion Matrix - LR

Confusion Report for Training and Test Data :

AUC and ROC for Training and Test Data :

Fig 8. AUC and ROC Curve - LDA

Confusion Matrix and Classification report for Training Data :

Confusion Matrix and Classification report for Test Data :

Confusion Matrix and Classification report for Training Data :

Fig 13. Confusion Matrix & Classification Report - KNN - Train

Confusion Matrix and Classification report for Test Data :

Fig 14. Confusion Matrix & Classification Report - KNN - Test

Confusion Matrix for Training and Test Data :

Fig 15. Confusion Matrix - Tuning LDA

Classification Report for Training and Test Data :

AUC and ROC for Training and Test Data :

Fig 17. AUC and ROC Curve - LDA Tuning

Confusion Matrix for Training and Test Data :

Classification Report for Training and Test Data :

Fig 19. Classification Report - Tuning LR

AUC and ROC for Training and Test Data :

Confusion Matrix for Training and Test Data :

AUC and ROC for Training and Test Data :

Bagging using Random Forest:

Confusion Matrix and Classification for Training Data :

Fig 23. Confusion Matrix & Classification Report - Train - Bagging

Confusion Matrix and Classification for Test Data :

AUC and ROC for Training and Test Data :

Fig 25. AUC and ROC Curve - Bagging

Confusion Matrix and Classification for Training Data :

AUC and ROC for Training and Test Data :

First - Train, Second - Test

Fig 28. AUC and ROC Curve - ADA Boosting

Confusion Matrix and Classification for Training Data :

Confusion Matrix and Classification for Test Data :

AUC and ROC for Training and Test Data :

First - Train, Second - Test

Fig 31. AUC and ROC Curve - Gradient Boosting

President Franklin D. Roosevelt in 1941

President John F. Kennedy in 1961

President Richard Nixon in 1973

Below is the output after removing the stopwords.