Download as pdf or txt
Download as pdf or txt
You are on page 1of 49

Project Report on:

MACHINE LEARNING

By:
Shivani Pandey
PGP-DSBA, Dec’21
Problem 1:
You are hired by one of the leading news channel CNBE who wants to analyse recent elections.
This survey was conducted on 1525 voters with 9 variables. You have to build a model, to
predict which party a voter will vote for on the basis of the given information, to create an exit
poll that will help in predicting overall win and seats covered by a particular party.
Dataset for Problem: Election_Data.xlsx

Data Ingestion:
1. Read the dataset. Do the descriptive statistics and do null value condition
check. Write an inference on it.
Exploratory Data Analysis

Shape of the dataset:


The Election dataset has 1525 rows and 9 columns (After removing the unnamed column).
Removing the unwanted variable “Unnamed: 0”, which is not giving a meaningful
information. And displaying the head of the Election dataset.
However, if we include the unnamed columns, there would be 1525 rows and 10 columns.

Below are the variables and their data types,


All the variables except vote and gender are int64 datatypes.
But when looking at the values in the dataset for the other variables, they all look like categorical
columns except age.

From using isnull().sum() function we come to know that the dataset does not have null values.

The dataset has few duplicates and removing them is the best choice as duplicates does not add any
value.

So after removing the duplicates we come to know that there are 1517 rows and 9 columns on
which we will perform our analysis.
Descriptive Statistics for the dataset

We will be Converting the necessary variables to object as it is meant to be. Because these
variables have values that are numeric but are a categorical column.

Checking the descriptive statistic of the dataset again,


From the above snippet we can come to a conclusion that the dataset has only one integer
column which is ’age’
The mean and median for the only integer column ‘age’ is almost same indicating the column
is normally distributed.

‘vote’ have two unique values Labour and Conservative, which is also a dependent variable.

‘gender’ has two unique values male and female.

Rest all the columns has object variables with ‘Europe’ being highest having 11 unique
values.

2. Perform Univariate and Bivariate Analysis. Do exploratory data analysis.


Check for Outliers.
Univariate Analysis and Outlier Check

‘age’ is the only integer variable and it is not having outliers. Also, the dist. plot shows that
the variable is normally distributed.
Frequency distribution of the categorical variables. votes

are large in number for ‘Labour’.


‘female’ voters large in number than ‘male’

Bivariate Analysis
Labour gets the highest voting from both female and male voters.

Almost in all the categories Labour is getting the maximum votes.

Conservative gets a little bit high votes from Europe ‘11’.


From the above we could see people who vote Conservative are the people who are older.

In variable Europe ‘1’ are older people.


Pair Plot

Heat Map
There is no correlation between the variables.
Data Preparation:
1. Encode the data (having string values) for Modelling. Is Scaling necessary
here or not? Data Split: Split the data into train and test (70:30).
Encoding the dataset
The variables ‘vote’ and ‘gender’ have string values.So we will be Converting them into
numeric values for Modeling
We also see that both vote and gender have 2 values. Now we will look at the head of the
data.
Scaling

We are not going to scale the data for Logistic regression, LDA and Naive Baye’s models as it
is not necessary.
But in case of KNN it is necessary to scale the data, as it a distance-based algorithm
(typically based on Euclidean distance). Scaling the data gives similar weightage to all the
variables.
Splitting the data into train and test – we will copy all the predictor variables into X data frame and
the target variable which is Vote into the Y data frame.
And then we will split into test and train data as below:

Modelling:
1. Logistic Regression.

We will be Applying Logistic Regression and fitting the training data


Predicting train and test and then check the values as below:
To check the Train and Test accuracy we will check the Confusion and Classification matrix of both test and
train data

Confusion matrix and Classification report for Train,

Confusion matrix and Classification report for Test,


The model is not overfitting or underfitting. Training and Testing results shows that the model
is excellent with good precision and recall values.

LDA (linear discriminant analysis).


Similarly, applying LDA and fitting the training data as belows:

And then Predicting train and test can be seen as below:

On calculating the ACC of test and train data we find them as 0.833 and 0.834 respectively which is a good
83%.

Train and Test accuracies can be seen using Confusion and correlation matrices as below:
Confusion matrix and Classification report for Train,

Confusion matrix and Classification report for Test,

Training and Testing results shows that the model is excellent with good precision and recall
values.
The LDA model is better than Logistic regression with better Test accuracy and recall values.

2. KNN Model.
Scaling the dataset as it is required because KNN is a distance-based algorithm. So we will scale the data and
then see the head which looks like below:
We will be Applying KNN and fitting the training data.
After Predicting train and test, we can see the values as below:

Train and Test accuracy can be seen through calculations of acc. We see that now, train data is 86% and the
accuracy of test is 82%

Confusion matrix and Classification report for Train,


Confusion matrix and Classification report for Test,

Training and Testing results shows that the model is excellent with good precision and recall
values.
This KNN model have good accuracy and recall values.

Naive Bayes.
Importing GaussianNB from sklearn and applying fitting to training
and testing data.

After predicting train and test, the data can be seen as below:
Train and Test accuracies can again be seen by calculating acc which in this case comes as 84% for train and
82% for test data.
Confusion matrix and Classification report for Train,

Confusion matrix and Classification report for Test,


Training and Testing results shows that the model neither overfitting nor underfitting.
The Naive Bayes model also performs well with better accuracy and recall values.
Even though NB and KNN have same Train and Test accuracy. Based on their recall
value in test dataset it is evident that KNN performs better than Naive Bayes.

3. Model Tuning, Bagging (Random Forest should be applied for Bagging)


and Boosting.
Using GridSearchCV and tuning the model which helps us in finding the best parameters for
the model,

Upon calculations of best parameters, maximum iterations of 100 is suggested by system.


Predicting the Train and test and seeing the data as below:
Applying KNN model and using the hyperparameter Leaf size and n_neighbour to estimate
the model parameters,
Basic Decision Tree classifier with gini index and random state of 1

 Using Bagging to improve the performance of the


model. Applying the model and predicting the train
and test data,
Applying Ada Boosting model and predicting the train and test,

Applying Gradient Boosting model with random state 1 and predicting the dependent
variable,

Applying Random forest, tuning the model to get the best parameters.
Above is the best parameter for Random Forest.

The best parameters are,

We will apply Bagging on a Random forest model to check its performance and predicting the
train and test.

4. Performance Metrics: Check the performance of Predictions on Train and


Test sets using Accuracy, Confusion Matrix, Plot ROC curve and get
ROC_AUC score for each model. Final Model: Compare the models and
write inference which model is best/optimized.
Tuned Logistic Regression
LR model performs good with 83% accuracy for both train and test.
The train data is 84% accurate

The test accuracy is 84%

Confusion Matrix and Classification report - Train


Confusion Matrix and Classification report - Test

The Logistic Regression model performs well with good Precision, recall and f1 score values.

Tuned KNN
KNN model is better than LR model performs good train accuracy of 84%, but the test is 83%,
which is also good.
The precision, recall and f1 score is the same as Logistic regression and the accuracy of train data is 84%
The accuracy of test data is 83%

Confusion Matrix and Classification report - Train


Confusion Matrix and Classification report - Test

Decision Tree
DT model is overfitted with 100% train accuracy and 80% test accuracy.
Statistical 10% + or – is acceptable, but here it is over 10%. Hence, OVERFITTING
Confusion Matrix and Classification report - Train

Confusion Matrix and Classification report -Test


Decision Tree Pruned
Pruning/tuning DT with gini index and max depth = 4, and the model performance is better
And not overfitting with 84% train accuracy and 79% test accuracy.

Ensemble Technique - Bagging (DT used)


DT is used as a base estimator for bagging. The model is OVERFITTING. as we can see that training data

has a score accuracy of 98% while the test data performs just at 80% accuracy
Confusion Matrix and Classification report - Train
Confusion Matrix and Classification report - Test
Ada Boosting
Applying Ada Boosting model and predicting the train and test.
The train and test accuracy are 84% and 82% respecting. We have seen models that performs
better than this.

Confusion Matrix and Classification report - Train


Confusion Matrix and Classification report - Test
Gradient Boosting
Gradient Boosting model performs the best with 89% train accuracy and with 83% test
accuracy. The precision, recall and f1 score is also good.

Confusion Matrix and Classification report - Train


Confusion Matrix and Classification report - Test
Random Forest

Random forest model’s train and test accuracy scores.

Random Forest - Bagging


RF model with bagging applied, performs similar to the normal RF as they are not different.
The model has good recall and precision also, with the bgc as 85% for train and 81% for
test data.

Confusion Matrix and Classification report - Train


Confusion Matrix and Classification report - Test
Model Comparison and Best Model
Gradient Boosting model performs the best with 89% train accuracy. And also have 91%
precision and 94% recall which is better than any other models that we have performed in
here with the Election dataset. We get get the g card score as 89% and 83% for train and
test data respectively.

Rest all the models are more or less have same accuracy of 84%
Inference:
1. Based on these predictions, what are the insights?
The important variable in predicting the dependent variables are
‘Hague’ and ‘Blair’

These are the ratings that the people gave to the Leaders of the ‘Labour’ and ‘Conservative’
party.

As the frequency distribution suggests most of the people gave 4 stars to ‘Blair’ and there are larger
number of people gave 2 stars to ‘Hague’ which made an impact in the dependent variable ‘vote’

Problem 2:
In this particular project, we are going to work on the inaugural corpora from the nltk in Python.
We will be looking at the following speeches of the Presidents of the United States of America:
1. President Franklin D. Roosevelt in 1941
2. President John F. Kennedy in 1961
3. President Richard Nixon in 1973

2.1 Find the number of characters, words and sentences for the
mentioned documents.
(Hint: use .words(), .raw(), .sent() for extracting counts)
President Franklin D. Roosevelt’s speech have 7571 Characters (including spaces) and 1360
words.
President John F. Kennedy’s Speech have 7618 Characters (including spaces) and 1390

words.
President Richard Nixon’s Speech have 9991 Characters (including spaces) and 1819 words.
President Franklin D. Roosevelt’s speech and President Richard Nixon’s Speech have 68
Sentences and,
President John F. Kennedy’s Speech have 52 Sentences.

2.2 Remove all the stop words from all the three speeches.
Converting all the character to lower case and removing all the punctuations.

Counting the number of stop words and removing them.


Word count after removing all the stop words.

All the stop words have been removed from all the three speeches.

2.3 Which word occurs the most number of times in his inaugural
address for each president? Mention the top three words. (after
removing the stop words)
In the below snippets we could see the words that occurred most number of times in their
inaugural address.
Removing the additional stop words from the above snippets and when checking the

frequency
Most frequently used words from President Franklin D. Roosevelt’s speech are

• Nation
• Democracy
• Spirit

Most frequently used words from President Richard Nixon’s Speech are

• Peace
• World
• New
• America
Most frequently used words from President John F. Kennedy’s Speech are
• World
• New
• Pledge
• Power
2.4 Plot the word cloud of each of the speeches of the variable. (after
removing the stop words)
Word Cloud for
President Franklin D. Roosevelt’s speech (after cleaning)!!

Word Cloud for


Word Cloud for
President John F. Kennedy’s Speech (after cleaning)!!

Word Cloud for


Word Cloud for
President Richard Nixon’s Speech (after cleaning)!!

Word Cloud for

You might also like