Professional Documents
Culture Documents
Business Report ML Jan2022
Business Report ML Jan2022
Business Report ML Jan2022
Anshul Dyundi
You are hired by one of the leading news channel CNBE who wants to analyse recent elections. This
survey was conducted on 1525 voters with 9 variables. You have to build a model, to predict which
party a voter will vote for on the basis of the given information, to create an exit poll that will help in
predicting overall win and seats covered by a particular party.
Dataset for Problem: Election_Data.xlsx
Data Ingestion:
1. Read the dataset. Do the descriptive statistics and do null value condition check. Write an
inference on it.
Exploratory Data Analysis
From the below snippet it is evident that the dataset does not have null values.
The dataset has few duplicates and removing them is the best choice as duplicates does not add any
value.
Below snippet also shows the shape of dataset after removing the duplicates.
Converting the necessary variables to object as it is meant to be. Because these variables have values
that are numeric but are a categorical column.
‘age’ is the only integer variable and it is not having outliers. Also, the dist. plot shows that the
variable is normally distributed.
Frequency distribution of the categorical variables.
votes are large in number for ‘Labour’.
‘female’ voters large in number than ‘male’
Bivariate Analysis
Labour gets the highest voting from both female and male voters.
Almost in all the categories Labour is getting the maximum votes.
Conservative gets a little bit high votes from Europe ‘11’.
From the above we could see people who vote Conservative are the people who are older. In
variable Europe ‘1’ are older people.
Pair Plot
Heat Map
There is no correlation between the variables.
Data Preparation:
1. Encode the data (having string values) for Modelling. Is Scaling necessary here or not? Data
Split: Split the data into train and test (70:30).
Modelling:
1. Logistic Regression.
Applying Logistic Regression and fitting the training data
Predicting train and test,
Training and Testing results shows that the model is excellent with good precision and recall values.
The LDA model is better than Logistic regression with better Test accuracy and recall values.
2. KNN Model.
Scaling the dataset as it is required because KNN is a distance-based algorithm,
Applying KNN and fitting the training data
Training and Testing results shows that the model is excellent with good precision and recall values.
This KNN model have good accuracy and recall values.
Naive Bayes.
3. Model Tuning, Bagging (Random Forest should be applied for Bagging) and Boosting.
Using GridSearchCV and tuning the model which helps us in finding the best parameters for the
model,
Predicting the Train and test,
Applying KNN model and using the hyperparameter Leaf size and n_neighbour to estimate the
model parameters,
Basic Decision Tree classifier with gini index and random state of 1
Applying Gradient Boosting model with random state 1 and predicting the dependent variable,
Applying Random forest, tuning the model to get the best parameters.
The best parameters are,
Applying Bagging on a Random forest model to check its performance and predicting the train and
test,
4. Performance Metrics: Check the performance of Predictions on Train and Test sets using
Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC score for each model. Final
Model: Compare the models and write inference which model is best/optimized.
The Logistic Regression model performs well with good Precision, recall and f1 score values.
Tuned KNN
KNN model is better than LR model performs good train accuracy of 84%, but the test is 83%, which
is also good.
The precision, recall and f1 score is the same as Logistic regression.
Confusion Matrix and Classification report - Train
Confusion Matrix and Classification report - Test
Decision Tree
DT model is overfitted with 100% train accuracy and 80% test accuracy.
Statistical 10% + or – is acceptable, but here it is over 10%. Hence, OVERFITTING
Confusion Matrix and Classification report - Train
Rest all the models are more or less have same accuracy of 84%
Inference:
1. Based on these predictions, what are the insights? The
important variable in predicting the dependent variables are ‘Hague’
and ‘Blair’
These are the ratings that the people gave to the Leaders of the ‘Labour’ and ‘Conservative’ party.
As the frequency distribution suggests most of the people gave 4 stars to ‘Blair’ and there are larger
number of people gave 2 stars to ‘Hague’ which made an impact in the dependent variable ‘vote’
Problem 2:
In this particular project, we are going to work on the inaugural corpora from the nltk in Python. We
will be looking at the following speeches of the Presidents of the United States of America:
2.1 Find the number of characters, words and sentences for the mentioned documents.
(Hint: use .words(), .raw(), .sent() for extracting counts)
President Franklin D. Roosevelt’s speech have 7571 Characters (including spaces) and 1360
words.
President John F. Kennedy’s Speech have 7618 Characters (including spaces) and 1390
words.
President Richard Nixon’s Speech have 9991 Characters (including spaces) and 1819 words.
President Franklin D. Roosevelt’s speech and President Richard Nixon’s Speech have 68
Sentences and,
President John F. Kennedy’s Speech have 52 Sentences.
2.2 Remove all the stop words from all the three speeches.
Converting all the character to lower case and removing all the punctuations.
Counting the number of stop words and removing them.
All the stop words have been removed from all the three speeches.
2.3 Which word occurs the most number of times in his inaugural address for each
president? Mention the top three words. (after removing the stop words)
In the below snippets we could see the words that occurred most number of times in their inaugural
address.
Removing the additional stop words from the above snippets and when checking the
frequency
Most frequently used words from President Franklin D. Roosevelt’s speech are
Nation
Democracy
Spirit
Most frequently used words from President Richard Nixon’s Speech are
Peace
World
New
America
Most frequently used words from President John F. Kennedy’s Speech are
World
New
Pledge
Power
2.4 Plot the word cloud of each of the speeches of the variable. (after removing the
stop words)
Word Cloud for President Franklin D. Roosevelt’s speech (after cleaning)!!
Word Cloud for President John F. Kennedy’s Speech (after cleaning)!!
Word Cloud for President Richard Nixon’s Speech (after cleaning)!!