Umendra Pratap Singh Solanki ML Graded Project 18-12-2022

GRADED PROJECT
MACHINE LEARNING
Submitted By - UMENDRA Pratap Singh

Submission Date – 18/12/2022
CONTENT
1.1 Read the dataset. Do the descriptive statistics and do the null value condition check. Write an
inference on it.
1.2 Perform Univariate and Bivariate Analysis. Do exploratory data analysis. Check for Outliers.
Data Preparation: 4 marks
1.3 Encode the data (having string values) for Modelling. Is Scaling necessary here or not? Data Split:
Split the data into train and test (70:30).
Modelling: 22 marks
1.4 Apply Logistic Regression and LDA (linear discriminant analysis).
1.5 Apply KNN Model and Naïve Bayes Model. Interpret the results.
1.6 Model Tuning, Bagging (Random Forest should be applied for Bagging), and Boosting.
1.7 Performance Metrics: Check the performance of Predictions on Train and Test sets using Accuracy,
Confusion Matrix, Plot ROC curve and get ROC_AUC score for each model. Final Model: Compare
the models and write inference which model is best/optimized.
Inference: 5 marks
1.8 Based on these predictions, what are the insights?

2.1 Find the number of characters, words, and sentences for the mentioned documents.
2.2 Remove all the stop words from all three speeches.
2.3 Which word occurs the most number of times in his inaugural address for each president? Mention
the top three words. (after removing the stop words)
2.4 Plot the word cloud of each of the speeches of the variable. (after removing the stop words)
Problem 1:
You are hired by one of the leading news channels CNBE who wants to analyze
recent elections. This survey was conducted on 1525 voters with 9 variables. You
have to build a model, to predict which party a voter will vote for on the basis of
the given information, to create an exit poll that will help in predicting overall
win and seats covered by a particular party.
Dataset for Problem: Election_Data.xlsx
1.1 Read the dataset. Do the descriptive statistics and do the null value condition check. Write
an inference on it.
• Top five and last five rows of the data set
Figure 1 Top five rows of the data
Figure 2 last five rows of the dataset
• There are 1525 rows and 10 columns in the dataset. We see that a column is unwanted
than we removing column “Unnamed : 0”, which is not having meaning full
information than after removing it we have only 9 columns
• We see that there is 8 integer and 2 object datatypes in the dataset, we converting the
object datatypes to categorical data
• We have 9 feature in which 8 are categorical and one is numerical feature
Figure 3
• vote and gender have two unique values

• maximum votes are of Labour gender of maximum voters are female
• in bellow table we show the 5-point summary which is mean, min.,25%, 50%,75%,
max.
• in age mean and median almost same than its normally distributed.
Figure 4 Discriptive statistics of data
• Dataset don’t have any duplicate value.

• Categorical column vote has 2 unique values which are -
Labour- 1063, Conservative 462
• Gender also have 2 unique values which are -
Male-713, Female – 812
• Age is continuous variable
Min. Age is 24
Maximum Age is 93
Average Age is 54.18
Sol.2
UNIVARIATE ANALYSIS –
Age-
• Age feature not have any outliers
• We see that it is slightly right skewed
Figure 5 hist plot and box plot of Age
Economic condition national –

• Economic.conditional,national have outliers
• Voters majorly gave rating 3 and 4 for the economic conditional national
Figure 6 hist plot and boxplot of Economic.conditional,national

Economic.cond.houshold.Distribution
• Economic.conditional,houshold have outliers
• Voters majorly gave rating 3 and 4 for the economic conditional national
Figure 7 histplot and boxplotof Economic.conditional,houshold
Blair –
• Voters ha majorly given 4 rating to labour party leader Blair
Figure 8 histplot and boxplot of Blair

Europe-
• Voters are majorly give rating 11 and least rating gives is 3 in Europe
Figure 9 hitplot and boxplot of Europe
Hague-
• Most of the voters given rating 2 and 4 to conservative party
leader Hague
Figure 10 histplot and box plot of Hague
Political knowledge –
• Most of the voters choose rating 2
Figure 11 hist plot and box plot of political knowledge
Vote-
• Voters of labour party are more
Figure 12 count plot of vote
Gender –
• Femal voters are more
Figure 13 count plot of Gender
BIVARIATE ANALYSIS -
• Age of conservative party voters are more, in conservative party female has more
age than male whereas in labour party male more aged
Figure 14
• Voters of labour party give higher rating in economic condition national

Figure 15
• Voters of labour party give higher rating in economic condition household in

labour party
Figure 16
• Voters of conservative party are more give more rating in political knowledge
Figure 17
• Voters of labour party more give rating to blair and male are more
likely to blair
Figure 18
• Voters of conservative party more like to hague
Figure 19
• Conservative party voters are more like Europe they give more rating
to Europe than labour party
Figure 20
PAIRPLOT
Figure 21
There is no strong relationship present between the variables

There is mixture of positive and negative feature
HEATMAP
• Economic_cond_national and economic_cond_household have maximum

correlation of 35% but that is not big percentage.
• Maximum negative correlation is between Blair and Europe 30% but it also not
large.
OUTLIER CHECK -
ONLY Economic_cond_national AND Economic_cond_household have outliers

But it not affects our data that we use the data with outliers we ignore the outliers
Sol. 3
Data split-
• Voters are maximum voted for labours party; votes are of 1:2 ratio.
• Chances of predict the labour in models is more compare to conservative party
• We Split the data in the train and test 70:30 ratio.
Encoding
• 6 ordinary features are already integer data type but 2 features are of object data
type they need to be converted into categorical datatype.
Figure 22 encoded table of dataset
Scaling
• Age is years in our dataset and other feature have ratings of different intervals
hence scaling is required for some model for accurate result.
• for training data set scaling is required, but for test data set scaling is not required
because in real world data is not scaled.
Sol. 4
Modelling-
Logistic regression-
• Scaling is not required for logistic regression its not affect test data accuracy.
• By using grid search cross validation, we build logistic regression model.
• Train set of logistic regression model accuracy is 83% and Test set model accuracy
is 85.4%.The accuracy of train and test set is almost similar then modal is valid and
difference between accuracy of both set lies within range of 10%.
Linear discriminant analysis-

• Scaling is not required for logistic discriminant analysis it’s not affect test data
accuracy.
• Train set of logistic discriminant analysis model accuracy is 83.6% and Test set
model accuracy is 81.5%. The accuracy of train and test set is almost similar then
modal is valid and difference between accuracy of each set lies within range of
10%.
When we compare Logistic regression model and LDA models. accuracy of

linear regression model is good then we conclude that logistic regression model is
good for predicting the party.
Sol. 5
Naïve Bayes Model-
• Scaling is not required for logistic discriminant analysis it’s not affect data
accuracy
• In naive bayes Model when we calculating numerical features it assumes that
features are normally distributed and also assumes that predictors are independent
to each other
• We build GuassianNB Classifier, classifier is trained by use of fit () method. After
its Model is making Predictions use predict () method.
• Train set of Naïve Bayes Model accuracy is 83.2% and Test set model accuracy is
82.3%. The accuracy of train and test set is almost similar then modal is valid and
difference between accuracy of each set lies within range of 10%.
KNN Model-
• In KNN model pre-processing is required to make independent variable scale
hence we need to z score here on all numeric features that calculate distance for
KNN.
• Value of neighbours-5, we need to try for different k value
• find that which is least misclassification error.
• From the data we see k value is 13, now we build k neighbours classifier
metric=’Euclidean’, we can use fit () method for training data then model is
need to predict which is doing by predict () methods to test the parameter.
• Train set of KNN Model accuracy is 84.9% and Test set model accuracy is 86%.
The accuracy of train and test set is almost similar then modal is valid and
difference between accuracy of each set lies within range of 10%.
When we compare Naïve Bayes Model and KNN Model. accuracy of KNN Model is good then
we conclude that KNN Model is good for predicting the party.
Sol. 6
Random forest model-
• Scaling is not required for random forest model.
• Model is build using grid search validation.
• Train set of Random forest model accuracy is 82.7% and Test set model
accuracy is 84.1%. The accuracy of train and test set is almost similar then
modal is valid and difference between accuracy of each set lies within range of
10%.
Bagging model –
• Here we have base estimator an n estimator=100 classifier is trained by use of
fit () method. After its Model is making Predictions use predict () method for
test the set feature
• Train set of Bagging Model accuracy is 81.7% and Test set model accuracy is
82.9%. The accuracy of train and test set is almost similar then modal is valid
and difference between accuracy of each set lies within range of 10%.
When we compare Random forest model and Bagging model. accuracy of Random
forest model is good then we conclude that Random forest model is good for predicting
the party.
Gradient Boosting-
• IN Gradient boosting we have n_estimator an n estimator=100 classifier is

trained by use of fit () method. After its Model is making Predictions use predict
() method for test the set feature.
Ada Boosting –
• IN Ada boosting we have n_estimator an n estimator=100 classifier is trained by
use of fit () method. After its Model is making Predictions use predict () method
for test the set feature.
When we compare Gradient Boosting and Ada Boosting. accuracy of both model is good but
the difference between train and test set accuracy is very low in Ada boosting we conclude that
Ada boosting is good for predicting the party.
Sol 7
Model Evaluation-
Logistic Regression-
• Classification Report -
Figure 23 classification report of train data ser

Figure 24 classification report of train data set
• Confusion matrix-
Figure 25 confusion matrix of train data set
Figure 26 confusion matrix of test data set

AUC score of train and test dataset is 87.4% and 91.5%
• ROC curve-
Figure 27 ROC curve of train and test dataset
From the above output we see that logistic regression model is good model.
Train set of Naïve Bayes Model accuracy is 83.2% and Test set model accuracy is 82.3%. The
accuracy of train and test set is almost similar then modal is valid and difference between
accuracy of each set lies within range of 10%.
Linear discriminant analysis-

• Classification Report -
Figure 28 classification report of train dataset

Figure 29 classification report of test dataset
Confusion matrix-
Figure 30 confusion matrix of train data set
Figure 31 confusion matrix of test dataset
AUC score for training and test data is 88.9% and88.4%.

ROC curve-
Figure 32 ROC curve of train and test data
We see that train and test data set results are almost similar the model is good
INSIGHTS-
• When we compare all models precision and recall score Naive bayes model is good for
predicting the parties
• In naïve bayes test data recall is 86% than only 14% voters is not in support of labour
party than they vote against the labour party.
• In naïve bayes test data precision is 89%, then only 11% of the people made wrong
prediction votes to be actually predicted against the labour party.
• Votes are in majorly favour of labour party and it wins.
• People which is not in favour of europistic than they likely to vote to labour party.
• People which do not have political knowledge vote labour party.
Problem -2
Load the package extract 3 speeches from given code
Sol 1Speech given by president franklin Roosevelt in 1941
Total no. words are1536, total no. characters are7571, total no. of sentences are 68.
Speech given by president johnf.kennedy in 1961
Total no. words are1546, total no. characters are7618, total no. of sentences are 52.
Speeches given by president Richard Nixon in 1973
Total no. words are 2028, total no. characters are9991, total no. of sentences is 69.
Sol 2
For removing stop words-
In all three speeches converting into lover case then remove stopwords and special character
and assign to new variable in the list.
Sol 3
Mostly words used in speech 1 is nation 12 times , know 19 times, spirit 9 times,
democracy 9 times , life 9 times
Mostly words used in speech 2 is let 16 times, us 12 times, sides 8 times, world 8
times, new 7 times.
Mostly words used in speech 3 is us 28 times, let 22 times ,America 21 times,
peace 19 times, world 18 times.

Umendra Pratap Singh Solanki ML Graded Project 18-12-2022

Uploaded by

Copyright:

Available Formats

You might also like

Umendra Pratap Singh Solanki ML Graded Project 18-12-2022

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Umendra Pratap Singh Solanki ML Graded Project 18-12-2022

Uploaded by

Copyright:

Available Formats

GRADED PROJECT

Submitted By - UMENDRA Pratap Singh

Data Preparation: 4 marks

1.4 Apply Logistic Regression and LDA (linear discriminant analysis).

1.8 Based on these predictions, what are the insights?

Figure 1 Top five rows of the data

Figure 2 last five rows of the dataset

• vote and gender have two unique values

Figure 4 Discriptive statistics of data

• Dataset don’t have any duplicate value.

Figure 5 hist plot and box plot of Age

Economic condition national –

Figure 6 hist plot and boxplot of Economic.conditional,national

Figure 7 histplot and boxplotof Economic.conditional,houshold

Figure 8 histplot and boxplot of Blair

Figure 9 hitplot and boxplot of Europe

Figure 10 histplot and box plot of Hague

• Voters of labour party are more

Figure 12 count plot of vote

• Voters of labour party give higher rating in economic condition national

• Voters of labour party give higher rating in economic condition household in

There is no strong relationship present between the variables

• Economic_cond_national and economic_cond_household have maximum

ONLY Economic_cond_national AND Economic_cond_household have outliers

Linear discriminant analysis-

When we compare Logistic regression model and LDA models. accuracy of

• IN Gradient boosting we have n_estimator an n estimator=100 classifier is

Figure 23 classification report of train data ser

Figure 25 confusion matrix of train data set

Figure 26 confusion matrix of test data set

Figure 27 ROC curve of train and test dataset

Linear discriminant analysis-

Figure 28 classification report of train dataset

Figure 30 confusion matrix of train data set

Figure 31 confusion matrix of test dataset

AUC score for training and test data is 88.9% and88.4%.

Figure 32 ROC curve of train and test data

Sol 1Speech given by president franklin Roosevelt in 1941

Speeches given by president Richard Nixon in 1973

You might also like