Umendra Pratap Singh Solanki ML Graded Project 18-12-2022

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 27

GRADED PROJECT

MACHINE LEARNING

Submitted By - UMENDRA Pratap Singh


Submission Date – 18/12/2022
CONTENT
1.1 Read the dataset. Do the descriptive statistics and do the null value condition check. Write an
inference on it.

1.2 Perform Univariate and Bivariate Analysis. Do exploratory data analysis. Check for Outliers.

Data Preparation: 4 marks

1.3 Encode the data (having string values) for Modelling. Is Scaling necessary here or not? Data Split:
Split the data into train and test (70:30).

Modelling: 22 marks

1.4 Apply Logistic Regression and LDA (linear discriminant analysis).

1.5 Apply KNN Model and Naïve Bayes Model. Interpret the results.

1.6 Model Tuning, Bagging (Random Forest should be applied for Bagging), and Boosting.

1.7 Performance Metrics: Check the performance of Predictions on Train and Test sets using Accuracy,
Confusion Matrix, Plot ROC curve and get ROC_AUC score for each model. Final Model: Compare
the models and write inference which model is best/optimized.

Inference: 5 marks

1.8 Based on these predictions, what are the insights?


2.1 Find the number of characters, words, and sentences for the mentioned documents.

2.2 Remove all the stop words from all three speeches.

2.3 Which word occurs the most number of times in his inaugural address for each president? Mention
the top three words. (after removing the stop words)

2.4 Plot the word cloud of each of the speeches of the variable. (after removing the stop words)
Problem 1:
You are hired by one of the leading news channels CNBE who wants to analyze
recent elections. This survey was conducted on 1525 voters with 9 variables. You
have to build a model, to predict which party a voter will vote for on the basis of
the given information, to create an exit poll that will help in predicting overall
win and seats covered by a particular party.
Dataset for Problem: Election_Data.xlsx
1.1 Read the dataset. Do the descriptive statistics and do the null value condition check. Write
an inference on it.
• Top five and last five rows of the data set

Figure 1 Top five rows of the data

Figure 2 last five rows of the dataset

• There are 1525 rows and 10 columns in the dataset. We see that a column is unwanted
than we removing column “Unnamed : 0”, which is not having meaning full
information than after removing it we have only 9 columns
• We see that there is 8 integer and 2 object datatypes in the dataset, we converting the
object datatypes to categorical data
• We have 9 feature in which 8 are categorical and one is numerical feature

Figure 3

• vote and gender have two unique values


• maximum votes are of Labour gender of maximum voters are female
• in bellow table we show the 5-point summary which is mean, min.,25%, 50%,75%,
max.
• in age mean and median almost same than its normally distributed.

Figure 4 Discriptive statistics of data

• Dataset don’t have any duplicate value.


• Categorical column vote has 2 unique values which are -
Labour- 1063, Conservative 462
• Gender also have 2 unique values which are -
Male-713, Female – 812
• Age is continuous variable
Min. Age is 24
Maximum Age is 93
Average Age is 54.18

Sol.2
UNIVARIATE ANALYSIS –
Age-
• Age feature not have any outliers
• We see that it is slightly right skewed

Figure 5 hist plot and box plot of Age

Economic condition national –


• Economic.conditional,national have outliers
• Voters majorly gave rating 3 and 4 for the economic conditional national

Figure 6 hist plot and boxplot of Economic.conditional,national


Economic.cond.houshold.Distribution
• Economic.conditional,houshold have outliers
• Voters majorly gave rating 3 and 4 for the economic conditional national

Figure 7 histplot and boxplotof Economic.conditional,houshold

Blair –
• Voters ha majorly given 4 rating to labour party leader Blair

Figure 8 histplot and boxplot of Blair


Europe-
• Voters are majorly give rating 11 and least rating gives is 3 in Europe

Figure 9 hitplot and boxplot of Europe

Hague-
• Most of the voters given rating 2 and 4 to conservative party
leader Hague

Figure 10 histplot and box plot of Hague

Political knowledge –
• Most of the voters choose rating 2
Figure 11 hist plot and box plot of political knowledge

Vote-

• Voters of labour party are more

Figure 12 count plot of vote

Gender –
• Femal voters are more
Figure 13 count plot of Gender

BIVARIATE ANALYSIS -
• Age of conservative party voters are more, in conservative party female has more
age than male whereas in labour party male more aged

Figure 14

• Voters of labour party give higher rating in economic condition national


Figure 15

• Voters of labour party give higher rating in economic condition household in


labour party

Figure 16
• Voters of conservative party are more give more rating in political knowledge

Figure 17

• Voters of labour party more give rating to blair and male are more
likely to blair

Figure 18
• Voters of conservative party more like to hague

Figure 19

• Conservative party voters are more like Europe they give more rating
to Europe than labour party

Figure 20
PAIRPLOT

Figure 21

There is no strong relationship present between the variables


There is mixture of positive and negative feature
HEATMAP

• Economic_cond_national and economic_cond_household have maximum


correlation of 35% but that is not big percentage.
• Maximum negative correlation is between Blair and Europe 30% but it also not
large.
OUTLIER CHECK -

ONLY Economic_cond_national AND Economic_cond_household have outliers


But it not affects our data that we use the data with outliers we ignore the outliers

Sol. 3
Data split-
• Voters are maximum voted for labours party; votes are of 1:2 ratio.
• Chances of predict the labour in models is more compare to conservative party
• We Split the data in the train and test 70:30 ratio.

Encoding
• 6 ordinary features are already integer data type but 2 features are of object data
type they need to be converted into categorical datatype.
Figure 22 encoded table of dataset

Scaling
• Age is years in our dataset and other feature have ratings of different intervals
hence scaling is required for some model for accurate result.
• for training data set scaling is required, but for test data set scaling is not required
because in real world data is not scaled.
Sol. 4

Modelling-
Logistic regression-
• Scaling is not required for logistic regression its not affect test data accuracy.
• By using grid search cross validation, we build logistic regression model.
• Train set of logistic regression model accuracy is 83% and Test set model accuracy
is 85.4%.The accuracy of train and test set is almost similar then modal is valid and
difference between accuracy of both set lies within range of 10%.

Linear discriminant analysis-


• Scaling is not required for logistic discriminant analysis it’s not affect test data
accuracy.
• Train set of logistic discriminant analysis model accuracy is 83.6% and Test set
model accuracy is 81.5%. The accuracy of train and test set is almost similar then
modal is valid and difference between accuracy of each set lies within range of
10%.

When we compare Logistic regression model and LDA models. accuracy of


linear regression model is good then we conclude that logistic regression model is
good for predicting the party.

Sol. 5
Naïve Bayes Model-
• Scaling is not required for logistic discriminant analysis it’s not affect data
accuracy
• In naive bayes Model when we calculating numerical features it assumes that
features are normally distributed and also assumes that predictors are independent
to each other
• We build GuassianNB Classifier, classifier is trained by use of fit () method. After
its Model is making Predictions use predict () method.
• Train set of Naïve Bayes Model accuracy is 83.2% and Test set model accuracy is
82.3%. The accuracy of train and test set is almost similar then modal is valid and
difference between accuracy of each set lies within range of 10%.

KNN Model-
• In KNN model pre-processing is required to make independent variable scale
hence we need to z score here on all numeric features that calculate distance for
KNN.
• Value of neighbours-5, we need to try for different k value
• find that which is least misclassification error.
• From the data we see k value is 13, now we build k neighbours classifier
metric=’Euclidean’, we can use fit () method for training data then model is
need to predict which is doing by predict () methods to test the parameter.
• Train set of KNN Model accuracy is 84.9% and Test set model accuracy is 86%.
The accuracy of train and test set is almost similar then modal is valid and
difference between accuracy of each set lies within range of 10%.
When we compare Naïve Bayes Model and KNN Model. accuracy of KNN Model is good then
we conclude that KNN Model is good for predicting the party.

Sol. 6
Random forest model-
• Scaling is not required for random forest model.
• Model is build using grid search validation.
• Train set of Random forest model accuracy is 82.7% and Test set model
accuracy is 84.1%. The accuracy of train and test set is almost similar then
modal is valid and difference between accuracy of each set lies within range of
10%.

Bagging model –
• Here we have base estimator an n estimator=100 classifier is trained by use of
fit () method. After its Model is making Predictions use predict () method for
test the set feature
• Train set of Bagging Model accuracy is 81.7% and Test set model accuracy is
82.9%. The accuracy of train and test set is almost similar then modal is valid
and difference between accuracy of each set lies within range of 10%.
When we compare Random forest model and Bagging model. accuracy of Random
forest model is good then we conclude that Random forest model is good for predicting
the party.

Gradient Boosting-

• IN Gradient boosting we have n_estimator an n estimator=100 classifier is


trained by use of fit () method. After its Model is making Predictions use predict
() method for test the set feature.
• Train set of Bagging Model accuracy is 88.7% and Test set model accuracy is
83.9%. The accuracy of train and test set is almost similar then modal is valid
and difference between accuracy of each set lies within range of 10%.

Ada Boosting –
• IN Ada boosting we have n_estimator an n estimator=100 classifier is trained by
use of fit () method. After its Model is making Predictions use predict () method
for test the set feature.
• Train set of Bagging Model accuracy is 84.7% and Test set model accuracy is
83.6%. The accuracy of train and test set is almost similar then modal is valid
and difference between accuracy of each set lies within range of 10%.
When we compare Gradient Boosting and Ada Boosting. accuracy of both model is good but
the difference between train and test set accuracy is very low in Ada boosting we conclude that
Ada boosting is good for predicting the party.

Sol 7
Model Evaluation-
Logistic Regression-
• Classification Report -

Figure 23 classification report of train data ser


Figure 24 classification report of train data set

• Confusion matrix-

Figure 25 confusion matrix of train data set

Figure 26 confusion matrix of test data set


AUC score of train and test dataset is 87.4% and 91.5%
• ROC curve-

Figure 27 ROC curve of train and test dataset

From the above output we see that logistic regression model is good model.
Train set of Naïve Bayes Model accuracy is 83.2% and Test set model accuracy is 82.3%. The
accuracy of train and test set is almost similar then modal is valid and difference between
accuracy of each set lies within range of 10%.

Linear discriminant analysis-


• Classification Report -

Figure 28 classification report of train dataset


Figure 29 classification report of test dataset

Confusion matrix-

Figure 30 confusion matrix of train data set

Figure 31 confusion matrix of test dataset

AUC score for training and test data is 88.9% and88.4%.


ROC curve-

Figure 32 ROC curve of train and test data

We see that train and test data set results are almost similar the model is good
INSIGHTS-
• When we compare all models precision and recall score Naive bayes model is good for
predicting the parties
• In naïve bayes test data recall is 86% than only 14% voters is not in support of labour
party than they vote against the labour party.
• In naïve bayes test data precision is 89%, then only 11% of the people made wrong
prediction votes to be actually predicted against the labour party.
• Votes are in majorly favour of labour party and it wins.
• People which is not in favour of europistic than they likely to vote to labour party.
• People which do not have political knowledge vote labour party.

Problem -2
Load the package extract 3 speeches from given code

Sol 1Speech given by president franklin Roosevelt in 1941

Total no. words are1536, total no. characters are7571, total no. of sentences are 68.
Speech given by president johnf.kennedy in 1961
Total no. words are1546, total no. characters are7618, total no. of sentences are 52.

Speeches given by president Richard Nixon in 1973

Total no. words are 2028, total no. characters are9991, total no. of sentences is 69.

Sol 2
For removing stop words-
In all three speeches converting into lover case then remove stopwords and special character
and assign to new variable in the list.

Sol 3
Mostly words used in speech 1 is nation 12 times , know 19 times, spirit 9 times,
democracy 9 times , life 9 times
Mostly words used in speech 2 is let 16 times, us 12 times, sides 8 times, world 8
times, new 7 times.
Mostly words used in speech 3 is us 28 times, let 22 times ,America 21 times,
peace 19 times, world 18 times.

You might also like