Professional Documents
Culture Documents
Umendra Pratap Singh Solanki ML Graded Project 18-12-2022
Umendra Pratap Singh Solanki ML Graded Project 18-12-2022
Umendra Pratap Singh Solanki ML Graded Project 18-12-2022
MACHINE LEARNING
1.2 Perform Univariate and Bivariate Analysis. Do exploratory data analysis. Check for Outliers.
1.3 Encode the data (having string values) for Modelling. Is Scaling necessary here or not? Data Split:
Split the data into train and test (70:30).
Modelling: 22 marks
1.5 Apply KNN Model and Naïve Bayes Model. Interpret the results.
1.6 Model Tuning, Bagging (Random Forest should be applied for Bagging), and Boosting.
1.7 Performance Metrics: Check the performance of Predictions on Train and Test sets using Accuracy,
Confusion Matrix, Plot ROC curve and get ROC_AUC score for each model. Final Model: Compare
the models and write inference which model is best/optimized.
Inference: 5 marks
2.2 Remove all the stop words from all three speeches.
2.3 Which word occurs the most number of times in his inaugural address for each president? Mention
the top three words. (after removing the stop words)
2.4 Plot the word cloud of each of the speeches of the variable. (after removing the stop words)
Problem 1:
You are hired by one of the leading news channels CNBE who wants to analyze
recent elections. This survey was conducted on 1525 voters with 9 variables. You
have to build a model, to predict which party a voter will vote for on the basis of
the given information, to create an exit poll that will help in predicting overall
win and seats covered by a particular party.
Dataset for Problem: Election_Data.xlsx
1.1 Read the dataset. Do the descriptive statistics and do the null value condition check. Write
an inference on it.
• Top five and last five rows of the data set
• There are 1525 rows and 10 columns in the dataset. We see that a column is unwanted
than we removing column “Unnamed : 0”, which is not having meaning full
information than after removing it we have only 9 columns
• We see that there is 8 integer and 2 object datatypes in the dataset, we converting the
object datatypes to categorical data
• We have 9 feature in which 8 are categorical and one is numerical feature
Figure 3
Sol.2
UNIVARIATE ANALYSIS –
Age-
• Age feature not have any outliers
• We see that it is slightly right skewed
Blair –
• Voters ha majorly given 4 rating to labour party leader Blair
Hague-
• Most of the voters given rating 2 and 4 to conservative party
leader Hague
Political knowledge –
• Most of the voters choose rating 2
Figure 11 hist plot and box plot of political knowledge
Vote-
Gender –
• Femal voters are more
Figure 13 count plot of Gender
BIVARIATE ANALYSIS -
• Age of conservative party voters are more, in conservative party female has more
age than male whereas in labour party male more aged
Figure 14
Figure 16
• Voters of conservative party are more give more rating in political knowledge
Figure 17
• Voters of labour party more give rating to blair and male are more
likely to blair
Figure 18
• Voters of conservative party more like to hague
Figure 19
• Conservative party voters are more like Europe they give more rating
to Europe than labour party
Figure 20
PAIRPLOT
Figure 21
Sol. 3
Data split-
• Voters are maximum voted for labours party; votes are of 1:2 ratio.
• Chances of predict the labour in models is more compare to conservative party
• We Split the data in the train and test 70:30 ratio.
Encoding
• 6 ordinary features are already integer data type but 2 features are of object data
type they need to be converted into categorical datatype.
Figure 22 encoded table of dataset
Scaling
• Age is years in our dataset and other feature have ratings of different intervals
hence scaling is required for some model for accurate result.
• for training data set scaling is required, but for test data set scaling is not required
because in real world data is not scaled.
Sol. 4
Modelling-
Logistic regression-
• Scaling is not required for logistic regression its not affect test data accuracy.
• By using grid search cross validation, we build logistic regression model.
• Train set of logistic regression model accuracy is 83% and Test set model accuracy
is 85.4%.The accuracy of train and test set is almost similar then modal is valid and
difference between accuracy of both set lies within range of 10%.
Sol. 5
Naïve Bayes Model-
• Scaling is not required for logistic discriminant analysis it’s not affect data
accuracy
• In naive bayes Model when we calculating numerical features it assumes that
features are normally distributed and also assumes that predictors are independent
to each other
• We build GuassianNB Classifier, classifier is trained by use of fit () method. After
its Model is making Predictions use predict () method.
• Train set of Naïve Bayes Model accuracy is 83.2% and Test set model accuracy is
82.3%. The accuracy of train and test set is almost similar then modal is valid and
difference between accuracy of each set lies within range of 10%.
KNN Model-
• In KNN model pre-processing is required to make independent variable scale
hence we need to z score here on all numeric features that calculate distance for
KNN.
• Value of neighbours-5, we need to try for different k value
• find that which is least misclassification error.
• From the data we see k value is 13, now we build k neighbours classifier
metric=’Euclidean’, we can use fit () method for training data then model is
need to predict which is doing by predict () methods to test the parameter.
• Train set of KNN Model accuracy is 84.9% and Test set model accuracy is 86%.
The accuracy of train and test set is almost similar then modal is valid and
difference between accuracy of each set lies within range of 10%.
When we compare Naïve Bayes Model and KNN Model. accuracy of KNN Model is good then
we conclude that KNN Model is good for predicting the party.
Sol. 6
Random forest model-
• Scaling is not required for random forest model.
• Model is build using grid search validation.
• Train set of Random forest model accuracy is 82.7% and Test set model
accuracy is 84.1%. The accuracy of train and test set is almost similar then
modal is valid and difference between accuracy of each set lies within range of
10%.
Bagging model –
• Here we have base estimator an n estimator=100 classifier is trained by use of
fit () method. After its Model is making Predictions use predict () method for
test the set feature
• Train set of Bagging Model accuracy is 81.7% and Test set model accuracy is
82.9%. The accuracy of train and test set is almost similar then modal is valid
and difference between accuracy of each set lies within range of 10%.
When we compare Random forest model and Bagging model. accuracy of Random
forest model is good then we conclude that Random forest model is good for predicting
the party.
Gradient Boosting-
Ada Boosting –
• IN Ada boosting we have n_estimator an n estimator=100 classifier is trained by
use of fit () method. After its Model is making Predictions use predict () method
for test the set feature.
• Train set of Bagging Model accuracy is 84.7% and Test set model accuracy is
83.6%. The accuracy of train and test set is almost similar then modal is valid
and difference between accuracy of each set lies within range of 10%.
When we compare Gradient Boosting and Ada Boosting. accuracy of both model is good but
the difference between train and test set accuracy is very low in Ada boosting we conclude that
Ada boosting is good for predicting the party.
Sol 7
Model Evaluation-
Logistic Regression-
• Classification Report -
• Confusion matrix-
From the above output we see that logistic regression model is good model.
Train set of Naïve Bayes Model accuracy is 83.2% and Test set model accuracy is 82.3%. The
accuracy of train and test set is almost similar then modal is valid and difference between
accuracy of each set lies within range of 10%.
Confusion matrix-
We see that train and test data set results are almost similar the model is good
INSIGHTS-
• When we compare all models precision and recall score Naive bayes model is good for
predicting the parties
• In naïve bayes test data recall is 86% than only 14% voters is not in support of labour
party than they vote against the labour party.
• In naïve bayes test data precision is 89%, then only 11% of the people made wrong
prediction votes to be actually predicted against the labour party.
• Votes are in majorly favour of labour party and it wins.
• People which is not in favour of europistic than they likely to vote to labour party.
• People which do not have political knowledge vote labour party.
Problem -2
Load the package extract 3 speeches from given code
Total no. words are1536, total no. characters are7571, total no. of sentences are 68.
Speech given by president johnf.kennedy in 1961
Total no. words are1546, total no. characters are7618, total no. of sentences are 52.
Total no. words are 2028, total no. characters are9991, total no. of sentences is 69.
Sol 2
For removing stop words-
In all three speeches converting into lover case then remove stopwords and special character
and assign to new variable in the list.
Sol 3
Mostly words used in speech 1 is nation 12 times , know 19 times, spirit 9 times,
democracy 9 times , life 9 times
Mostly words used in speech 2 is let 16 times, us 12 times, sides 8 times, world 8
times, new 7 times.
Mostly words used in speech 3 is us 28 times, let 22 times ,America 21 times,
peace 19 times, world 18 times.