Professional Documents
Culture Documents
Machine Learning Business Report - Reshma
Machine Learning Business Report - Reshma
Machine Learning Business Report - Reshma
Restrict
ed use
DSBA
Reshma A
Date:27/11/22
C2 -
Restrict
ed use
Contents
Problem 1................................................................................................................................................4
1.1. Read the dataset. Do the descriptive statistics and do the null value condition check.
Write an inference on it......................................................................................................................4
1.2. Perform Univariate and Bivariate Analysis. Do exploratory data analysis. Check for
Outliers...............................................................................................................................................6
1.3. 1.3 Encode the data (having string values) for Modelling. Is Scaling necessary here or
not? Data Split: Split the data into train and test (70:30)..................................................................9
1.4. 1.4 Apply Logistic Regression and LDA (linear discriminant analysis)................................11
1.5. Apply KNN Model and Naïve Bayes Model. Interpret the results......................................11
1.6. Model Tuning, Bagging (Random Forest should be applied for Bagging), and Boosting...12
1.7. 1.7 Performance Metrics: Check the performance of Predictions on Train and Test sets
using Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC score for each model. Final
Model: Compare the models and write inference which model is best/optimized........................13
Problem2...............................................................................................................................................31
2.1 Find the number of characters, words, and sentences for the mentioned documents............31
2.2 Remove all the stopwords from all three speeches...................................................................32
2.3 Which word occurs the most number of times in his inaugural address for each president?
Mention the top three words. (after removing the stopwords)......................................................33
2.4 Plot the word cloud of each of the speeches of the variable. (after removing the stopwords)34
C2 -
Restrict
ed use
List of Figures
Figure 1:Univatiate Analysis:Histogram....................................................................................................6
Figure 2:Boxplots with outliers.................................................................................................................7
Figure 3:Vote countplot............................................................................................................................7
Figure 4:Gender countplot........................................................................................................................8
List of Tables
Table 1: Sample of dataset........................................................................................................................4
Table 2:Datatypes.....................................................................................................................................4
Table 3:Statistical data..............................................................................................................................5
Table 5:Pairplot.........................................................................................................................................8
Table 6: Dataset after encoding..............................................................................................................10
Table 7:Boxplot after scaling...................................................................................................................11
C2 -
Restrict
ed use
Problem 1
You are hired by one of the leading news channels CNBE who wants to analyze recent elections. This survey
was conducted on 1525 voters with 9 variables. You have to build a model, to predict which party a voter will
vote for on the basis of the given information, to create an exit poll that will help in predicting overall win and
seats covered by a particular party.
1.1. Read the dataset. Do the descriptive statistics and do the null value condition check. Write an
inference on it.
Sample of dataset:
Datatypes:
Table 2:Datatypes
Statistical Data:
The following is the statistical data associated with the dataset and the information:
C2 -
Restrict
ed use
Duplicates:
There were 8 duplicates observed. They were dropped for better data analysis
Skewness:
Observations:
The dataset has 1525 rows and 9 columns. The Unnamed field was dropped because it wasn’t relevant
to the analysis . Hence it’s a 1525 x 8 dataset
‘Gender’ and ‘age’ are of object datatype while all others are int64 datatype
No null values were observed on null value check
There were 8 duplicate records which were dropped due to lack of relevance
From the skewness table it is observed that the data is fairly symmetrical. Only ‘Blair’ seems to have a
higher value of skewness and all others have skewness between 0.5 and -0.5
C2 -
Restrict
ed use
Univariate analysis is the simplest form of analyzing data. Uni means one, so in other words the data has only
one variable. Univariate data requires to analyze each variable separately. Data is gathered for the purpose of
answering a question, or more specifically, a research question
C2 -
Restrict
ed use
Bivariate Analysis
Table 4:Pairplot
C2 -
Restrict
ed use
Heatmap
Figure 5:Heatmap
Observations
Labour gets higher votes than conservative party
There are more female voters than male
C2 -
Restrict
ed use
Economic.cond.household and economic.cond.national have outliers and they need to be
treated for outliers
Age of the voters is scattered across values but mostly within the range of 40 and 70 but the
plot if slightly right skewed.
Distribution of both economic conditions fields are left skewed
Distribution of Hague is slightly right skewed
Distribution of Europe is left skewed
Distribution of Blair is slightly left skewed.
The political knowledge is highest in the range of age 35-50
Europe sentiment has not affected their rating of economic conditions of household and
nation
Voters with Europe sentiment favour Conservative party
Younger people have tendency to vote Labour. Labour party receives high voting rate among
all ages
1.3. Encode the data (having string values) for Modelling. Is Scaling necessary here or not? Data Split:
Split the data into train and test (70:30).
Encoding
The two categorical variables ‘vote’ and ‘gender’ were converted to numeric . ‘Vote’ had two values
Labour and Conservative which was replaced with ‘0’ and ‘1’ respectively.’Gender’ with values male
and female was also replaced with ‘1’ and ‘0’. Given below is the dataset after encoding
Scaling
From the exploratory data analysis it was very evident that the data was scattered and the variables
had different scales . Hence scaling was opted. As we can see in the previous segment where boxplots
were analysed, there were outliers too and so outlier treatment was also performed
C2 -
Restrict
ed use
1.4. 1.4 Apply Logistic Regression and LDA (linear discriminant analysis).
Logistic Regression
Test
Accuracy :82.89%
AUC : 0.881
Train
Accuracy :83.69%
AUC :0.891
Test
C2 -
Restrict
ed use
Accuracy :83.11%
AUC :0.888
Train
Accuracy :83.41
AUC :0.890
1.5. Apply KNN Model and Naïve Bayes Model. Interpret the results.
KNN Model
Test
Accuracy :82.67%
AUC :0.882
Hyper parameters used:
Train
Accuracy :85.67%
AUC :0.982
There is slight overfitting. There are no available hyper-parameters for this model hence none was
used. The error in test is slightly greater than train
Train
Accuracy :83.4%
AUC :0.889
There is no overfitting or underfitting. There are no available hyper-parameters for this model hence
none was used. The error in test is slightly greater than train
1.6. Model Tuning, Bagging (Random Forest should be applied for Bagging), and Boosting.
Bagging
base estimator=RandomForestClassifier
max_depth=10,
C2 -
Restrict
ed use
min_samples_leaf=15,
min_samples_split=30,
n_estimators=50,
random_state=1
Test
Accuracy :81.5%
AUC :0.888
Train
Accuracy :84.6%
AUC :0.909
Boosting
Test
Accuracy :89.9%
AUC :0.884
Train
Accuracy :83.4%
AUC :0.902
Logistic Regression
Test
Classification Report
Confusion Matrix
ROC Curve
C2 -
Restrict
ed use
Train
Classification Report
Confusion Matrix
C2 -
Restrict
ed use
ROC Curve
Classification Report
C2 -
Restrict
ed use
Confusion Matrix
ROC Curve
C2 -
Restrict
ed use
Train
Classification Report
Confusion Matrix
C2 -
Restrict
ed use
ROC Curve
KNN Model
Test
Classification Report
C2 -
Restrict
ed use
Confusion Matrix
ROC Curve
C2 -
Restrict
ed use
Train
Classification Report
Confusion Matrix
ROC Curve
C2 -
Restrict
ed use
Classification Report
Confusion Matrix
C2 -
Restrict
ed use
ROC Curve
C2 -
Restrict
ed use
Train
Classification Report
Confusion Matrix
C2 -
Restrict
ed use
ROC Curve
Bagging
Test
Classification Report
Confusion Matrix
C2 -
Restrict
ed use
ROC Curve
C2 -
Restrict
ed use
Train
Classification Report
Confusion Matrix
C2 -
Restrict
ed use
ROC Curve
Boosting
Train
Classification Report
C2 -
Restrict
ed use
Confusion Matrix
ROC Curve
C2 -
Restrict
ed use
Test
Classification Report
Confusion Matrix
C2 -
Restrict
ed use
ROC Curve
Observations
In terms of accuracy the KNN model is the most accurate
Underfittign and overfitting models of bagging and boosting were tuned to get better results
Boost classifier performed the best with max AUC score of 88.8% in test and 90.9
in train
Overfitting was observed in Random Forest which was used for Bagging. Hence both models
were overfitted. It was model tuned using GridSearchCV
The Labour party has considerable votes more than the Conservative party
All models had good accuracy
Blair has higher number of votes than Hague and had scored 3.3 average while Balir scored
lower at 2.74 average
People who gave low values for a party also ended up voting for the same party
Political knowledge has low count in a large section of people with average of only 1.7
The political knowledge is highest in the range of age 35-50
Europe sentiment has not affected their rating of economic conditions of household and
nation
The economic condition for householf averaged at 3.13 and the economic condition for nation
averaged at 3.24
Voters with Europe sentiment favour Conservative party
Younger people have tendency to vote Labour. Labour party receives high voting rate among
all ages
Overfitting was observed in Bagging model as well as Random Forest
Boosting classifier is the best model of the used models.
Problem2
In this particular project, we are going to work on the inaugural corpora from the nltk in
Python. We will be looking at the following speeches of the Presidents of the United States of
America:
1. President Franklin D. Roosevelt in 1941
2. President John F. Kennedy in 1961
3. President Richard Nixon in 1973
2.1 Find the number of characters, words, and sentences for the mentioned
documents.
Roosevelt file:
C2 -
Restrict
ed use
Number of characters in file:7571
Number of words in file:1390
Number of sentences in file:52
Kennedy File:
Nixon File:
Number of characters in file:9991
Number of words in file:1819
Number of sentences in file:/67
Nixon
2.3 Which word occurs the most number of times in his inaugural address
for each president? Mention the top three words. (after removing the
stopwords)
Roosevelt’s speech used ‘nation’ the most number of times and the top three word
occurrences are given below
Kennedy’s speech used ‘world’ the most number of times and the top three word
occurrences are given below
C2 -
Restrict
ed use
Nixon’s speech used ‘America’ the most number of times and the top three word occurrences
are given below
2.4 Plot the word cloud of each of the speeches of the variable. (after
removing the stopwords)