Machine Learning Report

12/5/2021
BUSINESS ANALYSIS
REPORT
MACHINE LEARNING
SANDYA VB
CONTENTS
PROBLEM 1:
Data Ingestion: 11 marks
1.1 Read the dataset. Do the descriptive statistics and do the null value condition check.
Write an inference on it. (4 Marks)
1.2 Perform Univariate and Bivariate Analysis. Do exploratory data analysis. Check for
Outliers. (7 Marks)
Data Preparation: 4 marks

1.3 Encode the data (having string values) for Modelling. Is Scaling necessary here or not?
Data Split: Split the data into train and test (70:30). (4 Marks)
Modelling: 22 marks
1.4 Apply Logistic Regression and LDA (linear discriminant analysis). (4 marks)
1.5 Apply KNN Model and Naïve Bayes Model. Interpret the results. (4 marks)
1.6 Model Tuning, Bagging (Random Forest should be applied for Bagging), and
Boosting. (7 marks)
1.7 Performance Metrics: Check the performance of Predictions on Train and Test sets
using Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC score for each
model. Final Model: Compare the models and write inference which model is
best/optimized. (7 marks)
Inference: 5 marks
1.8 Based on these predictions, what are the insights? (5 marks)
PROBLEM 2:
2.1 Find the number of characters, words, and sentences for the mentioned documents. – 3
Marks
2.2 Remove all the stopwords from all three speeches. – 3 Marks
2.3 Which word occurs the most number of times in his inaugural address for each
president? Mention the top three words. (after removing the stopwords) – 3 Marks
2.4 Plot the word cloud of each of the speeches of the variable. (after removing the
stopwords) – 3 Marks
Problem 1
You are hired by one of the leading news channels CNBE who wants to analyze
recent elections. This survey was conducted on 1525 voters with 9 variables. You
have to build a model, to predict which party a voter will vote for on the basis of
the given information, to create an exit poll that will help in predicting overall
win and seats covered by a particular party.
Dataset for Problem 1: Election_Data.xlsx
Data Dictionary:
1. vote: Party choice: Conservative or

Labour
2. age: in years
3. economic.cond.national: Assessment of current national economic

conditions, 1 to 5.
4. economic.cond.household: Assessment of current household economic

conditions, 1 to 5.
5. Blair: Assessment of the Labour leader,

1-5.
6. Hague: Assessment of the Conservative leader, 1-

5.
7. Europe: an 11-point scale that measures respondents' attitudes toward European integration.
represent ‘Eurosceptic’ sentiment.
8. political.knowledge: Knowledge of parties' positions on European

integration, 0 to 3.
9. gender: female /male.

1.1 Read the dataset. Do the descriptive statistics and do the null
value condition check. Write an inference on it.
Reading the csv file and checking the head and tail of the dataset.
Describing the dataset.
• The dataset has 9 columns.

• They are vote, economic.cond.national, economic.cond.household, Blair,
Hague, Europe, political.knowledge and gender.
• Vote is the target variable.
Shape of the data = (1525,9)

Info
• There are integer and object data types in the dataset.
• There are no null value present in the dataset.
Duplicate value
• There are 8 duplicate values present in the dataset.

• Removing the duplicate values from the dataset.
1.2 Perform Univariate and Bivariate Analysis. Do exploratory data

analysis. Check for Outliers.
Univariate Analysis:
• Vote is the target variable.

• The above plot describes the number of candidates who vote for labour is
more than the conservative party.
• Age : The above plot describes the age group of the candidates who vote.
• Economic.cond.national : The above plot describes the economic

condition of the nation which falls between 1-5.
• Economic.cond.household: The above plot describes the economic

condition of the household , which also falls between the scale 1-5.
Blair: It is the Assessment of the Labour leader on a scale of 1-5.

• Hague: It belongs to conservative party on a scale of 1-5.
• Political.knowledge: The above plot describes the knowledge of the

party’s position.
• Gender: The above plot describes the number of female and male who
has voted for the parties.
Bivariate Analysis:
Data Distribution:
Correlation graph:
Outliers Check:
• The outliers are been treated.

1.3 Encode the data (having string values) for Modelling. Is Scaling
necessary here or not? Data Split: Split the data into train and
test (70:30).
Encoding the data:
• Variables Vote and Gender are object datatype. We need to convert them
to integer datatype to perform operations.
• For encoding we use replace function to convert the categorical string
value to categorical numeric values.
• Vote has 2 variables : Conservative and Labour. To covert them into
Conservative to 0 and Labour to 1, we use Replace function.
• Gender also has 2 variables: Male and Female. To cpnvert them into Male
as 0 and Female as 1, we use replace function.
Data split :
• Data split is performed to split the data in the ratio 70:30.

• Test data is 70% and Train data is 30%.
Scaling:
• Scaling is not required.
1.4 Apply Logistic Regression and LDA (linear discriminant

analysis).
Logit_model = LogisticRegression()
Logit_model.fit(X_train, y_train)
LDA_model= LinearDiscriminantAnalysis()
LDA_model.fit(X_train, y_train)
1.5 Apply KNN Model and Naïve Bayes Model. Interpret the
results.
KNN_model=KNeighborsClassifier()
KNN_model.fit(X_train,y_train)
NB_model = GaussianNB()
NB_model.fit(X_train, y_train)
1.6 Model Tuning, Bagging (Random Forest should be applied for

Bagging), and Boosting.
Model Tuning:
DT_model= tree.DecisionTreeClassifier()
DT_model.fit(X_train, y_train)
Bagging:
• Random Forest
RF_model=RandomForestClassifier(n_estimators=100,random_state=1)
RF_model.fit(X_train, y_train)
Bagging_model=BaggingClassifier(base_estimator=cart,n_estimators=100,random_s
tate=1)
Bagging_model.fit(X_train, y_train)
• ADA Bagging:
ADB_model = AdaBoostClassifier(n_estimators=100,random_state=1)
ADB_model.fit(X_train,y_train)
Boosting:
gbcl = GradientBoostingClassifier(random_state=1)
gbcl = gbcl.fit(X_train, y_train)
1.7 Performance Metrics: Check the performance of Predictions on
Train and Test sets using Accuracy, Confusion Matrix, Plot ROC
curve and get ROC_AUC score for each model. Final Model:
Compare the models and write inference which model is
best/optimized.
Logistic Regression:
• Accuracy score for train data is 0.83
• Accuracy score for test data is 0.83
• Model score for train data is 0.834

• Model score for test data is 0.831
• AUC and ROC for train data

AUC score is 0.89
• AUC and ROC for test data
AUC score is 0.89
• Confusion Matrix for train data
array([[197, 110],
[ 66, 688]], dtype=int64)
• Confusion Matrix for test data

array([[112, 41],
[ 36, 267]], dtype=int64)
Linear Discriminant Analysis:


AUC score is 0.89

AUC score is 0.89

array([[200, 107],
[ 69, 685]], dtype=int64)
array([[111, 42],
[ 35, 268]], dtype=int64)
KNN Model:


AUC score is 0.923

AUC score is 0.923

array([[204, 103],
[ 52, 702]], dtype=int64)
array([[ 99, 54],
[ 30, 273]], dtype=int64)
GaussianNB Model:



AUC score is 0.889
AUC score is 0.889

array([[212, 95],
[ 81, 673]], dtype=int64)

array([[112, 41],
[ 40, 263]], dtype=int64)
Decision Tree Classifier Model:



AUC score is 0.889

AUC score is 0.889
array([[307, 0],
[ 0, 754]], dtype=int64)
array([[100, 53],
[ 45, 258]], dtype=int64)
Random Forest Classifier Model:


AUC score is 0.889

AUC score is 0.889
array([[307, 0],
[ 0, 754]], dtype=int64)
array([[104, 49],
[ 28, 275]], dtype=int64)
Bagging Classifier Model:



AUC score is 0.889
AUC score is 0.889
array([[307, 0],
[ 0, 754]], dtype=int64)
array([[108, 45],
[ 37, 266]], dtype=int64)
Ada Boost Classifier Model:



AUC score is 0.889

AUC score is 0.889
array([[214, 93],
[ 66, 688]], dtype=int64)
array([[103, 50],
[ 35, 268]], dtype=int64)
Gradient Boosting Classifier Model:


AUC score is 0.889

AUC score is 0.889
array([[239, 68],
[ 46, 708]], dtype=int64)
array([[105, 48],
[ 27, 276]], dtype=int64)
ROC Curve Analysis

1.8 Based on these predictions, what are the insights?
Accuracy of all the models is similar to each other on train data and test
data,
AUC and ROC curves appear similar on train data and test data.
Model score of all the models for train data and test data is similar and
close to each other’s score.
From the summary of the confusion matrix, we can see that the actual and
the predicted data are very close to each other. This is the reflection of the
right fit model.
F1 score of all the models for train data and test data are almost same.
Model tuning gives better results, but bagging performs well on both train
data and test data.
Boosting technique shows good performance.
Based on overall performance of all the models, we can come to a
conclusion that there is no overfitting nor under fitting issues in this case
study.
Problem 2
In this particular project, we are going to work on the inaugural corpora from the
nltk in Python. We will be looking at the following speeches of the Presidents of
the United States of America:
1. President Franklin D. Roosevelt in 1941

2. President John F. Kennedy in 1961
3. President Richard Nixon in 1973
2.1 Find the number of characters, words, and sentences for the
mentioned documents.
Number of Character:
Characters count for 1941-Roosevelt speech is = 7571

Characters count for 1961-Kennedy speech is = 7618
Characters count for 1973-Nixon speech is = 9991
Number of Words:
Words count for 1941-Roosevelt speech is = 1536

Words count for 1961-Kennedy speech is = 1546
Words count for 1973-Nixon speech is = 2028
Number of Sentences:
Sentences count for 1941-Roosevelt speech is = 68

Sentences count for 1961-Kennedy speech is = 52
Sentences count for 1973-Nixon speech is = 69
2.2 Remove all the stopwords from all three speeches.
def remove_stopwords(array,stopw):
filtered = []
for a in array:
al = a.lower()
if al not in stopw and a!='--':
filtered.append(al)
return filtered
stopw = set(stopwords.words('english')+list(string.punctuation))
Rwords = remove_stopwords(R_words,stopw)
Kwords = remove_stopwords(K_words,stopw)
Nwords = remove_stopwords(N_words,stopw)
2.3 Which word occurs the most number of times in his inaugural
address for each president? Mention the top three words. (after
removing the stopwords).
Top 3 words:
Top three words of Roosevelt: [('nation', 12), ('know', 10), ('spirit', 9)]
Top three words of Kennedy: [('let', 16), ('us', 12), ('world', 8)]
Top three words of Nixon: [('us', 26), ('let', 22), ('america', 21)]
2.4 Plot the word cloud of each of the speeches of the variable. (after
removing the stopwords.
Word Cloud for 1941-Roosevelt speech is
Word Cloud for 1961-Kennedy speech is

Word counts for 1973-Nixon speech is

Machine Learning Report

Uploaded by

Copyright:

Available Formats

You might also like

Machine Learning Report

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Machine Learning Report

Uploaded by

Copyright:

Available Formats

12/5/2021

Data Preparation: 4 marks

Dataset for Problem 1: Election_Data.xlsx

1. vote: Party choice: Conservative or

3. economic.cond.national: Assessment of current national economic

4. economic.cond.household: Assessment of current household economic

5. Blair: Assessment of the Labour leader,

6. Hague: Assessment of the Conservative leader, 1-

8. political.knowledge: Knowledge of parties' positions on European

9. gender: female /male.

Describing the dataset.

• The dataset has 9 columns.

Shape of the data = (1525,9)

• There are integer and object data types in the dataset.

• There are no null value present in the dataset.

• There are 8 duplicate values present in the dataset.

1.2 Perform Univariate and Bivariate Analysis. Do exploratory data

• Vote is the target variable.

• Economic.cond.national : The above plot describes the economic

• Economic.cond.household: The above plot describes the economic

Blair: It is the Assessment of the Labour leader on a scale of 1-5.

• Political.knowledge: The above plot describes the knowledge of the

• The outliers are been treated.

Encoding the data:

• Data split is performed to split the data in the ratio 70:30.

1.4 Apply Logistic Regression and LDA (linear discriminant

1.6 Model Tuning, Bagging (Random Forest should be applied for

• Model score for train data is 0.834

• AUC and ROC for train data

• Confusion Matrix for train data

• Confusion Matrix for test data

• Model score for train data is 0.834

• AUC and ROC for train data

• AUC and ROC for test data

• Confusion Matrix for train data

• Accuracy score for train data is 0.85

• Model score for train data is 0.853

• AUC and ROC for test data

• Confusion Matrix for train data

• Accuracy score for train data is 0.83

• Model score for train data is 0.832

• AUC and ROC for train data

• Confusion Matrix for train data

• Confusion Matrix for test data

• Accuracy score for train data is 1.0

• Model score for train data is 1.0

• AUC and ROC for train data

• AUC and ROC for test data

• Confusion Matrix for test data

Random Forest Classifier Model:

• Accuracy score for train data is 1.0

• Model score for train data is 1.0

• AUC and ROC for test data

• Confusion Matrix for train data

Bagging Classifier Model:

• Accuracy score for train data is 1.0

• Model score for train data is 1.0