Machine Learning Report

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 42

12/5/2021

BUSINESS ANALYSIS
REPORT
MACHINE LEARNING

SANDYA VB
CONTENTS

PROBLEM 1:
Data Ingestion: 11 marks
1.1 Read the dataset. Do the descriptive statistics and do the null value condition check.
Write an inference on it. (4 Marks)
1.2 Perform Univariate and Bivariate Analysis. Do exploratory data analysis. Check for
Outliers. (7 Marks)

Data Preparation: 4 marks


1.3 Encode the data (having string values) for Modelling. Is Scaling necessary here or not?
Data Split: Split the data into train and test (70:30). (4 Marks)

Modelling: 22 marks
1.4 Apply Logistic Regression and LDA (linear discriminant analysis). (4 marks)
1.5 Apply KNN Model and Naïve Bayes Model. Interpret the results. (4 marks)
1.6 Model Tuning, Bagging (Random Forest should be applied for Bagging), and
Boosting. (7 marks)
1.7 Performance Metrics: Check the performance of Predictions on Train and Test sets
using Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC score for each
model. Final Model: Compare the models and write inference which model is
best/optimized. (7 marks)

Inference: 5 marks
1.8 Based on these predictions, what are the insights? (5 marks)

PROBLEM 2:
2.1 Find the number of characters, words, and sentences for the mentioned documents. – 3
Marks

2.2 Remove all the stopwords from all three speeches. – 3 Marks

2.3 Which word occurs the most number of times in his inaugural address for each
president? Mention the top three words. (after removing the stopwords) – 3 Marks

2.4 Plot the word cloud of each of the speeches of the variable. (after removing the
stopwords) – 3 Marks
Problem 1

You are hired by one of the leading news channels CNBE who wants to analyze
recent elections. This survey was conducted on 1525 voters with 9 variables. You
have to build a model, to predict which party a voter will vote for on the basis of
the given information, to create an exit poll that will help in predicting overall
win and seats covered by a particular party.

Dataset for Problem 1: Election_Data.xlsx

Data Dictionary:

1. vote: Party choice: Conservative or


Labour

2. age: in years

3. economic.cond.national: Assessment of current national economic


conditions, 1 to 5.

4. economic.cond.household: Assessment of current household economic


conditions, 1 to 5.

5. Blair: Assessment of the Labour leader,


1-5.

6. Hague: Assessment of the Conservative leader, 1-


5.

7. Europe: an 11-point scale that measures respondents' attitudes toward European integration.
represent ‘Eurosceptic’ sentiment.

8. political.knowledge: Knowledge of parties' positions on European


integration, 0 to 3.

9. gender: female /male.


1.1 Read the dataset. Do the descriptive statistics and do the null
value condition check. Write an inference on it.

Reading the csv file and checking the head and tail of the dataset.

Describing the dataset.

• The dataset has 9 columns.


• They are vote, economic.cond.national, economic.cond.household, Blair,
Hague, Europe, political.knowledge and gender.
• Vote is the target variable.

Shape of the data = (1525,9)


Info

• There are integer and object data types in the dataset.

• There are no null value present in the dataset.

Duplicate value

• There are 8 duplicate values present in the dataset.


• Removing the duplicate values from the dataset.

1.2 Perform Univariate and Bivariate Analysis. Do exploratory data


analysis. Check for Outliers.

Univariate Analysis:

• Vote is the target variable.


• The above plot describes the number of candidates who vote for labour is
more than the conservative party.
• Age : The above plot describes the age group of the candidates who vote.

• Economic.cond.national : The above plot describes the economic


condition of the nation which falls between 1-5.

• Economic.cond.household: The above plot describes the economic


condition of the household , which also falls between the scale 1-5.

Blair: It is the Assessment of the Labour leader on a scale of 1-5.


• Hague: It belongs to conservative party on a scale of 1-5.

• Political.knowledge: The above plot describes the knowledge of the


party’s position.
• Gender: The above plot describes the number of female and male who
has voted for the parties.

Bivariate Analysis:
Data Distribution:

Correlation graph:
Outliers Check:

• The outliers are been treated.


1.3 Encode the data (having string values) for Modelling. Is Scaling
necessary here or not? Data Split: Split the data into train and
test (70:30).

Encoding the data:

• Variables Vote and Gender are object datatype. We need to convert them
to integer datatype to perform operations.
• For encoding we use replace function to convert the categorical string
value to categorical numeric values.
• Vote has 2 variables : Conservative and Labour. To covert them into
Conservative to 0 and Labour to 1, we use Replace function.
• Gender also has 2 variables: Male and Female. To cpnvert them into Male
as 0 and Female as 1, we use replace function.
Data split :

• Data split is performed to split the data in the ratio 70:30.


• Test data is 70% and Train data is 30%.

Scaling:
• Scaling is not required.

1.4 Apply Logistic Regression and LDA (linear discriminant


analysis).

Logit_model = LogisticRegression()
Logit_model.fit(X_train, y_train)
LDA_model= LinearDiscriminantAnalysis()
LDA_model.fit(X_train, y_train)
1.5 Apply KNN Model and Naïve Bayes Model. Interpret the
results.

KNN_model=KNeighborsClassifier()
KNN_model.fit(X_train,y_train)
NB_model = GaussianNB()
NB_model.fit(X_train, y_train)

1.6 Model Tuning, Bagging (Random Forest should be applied for


Bagging), and Boosting.

Model Tuning:

DT_model= tree.DecisionTreeClassifier()
DT_model.fit(X_train, y_train)
Bagging:

• Random Forest

RF_model=RandomForestClassifier(n_estimators=100,random_state=1)

RF_model.fit(X_train, y_train)
Bagging_model=BaggingClassifier(base_estimator=cart,n_estimators=100,random_s
tate=1)

Bagging_model.fit(X_train, y_train)
• ADA Bagging:

ADB_model = AdaBoostClassifier(n_estimators=100,random_state=1)
ADB_model.fit(X_train,y_train)

Boosting:

gbcl = GradientBoostingClassifier(random_state=1)
gbcl = gbcl.fit(X_train, y_train)
1.7 Performance Metrics: Check the performance of Predictions on
Train and Test sets using Accuracy, Confusion Matrix, Plot ROC
curve and get ROC_AUC score for each model. Final Model:
Compare the models and write inference which model is
best/optimized.

Logistic Regression:
• Accuracy score for train data is 0.83
• Accuracy score for test data is 0.83

• Model score for train data is 0.834


• Model score for test data is 0.831

• AUC and ROC for train data


AUC score is 0.89
• AUC and ROC for test data
AUC score is 0.89

• Confusion Matrix for train data

array([[197, 110],
[ 66, 688]], dtype=int64)

• Confusion Matrix for test data


array([[112, 41],
[ 36, 267]], dtype=int64)
Linear Discriminant Analysis:
• Accuracy score for train data is 0.83
• Accuracy score for test data is 0.83

• Model score for train data is 0.834


• Model score for test data is 0.831

• AUC and ROC for train data


AUC score is 0.89

• AUC and ROC for test data


AUC score is 0.89

• Confusion Matrix for train data


array([[200, 107],
[ 69, 685]], dtype=int64)
• Confusion Matrix for test data
array([[111, 42],
[ 35, 268]], dtype=int64)

KNN Model:

• Accuracy score for train data is 0.85


• Accuracy score for test data is 0.82

• Model score for train data is 0.853


• Model score for test data is 0.815
• AUC and ROC for train data
AUC score is 0.923

• AUC and ROC for test data


AUC score is 0.923

• Confusion Matrix for train data


array([[204, 103],
[ 52, 702]], dtype=int64)
• Confusion Matrix for test data
array([[ 99, 54],
[ 30, 273]], dtype=int64)

GaussianNB Model:

• Accuracy score for train data is 0.83


• Accuracy score for test data is 0.82

• Model score for train data is 0.832


• Model score for test data is 0.822

• AUC and ROC for train data


AUC score is 0.889
• AUC and ROC for test data
AUC score is 0.889

• Confusion Matrix for train data


array([[212, 95],
[ 81, 673]], dtype=int64)

• Confusion Matrix for test data


array([[112, 41],
[ 40, 263]], dtype=int64)
Decision Tree Classifier Model:

• Accuracy score for train data is 1.0


• Accuracy score for test data is 0.79

• Model score for train data is 1.0


• Model score for test data is 0.785

• AUC and ROC for train data


AUC score is 0.889

• AUC and ROC for test data


AUC score is 0.889
• Confusion Matrix for train data

array([[307, 0],
[ 0, 754]], dtype=int64)

• Confusion Matrix for test data

array([[100, 53],
[ 45, 258]], dtype=int64)

Random Forest Classifier Model:

• Accuracy score for train data is 1.0


• Accuracy score for test data is 0.83

• Model score for train data is 1.0


• Model score for test data is 0.831
• AUC and ROC for train data
AUC score is 0.889

• AUC and ROC for test data


AUC score is 0.889

• Confusion Matrix for train data

array([[307, 0],
[ 0, 754]], dtype=int64)
• Confusion Matrix for test data
array([[104, 49],
[ 28, 275]], dtype=int64)

Bagging Classifier Model:

• Accuracy score for train data is 1.0


• Accuracy score for test data is 0.82

• Model score for train data is 1.0


• Model score for test data is 0.82

• AUC and ROC for train data


AUC score is 0.889
• AUC and ROC for test data
AUC score is 0.889

• Confusion Matrix for train data

array([[307, 0],
[ 0, 754]], dtype=int64)

• Confusion Matrix for test data

array([[108, 45],
[ 37, 266]], dtype=int64)
Ada Boost Classifier Model:

• Accuracy score for train data is 0.85


• Accuracy score for test data is 0.81

• Model score for train data is 0.850


• Model score for test data is 0.813

• AUC and ROC for train data


AUC score is 0.889

• AUC and ROC for train data


AUC score is 0.889
• Confusion Matrix for train data

array([[214, 93],
[ 66, 688]], dtype=int64)

• Confusion Matrix for test data

array([[103, 50],
[ 35, 268]], dtype=int64)

Gradient Boosting Classifier Model:

• Accuracy score for train data is 0.89


• Accuracy score for test data is 0.84

• Model score for train data is 0.892


• Model score for test data is 0.835
• AUC and ROC for train data
AUC score is 0.889

• AUC and ROC for test data


AUC score is 0.889

• Confusion Matrix for train data

array([[239, 68],
[ 46, 708]], dtype=int64)
• Confusion Matrix for test data

array([[105, 48],
[ 27, 276]], dtype=int64)

ROC Curve Analysis


1.8 Based on these predictions, what are the insights?

Accuracy of all the models is similar to each other on train data and test
data,
AUC and ROC curves appear similar on train data and test data.
Model score of all the models for train data and test data is similar and
close to each other’s score.
From the summary of the confusion matrix, we can see that the actual and
the predicted data are very close to each other. This is the reflection of the
right fit model.
F1 score of all the models for train data and test data are almost same.
Model tuning gives better results, but bagging performs well on both train
data and test data.
Boosting technique shows good performance.
Based on overall performance of all the models, we can come to a
conclusion that there is no overfitting nor under fitting issues in this case
study.
Problem 2

In this particular project, we are going to work on the inaugural corpora from the
nltk in Python. We will be looking at the following speeches of the Presidents of
the United States of America:

1. President Franklin D. Roosevelt in 1941


2. President John F. Kennedy in 1961
3. President Richard Nixon in 1973

2.1 Find the number of characters, words, and sentences for the
mentioned documents.
Number of Character:

Characters count for 1941-Roosevelt speech is = 7571


Characters count for 1961-Kennedy speech is = 7618
Characters count for 1973-Nixon speech is = 9991

Number of Words:

Words count for 1941-Roosevelt speech is = 1536


Words count for 1961-Kennedy speech is = 1546
Words count for 1973-Nixon speech is = 2028

Number of Sentences:

Sentences count for 1941-Roosevelt speech is = 68


Sentences count for 1961-Kennedy speech is = 52
Sentences count for 1973-Nixon speech is = 69
2.2 Remove all the stopwords from all three speeches.

def remove_stopwords(array,stopw):
filtered = []
for a in array:
al = a.lower()
if al not in stopw and a!='--':
filtered.append(al)
return filtered

stopw = set(stopwords.words('english')+list(string.punctuation))
Rwords = remove_stopwords(R_words,stopw)
Kwords = remove_stopwords(K_words,stopw)
Nwords = remove_stopwords(N_words,stopw)

2.3 Which word occurs the most number of times in his inaugural
address for each president? Mention the top three words. (after
removing the stopwords).
Top 3 words:

Top three words of Roosevelt: [('nation', 12), ('know', 10), ('spirit', 9)]
Top three words of Kennedy: [('let', 16), ('us', 12), ('world', 8)]
Top three words of Nixon: [('us', 26), ('let', 22), ('america', 21)]
2.4 Plot the word cloud of each of the speeches of the variable. (after
removing the stopwords.

Word Cloud for 1941-Roosevelt speech is

Word Cloud for 1961-Kennedy speech is


Word counts for 1973-Nixon speech is

You might also like