Professional Documents
Culture Documents
Machine Learning Report
Machine Learning Report
Machine Learning Report
BUSINESS ANALYSIS
REPORT
MACHINE LEARNING
SANDYA VB
CONTENTS
PROBLEM 1:
Data Ingestion: 11 marks
1.1 Read the dataset. Do the descriptive statistics and do the null value condition check.
Write an inference on it. (4 Marks)
1.2 Perform Univariate and Bivariate Analysis. Do exploratory data analysis. Check for
Outliers. (7 Marks)
Modelling: 22 marks
1.4 Apply Logistic Regression and LDA (linear discriminant analysis). (4 marks)
1.5 Apply KNN Model and Naïve Bayes Model. Interpret the results. (4 marks)
1.6 Model Tuning, Bagging (Random Forest should be applied for Bagging), and
Boosting. (7 marks)
1.7 Performance Metrics: Check the performance of Predictions on Train and Test sets
using Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC score for each
model. Final Model: Compare the models and write inference which model is
best/optimized. (7 marks)
Inference: 5 marks
1.8 Based on these predictions, what are the insights? (5 marks)
PROBLEM 2:
2.1 Find the number of characters, words, and sentences for the mentioned documents. – 3
Marks
2.2 Remove all the stopwords from all three speeches. – 3 Marks
2.3 Which word occurs the most number of times in his inaugural address for each
president? Mention the top three words. (after removing the stopwords) – 3 Marks
2.4 Plot the word cloud of each of the speeches of the variable. (after removing the
stopwords) – 3 Marks
Problem 1
You are hired by one of the leading news channels CNBE who wants to analyze
recent elections. This survey was conducted on 1525 voters with 9 variables. You
have to build a model, to predict which party a voter will vote for on the basis of
the given information, to create an exit poll that will help in predicting overall
win and seats covered by a particular party.
Data Dictionary:
2. age: in years
7. Europe: an 11-point scale that measures respondents' attitudes toward European integration.
represent ‘Eurosceptic’ sentiment.
Reading the csv file and checking the head and tail of the dataset.
Duplicate value
Univariate Analysis:
Bivariate Analysis:
Data Distribution:
Correlation graph:
Outliers Check:
• Variables Vote and Gender are object datatype. We need to convert them
to integer datatype to perform operations.
• For encoding we use replace function to convert the categorical string
value to categorical numeric values.
• Vote has 2 variables : Conservative and Labour. To covert them into
Conservative to 0 and Labour to 1, we use Replace function.
• Gender also has 2 variables: Male and Female. To cpnvert them into Male
as 0 and Female as 1, we use replace function.
Data split :
Scaling:
• Scaling is not required.
Logit_model = LogisticRegression()
Logit_model.fit(X_train, y_train)
LDA_model= LinearDiscriminantAnalysis()
LDA_model.fit(X_train, y_train)
1.5 Apply KNN Model and Naïve Bayes Model. Interpret the
results.
KNN_model=KNeighborsClassifier()
KNN_model.fit(X_train,y_train)
NB_model = GaussianNB()
NB_model.fit(X_train, y_train)
Model Tuning:
DT_model= tree.DecisionTreeClassifier()
DT_model.fit(X_train, y_train)
Bagging:
• Random Forest
RF_model=RandomForestClassifier(n_estimators=100,random_state=1)
RF_model.fit(X_train, y_train)
Bagging_model=BaggingClassifier(base_estimator=cart,n_estimators=100,random_s
tate=1)
Bagging_model.fit(X_train, y_train)
• ADA Bagging:
ADB_model = AdaBoostClassifier(n_estimators=100,random_state=1)
ADB_model.fit(X_train,y_train)
Boosting:
gbcl = GradientBoostingClassifier(random_state=1)
gbcl = gbcl.fit(X_train, y_train)
1.7 Performance Metrics: Check the performance of Predictions on
Train and Test sets using Accuracy, Confusion Matrix, Plot ROC
curve and get ROC_AUC score for each model. Final Model:
Compare the models and write inference which model is
best/optimized.
Logistic Regression:
• Accuracy score for train data is 0.83
• Accuracy score for test data is 0.83
array([[197, 110],
[ 66, 688]], dtype=int64)
KNN Model:
GaussianNB Model:
array([[307, 0],
[ 0, 754]], dtype=int64)
array([[100, 53],
[ 45, 258]], dtype=int64)
array([[307, 0],
[ 0, 754]], dtype=int64)
• Confusion Matrix for test data
array([[104, 49],
[ 28, 275]], dtype=int64)
array([[307, 0],
[ 0, 754]], dtype=int64)
array([[108, 45],
[ 37, 266]], dtype=int64)
Ada Boost Classifier Model:
array([[214, 93],
[ 66, 688]], dtype=int64)
array([[103, 50],
[ 35, 268]], dtype=int64)
array([[239, 68],
[ 46, 708]], dtype=int64)
• Confusion Matrix for test data
array([[105, 48],
[ 27, 276]], dtype=int64)
Accuracy of all the models is similar to each other on train data and test
data,
AUC and ROC curves appear similar on train data and test data.
Model score of all the models for train data and test data is similar and
close to each other’s score.
From the summary of the confusion matrix, we can see that the actual and
the predicted data are very close to each other. This is the reflection of the
right fit model.
F1 score of all the models for train data and test data are almost same.
Model tuning gives better results, but bagging performs well on both train
data and test data.
Boosting technique shows good performance.
Based on overall performance of all the models, we can come to a
conclusion that there is no overfitting nor under fitting issues in this case
study.
Problem 2
In this particular project, we are going to work on the inaugural corpora from the
nltk in Python. We will be looking at the following speeches of the Presidents of
the United States of America:
2.1 Find the number of characters, words, and sentences for the
mentioned documents.
Number of Character:
Number of Words:
Number of Sentences:
def remove_stopwords(array,stopw):
filtered = []
for a in array:
al = a.lower()
if al not in stopw and a!='--':
filtered.append(al)
return filtered
stopw = set(stopwords.words('english')+list(string.punctuation))
Rwords = remove_stopwords(R_words,stopw)
Kwords = remove_stopwords(K_words,stopw)
Nwords = remove_stopwords(N_words,stopw)
2.3 Which word occurs the most number of times in his inaugural
address for each president? Mention the top three words. (after
removing the stopwords).
Top 3 words:
Top three words of Roosevelt: [('nation', 12), ('know', 10), ('spirit', 9)]
Top three words of Kennedy: [('let', 16), ('us', 12), ('world', 8)]
Top three words of Nixon: [('us', 26), ('let', 22), ('america', 21)]
2.4 Plot the word cloud of each of the speeches of the variable. (after
removing the stopwords.