Professional Documents
Culture Documents
Predictive Modelling Project 2
Predictive Modelling Project 2
Predictive Modelling Project 2
Modeling
Project
Sneha Sharma
PGPDSBA Mar’21
Group 2
Table of
Content
1.1)Read the data and do exploratory data analysis. Describe the data
briefly. (Check the null values, data types, shape, EDA). Perform
Univariate and Bivariate Analysis. (8 marks)
1.2)Impute null values if present? Do you think scaling is necessary in
this case? (8 marks)
1.3)Encode the data (having string values) for Modelling. Data Split:
Split the data into test and train (70:30). Apply Linear regression.
Performance Metrics: Check the performance of Predictions on Train and
Test sets using R-square, RMSE. (8 marks)
1.4)Inference: Based on these predictions, what are the business
insights and recommendations. (6 marks)
2.1) Data Ingestion: Read the dataset. Do the descriptive statistics and
do null value condition check, write an inference on it. Perform
Univariate and Bivariate Analysis. Do exploratory data analysis. (8
marks)
2.2) Encode the data (having string values) for Modelling. Data Split:
Split the data into train and test (70:30). Apply Logistic Regression and
LDA (linear discriminant analysis). (8 marks)
2.3) Performance Metrics: Check the performance of Predictions on Train
and Test sets using Accuracy, Confusion Matrix, Plot ROC curve and get
ROC_AUC score for each model. Compare both the models and write
inferences, which model is best/optimized. (8 marks)
2.4) Inference: Based on these predictions , what are the insights and
recommendations. (6 marks)
1.1 Read the data and do exploratory data analysis. Describe the data
briefly. (Check the null values, data types, shape, EDA). Perform Univariate
and Bivariate Analysis. (8 marks)
1.2Impute null values if present? Do you think scaling is necessary in this
case? (8 marks)
1.3Encode the data (having string values) for Modelling. Data Split: Split
the data into test and train (70:30). Apply Linear regression. Performance
Metrics: Check the performance of Predictions on Train and Test sets using
R-square, RMSE. (8 marks)
6. sp500: Membership of firms in the S&P 500 index. S&P is a stock market index
that measures the stock performance of 500 large companies listed on stock
exchanges in the United States
7. tobinq: Tobin's q (also known as q ratio and Kaldor's v) is the ratio between a
physical asset's market value and its replacement value.
Dataset has 759 rows & 10 features .Sales , capital ,randd ,employment , Tobinq ,value , institutions
are float64 types , patents is (Dependent Variable) is integer type & sp500 is object type.
Let’s start the Data exploration step with the head function to look at first 5 initial rows.
Count plot & Box plot for the object variable i.e “sp500” . As we can see in the above graph there are
outliers in sp500.
Based on the given data set, as we have attributes that are not well defined meaning so therefore we
should scale our data in this case. Accordingly we have scaled the dataset after treating the outliers
and converting the categorical data into continuous in the dataset. Standard Scaler normalizes the
data using the formula (x -mean)/standard deviation.
***************-----------------***************
Problem 2: Logistic Regression and Linear Discriminant
Analysis
You are hired by the Government to do an analysis of car crashes. You are provided
details of car crashes, among which some people survived and some didn't. You have
to help the government in predicting whether a person will survive or not on the
basis of the information given in the data set so as to provide insights that will help
the government to make stronger laws for car manufacturers to ensure safety
measures. Also, find out the important factors on the basis of which you made your
predictions.
Survived
survived 10037
Not_Survived 1180
Name: Survived, dtype: int64
airbag
airbag 7064
none 4153
Name: airbag, dtype: int64
seatbelt
belted 7849
none 3368
Name: seatbelt, dtype: int64
sex
m 6048
f 5169
Name: sex, dtype: int64
abcat
deploy 4365
unavail 4153
nodeploy 2699
Name: abcat, dtype: int64
occRole
driver 8786
pass 2431
Name: occRole, dtype: int64
caseid
73:100:2 7
75:84:2 6
49:106:1 6
78:2:1 6
73:110:1 6
..
78:151:1 1
47:39:1 1
72:184:1 1
78:85:2 1
81:105:1 1
Name: caseid, Length: 6488, dtype: int64
The target column which is 'Survived' to understand how the data is distributed amongst the various
values
Box-plot Visualization of Data set to check Outliers and Treating the same with IQR :
After outlier treatment:
Dvcat
1 5414
1 3368
2 1344
809 -8
282 Name: dvcat, dtype: int64
Survived
survived 10037
0 1180
Name: Survived, dtype: int64
Airbag
1 7064
0 4153
Name: airbag, dtype: int64
seatbelt
1 7849
0 3368
Name: seatbelt, dtype: int64
Sex
0 6048
1 5169
Name: sex, dtype: int64
abcat
1 4365
0 4153
2699
Name: abcat, dtype: int64
occRole
0 8786
1 2431
Name: occRole, dtype: int64
1 0.894831
0 0.105169
Name: Survived, dtype: float64
Particulars Logistic Reg Test Logistic Train LDA Test LDA Train
Accuracy 96.96% 97.13%
AUC 98.6% 98.6% 97.1% 96.9%
Recall 99% 99% 98% 99%
Precision 98% 98% 97% 97%
F1 Score 98% 98% 98% 98%
2.4) Inference: Based on these predictions , what are
the insights and recommendations. (6 marks)
The model accuracy of logistic regression on both training data as well as testing data is
almost same i.e 98%.
Similarly, AUC in logistic regression for training data and testing data is also similar.
The other parameters of confusion matrix in logistic regression is also similar, therefore
we can presume in this that our model is overfitted.
We have therefore applied Grid Search CV to hyper tune our model and as per which
F1 score in both training and test data was 98%.
In case of LDA, the AUC for testing and training data is also same and it was 97%,
besides this the other parameters of confusion matrix of LDA model was also similar and
it clearly shows that model is overfitted here too.
Score of Both Train and Test Data are coming near-by.
Linear Discriminant Analysis Model Giving Better Recall and Precision in comparison to
Logistic Regression.
Hence, LDA Model cab be considered further upgrading the same using SMOTE model,
whereby its predictive ability get further enhanced.
Overall we can conclude that logistic regression model is best suited for this data set
given the level of accuracy in spite of the Linear Discriminant Analysis that the model is
overfitted.
************---------------------************