Predictive Modelling Project 2

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 32

Predictive

Modeling
Project

Sneha Sharma
PGPDSBA Mar’21
Group 2
Table of
Content
1.1)Read the data and do exploratory data analysis. Describe the data
briefly. (Check the null values, data types, shape, EDA). Perform
Univariate and Bivariate Analysis. (8 marks)
1.2)Impute null values if present? Do you think scaling is necessary in
this case? (8 marks)
1.3)Encode the data (having string values) for Modelling. Data Split:
Split the data into test and train (70:30). Apply Linear regression.
Performance Metrics: Check the performance of Predictions on Train and
Test sets using R-square, RMSE. (8 marks)
1.4)Inference: Based on these predictions, what are the business
insights and recommendations. (6 marks)
2.1) Data Ingestion: Read the dataset. Do the descriptive statistics and
do null value condition check, write an inference on it. Perform
Univariate and Bivariate Analysis. Do exploratory data analysis. (8
marks)
2.2) Encode the data (having string values) for Modelling. Data Split:
Split the data into train and test (70:30). Apply Logistic Regression and
LDA (linear discriminant analysis). (8 marks)
2.3) Performance Metrics: Check the performance of Predictions on Train
and Test sets using Accuracy, Confusion Matrix, Plot ROC curve and get
ROC_AUC score for each model. Compare both the models and write
inferences, which model is best/optimized. (8 marks)
2.4) Inference: Based on these predictions , what are the insights and
recommendations. (6 marks)

Problem 1: Linear Regression


You are a part of an investment firm and your work is to do research about these 759
firms. You are provided with the dataset containing the sales and other attributes of
these 759 firms. Predict the sales of these firms on the bases of the details given in
the dataset so as to help your company in investing consciously. Also, provide them
with 5 attributes that are most important.

Questions for Problem 1:

1.1 Read the data and do exploratory data analysis. Describe the data
briefly. (Check the null values, data types, shape, EDA). Perform Univariate
and Bivariate Analysis. (8 marks)
1.2Impute null values if present? Do you think scaling is necessary in this
case? (8 marks)
1.3Encode the data (having string values) for Modelling. Data Split: Split
the data into test and train (70:30). Apply Linear regression. Performance
Metrics: Check the performance of Predictions on Train and Test sets using
R-square, RMSE. (8 marks)

1.4Inference: Based on these predictions, what are the business insights


and recommendations. (6 marks)
Data Dictionary for Firm_level_data:

1. sales: Sales (in millions of dollars).

2. capital: Net stock of property, plant, and equipment.

3. patents: Granted patents.

4. randd: R&D stock (in millions of dollars).

5. employment: Employment (in 1000s).

6. sp500: Membership of firms in the S&P 500 index. S&P is a stock market index
that measures the stock performance of 500 large companies listed on stock
exchanges in the United States

7. tobinq: Tobin's q (also known as q ratio and Kaldor's v) is the ratio between a
physical asset's market value and its replacement value.

8. value: Stock market value.

9. institutions: Proportion of stock owned by institutions.

1.1 Read the data and do exploratory data analysis.


Describe the data briefly. (Check the null values,
data types, shape, EDA). Perform Univariate and
Bivariate Analysis. (8 marks)
Load the required packages , set the working directory & load the data file.

Dataset has 759 rows & 10 features .Sales , capital ,randd ,employment , Tobinq ,value , institutions
are float64 types , patents is (Dependent Variable) is integer type & sp500 is object type.
Let’s start the Data exploration step with the head function to look at first 5 initial rows.

The data does have null values in ‘tobinq’ feature.

In section 1.2 we will talk about the imputation of missing values.

Dataset has zero duplicate records .


Outliers are present in all variables i.e patent , sales , capital , randd , employment , tobinq & value
except for institutions there are no outliers in institutions & it is somewhat normally distributed .

Count plot & Box plot for the object variable i.e “sp500” . As we can see in the above graph there are
outliers in sp500.

Correlation analyses using pair plot and heat


map.
Correlation plot

- Sales is highly correlated with employment (0.91)


- Employment is highly correlated with sales (0.91)

-Sales is correlated with capital & randd (0.87)

-Capital is correlated with sales(0.87)

-Randd is correlated with sales(0.87)

1.2Impute null values if present? Do you think scaling is


necessary in this case? (8 marks)
We have imputed the null values.

Based on the given data set, as we have attributes that are not well defined meaning so therefore we
should scale our data in this case. Accordingly we have scaled the dataset after treating the outliers
and converting the categorical data into continuous in the dataset. Standard Scaler normalizes the
data using the formula (x -mean)/standard deviation.

Median of Data Set :


 sales=448.577082
 capital=202.179023
 patents=3.000000
 randd =36.864136
 employment=2.924000
 tobinq=1.680303
 value=410.793529
 institutions =44.110000
 sp500_no=1.000000
 sp500_yes=0.000000
As can be seen from point 1.1 info Total Data has 759 rows with data while Independent
Variable- tobinq showing 738 entries meaning 21 null values , for which Median value of
variable tobinq is replaced.
Scaling : When scaling of the data is done, all variables come at comparable level, scaling is
important when data has different in volume and magnitude. Data normalization is the
method used to standardize the range of features of data. Since, the range of values of data
may vary widely, it becomes a necessary step in data pre- processing while using machine
learning algorithms. Here in current case where although data is in better state, however ,
Standard Scaler as well as Z score used to scale the data .
1.3Encode the data (having string values) for Modelling.
Data Split: Split the data into test and train (70:30).
Apply Linear regression. Performance Metrics: Check
the performance of Predictions on Train and Test sets
using R-square, RMSE. (8 marks)
Data post Encoding :
Data Type post conversion :

Linear Regression Application on the Data Set :

The intercept for our model is 409.4806712368136

# R square on training data


0.9359702538559448
# R square on testing data
0.9240311293641782
#RMSE on Training data
394.33716619028996
#RMSE on Testing data
400.002114937274

Stats Model - Apply Linear Regression :


MSE :
394.33716619029013
Stats Based Linear Regression Score on Test Set :
0.9252349323952496
Scatter plot Pre- and Post -Scaling with Z-Score
Pre – Scaling
Post-scaling

Co-efficient – of Scaled Data Set:

Variance inflation factor :

The intercept for our model is 1.670974735376873e-17

1.4Inference: Based on these predictions, what are the


business insights and recommendations. (6 marks)
Inference:
 The investment criteria for any new investor is mainly based on the capital invested in the
company by the promoters and investors are vying on the firms where the capital investment
is good as also reflecting in the scatter plot.
 To generate capital the company should have the combination of the following attributes
such as value, employment, sales and patents.
 The highest contributing attribute is employment followed by patents.
The final linear regression equation is
Sales = b0(intercept) + b1*patents + b2*employment + b3 *randd + b4* tobinq + b5* value +
b6* institutions + b7* capital
Sales = (54.6)(intercept) + (-4.15)*patents + (81.33)*employment + (0.65) *randd + (-
41.56)* tobinq + (0.26)* value + (0.62)* institutions + (0.41)* capital
When employment increases by 1 unit , sales increases by 81.33 units , keeping all other
predictors constant . Similarly , When capital increases by 1 unit , sales increases by 0.41
units , keeping all other predictors constant .
There are also some negative co-efficient values , for instance , patents has its corresponding
co-efficient as -4.15 . This implies , when the patents are reduced the sales decreases by -4.15
, keeping all other predictors constant .

***************-----------------***************
Problem 2: Logistic Regression and Linear Discriminant
Analysis
You are hired by the Government to do an analysis of car crashes. You are provided
details of car crashes, among which some people survived and some didn't. You have
to help the government in predicting whether a person will survive or not on the
basis of the information given in the data set so as to provide insights that will help
the government to make stronger laws for car manufacturers to ensure safety
measures. Also, find out the important factors on the basis of which you made your
predictions.

Questions for Problem 2:


2.1) Data Ingestion: Read the dataset. Do the descriptive statistics and
do null value condition check, write an inference on it. Perform
Univariate and Bivariate Analysis. Do exploratory data analysis. (8
marks)
2.2) Encode the data (having string values) for Modelling. Data Split:
Split the data into train and test (70:30). Apply Logistic Regression and
LDA (linear discriminant analysis). (8 marks)
2.3) Performance Metrics: Check the performance of Predictions on
Train and Test sets using Accuracy, Confusion Matrix, Plot ROC curve
and get ROC_AUC score for each model. Compare both the models and
write inferences, which model is best/optimized. (8 marks)
2.4) Inference: Based on these predictions , what are the insights and
recommendations. (6 marks)
Data Dictionary for Car_Crash
1. dvcat: factor with levels (estimated impact speeds) 1-9km/h, 10-24, 25-39,
40-54, 55+
2. weight: Observation weights, albeit of uncertain accuracy, designed to
account for varying sampling probabilities. (The inverse probability weighting
estimator can be used to demonstrate causality when the researcher cannot
conduct a controlled experiment but has observed data to model)
3. Survived: factor with levels Survived or not_survived
4. airbag: a factor with levels none or airbag
5. seatbelt: a factor with levels none or belted
6. frontal: a numeric vector; 0 = non-frontal, 1=frontal impact
7. sex: a factor with levels f: Female or m: Male
8. ageOFocc: age of occupant in years
9. yearacc: year of accident
10. yearVeh: Year of model of vehicle; a numeric vector
11. abcat: Did one or more (driver or passenger) airbag(s) deploy? This factor
has levels deploy, nodeploy and unavail
12. occRole: a factor with levels driver or pass: passenger
13. deploy: a numeric vector: 0 if an airbag was unavailable or did not deploy;
1 if one or more bags deployed.
14. injSeverity: a numeric vector; 0: none, 1: possible injury, 2: no incapacity,
3: incapacity, 4: killed; 5: unknown, 6: prior death
15. caseid: character, created by pasting together the populations sampling
unit, the case number, and the vehicle number. Within each year, use this to
uniquely identify the vehicle.
2.1) Data Ingestion: Read the dataset. Do the
descriptive statistics and do null value condition check,
write an inference on it. Perform Univariate and
Bivariate Analysis. Do exploratory data analysis. (8
marks)
Dataset has 11217 rows & 16 features .
-Weight , yearVeh , injSeverity are float64 types
-Frontal ,ageOFocc , yearacc , deployis int64 & Survived is (Dependent Variable) is integer
type
-'dvcat', 'Survived', 'airbag', 'seatbelt', 'sex', 'abcat', 'occRole', 'caseid' is object type.
Head of data:

Information about Data :


Duplicate Data : Zero Duplicate/ No Duplicate record/s Found. Unnecessary/Not Useful Variable i.e.,
Unnamed : 0 is dropped from Data Set. Treating Caseid as Index which will help in identifying
information about an accident using its id:

Checking Median Value of Data Frame :

Using Median Value to Replace InjSeverity Null Value.

Getting unique counts of all Object :


dvcat
10-24 5414
25-39 3368
40-54 1344
55+ 809
1-9km/h 282
Name: dvcat, dtype: int64

Survived
survived 10037
Not_Survived 1180
Name: Survived, dtype: int64

airbag
airbag 7064
none 4153
Name: airbag, dtype: int64

seatbelt
belted 7849
none 3368
Name: seatbelt, dtype: int64

sex
m 6048
f 5169
Name: sex, dtype: int64
abcat
deploy 4365
unavail 4153
nodeploy 2699
Name: abcat, dtype: int64

occRole
driver 8786
pass 2431
Name: occRole, dtype: int64

caseid
73:100:2 7
75:84:2 6
49:106:1 6
78:2:1 6
73:110:1 6
..
78:151:1 1
47:39:1 1
72:184:1 1
78:85:2 1
81:105:1 1
Name: caseid, Length: 6488, dtype: int64

Exploratory Data Analysis :

The target column which is 'Survived' to understand how the data is distributed amongst the various
values

Box-plot Visualization of Data set to check Outliers and Treating the same with IQR :
After outlier treatment:

Univariate and Bivariate Analysis :


Count plots for the object variables:
Checking for Correlations.
There is hardly any correlation between the numerical variables.

2.2) Encode the data (having string values) for


Modelling. Data Split: Split the data into train and test
(70:30). Apply Logistic Regression and LDA (linear
discriminant analysis). (8 marks)
Value Counts Object Data Type :

 Dvcat
1 5414
1 3368
2 1344
809 -8
282 Name: dvcat, dtype: int64
 Survived
survived 10037
0 1180
Name: Survived, dtype: int64
 Airbag
1 7064
0 4153
Name: airbag, dtype: int64
 seatbelt
1 7849
0 3368
Name: seatbelt, dtype: int64
 Sex
0 6048
1 5169
Name: sex, dtype: int64
 abcat
1 4365
0 4153
2699
Name: abcat, dtype: int64
 occRole
0 8786
1 2431
Name: occRole, dtype: int64

Data Info Post Encoding :


Train & Test Split of the Data Set :
1 0.89479
0 0.10521
Name: Survived, dtype: float64

1 0.894831
0 0.105169
Name: Survived, dtype: float64

Getting the Predicted Classes and Probs :

2.3) Performance Metrics: Check the performance of


Predictions on Train and Test sets using Accuracy,
Confusion Matrix, Plot ROC curve and get ROC_AUC
score for each model. Compare both the models and
write inferences, which model is best/optimized. (8
marks)
Logistic Regression
Training Data Accuracy : 0.9811488982295249
AUC & ROC :
AUC: 0.992
Test Data Accuracy : 0.9830659536541889
AUC: 0.992

Confusion Matrix : Training


array([[ 730, 96],
[ 52, 6973]], dtype=int64)

Visualization of Confusion Matrix : Training


Classification Report : Training
precision recall f1-score support

0 0.93 0.88 0.91 826


1 0.99 0.99 0.99 7025

accuracy 0.98 7851


macro avg 0.96 0.94 0.95 7851
weighted avg 0.98 0.98 0.98 7851

Confusion Matrix for test data:


array([[ 319, 35],
[ 22, 2990]], dtype=int64)
precision recall f1-score support

0 0.94 0.90 0.92 354


1 0.99 0.99 0.99 3012

accuracy 0.98 3366


macro avg 0.96 0.95 0.95 3366
weighted avg 0.98 0.98 0.98 3366

LDA ( Linear Discriminate Analysis )


Train Data Confusion Matrix Classification Report & Test Data Confusion Matrix Classification
Report :LDA
Classification Report of the training data:

precision recall f1-score support

0 0.92 0.68 0.78 826


1 0.96 0.99 0.98 7025

accuracy 0.96 7851


macro avg 0.94 0.84 0.88 7851
weighted avg 0.96 0.96 0.96 7851

Classification Report of the test data:

precision recall f1-score support

0 0.92 0.68 0.78 354


1 0.96 0.99 0.98 3012

accuracy 0.96 3366


macro avg 0.94 0.84 0.88 3366
weighted avg 0.96 0.96 0.96 3366
### Training Data and Test Data Confusion Matrix Comparison

AUC for the Training Data: 0.967


AUC for the Test Data: 0.965

Particulars Logistic Reg Test Logistic Train LDA Test LDA Train
Accuracy 96.96% 97.13%
AUC 98.6% 98.6% 97.1% 96.9%
Recall 99% 99% 98% 99%
Precision 98% 98% 97% 97%
F1 Score 98% 98% 98% 98%
2.4) Inference: Based on these predictions , what are
the insights and recommendations. (6 marks)
 The model accuracy of logistic regression on both training data as well as testing data is
almost same i.e 98%.
 Similarly, AUC in logistic regression for training data and testing data is also similar.
 The other parameters of confusion matrix in logistic regression is also similar, therefore
we can presume in this that our model is overfitted.
 We have therefore applied Grid Search CV to hyper tune our model and as per which
F1 score in both training and test data was 98%.
 In case of LDA, the AUC for testing and training data is also same and it was 97%,
besides this the other parameters of confusion matrix of LDA model was also similar and
it clearly shows that model is overfitted here too.
Score of Both Train and Test Data are coming near-by.
Linear Discriminant Analysis Model Giving Better Recall and Precision in comparison to
Logistic Regression.
Hence, LDA Model cab be considered further upgrading the same using SMOTE model,
whereby its predictive ability get further enhanced.
Overall we can conclude that logistic regression model is best suited for this data set
given the level of accuracy in spite of the Linear Discriminant Analysis that the model is
overfitted.

************---------------------************

You might also like