Sukanya Linear LogisticRegression Report

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 23

Problem 1: 

Linear Regression
You are a part of an investment firm and your work is to do research about these
759 firms. You are provided with the dataset containing the sales and other
attributes of these 759 firms. Predict the sales of these firms on the bases of the
details given in the dataset so as to help your company in investing consciously.
Also, provide them with 5 attributes that are most important.
Solution:
The following are the observations after exploring the data initially:

 Dataset consists of 759 rows and 10 columns


 9 columns are numeric in nature and 1 column ‘sp500’ is a categorical variable and of
object data type
 Missing values observed in the ‘tobinq’ variable
 No bad data is present in the dataset
 No duplicates are observed in the dataset
 Since there are no duplicate values, we can drop sequence variable ‘Unnamed: 0’
column from the dataset.

EDA

 Description of the dataset is as follows

cou
mean std min 25% 50% 75% max
nt

759. 2689.705 8722.060 0.1380 122.9200 448.5770 1822.547 135696.788


sales
0 158 124 00 00 82 366 200

759. 1977.747 6466.704 0.0570 52.65050 202.1790 1075.790 93625.2005


capital
0 498 896 00 1 23 020 60

759. 25.83135 97.25957 0.0000 11.50000 1220.00000


patents 1.000000 3.000000
0 7 7 00 0 0

759. 439.9380 2007.397 0.0000 36.86413 143.2534 30425.2558


randd 4.628262
0 74 588 00 6 03 60

employm 759. 14.16451 43.32144 0.0060 10.05000


0.927500 2.924000 710.799925
ent 0 9 3 00 1

738. 0.1190
tobinq 2.794910 3.366591 1.018783 1.680303 3.139309 20.000000
0 01

759. 2732.734 7071.072 1.9710 103.5939 410.7935 2054.160 95191.5911


value
0 750 362 53 46 29 386 60

institutio 759. 43.02054 21.68558 0.0000 25.39500 44.11000 60.51000


90.150000
ns 0 0 6 00 0 0 0
 Shape of the dataset
(759, 10)
 Columns in the dataset
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 759 entries, 0 to 758
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Unnamed: 0 759 non-null int64
1 sales 759 non-null float64
2 capital 759 non-null float64
3 patents 759 non-null int64
4 randd 759 non-null float64
5 employment 759 non-null float64
6 sp500 759 non-null object
7 tobinq 738 non-null float64
8 value 759 non-null float64
9 institutions 759 non-null float64
dtypes: float64(7), int64(2), object(1)

 The top 5 records in the dataset are as below

paten employm sp5 institutio


sales capital randd tobinq value
ts ent 00 ns

826.9950 161.6039 382.0782 11.0495 1625.4537


0 10 2.306000 no 80.27
50 86 47 11 55

407.7539 122.1010 0.84418 243.11708


1 2 0.000000 1.860000 no 59.02
73 12 7 2

8407.845 6221.144 3296.700 49.65900 5.20525 25865.233


2 138 yes 47.70
588 614 439 5 7 800

451.0000 266.8999 83.54016 0.30522


3 1 3.071000 no 63.024630 26.88
10 87 1 1

174.9279 140.1240 14.23363 1.06330


4 2 1.947000 no 67.406408 49.46
81 04 7 0

 The last 5 records in the dataset are as below

paten employm sp50 institutio


sales capital randd tobinq value
ts ent 0 ns

75 1253.9001 708.2999 412.9361 0.6974 267.1194


32 22.100002 yes 33.50
4 96 35 57 54 87

75 171.82102 73.66600 228.4757


1 0.037735 1.684000 no NaN 46.41
5 5 8 01

75 202.72696 123.9269 74.86109 5.2297 580.4307


13 1.460000 no 42.25
6 7 91 9 23 41

75 785.68794 138.7809 1.6253 309.9386


6 0.621750 2.900000 yes 61.39
7 4 92 98 51

75 14.24499 18.57436 2.2130 18.94014


22.701999 5 0.197000 no 7.50
8 9 0 70 0
Univariate and Bivariate Analysis

We can observe that there are many outliers in the dataset varied across most of the
variables.
From the histogram we can observe that tobinq (Tobin's q) is right skewed.
Institutions forms a Bell curve and almost uniformly distributed.
Correlation Plot

We can observe from the correlation plot:

 Employment and Sales are the highly correlated variables


 Patents and Randd (R&D stock) are also highly correlated
 Capital and Sales are correlated
Pairplot

From the pairplot also we can observe that there is atleast a mild correlation between the
variables.
Count Plot

1.2) Impute null values if present? Do you think scaling is necessary in this case?

Yes, null values are present in the dataset for the Tobinq’s variable.

Since the dataset is small, we can drop the 21 records with tobinq value as null values.

Post dropping off records, we can observe all null values have been imputed
sales 0
capital 0
patents 0
randd 0
employment 0
sp500 0
tobinq 0
value 0
institutions 0
dtype: int64

The shape of the dataset is now


(738, 9)
1.3) Encode the data (having string values) for Modelling. Data Split: Split the data into
test and train (30:70). Apply Linear regression. Performance Metrics: Check the
performance of Predictions on Train and Test sets using R-square, RMSE.

From the data we can see that sp_500 is the only categorical variable

paten employm sp5 institutio


sales capital randd tobinq value
ts ent 00 ns

826.9950 161.6039 382.0782 11.0495 1625.4537


0 10 2.306000 no 80.27
50 86 47 11 55

407.7539 122.1010 0.84418 243.11708


1 2 0.000000 1.860000 no 59.02
73 12 7 2

8407.845 6221.144 3296.700 49.65900 5.20525 25865.233


2 138 yes 47.70
588 614 439 5 7 800

451.0000 266.8999 83.54016 0.30522


3 1 3.071000 no 63.024630 26.88
10 87 1 1

174.9279 140.1240 14.23363 1.06330


4 2 1.947000 no 67.406408 49.46
81 04 7 0

Using get_dummies we can encode the data having string values

pate employ institut sp500 sp500


sales capital randd tobinq value
nts ment ions _no _yes

826.995 161.603 382.078 2.30600 11.049 1625.453


0 10 80.27 1 0
050 986 247 0 511 755

407.753 122.101 0.00000 1.86000 0.8441 243.1170


1 2 59.02 1 0
973 012 0 0 87 82

8407.84 6221.14 3296.70 49.6590 5.2052 25865.23


2 138 47.70 0 1
5588 4614 0439 05 57 3800

451.000 266.899 83.5401 3.07100 0.3052 63.02463


3 1 26.88 1 0
010 987 61 0 21 0

174.927 140.124 14.2336 1.94700 1.0633 67.40640


4 2 49.46 1 0
981 004 37 0 00 8

Now the shape of the dataset is


(738, 10)

We now split the data into Train and Test Data in 30:70 ratio using ‘train_test_split’
function. We drop ‘Sales’ target variable from ‘X’ dataset and assign ‘Sales’ variable to ‘y’
dataset.
Comparison of Performance of Predictions on Train and Test sets using R-square, RMSE

Performance on Train dataset

Using Linear Regression


0.942609247903072

Using R-Square
0.942609247903072

Using RSME – Square of Mean Squared Error


2359.982798372816

Performance on Test dataset

Using Linear Regression


0.7833858809094182

Using R-Square
0.7833858809094182

Using RSME – Square of Mean Squared Error


2686.692289357481

1.4 Inference: Based on these predictions, what are the business insights and
recommendations.

The business insights are:

 Sales is the target variable with other variables acting as the Predictor variables.
 The dataset consists of 759 rows and 10 columns. 9 columns are numeric variables
while sp500 is the only Categorical variable.
 Since there is no difference in the data before and after imputing the null values, we
have made the right choice of not performing scaling.
 Membership of firms in S&P 500 index (sp500) is observed as one of the crucial
attributes that determine Sales of the firm.
 We can consider the following 5 attributes to be the determining the factors of
Sales in the firm: R&D Stock (randd), Capital, Institutions, Tobin’s Q and sp500
 We observe high correlation between Patents and Randd hence we can take any one
of the attributes into consideration.
 Investing in more Granted Patents will increase the revenue as observed from the
Linear Regression of Training and Test dataset
 Based on various models, we can see that the RSME error value are slightly different
 Sales can be increased if we focus more on the Tobin’s Q and sp500. Since there is a
huge impact of stock exchanges in the United States, we need to speculate carefully.
 We can improve on the Institutions investment also, as there is a significant impact
on the Sales.
2.1) Data Ingestion: Read the dataset. Do the descriptive statistics and do null value
condition check, write an inference on it. Perform Univariate and Bivariate Analysis. Do
exploratory data analysis.

Solution:
The following are the observations after exploring the data initially:

 Dataset consists of 11217 rows and 16 columns

 Info of the dataset


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11217 entries, 0 to 11216
Data columns (total 16 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Unnamed: 0 11217 non-null int64
1 dvcat 11217 non-null object
2 weight 11217 non-null float64
3 Survived 11217 non-null object
4 airbag 11217 non-null object
5 seatbelt 11217 non-null object
6 frontal 11217 non-null int64
7 sex 11217 non-null object
8 ageOFocc 11217 non-null int64
9 yearacc 11217 non-null int64
10 yearVeh 11217 non-null float64
11 abcat 11217 non-null object
12 occRole 11217 non-null object
13 deploy 11217 non-null int64
14 injSeverity 11140 non-null float64
15 caseid 11217 non-null object
dtypes: float64(3), int64(5), object(8)

 7 columns are numeric variables of integer/float datatype. 9 columns are of object


datatype and are categorical variables
 No duplicates are observed in the dataset
 Shape of the dataset
(11217, 16)
 There are 77 missing values in the ‘injSeverity’ categorical variable

Unnamed: 0 0
dvcat 0
weight 0
Survived 0
airbag 0
seatbelt 0
frontal 0
sex 0
ageOFocc 0
yearacc 0
yearVeh 0
abcat 0
occRole 0
deploy 0
injSeverity 77
caseid 0
 Imputing null values with mode for the variable ‘injSeverity’
Post imputing
dvcat 0
weight 0
Survived 0
airbag 0
seatbelt 0
frontal 0
sex 0
ageOFocc 0
yearacc 0
yearVeh 0
abcat 0
occRole 0
deploy 0
injSeverity 0
caseid 0
dtype: int64

 Description of the dataset

count mean std min 25% 50% 75% max

11217. 1406.20294 31694.0


weight 431.405309 0.0 28.292 82.195 324.056
0 1 4

11217.
frontal 0.644022 0.478830 0.0 0.000 1.000 1.000 1.00
0

ageOFoc 11217.
37.427654 18.192429 16.0 22.000 33.000 48.000 97.00
c 0

11217. 2001.10323 1997. 2001.00 2001.00 2002.00


yearacc 1.056805 2002.00
0 6 0 0 0 0

11217. 1994.17794 1953. 1991.00 1995.00 1999.00


yearVeh 5.658704 2003.00
0 4 0 0 0 0

11217.
deploy 0.389141 0.487577 0.0 0.000 0.000 1.000 1.00
0

injSeverit 11140.
1.825583 1.378535 0.0 1.000 2.000 3.000 5.00
y 0
 Top 5 in dataset

d w
Unn air se fr s age ye ye oc de injS ca
v ei Survi ab
ame ba atb on e OFo ara arV cR pl ever se
c gh ved cat
d: 0 g elt tal x cc cc eh ole oy ity id
at t

5 27 Not_ 19 2:
no no 19 una driv
0 0 5 .0 Survi 1 m 32 87. 0 4.0 13
ne ne 97 vail er
+ 78 ved 0 :2

2
89 Not_ air 19 nod 2:
5- bel 19 driv
1 1 .6 Survi ba 0 f 54 94. epl 0 4.0 17
3 ted 97 er
27 ved g 0 oy :1
9

5 27 Not_ 19 2:
no bel 19 una driv
2 2 5 .0 Survi 1 m 67 92. 0 4.0 79
ne ted 97 vail er
+ 78 ved 0 :1

5 27 Not_ 19 2:
no bel 19 una pas
3 3 5 .0 Survi 1 f 64 92. 0 4.0 79
ne ted 97 vail s
+ 78 ved 0 :1

5 13 Not_ 19 4:
no no 19 una driv
4 4 5 .3 Survi 1 m 23 86. 0 4.0 58
ne ne 97 vail er
+ 74 ved 0 :1

 Unnamed : 0 field can be dropped, since we observe no Duplicates and this is a


sequencing variable. Post dropping

dv air sea fro s yea yea occ de injSe ca


wei Surviv ageO abc
ca ba tbel nta e rac rVe Rol plo verit sei
ght ed Focc at
t g t l x c h e y y d

27. Not_S
55 no non 199 198 una driv 2:1
0 07 urvive 1 m 32 0 4.0
+ ne e 7 7.0 vail er 3:2
8 d

25 89. Not_S air nod


belt 199 199 driv 2:1
1 - 62 urvive ba 0 f 54 eplo 0 4.0
ed 7 4.0 er 7:1
39 7 d g y

27. Not_S
55 no belt 199 199 una driv 2:7
2 07 urvive 1 m 67 0 4.0
+ ne ed 7 2.0 vail er 9:1
8 d

27. Not_S
55 no belt 199 199 una pas 2:7
3 07 urvive 1 f 64 0 4.0
+ ne ed 7 2.0 vail s 9:1
8 d
dv air sea fro s yea yea occ de injSe ca
wei Surviv ageO abc
ca ba tbel nta e rac rVe Rol plo verit sei
ght ed Focc at
t g t l x c h e y y d

13. Not_S
55 no non 199 198 una driv 4:5
4 37 urvive 1 m 23 0 4.0
+ ne e 7 6.0 vail er 8:1
4 d

Histogram
Boxplot

<AxesSubplot:>
Bivariate Analysis
Heatmap <AxesSubplot:>

 There is a positive correlation between deploy and frontal


 There is a very positive correlation between deploy and yearVeh
 There is a high negative correlation between injSeverity and weight
 There is a reasonable negative correlation between yearacc and injSeverity
Correlation Map <AxesSubplot:>

2.2 Encode the data (having string values) for Modelling. Data Split: Split the data into train and test
(70:30). Apply Logistic Regression and Linear Discriminant Analysis (LDA).

After replacing the string values and encoding we can see that the data is a numeric datatype
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11217 entries, 0 to 11216
Data columns (total 15 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 dvcat 11217 non-null object
1 weight 11217 non-null float64
2 Survived 11217 non-null int64
3 airbag 11217 non-null int64
4 seatbelt 11217 non-null int64
5 frontal 11217 non-null int64
6 sex 11217 non-null int64
7 ageOFocc 11217 non-null int64
8 yearacc 11217 non-null int64
9 yearVeh 11217 non-null float64
10 abcat 11217 non-null object
11 occRole 11217 non-null int64
12 deploy 11217 non-null int64
13 injSeverity 11217 non-null float64
14 caseid 11217 non-null object
dtypes: float64(3), int64(9), object(3)

For abcat and dvcat fields, we use the LabelEncoder from sklearn library to perform the
encoding

Since there are no duplicates, we can drop the caseid which is a sequence variable

Post encoding and dropping has been performed, we can observe the info of the dataset
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11217 entries, 0 to 11216
Data columns (total 14 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 dvcat 11217 non-null int32
1 weight 11217 non-null float64
2 Survived 11217 non-null int64
3 airbag 11217 non-null int64
4 seatbelt 11217 non-null int64
5 frontal 11217 non-null int64
6 sex 11217 non-null int64
7 ageOFocc 11217 non-null int64
8 yearacc 11217 non-null int64
9 yearVeh 11217 non-null float64
10 abcat 11217 non-null int32
11 occRole 11217 non-null int64
12 deploy 11217 non-null int64
13 injSeverity 11217 non-null float64
dtypes: float64(3), int32(2), int64(9)

Data Split: Split the data into train and test (70:30). 

We drop the field Survived from ‘X’ dataset and assign Survived target variable to ‘y’ dataset

Shape of X
(11217, 13)

Shape of y
(11217,)

We split the Train and Test dataset using the ‘train_test_split’ function which is imported
from Sklearn library

Y_train
1 0.89479
0 0.10521
Name: Survived, dtype: float64

Y_test
1 0.894831
0 0.105169
Name: Survived, dtype: float64
Using LinearRegression

0 1

0.02108 0.97891
0
8 2

0.00164 0.99835
1
5 5

0.00264 0.99735
2
8 2

0.00173 0.99826
3
2 8

0.01371 0.98628
4
9 1

Accuracy - Training Data


0.9603872118201503

Confusion Matrix – Training Data


Classification Report – Training Data
precision recall f1-score support

0 0.93 0.88 0.91 826


1 0.99 0.99 0.99 7025

accuracy 0.98 7851


macro avg 0.96 0.94 0.95 7851
weighted avg 0.98 0.98 0.98 7851

ROC Curve – Training Data

AUC: 0.968
Accuracy – Test Data
0.9607843137254902

Confusion Matrix – Test Data

Classification Report – Test Data


precision recall f1-score support

0 0.94 0.89 0.92 354


1 0.99 0.99 0.99 3012

accuracy 0.98 3366


macro avg 0.96 0.94 0.95 3366
weighted avg 0.98 0.98 0.98 3366

Precision Score – Test Data


0.9874587458745875
ROC Curve – Test Data

2.4 Inference: Based on these predictions, what are the insights and recommendations.

Business Insights are:

 Survived is the target variable and all others can be considered as the Predictor
variables
 Dataset consists of 11217 rows and 6 columns
 7 are numeric variables and 9 are of categorical nature.
 We then convert the categorical variables into numeric variables using encoding
 injSeverity column has 77 missing values, hence we impute the null values with Mode
 The Major factors that based on which we can predict survival are: dvcat (Estimated
impact speeds), airbag, seatbelt, frontal and deploy
 Car crashes can be mitigated if there is more concern in Airbags whether they are
deployed properly and testing the cars based on the Impact speeds at which the cars
can be hit.
 Survival rate can be increased if the injury severity can be determined quickly and
acted upon faster.
 The sex and year of accident are least considered as observed.
 Frontal impact, abcat could be moderately considered for the prediction of survival
rates.

You might also like