Sukanya Linear LogisticRegression Report

Problem 1:
Linear Regression
You are a part of an investment firm and your work is to do research about these
759 firms. You are provided with the dataset containing the sales and other
attributes of these 759 firms. Predict the sales of these firms on the bases of the
details given in the dataset so as to help your company in investing consciously.
Also, provide them with 5 attributes that are most important.
Solution:
The following are the observations after exploring the data initially:
 Dataset consists of 759 rows and 10 columns

 9 columns are numeric in nature and 1 column ‘sp500’ is a categorical variable and of
object data type
 Missing values observed in the ‘tobinq’ variable
 No bad data is present in the dataset
 No duplicates are observed in the dataset
 Since there are no duplicate values, we can drop sequence variable ‘Unnamed: 0’
column from the dataset.
EDA
 Description of the dataset is as follows
cou
mean std min 25% 50% 75% max
nt
759. 2689.705 8722.060 0.1380 122.9200 448.5770 1822.547 135696.788

sales
0 158 124 00 00 82 366 200
759. 1977.747 6466.704 0.0570 52.65050 202.1790 1075.790 93625.2005

capital
0 498 896 00 1 23 020 60
759. 25.83135 97.25957 0.0000 11.50000 1220.00000

patents 1.000000 3.000000
0 7 7 00 0 0
759. 439.9380 2007.397 0.0000 36.86413 143.2534 30425.2558

randd 4.628262
0 74 588 00 6 03 60
employm 759. 14.16451 43.32144 0.0060 10.05000

0.927500 2.924000 710.799925
ent 0 9 3 00 1
738. 0.1190
tobinq 2.794910 3.366591 1.018783 1.680303 3.139309 20.000000
0 01
759. 2732.734 7071.072 1.9710 103.5939 410.7935 2054.160 95191.5911

value
0 750 362 53 46 29 386 60
institutio 759. 43.02054 21.68558 0.0000 25.39500 44.11000 60.51000

90.150000
ns 0 0 6 00 0 0 0
 Shape of the dataset
(759, 10)
 Columns in the dataset
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 759 entries, 0 to 758
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Unnamed: 0 759 non-null int64
1 sales 759 non-null float64
2 capital 759 non-null float64
3 patents 759 non-null int64
4 randd 759 non-null float64
5 employment 759 non-null float64
6 sp500 759 non-null object
7 tobinq 738 non-null float64
8 value 759 non-null float64
9 institutions 759 non-null float64
dtypes: float64(7), int64(2), object(1)
 The top 5 records in the dataset are as below
paten employm sp5 institutio

sales capital randd tobinq value
ts ent 00 ns
826.9950 161.6039 382.0782 11.0495 1625.4537

0 10 2.306000 no 80.27
50 86 47 11 55
407.7539 122.1010 0.84418 243.11708

1 2 0.000000 1.860000 no 59.02
73 12 7 2
8407.845 6221.144 3296.700 49.65900 5.20525 25865.233

2 138 yes 47.70
588 614 439 5 7 800
451.0000 266.8999 83.54016 0.30522

3 1 3.071000 no 63.024630 26.88
10 87 1 1
174.9279 140.1240 14.23363 1.06330

4 2 1.947000 no 67.406408 49.46
81 04 7 0
 The last 5 records in the dataset are as below

ts ent 0 ns
75 1253.9001 708.2999 412.9361 0.6974 267.1194

32 22.100002 yes 33.50
4 96 35 57 54 87
75 171.82102 73.66600 228.4757

1 0.037735 1.684000 no NaN 46.41
5 5 8 01
75 202.72696 123.9269 74.86109 5.2297 580.4307

13 1.460000 no 42.25
6 7 91 9 23 41
75 785.68794 138.7809 1.6253 309.9386

6 0.621750 2.900000 yes 61.39
7 4 92 98 51
75 14.24499 18.57436 2.2130 18.94014

22.701999 5 0.197000 no 7.50
8 9 0 70 0
Univariate and Bivariate Analysis
We can observe that there are many outliers in the dataset varied across most of the
variables.
From the histogram we can observe that tobinq (Tobin's q) is right skewed.
Institutions forms a Bell curve and almost uniformly distributed.
Correlation Plot
We can observe from the correlation plot:
 Employment and Sales are the highly correlated variables

 Patents and Randd (R&D stock) are also highly correlated
 Capital and Sales are correlated
Pairplot
From the pairplot also we can observe that there is atleast a mild correlation between the
variables.
Count Plot
1.2) Impute null values if present? Do you think scaling is necessary in this case?
Yes, null values are present in the dataset for the Tobinq’s variable.
Since the dataset is small, we can drop the 21 records with tobinq value as null values.
Post dropping off records, we can observe all null values have been imputed
sales 0
capital 0
patents 0
randd 0
employment 0
sp500 0
tobinq 0
value 0
institutions 0
dtype: int64
The shape of the dataset is now

(738, 9)
1.3) Encode the data (having string values) for Modelling. Data Split: Split the data into
test and train (30:70). Apply Linear regression. Performance Metrics: Check the
performance of Predictions on Train and Test sets using R-square, RMSE.
From the data we can see that sp_500 is the only categorical variable

ts ent 00 ns
826.9950 161.6039 382.0782 11.0495 1625.4537

0 10 2.306000 no 80.27
50 86 47 11 55
407.7539 122.1010 0.84418 243.11708

1 2 0.000000 1.860000 no 59.02
73 12 7 2
8407.845 6221.144 3296.700 49.65900 5.20525 25865.233

2 138 yes 47.70
588 614 439 5 7 800
451.0000 266.8999 83.54016 0.30522

3 1 3.071000 no 63.024630 26.88
10 87 1 1
174.9279 140.1240 14.23363 1.06330

4 2 1.947000 no 67.406408 49.46
81 04 7 0
Using get_dummies we can encode the data having string values
pate employ institut sp500 sp500

nts ment ions _no _yes
826.995 161.603 382.078 2.30600 11.049 1625.453

0 10 80.27 1 0
050 986 247 0 511 755
407.753 122.101 0.00000 1.86000 0.8441 243.1170

1 2 59.02 1 0
973 012 0 0 87 82
8407.84 6221.14 3296.70 49.6590 5.2052 25865.23

2 138 47.70 0 1
5588 4614 0439 05 57 3800
451.000 266.899 83.5401 3.07100 0.3052 63.02463

3 1 26.88 1 0
010 987 61 0 21 0
174.927 140.124 14.2336 1.94700 1.0633 67.40640

4 2 49.46 1 0
981 004 37 0 00 8
Now the shape of the dataset is

(738, 10)
We now split the data into Train and Test Data in 30:70 ratio using ‘train_test_split’
function. We drop ‘Sales’ target variable from ‘X’ dataset and assign ‘Sales’ variable to ‘y’
dataset.
Comparison of Performance of Predictions on Train and Test sets using R-square, RMSE
Performance on Train dataset
Using Linear Regression

0.942609247903072
Using R-Square
0.942609247903072
Using RSME – Square of Mean Squared Error

2359.982798372816
Performance on Test dataset
Using Linear Regression

0.7833858809094182
Using R-Square
0.7833858809094182
Using RSME – Square of Mean Squared Error

2686.692289357481
1.4 Inference: Based on these predictions, what are the business insights and
recommendations.
The business insights are:
 Sales is the target variable with other variables acting as the Predictor variables.
 The dataset consists of 759 rows and 10 columns. 9 columns are numeric variables
while sp500 is the only Categorical variable.
 Since there is no difference in the data before and after imputing the null values, we
have made the right choice of not performing scaling.
 Membership of firms in S&P 500 index (sp500) is observed as one of the crucial
attributes that determine Sales of the firm.
 We can consider the following 5 attributes to be the determining the factors of
Sales in the firm: R&D Stock (randd), Capital, Institutions, Tobin’s Q and sp500
 We observe high correlation between Patents and Randd hence we can take any one
of the attributes into consideration.
 Investing in more Granted Patents will increase the revenue as observed from the
Linear Regression of Training and Test dataset
 Based on various models, we can see that the RSME error value are slightly different
 Sales can be increased if we focus more on the Tobin’s Q and sp500. Since there is a
huge impact of stock exchanges in the United States, we need to speculate carefully.
 We can improve on the Institutions investment also, as there is a significant impact
on the Sales.
2.1) Data Ingestion: Read the dataset. Do the descriptive statistics and do null value
condition check, write an inference on it. Perform Univariate and Bivariate Analysis. Do
exploratory data analysis.
Solution:
The following are the observations after exploring the data initially:
 Info of the dataset

--- ------ -------------- -----
0 Unnamed: 0 11217 non-null int64
1 dvcat 11217 non-null object
2 weight 11217 non-null float64
3 Survived 11217 non-null object
4 airbag 11217 non-null object
5 seatbelt 11217 non-null object
6 frontal 11217 non-null int64
7 sex 11217 non-null object
8 ageOFocc 11217 non-null int64
9 yearacc 11217 non-null int64
10 yearVeh 11217 non-null float64
11 abcat 11217 non-null object
12 occRole 11217 non-null object
13 deploy 11217 non-null int64
14 injSeverity 11140 non-null float64
15 caseid 11217 non-null object
 7 columns are numeric variables of integer/float datatype. 9 columns are of object

datatype and are categorical variables
 No duplicates are observed in the dataset
 Shape of the dataset
(11217, 16)
 There are 77 missing values in the ‘injSeverity’ categorical variable
Unnamed: 0 0
dvcat 0
weight 0
Survived 0
airbag 0
seatbelt 0
frontal 0
sex 0
ageOFocc 0
yearacc 0
yearVeh 0
abcat 0
occRole 0
deploy 0
injSeverity 77
caseid 0
 Imputing null values with mode for the variable ‘injSeverity’
Post imputing
dvcat 0
weight 0
Survived 0
airbag 0
seatbelt 0
frontal 0
sex 0
ageOFocc 0
yearacc 0
yearVeh 0
abcat 0
occRole 0
deploy 0
injSeverity 0
caseid 0
dtype: int64
 Description of the dataset
count mean std min 25% 50% 75% max
11217. 1406.20294 31694.0

weight 431.405309 0.0 28.292 82.195 324.056
0 1 4
11217.
frontal 0.644022 0.478830 0.0 0.000 1.000 1.000 1.00
0
ageOFoc 11217.
37.427654 18.192429 16.0 22.000 33.000 48.000 97.00
c 0
11217. 2001.10323 1997. 2001.00 2001.00 2002.00

yearacc 1.056805 2002.00
0 6 0 0 0 0
11217. 1994.17794 1953. 1991.00 1995.00 1999.00

yearVeh 5.658704 2003.00
0 4 0 0 0 0
11217.
deploy 0.389141 0.487577 0.0 0.000 0.000 1.000 1.00
0
injSeverit 11140.
1.825583 1.378535 0.0 1.000 2.000 3.000 5.00
y 0
 Top 5 in dataset
d w
Unn air se fr s age ye ye oc de injS ca
v ei Survi ab
ame ba atb on e OFo ara arV cR pl ever se
c gh ved cat
d: 0 g elt tal x cc cc eh ole oy ity id
at t
5 27 Not_ 19 2:
no no 19 una driv
0 0 5 .0 Survi 1 m 32 87. 0 4.0 13
ne ne 97 vail er
+ 78 ved 0 :2
2
89 Not_ air 19 nod 2:
5- bel 19 driv
1 1 .6 Survi ba 0 f 54 94. epl 0 4.0 17
3 ted 97 er
27 ved g 0 oy :1
9
5 27 Not_ 19 2:
no bel 19 una driv
2 2 5 .0 Survi 1 m 67 92. 0 4.0 79
ne ted 97 vail er
+ 78 ved 0 :1
5 27 Not_ 19 2:
no bel 19 una pas
3 3 5 .0 Survi 1 f 64 92. 0 4.0 79
ne ted 97 vail s
+ 78 ved 0 :1
5 13 Not_ 19 4:
no no 19 una driv
4 4 5 .3 Survi 1 m 23 86. 0 4.0 58
ne ne 97 vail er
+ 74 ved 0 :1
 Unnamed : 0 field can be dropped, since we observe no Duplicates and this is a

sequencing variable. Post dropping
dv air sea fro s yea yea occ de injSe ca

wei Surviv ageO abc
ca ba tbel nta e rac rVe Rol plo verit sei
ght ed Focc at
t g t l x c h e y y d
27. Not_S
55 no non 199 198 una driv 2:1
0 07 urvive 1 m 32 0 4.0
+ ne e 7 7.0 vail er 3:2
8 d
25 89. Not_S air nod

belt 199 199 driv 2:1
1 - 62 urvive ba 0 f 54 eplo 0 4.0
ed 7 4.0 er 7:1
39 7 d g y
27. Not_S
55 no belt 199 199 una driv 2:7
2 07 urvive 1 m 67 0 4.0
+ ne ed 7 2.0 vail er 9:1
8 d
27. Not_S
55 no belt 199 199 una pas 2:7
3 07 urvive 1 f 64 0 4.0
+ ne ed 7 2.0 vail s 9:1
8 d
dv air sea fro s yea yea occ de injSe ca
wei Surviv ageO abc
ca ba tbel nta e rac rVe Rol plo verit sei
ght ed Focc at
t g t l x c h e y y d
13. Not_S
55 no non 199 198 una driv 4:5
4 37 urvive 1 m 23 0 4.0
+ ne e 7 6.0 vail er 8:1
4 d
Histogram
Boxplot
<AxesSubplot:>
Bivariate Analysis
Heatmap <AxesSubplot:>
 There is a positive correlation between deploy and frontal

 There is a very positive correlation between deploy and yearVeh
 There is a high negative correlation between injSeverity and weight
 There is a reasonable negative correlation between yearacc and injSeverity
Correlation Map <AxesSubplot:>
2.2 Encode the data (having string values) for Modelling. Data Split: Split the data into train and test
(70:30). Apply Logistic Regression and Linear Discriminant Analysis (LDA).
After replacing the string values and encoding we can see that the data is a numeric datatype
--- ------ -------------- -----
0 dvcat 11217 non-null object
2 Survived 11217 non-null int64
3 airbag 11217 non-null int64
4 seatbelt 11217 non-null int64
6 sex 11217 non-null int64
10 abcat 11217 non-null object
11 occRole 11217 non-null int64
14 caseid 11217 non-null object
For abcat and dvcat fields, we use the LabelEncoder from sklearn library to perform the
encoding
Since there are no duplicates, we can drop the caseid which is a sequence variable
Post encoding and dropping has been performed, we can observe the info of the dataset
--- ------ -------------- -----
0 dvcat 11217 non-null int32
2 Survived 11217 non-null int64
3 airbag 11217 non-null int64
4 seatbelt 11217 non-null int64
6 sex 11217 non-null int64
10 abcat 11217 non-null int32
11 occRole 11217 non-null int64
dtypes: float64(3), int32(2), int64(9)
Data Split: Split the data into train and test (70:30).
We drop the field Survived from ‘X’ dataset and assign Survived target variable to ‘y’ dataset
Shape of X
(11217, 13)
Shape of y
(11217,)
We split the Train and Test dataset using the ‘train_test_split’ function which is imported
from Sklearn library
Y_train
1 0.89479
0 0.10521
Name: Survived, dtype: float64
Y_test
1 0.894831
0 0.105169
Name: Survived, dtype: float64
Using LinearRegression
0 1
0.02108 0.97891
0
8 2
0.00164 0.99835
1
5 5
0.00264 0.99735
2
8 2
0.00173 0.99826
3
2 8
0.01371 0.98628
4
9 1
Accuracy - Training Data

0.9603872118201503
Confusion Matrix – Training Data

Classification Report – Training Data
precision recall f1-score support
0 0.93 0.88 0.91 826

1 0.99 0.99 0.99 7025
accuracy 0.98 7851

macro avg 0.96 0.94 0.95 7851
weighted avg 0.98 0.98 0.98 7851
ROC Curve – Training Data
AUC: 0.968
Accuracy – Test Data
0.9607843137254902
Confusion Matrix – Test Data
Classification Report – Test Data

precision recall f1-score support
0 0.94 0.89 0.92 354

1 0.99 0.99 0.99 3012
accuracy 0.98 3366

macro avg 0.96 0.94 0.95 3366
weighted avg 0.98 0.98 0.98 3366
Precision Score – Test Data

0.9874587458745875
ROC Curve – Test Data
2.4 Inference: Based on these predictions, what are the insights and recommendations.
Business Insights are:
 Survived is the target variable and all others can be considered as the Predictor
variables
 7 are numeric variables and 9 are of categorical nature.
 We then convert the categorical variables into numeric variables using encoding
 injSeverity column has 77 missing values, hence we impute the null values with Mode
 The Major factors that based on which we can predict survival are: dvcat (Estimated
impact speeds), airbag, seatbelt, frontal and deploy
 Car crashes can be mitigated if there is more concern in Airbags whether they are
deployed properly and testing the cars based on the Impact speeds at which the cars
can be hit.
 Survival rate can be increased if the injury severity can be determined quickly and
acted upon faster.
 The sex and year of accident are least considered as observed.
 Frontal impact, abcat could be moderately considered for the prediction of survival
rates.

Sukanya Linear LogisticRegression Report

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Sukanya Linear LogisticRegression Report

Uploaded by

Copyright:

Available Formats

Problem 1:

 Dataset consists of 759 rows and 10 columns

 Description of the dataset is as follows

759. 2689.705 8722.060 0.1380 122.9200 448.5770 1822.547 135696.788

759. 1977.747 6466.704 0.0570 52.65050 202.1790 1075.790 93625.2005

759. 25.83135 97.25957 0.0000 11.50000 1220.00000

759. 439.9380 2007.397 0.0000 36.86413 143.2534 30425.2558

employm 759. 14.16451 43.32144 0.0060 10.05000

759. 2732.734 7071.072 1.9710 103.5939 410.7935 2054.160 95191.5911

institutio 759. 43.02054 21.68558 0.0000 25.39500 44.11000 60.51000

 The top 5 records in the dataset are as below

paten employm sp5 institutio

826.9950 161.6039 382.0782 11.0495 1625.4537

407.7539 122.1010 0.84418 243.11708

8407.845 6221.144 3296.700 49.65900 5.20525 25865.233

451.0000 266.8999 83.54016 0.30522

174.9279 140.1240 14.23363 1.06330

 The last 5 records in the dataset are as below

paten employm sp50 institutio

75 1253.9001 708.2999 412.9361 0.6974 267.1194

75 171.82102 73.66600 228.4757

75 202.72696 123.9269 74.86109 5.2297 580.4307

75 785.68794 138.7809 1.6253 309.9386

75 14.24499 18.57436 2.2130 18.94014

We can observe from the correlation plot:

 Employment and Sales are the highly correlated variables

The shape of the dataset is now

paten employm sp5 institutio

826.9950 161.6039 382.0782 11.0495 1625.4537

407.7539 122.1010 0.84418 243.11708

8407.845 6221.144 3296.700 49.65900 5.20525 25865.233

451.0000 266.8999 83.54016 0.30522

174.9279 140.1240 14.23363 1.06330

Using get_dummies we can encode the data having string values

pate employ institut sp500 sp500

826.995 161.603 382.078 2.30600 11.049 1625.453

407.753 122.101 0.00000 1.86000 0.8441 243.1170

8407.84 6221.14 3296.70 49.6590 5.2052 25865.23

451.000 266.899 83.5401 3.07100 0.3052 63.02463

174.927 140.124 14.2336 1.94700 1.0633 67.40640

Now the shape of the dataset is

Performance on Train dataset

Using Linear Regression

Using RSME – Square of Mean Squared Error

Performance on Test dataset

Using Linear Regression

Using RSME – Square of Mean Squared Error

The business insights are:

 Dataset consists of 11217 rows and 16 columns

 Info of the dataset

 7 columns are numeric variables of integer/float datatype. 9 columns are of object

 Description of the dataset

count mean std min 25% 50% 75% max

11217. 1406.20294 31694.0

11217. 2001.10323 1997. 2001.00 2001.00 2002.00

11217. 1994.17794 1953. 1991.00 1995.00 1999.00

 Unnamed : 0 field can be dropped, since we observe no Duplicates and this is a

dv air sea fro s yea yea occ de injSe ca

25 89. Not_S air nod

 There is a positive correlation between deploy and frontal

Accuracy - Training Data

Confusion Matrix – Training Data