Professional Documents
Culture Documents
Sukanya Linear LogisticRegression Report
Sukanya Linear LogisticRegression Report
Sukanya Linear LogisticRegression Report
Linear Regression
You are a part of an investment firm and your work is to do research about these
759 firms. You are provided with the dataset containing the sales and other
attributes of these 759 firms. Predict the sales of these firms on the bases of the
details given in the dataset so as to help your company in investing consciously.
Also, provide them with 5 attributes that are most important.
Solution:
The following are the observations after exploring the data initially:
EDA
cou
mean std min 25% 50% 75% max
nt
738. 0.1190
tobinq 2.794910 3.366591 1.018783 1.680303 3.139309 20.000000
0 01
We can observe that there are many outliers in the dataset varied across most of the
variables.
From the histogram we can observe that tobinq (Tobin's q) is right skewed.
Institutions forms a Bell curve and almost uniformly distributed.
Correlation Plot
From the pairplot also we can observe that there is atleast a mild correlation between the
variables.
Count Plot
1.2) Impute null values if present? Do you think scaling is necessary in this case?
Yes, null values are present in the dataset for the Tobinq’s variable.
Since the dataset is small, we can drop the 21 records with tobinq value as null values.
Post dropping off records, we can observe all null values have been imputed
sales 0
capital 0
patents 0
randd 0
employment 0
sp500 0
tobinq 0
value 0
institutions 0
dtype: int64
From the data we can see that sp_500 is the only categorical variable
We now split the data into Train and Test Data in 30:70 ratio using ‘train_test_split’
function. We drop ‘Sales’ target variable from ‘X’ dataset and assign ‘Sales’ variable to ‘y’
dataset.
Comparison of Performance of Predictions on Train and Test sets using R-square, RMSE
Using R-Square
0.942609247903072
Using R-Square
0.7833858809094182
1.4 Inference: Based on these predictions, what are the business insights and
recommendations.
Sales is the target variable with other variables acting as the Predictor variables.
The dataset consists of 759 rows and 10 columns. 9 columns are numeric variables
while sp500 is the only Categorical variable.
Since there is no difference in the data before and after imputing the null values, we
have made the right choice of not performing scaling.
Membership of firms in S&P 500 index (sp500) is observed as one of the crucial
attributes that determine Sales of the firm.
We can consider the following 5 attributes to be the determining the factors of
Sales in the firm: R&D Stock (randd), Capital, Institutions, Tobin’s Q and sp500
We observe high correlation between Patents and Randd hence we can take any one
of the attributes into consideration.
Investing in more Granted Patents will increase the revenue as observed from the
Linear Regression of Training and Test dataset
Based on various models, we can see that the RSME error value are slightly different
Sales can be increased if we focus more on the Tobin’s Q and sp500. Since there is a
huge impact of stock exchanges in the United States, we need to speculate carefully.
We can improve on the Institutions investment also, as there is a significant impact
on the Sales.
2.1) Data Ingestion: Read the dataset. Do the descriptive statistics and do null value
condition check, write an inference on it. Perform Univariate and Bivariate Analysis. Do
exploratory data analysis.
Solution:
The following are the observations after exploring the data initially:
Unnamed: 0 0
dvcat 0
weight 0
Survived 0
airbag 0
seatbelt 0
frontal 0
sex 0
ageOFocc 0
yearacc 0
yearVeh 0
abcat 0
occRole 0
deploy 0
injSeverity 77
caseid 0
Imputing null values with mode for the variable ‘injSeverity’
Post imputing
dvcat 0
weight 0
Survived 0
airbag 0
seatbelt 0
frontal 0
sex 0
ageOFocc 0
yearacc 0
yearVeh 0
abcat 0
occRole 0
deploy 0
injSeverity 0
caseid 0
dtype: int64
11217.
frontal 0.644022 0.478830 0.0 0.000 1.000 1.000 1.00
0
ageOFoc 11217.
37.427654 18.192429 16.0 22.000 33.000 48.000 97.00
c 0
11217.
deploy 0.389141 0.487577 0.0 0.000 0.000 1.000 1.00
0
injSeverit 11140.
1.825583 1.378535 0.0 1.000 2.000 3.000 5.00
y 0
Top 5 in dataset
d w
Unn air se fr s age ye ye oc de injS ca
v ei Survi ab
ame ba atb on e OFo ara arV cR pl ever se
c gh ved cat
d: 0 g elt tal x cc cc eh ole oy ity id
at t
5 27 Not_ 19 2:
no no 19 una driv
0 0 5 .0 Survi 1 m 32 87. 0 4.0 13
ne ne 97 vail er
+ 78 ved 0 :2
2
89 Not_ air 19 nod 2:
5- bel 19 driv
1 1 .6 Survi ba 0 f 54 94. epl 0 4.0 17
3 ted 97 er
27 ved g 0 oy :1
9
5 27 Not_ 19 2:
no bel 19 una driv
2 2 5 .0 Survi 1 m 67 92. 0 4.0 79
ne ted 97 vail er
+ 78 ved 0 :1
5 27 Not_ 19 2:
no bel 19 una pas
3 3 5 .0 Survi 1 f 64 92. 0 4.0 79
ne ted 97 vail s
+ 78 ved 0 :1
5 13 Not_ 19 4:
no no 19 una driv
4 4 5 .3 Survi 1 m 23 86. 0 4.0 58
ne ne 97 vail er
+ 74 ved 0 :1
27. Not_S
55 no non 199 198 una driv 2:1
0 07 urvive 1 m 32 0 4.0
+ ne e 7 7.0 vail er 3:2
8 d
27. Not_S
55 no belt 199 199 una driv 2:7
2 07 urvive 1 m 67 0 4.0
+ ne ed 7 2.0 vail er 9:1
8 d
27. Not_S
55 no belt 199 199 una pas 2:7
3 07 urvive 1 f 64 0 4.0
+ ne ed 7 2.0 vail s 9:1
8 d
dv air sea fro s yea yea occ de injSe ca
wei Surviv ageO abc
ca ba tbel nta e rac rVe Rol plo verit sei
ght ed Focc at
t g t l x c h e y y d
13. Not_S
55 no non 199 198 una driv 4:5
4 37 urvive 1 m 23 0 4.0
+ ne e 7 6.0 vail er 8:1
4 d
Histogram
Boxplot
<AxesSubplot:>
Bivariate Analysis
Heatmap <AxesSubplot:>
2.2 Encode the data (having string values) for Modelling. Data Split: Split the data into train and test
(70:30). Apply Logistic Regression and Linear Discriminant Analysis (LDA).
After replacing the string values and encoding we can see that the data is a numeric datatype
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11217 entries, 0 to 11216
Data columns (total 15 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 dvcat 11217 non-null object
1 weight 11217 non-null float64
2 Survived 11217 non-null int64
3 airbag 11217 non-null int64
4 seatbelt 11217 non-null int64
5 frontal 11217 non-null int64
6 sex 11217 non-null int64
7 ageOFocc 11217 non-null int64
8 yearacc 11217 non-null int64
9 yearVeh 11217 non-null float64
10 abcat 11217 non-null object
11 occRole 11217 non-null int64
12 deploy 11217 non-null int64
13 injSeverity 11217 non-null float64
14 caseid 11217 non-null object
dtypes: float64(3), int64(9), object(3)
For abcat and dvcat fields, we use the LabelEncoder from sklearn library to perform the
encoding
Since there are no duplicates, we can drop the caseid which is a sequence variable
Post encoding and dropping has been performed, we can observe the info of the dataset
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11217 entries, 0 to 11216
Data columns (total 14 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 dvcat 11217 non-null int32
1 weight 11217 non-null float64
2 Survived 11217 non-null int64
3 airbag 11217 non-null int64
4 seatbelt 11217 non-null int64
5 frontal 11217 non-null int64
6 sex 11217 non-null int64
7 ageOFocc 11217 non-null int64
8 yearacc 11217 non-null int64
9 yearVeh 11217 non-null float64
10 abcat 11217 non-null int32
11 occRole 11217 non-null int64
12 deploy 11217 non-null int64
13 injSeverity 11217 non-null float64
dtypes: float64(3), int32(2), int64(9)
Data Split: Split the data into train and test (70:30).
We drop the field Survived from ‘X’ dataset and assign Survived target variable to ‘y’ dataset
Shape of X
(11217, 13)
Shape of y
(11217,)
We split the Train and Test dataset using the ‘train_test_split’ function which is imported
from Sklearn library
Y_train
1 0.89479
0 0.10521
Name: Survived, dtype: float64
Y_test
1 0.894831
0 0.105169
Name: Survived, dtype: float64
Using LinearRegression
0 1
0.02108 0.97891
0
8 2
0.00164 0.99835
1
5 5
0.00264 0.99735
2
8 2
0.00173 0.99826
3
2 8
0.01371 0.98628
4
9 1
AUC: 0.968
Accuracy – Test Data
0.9607843137254902
2.4 Inference: Based on these predictions, what are the insights and recommendations.
Survived is the target variable and all others can be considered as the Predictor
variables
Dataset consists of 11217 rows and 6 columns
7 are numeric variables and 9 are of categorical nature.
We then convert the categorical variables into numeric variables using encoding
injSeverity column has 77 missing values, hence we impute the null values with Mode
The Major factors that based on which we can predict survival are: dvcat (Estimated
impact speeds), airbag, seatbelt, frontal and deploy
Car crashes can be mitigated if there is more concern in Airbags whether they are
deployed properly and testing the cars based on the Impact speeds at which the cars
can be hit.
Survival rate can be increased if the injury severity can be determined quickly and
acted upon faster.
The sex and year of accident are least considered as observed.
Frontal impact, abcat could be moderately considered for the prediction of survival
rates.