Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 11

ANSHUL AJAY DYUNDI

G5 SUN 11 30 JAN _22 A

ALTERNATIVE PROJECT
PREDICTIVE MODELIING
Problem 1: Linear Regression
Problem Statement: You are a part of an investing firm and your work is to do
research about these 759 firms. You are provided with the dataset containing the sales
and other attributes of these 759 firms. Predict the sales of these firms on the bases of
the details given in the dataset so as to help your company in investing consciously.
Also, provide them with 5 attributes that are most important.
1.1 Read the data and do exploratory data analysis. Describe the data briefly. (Check
the null values, data types, shape, EDA). Perform Univariate and Bivariate Analysis.
(8 marks)
1.2 Impute null values if present? Do you think scaling is necessary in this case? (8
marks)
1.3 Encode the data (having string values) for Modelling. Data Split: Split the data
into test and train (70:30). Apply Linear regression. Performance Metrics: Check the
performance of Predictions on Train and Test sets using Rsquare, RMSE. (8 marks)
1.4 Inference: Based on these predictions, what are the busine ss insights and
recommendations? (6 marks)
-------------------------------------------------------------------------------------------------------
Data Dictionary for Firm_level_data:
1. sales: Sales (in millions of dollars).
2. capital: Net stock of property, plant, and equipment.
3. patents: Granted patents.
4. randd: R&D stock (in millions of dollars).
5. employment: Employment (in 1000s).
6. sp500: Membership of firms in the S&P 500 index. S&P, is a stock market index
that measures the stock performance of 500 large companies listed on stock
exchanges in the United States
7. tobinq: Tobin's q (also known as q ratio and Kaldor's v) is the ratio between a
physical asset's market value and its replacement value.
8. value: Stock market value.
9. institutions: Proportion of stock owned by institutions.
Ans 1.1 We have read the data and have done exploratory data analysis. The brief
summary is as given below:

Some features of given dataset are as given below


Describe the data briefly. (Check the null values, data types, shape, EDA). Perform
Univariate and Bivariate Analysis.

The description of data is as given below:

Univariate Analysis: It refer to the analysis of a single variable. The main purpose of
univariate analysis is to summarize and find patterns in the data. The key point is that there
is only one variable involved in the analysis.
Let us check the distribution of each variable of data
Bivariate Analyis : Through bivariate analysis we try to analyze two vari ables
simultaneously. As opposed to univariate analysis where we check the characteristics
of a single variable, in bivariate analysis we try to determine if there is any
relationship between two
variables.

There are essentially 3 major scenarios that we will come across when we perform
bivariate analysis
1. Both variables of interest are qualitative
2. One variable is qualitative and the other is quantitative
3. Both variables are quantitative

We have performed bivariate analysis choosing the following variables as below


 Capital
 Value
In above both scenario is both variables are quantitative.

1.2 Impute null values if present? Do you think scaling is necessary in this case? (8
marks)
Ans 1.2 We have imputed the null values.
Scaling – Purpose : In this method, we convert variables with different scales of
measurements into a single scale.

Based on the given data set, as we have attributes that are not well defined meaning so
therefore we should scale our data in this case.
Accordingly we have scaled t he dataset after treating the outliers and converting the
categorical data into continuous in the dataset.
Standard Scaler normalizes the data using the formula (x-mean)/standard deviation

Ans 1.3 We have encoded the data (having string values) for Model ling and also
done Data Split: Split the data into test and train (70:30).

We have split the given data set into train and test data for model building we have
followed the steps as given below:

Step -1 : Separate X & Y

Step -2 : Train test split – X_train, Test, Train, Y_test. Step

-3 : Model is introduced.

Step -4: model. fit(X_train, Ytrain)

Step -5: model. predict(X_train) & (Test) Step

-6: Validation

We have applied Linear Regression on the given dataset and after application of it the
performance metrics is as given below:

R Square on training data is 84.5%


RMSE on training data is 6%
RMSE on testing data is 5.19%

Ans 1.4 Inference:


 The investment criteria for any new investor is mainly based on the capital
invested in the company by the promoters and investors are vying on the firms
where the capital investment is good as also reflecting in the scatter plot.
 To generate capital the company should have the combination of the following
attributes such as value, employment, sales and patents.
 The highest contributing attribute is employment followed by patents.
Problem 2: Logistic Regression and LDA
You are hired by Government to do analysis on car crashes. You are provided details
of car crashes, among which some people survived and some didn't . You have to help
the government in predicting whether a person will survive or not on the basis of the
information given in the data set so as to provide insights that will help government to
make stronger laws for car manufacturers to ensure safety meas ures. Also, find out
the important factors on the basis of which you made your predictions.
2.1Data Ingestion: Read the dataset. Do the descriptive statistics and do null value
condition check, write an inference on it. Perform Univariate and Bivariate A nalysis.
Do exploratory data analysis. (8 marks)
2.2Encode the data (having string values) for Modelling. Data Split: Split the data
into train and test (70:30). Apply Logistic Regression and LDA (linear discriminant
analysis). (8 marks)
2.3Performance Metrics: Check the performance of Predictions on Train and Test
sets using Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC score for
each model. Compare both the models and write inferences, which model is
best/optimized. (8 marks)
2.4Inferen ce: Based on these predictions, what are the insights and recommendations.
(6 marks)
-------------------------------------------------------------------------------------------------------
Data Dictionary for Car Crash:
1. dvcat: factor with levels (estimated impact speeds) 1 -9km/h, 10 -24, 25 -39, 40 -54,
55+
2. weight: Observation weights, albeit of uncertain accuracy, designed to account for
varying sampling probabilities. (The inverse probability weighting estimator can be
used to demonstrate causality w hen the researcher cannot conduct a controlled
experiment but has observed data to model) for further information go to this link:
https://en.wikipedia.org/wiki/Inverse_probability_weighting
3. Survived: factor with levels Survived or not survived
4. air bag: a factor with levels none or airbag
5. seatbelt: a factor with levels none or belted
6. frontal: a numeric vector; 0 = non -frontal, 1=frontal impact
7. sex: a factor with levels f: Female or m: Male
8. age of occ: age of occupant in years
9. year of acc: year of accident
10.year of Veh model : Year of model of vehicle; a numeric vector
11.abcat: Did one or more (driver or passenger) airbag(s) deploy? This factor has
levels deploy, no deploy and unavailing
12.occ role: a factor with levels driver or pass: passenger
13.deploy: a numeric vector: 0 if an airbag was unavailable or did not deploy; 1 if
one or more bags deployed.
Ans 2.1 Data Ingestion: We have loaded the dataset and have done the descriptive
statistics
Data Analysis - With data analysis, we use two main statistical methods - Descriptive
and Inferential which are as given below:
 Descriptive statistics uses tools like mean and standard deviation on a sample
to summarize data.
 Inferential statistics, on the other hand, looks at data that can randomly va ry,
and then draw conclusions from it.
Therefore, under descriptive statistics, fall two sets of properties– central tendency and
dispersion.
 Python Central tendency characterizes one central value for the entire
distribution. Measures under this include mean, median, and mode.
 Python Dispersion is the term for a practice that characterizes how apart the
members of the distribution are from the center and from each other.
Variance/Standard Deviation is one such measure of variability.
Inference on dataset

Univariate analysis explores each variable in a data set, separately. It looks at the
range of values, as well as the central tendency of the values. It describes the pattern
of response to the variable. It describes each variable on its own. Descript ive statistics
describe and summarize data.
Univariate analysis refer to the analysis of a single variable. The main purpose of
univariate analysis is to summarize and find patterns in the data. The key point is that
there is only one variable involved in the analysis.
We have performed Univariate analysis as given below
Bivariate Analysis - Through bivariate analysis we try to analyze two variables
simultaneously. As opposed to univariate analysis where we check the characteristics of a single
variable, in bivariate analysis we try to determine if there is any
relationship between two variables.

There are essentially 3 major scenarios that we will come across when we perform
bivariate analysis

 Both variables of interest are qualitative


 One variable is qualitative and the other is quantitative
 Both variables are quantitative
We have performed bivariate analysis choosing the following variables as below
 Survived
 Frontal impact

And Survived and air bag deployment.

Further we have also done between numerical variable i.e. frontal and air bag
deployment.

Ans 2.2 We have encoded the data (having string values) for Modelling.
Data Split: We have splitted the data into train and test (70:30).
We have applied Logistic Regression and LDA (linear discriminant analysis) on the
given dataset taking survived as target variable.

Ans 2.3 The performance metrics of Logistics regression and Linear Discriminant
Analysis model is as given below:

Particular Logistic Reg LogisticTrainLDA Test LDA Train


s Test 96.96% 97.13
Accuracy 98.6% % 97.1 96.9
AUC 99% 98.6% % %
Recall 98% 99% 98% 99%
Precision 98% 98% 97% 97%
F1 Score 98% 98% 98%
Ans 2.4 Inference
 The model accuracy of logistic regression on both training data as well as
testing data is almost same i.e 97%.
 Similarly, AUC in logistic regression for training data and testing data is also
similar.
 The other parameters of confusion matrix in logistic regression is also similar,
therefore we can presume in this that our model is overfitted.
 We have therefore applied Grid Search CV to hyper tune our model and as per
which F1 score in both training and test data was 98%.
 In case of LDA, the AUC for testing and training data is also same and it was
97%, besides this the other parameters of confusion matrix of LDA model was
also similar and it clearly shows that model is overfitted here too.
Overall we can conclude that logistic regression model is best suited for this data set
given the level of accuracy in spite of the Linear Discriminant Analysis that the model is
overfitted.

You might also like