Download as pdf or txt
Download as pdf or txt
You are on page 1of 40

NANDURI NAGA SOWRI PGP-DSBA_OCTA_G2 GREAT LEARNING

NANDURI NAGA SOWRI PGP-DSBA_OCTA_G2 GREAT LEARNING

Problem 1: Linear Regression


You are hired by a company Gem Stones co ltd, which is a cubic zirconia
manufacturer. You are provided with the dataset containing the prices and other
attributes of almost 27,000 cubic zirconia (which is an inexpensive diamond
alternative with many of the same qualities as a diamond). The company is earning
different profits on different prize slots. You have to help the company in predicting
the price for the stone on the bases of the details given in the dataset so it can
distinguish between higher profitable stones and lower profitable stones so as to
have better profit share. Also, provide them with the best 5 attributes that are most
important.

Data Dictionary:
Variable Name Description
Carat Carat weight of the cubic zirconia.
Cut Describe the cut quality of the cubic zirconia. Quality is increasing order Fair, Good,
Very Good, Premium, Ideal.
Color Colour of the cubic zirconia. With D being the worst and J the best.
Clarity cubic zirconia Clarity refers to the absence of the Inclusions and Blemishes. (In order
from Worst to Best) IF, VVS1, VVS2, VS1, VS2, Sl1, Sl2, l1
Depth The Height of cubic zirconia, measured from the Culet to the table, divided by its
average Girdle Diameter.
Table The Width of the cubic zirconia's Table expressed as a Percentage of its Average
Diameter.
Price the Price of the cubic zirconia.
X Length of the cubic zirconia in mm.
Y Width of the cubic zirconia in mm.
Z Height of the cubic zirconia in mm.
NANDURI NAGA SOWRI PGP-DSBA_OCTA_G2 GREAT LEARNING

1.1) Read the data and do exploratory data analysis. Describe the data briefly.
(Check the null values, Data types, shape, EDA, duplicate values). Perform
Univariate and Bivariate Analysis.

Ans) Checking if data has flown in properly:

Data Sample

Data Dimensions: (26967 rows, 11 columns)

Description of data:
NANDURI NAGA SOWRI PGP-DSBA_OCTA_G2 GREAT LEARNING

Univariate and Bivariate Analysis


Skewness of Data:

Most preferred cut is ideal according to below graphs


Count plot based on color

Plot based on color and price


Multivariate Analysis
1.2) Impute null values if present, also check for the values which are equal to zero.
Do they have any meaning or do we need to change them or drop them? Check for
the possibility of combining the sub levels of a ordinal variables and take actions
accordingly. Explain why you are combining these sub levels with appropriate
reasoning.
Ans) Based on the below, all columns except for depth has no null values.

Sine depth column is continuous, either mean or median imputation can be carried
out.
After imputation is done, we see that there are no null values present.

1.3) Encode the data (having string values) for Modelling. Split the data into train
and test (70:30). Apply Linear regression using scikit learn. Perform checks for
significant variables using appropriate method from statsmodel. Create multiple
models and check the performance of Predictions on Train and Test sets using
Rsquare, RMSE & Adj Rsquare. Compare these models and select the best one with
appropriate reasoning.

Ans) Dummies have to be encoded since linear regression models don’t take
categorical variables.
Splitting of the data: X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.3 , random_state=1)

Linear Regression Model:

The coefficient for carat is 1.3672709359491833


The coefficient for depth is -0.02715729778195777
The coefficient for table is -0.015129062503321425
The coefficient for x is -0.3109893370891057
The coefficient for y is -0.0008718302715287778
The coefficient for z is -0.009459310770528272
The coefficient for cut_Good is 0.13136322591216373
The coefficient for cut_Ideal is 0.194050821929168
The coefficient for cut_Premium is 0.1695361974418707
The coefficient for cut_Very Good is 0.1637510414681528
The coefficient for color_E is -0.04582992110650405
The coefficient for color_F is -0.06423152006658835
The coefficient for color_G is -0.1093432236463441
The coefficient for color_H is -0.2373503481063302
The coefficient for color_I is -0.36122694997710136
The coefficient for color_J is -0.5838191499347705
The coefficient for clarity_IF is 1.2899471399673714
The coefficient for clarity_SI1 is 0.8895287879225799
The coefficient for clarity_SI2 is 0.6446204697130623
The coefficient for clarity_VS1 is 1.111858158570466
The coefficient for clarity_VS2 is 1.0384035090938615
The coefficient for clarity_VVS1 is 1.2151510670753518
The coefficient for clarity_VVS2 is 1.1977150915884154
R Square and RMSW values for training and testing data are as below:

VIF Values
Best params summary:
1.4) Inference: Basis on these predictions, what are the business insights and
recommendations.
Ans: Business Insights: Based on the EDA analysis, it is clear that ideal cut brings in
the maximum profit to the company and the colors H,I and J bring in profit whereas
the other colors don’t. Additionally, the fair and good cuts are not bringing any profit
to the company.

Recommendations: Company should focus on carat and clarity of the stone to increase
pricing and thereby the profit. Good customer base and marketing strategy needs to
be adopted to attract customers to buy the stones which gives more profit.
Problem 2: Logistic Regression and LDA
You are hired by a tour and travel agency which deals in selling holiday packages.
You are provided details of 872 employees of a company. Among these employees,
some opted for the package and some didn't. You have to help the company in
predicting whether an employee will opt for the package or not on the basis of the
information given in the data set. Also, find out the important factors on the basis of
which the company will focus on particular employees to sell their packages.

Data Dictionary:
Variable Name Description
Holiday_Package Opted for Holiday Package yes/no?
Salary Employee salary
age Age in years
edu Years of formal education
no_young_children The number of young children (younger than 7 years)
no_older_children Number of older children
foreign foreigner Yes/No

2.1) Data Ingestion: Read the dataset. Do the descriptive statistics and do null
valuecondition check, write an inference on it. Perform Univariate and Bivariate
Analysis. Do exploratory data analysis.
Ans) The data was inputted and sample rows were can be viewed below:
The dimension of dataset is (872,8) with no null values.
The summary of dataset is as given below:

As per the below table, it can be understood that there are no missing values:
Univariate Analysis
dsafgvsdgg
Inferences:

 Employees over the age of 50 seems to be not taking holiday packages as


compared to younger employees
 Employees with salary <150000 seems to be taking holiday packages
 Based on the analysis, it looks like only 45% people are interested in holiday
packages
Bivariate Analysis

There is not much of a difference between data distribution among the holiday
packages.
2.2) Do not scale the data. Encode the data (having string values) for Modelling.
Data Split: Split the data into train and test (70:30). Apply Logistic Regression and
LDA (linear discriminant analysis).
Ans) We have split the data into test and train in the ratio of 70:30 and the data
hasbeen encoded as below:
Logistic Regression Model

Accuracy Scores:
Creating the LDA Model
Model scores and classification report:
2.3) Performance Metrics: Check the performance of Predictions on Train and Test
sets using Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC score for
each model Final Model: Compare Both the models and write inference which
model is best/optimized.
Ans)

Accuracy of the data sets:

Confusion Matrix:
AUC and ROC Curve
Checking optimal value that gives better value and accuracy:
AUC and ROC Curve:

LDA works better when there is category target variable is present, else both results
are pretty much same.

2.4) Inference: Basis on these predictions, what are the insights


andrecommendations.
Ans) Insights from the analysis are as below:
 Important factors predicting people’s interest in holiday packages are age, salary
and education.
 People above the age of 50 generally don’t prefer the holiday package people
having salary less than 50k have opted for it
Recommendations:
 A survey to understand good destinations for people above 50 may help attract
them to take holiday packages
 Targeting parents with younger children should be avoided as conversion rate
seems to be less

You might also like