Professional Documents
Culture Documents
Machine Learning - Nabeel Khan - Final Project Report - Problem 2
Machine Learning - Nabeel Khan - Final Project Report - Problem 2
Machine Learning - Nabeel Khan - Final Project Report - Problem 2
MACHINE LEARNING
September ‘21
Date: 25/09/2021
2
Table of Contents
Table of Contents..................................................................................................................................2
Table of Figures.....................................................................................................................................3
1. Executive Summary.......................................................................................................................4
2. Introduction...................................................................................................................................4
3. Data Details...................................................................................................................................4
3.1 Sample of the Dataset.................................................................................................................5
4. Exploratory Data Analysis..............................................................................................................5
4.1. Variable Types in the Dataset................................................................................................5
4.2. Data Description....................................................................................................................6
4.3. Data Shape.............................................................................................................................7
4.4. Outlier Analysis......................................................................................................................7
4.5. Univariate Analysis..............................................................................................................10
4.6. Bivariate Analysis.................................................................................................................12
4.7. Exploratory Data Analysis for Categorical Features: cut, color and clarity...........................14
4.8. Checking for Duplicate values..............................................................................................15
5. Checking and Treating Missing and Zero Values..........................................................................16
5.1 Checking for Missing Values in the Dataset.........................................................................16
5.2 Imputing Missing Values in the Dataset...............................................................................16
Box and Distribution Plots for Depth...........................................................................................16
5.3 Checking for Zero Values in the Dataset..............................................................................17
5.4 Do you think scaling is necessary in this case?.....................................................................17
6. Applying Linear Regression..........................................................................................................18
6.1 Encoding the data for Modelling...............................................................................................18
6.2 Splitting the data into train and test (70:30)..............................................................................18
6.3 Obtaining the Linear Regression Model.....................................................................................18
6.4 Defining the Linear Regression Equation...................................................................................19
6.5 Performance Metrics.................................................................................................................20
Rsquare........................................................................................................................................20
RMSE...........................................................................................................................................20
7. Inferences, Insights & Recommendations...................................................................................21
7.1 Insights......................................................................................................................................21
7.2 Recommendations.....................................................................................................................22
3
Table of Figure
Table of Tables
1. Executive Summary
A company called Gem Stones Co. Ltd. is a cubic zirconia manufacturer that handles various
kinds of zirconia stones. They have shared a dataset that contains information on various
attributes based on nearly 27000 zirconia stones used by their company. The price of a stone
predicting the price of a stone based on the attributes in the dataset and want to understand
the best 5 attributes that are most important for price determination. In this project, we will
explore the various attributes of a zirconia stone and its contribution to the price of the stone.
2. Introduction
The intent for this entire exercise is perform an analysis on the cubic zirconia dataset and
determine the relation between the price of a cubic zirconia stone and the other variables in
the dataset. We will try to explore this dataset by applying Central tendency and other
analysis. This data contains details on about 27000 zirconia stones, and I will try to analyse
the different characteristics of the zirconia stone which can help in determining the price of
the stone.
3. Data Details
The first column contains an index variable, which is simply the serial number of the entry. I
dropped the index column as it is useless for the model. Following are the data variables:
Carat : Carat weight of the cubic zirconia stone (continuous from 0.2 to 45)
i1<Si2<Si1<VS2<VS1<VVS2<VVS1<IF
Table : Width of the stone, expressed as a percentage of its Average Diameter. Continuous
from 49 to 79
Price : Price of the cubic zirconia stone. Continuous from 326 to 18818
The dataset has 10 variables with each stone having the same set of characteristics. Here, we
can say that price is the dependent or target feature and the rest of the variables are the
independent or predictor variables. Based on the characteristics, the price of the cubic
carat float64
cut object
color object
clarity object
depth float64
table float64
x float64
y float64
z float64
price int64
dtype: object
As we can see, there are around 27000 (26967 to be exact) rows and 10 columns in the given
cubic zirconia dataset. Off these 10 columns, 6 columns are of the type Float, 3 of the type
carat 0.70
depth 61.80
table 57.00
x 5.69
y 5.70
z 3.52
price 2373.00
dtype: float64
7
As we can see from the data description above, the mean and median values of the numeric
predictor variables are pretty close together, which implies that the dataset has a symmetrical
distribution.
(26967, 10)
It can be seen from the shape of the dataframe that it contains 26967 rows and 10 columns of
data.
As we can observe, there are a number of outliers in the following columns: carat, table, x, y,
z, and price, this may impact the efficacy of the regression model I will build. I have treated
the outliers in the dataset using the 25th and 75th percentiles. Post that, I re-checked for
As we can observe from Figure 2, the outliers in the dataset have been treated and we cannot
see any outliers in the boxplots for the numeric variables.
I tried to check the skewness of the dataset variables using skew() function.
carat 0.917096
depth -0.028618
table 0.480441
x 0.394470
y 0.390750
z 0.384198
price 1.158126
dtype: float64
We can observe that the distributions for the variables carat and price (the variable to be
predicted) are rightly-skewed. Moreover, the skewness coefficient is a large value, 0.917096
and 1.158126 for carat and price respectively, the two variables can be said to be heavily
For Bivariate Analysis, I plotted pair plots using Kernel Density Estimation for the
probability density function.
We can get some idea from the Correlation Heatmap above, what is the effect of each feature
o the price of the cubic zirconia stone. Moreover, if we look at the correlation coefficients,
they can tell us the degree of effect each variable has on the target feature (price in this case)
price 1.000000
carat 0.936741
y 0.914191
x 0.912759
z 0.905737
table 0.137971
depth 0.000340
Name: price, dtype: float64
As we can see from the correlation heatmap and the correlation coefficients above (derived
using corr() function), most of the attributes in the dataset correlated with the price of the
cubic zirconia stone. One variable that is visibly not correlated much to the price would be
4.7. Exploratory Data Analysis for Categorical Features: cut, color and clarity
As we can see from the categorical plots above, the cubic zirconia stones with “Premium”
Now, let’s see the categorical plots for the variable color.
16
Now, as we can see from the categorical plots for color above, stones with the color value of
Now, let’s see the categorical plots for the variable clarity
Now, as we can see from the categorical plots for clarity above, cubic zirconia stones having
the clarity value of “VS1” and “VS2” command the highest price.
I checked for duplicate rows using duplicated() function and found that 40 rows are
duplicated. I dropped these duplicated rows. Now the shape of the dataset is (26927, 10).
17
carat 0
cut 0
color 0
clarity 0
depth 697
table 0
x 0
y 0
z 0
price 0
dtype: int64
As you can see, of the ~27000 rows, there are 697 Null values in the Depth column.
I can drop those 697 rows with missing values but that may not be an ideal solution in all
situations. Instead, I will replace these Null values with either the Median or the Mean for the
Depth column. If the data was skewed, I would prefer the Median to impute the missing
values. But when I did Box Plot and Distribution plot for the Depth column, I observed that
data is not skewed but symmetrically distributed, so I believe I can use either of the Median
Here I have used Median for imputing the missing depth values.
A thing unique to this problem is that a zirconia stone should have a nonzero value for each
of the dimensions; so the value of either of the dimensional variables, length, width, or height
cannot be zero. Also, since depth of a stone is measured as the ratio of height to its diameter,
When I checked for such entries, I found that there are no entries where either of length,
width, height, or depth is equal to Zero. If there were any such entries, we would have to drop
such entries as a cubic zirconia stone with either of these variables as zero would mean an
If we are applying gradient descent, scaling can definitely enable the algorithm to converge
fast. With scaling, the model will reach global minima sooner (if we don’t do any scaling as
part of pre-processing, as the starting point would be quite far from minima) and it can help
make the coefficients analysis simpler. Basically, if our variables are on a different scale (i.e.
without scaling), it may affect the coefficients of the model and can make it hard for us to
However, I don’t think that scaling is necessary while doing simple Linear Regression. It is
definitely recommended when the number of variables is more, as it reduces the time needed
for running the model but we can obtain a good fit model even without applying any scaling.
I encoded the categorical variables cut, color and clarity in the order of worst to best as
follows:
color : D – J 0 – 6
clarity : i1 – IF 0 – 7
I will use the randomized training and test data splitting function from Sklearn package to
split the data into train and test datasets in the ratio 70:30.
When I applied LinearRegression to obtain the bestfit model on training data, I found the
# Variable Coefficient
1 carat 8895.48
2 cut 105.33
3 color -271.15
4 clarity 433.26
5 depth -28.08
6 table -16.11
7 x -1382.55
20
8 y 1065.18
9 z -126.37
Table 3: Feature coefficients for the independent variables
Hence, we can conclude that for a unit increase in either of carat, cut, clarity, or width of the
zirconia stone, the price increases by 8895.48, 105.33, 433.26, and 1065.18 respectively; and
for a unit increase in either of color, depth, table, length, or height of the stone, the price
Where,
Y, x1, x2, x3, x4, x5, x6, x7, x8, x9 are price, carot, cut, color, clarity, depth, table, length, width and
m1, m2, m3, m4, m5, m6, m7, m8, m9 are as given in Table 3: Feature coefficients for the
independent variables.
From the regression model, we find the value of the intercept, c = 675.92
This value of intercept c, is actually the predicted mean value of price when all predictor
Substituting all the derived m values and the intercept values, we get the final Linear
Rsquare
Using Rsquare to determine the performance of training data prediction, I got the following
values:
RsquareTraining = 0.9319
RsquareTest = 0.9295
RMSE
Using RMSE on Training data and Test data, I got values that are close.
RMSETraining = 900.75
RMSETest = 930.51
I have also plotted the predicted price values against the actual price values from the test data,
to see how close is the prediction of our model to actual Target Feature values.
7. Programming Files
Predictive_Modellin Predictive_Modellin
g_Nabeel_Khan_Final_Project_Report_Linear_Regression.pdf
g_Nabeel_Khan_Final_Project_Report_Linear_Regression.ipynb
7.1 Insights
1. As we know that Rsquare is a way to measure the percentage of the variance for a target
model. Which means that, Rsquare will always be a value between 0 and 1 (0% and
100%). Here, 0 implies that the model is incapable of explaining any variation of the
target feature about its mean. While 1 implies that the model can explain all variation of
the target feature about its mean. The above values of Rsquare clearly imply that our
model can explain most of the variation of price around its mean.
2. Also, from the Predicted Y vs Actual Y plot, we can see a strong corelation between the
3. Since the training data and test data scores are quite close, and the correlation between
predicted and actual target feature values is strong, we can say that the model is a good
fit.
4. From the final linear regression equation, we can say that the key variables for the cubic
zirconia stone price are carat, color, clarity, width and cut. Based on the coefficient
values, the variable depth seems to have none or negligible impact on the value of stone
price.
5. The Correlation Heatmap, Correlation coefficient values and the Pair plots suggest that
6. The high negative coefficient value for x, the length of a zirconia stone implies that the
higher the length of a zirconia stone, the less profitable it will be. The same thing holds
true for z, the height of the stone, which has a high negative coefficient.
7. We can relate these observations with the real world also. For example, a zirconia stone
with a high value of cut will appear very bright as more light will enter into the stone,
hence it will command a higher price. Same thing is true for the clarity of the zirconia
stone. On the other hand, a stone with a high x(height) value will not appear too bright as
it will not reflect a significant portion of the light entering the stone, appearing darker,
which a prospective buyer may not find attractive; this will cause the stone to command a
lower price.
7.2 Recommendations
1. My recommendation is that the Gem Stones Co Ltd should keep in mind the zirconia
stone features carat, color, clarity, width and cut as the key driving factors for predicting
2. When we consider the variable cut, we can observe that the stones with cut value of
'Premium’ command the highest price, which is followed by the stones with 'Very Good'
cut. So the Gem Stone Co Ltd should try to focus on stones that fall in these two cut
values.
3. Moreover, the cubic zirconia stones with clarity value of 'VS1' & 'VS2' have the highest
price. So the company should treat stones with these two clarity values as the most
profitable.
4. If we talk about dimensions of the stone, we can see that width has a higher positive
impact on price of the stone, so the Gem Stone Co Ltd should utilize width to improve
24
their profit share by using it to differentiate between stones that may bring in high
5. Moreover, the high negative coefficients for x, the length of the stone, and z, the height of
the stone imply that the higher the values of length and height of a stone, the lower will