Machine Learning - Nabeel Khan - Final Project Report - Problem 2

1
Final Project Report
Logistic Regression and LDA – Elections Analysis
MACHINE LEARNING
Nabeel Ahmed Khan
September ‘21
Date: 25/09/2021
2
Table of Contents
Table of Contents..................................................................................................................................2
Table of Figures.....................................................................................................................................3
1. Executive Summary.......................................................................................................................4
2. Introduction...................................................................................................................................4
3. Data Details...................................................................................................................................4
3.1 Sample of the Dataset.................................................................................................................5
4. Exploratory Data Analysis..............................................................................................................5
4.1. Variable Types in the Dataset................................................................................................5
4.2. Data Description....................................................................................................................6
4.3. Data Shape.............................................................................................................................7
4.4. Outlier Analysis......................................................................................................................7
4.5. Univariate Analysis..............................................................................................................10
4.6. Bivariate Analysis.................................................................................................................12
4.7. Exploratory Data Analysis for Categorical Features: cut, color and clarity...........................14
4.8. Checking for Duplicate values..............................................................................................15
5. Checking and Treating Missing and Zero Values..........................................................................16
5.1 Checking for Missing Values in the Dataset.........................................................................16
5.2 Imputing Missing Values in the Dataset...............................................................................16
Box and Distribution Plots for Depth...........................................................................................16
5.3 Checking for Zero Values in the Dataset..............................................................................17
5.4 Do you think scaling is necessary in this case?.....................................................................17
6. Applying Linear Regression..........................................................................................................18
6.1 Encoding the data for Modelling...............................................................................................18
6.2 Splitting the data into train and test (70:30)..............................................................................18
6.3 Obtaining the Linear Regression Model.....................................................................................18
6.4 Defining the Linear Regression Equation...................................................................................19
6.5 Performance Metrics.................................................................................................................20
Rsquare........................................................................................................................................20
RMSE...........................................................................................................................................20
7. Inferences, Insights & Recommendations...................................................................................21
7.1 Insights......................................................................................................................................21
7.2 Recommendations.....................................................................................................................22
3
Table of Figure
Figure 1: Outlier Analysis before Treating Outliers................................................................................8

Figure 2: Outlier Analysis post Outlier Treatment...............................................................................10
Figure 3: Histograms for the Cubic Zirconia Dataset Attributes...........................................................11
Figure 4: Bivariate Analysis..................................................................................................................12
Figure 5: Correlation Heatmap............................................................................................................13
Figure 6: Categorical Plot for the variable "cut"..................................................................................14
Figure 7: Categorical Plot for the variable "color"...............................................................................15
Figure 8: Categorical Plot for the variable "clarity".............................................................................15
Figure 9: Box Plot for Depth Column...................................................................................................16
Figure 10: Distribution Plot for Depth Column....................................................................................17
Figure 11: Predicted Y values vs Actual Y values..................................................................................20
Table of Tables
Table 1: Dataset Sample........................................................................................................................5

Table 2: Data Description......................................................................................................................6
Table 3: Feature coefficients for the independent variables...............................................................19
4
1. Executive Summary
A company called Gem Stones Co. Ltd. is a cubic zirconia manufacturer that handles various
kinds of zirconia stones. They have shared a dataset that contains information on various
attributes based on nearly 27000 zirconia stones used by their company. The price of a stone
is determined on the basis of these attributes/characteristics. The company needs a way of
predicting the price of a stone based on the attributes in the dataset and want to understand
the best 5 attributes that are most important for price determination. In this project, we will
explore the various attributes of a zirconia stone and its contribution to the price of the stone.
2. Introduction
The intent for this entire exercise is perform an analysis on the cubic zirconia dataset and
determine the relation between the price of a cubic zirconia stone and the other variables in
the dataset. We will try to explore this dataset by applying Central tendency and other
analysis. This data contains details on about 27000 zirconia stones, and I will try to analyse
the different characteristics of the zirconia stone which can help in determining the price of
the stone.
3. Data Details
The first column contains an index variable, which is simply the serial number of the entry. I
dropped the index column as it is useless for the model. Following are the data variables:
Carat : Carat weight of the cubic zirconia stone (continuous from 0.2 to 45)
Cut : Cut quality of the stone. Fair<Good<Very Good<Premium<Ideal

5
Color : Colour of the stone. D<E<F<G<H<I<J
Clarity : Clarity is the absence of the Inclusions and Blemishes.
i1<Si2<Si1<VS2<VS1<VVS2<VVS1<IF
Depth : Height of the stone. Continuous from 50.8 to 73.6
Table : Width of the stone, expressed as a percentage of its Average Diameter. Continuous
from 49 to 79
Price : Price of the cubic zirconia stone. Continuous from 326 to 18818
X : Length of the stone in mm. Continuous from 3.73 to 10.23
Y Width of the stone in mm. Continuous from 3.71 to 10.76
Z Height of the stone in mm. Continuous from 1.07 to 8.06
3.1 Sample of the Dataset
Table 1: Dataset Sample
The dataset has 10 variables with each stone having the same set of characteristics. Here, we
can say that price is the dependent or target feature and the rest of the variables are the
independent or predictor variables. Based on the characteristics, the price of the cubic
zirconia stone is defined.

6
4. Exploratory Data Analysis
4.1. Variable Types in the Dataset
I checked the types of variables in the dataset.
carat float64
cut object
color object
clarity object
depth float64
table float64
x float64
y float64
z float64
price int64
dtype: object
As we can see, there are around 27000 (26967 to be exact) rows and 10 columns in the given
cubic zirconia dataset. Off these 10 columns, 6 columns are of the type Float, 3 of the type
Object and 1 column is of the type Integer.
4.2. Data Description
carat 0.70
depth 61.80
table 57.00
x 5.69
y 5.70
z 3.52
price 2373.00
dtype: float64
7
Table 2: Data Description
As we can see from the data description above, the mean and median values of the numeric
predictor variables are pretty close together, which implies that the dataset has a symmetrical
distribution.
4.3. Data Shape
(26967, 10)
It can be seen from the shape of the dataframe that it contains 26967 rows and 10 columns of
data.
4.4. Outlier Analysis

8
9
Figure 1: Outlier Analysis before Treating Outliers
As we can observe, there are a number of outliers in the following columns: carat, table, x, y,
z, and price, this may impact the efficacy of the regression model I will build. I have treated
the outliers in the dataset using the 25th and 75th percentiles. Post that, I re-checked for
outliers once more (please see below).

10
11
Figure 2: Outlier Analysis post Outlier Treatment
As we can observe from Figure 2, the outliers in the dataset have been treated and we cannot
see any outliers in the boxplots for the numeric variables.
4.5. Univariate Analysis
I plotted the histograms for the numeric dataset variables below.

12
Figure 3: Histograms for the Cubic Zirconia Dataset Attributes
I tried to check the skewness of the dataset variables using skew() function.
carat 0.917096
depth -0.028618
table 0.480441
x 0.394470
y 0.390750
z 0.384198
price 1.158126
dtype: float64
We can observe that the distributions for the variables carat and price (the variable to be
predicted) are rightly-skewed. Moreover, the skewness coefficient is a large value, 0.917096
and 1.158126 for carat and price respectively, the two variables can be said to be heavily
skewed to the right.

13
4.6. Bivariate Analysis
For Bivariate Analysis, I plotted pair plots using Kernel Density Estimation for the
probability density function.
Figure 4: Bivariate Analysis
I created the correlation heatmap for the dataset.

14
Figure 5: Correlation Heatmap
We can get some idea from the Correlation Heatmap above, what is the effect of each feature
o the price of the cubic zirconia stone. Moreover, if we look at the correlation coefficients,
they can tell us the degree of effect each variable has on the target feature (price in this case)
price 1.000000
carat 0.936741
y 0.914191
x 0.912759
z 0.905737
table 0.137971
depth 0.000340
Name: price, dtype: float64
As we can see from the correlation heatmap and the correlation coefficients above (derived
using corr() function), most of the attributes in the dataset correlated with the price of the
cubic zirconia stone. One variable that is visibly not correlated much to the price would be
depth, with a correlation coefficient close to Zero.

15
4.7. Exploratory Data Analysis for Categorical Features: cut, color and clarity
Let’s do a categorical plots for the variable cut
Figure 6: Categorical Plot for the variable "cut"
As we can see from the categorical plots above, the cubic zirconia stones with “Premium”
value of cut command the highest price, followed by “Very Good”.
Now, let’s see the categorical plots for the variable color.
16
Figure 7: Categorical Plot for the variable "color"
Now, as we can see from the categorical plots for color above, stones with the color value of
“i” and “j” have the highest price.
Now, let’s see the categorical plots for the variable clarity
Figure 8: Categorical Plot for the variable "clarity"
Now, as we can see from the categorical plots for clarity above, cubic zirconia stones having
the clarity value of “VS1” and “VS2” command the highest price.
4.8. Checking for Duplicate values
I checked for duplicate rows using duplicated() function and found that 40 rows are
duplicated. I dropped these duplicated rows. Now the shape of the dataset is (26927, 10).
17
5. Checking and Treating Missing and Zero Values
5.1 Checking for Missing Values in the Dataset
carat 0
cut 0
color 0
clarity 0
depth 697
table 0
x 0
y 0
z 0
price 0
dtype: int64
As you can see, of the ~27000 rows, there are 697 Null values in the Depth column.
5.2 Imputing Missing Values in the Dataset
I can drop those 697 rows with missing values but that may not be an ideal solution in all
situations. Instead, I will replace these Null values with either the Median or the Mean for the
Depth column. If the data was skewed, I would prefer the Median to impute the missing
values. But when I did Box Plot and Distribution plot for the Depth column, I observed that
data is not skewed but symmetrically distributed, so I believe I can use either of the Median
or the Mean value for imputing missing values.
Here I have used Median for imputing the missing depth values.
Box and Distribution Plots for Depth
Figure 9: Box Plot for Depth Column

18
Figure 10: Distribution Plot for Depth Column

Post imputing, when I checked for any missing values, I did not get any.
5.3 Checking for Zero Values in the Dataset
A thing unique to this problem is that a zirconia stone should have a nonzero value for each
of the dimensions; so the value of either of the dimensional variables, length, width, or height
cannot be zero. Also, since depth of a stone is measured as the ratio of height to its diameter,
it also cannot be zero.
When I checked for such entries, I found that there are no entries where either of length,
width, height, or depth is equal to Zero. If there were any such entries, we would have to drop
such entries as a cubic zirconia stone with either of these variables as zero would mean an
invalid entry (a stone without a dimension).
5.4 Do you think scaling is necessary in this case?
If we are applying gradient descent, scaling can definitely enable the algorithm to converge
fast. With scaling, the model will reach global minima sooner (if we don’t do any scaling as
part of pre-processing, as the starting point would be quite far from minima) and it can help
make the coefficients analysis simpler. Basically, if our variables are on a different scale (i.e.
without scaling), it may affect the coefficients of the model and can make it hard for us to
interpret the resultant coefficients.

19
However, I don’t think that scaling is necessary while doing simple Linear Regression. It is
definitely recommended when the number of variables is more, as it reduces the time needed
for running the model but we can obtain a good fit model even without applying any scaling.
6. Applying Linear Regression
6.1 Encoding the data for Modelling
I encoded the categorical variables cut, color and clarity in the order of worst to best as
follows:
cut : Fair – Ideal  0 – 4
color : D – J  0 – 6
clarity : i1 – IF  0 – 7
6.2 Splitting the data into train and test (70:30)
I will use the randomized training and test data splitting function from Sklearn package to
split the data into train and test datasets in the ratio 70:30.
6.3 Obtaining the Linear Regression Model
When I applied LinearRegression to obtain the bestfit model on training data, I found the
Feature coefficients for the independent variables as follows:
# Variable Coefficient
1 carat 8895.48
2 cut 105.33
3 color -271.15
4 clarity 433.26
5 depth -28.08
6 table -16.11
7 x -1382.55
20
8 y 1065.18
9 z -126.37
Table 3: Feature coefficients for the independent variables
Hence, we can conclude that for a unit increase in either of carat, cut, clarity, or width of the
zirconia stone, the price increases by 8895.48, 105.33, 433.26, and 1065.18 respectively; and
for a unit increase in either of color, depth, table, length, or height of the stone, the price
decreases by 271.15, 28.08, 16.11, 1382.55, and 126.37 respectively.
Now, per the linear regression model,
Y = m1x1 + m2x2 + m3x3 + m4x4 + m5x5 + m6x6 + m7x7 + m8x8 + m9x9 + c
Where,
Y, x1, x2, x3, x4, x5, x6, x7, x8, x9 are price, carot, cut, color, clarity, depth, table, length, width and
height of the stone
m1, m2, m3, m4, m5, m6, m7, m8, m9 are as given in Table 3: Feature coefficients for the
independent variables.
Now, c is the intercept which we need to determine.
From the regression model, we find the value of the intercept, c = 675.92
This value of intercept c, is actually the predicted mean value of price when all predictor
variables (carat, cut, color, clarity, etc.) are equal to Zero.
6.4 Defining the Linear Regression Equation
According to the Linear Regression Model,
price = m1Xcarat + m2Xcut + m3Xcolor + m4Xclarity + m5Xdepth + m6Xtable + m7X length +
m8Xwidth + m9Xheight + Intercept

21
Substituting all the derived m values and the intercept values, we get the final Linear
Regression Equation as:
price = (8895.48)*carat + (105.33)*cut + (-271.15)*color + (433.26)*clarity + (-28.08)*depth +
(16.11)*table + (-1382.55)*length + (1065.18)*width + (-126.37)*height + (675.92)
6.5 Performance Metrics
Rsquare
Using Rsquare to determine the performance of training data prediction, I got the following
values:
RsquareTraining = 0.9319
RsquareTest = 0.9295
RMSE
Using RMSE on Training data and Test data, I got values that are close.
RMSETraining = 900.75
RMSETest = 930.51
I have also plotted the predicted price values against the actual price values from the test data,
to see how close is the prediction of our model to actual Target Feature values.
Figure 11: Predicted Y values vs Actual Y values

22
7. Programming Files
Predictive_Modellin Predictive_Modellin
g_Nabeel_Khan_Final_Project_Report_Linear_Regression.pdf
g_Nabeel_Khan_Final_Project_Report_Linear_Regression.ipynb
8. Inferences: Insights & Recommendations
7.1 Insights
1. As we know that Rsquare is a way to measure the percentage of the variance for a target
feature that's attributed by an independent variable or variables in a linear regression
model. Which means that, Rsquare will always be a value between 0 and 1 (0% and
100%). Here, 0 implies that the model is incapable of explaining any variation of the
target feature about its mean. While 1 implies that the model can explain all variation of
the target feature about its mean. The above values of Rsquare clearly imply that our
model can explain most of the variation of price around its mean.
2. Also, from the Predicted Y vs Actual Y plot, we can see a strong corelation between the
predicted and actual price values.
3. Since the training data and test data scores are quite close, and the correlation between
predicted and actual target feature values is strong, we can say that the model is a good
fit.
4. From the final linear regression equation, we can say that the key variables for the cubic
zirconia stone price are carat, color, clarity, width and cut. Based on the coefficient
values, the variable depth seems to have none or negligible impact on the value of stone
price.
5. The Correlation Heatmap, Correlation coefficient values and the Pair plots suggest that
there is a high degree of multicollinearity present I the cubic zirconia dataset.

23
6. The high negative coefficient value for x, the length of a zirconia stone implies that the
higher the length of a zirconia stone, the less profitable it will be. The same thing holds
true for z, the height of the stone, which has a high negative coefficient.
7. We can relate these observations with the real world also. For example, a zirconia stone
with a high value of cut will appear very bright as more light will enter into the stone,
hence it will command a higher price. Same thing is true for the clarity of the zirconia
stone. On the other hand, a stone with a high x(height) value will not appear too bright as
it will not reflect a significant portion of the light entering the stone, appearing darker,
which a prospective buyer may not find attractive; this will cause the stone to command a
lower price.
7.2 Recommendations
1. My recommendation is that the Gem Stones Co Ltd should keep in mind the zirconia
stone features carat, color, clarity, width and cut as the key driving factors for predicting
the price of a stone.
2. When we consider the variable cut, we can observe that the stones with cut value of
'Premium’ command the highest price, which is followed by the stones with 'Very Good'
cut. So the Gem Stone Co Ltd should try to focus on stones that fall in these two cut
values.
3. Moreover, the cubic zirconia stones with clarity value of 'VS1' & 'VS2' have the highest
price. So the company should treat stones with these two clarity values as the most
profitable.
4. If we talk about dimensions of the stone, we can see that width has a higher positive
impact on price of the stone, so the Gem Stone Co Ltd should utilize width to improve
24
their profit share by using it to differentiate between stones that may bring in high
revenue vs those that may bring a lower revenue.
5. Moreover, the high negative coefficients for x, the length of the stone, and z, the height of
the stone imply that the higher the values of length and height of a stone, the lower will
be the price, and consequently lower the profitability.

Machine Learning - Nabeel Khan - Final Project Report - Problem 2

Uploaded by

Copyright:

Available Formats

You might also like

Machine Learning - Nabeel Khan - Final Project Report - Problem 2

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Machine Learning - Nabeel Khan - Final Project Report - Problem 2

Uploaded by

Copyright:

Available Formats

1

Final Project Report

Logistic Regression and LDA – Elections Analysis

Nabeel Ahmed Khan

Figure 1: Outlier Analysis before Treating Outliers................................................................................8

Table 1: Dataset Sample........................................................................................................................5

is determined on the basis of these attributes/characteristics. The company needs a way of

Cut : Cut quality of the stone. Fair<Good<Very Good<Premium<Ideal

Color : Colour of the stone. D<E<F<G<H<I<J

Clarity : Clarity is the absence of the Inclusions and Blemishes.

Depth : Height of the stone. Continuous from 50.8 to 73.6

X : Length of the stone in mm. Continuous from 3.73 to 10.23

Y Width of the stone in mm. Continuous from 3.71 to 10.76

Z Height of the stone in mm. Continuous from 1.07 to 8.06

3.1 Sample of the Dataset

Table 1: Dataset Sample

zirconia stone is defined.

4. Exploratory Data Analysis

4.1. Variable Types in the Dataset

I checked the types of variables in the dataset.

Object and 1 column is of the type Integer.

4.2. Data Description

Table 2: Data Description

4.3. Data Shape

4.4. Outlier Analysis

Figure 1: Outlier Analysis before Treating Outliers

outliers once more (please see below).

Figure 2: Outlier Analysis post Outlier Treatment

4.5. Univariate Analysis

I plotted the histograms for the numeric dataset variables below.

Figure 3: Histograms for the Cubic Zirconia Dataset Attributes

skewed to the right.

4.6. Bivariate Analysis

Figure 4: Bivariate Analysis

I created the correlation heatmap for the dataset.

Figure 5: Correlation Heatmap

depth, with a correlation coefficient close to Zero.

Let’s do a categorical plots for the variable cut

Figure 6: Categorical Plot for the variable "cut"

value of cut command the highest price, followed by “Very Good”.

Figure 7: Categorical Plot for the variable "color"

“i” and “j” have the highest price.

Figure 8: Categorical Plot for the variable "clarity"

4.8. Checking for Duplicate values

5. Checking and Treating Missing and Zero Values

5.1 Checking for Missing Values in the Dataset

5.2 Imputing Missing Values in the Dataset

or the Mean value for imputing missing values.

Box and Distribution Plots for Depth

Figure 9: Box Plot for Depth Column

Figure 10: Distribution Plot for Depth Column

5.3 Checking for Zero Values in the Dataset

it also cannot be zero.

invalid entry (a stone without a dimension).

5.4 Do you think scaling is necessary in this case?

interpret the resultant coefficients.

6. Applying Linear Regression

price = (8895.48)carat + (105.33)cut + (-271.15)color + (433.26)clarity + (-28.08)*depth +

(16.11)table + (-1382.55)length + (1065.18)width + (-126.37)height + (675.92)