Machine Learning - Nabeel Khan - Final Project Report - Problem 2

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 24

1

Final Project Report

Logistic Regression and LDA – Elections Analysis

MACHINE LEARNING

Nabeel Ahmed Khan

September ‘21

Date: 25/09/2021
2

Table of Contents

Table of Contents..................................................................................................................................2
Table of Figures.....................................................................................................................................3
1. Executive Summary.......................................................................................................................4
2. Introduction...................................................................................................................................4
3. Data Details...................................................................................................................................4
3.1 Sample of the Dataset.................................................................................................................5
4. Exploratory Data Analysis..............................................................................................................5
4.1. Variable Types in the Dataset................................................................................................5
4.2. Data Description....................................................................................................................6
4.3. Data Shape.............................................................................................................................7
4.4. Outlier Analysis......................................................................................................................7
4.5. Univariate Analysis..............................................................................................................10
4.6. Bivariate Analysis.................................................................................................................12
4.7. Exploratory Data Analysis for Categorical Features: cut, color and clarity...........................14
4.8. Checking for Duplicate values..............................................................................................15
5. Checking and Treating Missing and Zero Values..........................................................................16
5.1 Checking for Missing Values in the Dataset.........................................................................16
5.2 Imputing Missing Values in the Dataset...............................................................................16
Box and Distribution Plots for Depth...........................................................................................16
5.3 Checking for Zero Values in the Dataset..............................................................................17
5.4 Do you think scaling is necessary in this case?.....................................................................17
6. Applying Linear Regression..........................................................................................................18
6.1 Encoding the data for Modelling...............................................................................................18
6.2 Splitting the data into train and test (70:30)..............................................................................18
6.3 Obtaining the Linear Regression Model.....................................................................................18
6.4 Defining the Linear Regression Equation...................................................................................19
6.5 Performance Metrics.................................................................................................................20
Rsquare........................................................................................................................................20
RMSE...........................................................................................................................................20
7. Inferences, Insights & Recommendations...................................................................................21
7.1 Insights......................................................................................................................................21
7.2 Recommendations.....................................................................................................................22
3

Table of Figure

Figure 1: Outlier Analysis before Treating Outliers................................................................................8


Figure 2: Outlier Analysis post Outlier Treatment...............................................................................10
Figure 3: Histograms for the Cubic Zirconia Dataset Attributes...........................................................11
Figure 4: Bivariate Analysis..................................................................................................................12
Figure 5: Correlation Heatmap............................................................................................................13
Figure 6: Categorical Plot for the variable "cut"..................................................................................14
Figure 7: Categorical Plot for the variable "color"...............................................................................15
Figure 8: Categorical Plot for the variable "clarity".............................................................................15
Figure 9: Box Plot for Depth Column...................................................................................................16
Figure 10: Distribution Plot for Depth Column....................................................................................17
Figure 11: Predicted Y values vs Actual Y values..................................................................................20

Table of Tables

Table 1: Dataset Sample........................................................................................................................5


Table 2: Data Description......................................................................................................................6
Table 3: Feature coefficients for the independent variables...............................................................19
4

1. Executive Summary

A company called Gem Stones Co. Ltd. is a cubic zirconia manufacturer that handles various

kinds of zirconia stones. They have shared a dataset that contains information on various

attributes based on nearly 27000 zirconia stones used by their company. The price of a stone

is determined on the basis of these attributes/characteristics. The company needs a way of

predicting the price of a stone based on the attributes in the dataset and want to understand

the best 5 attributes that are most important for price determination. In this project, we will

explore the various attributes of a zirconia stone and its contribution to the price of the stone.

2. Introduction

The intent for this entire exercise is perform an analysis on the cubic zirconia dataset and

determine the relation between the price of a cubic zirconia stone and the other variables in

the dataset. We will try to explore this dataset by applying Central tendency and other

analysis. This data contains details on about 27000 zirconia stones, and I will try to analyse

the different characteristics of the zirconia stone which can help in determining the price of

the stone.

3. Data Details

The first column contains an index variable, which is simply the serial number of the entry. I

dropped the index column as it is useless for the model. Following are the data variables:

Carat : Carat weight of the cubic zirconia stone (continuous from 0.2 to 45)

Cut : Cut quality of the stone. Fair<Good<Very Good<Premium<Ideal


5

Color : Colour of the stone. D<E<F<G<H<I<J

Clarity : Clarity is the absence of the Inclusions and Blemishes.

i1<Si2<Si1<VS2<VS1<VVS2<VVS1<IF

Depth : Height of the stone. Continuous from 50.8 to 73.6

Table : Width of the stone, expressed as a percentage of its Average Diameter. Continuous

from 49 to 79

Price : Price of the cubic zirconia stone. Continuous from 326 to 18818

X : Length of the stone in mm. Continuous from 3.73 to 10.23

Y Width of the stone in mm. Continuous from 3.71 to 10.76

Z Height of the stone in mm. Continuous from 1.07 to 8.06

3.1 Sample of the Dataset

Table 1: Dataset Sample

The dataset has 10 variables with each stone having the same set of characteristics. Here, we

can say that price is the dependent or target feature and the rest of the variables are the

independent or predictor variables. Based on the characteristics, the price of the cubic

zirconia stone is defined.


6

4. Exploratory Data Analysis

4.1. Variable Types in the Dataset

I checked the types of variables in the dataset.

carat float64
cut object
color object
clarity object
depth float64
table float64
x float64
y float64
z float64
price int64
dtype: object

As we can see, there are around 27000 (26967 to be exact) rows and 10 columns in the given

cubic zirconia dataset. Off these 10 columns, 6 columns are of the type Float, 3 of the type

Object and 1 column is of the type Integer.

4.2. Data Description

carat 0.70
depth 61.80
table 57.00
x 5.69
y 5.70
z 3.52
price 2373.00
dtype: float64
7

Table 2: Data Description

As we can see from the data description above, the mean and median values of the numeric

predictor variables are pretty close together, which implies that the dataset has a symmetrical

distribution.

4.3. Data Shape

(26967, 10)

It can be seen from the shape of the dataframe that it contains 26967 rows and 10 columns of

data.

4.4. Outlier Analysis


8
9

Figure 1: Outlier Analysis before Treating Outliers

As we can observe, there are a number of outliers in the following columns: carat, table, x, y,

z, and price, this may impact the efficacy of the regression model I will build. I have treated

the outliers in the dataset using the 25th and 75th percentiles. Post that, I re-checked for

outliers once more (please see below).


10
11

Figure 2: Outlier Analysis post Outlier Treatment

As we can observe from Figure 2, the outliers in the dataset have been treated and we cannot
see any outliers in the boxplots for the numeric variables.

4.5. Univariate Analysis

I plotted the histograms for the numeric dataset variables below.


12

Figure 3: Histograms for the Cubic Zirconia Dataset Attributes

I tried to check the skewness of the dataset variables using skew() function.

carat 0.917096
depth -0.028618
table 0.480441
x 0.394470
y 0.390750
z 0.384198
price 1.158126
dtype: float64

We can observe that the distributions for the variables carat and price (the variable to be

predicted) are rightly-skewed. Moreover, the skewness coefficient is a large value, 0.917096

and 1.158126 for carat and price respectively, the two variables can be said to be heavily

skewed to the right.


13

4.6. Bivariate Analysis

For Bivariate Analysis, I plotted pair plots using Kernel Density Estimation for the
probability density function.

Figure 4: Bivariate Analysis

I created the correlation heatmap for the dataset.


14

Figure 5: Correlation Heatmap

We can get some idea from the Correlation Heatmap above, what is the effect of each feature

o the price of the cubic zirconia stone. Moreover, if we look at the correlation coefficients,

they can tell us the degree of effect each variable has on the target feature (price in this case)

price 1.000000
carat 0.936741
y 0.914191
x 0.912759
z 0.905737
table 0.137971
depth 0.000340
Name: price, dtype: float64

As we can see from the correlation heatmap and the correlation coefficients above (derived

using corr() function), most of the attributes in the dataset correlated with the price of the

cubic zirconia stone. One variable that is visibly not correlated much to the price would be

depth, with a correlation coefficient close to Zero.


15

4.7. Exploratory Data Analysis for Categorical Features: cut, color and clarity

Let’s do a categorical plots for the variable cut

Figure 6: Categorical Plot for the variable "cut"

As we can see from the categorical plots above, the cubic zirconia stones with “Premium”

value of cut command the highest price, followed by “Very Good”.

Now, let’s see the categorical plots for the variable color.
16

Figure 7: Categorical Plot for the variable "color"

Now, as we can see from the categorical plots for color above, stones with the color value of

“i” and “j” have the highest price.

Now, let’s see the categorical plots for the variable clarity

Figure 8: Categorical Plot for the variable "clarity"

Now, as we can see from the categorical plots for clarity above, cubic zirconia stones having

the clarity value of “VS1” and “VS2” command the highest price.

4.8. Checking for Duplicate values

I checked for duplicate rows using duplicated() function and found that 40 rows are

duplicated. I dropped these duplicated rows. Now the shape of the dataset is (26927, 10).
17

5. Checking and Treating Missing and Zero Values

5.1 Checking for Missing Values in the Dataset

carat 0
cut 0
color 0
clarity 0
depth 697
table 0
x 0
y 0
z 0
price 0
dtype: int64
As you can see, of the ~27000 rows, there are 697 Null values in the Depth column.

5.2 Imputing Missing Values in the Dataset

I can drop those 697 rows with missing values but that may not be an ideal solution in all

situations. Instead, I will replace these Null values with either the Median or the Mean for the

Depth column. If the data was skewed, I would prefer the Median to impute the missing

values. But when I did Box Plot and Distribution plot for the Depth column, I observed that

data is not skewed but symmetrically distributed, so I believe I can use either of the Median

or the Mean value for imputing missing values.

Here I have used Median for imputing the missing depth values.

Box and Distribution Plots for Depth

Figure 9: Box Plot for Depth Column


18

Figure 10: Distribution Plot for Depth Column


Post imputing, when I checked for any missing values, I did not get any.

5.3 Checking for Zero Values in the Dataset

A thing unique to this problem is that a zirconia stone should have a nonzero value for each

of the dimensions; so the value of either of the dimensional variables, length, width, or height

cannot be zero. Also, since depth of a stone is measured as the ratio of height to its diameter,

it also cannot be zero.

When I checked for such entries, I found that there are no entries where either of length,

width, height, or depth is equal to Zero. If there were any such entries, we would have to drop

such entries as a cubic zirconia stone with either of these variables as zero would mean an

invalid entry (a stone without a dimension).

5.4 Do you think scaling is necessary in this case?

If we are applying gradient descent, scaling can definitely enable the algorithm to converge

fast. With scaling, the model will reach global minima sooner (if we don’t do any scaling as

part of pre-processing, as the starting point would be quite far from minima) and it can help

make the coefficients analysis simpler. Basically, if our variables are on a different scale (i.e.

without scaling), it may affect the coefficients of the model and can make it hard for us to

interpret the resultant coefficients.


19

However, I don’t think that scaling is necessary while doing simple Linear Regression. It is

definitely recommended when the number of variables is more, as it reduces the time needed

for running the model but we can obtain a good fit model even without applying any scaling.

6. Applying Linear Regression

6.1 Encoding the data for Modelling

I encoded the categorical variables cut, color and clarity in the order of worst to best as

follows:

cut : Fair – Ideal  0 – 4

color : D – J  0 – 6

clarity : i1 – IF  0 – 7

6.2 Splitting the data into train and test (70:30)

I will use the randomized training and test data splitting function from Sklearn package to

split the data into train and test datasets in the ratio 70:30.

6.3 Obtaining the Linear Regression Model

When I applied LinearRegression to obtain the bestfit model on training data, I found the

Feature coefficients for the independent variables as follows:

# Variable Coefficient
1 carat 8895.48
2 cut 105.33
3 color -271.15
4 clarity 433.26
5 depth -28.08
6 table -16.11
7 x -1382.55
20

8 y 1065.18
9 z -126.37
Table 3: Feature coefficients for the independent variables

Hence, we can conclude that for a unit increase in either of carat, cut, clarity, or width of the

zirconia stone, the price increases by 8895.48, 105.33, 433.26, and 1065.18 respectively; and

for a unit increase in either of color, depth, table, length, or height of the stone, the price

decreases by 271.15, 28.08, 16.11, 1382.55, and 126.37 respectively.

Now, per the linear regression model,

Y = m1x1 + m2x2 + m3x3 + m4x4 + m5x5 + m6x6 + m7x7 + m8x8 + m9x9 + c

Where,

Y, x1, x2, x3, x4, x5, x6, x7, x8, x9 are price, carot, cut, color, clarity, depth, table, length, width and

height of the stone

m1, m2, m3, m4, m5, m6, m7, m8, m9 are as given in Table 3: Feature coefficients for the

independent variables.

Now, c is the intercept which we need to determine.

From the regression model, we find the value of the intercept, c = 675.92

This value of intercept c, is actually the predicted mean value of price when all predictor

variables (carat, cut, color, clarity, etc.) are equal to Zero.

6.4 Defining the Linear Regression Equation

According to the Linear Regression Model,

price = m1Xcarat + m2Xcut + m3Xcolor + m4Xclarity + m5Xdepth + m6Xtable + m7X length +

m8Xwidth + m9Xheight + Intercept


21

Substituting all the derived m values and the intercept values, we get the final Linear

Regression Equation as:

price = (8895.48)*carat + (105.33)*cut + (-271.15)*color + (433.26)*clarity + (-28.08)*depth +

(16.11)*table + (-1382.55)*length + (1065.18)*width + (-126.37)*height + (675.92)

6.5 Performance Metrics

Rsquare
Using Rsquare to determine the performance of training data prediction, I got the following

values:

RsquareTraining = 0.9319

RsquareTest = 0.9295

RMSE
Using RMSE on Training data and Test data, I got values that are close.

RMSETraining = 900.75

RMSETest = 930.51

I have also plotted the predicted price values against the actual price values from the test data,

to see how close is the prediction of our model to actual Target Feature values.

Figure 11: Predicted Y values vs Actual Y values


22

7. Programming Files

Predictive_Modellin Predictive_Modellin
g_Nabeel_Khan_Final_Project_Report_Linear_Regression.pdf
g_Nabeel_Khan_Final_Project_Report_Linear_Regression.ipynb

8. Inferences: Insights & Recommendations

7.1 Insights

1. As we know that Rsquare is a way to measure the percentage of the variance for a target

feature that's attributed by an independent variable or variables in a linear regression

model. Which means that, Rsquare will always be a value between 0 and 1 (0% and

100%). Here, 0 implies that the model is incapable of explaining any variation of the

target feature about its mean. While 1 implies that the model can explain all variation of

the target feature about its mean. The above values of Rsquare clearly imply that our

model can explain most of the variation of price around its mean.

2. Also, from the Predicted Y vs Actual Y plot, we can see a strong corelation between the

predicted and actual price values.

3. Since the training data and test data scores are quite close, and the correlation between

predicted and actual target feature values is strong, we can say that the model is a good

fit.

4. From the final linear regression equation, we can say that the key variables for the cubic

zirconia stone price are carat, color, clarity, width and cut. Based on the coefficient

values, the variable depth seems to have none or negligible impact on the value of stone

price.

5. The Correlation Heatmap, Correlation coefficient values and the Pair plots suggest that

there is a high degree of multicollinearity present I the cubic zirconia dataset.


23

6. The high negative coefficient value for x, the length of a zirconia stone implies that the

higher the length of a zirconia stone, the less profitable it will be. The same thing holds

true for z, the height of the stone, which has a high negative coefficient.

7. We can relate these observations with the real world also. For example, a zirconia stone

with a high value of cut will appear very bright as more light will enter into the stone,

hence it will command a higher price. Same thing is true for the clarity of the zirconia

stone. On the other hand, a stone with a high x(height) value will not appear too bright as

it will not reflect a significant portion of the light entering the stone, appearing darker,

which a prospective buyer may not find attractive; this will cause the stone to command a

lower price.

7.2 Recommendations

1. My recommendation is that the Gem Stones Co Ltd should keep in mind the zirconia

stone features carat, color, clarity, width and cut as the key driving factors for predicting

the price of a stone.

2. When we consider the variable cut, we can observe that the stones with cut value of

'Premium’ command the highest price, which is followed by the stones with 'Very Good'

cut. So the Gem Stone Co Ltd should try to focus on stones that fall in these two cut

values.

3. Moreover, the cubic zirconia stones with clarity value of 'VS1' & 'VS2' have the highest

price. So the company should treat stones with these two clarity values as the most

profitable.

4. If we talk about dimensions of the stone, we can see that width has a higher positive

impact on price of the stone, so the Gem Stone Co Ltd should utilize width to improve
24

their profit share by using it to differentiate between stones that may bring in high

revenue vs those that may bring a lower revenue.

5. Moreover, the high negative coefficients for x, the length of the stone, and z, the height of

the stone imply that the higher the values of length and height of a stone, the lower will

be the price, and consequently lower the profitability.

You might also like