Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

EDA and Hypothesis testing on KC

Housing data
Daniele Sammarco | Exploratory Data Analysis for Machine Learning by IBM
Summary
1. Dataset and features description
2. Data cleaning and EDA
3. Hypothesis testing
4. Further developments and conclusion

1. Dataset and features description


The dataset I’m going to evaluate is a well-documented one in the international Data Science community
and it’s called KC-Housing-Data. It is normally used for academic purposes as it gives a direct exposure to
possible clustering or classification analysis as well as it gives the chance to apply regression techniques
and/or time series analysis on economic variables, being the realized sell price of the specific house
observed as the target variable under scrutiny. Being so far these consideration out of scope for this
assignment, I’ll provide a EDA and hypothesis testing on the data available.

As initial steps, it is of course useful to load the necessary Python 3 packages as the lectures of this module
has shown us, together with a first glance on our observations, by defining a path and a column to be
chosen as unique index for each row:

After having chosen to .drop the “date” column, I further applied the .shape, .columns and .info()
functions in other to start the inquire of the available observations and discover that there are 21613
observations/rows described by 19 columns, 18 of which being explanatory variables and one the price, as
above stated. To give glimpse about what we are talking about, I’ll quickly recap the features involved:

PAGE 1
- ‘bedrooms’ is an int64 data type defining the number of bedrooms available in the house
- ‘bathrooms’ is a float64 Dtype (might be converted to int64) providing the number of bedrooms
at disposition
- ‘sqft_living’, ‘sqft_lot’, ‘sqft_above’, ‘sqft_basement’, ‘sqft_living15’ and ‘sqft_lot15’ are all int 64 data
types measuring different classes of sizes in different temporal points, from feet area, area of
parking layout, sq feet area of basement etc.
- ‘floors’, float64 as the number of floors forming the house
- ‘waterfront’ int64 is a dummy variable signing whether a waterfront is there or not
- ‘view’ int64 variable with unspecified meaning
- ‘condition’ int64 feature providing a score for the overall conditions of the house
- ‘grade’ int64 with unknown meaning
- ‘yr_built’ specifies the built year
- ‘yr_renovated’: if valorized, tells when the house was renovated
- 'zipcode', 'lat', and 'long' provide information about the geographical position

The next steps will be the preprocessing, data cleaning, EDA and possible features engineering according
to the findings.

2. Data cleaning and EDA


A first approach to EDA should be to check whether our data frame presents missing data in order to deal
with NaNs, 99, zeroes or missing values. A quick method is the one that executes column wise sums of
null values. In our case we do not have missing values.

Since the columns are many, and we are at this point of the course interested in descriptive statistics and
finding distribution for possible features engineering, we may want to reduce the useful columns and
produce a smaller df, that does not account for “time-series like” behavior of our data. I therefore have
chosen to reduce the size of my data frame and include fewer columns, applying just after the descriptive
statistics method and transpose it for easier readability:
PAGE 2
At this point, having fewer features, I was able to use a more computational-intensive strategy, through
seaborn.pairplot(), feeding the whole df_kchouse as input data and scatter-plotting them against the
target variable ‘price’, so to have a sense about how each feature may (or not) be skewed and how they
could possibly be described by linear or non-linear relationships. As an example, I display scatter plots
and distribution plots of ‘sqft_living’,’ sqft_above’, and ‘sqft_basement’ to see how they behave:

PAGE 3
Regarding the first two, at least visually there seems to be a positive linear relationship, meaning that
house prices generally get larger as the feet surface improves, as logic would suggest. The distributions
are both right skewed, as the mean > median (a.k.a. 50th percentile or Q2) in our summary statistics
suggest above. A possible action would hence be to normalize these distributions by applying the natural
logarithm to both. With respect to the 'sqft_basement', one can observe how most observations display
no basement area at all; on top of this, the scatterplot does not seem to suggest an easy modellable
connection. A last possible check that could help in the choice of the variables to model or engineer, is a
heatmap available in seaborn, where one can apply a correlation function and check for possible linear
ties with the target. Since my purpose was to underline only the correlations with the highest magnitude,
I arbitrarily chose a minimum threshold of 40% correlation:

Given this assigment’s purpose and the picture above, I modeled just some of the most important features
among ‘bathrooms’, ‘sqft_living’, ‘view’, ‘grade’ and ‘sqft_above’. Just as shown in lecture 01D lecture, I
selected only the numerical dtype columns, set a threshold (skew_limit) of 0.75 to search for the most
skewed distributions and put them in descending order before applying the log transformation on ‘price’,
‘sqft_living’ and ‘sqft_above’, as they appear to be the most positively correlated variables.

PAGE 4
The next step I chose was to divide the data frame into X and y (y being the Series of target prices and X
being the features sub-data frame having all the aforementioned columns except ‘year’, since it appears to
have almost no correlation) to easily work on variables and add to X two new polynomial variables by
squaring ‘sqft_living’ and ‘sqft_above’, plus adding the ratio ‘bedrooms/bathrooms’, because I wanted to
check whether an higher or lower proportion explain part of the target. The code is presented here below:

The resulting X dataset is a (21613, 14) pandas object that includes the new engineered features. I hence
reconstructed the entire data frame concatenating X and y, positioning y as last column, renaming it
new_df_kc and therefore getting ready to run and discuss three hypothesis tests.

PAGE 5
3. Hypothesis testing
In this section I’ll present the discussion, code, and results of three different hypothesis tests I conducted
on different characteristics of the KC housing data, respectively: the two-sample t-test, the Jarque-Bera
normality test and the Pearson correlation test. First things first, let’s import the needed
packages/modules:

Two-sample t test: the object of this test was to postulate if the mean price of the group with higher
bedrooms/bathrooms ratio statistically differs from the price of the group with a lower ratio figure. The
two hypotheses are expressed as follows:

- H0: mean price df1 = mean price df2


- H1: mean price df1 != mean price df2 (beds/baths has an influence)

Where df1 is a sub-group of the original data where the observations have a 'beds/baths' ratio <= 1.6 and
df2 a 'beds/baths' > than this value. I’ve chosen this ratio on a trial-error basis, so that the two dfs display
a number of rows/observations as similar as possible. Additionally, an extra data cleaning procedure has
been to remove 3 observations where beds/baths gave back ‘inf’ result. After this, two equal sized random
samples from df1 and df2 have been selected to run the t-test on. The code is self-explainable and here
presented:

The scipy.stats.ttest_ind function calculates the t-test for the means of two independent samples of scores.
In this case I fed it with the df1_sample and df2_sample ‘price’ columns to discover with clear evidence
that the mean prices of higher and lower bed/baths are statistically different at 95% confidence interval,
since the pvalue is lower than the selected alpha. Or, better said, I did not have sufficient evidence to say
that the mean house prices between these two groups were equal.

Jarque-Bera normality test: the goal of this second experiment was to test for normality assumptions
on the crafted polynomial variables 'sqft_liv2' and 'sqft_above2'. From the Wikipedia definition “In
statistics, the Jarque–Bera test is a goodness-of-fit test of whether sample data have the skewness and
kurtosis matching a normal distribution. [...] The test statistic is always nonnegative. If it is far from zero,
it signals the data do not have a normal distribution.”. As before, I present H0 and H1:

- H0: the variable under probe follows a normal distribution


- H1: the variable does not follow a normal distribution

PAGE 6
The stats module again provides a ready-to-use function named stats.jarque_bera which accepts as input
parameter the ‘array like’ observations of a random variable and outputs the Jarque-Bera computed
statistics and the pvalue, of course. The results I outcomes I obtained are displayed here below:

The two features both exhibit very far from zero JB values and pvalue abundantly below 5%. This would
have hence been the case also for larger confidence intervals, meaning that I could sure enough reject H0
and state that these two variables, derived after polynomial engineering, do not have normal distributions.
I further state that before feeding any possible machine learning model with sqft_liv2 and sqft_above2,
one should normalize them first.

Pearson correlation test: the third and last hypothesis testy I chose was the Pearson correlation. From
the correlation matrix shown at page4, it would seem that 'sqft_living' and 'sqft_above' have a linear,
strongly positive Pearson correlation coefficient (0.88). Let's formalize this statement by setting up an
appropriate test. The underlying assumptions are: 1. Observations in each sample are i.i.d.; 2. Observations
in each sample are normal (already ensured they are); 3. Observations in each sample have the same
variance (I haven’t checked this). Also, H0 and H1 are:

- H0: 'sqft_living' and 'sqft_above' are independent


- H1: 'sqft_living' and 'sqft_above' show some degree of dependence

Again, the Pearson test is provided into ‘stats’:

As also the logic would indeed suggest, with a 95% confidence interval I rejected the null hypothesis that
sqft_living and sqft_above are independent, confirming the insight previously observed. A test such this
one, can be a valid reinforcement tool for features selection or dimensionality reductions techniques
before training ML algorithms.

PAGE 7
4. Further developments and conclusion
Few further steps may be suggested to be undertaken:

- Instead of dropping columns as ‘sqft_lot15’ and ‘sqft_living15’, I could have kept all those variables
displaying time-varying attributes and check for possible parametrical (significant) differences
between them and their up-to-date counterparts.
- I could look for further types of interactions between features, not limited to ratios but also
extended to their products, to check for other underlying patterns.
- I could engineer columns to their higher polynomial grades, not limited to squares.
- Other tests on regressors could be undertaken, not limited to the most correlated ones resulting
from the heatmap matrix.
- I could have used geographical coordinates such as those in 'zipcode', 'lat', 'long' to gain more
insights about interactions with the house price.
- Categorical regressors could be reshaped via one-hot encoding.

As for the quality of the data set, I selected a pretty easy sample, extensively researched in the Data Science
community (as stated in the beginning). There were no NaNs/nulls, -99 or 99, missing values in the
original forms; minimal cleaning has been performed on post-obtained regressors that could display some
odd values as ‘inf’ in the ‘beds/baths’ case.

PAGE 8

You might also like