Download as pdf or txt
Download as pdf or txt
You are on page 1of 19

Modeling Factors That Affect

Housing Prices in King


County, USA
Methodology
Obtaining Data
All data was imported from kc_house_data csv file. No other query was needed.

• Initial inspection showed 21597 data entries with 21 columns, which is a sizeable
amount of data
• Some columns, such as sqft of the living room, are continuous data, while others, such
as waterfront and grade, are categorical and will need to be tweaked for effective model
application.
Methodology
Obtaining and Cleaning Data

Data was scanned for null values, which Null values


would interfere with statistical modeling.
Three columns showed presence of null
values and different approaches were taken
to each:

• 63 entries with a null values in view were


dropped since they were a relatively small
percentage of the data
• Null values in waterfront were replaced
with the median (0).
• Year renovated was dropped entirely as a
column due to high amounts of missing
data.
Methodology
Data Type Conversion

-ID & date were


dropped due to
high number of
unique entries

-sqft_living15 and
sqft_lot15 also
dropped since year
renovation was
dropped
Assessing Multicollinearity

Using a heatmap, sqft living


was collinear with many Collinearity
things, which justified
dropping column altogether.
This may seem wasteful of
data, but sqft living was
simply a combination of
both sqft above with sqft
basement for most cases,
so the final model would still
have this information.
Assessing Multicollinearity
Absolute Correlation >.75

Sqft living’s high correlation with bathrooms and sqft


above justified dropped the column
Normalization and
Scaling
Most column contained
highly skewed data,
which warranted log
transformation.
However, columns with
negative values and
zeroes, such as
longitude and sqft
basement, could not be
log transformed.
Methodology
Post-Normalization
Methodology
Stepwise Selection to Ensure Significance
Methodology
K-Fold Cross Validation to Ensure Fit

MSE for all folds is extremely low, and so


is the average. This model can
successfully predict new data
Methodology
The Final Model

• This model is able to


explain 74% of the
variation in price.
• Most entries are still
preserved
• Lack of categorical data
columns reduces kurtosis
Interpretation
Interpretation
A grade of 11 or higher greatly
increases selling price of a
house
Interpretation
Waterfront view also has high
impact on price
Interpretation
General increase in views lead
to higher sell prices
Interpretation
Those further north in King
Country should consider higher
prices
Interpretation
Do not advertise too many
bedrooms or too big a lot as that
can have a negative impact
Interpretation
As expected, general living area,
apart from the basement,
increases the price.
Overall Takeaways
• Renovate home, if necessary, to ensure a high grade from
King County grading system
• Increase price of home if there is a view to waterfront, or
if it resides further north in King County.
• Put some effort into advertising and getting people to
view your home
• When renovating, do not expand lot size or number of
bedrooms and bathrooms. Instead, expand the general
living space, except for the basement.

You might also like