Download as pdf or txt
Download as pdf or txt
You are on page 1of 1

The House Price Prediction Problem

In this problem, we have been given a dataset that records the house prices of 9,761 houses in King County, Washington, US. The house prices are recorded along with some other attributes like - area of the house, number of
bedrooms, number of bathrooms, etc. We are required to do the following tasks:

1) Question the data -


Understand the variables very carefully and formulate your questions/hypothesis. (Note that these are just your initial hypothesis which may or may not seem to be true after the EDA step)

In [588…
import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

import seaborn as sns

from sklearn.model_selection import train_test_split

from pandas.plotting import scatter_matrix

from sklearn.model_selection import cross_val_score

from sklearn.linear_model import LinearRegression

from sklearn.tree import DecisionTreeRegressor


from sklearn.metrics import mean_squared_error as MSE

from sklearn.metrics import r2_score


from sklearn import tree

import matplotlib.pyplot as plt

In [589…
#Read the train data

data = pd.read_csv("kc_house_train_data.csv")

pd.set_option("display.max_columns", None) #to display all the columns

data.head(5)

Out[589… id date price bedrooms bathrooms sqft_living sqft_lot floors waterfront view condition grade sqft_above sqft_basement yr_built yr_renovated zipcode lat long sqft_li

0 2487200875 20141209T000000 604000.00000 4 3.00000 1960 5000 1.00000 0 0 5 7 1050 910 1965 0 98136 47.52080 -122.39300

1 7237550310 20140512T000000 1225000.00000 4 4.50000 5420 101930 1.00000 0 0 3 11 3890 1530 2001 0 98053 47.65610 -122.00500

2 9212900260 20140527T000000 468000.00000 2 1.00000 1160 6000 1.00000 0 0 4 7 860 300 1942 0 98115 47.69000 -122.29200

3 114101516 20140528T000000 310000.00000 3 1.00000 1430 19901 1.50000 0 0 4 7 1430 0 1927 0 98028 47.75580 -122.22900

4 6054650070 20141007T000000 400000.00000 3 1.75000 1370 9680 1.00000 0 0 4 7 1370 0 1977 0 98074 47.61270 -122.04500

List of Predictors of the dataset

date: Categorical variable showing date of sell

bedrooms: Categorical Variable showing number of bedrooms in house

batharooms: Continuous Variable showing number of bathrooms in a house

sqft_living: Continuous Variable showing living area in sqr.ft.

sqft_lot: Continuous Variable showing lot area in sqr.ft.

floors: Continuous variable showing number of floors

waterfront: Categorical variable 0 indiacting No waterfron and 1 indicating waterfront near house

condition: Categorical variable showing condition of house on scale of 5

grade: Categorical variable showing grade(Quality of the construction) on scale of 13

sqft_above: Continuous variable showing Sqft_above for house

sqft_basement: Continuous variable showing basement area in sqr.ft for house

yr_built: Categorical variable Build year of house

yr_renovated: Categorical variable showing house renovated year

zipcode: * zipcode of location of house

lat / long: Latitude and longitude of the House

sqft_living15: Continuous Variable showing living area in sqr.ft as on 2015.

sqft_lot15: Continuous Variable showing lot area in sqr.ft as on 2015.

Target Vairable

Price: Continuous variable showing Price of the house.

Creating Hypothesis and asking Questions


After first glance over data and understanding variables we can formulate below Questions/hypothesis about data:

Usually House prices increase with the increase in size of the house. So, does incrase in sqft_living area, sqft_lot area shows increase in price?

Bigger the house more is the price. Number of floors, numebr of bedrooms, bathroom has any effect on price?

The Housing prices are seen to be dependent on the age of the house. Older Houses gets lower prices. Is our data supporting this?

Is condition of the house has any effect on the price? Are houses with high condition rating (good conditioned) are priced more compare to low condition rating (Bad conditioned) houses?

Do people prefer to live near waterfront paying higher prices? People usually like to live near to the nature.

Better constructed houses last for many years. Does grade (construction Quality of House) has any effect on prices? Is House with Lower grade getting lower price?

House which has undergone renovation gets more prices?

2) Exploratory Data Analysis(EDA)


explore the dataset very carefully. Do univariate analysis and bivariate analysis by choosing appropriate graphs, charts and descriptive measures. Report the surprising elements (i.e. the one which you believed would be true
in step 1 did not turn out t be true, or a result that was beyond your expectation, etc.)

In [590…
#data

data.head(5)

Out[590… id date price bedrooms bathrooms sqft_living sqft_lot floors waterfront view condition grade sqft_above sqft_basement yr_built yr_renovated zipcode lat long sqft_li

0 2487200875 20141209T000000 604000.00000 4 3.00000 1960 5000 1.00000 0 0 5 7 1050 910 1965 0 98136 47.52080 -122.39300

1 7237550310 20140512T000000 1225000.00000 4 4.50000 5420 101930 1.00000 0 0 3 11 3890 1530 2001 0 98053 47.65610 -122.00500

2 9212900260 20140527T000000 468000.00000 2 1.00000 1160 6000 1.00000 0 0 4 7 860 300 1942 0 98115 47.69000 -122.29200

3 114101516 20140528T000000 310000.00000 3 1.00000 1430 19901 1.50000 0 0 4 7 1430 0 1927 0 98028 47.75580 -122.22900

4 6054650070 20141007T000000 400000.00000 3 1.75000 1370 9680 1.00000 0 0 4 7 1370 0 1977 0 98074 47.61270 -122.04500

In [591…
#Shape of train data(#no. of rows , no of columns)

data.shape

Out[591… (9761, 21)

In [592…
#info of train data to check dtype and null values

data.info()

<class 'pandas.core.frame.DataFrame'>

RangeIndex: 9761 entries, 0 to 9760

Data columns (total 21 columns):

# Column Non-Null Count Dtype

--- ------ -------------- -----

0 id 9761 non-null int64

1 date 9761 non-null object

2 price 9761 non-null float64

3 bedrooms 9761 non-null int64

4 bathrooms 9761 non-null float64

5 sqft_living 9761 non-null int64

6 sqft_lot 9761 non-null int64

7 floors 9761 non-null float64

8 waterfront 9761 non-null int64

9 view 9761 non-null int64

10 condition 9761 non-null int64

11 grade 9761 non-null int64

12 sqft_above 9761 non-null int64

13 sqft_basement 9761 non-null int64

14 yr_built 9761 non-null int64

15 yr_renovated 9761 non-null int64

16 zipcode 9761 non-null int64

17 lat 9761 non-null float64

18 long 9761 non-null float64

19 sqft_living15 9761 non-null int64

20 sqft_lot15 9761 non-null int64

dtypes: float64(5), int64(15), object(1)

memory usage: 1.6+ MB

In [593…
#Checking for Null values

sum(data.isnull().sum())

#There are no null values in train data set

Out[593… 0

Segmenting the dataset into continuous and categorical columns.

In [594…
continuous_cols=data[['price', 'bathrooms',

'sqft_living','sqft_lot', 'floors',

'sqft_above', 'sqft_basement', 'yr_built',

'lat', 'long', 'sqft_living15', 'sqft_lot15']]

categorical_cols=data[['bedrooms', 'waterfront', 'view', 'condition', 'grade','yr_renovated', 'zipcode']]

In [595…
#Describing the data

pd.set_option('display.float_format', lambda x: '%.5f' % x)

data.describe()

Out[595… id price bedrooms bathrooms sqft_living sqft_lot floors waterfront view condition grade sqft_above sqft_basement yr_built yr_renovated zipcode

count 9761.00000 9761.00000 9761.00000 9761.00000 9761.00000 9761.00000 9761.00000 9761.00000 9761.00000 9761.00000 9761.00000 9761.00000 9761.00000 9761.00000 9761.00000 9761.00000 976

mean 4605288287.66919 542734.95164 3.37588 2.11718 2086.73415 15215.26063 1.48607 0.00840 0.24803 3.41553 7.66151 1793.29116 293.44299 1970.79951 86.06659 98077.79019 4

std 2876044376.02698 379527.63854 0.96070 0.77397 927.19430 41266.73460 0.53232 0.09127 0.78788 0.65055 1.18268 835.76382 442.61272 29.24001 405.41737 53.20359

min 1200019.00000 80000.00000 0.00000 0.00000 290.00000 520.00000 1.00000 0.00000 0.00000 1.00000 1.00000 290.00000 0.00000 1900.00000 0.00000 98001.00000 4

25% 2126049290.00000 320000.00000 3.00000 1.75000 1420.00000 5100.00000 1.00000 0.00000 0.00000 3.00000 7.00000 1190.00000 0.00000 1951.00000 0.00000 98033.00000 4

50% 3905040800.00000 450000.00000 3.00000 2.25000 1910.00000 7642.00000 1.50000 0.00000 0.00000 3.00000 7.00000 1570.00000 0.00000 1975.00000 0.00000 98065.00000 4

75% 7338402850.00000 649000.00000 4.00000 2.50000 2570.00000 10660.00000 2.00000 0.00000 0.00000 4.00000 8.00000 2230.00000 570.00000 1996.00000 0.00000 98117.00000 4

max 9900000190.00000 7700000.00000 33.00000 8.00000 12050.00000 1651359.00000 3.50000 1.00000 4.00000 5.00000 13.00000 8860.00000 3480.00000 2015.00000 2015.00000 98199.00000 4

QUESTIONS
Price
How is the distribution of the price?

From the below plots it is clear that prices are right skewed and few houses are priced more which are outliers on right end of the distribution.
From the desciption of data it is seen that average peice of the house is 542K with minimum price 80K and maximum price 7700 dollors. More than 50% of the houses fall below average
Since the price is heavily right skwed and do not have any zero or -ve values and we can take the log transformation to reduce the skweness.

In [596…
#plotting the graphs

fig = plt.figure(figsize=(10,5)) #Creating figure

plt.subplot(221) #creating subplot

plt.title('Price Distribution') # title

sns.histplot(data=data.price) #histogram

plt.subplot(223)

plt.title('Price Boxplot')

sns.boxplot(data=data.price, orient='h') #boxplot

plt.subplot(222)

plt.title('Log_Price Distribution')

sns.histplot(data=np.log(data.price)) #histograph

plt.subplot(224)

plt.title('Log_Price Boxplot') #boxplot using log transformation

sns.boxplot(data=np.log(data.price), orient='h')

plt.tight_layout()

What is range of prices?

In [597…
min(data.price),max(data.price)

Out[597… (80000.0, 7700000.0)

How many outliers are in house prices?

In [598…
iqr = data.price.quantile(0.75)- data.price.quantile(0.25)

ub = data.price.quantile(0.75) + 1.5*iqr #Q1 + 1.5*IQR

lb = data.price.quantile(0.25) - 1.5*iqr #Q1 - 1.5*IQR

print('interquartile rage, upper boundry, lower loundry:',(iqr,ub,lb))

interquartile rage, upper boundry, lower loundry: (329000.0, 1142500.0, -173500.0)

In [599…
#There are 412 houses which are priced more than upper boundry(the outlier houses)

sum(data.price > ub)

Out[599… 522

In [600…
#OUTLIERS below lb

sum(data.price < lb)

Out[600… 0

sqft_living area
Univariate analysis on living area

Q. What is ditribution of the living area?

The distribution is right skwed, similar to price distribution. The average living area is 2086 sq.ft.

Since the price is heavily right skwed and do not have any zero or -ve values and we can take the log transformation to reduce the skweness.

In [601…
#Ploring a distribution graph

fig = plt.figure(figsize=(10,5))

plt.subplot(221)

plt.title('living areaDistribution')
sns.histplot(data=data.sqft_living)

plt.subplot(223) #plotting a Boxplot for living area

plt.title('living area Boxplot')

sns.boxplot(data=data.sqft_living, orient='h')

plt.subplot(222) #plotting a Log_living area

plt.title('Log_living areaDistribution')

sns.histplot(data=np.log(data.sqft_living))

plt.subplot(224) #plotting a Log_living area Boxplot

plt.title('Log_living area Boxplot')


sns.boxplot(data=np.log(data.sqft_living), orient='h')

plt.tight_layout()

Bivariate analysis on living area

Q. Compare price and living area, is there linear relation?

The scatter plot suggest linear relation between price and living area. As living area increase price increase

In [602…
#Price vs sqft_living
fig = plt.figure(figsize=(10,4))

plt.subplot(121)

plt.title('price vs sqr_living')

sns.scatterplot(x=data.sqft_living, y=data.price)

plt.subplot(122) #log_Price vs sqft_living

sns.scatterplot(x=np.log(data.sqft_living), y=np.log(data.price))

plt.title('Log_price vs Log_sqr_living')

Out[602… Text(0.5, 1.0, 'Log_price vs Log_sqr_living')

sqft_lot
Univariate analysis on sqft_lot using log tranformation as data is highly skwed

Q. What is range of lot area?

The range is wider for the lot area and data is skwed.

In [603…
#The extream vales are on far extream of distribution

min(data.sqft_lot),max(data.sqft_lot)

Out[603… (520, 1651359)

Q. What is distribution of the sqft_lot?

The outliers cased distribution to become skwed

In [604…
# Disribution plot

fig = plt.figure(figsize=(6,6))

plt.subplot(211)

plt.title('Log_lot area Distribution')

sns.histplot(data=np.log(data.sqft_lot))

plt.subplot(212)

plt.title('Log_lot area Boxplot')

sns.boxplot(data=np.log(data.sqft_lot), orient='h')

plt.tight_layout()

Bivariate analysis on living area

Q. Compare price and lot area, is there linear relation?

The scatter plot suggest no any linear relation between price and lot area as opposite to our hypothesis.

In [605…
#Price vs sqft_lot

sns.scatterplot(x=np.log(data.sqft_lot), y=np.log(data.price))

plt.title('Log_price vs Log_lot area')

Out[605… Text(0.5, 1.0, 'Log_price vs Log_lot area')

bedrooms
Q. House with how many bedroom sold more? is there any increase in price with bedroom number?

We see that there are more house with 2,3 and 4 bedroom and average price of the houses increase slightly with increase in bedrooms till bedroom size is 6.

In [606…
# graphs for bedrooms
fig = plt.figure(figsize=(10,4))

plt.subplot(121)

plt.title('Boxplot_bedrooms')

sns.countplot(x='bedrooms',data=data)

plt.subplot(122) # Boxplot for bedrooms

sns.boxplot(x=data.bedrooms, y=data.price) #Taking log transorm since data is highly skwed

plt.title('Boxplot_bedrooms')

Out[606… Text(0.5, 1.0, 'Boxplot_bedrooms')

bathroom
Q. House with how many bathroom sold more? is there any increase in price with bathroom number?

We can see that house with 2 bathrooms sold more follwed bt 3 and 1 bathrooms. The price of the house increases with bathrooms and people are ready to pay well above upper boundry for house with 2,3,4 bathroom.

In [607…
#Graphs for bathroom

fig = plt.figure(figsize=(15,4))

plt.subplot(131)

plt.title('bathrooms')

sns.histplot(data=np.round(data.bathrooms)) #Since bathroom dats is floot rounding off to integer

plt.subplot(132) #Boxplots for bathroom

sns.boxplot(x=np.round(data.bathrooms), y=(data.price))

plt.title('prive vs bathrooms')

plt.subplot(133) #Boxplots for bathroom vs log_price

sns.boxplot(x=np.round(data.bathrooms), y=np.log((data.price)))

plt.title('log_price vs bathrooms')

Out[607… Text(0.5, 1.0, 'log_price vs bathrooms')

floors
Q. House with how many floors sold more? is there any increase in price with increase in number of floor?

The house with 1 and 2 floors sold more , but, there isn't much of signifiant change in price due to number of floors

In [608…
#graphs for floors

fig = plt.figure(figsize=(10,4))

plt.subplot(121)

plt.title('floors')

sns.countplot(x='floors',data=np.round(data))

plt.subplot(122) #boxplot for floors

sns.boxplot(x=np.round(data.floors), y=data.price) # rounding of floors to integer

plt.title('floors')

Out[608… Text(0.5, 1.0, 'floors')

waterfront
Q How many houses are near waterfornt? Do you think the prices has any effect?

Most of the houses are not near waterfront, only less than 1% of house are near waterfron. The data suggest the average cost of the house near waterfront is more compare to non waterfront house, but we have less number
of house near waterfront.
People do not have any bias towards the house with waterfront.

In [609…
#garphs for warterfront

fig = plt.figure(figsize=(10,4))

plt.subplot(121)

plt.title('Univariate_waterfront')

sns.countplot(x='waterfront',data=data)

plt.subplot(122) #Boxplot for warterfront

sns.boxplot(x=data.waterfront, y=data.price)

plt.title('Biivariate_waterfront')

Out[609… Text(0.5, 1.0, 'Biivariate_waterfront')

In [610…
# % of houses near waterfront is 0.80%

(data.waterfront[data.waterfront==1].sum()/len(data))*100

Out[610… 0.8400778608749104

view
Q Which house sold more? Any effect on price?

The house with zero views sold more. The average price see slight change change with views but not much significant.

In [611…
#graph for view

fig = plt.figure(figsize=(10,4))

plt.subplot(121)

plt.title('view')

sns.countplot(x='view',data=data)

plt.subplot(122)

sns.boxplot(x=data.view, y=data.price) #taking log transformation because data is highly skwed

plt.title('view')

Out[611… Text(0.5, 1.0, 'view')

In [612…
# % number of houses with no views

(data.view[data.view==0].value_counts())/len(data)*100

Out[612… 0 89.62196

Name: view, dtype: float64

condition
Q What condition houses sold more? Any effect on price?

The house with good condition rating 3,4,5 sold more. But Data do not suggest condition rating has much effect on the average price of the house. Average price remains more or less same.

In [613…
#graph fpr condition

fig = plt.figure(figsize=(10,4))

plt.subplot(121)

plt.title('condition')

sns.countplot(x='condition',data=data)

plt.subplot(122) #boxplot fpr condition

sns.boxplot(x=data.condition, y=data.price)

plt.title('condition')

Out[613… Text(0.5, 1.0, 'condition')

grade
Q Which brade houses sold more? Any effect on price?

The houses with grade 6,7,8,9 sold more. The grade of the house shows storg relation with price. As grade of the house increase the price of the house exponentialy increases.

In [614…
#graph fot grade
fig = plt.figure(figsize=(15,4))

plt.subplot(131)

plt.title('grade')

sns.countplot(x='grade',data=data)

plt.subplot(132) #Boxplot fot grade

sns.boxplot(x=data.grade, y=data.price)

plt.title('price vs grade')

plt.subplot(133)

sns.boxplot(x=data.grade, y=np.log(data.price))#Taking log price to understand better

plt.title('log_price vs grade')

Out[614… Text(0.5, 1.0, 'log_price vs grade')

sqft_above
Q. What is distribution of sqft_above?

The distribution is right skwed similar to price distribution with mean 1796 sq.ft.

In [615…
#Graph for sqt_above

fig = plt.figure(figsize=(6,6))

plt.subplot(211)

plt.title('sqft_above')

sns.histplot(data=data.sqft_above)

plt.subplot(212) #Boxplot for sqt_above

plt.title('sqft_above Boxplot')

sns.boxplot(data=data.sqft_above, orient='h')

plt.tight_layout()

Q. Is there any effect of sqft_above on House prices?

The House prices seem to follow moderate linearity with the sqft_above.

In [616…
#Price vs sqft_above

sns.scatterplot(x=np.log(data.sqft_above), y=np.log(data.price))

plt.title('Log_price vs Log_sqft_above')

Out[616… Text(0.5, 1.0, 'Log_price vs Log_sqft_above')

yr_built vs yr_renovated
This shows the renovated houses are old buit, build before year 2000 and has not seen much of price increase due to revovation compare to other houses built same time.

In [617…
#Scatter plot for yr_built and yr_renovated

sns.scatterplot(x='yr_built', y='price', data=data, hue='yr_renovated')

Out[617… <AxesSubplot:xlabel='yr_built', ylabel='price'>

Correlation heatmap
This shows the correlation matrix for the variable present in the data set.

In [618…
#heatmap

plt.figure(figsize = (18,10))

sns.heatmap(data.corr(),cmap='coolwarm',annot=True, linewidths=.5)

Out[618… <AxesSubplot:>

Summary of EDA
Predictors that may affect the Target

sqft_living
grade
bathrooms
sqft_abve (higly correlated with sqft_living and can be dropped)
sqft_living15 (higly correlated with sqft_living, can be dropped )
view (less significant)
bedroom (less significant)
waterfront (less significant)
lat (less significant)

Predictors that may NOT affect the Target

id
date
sqft_lot
floor
condition
yr_built
yr_renovated
zipcode
long

Results that supports our Hypothesis

The house price increases with size of house i.e. living area of the house.
House with more number of bathrooms and bedrroms see highr prices.
People pay higher prices for better built houses i.e house with higher grade.
House with views gets slighty more price.
Most of the people do not live near waterfront, but prices of houses are higher for such houses.

Results that do not support our hypothesis

Lot area of the house do not affect the prices of house. People prefer bigger living area over bigger lot area.
Increased number of floors do not show increased in prices.
Conditoin of house do not affect the house prices.
houses price are not very dependent on Year built and year of renovation.

3) Initial model fitting step -


FIt a couple of linear regression models by considering different sets of predictors on the training dataset. Argue the reasons for considering those predictor sets. Report 10-fold cross-validation RMSE and R-squared values.
Discuss the results.

Fitting Linear Regression model

Creating a function called model_house(predictors, target, Model Number) which will return RMSE, r_square, adjusted r_square
The function will take predictors, target variable and model number. The function will run 10 fold cross_validation to calcultae RMSE and r_square. Also function will calculate adjusted r_square.

In [619…
#Functon to calculate RMSE and r_square using linear regression and cross validation

def model_house(X,y,model_number=None):

model=LinearRegression() #instance of model

RMSE = -cross_val_score(model, X, y, cv=10, scoring='neg_root_mean_squared_error').mean() #r_square

r2 = cross_val_score(model, X, y, cv=10, scoring='r2').mean() #meteric used are RMSE and r_square

adj_r2=np.round(r2-((X.shape[1])-1)/(len(X)-(X.shape[1]))*(1-r2),2)

results=print('Model',model_number,':\nRMSE:',np.round(RMSE,4),'\nr_square:',np.round(r2,2),'\nAdjusted r2:',adj_r2)

return(results)

In [620…
#X_train and y_train data

X_train=data[['id', 'date','bedrooms', 'bathrooms', 'sqft_living',

'sqft_lot', 'floors', 'waterfront', 'view', 'condition', 'grade',

'sqft_above', 'sqft_basement', 'yr_built', 'yr_renovated', 'zipcode',

'lat', 'long', 'sqft_living15', 'sqft_lot15']]

y_train = data[['price']]

Model_1 : Target = [ Price ] , Predictors = [ sqft_living ]

Considering sqft_living as predictor for our first SLR model as it has highest correlation with target price.

In [621…
model_house(X_train[['sqft_living']],np.array(y_train),1)

Model 1 :

RMSE: 268686.2432

r_square: 0.49

Adjusted r2: 0.49

In [623…
model_house(X_train[['bathrooms']],np.array(y_train),1)

Model 1 :

RMSE: 321723.1503

r_square: 0.27

Adjusted r2: 0.27

Model_2 : Target = [ Price ] , Predictors = [ sqft_living, 'grade' ]

To see if grade (quality of construction) and living area help improve the performance of the model

In [624…
X=X_train[['sqft_living','grade']] #Predictor

y=np.array(y_train) #Target

model_house(X,y,2)

Model 2 :

RMSE: 258217.7122

r_square: 0.53

Adjusted r2: 0.53

Model_3 : Target = [ Price ] , Predictors = [ sqft_living,bathrooms ]

we have observed strong exposential relation between the price and bathrooms in EDA. we will add bathrooms, sqft_living to see if model improve performance.

In [625…
X=X_train[['sqft_living','bathrooms']] #Predictor

y=np.array(y_train) #Target

model_house(X,y,3)

Model 3 :

RMSE: 268676.5899

r_square: 0.49

Adjusted r2: 0.49

Model_4 : Target = [Price] , Predictors = ['sqft_living','waterfront', 'view']

Consdering waterfront and view with sqft_living. As waterfront and view are not direclty related to attribute of house (like number of rooms, condition) and could be subjective person to person.

In [487…
X=X_train[['sqft_living','view','waterfront']] #Predictor

y=np.array(y_train) #Target

model_house(X,y,4)

Model 4 :

RMSE: 252037.9863

r_square: 0.55

Adjusted r2: 0.55

Model_5 : Target = [Price] , Predictors = ['sqft_living','lat', 'long']

Conidering latitude to check if the loctaion of house contribute to model performace.

In [488…
X=X_train[['sqft_living','lat','long']] #Predictor

y=np.array(y_train) #Target

model_house(X,y,5)

Model 5 :

RMSE: 245937.9951

r_square: 0.57

Adjusted r2: 0.57

Results

After fitting coulpe of initial models, we see that sqft_living being highly correlated feature contribute more to the performance of the model. Keeping sqft_living as baseline and adding few combination of fetures we tried to
evaluate the features that could improve the performance of the model.

The Quality of constructions helps the model to improve the performace.

The number of bathrooms when used along with sqft_living seems to be not contributing enough to improve the r square.
View and waterfront with sqft_living improve the r square ftom 0.49 to 0.55.
The location of the house('lat','long') contribute to the model performance.

4) Feature engineering -
(A) Suggest some possible feature transformations (like log(X), sqrt(X), X^2, X1*X2, etc.) with reasons, which you believe could have improved the performances of the previous models. Experiment to check if such
transformations actually help to do so. (B) suggest some new feature generation techniques (e.g. creating dummy variables, or using one-hot encoding, or transforming an existing feature to a new feature as you may convert
the variable 'year built' to the 'age of the house'. Check if such transformations help in improving the performances of the models. Report the RMSE and R-squared values in each case.

A) Log Transformation:
The dataset has lot of outliers which casesd data to be right skwed for few parameters. The outliers effect can be reduced using log transaformation. log transformation reduces or removes the skewness of our original data.
Log Transaformation cannot be used on data with 0 and -ve value observations.

The distribution of target price is right skwed. we can use log transformation. but the predicted values will be log of price which need to back tranformed by exp(price)

The sqft_living feature is right skwed due to outliers trying log transformation to check if it helps

1. Price to Log_price

In [489…
log_price = np.log(y_train.price) #taking log of price

In [490…
y_train.insert(0,'log_price',log_price) ##INSERTING new column log_price at 0 index

In [491…
y_train.head(1) #added log_price

Out[491… log_price price

0 13.31133 604000.00000

In [492…
model_house(X_train[['sqft_living','grade']],y_train[['price']],'without_log')

print('--------------------------------')

model_house(X_train[['sqft_living','grade']],y_train[['log_price']],'with_log_price')

Model without_log :

RMSE: 258217.7122

r_square: 0.53

Adjusted r2: 0.53

--------------------------------

Model with_log_price :

RMSE: 0.3509

r_square: 0.56

Adjusted r2: 0.56

The model with log improves the performance of the model with some combination of the features as r_square increase with log

2. sqft_living to log_sqft_living

In [493…
log_sqft_living = np.log(X_train.sqft_living) #Taking log of sq ft living

In [494…
#INSERTING new column log_sqft_living at 0 index

X_train.insert(0,'log_sqft_living',log_sqft_living)

In [495…
X_train.head(1)

Out[495… log_sqft_living id date bedrooms bathrooms sqft_living sqft_lot floors waterfront view condition grade sqft_above sqft_basement yr_built yr_renovated zipcode lat long sqft_li

0 7.58070 2487200875 20141209T000000 4 3.00000 1960 5000 1.00000 0 0 5 7 1050 910 1965 0 98136 47.52080 -122.39300

In [496…
model_house(X_train[['sqft_living','lat','view','grade']],y_train[['price']],'without_log')

print('--------------------------------')

model_house(X_train[['log_sqft_living','lat','view','grade']],y_train[['price']],'with_log_price')

Model without_log :

RMSE: 229000.1176

r_square: 0.63

Adjusted r2: 0.63

--------------------------------

Model with_log_price :

RMSE: 244648.0592

r_square: 0.58

Adjusted r2: 0.58

The model with log sqft_living does not contribute much to the model performance as r_square reduced for log

3. sqrt Transformation (bedrooms):


The square root method is typically used when your data is moderately skewed. Now using the square root (e.g., sqrt(x)) is a transformation that has a moderate effect on distribution shape. It is generally used to reduce right
skewed data. It is one of the alternative to log transformation when data has 0 or -ve values

In [497…
bedrooms_sqrt= np.sqrt(X_train.bedrooms) #takiong sqrt of bedrooms

In [498…
X_train.insert(0,'bedrooms_sqrt',bedrooms_sqrt)

In [586…
model_house(X_train[['sqft_living','bedrooms']],y_train[['price']],'with_bedrooms') #model with bedrooms

print('--------------------------------')

model_house(X_train[['bedrooms_sqrt']],y_train[['price']],'with_bedrooms_sqrt')#model without bedrooms

Model with_bedrooms :

RMSE: 265640.7144

r_square: 0.5

Adjusted r2: 0.5

--------------------------------

Model with_bedrooms_sqrt :

RMSE: 359968.6937

r_square: 0.09

Adjusted r2: 0.09

4. sqrt Transformation (bathrooms):


In [500…
bathrooms_sqrt= np.sqrt(X_train.bathrooms) #Taking sqrt transformation

In [501…
X_train.insert(0,'bathrooms_sqrt',bathrooms_sqrt)

In [502…
model_house(X_train[['bathrooms']],y_train[['price']],'with_bedrooms')

print('--------------------------------')

model_house(X_train[['bathrooms_sqrt']],y_train[['price']],'with_bedrooms_sqrt')

Model with_bedrooms :

RMSE: 321723.1503

r_square: 0.27

Adjusted r2: 0.27

--------------------------------

Model with_bedrooms_sqrt :

RMSE: 329842.3662

r_square: 0.23

Adjusted r2: 0.23

bedroom and bathrooms do not improve r_square after sqrt transformation.

B) New feature generation


In [503…
house_age = 2015 - (X_train.yr_built) #creating new feature

In [504…
X_train.insert(0,'house_age',house_age)

In [505…
model_house(X_train[['sqft_living']],y_train[['price']],'without_new feature house_age')

print('--------------------------------')

model_house(X_train[['sqft_living','house_age']],y_train[['price']],'with_house_age')

Model without_new feature house_age :

RMSE: 268686.2432

r_square: 0.49

Adjusted r2: 0.49

--------------------------------

Model with_house_age :

RMSE: 259925.0546

r_square: 0.52

Adjusted r2: 0.52

The r2 increase from previous best 0.65 to new 0.68 after adding new feature age of the house.

5) Model fitting step 2 (Linear regresssion)-


You may use model selection methods like (forward selection or backward elimination methods) to select an appropriate model.

Forward Selection Method:


Feature variables examined one at a time.
Use stopping rule (such as r square, adj r sqaure, p value) that acts as a hurdle the variable must overcome to be allowed in a model.
Added feature must reduce error significantly.
once variable added to model it stays and never removed
Then the process repeats until all variables are in model or no variable passes the stopping rule.
The ability of variables to reduce error can change as other variable enters the model

Model_1 : Target = [ Price ] , Predictors = [ sqft_living ]

In [506…
X=X_train[['sqft_living']]#Predictor

y=y_train[['log_price']] #Target

model_house(X,y,1)

Model 1 :

RMSE: 0.3786

r_square: 0.49

Adjusted r2: 0.49

Model_2 : Target = [ Price ] , Predictors = [ sqft_living, lat ]

In [507…
X=X_train[['sqft_living','lat']]#Predictor

y=y_train[['log_price']] #Target

model_house(X,y,2)

Model 2 :

RMSE: 0.3112

r_square: 0.65

Adjusted r2: 0.65

Model_3 : Target = [ Price ] , Predictors = [ sqft_living, lat, grade ]

In [508…
X=X_train[['sqft_living','lat','grade']]#Predictor

y=y_train[['log_price']] #Target

model_house(X,y,3)

Model 3 :

RMSE: 0.2869

r_square: 0.71

Adjusted r2: 0.71

Model_4 : Target = [ Price ] , Predictors = ['sqft_living','lat','grade','house_age' ]

In [509…
X=X_train[['sqft_living','lat','grade','house_age']]#Predictor

y=y_train[['log_price']] #Target

model_house(X,y,4)

Model 4 :

RMSE: 0.273

r_square: 0.73

Adjusted r2: 0.73

Model_5 : Target = [ Price ] , Predictors = ['sqft_living','lat','grade','house_age','waterfront' ]

In [510…
X=X_train[['sqft_living','lat','grade','house_age','waterfront']] #Predictor

y=y_train[['log_price']] #Target

model_house(X,y,5.1)

Model 5.1 :

RMSE: 0.2671

r_square: 0.75

Adjusted r2: 0.75

Model_6 : Target = [ Price ] , Predictors = ['sqft_living','lat','grade','house_age','waterfront',bathrooms ]

In [511…
X=X_train[['sqft_living','lat','grade','house_age','waterfront','bathrooms']] #Predictor

y=y_train[['log_price']] #Target

model_house(X,y,5.2)

Model 5.2 :

RMSE: 0.2646

r_square: 0.75

Adjusted r2: 0.75

Model_7 : Target = [ Price ] , Predictors = ['sqft_living','lat','grade','house_age','view' ]

In [512…
X=X_train[['sqft_living','lat','grade','house_age','view']] #Predictor

y=y_train[['log_price']] #Target

model_house(X,y,7)

Model 7 :

RMSE: 0.2661

r_square: 0.75

Adjusted r2: 0.75

Model_8 : Target = [ Price ] , Predictors = ['sqft_living','lat','grade','house_age','view','waterfront']

In [513…
X=X_train[['sqft_living','lat','grade','house_age','view','waterfront']] #Predictor

y=y_train[['log_price']] #Target

model_house(X,y,8)

Model 8 :

RMSE: 0.2637

r_square: 0.75

Adjusted r2: 0.75

Model_9 : Target = [ Price ] , Predictors = ['sqft_living','lat','grade','house_age','view','waterfront','bathrooms']

In [514…
X=X_train[['sqft_living','lat','grade','house_age','view','waterfront','bathrooms']] #Predictor

y=y_train[['log_price']] #Target

model_house(X,y,9)

Model 9 :

RMSE: 0.2612

r_square: 0.76

Adjusted r2: 0.76

Model_10 : Target = [ Price ] , Predictors = ['sqft_living','lat','grade','house_age','view','waterfront','bathrooms','condition']

In [515…
X=X_train[['sqft_living','lat','grade','house_age','view','waterfront','bathrooms','condition']] #Predictor

y=y_train[['log_price']] #Target

model_house(X,y,10)

Model 10 :
RMSE: 0.2593

r_square: 0.76

Adjusted r2: 0.76

Model_11 : Target = [ Price ] , Predictors = ['sqft_living','lat','grade','house_age','view','waterfront','bathrooms','condition','floors']

In [516…
X=X_train[['sqft_living','lat','grade','house_age','view','waterfront','bathrooms','condition','floors']] #Predictor

y=y_train[['log_price']] #Target

model_house(X,y,11)

Model 11 :
RMSE: 0.2577

r_square: 0.76

Adjusted r2: 0.76

Competing Models:
Model 9,10,11 has same r_square value, we will choose one which has less number of variables as adding variables not improving performnce and to avoid any possibility of overfitting.(Model_9)
Model 5,6,7,8 has same r_square value, we will choose one with less variables(Model_5)
Model5 and Model 9 will be taken for test data

6) Model fitting step 3 (decision tree) -


Experiment and check if a decision tree model can be used to fit the data mode accurately. You are free to use any kind of hyperparameter tuning to fit the model. Experiment using all the feature sets you have created before
(including all the transformed sets and new feature-generated sets).

Defing function to calculate r_square and RMSE for the Decision tree [model_house_DT(X,y,model,model_number=None)]

In [517…
#Creating function to calculate RMSE and r_square for decision tree using cross validation

def model_house_DT(X,y,model,model_number=None):

model = model

RMSE = -cross_val_score(model, X, y, cv=10, scoring='neg_root_mean_squared_error').mean()

r2 = cross_val_score(model, X, y, cv=10, scoring='r2').mean() #meteric used are RMSE and r_square

adj_r2=np.round(r2-((X.shape[1])-1)/(len(X)-(X.shape[1]))*(1-r2),2)

results=print('Model',model_number,':\nRMSE:',np.round(RMSE,2),'\nr_square:',np.round(r2,3),'\nAdjusted r2:',adj_r2)

return(results)

In [518…
X_train=X_train.drop(['date'],axis=1)

Decision Tree (Target =[ price ], Predictors=All the features)


In [534…
#fitting decision tree model 1

model = DecisionTreeRegressor(max_depth=5, min_samples_leaf=0.05, random_state=100)

model_house_DT(X_train,y_train[['price']],model)

Model None :

RMSE: 256317.78

r_square: 0.538

Adjusted r2: 0.54

Decision Tree (Target =[ log_price ], Predictors=All the features)


In [535…
#fitting decision tree model 2

model = DecisionTreeRegressor(max_depth=5, min_samples_leaf=0.05, random_state=100)

model_house_DT(X_train,y_train[['log_price']],model)

Model None :

RMSE: 0.29

r_square: 0.699

Adjusted r2: 0.7

The decison tree gives better results when we transormed taregt variable log_price compare to the original variable price.
The r_square for the Decision tree is 0.70 which is little less for the models we have selected fot Multiple linear regression.

7) Model testing -
Consider the best competing models and test their performances on the test data. Report the results.

In [370…
#Read the train data

test = pd.read_csv("kc_house_test_data.csv")

In [371…
#X_train and y_train data

X_test=test[['id','bedrooms', 'bathrooms', 'sqft_living',

'sqft_lot', 'floors', 'waterfront', 'view', 'condition', 'grade',

'sqft_above', 'sqft_basement', 'yr_built', 'yr_renovated', 'zipcode',

'lat', 'long', 'sqft_living15', 'sqft_lot15']]

y_test= test[['price']]

Adding log_price transformed feature

In [372…
#taking log of price

log_price = np.log(y_test.price)

In [373…
y_test.insert(0,'log_price',log_price) #inserting new feature

In [374…
y_test.head(1)

Out[374… log_price price

0 12.68541 323000.00000

Adding house_age new feature

In [375…
house_age = 2015 - (X_test.yr_built) #creating new feature house age

In [376…
X_test.insert(0,'house_age',house_age) #adding house age to data

Testing Models:
Model 9 : Target = [ Price ] , Predictors = ['sqft_living','lat','grade','house_age','view','waterfront','bathrooms']

In [546…
#trining the model from the train data(as peviously we have used only cross val function)

X=X_train[['sqft_living','lat','grade','house_age','view','waterfront','bathrooms']]

y=y_train[['log_price']]

model=LinearRegression()

model.fit(X,y)

Out[546… LinearRegression()

In [547…
#testing model

test=X_test[['sqft_living','lat','grade','house_age','view','waterfront','bathrooms']]

pred_log=model.predict(test)

pred=np.exp(pred_log)

In [548…
#Residual distribution showing normal distribution

fig = plt.figure(figsize=(5,4))

sns.histplot(np.array(y_test[['log_price']])-pred_log)

Out[548… <AxesSubplot:ylabel='Count'>

In [575…
#coefficients

coeff_df = pd.DataFrame(model.coef_.reshape(7,),test.columns,columns=['Coefficient'])

coeff_df

Out[575… Coefficient

sqft_living 0.00018

lat 1.31417

grade 0.19035

house_age 0.00369

view 0.06263

waterfront 0.42935

bathrooms 0.07760

In [585…
#r_square and RMSE

print('RMSE:', np.round(np.sqrt(MSE(np.array(y_test[['price']]), pred)),2))

print('r_square:', np.round(r2_score(np.array(y_test[['price']]), pred),2))

RMSE: 193872.65

r_square: 0.71

Model 5 : Target = [ Price ] , Predictors = ['sqft_living','lat','grade','house_age','waterfront'']

In [576…
#trining the model from the train data(as peviously we have used only cross val function)

X2=X_train[['sqft_living','lat','grade','house_age','waterfront']]

y2=y_train[['log_price']]

model=LinearRegression()

model.fit(X2,y2)

Out[576… LinearRegression()

In [577…
#fitting model

test2=X_test[['sqft_living','lat','grade','house_age','waterfront']]

pred_log2=model.predict(test2)

pred2=np.exp(pred_log2)

In [578…
#Residual distribution showing normal distribution

fig = plt.figure(figsize=(5,4))

sns.histplot(np.array(y_test[['log_price']])-pred_log2)

Out[578… <AxesSubplot:ylabel='Count'>

In [583…
#coefficients

coeff_df = pd.DataFrame(model.coef_.reshape(5,),test2.columns,columns=['Coefficient'])

coeff_df

Out[583… Coefficient

sqft_living 0.00022

lat 1.30541

grade 0.20096

house_age 0.00345

waterfront 0.62825

In [584…
#r_square and RMSE

print('RMSE:', np.sqrt(MSE(np.array(y_test[['price']]), pred2)))

print('R_square:', np.round(r2_score(np.array(y_test[['price']]), pred2),3))

RMSE: 197251.55678376346

R_square: 0.697

REULTS:
The model 9 has less RMSE and high r^2 compare to Model 5.The Model 9 is best performing model
In model 9 features ['sqft_living','lat','grade','house_age','view','waterfront','bathrooms'] are able to explain 71% of the variation in house prices

https://youtu.be/b1YDtGr8bqU

https://youtu.be/8uAom0sYNKY

You might also like