Professional Documents
Culture Documents
House Price Prediction
House Price Prediction
In this problem, we have been given a dataset that records the house prices of 9,761 houses in King County, Washington, US. The house prices are recorded along with some other attributes like - area of the house, number of
bedrooms, number of bathrooms, etc. We are required to do the following tasks:
In [588…
import numpy as np
import pandas as pd
In [589…
#Read the train data
data = pd.read_csv("kc_house_train_data.csv")
data.head(5)
Out[589… id date price bedrooms bathrooms sqft_living sqft_lot floors waterfront view condition grade sqft_above sqft_basement yr_built yr_renovated zipcode lat long sqft_li
0 2487200875 20141209T000000 604000.00000 4 3.00000 1960 5000 1.00000 0 0 5 7 1050 910 1965 0 98136 47.52080 -122.39300
1 7237550310 20140512T000000 1225000.00000 4 4.50000 5420 101930 1.00000 0 0 3 11 3890 1530 2001 0 98053 47.65610 -122.00500
2 9212900260 20140527T000000 468000.00000 2 1.00000 1160 6000 1.00000 0 0 4 7 860 300 1942 0 98115 47.69000 -122.29200
3 114101516 20140528T000000 310000.00000 3 1.00000 1430 19901 1.50000 0 0 4 7 1430 0 1927 0 98028 47.75580 -122.22900
4 6054650070 20141007T000000 400000.00000 3 1.75000 1370 9680 1.00000 0 0 4 7 1370 0 1977 0 98074 47.61270 -122.04500
waterfront: Categorical variable 0 indiacting No waterfron and 1 indicating waterfront near house
Target Vairable
Usually House prices increase with the increase in size of the house. So, does incrase in sqft_living area, sqft_lot area shows increase in price?
Bigger the house more is the price. Number of floors, numebr of bedrooms, bathroom has any effect on price?
The Housing prices are seen to be dependent on the age of the house. Older Houses gets lower prices. Is our data supporting this?
Is condition of the house has any effect on the price? Are houses with high condition rating (good conditioned) are priced more compare to low condition rating (Bad conditioned) houses?
Do people prefer to live near waterfront paying higher prices? People usually like to live near to the nature.
Better constructed houses last for many years. Does grade (construction Quality of House) has any effect on prices? Is House with Lower grade getting lower price?
In [590…
#data
data.head(5)
Out[590… id date price bedrooms bathrooms sqft_living sqft_lot floors waterfront view condition grade sqft_above sqft_basement yr_built yr_renovated zipcode lat long sqft_li
0 2487200875 20141209T000000 604000.00000 4 3.00000 1960 5000 1.00000 0 0 5 7 1050 910 1965 0 98136 47.52080 -122.39300
1 7237550310 20140512T000000 1225000.00000 4 4.50000 5420 101930 1.00000 0 0 3 11 3890 1530 2001 0 98053 47.65610 -122.00500
2 9212900260 20140527T000000 468000.00000 2 1.00000 1160 6000 1.00000 0 0 4 7 860 300 1942 0 98115 47.69000 -122.29200
3 114101516 20140528T000000 310000.00000 3 1.00000 1430 19901 1.50000 0 0 4 7 1430 0 1927 0 98028 47.75580 -122.22900
4 6054650070 20141007T000000 400000.00000 3 1.75000 1370 9680 1.00000 0 0 4 7 1370 0 1977 0 98074 47.61270 -122.04500
In [591…
#Shape of train data(#no. of rows , no of columns)
data.shape
In [592…
#info of train data to check dtype and null values
data.info()
<class 'pandas.core.frame.DataFrame'>
In [593…
#Checking for Null values
sum(data.isnull().sum())
Out[593… 0
In [594…
continuous_cols=data[['price', 'bathrooms',
'sqft_living','sqft_lot', 'floors',
In [595…
#Describing the data
data.describe()
Out[595… id price bedrooms bathrooms sqft_living sqft_lot floors waterfront view condition grade sqft_above sqft_basement yr_built yr_renovated zipcode
count 9761.00000 9761.00000 9761.00000 9761.00000 9761.00000 9761.00000 9761.00000 9761.00000 9761.00000 9761.00000 9761.00000 9761.00000 9761.00000 9761.00000 9761.00000 9761.00000 976
mean 4605288287.66919 542734.95164 3.37588 2.11718 2086.73415 15215.26063 1.48607 0.00840 0.24803 3.41553 7.66151 1793.29116 293.44299 1970.79951 86.06659 98077.79019 4
std 2876044376.02698 379527.63854 0.96070 0.77397 927.19430 41266.73460 0.53232 0.09127 0.78788 0.65055 1.18268 835.76382 442.61272 29.24001 405.41737 53.20359
min 1200019.00000 80000.00000 0.00000 0.00000 290.00000 520.00000 1.00000 0.00000 0.00000 1.00000 1.00000 290.00000 0.00000 1900.00000 0.00000 98001.00000 4
25% 2126049290.00000 320000.00000 3.00000 1.75000 1420.00000 5100.00000 1.00000 0.00000 0.00000 3.00000 7.00000 1190.00000 0.00000 1951.00000 0.00000 98033.00000 4
50% 3905040800.00000 450000.00000 3.00000 2.25000 1910.00000 7642.00000 1.50000 0.00000 0.00000 3.00000 7.00000 1570.00000 0.00000 1975.00000 0.00000 98065.00000 4
75% 7338402850.00000 649000.00000 4.00000 2.50000 2570.00000 10660.00000 2.00000 0.00000 0.00000 4.00000 8.00000 2230.00000 570.00000 1996.00000 0.00000 98117.00000 4
max 9900000190.00000 7700000.00000 33.00000 8.00000 12050.00000 1651359.00000 3.50000 1.00000 4.00000 5.00000 13.00000 8860.00000 3480.00000 2015.00000 2015.00000 98199.00000 4
QUESTIONS
Price
How is the distribution of the price?
From the below plots it is clear that prices are right skewed and few houses are priced more which are outliers on right end of the distribution.
From the desciption of data it is seen that average peice of the house is 542K with minimum price 80K and maximum price 7700 dollors. More than 50% of the houses fall below average
Since the price is heavily right skwed and do not have any zero or -ve values and we can take the log transformation to reduce the skweness.
In [596…
#plotting the graphs
sns.histplot(data=data.price) #histogram
plt.subplot(223)
plt.title('Price Boxplot')
plt.subplot(222)
plt.title('Log_Price Distribution')
sns.histplot(data=np.log(data.price)) #histograph
plt.subplot(224)
sns.boxplot(data=np.log(data.price), orient='h')
plt.tight_layout()
In [597…
min(data.price),max(data.price)
In [598…
iqr = data.price.quantile(0.75)- data.price.quantile(0.25)
In [599…
#There are 412 houses which are priced more than upper boundry(the outlier houses)
Out[599… 522
In [600…
#OUTLIERS below lb
Out[600… 0
sqft_living area
Univariate analysis on living area
The distribution is right skwed, similar to price distribution. The average living area is 2086 sq.ft.
Since the price is heavily right skwed and do not have any zero or -ve values and we can take the log transformation to reduce the skweness.
In [601…
#Ploring a distribution graph
fig = plt.figure(figsize=(10,5))
plt.subplot(221)
plt.title('living areaDistribution')
sns.histplot(data=data.sqft_living)
sns.boxplot(data=data.sqft_living, orient='h')
plt.title('Log_living areaDistribution')
sns.histplot(data=np.log(data.sqft_living))
plt.tight_layout()
The scatter plot suggest linear relation between price and living area. As living area increase price increase
In [602…
#Price vs sqft_living
fig = plt.figure(figsize=(10,4))
plt.subplot(121)
plt.title('price vs sqr_living')
sns.scatterplot(x=data.sqft_living, y=data.price)
sns.scatterplot(x=np.log(data.sqft_living), y=np.log(data.price))
plt.title('Log_price vs Log_sqr_living')
sqft_lot
Univariate analysis on sqft_lot using log tranformation as data is highly skwed
The range is wider for the lot area and data is skwed.
In [603…
#The extream vales are on far extream of distribution
min(data.sqft_lot),max(data.sqft_lot)
In [604…
# Disribution plot
fig = plt.figure(figsize=(6,6))
plt.subplot(211)
sns.histplot(data=np.log(data.sqft_lot))
plt.subplot(212)
sns.boxplot(data=np.log(data.sqft_lot), orient='h')
plt.tight_layout()
The scatter plot suggest no any linear relation between price and lot area as opposite to our hypothesis.
In [605…
#Price vs sqft_lot
sns.scatterplot(x=np.log(data.sqft_lot), y=np.log(data.price))
bedrooms
Q. House with how many bedroom sold more? is there any increase in price with bedroom number?
We see that there are more house with 2,3 and 4 bedroom and average price of the houses increase slightly with increase in bedrooms till bedroom size is 6.
In [606…
# graphs for bedrooms
fig = plt.figure(figsize=(10,4))
plt.subplot(121)
plt.title('Boxplot_bedrooms')
sns.countplot(x='bedrooms',data=data)
plt.title('Boxplot_bedrooms')
bathroom
Q. House with how many bathroom sold more? is there any increase in price with bathroom number?
We can see that house with 2 bathrooms sold more follwed bt 3 and 1 bathrooms. The price of the house increases with bathrooms and people are ready to pay well above upper boundry for house with 2,3,4 bathroom.
In [607…
#Graphs for bathroom
fig = plt.figure(figsize=(15,4))
plt.subplot(131)
plt.title('bathrooms')
sns.boxplot(x=np.round(data.bathrooms), y=(data.price))
plt.title('prive vs bathrooms')
sns.boxplot(x=np.round(data.bathrooms), y=np.log((data.price)))
plt.title('log_price vs bathrooms')
floors
Q. House with how many floors sold more? is there any increase in price with increase in number of floor?
The house with 1 and 2 floors sold more , but, there isn't much of signifiant change in price due to number of floors
In [608…
#graphs for floors
fig = plt.figure(figsize=(10,4))
plt.subplot(121)
plt.title('floors')
sns.countplot(x='floors',data=np.round(data))
plt.title('floors')
waterfront
Q How many houses are near waterfornt? Do you think the prices has any effect?
Most of the houses are not near waterfront, only less than 1% of house are near waterfron. The data suggest the average cost of the house near waterfront is more compare to non waterfront house, but we have less number
of house near waterfront.
People do not have any bias towards the house with waterfront.
In [609…
#garphs for warterfront
fig = plt.figure(figsize=(10,4))
plt.subplot(121)
plt.title('Univariate_waterfront')
sns.countplot(x='waterfront',data=data)
sns.boxplot(x=data.waterfront, y=data.price)
plt.title('Biivariate_waterfront')
In [610…
# % of houses near waterfront is 0.80%
(data.waterfront[data.waterfront==1].sum()/len(data))*100
Out[610… 0.8400778608749104
view
Q Which house sold more? Any effect on price?
The house with zero views sold more. The average price see slight change change with views but not much significant.
In [611…
#graph for view
fig = plt.figure(figsize=(10,4))
plt.subplot(121)
plt.title('view')
sns.countplot(x='view',data=data)
plt.subplot(122)
plt.title('view')
In [612…
# % number of houses with no views
(data.view[data.view==0].value_counts())/len(data)*100
Out[612… 0 89.62196
condition
Q What condition houses sold more? Any effect on price?
The house with good condition rating 3,4,5 sold more. But Data do not suggest condition rating has much effect on the average price of the house. Average price remains more or less same.
In [613…
#graph fpr condition
fig = plt.figure(figsize=(10,4))
plt.subplot(121)
plt.title('condition')
sns.countplot(x='condition',data=data)
sns.boxplot(x=data.condition, y=data.price)
plt.title('condition')
grade
Q Which brade houses sold more? Any effect on price?
The houses with grade 6,7,8,9 sold more. The grade of the house shows storg relation with price. As grade of the house increase the price of the house exponentialy increases.
In [614…
#graph fot grade
fig = plt.figure(figsize=(15,4))
plt.subplot(131)
plt.title('grade')
sns.countplot(x='grade',data=data)
sns.boxplot(x=data.grade, y=data.price)
plt.title('price vs grade')
plt.subplot(133)
plt.title('log_price vs grade')
sqft_above
Q. What is distribution of sqft_above?
The distribution is right skwed similar to price distribution with mean 1796 sq.ft.
In [615…
#Graph for sqt_above
fig = plt.figure(figsize=(6,6))
plt.subplot(211)
plt.title('sqft_above')
sns.histplot(data=data.sqft_above)
plt.title('sqft_above Boxplot')
sns.boxplot(data=data.sqft_above, orient='h')
plt.tight_layout()
The House prices seem to follow moderate linearity with the sqft_above.
In [616…
#Price vs sqft_above
sns.scatterplot(x=np.log(data.sqft_above), y=np.log(data.price))
plt.title('Log_price vs Log_sqft_above')
yr_built vs yr_renovated
This shows the renovated houses are old buit, build before year 2000 and has not seen much of price increase due to revovation compare to other houses built same time.
In [617…
#Scatter plot for yr_built and yr_renovated
Correlation heatmap
This shows the correlation matrix for the variable present in the data set.
In [618…
#heatmap
plt.figure(figsize = (18,10))
sns.heatmap(data.corr(),cmap='coolwarm',annot=True, linewidths=.5)
Out[618… <AxesSubplot:>
Summary of EDA
Predictors that may affect the Target
sqft_living
grade
bathrooms
sqft_abve (higly correlated with sqft_living and can be dropped)
sqft_living15 (higly correlated with sqft_living, can be dropped )
view (less significant)
bedroom (less significant)
waterfront (less significant)
lat (less significant)
id
date
sqft_lot
floor
condition
yr_built
yr_renovated
zipcode
long
The house price increases with size of house i.e. living area of the house.
House with more number of bathrooms and bedrroms see highr prices.
People pay higher prices for better built houses i.e house with higher grade.
House with views gets slighty more price.
Most of the people do not live near waterfront, but prices of houses are higher for such houses.
Lot area of the house do not affect the prices of house. People prefer bigger living area over bigger lot area.
Increased number of floors do not show increased in prices.
Conditoin of house do not affect the house prices.
houses price are not very dependent on Year built and year of renovation.
Creating a function called model_house(predictors, target, Model Number) which will return RMSE, r_square, adjusted r_square
The function will take predictors, target variable and model number. The function will run 10 fold cross_validation to calcultae RMSE and r_square. Also function will calculate adjusted r_square.
In [619…
#Functon to calculate RMSE and r_square using linear regression and cross validation
def model_house(X,y,model_number=None):
adj_r2=np.round(r2-((X.shape[1])-1)/(len(X)-(X.shape[1]))*(1-r2),2)
results=print('Model',model_number,':\nRMSE:',np.round(RMSE,4),'\nr_square:',np.round(r2,2),'\nAdjusted r2:',adj_r2)
return(results)
In [620…
#X_train and y_train data
y_train = data[['price']]
Considering sqft_living as predictor for our first SLR model as it has highest correlation with target price.
In [621…
model_house(X_train[['sqft_living']],np.array(y_train),1)
Model 1 :
RMSE: 268686.2432
r_square: 0.49
In [623…
model_house(X_train[['bathrooms']],np.array(y_train),1)
Model 1 :
RMSE: 321723.1503
r_square: 0.27
To see if grade (quality of construction) and living area help improve the performance of the model
In [624…
X=X_train[['sqft_living','grade']] #Predictor
y=np.array(y_train) #Target
model_house(X,y,2)
Model 2 :
RMSE: 258217.7122
r_square: 0.53
we have observed strong exposential relation between the price and bathrooms in EDA. we will add bathrooms, sqft_living to see if model improve performance.
In [625…
X=X_train[['sqft_living','bathrooms']] #Predictor
y=np.array(y_train) #Target
model_house(X,y,3)
Model 3 :
RMSE: 268676.5899
r_square: 0.49
Consdering waterfront and view with sqft_living. As waterfront and view are not direclty related to attribute of house (like number of rooms, condition) and could be subjective person to person.
In [487…
X=X_train[['sqft_living','view','waterfront']] #Predictor
y=np.array(y_train) #Target
model_house(X,y,4)
Model 4 :
RMSE: 252037.9863
r_square: 0.55
In [488…
X=X_train[['sqft_living','lat','long']] #Predictor
y=np.array(y_train) #Target
model_house(X,y,5)
Model 5 :
RMSE: 245937.9951
r_square: 0.57
Results
After fitting coulpe of initial models, we see that sqft_living being highly correlated feature contribute more to the performance of the model. Keeping sqft_living as baseline and adding few combination of fetures we tried to
evaluate the features that could improve the performance of the model.
The number of bathrooms when used along with sqft_living seems to be not contributing enough to improve the r square.
View and waterfront with sqft_living improve the r square ftom 0.49 to 0.55.
The location of the house('lat','long') contribute to the model performance.
4) Feature engineering -
(A) Suggest some possible feature transformations (like log(X), sqrt(X), X^2, X1*X2, etc.) with reasons, which you believe could have improved the performances of the previous models. Experiment to check if such
transformations actually help to do so. (B) suggest some new feature generation techniques (e.g. creating dummy variables, or using one-hot encoding, or transforming an existing feature to a new feature as you may convert
the variable 'year built' to the 'age of the house'. Check if such transformations help in improving the performances of the models. Report the RMSE and R-squared values in each case.
A) Log Transformation:
The dataset has lot of outliers which casesd data to be right skwed for few parameters. The outliers effect can be reduced using log transaformation. log transformation reduces or removes the skewness of our original data.
Log Transaformation cannot be used on data with 0 and -ve value observations.
The distribution of target price is right skwed. we can use log transformation. but the predicted values will be log of price which need to back tranformed by exp(price)
The sqft_living feature is right skwed due to outliers trying log transformation to check if it helps
1. Price to Log_price
In [489…
log_price = np.log(y_train.price) #taking log of price
In [490…
y_train.insert(0,'log_price',log_price) ##INSERTING new column log_price at 0 index
In [491…
y_train.head(1) #added log_price
0 13.31133 604000.00000
In [492…
model_house(X_train[['sqft_living','grade']],y_train[['price']],'without_log')
print('--------------------------------')
model_house(X_train[['sqft_living','grade']],y_train[['log_price']],'with_log_price')
Model without_log :
RMSE: 258217.7122
r_square: 0.53
--------------------------------
Model with_log_price :
RMSE: 0.3509
r_square: 0.56
The model with log improves the performance of the model with some combination of the features as r_square increase with log
2. sqft_living to log_sqft_living
In [493…
log_sqft_living = np.log(X_train.sqft_living) #Taking log of sq ft living
In [494…
#INSERTING new column log_sqft_living at 0 index
X_train.insert(0,'log_sqft_living',log_sqft_living)
In [495…
X_train.head(1)
Out[495… log_sqft_living id date bedrooms bathrooms sqft_living sqft_lot floors waterfront view condition grade sqft_above sqft_basement yr_built yr_renovated zipcode lat long sqft_li
0 7.58070 2487200875 20141209T000000 4 3.00000 1960 5000 1.00000 0 0 5 7 1050 910 1965 0 98136 47.52080 -122.39300
In [496…
model_house(X_train[['sqft_living','lat','view','grade']],y_train[['price']],'without_log')
print('--------------------------------')
model_house(X_train[['log_sqft_living','lat','view','grade']],y_train[['price']],'with_log_price')
Model without_log :
RMSE: 229000.1176
r_square: 0.63
--------------------------------
Model with_log_price :
RMSE: 244648.0592
r_square: 0.58
The model with log sqft_living does not contribute much to the model performance as r_square reduced for log
In [497…
bedrooms_sqrt= np.sqrt(X_train.bedrooms) #takiong sqrt of bedrooms
In [498…
X_train.insert(0,'bedrooms_sqrt',bedrooms_sqrt)
In [586…
model_house(X_train[['sqft_living','bedrooms']],y_train[['price']],'with_bedrooms') #model with bedrooms
print('--------------------------------')
Model with_bedrooms :
RMSE: 265640.7144
r_square: 0.5
--------------------------------
Model with_bedrooms_sqrt :
RMSE: 359968.6937
r_square: 0.09
In [501…
X_train.insert(0,'bathrooms_sqrt',bathrooms_sqrt)
In [502…
model_house(X_train[['bathrooms']],y_train[['price']],'with_bedrooms')
print('--------------------------------')
model_house(X_train[['bathrooms_sqrt']],y_train[['price']],'with_bedrooms_sqrt')
Model with_bedrooms :
RMSE: 321723.1503
r_square: 0.27
--------------------------------
Model with_bedrooms_sqrt :
RMSE: 329842.3662
r_square: 0.23
In [504…
X_train.insert(0,'house_age',house_age)
In [505…
model_house(X_train[['sqft_living']],y_train[['price']],'without_new feature house_age')
print('--------------------------------')
model_house(X_train[['sqft_living','house_age']],y_train[['price']],'with_house_age')
RMSE: 268686.2432
r_square: 0.49
--------------------------------
Model with_house_age :
RMSE: 259925.0546
r_square: 0.52
The r2 increase from previous best 0.65 to new 0.68 after adding new feature age of the house.
In [506…
X=X_train[['sqft_living']]#Predictor
y=y_train[['log_price']] #Target
model_house(X,y,1)
Model 1 :
RMSE: 0.3786
r_square: 0.49
In [507…
X=X_train[['sqft_living','lat']]#Predictor
y=y_train[['log_price']] #Target
model_house(X,y,2)
Model 2 :
RMSE: 0.3112
r_square: 0.65
In [508…
X=X_train[['sqft_living','lat','grade']]#Predictor
y=y_train[['log_price']] #Target
model_house(X,y,3)
Model 3 :
RMSE: 0.2869
r_square: 0.71
In [509…
X=X_train[['sqft_living','lat','grade','house_age']]#Predictor
y=y_train[['log_price']] #Target
model_house(X,y,4)
Model 4 :
RMSE: 0.273
r_square: 0.73
In [510…
X=X_train[['sqft_living','lat','grade','house_age','waterfront']] #Predictor
y=y_train[['log_price']] #Target
model_house(X,y,5.1)
Model 5.1 :
RMSE: 0.2671
r_square: 0.75
In [511…
X=X_train[['sqft_living','lat','grade','house_age','waterfront','bathrooms']] #Predictor
y=y_train[['log_price']] #Target
model_house(X,y,5.2)
Model 5.2 :
RMSE: 0.2646
r_square: 0.75
In [512…
X=X_train[['sqft_living','lat','grade','house_age','view']] #Predictor
y=y_train[['log_price']] #Target
model_house(X,y,7)
Model 7 :
RMSE: 0.2661
r_square: 0.75
In [513…
X=X_train[['sqft_living','lat','grade','house_age','view','waterfront']] #Predictor
y=y_train[['log_price']] #Target
model_house(X,y,8)
Model 8 :
RMSE: 0.2637
r_square: 0.75
In [514…
X=X_train[['sqft_living','lat','grade','house_age','view','waterfront','bathrooms']] #Predictor
y=y_train[['log_price']] #Target
model_house(X,y,9)
Model 9 :
RMSE: 0.2612
r_square: 0.76
In [515…
X=X_train[['sqft_living','lat','grade','house_age','view','waterfront','bathrooms','condition']] #Predictor
y=y_train[['log_price']] #Target
model_house(X,y,10)
Model 10 :
RMSE: 0.2593
r_square: 0.76
In [516…
X=X_train[['sqft_living','lat','grade','house_age','view','waterfront','bathrooms','condition','floors']] #Predictor
y=y_train[['log_price']] #Target
model_house(X,y,11)
Model 11 :
RMSE: 0.2577
r_square: 0.76
Competing Models:
Model 9,10,11 has same r_square value, we will choose one which has less number of variables as adding variables not improving performnce and to avoid any possibility of overfitting.(Model_9)
Model 5,6,7,8 has same r_square value, we will choose one with less variables(Model_5)
Model5 and Model 9 will be taken for test data
Defing function to calculate r_square and RMSE for the Decision tree [model_house_DT(X,y,model,model_number=None)]
In [517…
#Creating function to calculate RMSE and r_square for decision tree using cross validation
def model_house_DT(X,y,model,model_number=None):
model = model
adj_r2=np.round(r2-((X.shape[1])-1)/(len(X)-(X.shape[1]))*(1-r2),2)
results=print('Model',model_number,':\nRMSE:',np.round(RMSE,2),'\nr_square:',np.round(r2,3),'\nAdjusted r2:',adj_r2)
return(results)
In [518…
X_train=X_train.drop(['date'],axis=1)
model_house_DT(X_train,y_train[['price']],model)
Model None :
RMSE: 256317.78
r_square: 0.538
model_house_DT(X_train,y_train[['log_price']],model)
Model None :
RMSE: 0.29
r_square: 0.699
The decison tree gives better results when we transormed taregt variable log_price compare to the original variable price.
The r_square for the Decision tree is 0.70 which is little less for the models we have selected fot Multiple linear regression.
7) Model testing -
Consider the best competing models and test their performances on the test data. Report the results.
In [370…
#Read the train data
test = pd.read_csv("kc_house_test_data.csv")
In [371…
#X_train and y_train data
y_test= test[['price']]
In [372…
#taking log of price
log_price = np.log(y_test.price)
In [373…
y_test.insert(0,'log_price',log_price) #inserting new feature
In [374…
y_test.head(1)
0 12.68541 323000.00000
In [375…
house_age = 2015 - (X_test.yr_built) #creating new feature house age
In [376…
X_test.insert(0,'house_age',house_age) #adding house age to data
Testing Models:
Model 9 : Target = [ Price ] , Predictors = ['sqft_living','lat','grade','house_age','view','waterfront','bathrooms']
In [546…
#trining the model from the train data(as peviously we have used only cross val function)
X=X_train[['sqft_living','lat','grade','house_age','view','waterfront','bathrooms']]
y=y_train[['log_price']]
model=LinearRegression()
model.fit(X,y)
Out[546… LinearRegression()
In [547…
#testing model
test=X_test[['sqft_living','lat','grade','house_age','view','waterfront','bathrooms']]
pred_log=model.predict(test)
pred=np.exp(pred_log)
In [548…
#Residual distribution showing normal distribution
fig = plt.figure(figsize=(5,4))
sns.histplot(np.array(y_test[['log_price']])-pred_log)
Out[548… <AxesSubplot:ylabel='Count'>
In [575…
#coefficients
coeff_df = pd.DataFrame(model.coef_.reshape(7,),test.columns,columns=['Coefficient'])
coeff_df
Out[575… Coefficient
sqft_living 0.00018
lat 1.31417
grade 0.19035
house_age 0.00369
view 0.06263
waterfront 0.42935
bathrooms 0.07760
In [585…
#r_square and RMSE
RMSE: 193872.65
r_square: 0.71
In [576…
#trining the model from the train data(as peviously we have used only cross val function)
X2=X_train[['sqft_living','lat','grade','house_age','waterfront']]
y2=y_train[['log_price']]
model=LinearRegression()
model.fit(X2,y2)
Out[576… LinearRegression()
In [577…
#fitting model
test2=X_test[['sqft_living','lat','grade','house_age','waterfront']]
pred_log2=model.predict(test2)
pred2=np.exp(pred_log2)
In [578…
#Residual distribution showing normal distribution
fig = plt.figure(figsize=(5,4))
sns.histplot(np.array(y_test[['log_price']])-pred_log2)
Out[578… <AxesSubplot:ylabel='Count'>
In [583…
#coefficients
coeff_df = pd.DataFrame(model.coef_.reshape(5,),test2.columns,columns=['Coefficient'])
coeff_df
Out[583… Coefficient
sqft_living 0.00022
lat 1.30541
grade 0.20096
house_age 0.00345
waterfront 0.62825
In [584…
#r_square and RMSE
RMSE: 197251.55678376346
R_square: 0.697
REULTS:
The model 9 has less RMSE and high r^2 compare to Model 5.The Model 9 is best performing model
In model 9 features ['sqft_living','lat','grade','house_age','view','waterfront','bathrooms'] are able to explain 71% of the variation in house prices
https://youtu.be/b1YDtGr8bqU
https://youtu.be/8uAom0sYNKY