Professional Documents
Culture Documents
Analyze House Price For King County
Analyze House Price For King County
Analyze House Price For King County
Business Problem: Question - What will be the predicted value of home sale value in King
County?
Data Source: The most significant obstacle i encountered was while cleaning our original dataset.
Originally, the data I planned to use was from data.boston.gov but quickly realized this dataset
needed a large amount of work, not enough variables, and data id old. I chose new cleaned data-
set to work on instead of Boston city data-set. The new data-set for “King County” in
Washington state.
The new data-set source https://www.kaggle.com/harlfoxem/housesalesprediction/data.
Preliminary Work: Planned Analysis: I have to build a simple linear regression and multiple
linear regression models that i use to predict the value of the house King county, WA houses.
I’ve applied variables necessary to running a graphical analysis before I jumped into the model.
Displaying the information will provide insight that could be correlated, with the primary goal of
determining if these variables impact each other or just move in the same direction.
Risk assessment: Predictive analytics projects come with a variety of risks about the type of
analysis being done. These will impact the ability to assess the data available and whether the
variables selected are impacting the value of home prices. Overfitting is a huge potential issue as
with so many different variables it would be easy to assume we can predict an exact amount. It
will be important to recognize trends in the variables based on variables such as geographic
location in the city, the population of the city and jobs forecasted. Accurately predicting prices
relies on a lot of macroeconomic information which are based on their assumptions and may be
difficult to include in our analysis. Data will be plentiful, but it will be essential to identify
critical data which will drive the report.
Home prices are continually shifting. The reality of not knowing whether the price for a potential
house is going to increase or not, makes it extremely challenging for future homeowners to plan,
be prepared and ready, timely and financially. This factor plays into the role of influencing the
Predicting house price is not an easy task. In fact, it is very challenging and requires the
exploration of many factors to be able to come up with the findings. For this project, we are
using a single dataset from Kaggle, ‘kc_house_data.csv.’ This dataset contains house sale prices
for King County, which includes Seattle, between May 2014 and May 2015. Our team aims for
exploring the correlation between home price and other attributes in the dataset to come up with
the final findings and prediction. Additionally, we are not taking into consideration any potential
unique events in the real estate world or any macroeconomics factors since it is out of scope and
we don’t have the right resources for this task.
3.Basic methodology
To answer the project question which is, what will be house Price for King County, Washington?
I have to build a simple regression model that I use to predict the value of the house for house
Price for King County. We have to understand the variables involved by running a graphical
4.Findings
#Data Summary
>
> summary(kc_house_data)
id date price bedrooms
Min. :1.000e+06 20140623T000000: 142 Min. : 75000 Min. : 0.000
1st Qu.:2.123e+09 20140625T000000: 131 1st Qu.: 321950 1st Qu.: 3.000
Median :3.905e+09 20140626T000000: 131 Median : 450000 Median : 3.000
Mean :4.580e+09 20140708T000000: 127 Mean : 540182 Mean : 3.371
3rd Qu.:7.309e+09 20150427T000000: 126 3rd Qu.: 645000 3rd Qu.: 4.000
Max. :9.900e+09 20150325T000000: 123 Max. :7700000 Max. :33.000
(Other) :20833
bathrooms sqft_living sqft_lot floors view
Min. :0.000 Min. : 290 Min. : 520 Min. :1.000 Min. :0.0000
1st Qu.:1.750 1st Qu.: 1427 1st Qu.: 5040 1st Qu.:1.000 1st Qu.:0.0000
Median :2.250 Median : 1910 Median : 7618 Median :1.500 Median :0.0000
Mean :2.115 Mean : 2080 Mean : 15107 Mean :1.494 Mean :0.2343
3rd Qu.:2.500 3rd Qu.: 2550 3rd Qu.: 10688 3rd Qu.:2.000 3rd Qu.:0.0000
Max. :8.000 Max. :13540 Max. :1651359 Max. :3.500 Max. :4.0000
condition grade sqft_above sqft_basement yr_built
Min. :1.000 Min. : 1.000 Min. : 290 Min. : 0.0 Min. :1900
1st Qu.:3.000 1st Qu.: 7.000 1st Qu.:1190 1st Qu.: 0.0 1st Qu.:1951
Median :3.000 Median : 7.000 Median :1560 Median : 0.0 Median :1975
Mean :3.409 Mean : 7.657 Mean :1788 Mean : 291.5 Mean :1971
3rd Qu.:4.000 3rd Qu.: 8.000 3rd Qu.:2210 3rd Qu.: 560.0 3rd Qu.:1997
Max. :5.000 Max. :13.000 Max. :9410 Max. :4820.0 Max. :2015
yr_renovated zipcode lat long sqft_living15
Min. : 0.0 Min. :98001 Min. :47.16 Min. :-122.5 Min. : 399
1st Qu.: 0.0 1st Qu.:98033 1st Qu.:47.47 1st Qu.:-122.3 1st Qu.:1490
Median : 0.0 Median :98065 Median :47.57 Median :-122.2 Median :1840
Mean : 84.4 Mean :98078 Mean :47.56 Mean :-122.2 Mean :1987
3rd Qu.: 0.0 3rd Qu.:98118 3rd Qu.:47.68 3rd Qu.:-122.1 3rd Qu.:2360
Max. :2015.0 Max. :98199 Max. :47.78 Max. :-121.3 Max. :6210
> cor(price,bedrooms)
[1] 0.307069
cor.test(kc_house_data$price, kc_house_data$sqft_living)
t = 31.397, df = 998, p-value < 2.2e-16 --------- the p-value is too small means that there is a significance
relations between the 2 variables.
0.6723072 0.7348024
sample estimates:
cor
0.7049203
cor.test(kc_house_data$price, kc_house_data$bedrooms)
t = 10.193, df = 998, p-value < 2.2e-16--------- the p-value is too small means that there is a significance
relations between the 2 variables.
0.2498317 0.3621678
sample estimates:
0.307069
# model 1
Call:
lm(formula = price ~ sqft_living)
Residuals:
Min 1Q Median 3Q Max
-684539 -135676 -25205 96142 2042894
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -32626.173 19195.049 -1.7 0.0895 .
sqft_living 269.652 8.589 31.4 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> coef(summary(mdl1))
Estimate Std. Error t value Pr(>|t|)
(Intercept) -32626.1735 19195.049377 -1.699718 8.949556e-02
sqft_living 269.6519 8.588552 31.396669 4.717948e-15
(Intercept) sqft_living
-32626.1735 269.6519
> va1[,4]
(Intercept) sqft_living
8.949556e-02 4.717948e-151
Response: price
Df Sum Sq Mean Sq F value Pr(>F)
sqft_living 1 5.7270e+13 5.7270e+13 985.75 < 2.2e-16 ***
Residuals 998 5.7982e+13 5.8098e+10
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
[1] 630.1119 # just the same Residual standard error represented linear regression summary.
# model 2
> summary(mdl2)
Call:
Residuals:
Coefficients:
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
F-statistic: 103.9 on 1 and 998 DF, p-value: < 2.2e-16--------- the p-value is too small means that there is a
significance relations between the 2 variables.
> coef(summary(mdl2))
> va2[,1]
(Intercept) bedrooms
110517.5 122414.2
> va2[,4]
(Intercept) bedrooms
7.867344e-03 2.806871e-23
confint(mmdl1, conf.level=0.95)
confint(summary(mmdl1, conf.level=0.95))
summary(mmdl1)
plot(mmdl1)
# apply the ggplot to check the redundancy > see the appendix 1
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -32626.173 19195.049 -1.7 0.0895 .
sqft_living 269.652 8.589 31.4 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Interpretation text:
The slope coefficient(sqft_living) comes out to be 269.652, with a t-statistic 31.4, and P value
<2e-16 ***, now we know that it is now statistically significant even at 90% level as P >> 0.1.As
the value of Multiple R-squared comes out to be 0.4969, this model explains only about 269.652
% of the variation, which is almost insignificant. This is so high because there cannot be any
possible relation between two randomly generated normal variables.
The p-value is incredibly small implying that this variable is extremely significant in
determining the price. A high R-squared also shows that the square footage of the
apartment would determine a substantial amount of cost price.
• Business Impact:
Square footage being a key driver for the price of a home makes sense from a business
perspective, larger homes means more expensive real estate. This is confirmed based on
our analysis.
Residuals:
Min 1Q Median 3Q Max
-566003 -197773 -64410 100865 2592240
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 110518 41500 2.663 0.00787 **
Interpretation text:
The slope coefficient(bedroom) comes out to be 122414, with a t-statistic 10.193, and P value <
2e-16.As the value of Multiple R-squared comes out to be 0.09429, this model explains only
about 122.414 of the variation. As the output shows, the predictors jointly explain 9% of the
observed variance on the dependent variable ‘rating’ (adj-R2=0.09338); the amount of variance-
explained differs significantly from zero, F (1, 998) = 103.9, p<.01.
The p-value again is small implying that this variable is significant in determining price.
Bedrooms have a much smaller R-squared than anticipated, but it still is helping determine a
portion of the overall price for a home.
Business Impact:
Although the price determined by how many bedrooms is smaller, it is important to understand
why this might be the case. This attribute is being skewed potentially due to square footage
being a useful determinant for bedroom count, and therefore it leans away from the significance
of actual bedroom number.
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -7453.44 34231.87 -0.218 0.828
Interpretation text:
The slope coefficient for bedrooms comes out to be -45113.80 with a t-statistic -4.265,
sqft_living comes out to be 230.84, with a t-statistic 14.692, and for sqft_living15 comes out to
be 103.43, with a t-statistic 5.34e-08. The predictors bedrooms, sqft_living, and sqft_living15
have p-values are less than 0.05 which is the significance level. The data analysis results
showed that the three predictors have a significant outcome on price. The model is fit with three
predictors.
The low p-values for each attribute imply they are significant in determining the price. A negative
intercept means that when you remove all attributes (something we know to be impossible) the
price would go below zero. When you look at the amount determined by these variables, it is a
large portion of the final housing price and would be useful in determining future home prices.
• Business Impact:
This model grants the user the ability to estimate other home’s prices. It is beneficial to test
additional data on how current prices might impact future prices or how adding a new bedroom
could change the cost of a home. Leveraging this model in combination with the cost of doing
an addition to home could allow an investor to know if it would be beneficial to incur that
additional cost.
The results for both analyses that came out from linear regression and multilinear regression
analysis showed that the prices overall for King County houses keep increasing yearly. Even
though the correlation between the price and the square foot living showed less in 2015 than in
2014. The interpretations for linear regression analysis and multilinear regression analysis are
straight forward.
6.Management recommendation
#libraries
library(ggplot2)
library(ggmap)
library(dplyr)
library(glmnet)
library(rpart)
library(randomForest)
library(leaps)
library(rpart.plot)
library(ISLR)
#Import Data
View(kc_house_data)
str(kc_house_data)
#Data Summary
summary(kc_house_data)
attach(kc_house_data)
glimpse(kc_house_data)
#price distribution
hist(kc_house_data$price, main = 'price distribution', xlab = 'House price', ylab = 'recurrence', col = 'blue')
#Condition distribution
col ='red')
#linear Regression
cor(sqft_living, price)
cor(sqft_living15, price)
cor(sqft_basement, price)
cor(bedrooms, price)
cor.test(kc_house_data$price, kc_house_data$sqft_living)
cor.test(kc_house_data$price, kc_house_data$bedrooms)
# model 1
summary(mdl1)
coef(summary(mdl1))
va1 = coef(summary(mdl1))
va1[,1]
va1[,4]
predict(mdl1)
anova(mdl1)
# just the same Residual standard error represented linear regression summary.
# model 2
summary(mdl2)
coef(summary(mdl2))
va2 = coef(summary(mdl2))
va2[,1]
va2[,4]
predict(mdl2)
anova(mdl2)
sqrt(0.658) #the mean SQ for results represented in Anova just the same Residual standard error represented
-----------------------------------------------------------------------------------------------------------------------------------------------------
confint(mmdl1, conf.level=0.95)
confint(summary(mmdl1, conf.level=0.95))
plot(mmdl1)
Appendix:
predict(mdl1)
1 2 3 4 5 6 7 8 9 10
285563.11 660379.31 175005.82 495891.63 420389.08 1428887.33 429826.90 253204.88 447354.28 477015.99
11 12 13 14 15 16 17 18 19 20
927334.73 280170.08 352976.10 336796.98 455443.84 762847.04 477015.99 398816.93 290956.15 304438.75
21 22 23 24 25 26 27 28 29 30
404209.97 789812.24 579483.73 255901.40 628021.08 428478.64 628021.08 344886.54 377244.77 660379.31
31 32 33 34 35 36 37 38 39 40
592966.32 288259.63 595662.84 261294.44 522856.82 587573.28 414996.04 603752.40 296349.19 673861.91
41 42 43 44 45 46 47 48 49 50
660379.31 1105305.01 936772.55 390727.37 312528.31 819473.95 234329.25 584876.77 304438.75 709725.61
51 52 53 54 55 56 57 58 59 60
288259.63 816777.43 347583.06 501284.66 703523.62 730488.81 574090.69 619931.52 843742.63 466229.91
61 62 63 64 65 66 67 68 69 70
547125.49 307135.27 646627.06 382637.81 414996.04 714309.70 700827.10 571394.17 237025.76 830260.03
71 72 73 74 75 76 77 78 79 80
1253613.57 307135.27 708916.66 609145.44 450050.80 892279.98 441961.24 247811.84 347583.06 897673.01
81 82 83 84 85 86 87 88 89 90
601055.88 479712.51 512070.74 420389.08 226239.69 544428.97 684647.98 714309.70 401513.45 245115.32
91 92 93 94 95 96 97 98 99 100
501284.66 916548.65 290956.15 393423.89 393423.89 857225.22 495891.63 280170.08 455443.84 592966.32
101 102 103 104 105 106 107 108 109 110
525553.34 501284.66 557911.57 754757.49 293652.67 598359.36 417692.56 301742.23 814080.91 514767.26
111 112 113 114 115 116 117 118 119 120
590269.80 307135.27 382637.81 528249.86 838349.59 1148449.32 396120.41 204667.53 390727.37 401513.45
121 122 123 124 125 126 127 128 129 130
614538.48 358369.14 175005.82 533642.90 752060.97 708916.66 533642.90 549822.01 592966.32 525553.34
131 132 133 134 135 136 137 138 139 140
253204.88 509374.22 1032498.98 509374.22 544428.97 323314.39 512070.74 665772.35 288259.63 282866.59
141 142 143 144 145 146 147 148 149 150
266687.48 727792.29 401513.45 253204.88 514767.26 956996.44 654986.27 619931.52 576787.21 352976.10
151 152 153 154 155 156 157 158 159 160
334100.46 266687.48 304438.75 1364170.87 156130.18 285563.11 1035195.50 679254.94 309831.79 441961.24
161 162 163 164 165 166 167 168 169 170
522856.82 447354.28 884190.42 482409.03 512070.74 393423.89 328707.42 690041.02 690041.02 336796.98
171 172 173 174 175 176 177 178 179 180
388030.85 549822.01 328707.42 1013623.35 665772.35 269384.00 498588.15 296349.19 493195.11 331403.94
181 182 183 184 185 186 187 188 189 190
417692.56 609145.44 625324.56 250508.36 811384.39 1070250.25 369155.21 479712.51 326010.90 568697.65
191 192 193 194 195 196 197 198 199 200
412299.52 288259.63 544428.97 555215.05 253204.88 423085.60 498588.15 547125.49 482409.03 331403.94
201 202 203 204 205 206 207 208 209 210
199274.49 490498.59 239722.28 317921.35 212757.09 636110.63 625324.56 239722.28 210060.57 587573.28
211 212 213 214 215 216 217 218 219 220
385334.33 309831.79 571394.17 699209.19 431175.16 196577.97 857225.22 574090.69 1019016.39 323314.39
221 222 223 224 225 226 227 228 229 230
711613.18 439264.72 595662.84 566001.13 512070.74 304438.75 374548.25 431175.16 352976.10 366458.70
231 232 233 234 235 236 237 238 239 240
358369.14 582180.25 760150.53 237025.76 636110.63 981265.11 566001.13 498588.15 1000140.75 1156538.87
241 242 243 244 245 246 247 248 249 250
352976.10 191184.94 352976.10 317921.35 245115.32 706220.14 951603.40 161523.22 509374.22 388030.85
251 252 253 254 255 256 257 258 259 260
455443.84 873404.34 374548.25 344886.54 433871.68 350279.58 603752.40 393423.89 299045.71 630717.60
261 262 263 264 265 266 267 268 269 270
414996.04 309831.79 533642.90 175005.82 172309.30 425782.12 269384.00 255901.40 525553.34 1329116.12
271 272 273 274 275 276 277 278 279 280
1399225.62 247811.84 425782.12 317921.35 258597.92 682760.42 584876.77 997444.23 563304.61 611841.96
281 282 283 284 285 286 287 288 289 290
1 2 3 4 5 6 7 8 9 10 11 12
477760.2 477760.2 355346.0 600174.5 477760.2 600174.5 477760.2 477760.2 477760.2 477760.2 477760.2 355346.0
13 14 15 16 17 18 19 20 21 22 23 24
477760.2 477760.2 722588.7 600174.5 477760.2 600174.5 355346.0 477760.2 600174.5 477760.2 722588.7 355346.0
25 26 27 28 29 30 31 32 33 34 35 36
477760.2 477760.2 477760.2 477760.2 477760.2 600174.5 477760.2 355346.0 600174.5 477760.2 600174.5 477760.2
37 38 39 40 41 42 43 44 45 46 47 48
600174.5 600174.5 600174.5 600174.5 600174.5 600174.5 722588.7 477760.2 477760.2 477760.2 477760.2 600174.5
49 50 51 52 53 54 55 56 57 58 59 60
477760.2 477760.2 477760.2 722588.7 477760.2 355346.0 722588.7 600174.5 600174.5 477760.2 722588.7 600174.5
61 62 63 64 65 66 67 68 69 70 71 72
477760.2 477760.2 477760.2 477760.2 477760.2 477760.2 600174.5 600174.5 477760.2 722588.7 722588.7 477760.2
73 74 75 76 77 78 79 80 81 82 83 84
600174.5 600174.5 477760.2 600174.5 600174.5 477760.2 477760.2 600174.5 477760.2 600174.5 355346.0 477760.2
85 86 87 88 89 90 91 92 93 94 95 96
477760.2 477760.2 722588.7 477760.2 355346.0 355346.0 600174.5 722588.7 477760.2 477760.2 477760.2 600174.5
97 98 99 100 101 102 103 104 105 106 107 108
477760.2 600174.5 477760.2 477760.2 477760.2 477760.2 477760.2 477760.2 477760.2 477760.2 477760.2 355346.0
109 110 111 112 113 114 115 116 117 118 119 120
600174.5 722588.7 600174.5 477760.2 477760.2 477760.2 600174.5 477760.2 477760.2 355346.0 600174.5 600174.5
121 122 123 124 125 126 127 128 129 130 131 132
477760.2 477760.2 355346.0 600174.5 477760.2 600174.5 477760.2 600174.5 722588.7 600174.5 477760.2 600174.5
133 134 135 136 137 138 139 140 141 142 143 144
477760.2 477760.2 600174.5 477760.2 600174.5 600174.5 355346.0 477760.2 355346.0 722588.7 477760.2 477760.2
145 146 147 148 149 150 151 152 153 154 155 156
600174.5 600174.5 600174.5 355346.0 722588.7 477760.2 477760.2 477760.2 477760.2 600174.5 232931.7 477760.2
157 158 159 160 161 162 163 164 165 166 167 168
722588.7 600174.5 477760.2 600174.5 722588.7 477760.2 477760.2 477760.2 477760.2 477760.2 355346.0 600174.5
169 170 171 172 173 174 175 176 177 178 179 180
477760.2 477760.2 477760.2 477760.2 477760.2 600174.5 600174.5 477760.2 722588.7 477760.2 477760.2 355346.0
181 182 183 184 185 186 187 188 189 190 191 192
477760.2 600174.5 600174.5 355346.0 600174.5 722588.7 600174.5 477760.2 600174.5 477760.2 477760.2 477760.2
193 194 195 196 197 198 199 200 201 202 203 204
477760.2 477760.2 477760.2 477760.2 600174.5 477760.2 477760.2 477760.2 477760.2 477760.2 477760.2 477760.2
205 206 207 208 209 210 211 212 213 214 215 216
477760.2 600174.5 477760.2 477760.2 355346.0 845003.0 600174.5 477760.2 600174.5 477760.2 600174.5 355346.0
217 218 219 220 221 222 223 224 225 226 227 228
722588.7 477760.2 355346.0 355346.0 600174.5 477760.2 600174.5 477760.2 477760.2 477760.2 355346.0 600174.5
229 230 231 232 233 234 235 236 237 238 239 240
477760.2 477760.2 355346.0 600174.5 845003.0 355346.0 477760.2 722588.7 477760.2 600174.5 477760.2 845003.0
241 242 243 244 245 246 247 248 249 250 251 252
477760.2 355346.0 355346.0 477760.2 355346.0 477760.2 600174.5 355346.0 600174.5 477760.2 477760.2 600174.5
253 254 255 256 257 258 259 260 261 262 263 264
477760.2 477760.2 477760.2 355346.0 600174.5 477760.2 477760.2 600174.5 477760.2 477760.2 477760.2 355346.0
265 266 267 268 269 270 271 272 273 274 275 276
232931.7 600174.5 477760.2 355346.0 477760.2 600174.5 600174.5 355346.0 477760.2 477760.2 477760.2 477760.2
277 278 279 280 281 282 283 284 285 286 287 288
477760.2 600174.5 600174.5 477760.2 600174.5 355346.0 722588.7 600174.5 477760.2 600174.5 722588.7 477760.2
289 290 291 292 293 294 295 296 297 298 299 300
722588.7 355346.0 600174.5 600174.5 600174.5 600174.5 477760.2 600174.5 477760.2 477760.2 477760.2 477760.2
301 302 303 304 305 306 307 308 309 310 311 312
600174.5 477760.2 477760.2 477760.2 477760.2 477760.2 600174.5 600174.5 355346.0 600174.5 600174.5 722588.7
313 314 315 316 317 318 319 320 321 322 323 324
722588.7 600174.5 600174.5 600174.5 477760.2 477760.2 355346.0 600174.5 722588.7 477760.2 477760.2 477760.2