Download as pdf or txt
Download as pdf or txt
You are on page 1of 18

A REGRESSION MODEL OF SINGLE HOUSE PRICE IN LA

CONSTRUCTING A PREDICTED MODEL FOR HOUSE PRICES

A Project

Presented to the

Faculty of

California State Polytechnic University, Pomona

In Partial Fulfllment

Of the Requirements for the Degree

Master of Science

In

Economics

By

Lishun Yuan

2019
SIGNATURE PAGE

PROJECT: A REGRESSION MODEL OF SINGLE HOUSE PRICE IN LA


CONSTRUCTING A PREDICTED MODEL FOR HOUSE PRICES

AUTHOR: Lishun Yuan

DATE SUBMITTED: Spring 2019

Department of Economics

Dr. Craig Kerr


Project Committee Chair
Economics

Dr. Carsten Lange


Economics

Dr. Shin-tang Hwu


Economics

ii
ACKNOWLEDGMENTS

I want to thank Dr. Craig Kerr and Dr. Shih-Tang Hwu for help and advice that improved

this paper.

iii
ABSTRACT

Knowing the factors infuencing the real estate market is not only benefcial for realtors to

complete the sales, but also helpful for buyers to have a thorough view of the real estate

market and evaluate the properties in a better way. There are many factors that can affect

the real estate market.

In my study, I collect the latest house sale price in seven major cities in Los Angeles

County and attempt to construct a linear multiple regression model to estimate the factors

that affect house sale price in the current real estate market. The regression is based on 140

properties in market right now. These properties are measured by the eight critical vari-

ables that are widely utilized by realtors and buyers. They are namely internal square feet,

lot square feet, number of bedrooms, number of bathrooms, local school quality, median

household income, and city population. The regression inaccuracy and other statistics-

related fallacies are tested by the Gauss-Markov Theorem. The study is operated by R. The

results also provide suggestions to improve and inspire further study related to real estate

market.

iv
Contents

Signature Page ii

Acknowledgments iii

Abstract iv

1 Introduction 1

1.1 Background and overview . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Empirical Work 5

2.1 Data Selection and Framework . . . . . . . . . . . . . . . . . . . . . . . . 5

2.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3 Conclusion 12

Bibliography 13

v
Chapter 1

Introduction

1.1 Background and overview

The idea behind predictive modeling is the statistical approach to build a prediction func-

tion from the observed data. The function is then used to estimate a value of a new de-

pendent variable for a new data set. Predictive modeling has been widely used in many

research areas from business to social and natural sciences. This paper applies multivariate

linear regression to estimate the values of single houses in LA county. This paper builds a

multivariate regression model of property prices using a dataset composed of 140 homes.

When people consider buying homes, usually the location has been constrained to a

certain area such as not too far from the work place. With location factor pretty much

fxed, the property characteristics information weights more in the home prices. There are

many factors that determine the price of a house which do not weigh equally in determining

the home value. This paper presents a modeling process for estimating home values using

a multivariate linear regression model based on the condition information of the houses in

order to examine the key factors effecting their values. The project also provides a general

idea of fguring out if a transaction is a good deal based on the information provided.

Real estate economics is the application of economic techniques in the real estate mar-

1
ket. Real estate transactions play an essential part in the US economy. In recent years, as an

increasing number of immigrants have chosen to live in the United States, Los Angelas has

become one of the extremely popular areas in the country, which signifcantly increased the

housing demand in the local market. The main demographic variables that can affect the

housing price are population growth and population size. In other words, the more people

in a country or a region, the greater the demand for housing in that area.

However, it is an oversimplifcation if we only take these population factors into con-

sideration. Conventionally, buyers determines the sale price of houses based on simple

methods, calculating the average prices of property nearby or looking for a reasonable me-

dian price locally. There is a disadvantage of this method: it heavily relies on the subjective

perception and experiences of realtors and local residents to estimate house prices, which

creates bias, inaccuracy, and inconsistency in the price determining process.

Determining the price of a single-family house can be extremely complicated since

there are many factors that can affect the value of a house: such as the crime rate, number

of rooms, age of the property and school district. (Assil, 2012) The sale price of a house

available in the market is a fairly accurate index refecting the intrinsic value of the house.

2
1.2 Literature Review

For the purposes of constructing a reasonable model which can be universally utilized to

predict the single-house price in an assigned location, researchers use econometrics theory

to set up a regression model. This approach is called the econometric approach in contrast

with the conventional approach. This study aims to construct a best linear unbiased estima-

tor to predict the sale price of the properties in a certain area based on essential variables

in order to help amateur house buyers understand the price determining procedure and the

house market thoroughly.

Different researchers have different regression models and methods to construct their

own studies. The independent variables, however, are used similarly in some research

studies. Bourassa et al (2010) uses interior size, location, year built, lot size, number of

bedrooms and number of bathrooms as the independent variables in the linear regression

model. Hu et al. uses a dataset composed of 81 single houses to construct multivariate

regression models of home prices. Hu et al. then applies the maximum information coef-

fcient statistics to the dependent variable which is home values (Y) and the independent

variables as an evaluation of the regression models. The result shows a high strength of the

relationship among dependent and independent variables.

Case et al. (2014) and El Mahmah (2012) add the median household income and local

population size to their model because these two factors play important roles in determin-

ing the property sale price in the market. But the impacts of the two factors are different

from region to region. Therefore, it is necessary to take these two factors into the construc-

tion of the regression model. Also, neighborhood quality is an important measurement in

determining the house value. Since it is hard to evaluate the neibourhood quality quantita-

tively, Dubin (1998) uses proxy variables such as Crime rate, school quality measures and

race to determine the neighborhood quality. The study points out that if the neighborhood

variables such as crime rate and school district ratings are not included in the regression

model, it is likely that the error terms from nearby houses will be correlated because they

3
are in the same neighborhood.

After confrming the model, testing the model is important. Based on Guass-Markov

assumptions of regression, various researches test and improve the quality of the regression

model. Bourassa et al. (2010) uses scattered plots of independent variables versus a depen-

dent variable (i.e., the sale price) to demonstrate visual representations of the relationship

between X and Y. In this way, the study identifes which pair of relationship correlated

with each other the most and furthermore recognized multicollinearity between indepen-

dent variables. Hu et al. (2013) also tries to build up a multiple regression and suggests

that it is necessary to diagnose the regression being built. A scattered-dot diagram (based

on predicted values and residuals) can be used to illustrate if there is a heteroskedasticity

in models being constructed. (Hu et al, 2013)

4
Chapter 2

Empirical Work

2.1 Data Selection and Framework

There are 140 single-family houses as 140 observations in this study with detailed informa-

tion regarding the 140 real estate properties currently on market. There are seven common

variables that might help determine the sale price of houses: internal square feet, lot square

feet, number of bedrooms, number of bathrooms, local school quality, median household

income, and city population.

This study’s data is randomly selected from realtor.com, a professional real estate

database in the United States. The properties selected are generally located in seven major

cities in Los Angeles County: San Dimas, Sierra Madre, Claremont, La Verne, Pomona,

Montebello, and Glendale. Specifcally, 20 properties are randomly selected in each city.

Based on current data, this research study aims to construct a linear regression model that

can forecast other properties with essential and related variables. Measurements and prop-

erties of the eight variables are listed in the Table 2.1 below.

The data processing of the study is operated by R. All the following tests and operation

commands will be demonstrated along with the results. The basis of the study is the simple

linear regression model below:

5
Table 2.1: Properties of Variables

Name Abbreviations Measurement Source

Internal Square Feet Intersf Square feet Realtor.com

Number of Bedrooms Bed Units Realtor.com

Number of Bathroom Bath Units Realtor.com

Year Built Year Years Realtor.com

Median Household Income Income U.S. Dollars City Offcials

City Population Ctpop Number of People City Offcials

Median School Rating School School Evaluation City Offcials

Y = β0 + β1 x + εi (2.1)

where there is one variable “X”, the regression residual/error “e”, and the dependent vari-

able “Y”. From the simplifed regression model, the multivariable linear regression model

that includes more variables to construct a model that can forecast the housing price in LA

county effciently and consistently.

This regression model is a multiple linear regression model including seven indepen-

dent variables and one dependent variable. It is signifcant to consider the Gauss-Markov

assumptions and the corresponding statistical tests regarding this study. The Guess-Markov

assumptions in this Study make it possible that the least-square estimators are “best linear

unbiased estimator” (BLUE).

1. The population process has to be linear in parameters. In other words, no multiplica-

tive effect in parameters.

2. Each individual sample in the population is equally likely to be selected in the study.

More importantly, all of the data come from the same population.

6
3. Zero conditional mean of error. If any of the variable(s) known, it will not help the

researcher to predict whether the variable(s) will be above or below the average population

regression line.

Also, it is necessary to consider the multicollinearity and heteroskedasticity of the in-

dependent variables to achieve the ideal concept of BLUE. All the data in this study is

recent and from the same period of time. Therefore, there is no time series issues involved

in this dataset. Therefore, the elimination of serial correlation and correction of stationary

variables are not necessarily being carried out. Also, there is no two-way causality in this

topic.

The expectation of errors given any independent variable should be equal to zero.

1. No perfect collinearity in regressions. (use Variance Infation Factor (VIF) test)

2. No heteroskedasticity issue. (use Breusch-Pagan test))

Log(Price) = β0 +β1 inters f +β2 bed +β3 bath+β4 year +β5 income+β6 ct pop+β7 school +εi

(2.2)

Equation 2.1 is the log-linear regression model used to estimate the housing price in

LA county. This model is used to construct a regression analysis for housing price in LA

county in the following section.

7
2.2 Methodology

A log-linear regression is run based on the data in table 2.1 above. The results are shown

in Table 2.2 below.

Table 2.2: The corrected regression model with heteroskedasticity adjusted

Coeffcient Estimate T-value Pr ( > | t | )

Intercept 1.860e+01 11.903 <2e-16

Intersqft 3.274e-04 10.436 <2e-16

Bedroom -6.075-02 -1.906 0.059

Bathroom 9.004-02 2.682 0.008

Year -3.294-03 -4.144 6.53e-05

Income 5.644-06 3.334 0.001

Citypop 1.659-06 4.885 3.34e-06

School 3.191-02 2.677 0.008

Residual standard error 0.1635 degrees of freedom 116

Multiple R-squared 0.819 Adjusted R-squared 0.808

F-statistic 75.33 on 7 and 116 df P-value < 2.2e-16

The signifcance level is the probability to reject null hypothesis when it is true. In the

majority of analysis, 0.05 is used as a cutoff point. We reject the null hypothesis when

the p-value is less than 0.05. According to Table 2.2, The p values of ’number of bed-

room’ is 0.05916, which is slightly greater than 0.05 signifcance level. And for all the

other variables, their P-values are less than 0.05. As a result, lot square feet, number of

bathrooms, local school quality, median household income, and city population and school

district ratings are signifcant in this study.

In order to guarantee the hypothetical reasoning, a Variance Infation Factor test is also

8
necessary to justify if there is a multicollinearity issue regarding the data above. In statis-

tics, the Variance Infation Factor test (VIF) evaluates the severity of multicollinearity in an

OLS regression analysis. The result of VIF test provides an index that measures how much

the variance of an estimated regression coeffcient is increased because of collinearity.

Table 2.3: The calculation output of VIF test

Variable VIF Result

Intersqft 3.552

Bedroom 3.289

Bathroom 3.779

Year 1.514

Income 4.930

Citypop 2.428

School 3.145

According to the result of VIF test in table 2.3 above, all the outputs are less than 5,

which suggests there’s no multicollinearity issue in the data. So do not drop any of these

variables to construct a better regression model.

Although this is a regression without a multicollinearity problem and a model that is

highly signifcant (i.e., most of the regression probability levels are smaller than 0.05), it

is still not an ideal model since there might be heteroskedasticity problem. So next step

is to run a Breusch-Pagan test. The BP test, developed in 1979, is a method to test for

heteroskedasticity in linear regression models. The result of the BP test has a signifcant

p-value that is greater than 0.05. Therefore, H1 is rejected and H1 and H0 is accepted. It is

reasonable to claim that there is no heteroskedasticity problem in the dataset.

Having fnished the test for multicollinearity and heteroskedasticity, The regression

9
Table 2.4: The calculation output of BP test

Name Result

BP 5.207

Df 7

P-value 0.635

model is able to predict the single-house price in LA County. Based on the table below, a

Table 2.5: Log-linear Regression Model

Independent Variable Dependent Variable

Internal Square Feet 0.0003

Number of Bedroom -0.061

Number of Bathroom 0.090

Year Built -0.003

Median Household Income 0.0000056

City Population 0.0000017

Median School Rating 0.032

Constant 18.597

linear regression model of single-family house in LA County can be formulated as follow:

Log(price) = 18.597+ 0.0003X1− 0.061X2+ 0.090X3− 0.003X4+ 0.0000056X5+

0.0000017X6 + 0.032X7 + εi

where Y represents the predicted sale price of a single-family house; X1 is the internal

square feet; X2 is the number of bedrooms in the property; X3 is the number of bathrooms

in the property; X4 is the year the property was built; X5 is the median household income;

10
X6 is the city population; X7 is the median school rating of the city.

11
Chapter 3

Conclusion

Multivariate regression has been widely used in various aspects in lives. This paper presents

a process to building a multivariate regression model for a simplifed problem of estimating

private property’s prices. This steps of building a predictive model involves: (a) apply the

subsets procedure to select the best variables; (b) build a linear regression model from the

selected variables; (c) conduct diagnostics to fnd if there’s a multicollinearity problem or

a heteroskedasticity issue. This is a typical process of building regression models that may

apply to many applications and aspects.

Generally, the ideas of the econometric approach in real estate market can also be used

in government property tax estimation, house buyers fnancing estimation, and realtors

property evaluation, etc. Researchers are currently exploring non-linear regression models

which might potentially provide a better estimator in determining real estate price.

Although my model is not completely reliable to predict all the house prices, the study

provides an attempt to estimate the single-house price in Los Angeles County by utiliz-

ing a log-linear regression model as a scientifc vehicle in the econometric approach (El

Mahmah, 2012). Since this model’s results are signifcant, researchers in the future may be

able to improve it by adding the omitted variables as shown above in the system of formulas

and by adding more variables specifcally.

12
Bibliography

Assil, EL MAHMAH. 2012. “Constructing a real estate price index: the Moroccan expe-

rience.” IFC Bulletin, 28: 134.

Bourassa, Steven, Eva Cantoni, and Martin Hoesli. 2010. “Predicting house prices with

spatial dependence: a comparison of alternative methods.” Journal of Real Estate Re-

search, 32(2): 139–159.

Case, Bradford, Henry O Pollakowski, and Susan M Wachter. 1991. “On choosing

among house price index methodologies.” Real estate economics, 19(3): 286–307.

Dubin, Robin A. 1998. “Predicting house prices using multiple listings data.” The Journal

of Real Estate Finance and Economics, 17(1): 35–59.

Hu, Gongzhu, Jinping Wang, and Wenying Feng. 2013. “Multivariate regression model-

ing for home value estimates with evaluation using maximum information coeffcient.” In

Software Engineering, Artifcial Intelligence, Networking and Parallel/Distributed Com-

puting 2012. 69–81. Springer.

13

You might also like