Professional Documents
Culture Documents
YuanLishun Project2019
YuanLishun Project2019
A Project
Presented to the
Faculty of
In Partial Fulfllment
Master of Science
In
Economics
By
Lishun Yuan
2019
SIGNATURE PAGE
Department of Economics
ii
ACKNOWLEDGMENTS
I want to thank Dr. Craig Kerr and Dr. Shih-Tang Hwu for help and advice that improved
this paper.
iii
ABSTRACT
Knowing the factors infuencing the real estate market is not only benefcial for realtors to
complete the sales, but also helpful for buyers to have a thorough view of the real estate
market and evaluate the properties in a better way. There are many factors that can affect
In my study, I collect the latest house sale price in seven major cities in Los Angeles
County and attempt to construct a linear multiple regression model to estimate the factors
that affect house sale price in the current real estate market. The regression is based on 140
properties in market right now. These properties are measured by the eight critical vari-
ables that are widely utilized by realtors and buyers. They are namely internal square feet,
lot square feet, number of bedrooms, number of bathrooms, local school quality, median
household income, and city population. The regression inaccuracy and other statistics-
related fallacies are tested by the Gauss-Markov Theorem. The study is operated by R. The
results also provide suggestions to improve and inspire further study related to real estate
market.
iv
Contents
Signature Page ii
Acknowledgments iii
Abstract iv
1 Introduction 1
2 Empirical Work 5
2.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3 Conclusion 12
Bibliography 13
v
Chapter 1
Introduction
The idea behind predictive modeling is the statistical approach to build a prediction func-
tion from the observed data. The function is then used to estimate a value of a new de-
pendent variable for a new data set. Predictive modeling has been widely used in many
research areas from business to social and natural sciences. This paper applies multivariate
linear regression to estimate the values of single houses in LA county. This paper builds a
multivariate regression model of property prices using a dataset composed of 140 homes.
When people consider buying homes, usually the location has been constrained to a
certain area such as not too far from the work place. With location factor pretty much
fxed, the property characteristics information weights more in the home prices. There are
many factors that determine the price of a house which do not weigh equally in determining
the home value. This paper presents a modeling process for estimating home values using
a multivariate linear regression model based on the condition information of the houses in
order to examine the key factors effecting their values. The project also provides a general
idea of fguring out if a transaction is a good deal based on the information provided.
Real estate economics is the application of economic techniques in the real estate mar-
1
ket. Real estate transactions play an essential part in the US economy. In recent years, as an
increasing number of immigrants have chosen to live in the United States, Los Angelas has
become one of the extremely popular areas in the country, which signifcantly increased the
housing demand in the local market. The main demographic variables that can affect the
housing price are population growth and population size. In other words, the more people
in a country or a region, the greater the demand for housing in that area.
sideration. Conventionally, buyers determines the sale price of houses based on simple
methods, calculating the average prices of property nearby or looking for a reasonable me-
dian price locally. There is a disadvantage of this method: it heavily relies on the subjective
perception and experiences of realtors and local residents to estimate house prices, which
there are many factors that can affect the value of a house: such as the crime rate, number
of rooms, age of the property and school district. (Assil, 2012) The sale price of a house
available in the market is a fairly accurate index refecting the intrinsic value of the house.
2
1.2 Literature Review
For the purposes of constructing a reasonable model which can be universally utilized to
predict the single-house price in an assigned location, researchers use econometrics theory
to set up a regression model. This approach is called the econometric approach in contrast
with the conventional approach. This study aims to construct a best linear unbiased estima-
tor to predict the sale price of the properties in a certain area based on essential variables
in order to help amateur house buyers understand the price determining procedure and the
Different researchers have different regression models and methods to construct their
own studies. The independent variables, however, are used similarly in some research
studies. Bourassa et al (2010) uses interior size, location, year built, lot size, number of
bedrooms and number of bathrooms as the independent variables in the linear regression
regression models of home prices. Hu et al. then applies the maximum information coef-
fcient statistics to the dependent variable which is home values (Y) and the independent
variables as an evaluation of the regression models. The result shows a high strength of the
Case et al. (2014) and El Mahmah (2012) add the median household income and local
population size to their model because these two factors play important roles in determin-
ing the property sale price in the market. But the impacts of the two factors are different
from region to region. Therefore, it is necessary to take these two factors into the construc-
determining the house value. Since it is hard to evaluate the neibourhood quality quantita-
tively, Dubin (1998) uses proxy variables such as Crime rate, school quality measures and
race to determine the neighborhood quality. The study points out that if the neighborhood
variables such as crime rate and school district ratings are not included in the regression
model, it is likely that the error terms from nearby houses will be correlated because they
3
are in the same neighborhood.
After confrming the model, testing the model is important. Based on Guass-Markov
assumptions of regression, various researches test and improve the quality of the regression
model. Bourassa et al. (2010) uses scattered plots of independent variables versus a depen-
dent variable (i.e., the sale price) to demonstrate visual representations of the relationship
between X and Y. In this way, the study identifes which pair of relationship correlated
with each other the most and furthermore recognized multicollinearity between indepen-
dent variables. Hu et al. (2013) also tries to build up a multiple regression and suggests
that it is necessary to diagnose the regression being built. A scattered-dot diagram (based
4
Chapter 2
Empirical Work
There are 140 single-family houses as 140 observations in this study with detailed informa-
tion regarding the 140 real estate properties currently on market. There are seven common
variables that might help determine the sale price of houses: internal square feet, lot square
feet, number of bedrooms, number of bathrooms, local school quality, median household
This study’s data is randomly selected from realtor.com, a professional real estate
database in the United States. The properties selected are generally located in seven major
cities in Los Angeles County: San Dimas, Sierra Madre, Claremont, La Verne, Pomona,
Montebello, and Glendale. Specifcally, 20 properties are randomly selected in each city.
Based on current data, this research study aims to construct a linear regression model that
can forecast other properties with essential and related variables. Measurements and prop-
erties of the eight variables are listed in the Table 2.1 below.
The data processing of the study is operated by R. All the following tests and operation
commands will be demonstrated along with the results. The basis of the study is the simple
5
Table 2.1: Properties of Variables
Y = β0 + β1 x + εi (2.1)
where there is one variable “X”, the regression residual/error “e”, and the dependent vari-
able “Y”. From the simplifed regression model, the multivariable linear regression model
that includes more variables to construct a model that can forecast the housing price in LA
This regression model is a multiple linear regression model including seven indepen-
dent variables and one dependent variable. It is signifcant to consider the Gauss-Markov
assumptions and the corresponding statistical tests regarding this study. The Guess-Markov
assumptions in this Study make it possible that the least-square estimators are “best linear
2. Each individual sample in the population is equally likely to be selected in the study.
More importantly, all of the data come from the same population.
6
3. Zero conditional mean of error. If any of the variable(s) known, it will not help the
researcher to predict whether the variable(s) will be above or below the average population
regression line.
dependent variables to achieve the ideal concept of BLUE. All the data in this study is
recent and from the same period of time. Therefore, there is no time series issues involved
in this dataset. Therefore, the elimination of serial correlation and correction of stationary
variables are not necessarily being carried out. Also, there is no two-way causality in this
topic.
The expectation of errors given any independent variable should be equal to zero.
Log(Price) = β0 +β1 inters f +β2 bed +β3 bath+β4 year +β5 income+β6 ct pop+β7 school +εi
(2.2)
Equation 2.1 is the log-linear regression model used to estimate the housing price in
LA county. This model is used to construct a regression analysis for housing price in LA
7
2.2 Methodology
A log-linear regression is run based on the data in table 2.1 above. The results are shown
The signifcance level is the probability to reject null hypothesis when it is true. In the
majority of analysis, 0.05 is used as a cutoff point. We reject the null hypothesis when
the p-value is less than 0.05. According to Table 2.2, The p values of ’number of bed-
room’ is 0.05916, which is slightly greater than 0.05 signifcance level. And for all the
other variables, their P-values are less than 0.05. As a result, lot square feet, number of
bathrooms, local school quality, median household income, and city population and school
In order to guarantee the hypothetical reasoning, a Variance Infation Factor test is also
8
necessary to justify if there is a multicollinearity issue regarding the data above. In statis-
tics, the Variance Infation Factor test (VIF) evaluates the severity of multicollinearity in an
OLS regression analysis. The result of VIF test provides an index that measures how much
Intersqft 3.552
Bedroom 3.289
Bathroom 3.779
Year 1.514
Income 4.930
Citypop 2.428
School 3.145
According to the result of VIF test in table 2.3 above, all the outputs are less than 5,
which suggests there’s no multicollinearity issue in the data. So do not drop any of these
highly signifcant (i.e., most of the regression probability levels are smaller than 0.05), it
is still not an ideal model since there might be heteroskedasticity problem. So next step
is to run a Breusch-Pagan test. The BP test, developed in 1979, is a method to test for
heteroskedasticity in linear regression models. The result of the BP test has a signifcant
p-value that is greater than 0.05. Therefore, H1 is rejected and H1 and H0 is accepted. It is
Having fnished the test for multicollinearity and heteroskedasticity, The regression
9
Table 2.4: The calculation output of BP test
Name Result
BP 5.207
Df 7
P-value 0.635
model is able to predict the single-house price in LA County. Based on the table below, a
Constant 18.597
0.0000017X6 + 0.032X7 + εi
where Y represents the predicted sale price of a single-family house; X1 is the internal
square feet; X2 is the number of bedrooms in the property; X3 is the number of bathrooms
in the property; X4 is the year the property was built; X5 is the median household income;
10
X6 is the city population; X7 is the median school rating of the city.
11
Chapter 3
Conclusion
Multivariate regression has been widely used in various aspects in lives. This paper presents
private property’s prices. This steps of building a predictive model involves: (a) apply the
subsets procedure to select the best variables; (b) build a linear regression model from the
a heteroskedasticity issue. This is a typical process of building regression models that may
Generally, the ideas of the econometric approach in real estate market can also be used
in government property tax estimation, house buyers fnancing estimation, and realtors
property evaluation, etc. Researchers are currently exploring non-linear regression models
which might potentially provide a better estimator in determining real estate price.
Although my model is not completely reliable to predict all the house prices, the study
provides an attempt to estimate the single-house price in Los Angeles County by utiliz-
ing a log-linear regression model as a scientifc vehicle in the econometric approach (El
Mahmah, 2012). Since this model’s results are signifcant, researchers in the future may be
able to improve it by adding the omitted variables as shown above in the system of formulas
12
Bibliography
Assil, EL MAHMAH. 2012. “Constructing a real estate price index: the Moroccan expe-
Bourassa, Steven, Eva Cantoni, and Martin Hoesli. 2010. “Predicting house prices with
Case, Bradford, Henry O Pollakowski, and Susan M Wachter. 1991. “On choosing
among house price index methodologies.” Real estate economics, 19(3): 286–307.
Dubin, Robin A. 1998. “Predicting house prices using multiple listings data.” The Journal
Hu, Gongzhu, Jinping Wang, and Wenying Feng. 2013. “Multivariate regression model-
ing for home value estimates with evaluation using maximum information coeffcient.” In
13