A Comparative Study of Machine Learning and Spatial Interpolation Methods For Predicting House Prices

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

sustainability

Article
A Comparative Study of Machine Learning and Spatial
Interpolation Methods for Predicting House Prices
Jeonghyeon Kim, Youngho Lee, Myeong-Hun Lee and Seong-Yun Hong *

Department of Geography, Kyung Hee University, Seoul 02447, Korea; zmsdkdle@khu.ac.kr (J.K.);
emfo0124@khu.ac.kr (Y.L.); lmh9265@khu.ac.kr (M.-H.L.)
* Correspondence: syhong@khu.ac.kr

Abstract: As the volume of spatial data has rapidly increased over the last several decades, there is a
growing concern about missing and incomplete observations that may result in biased conclusions.
Several recent studies have reported that machine learning techniques can more efficiently address this
limitation in emerging data sets than conventional interpolation approaches, such as inverse distance
weighting and kriging. However, most existing studies focus on data from environmental sciences;
so, further evaluations are required to assess their strengths and limitations for socioeconomic data,
such as house price data. In this study, we conducted a comparative analysis of four commonly used
methods: neural networks, random forests, inverse distance weighting, and kriging. We applied
these methods to the real estate transaction data of Seoul, South Korea, and demonstrated how the
values of the houses at which no transactions are recorded could be predicted. Our empirical analysis
suggested that the neural networks and random forests can provide a more accurate estimation
than the interpolation counterparts. Of the two machine learning techniques, the results from a
random forest model were slightly better than those from a neural network model. However, the
neural network appeared to be more sensitive to the amount of training data, implying that it has the
potential to outperform the other methods when there are sufficient data available for training.
Citation: Kim, J.; Lee, Y.; Lee, M.-H.;
Hong, S.-Y. A Comparative Study of
Machine Learning and Spatial
Keywords: house prices; machine learning; spatial interpolation; neural networks; random forests;
Interpolation Methods for Predicting real estate transactions data
House Prices. Sustainability 2022, 14,
9056. https://doi.org/10.3390/
su14159056
1. Introduction
Academic Editors: Xuan Song and
Spatial data are multidimensional and include the location information of certain
Xiaodan Shi
events or objects and their non-spatial attributes. Ideally, the relationship between locations
Received: 13 June 2022 and attributes should be investigated using complete data for every location in the study
Accepted: 18 July 2022 region. In practice, however, such spatially continuous and complete data are difficult, if
Published: 24 July 2022 not impossible, to collect due to the time and cost required [1]. Missing and incomplete
Publisher’s Note: MDPI stays neutral observations are common in most real-world datasets, and thus, the use of reliable and
with regard to jurisdictional claims in accurate imputation methods has become crucial to complement them [2].
published maps and institutional affil- In recent decades, there have been numerous studies on the estimation of missing
iations. values in spatial data. The relevant literature can be divided into two broad groups:
(1) empirical studies that focus on utilizing spatial statistics and other imputation methods
and (2) more methodological works that aim to identify the most accurate and reliable
imputation method through comparative analysis. In the former, spatial interpolation
Copyright: © 2022 by the authors. techniques, such as inverse distance weighting (IDW) and kriging, have been commonly
Licensee MDPI, Basel, Switzerland. used to estimate unobserved values, especially when spatial autocorrelation is present in
This article is an open access article the data. However, machine learning techniques, such as neural networks and random
distributed under the terms and
forests, have become popular in recent years because of their successful application in
conditions of the Creative Commons
various domains [3–8].
Attribution (CC BY) license (https://
Although machine learning has shown its potential and usefulness in various academic
creativecommons.org/licenses/by/
and industrial fields, it is still essential to understand its advantages and limitations as a
4.0/).

Sustainability 2022, 14, 9056. https://doi.org/10.3390/su14159056 https://www.mdpi.com/journal/sustainability


Sustainability 2022, 14, 9056 2 of 14

spatial imputation tool compared to the existing interpolation methods [9]. While previous
studies have demonstrated that machine learning techniques, such as neural networks
and random forests, can estimate the missing values more accurately than the traditional
distance-based functions, the results varied depending on the training data size and the
characteristics of the target variable to be imputed. Considering that much of the existing
literature is in the field of environmental sciences, the application of machine learning
techniques to socio-spatial data still needs further evaluation.
To complement this limitation in the current literature, this work evaluates the
strengths and drawbacks of machine learning as an imputation method in the context
of social and spatial studies. More specifically, we aim to address the following two
research questions:
1. Are the state-of-the-art machine learning models capable of estimating missing values
in the spatial data representing complex urban phenomena, such as house prices?
2. If so, are they more accurate than the conventional spatial interpolation approaches?
To answer these questions, we employ two of the most popular techniques, namely
neural networks and random forests, to estimate the prices of the houses for which no
transactions are recorded in the real estate transaction data of Seoul, South Korea. The
results are then compared to those from IDW and kriging using a set of accuracy metrics,
such as the root mean square error (RMSE) and mean absolute error (MAE). The findings
from this empirical analysis will help us better understand the capabilities of machine
learning in social sciences and urban analytics, especially for imputing missing values in
massive spatial data with socioeconomic attributes.

2. Past Studies
In statistics, imputation refers to the process of replacing missing values with sub-
stituted ones using ancillary data [10]. Proper use of the imputation can minimize any
bias caused by missing values and ensure the efficiency of data processing and analysis.
Single imputation methods, such as hot-deck, mean substitution, and regression, have
been widely used because of their simplicity in solving problems. However, multiple
imputations, which provide an ensemble of the uncertainty of errors by performing a set of
single imputations, have also been increasingly adopted in recent years.
In the fields of geography, geology, and environmental sciences, interpolation methods
such as IDW and kriging have been commonly used to predict unobserved values in
datasets [11]. Past studies show that a simple distance-based function, such as IDW, works
reasonably well in certain settings (e.g., the case of bathymetry in [12]), but kriging is
often considered a more robust tool as it can take auxiliary variables into account. For
example, Wu and Li [13] argued that the temperature could be more accurately interpolated
when the variable ‘altitude’ was included in the model, along with latitude and longitude.
Bhattacharjee et al. [14] also demonstrated that a cokriging model incorporating a range of
weather variables into the temperature prediction could produce more precise estimations
than a model relying only on distances.
While these interpolation methods have usually been applied to environmental vari-
ables, such as precipitation, temperature, and humidity, some efforts have been made
to use IDW and kriging for the imputation of socioeconomic data with location infor-
mation [15,16]. Montero and Larraz [17] attempted to estimate the prices of commercial
properties in Toledo using IDW, kriging, and cokriging and concluded that cokriging per-
formed better than the others when auxiliary variables were chosen appropriately. Similarly,
Kuntz and Helbich [18] found that cokriging predicted the house prices in Vienna more
precisely than its other variants. These results seem to be supported by the cases of other
cities and countries [19,20].
Meanwhile, more recent studies focus on how the development of machine learning
can enhance the accuracy of imputation accuracy and replace the existing interpolation
methods. Because machine learning techniques, such as neural networks and random
forests, can handle the complexity and nonlinearity of real-world datasets, they can be
Sustainability 2022, 14, 9056 3 of 14

advantageous in predicting missing and unobserved values [21]. For the imputation of
environmental variables, several studies have been conducted to assess the performance of
machine learning, and promising results have been reported over the past decade [3,9]. In
particular, random forests yielded a significant improvement in various empirical cases,
suggesting that they can be a reliable alternative to conventional statistical approaches.
The efforts were not limited to the environmental studies. Pérez-Rave and col-
leagues [22], for example, proposed an integrated approach of the hedonic price regressions
model and machine learning and demonstrated its advantages in predicting real estate data.
Čeh et al. [23] also compared random forests and the hedonic model based on multiple
regression to predict apartment prices. They found that the random forest models could
detect and predict data variability more accurately than the traditional approaches.
Nonetheless, it is still difficult to generalize that machine learning outperforms inter-
polation methods in predicting missing values. While many studies illustrate the successful
applications of such state-of-the-art techniques, their effectiveness and efficiency for a
particular problem depend on the size and nature of the data at hand [24]. In the case of
spatial data representing socioeconomic attributes, such as house prices, more empirical
analyses should be conducted to verify the relative strengths and drawbacks of machine
learning [2]. To this end, this study applies neural networks and random forests, currently
the two most popular machine learning techniques, to estimate unobserved values in real
estate transaction data and compares the results to those from the interpolation methods.

3. Data and Methods


3.1. Data
This study uses the real estate transaction data for Seoul, South Korea. It is a public
dataset provided by the Ministry of Land, Infrastructure, and Transport (MOLIT) and
contains all the sales and rental transaction records from 2006. We utilized only the sales
inability 2022, 14, x FOR PEER REVIEW records between 2015 and 2019 as input data because the rental prices 4are of determined
14 by
various external factors and may not represent the actual value of the properties.
The real estate transactions data consist of several transaction-related variables, rang-
observations). Table 1the
ing from dates
lists of the transactions
the names, to and
descriptions, the sales
types(or
of rental) prices. However,
the variables. The sales they do not
include cadastral information, such as the year of construction,
price per unit was the target variable to be imputed, and all the other variables were land and
usedbuilding areas,
and the number of floors. MOLIT provides such information
as predictor variables. Note that the interpolation methods adopted in the present studyas a separate dataset called
cannot consider auxiliary variables other than the x and y coordinates; therefore, the esti- manually
the integrated building information data. Therefore, we merged the two datasets
to prepare
mation is based only onthethetraining
distancesand test datasets,
between as illustrated in Figure 1.
the properties.

Figure 1. Joining process


Figure of theprocess
1. Joining real estate transactions
of the data and thedata
real estate transactions integrated building information
and the integrated building information data.
data.

Table 1. List of predictor and target variables and descriptions.

Type Name Description Class


bldg_area Land area occupied by the building Numeric
yr_built Year built Numeric
Sustainability 2022, 14, 9056 4 of 14

The real estate transactions data have two fields for location information, namely lot
numbers and street addresses (or sometimes called road name addresses). The integrated
building information data have the same fields as well, but some adjustments are required
to match the location information of the two datasets. Therefore, we concatenated the
two fields to form a single field of [street address, lot number] and used it as a key field
during the joining process. Any transactions not matched to the integrated building
information data were removed from the final dataset.
It may be worth noting that the sales transactions were made multiple times for some
properties during the study period. Consequently, there were duplicate points in the final
dataset, each representing one transaction. Although we tried to minimize data loss during
the merging process, some transaction records were dropped because of incomplete or
inaccurate address information.
The final dataset consisted of 287,034 observations with 16 variables. We randomly
selected approximately 70% of these, or 200,924 observations, to train the prediction models;
we tested the accuracy of each model using the remaining 30% (i.e., 86,110 observations).
Table 1 lists the names, descriptions, and types of the variables. The sales price per unit
was the target variable to be imputed, and all the other variables were used as predictor
variables. Note that the interpolation methods adopted in the present study cannot consider
auxiliary variables other than the x and y coordinates; therefore, the estimation is based
only on the distances between the properties.

Table 1. List of predictor and target variables and descriptions.

Type Name Description Class


bldg_area Land area occupied by the building Numeric
yr_built Year built Numeric
flr_area Total floor area of the building Numeric
Area of the site on which the building is
site_area Numeric
located
height Height of the building (in metres) Numeric
bcr Building coverage area Numeric
far Floor area ratio Numeric
Predictor
x Latitude Numeric
y Longitude Numeric
district District in which the building is located Categorical
yr Year of the sales transaction Numeric
month Month of the sales transaction Numeric
net_area Floor area of the property Numeric
flr_lv Floor level Integer
Type of the property (e.g., apartments,
type Categorical
detached houses)
Target price Sales price Numeric

3.2. Methods
In this study, we examine the accuracy and efficiency of four methods—neural net-
works, random forests, IDW, and kriging—as imputation tools for spatial data representing
socioeconomic attributes, specifically house prices.
A neural network model comprises one input layer and one output layer, and there
can be multiple hidden layers between the two. The input layer takes raw data values and
feeds them into one or more hidden layers in the middle of the model. The data values are
weighted and summed at each hidden layer and then passed through an activation function
Sustainability 2022, 14, 9056 5 of 14

to the next hidden or output layer [25]. The number of hidden layers in a neural network
model is directly related to its ability to handle nonlinearity in the data. As the number
of hidden layers increases, a more complex relationship between the input and output
variables can be addressed; however, the risk of overfitting simultaneously escalates [26].
The learning process of a neural network involves backpropagation, which propagates
errors in the reverse direction from the output layer to the input layer and updates the
weights in each layer. While this is a crucial step for ensuring the accuracy and reliability of
the model [27,28], backpropagation in a large network with many hidden layers and nodes
can be computationally expensive. Therefore, for successful and effective use of neural
networks, it is essential to choose the appropriate number of hidden layers and nodes,
along with other hyperparameters, through iterative refinement.
Random forest is one of the most popular ensemble techniques; it forms multiple
decision trees using subsets of the input data, trains them through random bootstrapping,
and aggregates the trees into a final model [29]. In a random forest model, one predictor
variable is randomly selected at each branch of decision trees, and such randomness
minimizes bias and reduces the correlation between the trees [29]. Previous studies found
that random forests are robust to missing observations and outliers [30] and can produce
more accurate and reliable results than traditional decision trees.
One practical drawback of random forests is that they cannot visualize the results in
an intuitive graph form, as in the case of decision trees. If the significance of predictor
variables is required to simplify and optimize the model, it should be estimated separately.
In this study, we estimate the relative importance of variables by calculating the out-of-bag
(OOB) errors for decision trees with n-1 variables. If the error changes marginally when a
specific variable is removed, the variable can be eliminated from the model, as it has only a
slight impact on the result [31]. To obtain the best possible prediction results from random
forests, the predictor variables should be chosen based on the OOB errors, and the other
hyperparameters, such as the number of trees, should be carefully tuned [32].
IDW is the simplest spatial interpolation method and has been used for several
decades [11]. For each location to be imputed, it computes a weighted average of all the
observed values, where the weight of observations is determined based on their distance to
the location of interest, s0 [33]. The equation for IDW can be formulated as

∑in=1 w(si ) Z (si )


Ẑ(s0 ) = (1)
∑in=1 w(si )

where Ẑ(s0 ) refers to the estimated value at point s0 , and Z (si ) is the observed value at
point si . w(si ) represents the weight applied to Z (si ) and is calculated as
−p
w ( s i ) = s i − s0 (2)

The distance decay parameter, p, defines how rapidly the influence of distant points
decreases. As p increases, the weights for points distant from s0 become significantly smaller
than those for the close points. It is set to 2 by default in most computer software; however,
because the choice of p is crucial for the performance of IDW, it is desirable to test several
candidates and choose the one that produces the most accurate and explainable results.
Kriging is a weighted linear regression that utilizes observed values from neighboring
locations to predict missing or unknown values [34]. Kriging has many variants, including
simple, ordinary, and universal kriging. The most appropriate variant depends on the
variable of interest. If the mean of the target variable is constant in the study area and its
exact value is known, simple kriging can minimize the prediction errors. However, if the
mean of the target variable is unknown, or if it is non-stationary, the use of either ordinary
or universal kriging is ideal [35–37].
Cokriging refers to a family of methods that incorporate associated variables into
the kriging models. Cokriging has the same variants as kriging—simple, ordinary, and
universal—and it tends to perform better than its univariate counterparts, as briefly dis-
Sustainability 2022, 14, 9056 6 of 14

cussed in Section 2 (see, for example, [14]). However, in this study, none of the ancillary
variables available (i.e., those presented in Table 1) was sufficiently correlated with the
target variable. The strongest correlation of 0.285 was observed between the land areas and
sales prices per unit, but it was not strong enough to expect any improvements from the
use of cokriging [38]. Furthermore, there was no apparent trend in the mean of the target
variable; so, we chose ordinary kriging as the most suitable method for the provided data.
The semivariogram parameters, including nugget, sill, and range, were chosen through
iterations to obtain the best fitting model for ordinary kriging.

4. Model Optimization
To obtain the most accurate predictions from each method presented in the previous
section, it is crucial to optimize the model hyperparameters [39]. The hyperparameters
were iteratively changed in this study, and the accuracies were assessed through cross-
validation. To avoid overfitting, we divided all of the data into training and test sets for
model building and validation, respectively. For IDW and kriging, however, there were
no learning processes involved; therefore, these methods were directly applied to the test
set. The root mean square error (RMSE) was employed as the primary metric during the
optimization process, and the hyperparameters that minimized the RMSE were selected for
each model.
The neural network models were constructed by adjusting the number of hidden
layers and the number of nodes in the hidden layers. The batch size, indicating the total
number of training data in a single batch, was fixed at 64, and the epochs for the model
optimization were set at 150, considering the size of the training set. The activation and
loss functions were set to a rectified linear unit (ReLU) and mean square error (MSE),
respectively. Table 2 indicates that the RMSE decreases in general as the number of hidden
layers and nodes increases. However, it also suggests that the RMSE would increase if the
scale of the network exceeded a certain point. Therefore, in this study, we chose a model
with four hidden layers, a maximum of 256 nodes, and a minimum of 128 nodes as the
optimal neural network model, as it yielded the lowest RMSE.

Table 2. RMSE of neural network models with different combinations of hyperparameters.

No. of Hidden No. of Nodes


Model RMSE
Layers Max Min
1 2 32 16 193.75
2 2 128 64 152.35
3 2 256 128 143.19
4 3 32 16 210.97
5 3 128 64 114.08
6 3 256 128 103.03
7 4 32 16 124.04
8 4 128 64 101.99
9 4 256 128 97.90
10 5 32 16 117.59
11 5 128 64 98.65
12 5 256 128 107.60

The random forest models were built by varying the numbers of trees and predictor
variables for node branching. The number of trees increased by 50 from 50 to 200. For each
decision tree, the number of predictor variables was set to 4, 8, and 16, making 12 models
in total. The cross-validation results show that the error tends to decrease as we increase
10 5 32 16 117.59
11 5 128 64 98.65
12 5 256 128 107.60

Sustainability 2022, 14, 9056 7 of 14


The random forest models were built by varying the numbers of trees and predictor
variables for node branching. The number of trees increased by 50 from 50 to 200. For each
decision tree, the number of predictor variables was set to 4, 8, and 16, making 12 models
the number
in total. The of trees and predictor
cross-validation resultsvariables.
show that In
theTable
error3,tends
the model with as
to decrease 200wetrees and
increase
16
thepredictor
number of variables haspredictor
trees and the lowest RMSE; so,
variables. we chose
In Table 3, theit as the optimal
model with 200model
trees for
andthe
16
random
predictorforest.
variables has the lowest RMSE; so, we chose it as the optimal model for the
random forest.
Table 3. RMSE of random forest models with different numbers of decision trees and predictor
variables.
Table 3. RMSE of random forest models with different numbers of decision trees and predictor
variables.
Model No. of Trees Predictor Variables RMSE
Model 1 No. of Trees 50 Predictor Variables
4 RMSE
116.12
1 2 50 50 4 8 116.12
87.47
2 50 8 87.47
3 50 16 82.58
3 50 16 82.58
4 100 4 115.44
4 100 4 115.44
5 100 8 86.85
5 100 8 86.85
6 6 100 100 16 16 82.07
82.07
7 7 150 150 4 4 115.27
115.27
8 8 150 150 8 8 86.68
86.68
9 9 150 150 16 16 81.92
81.92
10 10 200 200 4 4 115.13
115.13
11 200 8 86.57
11 200 8 86.57
12 200 16 81.86
12 200 16 81.86

In IDW, because the model estimation depends only on the distance decay parameter
p, weInexamined
IDW, because the model
the accuracy byestimation
gradually depends only
increasing on the
it from 2 todistance decay
8 by 0.1. The parameter
results are
p, we examined
presented the accuracy
in Figure 2, where by
thegradually
RMSE is theincreasing it pfrom
lowest at = 2.2;2 therefore,
to 8 by 0.1.itThe
wasresults
chosenare
as
presented in Figure 2, where
the optimal parameter for IDW.the RMSE is the lowest at p = 2.2; therefore, it was chosen as
the optimal parameter for IDW.

Figure 2. RMSE with different distance decaying parameters, p.


Figure 2. RMSE with different distance decaying parameters, p.
In ordinary kriging, the models were constructed by varying the semivariogram
parameters, namely the model type, nugget, sill, and range (Figure 3). As shown in Table 4,
we tested the spherical, exponential, and Gaussian models to fit the semivariogram. For
each model, we used four different range values (i.e., 1000, 3000, 5000, and 7000) with
default nugget and sill values. Overall, the spherical function provided more accurate
results, and the RMSE seemed to decrease as the range decreased (Table 4).
In ordinary kriging, the models were constructed by varying the semivariogram pa-
rameters, namely the model type, nugget, sill, and range (Figure 3). As shown in Table 4,
we tested the spherical, exponential, and Gaussian models to fit the semivariogram. For
each model, we used four different range values (i.e., 1000, 3000, 5000, and 7000) with
default nugget and sill values. Overall, the spherical function provided more accurate re-
Sustainability 2022, 14, 9056 8 of 14
sults, and the RMSE seemed to decrease as the range decreased (Table 4).

Figure3.3.Semivariogram
Figure Semivariogramand
andfitting
fittingfunctions
functionsrepresenting
representingthe
thespherical
sphericalmodel,
model,the
theexponential
exponential
model,
model,and
andthe
theGaussian
Gaussianmodel.
model.
Table 4. RMSE of ordinary kriging models with different sets of parameters.
Table 4. RMSE of ordinary kriging models with different sets of parameters.
Fitting
Model
Model Fitting Function Nugget
Nugget Sill
Sill Range
Range RMSE
RMSE
Function
1 18,189.04 45,968.56 1000 131.00
1 18,189.04 45,968.56 1000 131.00
2 25,750.27 55,018.88 3000 134.11
Spherical
32 Spherical
25,750.27
27,728.88 55,018.88
61,251.09 3000
5000 134.11
134.95
43 27,728.88
28,872.34 61,251.09
66,397.40 5000
7000 134.95
135.38
54 28,872.34
21,004.69 66,397.40
54,192.23 7000
1000 135.38
131.89
65 26,625.02
21,004.69 67,663.44
54,192.23 3000
1000 134.32
131.89
Exponential
76 28,158.26
26,625.02 78,786.51
67,663.44 5000
3000 134.97
134.32
87 Exponential 29,120.28 94,920.26 7000 135.37
28,158.26 78,786.51 5000 134.97
9 28,021.85 51,190.93 1000 136.91
8 29,120.28 94,920.26 7000 135.37
10 31,722.12 65,187.40 3000 137.62
9 Gaussian 28,021.85 51,190.93 1000 136.91
11 33,177.85 76,133.24 5000 137.74
1210 Gaussian
31,722.12
34,099.72 65,187.40
98,675.17 3000
7000 137.62
137.79
11 33,177.85 76,133.24 5000 137.74
5. Results
12 34,099.72 98,675.17 7000 137.79
The models selected from the previous section were applied to the test datasets. For
5.comparison,
Results Figure 4 shows the distribution of the actual house prices and the residuals
(i.e.,The
the models
difference between
selected from the
theactual and section
previous the predicted prices) to
were applied from
the each method. For
test datasets. Due
to the volume of the data and the presence of many duplicate points, the
comparison, Figure 4 shows the distribution of the actual house prices and the residuals results could not
be visualized
(i.e., using
the difference raw points.
between Instead,
the actual andwethecreated
predictedgrids and from
prices) calculated the average
each method. Dueval-
to
uesvolume
the for eachofgrid. In Figure
the data 4a,presence
and the the gridsofwith
manyhigh averagepoints,
duplicate sales prices are marked
the results could in
notdark
be
red, whereas
visualized those
using rawwith lowInstead,
points. prices arewemarked
createdingrids
lightand
yellow. The red
calculated theand blue colors
average valuesin
Figure 4b–e indicate the positive and negative residuals, respectively.
for each grid. In Figure 4a, the grids with high average sales prices are marked in dark
Figure 4a
red, whereas shows
those withthat
lowtheprices
high sales prices are
are marked clustered
in light around
yellow. the and
The red central and
blue bottom
colors in
parts of
Figure theindicate
4b–e city, as the
wellpositive
as along andthenegative
Han River. The residuals
residuals, from the neural network
respectively.
Sustainability 2022, 14, x FOR PEER REVIEW 10 of 14

Random forest 48.1897 80.8789 0.0952 0.2063


Sustainability 2022, 14, 9056 IDW 78.4847 126.5288 0.1544 0.33609 of 14
Ordinary kriging 83.5145 131.0042 0.1655 0.3575

Figure 4. Geographic distributions of the actual (observed) sales prices and the residuals from each
Figure 4. Geographic distributions of the actual (observed) sales prices and the residuals from each
method: (a) actual values, (b) neural networks, (c) random forests, (d) IDW, and (e) ordinary kriging.
method: (a) actual values, (b) neural networks, (c) random forests, (d) IDW, and (e) ordinary kriging.

Figure 4a shows that the high sales prices are clustered around the central and bottom
parts of the city, as well as along the Han River. The residuals from the neural network
model tend to be small and positive, but there are some large discrepancies between
the actual and the predicted values around where the expensive properties are located
Sustainability 2022, 14, 9056 10 of 14

(Figure 4b). The random forest model has a similar visual impression (Figure 4c), and it is
difficult to conclude which model is better from these maps.
In contrast, it is relatively apparent that IDW and kriging produced higher residuals
in some parts of the city. Figure 4d implies that both over- and underestimation of house
prices have occurred. There are both negative and positive residuals, and some of those
along the river are considerable in size. The pattern in Figure 4e is similar; however, the
residuals tend to be smaller than those from IDW, and there appears to be no clustering
of large residuals in the central part of the city. Overall, these results imply that IDW and
kriging are less reliable for predicting house prices than the machine learning methods.
House prices are determined by a complex interplay of various geographic, demographic,
and social factors, and thus, the conventional spatial interpolation methods that rely on a
distance-based function may not be sufficient to establish an accurate model.
In addition to these visual comparisons, we adopted several accuracy metrics, in-
cluding the mean absolute error (MAE), mean absolute percentage error (MAPE), mean
absolute scale error (MASE), and RMSE, for a more comprehensive evaluation.
The MAE is the mean of the absolute error between the actual and predicted values.
Because the MAE represents the absolute value of the error, a smaller MAE value implies a
higher accuracy of the model. As it does not involve multiplication, it is less affected by
outliers than the RMSE. The MAPE is defined as the error divided by the actual, observed
value. As the actual value is used as a denominator, it should be non-zero. If the actual
value is less than one, and if it gets closer to zero, the MAPE approaches infinity. The last
metric in this study, MASE, on the other hand, divides the error by the normal variation
range to determine whether the error occurs outside of the variation range. This metric is
particularly useful when comparing variables with different variances.
Table 5 presents the calculated accuracy metrics. All metrics indicate that the random
forest model is the most accurate, followed by the neural network model. The accuracies of
IDW and Kriging were found to be the lowest. These findings indicate that the machine
learning techniques could be superior to the distance-based interpolation methods for
predicting house prices.

Table 5. Prediction accuracy of machine learning and conventional interpolation techniques measured
by four error metrics, MAE, RMSE, MAPE, and MASE.

Method MAE RMSE MAPE MASE


Neural network 64.4615 102.0348 0.1260 0.2759
Random forest 48.1897 80.8789 0.0952 0.2063
IDW 78.4847 126.5288 0.1544 0.3360
Ordinary kriging 83.5145 131.0042 0.1655 0.3575

This result can be attributed to the number of data points used in this study. In
ordinary kriging, although the imputation process is simple when the amount of data is
small, the calculation time increases significantly when a large amount of data is available,
making it difficult to fully utilize the given data [40]. On the other hand, most machine
learning techniques are designed to work with large data, and the accuracy of the model
improves as the volume of the data increases in general. The data in this study amounted
to almost 300,000 transactions, resulting in a higher accuracy with the machine learning
techniques than the spatial interpolation counterparts.
Among the machine learning techniques, the random forest model yielded higher
accuracy than the neural network model. One reason behind this is that the real estate
transaction data are in a tabular form. When tabular data are used in neural networks, the
structure of the model can become highly dimensional or overparameterized, resulting in
overfitting issues during the analysis process [41–43]. Conversely, the tree-based ensemble
model can be more efficient for high-dimensional tabular data [44], and it has the advantage
of preventing overfitting by considering the balance between biases and variances [29]. For
Sustainability 2022, 14, 9056 11 of 14

Sustainability 2022, 14, x FOR PEER REVIEW 11 of 14

this reason, the random forest model might perform better than the neural network for
predicting house prices in this study.
As mentioned in the previous section, the relative importance of the predictor varia-
As mentioned in the previous section, the relative importance of the predictor variables
bles in a random forest model can be calculated. This can compensate for the disadvantage
in a random forest model can be calculated. This can compensate for the disadvantage of
of random forests, making intuitive visualization and interpretation of the results difficult.
random forests, making intuitive visualization and interpretation of the results difficult.
Figure 5 shows the relative importance of the predictor variables in the selected ran-
Figure 5 shows the relative importance of the predictor variables in the selected
dom forest model. The most important variable, the y coordinate, has a score of 100, and
random forest model. The most important variable, the y coordinate, has a score of 100,
the
andscores of the of
the scores other
the variables are estimated.
other variables Following
are estimated. the y coordinates,
Following the x coordi-
the y coordinates, the x
nates, the years
coordinates, thebuilt,
yearsthebuilt,
heights, and a binary
the heights, and avariable
binary indicating whether the
variable indicating building
whether the
isbuilding
in Gangnam-gu, one of the 25
is in Gangnam-gu, onedistricts
of thein25Seoul, were
districts inrelatively
Seoul, weresignificant. Considering
relatively significant.
that five of the
Considering listed
that fivevariables
of the listed(i.e., site_area,
variables far,
(i.e., bcr, flr_area,
site_area, and
far, bcr, bldg_area)
flr_area, are area-
and bldg_area)
related, the areas are one of the most important factors that affect
are area-related, the areas are one of the most important factors that affect house house prices, which is
prices,
consistent with the results
which is consistent with the from previous
results studies. studies.
from previous In addition, the location-related
In addition, varia-
the location-related
bles also have considerable influences on the target variable: not only
variables also have considerable influences on the target variable: not only the absolute the absolute loca-
tions represented by the x and y coordinates but also the districts in which the
locations represented by the x and y coordinates but also the districts in which the properties properties
are
arelocated
locatedseem
seemto tohave
haveaasignificant
significantinfluence.
influence.

Relativevariable
Figure5.5.Relative
Figure variableimportance
importanceininthe
theselected
selectedrandom
randomforest
forestmodel.
model.

6. Conclusions
6. Conclusions
Although it is crucial to secure data integrity across the entire study region when
Although it is crucial to secure data integrity across the entire study region when
utilizing spatial data, missing values almost always exist for various reasons. Several
utilizing spatial data, missing values almost always exist for various reasons. Several stud-
studies have been conducted to develop imputation methods for spatial data to address
ies have been conducted to develop imputation methods for spatial data to address this
this problem. In geography, IDW and kriging have been commonly used, and they have
problem. In geography, IDW and kriging have been commonly used, and they have con-
contributed to solving the discontinuity of data. Recently, it has been suggested that the
tributed to solving the discontinuity of data. Recently, it has been suggested that the de-
development of machine learning can replace existing spatial interpolation methods, and
velopment of machine learning can replace existing spatial interpolation methods, and its
its outperformance has been demonstrated. However, there are still uncertainties about
outperformance has been demonstrated. However, there are still uncertainties about
whether it can be applied to spatial data representing complex urban phenomena.
whether it can be applied to spatial data representing complex urban phenomena.
While there is a considerable number of studies that compare machine learning and
While there is a considerable number of studies that compare machine learning and
the existing interpolation methods, the existing works tend to focus on the environmental
the existing
fields. To fillinterpolation methods,
this gap in the thewe
literature, existing works
compared tend tolearning
machine focus on(i.e.,
the neural
environmental
networks
fields.
and random forests) and spatial interpolation methods (i.e., IDW and kriging) usingnet-
To fill this gap in the literature, we compared machine learning (i.e., neural real
works
estate and random data
transactions forests) and spatial
in Seoul, South interpolation
Korea. In thismethods (i.e.,constructed
process, we IDW and kriging)
a datasetus-
by
ing real estate
combining thetransactions data in Seoul,
real estate transactions South
data and Korea. In this process,
the integrated buildingwe constructed
information a
data.
dataset by combining the real estate transactions data and the integrated building infor-
mation data. We proceeded with model optimization by varying the parameters to obtain
the best results from each method.
Sustainability 2022, 14, 9056 12 of 14

We proceeded with model optimization by varying the parameters to obtain the best results
from each method.
From the maps of the residuals from each model, we found that the machine learning
models performed better than the spatial imputation methods. The neural network model
and the random forest model did not show high residuals across the study area. This
visual impression was confirmed by a set of accuracy metrics, i.e., MAE, RMSE, MAPE, and
MASE. The random forest model was the most accurate, followed by the neural network.
The spatial interpolation methods showed clusters of high residuals, and kriging followed
IDW in terms of the accuracy indicators. High residuals tend to appear in the regions with
high transaction prices. To confirm how the selected variables affect the prediction results,
the relative variable importance was calculated, and those related to areas and locations
were found to have a significant influence.
In this study, we conducted a comparative analysis of machine learning and spatial
interpolation methods for predicting house prices. The application of machine learning
to the imputation of spatial data has been successful in many studies. Nonetheless, this
study is significant as it provides more empirical evidence to support the use of machine
learning for social science research and urban analytics. It is also expected that this study
can contribute to the application of machine learning in the imputation of spatial data.
Nevertheless, it is important to note that this result should not be generalized to other
fields of study and other types of spatial data. With the development of machine learning,
many methods have been proposed, and their performance is continuously improving.
However, the performance of a specific model for a specific application is affected by
various factors, such as the size and structure of the data. Therefore, it is important to
ensure that the method we adopt and the mode we are building are superior to other
candidates through careful evaluation.

Author Contributions: Conceptualization, S.-Y.H.; methodology, J.K. and S.-Y.H.; validation, J.K. and
S.-Y.H.; data curation, J.K. and M.-H.L.; formal analysis, J.K.; investigation, J.K.; writing—original
draft preparation, J.K. and Y.L.; writing—review and editing, S.-Y.H.; visualization, J.K.; supervision,
S.-Y.H.; project administration, S.-Y.H.; funding acquisition, S.-Y.H. All authors have read and agreed
to the published version of the manuscript.
Funding: This work was supported by the National Research Foundation of Korea (NRF) grants
funded by the Korean government (Ministry of Science and ICT) (NRF-2021R1C1C1009849; NRF-
2019R1C1C1006555).
Data Availability Statement: The real estate transactions data of South Korea are available to down-
load from the following website: https://rt.molit.go.kr/ (accessed on 13 June 2022).
Conflicts of Interest: The authors declare no conflict of interest. The funders had no role in the design
of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or
in the decision to publish the results.

References
1. Li, J.; Heap, A.D.; Potter, A.; Huang, Z.; Daniell, J.J. Can we improve the spatial predictions of seabed sediments? A case study of
spatial interpolation of mud content across the southwest Australian margin. Cont. Shelf Res. 2011, 31, 1365–1376. [CrossRef]
2. Tadić, J.M.; Ilić, V.; Biraud, S. Examination of geostatistical and machine-learning techniques as interpolators in anisotropic
atmospheric environments. Atmos. Environ. 2015, 111, 28–38. [CrossRef]
3. Appelhans, T.; Mwangomo, E.; Hardy, D.R.; Hemp, A.; Nauss, T. Evaluating machine learning approaches for the interpolation of
monthly air temperature at Mt. Kilimanjaro, Tanzania. Spat. Stat. 2015, 14, 91–113. [CrossRef]
4. Mariano, C.; Mónica, B. A random forest-based algorithm for data-intensive spatial interpolation in crop yield mapping. Comput.
Electron. Agric. 2021, 184, 106094. [CrossRef]
5. Zhu, D.; Cheng, X.; Zhang, F.; Yao, X.; Gao, Y.; Liu, Y. Spatial interpolation using conditional generative adversarial neural
networks. Int. J. Geogr. Inf. Sci. 2020, 34, 735–758. [CrossRef]
6. Hu, Q.; Li, Z.; Wang, L.; Huang, Y.; Wang, Y.; Li, L. Rainfall Spatial Estimations: A Review from Spatial Interpolation to
Multi-Source Data Merging. Water 2019, 11, 579. [CrossRef]
7. Nghiep, N.; Al, C. Predicting Housing Value: A Comparison of Multiple Regression Analysis and Artificial Neural Networks. J.
Real Estate Res. 2001, 22, 313–336. [CrossRef]
Sustainability 2022, 14, 9056 13 of 14

8. Lin, G.-F.; Chen, L.-H. A spatial interpolation method based on radial basis function networks incorporating a semivariogram
model. J. Hydrol. 2004, 288, 288–298. [CrossRef]
9. Li, J.; Heap, A.D.; Potter, A.; Daniell, J.J. Application of machine learning methods to spatial interpolation of environmental
variables. Environ. Model. Softw. 2011, 26, 1647–1659. [CrossRef]
10. Kleinke, K.; Reinecke, J.; Salfrán, D.; Spiess, M. Applied Multiple Imputation; Springer International Publishing: Cham, Switzerland, 2020.
11. Meng, Q.; Liu, Z.; Borders, B.E. Assessment of regression kriging for spatial interpolation—Comparisons of seven GIS interpolation
methods. Cartogr. Geogr. Inf. Sci. 2013, 40, 28–39. [CrossRef]
12. Henrico, I. Optimal interpolation method to predict the bathymetry of Saldanha Bay. Trans. GIS 2021, 25, 1991–2009. [CrossRef]
13. Wu, T.; Li, Y. Spatial interpolation of temperature in the United States using residual kriging. Appl. Geogr. 2013, 44, 112–120.
[CrossRef]
14. Bhattacharjee, S.; Chen, J.; Ghosh, S.K. Spatio-temporal prediction of land surface temperature using semantic kriging. Trans. GIS
2020, 24, 189–212. [CrossRef]
15. Martínez, M.G.; Lorenzo, J.M.M.; Rubio, N.G. Kriging methodology for regional economic analysis: Estimating the housing price
in Albacete. Int. Adv. Econ. Res. 2000, 6, 438–450. [CrossRef]
16. McCluskey, W.J.; Deddis, W.G.; Lamont, I.G.; Borst, R.A. The application of surface generated interpolation models for the
prediction of residential property values. J. Prop. Investig. Financ. 2000, 18, 162–176. [CrossRef]
17. Montero, J.; Larraz, B. Interpolation Methods for Geographical Data: Housing and Commercial Establishment Markets. J. Real
Estate Res. 2011, 33, 233–244. [CrossRef]
18. Kuntz, M.; Helbich, M. Geostatistical mapping of real estate prices: An empirical comparison of kriging and cokriging. Int. J.
Geogr. Inf. Sci. 2014, 28, 1904–1921. [CrossRef]
19. Kim, G.; Lee, B.; Park, B. A comparative analysis on spatial interpolation techniques for price estimation of housing facilities.
Geogr. J. Korea 2013, 47, 119–127.
20. Choi, J.H.; Kim, B.J. A study for applicability of cokriging techniques for estimating the real transaction price of land. J. Korean
Soc. Geospat. Inf. Sci. 2015, 23, 55–63.
21. Rigol, J.P.; Jarvis, C.H.; Stuart, N. Artificial neural networks as a tool for spatial interpolation. Int. J. Geogr. Inf. Sci. 2001, 15,
323–343. [CrossRef]
22. Pérez-Rave, J.I.; Correa-Morales, J.C.; González-Echavarría, F. A machine learning approach to big data regression analysis of real
estate prices for inferential and predictive purposes. J. Prop. Res. 2019, 36, 59–96. [CrossRef]
23. Čeh, M.; Kilibarda, M.; Lisec, A.; Bajat, B. Estimating the Performance of Random Forest versus Multiple Regression for Predicting
Prices of the Apartments. ISPRS Int. J. Geo-Inf. 2018, 7, 168. [CrossRef]
24. Seya, H.; Shiroi, D. A Comparison of Residential Apartment Rent Price Predictions Using a Large Data Set: Kriging versus Deep
Neural Network. Geogr. Anal. 2022, 54, 239–260. [CrossRef]
25. Abraham, A. Artificial Neural Networks. In Handbook of Measuring System Design; Oklahoma State University: Stillwater, OK,
USA, 2005.
26. Minsky, M.; Papert, S. Perceptrons: An Introduction to Computational Geometry; The MIT Press: Cambridge, MA, USA, January 1969.
Available online: https://mitpress.mit.edu/books/perceptrons (accessed on 12 June 2022).
27. Montavon, G.; Samek, W.; Müller, K.-R. Methods for interpreting and understanding deep neural networks. Digit. Signal
Processing 2018, 73, 1–15. [CrossRef]
28. Schmidhuber, J. Deep learning in neural networks: An overview. Neural Netw. 2015, 61, 85–117. [CrossRef]
29. Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32. [CrossRef]
30. Antipov, E.A.; Pokryshevskaya, E.B. Mass appraisal of residential apartments: An application of Random forest for valuation and
a CART-based approach for model diagnostics. Expert Syst. Appl. 2012, 39, 1772–1778. [CrossRef]
31. Strobl, C.; Malley, J.; Tutz, G. An introduction to recursive partitioning: Rationale, application, and characteristics of classification
and regression trees, bagging, and random forests. Psychol. Methods 2009, 14, 323–348. [CrossRef]
32. Liaw, A.; Wiener, M. Classification and regression by randomForest. R News 2002, 2, 18–22.
33. Bivand, R.; Pebesma, E.J.; Gómez-Rubio, V. Applied Spatial Data Analysis with R. In Use R!; Springer: New York, NY, USA, 2013.
34. Cressie, N. The origins of kriging. Math. Geol. 1990, 22, 239–252. [CrossRef]
35. Armstrong, M. Problems with universal kriging. J. Int. Assoc. Math. Geol. 1984, 16, 101–108. [CrossRef]
36. Oliver, M.A.; Webster, R. Kriging: A method of interpolation for geographical information systems. Int. J. Geogr. Inf. Syst. 1990, 4,
313–332. [CrossRef]
37. Webster, R.; McBratney, A.B. Mapping soil fertility at Broom’s Barn by simple kriging. J. Sci. Food Agric. 1987, 38, 97–115.
[CrossRef]
38. Van der Meer, F. Introduction to Geostatistics; ITC Lecture Notes: Enschede, The Netherlands, 1993.
39. Probst, P.; Boulesteix, A.-L.; Bischl, B. Tunability: Importance of hyperparameters of machine learning algorithms. J. Mach. Learn.
Res. 2019, 20, 1934–1965.
40. Cressie, N.; Johannesson, G. Fixed rank kriging for very large spatial data sets. J. R. Stat. Soc. Ser. B 2008, 70, 209–226. [CrossRef]
41. Goodfellow, I.; Bengio, Y.; Courville, A. Deep Learning; MIT Press: Cambridge, UK, 2016.
42. Shavitt, I.; Segal, E. Regularization learning networks: Deep learning for tabular datasets. Adv. Neural Inf. Processing Syst. 2018,
31. Available online: https://proceedings.neurips.cc/paper/2018 (accessed on 12 June 2022).
Sustainability 2022, 14, 9056 14 of 14

43. Xu, L.; Skoularidou, M.; Cuesta-Infante, A.; Veeramachaneni, K. Modeling tabular data using conditional gan. Adv. Neural Inf.
Processing Syst. 2019, 32. Available online: https://proceedings.neurips.cc/paper/2019 (accessed on 12 June 2022).
44. Lundberg, S.M.; Erion, G.G.; Lee, S.-I. Consistent individualized feature attribution for tree ensembles. arXiv 2018,
arXiv:1802.03888.

You might also like