file161201

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 42

DIFFERENCES IN AIRBNB PRICES

DETERMINANTS BETWEEN
COUNTRIES: A MACHINE
LEARNING APPROACH

XINKUN LIU

thesis submitted in partial fulfillment


of the requirements for the degree of
master of science in data science & society
at the school of humanities and digital sciences
of tilburg university

word counts: 8610


student number
u485290

committee
Dr. Dan Stowell
Chris Emmery MSc

location
Tilburg University
School of Humanities and Digital Sciences
Department of Cognitive Science &
Artificial Intelligence
Tilburg, The Netherlands

date
June 24, 2022

acknowledgments
I appreciate the guidance from Dr. Dan Stowell on this study.
DIFFERENCES IN AIRBNB PRICES
DETERMINANTS BETWEEN
COUNTRIES: A MACHINE
LEARNING APPROACH

xinkun liu

Abstract
Airbnb has been increasingly popular since 2008, and offers a
holiday rental market with variable rental price. Proposed price de-
terminants for Airbnb listings include several attributes, which have
been analysed for specific cities. However, the influence of countries
as well as the relation between the country and its cities to the impor-
tance of Airbnb rental price determinants are unknown. This thesis
investigates the differences in the Airbnb price determinants between
countries to fill these unknown research areas. Concretely, several
Machine Learning models are employed by using publicly available
datatsets of Airbnb listings in Germany, the United States, Australia,
and China. In prediction, the best performing model explains up
to 80% of variance with minimum MSE of 0.093 when applied per
country as opposed to maximum of 65.9% for a global baseline model
with minimum MSE of 0.205. Feature importance analysis is further
conducted based on the baseline and best performing model. The
main findings from both models are that some of the Airbnb rental
price determinants are different between countries. Particularly, the
review scores is not important for all countries based on the Extra
Trees model, which is a novel finding compared to existing litera-
ture. This thesis can be a starting point for broader studies on the
importance of Airbnb price determinants which extend from cities
to countries. These findings can also provide implications to Airbnb
hosts for a better understanding of the international market when
they expand their businesses to other countries.

1 data source and code

Work on this thesis did not involve collecting data directly from human
participants from Germany, the United States, Australia, and China. No

1
2 introduction 2

experiments were carried out for original data collection and no new data
were gathered for this study. The data used in this thesis is publicly avail-
able online by InsideAirbnb.com under the Creative Commons Attribution
4.0 International License. No "private" information is being used. Names,
listings and review details are all publicly displayed on the Airbnb website.
InsideAirbnb.com is the original owner of the data which retains ownership
of the data during and after the completion of this thesis. Furthermore, I
do not hold any legal claim or authority to the data used in this thesis.

2 introduction

According to Hamari, Sjöklint, and Ukkonen (2016), the sharing economy


is referred to:

"peer-to-peer-based (P2P) activity of obtaining, giving, or shar-


ing the access to goods and services, coordinated through
community-based online services."

Accommodation is a major business sector in the P2P economy which


includes Airbnb as a prominent part (Fang, Ye, & Law, 2016). Airbnb has
been dramatically growing since its establishment in 2008. It has more than
six million listings in more than 220 countries and regions (Airbnb, 2021).
Within the Hospitality industry which includes Airbnb, price is considered
crucial to customers when choosing their accommodations during traveling
(Zhang, Chen, Han, & Yang, 2017). On the other hand, setting relatively
right rental price is among the pivotal business strategies for hosts be-
cause they are facing greater competitions than before (Gibbs, Guttentag,
Gretzel, Morton, & Goodwill, 2018). Nowadays, internationalization has
shown its influence on not only travelers but also hosts. An increasing
number of people are expending their Airbnb businesses from one country
to another. Thus, it is worthy to analyze the importance of Airbnb loca-
tion in terms of countries to provide implications for hosts to understand
different markets. Furthermore, the platform, Airbnb.com, also provides
suggestions to hosts and travelers from all over the world, which means
the platform operators can also benefit from understanding the differences
in Airbnb rental price determinants between countries to give better advice.

From scientific perspective, many existing studies are inspired by re-


searches from hotel industry and has revealed key factors regarding to the
prediction of Airbnb rental price, such as listing location, room amenities,
and host attributes (Cai, Zhou, Ma, & Scott, 2019; Chang & Li, 2020; Dudás,
Kovalcsik, Vida, Boros, & Nagy, 2020; Gibbs et al., 2018). However, these
studies only focus on the listing itself and the neighbourhood of the listing.
2 introduction 3

For instance, whether the property is in the city center or close to the
attractions (Pérez Sánchez, Serrano Estrada, Martí-Ciriquián, & García,
2018; Wang & Nicolau, 2017). Although Pérez Sánchez et al. (2018) and
Chang and Li (2020) have suggested the possible influence of the city where
the listing locates, it is not analyzed in depth. There is also no specific
differentiation between countries when determining the importance level
of key Airbnb rental price determinants. Wang and Nicolau (2017) have
mentioned in their study that a global model describing the Airbnb rental
price determinants based on larger geographical locations, such as multiple
cities combined or countries, can better reflect the real market situation.
They also assumed that this global model can indicate the association of
Airbnb rental price and price determinants in terms of market equilibrium,
if Airbnb listing prices are acceptable by travelers from different coun-
tries around the world (Wang & Nicolau, 2017). Therefore, it is necessary
to identify the possible similarities or differences in Airbnb rental price
determinants between countries to provide a starting point for further
studies. Additionally, as current researches focus on cities, it is necessary
to identify the relative contributions of countries and their cities to the
importance level of Airbnb rental price determinants. This can be used to
compare with existing literature for supporting evidences and obtaining
new findings.

In this research, data of Airbnb listings is collected for four countries from
four continents and incorporate the features in the prediction of Airbnb
price by using Maching Learning algorithms, including the Linear Regres-
sion, XGBoost, Support Vector Regression (SVR), Gradient Boosting, and
Extra Trees. Model performance and comparison is based on R squared
scores and MSE. Feature importance analysis based on variable importance
measures (VIM) is further conducted to determine the differences in the
importance level of Airbnb rental price determinants based on countries.

This thesis tackles the main research question:


What are the differences in Airbnb rental price determinants between
Germany, the United States, Australia, and China?
The sub-questions are listed separately which aim to answer the main
research question:
RQ1 Which of the selected Machine Learning algorithms is better for predicting
Airbnb rental price?
RQ2 For both the baseline and best performing models, what are the differences
in feature importance and rank-ordering of the strongest Airbnb rental price
determinants between Germany, the United States, Australia, and China?
3 related work 4

RQ3 What are the relative contributions of the country, and the city within the
country, to the importance of Airbnb rental price determinants in Germany,
the United States, Australia, and China?

The following thesis is organized as: Section 3 discusses previous studies


regarding to Airbnb rental price determinants while Section 4 explains
the data and methods applied in this research; In Section 5, the results
of models and feature importance analysis are presented; At the end,
discussions and conclusions are stated in Section 6 and Section 7.

3 related work

In this section, the comparison between hotel industry and Airbnb has been
addressed because many existing works are inspired by the papers in hotel
industry. Then previous studies related to the Airbnb price determinants
are presented with detailed summarization. Lastly, studies regarding to
the algorithms and feature importance analysis are described.

3.1 Hotel and Airbnb

Airbnb as the largest online accommodation service, connects economic


benefits for both travellers and local people who rent out the listings (Os-
kam & Boswijk, 2015). Economic benefit is referred to “better value for
money” (Pérez Sánchez et al., 2018). It has received a variety of attentions
from scholars (Oskam & Boswijk, 2015). Several past studies are focused
on the relation between Airbnb and hotel industry. An instance is that
Zervas, Proserpio, and Byers (2017) conclude the entry of Airbnb had
a negative effect on local hotel revenue in the Texas market. However,
Yang, Nieto García, Viglia, and Nicolau (2021) indicate that hotel revenue
are influenced non-significantly by Airbnb despite its harm to the hotel
occupancy, potentially because the substitution influence is more possible
to occur to hotels with low-priced rooms.

From the perspective of customers, Airbnb is just an accommodation type


that same as hotels which provide basic accommodation needs (Chen &
Xie, 2017). They value the accommodations based on the amenities or
functionalities that the accommodation can offer (Chen & Xie, 2017). From
the perspective of hosts, their price setting depends largely on the service or
amenities provided in their Airbnb listings (Wang & Nicolau, 2017). Thus,
it is no surprise that Airbnb is widely recognized as a viable alternative
to the traditional hotels with competition existing in between (Yang et
al., 2021). There are three main drivers for consumers to choose Airbnb
3 related work 5

instead of hotels: lower cost, responsibility for society and environment,


and social interactions (Tussyadiah, 2016). "Lower cost" specifically implies
the vital role that price playing in the Hospitality industry. Guttentag,
Smith, Potwarka, and Havitz (2018) support the importance of price in
their study by forming a segment called “Money Saver” who are attracted
by Airbnb by its relatively low price. Therefore, it is a great value to know
the determinants that influence Airbnb rental price.

3.2 Airbnb Price Determinants

Early analysis of price determinants in hospitality industry are mainly


for hotels (Zhang et al., 2017). In these analysis, hedonic pricing method
is widely used to measure the effect of certain key factors and estimate
hotel room pricing (Soler, Gemar, Correia, & Serra, 2019). In the field of
applying hedonic research, it is generally agreed that the factors with the
most significant influence on hotel pricing are hotel location and category
(Abrate, Capriello, & Fraquelli, 2011). Additionally, other studies have
identified many determinants on the hotel price, including brand name,
hotel age, star rating, room amount, hotel amenities and services, as well
as customer reviews (Zhang et al., 2017). Inspired by the hotel industry,
Gibbs et al. (2018) applies the hedonic pricing method in the field of Airbnb
to determine the effects on pricing. Their study is based on 15,716 Airbnb
listings in Canada, which in result find that the number of bathrooms and
beds has positive influence on Airbnb price. Nevertheless, the study from
Dudás et al. (2020), conducted for the Airbnb listings in Hungary, shows
that the influence of room amenities is negative. This finding indicates
that the inner characteristics of Airbnb listings in the Lake Balaton Tourism
Region are different than these of listings in big cities, so that hosts con-
sider room amenities differently when setting up prices (Dudás et al., 2020).

Chen and Xie (2017) also applies the hedonic pricing analysis in Airbnb
studies, which concludes that the room type has the most influence on the
Airbnb price. They specifically mention that the entire home or apartment
room type has the most positive effect, followed by the private room type
and shared room type. Other functionalities of the Airbnb listing, including
property type, room amenities, and location, are also crucial to the listing
price (Chen & Xie, 2017).

Several studies focus on the host attributes of Airbnb listings. Ert, Fleischer,
and Magen (2016) conduct their research about the role of host personal
photos in Airbnb industry based on the available listings in Stockholm.
They conclude that the profile picture of hosts are crucial because it means
3 related work 6

reliability and trustworthiness to customers. Kakar, Franco, Voelz, and


Wu (2016) have analyzed some host features, including gender and race,
based on 2,772 listings with 2,161 unique hosts in San Francisco. They
indicate that the Airbnb rental price is different because of race. For in-
stance, Airbnb listings from Asian hosts are relatively 9.3% lower than
the listings from white hosts (Kakar et al., 2016). Moreover, Chen and
Xie (2017) show other host attributes that are statistically significant, such
as the speed of hosts response to requests and the number of available
verification methods.

Numerous researches about Airbnb rental price prediction are conducted


by dividing various factors into different sub-categories. In the study based
on listings from Hong Kong, Cai et al. (2019) divide the main determi-
nants on Airbnb listing price into five groups: room and host attributes,
reputation, location, and rental policies. These five groups consist of 20
explanatory sub-categories, such as the doorman and distance to tourism
attractions. According to the Ordinary Least Squares (OLS) regression,
majority of the factors can be significantly related to the Airbnb price in
Hong Kong (Cai et al., 2019). The results confirm several factors having
similar effects as previous studies. However, there are three conclusions
that are different: first, the room type has extraordinarily high effect; sec-
ond, the influence of host listing counts is negative; third, the distance to
city center or shopping center does not result in a significant linear effect
on the Airbnb rental price (Cai et al., 2019). Particularly for the second
difference, Wang and Nicolau (2017) identify a strong positive influence
of host listing counts on Airbnb price in their analysis, which is based on
Airbnb listings from 33 western cities.

Similar to the analysis of Cai, Zhou, and Scott, Chang and Li (2020) identify
Airbnb rental price determinants from five categories based on the dataset
with 65,130 observations drawn from Airbnb.com. They employ two OLS
models to estimate key factors on Airbnb price and explore interaction
effect between features. The top five price determinants from five categories
include the room type, amenities, pictures of the listing, distance to sight-
seeings, as well as the city where the listing locates (Chang & Li, 2020).
Besides, Zhang et al. (2017) implement the general linear model (GLM)
as baseline and the geographically weighted regression model (GWR)
for comparison in their study. They conclude that there are significant
connections between Airbnb rental price and the distance to the conference
center, the review amount and scores (Zhang et al., 2017). Additionally, the
GWR model outperforms the GLM based on R-squared score (Zhang et al.,
2017).
3 related work 7

3.3 Location

At the early stage of Airbnb price prediction, the effects of location on


price are unclear (Wang & Nicolau, 2017). Inspired by previous studies
in the hotel industry, location represented by the variable “distance” is
determined to have a dramatic negative effect on the Airbnb rental price in
the analysis of Wang and Nicolau (2017). Their conclusion indicates that
the price is lower when the distance between city center and the Airbnb
listing is larger. Furthermore, Latitude and longitude are also related to
the location. Masrom, Baharun, Razi, Abdul Rahman, and Abd Rahman
(2022) shows a negative correlation of latitude and a slightly positive corre-
lation of longitude to Airbnb price in their study based on the Singapore
Airbnb data. However, Cai et al. (2019) perform a quantile regression
in their analysis to focus on the estimation of locational factors, which
suggests a different result. They underline that locational factors do not
have statistically significant impacts on the high-priced Airbnb listings
while low-priced listings are more sensitive, especially to the distance to
the shopping center (Cai et al., 2019). Other researches has included the
distance between Airbnb listing location and tourist attractions, hotels, and
key central constructions (Pérez Sánchez et al., 2018).

In the opinion of Wang and Nicolau (2017), it is more capable to reflect


the market environment by describing the determinants of Airbnb price
in different cities around the world. Similar idea is implemented by
Pérez Sánchez et al. (2018), who extend the factors related to location
from the listing itself and neighbourhoods to the cities where the listing
locates. They suggest that the effects of location on Airbnb price can
be both positive and negative, depends on the cities. For instance, the
price of Airbnb listings located in Valencia is 18% higher than the rest of
listings, especially the listings with same attributes but located in Alicante
(Pérez Sánchez et al., 2018). Chang and Li (2020) support the theory in
their analysis by showing the city is the second most important Airbnb
rental price determinant, because the living cost is different across cities.

3.4 Algorithm and Feature Analysis

Many existing literature applies OLS models as a baseline to explore the


structure of the data and to compare with results from improved models
(e.g. Masrom et al., 2022; Pérez Sánchez et al., 2018; Zhang et al., 2017).
To compare with the baseline, both Kalehbasti, Nikolenko, and Rezaei
(2019) and Luo, Zhou, and Zhou (2019) implement various models, such
as K-means Clustering with Ridge Regression, Gradient Boosting Tree
4 method 8

Ensemble, and Support Vector Regression. The results show that the SVR
model has achieved the highest R squared score among all the algorithms
(Kalehbasti et al., 2019). Masrom et al. (2022) apply Lasso, Ridge, Decision
Tree, and Random Forest. Their results indicate that the Decision Tree and
Random Forest are two superior algorithms and their R squared score are
very closed to 1 and RMSE below 0.040. Thus, the Linear Regression model
will be used as a baseline while multiple models will be applied to select
the model that outperforming others in this study.

As for the variable importance measures (VIM), one of the definitions is


to quantify the contribution of the input variables to the model output
(Wei, Lu, & Song, 2015). In last few decades, a large number of researchers
have agreed that performing VIMs is necessary, especially considering the
dramatically increased data volume (Wei et al., 2015). To identify the most
important Airbnb rental price determinants, feature importance analysis
based on VIMs are suitable for this study.

Inspired by the novel research from Pérez Sánchez et al. (2018), this study
contributes to previous literature by extending the concept of location
further to the country where the listings locate.

4 method

As this study combines Airbnb price prediction and feature importance


analysis for multiple available datasets, the method applied in this study
are presented with a special focus. This section presents the general process
about how the research is conducted, a detailed description about the data,
and the different algorithms applied. In the last part of this section, the
evaluation criteria is indicated.

4.1 Pipeline

As shown in Figure 1, eight separate datasets for each city is generated


to start the research process. The pre-processing is made individually,
such as standardization and dummy coding. Afterwards, the cleaned
datasets are combined to form four datasets for four countries. During
training, there are two parts processed on the 12 datatsets. First of all,
the Linear Regression algorithm is applied to find the baseline. Secondly,
four models selected based on literature are applied on the training set
of each dataset to find the model that outperforms the baseline and other
models. Hyperparameter tuning is further used for the selected model
based on the validation set. Then predictions on test are made for all
4 method 9

models to see the performance of each model based on R squared score.


MSE is calculated for each datatset on test set for error analysis. In the
stage of feature importance analysis, Recursive Feature Elimination (RFE)
is used for both baseline model and selected model. RFE recursively
considers decreasing sets of features to select the desired number of the
most important features (Demir & Sahin, 2021). The rank-ordering of the
strongest features are identified based on coefficients for Linear Regression
model. The method of getting the rank-ordering of the strongest features
depends on the selected model. All the processes are conducted with the
use of Python packages.

Figure 1: Research process pipeline

Data pre-processing Train Predict Feature importance analysis

Berlin
Linear Recursive Feature
Germany Prediction Coefficients
Regression Elimination
Munich

Los
Angeles XGBoost

US
New
York
Airbnb
Original SVR
dataset Sydney Selected Hyperparameter Get feature importance based
Prediction on selected model
Australia Model Tuning
Gradient
Melbourne
Boosting

Beijing

China Extra Trees


Hong
Kong

4.2 Data

4.2.1 Data Source


In this thesis, Germany, the United States, Australia, and China are chosen
as the study focus. The four countries locate on different continents, which
can maximize the potential distinction in the Airbnb price determinants
ranking based on countries. For each country, two cities are chosen (see
Figure 1). The Airbnb datasets, scraped in December 2021, are retrieved
from the InsideAirbnb.com in the format of csv. files. InsideAirbnb.com
provides publicly available information for Airbnb listings located in dif-
ferent cities around the world (I. Airbnb, 2021). The eight datasets contain
information about all the Airbnb listings for Berlin, Munich, Los Angeles,
New York, Sydney, Melbourne, Beijing, and Hong Kong. The cleaning
and pre-processing for initial data are inspired by the code from Carrillo
(2019) on GitHub. Some features are dropped because of duplicate or
4 method 10

Table 1: Overview of datasets

Samples
Country City Initial After processing
*Germany Berlin 17290 13657
Munich 4995 12000
*US Los Angeles 33329 24892
New York 38277 28017
*Australia Sydney 20880 14397
Melbourne 17830 13190
*China Beijing 5159 2989
Hong Kong 5944 2836

irrelevant information while some samples are dropped because of missing


information. When processing the dataset of Munich, sampling with re-
placement is applied to balance the two cities in the final combined dataset
of Germany.

The multi-collinearity analysis is also applied on the features after drop-


ping some columns that contain duplicate information or have no values.
The correlation heat map of Munich is shown as an instance in Appendix
A. Based on the map, the features that are highly correlated with others
are dropped, such as the number of bedroom which is positively corre-
lated with the capacity of accommodates. The number of samples for
each dataset is indicated in Table 1 and the description of features after
processing is presented in Appendix B.

4.2.2 Numerical features


The date of the person first joined Airbnb as a host is provided in the
dataset, which is transferred into number of days in order to be analyzed
as a numerical variable. The review scores from initial datasets are sep-
arated into seven sub-categories: rating, accuracy, cleanliness, check-in,
communication, location, and value. As this study focuses on identifying
the most important features and the focus of guests on review scores can
be different, these types are not combined into one single category. Further-
more, replacing missing values in review scores will largely influence the
final distribution due to the large amount of missing value and negatively
skewed distribution. Thus, missing values are removed from datasets,
which is the main cause of samples lost after processing.
4 method 11

The distance to city center is calculated based on longitudes and latitudes


with Harversine formula. The Harversine formula is specifically used for
calculating the great-circle distance between two locations on the earth
(Wikipedia contributors, 2022b). The distance d is explicitly expressed as:

r
j2 j1 l2 l1
d = 2r arcsin( sin2 ( ) + cosj1 · cosj2 · sin2 ( )) (1)
2 2
Where r donates the radius of the earth, j donates the latitude, l donates
the longitude, and arcsin is the inverse sine function (Wikipedia contribu-
tors, 2022b). The average radius of earth is 6371.0 kilometers (Wikipedia
contributors, 2022a).

Other numerical features are highly skewed in all datasets. For instance,
host acceptance rate is negatively skewed and the capacity of accommo-
dates is positively skewed. Thus, log transformation is applied for nor-
malization. The comparison between before and after log transformation
for numerical features in Germany dataset is presented as an example in
Appendix C.

4.2.3 Categorical features


The first stage of this study is to predict the Airbnb listing price, which leads
to apply regression models on datasets instead of classification models.
Therefore, categorical features are transformed by dummy coding. The
amenities are the utilities served in the Airbnb listings. During processing,
amenities with same meaning but different descriptions are combined into
bigger categories, such as coffee machine and child friendly. Additionally,
the amenities contained by less than 10% of the total listings are dropped
to keep the most frequent amenities for predicting Airbnb price. Same
process is applied to the feature of "host verification", which refers to the
method of hosts used to verify their identities on Airbnb. Only the most
frequent methods are selected and dummy encoded.

4.3 Models

4.3.1 Linear Regression Model


The Ordinary Least Squares Linear Regression (OLS) fits the linear model
with independent variables to estimate the dependent variable. In this
study, the independent variables are the possible Airbnb price determinants
and the dependent variable is the price. The Linear Regression model can
be expressed as:
4 method 12

p
yi = b 0 + Â bk Xik + # i (2)
k =1

Where yi are the observed ith dependent variable, b = ( b 0 , b 1 , b 2 , ......, b m )


are the coefficients of the independent variables estimated by the linear
model, and # i denotes the random error (Zhang et al., 2017).

The processed data is split into 80% for the training set to fit the Linear
Regression model and 20% for the test set to check generalization on
new data. Both R squared score and MSE are calculated on the test set.
Afterwards, the recursive feature elimination (RFE) is applied. As for
the Linear Regression model, RFE is used to identify the ranking of each
Airbnb price determinant based on the coefficient of each feature. Both the
Linear Regression algorithm and RFE are applied by using the Sci-kit learn
package in Python.

4.3.2 Other Models


The other Machine Learning algorithms used for predicting Airbnb price
and their application in existing literature are listed in Table 2. All the
algorithms are applied by using the Sci-kit learn package and XGBoost
package in Python.

Table 2: Algorithms used for prediction

Algorithm Literature
Cai et al. (2019)
XGBoost / Gradient Boosting
Carrillo (2019)
Support Vector Regression Kalehbasti et al. (2019)
Masrom et al. (2022)
Extra Trees (Decision Trees, Random Forest)
Luo et al. (2019)

The Gradient Boosting is a boosting-like regression algorithm that mini-


mizing the loss function to find an approximation of the objective function
(Bentéjac, Csörgő, & Martínez-Muñoz, 2019). The process is iterative. More-
over, the XGBoost is a decision tree ensemble Machine Learning algorithm
based on gradient boosting and it is highly scalable (Bentéjac et al., 2019).
Same as the Gradient Boosting, it minimizes the loss function by building
sequential trees so that each subsequent tree can learn from previous tree.

The Support Vector Regression (SVR) is a supervised Machine Learning


algorithm implemented to predict discrete values. In this study, Airbnb
4 method 13

rental price is integer and treated as discrete values so that SVR is appli-
cable. Furthermore, SVR has excellent generalization capacity because it
minimizes the generalization error bound rather than the observed training
error (Basak, Pal, & Patranabis, 2007).

As for the Extra Trees algorithm, it is very similar to the Random Forest
which are composed of many decision trees and the final prediction is
based on every tree. Differing from the Random Forest, Extra Trees uses
the whole original sample instead of bootstrap replicas, and it chooses
cut points randomly rather than choosing the optimum split (Geurts,
Ernst, & Wehenkel, 2006). Thus, both bias and variance are reduced. The
computational process is also faster for the Extra Trees algorithm. The
Decision Trees and Random Forest have been used in previous studies
regarding to the Airbnb rental price determinants, such as the study from
Masrom et al. (2022). This suggests the possibility of applying the Extra
Trees algorithm in this study because it is part of the ensembles of Decision
Trees and adds randomization while obtaining optimization.

Table 3: Hyperparameter tuning for Extra Trees

Germany, Australia, China


Hyperparameters Values tested Best Setting
n_estimators 100, 150, 200, 250, 300 250
min_samples_leaf 1, 20, 40, 60, 80, 100 1

US
Hyperparameters Values tested Best Setting
n_estimators 100, 150, 200, 250, 300 300
min_samples_leaf 1, 20, 40, 60, 80, 100 1

When applying different algorithms, 64% of the data was used as the
training set, 16% as the validation set, and 20% as the test set. The training
set is used for choosing the model that outperforming the baseline and
other models on Airbnb rental price prediction based on R squared score
while the validation set is used for hyperparameter tuning for the selected
model. The details of hyperparameter tuning can be found in Table 3. The
test set is used for checking generalization. MSE is calculated on the test
set. Furthermore, the importance level of features are identified based
on the selected model. In this study, the attribute "feature_importances_"
from the Extra Trees in Python is used to find the most important fea-
tures. The "feature_importances_" is known as the Gini importance, which
means the importance level of a feature is calculated based on the total
5 results 14

reduction of that feature brought to the criterion. A higher obtained "fea-


ture_importances_" score indicates a more important feature. This attribute
works with similar idea as the RFE, which determines the feature impor-
tance based on taking each feature out and identifying its contribution to
the prediction. Therefore, RFE is not applied for the Extra Trees model.

4.4 Evaluation Criteria

As suggested by Zhang et al. (2017) and Luo et al. (2019), the performance
from previously mentioned models are compared based on the R squared
score and evaluated by MSE. R squared score is used to define the pro-
portion of variance explained by the model, which means it shows how
much variation of the data is explained by all features fed into the model.
The higher R squared score indicates a better fit of the model to the data.
As this study aims to identify the importance level of different features,
it is important to first let the features explain as much information of the
data as possible. If the model only explains a tiny fraction of the data, the
determination of feature importance ranking will not be reliable. Therefore,
R squared score is an ideal metric to use as the evaluation criteria. In model
selection, R squared scores obtained from different models are also used
for comparison. As for MSE, it measures the quality of fit and is suitable
for regression prediction problems. By minimizing MSE, the model is more
accurate and prediction is closer to actual data. MSE is also widely applied
by existing literature (e.g. Kalehbasti et al., 2019; Luo et al., 2019; Zhang et
al., 2017), which can be used as an indication for model evaluation. Both R
squared score and MSE on the test set are calculated for a further indication
of the performance of different models.

5 results

In this section, the results from all algorithms about Airbnb price prediction
are presented with comparison. As the feature importance analysis is the
main focus of this study, the results from the Linear Regression model
and Extra Trees model are addressed in detail, including the comparison
between countries with each other and the comparison between countries
with their own cities.

5.1 Model comparison on Airbnb Price Prediction

As shown in Table 4, the Linear Regression model as a baseline achieves R


squared scores that are above 0.5 for all countries on the training set and the
5 results 15

Table 4: Models performance on predicting Airbnb price

R2 score MSE
Models Countries Train Set Test Set Test Set
Germany 0.515 0.499 0.232
US 0.596 0.582 0.250
Linear Regression
Australia 0.650 0.659 0.205
China 0.625 0.639 0.374
Germany 0.626 0.610 0.185
US 0.696 0.672 0.196
XGB Regression
Australia 0.686 0.687 0.188
China 0.731 0.706 0.305
Germany 0.009 0.004 0.461
US 0.004 0.003 0.596
Support Vector Regression
Australia 0.004 0.003 0.599
China 0.036 0.046 0.999
Germany 0.628 0.604 0.183
US 0.697 0.670 0.197
Gradient Boosting Regression
Australia 0.687 0.686 0.189
China 0.740 0.699 0.312
Germany 1.0 0.800 0.093
US 1.0 0.715 0.171
Extra Trees
Australia 0.999 0.692 0.188
China 0.999 0.731 0.280

minimum at 0.499 on the test set. The R squared scores are slightly higher
on the test set than training set for Australia and China, because of the
inherent variability in the data. The MSE is above 0.200 for all countries on
the test set for the Linear Regression model. All other models outperform
the baseline except the SVR model which has R squared score extremely
lower and MSE higher than the baseline. The model outperforming others
on predicting the Aribnb rental price appears to be the Extra Trees. It has
R squared score close to 1 for all countries on the training set and higher
R squared score than other models on the test set. Besides, the MSE for
the Extra Trees model on the test set is also lower than other models for all
countries. Particularly, Germany dataset has the highest R squared score
and lowest MSE among all countries, because of the resampling processed
on the Munich dataset (R2 = 0.965, MSE = 0.014). Based on the results, Extra
Trees algorithm is selected as the model for feature importance analysis.
The details about model performance of cities within each country are
shown in Appendix D.
5 results 16

Table 5: Feature importance of countries (Linear Regression)

Features Germany US Australia China


Accommodates 0.252 0.293 0.321 0.394
Host response time 0.031 0.038 0.102
Room type 0.278 0.390 0.346 0.296
Neighbourhood 0.123 0.101 0.155 0.310
Review scores 0.106 0.078
Hot tub sauna or pool 0.155
Gym 0.073
Host Verification 0.366

5.2 Feature importance Analysis

5.2.1 Linear Regression Model


Based on RFE, the important features affecting Aribnb rental price for
each country are identified. Due to the fact that the host response time,
room type, neighbourhood, and review scores include sub-categories, the
averages of coefficients as absolute value per feature are calculated to show
their overall influence on Airbnb price prediction. RFE and coefficients
calculation are also applied on cities, which is used to compare with the
country they located. The detailed coefficients of features and their sub-
categories for each country and city are presented in Appendix E.

As shown in Table 5, the capacity of accommodates, room type, and neigh-


bourhood (where the listing located) are defined to be the three most impor-
tant features in all countries. Especially, the capacity of accommodates and
room type are the top two Airbnb rental price determinants. Within the
sub-categories of the room type (see Table 6), the entire home/apartment
(b >0.190) and the hotel room type (b >0.045) have positive influence on the
Airbnb rental price; The private room type (b <-0.135) and the shared room
type (b <0.415) have negative influence on the Airbnb price, except the pri-
vate room type in China which shows slightly positive influence (b = 0.091).

The other two features of top five Airbnb rental price determinants differ
based on countries. The host response time is one of the top five significant
features in Germany, Australia, and China. It is also the sixth most impor-
tant Airbnb price determinant in the United States (b = 0.023). Specifically,
the host response time within a few days or more has bigger effect than
the rest time period (see Table 6). As for the review scores, it only shows
the important influence in Germany and Australia. However, as indicated
5 results 17

Table 6: Feature importance of common features (Linear Regression)

Features Sub-categories Germany US Australia China


Entire home/apt 0.192 0.218 0.251 0.453
Room Hotel room 0.364 0.562 0.440 0.047
Type Private room -0.139 -0.173 -0.187 0.091
Shared room -0.417 -0.607 -0.504 -0.592
Within an hour -0.0008 0.022 0.005 -0.139
Host Within a few hours 0.018 -0.002 -0.027 0.019
Response Within a day 0.014 -0.011 0.040 -0.116
A few days or more 0.045 0.036 0.060 0.130

in Table 5, the host verification method, rather than review scores, affects
the Airbnb rental price in China (b = 0.366). Instead of the review scores,
the hot tub sauna or pool (b = 0.155) and the gym (b = 0.073) as amenities
affects the Aribnb price in the United States.

When comparing the importance level of Airbnb price determinants be-


tween cities and the country they locate, only Australia and its cities,
Sydney and Melbourne, show consistency in terms of top five important
features. Furthermore, the top five Airbnb rental price determinants are
same in both Germany and its city, Berlin. However, as shown in Table 7,
the cooking basic amenity rather than the host response time shows bigger
influence in Munich (b = -0.123).

Moreover, the review scores appears to be one of top five Airbnb price
determinants in Los Angeles and New York but its importance does not
show on the country level (see Table 7). The hot tub sauna or pool and the
gym are important features in the United States, which is contributed by
Los Angeles and New York respectively. As for China according to Table 7,
there are four Airbnb rental price determinants are same for the country
and the cities, including capacity of accommodates, host response time,
room type, and neighbourhood. However, whether the host has profile
picture only appears to be important in Beijing (b = 0.313) while the host
verification shows its importance only in China and its city, Hong Kong.

5.2.2 Extra Trees Model


The feature importance results provided by the Extra Trees model are
presented in Table 8, which shows the top five features for each country.
The attribute "feature_importance_" in the Extra Trees library from Python
5 results 18

Table 7: Feature importance of countries and cities (Linear Regression)

Features Germany Berlin Munich


Accommodates 0.252 0.216 0.272
Room type 0.278 0.389 0.175
Review scores 0.106 0.094 0.078
Neighbourhood 0.123 0.066 0.060
Host response time 0.031 0.062
Cooking basic -0.123
US Los Angeles New York
Accommodates 0.293 0.300 0.256
Room type 0.390 0.384 0.308
Neighbourhood 0.101 0.066 0.097
Hot tub sauna or pool 0.155 0.199
Gym 0.073 0.173
Review scores 0.125 0.078
Host response time 0.062
Bathrooms 0.183
Outdoor space 0.096
China Beijing Hong Kong
Accommodates 0.394 0.430 0.306
Host response time 0.102 0.076 0.105
Room type 0.296 0.266 0.347
Neighbourhood 0.310 0.254 0.240
Host verification 0.366 0.555
Host has profile picture 0.313

indicates the importance level of each feature by providing a weight to


it which scales from 0 to 1. Higher weight means the feature is more
important, with 1 as the most important. Additionally, each sub-categories
of features in the results are treated separately and has its own weight.
Due to a large number of sub-categories within the neighbourhood and
host response time, the maximum weight of them is used in results for
comparison.

The room type has three sub-categories: entire home/apartment, private


room, and shared room. As shown in Table 8, the entire home/apartment,
capacity of accommodates, and number of bathrooms are crucial Airbnb
rental price determinants in all countries. Similar as the results from Linear
Regression model, the other Airbnb rental price determinants also differ
5 results 19

Table 8: Feature importance of Countries (Extra Trees)

Features Germany US Australia China


Accommodates 0.157 0.130 0.145 0.330
Entire home/apt (Room type) 0.147 0.288 0.339 0.108
Neighbourhood (max value) 0.043 0.044 0.022
TV 0.034
Bathrooms 0.031 0.089 0.060 0.087
Distance to city center 0.033 0.016 0.021
Shared room (Room type) 0.028

in countries according to the Extra Trees model. The neighbourhood is


identified to be important in Germany, the United States, and Australia
(maximum w >0.020), but not in China. The distance to city center is
important in the United States (w = 0.033), Australia (w = 0.016), and China
(w = 0.021), but not in Germany. Furthermore, the TV is shown as one of
the top five determinants only in Germany (w = 0.034) while the shared
room type shows its importance only in China (w = 0.028).

To compare the results between countries and cities, the attribute "fea-
ture_importance_" from the Extra Trees library is also applied to each city.
The top two important Airbnb rental price determinants are same for all
cities and the countries they locate, which are the capacity of accommodates
and entire home or apartment room type.

Table 9: Feature importance for Germany (Extra Trees)

Features Germany Berlin Munich


Accommodates 0.162 0.117 0.225
Room type (Entire home/apt) 0.141 0.312 0.059
Neighbourhood (max) 0.043 0.039
Amenity (TV) 0.033
Bathrooms 0.030 0.047
Host response time (max) 0.039
Minimum nights 0.026 0.032
Distance to city center 0.027

In Germany (see Table 9), although minimum nights of stay is one of the
top five important features in Berlin (w = 0.026) and Munich (w = 0.032),
it is not identified importantly for the Airbnb rental price in Germany on
country level. On the other hand, TV is the fourth Airbnb rental price
determinant in Germany, but it is not crucial in both Berlin and Munich.
Nevertheless, there is consistency existing between Germany and its cities
5 results 20

respectively. The neighbourhood is crucial in both Germany and Munich


with a maximum weight of 0.043 and 0.039 respectively. For the other
Airbnb rental price determinants, the host response time is only important
in Berlin (w = 0.039) and distance to city center is only crucial in Munich
(w = 0.027).

Table 10: Feature importance for US (Extra Trees)

Features US Los Angeles New York


Room type (Entire home/apt) 0.288 0.202 0.290
Accommodates 0.130 0.258 0.093
Bathrooms 0.089 0.122
Neighbourhood (max) 0.044 0.078
Distance to city center 0.033 0.060
Amenity (Hot tub saune or pool) 0.026
Minimum nights 0.022 0.045

As for the United States (see Table 10), its top five Airbnb price deter-
minants are more consistent with New York. In addition to the entire
home/apartment and the capacity of accommodates, both the neighbour-
hood and distance to city center are important in the United States and
New York. The maximum importance weight of the neighbourhood is
0.078 and the maximum weight of distance to city center is 0.060 for New
York. Except the room type and capacity of accommodates, the consistency
between the United States and Los Angeles only shows in the number of
bathroom feature. Same as Germany, the minimum nights is considered
crucial in both Los Angeles (w = 0.022) and New York (w = 0.045), but it
does now show significant influence on the country level of the United
States. Additionally, the hot tub sauna or pool is an important Airbnb price
determinant only in Los Angeles (w = 0.026).

Table 11: Feature importance for Australia (Extra Trees)

Features Australia Sydney Melbourne


Room type (Entire home/apt) 0.339 0.364 0.337
Accommodates 0.145 0.156 0.135
Bathrooms 0.060 0.063 0.060
Neighbourhood (max) 0.022 0.029 0.027
Distance to city center 0.016 0.018
Room type (Hotel room) 0.022

As shown in Table 11, Australia and its cities have shown more consistency
than the other countries with their cities. The top four important Airbnb
price determinants are same for Australia and its cities, Sydney and Mel-
6 discussion 21

bourne. These four features are: the entire home/apartment room type,
capacity of accommodates, number of bathrooms, and neighbourhood. The
last feature is same between Australia and Sydney, which is the distance to
city center. However, the last determinant is different between Australia
and Melbourne. Instead of the distance to city center, the hotel room type
is more crucial in Melbourne (w = 0.022).

Table 12: Feature importance for China (Extra Trees)

Features China Beijing Hong Kong


Accommodates 0.324 0.381 0.157
Room type (Entire home/apt) 0.109 0.079 0.183
Bathrooms 0.092 0.202
Room type (Shared room) 0.026
Distance to city center 0.020 0.018
Amenity (Hot tub sauna or pool) 0.020
Host listings count 0.030
Minimum nights 0.028
Instant bookable 0.023

Based on Table 12, the capacity of accommodates and entire home/apartment


are defined important in both China and its cities, Beijing and Hong Kong.
Moreover, Beijing only has one Airbnb rental price determinant that is
different from China, which is the hot tub sauna or pool. Besides the
capacity of accommodates and entire home/apartment room type, the
other three Airbnb price determinants are different between China and
Hong Kong. The host listings count (w = 0.030), minimum nights of stay
(w = 0.028), and whether the Airbnb listing is instant bookable (w = 0.023)
are the other three most important features affecting Airbnb rental price in
Hong Kong.

6 discussion

In this section, the results from 5 are discussed by comparing with the
results from previous literature. The possible reasons developed based on
the environment of countries are also discussed. The limitations of this
study are stated at the end.

6.1 Research Questions

This paper aims to analyze the research question "What are the differences in
Airbnb rental price determinants between Germany, the United States, Australia,
and China?". To answer this main question, three sub-research questions
6 discussion 22

are answered first in the following part based on the results.

RQ1 Which of the selected Machine Learning algorithms is better for predict-
ing Airbnb rental price?

In order to answer this question, the Linear Regression model is applied as


a baseline and several Machine Learning algorithms are implemented to
every dataset. According to the comparison from Section 5.1, the algorithm
outperforming the baseline and other models is the Extra Trees model,
with the minimum R squared score of 0.692 and maximum MSE of 0.280
among all countries. This corresponds to the results from Masrom et al.
(2022) about their similar implementation of Decision Trees and Random
Forest.

RQ2 For both the baseline and best performing models, what are the differ-
ences in feature importance and rank-ordering of the strongest Airbnb rental price
determinants between Germany, the United States, Australia, and China?

As for the Linear Regression model which is described in Sub-section 5.2.1,


the importance level of Airbnb rental price determinants differs based on
countries, except for the listings located in Germany and Australia which
are in line with each other. The top five Airbnb rental price determinants in
Germany and Australia are the capacity of accommodates, host response
time, room type, neighbourhood, and review scores. The four price de-
terminants, excluding review scores, are also defined important for the
Airbnb listings in China. The last Airbnb price determinant in China is
the host verification method. As for the United States, only three Airbnb
price determinants mentioned above are counted in top five, which are
the capacity of accommodates, room type, and neighbourhood. The hot
tub sauna or pool and gym as amenities are more important Airbnb rental
price determinants in the United States.

The importance of top five Airbnb rental price determinants in different


countries is more distinct based on results from the Extra Trees model in
Sub-section 5.2.2. Germany, the United States, and Australia have more
similarities because they all have the capacity of accommodates, entire
home/apartment room type, neighbourhood, and number of bathrooms
defined as the most important Airbnb price determinants. The last deter-
minant differs based on countries. TV is important in Germany, and the
distance to city center matters in the United States and Australia. As for
China, it shares some common Airbnb rental price determinants as other
countries, including the capacity of accommodates, entire home/apartment
6 discussion 23

room type, and number of bathrooms. The other determinants in China


are the distance to city center and shared room type.

RQ3 What are the relative contributions of the country, and the city within
the country, to the importance of Airbnb rental price determinants in Germany,
the United States, Australia, and China?

Both the Linear Regression model and Extra Trees model show that Aus-
tralia has more consistency in Airbnb rental price determinants with its
cities. Furthermore, the capacity of accommodates and room type are two
crucial determinants for Airbnb rental price in all countries and cities re-
gardless of the model fitted. The relative contributions of the country, and
the cities within the country, to the other three Airbnb price determinants,
depends on the model implemented. The differences between the country
and its cities are less from the Linear Regression model than those from the
Extra Trees model. Munich and Beijing only has one Airbnb rental price
determinant that is different from their countries while Berlin and Hong
Kong have same determinants as their countries. However, the differences
in Airbnb price determinants between US and its cities are larger because
they only have three determinants in common. As for Extra Trees model,
all countries and their cities only have two Airbnb rental price determinants
in common, except for Australia and its cities. These findings suggest that
Airbnb rental price prediction based on specific cities is not a replacement
to the prediction based on countries. Therefore, considering the country
as a whole picture for Airbnb rental price is necessary, especially from the
perspective of the Airbnb platform operators.

6.2 Comparison with Existing Literature

Based on the Section 5, the capacity of accommodates, room type, and


neighbourhood are the most important Airbnb price determinants in all
countries with both the Linear Regression and Extra Trees model. This
finding is supported by Cai et al. (2019) who claim that both neighbour-
hood and capacity of accommodates have positive influence on Airbnb
price. Chang and Li (2020) also support by showing the effect of the list-
ing room type on Airbnb price. Particularly, they indicate that the entire
home/apartment shows positive influence while the shared room type has
negative influence, which is consistent with the findings from this study.
The number of bathrooms is also one of the most important Airbnb price
determinants in all countries based on the results from the Extra Trees
model. This conclusion is approved by several previous researchers in their
6 discussion 24

studies, such as Gibbs et al. (2018) and Wang and Nicolau (2017).

The listing amenity is also one of the top five Airbnb price determinants
which has positive effect, especially for the listings in the United States.
The amenities include hot tub sauna or pool, gym, and TV. This finding is
supported by Gibbs et al. (2018) who indicate that listing amenities have
positive influence on Airbnb price with a positive coefficients of 0.051 for
the pool amenity and 0.243 for the gym amenity. However, the positive
influence of TV in this study is in contrast to the conclusion from Dudás et
al. (2020) who claim that the TV negatively affects Airbnb price.

The host response time, which describes the characteristics of the host,
is counted as the Airbnb price determinant in terms of countries. The
importance of host response time has been addressed by Chen and Xie
(2017) who suggest that higher Airbnb price is related to longer response
time. They also conclude that host acceptance rate is not a significant
determinant on price, which is consistent with the results obtained in all
countries. Moreover, host profile pictures and host verification methods
are significantly important in China, which is supported by Ert et al. (2016)
who indicate the positive influence of host profile on Airbnb rental price
due to trust building. The host listings count appears to be an Airbnb
price determinant when comparing Hong Kong with other cities, which is
consistent with the conclusion from Cai et al. (2019). They claim the host
listings count has the negative influence on Airbnb price. Wang and Nico-
lau (2017) also support this finding by showing statistically significance of
the host listing count on Airbnb price.

According to the results from the Linear Regression model, the review
scores shows its importance in Germany and Australia, especially the
review scores about value, cleanliness and location. The analysis from
Chen and Xie (2017) supports this finding because they indicate that some
sub-categories of the review scores from customers have statistically sig-
nificant influence on the Airbnb price. For instance, the score measuring
the value for money has the largest influence on Airbnb price while the
score regarding to listing location has positively effect (Chen & Xie, 2017),
which is consistent with the results obtained from this study. Additionally,
this finding is also supported by Kakar et al. (2016) who indicate that user
reviews positively affect the Airbnb price.

The distance to city center is a primary component when considering the


influence of location on listing price. On the basis of Extra Trees model,
it is a predominant feature of Airbnb rental price in the United States,
6 discussion 25

Australia, and China. This finding is approved by Wang and Nicolau (2017)
who show that the larger distance from the listing to city center imply
lower Airbnb price. On the other hand, this finding is in contrast to the
conclusion from Masrom et al. (2022) who suggest non-significant impacts
of the distance to city center on the listing price. Moreover, an important
price determinant based on cities is the instant bookable feature of listings,
which is shown and supported by the studies of Gibbs et al. (2018) and
Wang and Nicolau (2017).

There are three main new findings from this thesis that are different from
or not analyzed in previous studies. First of all, the minimum nights of
stay is a crucial Airbnb rental price determinant for several cities across the
analyzed countries, such as Berlin, Los Angeles, and Hong Kong. However,
this determinant is not considered and analyzed in previous studies about
Airbnb rental price. Secondly, cooking basic amenities and outdoor space
are defined as important Airbnb price determinants in Munich and New
York respectively, which is not indicated by existing literature. Last but not
least, the review scores is not one of the top five Airbnb price determinants
for all countries based on the Extra Trees algorithm from this study, which
is on the contrary to existing literature. The possible reason is that the
review scores is divided into seven small sub-categories, and each of them
are treated separately. This results in smaller influence per sub-category
than other categories on the Airbnb rental price.

6.3 Comparison between Countries

It is necessary to discuss the different environment of countries because


there are limited literature regarding to the Airbnb rental price determi-
nants on the country level.

Because the land area of Germany is 357,588 square kilometers, which is


much smaller than other countries analyzed in this study, the distance to
city center is not a crucial factor of determining Airbnb rental price. Instead,
a large number of people choose Airbnb listings based on the districts of
cities in Germany, which makes the neighbourhood an important price
determinant. As for the United States, sauna culture has developed since
1960. Nowadays, it is more and more popular in the country, especially in
the North American. In this study, both Los Angeles and New York are
located in the North American, which increases the importance level of
the hot tub sauna or pool as Airbnb rental price determinant in the results.
Moreover, Australia has many coastal cities that are popular to tourists
and the size of these cities are large, which makes both the neighbourhood
6 discussion 26

feature and distance to city center important on predicting Airbnb price.

In China, not only guests but also hosts are required to register themselves
in the system. Travelers also pay special attention to the identity of hosts
to ensure their security. Therefore, host verification method has more
influence on the Airbnb rental price than that in other countries. Further-
more, Airbnb is more considered by Chinese people who stay with more
than three family members or friends for travelling purpose. The time
they spend in the listing is limited during their stay. Thus, the capacity of
accommodates, number of bathrooms, and room type are considered more
important than other amenity features, such as the hot tub sauna or pool
and gym. The distance to city center is important for customers in China
due to the size of the city.

6.4 Limitations

There are several limitations regarding to this study. First of all, the price
in the datasets retrieved from InsideAirbnb.com is the advertised price
showing on the platform, which means the actual rental price paid by
guests might differ. Secondly, there are still some correlated features left
after pre-processing. Nevertheless, the Extra Trees algorithm selected as
the best-performing model is robust to correlation so the interpretation
of the best model and its results should not be influenced. Moreover, the
name of listings is a variable included in the initial datatsets. During
the first attempt running the model, words from names are tokenized for
Germany, the United States, and Australia. The results shows that the
name does not increase the variation explained by the model. However,
tokenization focuses on the frequency of words and there are different
languages in each datasets, which might hide the real influence of the
name variable. Thirdly, only two cities are chosen for each country and
the datatsets combined from cities are not completely balanced. Although
sampling with replacement for Munich is applied to balance the Germany
dataset, the model prediction results of Munich become dramatically good
compared to other cities. Therefore, bias might be introduced in the final
results. Additionally, only R squared score and MSE are considered for
model evaluation based on existing literature, which can be extended to
apply more evaluation methods for better overview of model performance.
Last but not least, hyperparameter tuning is only applied for the best-
performing model after model selection. This may cause the performances
of some models are worse than what they should be due to the lack of
hyperparameter tuning.
7 conclusion 27

7 conclusion

This study investigates the key Airbnb price determinants and answers
the main research question, "What are the differences in Airbnb rental price
determinants between Germany, the United States, Australia, and China?".

To answer this question, eight datasets are extracted for eight cities and
combined into four datasets for Germany, the United States, Australia,
and China. There are two stages of this study, which are predicting the
Airbnb rental price and conducting feature importance analysis for each
datatset. In the first stage, multiple models are applied, including the
Linear Regression, XGB Regression, Support Vector Regression, Gradient
Boosting Regression, and Extra Trees. The Linear Regression is used as a
baseline while the Extra Trees is selected as the best performing algorithm
based on R squared score and MSE. During the second stage of this study,
feature importance analysis is conducted for the Linear Regression model
and the Extra Trees model. Both models suggest that the top five Airbnb
rental price determinants differ in countries. The Linear Regression model
reveals the most important Airbnb price determinants, including the ca-
pacity of accommodates, host response time, room type, neighbourhood,
review scores, listing amenities (hot tub sauna or pool, gym), and host
verification method. The Airbnb price determinants are same in Germany
and Australia but different in the United States and China. As for the
Extra Trees model, it shows the most crucial price determinants which
are the capacity of accommodates, room type, neighbourhood, number of
bathrooms, distance to city center, and listing amenity (TV). There are no
countries having exactly same top five Airbnb price determinants although
all countries have some price determinants in common.

This research contributes to existing literature mainly by extending the


study on Airbnb price determinants from the city level to the country level.
The results not only approve some conclusions from previous literature but
also provide reliable evidences showing contradictions to the existing stud-
ies. Additionally, there are new findings that are not covered by existing
studies: firstly, the minimum nights of stay, cooking basic amenities, and
outdoor space are important Airbnb rental price determinants that are not
considered before; secondly, the review scores is not counted in top five
Airbnb price determinants for all countries according to the Extra Trees
model, which contradicts to previous findings. Moreover, the Extra Trees
model has not been applied before in the analysis about Airbnb rental
price. The results from this study show that the Extra Trees algorithm can
achieve excellent R squared score, which is very close to 1.0. MSE of the
7 conclusion 28

Extra Trees algorithm is also the least among all algorithms applied in this
study.

Although there are numerous studies conducted regarding to the Airbnb


rental price, it is still necessary to further analyze this topic from different
perspectives. There can be factors that are not explained by the Machine
Learning approaches. Inspired by this thesis, other countries can be ana-
lyzed to discover new features that are crucial to Airbnb listing price. From
the perspective of business, it could be useful for hosts or companies to un-
derstand the Airbnb market in different countries. For instance, European
countries are close to each other, which means it is viable that hosts having
listings in different countries. Thus, by analyzing and understanding the
difference of Airbnb price determinants based on countries, hosts or com-
panies can attract more guests and maximize their revenue. The Airbnb
platform operators can also provide better suggestions to their hosts and
make comprehensive data analysis for each country.
REFERENCES 29

references

Abrate, G., Capriello, A., & Fraquelli, G. (2011, 08). When quality signals
talk: Evidence from the turin hotel industry. Tourism Management, 32,
912-921.
Airbnb. (2021). About us. Retrieved from https://news.airbnb.com/
about-us/
Airbnb, I. (2021, 12). Retrieved from http://insideairbnb.com/
Basak, D., Pal, S., & Patranabis, D. (2007, 11). Support vector regression.
Neural Information Processing – Letters and Reviews, 11.
Bentéjac, C., Csörgő, A., & Martínez-Muñoz, G. (2019, 11). A comparative
analysis of xgboost.
Cai, Y., Zhou, Y., Ma, J., & Scott, N. (2019, 05). Price determinants of airbnb
listings: Evidence from hong kong. Tourism Analysis, 24, 227-242.
Carrillo, G. (2019, 10). Exploration of edinburgh’s short rental market.
Retrieved from https://github.com/gracecarrillo/Predicting
-Airbnb-prices-with-machine-learning-and-location-data/
blob/gh-pages/Exploring_Edinburgh_Graciela_Carrillo.ipynb
Chang, C., & Li, S. (2020, 12). Study of price determinants of sharing
economy-based accommodation services: Evidence from airbnb.com.
Journal of Theoretical and Applied Electronic Commerce Research, 16, 584-
601.
Chen, Y., & Xie, K. (2017). Consumer valuation of airbnb listings: A hedo-
nic pricing approach. International journal of contemporary hospitality
management.
Demir, S., & Sahin, E. (2021, 09). Assessment of feature selection for liquefac-
tion prediction based on recursive feature elimination. European Jour-
nal of Science and Technology, 28, 290-294. doi: 10.31590/ejosat.998033)
Dudás, G., Kovalcsik, T., Vida, G., Boros, L., & Nagy, G. (2020, 05). Price
determinants of airbnb listing prices in lake balaton touristic region,
hungary. European Journal of Tourism Research, 24(10), 1-18.
Ert, E., Fleischer, A., & Magen, N. (2016). Trust and reputation in the
sharing economy: The role of personal photos in airbnb. Tourism
Management, 55, 62-73. Retrieved from https://www.sciencedirect
.com/science/article/pii/S0261517716300127 doi: https://doi
.org/10.1016/j.tourman.2016.01.013
Fang, B., Ye, Q., & Law, R. (2016, 03). Effect of sharing economy on tourism
industry employment. Annals of Tourism Research, 57, 264-267. doi:
10.1016/j.annals.2015.11.018
Geurts, P., Ernst, D., & Wehenkel, L. (2006, 04). Extremely randomized
trees. Machine Learning, 63, 3-42. doi: 10.1007/s10994-006-6226-1
Gibbs, C., Guttentag, D., Gretzel, U., Morton, J., & Goodwill, A. (2018).
REFERENCES 30

Pricing in the sharing economy: a hedonic pricing model applied to


airbnb listings. Journal of Travel Tourism Marketing, 35(1), 45-56.
Guttentag, D., Smith, S., Potwarka, L., & Havitz, M. (2018, 04). Why tourists
choose airbnb: A motivation-based segmentation study. Journal of
Travel Research, 57(3), 342-359.
Hamari, J., Sjöklint, M., & Ukkonen, A. (2016, 09). The sharing economy:
Why people participate in collaborative consumption. Journal of the
Association for Information Science and Technology, 67, 2047-2059. doi:
10.1002/asi.23552
Kakar, V., Franco, J., Voelz, J., & Wu, J. (2016, March). Effects of Host Race
Information on Airbnb Listing Prices in San Francisco (MPRA Paper).
University Library of Munich, Germany. Retrieved from https://
ideas.repec.org/p/pra/mprapa/69974.html
Kalehbasti, P., Nikolenko, L., & Rezaei, H. (2019, 07). Airbnb price
prediction using machine learning and sentiment analysis. Retrieved
from https://arxiv.org/pdf/1907.12665.pdf
Luo, Y., Zhou, X., & Zhou, Y. (2019, 12). Predicting airbnb listing price
across different cities. Retrieved from http://cs229.stanford.edu/
proj2019aut/data/assignment_308832_raw/26647491.pdf
Masrom, S., Baharun, N., Razi, N., Abdul Rahman, R., & Abd Rahman,
A. S. (2022, 01). Particle swarm optimization in machine learning
prediction of airbnb hospitality price prediction. International Journal
of Emerging Technology and Advanced Engineering, 12, 146-151. doi:
10.46338/ijetae0122_14
Oskam, J., & Boswijk, A. (2015, 03). Airbnb: The future of networked
hospitality businesses. Journal of Tourism Futures, 2, 22-42.
Pérez Sánchez, R., Serrano Estrada, L., Martí-Ciriquián, P., & García, R. T.
(2018, 12). The what, where, and why of airbnb price determinants.
Sustainability, 10(12), 4596.
Soler, I., Gemar, G., Correia, M., & Serra, F. (2019, 02). Algarve hotel price
determinants: A hedonic pricing model. Tourism Management, 70,
311-321.
Tussyadiah, I. (2016, 05). Factors of satisfaction and intention to use peer-to-
peer accommodation. International Journal of Hospitality Management,
55, 70-80. doi: 10.1016/j.ijhm.2016.03.005
Wang, D., & Nicolau, J. (2017, 04). Price determinants of sharing economy
based accommodation rental: A study of listings from 33 cities on
airbnb.com. International Journal of Hospitality Management, 62, 120-
131.
Wei, P., Lu, Z., & Song, J. (2015). Variable importance analysis: A
comprehensive review. Reliability Engineering System Safety, 142,
399-432. Retrieved from https://www.sciencedirect.com/science/
REFERENCES 31

article/pii/S0951832015001672 doi: https://doi.org/10.1016/


j.ress.2015.05.018
Wikipedia contributors. (2022a). Earth — Wikipedia, the free ency-
clopedia. Retrieved from https://en.wikipedia.org/w/index.php
?title=Earth&oldid=1086578959 ([Online; accessed 13-May-2022])
Wikipedia contributors. (2022b). Haversine formula — Wikipedia, the free en-
cyclopedia. Retrieved from https://en.wikipedia.org/w/index.php
?title=Haversine_formula&oldid=1083812895 ([Online; accessed
13-May-2022])
Yang, Y., Nieto García, M., Viglia, G., & Nicolau, J. (2021, 09). Competitors
or complements: A meta-analysis of the effect of airbnb on hotel per-
formance. Journal of Travel Research. doi: 10.1177/00472875211042670
Zervas, G., Proserpio, D., & Byers, J. (2017, 01). The rise of the sharing
economy: Estimating the impact of airbnb on the hotel industry.
Journal of Marketing Research, 54. doi: 10.1509/jmr.15.0204
Zhang, Z., Chen, R., Han, L., & Yang, L. (2017, 09). Key factors affecting
the price of airbnb listings: A geographically weighted approach.
Sustainability, 9, 1635.
REFERENCES 32

appendix a

Figure 2: Correlation heat map (Munich)


REFERENCES 33

appendix b

Table 13: Descriptions of features after pre-processing

Features Descriptions
Host ID Airbnb’s unique identifier for the host
Host acceptance rate The rate at which a host accepts booking requests.
Host response time The time spent by a host to response on booking requests.
Whether the host is a experienced host who are enthusi-
Host is superhost
astic about making memorable stay for guests.
Host listings count The number of Airbnb listings a host has.
Host has profile pic Whether the host has a profile picture.
Host identity verified Whether the host identity is verified.
Host is verified through one or more method (Facebook,
email, phone, google, government id, identity manual,
Host verification
jumio, kba, manual offline/online, work email, photogra-
pher, reviews, selfie, zhima selfie)
Neighbourhood The district where the Airbnb listing located
Room type Entire home/apt, Private room, or Shared room
Accommodates The maximum capacity of the listing
Bathrooms The number of bathrooms in the listing
The amenities served in the listing (24h check-in, air con-
ditioning, high-end electronics, BBQ, balcony, nature and
views, bed linen, breakfast, TV, coffee machine, cooking
basics, white goods, elevator, gym, child friendly, park-
Amenity
ing, outdoor space, host greeting, hot tub sauna or pool,
internet, long-term stays, pets allowed, private entrance,
secure, self-check-in, smoking allowed, accessible, event
suitable)
Minimum nights Minimum number of night stay for the listing
Maximum nights Maximum number of night stay for the listing
Has availability Whether the listing has availability
Number of reviews The number of reviews the listing has
The review scores given by guests (rating, accuracy, clean-
Review scores
liness, check-in, communication, location, value)
Whether the guest can automatically book the listing
Instant bookable
without the host requiring to accept their booking request
Host duration The number of days a person being a host
REFERENCES 34

appendix c

Figure 3: Histogram of numerical features before log transformation (Germany)


REFERENCES 35

Figure 4: Histogram of numerical features after log transformation (Germany)


REFERENCES 37

appendix d

Table 14: Models performance on predicting Airbnb price

R2 score MSE
Models Countries Train Set Test Set Test Set
Berlin, Germany 0.573 0.572 0.180
Munich, Germany 0.449 0.459 0.230
Los Angeles, US 0.628 0.623 0.253
New York, US 0.610 0.587 0.211
Linear Regression
Sydney, Australia 0.692 0.676 0.213
Melbourne, Australia 0.611 0.550 0.245
Beijing, China 0.726 0.713 0.306
Hong Kong, China 0.467 0.328 0.511
Berlin, Germany 0.655 0.610 0.164
Munich, Germany 0.624 0.601 0.169
Los Angeles, US 0.734 0.713 0.193
New York, US 0.700 0.657 0.175
XGB Regression
Sydney, Australia 0.746 0.719 0.185
Melbourne, Australia 0.691 0.604 0.215
Beijing, China 0.817 0.759 0.256
Hong Kong, China 0.711 0.479 0.0.396
Berlin, Germany 0.035 0.017 0.413
Munich, Germany -0.008 -0.019 0.432
Los Angeles, US -0.004 -0.007 0.676
New York, US 0.010 0.010 0.506
Support Vector Regression
Sydney, Australia 0.014 0.005 0.656
Melbourne, Australia 0.004 0.002 0.542
Beijing, China 0.004 0.004 1.058
Hong Kong, China -0.022 -0.006 0.764
Berlin, Germany 0.660 0.610 0.164
Munich, Germany 0.628 0.608 0.166
Los Angeles, US 0.734 0.717 0.191
New York, US 0.701 0.656 0.176
Gradient Boosting Regression
Sydney, Australia 0.749 0.717 0.186
Melbourne, Australia 0.697 0.602 0.216
Beijing, China 0.824 0.763 0.254
Hong Kong, China 0.735 0.460 0.409
Berlin, Germany 1.0 0.582 0.176
Munich, Germany 1.0 0.965 0.014
Los Angeles, US 1.0 0.751 0.168
New York, US 0.999 0.689 0.160
Extra Trees
Sydney, Australia 1.0 0.717 0.190
Melbourne, Australia 0.999 0.604 0.216
Beijing, China 0.999 0.795 0.221
Hong Kong, China 0.999 0.535 0.347
REFERENCES 38

appendix e

Table 15: Feature importance for Germany (Linear Regression)

Features Sub-categories Germany Berlin Munich


Accommodates 0.252 0.216 0.272
Unknown -0.075 -0.154
Within an hour -0.0008 0.005
Within a few hours 0.018 0.029
Host response time
Within a day 0.014 0.046
A few days or more 0.045 0.074
Absolute average 0.031 0.062
Entire home/apt 0.192 0.252 0.143
Hotel room 0.364 0.525 0.207
Room type Private room -0.139 -0.176 -0.088
Shared room -0.417 -0.602 -0.262
Absolute average 0.278 0.389 0.175
Rating 0.102 0.128 0.064
Accuracy -0.037 -0.054 0.032
Cleanliness 0.150 0.111 0.103
Check-in -0.018 -0.028 0.059
Review scores
Communication 0.036 0.003 -0.002
Location 0.133 0.135 0.041
Value -0.266 -0.197 -0.248
Absolute average 0.106 0.094 0.078
Reinickendorf(max) -0.335 -0.156
Ludwigsvorstadt-Isarvorstadt 0.330 0.185
Neighbourhood
Marzahn - Hellersdorf -0.310 -0.142
Absolute average 0.123 0.066 0.060
Amenity Cooking basic -0.123
REFERENCES 39

Table 16: Feature importance for US (Linear Regression)

Features Sub-categories US Los Angeles New York


Accommodates 0.293 0.300 0.256
Bathrooms 0.183
Unknown -0.044 -0.154
Within an hour 0.022 0.005
Within a few hours -0.002 0.029
Host response time
Within a day -0.011 0.046
A few days or more 0.036 0.074
Absolute average 0.023 0.062
Entire home/apt 0.218 0.394 0.146
Hotel room 0.562 0.373 0.470
Room type Private room -0.173 -0.039 -0.226
Shared room -0.607 -0.729 -0.390
Absolute average 0.390 0.384 0.308
Rating 0.160 0.064
Accuracy -0.060 0.032
Cleanliness 0.122 0.103
Check-in -0.062 0.059
Review scores
Communication -0.037 -0.002
Location 0.216 0.041
Value -0.218 -0.248
Absolute average 0.125 0.078
Manhattan(max) 0.351 0.243
Staten Island -0.192 -0.151
Neighbourhood
Unincorporated Areas -0.088 -0.142
Absolute average 0.101 0.066 0.097
Hot tub sauna or pool 0.155 0.199
Amenity Gym 0.073 0.173
Outdoor space 0.096
REFERENCES 40

Table 17: Feature importance for Australia (Linear Regression)

Features Sub-categories Australia Sydney Melbourne


Accommodates 0.321 0.330 0.301
Unknown -0.078 -0.077 -0.041
Within an hour 0.005 -0.001 0.042
Within a few hours -0.027 0.016 -0.053
Host response time
Within a day 0.040 0.029 0.031
A few days or more 0.060 0.033 0.021
Absolute average 0.042 0.031 0.038
Entire home/apt 0.251 0.316 0.206
Hotel room 0.440 0.373 0.533
Room type Private room -0.187 -0.126 -0.231
Shared room -0.504 -0.564 -0.508
Absolute average 0.345 0.345 0.369
Rating 0.129 0.123 0.144
Accuracy -0.003 0.004 -0.001
Cleanliness 0.096 0.094 0.095
Check-in 0.006 0.012 -0.003
Review scores
Communication -0.028 -0.009 -0.043
Location 0.076 0.081 0.057
Value -0.210 -0.230 -0.211
Absolute average 0.078 0.079 0.079
Pittwater(max) 0.769 0.704
Yarra Ranges 0.455 0.498
Neighbourhood
Mosman 0.512 0.414
Absolute average 0.153 0.152 0.119
REFERENCES 41

Table 18: Feature importance for China (Linear Regression)

Features Sub-categories China Beijing Hong Kong


Accommodates 0.394 0.430 0.306
Host has profile picture 0.313
Unknown 0.106 0.055 0.129
Within an hour -0.139 -0.090 -0.181
Within a few hours 0.019 0.018 0.025
Host response time
Within a day -0.116 -0.100 -0.082
A few days or more 0.130 0.117 0.109
Absolute average 0.102 0.076 0.105
Entire home/apt 0.453 0.340 0.501
Hotel room 0.047 0.088
Room type Private room 0.091 0.059 0.105
Shared room -0.592 -0.400 -0.694
Absolute average 0.296 0.266 0.347
Tuen Mun(max) 1.592 0.163
Tsuen Wan 1.341 1.219
Neighbourhood Shijingshan -0.938 -0.723
Fengtai -0.751 -0.523
Absolute average 0.310 0.254 0.240
Jumio 0.331
Government ID -0.400
Host verification
Zhima selfie 0.555
Absolute average 0.366

You might also like