Professional Documents
Culture Documents
file161201
file161201
file161201
DETERMINANTS BETWEEN
COUNTRIES: A MACHINE
LEARNING APPROACH
XINKUN LIU
committee
Dr. Dan Stowell
Chris Emmery MSc
location
Tilburg University
School of Humanities and Digital Sciences
Department of Cognitive Science &
Artificial Intelligence
Tilburg, The Netherlands
date
June 24, 2022
acknowledgments
I appreciate the guidance from Dr. Dan Stowell on this study.
DIFFERENCES IN AIRBNB PRICES
DETERMINANTS BETWEEN
COUNTRIES: A MACHINE
LEARNING APPROACH
xinkun liu
Abstract
Airbnb has been increasingly popular since 2008, and offers a
holiday rental market with variable rental price. Proposed price de-
terminants for Airbnb listings include several attributes, which have
been analysed for specific cities. However, the influence of countries
as well as the relation between the country and its cities to the impor-
tance of Airbnb rental price determinants are unknown. This thesis
investigates the differences in the Airbnb price determinants between
countries to fill these unknown research areas. Concretely, several
Machine Learning models are employed by using publicly available
datatsets of Airbnb listings in Germany, the United States, Australia,
and China. In prediction, the best performing model explains up
to 80% of variance with minimum MSE of 0.093 when applied per
country as opposed to maximum of 65.9% for a global baseline model
with minimum MSE of 0.205. Feature importance analysis is further
conducted based on the baseline and best performing model. The
main findings from both models are that some of the Airbnb rental
price determinants are different between countries. Particularly, the
review scores is not important for all countries based on the Extra
Trees model, which is a novel finding compared to existing litera-
ture. This thesis can be a starting point for broader studies on the
importance of Airbnb price determinants which extend from cities
to countries. These findings can also provide implications to Airbnb
hosts for a better understanding of the international market when
they expand their businesses to other countries.
Work on this thesis did not involve collecting data directly from human
participants from Germany, the United States, Australia, and China. No
1
2 introduction 2
experiments were carried out for original data collection and no new data
were gathered for this study. The data used in this thesis is publicly avail-
able online by InsideAirbnb.com under the Creative Commons Attribution
4.0 International License. No "private" information is being used. Names,
listings and review details are all publicly displayed on the Airbnb website.
InsideAirbnb.com is the original owner of the data which retains ownership
of the data during and after the completion of this thesis. Furthermore, I
do not hold any legal claim or authority to the data used in this thesis.
2 introduction
For instance, whether the property is in the city center or close to the
attractions (Pérez Sánchez, Serrano Estrada, Martí-Ciriquián, & García,
2018; Wang & Nicolau, 2017). Although Pérez Sánchez et al. (2018) and
Chang and Li (2020) have suggested the possible influence of the city where
the listing locates, it is not analyzed in depth. There is also no specific
differentiation between countries when determining the importance level
of key Airbnb rental price determinants. Wang and Nicolau (2017) have
mentioned in their study that a global model describing the Airbnb rental
price determinants based on larger geographical locations, such as multiple
cities combined or countries, can better reflect the real market situation.
They also assumed that this global model can indicate the association of
Airbnb rental price and price determinants in terms of market equilibrium,
if Airbnb listing prices are acceptable by travelers from different coun-
tries around the world (Wang & Nicolau, 2017). Therefore, it is necessary
to identify the possible similarities or differences in Airbnb rental price
determinants between countries to provide a starting point for further
studies. Additionally, as current researches focus on cities, it is necessary
to identify the relative contributions of countries and their cities to the
importance level of Airbnb rental price determinants. This can be used to
compare with existing literature for supporting evidences and obtaining
new findings.
In this research, data of Airbnb listings is collected for four countries from
four continents and incorporate the features in the prediction of Airbnb
price by using Maching Learning algorithms, including the Linear Regres-
sion, XGBoost, Support Vector Regression (SVR), Gradient Boosting, and
Extra Trees. Model performance and comparison is based on R squared
scores and MSE. Feature importance analysis based on variable importance
measures (VIM) is further conducted to determine the differences in the
importance level of Airbnb rental price determinants based on countries.
RQ3 What are the relative contributions of the country, and the city within the
country, to the importance of Airbnb rental price determinants in Germany,
the United States, Australia, and China?
3 related work
In this section, the comparison between hotel industry and Airbnb has been
addressed because many existing works are inspired by the papers in hotel
industry. Then previous studies related to the Airbnb price determinants
are presented with detailed summarization. Lastly, studies regarding to
the algorithms and feature importance analysis are described.
Chen and Xie (2017) also applies the hedonic pricing analysis in Airbnb
studies, which concludes that the room type has the most influence on the
Airbnb price. They specifically mention that the entire home or apartment
room type has the most positive effect, followed by the private room type
and shared room type. Other functionalities of the Airbnb listing, including
property type, room amenities, and location, are also crucial to the listing
price (Chen & Xie, 2017).
Several studies focus on the host attributes of Airbnb listings. Ert, Fleischer,
and Magen (2016) conduct their research about the role of host personal
photos in Airbnb industry based on the available listings in Stockholm.
They conclude that the profile picture of hosts are crucial because it means
3 related work 6
Similar to the analysis of Cai, Zhou, and Scott, Chang and Li (2020) identify
Airbnb rental price determinants from five categories based on the dataset
with 65,130 observations drawn from Airbnb.com. They employ two OLS
models to estimate key factors on Airbnb price and explore interaction
effect between features. The top five price determinants from five categories
include the room type, amenities, pictures of the listing, distance to sight-
seeings, as well as the city where the listing locates (Chang & Li, 2020).
Besides, Zhang et al. (2017) implement the general linear model (GLM)
as baseline and the geographically weighted regression model (GWR)
for comparison in their study. They conclude that there are significant
connections between Airbnb rental price and the distance to the conference
center, the review amount and scores (Zhang et al., 2017). Additionally, the
GWR model outperforms the GLM based on R-squared score (Zhang et al.,
2017).
3 related work 7
3.3 Location
Ensemble, and Support Vector Regression. The results show that the SVR
model has achieved the highest R squared score among all the algorithms
(Kalehbasti et al., 2019). Masrom et al. (2022) apply Lasso, Ridge, Decision
Tree, and Random Forest. Their results indicate that the Decision Tree and
Random Forest are two superior algorithms and their R squared score are
very closed to 1 and RMSE below 0.040. Thus, the Linear Regression model
will be used as a baseline while multiple models will be applied to select
the model that outperforming others in this study.
Inspired by the novel research from Pérez Sánchez et al. (2018), this study
contributes to previous literature by extending the concept of location
further to the country where the listings locate.
4 method
4.1 Pipeline
Berlin
Linear Recursive Feature
Germany Prediction Coefficients
Regression Elimination
Munich
Los
Angeles XGBoost
US
New
York
Airbnb
Original SVR
dataset Sydney Selected Hyperparameter Get feature importance based
Prediction on selected model
Australia Model Tuning
Gradient
Melbourne
Boosting
Beijing
4.2 Data
Samples
Country City Initial After processing
*Germany Berlin 17290 13657
Munich 4995 12000
*US Los Angeles 33329 24892
New York 38277 28017
*Australia Sydney 20880 14397
Melbourne 17830 13190
*China Beijing 5159 2989
Hong Kong 5944 2836
r
j2 j1 l2 l1
d = 2r arcsin( sin2 ( ) + cosj1 · cosj2 · sin2 ( )) (1)
2 2
Where r donates the radius of the earth, j donates the latitude, l donates
the longitude, and arcsin is the inverse sine function (Wikipedia contribu-
tors, 2022b). The average radius of earth is 6371.0 kilometers (Wikipedia
contributors, 2022a).
Other numerical features are highly skewed in all datasets. For instance,
host acceptance rate is negatively skewed and the capacity of accommo-
dates is positively skewed. Thus, log transformation is applied for nor-
malization. The comparison between before and after log transformation
for numerical features in Germany dataset is presented as an example in
Appendix C.
4.3 Models
p
yi = b 0 + Â bk Xik + # i (2)
k =1
The processed data is split into 80% for the training set to fit the Linear
Regression model and 20% for the test set to check generalization on
new data. Both R squared score and MSE are calculated on the test set.
Afterwards, the recursive feature elimination (RFE) is applied. As for
the Linear Regression model, RFE is used to identify the ranking of each
Airbnb price determinant based on the coefficient of each feature. Both the
Linear Regression algorithm and RFE are applied by using the Sci-kit learn
package in Python.
Algorithm Literature
Cai et al. (2019)
XGBoost / Gradient Boosting
Carrillo (2019)
Support Vector Regression Kalehbasti et al. (2019)
Masrom et al. (2022)
Extra Trees (Decision Trees, Random Forest)
Luo et al. (2019)
rental price is integer and treated as discrete values so that SVR is appli-
cable. Furthermore, SVR has excellent generalization capacity because it
minimizes the generalization error bound rather than the observed training
error (Basak, Pal, & Patranabis, 2007).
As for the Extra Trees algorithm, it is very similar to the Random Forest
which are composed of many decision trees and the final prediction is
based on every tree. Differing from the Random Forest, Extra Trees uses
the whole original sample instead of bootstrap replicas, and it chooses
cut points randomly rather than choosing the optimum split (Geurts,
Ernst, & Wehenkel, 2006). Thus, both bias and variance are reduced. The
computational process is also faster for the Extra Trees algorithm. The
Decision Trees and Random Forest have been used in previous studies
regarding to the Airbnb rental price determinants, such as the study from
Masrom et al. (2022). This suggests the possibility of applying the Extra
Trees algorithm in this study because it is part of the ensembles of Decision
Trees and adds randomization while obtaining optimization.
US
Hyperparameters Values tested Best Setting
n_estimators 100, 150, 200, 250, 300 300
min_samples_leaf 1, 20, 40, 60, 80, 100 1
When applying different algorithms, 64% of the data was used as the
training set, 16% as the validation set, and 20% as the test set. The training
set is used for choosing the model that outperforming the baseline and
other models on Airbnb rental price prediction based on R squared score
while the validation set is used for hyperparameter tuning for the selected
model. The details of hyperparameter tuning can be found in Table 3. The
test set is used for checking generalization. MSE is calculated on the test
set. Furthermore, the importance level of features are identified based
on the selected model. In this study, the attribute "feature_importances_"
from the Extra Trees in Python is used to find the most important fea-
tures. The "feature_importances_" is known as the Gini importance, which
means the importance level of a feature is calculated based on the total
5 results 14
As suggested by Zhang et al. (2017) and Luo et al. (2019), the performance
from previously mentioned models are compared based on the R squared
score and evaluated by MSE. R squared score is used to define the pro-
portion of variance explained by the model, which means it shows how
much variation of the data is explained by all features fed into the model.
The higher R squared score indicates a better fit of the model to the data.
As this study aims to identify the importance level of different features,
it is important to first let the features explain as much information of the
data as possible. If the model only explains a tiny fraction of the data, the
determination of feature importance ranking will not be reliable. Therefore,
R squared score is an ideal metric to use as the evaluation criteria. In model
selection, R squared scores obtained from different models are also used
for comparison. As for MSE, it measures the quality of fit and is suitable
for regression prediction problems. By minimizing MSE, the model is more
accurate and prediction is closer to actual data. MSE is also widely applied
by existing literature (e.g. Kalehbasti et al., 2019; Luo et al., 2019; Zhang et
al., 2017), which can be used as an indication for model evaluation. Both R
squared score and MSE on the test set are calculated for a further indication
of the performance of different models.
5 results
In this section, the results from all algorithms about Airbnb price prediction
are presented with comparison. As the feature importance analysis is the
main focus of this study, the results from the Linear Regression model
and Extra Trees model are addressed in detail, including the comparison
between countries with each other and the comparison between countries
with their own cities.
R2 score MSE
Models Countries Train Set Test Set Test Set
Germany 0.515 0.499 0.232
US 0.596 0.582 0.250
Linear Regression
Australia 0.650 0.659 0.205
China 0.625 0.639 0.374
Germany 0.626 0.610 0.185
US 0.696 0.672 0.196
XGB Regression
Australia 0.686 0.687 0.188
China 0.731 0.706 0.305
Germany 0.009 0.004 0.461
US 0.004 0.003 0.596
Support Vector Regression
Australia 0.004 0.003 0.599
China 0.036 0.046 0.999
Germany 0.628 0.604 0.183
US 0.697 0.670 0.197
Gradient Boosting Regression
Australia 0.687 0.686 0.189
China 0.740 0.699 0.312
Germany 1.0 0.800 0.093
US 1.0 0.715 0.171
Extra Trees
Australia 0.999 0.692 0.188
China 0.999 0.731 0.280
minimum at 0.499 on the test set. The R squared scores are slightly higher
on the test set than training set for Australia and China, because of the
inherent variability in the data. The MSE is above 0.200 for all countries on
the test set for the Linear Regression model. All other models outperform
the baseline except the SVR model which has R squared score extremely
lower and MSE higher than the baseline. The model outperforming others
on predicting the Aribnb rental price appears to be the Extra Trees. It has
R squared score close to 1 for all countries on the training set and higher
R squared score than other models on the test set. Besides, the MSE for
the Extra Trees model on the test set is also lower than other models for all
countries. Particularly, Germany dataset has the highest R squared score
and lowest MSE among all countries, because of the resampling processed
on the Munich dataset (R2 = 0.965, MSE = 0.014). Based on the results, Extra
Trees algorithm is selected as the model for feature importance analysis.
The details about model performance of cities within each country are
shown in Appendix D.
5 results 16
The other two features of top five Airbnb rental price determinants differ
based on countries. The host response time is one of the top five significant
features in Germany, Australia, and China. It is also the sixth most impor-
tant Airbnb price determinant in the United States (b = 0.023). Specifically,
the host response time within a few days or more has bigger effect than
the rest time period (see Table 6). As for the review scores, it only shows
the important influence in Germany and Australia. However, as indicated
5 results 17
in Table 5, the host verification method, rather than review scores, affects
the Airbnb rental price in China (b = 0.366). Instead of the review scores,
the hot tub sauna or pool (b = 0.155) and the gym (b = 0.073) as amenities
affects the Aribnb price in the United States.
Moreover, the review scores appears to be one of top five Airbnb price
determinants in Los Angeles and New York but its importance does not
show on the country level (see Table 7). The hot tub sauna or pool and the
gym are important features in the United States, which is contributed by
Los Angeles and New York respectively. As for China according to Table 7,
there are four Airbnb rental price determinants are same for the country
and the cities, including capacity of accommodates, host response time,
room type, and neighbourhood. However, whether the host has profile
picture only appears to be important in Beijing (b = 0.313) while the host
verification shows its importance only in China and its city, Hong Kong.
To compare the results between countries and cities, the attribute "fea-
ture_importance_" from the Extra Trees library is also applied to each city.
The top two important Airbnb rental price determinants are same for all
cities and the countries they locate, which are the capacity of accommodates
and entire home or apartment room type.
In Germany (see Table 9), although minimum nights of stay is one of the
top five important features in Berlin (w = 0.026) and Munich (w = 0.032),
it is not identified importantly for the Airbnb rental price in Germany on
country level. On the other hand, TV is the fourth Airbnb rental price
determinant in Germany, but it is not crucial in both Berlin and Munich.
Nevertheless, there is consistency existing between Germany and its cities
5 results 20
As for the United States (see Table 10), its top five Airbnb price deter-
minants are more consistent with New York. In addition to the entire
home/apartment and the capacity of accommodates, both the neighbour-
hood and distance to city center are important in the United States and
New York. The maximum importance weight of the neighbourhood is
0.078 and the maximum weight of distance to city center is 0.060 for New
York. Except the room type and capacity of accommodates, the consistency
between the United States and Los Angeles only shows in the number of
bathroom feature. Same as Germany, the minimum nights is considered
crucial in both Los Angeles (w = 0.022) and New York (w = 0.045), but it
does now show significant influence on the country level of the United
States. Additionally, the hot tub sauna or pool is an important Airbnb price
determinant only in Los Angeles (w = 0.026).
As shown in Table 11, Australia and its cities have shown more consistency
than the other countries with their cities. The top four important Airbnb
price determinants are same for Australia and its cities, Sydney and Mel-
6 discussion 21
bourne. These four features are: the entire home/apartment room type,
capacity of accommodates, number of bathrooms, and neighbourhood. The
last feature is same between Australia and Sydney, which is the distance to
city center. However, the last determinant is different between Australia
and Melbourne. Instead of the distance to city center, the hotel room type
is more crucial in Melbourne (w = 0.022).
6 discussion
In this section, the results from 5 are discussed by comparing with the
results from previous literature. The possible reasons developed based on
the environment of countries are also discussed. The limitations of this
study are stated at the end.
This paper aims to analyze the research question "What are the differences in
Airbnb rental price determinants between Germany, the United States, Australia,
and China?". To answer this main question, three sub-research questions
6 discussion 22
RQ1 Which of the selected Machine Learning algorithms is better for predict-
ing Airbnb rental price?
RQ2 For both the baseline and best performing models, what are the differ-
ences in feature importance and rank-ordering of the strongest Airbnb rental price
determinants between Germany, the United States, Australia, and China?
RQ3 What are the relative contributions of the country, and the city within
the country, to the importance of Airbnb rental price determinants in Germany,
the United States, Australia, and China?
Both the Linear Regression model and Extra Trees model show that Aus-
tralia has more consistency in Airbnb rental price determinants with its
cities. Furthermore, the capacity of accommodates and room type are two
crucial determinants for Airbnb rental price in all countries and cities re-
gardless of the model fitted. The relative contributions of the country, and
the cities within the country, to the other three Airbnb price determinants,
depends on the model implemented. The differences between the country
and its cities are less from the Linear Regression model than those from the
Extra Trees model. Munich and Beijing only has one Airbnb rental price
determinant that is different from their countries while Berlin and Hong
Kong have same determinants as their countries. However, the differences
in Airbnb price determinants between US and its cities are larger because
they only have three determinants in common. As for Extra Trees model,
all countries and their cities only have two Airbnb rental price determinants
in common, except for Australia and its cities. These findings suggest that
Airbnb rental price prediction based on specific cities is not a replacement
to the prediction based on countries. Therefore, considering the country
as a whole picture for Airbnb rental price is necessary, especially from the
perspective of the Airbnb platform operators.
studies, such as Gibbs et al. (2018) and Wang and Nicolau (2017).
The listing amenity is also one of the top five Airbnb price determinants
which has positive effect, especially for the listings in the United States.
The amenities include hot tub sauna or pool, gym, and TV. This finding is
supported by Gibbs et al. (2018) who indicate that listing amenities have
positive influence on Airbnb price with a positive coefficients of 0.051 for
the pool amenity and 0.243 for the gym amenity. However, the positive
influence of TV in this study is in contrast to the conclusion from Dudás et
al. (2020) who claim that the TV negatively affects Airbnb price.
The host response time, which describes the characteristics of the host,
is counted as the Airbnb price determinant in terms of countries. The
importance of host response time has been addressed by Chen and Xie
(2017) who suggest that higher Airbnb price is related to longer response
time. They also conclude that host acceptance rate is not a significant
determinant on price, which is consistent with the results obtained in all
countries. Moreover, host profile pictures and host verification methods
are significantly important in China, which is supported by Ert et al. (2016)
who indicate the positive influence of host profile on Airbnb rental price
due to trust building. The host listings count appears to be an Airbnb
price determinant when comparing Hong Kong with other cities, which is
consistent with the conclusion from Cai et al. (2019). They claim the host
listings count has the negative influence on Airbnb price. Wang and Nico-
lau (2017) also support this finding by showing statistically significance of
the host listing count on Airbnb price.
According to the results from the Linear Regression model, the review
scores shows its importance in Germany and Australia, especially the
review scores about value, cleanliness and location. The analysis from
Chen and Xie (2017) supports this finding because they indicate that some
sub-categories of the review scores from customers have statistically sig-
nificant influence on the Airbnb price. For instance, the score measuring
the value for money has the largest influence on Airbnb price while the
score regarding to listing location has positively effect (Chen & Xie, 2017),
which is consistent with the results obtained from this study. Additionally,
this finding is also supported by Kakar et al. (2016) who indicate that user
reviews positively affect the Airbnb price.
Australia, and China. This finding is approved by Wang and Nicolau (2017)
who show that the larger distance from the listing to city center imply
lower Airbnb price. On the other hand, this finding is in contrast to the
conclusion from Masrom et al. (2022) who suggest non-significant impacts
of the distance to city center on the listing price. Moreover, an important
price determinant based on cities is the instant bookable feature of listings,
which is shown and supported by the studies of Gibbs et al. (2018) and
Wang and Nicolau (2017).
There are three main new findings from this thesis that are different from
or not analyzed in previous studies. First of all, the minimum nights of
stay is a crucial Airbnb rental price determinant for several cities across the
analyzed countries, such as Berlin, Los Angeles, and Hong Kong. However,
this determinant is not considered and analyzed in previous studies about
Airbnb rental price. Secondly, cooking basic amenities and outdoor space
are defined as important Airbnb price determinants in Munich and New
York respectively, which is not indicated by existing literature. Last but not
least, the review scores is not one of the top five Airbnb price determinants
for all countries based on the Extra Trees algorithm from this study, which
is on the contrary to existing literature. The possible reason is that the
review scores is divided into seven small sub-categories, and each of them
are treated separately. This results in smaller influence per sub-category
than other categories on the Airbnb rental price.
In China, not only guests but also hosts are required to register themselves
in the system. Travelers also pay special attention to the identity of hosts
to ensure their security. Therefore, host verification method has more
influence on the Airbnb rental price than that in other countries. Further-
more, Airbnb is more considered by Chinese people who stay with more
than three family members or friends for travelling purpose. The time
they spend in the listing is limited during their stay. Thus, the capacity of
accommodates, number of bathrooms, and room type are considered more
important than other amenity features, such as the hot tub sauna or pool
and gym. The distance to city center is important for customers in China
due to the size of the city.
6.4 Limitations
There are several limitations regarding to this study. First of all, the price
in the datasets retrieved from InsideAirbnb.com is the advertised price
showing on the platform, which means the actual rental price paid by
guests might differ. Secondly, there are still some correlated features left
after pre-processing. Nevertheless, the Extra Trees algorithm selected as
the best-performing model is robust to correlation so the interpretation
of the best model and its results should not be influenced. Moreover, the
name of listings is a variable included in the initial datatsets. During
the first attempt running the model, words from names are tokenized for
Germany, the United States, and Australia. The results shows that the
name does not increase the variation explained by the model. However,
tokenization focuses on the frequency of words and there are different
languages in each datasets, which might hide the real influence of the
name variable. Thirdly, only two cities are chosen for each country and
the datatsets combined from cities are not completely balanced. Although
sampling with replacement for Munich is applied to balance the Germany
dataset, the model prediction results of Munich become dramatically good
compared to other cities. Therefore, bias might be introduced in the final
results. Additionally, only R squared score and MSE are considered for
model evaluation based on existing literature, which can be extended to
apply more evaluation methods for better overview of model performance.
Last but not least, hyperparameter tuning is only applied for the best-
performing model after model selection. This may cause the performances
of some models are worse than what they should be due to the lack of
hyperparameter tuning.
7 conclusion 27
7 conclusion
This study investigates the key Airbnb price determinants and answers
the main research question, "What are the differences in Airbnb rental price
determinants between Germany, the United States, Australia, and China?".
To answer this question, eight datasets are extracted for eight cities and
combined into four datasets for Germany, the United States, Australia,
and China. There are two stages of this study, which are predicting the
Airbnb rental price and conducting feature importance analysis for each
datatset. In the first stage, multiple models are applied, including the
Linear Regression, XGB Regression, Support Vector Regression, Gradient
Boosting Regression, and Extra Trees. The Linear Regression is used as a
baseline while the Extra Trees is selected as the best performing algorithm
based on R squared score and MSE. During the second stage of this study,
feature importance analysis is conducted for the Linear Regression model
and the Extra Trees model. Both models suggest that the top five Airbnb
rental price determinants differ in countries. The Linear Regression model
reveals the most important Airbnb price determinants, including the ca-
pacity of accommodates, host response time, room type, neighbourhood,
review scores, listing amenities (hot tub sauna or pool, gym), and host
verification method. The Airbnb price determinants are same in Germany
and Australia but different in the United States and China. As for the
Extra Trees model, it shows the most crucial price determinants which
are the capacity of accommodates, room type, neighbourhood, number of
bathrooms, distance to city center, and listing amenity (TV). There are no
countries having exactly same top five Airbnb price determinants although
all countries have some price determinants in common.
Extra Trees algorithm is also the least among all algorithms applied in this
study.
references
Abrate, G., Capriello, A., & Fraquelli, G. (2011, 08). When quality signals
talk: Evidence from the turin hotel industry. Tourism Management, 32,
912-921.
Airbnb. (2021). About us. Retrieved from https://news.airbnb.com/
about-us/
Airbnb, I. (2021, 12). Retrieved from http://insideairbnb.com/
Basak, D., Pal, S., & Patranabis, D. (2007, 11). Support vector regression.
Neural Information Processing – Letters and Reviews, 11.
Bentéjac, C., Csörgő, A., & Martínez-Muñoz, G. (2019, 11). A comparative
analysis of xgboost.
Cai, Y., Zhou, Y., Ma, J., & Scott, N. (2019, 05). Price determinants of airbnb
listings: Evidence from hong kong. Tourism Analysis, 24, 227-242.
Carrillo, G. (2019, 10). Exploration of edinburgh’s short rental market.
Retrieved from https://github.com/gracecarrillo/Predicting
-Airbnb-prices-with-machine-learning-and-location-data/
blob/gh-pages/Exploring_Edinburgh_Graciela_Carrillo.ipynb
Chang, C., & Li, S. (2020, 12). Study of price determinants of sharing
economy-based accommodation services: Evidence from airbnb.com.
Journal of Theoretical and Applied Electronic Commerce Research, 16, 584-
601.
Chen, Y., & Xie, K. (2017). Consumer valuation of airbnb listings: A hedo-
nic pricing approach. International journal of contemporary hospitality
management.
Demir, S., & Sahin, E. (2021, 09). Assessment of feature selection for liquefac-
tion prediction based on recursive feature elimination. European Jour-
nal of Science and Technology, 28, 290-294. doi: 10.31590/ejosat.998033)
Dudás, G., Kovalcsik, T., Vida, G., Boros, L., & Nagy, G. (2020, 05). Price
determinants of airbnb listing prices in lake balaton touristic region,
hungary. European Journal of Tourism Research, 24(10), 1-18.
Ert, E., Fleischer, A., & Magen, N. (2016). Trust and reputation in the
sharing economy: The role of personal photos in airbnb. Tourism
Management, 55, 62-73. Retrieved from https://www.sciencedirect
.com/science/article/pii/S0261517716300127 doi: https://doi
.org/10.1016/j.tourman.2016.01.013
Fang, B., Ye, Q., & Law, R. (2016, 03). Effect of sharing economy on tourism
industry employment. Annals of Tourism Research, 57, 264-267. doi:
10.1016/j.annals.2015.11.018
Geurts, P., Ernst, D., & Wehenkel, L. (2006, 04). Extremely randomized
trees. Machine Learning, 63, 3-42. doi: 10.1007/s10994-006-6226-1
Gibbs, C., Guttentag, D., Gretzel, U., Morton, J., & Goodwill, A. (2018).
REFERENCES 30
appendix a
appendix b
Features Descriptions
Host ID Airbnb’s unique identifier for the host
Host acceptance rate The rate at which a host accepts booking requests.
Host response time The time spent by a host to response on booking requests.
Whether the host is a experienced host who are enthusi-
Host is superhost
astic about making memorable stay for guests.
Host listings count The number of Airbnb listings a host has.
Host has profile pic Whether the host has a profile picture.
Host identity verified Whether the host identity is verified.
Host is verified through one or more method (Facebook,
email, phone, google, government id, identity manual,
Host verification
jumio, kba, manual offline/online, work email, photogra-
pher, reviews, selfie, zhima selfie)
Neighbourhood The district where the Airbnb listing located
Room type Entire home/apt, Private room, or Shared room
Accommodates The maximum capacity of the listing
Bathrooms The number of bathrooms in the listing
The amenities served in the listing (24h check-in, air con-
ditioning, high-end electronics, BBQ, balcony, nature and
views, bed linen, breakfast, TV, coffee machine, cooking
basics, white goods, elevator, gym, child friendly, park-
Amenity
ing, outdoor space, host greeting, hot tub sauna or pool,
internet, long-term stays, pets allowed, private entrance,
secure, self-check-in, smoking allowed, accessible, event
suitable)
Minimum nights Minimum number of night stay for the listing
Maximum nights Maximum number of night stay for the listing
Has availability Whether the listing has availability
Number of reviews The number of reviews the listing has
The review scores given by guests (rating, accuracy, clean-
Review scores
liness, check-in, communication, location, value)
Whether the guest can automatically book the listing
Instant bookable
without the host requiring to accept their booking request
Host duration The number of days a person being a host
REFERENCES 34
appendix c
appendix d
R2 score MSE
Models Countries Train Set Test Set Test Set
Berlin, Germany 0.573 0.572 0.180
Munich, Germany 0.449 0.459 0.230
Los Angeles, US 0.628 0.623 0.253
New York, US 0.610 0.587 0.211
Linear Regression
Sydney, Australia 0.692 0.676 0.213
Melbourne, Australia 0.611 0.550 0.245
Beijing, China 0.726 0.713 0.306
Hong Kong, China 0.467 0.328 0.511
Berlin, Germany 0.655 0.610 0.164
Munich, Germany 0.624 0.601 0.169
Los Angeles, US 0.734 0.713 0.193
New York, US 0.700 0.657 0.175
XGB Regression
Sydney, Australia 0.746 0.719 0.185
Melbourne, Australia 0.691 0.604 0.215
Beijing, China 0.817 0.759 0.256
Hong Kong, China 0.711 0.479 0.0.396
Berlin, Germany 0.035 0.017 0.413
Munich, Germany -0.008 -0.019 0.432
Los Angeles, US -0.004 -0.007 0.676
New York, US 0.010 0.010 0.506
Support Vector Regression
Sydney, Australia 0.014 0.005 0.656
Melbourne, Australia 0.004 0.002 0.542
Beijing, China 0.004 0.004 1.058
Hong Kong, China -0.022 -0.006 0.764
Berlin, Germany 0.660 0.610 0.164
Munich, Germany 0.628 0.608 0.166
Los Angeles, US 0.734 0.717 0.191
New York, US 0.701 0.656 0.176
Gradient Boosting Regression
Sydney, Australia 0.749 0.717 0.186
Melbourne, Australia 0.697 0.602 0.216
Beijing, China 0.824 0.763 0.254
Hong Kong, China 0.735 0.460 0.409
Berlin, Germany 1.0 0.582 0.176
Munich, Germany 1.0 0.965 0.014
Los Angeles, US 1.0 0.751 0.168
New York, US 0.999 0.689 0.160
Extra Trees
Sydney, Australia 1.0 0.717 0.190
Melbourne, Australia 0.999 0.604 0.216
Beijing, China 0.999 0.795 0.221
Hong Kong, China 0.999 0.535 0.347
REFERENCES 38
appendix e