Download as pdf or txt
Download as pdf or txt
You are on page 1of 61

Rochester Institute of Technology

RIT Scholar Works

Theses

1-2022

Prediction of Dubai Apartment Prices Using Machine Learning


Ali Alshamsi
ama8663@rit.edu

Follow this and additional works at: https://scholarworks.rit.edu/theses

Recommended Citation
Alshamsi, Ali, "Prediction of Dubai Apartment Prices Using Machine Learning" (2022). Thesis. Rochester
Institute of Technology. Accessed from

This Master's Project is brought to you for free and open access by RIT Scholar Works. It has been accepted for
inclusion in Theses by an authorized administrator of RIT Scholar Works. For more information, please contact
ritscholarworks@rit.edu.
RIT

Prediction of Dubai Apartment Prices Using


Machine Learning

by

Ali Alshamsi

A Capstone Submitted in Partial Fulfilment of the Requirements for

the Degree of Master of Science in Professional Studies:

Data Analytics

Department of Graduate Programs & Research

Rochester Institute of Technology

RIT Dubai

Jan 2022

i
RIT

Master of Science in Professional Studies:

Data Analytics

Graduate Capstone Approval

Student Name: Ali Mohammed Alshamsi

Graduate Capstone Title: Prediction Of Dubai Apartment Prices Using Machine

Learning

Graduate Capstone Committee:

Name: Dr. Sanjay Modak Date:

Chair of committee

Name: Dr. Ehsan Warriach Date:

Member of committee

ii
Acknowledgments
To begin with, I would like to thank my parents for their belief and support in order for my

successful completion of this program. Their constant encouragement and advice helped me to

think outside the box and better understand my goals for success.

I would also like to thank Dr. Sanjay and his department for introducing this program in this

region. It opens doors to several opportunities considering the growing demand in Data Science.

Along with the department, I would also like to extend my gratitude to RIT staff members, who

helped make a seamless shift from regular classes to online sessions during such unprecedented

times.

I extend my heartfelt thanks to Dr. Ehsan for his persistent advice and guidance throughout my

graduate study, especially as a mentor for my successful completion of this capstone. Apart from

his mentorship, he has been a great guide in know more about the field of Data Analytics and

Business Intelligence.

iii
Abstract
The Real Estate Industry has been facing an unprecedented fluctuation in prices in the recent

past. The most alarming of all is the drastic fall in the prices of Apartments in Arab regions from

Dubai and Abu Dhabi in the past year. In comparison to the rest of the cities in the world, Knight

Frank Global Residential Cities Index shows a drastic drop in the prices of apartments by about

6.6 percent in Dubai and 8.3 percent in the recent 12 months of business (Arabianbusiness.Com,

2021).

In this regard, this study aims at developing a machine learning model able to predict the future

prices of Apartments in Dubai. By incorporating the essential variables, the model should give

the expected valuations and prices of apartments in the region and the real estate industry.

Analysis of existing data is crucial in determining the success of the model. Gupta et al. (2021)

assert that data analysis techniques such as machine learning models and artificial intelligence

play an important role in developing the most appropriate model to give reliable predictions. The

model is helpful in both the prospective Real Estate developers, and clients determining the most

appropriate time to invest in the industry or instead do business.

Key words: Real Estate Industry, Apartment price fluctuation, machine learning model.

iv
TABLE OF CONTENTS
Acknowledgments.......................................................................................................................... iii

Abstract .......................................................................................................................................... iv

TABLE OF CONTENTS ................................................................................................................ v

CHAPTER 1: INTRODUCTION ................................................................................................... 1

1.1 Background ........................................................................................................................... 1

1.3 Problem Statement ................................................................................................................ 2

1.4 Research questions ................................................................................................................ 2

1.5 Scope of Study ...................................................................................................................... 2

CHAPTER 2: LITERATURE REVIEW ........................................................................................ 4

CHAPTER 3: METHODOLOGY .................................................................................................. 8

3.1 Data collection....................................................................................................................... 8

3.2 Data Cleaning and Processing ............................................................................................... 8

3.3 Methods of data analysis ....................................................................................................... 9

3.3.1. Decision Tree ................................................................................................................. 9

3.3.2 Random Forest .............................................................................................................. 10

3.3.3 K Nearest Neighbours .................................................................................................. 10

3.3.4 Linear regression .......................................................................................................... 10

3.3.5 Support vector regression ............................................................................................. 11

3.3.6 Boosted Regression model ........................................................................................... 11

v
3.4 Model Tuning ...................................................................................................................... 11

3.5 Cross-validation .................................................................................................................. 12

CHAPTER 4: ANALYSIS AND RESULTS ............................................................................... 13

4.1 Data source .......................................................................................................................... 13

4.2 Data partitioning .................................................................................................................. 15

4.3 Exploratory Data Analysis (EDA) ...................................................................................... 15

4.3.1 Descriptive statistics for continuous variables ............................................................. 15

4.3.2 Distribution of continuous variables............................................................................. 16

4.3.3 Correlation Between Variables ..................................................................................... 17

4.3.4 Descriptive statistics for price per square feet broken by categories ........................... 18

4.3.5 Spatial distribution of apartment sizes.......................................................................... 21

4.3.6 Spatial distribution of apartment prices ........................................................................ 22

4.4 Modelling ............................................................................................................................ 24

4.4.1 Linear regression .......................................................................................................... 24

4.4.2 Decision Tree ................................................................................................................ 28

4.4.3 Random Forest .............................................................................................................. 31

4.4.4 K- nn Model.................................................................................................................. 35

4.4.5 Support Vector Regression. .......................................................................................... 39

4.4.6 Boosted Regression Model ........................................................................................... 42

CHAPTER 5: SUMMARY, DISCUSSION, AND RECOMMENDATIONS ............................. 46

vi
5.1 Introduction ......................................................................................................................... 46

5.2 Summary ............................................................................................................................. 46

5.3 Findings ............................................................................................................................... 47

5.4 Discussion ........................................................................................................................... 47

5.5 Conclusion........................................................................................................................... 48

5.6 Limitations and Recommendations for Future Study ......................................................... 48

REFERENCES ............................................................................................................................. 49

APPENDICES ...............................................................................Error! Bookmark not defined.

vii
CHAPTER 1: INTRODUCTION
1.1 Background

The Real Estate industry has been rapidly growing in recent. According to Avtar et al.

(2019), the major contributing factors to the growth of the Real Estate industry include the ever-

increasing population of the world, urbanization, and the people's lifestyle changes with time.

The rapid growth has come in handy to meet the demands of the market as well as meet the

needs of the ever-increasing city population.

Muldoon-Smith & Greenhalgh (2019) believes that the geographical location of the Real

Estate project highly influences its performance. Those cities within the tropics have a fairly

constant value in the prices of apartments. The cities in the tropics experiencing the four seasons

of the year have a highly volatile price variation depending on the time and season of the year.

This significant change is mainly attributed to the significant variations in human habits and

activities over the year's seasons. According to (Core, 2017), Apartment sales peak during the

month of March, June, and December. These are the end months of Q1, Q2, and Q4, which

makes Q3 more favorable for house purchases. The strategic location of the estate determines the

expected value and consequently the average price fluctuations of the apartments. Wang et al.

(2020) assert that the economic growth rate of a city also significantly affects the average price

fluctuations of the apartments in the industry. Besides, old valuation methods are becoming

irrelevant in the modern real estate market practice (AlHathboor, 2020). Well-developed cities,

including Dubai, being the epitome of economic growth, have no significant changes in the

valuations of apartments, though prices are relatively higher than in the developing cities.

1
1.3 Problem Statement

The real estate industry is among the fast-growing financial industries topping the list.

Though the growth rates vary depending on the city's economy, location, and population, the

difference is insignificant. However, the prices and valuations of apartments have significant

unprecedented changes over time. Both the investors and the users of the apartments and the

entire real estate industry face a significant challenge of price fluctuations depending on seasons,

location, economy, and lifestyle. There is a need to develop a reliable predictive model to give an

expected price of an apartment soon. Having a predetermined future help developer plan their

future investment trends per demand and economic viability.

1.4 Research questions

i. What are the factors contributing to the price variations of apartments in Dubai?

ii. What are the trends in apartment prices that can be used in developing a price

prediction model for the Real Estate Industry?

1.5 Scope of Study

This work focuses on the trends and volatility of apartment prices in Dubai City. The

study is limited to analysis of the Real Estate industry in Dubai using the available data.

Considering the global factors affecting the prices of apartments, such as the strength of the US

Dollar, mortgage laws and regulations and market supply, and expected future prices can be

determined using previous trends.

This proposed project aims to come up with s machine learning model using the available

data on the trends of prices of apartments in Dubai compared to the rest of the cities in the world.

The underlying economic factors specific to Dubai are the essential considerations in developing

an accurate predictive model. The model's predicted price is highly dependent on crucial factors,

2
including market demand, location of the apartments, strategic location, and proximity to

essential amenities. Other factors such as soil fertility and the face of the apartments towards the

sea or the city also contribute to the final predicted price. In regard to the limitations of the

available data, the cross-validation method comes in handy in developing the preferred model.

3
CHAPTER 2: LITERATURE REVIEW
Recent advancements in technology have played an essential role in developing the Real

Estate Industry by enhancing the ease of access, mode of acquisition, and the way we conduct

transactions in the sector (Wouda & Opdenakker, 2019; Xu & Zhang, 2022). Yas et al. (2020), el

Khatib & Alzouebi (2021), and Ying (2021) found improved tourism in the United Arab

Emirates region and her cities, with Abu Dhabi and Dubai taking the lead in the industry.

Infrastructural growth in the cities as mentioned earlier has spurred rapid growth in the Real

Estate Industry over time. As a result, more facilities have been made available for the market,

making it easy for individuals, investors, and prospective buyers to conduct business in the

industry (Rao, 2019; Khalaf et al., 2020). Well-developed housing plan, including the optimal

locations of the facilities, an economic shift that emphasizes the development of the industry, and

good government policies highly influence the changes in apartment prices (Krishna, 2021;

Kamei et al., 2021).

The instability and volatility of apartment prices in Dubai, as evidenced in the recent

trends, is an unprecedented occurrence getting most investors and the entire market unaware

(Humphrey, 2020). According to Global Property Guide (2021), the residential property price

index (RPPI) experienced a steady decline of about 2.1 percent as of May 2021. The previous

year experienced the most outstanding price decline, the lowest depression since the last fall out

in 2017. With inflation adjustment factored in, the apartment prices in Dubai had a 3.8 percent

price drop with a slight improvement in prices during the second half of the year 2021 as from

May. The increase experienced was about 0.6 percent of the initial price in the market at the

time. In the second quarter of the year 2021, the demand for apartments in Dubai rose steeply,

recording the highest number of units demanded more than eight years ago (Global Property

4
Guide, 2021). This increase was a cushion to the investors despite the adverse effects of the

Corona Virus Pandemic that hit the region hard.

Figure 1:Apartment Price Trend in Dubai

Market analysis of the Real estate industry showed that most of the Dubai Real Estate

market property buyers were outsiders. Together with the Russians, Europeans accounted for

about 20 percent of all the sales made. Nationals of the Arab region, including the United Arab

Emirates, accounted for 28%, while Iran accounted for 12 percent of all the total apartments sold.

Asian nationals and those of Asian origin were the most buyers in the market, contributing to

more than 40 percent of the total available units (Global Property Guide, 2021). Hoang et al.

(2021) and Rizvi et al. (2020) explained that during the market depression and hard times due to

the impact of the Corona Virus Pandemic, most investors put on hold the ongoing mega projects

in the region. However, the development projects resumed when the economy recovered and

grew in 2021. Nazemi & Rafiean (2021) illustrates that designing and developing an artificial

intelligence model that can predict the factor influencing real estate prices in Dubai can be

meaningful for the industry players.

5
Other factors influencing price fluctuations in the global and local markets of Dubai are

government taxes on properties (Alawadi et al. 2018; Katranis, 2021). In 2018, as a matter of

consideration, a 5 percent value-added tax was implemented in the region. These additional

government charges are applied to units sold after three years of project completion. However,

those units of projects sold within the first three months of completion were exempted from the

additional taxation. Due to the immense increase in the volume of transactions in the Dubai

property market to a tune of 74 percent raise, as of May 2021, investors have been forced to

make a turnaround in the mode of conducting business. Secondary off-plan sale has been the new

normal even when the units in business are still under construction. Due to this demand, there

has been an increase in the price per unit by about 6.6 percent in the market from 2019.

Rental rates in Dubai have shown a lot of inconsistencies in the price fluctuations for a

more extended period (Lee et al. 2019; Baba et al. 2020. According to the research from

Oxborrow (2021), tenants have paid even half their annual rent in some periods during the price

fluctuations. Social and economic functions also have a great impact on the direction of price

change in rental apartments. The market prices also consistently rose when Dubai hosted the

Expo 2020 before declining due to the low oil prices and the rising strength of the US Dollar (el

Amrousi et al, 2019). As the market became surplus and the demand declined, the market

registered a steady fall in prices per unit.

Therefore, it is imperative to note that technological advances are critical as far as the

development of the Real Estate Industry is concerned (Bishr, 2019; Lambourne, 2021). It has

resulted in new transaction methods in the sector, thus promoting the development in the Real

Estate Industry. On the other hand, the instability and volatility of the apartment prices have

greatly influenced the sector, thus, causing considerable losses to the investors

6
(Dufitinema,2021). Additionally, the government taxes dramatically influence the global and

local market in the Real Estate Industry, causing fluctuations. Developing a predictive artificial

intelligence approach is a viable alternative in modern society and can aid real estate investment

decisions in the Dubai market (Aljohani, 2021).

7
CHAPTER 3: METHODOLOGY
This project proposal intends to utilize emerging technologies such as machine learning

techniques to develop a predictive price model to predetermine the future price trends of

apartments in Dubai. The required data are residential and real estate data of apartments and

residents in Dubai and its environs.

3.1 Data collection

With the help of an online database, the data on relevant location, social and economic factors

affecting prices, and the impact of supply and demand can be obtained from the Kaggle website

for analysis. These include data of real estates located in Dubai downtown and Dubai Marina

areas where the ecological conditions are more conducive for life.

The study will obtain data regarding the real estate industries from the Kaggle website for

analysis. The data will purely comprise information on Real Estate in Dubai downtown as well

as Dubai Marina regions.

3.2 Data Cleaning and Processing

The data collected may contain abnormalities or inappropriate data entries and therefore

need to be fixed; this is achieved through the process of data cleaning. Screening is done to

ensure that the data is consistent with the trends in the industry to achieve reliable results after

analysis. The data is checked for accuracy through various dimensions. The first dimension is

accuracy which determines whether the information constitutes the real situation and identifies

any incorrect data to be fixed. The second dimension is completeness which assesses the

availability of all the requisite information to identify any missing element. The third dimension

is consistency which assesses whether the information is conflicting in any instance to avoid

inconsistency. The fourth dimension is timelines which determine whether the information is

8
available for use at the right time. Validity is another dimension of checking data quality. It

determines whether the information is aligned with the business rules or follows a particular

format. Lastly, uniqueness is a vital dimension that ensures the information does not match any

other work in the database.

The data obtained is separated from its identifiers to ensure anonymity from the

associated properties. Each property will be assigned an identifier different from the identity of

the property

3.3 Methods of data analysis

3.3.1. Decision Tree

This is the most used machine learning model when conducting regression or data

classification. In most cases, recursive tree partitioning, being the most popular, will be the most

helpful method to determine the patterns and the trends in the available data. Data grouping is

done during training and reduces the residual sum of squares. As a matter of great consideration,

the rules used to predict the data and develop the model are stored for use by future sets of data.

Control conditions called hyperparameters limit the number of tree patterns formed by the rules

to reduce complexity. These parameters include: -

i. the mini-splits- the minimum number of cases in a node before a split is initiated.

ii. The mini buckets- is the condition set such that whenever a node meets the mini-split

condition, the resulting node should not have cases more petite than the mini buckets.

iii. Maxdepth- is the maximum number of nodes allowable in the algorithm

iv. Parameter of complexity – this is the limiting condition that reduces the splits that do

not help in enhancing the general accuracy of the model.

9
3.3.2 Random Forest

This technique utilizes a system of decision trees called a forest. Through training, fitting

of decision trees on samples of bootstraps. These techniques ensemble both the observed data

and the attributes to develop a reliable predictive model. With this algorithm, data are connected

through links called trees. Each tree splits its links to a possible prediction from each of the

variables in the data. The average mean prediction from the data forms the final model prediction

from the data. This technique may either use the tree method, considering the number of trees

used, or the tree, which considers the specific number of features used in each tree.

3.3.3 K Nearest Neighbours

This technique of data analysis utilizes the property of the neighboring data points. The

technique calculates metric distances between the data points to determine the best node to

connect to. For every calculation and choice of the path taken, the k nearest neighbors' values are

taken, and the mean is calculated to give the overall predicted value.

3.3.4 Linear regression

This technique finds a best-fitting line for the available data points on a plot eventually

used for predicting output values absent in the available dataset. The technique is associated with

an assumption that the output would fit on the line. Factors such as data cleanliness and

consistency are the key determinants of performance. Additionally, the model performance or

generalizability can be enhanced. Nevertheless, each one presents unique pros and cons,

rendering the selection of application-dependent methods.

10
3.3.5 Support vector regression

SVR is one of the most popular techniques for classification in machine learning. The

Support Vector Regression is usually supervised and used in predicting discrete values. It utilizes

Support Vector Machine in handling classification issues in machine learning. However, the

utilization of SVMs in regression is not well documented since it recognizes the presence of non-

linearity within the available data, hence offering an adept prediction model. In other words, the

learning algorithm aims at fitting the best line in a threshold value instead of reducing the

inaccuracy between the predicted and the actual values like other Regression models.

3.3.6 Boosted Regression model

This model comprises two techniques, including boosting techniques and decision tree

algorithms. It fits several decision trees in enhancing the model accuracy. The two techniques

randomly adopt a subset of all data of newly built trees. Additionally, the random subsets entail a

similar number of data points chosen from the entire dataset. The model puts back the user data

in the complete dataset for selection in the subsequent trees. The Boosted Regression Tests

utilize boosting methods in weighing the input data. The validation error is determined by taking

the averaged mean squared error of the sample subsequent trees to enhance accuracy.

3.4 Model Tuning

The performance of the models developed above will be compared to determine the

model with the best performance characteristics as the most preferred model or rather the model

of choice. An iterative model helps determine the best model through the parameter condition of

the model each time an iteration is done; this is done until a model design with the slightest

11
deviation and error margin is minimized. Conducting the root mean square error gives the best

choice of the model to be selected.

3.5 Cross-validation

The model fitting is evaluated by testing its performance for new data, apart from the

sample, using the k-fold method of cross-validation. The data used to train the model is

randomly split into k equal samples. After training using all the k-1 data sets, the remaining

sample is used to test the model performance. The cross.

12
CHAPTER 4: ANALYSIS AND RESULTS
4.1 Data source

The study will use secondary data by the name "properties data.csv," retrieved from the

Kaggle repository. It consists of 38 features of 1905 Dubai apartments scrapped from different

real estate websites in 2020. There are no missing values on the dataset. Of the 38 features, 21

are numeric, 2 are factors, while 28 are binary (TRUE/FALSE) features denoting the

presence/absence of a feature, respectively. Fig 2 and fig 3 provide feature description and

feature overview, respectively.

13
Figure 2: Variable description

Figure 3:Data overview

14
4.2 Data partitioning

The dataset was split into 75% training set and 25% testing set to evaluate the models with

data they hadn't seen before. The training set will be used to train the models as well as to obtain

the summary statistics. This way, we will ensure that the models don't learn patterns on the

testing set. Similarly, obtaining summary statistics only from the training ensures that the

researcher doesn't get hints about patterns on the testing set; this is important because future

price patterns will always be uncertain. For this reason, information about patterns on the testing

set will only be admitted during model evaluation.

4.3 Exploratory Data Analysis (EDA)

In this part, we used Visualizations and descriptive data analysis to overview the distribution

of price, apartment features, and patterns of relationship between these features and price.

4.3.1 Descriptive statistics for continuous variables

For the sampled apartments, the number of bedrooms ranges between 1 and 6, with an

average house having 2.5 bathrooms. The median number is two bathrooms. On the other hand,

the number of bedrooms ranges between 0 and 5, with an average apartment having 1.78

bedrooms. The median number is 2 bedrooms. For sizes, the houses range between 306 and 8722

square feet; an average house has around 1409.821 square feet, and the median space is 1253.500

square feet.

The cheapest of the sampled properties costs around AED 220,000, while the most expensive

costs AED 34,340,000. The average price is AED 2,041,773, while the median price is

1,400,000. These prices don't consider the house sizes. The measurement "price per square feet"

helps figure out overpriced properties. The prices per square foot range between AED 375.68

15
and AED 4,561. An average house costs about AED 1323.059; the median price is AED

1140.605. figure 4 below reports these statistics.

Figure 4: Descriptive statistics

4.3.2 Distribution of continuous variables

Boxplots of the continuous variables show that raw prices and per foot are right-skewed

and have outliers on the upper tail; this means that most properties have less than the average

price, but the sample consists of some expensive properties which influence the average house

property. For this reason, the median price is a better estimate of typical values. A similar

distribution was observed for the number of bathrooms, bedrooms, and square feet. See figure 5

below.

16
Figure 5: Distribution of continuous variables

4.3.3 Correlation Between Variables

Correlation analysis shows that all the five variables are positively and significantly

correlated at 5%. Price is most correlated to size per square foot and least correlated to the

number of bathrooms. It means that the number of sizes determines the price more than the

number of bathrooms. Figure 6 below reports this information. Blank cells of the figure denote

insignificant correlations at a 0.05 level of significance.

17
Figure 6: Correlation analysis

4.3.4 Descriptive statistics for price per square feet broken by categories

Each categorical variable breaks down descriptive statistics of price per square feet to

explain how categorical features affect the price. Bar graphs of median prices per square feet

were done to visualize the relationships. Figure 5 below reports these statistics. The median price

per foot was higher on properties with concierge, kitchen appliances, private garden, private

gym, private jacuzzi, private pool, and water view compared to those without. For a shared gym,

the difference is too small to call. For the other features, the median prices become less if

included. see figure 7-9 below

18
Figure 7: Descriptive statistics for price broken by categorical variable
19
Figure 8: Median price per square feet by categorical variables
20
Figure 9: Extends figure 7

4.3.5 Spatial distribution of apartment sizes

For people buying a house to reside in, a significant concern is family size; one would want

to buy a house that can accommodate their family size. The map below visualizes house sizes in

square feet; bubble sizes and color scaling represent house sizes. A list of top and bottom

neighborhoods with big/small houses. The ranking is based on how many houses from the

neighborhoods fall in the top 25% in terms of size. We see that Palm Jumeirah, Downtown

Dubai, and Dubai marina have the biggest houses. Similarly, Jumeirah village circle, Downtown

Dubai, and Dubai marina have the highest number of small houses (based on the number of

houses in the bottom 25%).

21
Figure 10: Distribution of apartment sizes

4.3.6 Spatial distribution of apartment prices

To understand the distribution of the unit cost of apartments, the map below gives an

overview of where property prices per square foot are high or low. The bubble size and color

represent the price magnitude per square foot. A list of the most expensive and cheapest

neighborhoods was also overlayed. Similar to the previous case, the ranking depends on how

many houses from the neighborhood contribute to the top 25% of the expensive house and the

bottom 25% in terms of price. Downtown Dubai, Palm Jumeirah, and Dubai beach residents

have the highest numbers of houses in the top 25% of the prices. On the other hand, Jumeirah

village circle, Downtown Dubai, and Jumeirah Lake towers contribute the cheapest houses.

22
Figure 11:Distribution of apartment prices

A better view can be achieved by setting colors to represent quantiles. Places with a high

concentration of red color show where most houses are expensive, while a high concentration of

yellow represents the cheapest neighborhood.

23
4.4 Modelling

4.4.1 Linear regression

A linear regression model was used to estimate the underlying relationship between these

features and price per square foot. From figure 5, it can be seen that the price per square foot is

skewed to the right. We expect this to cause heteroskedasticity error variance. A logarithmic

transformation to the variable.

Residual analysis

Linear regression makes critical assumptions whose extreme assumptions are likely to affect

the estimated slopes or predicted values incorrectly. Consequently, conclusions made from such

results would be invalid. One of these assumptions is that a multivariate linear exists between the

response and the predictor variables. A residual versus fitted values plot shows a random spread

above and below the horizontal line; this means linearity is reasonably met. A normal quartile to

quartile plot, on the other hand, approximates a linear pattern, which means the assumption of

normality is met. A scale-location plot also shows a random spread above and below the

horizontal line indicating homoskedasticity of error variance. From a residual versus fitted plot,

it can be seen that even the outliers seen before are not large enough to leverage the regression

slope. See figures 11-12 for these results

24
Figure 12: Residual plots set 1

Figure 13:Residual plots set 2


Model summary

The expected number of variations in log prices the model can explain is 59.95%; this is a

significant amount (F(88,1341) =22.81,p<0.001), and at least one of the specified predictors has a

significant impact on price.

Figure 14: model summary

Figure 14 below reports the regression coefficients estimates significantly at the 5% level.

Because the price was in log format, estimates the last column contains the exponentiated

coefficients. For neighborhoods, Al Barari was used as the reference neighborhood. The general

rule for interpreting these estimates is;

- If the exponentiated estimate is greater than 1, subtract one from the exponentiated estimate

and multiply by 100 to convert the result to a percentage.

- If the exponentiated estimate is less than 1, subtract it from one from the exponentiated

estimate and multiply by 100 to convert the result to a percentage.

25
For instance, the estimate for Culture Village is significant and positive (B= 0.49,

p=0.046, exp(B) = 1.632); this means that while holding the effect of other features constant,

the average price per square feet is 100*(1.632-1) =63.2% times high in this neighborhood,

compared to Al Barari. On the other hand, Discovery Gardens boasts lower prices than Al

Barari. Prices there are, 100*(1- 0.419) =68.1% times less than Al Barari. Having these

differences in average prices provides evidence that geographical location in Dubai affects

apartment prices. For continuous features, converting the exponentiated coefficients as

explained above gives the expected percentage change in price per square foot for every unit

increase on the feature, assuming other factors stay constant. A full list of the estimates is

reported in fig 15 below.

26
Figure 15: Regression coefficients

Evaluation

27
The model was evaluated on the testing set to get an overview of how it would perform on

new data. The root means the squared error is 514.37, while the explained amount of variance(R-

squared) is 46.42%. This significant fall in r squared indicates that the model might be

overfitting the data

Figure 16:Evaluation metrics

Fig 17 below visualizes predicted values against the actual values. The graph is color scaled

to show the trend in variability. The red points at the upper extreme indicate that the model

prediction ability falls as the price per square foot increases.

Figure 17: Predicted versus fitted

4.4.2 Decision Tree

We kept in mind that the recursive partitioning algorithm is greedy with decision trees. If the

splitting is not controlled, the model might create unnecessary splits, which do not improve
28
prediction accuracy by any significant amount. Such a large tree will tend to overfit. Different

values of complexity parameters were tried to find tree sizes which led to higher prediction

accuracy (lowest RMSE).

In Figure 18 below, the left panel reports the results of the 20 tried values of cp and their

corresponding error measures; the right panel visualizes the change in RMSE as the complexity

parameter increases. The optimal cp is 0.0038402; as this value is increased (as we allow the tree

to grow deeper), the prediction error increases. This behavior indicates a possibility that the

model more extensive than this size is overfitting the data. See appendix 1 for the classification

tree and rules

29
Figure 18: DT Tuning results

Model Evaluation

On predicting the testing set, the prediction error (RMSE) is 459.07; the model was able to

explain 55.64% of the] variations in price per square feet; this is a better performance as

compared to the previously fitted linear regression model.

Figure 19: model performance

30
A visual representation of predicted values versus observed values shows that as houses

get more expensive, the model's ability to predict prices correctly begins to fall.

Figure 20: Graphing predicted variables against observed values

4.4.3 Random Forest

As mentioned previously, decision trees tend to overfit. Random forest tries to solve this

problem by fitting a forest (many decision trees) instead of a single tree. The algorithm does this

by drawing bootstrap samples from the training data and fitting a tree in each sample; this means

we have to decide how many trees to use and the number of variables randomly sampled as

candidates at each split (mtry). Choosing the number of trees is quite cumbersome because they

can range is too broad. The r default of 500 was used. For mtry it is feasible to tune the

31
parameter because we know the number of variables available. About 20 different values were

tried, which pointed to the optimal value to be 61 features. See figure 20 below.

32
Figure 21:model tuning results

33
Feature Importance

Geographical coordinates (latitude and longitude) happen to be among the most critical

variables in determining apartment prices; this means that location is critical in determining

prices. Similarly, some neighborhoods seem to appear among the top essential features. These

are among the neighborhoods previously ranked as expensive/ cheap neighborhoods. Apartment

size, number of bathrooms, number of bedrooms, and whether the apartment is networked also

appear among the top 10 essential features to be considered when estimating prices.

Figure 22: variable Importance

Model evaluation

The amount of error is less than the previous models for predicting the testing set (RMSE=

354.952). On the other hand, the model explained 73.12% of the variations in price per square

foot. See figure 23 below.

34
Figure 23: model evaluation

Figure 23 visualizes the relationship between the predicted and actual prices. The

improvement in prediction accuracy resulted from the model being able to predict prices of high

prices correctly. The previous model failed on that, but for this model, only a few red points can

be seen in figure 23.

Figure 24: Prediction vs. obs values

4.4.4 K- nn Model

In the previous chapter, we saw that K- nn algorithm assigns predicted values to a new case

based on the data point it resembles on the training set. For this case, the predicted price of an

apartment is the average of the apartments with similar features. The similarity is measured in

35
terms of Euclidian distance. The most critical decision is how many neighbors (similar

apartments) do we average? This hyperparameter is called k. around 20 different values of k

were tried to determine which value corresponds to the highest accuracy. The optimal k was

found to be 17 neighbors. See figure 25 below.

36
Figure 25: Tuning results

Model evaluation

37
On predicting the testing set, the error is higher than for the previously fitted models

(RMSE=608.50). The model explains only 21.83% of variations in price per square foot; this is

too low. See figure 26 below.

Figure 26:Accuracy metrics

By visualizing the predicted values against the actual values, it is clear that the model suffers

severely from an inability to classify expensive houses correctly (see the red points on fig 27)

hence the reason behind poor performance.

Figure 27: Observed vs. predicted values

38
4.4.5 Support Vector Regression.

With support vector regression, a range of hyperparameters needs to be chosen. Some of

them depend on the kernel type chosen. The first kernel type tried was a linear kernel. The main

parameter for this kernel is "cost," which specifies the extent to which we want to smooth the

data. We started with a coarse grid of costs between 0 and 1000 by steps of 100. C=100 had the

highest performance (Rsq =0.14). Next, the search was refined by trying a range between 50 and

150 by steps of 10, C=70 had the best performance (Rsq =0.35), Converging more on this range

by trying range 68 to 72 by steps of 2 also a finer tune by steps of 0.5 pointed C=70 as the best

value of C.

Model tuning results

650
RMSE(cross validation)

600

550
k=70
69.0 69.5 70.0 70.5 71.0
C
Figure 28: linear tuning results

For the Radial kernel, the default is to try C= 2x; with a tuned length of 20, we got results

for x =-2 to x=17 where x=17 had the highest accuracy; this shows the need to keep increasing x

39
(Rsq =0.19). The grid was made coarser because as x increases, computation becomes more

intensive. We tried x= 20,25,30, 35,40 and 50 were also tried. The best performance was at

x=20. Next was to try a fine search around the value 20, x=21 had is optimal. See figure 29

below

40
Model tuning results

660

RMSE(cross validation)
630

600

C=2097152
0e+00 1e+07 2e+07 3e+07
C

Figure 29: tuning results

41
Evaluation

The linear kernel model performed better (RMSE=590.60). The model was able to explain

30.73% of the testing set correctly. See fig 30 below

Figure 30: model performance

Graphing predicted versus observed values shows that the model is among the worst for

predicting houses with considerable prices. See the red values in fig 31 below.

2000

(obs - pred)^2
1500
1.0e+07

7.5e+06
pred

5.0e+06

1000 2.5e+06

500

1000 2000 3000 4000 5000


obs

Figure 31:Observed vs. predicted.

4.4.6 Boosted Regression Model

For all the previous models, the only random forest could predict large house prices with

high precision; the rest made huge errors when predicting expensive houses. Therefore, it is

42
expected that these models will perform poorly in neighborhoods where most houses are

expensive; this can be attributed to men factors, one being that the data is left-skewed, which

means there are significantly few expensive houses on the dataset as compared to less expensive

houses. For this reason, the models had not enough opportunity to learn the characteristics of

expensive houses; this brings in the need for a model which gives more weight to such cases.

The algorithms focus on compensating the previously made errors (boosting week learner).

Multiple hyperparameters were tuned. And the optimal combination was; nrounds = 250,

max_depth = 4, eta = 0.3, gamma = 0, colsample_bytree = 0.6, min_child_weight = 1 and

subsample = 1. From figure 29 below, this combination is in the last graph. The optimal values

correspond to Cross-validation error 338.67(RMSE), and R squared = 73.45%.

Max Tree Depth


1 4 7 10
2 5 8
3 6 9
100 200 300 400 500 100 200 300 400 500 100 200 300 400 500 100 200 300 400 500 100 200 300 400 500

colsample_bytree: colsample_bytree:
0.6 colsample_bytree:
0.6 colsample_bytree:
0.6 colsample_bytree:
0.6 colsample_bytree:
0.6 colsample_bytree:
0.6 colsample_bytree:
0.6 colsample_bytree:
0.6 colsample_bytree:
0.6 0.6
eta: 0.4 eta: 0.4 eta: 0.4 eta: 0.4 eta: 0.4 eta: 0.4 eta: 0.4 eta: 0.4 eta: 0.4 eta: 0.4
subsample: 0.5000subsample: 0.5556subsample: 0.6111subsample: 0.6667subsample: 0.7222subsample: 0.7778subsample: 0.8333subsample: 0.8889subsample: 0.9444subsample: 1.0000

450
RMSE (Cross-Validation)

400

350

colsample_bytree: colsample_bytree:
0.6 colsample_bytree:
0.6 colsample_bytree:
0.6 colsample_bytree:
0.6 colsample_bytree:
0.6 colsample_bytree:
0.6 colsample_bytree:
0.6 colsample_bytree:
0.6 colsample_bytree:
0.6 0.6
eta: 0.3 eta: 0.3 eta: 0.3 eta: 0.3 eta: 0.3 eta: 0.3 eta: 0.3 eta: 0.3 eta: 0.3 eta: 0.3
subsample: 0.5000subsample: 0.5556subsample: 0.6111subsample: 0.6667subsample: 0.7222subsample: 0.7778subsample: 0.8333subsample: 0.8889subsample: 0.9444subsample: 1.0000

450

400

350

100 200 300 400 500 100 200 300 400 500 100 200 300 400 500 100 200 300 400 500 100 200 300 400 500

# Boosting Iterations

Figure 32:Accuracy

43
Figure 33: Snapshot of the tuning results

Model Evaluation

On predicting the testing set, the model RMSE is 389.76, while the explained variation on

prices is 68.26%; this comes close to the random forest performance. But does not outdo it.

Figure 34: performance measures

Graphing the observed versus predicted values shows that the model has improved

significantly in predicting significant prices, although not as perfect as random forest. See

figure 35 below

44
4000

3000
(obs - pred)^2
8e+06
pred

6e+06
2000
4e+06

2e+06

1000

0
1000 2000 3000 4000 5000
obs

Figure 35: Observed vs. predicted

Summary

In this analysis, a linear regression model was used, with the primary purpose being to

estimate the relationship between candidate factors and price per square feet. On the other hand,

a decision tree, random forest, KNN, and SVM models were trained to predict price per square

fit with the highest precision possible. Lastly, a boosted regression model was used to improve

the prediction of expensive houses, which the rest of the models missed.

The performance of the models are as follows;

S/No Model Performance

1 Random Forest R squared =73.12%

2 Boosted Regression Trees R squared =68.26%

3 Decision Tree R squared =55.64%

4 Linear Regression R squared =46.42%

5 SVM R squared =30.73%

6 KNN model R squared =21.83%

45
CHAPTER 5: SUMMARY, DISCUSSION, AND
RECOMMENDATIONS
5.1 Introduction

This chapter entails the conclusions derived from apartments and residents' residential

and real estate data in Dubai and its environs. The conclusions were obtained from the research

questions, the purpose of the study, and the study findings. The chapter also sets forth the

implications of the study findings and concludes with the recommendations established and the

study's purpose, limitations, and areas for future study.

5.2 Summary

This qualitative research aimed to develop a machine learning model by using the

available data on the trends of prices of apartments in Dubai compared to the rest of the cities

globally. The study utilized secondary data called "properties data.csv," obtained from Kaggle

Repository. The data contained 38 features of 1905 Dubai apartments discarded from various

real estate websites in 2020. Their dataset had no missing values. The 38 features involved 21

numeric, 2 factors, and 28 binary features that denoted the feature presence/absence of a feature,

respectively.

The study exploited emerging technologies like machine learning techniques to develop a

predictive price model to predetermine the future price trends of apartments in Dubai. The data

was collected from an online database, the Kaggle website, for cleaning, processing, and taken

for analysis. The study adopted various data analysis methods, including Decision Tree, Random

Forest, K nearest neighbors, Linear regression, Support vector regression, and Boosted

Regression model to determine the best performance characteristics. This study adopted a cross-

validation method in developing the most preferred model

46
5.3 Findings

The study established that geographical coordinates such as latitudes and longitudes are

critical in determining apartment prices. Similarly, some neighborhoods seem to appear among

the top essential features, including those previously ranked as expensive/cheap neighborhoods.

According to the study, apartment size, number of bathrooms, number of bedrooms, and

apartment networking are also critical determinants of prices. Also, the study findings reveal that

the number of apartment sizes determines price more than the number of bathrooms. A

significant concern for people buying a house to reside in is family size; one would want to buy a

house that can accommodate their family size. Additionally, the median price per foot was higher

on properties with features such as concierge, kitchen appliances, private garden, private gym,

private jacuzzi, private pool, and water view compared to those without. The findings align with

the Global Property Guide (2021) findings which established that a well-developed housing plan

including the optimal locations of the facilities, an economic shift that emphasizes the

development of the industry, and good government policies highly influence the changes in

apartment prices.

5.4 Discussion

Geographical coordinates, including latitudes and longitudes, are the most significant

variables determining apartment prices. In other words, the location remains a critical aspect that

determines the house prices. In the same way, various neighborhoods are some of the crucial

features that were ranked as cheap or expensive. The number of bathrooms, apartment

networking, number of bedrooms, and apartment size is critical price estimation features.

47
5.5 Conclusion

The unprecedented price fluctuations in the Real Estate Industry, especially the drastic

price fall of apartment prices, especially in the Arab regions, require an effective price prediction

model to give the expected valuations and prices of apartments in the region and the Real Estate

Industry large. Data analysis techniques such as machine learning models and artificial

intelligence play an important role in developing the most appropriate model to give reliable

predictions. The model is helpful to both the prospective Real Estate developers, and clients to

determine the most appropriate time to invest in the industry or instead do business. Random

forest was the best price estimation model, according to the study, as far as the Real Estate

Industry is concerned.

5.6 Limitations and Recommendations for Future Study

• Future research study on the targeted research areas is critical in strengthening the

findings. Developing another research study using a different research methodology on

the future price trends of apartments is critical.

• Another research study with datasets from different countries across the world, coupled

with the findings of this research, potentially comparing the future price trends of

apartments may aid in achieving better results. In the same way, utilizing a broader

dataset may offer more significant insights into the future price trends of apartments in

the Real Estate Industry.

• The third quarter of the year is most favorable for a house purchase because generally,

house demand is low as well as prices.

48
REFERENCES

Alawadi, K., Khanal, A., & Almulla, A. (2018). Land, urban form, and politics: A study on

Dubai’s housing landscape and rental affordability. Cities, 81, 115–130.

https://doi.org/10.1016/j.cities.2018.04.001

AlHathboor, A. (2020, December 1). Price Prediction and Valuation Using Data Mining in

Dubai Real Estate Market. RIT Scholar Works.

https://scholarworks.rit.edu/theses/10694/

Aljohani, O. (2021). Developing a stable house price estimator using regression analysis. The 5th

International Conference on Future Networks & Distributed Systems.

https://doi.org/10.1145/3508072.3508091

Arabianbusiness.com. (2021). Arabian Business.

https://www.arabianbusiness.com/property/452753-why-property-prices-are-falling-

faster-in-the-uae-than-the-rest-of-the-world

Avtar, R., Tripathi, S., Aggarwal, A. K., & Kumar, P. (2019). Population–urbanization–energy

nexus: a review. Resources, 8(3), 136.

Baba, H., Nishi, H., Seetharamapura, A.M., & Shimizu, C. (2020). Dynamic Hedonic Analysis

Using Time-Varying Coefficients: Application to Dubaiʼs Housing Market.

Bishr, A. B. (2019). Dubai: A City Powered by Blockchain. Innovations: Technology,

Governance, Globalization, 12(3–4), 4–8. https://doi.org/10.1162/inov_a_00271

Dufitinema, J. (2021). Forecasting the Finnish house price returns and volatility: a comparison of

time series models. International Journal of Housing Markets and Analysis, 15(1), 165–

187. https://doi.org/10.1108/ijhma-12-2020-0145

49
el Amrousi, M., ElHakeem, M., Paleologos, E., & Misuri, M. A. (2019). Engineered Landscapes:

The New Dubai Canal and Emerging Public Spaces. International Review for Spatial

Planning and Sustainable Development, 7(3), 33–44.

https://doi.org/10.14246/irspsd.7.3_33

el Khatib, M. M., & Alzouebi, K. (2021). Collaborative Business Intelligence A Case Study of

the Dubai Smart City Strategy. International Journal of Innovative Technology and

Exploring Engineering, 10(4), 83–90. https://doi.org/10.35940/ijitee.d8496.0210421

Global Property Guide. (2021). UAE's housing market on the cusp of recovery?

https://www.globalpropertyguide.com/Middle-East/United-Arab-Emirates/Price-History

Gupta, R., Srivastava, D., Sahu, M., Tiwari, S., Ambasta, R. K., & Kumar, P. (2021). Artificial

intelligence to deep learning: machine intelligence approach for drug

discovery. Molecular diversity, 25(3), 1315-1360.

Hoang, A. T., Nižetić, S., Olcer, A. I., Ong, H. C., Chen, W. H., Chong, C. T., ... & Nguyen, X.

P. (2021). Impacts of COVID-19 pandemic on the global energy system and the shift

progress to renewable energy: Opportunities, challenges, and policy implications. Energy

Policy, 154, 112322.

Humphrey, C. (2020). Real estate speculation: volatile social forms at a global frontier of capital.

Economy and Society, 49(1), 116–140. https://doi.org/10.1080/03085147.2019.1690256

Kamei, M., Wangmo, T., Leibowicz, B. D., & Nishioka, S. (2021). Urbanization, carbon

neutrality, and Gross National Happiness: Sustainable development pathways for

Bhutan. Cities, 111, 102972.

Katranis, D. (2021). How Economic Conditions of a country impact its local tourism: Identifying

the economic factors influencing tourism sector.

50
Khalaf Ahmed Al Qatawneh, D., Azrin Adnan, A., A. A. AlZoughool, S., & Hussain Ahmed

Alqudah, T. (2020). A Study of the Leading Factors that Induced Confidence in the

Foreign Investors for Investing in Real Estate Business for the Period 1970- 2019.

Scholars Journal of Economics, Business and Management, 07(01), 39–43.

https://doi.org/10.36347/sjebm.2020.v07i01.007

Krishna, N. (2021, May 1). Dubai Real Estate Investment: A Predictive and Time Series

Analysis. RIT Scholar Works. https://scholarworks.rit.edu/theses/10943/

Lambourne, T. (2021). Valuing sustainability in real estate: a case study of the United Arab

Emirates. Journal of Property Investment & Finance. https://doi.org/10.1108/jpif-04-

2020-0040

Lee, K., Kim, H., & Shin, D. H. (2019). Forecasting Short-Term Housing Transaction Volumes

using Time-Series and Internet Search Queries. KSCE Journal of Civil Engineering,

23(6), 2409–2416. https://doi.org/10.1007/s12205-019-1926-9

Muldoon-Smith, K., & Greenhalgh, P. (2019). Suspect foundations: Developing an

understanding of climate-related stranded assets in the global real estate sector. Energy

Research & Social Science, 54, 60-67.

Nazemi, B., & Rafiean, M. (2021). Modelling the affecting factors of housing price using

GMDH-type artificial neural networks in Isfahan city of Iran. International Journal of

Housing Markets and Analysis, 15(1), 4–18. https://doi.org/10.1108/ijhma-08-2020-0095

Oxborrow, I. (2021, July 5). How Dubai property rental rates have changed over the past decade.

The National. https://www.thenationalnews.com/business/property/how-dubai-property-

rental-rates-have-changed-over-the-past-decade-1.983112

51
Rao, I. (2019). Competing values in Asian business: evidence from India and Dubai. Journal of

Asia Business Studies.

Rizvi, S., Rustum, R., Deepak, M., Wright, G. B., & Arthur, S. (2020). Identifying and analyzing

residential water demand profile; including the impact of COVID-19 and month of

Ramadan, for selected developments in Dubai, United Arab Emirates. Water Supply,

21(3), 1144–1156. https://doi.org/10.2166/ws.2020.319

Wang, J., Huang, X., Gong, Z., & Cao, K. (2020). Dynamic assessment of tourism carrying

capacity and its impacts on tourism economic growth in urban tourism destinations in

China. Journal of Destination Marketing & Management, 15, 100383.

Wouda, H. P., & Opdenakker, R. (2019). Blockchain technology in commercial real estate

transactions. Journal of property investment & Finance.

Xu, X., & Zhang, Y. (2022). Residential housing price index forecasting via neural networks.

Neural Computing and Applications. https://doi.org/10.1007/s00521-022-07309-y

Yas, H., Mardani, A., Albayati, Y. K., Lootah, S. E., & Streimikiene, D. (2020). The Positive

Role of the Tourism Industry for Dubai City in the United Arab Emirates. Contemporary

Economics, 14(4), 601-617.

Ying, D. (2021). Optimization of Tourism Real Estate Development Project Based on Option

Premium Model. Discrete Dynamics in Nature and Society, 2021, 1–8.

https://doi.org/10.1155/2021/2954795

52
53

You might also like