Download as pdf or txt
Download as pdf or txt
You are on page 1of 17

Faculty of Economics

University of Ljubljana

Airbnb Apartments in
New York City
Applied Data Analytics

Professor: Ales Popovic

Made by: Soomin Kim

Burak Cakir

Raúl Díaz Burillo


Date: 24 / 05 / 2020
TABLE OF CONTENTS
1. INTRODUCTION 1
2. DATA PREPARATION 2
2.1. About Dataset 2
2.2. Data preparation and modeling for descriptive analysis in PowerBI 3
2.3.1. Data reduction 4
2.3.2. Data transformation 4
3. DESCRIPTIVE MODEL IN POWERBI 5
3.1. Results: features and patterns 6
4. DATA MINING BY RAPIDMINER 8
4.1. Process Documents 8
4.2. Results 9
4.2.1. Word Example 9
4.2.2. Process Documents 10
5. CONCLUSION 12
6. REFERENCES 14
1. INTRODUCTION
Our dataset is about all the Airbnb apartments available in New York City along with related
information such as the district, neighborhood, room type, availability, price, nº of reviews the
apartment has. In addition, the dataset contains data about the apartment owner name and the
description it wrote on Airbnb about the apartment. All the apartments are uniquely identified with
an ID.

Such data allows us to get relevant information about the characteristics of the apartments in every
New York City district, like discovering features related with prices, availability, reviews along
with room types and descriptions.

After reading and thinking about the data we have, we have determined to split our project in two
parts:

- On one hand, we will proceed to perform a descriptive analysis focused on possible renters
with the goal to give them relevant information about the apartments located in each district
of New York City, in order to provide them the best option according to their needs or
desires within their possibilities. This analysis will be developed in PowerBI Desktop with
a unique dashboard where the user will have different tools to interact with the data.

- On the other hand, we will proceed to perform a business predictive analytics focused on
the landlord of Airbnb to make them find what are the factors that can make Airbnb more
attractive, and the best option to increase the demand. This analysis will be developed in
RapidMiner with several operators.

To achieve these goals, first we had to follow a few steps to preprocess the data since the database
we’ve acquired is not completely clean. Then, we generated information by using data
visualization, text mining and then tried to determine the best possible decision to the renters and
landlords of Airbnb using the analytics of data source.

1
2. DATA PREPARATION
Like in most of the data analytics projects, understanding, cleaning and prepping the data model
is so often the bulk of the work. So, in order to do it properly, we cleaned the dataset since it was
not completely clean and then, we divided the data preparation in two parts, one for each business
analytics we wanted to do.

2.1. About Dataset


The dataset we have acquired came from a CSV file and it describes the listing activity and metrics
in New York City, for 2019:

● Column “id” identifies each Airbnb option with a unique number.


● The column “name” is the Airbnb name that people who are trying to rent can see.
In name, some options and explanations of the room are also included.
● The column “host_id” and “host_name” shows the id number of a particular host
and the name of the host.
● “neighborhood_group” shows the New York City district of the apartment and
“neighborhood” is the neighborhood where it is located.
● “latitude” and “longitude” determine the exact location of each apartment.
● “room_type” is the Airbnb space type, which consists of Entire home/apt, private
room.
● “price” is the room price per night in $.
● “minimum_nights” column shows us the minimum nights we should book to rent
the apartment.
● “number_of_reviews” column contains the nº of reviews that each apartment has
in the Airbnb website.
● “last_review”. It shows us the date of the last review the apartment has obtained.
● “reviews_per_month” gives us in decimal numbers, the average of reviews each
apartment has per month.
● “calculated_host_listings_count” is the amount of listing per host.
● “availability_365” is the number of days when listing is available for booking.

As a first step, we read the data to understand what it contains and if there were errors, missing
values or noise in it. To do so, we import the CSV file in Excel and apply table format to explore
the dataset in a better way.

The data was pretty much clean but there were some cases where we had to apply some changes.

As we can see in the picture 9 (in references), rows 2287, 2288 and 2289 were not properly filled
or they had missing values. To deal with this problem which has appeared several times, we
decided to remove the rows that had this problem since we couldn’t replace or fill the wrong and
missing values because of the acknowledgement of their content.

2
After handling these errors in the data, we could also see “latitude” and “longitude” columns were
not in the proper data type, but it will be changed in the business analytics preparation where these
columns will be used.

2.2. Data preparation and modeling for descriptive analysis in PowerBI


To get the proper data model to develop our descriptive analysis, first we needed to apply some
transformations in the data.

First of all, we deleted some columns we knew we were not going to use like “name”, “host_name”
and “host_id” since we wanted to show data about the prices, availability and reviews depending
on the district, room type or price range. After that, we checked the data types of each column,
applying some modifications like in “latitude” and “longitude” columns because they were
recorded as whole numbers and they should be in decimal numbers so we had to change the data
type and then, apply math operations to get the appropriate decimal number.

We also renamed some columns we thought they could be better descriptive for us.

During the preparation process, we tried different visualization tools in order to decide which are
the best ones to develop the analysis and what data we would need. And one of the key
visualization tools was a map to show the apartments distribution since we have data about
locations, latitude and longitude, but this was not enough to have an accurate map visualization
because there were some neighborhood names that are also other locations around the world (i.e.
Chelsea neighborhood in Manhattan is also a neighborhood in London). To solve this, first we
created a new column calling it “City” where we write New York in all the values and then, we
create another column joining columns “Neighborhood”, “District” and “City” with a “,” between
them. Thus, PowerBI can define that the neighborhoods are just from New York (picture 10, in
references).

Another created column has been the percentage of availability, taking the column
“availability_365” and dividing it by 365 and defining the column as a percentage. This column
was created with the purpose of giving to the possible renter a more understandable information
about availability.

The last column created has been “Price Range”, which was created with the purpose of using it
as a filter, where the apartments under 100€/night are defined as low cost; apartments between
100€/night and 300€/night are medium cost and apartments higher than 300€/night are high cost
apartments. We also create a measure called “Price Range Measure” with the same values (picture
11, in references).

3
2.3. Data preparation for text mining

Our purpose of text mining was to analyze the sentence Airbnb landlord set to attract consumers
and think about what the features are and then, try to find the best option. First to proceed on text
mining we needed corpus, which is a collection of documents. But our data was in an excel file,
and to use operators in RapidMiner, it was confusing to use without data processing.

2.3.1. Data reduction


When we analyze the names of Airbnb, other columns were unneeded. For the convenience of
transforming the excel file into a document and better efficiency of analyzing, we reduced the
dimension by eliminating columns except for column “name”, and column “reviews_per_month”,
and “numbers_of_reviews” that will be used in this text mining by Power BI.

2.3.2. Data transformation


Next, with reduced data we needed to transform the data type. The column “name” was polynomial
and for text mining, a document was needed. So first with operator “Nominal to Text” we were
able to transform the data type. And because there were some missing values in the column
“name”, with operator “Filter Examples”, we filtered only values there are not missing by setting
the condition. Then finally with operator “Data to Document”, we transformed the final data type
to document to get ready to use operator “Process Documents”.

4
3. DESCRIPTIVE MODEL IN POWERBI
Once the data model was ready to be analyzed, we proceeded to develop the visualizations. Our
goal of developing the descriptive model in just one dashboard was to allow the user to have all
the possible information in one page along with a good-looking design, where you can interact
with the data in order to get the information you want about your desires or needs using different
filters or just for simple entertainment “playing” and interacting with the dashboard.

With this goal in mind, we had to think about which information we wanted to show and how we
would do it.

To explain which visuals and their characteristics are better, here we can see the dashboard we
have designed.

Picture 1. New York Airbnb apartments dashboard focused on renters.

One of the visualizations, as we said before, was going to be a map, so we decided to use it to
represent the price average and nº of apartments in each neighborhood where the size of the bubble
varies depending on the numbers of the apartments in each neighborhood/district. The price
average appears as a tooltip, mousing over the bubbles (picture 12, in references). The colors
represent each district. The user can also use location hierarchy levels if he/she wants to have
general information about each district as a summary (picture 13, in references).

The matrix table below the map shows us more data about the availability, reviews, minimum
nights and price in a more traditional way with colors, data bars and icons. For the “Price Range”
column we used a star icon that classifies with an empty star, half or full star according to the
average price we’ve already explained about the price range measure. This visualization also
allows us to use hierarchy levels in order to see each neighborhood with more detail by clicking
on the “+” symbol (picture 14, in references).

5
Next to the map, there is a donut chart where the user can see the nº of apartments in each district.
Is a simple graphic but we considered that is very useful to know how many options the renter has.
To count the apartments, we used the column “id” assigning it count function.

The other two line and clustered charts show the most expensive neighborhoods and their average
availability for the most luxury possible renters or for those who want to avoid high prices and the
other one, the ten neighborhoods that offer more rent options along with their average number of
reviews. We consider this last graphic important because the renter can be more interested in the
neighborhoods with more options since it can be a sign of a popular place in the district.

With the visualizations already finished, we applied some filters such as district, room type and
price range in order to offer to the user an easy way to have deeper information about its desires
or needs. The visualizations also have the possibility to be used as filters, so the user can use, for
example, the top 10 neighborhoods with more apartments to filter in other visualizations like being
able to see where the neighborhood is located in the map or what is the availability and reviews in
the matrix table (picture 15, in references).

As the last step, we included different cards on the left side that show information in a text format
since we thought having information in this way complemented with the visuals is a good form to
have a summary (picture 16, in references).

About the color palette we used, we chose mainly white color along with dark blue and gold yellow
over a grey background. Getting this contrast that shows the information clearly. Also, each district
has its own color, indicated in the map visualization legend.

3.1. Results: features and patterns


At first sight, we can see Brooklyn and Manhattan count with more than 80% of the Airbnb
apartments and they also have the highest prices among the five districts. A possible reason is that
because the demands in these regions are high, causing more hosts to rent out their rooms or
apartments. In fact, despite their nº of apartments and princes, both districts Brooklyn and
Manhattan are the ones with less availability.

In a general sight, more than half of the apartments are in the “Medium Cost” range, between 100€
and 300€ per night, closely followed by the “Low Cost” apartments, which are the ones that have
a greater number of reviews.

If we take a look at the most expensive neighborhoods, surprisingly, Riverdale in the Bronx is as
expensive as Manhattan. We can also observe two neighborhoods from Staten Island as the most
expensive ones but they count just with one apartment each, giving us to understand it is a region
with a poor number of rent options but with some luxury options since it is a quiet and residential
location.
6
Speaking of the minimum nights required, most of the apartments require less than a week, fitting
in the original principle of Airbnb service as a short-term accommodation. Although the average
of minimum nights in entire homes are higher than a week and also more expensive because the
whole apartment is being offered.

Looking deeply in the matrix table, apparently, we cannot see any relationship between prices and
nº of reviews. Instead, almost all houses only get few reviews per month.

Talking about room types, entire homes/apartments have the highest share with Manhattan as the
main bidder followed by private rooms, where Brooklyn offers more options. On the other hand,
the number of shared rooms is very poor.

The most expensive entire apartments can be found in the three most expensive districts, Staten
Island, Manhattan and Brooklyn. But this pattern changes if we talk about private rooms, where
Queens gains prominence where it has two neighborhoods in the top 5, although the number is
very low compared to the others.

7
4. DATA MINING BY RAPIDMINER
First, what we wanted to discover from this process was to measure the frequency of certain words
that are included in the name of Airbnb and discover if it has any relationship with the demand of
users. We measured the demand with number of reviews. It does not match with demand precisely,
but since there was no information about preference and to have more reviews there has to be more
users, we thought the number of reviews can represent the demand of users.

4.1. Process Documents

Picture 2. Data Mining process in RapidMiner.

Picture 3. Tokenization. Process inside “Process Documents” element in RapidMiner.

First process was to “tokenize” the sentence into words. In this process, each term is treated as a
single token so that we can measure the frequency. Then we used “Transform Cases” operator
to change all letters into lower cases so that words can be categorized more precisely. Then we
used “Filter Stop words (English)” to filter out unnecessary words or numbers that are used
frequently but gives us little unique information because those words can hide the meaningful
words. After filtering stop words, we used “Generate n-Grams (Terms)” operator to create term
n-Grams of tokens in a document, consisting of all series of consecutive tokens of length n. Here
we set max length as 2. Then we used “Filter Tokens (by Length)” to filter tokens based on their
length, like the number of characters they contain to treat tokens that contain the same meaning as
the same token and we set minimum chars as 4.

8
4.2. Results

4.2.1. Word Example

Picture 4. Matrix table of words frequency as a result of our tokenization.

Firstly, this result can be helpful for hosts of Airbnb’s to be aware of what are frequently used
words that other hosts use for naming and consider for the next room when they want to attract
users by naming them attractive.

This table shows how frequent the words are shown in the name of Airbnb. If we order by the
largest number, we can find some meaningful words that can attract users. The word “private” is
used 4368 times, which from the word, users who are searching for a private and quiet room can
be attracted. The word “cozy” is used 3291 times, which from the word, gives a calm and warm
image of the Airbnb. The word “spacious” is used 2191 times, which we can think of as very big
enough space for room. This can be attractive to users who have many luggage or more people to
stay together. The word “beautiful” is used 1768 times, which can be attractive to users to think
interior is important or wants to take good pictures in Airbnb. The word “luxury” is used 1290
times, which can be attractive to users who are seeking luxury travel, but users who are trying to
save money with renting might not prefer Airbnb that contains this word. The word “bright” is
used 1287 times, that makes the image of a large window or bright light in the room. The word
“loft” is used 1259 times, which means high, so users who prefer high floors or who want to enjoy
the view of the city might prefer. The word “central” is used 1073 times, which will mean it is
located in the central part of the city, and it can be preferred by users who think distance is an
important factor.

9
4.2.2. Process Documents

Picture 5. Relation between nº of reviews and frequency of words.

This graph shows the relation between numbers of reviews and frequency of some words we chose
that might be meaningful. We chose the word {cozy, beautiful, luxury, spacious, bright, central,
clean}. And we used Median value of each word to best show their relationship.

Picture 6. “Cozy” as one of the key words in Airbnb descriptions.

When we looked into detail word by word, we found that the word “cozy” appears in highest
reviews_per_month column and also there are lots of graphs that are shown in other
reviews_per_month. So, we can think that “cozy” is more attractive than other words to users.

10
Picture 7. More key words as ”beautiful”, “luxury” or “bright”.

In this screenshot, it is not easy to distinguish but if we click on each word and check the
distribution, word “beautiful”, “luxury”, “bright” shows the next highest reviews_per_month and
also many graphs, and the word “clean”, “central” follows behind. From this relation, we can find
out that those words were not as attractive as “cozy” but pretty attractive to users and especially
the word “clean” because the sum of frequency was the lowest among these words.

Picture 8. Not all the most frequency words comes with a higher number of reviews. “spacious” as an example.

And for the word “spacious”, even though it had very high frequency compare to other words in
the graph, it does not have many graphs shown up and also is located in 5~9 reviews_per_month
which is not so high. From this relation, we are able to expect that the word “spacious” is not that
attractive to users relative to the frequency that are showed up in the name.

11
5. CONCLUSION
As we all know, New York City is one of the most famous cities in the world but it also have
different districts that are very differentiated between them in a class social way, difference that
can be applied in the offer of Airbnb apartments, where Brooklyn and Manhattan, the most
touristic, modern and developed districts along with the economy activity they have, count with
most of them and also with the highest prices although Manhattan is above Brooklyn talking about
these features.

The density of the apartments offered in both locations are pretty much higher than the other
districts in which the demand is also the highest ones, since, as we said, they count with a big
percentage of the touristic and economic activity, where people with different purposes need or
want to stay there for a while.

On the other hand, the Bronx is still counting with low prices and with 2,3% of the apartments in
New York, the availability is considerably high. The reason could be the Bronx is known around
the world as one of the dangerous locations in New York.

The same fact but with other reasons happens with Staten Island, known as the forgotten district
of New York, where the offer is less than 1% and it has the highest availability. The reason why
is because it is a very quiet island (the third biggest district but with the least population), far away
from the public transport and without touristic attractions that can make the people be interested
in.

Talking about room types, more than a half of the apartments are entire homes, since probably
most of the people renting apartments are families, so it’s more convenient to hire the whole house.
Followed by private rooms, where Queens could be a good alternative to get a good room for a
better price compared with the main districts for those who are travelling alone. In Queens you can
find really famous neighborhoods such as Long Island or Flushing.

Looking at the neighborhoods with most Airbnb apartments we can see that are very popular places
like Williamsburg, a trendy neighborhood of Brooklyn since it has been renovated focusing on
modern restaurants and where some Netflix series have been filmed there, or Harlem from
Manhattan, a popular neighborhood known for its Afro-American roots and culture and jazz clubs.

About the descriptions of Airbnb apartments, the word list show that hosts are trying to attract
users by using positive image such as “cozy”, “beautiful”, “clean”. And if we checked on some
important and meaningful words, we could see that the word “cozy” is the most used attractive
word. And “beautiful”, “luxury”, “clean”, “central”, “bright” was the word quite attractive, and
“spacious” was not so appealing to rent the house. So, we can conclude that since the words have
relation with preference which was calculated by reviews, the hosts of Airbnb should include many
positive words to attract users especially “cozy”, and try to use other words than “spacious”.

12
6. REFERENCES

Picture 9. Rows with empty or wrong values were deleted since we cannot replace them.

Picture 10. Neighborhood, district and city combined in a column with a “,” between each hierarchy level.

Picture 11. “Price Range Measure” calculation with an IF function.

13
Picture 12. The bubble size describes the nº of the apartments of each neighborhood. We have more information if we mouse
over the bubbles. Example of Bedford-Stuyvesant neighborhood in Brooklyn.

Picture 13. District hierarchy level, a global visual where we can appreciate that Manhattan and Brooklyn have more apartment
than the rest.

Picture 14. If we click on the “+” next to the District name, a list of their neighborhoods is deployed.

14
Picture 15. Example of filtering with visual elements.

Picture16. Cards that change when filters are applied, giving us information in a textual way.

15

You might also like