Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

Math Horizons

ISSN: 1072-4117 (Print) 1947-6213 (Online) Journal homepage: http://www.tandfonline.com/loi/umho20

Exploring Real Data A Look at Airbnb

Amanda Francis & Eric Sullivan

To cite this article: Amanda Francis & Eric Sullivan (2018) Exploring Real Data A Look at Airbnb,
Math Horizons, 25:3, 14-17, DOI: 10.1080/10724117.2018.1424459

To link to this article: https://doi.org/10.1080/10724117.2018.1424459

Published online: 25 Jan 2018.

Submit your article to this journal

Article views: 176

View related articles

View Crossmark data

Full Terms & Conditions of access and use can be found at


http://www.tandfonline.com/action/journalInformation?journalCode=umho20
Exploring Real Data
A Look at Airbnb
Amanda Francis and Eric Sullivan

“D
ata is the sword of the 21st cen- In 2008 the San Francisco–based company Airbnb
tury,” wrote Jonathan Rosenberg, changed the nature of finding lodging by providing
former senior vice president of an alternative to traditional hotels.
products at Google, “those who Airbnb describes itself as “a trusted community mar-
wield it well, the samurai.” The ketplace for people to list, discover, and book unique
world of data and data analysis is growing in impor- accommodations around the world. . . . Whether an
tance, and those with the interest and appropriate apartment for a night, a castle for a week, or a villa for
skills are uniquely positioned to solve intriguing a month, Airbnb connects people to unique travel ex-
problems and answer vital questions. periences, at any price point, in more than 65,000 cities
In this article, we look at a freely available, rich, and 191 countries” (airbnb.com).
and complex data set that you might use to ask and As Airbnb has grown in popularity, a wealth of
answer data-driven questions. We propose a collection data has been accumulated about many aspects of the
of explorations that you can try out along with us. Airbnb experience. Insideairbnb.com, an independent
website, has developed “a set of tools and data that
Airbnb allows you to explore how Airbnb is really being used
We have all experienced the traveler’s dilemma: in cities around the world.” This site has gathered
I want to travel cheaply so that my trip can last public data from Airbnb sites around the world, pro-
longer, but the cost of lodging is prohibitively high. viding us with a treasure trove of data to explore!

insideairbnb.com/seattle
Figure 1. A graphical visualization for Seattle on the website Inside Airbnb.

14 February 2018 : : Math Horizons : : www.maa.org/mathhorizons


relatively low concentration of the rentals in the Alki
neighborhood are near the mean price; many are be-
tween $40 and $100 away from the center. You may
want to investigate the types of rentals in this data,
or perhaps their distance from the coastline, to figure
out what is creating this interesting shape.
If we want to further analyze prices in this neigh-
Figure 2. A violin plot showing Airbnb prices in the borhood, we might consider performing a variable
Alki beach area of Seattle. transformation. Since the prices are greater than
zero, bimodal, and fairly right skewed, a square root
Exploration 1: Visit insideairbnb.com, click “Get the
or log transformation might prove useful for further
Data,” find a city that interests you, and choose to see
analysis.
your city’s data visually. Spend some time exploring the
visual aids, exploring the data set, and proposing ques- Exploration 3: Do hosts with cleaner rentals charge
tions that can be answered from the data. more? More specifically, is there a significant difference
between the daily price of the cleanest rentals versus the
We are located near Seattle, so we’ll use the
least clean rentals?
Seattle data for our discussions. The graphical visu-
alization tool (shown in figure 1) allows you to see The first order of business in this exploration is to
several descriptive statistics for your chosen city, an- decide what cleanest and least clean mean to us. The
imate through the frequency and location of reviews, cleanliness rating scale goes from 0 to 10, but we
see top-rated hosts, and much more. shouldn’t assume that ratings are evenly distributed
among these numbers, or even that the numbers 0
Initial Data Analysis
through 10 represent a linear progression of cleanli-
Let’s explore some questions related to your city.
ness. Many travelers might feel guilty giving a low
Exploration 2: You’re trying to budget your travel cleanliness rating, so we may expect the ratings to
money. Give a practical estimate for the daily prices of be artificially inflated.
Airbnb rentals in a particular neighborhood. Based on what we learned from our last explora-
There is a beautiful beach on the Puget Sound tion, let’s look at a visualization of the data first.
in the Alki neighborhood of Seattle. Let’s filter the Figure 3 shows a bar graph of the counts of the
data to explore the prices of rentals from the Alki cleanliness scores.
beach area. It appears that the number of ratings drops off
Start with some basic descriptive statistics: The drastically below a cleanliness score of 8. Hence, we
average rental price in the Alki neighborhood is conjecture that the ratings between 0 and 7 rep-
$172, but the standard deviation is $97! resent the people who are truly unhappy with the
The question in this exploration is best answered cleanliness of the rental.
with a range of prices, and, based on the large To include price in the discussion, we now consider
standard deviation, the estimate will likely have a
large range. If this data was normally distributed, we
would expect most of the rentals (approximately 68
percent of them) to be within one standard devia-
tion of the mean: between $75 and $269. The large
variance in the prices leads us to believe that there is
more going on here. Let’s create a data visualization
to investigate further.
The plot in figure 2 is called a violin plot. The
curved shape tells us where clusters of the data are
located, while the boxplot tells us where the median
and quartiles are. We observe that this data does
not appear to be normally distributed. In fact, a Figure 3. Cleanliness ratings for Airbnb rentals.

www.maa.org/mathhorizons : : Math Horizons : : February 2018 15


Cancellation Policy

Figure 4. A jittered scatterplot of the price of rentals Figure 6. Cancellation policies in Belltown and Alki
with given cleanliness ratings. beach.

a scatter plot of cleanliness score versus price. An Exploration 4: You want to have an out in case your
immediate concern, however, is that many of the travel plans fall through. Is there a difference between
points will overlap and make the plot difficult to the proportion of rentals in one neighborhood that have a
read. One option is to add a “jitter” to our graphic strict cancellation policy as compared with the rentals in
so that we can see more points, which will give us a another neighborhood?
sense of where the clusters are. Let’s say that we want to stay on the waterfront
Notice that the sample sizes are wildly different: 80 in Seattle. Therefore, we’ll compare Alki beach and
rentals with lower cleanliness ratings and 3,165 with Belltown, both of which are on the water.
higher ratings. It looks like the rentals rated higher Figure 6 shows the distribution of cancellation
for cleanliness get better prices, perhaps with a few policies in the two neighborhoods, and a statistical
outliers. However, our scatterplot dots are still so test on the difference of the proportions of strict
clustered that it is difficult to see what’s going on. policies in each neighborhood gives us a p-value
Let’s try side-by-side violin plots (see figure 5). In much less than 0.01. So there is evidence to sug-
this case, we’ve added a log scale to the price vari- gest that rentals in Belltown have a much stricter
able to make the shapes easier to see. cancellation policy than those in Alki. The safe bet
We see that the bulk of the rental prices for both seems to be Alki, but if we want to spend our social
high and low cleanliness ratings are centered near time downtown, near Belltown, we need to consider
$100, but the clean rentals have some outliers with the transportation costs.
much higher prices. On the whole, the average rental Exploration 5: You are considering inviting some
prices in both categories aren’t drastically different. friends on your trip. You want to know how much more
When we conducted a statistical t-test on two means, to expect to pay for each additional bed.
we came to roughly the same conclusion (with a
Let’s start with two visualizations of our data—in
p-value of about 0.23), so there isn’t strong evidence
a jittered scatterplot and in a contour plot showing
of a difference in price between the two groups.
which combinations of number of beds and rental
price occur most frequently.
A simple linear regression tells us that the best-fit
line for our data is approximately

I should expect to pay about $93 for a rental with


one bed and $47 more per additional bed.
You might next want to find a 95 percent confi-
dence interval for the true slope (marginal price)
of this regression line. What about a 95 percent
confidence interval for the average price of rentals
with three beds? What about a predicted interval
for the actual price of a rental with three beds?
Figure 5. Log scale violin plots showing the prices for (How are these questions different from each
more and less clean rentals. other?)
16 February 2018 : : Math Horizons : : www.maa.org/mathhorizons
Some of our favorite places to find free, interest-
ing, and real data are
• data.gov
• kaggle.com
• archive.ics.uci.edu/ml
• Quandl.com
• gapminder.org
• flowingdata.com

most useful? Try using the same variables as in the last


Figure 7. A jittered scatter plot and a contour plot exploration (including price) to build a model.
relating the number of beds and the price of the
rental. Exploration 8: Can you predict which neighborhood
a rental is in based on other characteristics in the data?
Moving beyond the Basics Pick three or four neighborhoods that contain many
In this section, we pose more questions and leave rentals, and keep data only from these neighborhoods.
the investigations open to you. If you would like Can you build a model (maybe hierarchical clustering,
to see how we answered these questions with the k-means, or something else) that can predict which
Seattle data, you can visit our workfile at maa.org/ neighborhood you’re in, based on certain important
mathhorizons/supplemental.htm.
variables? Which variables should you use? Is it cheating
Several of these exercises require more advanced
to use latitude and/or longitude? How well does your
techniques, such as multiple regression, logistic re-
model work?
gression, or machine learning techniques to answer.
For each exploration, it would be wise to split your Recently, the field of data science has become
data into a training set and a test set so you can a successful and popular way to analyze data like
test your predictions on data that your model has that presented in this article. To learn more, you
not seen. can enroll in starter courses on the web (such as at
As you work on these explorations, consider the Coursera and DataCamp), and many universities
following questions. and colleges are implementing data science programs.
• Which variables are the most important? The As the last three explorations show, the level of
least important? mathematical sophistication associated with data sci-
• Are any variables just “noise” in the analysis? ence can be high. So if you find data-driven problems
• How reliable is your model?
interesting, take more courses in statistics, computer
• Would this same model work in other regions of
science, and computational mathematics!
the world? Why or why not?
As Jennifer Pahlka, founder and executive director
• Is it necessary to preprocess the data in any way?
of Code for America, said in a press statement: “Our
• Should the variables be transformed in any way?
ability to do great things with data will make a real
• Should any additional features be included in the
difference in every aspect of our lives.” n
data (such as the square of a feature or the prod-
uct of two features)? Amanda Francis teaches mathematics at Carroll
College in Helena, Montana. She loves exploring un-
Exploration 6: What variables in this data set are answered questions and enjoys reading, running, and
most useful in predicting the price of a particular watching cheesy movies with her family.
rental? For example, try using such variables as ac-
commodates, beds, bathrooms, guests_included, Eric Sullivan teaches mathematics at Carroll College
review_scores_rating, and various neighborhoods, and in Montana. Beyond the usual academic pursuits of
see if you can build a model (maybe a multiple regres- teaching and writing, he can be found backpacking
around the wilderness areas of Montana.
sion, regression tree, or artificial neural network) that is
good at predicting prices.
This work was funded by a grant from the W. M.
Exploration 7: Can you predict whether a rental is Keck Foundation.
an entire home or a private room based on the other
characteristics of that rental? What variables are the 10.1080/10724117.2018.1424459

www.maa.org/mathhorizons : : Math Horizons : : February 2018 17

You might also like