Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 14

“Never assume you can't do something. Push yourself to redefine the boundaries.


Brian Chesky, CEO of Airbnb
Agenda
• Problem Definition
• Data Analysis & Visualization.
• Challenges
• Feature Selection
• Model Creation
Problem Definition
• What is Airbnb?
• Real problem experience by host.
-Deciding the correct price for a short term rental.
• Our Goal - Idea of our project is to form a model to estimate what the correct price of their rental should
be given the features of their property.
• Dataset: 77000+ records and has 97 columns.

Dataset source: http://insideairbnb.com/get-the-data.html


Preprocessing Phase
• Removed the unwanted features from our dataset.

• Performed Exploratory Data Analysis (EDA) in order to remove more columns from data.
E.g.:'square_feet' column has maximum blank values so we have removed it.

• Performed Data Cleaning to identify and remove errors, noise and duplicate data which improves the quality of
the training data for analytics and gives better decision-making.

• Removed and replaced the reccords/values which are having target columns greater than 99th percentile for
each category level in a particular column.

• Deleting the reccords which are having noisy data or irrelevant data.

• Handling the misleading information in our data.


E.g.: The listing shows that 1 bedroom is rented but have priced for total numbers of rooms in house.
Data Analysis & Visualization.
Correlation Matrix:

• After visualization of
correlation between different
features, 'beds' column is
having strong correlation with
accomodation.
• 'beds' column is removed.
Visualizing london_borough wrt price

• This plot shows relationship of


different variables in
london_borough and price in
data using barplot().

• From this plot Westminister,


Kensington and Chelsea, City
of London has higher prices
where the black lines are the
error bars that shows
uncertainty in values which is
highest for Enfield.
Visualizing different property types wrt price.
• This plot shows relationship of
different variables in
property_type and price in
data using barplot().

• From this plot their are some


variables which are having
longer error bar which
represents higher uncertainty
in values of price. So such
variables are removed.
Visualizing bedrooms wrt price
• This visualization indicates linear
relationship through Regression
between bedrooms and price in
data which has discrete values
using regplot().

• As the number of bedrooms


increases the price also increases
and has more variance.

• This plot shows that even 1


bedroom is priced as 0 which is not
possible .So such reccords are
deleted.
Challenges:
• After applying Exploratory Data Analysis(EDA) on our
data we found that some records in 'price' column was
suspicious that can lead to incorrect values.

• So we used selenium library for scraping the data and


get the accurate price values.

• To get the accurate price values we used listing_url


column and developed some logic for scraping using
BeautifulSoup library.

• Dataset also had some inaccurate data in listing which


were provided by the host.
For eg: Host has mentioned 1 guest in listing which has
5
bed and 5 bedrooms for private room.
Feature Selection & Feature Engineering
• After performing data cleaning and removing the unwanted features from our data we have selected
feature like 'amenities' for feature transformation.

• Amenities feature have multiple labels like (TV,Internet,WiFi etc) in it so we have used
MultiLabelBinarizer() to get the binary values for multiple labels.

• By checking the correlation between amenities we removed some highly correlated features from the
data as they do not convey extra information.
For eg: bathroom essentials,cooking basics etc.

• Removing ameneties which are most common or most uncommon from the data.

• We have set a threshold limit to remove the values of amenities which have less than 3% of 0's or 1's.

• The data also has different property type values, so we are replacing those values with “Others”
where count is below or equal to 100.
Insights
Model Selection and Creation
• After feature engineering step we have created 2 bins for 'price' from 0-100 & 101-2001, then binning is
done to get price_bins feature.

• Splitting the data into Train and Test set.

• Before performing Regression we have first done Classification to predict Price_bins.

Classification Models Accuracy Precision Recall


Train Test
Random Forest Classifier 0.8873 0.8688 0.88 0.87
Logistic Regression 0.8683 0.8633 0.87 0.86
Vote Classifier 0.8788 0.8668 0.87 0.87
Model Selection and Creation (Conti.)
• After performing Classification on price_bins we have built 2 XGBRegressor model for price bins.

Regression Models Median Absolute Error Median Absolute Error


Train Test
XGBRegressor For Price_bin 1 4.9123 7.8473

XGBRegressor For Price_bin 2 16.6995 25.9586


Thank You

You might also like