Professional Documents
Culture Documents
Final
Final
”
Brian Chesky, CEO of Airbnb
Agenda
• Problem Definition
• Data Analysis & Visualization.
• Challenges
• Feature Selection
• Model Creation
Problem Definition
• What is Airbnb?
• Real problem experience by host.
-Deciding the correct price for a short term rental.
• Our Goal - Idea of our project is to form a model to estimate what the correct price of their rental should
be given the features of their property.
• Dataset: 77000+ records and has 97 columns.
• Performed Exploratory Data Analysis (EDA) in order to remove more columns from data.
E.g.:'square_feet' column has maximum blank values so we have removed it.
• Performed Data Cleaning to identify and remove errors, noise and duplicate data which improves the quality of
the training data for analytics and gives better decision-making.
• Removed and replaced the reccords/values which are having target columns greater than 99th percentile for
each category level in a particular column.
• Deleting the reccords which are having noisy data or irrelevant data.
• After visualization of
correlation between different
features, 'beds' column is
having strong correlation with
accomodation.
• 'beds' column is removed.
Visualizing london_borough wrt price
• Amenities feature have multiple labels like (TV,Internet,WiFi etc) in it so we have used
MultiLabelBinarizer() to get the binary values for multiple labels.
• By checking the correlation between amenities we removed some highly correlated features from the
data as they do not convey extra information.
For eg: bathroom essentials,cooking basics etc.
• Removing ameneties which are most common or most uncommon from the data.
• We have set a threshold limit to remove the values of amenities which have less than 3% of 0's or 1's.
• The data also has different property type values, so we are replacing those values with “Others”
where count is below or equal to 100.
Insights
Model Selection and Creation
• After feature engineering step we have created 2 bins for 'price' from 0-100 & 101-2001, then binning is
done to get price_bins feature.