Challenge
Challenge Statement
Objective: Predict the nightly price of AirBnB listings in London based on their characteristics.
Data: Inside AirBnB collects publicly-available information from AirBnB listings to estimate the effect of AirBnB on urban economies.
Guidelines
This challenge aims to give you experience with all of the basic tasks in a data science project. I suggest following a workflow like this:
- Read the dataset with
pandas
. - Do some exploratory analysis to understand the data.
- Transform the data into relevant variables. Consider:
- Handling missing data.
- Re-coding categorical variables.
- Standardizing / normalizing variables.
- Extracting information from text variables.
- Split the dataset into training / evaluation datasets.
- Choose a regression to model the price of AirBnBs. Consider:
- Linear regression.
- Decision trees and random forests.
- Neural networks (also see here).
- See more model suggestions here.
- Fit the regression model.
- Evaluate the model performance.
- Visualize / communicate the results.
To help you get started, I have made a basic project scaffold. In this file, I try to predict the price of AirBnB listings with a linear regression based only on the number of bedrooms in the listing. The model performs terribly, but this should be a good starting point. Consider how you can extract more relevant information from the dataset and make a more appropriate model choice to improve the accuracy of your model.
Regression
I recommend using sklearn for statistical modelling, a popular library for machine learning in Python.
The process of training and testing regression models with sklearn follows the same basic pattern for all supported models. You can find information and example code in the documentation for individual models, or you can look at the official sklearn tutorial. I don’t recommend working through this whole tutorial (it is about image classification). Instead, look through it and try to apply the same methodology (training/testing split, model fitting, model evaluation) to your own workflow.
Dataset
The Inside AirBnB dataset comes from information “scraped” from AirBnB. This means that the dataset is created by automatically recording information on public AirBnB listings in different urban areas.
I highly recommend that you look through the Inside AirBnB Documentation before getting started.
Also look through the Data Dictionary to get an understanding of the different dataset variables and which variables might be relevant to your analysis.
Recommendations
Before getting started, take a moment to plan your approach.
Consider:
- How was the dataset collected?
- What variables are in the dataset?
- What are the data types of the variables?
- Is there any missing data?
- What regression methodology should you use?
- What data format does your regression analysis need?
- Which variables are / are not relevant to your methodology?
- Do you need to transform these variables in any way?
- Does restricting the dataset to certain property types / locations improve predictability?
By the end of the session, aim to make progress on each major item in the Guidelines. Start simple, then refine your solution iteratively.
If you run into any difficulties, document them, adapt your solution, and feel free to ask for help!
Debrief
On the afternoon of Day 4, I will ask you to add your solution file to a shared GitHub repository.
We will then compare everyone’s solutions and discuss your results, the different challenges that you faced, and how you overcame them.