As of March
2024, there are over 8,900 Airbnb listings in the city of Amsterdam,
the Netherlands. However, the ongoing presence of these listings is not
without controversy. For instance, Geerte
Udo — director of Amsterdam’s tourism promotion agency — argues that
short-term rental properties exacerbate the acute housing shortage that
Amsterdam has been experiencing, while harming the character of local
communities. With housing affordability and supply being a perennial
policy concern for the electorate, both local and national politicians
have called for greater regulatory scrutiny over the operations of
platforms such as Airbnb. Indeed, in line with expectations, the
conversion of residential properties to short-term rental properties has
generated upward pressure on surrounding property values, which can have
an adverse impact on local housing affordability. Indeed, a study on
Airbnb activity in the Netherlands by Morpey (2022) found that a 1%
increase in neighborhood-level Airbnb activity leads to a 0.038%
increase in residential property values. You can view the presentation
introducing our application here:
Fairbnb APP
Introduction
“Private holiday rental was harmful in cities like Amsterdam… it took away a lot of houses from the market. We as Amsterdam people have to fight hard to find a house anyway. So that’s what we think: from a city perspective, it takes out social cohesion and it doesn’t add any houses.”
Geerte Udo
amsterdam&partners foundation
2022 Bloomberg CityLab global cities summit
Cities around the world are at a crossroads for short-term rental
regulation, with policies falling on a spectrum from outright bans to
laissez-faire attitudes towards such platforms. More drastic approaches
include that of New York, which effectively bans Airbnbs through Local
Law 18 (Short-Term Rental Registration Law) with exceptionally
stringent compliance rules. Likewise, Singapore bans all short-term
rentals with durations
below three months, while Vienna places extremely tight
restrictions on which neighborhoods are short-term rentals
permitted, with the 2023 update to the Building Code expanding the scope
of such restrictions. In comparison, cities such as London have less
stringent regulations over short-term rentals, often with a cap on
the maximum number of nights permitted in a year.
We believe that the level of stringency should be determined through the democratic choice of local communities and the wider populace. But good democracy requires well-informed citizens. For the vast majority of cities in the absence of a total ban on Airbnb, policymakers and citizens need a versatile decision-making tool to evaluate the impact of Airbnb activity on both the city level and the neighborhood level. This forms the basis for Fairbnb — our proposed short-term rental market intelligence tool for policymakers (along with community advocates and interested members of the public) to quickly diagnose short-term rental supply-demand mismatches at the hyper-local level, opening possibilities for policies to dynamically respond to the market, as introduced in the next section.
“Houses aren’t used for what they’re meant to be used: for living, and some neighborhoods are completely taken over by tourists. That’s a severe threat to the livability of our inhabitants.”
Laurens Ivens
Amsterdam’s Deputy Mayor for Housing
2020, Interview for the Politico
The Fairbnb web application primarily serves policymakers in the housing and planning fields. At its core is the reporting of a Fairness Index — a local, below-neighborhood level measure of how much Airbnb prices deviate from baseline price levels, as determined by property features and locational characteristics. A value of 0 for the Fairness Index indicates that Airbnb prices at a particular area are in line with market expectations, while a negative value shows that Airbnb prices are lower than what is expected of that area. The Fairness Index would be derived from a hedonic pricing model for Airbnb prices using a robust set of property-level predictors and controlling for time and neighborhood fixed effects. A fishnet of cells overlaid across Amsterdam then aggregates errors from model predictions to produce the Fairness Index metric.
A persistently large, negative Fairness Index may be of particular concern to policymakers, given that its implications are two-fold: it signals that for a given area (below the neighborhood level) Airbnb prices are below the market equilibrium, indicating that there is excess supply of listings. This in turn reflects the reality that residential housing is excessively converted for short-term rentals, which could be more efficiently used for long-term residential tenancy given social justice considerations. The Index thus serves as a starting point for policymakers to further explore the underlying drivers for persistent deviations in an area from the market equilibrium, which will differ from place to place. Indeed, there are myriad possibilities for policy interventions on areas consistently with a large, negative Fairness Index, and here it is apt to introduce the regulatory environment that property owners work within.
As of Dec 2024, the City of Amsterdam mandates homeowners to register for a permit for renting out their properties to tourists in Amsterdam, with each permit priced at a flat rate of €71.00. In addition, the number of nights allowed for short-term rentals are capped at 30 per calendar year, and the nightly guest limit is 4. Homeowners must also give notice to the municipality (the City of Amsterdam) regarding their short-term rentals, prior to the arrival of each guest.
Such regulations have evolved significantly over time. For instance, in Jan 2018, the city council imposed a stricter limit as to the maximum permitted number of days each property can be let out in a year, from 60 to 30, which came into force on 1 Jan 2019. The datasets provided contain information scraped from the platform prior to the implementation of this policy update.
In addition, the City of Amsterdam attempted to ban Airbnb listings altogether in the three neighborhoods in the red light district and parts of the canal ring starting from 2020, on the grounds of nuisance. However, this decision has been overturned by the country’s highest administrative court (Council of State) in 2023. The judgment noted that the City should have first explored a reduced quota system or considered withdrawing permits from owners whose properties have received nuisance complaints from neighbors.
Meanwhile, property market analysts have observed that the recently tightened limits on rentals have cut Airbnb supply in Amsterdam by about half, although rising tourist inflows have placed upward pressure on the listing of residential properties for short-term rental.
In the European Union (EU), draft legislation is in place to reconcile the demand for short-term rental platforms such as Airbnb and the adverse impacts they can bring to local communities. This is in response to pressure from 13 EU cities including Amsterdam and Barcelona to address the conversion of residential properties intended for long-term leases to short-term rentals.
Under the draft Directive (set to come into force in 2026), companies such as Airbnb are required to share data with local government agencies on a monthly basis regarding 1) number of nights the unit was rented out, 2) number of guests, 3) specific address, and 4) URL of the listing. This policy is intended for local governments to actively monitor market dynamics in the short-term accommodation rental sector, and indeed is the key proposition for an intelligence tool such as Fairbnb — exploiting granular short-term rental data for more precise policymaking.
With this regulatory context, there are various actions policymakers can take based on the Fairbnb Index. For instance, if a certain part of a neighborhood with a consistently large, negative Fairness Index (indicating an over-supply of Airbnb properties), policymakers could adopt a combination of the following local policies, working along with residents to:
Set a limit on the total number of permits for that particular area while introducing a competitive bidding process for permits, with a pre-determined reserve price. This is in contrast to the flat €71 permit price under existing regulations.
Introduce a minimum quantum of affordable / long-term leasehold units for that specific area, to preserve its character and community.
Levy taxes on vacancy in between short-term rental periods, to disincentivize speculation by Airbnb property owners.
Shorten the maximum allowed number of days for a property to be rented out on a short-term basis.
Implement an additional community tax on short-term rental hosts
Reinvest all short-term rental related revenue into local community infrastructure and housing programs
The strength of hyper-local market intelligence, in contrast to city-wide blanket policies, lies in the ability for policy differentiation across and within neighborhoods. The small scale of intervention, in combination with market-based mechanisms, also mitigates the unintended effect of affecting supply-demand dynamics in adjacent areas and neighborhoods.
Against this backdrop, the next section introduces the 2018 – 19 Airbnb data available for the present analysis, along with supplementary datasets that are used for the calculation of the Fairbnb Index.
This section introduces the original datasets provided relating to the scraped Airbnb listings for Amsterdam, and some external datasets we have obtained that may improve the predictive power of models of Airbnb rental prices.
For reproducibility, the data wrangling workflow is presented below for both the original datasets and the external datasets obtained.
The datasets provided are scraped by Erik Bruin on 6 Dec 2018, covering daily price data for 20,030 unique Airbnb listings in Amsterdam from 6 Dec 2018 to 6 Dec 2019 (365 days). In total, there are six datasets available for analysis, with their descriptions as follows:
calendar.csv
— records the availability of the
20,030 unique Airbnb listings throughout the study period (each listing
having 365 records), along with price information for each available day
— this forms the primary source of data for this analysis
listings_details.csv
— contains information on each
listing, such as host name, neighborhood, listing type (private room or
entire place) and reviews available, and a description
reviews_details.csv
— provides the text content for
each review under each listing, and the corresponding date
Calendar data
For calendar.csv
, the cleaning process involves 1)
removing listings where there is no price information available for any
day in the 365-day study period and 2) converting the
$price
variable to numeric.
calendar <- read.csv("./data/calendar.csv")
calendar$listing_id <- as.factor(calendar$listing_id)
calendar$available <- as.factor(calendar$available)
calendar$date <- as.POSIXct(calendar$date, format = "%Y-%m-%d", tz = "UTC")
calendar$price[calendar$price == ""] <- NA
calendar$price <- gsub("\\$", "", calendar$price)
calendar$price <- gsub(",", "", calendar$price)
calendar$price <- as.numeric(calendar$price)
calendar <- calendar %>%
group_by(listing_id) %>%
filter(!all(is.na(price))) %>%
ungroup()
calendar <- calendar[!is.na(calendar$price), ]
calendar <- calendar %>%
select(-available)
Listings details data
For listings_details.csv
, variables for geometry,
superhost status, property and room type, bedding, occupancy, number of
bathrooms and bedrooms and number of beds are extracted.
listings_details <- read.csv("./data/listings_details.csv")
listings_details <- listings_details %>%
select(
id,
latitude,
longitude,
# categorical variables
host_is_superhost,
property_type,
room_type,
bed_type,
# numeric variables
accommodates,
bathrooms,
bedrooms,
beds) %>%
mutate(host_is_superhost = ifelse(is.na(host_is_superhost), FALSE, host_is_superhost))
Reviews data
There are a few limitations of using reviews data as an input for subsequent models, namely that 1) 12.0% of listings in the original dataset are new listings with 0 reviews, 2) 31.0% of listings with reviews available have 5 or fewer reviews. It is therefore difficult, and indeed problematic, to assume that listings without reviews share similar price trends, all else being equal. As such, for the present analysis, the reviews dataset will not be used as a model input.
reviews <- read.csv("./data/reviews.csv")
reviews_summary <- reviews %>%
group_by(listing_id) %>%
summarise(count_reviews = n())
# Proportion without reviews (full listings dataset)
1 - nrow(reviews_summary) / nrow(listings_details)
# Proportion with 5 or fewer reviews
reviews_summary %>%
summarise(prop_leq_5 = mean(count_reviews <= 5)) %>%
pull(prop_leq_5)
ggplot(reviews_summary, aes(x = count_reviews)) +
geom_histogram(binwidth = 10, fill = "#006e70", color = "white") +
labs(
title = "Histogram of number of reviews per listing",
subtitle = "Data: 17,624 listings where reviews are available",
x = "Number of reviews",
y = "Count"
) +
scale_y_continuous(labels = comma) +
theme_minimal(base_family = "ubuntu")
In addition to the original datasets provided, additional features are incorporated into our analysis, as described below:
attractions.geojson
— using Tripadvisor,
the 10 most popular places of interest for tourists are obtained,
including the Rijksmuseum and the Red Light District
trammetro-2018.geojson
— locations of tram and metro
stops in 2018 are available directly on Amsterdam’s open data
portal
restaurants-osm.geojson
— locations of 1,801
restaurants located in Amsterdam, obtained using the QuickOSM plugin in
QGIS
hotels-osm.geojson
— locations of 382 hotels located
in Amsterdam, obtained using the QuickOSM plugin in QGIS
holidays-ref.csv
— dates of 2018 and 2019 public
holidays in the Netherlands, obtained from timeanddate.com
sf.neighborhd25
and sf.neighborhd110
—
neighborhood polygon layers downloaded from Amsterdam’s open
data portal
Attractions data
The locations of the top 10 most popular places of interest for
tourists on Tripadvisor are plotted manually on QGIS and loaded as
attractions.geojson
.
attraction <- st_read('./data/attractions.geojson')
attraction <- attraction %>%
st_transform('EPSG:28992')
Neighborhood data
The neighborhood layers data are directly queried from Amsterdam’s open data portal in the code chunk below, and neighborhood information is joined to each listing.
neighborhd25.url <- "https://maps.amsterdam.nl/open_geodata/geojson_lnglat.php?KAARTLAAG=INDELING_GEBIED_EXWATER&THEMA=gebiedsindeling"
neighborhd110.url <- "https://maps.amsterdam.nl/open_geodata/geojson_lnglat.php?KAARTLAAG=INDELING_WIJK_EXWATER&THEMA=gebiedsindeling"
sf.neighborhd25 <- st_read(neighborhd25.url) %>%
st_transform('EPSG:28992')
sf.neighborhd110 <- st_read(neighborhd110.url) %>%
st_transform('EPSG:28992')
listings_details <- listings_details %>%
filter(id %in% unique(calendar$listing_id))
listings_details_sf <- st_as_sf(
listings_details,
coords = c("longitude", "latitude"), # Specify the coordinate columns
crs = 4326 # Use the WGS84 CRS (standard for lat/long)
) %>%
st_transform('EPSG:28992')
listings_details_sf <- listings_details_sf %>%
filter(apply(st_intersects(., sf.neighborhd25, sparse = FALSE), 1, any)) %>%
st_join(., sf.neighborhd25[, c("Gebied")],
left = TRUE
) %>%
rename(neighborhood25 = Gebied) %>%
st_join(., sf.neighborhd110[, c("Wijk")],
left = TRUE
) %>%
rename(neighborhood110 = Wijk)
listings_details <- listings_details %>%
filter(id %in% listings_details_sf$id)
Transit data
trammetro-2018.geojson
from Amsterdam’s open data portal
is directly available and is visualized below:
transit <- st_read("./data/trammetro-2018.php.geojson")
transit <- transit %>%
distinct(geometry, .keep_all = TRUE)%>%
st_transform('EPSG:28992')
ggplot() +
geom_sf(data = sf.neighborhd25, fill = NA, color = "gray80") +
geom_point(data = subset(transit, Modaliteit == "Tram"),
aes(x = st_coordinates(geometry)[,1],
y = st_coordinates(geometry)[,2]),
color = "#00b2b3", alpha = 0.5, size = 1) +
geom_point(data = subset(transit, Modaliteit == "Metro"),
aes(x = st_coordinates(geometry)[,1],
y = st_coordinates(geometry)[,2]),
color = "#e74c3c", alpha = 0.5, size = 1) +
labs(title = "Transit stations in Amsterdam",
subtitle = "The GVB network: Tram (blue) and Metro (red) stops",
x = "Longitude", y = "Latitude") +
theme_minimal(base_family = "ubuntu") +
theme(
axis.line = element_blank(),
axis.text = element_blank(),
axis.ticks = element_blank(),
panel.grid = element_blank(),
axis.title = element_blank()
)
Hotels and restaurants data
Like above, hotels and restaurants data is scraped in QGIS using QuickOSM, with the outputs loaded below.
hotels <- st_read("./data/hotels-osm.geojson")
hotels <- hotels %>%
select(full_id, name, geometry) %>%
st_transform('EPSG:28992')
restaurant <- st_read("./data/restaurants-osm.geojson")
restaurant <- restaurant%>%
select(full_id, name, geometry)%>%
st_transform('EPSG:28992')
Holidays data
A $pub_hol
dummy variable is created for the
calendar
dataset to indicate if a given day falls on a
public holiday in the Netherlands. A similar variable
$pub_hol_week
is created to indicate if a given day falls
within ±3 days from a public holiday.
holidays <- read.csv("./data/holidays-ref.csv")
calendar$date <- as.POSIXct(calendar$date, format = "%Y-%m-%d", tz = "UTC")
holidays$Date <- as.POSIXct(holidays$Date, format = "%d/%m/%Y", tz = "UTC")
calendar <- calendar %>%
mutate(pub_hol = ifelse(date %in% unique(holidays$Date), 1, 0))
holiday_ranges <- holidays %>%
mutate(Date = as.Date(Date)) %>% # Ensure Date is in Date format
rowwise() %>%
summarise(
Date = list(Date + c(-3, -2, -1, 0, 1, 2, 3)), # Generate date ranges
.groups = "drop"
) %>%
unnest(cols = c(Date)) %>% # Unnest the list of dates
distinct() %>% # Remove duplicates
arrange(Date) # Sort the dates
holiday_ranges <- holiday_ranges %>%
mutate(Date = as.POSIXct(Date, tz = "UTC"))
calendar <- calendar %>%
mutate(pub_hol_week = ifelse(date %in% unique(holiday_ranges$Date), 1, 0))
To account for time fixed effects (seasonality in Airbnb pricing),
the variables $month
and $day
(day of the
week) are created.
calendar <- calendar %>%
mutate(
day = format(calendar$date, "%a"), # Day of the week (Mon, Tue, ...)
month = format(calendar$date, "%b") # Month abbreviation (Jan, Feb, ...)
)
## plot mean price by month
# calendar_bymonth <- calendar_cleaned %>%
# group_by(month) %>%
# summarise(mean_price = mean(price, na.rm = T))
#
# calendar_bymonth$month <- as.numeric(as.character(calendar_bymonth$month))
# plot(x = calendar_bymonth$month, y = calendar_bymonth$mean_price, type = "l")
For the 10,510 listings, the daily rate ranges from €21.43 to €8,500, with a mean of €179.90 and a median of €148.75. From experience, it is evident that ultra-luxurious units (with particularly high daily rates, given their bespoke features and services) are contained within the dataset. These outliers are likely to distort the conclusions that will be drawn from subsequent regression analysis. Thus, for the present analysis, we apply the commonly used rule which states that a data point is an outlier if it is more than 1.5 x IQR above the third quartile or 1.5 x IQR below the first quartile. In other words, low outliers are below Q1 - 1.5 x IQR and high outliers are greater than Q3 + 1.5 x IQR.
Calculating the mean daily rate for each listing, this formula yields a price threshold of €354.62 per night. This will remove 621 (5.9%) out of 10,510 listings. Following which, price records are filtered to those below twice this threshold (€709.23), as some properties have anomalous prices (such as listing #22807937 which charges €8,070 on a few days and €80 on all other days), likely a minor error introduced during the web scraping process.
listings_details$id <- as.factor(listings_details$id)
listings_details_sf$id <- as.factor(listings_details_sf$id)
calendar <- calendar %>%
filter(listing_id %in% unique(listings_details$id))
calendar_price_summary <- calendar %>%
group_by(listing_id) %>%
summarise(mean_price = mean(price, na.rm = T))
summary(calendar_price_summary$mean_price)
q1 <- quantile(calendar_price_summary$mean_price, 0.25)
q3 <- quantile(calendar_price_summary$mean_price, 0.75)
iqr <- q3 - q1
# Define outlier thresholds
lower_bound <- q1 - 1.5 * iqr
upper_bound <- q3 + 1.5 * iqr
lower_bound
upper_bound
upper_bound * 2
calendar_price_summary %>%
filter(mean_price < lower_bound | mean_price > upper_bound) %>%
nrow()
# [1] 621
calendar_price_summary <- calendar_price_summary %>%
filter(mean_price < 354.62)
calendar <- calendar %>%
filter(listing_id %in% unique(calendar_price_summary$listing_id))
listings_details <- listings_details %>%
filter(id %in% unique(calendar_price_summary$listing_id))
listings_details_sf <- listings_details_sf %>%
filter(id %in% unique(calendar_price_summary$listing_id))
# Remove price records outliers for calendar
calendar <- calendar %>%
filter(price < 709.23)
listings_details <- listings_details %>%
filter(id %in% unique(calendar$listing_id))
listings_details_sf <- listings_details_sf %>%
filter(id %in% unique(calendar$listing_id))
Within the listings
dataset, there are 14 listings
lacking in information on property attributes (such as the number of
beds). These are removed from the dataset, leaving us with 9,875
listings.
listings_details %>%
summarise(na_rows = sum(rowSums(is.na(.)) > 0))
listings_details <- na.omit(listings_details)
listings_details_sf <- na.omit(listings_details_sf)
calendar <- calendar %>%
filter(listing_id %in% unique(listings_details$id))
For each listing, the average distance to the three nearest transit stops, three nearest restaurants and three nearest popular tourist attractions are calculated.
predictors.unique <- listings_details_sf %>%
mutate(tran_nn3 = nn_function(st_coordinates(listings_details_sf), st_coordinates(transit), 3),
rest_nn15 = nn_function(st_coordinates(listings_details_sf), st_coordinates(restaurant), 15),
attr_nn3 = nn_function(st_coordinates(listings_details_sf), st_coordinates(attraction), 3)) %>%
rename(listing_id = id)
listings_details <- st_drop_geometry(predictors.unique)
listings_details_sf <- predictors.unique
calendar <- calendar %>%
left_join(predictors.unique, by = "listing_id") %>%
na.omit()
nrow(calendar)
# [1] 1059688
Here, hotels, hostels and non-conventional property types such as boats and barns are removed from the dataset, as the focus of the present analysis is on mainstream, residential properties.
calendar <- calendar %>%
mutate(across(where(is.character), as.factor))
# summary(calendar$property_type)
exclude_property_types <- c("Houseboat", "Boutique hotel", "Boat", "Other",
"Hotel", "Cabin", "Cottage", "Hostel", "Camper/RV",
"Barn", "Casa particular (Cuba)", "Aparthotel", "Nature lodge", "Tiny house", "Chalet")
calendar <- calendar %>%
filter(!property_type %in% exclude_property_types)
calendar <- calendar %>%
filter(room_type != "Shared room")
calendar <- calendar %>%
mutate(entire_place = ifelse(room_type == "Entire home/apt", 1, 0))
calendar$room_type = NULL
summary(calendar$property_type)
listings_details <- listings_details %>%
filter(listing_id %in% unique(calendar$listing_id))
listings_details_sf <- listings_details_sf %>%
filter(listing_id %in% unique(calendar$listing_id))
unique(calendar$host_is_superhost)
calendar <- calendar %>%
mutate(superhost = if_else(host_is_superhost %in% c("", "f"), 0, 1))
listings_details <- listings_details %>%
mutate(superhost = if_else(host_is_superhost %in% c("", "f"), 0, 1))
listings_details_sf <- listings_details_sf %>%
mutate(superhost = if_else(host_is_superhost %in% c("", "f"), 0, 1))
calendar <- calendar %>%
mutate(superhost = factor(superhost))
listings_details <- listings_details %>%
mutate(superhost = factor(superhost))
listings_details_sf <- listings_details_sf %>%
mutate(superhost = factor(superhost))
# Convert binary variables to factors
calendar <- calendar %>%
mutate(
entire_place = factor(entire_place),
pub_hol = factor(pub_hol),
pub_hol_week = factor(pub_hol_week)
)
From above, the data wrangling process results in 9,378 Airbnb listings in Amsterdam, associated with 976,204 price records available for regression analysis. Each price record is associated with the following variables:
Dependent variable
price
— daily rate of Airbnb listingContinuous predictor variables
accommodates
— number of guests permitted
bathrooms
— number of bathrooms in unit
bedrooms
— number of bedrooms in unit
beds
— number of beds in unit
tran_nn3
— distance to three nearest transit stops
(metro / tram)
rest_nn15
— distance to three nearest
restaurants
attr_nn3
— distance to three nearest tourist
attractions
Categorical predictor variables
property_type
— type of property, namely: Apartment,
Townhouse, Bed and breakfast, House, Guesthouse, Condominium, Guest
suite, Loft, Serviced apartment, Villa, and Bungalow
entire_place
— whether guests have the entire place
to themselves (1) or just a private room (0)
pub_hol
— whether the date falls on a public holiday
in the Netherlands
pub_hol_week
— whether the date falls within ±3 days
from a public holiday in the Netherlands
day
— day of the week
month
— month of the year
superhost
— whether the host is a superhost (1) or
not (0)
bed_type
— type of beds available, namely: Real Bed,
Pull-out Sofa, Futon, or Airbed
neighborhood110
— neighborhood name
For the 9,378 listings, out of the 365-day study period, listings are marked as available for an average of 104 days, which is much higher than the pre-2019 yearly limit set by the City of Amsterdam of 60 days. This illuminates one interesting behavioral aspect of property owners — the tendency to speculatively list out their properties for a wider date range than what is legally permitted. It is likely that such properties sit vacant during the short periods between bookings — a rather inefficient use of space that forms one key motivation for this analysis.
calendar_count <- calendar %>%
group_by(listing_id) %>%
summarise(record_count = n(), .groups = "drop")
mean(calendar_count$record_count)
# [1] 104.0951
The histogram below shows the distribution of daily rates for the filtered Airbnb listings. As the figure shows a skew to the right, it is useful to perform a log-transformation of Airbnb prices for subsequent analyses.
calendar_price_summary <- calendar %>%
group_by(listing_id) %>%
summarise(mean_price = mean(price, na.rm = T))
ggplot(calendar_price_summary, aes(x = mean_price)) +
geom_histogram(binwidth = 20, fill = "#006e70", color = "white") +
labs(
title = "Histogram of mean daily rates for Airbnb listings in Amsterdam",
subtitle = "Data: 9,378 listings scraped on 6 Dec 2018 (after cleaning)",
x = "Mean daily rate (€)",
y = "Count"
) +
scale_y_continuous(labels = comma) +
theme_minimal(base_family = "ubuntu")
The result of the log-transformation (creating a new variable
log_price
) is indicated by the histogram below. While there
is now a slight left skew, the degree of skew seems to be reduced, with
the shape following more closely that of a normal distribution.
calendar <- calendar %>%
mutate(log_price = log(price))
calendar_price_summary <- calendar %>%
group_by(listing_id) %>%
summarise(mean_logged_price = mean(log_price, na.rm = T))
ggplot(calendar_price_summary, aes(x = mean_logged_price)) +
geom_histogram(binwidth = 0.15, fill = "#00999e", color = "white") +
labs(
title = "Log-transformed mean daily rates in Amsterdam",
subtitle = "Data: 9,378 listings scraped on 6 Dec 2018 (after cleaning)",
x = "Mean daily rate (€)",
y = "Count"
) +
scale_y_continuous(labels = comma) +
theme_minimal(base_family = "ubuntu")
The table below shows the summary statistics for the numeric predictor variables in the dataset.
str(calendar)
summ_stats <- calendar %>% st_drop_geometry()
summ_stats <- summ_stats %>%
select(accommodates, bathrooms, bedrooms, beds, tran_nn3, rest_nn15, attr_nn3)
summ_stats <- summ_stats %>%
psych::describe() %>%
dplyr::select(mean:median,min,max) %>%
mutate(Mean = as.character(round(mean)),
SD = as.character(round(sd)),
Median = as.character(round(median)),
Min = as.character(round((min))),
Max = as.character(round(max))) %>%
dplyr::select(-(mean:max)) %>%
# filter(Max > 1) %>%
as.data.frame()
summ_stats$Description <-
c("Occupancy",
"Number of bathrooms",
"Number of bedrooms",
"Number of beds",
"Average distance to 3 nearest transit stations",
"Average distance to 3 nearest restaurants",
"Average distance to 3 nearest tourist attractions")
summ_stats %>%
dplyr::select(Description, Mean, SD, Median, Min, Max) %>%
kable(format = "html", caption = "Summary statistics for numeric predictor variables") %>%
kable_styling(bootstrap_options = c("hover", "striped", "condensed"),
fixed_thead = TRUE)
Description | Mean | SD | Median | Min | Max | |
---|---|---|---|---|---|---|
accommodates | Occupancy | 3 | 1 | 2 | 1 | 12 |
bathrooms | Number of bathrooms | 1 | 0 | 1 | 0 | 10 |
bedrooms | Number of bedrooms | 1 | 1 | 1 | 0 | 6 |
beds | Number of beds | 2 | 1 | 1 | 0 | 12 |
tran_nn3 | Average distance to 3 nearest transit stations | 480 | 501 | 339 | 132 | 4448 |
rest_nn15 | Average distance to 3 nearest restaurants | 379 | 413 | 245 | 17 | 3762 |
attr_nn3 | Average distance to 3 nearest tourist attractions | 2046 | 1729 | 1533 | 189 | 10365 |
Here, histogram plots of the seven numeric predictors are produced to examine any potential skew, which can be mitigated with log-transformation.
numeric_vars <- calendar %>%
st_drop_geometry() %>%
select(accommodates, bathrooms, bedrooms, beds, tran_nn3, rest_nn15, attr_nn3)
ggplot(numeric_vars %>%
gather(key = "variable", value = "value"), aes(x = value)) +
geom_histogram(bins = 15, fill = "#006e70", color = "white", alpha = 1) +
facet_wrap(~ variable, scales = "free", ncol = 2) +
theme_minimal(base_family = "ubuntu") +
labs(
title = "Histograms for numeric predictor variables",
subtitle = "Includes property attributes and locational characteristics",
x = "Value",
y = "Count"
) +
theme(
text = element_text(family = "Ubuntu"),
plot.title = element_text(family = "ubuntu", size = 16),
axis.title.x = element_text(family = "ubuntu"),
axis.title.y = element_text(family = "ubuntu")
) +
scale_x_continuous(labels = comma) +
scale_y_continuous(labels = comma)
Indeed, there is an obvious skew for these five predictors
attr_nn3
, bathrooms
, beds
,
rest_nn15
and tran_nn3
, and log transformation
is thus applied for them, with the resulting histograms as follows:
calendar <- calendar %>%
mutate(
log_attr_nn3 = log(attr_nn3),
log_bathrooms = log(bathrooms + 1),
log_beds = log(beds + 1),
log_rest_nn15 = log(rest_nn15),
log_tran_nn3 = log(tran_nn3)
)
numeric_vars <- calendar %>%
st_drop_geometry() %>%
select(log_bathrooms, log_beds, log_tran_nn3, log_rest_nn15, log_attr_nn3)
ggplot(numeric_vars %>%
gather(key = "variable", value = "value"), aes(x = value)) +
geom_histogram(bins = 15, fill = "#00999e", color = "white", alpha = 1) +
facet_wrap(~ variable, scales = "free", ncol = 2) +
theme_minimal(base_family = "ubuntu") +
labs(
title = "Histograms for log-transformed numeric predictor variables",
subtitle = "Log-transformation involving adding 1 to the number of beds and bathrooms",
x = "Value",
y = "Count"
) +
theme(
text = element_text(family = "Ubuntu"),
plot.title = element_text(family = "ubuntu", size = 16),
axis.title.x = element_text(family = "ubuntu"),
axis.title.y = element_text(family = "ubuntu")
) +
scale_x_continuous(labels = comma) +
scale_y_continuous(labels = comma)
The correlation matrix below shows the pairwise Pearson correlation coefficients between numeric variables (with log-transformation, if applicable), including the dependent variable (log-transformed Airbnb prices).
numeric_vars <- calendar %>%
st_drop_geometry() %>%
select(log_price, accommodates, log_bathrooms, bedrooms, log_beds, log_tran_nn3, log_rest_nn15, log_attr_nn3)
corr <- round(cor(numeric_vars), 2)
p <- ggcorrplot(corr,
colors = rev(corr_mat_pal),
lab = FALSE,
type = "lower",
insig = "blank",
tl.cex = 6) +
labs(title = "Correlation matrix for continuous predictor variables") +
theme_minimal(base_family = "Ubuntu") +
theme(
text = element_text(family = "Ubuntu"),
axis.text.x = element_text(size = 5, family = "Ubuntu"),
axis.text.y = element_text(size = 8, family = "Ubuntu"),
plot.title = element_text(size = 14, family = "Ubuntu"),
plot.subtitle = element_text(size = 10, family = "Ubuntu"),
axis.title = element_blank(),
axis.ticks = element_blank(),
panel.grid = element_blank()
)
p +
geom_text(aes(label = value),
family = "Ubuntu",
size = 4,
fontface = "plain")
There appears to be a strong collinearity between
log_rest_nn15
and log_attr_nn3
. This may be
due to the fact that tourist hotspots in Amsterdam are likely to have
greater restaurant offerings, given the higher footfall. Likewise, the
strong collinearity between log_rest_nn15
and
log_tran_nn3
suggests that restaurants are also likely to
cluster around transit areas, which intuitively have greater visitor
traffic. Unsurprisingly, the correlation between number of beds and
number of bedrooms is high. These findings will be taken into account in
the subsequent modelling section when making comparisons between
different model specifications.
Meanwhile, the figure below presents the correlation of different numeric predictor variables with the dependent variable, in descending order. All numeric predictors have a moderate-to-weak correlation with log-transformed daily rates, suggesting that interaction terms may be helpful in subsequent regression analysis.
numeric_vars %>%
correlate() %>%
select(term, log_price) %>%
mutate(Correlation = if_else(log_price > 0, "Positive", "Negative")) %>%
drop_na() %>%
mutate(term = fct_reorder(term, desc(abs(log_price)))) %>%
ggplot(aes(x = term, y = log_price, fill = Correlation)) +
geom_col(alpha = 0.8, width = 0.5) +
scale_fill_manual(values = c("#e74c3c", "#006e70")) +
theme_minimal(base_family = "ubuntu") +
labs(
title = "Correlation coefficients of numeric variables",
subtitle = "With log-transformed Airbnb daily rates",
x = "Numeric variables",
y = "Correlation with logged daily rates"
) +
theme(
text = element_text(family = "ubuntu"),
plot.title = element_text(family = "ubuntu", size = 16),
axis.text.x = element_text(angle = 45, hjust = 1, family = "ubuntu"),
axis.text.y = element_text(family = "ubuntu"),
axis.title.x = element_text(family = "ubuntu"),
axis.title.y = element_text(family = "ubuntu")
)
As a complement to the correlation values calculated, the figures below present selected scatterplots of numeric predictor variables with log-transformed Airbnb prices.
Distance to transit
ggplot(calendar, aes(x = log_tran_nn3, y = log_price)) +
geom_point(color = "black", alpha = 0.008) +
geom_smooth(method = "lm", color = "#e74c3c", se = FALSE) + # Best fit line
labs(
title = "Scatterplot of logged daily rates vs logged distance to transit",
subtitle = "Best-fit line indicated in red",
x = "Log-transformed distance to transit",
y = "Log-transformed Airbnb daily rates"
) +
theme_minimal(base_family = "ubuntu") +
theme(
text = element_text(family = "ubuntu"),
plot.title = element_text(family = "ubuntu"),
plot.subtitle = element_text(family = "ubuntu"),
axis.title.x = element_text(family = "ubuntu"),
axis.title.y = element_text(family = "ubuntu")
)
Number of beds
ggplot(calendar, aes(x = log_beds, y = log_price)) +
geom_point(color = "black", alpha = 0.008) +
geom_smooth(method = "lm", color = "#e74c3c", se = FALSE) + # Best fit line
labs(
title = "Scatterplot of logged daily rates vs logged number of beds",
subtitle = "Best-fit line indicated in red",
x = "Log-transformed number of beds",
y = "Log-transformed Airbnb daily rates"
) +
theme_minimal(base_family = "ubuntu") +
theme(
text = element_text(family = "ubuntu"),
plot.title = element_text(family = "ubuntu"),
plot.subtitle = element_text(family = "ubuntu"),
axis.title.x = element_text(family = "ubuntu"),
axis.title.y = element_text(family = "ubuntu")
)
The box plots below summarizes the (log-transformed) Airbnb daily
rates for each factor of the categorical predictor variables (with the
exception of neighborhood110
which has 110 factors) in the
dataset.
categorical_vars <- calendar %>%
st_drop_geometry() %>%
select(log_price, property_type, entire_place, pub_hol, pub_hol_week, day, month,
superhost, bed_type)
cat_vars <- c("property_type", "entire_place", "pub_hol", "pub_hol_week", "day", "month",
"superhost", "bed_type")
plot_list <- list()
for (var in cat_vars) {
plot <- ggplot(categorical_vars, aes_string(x = var, y = "log_price")) +
geom_boxplot(fill = '#006e70') +
scale_y_continuous(labels = comma) +
labs(title = paste("Logged daily rates by", var),
x = var,
y = "Logged daily rates") +
theme_minimal() +
theme(
axis.text.x = element_text(angle = 45, hjust = 1, family = "ubuntu"),
axis.text.y = element_text(family = "ubuntu"),
axis.title.x = element_text(family = "ubuntu"),
axis.title.y = element_text(size = 8, family = "ubuntu"),
plot.title = element_text(size = 10, family = "ubuntu")
)
plot_list[[var]] <- plot
}
combined_plot <- wrap_plots(plot_list, ncol = 3)
combined_plot
In line with expectations, there is a price premium for public holidays and weeks with public holidays. Having access to the entire place also fetches a large premium
Meanwhile, on the Airbnb platform, selected property owners are granted the “Superhost” badge signifying exceptional hospitality standards. In the cleaned dataset, 2,034 (21.7%) property owners are classified as Superhosts. Interestingly, the box plots above do not show a price premium for Superhost properties.
# Create a data frame for plotting
host_status <- as.data.frame(table(listings_details$host_is_superhost))
colnames(host_status) <- c("Superhost", "Count")
host_status <- host_status %>%
mutate(Proportion = round(Count / sum(Count), 4))
# Plot
ggplot(host_status, aes(x = Count, y = Superhost, fill = "#006e70")) +
geom_bar(stat = "identity", color = "#006e70", fill = "#006e70", width = 0.5) +
labs(
title = "Superhost status of Airbnb listings",
subtitle = "Data: 9,378 listings scraped on 6 Dec 2018 (after removing outliers)",
x = "Number of listings",
y = "Superhost status"
) +
scale_x_continuous(labels = comma) +
theme_minimal(base_family = "ubuntu")
The map below shows the distribution of average daily rates for Airbnb listings, aggregated at the neighborhood level. With the exception of the business district of Amstel III/Bullewijk in the southeast, average daily rates are the highest in the traditional city center, and decreases with distance from the city center, in agreement with the classic bid-rent theory in real estate. The seeming anomalous observations for Amstel III/Bullewijk could be due to the client base consisting more of business than leisure travelers, given the corporate setting of the overall neighborhood. Another explanation could be the lack of alternative accommodation providers in the vicinity and the lack of competition within the platform (with the number of Airbnb listings being the 3rd lowest among all 110 neighborhoods).
price_summary <- calendar %>%
group_by(neighborhood110) %>%
summarise(
mean_price = mean(price, na.rm = TRUE),
count = n() # Add count of records in each group
)
price_summary_sf <- price_summary %>%
left_join(sf.neighborhd110, by = c("neighborhood110" = "Wijk")) %>%
st_as_sf()
ggplot(price_summary_sf) +
geom_sf(aes(fill = mean_price)) +
scale_fill_viridis(option = "magma", name = "Mean daily rate", direction = -1) +
theme_minimal() +
labs(
title = "Mean daily rate by neighborhood",
subtitle = "Base layer: Amsterdam Open Data Portal"
) +
theme(
text = element_text(family = "ubuntu"),
plot.title = element_text(family = "ubuntu"),
plot.subtitle = element_text(family = "ubuntu"),
axis.title = element_blank(),
axis.text = element_blank(),
panel.grid = element_blank()
)
Prior to variable selection, the cleaned data is divided into a 70%
train set and 30% test set. The createDataPartition
function is modified to ensure that each one of the 110 neighborhoods is
covered in the train and test sets. For reproducibility, a seed of 5080
is set.
set.seed(5080)
train_indices <- createDataPartition(calendar$neighborhood110, p = 0.7, list = FALSE)
train_data <- calendar[train_indices, ]
test_data <- calendar[-train_indices, ]
Using the variables described above (including log-transformations when necessary), a base linear model (Model #1) is created containing all predictors. Based on this base specification, Model #2 modifies the neighborhood predictor, using a coarser spatial unit of analysis (25 districts rather than 110 neighborhoods). Following the observations of collinearity among some locational characteristics predictors, Model #3 modifies the base model by dropping the distance to the nearest restaurants (log_rest_nn15). Lastly, Models #4 and #5 experiment with interaction terms, with Model #4 testing whether the interaction between occupancy and property type improves the model, and Model #5 investigating the interaction between occupancy and whether guests have access to the entire place.
These models are fitted on the test dataset, with the mean absolute errors (MAE) and mean absolute percentage errors (MAPE) calculated for each regression specification. These two metrics are then compared across the five models, with lower values indicating a better model fit, as presented in the summary table below.
any(is.na(calendar))
# reg1
reg1_formula <- log_price ~ pub_hol + pub_hol_week + day + month + property_type +
bed_type + accommodates + bathrooms + bedrooms + beds +
neighborhood110 + entire_place + superhost +
log_attr_nn3 + log_bathrooms + log_beds + log_rest_nn15 + log_tran_nn3
reg1 <- lm(reg1_formula, data = train_data)
# reg2
reg2_formula <- log_price ~ pub_hol + pub_hol_week + day + month + property_type +
bed_type + accommodates + bathrooms + bedrooms + beds +
neighborhood25 + entire_place + superhost +
log_attr_nn3 + log_bathrooms + log_beds + log_rest_nn15 + log_tran_nn3
reg2 <- lm(reg2_formula, data = train_data)
# reg3
reg3_formula <- log_price ~ pub_hol + pub_hol_week + day + month + property_type +
bed_type + accommodates + bathrooms + bedrooms + beds +
neighborhood110 + entire_place + superhost +
log_attr_nn3 + log_bathrooms + log_beds + log_tran_nn3
reg3 <- lm(reg3_formula, data = train_data)
# reg4 with interaction: accommodates * property_type
reg4_formula <- log_price ~ pub_hol + pub_hol_week + day + month + property_type +
bed_type + accommodates + bathrooms + bedrooms + beds +
neighborhood110 + entire_place + superhost +
log_attr_nn3 + log_bathrooms + log_beds + log_rest_nn15 + log_tran_nn3 +
accommodates:property_type
reg4 <- lm(reg4_formula, data = train_data)
# reg5 with interaction: accommodates * entire_place
reg5_formula <- log_price ~ pub_hol + pub_hol_week + day + month + property_type +
bed_type + accommodates + bathrooms + bedrooms + beds +
neighborhood110 + entire_place + superhost +
log_attr_nn3 + log_bathrooms + log_beds + log_rest_nn15 + log_tran_nn3 +
accommodates:entire_place
reg5 <- lm(reg5_formula, data = train_data)
summary(reg1)
summary(reg2)
summary(reg3)
summary(reg4)
summary(reg5)
reg_models <- list(reg1 = reg1, reg2 = reg2, reg3 = reg3, reg4 = reg4, reg5 = reg5)
calc_metrics <- function(model, test_data) {
test_data <- test_data %>%
mutate(
log_pricePredict = predict(model, test_data),
log_price.Error = log_pricePredict - log_price,
log_price.AbsError = abs(log_price.Error),
log_price.APE = log_price.AbsError / log_price
)
test_data %>%
summarise(
Mean_Error = mean(log_price.Error, na.rm = TRUE),
Mean_Absolute_Error = mean(log_price.AbsError, na.rm = TRUE),
Mean_APE = mean(log_price.APE, na.rm = TRUE)
)
}
metrics_list <- map(reg_models, calc_metrics, test_data = test_data)
metrics_df <- bind_rows(metrics_list, .id = "Model") %>%
mutate(across(where(is.numeric), ~ signif(.x, 4)))
save(metrics_df, file = "comparisons_models.RData")
load("comparisons_models.RData")
metrics_df %>%
kable(format = "html", caption = "Error metrics for different model specifications") %>%
kable_styling(bootstrap_options = c("hover", "striped", "condensed"),
fixed_thead = TRUE)
Model | Mean_Error | Mean_Absolute_Error | Mean_APE |
---|---|---|---|
reg1 | -0.001238 | 0.2444 | 0.04979 |
reg2 | -0.001193 | 0.2489 | 0.05073 |
reg3 | -0.001253 | 0.2445 | 0.04981 |
reg4 | -0.001259 | 0.2436 | 0.04963 |
reg5 | -0.001208 | 0.2427 | 0.04942 |
From the outputs above, one can observe that:
Controlling for more granular neighborhood fixed effects results in lower MAE and MAPE (comparing Models #1 and #2)
Removing distance to the nearest restaurants from the predictors does not improve MAE and MAPE (comparing Models #1 and #3)
Controlling for the interaction between occupancy and entire_place and between occupancy and property_type improves both the MAE and MAPE.
Following this, the final model incorporates:
Neighborhood fixed effects (110 neighborhoods of Amsterdam)
Average distance to the nearest restaurants
Interaction effect between occupancy and
entire_place
Interaction effect between occupancy and
property_type
reg6_formula <- log_price ~ pub_hol + pub_hol_week + day + month + property_type +
bed_type + accommodates + bathrooms + bedrooms + beds +
neighborhood110 + entire_place + superhost +
log_attr_nn3 + log_bathrooms + log_beds + log_rest_nn15 + log_tran_nn3 +
accommodates:entire_place +
accommodates:property_type
# reg6 <- lm(reg6_formula, data = train_data)
In this section, the goodness of fit of the final model is evaluated with reference to its generalizability (whether it can predict new data well and under different spatial and temporal contexts) and accuracy (whether there is a low degree of error between actual and predicted values).
Here, the final model is cross-validated with k = 100 folds, in particular specifying that each fold should contain all the neighborhoods of Amsterdam present in the original dataset. For each fold, the mean absolute error (MAE) between actual and predicted Airbnb daily rates is calculated, and the results are displayed in the histogram below.
calendar$neighborhood110 <- droplevels(calendar$neighborhood110)
train_data$neighborhood110 <- droplevels(train_data$neighborhood110)
test_data$neighborhood110 <- droplevels(test_data$neighborhood110)
# Set formula and train data
reg6_formula <- log_price ~ pub_hol + pub_hol_week + day + month + property_type +
bed_type + accommodates + bathrooms + bedrooms + beds +
neighborhood110 + entire_place + superhost +
log_attr_nn3 + log_bathrooms + log_beds + log_rest_nn15 + log_tran_nn3 +
accommodates:entire_place +
accommodates:property_type
# Create k stratified folds by neighborhood110
set.seed(5080) # For reproducibility
folds <- caret::createFolds(train_data$neighborhood110, k = 100, list = TRUE)
# Perform cross-validation
cv_results <- map_dfr(folds, function(test_indices) {
# Split the data into training and testing
test_data <- train_data[test_indices, ]
train_data_fold <- train_data[-test_indices, ]
# Train the model
reg6 <- lm(reg6_formula, data = train_data_fold)
# Predict on the test set
test_data <- test_data %>%
mutate(
log_pricePredict = predict(reg6, test_data),
log_price.Error = log_pricePredict - log_price,
log_price.AbsError = abs(log_pricePredict - log_price),
log_price.APE = abs(log_pricePredict - log_price) / log_price
)
# Calculate metrics for this fold
metrics <- test_data %>%
summarise(
MSE = mean(log_price.Error^2),
MAE = mean(log_price.AbsError),
MAPE = mean(log_price.APE)
)
return(metrics)
})
# cv_results_summary <- cv_results %>%
# summarise(
# Avg_MSE = mean(MSE),
# Avg_MAE = mean(MAE),
# Avg_MAPE = mean(MAPE)
# )
save(cv_results, file = "cv_results.RData")
load("cv_results.RData")
ggplot(cv_results, aes(x = MAE)) +
geom_histogram(binwidth = 0.001, fill = "#006e70", color = "white") +
theme_minimal(base_family = "Ubuntu") +
labs(
title = "Distribution of Mean Absolute Error of 100 folds",
subtitle = "Final model",
x = "Mean Absolute Error (MAE)",
y = "Frequency"
) +
theme(
text = element_text(family = "Ubuntu"),
plot.title = element_text(size = 16, family = "Ubuntu"),
plot.subtitle = element_text(size = 12, family = "Ubuntu"),
axis.title = element_text(size = 12, family = "Ubuntu"),
axis.text = element_text(size = 10, family = "Ubuntu")
)
The histogram above shows that the final model is able to generalize well for new Airbnb listing data, as the mean absolute errors of the folds cluster tightly together.
To determine if the final model is generalizable across space, a map aggregating the mean absolute percentage errors (MAPE) of the test set predictions at the neighborhood level is produced, as presented in the figure below. If there is no clustering of high or low MAPE at certain regions, then the model is able to generalize well across space.
reg6_formula <- log_price ~ pub_hol + pub_hol_week + day + month + property_type +
bed_type + accommodates + bathrooms + bedrooms + beds +
neighborhood110 + entire_place + superhost +
log_attr_nn3 + log_bathrooms + log_beds + log_rest_nn15 + log_tran_nn3 +
accommodates:entire_place +
accommodates:property_type
reg6 <- lm(reg6_formula, data = train_data)
test_data <- test_data %>%
mutate(
log_pricePredict = predict(reg6, test_data),
log_price.Error = log_pricePredict - log_price,
log_price.AbsError = abs(log_pricePredict - log_price),
log_price.APE = abs(log_pricePredict - log_price) / log_price
)
neighborhood_mean_APE <- test_data %>%
group_by(neighborhood110) %>%
summarise(mean_APE = mean(log_price.APE, na.rm = TRUE))
sf_with_mean_APE <- sf.neighborhd110 %>%
left_join(neighborhood_mean_APE, by = c("Wijk" = "neighborhood110"))
sf_with_mean_APE <- sf_with_mean_APE %>%
filter(!is.na(mean_APE))
ggplot(sf_with_mean_APE) +
geom_sf(aes(fill = mean_APE)) +
scale_fill_viridis(option = "mako", name = "MAPE", direction = -1) +
theme_minimal() +
labs(
title = "Mean absolute percentage error (MAPE) by neighborhood",
subtitle = "Calculations based on the final model"
) +
theme(
text = element_text(family = "ubuntu"),
plot.title = element_text(family = "ubuntu"),
plot.subtitle = element_text(family = "ubuntu"),
axis.title = element_blank(),
axis.text = element_blank(),
panel.grid = element_blank()
)
The map above shows a rather decent generalizability across space.
The central neighborhoods have a decent spread of MAPE values from 3% to
7%, and even in the city outskirts (areas such as Amsterdam-Zuidoost to
the southeast and Amsterdam Nieuw-West to the northwest) there is a wide
range of values from 2% to 9%. Thus, on the whole, one can conclude that
the model is able to generalize well across space.
The accuracy of the model can be determined with reference to the MAE and MAPE values calculated for the test dataset. These are reported in the table below:
mean_abs_error <- mean(test_data$log_price.AbsError, na.rm = TRUE)
mean_ape <- mean(test_data$log_price.APE, na.rm = TRUE)
error_metrics <- data.frame(
Metric = c("Mean Absolute Error (MAE)", "Mean Absolute Percentage Error (MAPE)"),
Value = c(mean_abs_error, mean_ape)
)
error_metrics %>%
kable(
col.names = c("Error Metric", "Value"),
caption = "Measuring model accuracy: MAE and MAPE",
digits = 3
) %>%
kable_styling(
bootstrap_options = c("striped", "hover", "condensed", "responsive"),
full_width = FALSE,
font_size = 12
)
Error Metric | Value |
---|---|
Mean Absolute Error (MAE) | 0.242 |
Mean Absolute Percentage Error (MAPE) | 0.049 |
As the dependent variable (Airbnb) is log-transformed, the more appropriate measure for determining accuracy is the MAPE, since the range of Airbnb daily rates can vary quite widely. Here, the MAPE of 4.9% is rather decent, showing that the model is generally accurate.
Leaving out the (110) rows for neighborhood fixed effects, the output of the final model is as follows:
reg6_formula <- log_price ~ pub_hol + pub_hol_week + day + month + property_type +
bed_type + accommodates + bathrooms + bedrooms + beds +
neighborhood110 + entire_place + superhost +
log_attr_nn3 + log_bathrooms + log_beds + log_rest_nn15 + log_tran_nn3 +
accommodates:entire_place +
accommodates:property_type
# Note that the data here is calendar
reg6 <- lm(reg6_formula, data = calendar)
reg6_summary <- summary(reg6)
reg6_summary$coefficients <- reg6_summary$coefficients[!grepl("neighborhood110", rownames(reg6_summary$coefficients)), ]
coefficients_df <- as.data.frame(reg6_summary$coefficients)
coefficients_df <- coefficients_df[!grepl("neighborhood110", rownames(coefficients_df)), ]
colnames(coefficients_df)[4] <- "p-value"
coefficients_df[3] <- NULL
kable(coefficients_df, caption = "Final model output") %>%
kable_styling(bootstrap_options = c("hover", "striped", "condensed"))
The regression output shows that all predictors are statistically
significant at 5%, with the exception of the property_type
“Villa”
and the interaction between property_type
“Loft”
and occupancy.
In this final section for modelling, a fishnet grid is produced based on the Airbnb listings, with each cell measuring 500 meters by 500 meters, offering a good granularity for bespoke short-term rental policy evaluation and interventions. For each cell, the mean error of its listings based on the model is aggregated, producing the Index value.
The curve below shows the cumulative density function for the aggregated errors of each cell, and reflects a decent distribution of values to be represented in the Fairbnb Index.
fishnet <-
st_make_grid(sf.neighborhd110,
cellsize = 500, # 1640.42 feet / 500 meters
square = TRUE) %>%
.[sf.neighborhd110] %>%
st_sf() %>%
mutate(uniqueID = 1:n())
fairbnb_net <-
dplyr::select(listings_details_sf) %>%
mutate(countBnb = 1) %>%
aggregate(., fishnet, sum) %>%
mutate(countBnb = replace_na(countBnb, 0))
fairbnb_net <- fairbnb_net %>%
mutate(sufficient_data = if_else(countBnb >= 10, TRUE, FALSE))
fairbnb_net <- fairbnb_net %>%
mutate(Cell_ID = row_number())
calendar <- calendar %>%
mutate(
log_pricePredict = predict(reg6, calendar),
log_price.Error = log_price - log_pricePredict,
log_price.AbsError = abs(log_price - log_pricePredict),
log_price.PE = (log_price - log_pricePredict) / log_price
)
calendar_sf <- calendar %>%
left_join(
select(listings_details_sf, listing_id, geometry),
by = "listing_id"
)
calendar_sf <- st_as_sf(calendar_sf, sf_column_name = "geometry.y", crs = st_crs(listings_details_sf))
aggregated_data <- calendar_sf %>%
st_join(fairbnb_net, join = st_within) %>%
group_by(Cell_ID) %>%
summarise(
mean_log_price_error = mean(log_price.Error, na.rm = TRUE),
mean_log_price_PE = mean(log_price.PE, na.rm = TRUE),
mean_price = mean(price, na.rm = TRUE)
)
fairbnb_net_with_errors <- fairbnb_net %>%
left_join(aggregated_data %>% st_drop_geometry(), by = "Cell_ID")
fairbnb_net_with_errors <- fairbnb_net_with_errors %>%
filter(sufficient_data == TRUE)
range(fairbnb_net_with_errors$mean_log_price_error, na.rm = TRUE)
range(fairbnb_net_with_errors$mean_log_price_PE, na.rm = TRUE)
# ggplot() +
# geom_sf(data = fairbnb_net, fill = "#ded9ca", color = "white", size = 5) +
# theme_minimal(base_family = "Ubuntu") +
# labs(
# title = 'Base fishnet',
# subtitle = "500m x 500m grids"
# ) +
# theme(
# plot.title = element_text(size = 14, family = "Ubuntu"),
# axis.title = element_blank(),
# axis.text = element_blank(),
# panel.grid = element_blank()
# )
ggplot(fairbnb_net_with_errors, aes(x = mean_log_price_PE)) +
stat_ecdf(geom = "step", color = "#006e70") +
labs(
title = "Cumulative density function of model percentage errors",
subtitle = "Final model outputs aggregated into fishnet cells",
x = "Mean Log Price Error",
y = "Cumulative Density"
) +
theme_minimal() +
theme(
text = element_text(family = "ubuntu"),
plot.title = element_text(family = "ubuntu"),
plot.subtitle = element_text(family = "ubuntu"),
axis.title = element_text(family = "ubuntu"),
axis.text = element_text(family = "ubuntu")
)
Next, the Fairbnb Index is calculated by aggregating the percentage errors in each fishnet cell, and is visualized in the map below.
fairbnb_net_with_errors <- fairbnb_net_with_errors %>%
mutate(Fairbnb_Index = round(100 * mean_log_price_PE, 5))
fairbnb_net_with_errors <- fairbnb_net_with_errors %>%
mutate(Fairbnb_Index_Category = case_when(
Fairbnb_Index < -4 ~ "#c7522a",
Fairbnb_Index >= -4 & Fairbnb_Index < -2 ~ "#e5c185",
Fairbnb_Index >= -2 & Fairbnb_Index < 2 ~ "#c8c7d188",
Fairbnb_Index >= 2 & Fairbnb_Index <= 4 ~ "#74a892",
Fairbnb_Index > 4 ~ "#008585",
TRUE ~ "#FFFFFF" # Default color if not in any range
))
ggplot(fairbnb_net_with_errors) +
geom_sf(data = sf.neighborhd110, fill = NA, color = "#c8c7d188", size = 5) +
geom_sf(aes(fill = Fairbnb_Index_Category)) +
scale_fill_identity() + # Use the colors directly without modification
theme_minimal() +
labs(
title = "Fairbnb Index across Amsterdam",
subtitle = "<span style='color:black; font-weight:bold;'>Color guide:<br>
<span style='color:#c7522a;'>Dark red:</span> Below -4<br>
<span style='color:#e5c185;'>Orange:</span> -4 to -2<br>
<span style='color:black;'>Gray:</span> -2 to 2<br>
<span style='color:#74a892;'>Light green:</span> 2 to 4<br>
<span style='color:#008585;'>Dark green:</span> Above 4"
) +
theme(
text = element_text(family = "ubuntu"),
plot.title = element_text(size = 16, family = "ubuntu"),
plot.subtitle = element_markdown(size = 8),
axis.title = element_blank(),
axis.text = element_blank(),
panel.grid = element_blank()
)
An interactive version of the map (embedded Leaflet map) can be previewed below.
fairbnb_net_with_errors_4326 <- st_transform(fairbnb_net_with_errors, crs = 4326)
leaflet(fairbnb_net_with_errors_4326) %>%
addProviderTiles(providers$CartoDB.Positron) %>% # Base map
addPolygons(
fillColor = ~Fairbnb_Index_Category, # Color based on the Fairbnb Index
fillOpacity = 0.7,
color = "#ffffff", # Border color for polygons
weight = 1, # Border weight
popup = ~paste("Fairbnb Index value: ", Fairbnb_Index), # Popup with Fairbnb Index value
label = ~paste("Fairbnb Index value: ", Fairbnb_Index), # Tooltip with Fairbnb Index value
labelOptions = labelOptions(
direction = "auto", # Tooltip direction
textsize = "15px" # Tooltip font size
)
) %>%
addLegend(
position = "topright",
colors = c("#c7522a", "#e5c185", "#c8c7d188", "#74a892", "#008585"),
labels = c("Below -4", "-4 to -2", "-2 to 2", "2 to 4", "Above 4"),
title = "Fairbnb Index Categories"
)
fairbnb_net_with_errors_centroids <- st_centroid(fairbnb_net_with_errors)
fairbnb_net_with_errors_joined <- st_join(fairbnb_net_with_errors_centroids, sf.neighborhd110[, c("Wijk")])
fairbnb_detailed <- fairbnb_net_with_errors %>%
left_join(fairbnb_net_with_errors_joined %>% st_drop_geometry() %>% select(Cell_ID, Wijk), by = c("Cell_ID" = "Cell_ID")) %>%
st_as_sf()
fairbnb_detailed_4326 <- st_transform(fairbnb_detailed, crs = 4326)
fairbnb_detailed_4326 <- fairbnb_detailed_4326 %>%
mutate(Fairbnb_Index_Category = ifelse(Fairbnb_Index_Category == "#c8c7d188", "#b6c1d5", Fairbnb_Index_Category))
# Replace NA values in the Wijk column
fairbnb_detailed_4326$Wijk[is.na(fairbnb_detailed_4326$Wijk)] <- "Partially outside City administrative boundary"
st_write(fairbnb_detailed_4326, "fairbnb_detailed.geojson")
Meanwhile, a web application prototype is developed using JavaScript, and can be viewed at the following link:
https://tinyurl.com/musa-5080-final
In summary, the Fairbnb modelling exercise and web application serve as a useful starting point for policymakers in Amsterdam to examine the nuanced effects of short-term rental properties on neighborhoods, in particular uncovering groups of blocks where there appears to be chronic over-supply of short-term rental properties — dwelling spaces which could be better utilized when returned for residential use. With reference to the Fairbnb Index (and how it changes over time as new data arrives), policymakers can pinpoint specific areas below the neighborhood level for market interventions, such as setting price floors for permits and restricting the supply of permits at the local level. Such steps are necessary in order to rebalance supply and demand for short-term rental properties — an important step in stemming the hemorrhaging of residential properties to the short-term rental industry for such areas, whose market players tend to behave in a speculative manner, as the exploratory analysis has demonstrated.
A key advantage of the modelling process is its replicability, which
guarantees the long-term sustainability of the Fairbnb
platform, particularly with the upcoming
EU Directive legislating companies like Airbnb to share more data with
local and national governments. While the current analysis looks at a
time frame of one year, the price model can be easily applied to more
recent listing data using the same code base contained in this R
markdown file.
There are a few areas where the present analysis can be augmented. For
instance, text-mining algorithms (including those provided by large
language model services such as the OpenAI API) could be exploited in
the feature engineering stage to obtain more property-level
information not included in the current analysis, such as views
from the apartment or special interior features which may fetch a price
premium over similar properties. As for the modelling phase, the current
linear regression approach could be extended through more
advanced supervised machine learning algorithms such as
decision trees, which might capture the relationships between the
predictor variables in a more robust manner. Indeed, the code block
below provides one such example — the use of the open-source machine
learning library XGBoost — which consists of a set of classification and
regression trees in predicting Airbnb daily rates. Initial results
already show a lower mean absolute error and mean absolute percentage
error compared to our best linear model, demonstrating one strength of
more advanced statistical analyses.
formula <- log_price ~ pub_hol + pub_hol_week + day + month + property_type +
bed_type + accommodates + bathrooms + bedrooms + beds +
neighborhood110 + entire_place + superhost +
log_attr_nn3 + log_bathrooms + log_beds + log_rest_nn15 + log_tran_nn3 +
accommodates:entire_place + accommodates:property_type
# Create model matrix for features
model_matrix <- model.matrix(formula, data = train_data)
# Extract target variable
target <- train_data$log_price
# Convert to DMatrix (efficient storage for xgboost)
dtrain <- xgb.DMatrix(data = model_matrix, label = target)
# Parameters for xgboost
params <- list(
objective = "reg:squarederror", # Regression task
eval_metric = "rmse", # Root Mean Squared Error
max_depth = 6, # Depth of the tree
eta = 0.3, # Learning rate
subsample = 0.8, # Fraction of data used per iteration
colsample_bytree = 0.8 # Fraction of features used per tree
)
# Train the model
set.seed(123) # For reproducibility
model <- xgb.train(
params = params,
data = dtrain,
nrounds = 100, # Number of boosting rounds
watchlist = list(train = dtrain),
verbose = 1
)
test_matrix <- model.matrix(formula, data = test_data)
# Convert test_matrix to DMatrix
dtest <- xgb.DMatrix(data = test_matrix)
# Predict using the trained model
test_predictions <- predict(model, dtest)
# Output predictions
head(test_predictions)
if ("log_price" %in% colnames(test_data)) {
actual <- test_data$log_price
# Root Mean Squared Error (RMSE)
rmse_test <- sqrt(mean((actual - test_predictions)^2))
# Mean Absolute Error (MAE)
mae_test <- mean(abs(actual - test_predictions))
# Mean Absolute Percentage Error (MAPE)
mape_test <- mean(abs((actual - test_predictions) / actual)) * 100
print(paste("Test RMSE:", round(rmse_test, 4)))
print(paste("Test MAE:", round(mae_test, 4)))
print(paste("Test MAPE (%):", round(mape_test, 2)))
}
Irregardless of the final modelling approach chosen, it should be up to Amsterdam’s citizens — the people who are priced out by investment properties in their own neighborhoods, the residents who cross paths with short-term visitors to their apartment blocks, the activists concerned about the erosion of the character of their districts, or the property owners who take profit from speculative letting of their property portfolios — to negotiate the final set of regulations governing short-term rental properties in their ara. Indeed, Fairbnb is never intended to be wielded as a blunt technocratic tool for intervening in housing and short-term rental markets, but rather as an open platform that empowers policymakers to make informed decisions and for citizens to scrutinize and influence this decision-making process. After all, it is homes, livelihoods and communities that are at stake.
Ioannides, D., Röslmaier, M., & Van Der Zee, E. (2019). Airbnb as an instigator of ‘tourism bubble’ expansion in Utrecht’s Lombok neighbourhood. Tourism Geographies, 21(5), 822-840.
Koolhoven, R. (2018). Regulating Airbnb in the Netherlands. J. Eur. Consumer & Mkt. L., 7, 253.
Morpey, U. (2022). The effect of Airbnb activity on residential real estate values and livability: The case for the Netherlands (Doctoral dissertation).
Image credits:
Cover image — Adrien Olichon via Unsplash
https://unsplash.com/photos/view-of-apartment-complex-buildings-082xic2aif8
Fairbnb logo — Own work